Session 9· 08· 20 min

Code-Gen Eval Pipeline

What you'll learn
  • Build a dataset of AWS code-generation tasks with format and solution_criteria fields
  • Use code prefill in run_prompt() to get clean code output from the model
  • Walk through the full code-gen eval pipeline end to end

Code generation is a different eval problem

Classifying sentiment or triaging tickets produces text that a judge can evaluate holistically. Code generation adds a new dimension: the output must be syntactically valid as well as semantically correct. This lesson builds a pipeline that evaluates AWS code tasks — Python functions, JSON IAM policies, and regex patterns — before combining LLM judging with programmatic syntax validation in lesson 09.

The dataset

Each code-gen test case has five fields: task (the coding problem), format (python, json, or regex), expected (a reference solution), solution_criteria (the rubric), and a difficulty field.

08_code_gen_evals.py
dataset = [
    {
        "task": "Write a Python function `list_s3_buckets()` that uses boto3 to return a list of all S3 bucket names in the account.",
        "format": "python",
        "expected": "import boto3\ndef list_s3_buckets():\n    s3 = boto3.client(\"s3\")\n    return [b[\"Name\"] for b in s3.list_buckets()[\"Buckets\"]]",
        "solution_criteria": "Function is named list_s3_buckets, uses boto3.client('s3'), calls list_buckets(), returns a list of Name strings.",
        "difficulty": "easy",
    },
    {
        "task": "Write a Python function `upload_to_s3(file_path, bucket, key)` that uploads a local file to S3 and returns the S3 URI.",
        "format": "python",
        "expected": "import boto3\ndef upload_to_s3(file_path, bucket, key):\n    boto3.client(\"s3\").upload_file(file_path, bucket, key)\n    return f\"s3://{bucket}/{key}\"",
        "solution_criteria": "Takes file_path, bucket, key. Uses upload_file. Returns s3://bucket/key URI.",
        "difficulty": "easy",
    },
    {
        "task": "Write a JSON IAM policy that allows s3:GetObject and s3:PutObject on all objects in a bucket named my-data-bucket.",
        "format": "json",
        "expected": "{\"Version\":\"2012-10-17\",\"Statement\":[{\"Effect\":\"Allow\",\"Action\":[\"s3:GetObject\",\"s3:PutObject\"],\"Resource\":\"arn:aws:s3:::my-data-bucket/*\"}]}",
        "solution_criteria": "Valid IAM policy JSON with s3:GetObject and s3:PutObject on arn:aws:s3:::my-data-bucket/*",
        "difficulty": "medium",
    },
    {
        "task": "Write a Python regex pattern that matches a valid AWS S3 bucket name (3-63 lowercase alphanumeric chars and hyphens, no leading/trailing hyphen).",
        "format": "regex",
        "expected": "^[a-z0-9][a-z0-9\\-]{1,61}[a-z0-9]$",
        "solution_criteria": "Regex matches valid S3 bucket names: 3-63 chars, lowercase alphanumeric and hyphens, no leading or trailing hyphen.",
        "difficulty": "hard",
    },
]

run_prompt() with code prefill

08_code_gen_evals.py (continued)
PREFILLS = {                                                        ①
    "python": "```python\n",
    "json":   "```json\n",
    "regex":  "```\n",
}
STOP = ["```"]

def run_prompt(test_case):
    fmt = test_case["format"]
    user_prompt = (
        f"Write AWS {fmt.upper()} code for the following task. "
        "Return ONLY the code — no explanation.\n\n"
        f"Task:\n{test_case[\"task\"]}"
    )
    messages = [
        {"role": "user", "content": user_prompt},
        {"role": "assistant", "content": PREFILLS[fmt]},          ②
    ]
    return chat(messages, stop_sequences=STOP)                     ③
PREFILLS maps format to the right opening fence so the model starts inside the correct code block.
Prefilling with the opening fence forces the model to output only code.
stop_sequences=["```"] terminates before the closing fence. Result is clean code.

LLM judge for code

08_code_gen_evals.py (continued)
def grade_by_model(test_case, output):
    eval_prompt = (
        "You are an expert AWS engineer reviewing generated code.\n\n"
        f"Task:\n<task>\n{test_case[\"task\"]}\n</task>\n\n"
        f"Generated code:\n<output>\n{output}\n</output>\n\n"
        f"Reference solution:\n<expected>\n{test_case[\"expected\"]}\n</expected>\n\n"
        f"Criteria:\n<criteria>\n{test_case[\"solution_criteria\"]}\n</criteria>\n\n"
        "Score the code 1-10. Consider: correctness, idiomatic AWS SDK usage, "
        "edge case handling, and adherence to criteria.\n"
        "Return JSON: {{\"score\": int, \"reasoning\": str, \"strengths\": [str], \"weaknesses\": [str]}}"
    )
    messages = [
        {"role": "user", "content": eval_prompt},
        {"role": "assistant", "content": "```json"},
    ]
    text = chat(messages, stop_sequences=["```"])
    return json.loads(text)

Full pipeline

Code-gen eval pipeline
Dataset
task + format + criteria
run_prompt()
prefill + stop
LLM judge
semantic scoring
HTML report
PASS / PARTIAL / FAIL
08_code_gen_evals.py (continued)
def run_test_case(test_case):
    output = run_prompt(test_case)
    grade  = grade_by_model(test_case, output)
    return {
        "task":      test_case["task"],
        "format":    test_case["format"],
        "output":    output,
        "score":     grade["score"],
        "reasoning": grade["reasoning"],
        "strengths": grade["strengths"],
        "weaknesses": grade["weaknesses"],
    }

results = [run_test_case(case) for case in dataset]
generate_eval_html_report(results, "code_gen_report.html")
$ python 08_code_gen_evals.py

Knowledge check

Code Check
What does setting the assistant prefill to "```python\n" accomplish in a code-gen prompt?
Knowledge Check
Why does the code-gen dataset have a "format" field in addition to "task"?
Recap — what you just learned
  • Code-gen datasets include format (python/json/regex), expected solution, and solution_criteria
  • run_prompt() selects a prefill by format and uses stop_sequences to extract clean code
  • The LLM judge evaluates correctness and idiomatic AWS usage semantically
  • Lesson 09 adds programmatic syntax validation on top of the LLM score
Next up: 09 — Dual Grading: LLM + Syntax