Session 9· 08· 20 min
Code-Gen Eval Pipeline
What you'll learn
- ▸Build a dataset of AWS code-generation tasks with format and solution_criteria fields
- ▸Use code prefill in run_prompt() to get clean code output from the model
- ▸Walk through the full code-gen eval pipeline end to end
Code generation is a different eval problem
Classifying sentiment or triaging tickets produces text that a judge can evaluate holistically. Code generation adds a new dimension: the output must be syntactically valid as well as semantically correct. This lesson builds a pipeline that evaluates AWS code tasks — Python functions, JSON IAM policies, and regex patterns — before combining LLM judging with programmatic syntax validation in lesson 09.
The dataset
Each code-gen test case has five fields: task (the coding problem), format (python, json, or regex), expected (a reference solution), solution_criteria (the rubric), and a difficulty field.
08_code_gen_evals.py
dataset = [
{
"task": "Write a Python function `list_s3_buckets()` that uses boto3 to return a list of all S3 bucket names in the account.",
"format": "python",
"expected": "import boto3\ndef list_s3_buckets():\n s3 = boto3.client(\"s3\")\n return [b[\"Name\"] for b in s3.list_buckets()[\"Buckets\"]]",
"solution_criteria": "Function is named list_s3_buckets, uses boto3.client('s3'), calls list_buckets(), returns a list of Name strings.",
"difficulty": "easy",
},
{
"task": "Write a Python function `upload_to_s3(file_path, bucket, key)` that uploads a local file to S3 and returns the S3 URI.",
"format": "python",
"expected": "import boto3\ndef upload_to_s3(file_path, bucket, key):\n boto3.client(\"s3\").upload_file(file_path, bucket, key)\n return f\"s3://{bucket}/{key}\"",
"solution_criteria": "Takes file_path, bucket, key. Uses upload_file. Returns s3://bucket/key URI.",
"difficulty": "easy",
},
{
"task": "Write a JSON IAM policy that allows s3:GetObject and s3:PutObject on all objects in a bucket named my-data-bucket.",
"format": "json",
"expected": "{\"Version\":\"2012-10-17\",\"Statement\":[{\"Effect\":\"Allow\",\"Action\":[\"s3:GetObject\",\"s3:PutObject\"],\"Resource\":\"arn:aws:s3:::my-data-bucket/*\"}]}",
"solution_criteria": "Valid IAM policy JSON with s3:GetObject and s3:PutObject on arn:aws:s3:::my-data-bucket/*",
"difficulty": "medium",
},
{
"task": "Write a Python regex pattern that matches a valid AWS S3 bucket name (3-63 lowercase alphanumeric chars and hyphens, no leading/trailing hyphen).",
"format": "regex",
"expected": "^[a-z0-9][a-z0-9\\-]{1,61}[a-z0-9]$",
"solution_criteria": "Regex matches valid S3 bucket names: 3-63 chars, lowercase alphanumeric and hyphens, no leading or trailing hyphen.",
"difficulty": "hard",
},
]run_prompt() with code prefill
08_code_gen_evals.py (continued)
PREFILLS = { ①
"python": "```python\n",
"json": "```json\n",
"regex": "```\n",
}
STOP = ["```"]
def run_prompt(test_case):
fmt = test_case["format"]
user_prompt = (
f"Write AWS {fmt.upper()} code for the following task. "
"Return ONLY the code — no explanation.\n\n"
f"Task:\n{test_case[\"task\"]}"
)
messages = [
{"role": "user", "content": user_prompt},
{"role": "assistant", "content": PREFILLS[fmt]}, ②
]
return chat(messages, stop_sequences=STOP) ③①PREFILLS maps format to the right opening fence so the model starts inside the correct code block.
②Prefilling with the opening fence forces the model to output only code.
③stop_sequences=["```"] terminates before the closing fence. Result is clean code.
LLM judge for code
08_code_gen_evals.py (continued)
def grade_by_model(test_case, output):
eval_prompt = (
"You are an expert AWS engineer reviewing generated code.\n\n"
f"Task:\n<task>\n{test_case[\"task\"]}\n</task>\n\n"
f"Generated code:\n<output>\n{output}\n</output>\n\n"
f"Reference solution:\n<expected>\n{test_case[\"expected\"]}\n</expected>\n\n"
f"Criteria:\n<criteria>\n{test_case[\"solution_criteria\"]}\n</criteria>\n\n"
"Score the code 1-10. Consider: correctness, idiomatic AWS SDK usage, "
"edge case handling, and adherence to criteria.\n"
"Return JSON: {{\"score\": int, \"reasoning\": str, \"strengths\": [str], \"weaknesses\": [str]}}"
)
messages = [
{"role": "user", "content": eval_prompt},
{"role": "assistant", "content": "```json"},
]
text = chat(messages, stop_sequences=["```"])
return json.loads(text)Full pipeline
Code-gen eval pipeline
Dataset
task + format + criteria
run_prompt()
prefill + stop
LLM judge
semantic scoring
HTML report
PASS / PARTIAL / FAIL
08_code_gen_evals.py (continued)
def run_test_case(test_case):
output = run_prompt(test_case)
grade = grade_by_model(test_case, output)
return {
"task": test_case["task"],
"format": test_case["format"],
"output": output,
"score": grade["score"],
"reasoning": grade["reasoning"],
"strengths": grade["strengths"],
"weaknesses": grade["weaknesses"],
}
results = [run_test_case(case) for case in dataset]
generate_eval_html_report(results, "code_gen_report.html")$ python 08_code_gen_evals.py
Knowledge check
Code Check
What does setting the assistant prefill to "```python\n" accomplish in a code-gen prompt?
Knowledge Check
Why does the code-gen dataset have a "format" field in addition to "task"?
Recap — what you just learned
- ✓Code-gen datasets include format (python/json/regex), expected solution, and solution_criteria
- ✓run_prompt() selects a prefill by format and uses stop_sequences to extract clean code
- ✓The LLM judge evaluates correctness and idiomatic AWS usage semantically
- ✓Lesson 09 adds programmatic syntax validation on top of the LLM score