Session 9· 06· 20 min
Evaluating Structured Output
What you'll learn
- ▸Adapt run_prompt() to return JSON using prefill + stop_sequences
- ▸Handle JSONDecodeError gracefully in run_test_case()
- ▸Run a full eval loop on structured triage output and generate an HTML report
Structured output changes the eval
The sentiment eval in lessons 02-04 had a simple text output. The triage task requires a structured JSON response with four fields. Two things change: run_prompt() needs to extract JSON from the model, and run_test_case() needs to handle the case where the model returns invalid JSON rather than crashing.
run_prompt() for structured JSON
06_structured_evals.py
def run_prompt(test_case):
user_prompt = (
"You are a customer support triage agent.\n"
"Analyse the following support ticket and return a JSON object with:\n"
" category: \"billing\" | \"technical\" | \"shipping\" | \"general\"\n"
" urgency: \"low\" | \"medium\" | \"high\" | \"critical\"\n"
" summary: one sentence describing the issue\n"
" action_items: list of concrete next steps\n\n"
f"Ticket:\n{test_case[\"task\"]}"
)
messages = [
{"role": "user", "content": user_prompt},
{"role": "assistant", "content": "```json"}, ①
]
text = chat(messages, stop_sequences=["```"]) ②
return text ③①Prefill forces the model into a JSON code block immediately.
②stop_sequences=["```"] cuts off before the closing fence.
③Return the raw text — run_test_case() will handle parsing and errors.
Adapting the judge for structured output
The judge prompt now needs to compare the model's JSON output against both the expected JSON and the plain-English criteria. We pass both into the judge context.
06_structured_evals.py (continued)
def grade_by_model(test_case, output):
eval_prompt = (
"You are an expert reviewer judging a customer-support triage model.\n\n"
f"Ticket:\n<task>\n{test_case[\"task\"]}\n</task>\n\n"
f"Model output:\n<output>\n{output}\n</output>\n\n"
f"Expected output:\n<expected>\n{json.dumps(test_case[\"expected\"], indent=2)}\n</expected>\n\n"
f"Criteria:\n<criteria>\n{test_case[\"criteria\"]}\n</criteria>\n\n"
"Score the output 1-10 and explain your reasoning.\n"
"Return JSON: {{\"score\": int, \"reasoning\": str, \"strengths\": [str], \"weaknesses\": [str]}}"
)
messages = [
{"role": "user", "content": eval_prompt},
{"role": "assistant", "content": "```json"},
]
text = chat(messages, stop_sequences=["```"])
return json.loads(text)run_test_case() with error handling
06_structured_evals.py (continued)
def run_test_case(test_case):
raw = run_prompt(test_case)
try:
parsed = json.loads(raw) ①
except json.JSONDecodeError:
return { ②
"task": test_case["task"],
"output": raw,
"score": 0,
"reasoning": "Model returned invalid JSON.",
"strengths": [],
"weaknesses": ["Output could not be parsed as JSON"],
}
grade = grade_by_model(test_case, json.dumps(parsed, indent=2))
return {
"task": test_case["task"],
"output": json.dumps(parsed, indent=2),
"score": grade["score"],
"reasoning": grade["reasoning"],
"strengths": grade["strengths"],
"weaknesses": grade["weaknesses"],
}①Try to parse the model output as JSON. A parse failure is automatically a score of 0.
②Return a minimal result dict instead of crashing so the eval loop continues for other cases.
Running the full eval
06_structured_evals.py (continued)
results = [run_test_case(case) for case in dataset]
generate_eval_html_report(results, filename="triage_eval_report.html")
avg = sum(r["score"] for r in results) / len(results)
print(f"Average score: {avg:.1f}/10")$ python 06_structured_evals.py
Knowledge check
Code Check
What score does run_test_case() assign when the model returns invalid JSON?
Knowledge Check
What extra context does the structured judge prompt include compared to the sentiment judge?
Recap — what you just learned
- ✓run_prompt() uses prefill + stop_sequences to extract clean JSON from the model
- ✓Catch JSONDecodeError in run_test_case() and return score=0 rather than crashing
- ✓The structured judge prompt includes expected JSON so it can compare field-by-field
- ✓The HTML report works for structured evals without any changes