Session 9· 06· 20 min

Evaluating Structured Output

What you'll learn
  • Adapt run_prompt() to return JSON using prefill + stop_sequences
  • Handle JSONDecodeError gracefully in run_test_case()
  • Run a full eval loop on structured triage output and generate an HTML report

Structured output changes the eval

The sentiment eval in lessons 02-04 had a simple text output. The triage task requires a structured JSON response with four fields. Two things change: run_prompt() needs to extract JSON from the model, and run_test_case() needs to handle the case where the model returns invalid JSON rather than crashing.

run_prompt() for structured JSON

06_structured_evals.py
def run_prompt(test_case):
    user_prompt = (
        "You are a customer support triage agent.\n"
        "Analyse the following support ticket and return a JSON object with:\n"
        "  category: \"billing\" | \"technical\" | \"shipping\" | \"general\"\n"
        "  urgency:  \"low\" | \"medium\" | \"high\" | \"critical\"\n"
        "  summary:  one sentence describing the issue\n"
        "  action_items: list of concrete next steps\n\n"
        f"Ticket:\n{test_case[\"task\"]}"
    )
    messages = [
        {"role": "user", "content": user_prompt},
        {"role": "assistant", "content": "```json"},              ①
    ]
    text = chat(messages, stop_sequences=["```"])                  ②
    return text                                                    ③
Prefill forces the model into a JSON code block immediately.
stop_sequences=["```"] cuts off before the closing fence.
Return the raw text — run_test_case() will handle parsing and errors.

Adapting the judge for structured output

The judge prompt now needs to compare the model's JSON output against both the expected JSON and the plain-English criteria. We pass both into the judge context.

06_structured_evals.py (continued)
def grade_by_model(test_case, output):
    eval_prompt = (
        "You are an expert reviewer judging a customer-support triage model.\n\n"
        f"Ticket:\n<task>\n{test_case[\"task\"]}\n</task>\n\n"
        f"Model output:\n<output>\n{output}\n</output>\n\n"
        f"Expected output:\n<expected>\n{json.dumps(test_case[\"expected\"], indent=2)}\n</expected>\n\n"
        f"Criteria:\n<criteria>\n{test_case[\"criteria\"]}\n</criteria>\n\n"
        "Score the output 1-10 and explain your reasoning.\n"
        "Return JSON: {{\"score\": int, \"reasoning\": str, \"strengths\": [str], \"weaknesses\": [str]}}"
    )
    messages = [
        {"role": "user", "content": eval_prompt},
        {"role": "assistant", "content": "```json"},
    ]
    text = chat(messages, stop_sequences=["```"])
    return json.loads(text)

run_test_case() with error handling

06_structured_evals.py (continued)
def run_test_case(test_case):
    raw = run_prompt(test_case)
    try:
        parsed = json.loads(raw)                                  ①
    except json.JSONDecodeError:
        return {                                                    ②
            "task":    test_case["task"],
            "output":  raw,
            "score":   0,
            "reasoning": "Model returned invalid JSON.",
            "strengths":  [],
            "weaknesses": ["Output could not be parsed as JSON"],
        }
    grade = grade_by_model(test_case, json.dumps(parsed, indent=2))
    return {
        "task":      test_case["task"],
        "output":    json.dumps(parsed, indent=2),
        "score":     grade["score"],
        "reasoning": grade["reasoning"],
        "strengths": grade["strengths"],
        "weaknesses": grade["weaknesses"],
    }
Try to parse the model output as JSON. A parse failure is automatically a score of 0.
Return a minimal result dict instead of crashing so the eval loop continues for other cases.

Running the full eval

06_structured_evals.py (continued)
results = [run_test_case(case) for case in dataset]
generate_eval_html_report(results, filename="triage_eval_report.html")

avg = sum(r["score"] for r in results) / len(results)
print(f"Average score: {avg:.1f}/10")
$ python 06_structured_evals.py

Knowledge check

Code Check
What score does run_test_case() assign when the model returns invalid JSON?
Knowledge Check
What extra context does the structured judge prompt include compared to the sentiment judge?
Recap — what you just learned
  • run_prompt() uses prefill + stop_sequences to extract clean JSON from the model
  • Catch JSONDecodeError in run_test_case() and return score=0 rather than crashing
  • The structured judge prompt includes expected JSON so it can compare field-by-field
  • The HTML report works for structured evals without any changes
Next up: 07 — Adversarial Test Cases