Session 9· 03· 20 min

LLM-as-Judge Scoring

What you'll learn
  • Use the prefill + stop_sequences trick to get clean JSON from a judge prompt
  • Build grade_by_model() returning score, reasoning, strengths, and weaknesses
  • Wire run_test_case() to run both the prompt and the judge for each case

Why use an LLM as a judge?

Regex checks can verify a label matches an expected string, but they cannot tell you whether the reason given is accurate or sensible. An LLM judge reads the output against a plain-English rubric and returns a structured score with explanations — giving you richer signal than a binary pass/fail.

The judge is a second LLM call
Every test case now costs two API calls: one for the prompt under test and one for the judge. This is normal — the judge call is usually cheap (short input, low max_tokens) and the signal it returns is far more valuable than a simple string match.

The prefill + stop_sequences trick

To get valid JSON from the judge without wrapping text, we use two Claude features together: we prefill the assistant turn with "```json" so the model starts inside a code block, then set stop_sequences=["```"] so it stops before the closing fence. The text between the prefill and the stop sequence is clean JSON every time.

03_llm_judge.py
import json

def grade_by_model(test_case, output):                               ①
    eval_prompt = (
        "You are an expert reviewer judging the output of a sentiment-classification prompt.\n\n"
        f"Original task:\n<task>\n{test_case[\"task\"]}\n</task>\n\n"
        f"Model's output:\n<output>\n{output}\n</output>\n\n"
        f"What a good answer looks like:\n<criteria>\n{test_case[\"criteria\"]}\n</criteria>\n\n"
        "Score the output and explain your reasoning. Respond with JSON in this exact shape:\n"
        "{\n"
        "  \"strengths\": [\"short bullet\", \"...\"],\n"
        "  \"weaknesses\": [\"short bullet\", \"...\"],\n"
        "  \"reasoning\": \"one or two sentences\",\n"
        "  \"score\": <integer 1-10>\n"
        "}"
    )
    messages = [
        {"role": "user", "content": eval_prompt},
        {"role": "assistant", "content": "```json"},               ②
    ]
    text = chat(messages, stop_sequences=["```"])                   ③
    return json.loads(text)                                         ④
grade_by_model() takes the test case (for context and criteria) and the model output to judge.
Prefilling the assistant turn with "```json" forces the model to start its response inside a JSON block.
stop_sequences=["```"] ends generation before the closing fence — leaving only the JSON content.
json.loads() parses the clean JSON string into a Python dict.

The full eval loop

03_llm_judge.py (continued)
def run_test_case(test_case):
    output = run_prompt(test_case)
    grade  = grade_by_model(test_case, output)
    return {
        "task":           test_case["task"],
        "expected_label": test_case["expected_label"],
        "output":         output,
        "score":          grade["score"],
        "reasoning":      grade["reasoning"],
        "strengths":      grade["strengths"],
        "weaknesses":     grade["weaknesses"],
    }

results = [run_test_case(case) for case in dataset]

for r in results:
    print(f"Score {r['score']}/10 | {r['reasoning']}")
$ python 03_llm_judge.py

Knowledge check

Code Check
What does prefilling the assistant turn with "```json" accomplish?
Knowledge Check
What four fields does grade_by_model() return?
Recap — what you just learned
  • Prefill the assistant turn + stop_sequences=["```"] extracts clean JSON from any LLM judge prompt
  • grade_by_model() returns score (1-10), reasoning, strengths, and weaknesses
  • run_test_case() chains run_prompt() and grade_by_model() for a single case
  • An LLM judge provides far richer signal than a regex match
Next up: 04 — HTML Eval Reports