Session 9· 03· 20 min
LLM-as-Judge Scoring
What you'll learn
- ▸Use the prefill + stop_sequences trick to get clean JSON from a judge prompt
- ▸Build grade_by_model() returning score, reasoning, strengths, and weaknesses
- ▸Wire run_test_case() to run both the prompt and the judge for each case
Why use an LLM as a judge?
Regex checks can verify a label matches an expected string, but they cannot tell you whether the reason given is accurate or sensible. An LLM judge reads the output against a plain-English rubric and returns a structured score with explanations — giving you richer signal than a binary pass/fail.
The judge is a second LLM call
Every test case now costs two API calls: one for the prompt under test and one for the judge. This is normal — the judge call is usually cheap (short input, low max_tokens) and the signal it returns is far more valuable than a simple string match.
The prefill + stop_sequences trick
To get valid JSON from the judge without wrapping text, we use two Claude features together: we prefill the assistant turn with "```json" so the model starts inside a code block, then set stop_sequences=["```"] so it stops before the closing fence. The text between the prefill and the stop sequence is clean JSON every time.
03_llm_judge.py
import json
def grade_by_model(test_case, output): ①
eval_prompt = (
"You are an expert reviewer judging the output of a sentiment-classification prompt.\n\n"
f"Original task:\n<task>\n{test_case[\"task\"]}\n</task>\n\n"
f"Model's output:\n<output>\n{output}\n</output>\n\n"
f"What a good answer looks like:\n<criteria>\n{test_case[\"criteria\"]}\n</criteria>\n\n"
"Score the output and explain your reasoning. Respond with JSON in this exact shape:\n"
"{\n"
" \"strengths\": [\"short bullet\", \"...\"],\n"
" \"weaknesses\": [\"short bullet\", \"...\"],\n"
" \"reasoning\": \"one or two sentences\",\n"
" \"score\": <integer 1-10>\n"
"}"
)
messages = [
{"role": "user", "content": eval_prompt},
{"role": "assistant", "content": "```json"}, ②
]
text = chat(messages, stop_sequences=["```"]) ③
return json.loads(text) ④①grade_by_model() takes the test case (for context and criteria) and the model output to judge.
②Prefilling the assistant turn with "```json" forces the model to start its response inside a JSON block.
③stop_sequences=["```"] ends generation before the closing fence — leaving only the JSON content.
④json.loads() parses the clean JSON string into a Python dict.
The full eval loop
03_llm_judge.py (continued)
def run_test_case(test_case):
output = run_prompt(test_case)
grade = grade_by_model(test_case, output)
return {
"task": test_case["task"],
"expected_label": test_case["expected_label"],
"output": output,
"score": grade["score"],
"reasoning": grade["reasoning"],
"strengths": grade["strengths"],
"weaknesses": grade["weaknesses"],
}
results = [run_test_case(case) for case in dataset]
for r in results:
print(f"Score {r['score']}/10 | {r['reasoning']}")$ python 03_llm_judge.py
Knowledge check
Code Check
What does prefilling the assistant turn with "```json" accomplish?
Knowledge Check
What four fields does grade_by_model() return?
Recap — what you just learned
- ✓Prefill the assistant turn + stop_sequences=["```"] extracts clean JSON from any LLM judge prompt
- ✓grade_by_model() returns score (1-10), reasoning, strengths, and weaknesses
- ✓run_test_case() chains run_prompt() and grade_by_model() for a single case
- ✓An LLM judge provides far richer signal than a regex match