Session 9· 09· 20 min
Dual Grading: LLM + Syntax
What you'll learn
- ▸Build programmatic validators for Python, JSON, and regex syntax
- ▸Combine LLM semantic score with syntax score into a blended grade
- ▸Know when to use dual grading versus LLM-only scoring
Why two grades?
An LLM judge can tell you whether code is logically correct and idiomatic. What it cannot reliably do is guarantee the code is syntactically valid — a subtle indentation error or unclosed bracket might slip through. A programmatic syntax validator provides a fast, deterministic check that the LLM cannot fake. Blending the two scores gives you the best of both worlds.
LLM judge
semantic
- •Understands intent and correctness
- •Catches wrong API calls
- •Evaluates idiomatic style
- •Can be fooled by plausible-looking but broken code
Syntax validator
programmatic
- •Deterministic: parse succeeds or fails
- •Fast and cheap (no API call)
- •Cannot evaluate correctness
- •Catches all syntax errors reliably
The three validators
09_dual_grading.py
import ast, json, re
def validate_json(text): ①
try:
json.loads(text.strip())
return 10
except json.JSONDecodeError:
return 0
def validate_python(text): ②
try:
ast.parse(text.strip())
return 10
except SyntaxError:
return 0
def validate_regex(text): ③
try:
re.compile(text.strip())
return 10
except re.error:
return 0①validate_json() uses json.loads — either valid JSON (10) or not (0). No partial credit.
②validate_python() uses ast.parse — catches all syntax errors. Does not run the code.
③validate_regex() uses re.compile — checks the pattern compiles. Does not test it against strings.
grade_syntax() dispatcher
09_dual_grading.py (continued)
VALIDATORS = {
"python": validate_python,
"json": validate_json,
"regex": validate_regex,
}
def grade_syntax(test_case, output):
validator = VALIDATORS.get(test_case["format"])
if validator is None:
return 10 # No validator — assume syntactically valid
return validator(output)The blended score
We take a simple average of the LLM semantic score and the programmatic syntax score. If a submission is syntactically broken the syntax score pulls the blended score down significantly — even if the LLM judge was generous.
09_dual_grading.py (continued)
def run_test_case(test_case):
output = run_prompt(test_case)
model_score = grade_by_model(test_case, output)["score"] ①
syntax_score = grade_syntax(test_case, output) ②
blended = (model_score + syntax_score) / 2 ③
return {
"task": test_case["task"],
"format": test_case["format"],
"output": output,
"model_score": model_score,
"syntax_score": syntax_score,
"score": blended,
}
results = [run_test_case(case) for case in dataset]
generate_eval_html_report(results, "dual_grading_report.html")①model_score comes from grade_by_model() — semantic quality, 1-10.
②syntax_score comes from grade_syntax() — parse success (10) or failure (0).
③Blended = simple average. A syntactically invalid submission caps at 5 no matter what the LLM judge says.
When to use dual grading
Use dual grading
- •Code generation tasks (Python, SQL, JSON, regex)
- •Any output that must parse/compile
- •When you need a hard syntax gate
- •Production evals where broken code = critical failure
LLM-only is fine
- •Free-text classification tasks
- •Summarisation and translation
- •Creative writing
- •Any task where "syntax" is not meaningful
Session 9 complete
You can now build a full eval pipeline: hand-written and generated datasets, LLM-as-judge scoring, HTML reports, adversarial cases, structured output evals, and dual LLM+syntax grading for code generation. These are the core techniques used in production AI evaluation systems.
Knowledge check
Code Check
A model returns syntactically invalid Python for a code-gen task. What is the blended score if the LLM judge gives it 8/10?
Knowledge Check
Why does validate_python() use ast.parse() instead of exec()?
Knowledge Check
Which type of task does NOT benefit from dual grading?
Recap — what you just learned
- ✓validate_python() uses ast.parse, validate_json() uses json.loads, validate_regex() uses re.compile
- ✓grade_syntax() dispatches to the right validator based on test_case["format"]
- ✓Blended score = (model_score + syntax_score) / 2 — syntax failure caps the score at 5
- ✓Use dual grading for any output that must parse/compile; LLM-only for free text