Session 9· 09· 20 min

Dual Grading: LLM + Syntax

What you'll learn
  • Build programmatic validators for Python, JSON, and regex syntax
  • Combine LLM semantic score with syntax score into a blended grade
  • Know when to use dual grading versus LLM-only scoring

Why two grades?

An LLM judge can tell you whether code is logically correct and idiomatic. What it cannot reliably do is guarantee the code is syntactically valid — a subtle indentation error or unclosed bracket might slip through. A programmatic syntax validator provides a fast, deterministic check that the LLM cannot fake. Blending the two scores gives you the best of both worlds.

LLM judge
semantic
  • Understands intent and correctness
  • Catches wrong API calls
  • Evaluates idiomatic style
  • Can be fooled by plausible-looking but broken code
Syntax validator
programmatic
  • Deterministic: parse succeeds or fails
  • Fast and cheap (no API call)
  • Cannot evaluate correctness
  • Catches all syntax errors reliably

The three validators

09_dual_grading.py
import ast, json, re

def validate_json(text):                                           ①
    try:
        json.loads(text.strip())
        return 10
    except json.JSONDecodeError:
        return 0

def validate_python(text):                                         ②
    try:
        ast.parse(text.strip())
        return 10
    except SyntaxError:
        return 0

def validate_regex(text):                                          ③
    try:
        re.compile(text.strip())
        return 10
    except re.error:
        return 0
validate_json() uses json.loads — either valid JSON (10) or not (0). No partial credit.
validate_python() uses ast.parse — catches all syntax errors. Does not run the code.
validate_regex() uses re.compile — checks the pattern compiles. Does not test it against strings.

grade_syntax() dispatcher

09_dual_grading.py (continued)
VALIDATORS = {
    "python": validate_python,
    "json":   validate_json,
    "regex":  validate_regex,
}

def grade_syntax(test_case, output):
    validator = VALIDATORS.get(test_case["format"])
    if validator is None:
        return 10  # No validator — assume syntactically valid
    return validator(output)

The blended score

We take a simple average of the LLM semantic score and the programmatic syntax score. If a submission is syntactically broken the syntax score pulls the blended score down significantly — even if the LLM judge was generous.

09_dual_grading.py (continued)
def run_test_case(test_case):
    output       = run_prompt(test_case)
    model_score  = grade_by_model(test_case, output)["score"]       ①
    syntax_score = grade_syntax(test_case, output)                  ②
    blended      = (model_score + syntax_score) / 2                 ③
    return {
        "task":         test_case["task"],
        "format":       test_case["format"],
        "output":       output,
        "model_score":  model_score,
        "syntax_score": syntax_score,
        "score":        blended,
    }

results = [run_test_case(case) for case in dataset]
generate_eval_html_report(results, "dual_grading_report.html")
model_score comes from grade_by_model() — semantic quality, 1-10.
syntax_score comes from grade_syntax() — parse success (10) or failure (0).
Blended = simple average. A syntactically invalid submission caps at 5 no matter what the LLM judge says.

When to use dual grading

Use dual grading
  • Code generation tasks (Python, SQL, JSON, regex)
  • Any output that must parse/compile
  • When you need a hard syntax gate
  • Production evals where broken code = critical failure
LLM-only is fine
  • Free-text classification tasks
  • Summarisation and translation
  • Creative writing
  • Any task where "syntax" is not meaningful
Session 9 complete
You can now build a full eval pipeline: hand-written and generated datasets, LLM-as-judge scoring, HTML reports, adversarial cases, structured output evals, and dual LLM+syntax grading for code generation. These are the core techniques used in production AI evaluation systems.

Knowledge check

Code Check
A model returns syntactically invalid Python for a code-gen task. What is the blended score if the LLM judge gives it 8/10?
Knowledge Check
Why does validate_python() use ast.parse() instead of exec()?
Knowledge Check
Which type of task does NOT benefit from dual grading?
Recap — what you just learned
  • validate_python() uses ast.parse, validate_json() uses json.loads, validate_regex() uses re.compile
  • grade_syntax() dispatches to the right validator based on test_case["format"]
  • Blended score = (model_score + syntax_score) / 2 — syntax failure caps the score at 5
  • Use dual grading for any output that must parse/compile; LLM-only for free text
Next up: Session 9 Overview