Session 9· 09· 20 min

Dual Grading: LLM + Syntax

What you'll learn

▸Build programmatic validators for Python, JSON, and regex syntax
▸Combine LLM semantic score with syntax score into a blended grade
▸Know when to use dual grading versus LLM-only scoring

Why two grades?

An LLM judge can tell you whether code is logically correct and idiomatic. What it cannot reliably do is guarantee the code is syntactically valid — a subtle indentation error or unclosed bracket might slip through. A programmatic syntax validator provides a fast, deterministic check that the LLM cannot fake. Blending the two scores gives you the best of both worlds.

LLM judge

semantic

•Understands intent and correctness
•Catches wrong API calls
•Evaluates idiomatic style
•Can be fooled by plausible-looking but broken code

Syntax validator

programmatic

•Deterministic: parse succeeds or fails
•Fast and cheap (no API call)
•Cannot evaluate correctness
•Catches all syntax errors reliably

The three validators

09_dual_grading.py

import ast, json, re

def validate_json(text):                                           ①
    try:
        json.loads(text.strip())
        return 10
    except json.JSONDecodeError:
        return 0

def validate_python(text):                                         ②
    try:
        ast.parse(text.strip())
        return 10
    except SyntaxError:
        return 0

def validate_regex(text):                                          ③
    try:
        re.compile(text.strip())
        return 10
    except re.error:
        return 0

①validate_json() uses json.loads — either valid JSON (10) or not (0). No partial credit.

②validate_python() uses ast.parse — catches all syntax errors. Does not run the code.

③validate_regex() uses re.compile — checks the pattern compiles. Does not test it against strings.

grade_syntax() dispatcher

09_dual_grading.py (continued)

VALIDATORS = {
    "python": validate_python,
    "json":   validate_json,
    "regex":  validate_regex,
}

def grade_syntax(test_case, output):
    validator = VALIDATORS.get(test_case["format"])
    if validator is None:
        return 10  # No validator — assume syntactically valid
    return validator(output)

The blended score

We take a simple average of the LLM semantic score and the programmatic syntax score. If a submission is syntactically broken the syntax score pulls the blended score down significantly — even if the LLM judge was generous.

09_dual_grading.py (continued)

def run_test_case(test_case):
    output       = run_prompt(test_case)
    model_score  = grade_by_model(test_case, output)["score"]       ①
    syntax_score = grade_syntax(test_case, output)                  ②
    blended      = (model_score + syntax_score) / 2                 ③
    return {
        "task":         test_case["task"],
        "format":       test_case["format"],
        "output":       output,
        "model_score":  model_score,
        "syntax_score": syntax_score,
        "score":        blended,
    }

results = [run_test_case(case) for case in dataset]
generate_eval_html_report(results, "dual_grading_report.html")

①model_score comes from grade_by_model() — semantic quality, 1-10.

②syntax_score comes from grade_syntax() — parse success (10) or failure (0).

③Blended = simple average. A syntactically invalid submission caps at 5 no matter what the LLM judge says.

When to use dual grading

Use dual grading

•Code generation tasks (Python, SQL, JSON, regex)
•Any output that must parse/compile
•When you need a hard syntax gate
•Production evals where broken code = critical failure

LLM-only is fine

•Free-text classification tasks
•Summarisation and translation
•Creative writing
•Any task where "syntax" is not meaningful

Session 9 complete

You can now build a full eval pipeline: hand-written and generated datasets, LLM-as-judge scoring, HTML reports, adversarial cases, structured output evals, and dual LLM+syntax grading for code generation. These are the core techniques used in production AI evaluation systems.

Knowledge check

Code Check

A model returns syntactically invalid Python for a code-gen task. What is the blended score if the LLM judge gives it 8/10?

Knowledge Check

Why does validate_python() use ast.parse() instead of exec()?

Knowledge Check

Which type of task does NOT benefit from dual grading?

Recap — what you just learned

✓validate_python() uses ast.parse, validate_json() uses json.loads, validate_regex() uses re.compile
✓grade_syntax() dispatches to the right validator based on test_case["format"]
✓Blended score = (model_score + syntax_score) / 2 — syntax failure caps the score at 5
✓Use dual grading for any output that must parse/compile; LLM-only for free text

Next up: Session 9 Overview