Session 10· 05· 15 min

A/B Testing Prompts

What you'll learn

▸Run both prompt variants against the full dataset in a single script
▸Compute per-case blended scores by averaging judge and syntax grades
▸Read the delta printout to decide which variant wins

The A/B pattern

Once you have two wrappers (weak and engineered), run them both against every case in the dataset and compare blended scores side-by-side. This makes the improvement concrete and auditable.

run_one_case()

ab_test.py

def run_one_case(test_case, wrap_user):
    output = run_generation(test_case, wrap_user)
    mg = grade_by_model(test_case, output)
    syn = grade_syntax(output, test_case)
    blended = (mg["score"] + syn) / 2
    return {
        "output_preview": output.strip()[:160],
        "judge": mg["score"],
        "syntax": syn,
        "blended": blended,
        "reasoning": mg.get("reasoning", ""),
    }

①wrap_user is passed as a parameter — the same function runs both variants without duplication.

②run_generation calls the model; grade_by_model calls the LLM judge.

③grade_syntax is a deterministic format check (valid JSON, valid Python, etc.).

④blended is the arithmetic mean of judge score and syntax score — a single comparable number.

⑤output_preview is truncated to 160 chars for the summary table; full output is in the HTML report.

Running both variants

ab_test.py

# Run both variants
weak_rows = [run_one_case(tc, wrap_user_weak) for tc in DEMO_DATASET]
eng_rows  = [run_one_case(tc, wrap_user_engineered) for tc in DEMO_DATASET]

Keep DEMO_DATASET identical for both loops. Do not shuffle or filter between runs.

summarize() — per-case blended scores

ab_test.py

def summarize(label, rows):
    avg = sum(r["blended"] for r in rows) / len(rows)
    print(f"\n=== {label} (n={len(rows)}) ===")
    print(f"  Mean blended: {avg:.3f}")
    for i, r in enumerate(rows):
        print(f"  [{i}] judge={r['judge']:.2f} syn={r['syntax']:.2f} "
              f"blended={r['blended']:.2f}  {r['output_preview'][:60]!r}")
    return avg

weak_avg = summarize("WEAK", weak_rows)
eng_avg  = summarize("ENGINEERED", eng_rows)

Delta comparison printout

ab_test.py

delta = eng_avg - weak_avg
print(f"\nDelta (engineered - weak): {delta:+.3f}")
if delta > 0.05:
    print("  Verdict: ENGINEERED wins")
elif delta < -0.05:
    print("  Verdict: WEAK wins (unexpected — investigate)")
else:
    print("  Verdict: no meaningful difference")

What delta threshold to use

The 0.05 threshold is a practical heuristic for a 0-1 blended score. For a 5-case demo dataset you need a larger delta (noise is high). For a 50-case dataset, 0.02 may be meaningful. Always look at the per-case breakdown — a high average can hide a regression on one important case.

Temperature and reproducibility

If temperature > 0, the same prompt can produce different scores across runs. Run the A/B test at least twice and average the deltas before committing to a winner.

Code Check

After running the A/B test, the engineered variant averages 0.82 and the weak variant averages 0.79. Delta = +0.03. What should you conclude?

Recap — what you just learned

✓run_one_case() accepts wrap_user as a parameter so both variants use the same infrastructure
✓Blended score = average of judge score and deterministic syntax score
✓summarize() prints per-case breakdowns so regressions on individual cases are visible
✓A delta above 0.05 on a representative dataset is a reasonable threshold for "the change helped"
✓Run multiple trials at temperature > 0 before declaring a winner

Next up: Pre-Run Checklist