Session 10· 05· 15 min
A/B Testing Prompts
What you'll learn
- ▸Run both prompt variants against the full dataset in a single script
- ▸Compute per-case blended scores by averaging judge and syntax grades
- ▸Read the delta printout to decide which variant wins
The A/B pattern
Once you have two wrappers (weak and engineered), run them both against every case in the dataset and compare blended scores side-by-side. This makes the improvement concrete and auditable.
run_one_case()
ab_test.py
def run_one_case(test_case, wrap_user):
output = run_generation(test_case, wrap_user)
mg = grade_by_model(test_case, output)
syn = grade_syntax(output, test_case)
blended = (mg["score"] + syn) / 2
return {
"output_preview": output.strip()[:160],
"judge": mg["score"],
"syntax": syn,
"blended": blended,
"reasoning": mg.get("reasoning", ""),
}①wrap_user is passed as a parameter — the same function runs both variants without duplication.
②run_generation calls the model; grade_by_model calls the LLM judge.
③grade_syntax is a deterministic format check (valid JSON, valid Python, etc.).
④blended is the arithmetic mean of judge score and syntax score — a single comparable number.
⑤output_preview is truncated to 160 chars for the summary table; full output is in the HTML report.
Running both variants
ab_test.py
# Run both variants
weak_rows = [run_one_case(tc, wrap_user_weak) for tc in DEMO_DATASET]
eng_rows = [run_one_case(tc, wrap_user_engineered) for tc in DEMO_DATASET]Keep DEMO_DATASET identical for both loops. Do not shuffle or filter between runs.
summarize() — per-case blended scores
ab_test.py
def summarize(label, rows):
avg = sum(r["blended"] for r in rows) / len(rows)
print(f"\n=== {label} (n={len(rows)}) ===")
print(f" Mean blended: {avg:.3f}")
for i, r in enumerate(rows):
print(f" [{i}] judge={r['judge']:.2f} syn={r['syntax']:.2f} "
f"blended={r['blended']:.2f} {r['output_preview'][:60]!r}")
return avg
weak_avg = summarize("WEAK", weak_rows)
eng_avg = summarize("ENGINEERED", eng_rows)Delta comparison printout
ab_test.py
delta = eng_avg - weak_avg
print(f"\nDelta (engineered - weak): {delta:+.3f}")
if delta > 0.05:
print(" Verdict: ENGINEERED wins")
elif delta < -0.05:
print(" Verdict: WEAK wins (unexpected — investigate)")
else:
print(" Verdict: no meaningful difference")What delta threshold to use
The 0.05 threshold is a practical heuristic for a 0-1 blended score. For a 5-case demo dataset you need a larger delta (noise is high). For a 50-case dataset, 0.02 may be meaningful. Always look at the per-case breakdown — a high average can hide a regression on one important case.
Temperature and reproducibility
If temperature > 0, the same prompt can produce different scores across runs. Run the A/B test at least twice and average the deltas before committing to a winner.
Code Check
After running the A/B test, the engineered variant averages 0.82 and the weak variant averages 0.79. Delta = +0.03. What should you conclude?
Recap — what you just learned
- ✓run_one_case() accepts wrap_user as a parameter so both variants use the same infrastructure
- ✓Blended score = average of judge score and deterministic syntax score
- ✓summarize() prints per-case breakdowns so regressions on individual cases are visible
- ✓A delta above 0.05 on a representative dataset is a reasonable threshold for "the change helped"
- ✓Run multiple trials at temperature > 0 before declaring a winner