Session 10· 06· 10 min

Pre-Run Checklist

What you'll learn

▸Verify eval conditions are identical between A/B runs using a structured checklist
▸Summarise all six prompt-engineering stages and know when to apply each

Why a checklist?

Eval runs are easy to invalidate through small oversights: a shuffled dataset, a judge prompt that drifted, a temperature that was different. A pre-run checklist takes 30 seconds and prevents hours of wasted investigation.

The checklist

Same dataset file and case order as the baseline — no additions, deletions, or shuffles
Same model id (including version suffix) as the baseline run
Same temperature as the baseline run
Judge prompt is unchanged from the baseline — no wording edits, no new criteria
If temperature > 0, plan at least two runs and average the deltas
Worst-case rows in the HTML report have been read and make sense — they are real failures, not evaluation bugs

Model version drift

Cloud model aliases like "claude-3-5-sonnet-latest" can resolve to a newer checkpoint after a provider release. Pin the exact version string (e.g., "claude-3-5-sonnet-20241022") to ensure baseline and re-run use the same model.

All six prompt-engineering stages

The six stages covered in this session form a complete toolkit. Apply them in order — each stage builds on the previous.

The six stages

1. Improve loop

freeze → baseline → change one → re-run → keep/revert

2. Specify

inputs, outputs, format, forbidden

3. Constrain

delimiters, respond-with-only, stop, prefill

4. Structure

numbered lists, headings, XML tags

5. Demonstrate

zero/one/few-shot from real cases

6. Align & Iterate

criteria vocab, worst-case triage, persona test

Stage

•1. Improvement loop
•2. Specify
•3. Constrain
•4. Structure
•5. Demonstrate
•6. Align & iterate

Core question

•Am I measuring this correctly?
•What does "correct" look like?
•How do I enforce format?
•Are my requirements scannable?
•Do I need examples?
•Do generator and judge share vocabulary?

Key tool

•Frozen eval + delta comparison
•Input/output/format/forbidden list
•Delimiters, prefill, stop sequences
•Numbered lists, ## headings, XML tags
•Real passing cases from eval dataset
•HTML report worst-case rows

You have the full loop

Freeze the eval. Record a baseline. Apply one stage at a time. Re-run. Keep or revert. Repeat until scores plateau or the task is solved. This is the complete prompt-engineering workflow for evaluated systems.

Knowledge Check

You are about to run a re-evaluation after editing the user wrapper. Which checklist item is most critical to verify first?

Knowledge Check

After applying all six stages, your blended score is still below target. What is the right next action?

Recap — what you just learned

✓Run the pre-checklist in 30 seconds before every eval run to prevent invalid comparisons
✓Pin exact model version strings to prevent checkpoint drift between runs
✓The six stages are: improve loop, specify, constrain, structure, demonstrate, align & iterate
✓Apply stages in order — each one builds on the previous and the loop is the foundation
✓Worst-case rows in the HTML report are always the next prompt target

Next up: Session 11: Chunking