Session 10· 06· 10 min
Pre-Run Checklist
What you'll learn
- ▸Verify eval conditions are identical between A/B runs using a structured checklist
- ▸Summarise all six prompt-engineering stages and know when to apply each
Why a checklist?
Eval runs are easy to invalidate through small oversights: a shuffled dataset, a judge prompt that drifted, a temperature that was different. A pre-run checklist takes 30 seconds and prevents hours of wasted investigation.
The checklist
- Same dataset file and case order as the baseline — no additions, deletions, or shuffles
- Same model id (including version suffix) as the baseline run
- Same temperature as the baseline run
- Judge prompt is unchanged from the baseline — no wording edits, no new criteria
- If temperature > 0, plan at least two runs and average the deltas
- Worst-case rows in the HTML report have been read and make sense — they are real failures, not evaluation bugs
Model version drift
Cloud model aliases like "claude-3-5-sonnet-latest" can resolve to a newer checkpoint after a provider release. Pin the exact version string (e.g., "claude-3-5-sonnet-20241022") to ensure baseline and re-run use the same model.
All six prompt-engineering stages
The six stages covered in this session form a complete toolkit. Apply them in order — each stage builds on the previous.
The six stages
1. Improve loop
freeze → baseline → change one → re-run → keep/revert
2. Specify
inputs, outputs, format, forbidden
3. Constrain
delimiters, respond-with-only, stop, prefill
4. Structure
numbered lists, headings, XML tags
5. Demonstrate
zero/one/few-shot from real cases
6. Align & Iterate
criteria vocab, worst-case triage, persona test
Stage
- •1. Improvement loop
- •2. Specify
- •3. Constrain
- •4. Structure
- •5. Demonstrate
- •6. Align & iterate
Core question
- •Am I measuring this correctly?
- •What does "correct" look like?
- •How do I enforce format?
- •Are my requirements scannable?
- •Do I need examples?
- •Do generator and judge share vocabulary?
Key tool
- •Frozen eval + delta comparison
- •Input/output/format/forbidden list
- •Delimiters, prefill, stop sequences
- •Numbered lists, ## headings, XML tags
- •Real passing cases from eval dataset
- •HTML report worst-case rows
You have the full loop
Freeze the eval. Record a baseline. Apply one stage at a time. Re-run. Keep or revert. Repeat until scores plateau or the task is solved. This is the complete prompt-engineering workflow for evaluated systems.
Knowledge Check
You are about to run a re-evaluation after editing the user wrapper. Which checklist item is most critical to verify first?
Knowledge Check
After applying all six stages, your blended score is still below target. What is the right next action?
Recap — what you just learned
- ✓Run the pre-checklist in 30 seconds before every eval run to prevent invalid comparisons
- ✓Pin exact model version strings to prevent checkpoint drift between runs
- ✓The six stages are: improve loop, specify, constrain, structure, demonstrate, align & iterate
- ✓Apply stages in order — each one builds on the previous and the loop is the foundation
- ✓Worst-case rows in the HTML report are always the next prompt target