The Improvement Loop
- ▸Describe the 5-step prompt-engineering rhythm and why each step matters
- ▸Explain why changing one variable at a time is essential for measured progress
- ▸Distinguish eyeballing from measured improvement
The 5-step rhythm
Every productive prompt-engineering session follows the same rhythm. Skipping any step leads to wasted effort or false confidence.
Step 1 — Freeze the eval
Before touching any prompt, lock three things: the dataset file (no new cases), the judge prompt (exact wording), and the model id + temperature. If any of those change between runs, you cannot compare scores.
Step 2 — Record a baseline
Run the eval once and save the per-case scores. The HTML report is your paper trail. Note the overall blended average AND the two or three worst-scoring cases — those are your targets.
Step 3 — Change one thing
Edit only one part of the generator prompt per experiment: either the system prompt OR the user wrapper, not both. One change at a time is the only way to know what worked.
Step 4 — Re-run under identical conditions
Use the same script, same model id, same temperature. If temperature > 0, run at least twice and average to reduce noise before declaring a winner.
Step 5 — Keep or revert based on data
Compare new scores to baseline. If the blended average improved and no case got significantly worse, keep the change. Otherwise revert. Gut feeling is not a decision criterion here.
Eyeballing vs measured improvement
- •Read a few outputs and decide they seem better
- •No record of what changed
- •No score to compare against
- •Placebo effect is real — new wording feels fresh
- •Cannot catch regressions on edge cases
- •Numeric score on every case before and after
- •Diff of exact prompt change is saved
- •Delta is positive AND statistically plausible
- •Worst-case rows are checked, not just averages
- •Reproducible: someone else can verify the result
- ✓The 5-step loop is: freeze eval → baseline → change one thing → re-run → keep/revert
- ✓Never change the dataset or judge between A/B runs
- ✓One prompt variable per experiment is non-negotiable
- ✓Eyeballing feels faster but produces unreliable results; scores are the source of truth