Session 10· 01· 12 min

The Improvement Loop

What you'll learn

▸Describe the 5-step prompt-engineering rhythm and why each step matters
▸Explain why changing one variable at a time is essential for measured progress
▸Distinguish eyeballing from measured improvement

Session overview

Session 10 is about prompt engineering for evaluations. You already have a working eval harness (dataset, judge, HTML report). Now you will learn how to use that harness to improve your prompts systematically rather than guessing.

The 5-step rhythm

Every productive prompt-engineering session follows the same rhythm. Skipping any step leads to wasted effort or false confidence.

The Prompt Improvement Loop

Freeze eval

lock dataset + judge

Baseline

record scores

Change one thing

one prompt edit

Re-run eval

same conditions

Keep / Revert

data decides

Step 1 — Freeze the eval

Before touching any prompt, lock three things: the dataset file (no new cases), the judge prompt (exact wording), and the model id + temperature. If any of those change between runs, you cannot compare scores.

Common mistake

Adding a new test case between baseline and re-run. A score jump might just reflect an easy new case, not prompt improvement. Keep the dataset identical until you decide to do a deliberate expansion.

Step 2 — Record a baseline

Run the eval once and save the per-case scores. The HTML report is your paper trail. Note the overall blended average AND the two or three worst-scoring cases — those are your targets.

Step 3 — Change one thing

Edit only one part of the generator prompt per experiment: either the system prompt OR the user wrapper, not both. One change at a time is the only way to know what worked.

Step 4 — Re-run under identical conditions

Use the same script, same model id, same temperature. If temperature > 0, run at least twice and average to reduce noise before declaring a winner.

Step 5 — Keep or revert based on data

Compare new scores to baseline. If the blended average improved and no case got significantly worse, keep the change. Otherwise revert. Gut feeling is not a decision criterion here.

Eyeballing vs measured improvement

Eyeballing

looks good to me

•Read a few outputs and decide they seem better
•No record of what changed
•No score to compare against
•Placebo effect is real — new wording feels fresh
•Cannot catch regressions on edge cases

Measured improvement

the eval decides

•Numeric score on every case before and after
•Diff of exact prompt change is saved
•Delta is positive AND statistically plausible
•Worst-case rows are checked, not just averages
•Reproducible: someone else can verify the result

Why one variable at a time?

If you change both the system prompt and the user wrapper and scores go up, you do not know which change helped. You might keep harmful noise. If scores go down you do not know which change to revert. Single-variable experiments are the only way to build reliable intuition.

Knowledge Check

You change the system prompt and the user wrapper in the same run. Scores improve by 4 points. What is the problem?

Recap — what you just learned

✓The 5-step loop is: freeze eval → baseline → change one thing → re-run → keep/revert
✓Never change the dataset or judge between A/B runs
✓One prompt variable per experiment is non-negotiable
✓Eyeballing feels faster but produces unreliable results; scores are the source of truth

Next up: Specify & Constrain