Session 10· 04· 12 min

Align & Iterate

What you'll learn
  • Align generator output vocabulary with the judge's solution_criteria to eliminate terminology gaps
  • Use the HTML report's worst-case rows to find the next prompt fix
  • Decide when role/persona prompting helps and when it adds noise

Align generator with judge

Your eval has two prompts: the generator prompt and the judge prompt. They need to share vocabulary. If the judge grades on "double-quoted keys" but the generator prompt never mentions quoting, the generator has no reason to obey that constraint.

Alignment flow
Read solution_criteria
Copy key phrases into generator output rules
Re-run eval
Check judge reasoning in HTML report
Close remaining gaps
Why copy criteria verbatim?
When the generator sees the exact phrase the judge will look for, it is more likely to produce output that satisfies that phrase. This is not teaching the model to cheat — the criteria IS the spec. Both sides should know the spec.

Update criteria when you change the task prompt

If you add a new output rule to the generator prompt (e.g., "no trailing newline"), add a matching criterion to the judge prompt in the same commit. Otherwise the judge will not penalize violations of the new rule, and your scores will be misleading.

Reading the HTML report

The HTML report sorts cases by score ascending. The top of the table is your to-do list. For each failing case, read the judge reasoning column — it tells you exactly which criterion was violated.

  1. Open the report and go to the lowest-scored cases
  2. Read the judge reasoning for each
  3. Group failures: are they all the same violation (e.g., markdown fences) or diverse?
  4. If they share a violation, add one generator rule targeting that violation
  5. If they are diverse, pick the most common and iterate

Role and persona prompting

Adding a role to the system prompt ("You are an expert AWS engineer") can shift tone and confidence. The effect is real but small and inconsistent across models.

When role/persona helps
  • Task requires domain-specific vocabulary (e.g., "You are a regex specialist")
  • Task requires a specific tone (formal, terse, educational)
  • Model is over-explaining — a terse persona cuts verbosity
When role/persona hurts or is neutral
  • Persona is vague ("You are a helpful assistant") — adds no signal
  • Persona conflicts with output rules ("You are a conversational assistant" + "emit only JSON")
  • Persona bloats the system prompt without measurable score improvement
Measure the persona
Treat persona prompting as a single-variable experiment like any other change. Add the persona, re-run the eval, compare scores. If there is no improvement, remove it. Do not accumulate unvalidated prompt elements.
Knowledge Check
The judge reasoning column consistently says "output contains markdown code fence". What is the correct next action?
Recap — what you just learned
  • Copy key phrases from solution_criteria into generator output rules to close vocabulary gaps
  • Always update judge criteria when you add a new generator rule
  • Sort the HTML report by score ascending and treat the worst cases as your next prompt target
  • Role/persona prompting is a measurable experiment, not a free improvement
Next up: A/B Testing Prompts