Align & Iterate
- ▸Align generator output vocabulary with the judge's solution_criteria to eliminate terminology gaps
- ▸Use the HTML report's worst-case rows to find the next prompt fix
- ▸Decide when role/persona prompting helps and when it adds noise
Align generator with judge
Your eval has two prompts: the generator prompt and the judge prompt. They need to share vocabulary. If the judge grades on "double-quoted keys" but the generator prompt never mentions quoting, the generator has no reason to obey that constraint.
Update criteria when you change the task prompt
If you add a new output rule to the generator prompt (e.g., "no trailing newline"), add a matching criterion to the judge prompt in the same commit. Otherwise the judge will not penalize violations of the new rule, and your scores will be misleading.
Reading the HTML report
The HTML report sorts cases by score ascending. The top of the table is your to-do list. For each failing case, read the judge reasoning column — it tells you exactly which criterion was violated.
- Open the report and go to the lowest-scored cases
- Read the judge reasoning for each
- Group failures: are they all the same violation (e.g., markdown fences) or diverse?
- If they share a violation, add one generator rule targeting that violation
- If they are diverse, pick the most common and iterate
Role and persona prompting
Adding a role to the system prompt ("You are an expert AWS engineer") can shift tone and confidence. The effect is real but small and inconsistent across models.
- •Task requires domain-specific vocabulary (e.g., "You are a regex specialist")
- •Task requires a specific tone (formal, terse, educational)
- •Model is over-explaining — a terse persona cuts verbosity
- •Persona is vague ("You are a helpful assistant") — adds no signal
- •Persona conflicts with output rules ("You are a conversational assistant" + "emit only JSON")
- •Persona bloats the system prompt without measurable score improvement
- ✓Copy key phrases from solution_criteria into generator output rules to close vocabulary gaps
- ✓Always update judge criteria when you add a new generator rule
- ✓Sort the HTML report by score ascending and treat the worst cases as your next prompt target
- ✓Role/persona prompting is a measurable experiment, not a free improvement