Session 9· 04· 20 min
HTML Eval Reports
What you'll learn
- ▸Generate a readable HTML report from eval results
- ▸Assign PASS / PARTIAL / FAIL thresholds to each score
- ▸Use Claude to grow the dataset when results look too easy
Why a report?
A list of scores in the terminal is hard to act on. An HTML report lets you share results with teammates, track regressions over time by saving the file, and spot patterns — e.g. the judge consistently gives low scores to sarcasm cases. This lesson adds a generate_eval_html_report() function that turns your results list into a standalone HTML file.
PASS / PARTIAL / FAIL thresholds
PASS
score >= 8
- •Label and reason are both correct
- •Meets all criteria
- •Ready for production
PARTIAL
5 <= score < 8
- •Label correct but reason weak
- •Minor criteria misses
- •Needs prompt improvement
FAIL
score < 5
- •Wrong label or wildly off-criteria
- •Systematic failure mode
- •Prompt needs rework
generate_eval_html_report()
04_eval_reports.py
def get_status(score):
if score >= 8:
return "PASS"
elif score >= 5:
return "PARTIAL"
return "FAIL"
def generate_eval_html_report(results, filename="eval_report.html"):
total = len(results)
passed = sum(1 for r in results if get_status(r["score"]) == "PASS")
avg = sum(r["score"] for r in results) / total if total else 0
rows = ""
for r in results:
status = get_status(r["score"])
color = {"PASS": "#22c55e", "PARTIAL": "#f59e0b", "FAIL": "#ef4444"}[status]
rows += (
f"<tr><td>{r['task'][:60]}...</td>"
f"<td>{r['expected_label']}</td>"
f"<td>{r['score']}/10</td>"
f'<td style="color:{color};font-weight:bold">{status}</td>'
f"<td>{r['reasoning']}</td></tr>\n"
)
html = f"""<!DOCTYPE html>
<html><head><title>Eval Report</title></head><body>
<h1>Eval Report</h1>
<p>Cases: {total} | Passed: {passed} | Avg score: {avg:.1f}/10</p>
<table border=1 cellpadding=6>
<tr><th>Task</th><th>Expected</th><th>Score</th><th>Status</th><th>Reasoning</th></tr>
{rows}</table>
</body></html>"""
with open(filename, "w") as f:
f.write(html)
print(f"Report saved to {filename}")Growing the dataset with Claude
If all five cases pass, your dataset is too easy. Use Claude to generate more cases that target edge cases your prompt struggles with.
04_eval_reports.py (continued)
def generate_more_cases(existing_dataset, n=5):
examples = "\n".join(
f'- "{c["task"]}" -> {c["expected_label"]}'
for c in existing_dataset
)
prompt = (
f"Here are {len(existing_dataset)} product review classification examples:\n{examples}\n\n"
f"Generate {n} NEW examples that are harder — include sarcasm, mixed signals, and ambiguous cases. "
"Return a Python list of dicts with keys: task, expected_label, criteria."
)
messages = [{"role": "user", "content": prompt}]
return chat(messages, temperature=1.0)Review is not optional
Generated cases can be wrong. Always read Claude's suggestions before adding them to your dataset. Bad ground-truth labels corrupt your eval signal.
$ python 04_eval_reports.py
Knowledge check
Knowledge Check
What score threshold marks a PARTIAL result?
Knowledge Check
Why should you use temperature=1.0 when generating new dataset cases?
Recap — what you just learned
- ✓PASS >= 8, PARTIAL 5-7, FAIL < 5 — adjust thresholds to your quality bar
- ✓generate_eval_html_report() saves a standalone HTML file you can share
- ✓Use Claude at temperature=1.0 to generate harder edge-case examples
- ✓Always manually review generated cases before adding them to the dataset