Session 9· 04· 20 min

HTML Eval Reports

What you'll learn

▸Generate a readable HTML report from eval results
▸Assign PASS / PARTIAL / FAIL thresholds to each score
▸Use Claude to grow the dataset when results look too easy

Why a report?

A list of scores in the terminal is hard to act on. An HTML report lets you share results with teammates, track regressions over time by saving the file, and spot patterns — e.g. the judge consistently gives low scores to sarcasm cases. This lesson adds a generate_eval_html_report() function that turns your results list into a standalone HTML file.

PASS / PARTIAL / FAIL thresholds

PASS

score >= 8

•Label and reason are both correct
•Meets all criteria
•Ready for production

PARTIAL

5 <= score < 8

•Label correct but reason weak
•Minor criteria misses
•Needs prompt improvement

FAIL

score < 5

•Wrong label or wildly off-criteria
•Systematic failure mode
•Prompt needs rework

generate_eval_html_report()

04_eval_reports.py

def get_status(score):
    if score >= 8:
        return "PASS"
    elif score >= 5:
        return "PARTIAL"
    return "FAIL"

def generate_eval_html_report(results, filename="eval_report.html"):
    total  = len(results)
    passed = sum(1 for r in results if get_status(r["score"]) == "PASS")
    avg    = sum(r["score"] for r in results) / total if total else 0

    rows = ""
    for r in results:
        status = get_status(r["score"])
        color  = {"PASS": "#22c55e", "PARTIAL": "#f59e0b", "FAIL": "#ef4444"}[status]
        rows += (
            f"<tr><td>{r['task'][:60]}...</td>"
            f"<td>{r['expected_label']}</td>"
            f"<td>{r['score']}/10</td>"
            f'<td style="color:{color};font-weight:bold">{status}</td>'
            f"<td>{r['reasoning']}</td></tr>\n"
        )

    html = f"""<!DOCTYPE html>
<html><head><title>Eval Report</title></head><body>
<h1>Eval Report</h1>
<p>Cases: {total} | Passed: {passed} | Avg score: {avg:.1f}/10</p>
<table border=1 cellpadding=6>
<tr><th>Task</th><th>Expected</th><th>Score</th><th>Status</th><th>Reasoning</th></tr>
{rows}</table>
</body></html>"""

    with open(filename, "w") as f:
        f.write(html)
    print(f"Report saved to {filename}")

Growing the dataset with Claude

If all five cases pass, your dataset is too easy. Use Claude to generate more cases that target edge cases your prompt struggles with.

04_eval_reports.py (continued)

def generate_more_cases(existing_dataset, n=5):
    examples = "\n".join(
        f'- "{c["task"]}" -> {c["expected_label"]}'
        for c in existing_dataset
    )
    prompt = (
        f"Here are {len(existing_dataset)} product review classification examples:\n{examples}\n\n"
        f"Generate {n} NEW examples that are harder — include sarcasm, mixed signals, and ambiguous cases. "
        "Return a Python list of dicts with keys: task, expected_label, criteria."
    )
    messages = [{"role": "user", "content": prompt}]
    return chat(messages, temperature=1.0)

Review is not optional

Generated cases can be wrong. Always read Claude's suggestions before adding them to your dataset. Bad ground-truth labels corrupt your eval signal.

$ python 04_eval_reports.py

Knowledge check

Knowledge Check

What score threshold marks a PARTIAL result?

Knowledge Check

Why should you use temperature=1.0 when generating new dataset cases?

Recap — what you just learned

✓PASS >= 8, PARTIAL 5-7, FAIL < 5 — adjust thresholds to your quality bar
✓generate_eval_html_report() saves a standalone HTML file you can share
✓Use Claude at temperature=1.0 to generate harder edge-case examples
✓Always manually review generated cases before adding them to the dataset

Next up: 05 — Generating Eval Datasets