SESSION 9
Session 9 — LLM Evaluations
Build eval datasets, score outputs with LLM-as-Judge, and generate HTML reports.
2–3 hours•9 exercises · 3 phases
What you'll be able to do by the end
- ✓ Explain why eyeballing LLM outputs is insufficient and what evals fix
- ✓ Build a hand-written eval dataset with task, expected label, and criteria
- ✓ Score LLM outputs automatically using the LLM-as-Judge pattern
- ✓ Generate eval datasets with Claude for complex tasks
- ✓ Create adversarial test cases that deliberately break your prompts
- ✓ Build dual-graded evals that combine LLM scoring with programmatic validation
Prerequisites
Comfortable with Python functions and JSONAn Anthropic API key (ANTHROPIC_API_KEY) set in .envJupyter Notebook or VS Code with the Jupyter extension
The 3-phase arc
Phase 1 builds the eval loop from scratch. Phase 2 scales it with generated datasets and structured tasks. Phase 3 adds programmatic validation for code-generation evals.
Phase 1
Fundamentals
Phase 2
Datasets
Phase 3
Code Evals
Exercises by phase
Phase 1 — Eval Fundamentals
Why evals, your first dataset, LLM-as-Judge, and HTML reports.
Phase 2 — Scaling Datasets
Generate eval datasets with Claude, evaluate structured output, adversarial testing.
Phase 3 — Code Generation Evals
Eval pipelines for code tasks with dual grading: LLM judge + syntax validation.