SESSION 9

Session 9 — LLM Evaluations

Build eval datasets, score outputs with LLM-as-Judge, and generate HTML reports.

2–3 hours•9 exercises · 3 phases

What you'll be able to do by the end

&check; Explain why eyeballing LLM outputs is insufficient and what evals fix
&check; Build a hand-written eval dataset with task, expected label, and criteria
&check; Score LLM outputs automatically using the LLM-as-Judge pattern
&check; Generate eval datasets with Claude for complex tasks
&check; Create adversarial test cases that deliberately break your prompts
&check; Build dual-graded evals that combine LLM scoring with programmatic validation

Prerequisites

Comfortable with Python functions and JSON An Anthropic API key (ANTHROPIC_API_KEY) set in .env Jupyter Notebook or VS Code with the Jupyter extension

The 3-phase arc

Phase 1 builds the eval loop from scratch. Phase 2 scales it with generated datasets and structured tasks. Phase 3 adds programmatic validation for code-generation evals.

Phase 1

Fundamentals

Phase 2

Datasets

Phase 3

Code Evals