SESSION 9

Session 9 — LLM Evaluations

Build eval datasets, score outputs with LLM-as-Judge, and generate HTML reports.

2–3 hours9 exercises · 3 phases
What you'll be able to do by the end
  • ✓ Explain why eyeballing LLM outputs is insufficient and what evals fix
  • ✓ Build a hand-written eval dataset with task, expected label, and criteria
  • ✓ Score LLM outputs automatically using the LLM-as-Judge pattern
  • ✓ Generate eval datasets with Claude for complex tasks
  • ✓ Create adversarial test cases that deliberately break your prompts
  • ✓ Build dual-graded evals that combine LLM scoring with programmatic validation

Prerequisites

The 3-phase arc

Phase 1 builds the eval loop from scratch. Phase 2 scales it with generated datasets and structured tasks. Phase 3 adds programmatic validation for code-generation evals.

Phase 1
Fundamentals
Phase 2
Datasets
Phase 3
Code Evals

Exercises by phase

Phase 1 — Eval Fundamentals
Why evals, your first dataset, LLM-as-Judge, and HTML reports.
Phase 2 — Scaling Datasets
Generate eval datasets with Claude, evaluate structured output, adversarial testing.
Phase 3 — Code Generation Evals
Eval pipelines for code tasks with dual grading: LLM judge + syntax validation.

When you finish