Session 9· 01· 15 min

Why Evals Matter

What you'll learn
  • Explain why LLM non-determinism makes manual inspection unreliable
  • Run a 5-call demo to observe output variance at temperature=1.0
  • Understand the 4-stage eval pipeline: dataset → run → judge → report

The problem with eyeballing

When you test a regular Python function, calling it twice with the same input gives the same output. LLMs are different: they are non-deterministic by default. Two calls to the same prompt may produce different labels, different phrasings, and — critically — different correctness. If you just read a few outputs and think "looks good", you will miss systematic failure modes that only appear across many runs.

Eyeballing does not scale
Reading model outputs by hand is slow, biased, and impossible to automate. As soon as you have more than a handful of test cases — or need to compare two prompt versions — you need a structured eval pipeline.

The 5-run demo: sarcasm at temperature=1.0

A sarcastic review like "Oh great, another product that breaks on day one" is ambiguous. Is it positive or negative? Run the same classification prompt five times at temperature=1.0 and watch the label change. This is the core demonstration of why evals exist.

why_evals_demo.py
import anthropic

client = anthropic.Anthropic()
model = "claude-opus-4-5"

sarcastic_review = "Oh great, another product that breaks on day one. Absolutely thrilled."

prompt = f"""Classify the following product review as positive, negative, or neutral.

Review: {sarcastic_review}

Respond with only one word: positive, negative, or neutral."""

print("Running 5 times at temperature=1.0:")
for i in range(5):
    response = client.messages.create(
        model=model,
        max_tokens=10,
        temperature=1.0,
        messages=[{"role": "user", "content": prompt}],
    )
    label = response.content[0].text.strip()
    print(f"  Run {i+1}: {label}")

Three different labels for the same input across five runs. This is not a bug — it is a fundamental property of language models. Evals give you a systematic way to measure and track this variance instead of being surprised by it in production.

The eval pipeline

The 4-stage eval pipeline
Dataset
hand-written or generated test cases
Run
call the prompt for each case
Judge
LLM or heuristic scores output
Report
HTML summary with scores

Every eval you build in this session follows this same four-stage pipeline. The dataset defines what you test. The run step calls your prompt. The judge scores the output against a rubric. The report aggregates scores so you can catch regressions and compare prompt versions at a glance.

Knowledge check

Knowledge Check
Why does running the same prompt twice sometimes produce different outputs?
Knowledge Check
Which stage of the eval pipeline scores the model output?
Recap — what you just learned
  • LLMs are non-deterministic at temperature > 0 — the same prompt can produce different outputs on every call
  • Eyeballing a few outputs is not a reliable quality signal; structured evals are
  • The eval pipeline has four stages: dataset → run → judge → report
  • This session builds that pipeline from scratch, step by step
Next up: 02 — Your First Eval