Why Evals Matter
- ▸Explain why LLM non-determinism makes manual inspection unreliable
- ▸Run a 5-call demo to observe output variance at temperature=1.0
- ▸Understand the 4-stage eval pipeline: dataset → run → judge → report
The problem with eyeballing
When you test a regular Python function, calling it twice with the same input gives the same output. LLMs are different: they are non-deterministic by default. Two calls to the same prompt may produce different labels, different phrasings, and — critically — different correctness. If you just read a few outputs and think "looks good", you will miss systematic failure modes that only appear across many runs.
The 5-run demo: sarcasm at temperature=1.0
A sarcastic review like "Oh great, another product that breaks on day one" is ambiguous. Is it positive or negative? Run the same classification prompt five times at temperature=1.0 and watch the label change. This is the core demonstration of why evals exist.
import anthropic
client = anthropic.Anthropic()
model = "claude-opus-4-5"
sarcastic_review = "Oh great, another product that breaks on day one. Absolutely thrilled."
prompt = f"""Classify the following product review as positive, negative, or neutral.
Review: {sarcastic_review}
Respond with only one word: positive, negative, or neutral."""
print("Running 5 times at temperature=1.0:")
for i in range(5):
response = client.messages.create(
model=model,
max_tokens=10,
temperature=1.0,
messages=[{"role": "user", "content": prompt}],
)
label = response.content[0].text.strip()
print(f" Run {i+1}: {label}")Three different labels for the same input across five runs. This is not a bug — it is a fundamental property of language models. Evals give you a systematic way to measure and track this variance instead of being surprised by it in production.
The eval pipeline
Every eval you build in this session follows this same four-stage pipeline. The dataset defines what you test. The run step calls your prompt. The judge scores the output against a rubric. The report aggregates scores so you can catch regressions and compare prompt versions at a glance.
Knowledge check
- ✓LLMs are non-deterministic at temperature > 0 — the same prompt can produce different outputs on every call
- ✓Eyeballing a few outputs is not a reliable quality signal; structured evals are
- ✓The eval pipeline has four stages: dataset → run → judge → report
- ✓This session builds that pipeline from scratch, step by step