Session 9· 02· 20 min

Your First Eval

What you'll learn
  • Set up the Anthropic SDK with a temperature=0 chat helper
  • Build a 5-case hand-written dataset for sentiment classification
  • Write run_prompt() and verify outputs before adding a judge

Setup

Create a Session9/ directory with a virtual environment, install the Anthropic SDK, and add your API key to a .env file. We will use temperature=0 for the prompt under test so results are deterministic — and reserve temperature > 0 for dataset generation later.

$ mkdir Session9 && cd Session9 && python -m venv .venv && source .venv/bin/activate && pip install anthropic python-dotenv

The chat() helper

Rather than calling client.messages.create() directly every time, we wrap it in a small chat() helper that accepts messages, an optional system prompt, temperature, and stop sequences. This helper is used throughout the whole session.

02_first_eval.py
import anthropic
import os
from dotenv import load_dotenv

load_dotenv()
client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
model = "claude-opus-4-5"

def chat(messages, system=None, temperature=0, stop_sequences=None):   ①
    params = {
        "model": model,
        "max_tokens": 1000,
        "messages": messages,
        "temperature": temperature,
    }
    if system:
        params["system"] = system
    if stop_sequences:
        params["stop_sequences"] = stop_sequences
    return client.messages.create(**params).content[0].text           ②
chat() accepts a list of message dicts, an optional system prompt, temperature (default 0), and stop sequences.
It returns only the text string — not the full response object — to keep downstream code simple.

The dataset

A hand-written dataset is the foundation of a good eval. Each test case has three fields: task (the input you give the model), expected_label (the ground truth), and criteria (a plain-English description of what a correct answer looks like, used by the judge later).

02_first_eval.py (continued)
dataset = [
    {
        "task": "This laptop exceeded my expectations. Fast, quiet, and the battery lasts all day.",
        "expected_label": "positive",
        "criteria": "Label is positive. Reason mentions speed, noise, or battery life.",
    },
    {
        "task": "Arrived broken. Screen cracked. Support ignored my emails for two weeks.",
        "expected_label": "negative",
        "criteria": "Label is negative. Reason mentions damage or poor support.",
    },
    {
        "task": "It's a keyboard. Does what keyboards do.",
        "expected_label": "neutral",
        "criteria": "Label is neutral. Reason reflects lack of strong sentiment.",
    },
    {
        "task": "Oh great, another product that breaks on day one. Absolutely thrilled.",
        "expected_label": "negative",
        "criteria": "Label is negative despite sarcastic positive language. Reason detects sarcasm.",
    },
    {
        "task": "Works fine I guess. Nothing special. Charging is a bit slow but otherwise ok.",
        "expected_label": "neutral",
        "criteria": "Label is neutral. Reason captures mixed or lukewarm sentiment.",
    },
]
Quality over quantity
Five well-designed cases — including edge cases like sarcasm — will teach you more than fifty bland positive/negative reviews. When you expand your dataset in lesson 05, you will generate variety automatically.

The run_prompt() function

02_first_eval.py (continued)
def run_prompt(test_case):                                               ①
    user_prompt = (
        "Classify the following product review as positive, negative, or neutral.\n\n"
        f"Review:\n{test_case[\"task\"]}\n\n"
        "Respond in this exact format:\n"
        "Label: <positive | negative | neutral>\n"
        "Reason: <one short sentence>\n"
    )
    messages = [{"role": "user", "content": user_prompt}]
    return chat(messages)                                                  ②

# Quick smoke test
for case in dataset:
    output = run_prompt(case)
    print(f"Expected: {case[\"expected_label\"]}\n{output}\n{'—'*40}")
run_prompt() wraps the prompt template and calls chat(). Keeping it separate makes it easy to swap prompts.
chat() returns a string — the raw model output. The judge in lesson 03 will score it.
$ python 02_first_eval.py

Knowledge check

Knowledge Check
Why do we use temperature=0 for the prompt under test?
Code Check
What are the three fields every test case in this dataset contains?
Recap — what you just learned
  • The chat() helper wraps client.messages.create() and returns plain text
  • A dataset is a list of dicts with task, expected_label, and criteria fields
  • run_prompt() isolates the prompt template so you can swap it without touching the eval loop
  • Start small and deliberate — 5 well-chosen cases beats 50 repetitive ones
Next up: 03 — LLM-as-Judge Scoring