Your First Eval
- ▸Set up the Anthropic SDK with a temperature=0 chat helper
- ▸Build a 5-case hand-written dataset for sentiment classification
- ▸Write run_prompt() and verify outputs before adding a judge
Setup
Create a Session9/ directory with a virtual environment, install the Anthropic SDK, and add your API key to a .env file. We will use temperature=0 for the prompt under test so results are deterministic — and reserve temperature > 0 for dataset generation later.
The chat() helper
Rather than calling client.messages.create() directly every time, we wrap it in a small chat() helper that accepts messages, an optional system prompt, temperature, and stop sequences. This helper is used throughout the whole session.
import anthropic
import os
from dotenv import load_dotenv
load_dotenv()
client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
model = "claude-opus-4-5"
def chat(messages, system=None, temperature=0, stop_sequences=None): ①
params = {
"model": model,
"max_tokens": 1000,
"messages": messages,
"temperature": temperature,
}
if system:
params["system"] = system
if stop_sequences:
params["stop_sequences"] = stop_sequences
return client.messages.create(**params).content[0].text ②The dataset
A hand-written dataset is the foundation of a good eval. Each test case has three fields: task (the input you give the model), expected_label (the ground truth), and criteria (a plain-English description of what a correct answer looks like, used by the judge later).
dataset = [
{
"task": "This laptop exceeded my expectations. Fast, quiet, and the battery lasts all day.",
"expected_label": "positive",
"criteria": "Label is positive. Reason mentions speed, noise, or battery life.",
},
{
"task": "Arrived broken. Screen cracked. Support ignored my emails for two weeks.",
"expected_label": "negative",
"criteria": "Label is negative. Reason mentions damage or poor support.",
},
{
"task": "It's a keyboard. Does what keyboards do.",
"expected_label": "neutral",
"criteria": "Label is neutral. Reason reflects lack of strong sentiment.",
},
{
"task": "Oh great, another product that breaks on day one. Absolutely thrilled.",
"expected_label": "negative",
"criteria": "Label is negative despite sarcastic positive language. Reason detects sarcasm.",
},
{
"task": "Works fine I guess. Nothing special. Charging is a bit slow but otherwise ok.",
"expected_label": "neutral",
"criteria": "Label is neutral. Reason captures mixed or lukewarm sentiment.",
},
]The run_prompt() function
def run_prompt(test_case): ①
user_prompt = (
"Classify the following product review as positive, negative, or neutral.\n\n"
f"Review:\n{test_case[\"task\"]}\n\n"
"Respond in this exact format:\n"
"Label: <positive | negative | neutral>\n"
"Reason: <one short sentence>\n"
)
messages = [{"role": "user", "content": user_prompt}]
return chat(messages) ②
# Quick smoke test
for case in dataset:
output = run_prompt(case)
print(f"Expected: {case[\"expected_label\"]}\n{output}\n{'—'*40}")Knowledge check
- ✓The chat() helper wraps client.messages.create() and returns plain text
- ✓A dataset is a list of dicts with task, expected_label, and criteria fields
- ✓run_prompt() isolates the prompt template so you can swap it without touching the eval loop
- ✓Start small and deliberate — 5 well-chosen cases beats 50 repetitive ones