Session 9· 05· 20 min

Generating Eval Datasets

What you'll learn
  • Understand why hand-writing datasets does not scale beyond ~20 cases
  • Build generate_triage_dataset() to create customer-support test cases at scale
  • Cache generated datasets to disk and use the REGENERATE flag to control re-generation

Why generation?

Crafting test cases by hand is thorough but slow. A team can realistically write 20-30 good cases in an afternoon. But a robust eval suite for a real product needs hundreds — covering rare edge cases, unusual phrasings, and language variations that humans would never think to include. LLM-generated datasets fill that gap, with the caveat that a human must review them.

The new task: customer-support triage
For this lesson we switch from simple sentiment labels to a harder structured-output task: classify a customer support ticket into category, urgency, summary, and action_items. This is more representative of real production evals.

The triage task schema

05_dataset_generation.py
# Expected output shape for each ticket
# {
#   "category":     "billing" | "technical" | "shipping" | "general",
#   "urgency":       "low" | "medium" | "high" | "critical",
#   "summary":       "one sentence describing the issue",
#   "action_items":  ["step 1", "step 2", ...]
# }

generate_triage_dataset()

05_dataset_generation.py (continued)
def generate_triage_dataset(n=10):
    prompt = (
        "Generate a dataset of customer support tickets for an e-commerce platform.\n"
        f"Create {n} diverse tickets covering: billing disputes, technical bugs, "
        "shipping problems, and general enquiries.\n\n"
        "For each ticket, return a dict with:\n"
        "  ticket: str  (the raw customer message)\n"
        "  expected: dict with keys category, urgency, summary, action_items\n"
        "  criteria: str (plain English rubric for judging model output)\n\n"
        "Include edge cases: polite but critical urgency, angry but low-priority, "
        "ambiguous category.\n"
        "Return a JSON array of these dicts."
    )
    messages = [
        {"role": "user", "content": prompt},
        {"role": "assistant", "content": "```json"},              ①
    ]
    text = chat(messages, temperature=1.0, stop_sequences=["```"])  ②
    return json.loads(text)
Prefill + stop_sequences gives us clean JSON — same trick as the judge prompt.
temperature=1.0 ensures variety. Without it, all generated tickets sound similar.

Caching to disk and the REGENERATE pattern

Generating a large dataset on every script run wastes API credits. Instead, save the generated dataset to a JSON file on first run and reload it on subsequent runs. Set REGENERATE=True only when you deliberately want fresh data.

05_dataset_generation.py (continued)
import json, os

DATASET_FILE = "triage_dataset.json"
REGENERATE   = False                                              ①

if REGENERATE or not os.path.exists(DATASET_FILE):               ②
    print("Generating dataset...")
    dataset = generate_triage_dataset(n=20)
    with open(DATASET_FILE, "w") as f:
        json.dump(dataset, f, indent=2)
    print(f"Saved {len(dataset)} cases to {DATASET_FILE}")
else:
    with open(DATASET_FILE) as f:
        dataset = json.load(f)
    print(f"Loaded {len(dataset)} cases from {DATASET_FILE}")
Set REGENERATE = True to force fresh generation; False to load the cached file.
The condition generates only when forced or when the file does not exist yet.
Review is not optional
Generated cases occasionally have wrong expected labels or vague criteria. Before running your eval, skim every generated case and fix any that look off. Bad labels corrupt your eval signal — the judge will score based on wrong ground truth.
$ python 05_dataset_generation.py

Knowledge check

Knowledge Check
What is the primary reason for caching the generated dataset to disk?
Code Check
Why is temperature=1.0 used when generating the dataset but temperature=0 used for the prompt under test?
Recap — what you just learned
  • Hand-writing scales to ~20 cases; generation scales to hundreds
  • Use temperature=1.0 for generation to maximise variety
  • Cache to JSON on first run; set REGENERATE=True only when you want fresh data
  • Always review generated cases before adding them to your eval dataset
Next up: 06 — Evaluating Structured Output