Session 9· 07· 15 min
Adversarial Test Cases
What you'll learn
- ▸Understand what makes a test case adversarial for a triage model
- ▸Build generate_adversarial_cases() using targeted generation techniques
- ▸Merge adversarial cases into the existing dataset and re-run the eval
What is an adversarial test case?
An adversarial test case is designed to expose a specific weakness in your prompt. For a triage model, adversarial cases include: tickets that sound calm but are actually critical, sarcastic complaints that look positive, tickets that fit two categories equally well, and very short or very long tickets. These cases pass where typical ones fail.
Adversarial != malicious
In the eval context "adversarial" means "designed to challenge your model", not "designed to jailbreak it". These are realistic edge cases that real users will send — you just curate them deliberately.
Adversarial generation techniques
- Misleading signals: polite tone + critical urgency (e.g. "Hi, no rush but my payment was charged three times")
- Calm urgency: low-emotion language describing a severe problem (e.g. "Our entire checkout is down")
- Sarcastic complaints: "What a wonderful experience — the third time this month my order vanished"
- Category ambiguity: a billing dispute caused by a technical bug — which wins?
- Extreme length: one-word ticket ("Broken") vs multi-paragraph essay about a minor delay
generate_adversarial_cases()
07_adversarial.py
def generate_adversarial_cases(existing_dataset, n=5):
examples = json.dumps(
[{"ticket": c["task"], "expected": c["expected"]} for c in existing_dataset[:3]],
indent=2,
)
prompt = (
"You are a red-team engineer designing adversarial test cases for a customer-support "
"triage model.\n\n"
f"Here are 3 normal cases for reference:\n{examples}\n\n"
f"Generate {n} ADVERSARIAL cases that are likely to trip up a naive triage model. "
"Use these techniques:\n"
"- Polite language hiding critical urgency\n"
"- Sarcasm that makes a complaint sound positive\n"
"- Tickets that fit two categories equally well\n"
"- Calm descriptions of severe outages\n\n"
"Return a JSON array. Each element must have:\n"
" task: str (the ticket text)\n"
" expected: dict (category, urgency, summary, action_items)\n"
" criteria: str (why this is adversarial and what a good answer must do)"
)
messages = [
{"role": "user", "content": prompt},
{"role": "assistant", "content": "```json"}, ①
]
text = chat(messages, temperature=1.0, stop_sequences=["```"]) ②
return json.loads(text)①Prefill + stop for JSON extraction — consistent with the rest of the session.
②temperature=1.0 ensures the adversarial cases are genuinely diverse and surprising.
Adding to the dataset and re-running
07_adversarial.py (continued)
adversarial = generate_adversarial_cases(dataset, n=5)
# Always review before merging
for case in adversarial:
print(json.dumps(case, indent=2))
print("---")
response = input("Add these cases to the dataset? (y/n): ")
if response.lower() == "y":
dataset.extend(adversarial)
with open(DATASET_FILE, "w") as f:
json.dump(dataset, f, indent=2)
print(f"Dataset now has {len(dataset)} cases")$ python 07_adversarial.py
Where scores drop is where to improve your prompt
Re-run the eval after merging adversarial cases. Cases that flip from PASS to FAIL are your highest-priority prompt improvements. Focus on those.
Knowledge check
Knowledge Check
What is the primary purpose of adding adversarial cases to an eval dataset?
Knowledge Check
Which of these is an example of a "misleading signals" adversarial case?
Recap — what you just learned
- ✓Adversarial cases target specific model weaknesses: misleading tone, sarcasm, category ambiguity
- ✓generate_adversarial_cases() uses a targeted red-team prompt at temperature=1.0
- ✓Always review adversarial cases before merging — bad ground-truth labels corrupt the eval
- ✓Score drops after merging adversarial cases pinpoint exactly where to improve your prompt