Session 9· 9 questions

End-of-session Quiz

What you'll learn
  • Confirm you understand eval datasets, LLM-as-Judge, and dual grading
  • Identify any gaps before moving to Session 10

9 questions covering eval fundamentals, dataset generation, and dual grading. Aim for 7+ before moving to Session 10.

Phase 1 — Eval Fundamentals

Knowledge Check
Why is eyeballing LLM outputs insufficient for evaluating prompt changes?
Knowledge Check
What three fields should every eval test case have?
Code Check
What is the prefill + stop_sequences trick used for?

Phase 2 — Scaling Datasets

Knowledge Check
When should you generate eval datasets with Claude instead of hand-writing them?
Knowledge Check
Why use temperature=1.0 for dataset generation but temperature=0 for eval runs?
Knowledge Check
What is the purpose of adversarial test cases?

Phase 3 — Code Generation Evals

Code Check
What does grade_syntax() check that grade_by_model() cannot?
Code Check
How is the blended score calculated in dual grading?
Knowledge Check
What does the HTML eval report show that raw JSON results do not?
Scored 7+ out of 9?
You are ready for Session 10 — Prompt Engineering. If you missed more than 2, revisit the matching lessons before moving on.