Fasttrack· 10· 20 min

Evals & Prompt Engineering

What you'll learn
  • Engineer prompts at three levels: vague, specific, and few-shot
  • Build a mini eval harness with labeled test cases
  • Use LLM-as-Judge for open-ended evaluation
You cannot improve what you cannot measure — evals are how you know your prompts are actually working. Without an eval harness, prompt changes are guesswork. With one, every change is an experiment with a measurable outcome.
Three levels of prompt engineering: vague, specific, few-shot
Click to zoom
Three levels of prompt engineering: vague → specific → few-shot

Level 1: Vague prompt

A vague prompt gives the model too much freedom. For classification tasks, the model might return "This review is positive", "Positive!", "I think this is a positive review", or "POSITIVE" — four different formats that are all correct but all require different parsing. Accuracy on structured tasks is typically around 25% with vague prompts.

prompt_levels.py
# Level 1: Vague — unpredictable output format
vague_prompt = 'Classify this review: "{review}"'

response = model.invoke(vague_prompt.format(review='The product broke after one day.'))
print(response.content)
# Might return: "This review is negative and expresses disappointment."
# Or: "NEGATIVE"
# Or: "Negative sentiment detected."

Level 2: Specific prompt

Adding explicit constraints — the allowed output values and a strict format instruction — dramatically improves accuracy and consistency. The model stops guessing about format and focuses on the classification itself. Accuracy typically jumps to around 88%.

prompt_levels.py
# Level 2: Specific — constrained output format
specific_prompt = (
    'Classify the sentiment of the following review.\n'
    'Respond with exactly one word: POSITIVE, NEGATIVE, or NEUTRAL.\n'
    'Do not include any other text, punctuation, or explanation.\n\n'
    'Review: "{review}"'
)

response = model.invoke(specific_prompt.format(review='The product broke after one day.'))
print(response.content)  # "NEGATIVE" — reliably, every time

Level 3: Few-shot prompt

A few-shot prompt includes two or three input-output examples before the real task. The model pattern-matches to your examples and mirrors their format perfectly. This is especially valuable for ambiguous edge cases — your examples implicitly define how to handle them.

prompt_levels.py
# Level 3: Few-shot — model matches your example format exactly
few_shot_prompt = (
    'Classify sentiment. Reply with only POSITIVE, NEGATIVE, or NEUTRAL.\n\n'
    'Review: "Arrived on time and worked perfectly."'
    '\nLabel: POSITIVE\n\n'
    'Review: "Completely useless, stopped working in an hour."'
    '\nLabel: NEGATIVE\n\n'
    'Review: "It's fine, does what it says on the tin."'
    '\nLabel: NEUTRAL\n\n'
    'Review: "{review}"\nLabel:'
)

response = model.invoke(few_shot_prompt.format(review='The product broke after one day.'))
print(response.content)  # "NEGATIVE" — clean, matches example format perfectly
Level 1: Vague
~25% accuracy
  • "Classify this review"
  • Model guesses the output format
  • Hard to parse programmatically
  • Inconsistent across runs
Level 2: Specific
~88% accuracy
  • Explicit allowed values listed
  • "Reply with only one word"
  • Consistent, parseable output
  • Much easier to evaluate automatically
Level 3: Few-shot
~88% accuracy
  • 2-3 input/output examples included
  • Model mimics example format exactly
  • Handles edge cases via example
  • Best for ambiguous or nuanced labels

Building an eval harness

Eval harness diagram: test cases flow to LLM, then to judge, producing a score report
Click to zoom
An eval harness: test cases → LLM → judge → score report
eval_harness.py
DATASET = [                                                  # ①
    {'review': 'Works perfectly, very happy.', 'label': 'POSITIVE'},
    {'review': 'Broke on day one.', 'label': 'NEGATIVE'},
    {'review': 'Arrived quickly and as described.', 'label': 'POSITIVE'},
    {'review': 'Too expensive for what you get.', 'label': 'NEGATIVE'},
    {'review': 'Average product, nothing special.', 'label': 'NEUTRAL'},
    {'review': 'Excellent build quality.', 'label': 'POSITIVE'},
    {'review': 'Stopped working after a week.', 'label': 'NEGATIVE'},
    {'review': 'Does the job. Nothing more.', 'label': 'NEUTRAL'},
]

def classify(review: str) -> str:                             # ②
    response = model.invoke(specific_prompt.format(review=review))
    return response.content.strip().upper()

def evaluate(dataset: list) -> dict:                          # ③
    correct = 0
    for case in dataset:
        prediction = classify(case['review'])
        if prediction == case['label']:
            correct += 1
        else:
            print(f'WRONG: "{case["review"]}" -> got {prediction}, expected {case["label"]}')
    accuracy = correct / len(dataset)
    return {'accuracy': accuracy, 'correct': correct, 'total': len(dataset)}

results = evaluate(DATASET)
print(f'Accuracy: {results["accuracy"]:.0%} ({results["correct"]}/{results["total"]})')
① DATASET contains labeled examples — the ground truth your prompts must match
② classify() runs your prompt and returns the prediction — swap prompts here to compare them
③ evaluate() runs every test case and computes accuracy — the single number you optimize

LLM-as-Judge

For open-ended tasks like text summarization, code review, or creative writing, exact string matching is useless as an eval metric. A good summary and a great summary both differ from the reference. LLM-as-Judge uses a separate LLM call to score outputs on a rubric. It correlates well with human evaluation and scales to any volume.

llm_judge.py
def llm_judge(question: str, response: str, reference: str) -> int:
    """Score a response from 1 (poor) to 5 (excellent) using an LLM judge."""
    judge_prompt = (
        'You are an expert evaluator. Score the response on a scale of 1-5.\n\n'
        f'Question: {question}\n'
        f'Reference answer: {reference}\n'
        f'Response to score: {response}\n\n'
        'Scoring rubric:\n'
        '5 = Complete, accurate, clear, well-structured\n'
        '4 = Mostly correct with minor gaps\n'
        '3 = Partially correct but missing key points\n'
        '2 = Mostly incorrect or unclear\n'
        '1 = Wrong or irrelevant\n\n'
        'Reply with only the integer score (1-5).'
    )
    result = judge_model.invoke(judge_prompt)
    return int(result.content.strip())

score = llm_judge(
    question='What is RAG?',
    response=summary_to_evaluate,
    reference='RAG combines retrieval of relevant documents with LLM generation.',
)
print(f'Quality score: {score}/5')

The complete Fasttrack journey

All 10 lessons at a glance
01 Call
invoke & stream
02 Providers
any LLM
03 Structured
Pydantic
04 Memory
history
05 Tools
@tool loop
06 RAG
PDF QA
07 Agent
ReAct loop
08 AgentMem
checkpointer
09 Patterns
orchestration
10 Evals
measure & improve
You now have the building blocks for almost any LangChain application. You can call any LLM, swap providers instantly, extract typed data, give agents memory, build tool-using agents, answer questions about documents, orchestrate multiple agents, and measure whether your prompts are working. Dive deeper into any specific session for hands-on practice with each topic.
Recap — what you just learned
  • Prompt engineering has three levels: vague (≈25%), specific (≈88%), few-shot (≈88%)
  • An eval harness with labeled test cases gives you a single accuracy number to optimize
  • LLM-as-Judge enables evaluation of open-ended tasks that cannot be exact-matched
  • You now have all ten building blocks — LLM calls to multi-agent orchestration to evals