Session 0· 05· 10 min

How to Choose an LLM for Your Use Case

What you'll learn
  • Pick a model based on latency, quality, and cost trade-offs
  • Match a use case to a provider tier (small/fast vs large/smart)
  • Walk through a decision flow instead of guessing

There is no "best" model — only the best model for your constraints. The three knobs you trade off are quality, latency, and cost. You cannot maximise all three at once.

The three tiers inside every provider

Every frontier provider organises its line-up into roughly three tiers. Learn the tier names and 90% of model selection becomes automatic.

Small / Fast
gpt-4o-mini · haiku · gemini-flash
  • Sub-second latency
  • 10–30x cheaper than flagship
  • Handles routing, classification, extraction
  • Good enough for most prototyping
  • Use when: high volume or chatbots
Mid / Balanced
gpt-4o · sonnet · gemini-pro
  • 1–3 second latency
  • Strong general reasoning
  • Handles multi-step tool calls
  • The default starting point
  • Use when: real products, normal traffic
Large / Smart
o1 · opus · gemini-pro-reasoning
  • 5–30 second latency
  • Deep reasoning, math, novel problems
  • 5–30x more expensive
  • Use sparingly — as a fallback
  • Use when: quality is everything

Decision flow — match your use case to a tier

Q1. Does the task need deep reasoning or long chain-of-thought?
If yes
Start with a Large/Smart reasoning model (o1, opus, gemini-pro-reasoning)
If no
Start with a Mid/Balanced model (gpt-4o, sonnet, gemini-pro)
Q2. Will you run 100+ requests per minute in production?
If yes
Downgrade to Small/Fast tier (mini, haiku, flash) — cost scales fast
If no
Keep the tier you picked above
Q3. Does the user wait on the response in real time?
If yes
Prefer Small/Fast tier OR enable streaming so first token arrives < 1s
If no (batch / async)
Latency is free — optimise for quality
Q4. Does the context ever exceed 100k tokens?
If yes
Prefer Gemini (2M) or Claude (200k); truncate or RAG otherwise
If no
Any provider works — pick on quality and cost
Q5. Is user data sensitive / regulated?
If yes
Use enterprise tier with zero-retention OR self-host open-source
If no
Standard API tier is fine
The universal starting move
Start with the cheapest mid-tier model (gpt-4o-mini or gemini-2.0-flash). Measure quality. Upgrade only if it is not good enough. Most developers over-pay by starting at the top.

Common use cases → recommended starting point

 Small/FastMid/BalancedLarge/Smart
High-volume classifier / router
Customer support chatbot
Document Q&A / RAG
Code generation assistant
Multi-step agent with tools
Legal / medical analysis
Math / scientific reasoning
Creative writing
Simple data extraction
Summarisation

How to actually compare two models

  1. Write 10–20 representative prompts covering your real use case. Keep them in a file.
  2. Run each prompt through both models. Record outputs, token counts, and latency.
  3. Score each output: correct, borderline, wrong. Compare win rates.
  4. Compute $ per 1000 successful responses — not $ per call.
  5. Pick the model that wins on "quality at your price" — not the one with the highest benchmark score.
Public benchmarks lie
MMLU, HumanEval, and the other leaderboards correlate loosely with your use case. A model that tops the benchmark may still lose to a cheaper one on YOUR prompts. Always run your own eval set before committing.
Rule of thumb
Start cheap and mid-tier. Upgrade only on measurable failure. Switch providers when one gives the same quality for less money. Keep a small eval set around — it is the single most valuable asset in an LLM project.