Session 0· 05· 10 min

How to Choose an LLM for Your Use Case

What you'll learn

▸Pick a model based on latency, quality, and cost trade-offs
▸Match a use case to a provider tier (small/fast vs large/smart)
▸Walk through a decision flow instead of guessing

There is no "best" model — only the best model for your constraints. The three knobs you trade off are quality, latency, and cost. You cannot maximise all three at once.

The three tiers inside every provider

Every frontier provider organises its line-up into roughly three tiers. Learn the tier names and 90% of model selection becomes automatic.

Small / Fast

gpt-4o-mini · haiku · gemini-flash

•Sub-second latency
•10–30x cheaper than flagship
•Handles routing, classification, extraction
•Good enough for most prototyping
•Use when: high volume or chatbots

Mid / Balanced

gpt-4o · sonnet · gemini-pro

•1–3 second latency
•Strong general reasoning
•Handles multi-step tool calls
•The default starting point
•Use when: real products, normal traffic

Large / Smart

o1 · opus · gemini-pro-reasoning

•5–30 second latency
•Deep reasoning, math, novel problems
•5–30x more expensive
•Use sparingly — as a fallback
•Use when: quality is everything

Decision flow — match your use case to a tier

Q1. Does the task need deep reasoning or long chain-of-thought?

If yes

→ Start with a Large/Smart reasoning model (o1, opus, gemini-pro-reasoning)

If no

→ Start with a Mid/Balanced model (gpt-4o, sonnet, gemini-pro)

Q2. Will you run 100+ requests per minute in production?

If yes

→ Downgrade to Small/Fast tier (mini, haiku, flash) — cost scales fast

If no

→ Keep the tier you picked above

Q3. Does the user wait on the response in real time?

If yes

→ Prefer Small/Fast tier OR enable streaming so first token arrives < 1s

If no (batch / async)

→ Latency is free — optimise for quality

Q4. Does the context ever exceed 100k tokens?

If yes

→ Prefer Gemini (2M) or Claude (200k); truncate or RAG otherwise

If no

→ Any provider works — pick on quality and cost

Q5. Is user data sensitive / regulated?

If yes

→ Use enterprise tier with zero-retention OR self-host open-source

If no

→ Standard API tier is fine

The universal starting move

Start with the cheapest mid-tier model (gpt-4o-mini or gemini-2.0-flash). Measure quality. Upgrade only if it is not good enough. Most developers over-pay by starting at the top.

Common use cases → recommended starting point

	Small/Fast	Mid/Balanced	Large/Smart
High-volume classifier / router	⭐
Customer support chatbot		⭐
Document Q&A / RAG		⭐
Code generation assistant		⭐
Multi-step agent with tools		⭐
Legal / medical analysis			⭐
Math / scientific reasoning			⭐
Creative writing		⭐
Simple data extraction	⭐
Summarisation		⭐

How to actually compare two models

Write 10–20 representative prompts covering your real use case. Keep them in a file.
Run each prompt through both models. Record outputs, token counts, and latency.
Score each output: correct, borderline, wrong. Compare win rates.
Compute $ per 1000 successful responses — not $ per call.
Pick the model that wins on "quality at your price" — not the one with the highest benchmark score.

Public benchmarks lie

MMLU, HumanEval, and the other leaderboards correlate loosely with your use case. A model that tops the benchmark may still lose to a cheaper one on YOUR prompts. Always run your own eval set before committing.

Rule of thumb

Start cheap and mid-tier. Upgrade only on measurable failure. Switch providers when one gives the same quality for less money. Keep a small eval set around — it is the single most valuable asset in an LLM project.