Session 0· 06· 10 min
Cost Factors — how LLM bills actually work
What you'll learn
- ▸Understand how LLM usage is billed (input vs output tokens)
- ▸Estimate the cost of a single API call before you run it
- ▸Know the five levers you can pull to cut costs in half
LLM APIs bill per token. Input tokens (your prompt) and output tokens (the reply) are priced separately — output is typically 3–5× more expensive. Every other cost factor is a consequence of this simple rule.
The pricing formula
Cost formula
cost = (input_tokens × input_price_per_1M / 1,000,000)
+ (output_tokens × output_price_per_1M / 1,000,000)That is it. Everything in this page is a way to drive one of those four numbers down.
Sample prices (illustrative, April 2026)
| Input / 1M tokens | Output / 1M tokens | Context window | |
|---|---|---|---|
| gpt-4o-mini | $0.15 | $0.60 | 128k |
| gpt-4o | $2.50 | $10.00 | 128k |
| o1 | $15.00 | $60.00 | 200k |
| claude-haiku-4-5 | $0.25 | $1.25 | 200k |
| claude-sonnet-4-5 | $3.00 | $15.00 | 200k |
| claude-opus-4-6 | $15.00 | $75.00 | 200k |
| gemini-2.0-flash | $0.10 | $0.40 | 1M |
| gemini-2.5-pro | $1.25 | $5.00 | 2M |
Prices move constantly
The numbers above are rough and change every quarter. Always check the provider's current pricing page before you size a budget. The RELATIVE ratios (mini is ~20x cheaper than full, output is ~4x input) stay roughly constant.
Worked example — a typical chatbot call
Imagine your user asks: "Summarise this 2-page article for me."
What lands in the tokenizer
System
50 tokens
Article
1,000 tokens
User question
15 tokens
Reply
200 tokens
1,065
Input tokens
system + article + user
200
Output tokens
summary reply
$0.0003
Cost (gpt-4o-mini)
~ 3,300 calls per $1
With gpt-4o the same call costs $0.0046 — 15× more. With o1 it is $0.028 — 90× more. Same prompt, same output length, just a different tier.
The five cost levers you control
1
−90%Downgrade the model tier
The single biggest lever. Start on mini/haiku/flash. Upgrade only when you measure a failure your user actually cares about.
2
−50%Shrink the prompt
System messages creep. Shorten them. Strip whitespace. Truncate retrieved context. Every token removed is saved on every future call.
3
−30%Cap output length
Output is 3–5× input. Set max_output_tokens. Ask the model for bullets, not essays. Stop it when it drifts into filler.
4
−50%Cache repeated prompts
OpenAI, Anthropic, and Google all offer prompt caching. The same system message across many calls gets billed once at a discount.
5
−50%Batch async calls
OpenAI's Batch API and Anthropic's Message Batches give 50% off if you can wait up to 24h for results (great for data processing jobs).
What a token actually looks like in the wild
Short common words ("the", "is", "of") are one token each. Long, rare, or foreign words split into multiple tokens. Code and emoji are often worse. Here is a sentence with mixed content:
def my_function(x): return x * 2 # double
def my_function(x): return x * 2 # double — 12 tokens
Estimate tokens for free
Use OpenAI's tiktoken library (Python) or just the online tokenizer at platform.openai.com/tokenizer. Paste your prompt, see exact token count instantly.
Hidden cost traps
- Infinite loops. A tool-calling agent that never stops retrying burns tokens fast. Always set a hard max_iterations.
- Streaming you ignore. You pay for every streamed token even if the user disconnects. Handle disconnects.
- Long conversation history. Every turn resends the full chat. A 50-turn chat can cost 50× more than turn 1. Summarise or truncate.
- Redundant retries. 3 retries on a 500 error = 3× cost. Cap retries.
- Over-retrieved RAG. Dumping 20 chunks into every prompt when 3 would do. Re-rank.
Budget planning checklist
- Measure real token counts on 10 representative calls.
- Multiply by daily call volume.
- Multiply by 30 for monthly cost.
- Add 2× safety buffer (traffic grows, prompts grow).
- Set a hard spend cap on the provider dashboard.
Session 0 complete
You know what an LLM is, how it differs from ML, how the API works, which providers exist, how to pick one, and how to estimate the bill. Now head to Session 1 and build your first app.