Session 0· 06· 10 min

Cost Factors — how LLM bills actually work

What you'll learn

▸Understand how LLM usage is billed (input vs output tokens)
▸Estimate the cost of a single API call before you run it
▸Know the five levers you can pull to cut costs in half

LLM APIs bill per token. Input tokens (your prompt) and output tokens (the reply) are priced separately — output is typically 3–5× more expensive. Every other cost factor is a consequence of this simple rule.

The pricing formula

Cost formula

cost = (input_tokens  × input_price_per_1M  / 1,000,000)
     + (output_tokens × output_price_per_1M / 1,000,000)

That is it. Everything in this page is a way to drive one of those four numbers down.

Sample prices (illustrative, April 2026)

	Input / 1M tokens	Output / 1M tokens	Context window
gpt-4o-mini	$0.15	$0.60	128k
gpt-4o	$2.50	$10.00	128k
o1	$15.00	$60.00	200k
claude-haiku-4-5	$0.25	$1.25	200k
claude-sonnet-4-5	$3.00	$15.00	200k
claude-opus-4-6	$15.00	$75.00	200k
gemini-2.0-flash	$0.10	$0.40	1M
gemini-2.5-pro	$1.25	$5.00	2M

Prices move constantly

The numbers above are rough and change every quarter. Always check the provider's current pricing page before you size a budget. The RELATIVE ratios (mini is ~20x cheaper than full, output is ~4x input) stay roughly constant.

Worked example — a typical chatbot call

Imagine your user asks: "Summarise this 2-page article for me."

What lands in the tokenizer

System

50 tokens

Article

1,000 tokens

User question

15 tokens

200 tokens

1,065

Input tokens

system + article + user

200

Output tokens

summary reply

$0.0003

Cost (gpt-4o-mini)

~ 3,300 calls per $1

With gpt-4o the same call costs $0.0046 — 15× more. With o1 it is $0.028 — 90× more. Same prompt, same output length, just a different tier.

The five cost levers you control

−90%

Downgrade the model tier

The single biggest lever. Start on mini/haiku/flash. Upgrade only when you measure a failure your user actually cares about.

−50%

Shrink the prompt

System messages creep. Shorten them. Strip whitespace. Truncate retrieved context. Every token removed is saved on every future call.

−30%

Cap output length

Output is 3–5× input. Set max_output_tokens. Ask the model for bullets, not essays. Stop it when it drifts into filler.

−50%

Cache repeated prompts

OpenAI, Anthropic, and Google all offer prompt caching. The same system message across many calls gets billed once at a discount.

−50%

Batch async calls

OpenAI's Batch API and Anthropic's Message Batches give 50% off if you can wait up to 24h for results (great for data processing jobs).

What a token actually looks like in the wild

Short common words ("the", "is", "of") are one token each. Long, rare, or foreign words split into multiple tokens. Code and emoji are often worse. Here is a sentence with mixed content:

def my_function(x): return x * 2 # double

def my_function(x): return x * 2 # double — 12 tokens

Estimate tokens for free

Use OpenAI's tiktoken library (Python) or just the online tokenizer at platform.openai.com/tokenizer. Paste your prompt, see exact token count instantly.

Hidden cost traps

Infinite loops. A tool-calling agent that never stops retrying burns tokens fast. Always set a hard max_iterations.
Streaming you ignore. You pay for every streamed token even if the user disconnects. Handle disconnects.
Long conversation history. Every turn resends the full chat. A 50-turn chat can cost 50× more than turn 1. Summarise or truncate.
Redundant retries. 3 retries on a 500 error = 3× cost. Cap retries.
Over-retrieved RAG. Dumping 20 chunks into every prompt when 3 would do. Re-rank.

Budget planning checklist

Measure real token counts on 10 representative calls.
Multiply by daily call volume.
Multiply by 30 for monthly cost.
Add 2× safety buffer (traffic grows, prompts grow).
Set a hard spend cap on the provider dashboard.

Session 0 complete

You know what an LLM is, how it differs from ML, how the API works, which providers exist, how to pick one, and how to estimate the bill. Now head to Session 1 and build your first app.