Session 0· 03· 10 min

LLM API Basics — request in, text out

What you'll learn
  • Understand the request/response shape of an LLM API call
  • Know what system, user, and assistant roles mean
  • Understand tokens, temperature, and the context window

An LLM API is remarkably simple: you send a JSON request over HTTPS, the provider runs the model, and you get JSON back with the generated text. No databases, no GPUs, no infrastructure. Just one HTTP call.

The round trip

What happens when you call client.responses.create()
Your code
Python / JS / curl
HTTPS request
JSON body + API key
Provider
Runs the model
HTTPS response
JSON with output
Your code
Reads the text

A minimal request

This is the entire shape of an LLM call. Everything else is a variation.

minimal_call.py
from openai import OpenAI
client = OpenAI()

response = client.responses.create(
    model="gpt-4o-mini",
    input=[
        {"role": "system", "content": "You are a helpful tutor."},
        {"role": "user",   "content": "Explain HTTP in 2 lines."},
    ],
    temperature=0.7,
    max_output_tokens=200,
)

print(response.output_text)

The three message roles

  • system — sets persona, rules, and output format. "You are a JSON-only assistant."
  • user — the human's question or request.
  • assistant — the model's previous replies (only when simulating multi-turn memory).
The system message is powerful
The model weighs system instructions heavily. Put persona, rules, format, and constraints there — never in the user message. It makes prompts reusable across different questions.

The three parameters that matter most

model
Which model
e.g. gpt-4o-mini, claude-sonnet-4-5
temp
Creativity
0 = deterministic · 1 = default · 1.5 = wild
max
Max tokens
Cap on response length

Tokens — the units you pay for

You pay for tokens, not characters or words. Both input (prompt) and output (reply) are metered. Here is what the request above looks like as tokens:

You are a helpful tutor. Explain HTTP in 2 lines.
Input prompt12 tokens
Rule of thumb
1 token ≈ 0.75 English words · 1 page of text ≈ 500 tokens · Today's frontier models bill roughly $0.15 – $5 per MILLION tokens, depending on model and direction (input vs output).

The context window

Every model has a maximum number of tokens it can process in a single call — the context window. It includes BOTH your input AND the reply the model generates. Exceed it and the API rejects the request.

8k
Small model
~16 pages of text
128k
Modern standard
~300 pages of text
2M
Gemini long-context
~4000 pages
Longer context is not free
Even when a model supports 2 million tokens, every token costs money AND slows down the response. Send only what the model actually needs.

The response object

The API returns a JSON object with the generated text plus metadata: which model ran, how many tokens were used (input + output), and why generation stopped ("stop", "length", or "tool_calls").

response.json (trimmed)
{
  "id": "resp_abc123",
  "model": "gpt-4o-mini",
  "output_text": "HTTP is a request-response protocol used by the web...",
  "usage": {
    "input_tokens": 14,
    "output_tokens": 27,
    "total_tokens": 41
  },
  "finish_reason": "stop"
}
That is the whole API
Send messages. Pick a model. Tune temperature and max_tokens. Read the response text. Everything else you will meet in this course — tools, structured output, streaming — is a layer on top of this one pattern.