Session 1· H9· 10 min

Text to Speech

What you'll learn

▸Use client.audio.speech.create() to synthesize audio
▸Stream the response directly to an mp3 file
▸Pick a voice via environment variable

What you will build

A script that turns a --text string into an mp3 file saved to output/speech.mp3. Third endpoint in Session 1, same client, same pattern.

New concept — streaming response

Why stream?

Audio files can be big. Instead of downloading all the bytes into memory and then writing a file, with_streaming_response writes bytes to disk as they arrive. Faster and memory-safe for long inputs.

The code

src/09_text_to_speech.py (excerpt)

from pathlib import Path

voice = env_model("VOICE_TTS", "alloy")                          ①
model = env_model("MODEL_TTS", "gpt-4o-mini-tts")

with client.audio.speech.with_streaming_response.create(          ②
    model=model,
    voice=voice,
    input=args.text,                                              ③
) as response:
    response.stream_to_file(Path("output/speech.mp3"))            ④

print("Saved -> output/speech.mp3")

①"alloy" is one of several built-in voices (nova, echo, onyx, shimmer, fable).

②Use the streaming variant to write directly to disk as bytes arrive.

③input is the TEXT to speak — not a prompt, just the sentence itself.

④stream_to_file handles the file open/write/close for you.

Run it

$ python src/09_text_to_speech.py --text "Welcome to LLM API basics."

Right-click output/speech.mp3 in VS Code's explorer → "Reveal in Finder / File Explorer" → double-click to play.

Knowledge check

Knowledge Check

What is the difference between the "with_streaming_response" variant and a regular TTS call?

Recap — what you just learned

✓TTS uses a third endpoint: client.audio.speech.create()
✓Pick a voice from the built-in set (alloy, nova, echo, onyx, shimmer, fable)
✓with_streaming_response + stream_to_file writes the mp3 without loading it all in memory

Next up: H10 — Mini CLI App