Session 1· H9· 10 min

Text to Speech

What you'll learn
  • Use client.audio.speech.create() to synthesize audio
  • Stream the response directly to an mp3 file
  • Pick a voice via environment variable

What you will build

A script that turns a --text string into an mp3 file saved to output/speech.mp3. Third endpoint in Session 1, same client, same pattern.

New concept — streaming response

Why stream?
Audio files can be big. Instead of downloading all the bytes into memory and then writing a file, with_streaming_response writes bytes to disk as they arrive. Faster and memory-safe for long inputs.

The code

src/09_text_to_speech.py (excerpt)
from pathlib import Path

voice = env_model("VOICE_TTS", "alloy")                          ①
model = env_model("MODEL_TTS", "gpt-4o-mini-tts")

with client.audio.speech.with_streaming_response.create(          ②
    model=model,
    voice=voice,
    input=args.text,                                              ③
) as response:
    response.stream_to_file(Path("output/speech.mp3"))            ④

print("Saved -> output/speech.mp3")
"alloy" is one of several built-in voices (nova, echo, onyx, shimmer, fable).
Use the streaming variant to write directly to disk as bytes arrive.
input is the TEXT to speak — not a prompt, just the sentence itself.
stream_to_file handles the file open/write/close for you.

Run it

$ python src/09_text_to_speech.py --text "Welcome to LLM API basics."
Right-click output/speech.mp3 in VS Code's explorer → "Reveal in Finder / File Explorer" → double-click to play.

Knowledge check

Knowledge Check
What is the difference between the "with_streaming_response" variant and a regular TTS call?
Recap — what you just learned
  • TTS uses a third endpoint: client.audio.speech.create()
  • Pick a voice from the built-in set (alloy, nova, echo, onyx, shimmer, fable)
  • with_streaming_response + stream_to_file writes the mp3 without loading it all in memory
Next up: H10 — Mini CLI App