Session 1· H9· 10 min
Text to Speech
What you'll learn
- ▸Use client.audio.speech.create() to synthesize audio
- ▸Stream the response directly to an mp3 file
- ▸Pick a voice via environment variable
What you will build
A script that turns a --text string into an mp3 file saved to output/speech.mp3. Third endpoint in Session 1, same client, same pattern.
New concept — streaming response
Why stream?
Audio files can be big. Instead of downloading all the bytes into memory and then writing a file, with_streaming_response writes bytes to disk as they arrive. Faster and memory-safe for long inputs.
The code
src/09_text_to_speech.py (excerpt)
from pathlib import Path
voice = env_model("VOICE_TTS", "alloy") ①
model = env_model("MODEL_TTS", "gpt-4o-mini-tts")
with client.audio.speech.with_streaming_response.create( ②
model=model,
voice=voice,
input=args.text, ③
) as response:
response.stream_to_file(Path("output/speech.mp3")) ④
print("Saved -> output/speech.mp3")①"alloy" is one of several built-in voices (nova, echo, onyx, shimmer, fable).
②Use the streaming variant to write directly to disk as bytes arrive.
③input is the TEXT to speak — not a prompt, just the sentence itself.
④stream_to_file handles the file open/write/close for you.
Run it
$ python src/09_text_to_speech.py --text "Welcome to LLM API basics."
Right-click output/speech.mp3 in VS Code's explorer → "Reveal in Finder / File Explorer" → double-click to play.
Knowledge check
Knowledge Check
What is the difference between the "with_streaming_response" variant and a regular TTS call?
Recap — what you just learned
- ✓TTS uses a third endpoint: client.audio.speech.create()
- ✓Pick a voice from the built-in set (alloy, nova, echo, onyx, shimmer, fable)
- ✓with_streaming_response + stream_to_file writes the mp3 without loading it all in memory