Session 1· H11· 30 min

Multimodal Story Pipeline (capstone)

What you'll learn
  • Chain text → audio → image generation for each scene
  • Carry full scene context between calls to keep consistency
  • Produce a folder of scenes.json + audio + images from one storyline

What you will build

The capstone of Session 1. Given a storyline and a scene count, the script generates scenes, then for EACH scene generates narration audio and an illustration. Because the API is stateless, it sends the full scene list on every image call so the visuals stay consistent.

End-to-end pipeline
Storyline
1 sentence in
Text model
scenes.json
TTS model
scene audio
Image model
scene images
Folder
output/story_…

Run it

$ python src/11_story_scene_pipeline.py --storyline "A young inventor builds a water-saving robot for her village" --num_scenes 5

What it produces

  • output/story_pipeline/scenes.json — the structured scene list
  • output/story_pipeline/audio/scene_XX.mp3 — one narration per scene
  • output/story_pipeline/images/scene_XX.png — one illustration per scene

Why this exercise matters

Everything comes together
H4 taught you the API is stateless. H5 taught you to resend context. H6–H9 taught you to call the vision, image, and TTS endpoints. H11 makes you use ALL of it in one script. This is production-shaped code.

Knowledge check

Knowledge Check
Why does the pipeline send the FULL scene list on every image generation call?
Recap — what you just learned
  • You can now chain text, vision, image, and TTS endpoints in one pipeline
  • Statelessness means you resend shared context (the scene list) on every call
  • This is the same pattern real multimodal apps use for consistency
  • You have the full Session 1 toolkit — every future session layers on top of this
Next up: Session 1 Troubleshooting
Session 1 complete
You know the OpenAI SDK end-to-end. Head to the Troubleshooting page for common gotchas, then take the end-of-session quiz to confirm your understanding before moving to Session 5.