Session 1· H11· 30 min

Multimodal Story Pipeline (capstone)

What you'll learn

▸Chain text → audio → image generation for each scene
▸Carry full scene context between calls to keep consistency
▸Produce a folder of scenes.json + audio + images from one storyline

What you will build

The capstone of Session 1. Given a storyline and a scene count, the script generates scenes, then for EACH scene generates narration audio and an illustration. Because the API is stateless, it sends the full scene list on every image call so the visuals stay consistent.

End-to-end pipeline

Storyline

1 sentence in

Text model

scenes.json

TTS model

scene audio

Image model

scene images

Folder

output/story_…

Run it

$ python src/11_story_scene_pipeline.py --storyline "A young inventor builds a water-saving robot for her village" --num_scenes 5

What it produces

output/story_pipeline/scenes.json — the structured scene list
output/story_pipeline/audio/scene_XX.mp3 — one narration per scene
output/story_pipeline/images/scene_XX.png — one illustration per scene

Why this exercise matters

Everything comes together

H4 taught you the API is stateless. H5 taught you to resend context. H6–H9 taught you to call the vision, image, and TTS endpoints. H11 makes you use ALL of it in one script. This is production-shaped code.

Knowledge check

Knowledge Check

Why does the pipeline send the FULL scene list on every image generation call?

Recap — what you just learned

✓You can now chain text, vision, image, and TTS endpoints in one pipeline
✓Statelessness means you resend shared context (the scene list) on every call
✓This is the same pattern real multimodal apps use for consistency
✓You have the full Session 1 toolkit — every future session layers on top of this

Next up: Session 1 Troubleshooting

Session 1 complete

You know the OpenAI SDK end-to-end. Head to the Troubleshooting page for common gotchas, then take the end-of-session quiz to confirm your understanding before moving to Session 5.