The AI Video Production Playbook: How I Made a 90-Second Concept Video for $15

I built a production pipeline using Claude Code + 7 AI tools that turns a concept into a polished 90-second video — from one terminal, in one day, for under $15. Here's the exact workflow.

The AI Video Production Playbook: How I Made a 90-Second Concept Video for $15

I built a production pipeline using Claude Code + 7 AI tools that turns a concept into a polished video — from one terminal, in one day, for under $15.

The Problem

Creating a 90-second concept video traditionally requires:

  • A motion graphics designer ($5K–$15K)
  • A voiceover artist ($500–$2K)
  • 2–4 weeks of back-and-forth
  • A video editor for assembly

With AI tools, one person can do this in a day — but only if you follow the right workflow. Most people waste hours fighting AI to render text correctly or regenerating entire videos because a single frame was wrong.

I learned this the hard way. My first attempt (V1) had blurry text, misspelled brand names, and a dark aesthetic that made my consumer product look like a surveillance tool. V2 — using the workflow below — fixed everything.

This playbook documents what worked.

The Pipeline

The workflow has four phases. Each one feeds the next — skip a step and you'll pay for it downstream.

Phase 1: Pre-Production

  1. Write the voiceover script FIRST
  2. Time each scene to the script
  3. Create a scene map (what happens in each scene)
  4. Write a Style Bible + Character Bible

Phase 2: Image Generation

  1. Generate draft frames with a fast tool (Gemini via Nano Banana MCP)
  2. Review ALL frames as a batch
  3. Fix pass — edit or regenerate only the failures
  4. Polish text-heavy scenes with a specialized tool (DALL-E 3 or Ideogram)
  5. Composite in Figma — add text, branding, typography as layers
  6. Export final frames at 1920x1080

Phase 3: Video Generation

  1. Send Frame A → Frame C pairs to Kling (image-to-video)
  2. Check status, download clips
  3. Slow down clips to match scene timing from the script

Phase 4: Post-Production

  1. FFmpeg — concatenate all clips into one video
  2. Add voiceover (ElevenLabs)
  3. Add background music (Suno or Udio)
  4. Add sound effects per scene
  5. Final edit in CapCut — text animations, keyframes
  6. Export in 16:9, 9:16, and 1:1

The Tools

ToolRoleWhy
Nano Banana (Gemini Flash)Draft image generationFast, free, follows prompts well
DALL-E 3 (OpenAI)Text-heavy image polishBest at following complex instructions
IdeogramText perfectionPurpose-built for text in images
FigmaCompositing + typographyPixel-perfect text, brand consistency
KlingImage-to-videoBest stylized video gen, supports start+end frames
CapCutVideo editingText animations, keyframes, transitions
ElevenLabsVoiceoverNatural narration with emotion control
FFmpegAssemblyConcat, speed, format conversion

Every single one of these is accessible from Claude Code via MCP servers or Bash commands. One terminal. No context switching.

The 5 Rules That Save Hours

Rule 1: Images first. Always.

The entire pipeline flows downstream from your images. A blurry Frame A produces a blurry video. Fixing a bad image takes 30 seconds. Fixing a bad video means re-running a 5-minute Kling job and waiting for it all over again.

Spend 80% of your time getting images right. The rest is mechanical.

Rule 2: Never ask AI to render your brand name.

AI image generators will misspell your brand. "ProductName" becomes "ProdutcName" or "ProduceName." Every. Single. Time. I watched it happen across three different generators.

Solution: Generate the scene WITHOUT text pressure. Then open in Figma, add your brand name as a text layer with your exact font, size, and color. Export. Send to Kling. Perfect text, every time.

This applies to all important text: taglines, data labels, impact stats.

Rule 3: Write a Style Bible and reuse it.

A Style Bible is a paragraph that defines your visual aesthetic. Prepend it to every prompt so you don't re-describe the look 9 times.

Example:

LEGO stop-motion style, Pixar-quality character design, cinematic 16:9 
composition, warm golden lighting, shallow depth of field, photorealistic 
LEGO miniature photography.

Pair it with a Character Bible for your protagonist:

Male LEGO minifigure with glasses, black wavy hair, expressive Pixar-style 
face, wearing a plaid flannel shirt.

Your prompts become: [Style Bible] + [Character Bible] + [Scene description]

This cuts prompt-writing time by 60% and keeps visual consistency across scenes.

Rule 4: Batch generate, batch review, batch fix.

Wrong: Generate Scene 1 → review → fix → generate Scene 2 → review → fix → repeat 9 times.

Right: Generate all 18 frames → review everything at once → fix only the failures.

You'll spot cross-scene inconsistencies that you'd miss going one-at-a-time. And the fix pass is faster because you've already calibrated your eye.

Rule 5: Script drives timing, not the other way around.

Most people generate 9 equal-length clips (10 seconds each) and then try to fit a voiceover on top. This creates awkward pacing.

Instead: Write the voiceover script first. Time each section. Then make scene durations match the script.

Scene 1 (problem):       12 seconds — needs time to feel the pain
Scene 2 (processing):     8 seconds — quick, mechanical  
Scene 3 (concept):        8 seconds — visual, not narrative
Scene 9 (payoff):        15 seconds — emotional climax, needs to breathe

The Frame A → Frame C Strategy

For each scene, generate TWO keyframe images:

  • Frame A — the starting state (establishing shot)
  • Frame C — the ending state (where the motion ends)

Kling generates video that transitions from Frame A to Frame C. The more similar they are (same composition, slight changes), the smoother the video.

Good A→C pairs:

  • Same scene, camera pushes in slightly
  • Character changes pose (neutral → smiling)
  • Objects appear or light up (cards fan out, connections glow)

Bad A→C pairs:

  • Completely different compositions (Kling won't know what to do)
  • Too many simultaneous changes (creates visual chaos)

The Multi-Tool Composite Workflow

This is the real unlock. Not every tool is best at everything:

Scene typeBest toolWhy
Cinematic/emotionalGemini or MidjourneyBest aesthetics and lighting
Text-heavy (data, labels)DALL-E 3 or IdeogramBest text rendering
Brand name visibleAny tool + FigmaNever trust AI with your brand
Character consistencySame tool + Character BibleReference previous outputs

The composite workflow:

  1. Generate the visual in Gemini (ignore text quality)
  2. Import PNG into Figma (1920x1080 frame)
  3. Add text layers — brand name, labels, data, taglines
  4. Add brand assets — logo, consistent colors
  5. Export as PNG
  6. Send to Kling

AI does what it's good at (visuals). Figma does what it's good at (typography). You never fight AI over spelling again.

Common Mistakes

MistakeWhy it hurtsFix
Sending "good enough" images to KlingBad input = bad video, and video gen is slowSpend the extra minute editing the image
Fighting AI to spell your brandYou'll waste 5+ attemptsFigma text overlay
Vague motion prompts"camera moves" = random motion"slow zoom in, documents scatter slightly"
Adding voiceover lastTiming mismatches force re-editingWrite script first
Editing one scene at a timeMiss cross-scene issuesBatch generate, batch review
One tool for everythingEvery tool has weaknessesMulti-tool pipeline
Dark aesthetic for consumer productFeels cold and intimidatingWarm, bright, Pixar-level charm
Internal jargon in the videoViewers don't know your architectureUser-facing language only

The Automation Layer

Here's what makes this workflow different from "just use Midjourney and iMovie": everything runs from Claude Code via MCP servers.

# Image generation
nano-banana → generate_image / continue_editing
openai-image → gpt_image_generate / gpt_image_edit
ideogram → ideogram_generate / ideogram_edit / ideogram_upscale

# Compositing
figma → use_figma / create_new_file / get_screenshot

# Video generation  
kling → generate_image_to_video / check_video_status

# Assembly
ffmpeg → concat, speed, crop (via Bash)

# Editing
capcut → create_draft / add_video / add_text / save_draft

# Audio
elevenlabs → text-to-speech (via API)

One person. One terminal. No context switching between apps. You describe what you want, Claude generates it, you review, iterate, ship.

Cost Breakdown

For a 90-second concept video with 9 scenes:

ItemCost
Gemini (Nano Banana)Free
DALL-E 3 (HD, ~10 images)~$1
Ideogram (~10 images)~$1
FigmaIncluded in premium
Kling (9 clips)~$5–10
ElevenLabs voiceover~$1–3
FFmpeg / CapCutFree
Total~$8–15

Compare to $5,000–$15,000 for a traditional motion graphics studio. That's a 500x cost reduction.

Template: Scene Planning Sheet

Copy this for each scene in your video:

Scene [#]: [Title]
Duration: [X seconds, based on voiceover script]
Voiceover: "[Exact narration for this scene]"

Frame A (start):
- Composition: [what the camera sees]
- Key elements: [characters, objects, text]
- Mood: [emotional tone]

Frame C (end):
- What changed: [motion, new elements, lighting shift]
- Key elements: [what appeared or transformed]

Text overlays (for Figma):
- [List exact text to add in post]

Kling motion prompt:
- "[Specific motion description]"

Sound effects:
- [What sounds play during this scene]

This playbook was developed while producing a LEGO Pixar-style concept video for a SaaS product I'm building. The workflow was iterated over two production cycles (V1 and V2) and refined through real failures and fixes. V1 taught me what NOT to do. V2 — using this pipeline — produced 18 frames in under 15 minutes, all with legible text, consistent characters, and warm Pixar aesthetics.

Tools used: Claude Code (Opus) + Nano Banana + DALL-E 3 + Ideogram + Figma + Kling + CapCut + ElevenLabs + FFmpeg