Down the Rabbit Hole
My Alice in Wonderland Journey Through AI Video Generation

"Curiouser and curiouser!" cried Alice. I know exactly how she felt. What started as a simple question—"Can AI generate marketing videos?"—turned into a months-long expedition through a wonderland of impossible text, spelling nightmares, and the profound discovery that a prompt alone cannot conjure video magic.
Like Alice chasing a white rabbit, I fell into a hole I didn't know existed. And like Alice, I emerged with a completely different understanding of reality.
The White Rabbit: "Google Has a Video AI"
It started, as these things do, with a passing comment. Someone mentioned Google's Veo could generate videos from text prompts. Eight seconds of video from nothing but words.
Eight seconds, I thought. That's nothing. That's barely a blink.
But the rabbit hole beckoned. And like Alice, I couldn't resist following.
"Would you tell me, please, which way I ought to go from here?"
"That depends a good deal on where you want to get to," said the Cat.
— Lewis Carroll, Alice's Adventures in Wonderland
The problem was, I didn't know where I wanted to get to. I just knew I wanted to see what was possible. And that, as it turns out, is exactly how you end up at a mad tea party with no sense of time.
The Pool of Tears: First Experiments
My first Veo prompt was embarrassingly naive:
"Create a professional marketing video showing invoice automation saving time for finance teams. Show the software interface with clear labels."
What I got back was... surreal.
The video looked professional enough. Smooth motion. Nice lighting. But the text on screen? Gibberish. Pure, beautiful, utterly nonsensical gibberish.
Where I'd asked for "Marketing_Budget.xlsx," I got "Markefing_Bujet.xlsz." Where I wanted "Invoice Processing," I received "Invoise Prosseccing." The AI had created a parallel universe where everything looked right but nothing was spelled correctly.
The Harsh Reality of AI Text Rendering
- • File names: "Marketing_Buget.xlsx" instead of "Marketing_Budget.xlsx"
- • Email subjects: "Expensestes" instead of "Expenses"
- • UI labels: Pure gibberish characters
- • Currency symbols: € appearing where $ should be
Success rate with on-screen text: ~40%
I had fallen into Wonderland's Pool of Tears. The AI could generate visuals that looked professional, but it fundamentally didn't understand spelling. It worked with patterns and pixels, not grammar and orthography.
Alice cried so much she nearly drowned. I spent three weeks trying to make text render correctly before accepting reality.
The Caterpillar's Question: "Who Are You?"
After the text disasters, I had an identity crisis. Was I trying to be a video producer? A prompt engineer? A software developer building video tools?
The Caterpillar in Alice asks the most profound question: "Who are you?" And Alice can't answer because she's changed so many times since falling down the rabbit hole.
I felt the same way. Each experiment transformed my understanding.
Week 1: "I'm a prompt engineer!"
Surely the right words would unlock perfect videos. I wrote prompts like poetry. They produced beautiful gibberish.
Week 3: "I'm a video producer!"
Maybe I needed to think in storyboards, shot sequences, narrative arcs. The videos got better, but text remained a nightmare.
Week 6: "I'm a systems architect!"
The realization hit: this wasn't about single videos. It was about building systems that could reliably produce videos at scale.
Week 12: "I'm an orchestrator."
Multiple AI models, multiple techniques, multiple stitching strategies—all coordinated into a unified workflow.
The Mad Tea Party: Discovering Extension Chaining
The Mad Hatter's tea party is famous for its absurdity. Time doesn't work right. Everyone moves around the table endlessly. Nothing makes sense until you accept the rules are different here.
That's exactly what happened when I discovered Veo's extension chaining.
See, Veo generates 8-second clips. That's the limit. Eight seconds. But—and here's where it gets mad—you can extend a video by passing back a special token called veoVideoToken.
The Extension Chain Discovery
- Initial clip: 8 seconds
- Each extension: 7 seconds
- Maximum extensions: 8
- Total possible length: 64 seconds of continuous video
But here's the mad part: each extension continues seamlessly from the previous clip's last frame. No cuts. No transitions. One continuous shot.
This changed everything. Instead of stitching together six separate 5-second clips with jarring cuts, I could generate a flowing 30-second video that felt like a single continuous take.
The Mad Hatter would have loved it. "It's always six o'clock now," he says in the book, stuck in permanent tea-time. With extension chaining, I could stretch time itself—eight seconds becoming sixty-four through a chain of temporal extensions.
But like the tea party, there were hidden rules:
- • Tokens expire—take too long between extensions and the chain breaks
- • Regeneration cascades—change clip 3 and clips 4-8 all need regeneration
- • Style drift—too many extensions and the visual coherence wanders
Welcome to the tea party. Move down, move down.
The Queen's Croquet: The Spelling Prevention Techniques
In Wonderland, the Queen plays croquet with flamingos as mallets and hedgehogs as balls. The game is impossible because the equipment won't cooperate. The flamingo twists its neck. The hedgehog uncurls and walks away.
That's what trying to render text felt like.
No matter how precisely I specified the words, the AI would introduce errors. It wasn't malicious—the hedgehog isn't trying to ruin your game—it just doesn't understand what you want.
So I developed eight techniques to work around the uncooperative equipment:
Don't ask for "Marketing_Budget.xlsx"—ask for "Excel files with department names visible." The AI can't misspell what you never asked it to spell.
Replace "95% TIME SAVINGS" with progress bars and color indicators. No text means no spelling errors.
Short text has dramatically higher accuracy. "LATE" works. "Month-End Close: Day 2 of 3" fails.
Put critical text in "quotes" to signal literal strings: File name displays "Sales_Q4.xlsx"
"File showing Mar-ket-ing Bud-jet dot x-l-s-x" works better than the literal filename.
Add to every prompt: "No gibberish text. No text artifacts. No nonsensical characters."
Camera movement increases text errors. Keep the camera still when text must appear.
Fewer elements means fewer errors. Show 2-3 file names, not 10.
The Queen's croquet game isn't winnable by force. You win by understanding that flamingos aren't mallets and working with what you've got.
The Drink Me Bottle: Image-to-Video Changes Everything
In Alice's story, she finds a bottle labeled "DRINK ME." It makes her shrink—but that shrinking is what allows her to fit through the tiny door into the beautiful garden.
My "Drink Me" moment was discovering image-to-video generation.
Here's the problem with pure text-to-video: the AI decides what the first frame looks like. Sometimes it's perfect. Sometimes it's wildly off from what you imagined. You're rolling dice with every generation.
But what if you could specify the first frame?
The Image-to-Video Workflow
- Step 1: Generate a precise starting frame with Imagen 3
- Step 2: Pass that image to Veo as input
- Step 3: Veo animates from your exact image
Result: Controlled composition, brand consistency, predictable starting points.
This was transformative. Instead of hoping Veo would generate the dashboard layout I wanted, I could show it the exact dashboard and say "animate this."
The "Drink Me" bottle made Alice small enough to enter Wonderland properly. Image-to-video made AI video controllable enough to use professionally.
A prompt is not enough. You need to prime the context with an image.
The Cheshire Cat: Google's Generosity
The Cheshire Cat appears and disappears throughout Alice's journey, offering cryptic guidance with that famous grin. You never know when he'll show up or what he'll say, but his presence is always... helpful, in its own strange way.
Google has been my Cheshire Cat.
When I started this journey, I was terrified of costs. AI video generation isn't cheap. Each Veo clip costs roughly $0.10-0.15. Imagen frames are $0.02-0.04. Extensions add up. A 30-second marketing video might run $0.75. Do that hundreds of times while experimenting and you're looking at real money.
Then Google appeared, grinning, with thousands of dollars in cloud credits.
The Real Costs of AI Video
| Video Type | Clips | Est. Cost |
|---|---|---|
| 15s Social Clip | 3 clips | ~$0.36 |
| 30s Marketing Video | 6 clips | ~$0.72 |
| 30s Continuous | 1 + 3 ext | ~$0.42 |
| 60s Product Demo | 1 + 7 ext | ~$0.92 |
With credits: Hundreds of experiments for free. Without credits: This blog post wouldn't exist.
The Cheshire Cat's generosity allowed me to fail spectacularly, learn from every failure, and eventually build something that works. You can't discover the rules of Wonderland without playing the game, and you can't play the game if every move costs real money.
Building the Video Builder: A Web UI Detour
At some point during my journey, I thought: "This needs a graphical interface. Users shouldn't have to write JSON configs and run Node scripts."
So I built a Video Builder MVP. A full web interface with:
- • Clip-by-clip storyboard editing
- • Live preview of prompts
- • Voiceover script timing calculations
- • Generation status tracking
- • FFmpeg stitching integration
It was a beautiful detour. Like Alice exploring the Duchess's kitchen—interesting, but not where she needed to be.
The web UI worked. But I realized something important: the complexity wasn't in the interface—it was in the orchestration.
Users didn't need buttons to click. They needed an intelligent system that could take a concept and guide them through approach selection, storyboard generation, spelling prevention, source selection, and stitching strategies.
They needed a skill, not a UI.
The Trial: Building the Unified Video Creator Skill
At the end of Alice's journey, there's a trial. The Queen wants to execute everyone. Cards are flying. Nothing makes sense. And Alice, finally, has had enough.
"You're nothing but a pack of cards!" she shouts. And Wonderland dissolves.
My trial was building the unified skill. Taking everything I'd learned—every failed prompt, every spelling disaster, every successful technique—and encoding it into a system that could guide anyone through the process.
The Video Creator Skill: What I Built
- • Approach templates (3 video approaches)
- • Concept-to-storyboard guide
- • Prompt patterns for each source
- • Voiceover timing rules
- • Spelling prevention techniques
- • Stitching strategies
- • Veo clip generator
- • Veo extension chainer
- • Imagen frame generator
- • FFmpeg stitcher
- • Video orchestrator
21 files. Thousands of lines. Months of learning distilled into a reusable system.
The trial isn't about convicting anyone. It's about Alice realizing she's grown—literally and figuratively—and she doesn't have to play by Wonderland's rules anymore.
Building the skill was my moment of clarity. The chaos of AI video generation could be tamed. Not by fighting it, but by encoding the rules of this strange world into systems that others could follow.
Waking Up: What I Learned
Alice wakes up on the riverbank, her sister brushing dead leaves off her face. Was it a dream? Did it really happen?
I emerged from my AI video rabbit hole with these truths:
1. A Prompt Is Not Enough
You cannot summon professional video from words alone. You need starting images for context. You need voiceover scripts for timing. You need spelling prevention techniques. You need to understand which approach fits your content.
2. Text Is the Enemy (Until You Understand It)
AI video models don't understand spelling. They work with patterns. The moment you accept this and work with the limitation—generic descriptions, text-free visuals, voiceover for information—everything gets easier.
3. Extension Chaining Is Magic (With Caveats)
Turning 8 seconds into 64 seconds of continuous video is transformative. But the chain is fragile. Tokens expire. Regeneration cascades. Use it for continuous narratives, not for everything.
4. Image-to-Video Is the Real Secret
Generate your starting frame with Imagen. Then animate with Veo. This gives you control over composition, branding, and visual consistency that pure text-to-video can't match.
5. The Cost Is Real (But Manageable)
A 30-second marketing video costs less than a dollar. That's nothing compared to traditional video production. But experimentation adds up—Google's credits let me learn without going broke.
6. Systems Beat Interfaces
I built a web UI. It was fine. But what users really need is intelligent orchestration—a system that knows when to use FFmpeg vs. extensions, how to prevent spelling errors automatically, and how to transform concepts into storyboards.
The Beautiful Garden
At the beginning of Alice's adventure, she glimpses a beautiful garden through a tiny door. The entire journey—the shrinking, the growing, the Mad Hatter, the Queen—is really just about getting into that garden.
My beautiful garden? It's this:
From a single concept, in a single conversation, create a professional AI-generated video.
Describe your idea. Get 2-3 approach suggestions. See a storyboard. Watch it generate. Download your video.
That's the garden I was trying to reach. I'm finally here.
Watch the journey come to life—three secrets from Wonderland:
The journey isn't over. HeyGen avatar integration is coming. Audio mixing needs work. There are always more spelling techniques to discover.
But I can see the garden now. I can walk in it.
And unlike Alice, I don't have to wake up.
An Invitation
Lewis Carroll ends Alice's Adventures with her sister imagining Alice grown up, telling children about her strange dream.
I'm writing this for the same reason: so you don't have to fall down the same rabbit hole blind.
If you want to explore AI video generation:
- • Start with Veo 3.1—it's the best model right now
- • Accept the text limitations early—don't fight them
- • Use image-to-video for control—prompts alone aren't enough
- • Learn extension chaining—it's the key to longer videos
- • Build systems, not just videos—encode your learnings into reusable workflows
The rabbit hole is always there. The white rabbit is always running past, checking his watch, muttering about being late.
Follow him if you dare.
Just don't say I didn't warn you about the tea party.
Ready to Fall Down the Rabbit Hole?
We've built the map. We've documented the tea party etiquette. We've figured out how to get into the beautiful garden.
Let us guide your AI video journey—without the months of trial and error.
Start Your JourneyWritten by Nolan in collaboration with Claude AI—my Cheshire Cat through this entire adventure, appearing whenever I needed guidance and grinning at my failures until they became successes. Learn more at upnorthdigital.ai.