"Curiouser and curiouser!" cried Alice. I know exactly how she felt. What started as a simple question—"Can AI generate marketing videos?"—turned into a months-long expedition through a wonderland of impossible text, spelling nightmares, and the profound discovery that a prompt alone cannot conjure video magic.

Like Alice chasing a white rabbit, I fell into a hole I didn't know existed. And like Alice, I emerged with a completely different understanding of reality.

The White Rabbit: "Google Has a Video AI"

It started, as these things do, with a passing comment. Someone mentioned Google's Veo could generate videos from text prompts. Eight seconds of video from nothing but words.

Eight seconds, I thought. That's nothing. That's barely a blink.

But the rabbit hole beckoned. And like Alice, I couldn't resist following.

"Would you tell me, please, which way I ought to go from here?"

"That depends a good deal on where you want to get to," said the Cat.

— Lewis Carroll, Alice's Adventures in Wonderland

The problem was, I didn't know where I wanted to get to. I just knew I wanted to see what was possible. And that, as it turns out, is exactly how you end up at a mad tea party with no sense of time.

The Pool of Tears: First Experiments

My first Veo prompt was embarrassingly naive:

"Create a professional marketing video showing invoice automation saving time for finance teams. Show the software interface with clear labels."

What I got back was... surreal.

The video looked professional enough. Smooth motion. Nice lighting. But the text on screen? Gibberish. Pure, beautiful, utterly nonsensical gibberish.

Where I'd asked for "Marketing_Budget.xlsx," I got "Markefing_Bujet.xlsz." Where I wanted "Invoice Processing," I received "Invoise Prosseccing." The AI had created a parallel universe where everything looked right but nothing was spelled correctly.

The Harsh Reality of AI Text Rendering

• File names: "Marketing_Buget.xlsx" instead of "Marketing_Budget.xlsx"
• Email subjects: "Expensestes" instead of "Expenses"
• UI labels: Pure gibberish characters
• Currency symbols: € appearing where $ should be

Success rate with on-screen text: ~40%

I had fallen into Wonderland's Pool of Tears. The AI could generate visuals that looked professional, but it fundamentally didn't understand spelling. It worked with patterns and pixels, not grammar and orthography.

Alice cried so much she nearly drowned. I spent three weeks trying to make text render correctly before accepting reality.

The Caterpillar's Question: "Who Are You?"

After the text disasters, I had an identity crisis. Was I trying to be a video producer? A prompt engineer? A software developer building video tools?

The Caterpillar in Alice asks the most profound question: "Who are you?" And Alice can't answer because she's changed so many times since falling down the rabbit hole.

I felt the same way. Each experiment transformed my understanding.

Week 1: "I'm a prompt engineer!"

Surely the right words would unlock perfect videos. I wrote prompts like poetry. They produced beautiful gibberish.

Week 3: "I'm a video producer!"

Maybe I needed to think in storyboards, shot sequences, narrative arcs. The videos got better, but text remained a nightmare.

Week 6: "I'm a systems architect!"

The realization hit: this wasn't about single videos. It was about building systems that could reliably produce videos at scale.

Week 12: "I'm an orchestrator."

Multiple AI models, multiple techniques, multiple stitching strategies—all coordinated into a unified workflow.

The Mad Tea Party: Discovering Extension Chaining

The Mad Hatter's tea party is famous for its absurdity. Time doesn't work right. Everyone moves around the table endlessly. Nothing makes sense until you accept the rules are different here.

That's exactly what happened when I discovered Veo's extension chaining.

See, Veo generates 8-second clips. That's the limit. Eight seconds. But—and here's where it gets mad—you can extend a video by passing back a special token called veoVideoToken.

The Extension Chain Discovery

Initial clip: 8 seconds
Each extension: 7 seconds
Maximum extensions: 8
Total possible length: 64 seconds of continuous video

But here's the mad part: each extension continues seamlessly from the previous clip's last frame. No cuts. No transitions. One continuous shot.

This changed everything. Instead of stitching together six separate 5-second clips with jarring cuts, I could generate a flowing 30-second video that felt like a single continuous take.

The Mad Hatter would have loved it. "It's always six o'clock now," he says in the book, stuck in permanent tea-time. With extension chaining, I could stretch time itself—eight seconds becoming sixty-four through a chain of temporal extensions.

But like the tea party, there were hidden rules:

• Tokens expire—take too long between extensions and the chain breaks
• Regeneration cascades—change clip 3 and clips 4-8 all need regeneration
• Style drift—too many extensions and the visual coherence wanders

Welcome to the tea party. Move down, move down.

The Queen's Croquet: The Spelling Prevention Techniques

In Wonderland, the Queen plays croquet with flamingos as mallets and hedgehogs as balls. The game is impossible because the equipment won't cooperate. The flamingo twists its neck. The hedgehog uncurls and walks away.

That's what trying to render text felt like.

No matter how precisely I specified the words, the AI would introduce errors. It wasn't malicious—the hedgehog isn't trying to ruin your game—it just doesn't understand what you want.

So I developed eight techniques to work around the uncooperative equipment:

Generic Descriptions (90-95% success)
Don't ask for "Marketing_Budget.xlsx"—ask for "Excel files with department names visible." The AI can't misspell what you never asked it to spell.

Text-Free Visuals (100% success)
Replace "95% TIME SAVINGS" with progress bars and color indicators. No text means no spelling errors.

Keep Text Short (3-5 words max)
Short text has dramatically higher accuracy. "LATE" works. "Month-End Close: Day 2 of 3" fails.

Quotation Marks
Put critical text in "quotes" to signal literal strings: File name displays "Sales_Q4.xlsx"

Phonetic Spelling
"File showing Mar-ket-ing Bud-jet dot x-l-s-x" works better than the literal filename.

"No Gibberish" Directive
Add to every prompt: "No gibberish text. No text artifacts. No nonsensical characters."

Static Shots for Text
Camera movement increases text errors. Keep the camera still when text must appear.

Minimize Text Elements
Fewer elements means fewer errors. Show 2-3 file names, not 10.

The Queen's croquet game isn't winnable by force. You win by understanding that flamingos aren't mallets and working with what you've got.

The Drink Me Bottle: Image-to-Video Changes Everything

In Alice's story, she finds a bottle labeled "DRINK ME." It makes her shrink—but that shrinking is what allows her to fit through the tiny door into the beautiful garden.

My "Drink Me" moment was discovering image-to-video generation.

Here's the problem with pure text-to-video: the AI decides what the first frame looks like. Sometimes it's perfect. Sometimes it's wildly off from what you imagined. You're rolling dice with every generation.

But what if you could specify the first frame?

The Image-to-Video Workflow

Step 1: Generate a precise starting frame with Imagen 3
Step 2: Pass that image to Veo as input
Step 3: Veo animates from your exact image

Result: Controlled composition, brand consistency, predictable starting points.

This was transformative. Instead of hoping Veo would generate the dashboard layout I wanted, I could show it the exact dashboard and say "animate this."

The "Drink Me" bottle made Alice small enough to enter Wonderland properly. Image-to-video made AI video controllable enough to use professionally.

A prompt is not enough. You need to prime the context with an image.

The Cheshire Cat: Google's Generosity

The Cheshire Cat appears and disappears throughout Alice's journey, offering cryptic guidance with that famous grin. You never know when he'll show up or what he'll say, but his presence is always... helpful, in its own strange way.

Google has been my Cheshire Cat.

When I started this journey, I was terrified of costs. AI video generation isn't cheap. Each Veo clip costs roughly $0.10-0.15. Imagen frames are $0.02-0.04. Extensions add up. A 30-second marketing video might run $0.75. Do that hundreds of times while experimenting and you're looking at real money.

Then Google appeared, grinning, with thousands of dollars in cloud credits.

The Real Costs of AI Video

Video Type	Clips	Est. Cost
15s Social Clip	3 clips	~$0.36
30s Marketing Video	6 clips	~$0.72
30s Continuous	1 + 3 ext	~$0.42
60s Product Demo	1 + 7 ext	~$0.92

With credits: Hundreds of experiments for free. Without credits: This blog post wouldn't exist.

The Cheshire Cat's generosity allowed me to fail spectacularly, learn from every failure, and eventually build something that works. You can't discover the rules of Wonderland without playing the game, and you can't play the game if every move costs real money.

Building the Video Builder: A Web UI Detour

At some point during my journey, I thought: "This needs a graphical interface. Users shouldn't have to write JSON configs and run Node scripts."

So I built a Video Builder MVP. A full web interface with:

• Clip-by-clip storyboard editing
• Live preview of prompts
• Voiceover script timing calculations
• Generation status tracking
• FFmpeg stitching integration

It was a beautiful detour. Like Alice exploring the Duchess's kitchen—interesting, but not where she needed to be.

The web UI worked. But I realized something important: the complexity wasn't in the interface—it was in the orchestration.

Users didn't need buttons to click. They needed an intelligent system that could take a concept and guide them through approach selection, storyboard generation, spelling prevention, source selection, and stitching strategies.

They needed a skill, not a UI.

The Trial: Building the Unified Video Creator Skill

At the end of Alice's journey, there's a trial. The Queen wants to execute everyone. Cards are flying. Nothing makes sense. And Alice, finally, has had enough.

"You're nothing but a pack of cards!" she shouts. And Wonderland dissolves.

My trial was building the unified skill. Taking everything I'd learned—every failed prompt, every spelling disaster, every successful technique—and encoding it into a system that could guide anyone through the process.

The Video Creator Skill: What I Built

Reference Documents:

• Approach templates (3 video approaches)
• Concept-to-storyboard guide
• Prompt patterns for each source
• Voiceover timing rules
• Spelling prevention techniques
• Stitching strategies

Generation Scripts:

• Veo clip generator
• Veo extension chainer
• Imagen frame generator
• FFmpeg stitcher
• Video orchestrator

21 files. Thousands of lines. Months of learning distilled into a reusable system.

The trial isn't about convicting anyone. It's about Alice realizing she's grown—literally and figuratively—and she doesn't have to play by Wonderland's rules anymore.

Building the skill was my moment of clarity. The chaos of AI video generation could be tamed. Not by fighting it, but by encoding the rules of this strange world into systems that others could follow.

Waking Up: What I Learned

Alice wakes up on the riverbank, her sister brushing dead leaves off her face. Was it a dream? Did it really happen?

I emerged from my AI video rabbit hole with these truths:

1. A Prompt Is Not Enough

You cannot summon professional video from words alone. You need starting images for context. You need voiceover scripts for timing. You need spelling prevention techniques. You need to understand which approach fits your content.

2. Text Is the Enemy (Until You Understand It)

AI video models don't understand spelling. They work with patterns. The moment you accept this and work with the limitation—generic descriptions, text-free visuals, voiceover for information—everything gets easier.

3. Extension Chaining Is Magic (With Caveats)

Turning 8 seconds into 64 seconds of continuous video is transformative. But the chain is fragile. Tokens expire. Regeneration cascades. Use it for continuous narratives, not for everything.

4. Image-to-Video Is the Real Secret

Generate your starting frame with Imagen. Then animate with Veo. This gives you control over composition, branding, and visual consistency that pure text-to-video can't match.

5. The Cost Is Real (But Manageable)

A 30-second marketing video costs less than a dollar. That's nothing compared to traditional video production. But experimentation adds up—Google's credits let me learn without going broke.

6. Systems Beat Interfaces

I built a web UI. It was fine. But what users really need is intelligent orchestration—a system that knows when to use FFmpeg vs. extensions, how to prevent spelling errors automatically, and how to transform concepts into storyboards.

The Beautiful Garden

At the beginning of Alice's adventure, she glimpses a beautiful garden through a tiny door. The entire journey—the shrinking, the growing, the Mad Hatter, the Queen—is really just about getting into that garden.

My beautiful garden? It's this:

From a single concept, in a single conversation, create a professional AI-generated video.

Describe your idea. Get 2-3 approach suggestions. See a storyboard. Watch it generate. Download your video.

That's the garden I was trying to reach. I'm finally here.

Watch the journey come to life—three secrets from Wonderland:

The journey isn't over. HeyGen avatar integration is coming. Audio mixing needs work. There are always more spelling techniques to discover.

But I can see the garden now. I can walk in it.

And unlike Alice, I don't have to wake up.

An Invitation

Lewis Carroll ends Alice's Adventures with her sister imagining Alice grown up, telling children about her strange dream.

I'm writing this for the same reason: so you don't have to fall down the same rabbit hole blind.

If you want to explore AI video generation:

• Start with Veo 3.1—it's the best model right now
• Accept the text limitations early—don't fight them
• Use image-to-video for control—prompts alone aren't enough
• Learn extension chaining—it's the key to longer videos
• Build systems, not just videos—encode your learnings into reusable workflows

The rabbit hole is always there. The white rabbit is always running past, checking his watch, muttering about being late.

Follow him if you dare.

Just don't say I didn't warn you about the tea party.

Ready to Fall Down the Rabbit Hole?

We've built the map. We've documented the tea party etiquette. We've figured out how to get into the beautiful garden.

Let us guide your AI video journey—without the months of trial and error.

Start Your Journey

Written by Nolan in collaboration with Claude AI—my Cheshire Cat through this entire adventure, appearing whenever I needed guidance and grinning at my failures until they became successes. Learn more at upnorthdigital.ai.

AI & Strategy

Software Is Dead and Other Predictions With a Timing Problem

History shows disruption predictions are almost always directionally right but 3-10x off on timing. COBOL was 'dead' in the 1990s. Mainframes were 'dead' by 1996. But BlackBerry collapsed in 7 years. A personal essay on building in the AI market after getting replaced by a platform vendor, and why the 'Datadog for AI prompts' tool everyone needs might never survive as a product.

8-15 min readRead more →

AI & Strategy

The 6 Layers of Enterprise AI: From Shadow AI to Autonomous Agents

Your employees are already using AI — they just didn't tell IT. A practical framework mapping 6 layers of enterprise AI adoption to Anthropic Claude licensing, from unmanaged personal accounts to fully autonomous agents. Includes example use cases, cost analysis, and a 90-day play.

7-13 min readRead more →

AI & Economics

And Now You Know the Rest of the Story: Who Really Should Pay for Your AI Subscription

Who should pay for AI subscriptions — the employee or the business? The obvious answer is always wrong. Paul Harvey knew: mechanics, teachers, nurses, truck drivers, and one accountant going through a divorce prove that you can't make the right call without the rest of the story.

16 min readRead more →

The White Rabbit: "Google Has a Video AI"

The Pool of Tears: First Experiments

The Harsh Reality of AI Text Rendering

The Caterpillar's Question: "Who Are You?"

Week 1: "I'm a prompt engineer!"

Week 3: "I'm a video producer!"

Week 6: "I'm a systems architect!"

Week 12: "I'm an orchestrator."

The Mad Tea Party: Discovering Extension Chaining

The Extension Chain Discovery

The Queen's Croquet: The Spelling Prevention Techniques

The Drink Me Bottle: Image-to-Video Changes Everything

The Image-to-Video Workflow

The Cheshire Cat: Google's Generosity

The Real Costs of AI Video

Building the Video Builder: A Web UI Detour

The Trial: Building the Unified Video Creator Skill

The Video Creator Skill: What I Built

Waking Up: What I Learned

1. A Prompt Is Not Enough

2. Text Is the Enemy (Until You Understand It)

3. Extension Chaining Is Magic (With Caveats)

4. Image-to-Video Is the Real Secret

5. The Cost Is Real (But Manageable)

6. Systems Beat Interfaces

The Beautiful Garden

An Invitation

Ready to Fall Down the Rabbit Hole?

Related Articles

Software Is Dead and Other Predictions With a Timing Problem

The 6 Layers of Enterprise AI: From Shadow AI to Autonomous Agents

And Now You Know the Rest of the Story: Who Really Should Pay for Your AI Subscription