I asked Claude Desktop a deceptively simple question: "Why is LLM training locked at the time of deployment? Why can't additional training be applied after the model has been trained?"

What I expected was a straightforward technical answer about compute costs or API limitations.

What I got was something far more interesting—a deep dive into the fundamental constraints of how neural networks work, why "frozen weights" are actually a feature not a bug, and how the entire architecture of modern AI deployment is built around working with these constraints rather than fighting them.

The Glacier Problem: Catastrophic Forgetting

Here's the thing most people don't realize: neural networks don't learn the way humans do.

When you learn something new, your brain doesn't overwrite old memories. You can learn French without forgetting English. You can master Python without losing your JavaScript knowledge. Your memory is additive.

Neural networks? Not so much.

Catastrophic Forgetting: The Core Problem

When you train a neural network on new data, the weight updates that encode that new information tend to overwrite or degrade the weights encoding prior knowledge.

You can't just "add" to what it knows. You risk corrupting what it already learned.

This is why retraining typically means training from scratch on the full corpus plus new data.

Imagine if every time you learned a new fact, there was a 15% chance you'd forget your mother's name. That's catastrophic forgetting.

It's not a bug. It's a fundamental constraint of how gradient descent works. When you update weights to minimize loss on new data, those updates can increase loss on old data. The network "forgets" in order to learn.

The Economics of Compute: Why You Can't Just Hit "Resume Training"

Even if catastrophic forgetting wasn't an issue, there's a more practical problem: the infrastructure doesn't work that way.

Training a frontier model like Claude Sonnet 4.5 or GPT-5 costs tens to hundreds of millions of dollars in compute. It takes months. It requires thousands of GPUs in tight synchronization, specialized networking, distributed training frameworks, and expert ML engineers babysitting the whole operation.

The inference infrastructure optimized for serving predictions? Fundamentally different architecture. You can't just flip a switch and have it "keep learning."

Training Infra:

Thousands of GPUs in tight synchronization, distributed across multiple data centers, running for months. Optimized for massive parallel batch processing and gradient synchronization.

Inference Infra:

Optimized for low-latency single requests, horizontal scaling, efficient memory usage. Completely different hardware configurations, networking requirements, and operational constraints.

It's like asking why you can't use a Formula 1 race car for hauling lumber. Sure, they're both vehicles with engines. But the engineering trade-offs are fundamentally incompatible.

The Safety Lock: Quality Control and Alignment

But here's where it gets really interesting—and this is the part that Claude Desktop emphasized that I hadn't fully appreciated:

The "frozen" state is partly a feature, not a bug.

Training a production AI model involves way more than just feeding it text:

• Careful data curation (filtering out harmful, biased, or low-quality content)
• RLHF (Reinforcement Learning from Human Feedback) to align behavior
• Red-teaming to discover adversarial vulnerabilities
• Safety evaluations across dozens of risk categories
• Capability assessments to ensure it actually works

If a deployed model could learn continuously from user interactions, it could:

• Pick up biases from conversations
• Learn harmful patterns from adversarial users
• Have its alignment drift in unpredictable ways
• Develop behaviors that weren't tested or approved
• Become a moving target impossible to audit or debug

Remember Tay, Microsoft's Twitter chatbot from 2016? It learned from user interactions. Within 24 hours, it was spewing racist, sexist garbage because internet trolls trained it to.

Frozen weights prevent that. The model you deploy is the model that was tested, aligned, and approved. It's deterministic and reproducible. You know what you're shipping.

The Workarounds: How We Give AI "New Knowledge" Anyway

So if models can't learn after deployment, how do we give them new information? How do we specialize them for specific tasks?

Turns out, the AI community has developed several clever workarounds:

1. RAG (Retrieval-Augmented Generation)

Don't change the model—change what you show the model. Pull relevant documents from a knowledge base and inject them into the context window.

Example: AWS Bedrock Knowledge Bases

You upload your company's internal docs to S3. When a user asks a question, the system retrieves relevant passages and includes them in the prompt. The model doesn't "know" your docs—but it can read them on demand.

2. Fine-Tuning (Transfer Learning)

Start with a pre-trained model, then do additional training on a specialized dataset. This updates the weights, but in a controlled way with the full dataset available.

This is how you get models like "Code Llama" (Llama fine-tuned on code) or domain-specific models for medical, legal, or financial applications.

3. LoRA (Low-Rank Adaptation)

This is the really clever one. Instead of updating all the weights, you freeze the pretrained weights and inject small "adapter" matrices into specific layers.

How LoRA Works

In a standard neural network layer, you have a weight matrix W that transforms inputs. LoRA adds a parallel path: instead of learning a new W, you learn two much smaller matrices A and B.

If W is 4096×4096 (16 million parameters), you might use A as 4096×8 and B as 8×4096 (only ~65,000 parameters).

You're training maybe 0.1-1% of the parameters. The adapter file might be just 10-100MB versus the full model's hundreds of gigabytes.

The magic of LoRA is that adapters are swappable. The base model stays frozen. You can hot-swap different LoRA adapters for different tasks—one for code, one for medical, one for legal—without multiple full model copies.

4. In-Context Learning

The simplest workaround: use the massive context window. Show the model examples of what you want, and it will pattern-match without any weight updates.

Claude Sonnet has a 200K token context window. That's enough to include entire codebases, documentation sets, or conversation histories. The model "learns" your patterns for the duration of that session.

The Multimodal Trade-Off: Does Vision "Steal" Capacity from Coding?

During my conversation with Claude Desktop, I mentioned struggling with Gemini 3 Pro for coding. My intuition was: "Gemini is multimodal with huge context windows. Maybe too much neural network capacity is being 'used' for vision, leaving less for reasoning about code?"

Claude corrected my mental model:

"Modern multimodal models don't work the way you're imagining—where vision and language compete for the same neurons. The common architecture is modality-specific encoders (vision transformer for images/video, audio encoder, etc.) that project into a shared embedding space, feeding into a core transformer that does the reasoning."

"So the 'video understanding' parameters are largely separate from the 'reasoning over text' parameters."

But then Claude added the nuance:

⚠️

Training Compute Allocation:
Every hour spent training on video-text pairs is an hour not spent on code reasoning. The training data mix and compute budget reflect strategic priorities. Google is clearly prioritizing multimodal breadth.

⚠️

Attention and Context:
That massive 1-2M token context window requires architectural tradeoffs. Efficient attention mechanisms that scale to huge contexts sometimes sacrifice depth of reasoning per token.

⚠️

Product Philosophy:
Anthropic optimized Claude to be a thinking partner—extended reasoning, nuance, collaboration. Google seems more focused on multimodal breadth and agentic tool use. These are different optimization targets that show up in RLHF tuning and capability prioritization.

So my intuition wasn't entirely wrong—but the bottleneck isn't architectural. It's where the training effort was invested.

The Gemini 3 Reality Check: When Benchmarks Meet Real Work

Here's what happened: I thought taking my complex Veo/Imagen orchestration skill to Gemini 3 Pro would enhance it. After all, Google's own model should understand Google's own APIs better, right?

Two painful sessions later, I ditched it.

What Went Wrong:

• Confusion between gcloud CLI and REST APIs
• Unclear when to use Vertex AI SDK vs direct API calls
• Wrong assumptions about parameter availability across access methods
• Confident wrong answers instead of admitting uncertainty

Claude nailed the diagnosis:

"Internal knowledge at Google about how to actually orchestrate Veo, Imagen, Vertex AI, and the various CLI/API interfaces together probably lives in internal docs, Googler brains, and tribal knowledge—not in public training data. The best practices for combining these services in sophisticated workflows might literally not exist in text form anywhere a model could learn from."

When I built that skill with Claude, we reasoned through constraints together. Claude wasn't retrieving "here's how to orchestrate Veo + Imagen"—it was reasoning through API docs, inferring limitations from error patterns, and iterating with me on what actually worked.

That's constraint-satisfaction reasoning, not pattern matching.

And that's the difference between a model optimized for "autonomous task completion" (Gemini's agentic focus) versus "collaborative thinking" (Claude's design philosophy).

The Research Frontier: Can We Solve Continuous Learning?

All of this raises the question: is continuous learning after deployment fundamentally impossible, or just really hard?

The research community is actively working on it:

Elastic Weight Consolidation (EWC)

Identifies which weights are "important" for old tasks and protects them from large updates when learning new tasks. Helps but doesn't eliminate forgetting.

Memory-Augmented Architectures

External memory modules that the network can read/write to, separating "working memory" from "long-term knowledge" (the weights). Think of it as giving the AI a notebook.

Progressive Neural Networks

When learning a new task, freeze old network columns and add new columns. The network grows over time but old knowledge is literally frozen and untouched.

Retrieval-Enhanced Models

Hybrid approaches where the model can query external datastores during inference. The model doesn't change, but its accessible knowledge does.

But solving this elegantly—truly continuous learning without catastrophic forgetting, alignment drift, or computational explosion—remains an open problem.

Why Frozen Might Be Better Anyway

Here's the thing that crystallized for me during this exploration:

For production systems, determinism is valuable.

You want consistent, reproducible behavior. You want to know that the model that passed safety evals last month is the same model running today. You want bugs to be debuggable, not shifting phantoms.

A continuously learning model is a moving target. Hard to debug. Hard to audit. Hard to trust.

The workarounds we've developed—RAG, fine-tuning, LoRA, in-context learning—give us the best of both worlds:

✅ Frozen core model with known, tested, aligned behavior
✅ Adaptable knowledge through retrieval and adapters
✅ Controllable specialization through fine-tuning
✅ Reversible changes (swap adapters, update knowledge bases)

It's not a limitation we're stuck with. It's an architectural decision that prioritizes safety, reliability, and controllability over unlimited flexibility.

What This Means for You

If you're building with AI, understanding these constraints changes how you architect systems:

Don't expect models to "learn from experience"

The model you use today will behave the same way tomorrow. If you need it to adapt, build that adaptation into your system architecture (knowledge bases, context injection, fine-tuned adapters).

Use RAG for dynamic knowledge

Your company knowledge changes constantly. Don't wait for model retraining—pull fresh docs into context on every request.

Fine-tune for style, not facts

Fine-tuning is great for teaching models how to respond (tone, format, behavior). For what to know, use retrieval.

Embrace frozen weights as reliability

Your production AI won't surprise you with behavior changes. That's a feature. Build your testing and monitoring strategies around that stability.

The Bottom Line

LLMs are frozen at deployment not because we forgot to add a "keep learning" button. They're frozen because:

Catastrophic forgetting is a fundamental constraint of how neural networks work
Training and inference infrastructures are incompatible by design
Continuous learning would break alignment and safety guarantees
Determinism is valuable for production systems
Better workarounds exist (RAG, fine-tuning, LoRA, in-context learning)

The next time someone asks why your AI assistant can't "just learn" from corrections, you can explain: it's not a bug. It's the entire architecture of how modern AI works—and it's built that way for very good reasons.

The glacier doesn't flow backward. But that's why we know where we're standing.

Want to Build AI Systems That Actually Work?

Understanding the constraints is the first step. Working with them instead of against them is what separates successful AI implementations from expensive experiments.

We help businesses architect AI systems that embrace these realities—RAG for dynamic knowledge, fine-tuning for specialization, deterministic behavior for production reliability.

Let's Talk Architecture

This post was written collaboratively with Claude (Sonnet 4.5), exploring the very constraints that make our collaboration possible. The irony is not lost on us. Learn more at upnorthdigital.ai.