Introduction
If you have ever tried to make a large language model reason over an entire codebase, legal case, or weeks of logs, you know the feeling. Context windows grow, prompts get longer, costs explode, and yet the model still forgets something important halfway through. Transformers are brilliant short term thinkers, but they make very expensive long term storage.
That is the pressure that brought Google Titans into existence. At first glance Google Titans looks like another member of the growing zoo of Transformer alternatives. Look a little closer and it starts to feel more like a rethink of LLM memory. The architecture pairs a compact attention core for short range reasoning with a Neural long-term memory that learns while the model is running. Underneath sits a theoretical framework called Miras, which treats many modern sequence models as different ways to build and train an associative memory.
In this article we will walk through what Google Titans is, how test time training works in practice, why the Miras view matters, and where this family of models sits relative to Transformers, Mamba style linear recurrent neural networks, and the next wave of long context systems.
Table of Contents
1. What Is Google Titans? The Post Transformer Era Begins

The core idea behind Google Titans is almost embarrassingly simple. Attention is great for short term reasoning. It can look across a few thousand tokens, pick out the relevant bits, and combine them in flexible ways. It is terrible as a long term storage system because it scales linearly in memory and quadratically in compute with sequence length.
So Google Titans splits the job. The model keeps a standard attention based core with a limited context window for local computation. Beside it sits a neural long term memory. Instead of stuffing every previous token into a growing key value cache, the model trains a separate deep network that learns to store the history in its own weights as the sequence streams by. That memory network sees key value pairs produced from the main model, measures how badly it predicts each new value, and updates itself to reduce that error.
The result is a two level system. The core attention behaves like short term working memory. The neural long term memory behaves more like a notebook that keeps a compact summary of what actually mattered in the past. Google Titans uses the notebook when the current input needs information that is far away in the raw token sequence.
Importantly this long term memory is not just a single vector or matrix. It is a deep multilayer perceptron, which gives it much more expressive power than the linear memory used in many earlier linear recurrent neural networks. That extra depth matters because the model is not just remembering literal strings. It is trying to store higher level patterns and abstractions across very long sequences.
From a product point of view, you can think of Google Titans as an architecture that says: let attention handle the next few pages, and let a dedicated neural module become the system of record for everything beyond that.
2. The Miras Framework It Is All Connected

Miras is the piece that turns the Google Titans architecture into more than a one off idea. In the companion Google Titans paper, Miras is introduced as a general recipe for sequence models. Every model is treated as an associative memory that maps keys to values, optimized with an internal objective called attentional bias and stabilized with a retention gate.
In this view you design a model by answering four questions:
- Memory Architecture: Is the memory a vector, a matrix, or a deep Neural long-term memory, as in Google Titans
- Attentional Bias: Which loss does the memory minimize internally For most existing models the answer is dot product similarity or plain L2 regression.
- Retention Gate: How strongly do you regularize updates so that new information does not completely overwrite the past This is the forget gate mechanism in more familiar language.
- Memory Learning Algorithm: How do you update the memory A simple choice is online gradient descent, but Miras allows richer online optimization rules.
Once you rewrite standard architectures in this format, a pattern appears. Transformers, RetNet, Mamba, Google Titans, and many linear recurrent neural networks differ mainly in these four design choices, not in some mysterious magic block.
2.1 Beyond MSE And Dot Products
The Miras authors then push beyond the usual losses by building variants like Moneta, Yaad, and Memora, which plug in robust penalties and probability style constraints as new forms of attentional bias and retention. The exact details matter less than the message. Attentional bias is a knob you can tune. Google Titans sits at one practical point in that larger space, not at its edge.
3. How Learning At Test Time Actually Works

“Test time training” sounds scary at first. Does Google Titans really change its own weights while serving live traffic The answer is yes, but in a very constrained way. Only the long term memory learns online and it does so using a clear optimization story.
For each new token the model builds a key and a value vector. The key is fed into the neural long term memory, which tries to predict the value. The internal loss is simply the squared error between prediction and reality. The gradient of that loss is the surprise signal. Large gradient means “this association does not match what I believe yet.” Tiny gradient means “this fits my current model of the world.”
Google Titans uses this surprise to steer learning at test time. When surprise is high, the memory takes a gradient step toward encoding the new association. When surprise is low, the update is almost zero. Over a stream of tokens, the memory becomes a compact archive of the unusual and structurally important events instead of a tape recorder.
The update rule adds one more ingredient, momentum. Instead of throwing away the past gradient after each step, Google Titans keeps a running update vector that mixes the last surprise with the recent history of surprises. A big change in topic creates a strong update that continues to shape learning for the next few tokens, which is exactly what you want when a new section in a document introduces an entire cluster of related facts.
A retention gate then applies a weight decay style term that pulls the memory back toward a stable baseline. In Miras language this is part of the retention objective. It caps the effective capacity, flushes out old noise, and keeps the whole online optimization loop numerically stable.
4. Titans Vs Mamba Vs Transformers The Architecture Battle
At this point it helps to put Google Titans next to its closest rivals. The table below sketches the tradeoffs in simplified form.
Google Titans Memory Architectures Overview
| Model Family | Memory Type | Long Context Behavior | Strengths | Weak Spots |
|---|---|---|---|---|
| Transformers | Explicit key value cache | Grows with context window, costs explode on very long sequences | Great local reasoning, strong in context learning | Context reset pain, high O(N^2) attention cost |
| Linear Recurrent Neural Networks | Compressed vector or matrix state | Constant size, learns a lossy summary of the past | Linear time, low memory footprint | Fixed capacity, information loss over very long histories |
| Google Titans | Deep neural long term memory plus local attention | Fixed memory size with learned abstraction of history | Combines precise local attention with scalable long range recall | More complex to implement and reason about, new failure modes to understand |
This is not a winner takes all table. Transformers still shine at many workloads. Linear recurrent neural networks continue to win in scenarios where sheer throughput is king and the sequences are very long but somewhat repetitive.
Google Titans lands in the middle. It keeps the sharp local reasoning of attention while pushing most long range storage into its neural long term memory. For tasks where the story stretches across millions of tokens and you care about non trivial global structure, that hybrid design is a powerful place to stand.
5. The Three Titans Variants MAC MAG And MAL
The Titans architecture ships in three variants that share the same neural long term memory and differ only in where that memory connects to the core. Think of them as three ways to plug the same module into the stack.
5.1 Memory As Context MAC
In memory as context, the long term memory produces a summary vector that is added to the local context before attention runs. Attention chooses when to look at nearby tokens and when to look at that global summary. MAC is a natural fit for extreme long context QA and retrieval augmented generation, where each position benefits from a “what matters globally right now” hint.
5.2 Memory As Gate MAG
In memory as gate, the long term memory controls how a sliding window attention layer behaves. Its output gates the layer so that the effective receptive field widens or narrows depending on the global state. This pattern suits workloads that mix local patterns with occasional long jumps, such as long code traces or log analysis.
5.3 Memory As Layer MAL
In memory as layer, the long term memory acts as a separate layer interleaved with attention layers. The network alternates between local computation and explicit long range recall. This variant feels closest to standard deep stacks and is a reasonable default.
You can summarize the tradeoffs in a single table.
Google Titans Memory Variants Overview
| Variant | How Memory Feeds The Core | Best For | Intuition |
|---|---|---|---|
| MAC | Memory summary merged into attention context | Extreme long context QA and RAG | “Always give attention a global hint.” |
| MAG | Memory output gates a sliding window attention layer | Mixed local and global structure | “Let memory decide how wide the lens should be.” |
| MAL | Memory acts as its own layer in the stack | General long context modeling | “Alternate between thinking locally and recalling globally.” |
All three are members of the Titans family. You choose among them based on hardware budget, latency needs, and how structured your long range dependencies are.
6. Performance On Needle In A Haystack Style Benchmarks

Architectural elegance is nice. What finally matters is whether models remember the right things in practice. On this front the reported results for Google Titans are eye opening.
On standard language modeling datasets like C4 and WikiText, Google Titans style architectures match or beat strong Transformer and linear recurrent baselines of similar size while keeping training parallelizable and inference speeds linear in the sequence length. The neural long term memory acts as a drop in upgrade to naive compressed states.
On more stress testing setups such as the BABILong tasks and the classic needle in a haystack benchmark, Google Titans shows what the architecture is really about. The model can track facts scattered across documents that span millions of tokens and still answer questions that depend on those distant details. In these experiments Google Titans outperforms strong baselines including very large Transformer models like GPT 4 and modern Mamba style architectures that depend purely on compressed states.
That combination, long context recall plus strong local reasoning, is the real value proposition. It turns an LLM from a clever short term reasoner with a hazy memory into something that starts to look like a system with persistent understanding over time.
7. Retention Gates Forgetting And Safety
When you first hear that a model is learning at test time, the natural fear is that it will drift into strange behavior after a long session. Miras addresses that with the retention gate, which plays the role of a disciplined forget gate mechanism.
Each update to the neural long term memory has two parts. One step moves the weights toward encoding the new association. Another step gently pulls them back toward their previous state. If similar patterns keep showing up, the reinforcing updates win and the memory keeps them. If a pattern appears once and never returns, the retention term gradually shrinks its influence.
That tradeoff is not a hack. In very long sequences, keeping every detail is both impossible and unhelpful. Early recurrent models suffered from state explosion because they had no principled way to drop stale information. In Titans, forgetting is a controlled feature. It frees capacity for genuinely useful structure and reduces the odds that rare outliers dominate the internal state.
For teams building products, this retention behavior is part of the safety story. A model that can clear old or noisy associations from its LLM memory is less likely to hallucinate based on forgotten branches of a prompt and is easier to test, because its long term behavior does not depend on every quirk in its history.
8. Is Titans The End Of The Transformer
So is this the moment we declare the Transformer dead Probably not. Transformers solved short term reasoning so well that they will stay with us. What Titans and the Miras framework really offer is a way to stop fighting the context length problem with brute force and to start designing LLM memory as a first class object.
If you build products on top of large models, the useful question is simple. Where do you need stable structure across hundreds of thousands or millions of tokens Where are you patching around the limits of attention with manual sharding, external vector databases, or brittle retrieval hacks
Those are the places where Titans will matter first. Anywhere you want neural long term memory, disciplined test time training, and a principled retention gate, this architecture gives you a roadmap instead of a bag of tricks.
The call to action is straightforward. Read the Google Titans paper and the Miras framework, map their ideas onto your own workloads, and sketch where a learned long range memory would change your product. The sooner we treat memory as a design choice instead of an afterthought, the sooner models inspired by Google Titans will feel less like research curiosities and more like a default backbone for long context AI systems.
What is Google Titans and how is it different from Transformers?
Google Titans is a sequence model from Google that combines a standard attention core with a deep neural long-term memory that learns during inference. Unlike Transformers, which keep past tokens in a growing key value cache and pay a quadratic cost for attention, Google Titans compresses history into a separate learned memory module. This lets it handle much longer contexts with linear-time inference while preserving strong local reasoning.
What is the Miras framework in the context of Google Titans?
The Miras framework is a theoretical blueprint that treats Google Titans, Transformers, RetNet, Mamba and other linear recurrent neural networks as variations of one associative memory system. It describes each model using four choices: memory architecture, attentional bias objective, retention gate and memory learning algorithm. Google Titans is a concrete Miras instance where the memory is a deep neural network trained online with a surprise-driven objective and explicit forgetting.
What is test-time training and how does Google Titans use it?
Test-time training means updating some model parameters while the model is running, not only during offline training. In Google Titans, only the long-term memory module is updated at test time, using gradients as a “surprise” signal that decides which token associations to store. The core attention and persistent weights stay fixed, so the model gains adaptive LLM memory without destabilizing the main network.
Is Google Titans better than Mamba or RetNet?
On long-context benchmarks, Google Titans often outperforms Mamba, RetNet and strong Transformer baselines of similar or larger size, especially on tasks like BABILong and needle in a haystack where facts are spread across millions of tokens. Linear recurrent neural networks such as Mamba remain very efficient for pure throughput, while Google Titans trades some simplicity for deeper neural long-term memory and stronger recall across very large contexts.
Can Google Titans actually handle infinite context windows?
Google Titans cannot offer truly infinite context, but it gets extremely close in practice by decoupling memory from raw sequence length. Its neural long-term memory learns compressed representations of the past so the model can reason over contexts beyond two million tokens on benchmarks like BABILong and needle in a haystack tests. The real limits come from memory capacity, compute budget and the retention rule that deliberately forgets low-value information.
