Titans: Learning to Memorize at Test Time
Table of Contents
- Why Traditional Memory Systems Fall Short
- The Titans Architecture Explained
- How Neural Memory Learns to Memorize at Test Time
- The Surprise Metric: Teaching Machines What to Remember
- Three Titans Variants for Memory Integration
- Titans vs. Transformers: Performance Benchmarks
- Scaling Beyond 2 Million Tokens
- Real-World Applications of Titans Neural Memory
- The Future of Test-Time Learning Architectures
- Frequently Asked Questions
📌 Key Takeaways
- Neural long-term memory: Titans introduces a deep memory module that learns to memorize and forget data during inference, not just during training.
- Surprise-driven learning: The architecture uses gradient-based surprise metrics to determine which tokens are most worth remembering, inspired by human memory research.
- Three flexible variants: MAC, MAG, and MAL offer different strategies for incorporating long-term memory into attention-based architectures.
- Beyond 2M context: Titans scales to over 2 million token context windows while maintaining high accuracy on needle-in-haystack retrieval tasks.
- Outperforms Transformers: Across language modeling, commonsense reasoning, genomics, and time series benchmarks, Titans consistently beats both Transformers and modern linear recurrent models.
Why Traditional Memory Systems in Neural Networks Fall Short
Transformers have dominated the artificial intelligence landscape for nearly a decade, powering everything from large language models to computer vision systems. Their attention mechanism functions as an associative memory block, storing key-value pairs and retrieving them through pairwise similarity computations. This approach delivers remarkable accuracy in modeling token dependencies, but it comes at a steep cost: quadratic time and memory complexity relative to the context length.
When processing millions of tokens in real-world applications—long documents, genomic sequences, continuous time series data—the standard Transformer architecture simply breaks down. The attention matrix grows so large that even state-of-the-art hardware cannot process it efficiently. This fundamental limitation has driven researchers toward alternative approaches, including linear Transformers that replace softmax attention with kernel functions, reducing complexity to linear time.
However, linear Transformers introduce their own contradictions. They compress historical data into fixed-size matrix-valued states, which works well for short sequences but fails catastrophically when the context grows long. The very scenarios where linear complexity would be most advantageous—extremely long sequences—are precisely the scenarios where fixed-size compression loses the most information. As researchers at Google DeepMind have explored in their safety research, building reliable AI systems requires architectures that can maintain faithful representations of extended context without sacrificing computational tractability.
Modern recurrent models like Mamba, Griffin, and xLSTM have attempted to bridge this gap by adding adaptive forgetting mechanisms and improved write operations. Yet these architectures still rely on linear memory structures—single vectors or matrices—that fundamentally limit how much historical information they can retain. The Titans paper, authored by Ali Behrouz, Peilin Zhong, and Vahab Mirrokni at Google Research, argues that this limitation stems from a flawed analogy with human memory. The human brain does not rely on a single memory system. Instead, it operates through a confederation of interconnected yet independent modules: short-term memory, working memory, and long-term memory. Each serves a different function with different neural structures. Any architecture that tries to handle all memory functions with a single mechanism will inevitably compromise on some dimension of performance.
The Titans Architecture: A New Paradigm for Neural Memory
Titans addresses this fundamental gap by introducing a family of architectures built around three distinct yet interconnected memory components, each inspired by a different aspect of human cognition. The Core module handles short-term memory through standard attention with a limited window size. This is the primary data processing pathway, responsible for capturing accurate local dependencies between tokens within a manageable context window. By constraining the attention window, the Core maintains the precision of full softmax attention without incurring quadratic costs over the entire sequence.
The Long-term Memory module is the true innovation of Titans. Rather than compressing all historical data into a fixed-size state, this module is a deep neural network—specifically a multi-layer perceptron—that learns to memorize data into its own parameters at test time. This is a fundamentally different approach from traditional recurrent hidden states. Instead of a passive storage buffer, the long-term memory is an active learner that continually adapts its internal weights based on the data it encounters during inference.
The third component, Persistent Memory, consists of learnable but data-independent parameters that encode general knowledge about the task. Unlike the long-term memory, which adapts to each specific input sequence, persistent memory captures patterns that remain consistent across all inputs—structural regularities, syntactic patterns, and domain-specific priors that the model has learned during training.
This three-component design mirrors the human cognitive architecture more faithfully than any previous neural network model. Short-term attention handles immediate context with high fidelity. Long-term neural memory stores compressed abstractions of extended history. Persistent memory provides the stable background knowledge that anchors both. Together, these modules create an architecture that can process sequences of virtually unlimited length while maintaining both local precision and global awareness.
How Titans Neural Memory Learns to Memorize at Test Time
The concept of test-time learning is what makes Titans truly revolutionary. Traditional neural networks learn during training and then freeze their parameters for inference. Titans breaks this paradigm by treating the long-term memory module as a meta-model that continues learning during inference. The memory module’s parameters are updated based on each new token it processes, effectively training itself on the fly to memorize the specific sequence it is currently handling.
This approach draws directly from meta-learning research, where models are trained not to solve specific tasks but to learn how to learn. In Titans, the outer training loop optimizes the overall architecture parameters—attention weights, projection matrices, and persistent memory—while an inner loop continuously optimizes the long-term memory module’s weights during inference. The outer loop teaches the memory module how to memorize effectively. The inner loop applies that learned memorization strategy to specific input sequences.
The memory module operates as an associative memory system. For each input token, two linear projections create a key and a value. The memory module learns to store the associations between keys and values by optimizing an associative memory loss function: the squared error between the memory’s predicted value for a given key and the actual value. When the memory needs to retrieve information, a query projection is applied to the current input, and the memory module performs a forward pass without weight updates to produce the relevant stored information.
What makes this system particularly powerful is that the memory module uses deep neural networks—MLPs with multiple layers—rather than the single-layer linear maps used by previous approaches. The authors demonstrate theoretically and empirically that deep memory modules are strictly more expressive than their linear counterparts. A linear memory module can only capture linear dependencies in historical data, equivalent to online linear regression. A deep memory module, by contrast, can learn nonlinear abstractions, enabling it to compress and retrieve far more complex patterns from extended sequences. This depth is what allows Titans to handle context windows that would overwhelm any linear memory system.
Transform complex research papers into interactive experiences your team will actually engage with.
The Surprise Metric: Teaching Machines What Matters to Remember
Not all tokens deserve equal attention from the memory system. Titans implements an elegant solution inspired by human neuropsychology: events that violate expectations are more memorable. The architecture formalizes this intuition through a gradient-based surprise metric that determines how much each new token should influence the memory’s parameters.
The surprise of an input token is measured by the gradient of the memory’s loss function with respect to that token. When the memory already stores a good approximation of the token’s key-value association, the gradient is small—the token is unsurprising and requires minimal memory adjustment. When the token diverges significantly from what the memory has learned to expect, the gradient is large—the token is surprising and triggers a substantial memory update.
However, a purely momentary surprise metric has a critical flaw. After a highly surprising event, the gradients can become extremely small for subsequent tokens, even if those tokens carry important information related to the surprising event. In human experience, a dramatic event doesn’t just make the moment itself memorable—it heightens our attention for the entire surrounding period. Titans captures this phenomenon by decomposing surprise into two components: momentary surprise, which measures the current token’s deviation from expectations, and past surprise, which maintains a running momentum of recent surprise signals.
The mathematical formulation is remarkably elegant. The memory update at each step combines the current gradient (momentary surprise) with a decayed version of the previous surprise signal (past surprise). The decay rate is data-dependent, allowing the model to control whether past surprise should persist—because the current token is contextually related to the surprising event—or decay rapidly—because the context has shifted entirely. This mechanism is mathematically equivalent to gradient descent with momentum, a well-understood optimization technique, which gives the approach both theoretical grounding and practical reliability.
Complementing the surprise mechanism is an adaptive forgetting gate that manages the memory’s finite capacity. Even with deep neural networks, memory resources are limited. The forgetting gate is a data-dependent scalar that controls how much of the existing memory should be retained versus overwritten. It can preserve the entire memory when new data is consistent with stored patterns, or clear significant portions when the context shifts dramatically. This forgetting mechanism generalizes the gating strategies used in modern recurrent models like Mamba and xLSTM, but operates at the level of deep network parameters rather than simple state vectors. Understanding these safety-critical design decisions in neural architectures connects directly to ongoing AI safety research at leading laboratories.
Three Titans Variants for Integrating Memory
With the neural long-term memory module designed, the critical architectural question becomes: how should this memory interact with the rest of the model? Titans answers this with three distinct variants, each offering different trade-offs between expressiveness, efficiency, and ease of integration.
MAC (Memory as Context) is the most intuitive variant. It prepends the output of the long-term memory module as additional context tokens to the attention mechanism. When the attention core processes a segment of the sequence, it attends not only to the tokens within its local window but also to a set of memory tokens retrieved from the long-term memory. This effectively expands the attention’s receptive field without increasing the window size, giving the model access to distant historical information through the compressed representations stored in the memory module. MAC preserves the standard attention computation, making it straightforward to implement within existing Transformer frameworks.
MAG (Memory as Gated branch) takes a parallel processing approach. The input is simultaneously fed through the attention core and the long-term memory module, and their outputs are combined through a learned gating mechanism. The gate determines, for each token, how much weight to assign to the attention output versus the memory output. For tokens where local context is sufficient, the gate can rely primarily on attention. For tokens that require historical context beyond the local window, the gate can draw more heavily on the memory module’s output. This variant offers the greatest flexibility in balancing short-term precision with long-term recall.
MAL (Memory as Layer) processes the data sequentially, first through the long-term memory module and then through the attention mechanism. The memory module enriches the input representations with historical context before attention refines them using local dependencies. This variant is conceptually the simplest and creates a clean separation of concerns: memory handles long-range dependencies, attention handles local ones. In practice, the researchers found that all three variants outperform both standard Transformers and modern linear recurrent models, with MAG generally delivering the strongest results across diverse benchmarks.
Titans vs. Transformers: How the Benchmarks Compare
The experimental evaluation of Titans spans an impressively diverse set of tasks: language modeling, commonsense reasoning, recall-intensive retrieval, needle-in-haystack search, time series forecasting, and DNA sequence modeling. Across every category, Titans demonstrates consistent superiority over both Transformers and modern linear recurrent alternatives.
In language modeling benchmarks, Titans achieves lower perplexity than Transformers operating with equivalent context windows. More notably, Titans with a limited attention window of 512 tokens plus its neural memory achieves comparable or superior performance to full-attention Transformers processing the entire sequence—a remarkable result given the enormous computational savings. Against modern linear recurrent models like Mamba, Mamba2, DeltaNet, and Gated DeltaNet, Titans consistently delivers better perplexity scores, demonstrating that its deep neural memory captures information that linear compression fundamentally cannot.
On commonsense reasoning tasks including HellaSwag, WinoGrande, PIQA, ARC-Easy, and ARC-Challenge, Titans achieves the highest average accuracy among all tested architectures. The margins are particularly significant on tasks requiring synthesis of information from extended contexts, where the long-term memory module provides a clear advantage over models that either truncate context or compress it linearly.
The recall-intensive benchmarks provide perhaps the most dramatic demonstration of Titans’ capabilities. These tasks require the model to retrieve specific information from long sequences—precisely the scenario where traditional approaches struggle most. As the research community continues to explore how next-generation AI systems from DeepMind push scientific reasoning boundaries, the ability to accurately recall information from extended contexts becomes increasingly critical for practical deployment.
Make cutting-edge AI research accessible to your entire organization with interactive documents.
Scaling Beyond 2 Million Tokens with Titans
Perhaps the most headline-worthy result from the Titans paper is its ability to scale to context windows exceeding 2 million tokens while maintaining high accuracy on retrieval tasks. Standard Transformers, with their quadratic complexity, become computationally infeasible well before reaching this scale. Even efficient variants like FlashAttention, while dramatically improving throughput, cannot eliminate the fundamental quadratic scaling that limits practical context lengths.
The needle-in-haystack evaluation provides a rigorous test of this capability. A specific piece of information (the “needle”) is embedded at various positions within sequences of varying lengths (the “haystack”). The model must locate and retrieve this information accurately. As context length increases, most architectures see dramatic degradation in retrieval accuracy. Transformers fail entirely beyond their maximum context window. Linear recurrent models see progressive accuracy drops as their fixed-size states overflow with compressed information.
Titans maintains consistently high retrieval accuracy even at the 2M token mark. The deep neural memory module’s ability to selectively memorize surprising tokens while forgetting redundant ones means that the relevant “needle” information is preserved with high fidelity regardless of how much “haystack” surrounds it. The surprise-driven memory management ensures that unusual or distinctive information—precisely the kind of information a needle-in-haystack test evaluates—receives disproportionate attention from the memory system.
This scaling capability has profound implications for practical applications. Legal document analysis often requires processing thousands of pages of contracts, precedents, and regulations. Genomic research involves sequences of billions of base pairs. Financial analysis may span decades of tick-by-tick market data. In each case, the ability to maintain a faithful, queryable memory of extended historical context while processing new information in real time is not merely convenient—it is essential. The emerging trends in agentic AI depend heavily on models that can maintain coherent reasoning across extended interactions.
The parallelizable training algorithm that enables this scaling deserves special mention. Despite the recurrent nature of the memory update process, the Titans team developed a tensorized mini-batch approach that converts the sequential gradient computations into matrix multiplication operations. This means the neural memory can be trained efficiently on modern GPU hardware that is optimized for large matrix operations, avoiding the sequential bottleneck that has historically limited recurrent architectures. The result is a model that combines the expressive power of deep recurrent memory with the training efficiency of parallel architectures.
Real-World Applications of Titans Neural Memory Architecture
The practical implications of Titans extend far beyond academic benchmarks. Any domain that requires processing, understanding, and reasoning over extended sequences stands to benefit from this architecture. In natural language processing, Titans enables language models that can maintain coherent context across entire books, lengthy conversation histories, or vast knowledge bases without the truncation or summarization that current models require.
In genomics and bioinformatics, the ability to process sequences of millions of nucleotides opens new possibilities for understanding long-range interactions in DNA and RNA sequences. The paper demonstrates strong performance on DNA modeling tasks, suggesting that Titans can capture biological patterns that span far beyond what current fixed-context models can handle. Regulatory elements that influence gene expression from distances of millions of base pairs, structural features of chromosomes, and complex patterns of genetic variation all require the kind of extended context modeling that Titans provides.
Time series forecasting represents another natural application. Financial markets, climate data, industrial sensor networks, and healthcare monitoring systems all generate continuous streams of data where patterns can span hours, days, or months. Titans’ demonstrated superiority on time series benchmarks suggests it can capture both short-term fluctuations and long-term trends simultaneously—a capability that has eluded previous architectures. The latest workforce and organizational trends increasingly rely on AI systems that can synthesize long-horizon data for strategic decision-making.
For enterprise AI and agentic systems, Titans’ memory architecture offers a path toward AI agents that truly remember their past interactions and learn from experience. Current AI assistants operate within limited context windows, requiring elaborate retrieval-augmented generation pipelines to access historical information. Titans’ built-in long-term memory could dramatically simplify these architectures while improving both accuracy and latency. An AI agent powered by Titans could maintain a continuous, evolving understanding of its operating environment, user preferences, and task history without external memory systems.
Video understanding and multimodal processing are also prime candidates. Videos consist of thousands of frames, each containing dense visual information. Understanding narrative, tracking objects, and reasoning about events across an entire video requires maintaining context over extremely long sequences of visual tokens. Titans’ ability to scale to millions of tokens while selectively memorizing significant events maps directly onto the challenges of video understanding, where most frames are redundant but occasional key frames carry critical narrative information.
The Future of Test-Time Learning and Neural Memory Architectures
Titans represents more than an incremental improvement in neural architecture design. It introduces a paradigm shift in how we think about memory in artificial intelligence. The idea that a neural network should actively learn and adapt its memory during inference—not just during training—challenges one of the most fundamental assumptions in deep learning. This approach opens several compelling research directions that could reshape the field.
The current implementation uses simple multi-layer perceptrons as the memory architecture, a deliberate choice by the authors to focus on the broader design principles rather than optimizing the memory module itself. This leaves enormous room for improvement. More sophisticated architectures specifically designed for efficient memorization—such as memory-optimized neural networks, content-addressable stores, or hierarchical memory structures—could dramatically improve both the capacity and retrieval accuracy of the long-term memory module.
The connection between Titans’ surprise-driven memory and predictive coding theories in neuroscience is particularly intriguing. Predictive coding posits that the brain continuously generates predictions about incoming sensory data and updates its internal models based on prediction errors—essentially a biological version of Titans’ surprise metric. This convergence between artificial and biological approaches to memory suggests that Titans may be approaching something fundamental about how intelligent systems should process and store information.
The parallelization strategy developed for Titans’ training also has broader implications. The tensorized mini-batch approach that converts sequential gradient computations into matrix operations could be applied to other recurrent or sequential learning systems, potentially unlocking efficient training for a wide class of models that are currently limited by their sequential nature. As hardware continues to evolve toward more specialized AI accelerator designs, architectures like Titans that can leverage massive parallelism while maintaining recurrent expressiveness will become increasingly important.
For the broader AI industry, Titans signals a move toward architectures that blur the line between training and inference. Models that continue to learn and adapt at deployment time could fundamentally change how we think about AI lifecycle management, model updating, and knowledge integration. Rather than periodic retraining on new data, models could continuously incorporate new information into their long-term memory, maintaining relevance without the costly retraining pipelines that current approaches require.
The open questions are as exciting as the answers Titans provides. Can the memory module be extended to multimodal data—storing and retrieving visual, auditory, and textual information in a unified framework? Can the surprise metric be refined to handle more nuanced notions of importance beyond prediction error? Can the architecture scale to truly continuous, lifelong learning scenarios? These questions define the frontier of what may prove to be one of the most significant architectural innovations in deep learning since the original Transformer.
Turn groundbreaking research into engaging interactive content your audience will remember.
Frequently Asked Questions
What is the Titans architecture in neural networks?
Titans is a family of neural network architectures developed by Google Research that introduces a neural long-term memory module capable of learning to memorize data at test time. It combines three components: a short-term attention core, a deep neural long-term memory, and persistent memory parameters, enabling it to process sequences exceeding 2 million tokens while outperforming both Transformers and modern recurrent models.
How does Titans learn to memorize at test time?
Titans uses a meta-learning approach where a neural memory module trains itself during inference by measuring the surprise of incoming data through gradient-based signals. Events that violate expectations produce larger gradients, making them more memorable. This is combined with momentum-based past surprise tracking and an adaptive forgetting mechanism to manage memory capacity efficiently.
What are the three variants of Titans architecture?
The three Titans variants are MAC (Memory as Context), which prepends memory tokens to the attention context; MAG (Memory as Gated branch), which uses a gated mechanism to blend attention output with memory retrieval; and MAL (Memory as Layer), which processes data through memory and attention sequentially. Each variant offers different trade-offs between performance and computational efficiency.
How does Titans compare to standard Transformers?
Titans outperforms Transformers on language modeling, commonsense reasoning, and recall tasks while using significantly less compute. Unlike Transformers, which have quadratic complexity and are limited to fixed context windows, Titans scales to over 2 million tokens with linear complexity. In needle-in-haystack benchmarks, Titans achieves higher accuracy at extreme context lengths where Transformers fail entirely.
What is the surprise metric used in Titans neural memory?
The surprise metric in Titans measures how unexpected an input token is by computing the gradient of the associative memory loss with respect to the input. A larger gradient indicates the data diverges from what the memory has stored, making it more surprising and thus more worthy of memorization. This is enhanced with a momentum term that tracks past surprise, preventing the model from missing important information that follows a surprising event.