Beyond LLMS
Test-Time Training and Adaptive Inference
Updating model parameters at inference time using self-supervised objectives on the test input itself. TTT layers replace fixed linear recurrences (as in Mamba) with learned update rules, blurring the boundary between training and inference.
Prerequisites
Why This Matters
Standard neural networks freeze their weights after training. At inference time, the same weights process every input. Transformers use attention with a KV cache that grows with context. Mamba and other SSMs compress context into a fixed-size state vector, keeping the dynamics fixed.
Test-Time Training (TTT) is a different approach entirely. It is not built on Mamba or transformers. A TTT layer updates its own parameters on each input using a self-supervised gradient step. The "hidden state" is not a vector (as in SSMs) or a KV cache (as in attention), but a set of weights that are updated by gradient descent at every token.
TTT, Mamba, and attention are three competing architectures for sequence modeling. They solve the same problem (processing sequences with bounded compute per token) through different mechanisms. TTT uses gradient-based learning as its recurrence. Mamba uses structured state spaces. Attention uses content-based retrieval. TTT has shown stronger performance than Mamba on long-context tasks (beyond 16k tokens) because gradient-based weight updates can store more retrievable information than a fixed-size linear state.
The TTT Layer
TTT Layer: Learning as State Update
Statement
A TTT layer maintains a weight matrix that is updated at each timestep via a self-supervised gradient step. To preserve causality (no leakage from into the prediction at step ), the update at step is computed from the previous input :
where is a self-supervised loss (e.g., reconstruction loss, masked-prediction loss) evaluated on . The output at step is computed using these causally-updated weights:
so depends on and on , which itself depends only on . (Implementations may equivalently compute from and use it to predict ; the constraint is that the update step and the prediction step use disjoint input information.)
This replaces the linear recurrence of SSMs with a gradient-based update: the "state" is the weight matrix , and the "transition function" is one step of gradient descent.
Intuition
In a standard SSM, the hidden state is a compressed summary of the context, and the compression is linear (matrix multiply). This limits what can be stored. In a TTT layer, the "state" is an entire weight matrix, and the "compression" is gradient descent on a loss function. Gradient descent is a much richer update rule: it can store key-value associations, learn patterns, and adapt to the specific distribution of the current context.
The self-supervised loss acts as the "what to remember" signal. If the loss is reconstruction (), the weights learn to reconstruct the input distribution seen so far. If it is masked prediction, the weights learn to predict missing tokens.
Why It Matters
TTT layers achieve the expressiveness benefits of attention (content-addressable memory, ability to store and retrieve arbitrary information) with the linear-time complexity of SSMs (the state update is per step, independent of context length). The key insight: when the inner model and the outer model share the same width , the weight matrix carries real-valued parameters compared to for Mamba (with typically), so the state-parameter budget scales quadratically with instead of linearly with . Parameter count is not the same as information capacity in bits (that depends on precision and on how many distinguishable patterns can be recovered by the inner loss), but it sets an upper bound on retrievable associations. This budget argument is what drives TTT's advantage on long context. The comparison assumes a common : in practice the outer model width can be hundreds while the TTT inner width is set much smaller (e.g., ), and the effective-capacity comparison against Mamba must use the specific architectural dimensions.
Failure Mode
The gradient step at each position adds computation: you must compute at every token, which involves a forward and backward pass through the TTT layer's inner model. This is more expensive per step than Mamba's linear recurrence. The learning rate is critical: too large and the weights oscillate, too small and the layer does not adapt. In practice, a linear self-supervised model (TTT-Linear, where ) keeps the cost manageable while still outperforming fixed recurrences.
TTT vs Attention vs SSM
| Property | Attention | SSM (Mamba) | TTT |
|---|---|---|---|
| State type | KV cache (grows with context) | Fixed-size vector | Weight matrix (fixed size) |
| State size | , | ||
| Per-token cost | |||
| Retrieval ability | Exact (attends to all tokens) | Approximate (compressed) | Learned (gradient-based) |
| State parameter budget | Unbounded (grows with ) | real params | real params (inner width) |
| Parallelizable | Yes (within sequence) | Yes (parallel scan) | Partially (mini-batch of tokens) |
Connection to Online Learning and Meta-Learning
TTT is online learning applied inside a neural network layer. At each timestep, the layer receives a new data point , updates its parameters to reduce the self-supervised loss, and makes a prediction. This sits in the online learning family that includes online convex optimization, but it is not literally OCO: OCO regret theorems require a convex inner loss over a convex domain, whereas generic neural TTT layers minimize a learned, nonconvex self-supervised loss in weight space. The OCO connection is a useful analogy and a source of update-rule design patterns (gradient descent, Adam-style preconditioning, mirror descent), not an off-the-shelf regret guarantee.
The connection to meta-learning is direct: the outer network (trained end-to-end) learns the self-supervised loss, learning rate, and initialization that make the inner TTT updates maximally useful. The outer training optimizes the learning rule; the inner TTT applies it at test time. This is "learning to learn" in a literal sense.
Common Confusions
TTT is not fine-tuning at test time
Fine-tuning updates the entire model on a labeled dataset. TTT updates a single layer's weights using a self-supervised loss on the current input, with no labels. The update is local (one layer), fast (one gradient step per token), and unsupervised. It is closer to online learning than to fine-tuning.
TTT on evaluation data or already-seen tokens is data leakage
If you apply TTT during evaluation where the self-supervised loss uses tokens the model will be scored on, you are leaking test information into the model's parameters. The self-supervised update must use only tokens that precede the prediction target in the sequence. For causal language modeling, this means the TTT update at position can only use tokens , not itself. Violating this produces artificially low perplexity numbers that do not reflect real-world performance. This is the same principle as train-test split hygiene: the model cannot learn from the data it is being evaluated on.
The weight matrix state is not the same as attention KV cache
Both grow in information content, but through different mechanisms. KV cache stores exact key-value pairs and grows linearly with context length. TTT weight state has fixed size () but stores information through gradient-based compression, which can be lossy. Very long contexts may lose early information if later gradients overwrite it (catastrophic forgetting within a sequence).
Exercises
Problem
A TTT-Linear layer has with self-supervised loss . Compute the gradient and describe what the weight update is doing geometrically.
Problem
Compare the state parameter budget of a Mamba state (, ) with a TTT weight state (, so the inner model has ). Which carries more real-valued parameters? Then argue which can store more retrievable information, and under what assumptions. Why is this different from the table row " vs ," which assumes a common ?
References
Canonical:
- Sun et al., "Learning to (Learn at Test Time): RNNs with Expressive Hidden States" (ICML 2024, arXiv:2407.04620). The TTT paper; TTT-Linear and TTT-MLP are variants introduced as sections within this paper, not separate publications.
Related foundations:
- Irie et al., "The Dual Form of Neural Networks Revisited: Connecting Test Time Predictions to Training Patterns via Spotlights of Attention" (ICML 2022, arXiv:2202.05798). Related dual-form perspective on attention as gradient-based memory.
- Sun et al., "Test-Time Training with Self-Supervision for Generalization under Distribution Shifts" (ICML 2020). The original test-time training framework; TTT layers generalize the idea from per-sample adaptation to per-token adaptation inside a recurrent layer.
- Schmidhuber, "Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks" (Neural Computation 1992). Fast-weight memory systems, the conceptual ancestor of TTT-style weight-as-state recurrences.
Next Topics
- Online learning and bandits: the theoretical framework for learning from sequential data
- Meta-learning: learning the learning rule itself
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
3- Stochastic Gradient Descent Convergencelayer 2 · tier 1
- Recurrent Neural Networkslayer 3 · tier 2
- Continuous Thought Machineslayer 5 · tier 3
Derived topics
2- Meta-Learninglayer 3 · tier 2
- Online Learning and Banditslayer 3 · tier 2
Graph-backed continuations