Skip to main content

Beyond LLMS

Test-Time Training and Adaptive Inference

Updating model parameters at inference time using self-supervised objectives on the test input itself. TTT layers replace fixed linear recurrences (as in Mamba) with learned update rules, blurring the boundary between training and inference.

AdvancedTier 2FrontierFrontier watch~40 min

Why This Matters

Standard neural networks freeze their weights after training. At inference time, the same weights process every input. Transformers use attention with a KV cache that grows with context. Mamba and other SSMs compress context into a fixed-size state vector, keeping the dynamics fixed.

Test-Time Training (TTT) is a different approach entirely. It is not built on Mamba or transformers. A TTT layer updates its own parameters on each input using a self-supervised gradient step. The "hidden state" is not a vector (as in SSMs) or a KV cache (as in attention), but a set of weights that are updated by gradient descent at every token.

TTT, Mamba, and attention are three competing architectures for sequence modeling. They solve the same problem (processing sequences with bounded compute per token) through different mechanisms. TTT uses gradient-based learning as its recurrence. Mamba uses structured state spaces. Attention uses content-based retrieval. TTT has shown stronger performance than Mamba on long-context tasks (beyond 16k tokens) because gradient-based weight updates can store more retrievable information than a fixed-size linear state.

The TTT Layer

Proposition

TTT Layer: Learning as State Update

Statement

A TTT layer maintains a weight matrix WtW_t that is updated at each timestep via a self-supervised gradient step. To preserve causality (no leakage from xtx_t into the prediction at step tt), the update at step tt is computed from the previous input xt1x_{t-1}:

Wt=Wt1ηW(Wt1;xt1)W_t = W_{t-1} - \eta \nabla_W \ell(W_{t-1}; x_{t-1})

where \ell is a self-supervised loss (e.g., reconstruction loss, masked-prediction loss) evaluated on xt1x_{t-1}. The output at step tt is computed using these causally-updated weights:

yt=fWt(xt)y_t = f_{W_t}(x_t)

so yty_t depends on xtx_t and on WtW_t, which itself depends only on x1,,xt1x_1, \ldots, x_{t-1}. (Implementations may equivalently compute Wt+1W_{t+1} from xtx_t and use it to predict xt+1x_{t+1}; the constraint is that the update step and the prediction step use disjoint input information.)

This replaces the linear recurrence ht=Aht1+Bxth_t = Ah_{t-1} + Bx_t of SSMs with a gradient-based update: the "state" is the weight matrix WtW_t, and the "transition function" is one step of gradient descent.

Intuition

In a standard SSM, the hidden state hth_t is a compressed summary of the context, and the compression is linear (matrix multiply). This limits what can be stored. In a TTT layer, the "state" is an entire weight matrix, and the "compression" is gradient descent on a loss function. Gradient descent is a much richer update rule: it can store key-value associations, learn patterns, and adapt to the specific distribution of the current context.

The self-supervised loss acts as the "what to remember" signal. If the loss is reconstruction (=Wxx2\ell = \|W x - x\|^2), the weights learn to reconstruct the input distribution seen so far. If it is masked prediction, the weights learn to predict missing tokens.

Why It Matters

TTT layers achieve the expressiveness benefits of attention (content-addressable memory, ability to store and retrieve arbitrary information) with the linear-time complexity of SSMs (the state update is O(d2)O(d^2) per step, independent of context length). The key insight: when the inner model and the outer model share the same width dd, the d×dd \times d weight matrix carries O(d2)O(d^2) real-valued parameters compared to O(dN)O(d \cdot N) for Mamba (with N=16N = 16 typically), so the state-parameter budget scales quadratically with dd instead of linearly with dd. Parameter count is not the same as information capacity in bits (that depends on precision and on how many distinguishable patterns can be recovered by the inner loss), but it sets an upper bound on retrievable associations. This budget argument is what drives TTT's advantage on long context. The comparison assumes a common dd: in practice the outer model width can be hundreds while the TTT inner width is set much smaller (e.g., dinner=64d_{\text{inner}} = 64), and the effective-capacity comparison against Mamba must use the specific architectural dimensions.

Failure Mode

The gradient step at each position adds computation: you must compute W\nabla_W \ell at every token, which involves a forward and backward pass through the TTT layer's inner model. This is more expensive per step than Mamba's linear recurrence. The learning rate η\eta is critical: too large and the weights oscillate, too small and the layer does not adapt. In practice, a linear self-supervised model (TTT-Linear, where =Wxx2\ell = \|Wx - x\|^2) keeps the cost manageable while still outperforming fixed recurrences.

TTT vs Attention vs SSM

PropertyAttentionSSM (Mamba)TTT
State typeKV cache (grows with context)Fixed-size vectorWeight matrix (fixed size)
State sizeO(nd)O(n \cdot d)O(dN)O(d \cdot N), N16N \approx 16O(d2)O(d^2)
Per-token costO(nd)O(n \cdot d)O(dN)O(d \cdot N)O(d2)O(d^2)
Retrieval abilityExact (attends to all tokens)Approximate (compressed)Learned (gradient-based)
State parameter budgetUnbounded (grows with nn)O(dN)O(dN) real paramsO(d2)O(d^2) real params (inner width)
ParallelizableYes (within sequence)Yes (parallel scan)Partially (mini-batch of tokens)

Connection to Online Learning and Meta-Learning

TTT is online learning applied inside a neural network layer. At each timestep, the layer receives a new data point xtx_t, updates its parameters to reduce the self-supervised loss, and makes a prediction. This sits in the online learning family that includes online convex optimization, but it is not literally OCO: OCO regret theorems require a convex inner loss over a convex domain, whereas generic neural TTT layers minimize a learned, nonconvex self-supervised loss in weight space. The OCO connection is a useful analogy and a source of update-rule design patterns (gradient descent, Adam-style preconditioning, mirror descent), not an off-the-shelf regret guarantee.

The connection to meta-learning is direct: the outer network (trained end-to-end) learns the self-supervised loss, learning rate, and initialization that make the inner TTT updates maximally useful. The outer training optimizes the learning rule; the inner TTT applies it at test time. This is "learning to learn" in a literal sense.

Common Confusions

Watch Out

TTT is not fine-tuning at test time

Fine-tuning updates the entire model on a labeled dataset. TTT updates a single layer's weights using a self-supervised loss on the current input, with no labels. The update is local (one layer), fast (one gradient step per token), and unsupervised. It is closer to online learning than to fine-tuning.

Watch Out

TTT on evaluation data or already-seen tokens is data leakage

If you apply TTT during evaluation where the self-supervised loss uses tokens the model will be scored on, you are leaking test information into the model's parameters. The self-supervised update must use only tokens that precede the prediction target in the sequence. For causal language modeling, this means the TTT update at position tt can only use tokens x1,,xt1x_1, \ldots, x_{t-1}, not xtx_t itself. Violating this produces artificially low perplexity numbers that do not reflect real-world performance. This is the same principle as train-test split hygiene: the model cannot learn from the data it is being evaluated on.

Watch Out

The weight matrix state is not the same as attention KV cache

Both grow in information content, but through different mechanisms. KV cache stores exact key-value pairs and grows linearly with context length. TTT weight state has fixed size (d2d^2) but stores information through gradient-based compression, which can be lossy. Very long contexts may lose early information if later gradients overwrite it (catastrophic forgetting within a sequence).

Exercises

ExerciseCore

Problem

A TTT-Linear layer has WR64×64W \in \mathbb{R}^{64 \times 64} with self-supervised loss =Wxx2\ell = \|Wx - x\|^2. Compute the gradient W\nabla_W \ell and describe what the weight update WWηWW \leftarrow W - \eta \nabla_W \ell is doing geometrically.

ExerciseAdvanced

Problem

Compare the state parameter budget of a Mamba state (d=512d = 512, N=16N = 16) with a TTT weight state (WR64×64W \in \mathbb{R}^{64 \times 64}, so the inner model has dinner=64d_{\text{inner}} = 64). Which carries more real-valued parameters? Then argue which can store more retrievable information, and under what assumptions. Why is this different from the table row "O(d2)O(d^2) vs O(dN)O(dN)," which assumes a common dd?

References

Canonical:

  • Sun et al., "Learning to (Learn at Test Time): RNNs with Expressive Hidden States" (ICML 2024, arXiv:2407.04620). The TTT paper; TTT-Linear and TTT-MLP are variants introduced as sections within this paper, not separate publications.

Related foundations:

  • Irie et al., "The Dual Form of Neural Networks Revisited: Connecting Test Time Predictions to Training Patterns via Spotlights of Attention" (ICML 2022, arXiv:2202.05798). Related dual-form perspective on attention as gradient-based memory.
  • Sun et al., "Test-Time Training with Self-Supervision for Generalization under Distribution Shifts" (ICML 2020). The original test-time training framework; TTT layers generalize the idea from per-sample adaptation to per-token adaptation inside a recurrent layer.
  • Schmidhuber, "Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks" (Neural Computation 1992). Fast-weight memory systems, the conceptual ancestor of TTT-style weight-as-state recurrences.

Next Topics

  • Online learning and bandits: the theoretical framework for learning from sequential data
  • Meta-learning: learning the learning rule itself

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

3

Derived topics

2