Latent Reasoning

Sneiderman, Robby

LLM Construction

Latent Reasoning

Techniques that spend computation in hidden state rather than visible chain-of-thought tokens: pause tokens, continuous thought, recurrent depth, and what is empirical vs speculative.

AdvancedTier 2FrontierFrontier watch~50 min

Prerequisites

Test Time Compute and Search Memory Systems for LLMS Multi Token Prediction

Prereq Map

Why This Matters

Chain-of-thought (CoT) prompting improves LLM reasoning by generating intermediate tokens that externalize the computation. But this approach has a cost: every reasoning token consumes output bandwidth, increases latency, and can expose a rationale that is not always faithful to the computation that produced the answer.

Latent reasoning asks a narrower, technical question: can some useful inference compute happen in the model's hidden states before, between, or instead of visible reasoning tokens? Instead of producing a full chain of tokens $t_1, t_2, \ldots, t_n$ , a model may allocate extra positions, pass continuous thought states forward, or iterate a recurrent block: $z \to f(z) \to f(f(z)) \to \cdots$ before decoding an answer.

This is early-stage research. The motivation is to decouple some inference compute from output length, while being honest that hidden computation is harder to train and harder to audit.

Mental Model

Standard CoT is like working a math problem by writing out every step on paper. Latent reasoning is like allowing an internal scratchpad before the final line, but the metaphor should not be pushed too far. The scratchpad is a learned hidden state, not a human thought, and it may be hard to interpret.

Core Concepts

Definition

Chain-of-Thought Bottleneck

In standard chain-of-thought reasoning, inference compute scales linearly with the number of generated tokens. Each token requires a full forward pass through the model. The CoT bottleneck is the constraint that reasoning depth is coupled to output sequence length: more computation requires more output tokens.

Definition

Latent Reasoning

Latent reasoning performs iterative computation in the model's hidden state space without generating intermediate tokens. Given hidden state $z_t$ at position $t$ , the model applies a recurrent function $f$ for $M$ steps:

$z_t^{(0)} = z_t, \quad z_t^{(m)} = f(z_t^{(m-1)}) \quad \text{for } m = 1, \ldots, M$

The final state $z_t^{(M)}$ is then used to predict the next token. The parameter $M$ controls recurrent depth independently of output length within architectures that support such iteration.

Definition

Continuous Thought

Continuous thought replaces discrete token representations with continuous vector embeddings in the reasoning process. Instead of sampling a token $x_t$ , embedding it, and feeding it back, the model directly passes the continuous hidden state to the next iteration. This eliminates the information bottleneck of discrete tokenization during reasoning.

Approaches

Recurrent Depth

The simplest form: apply the same transformer block (or a subset of blocks) multiple times. A standard transformer with $L$ layers computes:

$z^{(l)} = \text{Block}_l(z^{(l-1)}) \quad \text{for } l = 1, \ldots, L$

A recurrent-depth model with $L$ layers and $M$ recurrence steps computes:

$z^{(l, m)} = \text{Block}_l(z^{(l-1, m)}) \quad \text{for } l = 1, \ldots, L$

and then feeds $z^{(L, m)}$ back as input for recurrence step $m+1$ . The effective depth is $L \times M$ while the parameter count remains $L$ layers.

Coconut (Chain of Continuous Thought)

Coconut (Hao et al., 2024) replaces discrete chain-of-thought tokens with continuous hidden states. During training:

Start with a standard CoT training example
Replace each reasoning token with a special "thought" token
Instead of embedding the thought token discretely, pass the continuous hidden state from the previous position directly
The model learns to reason in this continuous space

At inference, the model generates a fixed number of continuous thought steps (no discrete tokens emitted), then generates the answer.

Pause Tokens

A simpler variant: insert learnable "pause" tokens into the input (Goyal, Ji, Rawat, Menon, Kumar, and Nagarajan 2024, arXiv:2310.02226, ICLR 2024). These tokens carry no semantic content but give the model additional forward-pass computation before it must produce an output. The model learns to use these extra positions for internal computation. Crucially, pause tokens require pretraining-time inclusion to help: bolting them on at inference alone produces no gain. This is a useful empirical signal that latent compute is not free; the model needs to learn how to use the extra positions.

Looped Transformers

Giannou et al. (2023, arXiv:2301.13196) study transformers whose blocks are iterated $M$ times with shared weights. They give an explicit construction showing that a looped transformer simulates arbitrary Turing machines with polynomial overhead. This is the theoretical precedent for the recurrent-depth and Huginn-style architectures: more loops equals more compute equals a strictly larger class of functions the architecture can represent, at fixed parameter count.

Quiet-STaR

Quiet-STaR (Zelikman, Harik, Shao, Jayasiri, Haber, and Goodman 2024, arXiv:2403.09629) teaches a base language model to generate silent rationale tokens before each observed token, trained with an REINFORCE-style objective on next-token likelihood. The rationale tokens are generated but not shown to the user. This sits between discrete CoT and fully continuous latent reasoning: the intermediate computation is still discrete tokens, but they are silent and trained by reward rather than supervised demonstration. Quiet-STaR improves zero-shot reasoning on GSM8K and CommonsenseQA without task-specific fine-tuning.

Expressivity and Empirical Scaling

The claim that latent reasoning is "strictly more expressive than a standard forward pass" is not a proven theorem about arbitrary latent-reasoning architectures. Two nearby results are well established: a formal expressivity gap for chain-of-thought augmented transformers (Merrill and Sabharwal 2024), and a constructive simulation result for looped transformers (Giannou et al. 2023). Together with the empirical scaling of Huginn (Geiping et al. 2025), these pin down what is and is not known.

Proposition

CoT Lifts Transformer Expressivity

Statement

Merrill and Sabharwal (2024) prove that a fixed-depth log-precision transformer decoding $T(n)$ intermediate tokens before the answer simulates computations in roughly $\mathrm{TIME}(T(n))$ on a Turing machine. With $T(n) = \mathrm{poly}(n)$ intermediate steps, the expressivity class lifts from $\mathrm{TC}^0$ (the base fixed-depth class) to $\mathrm{P/poly}$ .

Intuition

A single forward pass of a fixed-depth transformer is parallel and shallow, so it sits inside $\mathrm{TC}^0$ . Chaining $T(n)$ decoded steps gives the model a sequential tape to write on, turning it into a serial machine. Polynomially many steps suffice to simulate any polynomial-time algorithm.

Why It Matters

This is the expressivity backbone for the whole "more test-time compute equals more reasoning" story. The lift is driven by the number of sequential steps, not by whether those steps are emitted as tokens or threaded through hidden state. It motivates latent reasoning but does not by itself prove a latent scheme is more expressive than a CoT scheme with the same step count.

Failure Mode

The result is about representability, not learnability or sample complexity. It assumes log precision and exact decoding. It also assumes the intermediate steps are actually used: a transformer that ignores its scratchpad stays in $\mathrm{TC}^0$ . For latent-only schemes (no tokens emitted), the expressivity gain depends on whether hidden-state iteration is a faithful surrogate for decoded steps. This is not settled in general.

report a correction →

Proposition

Looped Transformers Simulate Turing Machines

Statement

Giannou, Rajput, Sohn, Lee, Lee, and Papailiopoulos (2023) construct a looped transformer (a fixed transformer block applied for $M$ iterations) that simulates any Turing machine with polynomial blowup in $M$ and in width. The construction realizes instruction decoding, memory read/write, and branching inside attention and MLP layers.

Intuition

Looping a transformer block is the architectural analogue of running a program for more steps. Each iteration advances a simulated tape head and applies one instruction. More loops equals more compute equals more reasoning, in a literal Turing-machine sense.

Why It Matters

This is the cleanest theoretical justification for recurrent-depth architectures. It shows that, in principle, a latent loop can exploit extra test-time compute to solve harder problems, not just run longer. It does not say gradient descent finds such a solution, only that the architecture admits one.

Failure Mode

The construction is existence, not learnability. Training a looped transformer to behave as a universal interpreter requires curricula, step-level supervision, or very long training runs. Practical looped models exploit a narrow slice of this capacity.

report a correction →

Empirical Finding: Huginn Scales with Loop Depth

Geiping, McLeish, Jain, Kirchenbauer, Singh, Bartoldson, Kailkhura, Bhatele, and Goldstein (2025, arXiv:2502.05171) train Huginn-3.5B, an open-weights latent-reasoning model at scale. The architecture applies a shared recurrent block for a variable number of iterations $M$ at inference. They report that test-time accuracy on reasoning benchmarks (GSM8K, ARC, MMLU) improves as $M$ increases, without retraining and without emitting more tokens. This is an empirical scaling observation, not a theorem. The effective compute gain per loop eventually saturates, and the gain depends on the task having latent structure the recurrent block can exploit.

The Information Bottleneck Argument

Standard CoT forces all inter-step communication through discrete tokens. Each token is one of a finite vocabulary $\mathcal{V}$ , so the channel capacity per step is at most $\log_2 |\mathcal{V}|$ bits (about 15 bits for a 32K vocabulary). Continuous hidden states sit in $\mathbb{R}^d$ with $d$ typically 4096 or more, so the raw channel capacity per step is much larger.

The capacity argument is necessary but not sufficient. Three caveats matter:

Effective dimension is much smaller than $d$ . Trained transformer activations concentrate on a low-dimensional manifold; the usable information is a fraction of $d$ , not the full coordinate count.
Finite precision degrades the channel. In bf16 or fp16 the per-coordinate precision is roughly 8-11 bits, and rounding compounds across iterations.
The 15-bit ceiling is only a bottleneck when the model needs more than 15 bits of inter-step state. For many reasoning chains, a single token carries enough state and the discrete channel is fine.

Discrete tokens also do work that the bandwidth argument ignores. They give an interpretable trace, they serve as a regularizer (the model must compress its reasoning into human-readable form), and they create the auditing surface that faithfulness research (Lanham 2023, Turpin 2023) inspects. Latent reasoning trades all three for higher channel capacity and lower output-token cost.

What Is Established vs Speculative

Established. Extra sequential computation can change what a transformer-like system can represent or solve. CoT expressivity theory shows that decoded intermediate steps can extend transformer computation. Looped-transformer work shows that repeated transformer blocks can implement program-like computation. Pause-token, Quiet-STaR, Coconut, and Huginn-style papers report empirical gains under specific training setups.

Not established. Latent reasoning is not a universal replacement for CoT, not automatically faithful, and not free compute. More hidden steps can waste runtime or amplify an internal error. If a benchmark gain is reported, ask whether the comparison controls for total compute, training data, context length, and the amount of hidden or visible reasoning allowed.

Challenges

Training difficulty. Latent reasoning requires learning what to compute in hidden space without the supervision signal that discrete CoT tokens provide. In standard CoT training, each intermediate token has a ground-truth target. In latent reasoning, only the final answer provides supervision. The model must discover useful intermediate representations on its own.

Interpretability loss. With discrete CoT, you can inspect the reasoning trace. With latent reasoning, the computation is opaque. This makes debugging, auditing, and alignment substantially harder.

Optimal depth selection. How many recurrence steps $M$ should the model use? Too few and the model cannot solve hard problems. Too many and compute is wasted on easy problems. Adaptive depth (dynamically choosing $M$ per input) is an open problem.

Watch Out

Latent reasoning is not just deeper networks

Adding more layers to a standard transformer increases depth but also increases parameter count. Latent reasoning reuses the same parameters for multiple passes. The distinction matters: a 120-layer transformer has 10x the parameters of a 12-layer one, but a 12-layer recurrent transformer with 10 passes has the same parameters as the base 12-layer model.

Watch Out

Continuous thought is not soft prompting

Soft prompting learns continuous embeddings as inputs to the model. Continuous thought passes continuous hidden states between reasoning steps. Soft prompts are fixed after training. Continuous thought states are dynamically computed at inference time based on the input.

Watch Out

Hidden chain-of-thought is not automatically latent reasoning

Many production reasoning surfaces hide, summarize, or route intermediate reasoning differently from the final answer. That product surface does not by itself imply latent reasoning in the technical sense of this page. A hidden sequence of discrete rationale tokens is still token-space computation if the model samples it from the vocabulary. The latent-reasoning research line (Coconut, Huginn, recurrent-depth models) is about propagating continuous state or looping hidden computation between decoded tokens. Conflating the two confuses two test-time-compute axes: the number of decoded tokens and the number of recurrent or hidden-state updates.

Summary

Standard CoT couples reasoning depth to output length; latent reasoning decouples them
Recurrent depth: apply the same transformer blocks multiple times, trading compute for effective depth without adding parameters
Continuous thought (Coconut): replace discrete reasoning tokens with continuous hidden state propagation
Information capacity per reasoning step is much higher in continuous space than in discrete token space
Challenges: training is harder without intermediate supervision, interpretability is lost, and adaptive depth is unsolved
This is early-stage research with promising but setup-dependent results

Exercises

ExerciseCore

Problem

A standard transformer has $L = 24$ layers. A recurrent variant uses $L = 8$ layers with $M = 5$ recurrence steps. Compare their effective depth and parameter counts (assume each layer has the same number of parameters).

ExerciseAdvanced

Problem

A vocabulary of size $|\mathcal{V}| = 32000$ provides at most $\log_2(32000) \approx 15$ bits of information per discrete reasoning token. A continuous hidden state has dimension $d = 4096$ with 16-bit floating point entries. What is the theoretical maximum information per continuous reasoning step? Why is this comparison misleading in practice?

References

Canonical (latent and recurrent reasoning):

Geiping, McLeish, Jain, Kirchenbauer, Singh, Bartoldson, Kailkhura, Bhatele, Goldstein. "Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach" (2025). Huginn-3.5B; reports test-time accuracy gains with loop depth $M$ without emitting more tokens.
Hao, Sukhbaatar, Su, Li, Hao, Weston, Tian. "Training Large Language Models to Reason in a Continuous Latent Space" (Coconut, 2024/2025). Continuous thought states are fed back as input embeddings instead of decoded into words.
Giannou, Rajput, Sohn, Lee, Lee, Papailiopoulos. "Looped Transformers as Programmable Computers" (ICML 2023). Shows how looped transformers can execute program-like computations; the expressivity precedent for recurrent-depth architectures.
Dehghani, Gouws, Vinyals, Uszkoreit, Kaiser. "Universal Transformers" (ICLR 2019). Recurrent self-attentive sequence model with adaptive halting.

Expressivity theory:

Merrill, Sabharwal. "The Expressive Power of Transformers with Chain of Thought" (NeurIPS 2024). Log-precision fixed-depth transformers with $T(n)$ decoded intermediate tokens simulate computations in $\mathrm{TIME}(T(n))$ ; $T(n) = \mathrm{poly}(n)$ lifts expressivity from $\mathrm{TC}^0$ to $\mathrm{P/poly}$ .
Merrill, Sabharwal. "The Parallelism Tradeoff: Limitations of Log-Precision Transformers" (TACL 2023). Base fixed-depth transformers sit inside $\mathrm{TC}^0$ .

Silent rationales and test-time compute:

Zelikman, Harik, Shao, Jayasiri, Haber, Goodman. "Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking" (2024). RL-trained silent rationales generated per token; reports zero-shot reasoning gains without task-specific fine-tuning.
Goyal, Ji, Rawat, Menon, Kumar, Nagarajan. "Think before you speak: Training Language Models With Pause Tokens" (ICLR 2024). Pause-token training gives models extra positions before answer extraction.
Snell, Lee, Xu, Kumar. "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters" (2024). Test-time compute scaling frame for comparing extra inference budget against model scale.

Next Topics

The natural next steps from latent reasoning:

Multi-token prediction: another approach to planning ahead, but in token space rather than hidden state space
Memory systems for LLMs: long-term memory complements latent reasoning by providing persistent storage beyond the hidden state

Last reviewed: April 22, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Memory Systems for LLMslayer 5 · tier 2
Multi-Token Predictionlayer 5 · tier 2
Test-Time Compute and Searchlayer 5 · tier 2

Derived topics

0

No published topic currently declares this as a prerequisite.