LLM Construction
Multi-Token Prediction
Predicting k future tokens simultaneously using auxiliary prediction heads: forces planning, improves code generation, and connects to speculative decoding.
Prerequisites
Why This Matters
Standard language models predict one token at a time. At each position , the model produces a distribution over the next token given the prefix . This is the autoregressive objective that underlies GPT, Llama, and nearly every decoder-only LLM.
Multi-token prediction changes the training objective: predict the next tokens simultaneously. This forces the model to plan ahead rather than making greedy, myopic predictions. Empirically, multi-token prediction improves performance on tasks requiring planning (code generation, mathematical reasoning) while providing a natural connection to speculative decoding at inference time.
Mental Model
Think of single-token prediction as a chess player who only considers the next move. Multi-token prediction is like requiring the player to announce the next moves in advance. The player must think further ahead, considering how each move constrains future options. The model learns internal representations that encode longer-horizon structure.
Formal Setup
Let be a sequence of tokens. The standard autoregressive loss is:
Multi-Token Prediction Objective
The multi-token prediction objective with lookahead uses independent prediction heads sharing a common trunk (transformer body). The loss is:
Each head predicts the token steps ahead given the same hidden representation from the trunk at position . The trunk is trained with gradients from all heads simultaneously.
Auxiliary Prediction Head
An auxiliary prediction head is a separate output layer (typically a linear projection to vocabulary logits) attached to the shared transformer trunk. Head maps the trunk's hidden state at position to a distribution over . During inference, only head (next-token) is required, but the other heads can serve as draft predictions for speculative decoding.
Architecture
The key architectural choice: the trunk is shared, but the heads are independent. Each head is a separate linear layer mapping from the trunk's hidden dimension to the vocabulary size .
During training, a single forward pass through the trunk produces hidden states at each position. Each head independently computes:
for vocabulary token . The memory overhead is additional weight matrices of size , which is small relative to the trunk.
Gradient Structure
By linearity of the gradient and the chain rule, the gradient of the multi-token loss with respect to trunk parameters decomposes as:
where is the loss from head . This is an algebraic identity, not a substantive theorem. The engineering question is how to parallelize head gradients without stalling the trunk, and how to manage the activation memory of vocabulary-sized logit tensors.
The content is in the consequence. With single-token prediction, the trunk at position only needs to represent enough information to predict . With multi-token prediction, the same representation must support predicting . This is a strictly harder task that empirically produces richer internal representations. The trunk must encode not just "what comes next" but "what trajectory the sequence is on."
This explains why multi-token training improves downstream performance even when only the next-token head is used at inference. The auxiliary heads are scaffolding that can be discarded after training or repurposed for speculative decoding.
Failure mode. If the -step-ahead prediction is nearly independent of the current context (high entropy futures), the auxiliary heads contribute noisy gradients that may not improve trunk representations. This is more likely for large in domains with high local entropy, such as open-ended dialogue. The benefit is largest for structured domains like code, where future tokens are strongly constrained by the current context.
Speculative-Decoding Speedup Bound
Expected Speedup from k-Head Speculative Decoding
Statement
Let be the number of draft tokens accepted in one verification step under independent acceptance with probability per position, up to drafts. Then
and the expected wall-clock speedup over single-token decoding is , which is bounded above by as .
Intuition
Acceptance is a geometric stopping time: drafts survive only if each of them was accepted. The expected run-length is the sum of survival probabilities. Adding more heads past the point where is small brings diminishing returns, which is why Medusa and EAGLE papers use in the range of 3 to 5 rather than scaling arbitrarily.
Why It Matters
The bound exposes the central trade-off: speedup is gated by acceptance rate, not by head count. Doubling from 4 to 8 at only moves expected accepted drafts from to . Engineering effort should go into raising (better draft head quality, EAGLE-style feature conditioning, DeepSeek-V3-style MTP training) rather than stacking more heads.
Failure Mode
The token-level independence assumption is optimistic. Real acceptance events are positively correlated: if draft position 2 is rejected due to a hard context, later positions are often rejected too. This makes the bound an upper estimate; measured speedups on real workloads are typically below .
Connection to Speculative Decoding
Multi-token prediction provides a natural draft model for speculative decoding without needing a separate model. At inference time:
- The auxiliary heads generate draft tokens in parallel (one forward pass)
- The next-token head verifies these drafts in the following forward pass
- Accept or reject using the standard speculative decoding rejection sampling scheme
This is called self-speculative decoding: the model is its own draft model. The advantage over external draft models is that the heads share the trunk's representation, so draft quality tends to be higher.
DeepSeek-V3 (2024) is the flagship production example. It trains an MTP objective as a dense auxiliary signal and reuses the extra heads at inference for speculative decoding. Medusa (Cai et al., 2024) and EAGLE (Li et al., 2024) are the canonical multi-head speculative-decoding precedents; they target inference acceleration rather than training-signal density, and EAGLE adds a small autoregressive draft network on top of the trunk features.
Training Procedure and Memory Efficiency
Training with heads naively requires storing copies of the vocabulary logits at each position, which is memory. For , , and , this is GB of activations in float32, on top of the trunk's activations.
The memory-efficient approach computes the auxiliary losses sequentially. At each position , the trunk state is computed once and stored. Then each head computes its logits, computes the cross-entropy loss, backpropagates through the head to get , and discards the logits. The gradient contributions from all heads are accumulated into 's gradient before backpropagating through the trunk. This reduces peak memory from to at the cost of sequential head computations.
The trunk backward pass is unchanged: it receives the accumulated gradient and backpropagates normally. The computational cost increases by approximately the cost of forward and backward passes through the head layers, which is small relative to the trunk.
Training cost breakdown for Llama-scale model
All FLOP counts use the forward+backward accounting (factor 6 per weight-token pair), per Hoffmann et al. 2022 / Kaplan et al. 2020 convention. Forward-only would be factor 2. The comparison only makes sense when both sides use the same accounting.
For a 7B parameter model with , , and heads:
Trunk forward + backward: dominated by attention and MLP, approximately GFLOP per token.
Head forward + backward: GFLOP per token.
The auxiliary heads add about to the per-token compute. The ratio is , independent of the factor 6. The memory overhead (393M extra parameters) adds about 5.6% to the model size. These are modest costs for the training signal improvement.
When Multi-Token Prediction Helps
The benefit depends on the domain:
Code generation: strong improvement. Code has rigid syntactic structure where future tokens are highly constrained by the current context. Predicting tokens ahead forces the model to plan syntactic closures, variable usage, and control flow.
Mathematical reasoning: moderate improvement. Multi-step derivations benefit from planning, but the model must learn to sequence logical steps.
Open-ended text: marginal improvement. Natural language has high local entropy. The 5th token ahead is often weakly determined by the current position alone.
Multi-token prediction is not the same as beam search
Beam search explores multiple alternative continuations at each step. Multi-token prediction generates a single continuation but predicts multiple positions ahead. Beam search is an inference algorithm. Multi-token prediction is a training objective (and optionally an inference optimization via speculative decoding).
The auxiliary heads are not autoregressive with each other
Each head predicts from the trunk state independently. Head does not condition on the predictions of heads or . This independence is what makes parallel prediction possible, but it also limits the expressiveness of later heads. They cannot model dependencies between the predicted tokens themselves.
Summary
- Standard LLMs predict one token at a time; multi-token prediction predicts the next tokens using heads sharing a common trunk
- The trunk receives gradients from all heads, learning richer representations that encode longer-horizon structure
- Auxiliary heads can be repurposed as draft predictors for self-speculative decoding at inference
- Benefits are largest for structured domains (code, math) where future tokens are strongly constrained by context
- The heads predict independently. They do not condition on each other's outputs
Exercises
Problem
A model uses multi-token prediction with heads. The trunk has hidden dimension and vocabulary size . How many additional parameters do the auxiliary heads add (heads 2, 3, 4), and what fraction is this of a 7B-parameter trunk?
Problem
Suppose you train with but at inference only use head (standard autoregressive decoding). Would you expect the model to outperform a model trained with on code completion tasks? What about on open-ended story generation? Explain the mechanism.
Problem
Medusa (Cai et al., 2024, arXiv:2401.10774) augments a frozen base model with independent heads and accepts drafts under speculative-decoding rejection sampling. Assume token-level independent acceptance with probability per draft position. Derive the expected number of accepted tokens per decoding step as a function of and , and the expected wall-clock speedup over single-token decoding when the verification pass dominates cost.
References
Canonical:
- Gloeckle et al., "Better & Faster Large Language Models via Multi-token Prediction" (Meta, 2024), arXiv:2404.19737
- DeepSeek-AI, "DeepSeek-V3 Technical Report" (2024), arXiv:2412.19437. Uses MTP as an auxiliary training objective for dense supervision and at inference for speculative decoding.
Speculative-decoding heads:
- Cai et al., "Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads" (2024), arXiv:2401.10774
- Li et al., "EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty" (2024), arXiv:2401.15077
Reference architecture:
- Phuong & Hutter, "Formal Algorithms for Transformers" (2022), arXiv:2207.09238
Next Topics
The natural next steps from multi-token prediction:
- Latent reasoning: another approach to making models plan ahead, but in continuous hidden state space rather than token space
- Speculative decoding and quantization: the inference framework where multi-token prediction heads serve as draft models
Last reviewed: April 18, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
1- Transformer Architecturelayer 4 · tier 2
Derived topics
2- Latent Reasoninglayer 5 · tier 2
- Speculative Decoding and Quantizationlayer 5 · tier 2
Graph-backed continuations