LLM Construction
Attention Variants and Efficiency
Multi-head, multi-query, grouped-query, linear, and sparse attention: how each variant trades expressivity for efficiency, and when to use which.
Prerequisites
Why This Matters
Standard scaled dot-product attention has time complexity where is sequence length. Naive implementations also materialize the attention matrix in HBM, costing memory; tiled implementations like FlashAttention (Dao et al. 2022) preserve the time count but reduce HBM traffic to , so in practice the binding constraint on long sequences is typically memory bandwidth rather than FLOPs. For long sequences (documents, code, genomics), this remains the bottleneck. Different attention variants trade expressivity for efficiency, and the right choice depends on the deployment scenario: training throughput, inference latency, KV cache size, or sequence length.
Understanding these variants requires knowing exactly what each one changes and what it preserves.
Baseline: Multi-Head Attention (MHA)
Multi-Head Attention
Given attention heads, each head has its own projection matrices where :
where is the output projection.
Parameter count per layer: (three projections plus output , each ).
KV cache per token: values must be stored for each token in the context. For a model with layers, the total KV cache per token is .
Multi-Head Expressivity
Statement
Each attention head produces a per-head output whose row space is rank-bounded by : the attention-weighted value matrix in head lies in the column space of , so its rank is at most . Concatenating heads gives a matrix of rank at most , and the output projection mixes these rank- pieces.
The structural observation is that MHA decomposes the layer's output into parallel rank- channels with independent attention patterns, rather than a single rank- channel driven by one pattern. Different heads can attend to different subspaces of the key-value space, producing outputs that no single head of dimension with one attention matrix can produce.
Intuition
Think of each head as asking a different question about the input. Head 1 might attend based on syntactic relationships (subject to verb). Head 2 might attend based on semantic similarity. Head 3 might attend to positional proximity. By running these in parallel and combining, the model captures multiple types of dependencies simultaneously.
Proof Sketch
A single softmax attention head produces a convex combination of value vectors along one rank- channel, so it cannot simultaneously place sharp mass on two incompatible sets of positions (e.g., nearest token and most semantically similar token). With two heads, each can place sharp mass on one pattern, and linearly combines the two rank- outputs. Each per-head output matrix has rank at most , and the concatenation has rank at most , so once the heads jointly span the model dimension the linear part of the layer is saturated. This is a heuristic for why returns diminish past heads at fixed , not a hard expressivity theorem: each head still contributes its own input-dependent softmax pattern, so adding more narrow heads can in principle express attention combinations that a smaller number of heads cannot, even with the same total channel width. The rank-saturation argument should be read as motivating the standard choice, not as a proof that more heads strictly add nothing.
Why It Matters
This explains why multi-head attention outperforms single-head attention with the same total dimension. The computational cost is identical (same number of parameters and FLOPs), but the representational capacity increases.
Failure Mode
In practice, many heads learn redundant or near-identical patterns. Michel et al. (2019) showed that most heads can be pruned after training with minimal quality loss. The theoretical expressivity advantage is not always used. This observation motivates the reduced-head variants below.
Multi-Query Attention (MQA)
Multi-Query Attention
Multi-query attention (Shazeer, 2019) uses separate projections per head but shares a single and across all heads:
All heads compute attention over the same key-value pairs but with different queries.
KV cache per token: (one set of keys and values instead of sets). This is an -fold reduction compared to MHA.
Tradeoff: MQA reduces KV cache by a factor of . For current production LLMs with in the 8-to-64 range (LLaMA-2-70B has , LLaMA-3 family , Mistral / Gemma-7B ), this is roughly an 8x to 64x cache reduction, enabling much longer contexts and larger batch sizes during inference. Older Shazeer 2019 MQA experiments on Transformer baselines with head counts saw the upper end of this range; modern decoder-only LLMs sit in the 8x to 32x band. Quality degrades slightly because heads can no longer attend to different subspaces of keys and values. For many tasks, this degradation is small (less than 1% on benchmarks).
Grouped-Query Attention (GQA)
Grouped-Query Attention
Grouped-query attention (Ainslie et al., 2023) divides the query heads into groups. Each group shares one set of key-value projections:
where maps head to its group. Heads are 0-indexed: and , so and whenever . must be divisible by in practice so that groups are balanced (each group owns exactly query heads).
KV cache per token: . When , GQA is MHA. When , GQA is MQA.
GQA is the current standard for large language models (Llama 2/3, Mistral). It recovers most of MHA's quality while getting most of MQA's cache efficiency. Typical choices: query heads with KV groups (4x cache reduction).
Summary of Head-Sharing Variants
| Variant | KV heads | KV cache per token | Quality | Inference speed |
|---|---|---|---|---|
| MHA | Best | Slowest (large cache) | ||
| GQA | () | Near MHA | Fast | |
| MQA | Slightly worse | Fastest |
Linear Attention
Standard attention computes , which requires materializing the attention matrix. Linear attention replaces the softmax with a kernel function to avoid this.
Per-token autoregressive-decode cost is , constant in sequence length . This is the practical motivation. Causal linear attention admits a recurrent formulation: maintain a running state and a running -vector , updated in per step. Training and prefill over a length- sequence still cost (one update per position); the -in- advantage only applies to per-step generation after the recurrent state is warmed up, and only relative to the softmax cache read cost. Softmax attention, by contrast, grows linearly per decoding step because it reads the entire KV cache.
Linear Attention as Kernel Approximation
Statement
Define as a feature map. Linear attention replaces the softmax attention with:
The key observation: is an matrix that can be computed once and reused for all queries. This gives complexity instead of .
Intuition
Standard attention computes pairwise similarity between all query-key pairs, then uses these similarities to weight values. Linear attention factorizes this: first summarize all key-value pairs into a fixed-size matrix, then match each query against this summary. The summary is a compressed representation of the entire context.
Proof Sketch
Write the standard kernel attention: where . Substitute the kernel decomposition and rearrange: the numerator becomes , which separates the query from the key-value aggregation.
Why It Matters
Linear attention achieves complexity in sequence length (for fixed and ). This enables attention over sequences of length or more, which is impractical with standard attention.
Failure Mode
The feature map must map into the non-negative orthant for the denominator to be a valid (positive) normalizer: otherwise cancellations can produce near-zero or negative denominators and the output blows up or flips sign. Naive random Fourier features take values in and violate this. Two standard safe choices: (Katharopoulos et al. 2020), elementwise and non-negative; and Performer's positive random features (Choromanski et al. 2020), which give an unbiased estimate of the softmax kernel while staying non-negative.
Even with a valid feature map, must approximate the softmax kernel well. On language modeling benchmarks, linear attention models consistently underperform softmax attention by a meaningful margin. The matrix computed by softmax attention contains fine-grained pairwise information that the summary cannot fully capture when .
Sparse Attention
Instead of replacing softmax, sparse attention restricts which query-key pairs are computed.
Local window attention: each token attends only to the nearest tokens. Complexity: . Captures local dependencies but misses long-range ones.
Strided attention: attend to every -th token plus a local window. Captures periodic structure and some long-range dependencies.
Hash-based (LSH) attention (Kitaev et al., 2020): use locality-sensitive hashing to find the most similar keys for each query. Only compute attention for likely-high-similarity pairs. Expected complexity: .
Block-sparse patterns: combine local windows with a few global tokens that attend to everything. Longformer (Beltagy et al., 2020) and BigBird (Zaheer et al., 2020) use this approach.
The practical problem: sparse patterns must be known in advance or computed cheaply. For autoregressive generation, where the attention pattern depends on the content being generated, fixed sparse patterns may miss important long-range dependencies.
Mistral 7B (Jiang et al. 2023) combines sliding-window attention with a rolling KV buffer, trading exact long-range coverage for practical throughput. The attention-sink strategy from StreamingLLM (Xiao et al. 2023) is an inference-time KV-retention trick and is not part of Mistral's trained architecture, though the two can be composed at deployment time. DeepSeek (2024) proposed Native Sparse Attention, training the sparse pattern jointly with the model rather than fixing it by hand.
Hardware-Aware and Deployment Variants
A parallel line of work leaves the mathematical definition of attention unchanged and attacks the memory hierarchy instead. This matters because on modern GPUs, nominally more efficient variants often lose to a well-implemented quadratic kernel.
FlashAttention (Dao et al. 2022, Dao 2023 for v2, Shah et al. 2024 for v3) fuses the attention computation into a single tiled kernel that never materializes the matrix in HBM. Standard MHA with FlashAttention is frequently faster than linear or sparse variants because it is memory-bound, not FLOP-bound. This reframes the efficiency conversation: the asymptotic is not the binding constraint at typical lengths; HBM traffic is.
PagedAttention (Kwon et al. 2023, the kernel behind vLLM) manages the KV cache as a paged virtual-memory allocator, cutting fragmentation and enabling high-throughput batched serving. It is orthogonal to MHA/GQA/MQA and composes with all of them.
Ring Attention (Liu et al. 2023) shards the KV sequence across devices and overlaps communication with computation in a ring, enabling context parallelism for sequences that do not fit in one device's memory.
Multi-head Latent Attention (MLA) (DeepSeek-V2, DeepSeek-AI 2024; DeepSeek-V3 2024) compresses keys and values into a low-rank latent vector per token, caching only the latent. This gives a KV cache smaller than MQA's while recovering near-MHA quality, and was the main inference-cost lever behind DeepSeek-V2/V3 at deployment scale.
Differential Attention (Ye et al. 2024) computes two softmax attention maps and takes their difference, reducing attention noise on irrelevant tokens. The math is standard attention applied twice; the change is in how the outputs are combined.
Hybrid architectures replace some attention layers with selective state space models. Mamba (Gu and Dao 2023) and Mamba-2 (Dao and Gu 2024) give sequence mixing with input-dependent dynamics. Jamba (Lieber et al. 2024) and Zamba interleave Mamba blocks with a minority of attention layers, trading some recall ability for long-context throughput.
Common Confusions
MQA does not reduce training FLOPs
MQA and GQA save memory for the KV cache during inference, which allows larger batch sizes and longer contexts. During training, the dominant cost is the matrix multiplications for and the attention-weighted sum over values, which are the same for all variants (the shared K, V projections are simply broadcast). The training speedup from MQA is modest; the inference speedup is large.
Linear attention is not just softmax without exp
Dropping the exponential from softmax gives unnormalized attention, not linear attention. Linear attention specifically uses a kernel decomposition to factorize the computation. The feature map is what makes the factorization possible. Simply removing exp would give negative attention weights and no computational benefit.
Asymptotic complexity is not the binding constraint
Standard attention is in both time and memory, but on current GPUs it is usually memory-bandwidth bound, not FLOP bound. FlashAttention (Dao et al. 2022) made standard MHA faster than many "more efficient" linear and sparse variants at sequence lengths up to tens of thousands, because it reduces HBM traffic rather than FLOPs. When comparing variants, compare wall-clock with a tuned kernel, not asymptotic complexity alone.
Canonical Examples
KV cache savings with GQA
Consider a model with , heads, , layers, serving at sequence length in float16 (2 bytes per value). The per-sequence KV cache is bytes. MHA: GB. GQA with : GB (4x reduction, as expected from ). MQA: GB (32x reduction). On an 80 GB GPU, the difference between 4.29 GB and 0.134 GB per sequence determines whether you can serve roughly 10 or 500 concurrent requests at this length. These are the numbers that drive deployment-time variant choice.
Exercises
Problem
A Transformer has query heads and uses GQA with KV groups. How many query heads share each KV group? What is the KV cache reduction factor compared to standard MHA?
Problem
Show that standard softmax attention can be written as a kernel: with . Why can this kernel not be exactly decomposed as for a finite-dimensional ?
References
Canonical:
- Vaswani et al., "Attention Is All You Need" (NeurIPS 2017), Section 3.2
- Shazeer, "Fast Transformer Decoding: One Write-Head is All You Need" (arXiv 2019). MQA.
Head-sharing variants:
- Ainslie et al., "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints" (EMNLP 2023)
- DeepSeek-AI, "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model" (arXiv 2024), Section 2.1. Multi-head Latent Attention.
- DeepSeek-AI, "DeepSeek-V3 Technical Report" (arXiv 2024), Section 2. MLA at deployment scale.
Linear attention:
- Katharopoulos et al., "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention" (ICML 2020). ELU+1 feature map and the causal recurrence.
- Choromanski et al., "Rethinking Attention with Performers" (ICLR 2021). Positive random features for unbiased softmax-kernel estimation.
- Wang et al., "Linformer: Self-Attention with Linear Complexity" (arXiv 2020). Low-rank projection of keys and values.
Sparse and hybrid:
- Kitaev et al., "Reformer: The Efficient Transformer" (ICLR 2020). LSH attention.
- Beltagy et al., "Longformer: The Long-Document Transformer" (arXiv 2020)
- Zaheer et al., "Big Bird: Transformers for Longer Sequences" (NeurIPS 2020)
- Jiang et al., "Mistral 7B" (arXiv 2023), Section 2. Sliding window plus sink.
- DeepSeek-AI, "Native Sparse Attention" (arXiv 2024). Trainable sparse pattern.
- Ye et al., "Differential Transformer" (arXiv 2024). Noise-cancelling attention.
Hardware-aware and deployment:
- Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (NeurIPS 2022)
- Dao, "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning" (arXiv 2023)
- Shah et al., "FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision" (arXiv 2024)
- Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention" (SOSP 2023). vLLM.
- Liu et al., "Ring Attention with Blockwise Transformers for Near-Infinite Context" (arXiv 2023)
State-space hybrids:
- Gu and Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces" (arXiv 2023)
- Dao and Gu, "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality" (ICML 2024). Mamba-2.
- Lieber et al., "Jamba: A Hybrid Transformer-Mamba Language Model" (arXiv 2024)
Pruning and redundancy:
- Michel, Levy, Neubig, "Are Sixteen Heads Really Better than One?" (NeurIPS 2019). Empirical redundancy of heads.
Next Topics
- KV cache: how these attention variants affect the memory and speed of autoregressive generation
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
2- Fast Fourier Transformlayer 1 · tier 2
- Attention Mechanism Theorylayer 4 · tier 2
Derived topics
3- Efficient Transformers Surveylayer 4 · tier 2
- Forgetting Transformer (FoX)layer 4 · tier 2
- KV Cachelayer 5 · tier 2
Graph-backed continuations