LLM Construction
Efficient Transformers Survey
Sub-quadratic attention variants (linear attention, Linformer, Performer, Longformer, BigBird) and why FlashAttention, a hardware-aware exact method, made most of them unnecessary in practice.
Prerequisites
Why This Matters
Standard self-attention computes all pairwise interactions for a sequence of length , costing time and memory. For long sequences (documents, genomes, high-resolution images), this quadratic cost is prohibitive.
Between 2019 and 2022, dozens of papers proposed sub-quadratic attention variants. Most are now rarely used. FlashAttention (2022) showed that the bottleneck was memory bandwidth, not FLOPs, and that exact attention with IO-aware tiling is faster in practice than most approximate methods up to sequences of length 8K-16K. Understanding why this happened teaches more about systems-level thinking than about any single algorithm.
The Quadratic Bottleneck
For queries , keys , values :
The matrix has entries. Storing it costs memory. For and , the attention matrix alone requires 40 GB in float32. The question: can we avoid materializing this matrix?
Linear Attention
Linear Attention via Kernel Decomposition
Statement
Replace the softmax kernel with a feature map such that . Then:
The sums and can be precomputed in time. Each attention output costs , giving total cost , which is linear in if is fixed.
Intuition
Instead of computing the full attention matrix and then multiplying by , change the order of operations. First aggregate key-value pairs into a compact summary (the matrix ), then query this summary. This is the associativity trick: up to normalization.
Proof Sketch
Write . Substitute and factor out from both sums.
Why It Matters
Linear attention reduces the complexity from to . For , this is a large speedup for very long sequences. The idea also enables recurrent computation: process tokens one by one, updating and incrementally, giving constant memory per step.
Failure Mode
The approximation quality depends on . No known feature map perfectly approximates softmax attention. The approximation error degrades attention's ability to do sharp, sparse lookups (e.g., copying a specific token). In practice, linear attention models underperform standard attention on tasks requiring precise retrieval from context.
Sparse Attention Methods
Rather than approximating the softmax, sparse methods compute exact attention on a subset of the pairs.
Longformer (Beltagy et al., 2020): Each token attends to a fixed-size local window (e.g., 512 tokens around it) plus a small set of global tokens. Cost is where is the window size. Good for document-level tasks where local context dominates.
BigBird (Zaheer et al., 2020): Combines local window attention, global tokens, and random attention edges. Proved that this sparse pattern is a universal approximator of sequence-to-sequence functions and is Turing complete, matching the theoretical expressiveness of full attention.
Block-sparse attention: Partition tokens into blocks and attend only within blocks (or within a structured pattern of blocks). Efficient on GPUs because block operations map well to hardware. This is not the same as FlashAttention: FlashAttention computes the full, dense exact softmax attention and only uses block-wise tiling of the same arithmetic to fit the inner products in fast on-chip SRAM. Block-sparse attention drops pairs; FlashAttention drops nothing.
Low-Rank Approximations
Linformer (Wang et al., 2020): Project keys and values to a lower dimension using learned projection matrices :
Cost is instead of . The claim: the attention matrix is approximately low-rank for most practical inputs, so projecting to rank loses little information.
Performer (Choromanski et al., 2021): Uses positive random features (the FAVOR+ scheme) to give an unbiased estimate of the exponential dot-product kernel that defines softmax attention:
where with random . The factor that appears in the standard scaled-dot-product attention is absorbed into the query/key pre-normalization, not into the covariance of . This is not the classical Rahimi-Recht random Fourier feature construction (which targets shift-invariant kernels via cosines): the exponential dot-product kernel is not shift-invariant, and the trigonometric RFF estimator was empirically shown to be numerically unstable for softmax attention. FAVOR+ uses non-negative exponentials specifically to keep estimated attention weights non-negative.
Performer Approximation Quality
Statement
The Performer estimator is unbiased:
and for fixed with bounded norms its variance scales as . Choromanski et al. give pointwise concentration bounds and an error bound on the approximated attention output that depends polynomially on and on the maximum query/key norm; orthogonal random features (FAVOR+) further reduce variance relative to i.i.d. Gaussian draws.
Intuition
A Monte Carlo construction that mirrors the kernel-trick form of attention. Variance falls like , but the constants depend on how peaked the true softmax weights are: very sharp attention patterns require many more random features than smooth ones.
Proof Sketch
Unbiasedness follows from the identity . Pointwise concentration follows from sub-exponential tail bounds for products of Gaussian-driven exponentials with bounded norms.
Why It Matters
This gives a principled way to trade accuracy for speed. But the constants in the bounds matter: there is no uniform relative-error guarantee at modest , and on attention distributions that are highly peaked the required can erase the asymptotic speedup over exact attention.
Failure Mode
For tasks requiring sharp attention patterns (copying, exact retrieval), the variance can be large enough that quality degrades noticeably. The classical trigonometric random-Fourier-feature construction also gives an unbiased softmax-kernel estimator in principle but produces signed estimates and is numerically unstable for attention; this is why Performer specifically uses non-negative features.
Why FlashAttention Won
FlashAttention computes exact softmax attention but tiles the computation to minimize HBM (high bandwidth memory) reads and writes. The key insight: the attention bottleneck for sequences up to 16K is not FLOPs but memory IO. The attention matrix does not need to be materialized; it can be computed and consumed block by block.
FlashAttention achieves 2-4x wall-clock speedup over standard attention for typical LLM training sequence lengths (2K-8K), while computing the exact same output. This made most approximate attention methods unnecessary for their target use case (moderate-length sequences on modern GPUs).
Sub-quadratic methods still matter when is truly enormous (100K+ tokens), where the FLOP cost itself dominates. But for most current LLM training and inference, FlashAttention (and its successors) suffice.
Common Confusions
Sub-quadratic FLOPs does not mean faster wall-clock time
A method with FLOPs can be slower than if it has poor memory access patterns, cannot use tensor cores, or requires custom kernels. On modern GPUs, memory bandwidth is often the bottleneck, not arithmetic. This is why FlashAttention (which uses more FLOPs due to recomputation) is faster than standard attention (which materializes the full attention matrix).
Linear attention is not a drop-in replacement
Linear attention changes the model's expressive power, not just its cost. The softmax attention matrix can be arbitrarily sparse and peaked. Linear attention with a fixed feature map cannot represent sharp attention patterns. Models trained with linear attention often require architectural changes (gating, decay factors) to match softmax attention quality.
Summary
- Standard attention costs time and memory
- Linear attention uses the kernel trick to achieve but sacrifices sharp attention patterns
- Sparse methods (Longformer, BigBird) compute exact attention on a structured subset of pairs
- Low-rank methods (Linformer) project to a fixed-rank subspace so the attention matrix is approximated by a low-rank factorization
- Kernel-approximation methods (Performer) replace softmax with a random-feature kernel, which is not a low-rank structure on the attention matrix itself
- FlashAttention computes exact attention with IO-aware tiling, making approximate methods unnecessary for
- For truly long sequences (), sub-quadratic methods and state-space models remain relevant
Exercises
Problem
In linear attention with feature map , what are the shapes of the precomputed matrices and ? What is the total memory cost of storing and compared to storing the full attention matrix?
Problem
Explain why FlashAttention uses more FLOPs than standard attention (due to recomputation in the backward pass) yet achieves faster wall-clock time. Quantify the tradeoff: if HBM bandwidth is bytes/sec and compute throughput is FLOP/sec, under what condition on and is standard attention memory-bound?
References
Canonical:
- Vaswani et al., "Attention Is All You Need" (NeurIPS 2017)
- Katharopoulos et al., "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention" (ICML 2020), arXiv:2006.16236
Architecture variants cited in the body:
- Beltagy, Peters, Cohan, "Longformer: The Long-Document Transformer" (2020), arXiv:2004.05150
- Zaheer et al., "Big Bird: Transformers for Longer Sequences" (NeurIPS 2020), arXiv:2007.14062
- Wang et al., "Linformer: Self-Attention with Linear Complexity" (2020), arXiv:2006.04768
- Choromanski et al., "Rethinking Attention with Performers" (ICLR 2021), arXiv:2009.14794
Current:
- Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (NeurIPS 2022), arXiv:2205.14135
- Dao, "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning" (ICLR 2024), arXiv:2307.08691
- Shah et al., "FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision" (2024), arXiv:2407.08608
- Tay et al., "Efficient Transformers: A Survey" (ACM Computing Surveys 2022), arXiv:2009.06732
Further reading (not covered in body):
- Kitaev, Kaiser, Levskaya, "Reformer: The Efficient Transformer" (ICLR 2020), arXiv:2001.04451 — LSH-based sparse attention.
- Liu et al., "Ring Attention with Blockwise Transformers for Near-Infinite Context" (2023), arXiv:2310.01889 — sequence parallelism for multi-million-token contexts.
- Gu and Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces" (2023), arXiv:2312.00752 — the SSM-based alternative that largely replaced linear-attention work post-2023.
Next Topics
- Mamba and state-space models: a recurrent architecture that avoids attention entirely for long sequences
- KV cache: memory optimization for autoregressive inference
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
1- Attention Variants and Efficiencylayer 4 · tier 2
Derived topics
2- Mamba and State-Space Modelslayer 4 · tier 2
- KV Cachelayer 5 · tier 2
Graph-backed continuations