LLM Construction
Sparse Autoencoders for Interpretability: TopK, JumpReLU, Matryoshka, and Scaling
Sparse autoencoders (SAEs) decompose neural network activations into interpretable feature directions by learning an overcomplete dictionary with a sparsity constraint. Coverage of the superposition hypothesis, the L1 SAE objective and its shrinkage failure mode, TopK SAEs (Gao et al. 2024 / OpenAI), JumpReLU SAEs (Rajamanoharan et al. 2024 / DeepMind), Matryoshka SAEs (Bushnaq / Bussmann 2024), Anthropic's million-feature Claude 3 Sonnet decomposition (Templeton et al. 2024), feature splitting, dead features, evaluation, and steering.
Prerequisites
Why This Matters
Individual neurons in a trained transformer are polysemantic: a single neuron activates for many unrelated concepts. Anthropic's Claude 3 Sonnet has neurons that fire on the Golden Gate Bridge, security vulnerabilities, sycophantic praise, and DNA sequences (Templeton et al. 2024). Reading neuron activations directly tells you almost nothing about what the model is computing.
The superposition hypothesis (Elhage et al. 2022) explains why: networks pack many more features than neurons by encoding each feature as a direction in activation space. When features are sparse (each one is active on only a small fraction of inputs), nearly-orthogonal packing lets a -dimensional space hold distinguishable feature directions for features (Welch / Kabatyanskii-Levenshtein bounds for sphere packings). Sparse autoencoders (SAEs) recover those feature directions by training an overcomplete dictionary on activation samples.
The 2024 wave of SAE research closed three real problems with the original Bricken-style L1 SAE:
- Shrinkage. L1 penalizes the magnitude of active features, so the decoder has to over-scale to compensate, biasing reconstruction. JumpReLU and Gated SAEs separate gating from magnitude (Rajamanoharan et al. 2024).
- Sparsity-reconstruction tradeoff. L1 imposes a soft penalty whose hyperparameter trades off reconstruction against L0 sparsity in a model-specific way. TopK SAEs replace L1 with an exact L0 constraint (Gao et al. 2024 from OpenAI on GPT-4 activations).
- Feature granularity. A fixed dictionary size commits to one resolution. Matryoshka SAEs train a nested family of dictionaries so that the first features form a coarse decomposition and adding more reveals finer-grained subfeatures (Bushnaq et al. 2024; Bussmann 2024).
Anthropic's Scaling Monosemanticity (Templeton et al. May 2024) and DeepMind's Gemma Scope (Lieberum et al. July 2024, arXiv:2408.05147) showed that SAEs work at frontier-model scale, recovering millions of interpretable features in production-class models. SAEs are now the default tool for mechanistic interpretability research at every major lab.
Mental Model
You have a large dataset of activations with collected by running the base model on text. You hypothesize that each is approximately a sparse linear combination of feature directions with (overcomplete). The SAE jointly learns the dictionary and a per-input sparse code such that with having only a handful of nonzero entries. The decoder is the matrix whose columns are ; the encoder is a small neural network that produces the code from each input.
The interpretability payoff: each is a direction in activation space whose meaning can be characterized by inspecting which inputs cause to be large. A typical "good" feature might fire only on French text about wine, only on Python list comprehensions, or only on inputs that prompt a refusal. Once features are extracted, you can probe the model's internal vocabulary, ablate features to test causal effect, and steer generation by adding feature directions to the residual stream (Anthropic's "Golden Gate Claude" demo, May 2024).
Feature direction
A unit vector in activation space corresponding to a column of the SAE decoder . The reconstructed activation is where is the encoder's score for feature on input . A feature direction is monosemantic if and only if its top-activating inputs share a single human-interpretable concept; polysemantic if and only if they correspond to several unrelated concepts.
Overcomplete dictionary
A dictionary with . The ratio is the overcompleteness factor; typical SAE values are to , with frontier-scale Anthropic and DeepMind SAEs going to overcompleteness on production models. Overcompleteness is the architectural commitment to the superposition hypothesis: there are more features than dimensions.
The Superposition Hypothesis
Superposition Hypothesis
Statement
A neural network with activation dimension can represent features by encoding each feature as a direction in . In the toy-model regime of Elhage et al. (2022), the number of representable features scales roughly as where is the per-feature activation probability. In the very-sparse limit , the count approaches the Welch/Kabatyanskii-Levenshtein bound of nearly-orthogonal directions with pairwise inner product bounded by . The formula should be read as an informal scaling observation from a controlled toy model, not a tight theoretical bound; see Scherlis-Sachan-Jermyn-Benton-Shlegeris (2022, arXiv:2210.01892) for a more rigorous capacity analysis.
Intuition
In , you can fit exactly orthogonal directions. If you allow a small angular slack (small interference), you can fit exponentially more nearly-orthogonal directions; this is the Johnson-Lindenstrauss phenomenon applied to feature representation. The network exploits sparsity: when two features whose directions weakly interfere are unlikely to be active together, the interference rarely produces output errors, so the cost of packing more features is low.
Proof Sketch
Elhage et al. (2022) study the toy model with sparse and analyze the optimum as a function of sparsity . The optimal goes through three regimes as decreases from to : (1) for dense features the model learns orthogonal directions and discards the rest; (2) at moderate sparsity it learns anti-podal pairs that share a direction by sign; (3) at high sparsity it learns polytopes (digons, triangles, tetrahedra) that pack features into nearly-orthogonal subsets with controlled interference. The transition is governed by a phase diagram in . In the high-sparsity regime, sphere-packing bounds (Welch / Kabatyanskii-Levenshtein) cap the number of -orthogonal directions at .
Why It Matters
Superposition is the empirical reason individual neurons are unreliable units of analysis. The meaningful unit is a feature direction (a row of the inverse decoder), not a neuron axis. SAEs are the construction that recovers feature directions from the superposed representation, producing a dictionary of monosemantic atoms from a polysemantic activation space. Without superposition, polysemanticity would be an artifact of training noise and would disappear with scale; with superposition, polysemanticity is a prediction and SAEs are the right tool to handle it.
Failure Mode
The hypothesis is well-supported in the toy models of Elhage et al. (2022) and in small transformers (Bricken et al. 2023). Open questions for frontier models: (1) what fraction of features are truly linearly represented vs. encoded in cross-layer composition (Black et al. 2024 on linear probes); (2) how does the feature count scale with model size (Templeton 2024 found roughly linear scaling of useful feature count with SAE width up to ); (3) whether the clean basis assumption (each feature has a single direction) holds, given growing evidence of feature splitting and feature absorption (Bussmann et al. 2024). Treat the hypothesis as a working framework that has earned empirical credit, not as a theorem about real models.
The L1 SAE Objective and the Shrinkage Problem
L1 Sparse Autoencoder (Bricken et al. 2023)
Statement
The L1 SAE learns encoder and decoder minimizing
with
and for all (unit-norm decoder columns). The decoder columns are the learned feature directions; the encoder activations score how much each feature is present in .
Intuition
This is dictionary learning posed as a single-hidden-layer autoencoder with a sparsity penalty. The L1 penalty pushes the encoder activations to zero whenever the marginal cost of activation exceeds the marginal reconstruction gain, producing a sparse code in which only a small subset of features is active per input. The unit-norm constraint on decoder columns prevents the optimizer from cheating by scaling features arbitrarily small (which would make the L1 cost vanish without producing genuine sparsity).
Proof Sketch
The objective is a relaxed LASSO regression. Fixing the dictionary , the per-input encoding is exactly LASSO, whose solution is the soft-thresholding of the optimal least-squares code. The encoder approximates this LASSO solution in a single forward pass; the unit-norm constraint on identifies it up to permutation of columns and removes the trivial scale degeneracy. Joint training of encoder and decoder amortizes the LASSO solve across the data distribution.
Why It Matters
The L1 SAE is the historical baseline (Bricken et al. October 2023) that opened the modern interpretability research program at Anthropic and elsewhere. It produces interpretable features at small scale (single MLP layer of a 1L transformer), and the same architecture scales to GPT-2-class models (Cunningham et al. 2023, arXiv:2309.08600). Every later SAE variant is described relative to this baseline.
Failure Mode
Shrinkage. The L1 penalty biases active feature magnitudes downward (this is why LASSO has the well-known shrinkage problem). If the optimal reconstruction code is , the L1-penalized solution is approximately , the soft-thresholded version. To compensate, the SAE has to learn an over-scaled decoder, which biases reconstruction in a structured way. Empirically, L1 SAEs hit a Pareto frontier of that is strictly worse than TopK and JumpReLU (Gao et al. 2024 Figure 4; Rajamanoharan et al. 2024 Figure 2). Dead features. Many dictionary elements never activate. Bricken et al. (2023) reported up to 90% dead features at the optimal ; the standard fix is resampling (re-initialize dead encoder rows toward high-loss data points every k steps). Feature splitting. A single semantic concept can be split across multiple dictionary elements that fire on slightly different sub-distributions; this is the SAE-side artifact of forcing too many features to be present (Anthropic's Scaling Monosemanticity discusses this at length).
TopK SAEs: Hard Sparsity
TopK Sparse Autoencoder (Gao et al. 2024 / OpenAI)
Statement
The TopK SAE replaces the L1 penalty with an exact L0 constraint. Define as the operator that keeps the largest entries and zeros the rest. The encoder is
and the loss is pure reconstruction:
No sparsity penalty appears; sparsity is enforced exactly by construction. Gao et al. (2024, arXiv:2406.04093) show on GPT-4 activations that TopK SAEs strictly Pareto-dominate L1 SAEs on the frontier and eliminate the shrinkage problem.
Intuition
L1 enforces sparsity softly by paying a cost proportional to magnitude; this couples gating (which features fire) to magnitude (how much they fire). TopK enforces sparsity hardly by selecting exactly the top features per input; gating is a discrete decision, magnitude is set by reconstruction alone. Removing the magnitude penalty removes shrinkage.
Proof Sketch
Shrinkage-elimination argument. Suppose the true sparse code on input has nonzero entries, exactly. Then the L1 SAE's optimum is , biased downward by per active entry. The TopK SAE sets the active indices via TopK and the active magnitudes via least-squares regression onto the chosen columns; no penalty appears in the magnitude solve, so the magnitudes are unbiased. The Pareto improvement on follows directly: at fixed L0, TopK achieves the least-squares optimum; L1 achieves a strictly worse value due to soft-thresholding.
Gradient pass-through. TopK is non-differentiable, but the gradient flows through the selected indices identically to a standard linear layer; non-selected indices receive zero gradient. There is no straight-through estimator needed for the selection step itself; the discrete choice is treated as a hard mask in the backward pass.
Auxiliary loss for dead features. Gao et al. add a per-feature AuxK loss that activates dead features on the top-aux-k positions of the residual error, keeping the dictionary alive without resampling.
Why It Matters
TopK SAEs are the current default at OpenAI (Gao et al. trained 16M-feature SAEs on GPT-4 activations) and one of two defaults at most other interp labs. The pattern of "discrete top-K selection, dense regression on the chosen subset" is a clean architectural template that translates to any setting where you want a hard sparsity constraint without a penalty hyperparameter.
Failure Mode
TopK fixes globally, but the true number of active features per input can vary (a long sentence about many topics needs more features than a short uniform one). Per-input adaptive is an open design question; some recent work uses a learned gating network or a soft TopK relaxation. TopK also still suffers from feature splitting and feature absorption; the L0 constraint controls count but not granularity, so SAEs at different values can still find substantially different feature decompositions.
JumpReLU SAEs: Separating Gating from Magnitude
JumpReLU Sparse Autoencoder (Rajamanoharan et al. 2024 / DeepMind)
Statement
The JumpReLU activation is a per-coordinate piecewise function with a learnable threshold :
The encoder is applied coordinate-wise, and the loss combines reconstruction with an L0 penalty:
The L0 indicator is non-differentiable; Rajamanoharan et al. (2024, arXiv:2407.14435) train it with a straight-through estimator using a kernel-density approximation of the Heaviside step.
Intuition
A standard ReLU activation commits two errors at once: it zeros small magnitudes (good, that is gating), and it shrinks remaining magnitudes near the threshold (bad, that is shrinkage). JumpReLU separates these: the magnitude pass-through is exact ( when active), and the gating decision is a hard threshold at . Compared to TopK, JumpReLU has per-feature adaptive sparsity: a feature with a tiny threshold fires often, one with a large threshold fires rarely. The L0 penalty then trades off how many features fire, summed across the dictionary.
Proof Sketch
Step 1 (magnitude pass-through). When , exactly. Reconstruction sees the unbiased magnitude, so no shrinkage bias on active features.
Step 2 (gating decision). is a hard threshold; gradient is zero almost everywhere. Rajamanoharan et al. use a straight-through estimator: in the forward pass, use the indicator; in the backward pass, replace it with a smooth kernel for narrow bandwidth . As the kernel becomes a delta function and the surrogate gradient is unbiased on average but high-variance.
Step 3 (Pareto improvement over L1 + ReLU). At any fixed L0 sparsity level, JumpReLU achieves strictly better reconstruction than the L1 + ReLU baseline because (1) magnitudes are unbiased, (2) the per-feature threshold can adapt to feature scale (no global ). Empirically (Rajamanoharan et al. Figure 2), JumpReLU matches or slightly beats TopK on the frontier across Gemma-2-2B and Gemma-2-9B layers.
Why It Matters
JumpReLU is one of two architectures (with TopK) used to train DeepMind's Gemma Scope (Lieberum et al. July 2024, arXiv:2408.05147), a public release of JumpReLU SAEs across Gemma-2-2B and Gemma-2-9B layers and sites. Per-feature adaptive thresholds make JumpReLU especially well-suited to scale: in TopK every input gets exactly features, while in JumpReLU some inputs naturally activate more than the average. This matches the empirical reality that sentences vary in semantic complexity.
Failure Mode
The straight-through estimator is unbiased only in expectation and has high variance for small kernel bandwidth ; large biases the gating decision. Picking is an additional hyperparameter that L1 and TopK SAEs avoid. JumpReLU SAEs also rely on per-feature thresholds being well-initialized; cold-start training can leave many features dead until thresholds settle. The Gated SAE variant (Rajamanoharan et al. April 2024, arXiv:2404.16014) is closely related and has the same shrinkage-elimination property; choice between Gated and JumpReLU is mostly a matter of implementation taste.
Matryoshka SAEs: Multi-Resolution Dictionaries
Matryoshka Sparse Autoencoder (Bushnaq et al. / Bussmann 2024)
Statement
A Matryoshka SAE trains a single encoder/decoder pair such that the first dictionary elements form a stand-alone SAE for every prefix size in a chosen geometric series . The training loss is
with weights and per-prefix sparsity targets scaled to maintain a roughly constant ratio .
Intuition
A standard SAE commits to one dictionary size and one feature granularity. Increasing produces finer features (more concept splitting) but also more dead features and more redundancy. Matryoshka SAEs hedge: the first 64 dictionary elements form a coarse decomposition (high-level concepts like "Python code", "French text", "math notation"), the first elements form a moderate decomposition (sub-types of those concepts), and the full form a fine decomposition. A user can choose the granularity post hoc by truncating the dictionary at any prefix length.
Proof Sketch
Step 1 (nested feature representation). Training with the prefix-sum loss forces the first dictionary columns to carry as much reconstruction quality as possible at the smallest size, then the next columns to add residual reconstruction, and so on. This is identical in spirit to the Matryoshka Representation Learning construction (Kusupati et al. 2022, arXiv:2205.13147) for embeddings, applied to dictionary learning.
Step 2 (avoiding feature splitting). A standard SAE forced to use features has no constraint on whether two features are coarse-then-fine or two siblings of the same coarse concept. The Matryoshka loss biases the optimization toward a hierarchy: coarse concepts in the first prefix, refinements in later prefixes. Empirically (Bussmann et al. 2024 LessWrong report) this produces fewer split features and clearer feature trees.
Step 3 (computational cost). Matryoshka loss costs forward/backward passes per training step relative to a standard SAE; with prefixes the overhead is . Some implementations amortize by sharing the encoder forward pass and only doing separate decoder solves.
Why It Matters
Feature splitting is the most-cited critique of standard SAEs: a single semantic concept (say, "the word 'the' in English") can be split across 5-10 dictionary elements that each fire on slightly different contexts, making downstream analysis ambiguous. Matryoshka SAEs partially address this by forcing the dictionary to organize hierarchically. The construction also gives a knob for users who want different feature granularities for different analyses (coarse for steering, fine for circuit analysis).
Failure Mode
Matryoshka SAEs are still empirical; rigorous evaluations against TopK and JumpReLU baselines on the same tasks are limited as of 2026 Q1. The training cost ( a standard SAE) limits scaling. The hierarchy is implicit in the loss, not enforced; some Matryoshka runs still produce flat splits at certain depths. The closely-related Gated SAE and cross-layer transcoder lines (Lindsey et al. 2024-2025) attack feature granularity from different angles, and the field has not converged on a single dominant approach.
Worked Examples
Sphere-packing capacity for d = 4096
Let (the residual stream dimension of a typical 7B-parameter transformer). How many features can the network represent in superposition?
For pairwise inner-product slack (allowing 10% "off-axis" projection between any two features), the Welch bound on the number of -orthogonal unit vectors in is approximately . With the bound drops to features — so the number of features the model can pack scales exponentially in but extremely steeply in the slack tolerance .
This is one reason Anthropic's Scaling Monosemanticity needed dictionary sizes of to find interpretable features in Claude 3 Sonnet's residual stream: the model can plausibly be running with to features in superposition, and the SAE has to be at least that large to recover them.
L1 shrinkage on a single feature
Consider a single-feature toy: with feature direction unit-norm, scalar magnitude. The L1 SAE optimum at is (with the encoder learning to approximate this). Reconstruction error: when .
The TopK SAE with recovers exactly (least-squares regression onto the chosen direction with no penalty). Reconstruction error: .
This is the cleanest demonstration of why TopK strictly Pareto-dominates L1 on the frontier: the magnitude-shrinkage cost paid by L1 is unnecessary once you separate gating from magnitude.
Anthropic's Golden Gate Claude steering demo (May 2024)
Anthropic trained an SAE on Claude 3 Sonnet's middle-layer residual stream and identified a feature that activates on the Golden Gate Bridge: text mentioning the bridge, images of the bridge, references to fog or San Francisco that contextually evoke the bridge. The feature is monosemantic in the sense that its top activations are all bridge-related across modalities.
To steer, Anthropic added a multiple of the feature direction to the residual stream during generation: for layer and steering coefficient . With moderate, the model started referencing the Golden Gate Bridge in answers to unrelated questions; with very large, the model became unable to talk about anything else (the public Golden Gate Claude demo).
Lesson: SAE features are causally meaningful rather than merely correlated with concepts. Adding a feature direction at inference time produces predictable behavioral changes, which is the operational test of mechanistic understanding.
Dead features and the Gao et al. AuxK fix
A standard L1 SAE trained on GPT-2 small with on the residual stream of layer 6 typically has 60-90% dead features after convergence (Cunningham et al. 2023; Bricken et al. 2023). Fix 1: resampling, i.e. re-initialize dead encoder rows to point toward inputs the SAE reconstructs poorly, every training steps.
Fix 2 (Gao et al. 2024): the AuxK loss. Compute the residual error , run a separate TopK over the dead features only with size (typically 512), and add the auxiliary reconstruction loss scaled by a small coefficient. This activates dead features on residual error and keeps the dictionary alive without hard resampling.
Empirical result: Gao et al. report dead-feature rates below 7% for GPT-4-class TopK SAEs with M, compared to for L1 SAEs of equivalent size.
Feature splitting in a small SAE
Train an L1 SAE with on Pythia-1B layer 4 residual stream, then train another with on the same data. The smaller SAE has a single feature that fires on "the word 'the' followed by an adjective"; the larger SAE has six features that fire on slightly different versions of this pattern (the + adjective + noun, the + comparative adjective, the + national-origin adjective, etc.).
This is feature splitting: as grows, semantic concepts subdivide into finer-grained features. There is no objective ground truth; the "right" granularity depends on the downstream analysis. Matryoshka SAEs partially address this by giving access to the same SAE at multiple granularities post hoc.
Common Confusions
SAE features are learned decompositions, not discovered ground truth
Two SAE training runs with different random seeds produce different feature dictionaries. Two SAE architectures (TopK vs JumpReLU vs Matryoshka) produce different decompositions of the same activation distribution. The features are useful summaries of the activation space, not metaphysically real "atoms" of cognition. The metric Max Mean Cosine Similarity (MMCS) across runs measures dictionary stability and typically lands at 0.7-0.85 for L1 SAEs and 0.85-0.95 for TopK/JumpReLU on Gemma-scale models (Lieberum et al. 2024).
Overcomplete does not guarantee finding all features
An SAE with does not necessarily find all features the network uses. It finds the directions that best explain the activation distribution under the architectural sparsity constraint. Rare features (active on of tokens) may be missed; concentrated features (a single direction explaining a lot of variance) may absorb related concepts; common features may split across dictionary elements. Increasing helps recall at the cost of feature splitting.
L0 sparsity is per-input, not per-feature
The standard sparsity metric counts how many features fire on the average input. A typical Gemma Scope SAE has out of features (). This says nothing about which features tend to fire; some features may fire on of inputs (these are usually positional or role-encoding features and deserve separate attention), others may fire on (rare specialized concepts). The marginal feature firing rate is a separate diagnostic.
SAEs train per layer, not on the whole model
Each SAE is fit to one layer and one site (residual stream, MLP output, attention output). Different layers have different SAEs and different feature dictionaries. The same concept can appear at multiple layers with different feature directions; tracing a concept through the network requires cross-layer transcoders (Lindsey et al. 2024) or careful pairing of features across layers via causal probing.
Reconstruction R² is a necessary but insufficient metric
A high (say ) means the SAE preserves most of the activation, but it does not directly mean the features are interpretable or useful. Two SAEs with the same at the same can have very different feature qualities; the standard supplementary diagnostics are (1) autointerp score (use a strong LLM to label features and compute correctness), (2) probe transfer (do features predict downstream behaviors?), (3) causal effects under ablation, (4) MMCS across seeds. The community is still converging on a unified evaluation suite; SAE-Bench (2024) is one attempt.
SAE-based steering is not the only valid feature-causality test
"Adding the feature direction makes the model say bridge-related things" is a necessary condition for the feature to be causally meaningful, not a sufficient one. Spurious directions in activation space can also produce coherent steering effects without corresponding to interpretable features (Marks et al. 2024 on the steering vs probing gap). Strong evidence requires both behavioral steering and causal ablation effects, ideally with control comparisons against random or matched-norm directions.
Evaluation: How Do You Know an SAE Is Good?
The community has converged on a multi-axis evaluation that no single metric replaces:
-
Reconstruction quality. . Modern SAEs target at the chosen sparsity level, with frontier-scale TopK SAEs hitting (Gao et al. 2024).
-
Sparsity. . Typical operating points: 32-512 active features on a dictionary of -.
-
Downstream loss preservation. Substitute the SAE's reconstruction for the actual activation in the model's forward pass and measure the increase in cross-entropy loss. A "perfect" SAE has zero degradation; a typical SAE adds 0.05-0.2 nats per token.
-
Autointerp. Have a strong LLM (typically Claude or GPT-4) generate a natural-language description of each feature from its top-activating examples, then evaluate whether the description predicts activations on held-out examples. Modern SAEs score - on this test for the most-active features and worse for rare ones.
-
Feature stability across seeds. MMCS across two independent training runs.
-
Causal effects under ablation. Zero feature , measure behavioral change; compare to ablating a random direction with matched norm.
SAE-Bench (Karvonen et al. 2024) and the Gemma Scope evaluation suite (Lieberum et al. 2024) are public benchmarks attempting to standardize these axes. As of April 2026 the community has not converged on a single best metric or weighting.
Frontier-Scale SAEs: What They Find
Anthropic's Scaling Monosemanticity (Templeton et al. May 2024) trained SAEs on Claude 3 Sonnet's residual stream with dictionary sizes of and . Notable findings:
- Multi-modal features. Features fire on the same concept across English text, code referencing the concept, and images of the concept. The "Golden Gate Bridge" feature fires on text mentioning the bridge, on photos of the bridge, and on related concepts (San Francisco, fog, suspension bridges) at lower magnitudes.
- Safety-relevant features. Features for sycophancy, deception, biological-weapons synthesis, SQL injection, racist content. These were used to demonstrate that SAE features can localize dangerous behaviors, with implications for both safety auditing and steering the model away from those behaviors.
- Compositional features. Features for syntactic patterns ("subject-verb agreement"), semantic relations ("X is a member of Y"), task-style patterns ("the model is being asked to refuse").
- Scaling pattern. Useful feature count scales roughly linearly with up to , suggesting that frontier models contain at least that many distinct features.
DeepMind's Gemma Scope (Lieberum et al. July 2024) released JumpReLU SAEs across layers and sites of Gemma-2-2B and Gemma-2-9B as a public artifact. This made frontier-scale SAE research reproducible outside major labs for the first time.
Exercises
Problem
An SAE with dictionary size is trained on dimensional activations. On a typical input, features are nonzero. What is the overcomplete factor? What fraction of features are active per input? Why must the overcomplete factor be much larger than 1 (cite the superposition hypothesis)?
Problem
Show that the SAE objective is exactly the LASSO problem (with as the design matrix and as coefficients). Then sketch why the L1 ball geometry produces sparsity while the L2 ball does not.
Problem
Derive the L1 shrinkage explicitly. Suppose has orthonormal columns and the true sparse code is with (no noise). Compute the L1-penalized SAE optimum in closed form, and describe the bias on relative to .
Problem
Implement TopK SAE backward-pass logic in pseudocode. Specifically, given a forward pass that computes , (the top- entries kept, others zeroed), , , derive the gradient and explain why no straight-through estimator is needed.
Problem
Capacity bound for superposition. Use the Welch bound to derive a quantitative cap on how many features a -dimensional residual stream can carry in superposition with pairwise inner-product slack . Then estimate the cap for (a 7B-parameter transformer) at and at . What does this imply about the feasibility of frontier-scale SAEs with ?
Summary
- Polysemanticity is empirical: individual neurons fire on many concepts.
- Superposition is the explanatory hypothesis: networks pack features as nearly-orthogonal directions.
- L1 SAE (Bricken et al. 2023): minimize , decoder columns unit-norm. Suffers from shrinkage and dead features.
- TopK SAE (Gao et al. 2024 / OpenAI): replace L1 with hard top- selection. Eliminates shrinkage, Pareto-dominates L1.
- JumpReLU SAE (Rajamanoharan et al. 2024 / DeepMind): per-feature learnable threshold with straight-through gradient on the L0 indicator. Per-feature adaptive sparsity.
- Matryoshka SAE (Bushnaq / Bussmann 2024): nested dictionaries trained with prefix-sum loss. Multi-resolution decomposition.
- Frontier-scale evidence: Anthropic's Claude 3 Sonnet decomposition (Templeton et al. May 2024) finds millions of interpretable features; DeepMind's Gemma Scope (Lieberum et al. July 2024) releases public SAEs.
- Steering: adding a feature direction to the residual stream produces predictable behavioral changes (Golden Gate Claude).
- Evaluation: , , downstream loss preservation, autointerp, MMCS across seeds, ablation effects. No single metric suffices.
- Open problems: feature splitting, feature absorption, cross-layer composition, adaptive sparsity, evaluating "real" features.
References
Canonical:
- Olshausen & Field, "Emergence of simple-cell receptive field properties by learning a sparse code for natural images," Nature 381 (1996). The original sparse coding paper for natural images, foundational to dictionary learning for feature extraction.
- Bricken, Templeton, Batson, Chen et al., "Towards Monosemanticity: Decomposing Language Models with Dictionary Learning," Anthropic Transformer Circuits Thread (October 2023). The L1 SAE construction applied to a 1L transformer; the empirical demonstration that SAEs recover monosemantic features.
- Elhage, Hume, Olsson, Schiefer, Henighan, Kravec, Hatfield-Dodds, Lasenby, Drain, Chen, Grosse, McCandlish, Kaplan, Amodei, Wattenberg, Olah, "Toy Models of Superposition" (2022). Foundational toy-model analysis of superposition, the phase diagram in , and the polytope solutions.
- Cunningham, Ewart, Riggs, Huben, Sharkey, "Sparse Autoencoders Find Highly Interpretable Features in Language Models," arXiv:2309.08600 (ICLR 2024). Independent reproduction of Bricken et al. on Pythia and GPT-2; established that SAE features generalize across model families.
- Templeton, Conerly, Marcus, Lindsey, Bricken, Chen, Pearce, Citro, Ameisen, Jermyn, Anil, Denison, Askell, Lasenby, Wu, Kravec, Schiefer, Maxwell, Joseph, Hatfield-Dodds, Tamkin, Nguyen, McLean, Burke, Hume, Carter, Henighan, Olah, "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet," Anthropic Transformer Circuits Thread (May 2024). The frontier-scale demonstration: feature SAE on Claude 3 Sonnet, multi-modal features, safety-relevant features, the Golden Gate Claude steering result.
- Scherlis, Sachan, Jermyn, Benton, Shlegeris, "Polysemanticity and Capacity in Neural Networks," arXiv:2210.01892 (2022). Rigorous capacity analysis for superposition.
Current:
- Gao, la Tour, Tillman, Goh, Troll, Radford, Sutskever, Leike, Wu, "Scaling and evaluating sparse autoencoders," arXiv:2406.04093 (June 2024). The OpenAI TopK SAE paper. Trained 16M-feature SAEs on GPT-4 activations; introduces the AuxK loss for dead features; introduces multi-axis evaluation including downstream loss preservation.
- Rajamanoharan, Conmy, Lieberum, Smith, Varma, Sonnerat, Krakovna, Kramár, Shah, Nanda, "Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders," arXiv:2407.14435 (July 2024). The DeepMind JumpReLU SAE paper with straight-through gradient on the L0 indicator.
- Rajamanoharan, Conmy, Smith, Sonnerat, Lieberum, Krakovna, Kramár, Shah, Nanda, "Improving Dictionary Learning with Gated Sparse Autoencoders," arXiv:2404.16014 (April 2024). The DeepMind Gated SAE: separate gating from magnitude estimation. JumpReLU's predecessor; closely related architecture.
- Lieberum, Rajamanoharan, Conmy, Smith, Sonnerat, Varma, Kramár, Dragan, Shah, Nanda, "Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2," arXiv:2408.05147 (August 2024). The DeepMind public release of JumpReLU SAEs across Gemma-2-2B and Gemma-2-9B layers and sites.
- Bussmann, Leask, Nanda, "Showing SAE Latents are not Atomic Using Meta-SAEs," LessWrong / Alignment Forum (August 2024). Empirical demonstration that L1/TopK SAE features themselves decompose into finer features under a meta-SAE; one motivation for Matryoshka.
- Bushnaq, Nanda, et al., "Matryoshka Sparse Autoencoders," LessWrong / Alignment Forum (October 2024). Matryoshka construction with nested dictionaries.
- Karvonen, Wright, Rager, Angell, Brinkmann, Smith, Verdun, Bau, Marks, "SAEBench: A Comprehensive Benchmark for Sparse Autoencoders" (2024). Standardized benchmark across sparsity, fidelity, and interpretability axes.
- Marks, Rager, Michaud, Belrose, Sharkey, Bau, "Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models," arXiv:2403.19647 (2024). Causal-graph extraction from SAE features; one of the strongest demonstrations of SAE-based mechanistic understanding.
- Lindsey, Templeton, Marcus, Chen, et al., "Cross-Layer Transcoders," Anthropic Transformer Circuits Thread (2024-2025). Architecture for tracing concepts across layers using SAE-style decompositions.
Critique:
- Anders, Neuhaus, Rauker, Sharkey, "The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks," arXiv:2405.10928 (May 2024). Argues that SAE features may not be the right basis for circuit-level analysis; introduces the local interaction basis as a competitor.
- Engels, Liao, Michaud, Gurnee, Tegmark, "Not All Language Model Features Are Linear," arXiv:2405.14860 (May 2024). Empirical finding that some features are circular (encoded as 2D rotational patterns), undermining the strict linear-feature assumption underlying L1/TopK/JumpReLU SAEs.
- Bussmann, Leask, Nanda, "Meta-SAEs and Feature Splitting" (LessWrong, 2024, ongoing). Documents the feature splitting phenomenon and its implications for interpretation.
- Black, Olszewski, Cundy, Conmy, "Interpreting Neural Networks through the Polytope Lens," arXiv:2310.07466 (2023). Argues that polytope structure (rather than linear features) is the more fundamental object; complementary to but distinct from the SAE program.
- Hindupur, Lubana, Fel, Ba, "Probing the Inadequacy of Sparse Autoencoders for Linearly Decodable Concepts," arXiv:2502.05407 (Feb 2025). Documents cases where SAEs fail to recover concepts that linear probes find, raising questions about completeness.
- Marks, Templeton, Sharkey, et al., "What is the Difference Between Steering and Probing?" (2024). The steering-probing gap: SAE features can be probed accurately without producing clean steering, and vice versa, suggesting the two operations measure different things.
Next Topics
- Mechanistic interpretability: the broader research program SAEs feed into; circuits, probing, and behavioral analysis.
- Induction heads: a specific interpretable circuit discovered by direct mechanistic analysis (predates SAEs); a useful comparison case.
- Residual stream and transformer internals: the architectural framework for understanding where SAE features live (residual stream vs MLP output vs attention output).
- Feature importance and interpretability: classical (non-mechanistic) feature attribution methods for comparison.
Last reviewed: April 19, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
5- Lasso Regressionlayer 2 · tier 1
- Mechanistic Interpretability: Features, Circuits, and Causal Faithfulnesslayer 4 · tier 1
- Autoencoderslayer 2 · tier 2
- Residual Stream and Transformer Internalslayer 4 · tier 2
- Truth Directions and Linear Probeslayer 4 · tier 2
Derived topics
2- Feature Importance and Interpretabilitylayer 2 · tier 2
- Induction Headslayer 4 · tier 2
Graph-backed continuations