Sparse Autoencoders for Interpretability: TopK, JumpReLU, Matryoshka, and Scaling

Sneiderman, Robby

LLM Construction

Sparse Autoencoders for Interpretability: TopK, JumpReLU, Matryoshka, and Scaling

Sparse autoencoders (SAEs) decompose neural network activations into interpretable feature directions by learning an overcomplete dictionary with a sparsity constraint. Coverage of the superposition hypothesis, the L1 SAE objective and its shrinkage failure mode, TopK SAEs (Gao et al. 2024 / OpenAI), JumpReLU SAEs (Rajamanoharan et al. 2024 / DeepMind), Matryoshka SAEs (Bushnaq / Bussmann 2024), Anthropic's million-feature Claude 3 Sonnet decomposition (Templeton et al. 2024), feature splitting, dead features, evaluation, and steering.

AdvancedTier 1FrontierCore spine~90 min

Prerequisites

Autoencoders Mechanistic Interpretability Lasso Regression Residual Stream and Transformer Internals

Quiz (4)Prereq Map

Why This Matters

Individual neurons in a trained transformer are polysemantic: a single neuron activates for many unrelated concepts. Anthropic's Claude 3 Sonnet has neurons that fire on the Golden Gate Bridge, security vulnerabilities, sycophantic praise, and DNA sequences (Templeton et al. 2024). Reading neuron activations directly tells you almost nothing about what the model is computing.

The superposition hypothesis (Elhage et al. 2022) explains why: networks pack many more features than neurons by encoding each feature as a direction in activation space. When features are sparse (each one is active on only a small fraction of inputs), nearly-orthogonal packing lets a $d$ -dimensional space hold $\Omega(d \log F)$ distinguishable feature directions for $F$ features (Welch / Kabatyanskii-Levenshtein bounds for sphere packings). Sparse autoencoders (SAEs) recover those feature directions by training an overcomplete dictionary on activation samples.

The 2024 wave of SAE research closed three real problems with the original Bricken-style L1 SAE:

Shrinkage. L1 penalizes the magnitude of active features, so the decoder has to over-scale to compensate, biasing reconstruction. JumpReLU and Gated SAEs separate gating from magnitude (Rajamanoharan et al. 2024).
Sparsity-reconstruction tradeoff. L1 imposes a soft penalty whose hyperparameter $\lambda$ trades off reconstruction $R^2$ against L0 sparsity in a model-specific way. TopK SAEs replace L1 with an exact L0 constraint (Gao et al. 2024 from OpenAI on GPT-4 activations).
Feature granularity. A fixed dictionary size $M$ commits to one resolution. Matryoshka SAEs train a nested family of dictionaries so that the first $k$ features form a coarse decomposition and adding more reveals finer-grained subfeatures (Bushnaq et al. 2024; Bussmann 2024).

Anthropic's Scaling Monosemanticity (Templeton et al. May 2024) and DeepMind's Gemma Scope (Lieberum et al. July 2024, arXiv:2408.05147) showed that SAEs work at frontier-model scale, recovering millions of interpretable features in production-class models. SAEs are now the default tool for mechanistic interpretability research at every major lab.

Mental Model

You have a large dataset of activations $\{x_i\}_{i=1}^N$ with $x_i \in \mathbb R^d$ collected by running the base model on text. You hypothesize that each $x_i$ is approximately a sparse linear combination of $M$ feature directions $\{f_1, \ldots, f_M\} \subset \mathbb R^d$ with $M \gg d$ (overcomplete). The SAE jointly learns the dictionary $\{f_j\}$ and a per-input sparse code $z_i \in \mathbb R^M$ such that $x_i \approx \sum_j z_{ij} f_j$ with $z_i$ having only a handful of nonzero entries. The decoder is the matrix $W_{\mathrm{dec}}$ whose columns are $\{f_j\}$ ; the encoder is a small neural network that produces the code $z_i = f(x_i)$ from each input.

The interpretability payoff: each $f_j$ is a direction in activation space whose meaning can be characterized by inspecting which inputs cause $z_{ij}$ to be large. A typical "good" feature might fire only on French text about wine, only on Python list comprehensions, or only on inputs that prompt a refusal. Once features are extracted, you can probe the model's internal vocabulary, ablate features to test causal effect, and steer generation by adding feature directions to the residual stream (Anthropic's "Golden Gate Claude" demo, May 2024).

Definition

Feature direction $f_{j} \in R^{d}$

A unit vector in activation space corresponding to a column of the SAE decoder $W_{\mathrm{dec}}$ . The reconstructed activation is $\hat x = \sum_{j=1}^M z_j(x) f_j$ where $z_j(x) \geq 0$ is the encoder's score for feature $j$ on input $x$ . A feature direction is monosemantic if and only if its top-activating inputs share a single human-interpretable concept; polysemantic if and only if they correspond to several unrelated concepts.

Definition

Overcomplete dictionary $M / d$

A dictionary $\{f_1, \ldots, f_M\} \subset \mathbb R^d$ with $M > d$ . The ratio $M/d$ is the overcompleteness factor; typical SAE values are $4$ to $64$ , with frontier-scale Anthropic and DeepMind SAEs going to $\sim 10^4$ overcompleteness on production models. Overcompleteness is the architectural commitment to the superposition hypothesis: there are more features than dimensions.

The Superposition Hypothesis

Proposition

Superposition Hypothesis

Statement

A neural network with activation dimension $d$ can represent $F \gg d$ features by encoding each feature as a direction in $\mathbb R^d$ . In the toy-model regime of Elhage et al. (2022), the number of representable features scales roughly as $F \sim d / p$ where $p$ is the per-feature activation probability. In the very-sparse limit $p \to 0$ , the count approaches the Welch/Kabatyanskii-Levenshtein bound of $\exp(\Omega(d \varepsilon^2))$ nearly-orthogonal directions with pairwise inner product bounded by $\varepsilon$ . The $F \sim d/p$ formula should be read as an informal scaling observation from a controlled toy model, not a tight theoretical bound; see Scherlis-Sachan-Jermyn-Benton-Shlegeris (2022, arXiv:2210.01892) for a more rigorous capacity analysis.

Intuition

In $\mathbb R^d$ , you can fit exactly $d$ orthogonal directions. If you allow a small angular slack (small interference), you can fit exponentially more nearly-orthogonal directions; this is the Johnson-Lindenstrauss phenomenon applied to feature representation. The network exploits sparsity: when two features whose directions weakly interfere are unlikely to be active together, the interference rarely produces output errors, so the cost of packing more features is low.

Proof Sketch

Elhage et al. (2022) study the toy model $y = W^\top \mathrm{ReLU}(Wx + b)$ with $x$ sparse and analyze the optimum as a function of sparsity $p$ . The optimal $W$ goes through three regimes as $p$ decreases from $1$ to $0$ : (1) for dense features the model learns $d$ orthogonal directions and discards the rest; (2) at moderate sparsity it learns anti-podal pairs that share a direction by sign; (3) at high sparsity it learns polytopes (digons, triangles, tetrahedra) that pack features into nearly-orthogonal subsets with controlled interference. The transition is governed by a phase diagram in $(p, F/d)$ . In the high-sparsity regime, sphere-packing bounds (Welch / Kabatyanskii-Levenshtein) cap the number of $\varepsilon$ -orthogonal directions at $\exp(\Omega(d \varepsilon^2))$ .

Why It Matters

Superposition is the empirical reason individual neurons are unreliable units of analysis. The meaningful unit is a feature direction (a row of the inverse decoder), not a neuron axis. SAEs are the construction that recovers feature directions from the superposed representation, producing a dictionary of monosemantic atoms from a polysemantic activation space. Without superposition, polysemanticity would be an artifact of training noise and would disappear with scale; with superposition, polysemanticity is a prediction and SAEs are the right tool to handle it.

Failure Mode

The hypothesis is well-supported in the toy models of Elhage et al. (2022) and in small transformers (Bricken et al. 2023). Open questions for frontier models: (1) what fraction of features are truly linearly represented vs. encoded in cross-layer composition (Black et al. 2024 on linear probes); (2) how does the feature count scale with model size (Templeton 2024 found roughly linear scaling of useful feature count with SAE width up to $\sim 10^7$ ); (3) whether the clean basis assumption (each feature has a single direction) holds, given growing evidence of feature splitting and feature absorption (Bussmann et al. 2024). Treat the hypothesis as a working framework that has earned empirical credit, not as a theorem about real models.

report a correction →

The L1 SAE Objective and the Shrinkage Problem

Proposition

L1 Sparse Autoencoder (Bricken et al. 2023)

Statement

The L1 SAE learns encoder $f: \mathbb R^d \to \mathbb R^M$ and decoder $g: \mathbb R^M \to \mathbb R^d$ minimizing

$\mathcal L(x) = \| x - g(f(x)) \|_2^2 + \lambda \| f(x) \|_1,$

with

$f(x) = \mathrm{ReLU}(W_{\mathrm{enc}}(x - b_{\mathrm{dec}}) + b_{\mathrm{enc}}), \qquad g(z) = W_{\mathrm{dec}} z + b_{\mathrm{dec}},$

and $\| W_{\mathrm{dec}}[:,j] \|_2 = 1$ for all $j$ (unit-norm decoder columns). The decoder columns are the learned feature directions; the encoder activations $f(x)_j$ score how much each feature is present in $x$ .

Intuition

This is dictionary learning posed as a single-hidden-layer autoencoder with a sparsity penalty. The L1 penalty pushes the encoder activations to zero whenever the marginal cost of activation exceeds the marginal reconstruction gain, producing a sparse code in which only a small subset of features is active per input. The unit-norm constraint on decoder columns prevents the optimizer from cheating by scaling features arbitrarily small (which would make the L1 cost vanish without producing genuine sparsity).

Proof Sketch

The objective is a relaxed LASSO regression. Fixing the dictionary $W_{\mathrm{dec}}$ , the per-input encoding $f(x) = \arg\min_z \| x - W_{\mathrm{dec}} z \|^2 + \lambda \| z \|_1$ is exactly LASSO, whose solution is the soft-thresholding of the optimal least-squares code. The encoder $W_{\mathrm{enc}}$ approximates this LASSO solution in a single forward pass; the unit-norm constraint on $W_{\mathrm{dec}}$ identifies it up to permutation of columns and removes the trivial scale degeneracy. Joint training of encoder and decoder amortizes the LASSO solve across the data distribution.

Why It Matters

The L1 SAE is the historical baseline (Bricken et al. October 2023) that opened the modern interpretability research program at Anthropic and elsewhere. It produces interpretable features at small scale (single MLP layer of a 1L transformer), and the same architecture scales to GPT-2-class models (Cunningham et al. 2023, arXiv:2309.08600). Every later SAE variant is described relative to this baseline.

Failure Mode

Shrinkage. The L1 penalty biases active feature magnitudes downward (this is why LASSO has the well-known shrinkage problem). If the optimal reconstruction code is $z^* = (z_1^*, \ldots, z_M^*)$ , the L1-penalized solution is approximately $z_j = \mathrm{sign}(z_j^*) \cdot \max(0, |z_j^*| - \lambda/2)$ , the soft-thresholded version. To compensate, the SAE has to learn an over-scaled decoder, which biases reconstruction in a structured way. Empirically, L1 SAEs hit a Pareto frontier of $(R^2, L_0)$ that is strictly worse than TopK and JumpReLU (Gao et al. 2024 Figure 4; Rajamanoharan et al. 2024 Figure 2). Dead features. Many dictionary elements never activate. Bricken et al. (2023) reported up to 90% dead features at the optimal $\lambda$ ; the standard fix is resampling (re-initialize dead encoder rows toward high-loss data points every $\sim 25$ k steps). Feature splitting. A single semantic concept can be split across multiple dictionary elements that fire on slightly different sub-distributions; this is the SAE-side artifact of forcing too many features to be present (Anthropic's Scaling Monosemanticity discusses this at length).

report a correction →

TopK SAEs: Hard Sparsity

Proposition

TopK Sparse Autoencoder (Gao et al. 2024 / OpenAI)

Statement

The TopK SAE replaces the L1 penalty with an exact L0 constraint. Define $\mathrm{TopK}: \mathbb R^M \to \mathbb R^M$ as the operator that keeps the $k$ largest entries and zeros the rest. The encoder is

$f(x) = \mathrm{TopK}\big(W_{\mathrm{enc}}(x - b_{\mathrm{dec}}) + b_{\mathrm{enc}}\big),$

and the loss is pure reconstruction:

$\mathcal L(x) = \| x - W_{\mathrm{dec}} f(x) - b_{\mathrm{dec}} \|_2^2.$

No sparsity penalty appears; sparsity is enforced exactly by construction. Gao et al. (2024, arXiv:2406.04093) show on GPT-4 activations that TopK SAEs strictly Pareto-dominate L1 SAEs on the $(R^2, L_0)$ frontier and eliminate the shrinkage problem.

Intuition

L1 enforces sparsity softly by paying a cost proportional to magnitude; this couples gating (which features fire) to magnitude (how much they fire). TopK enforces sparsity hardly by selecting exactly the top $k$ features per input; gating is a discrete decision, magnitude is set by reconstruction alone. Removing the magnitude penalty removes shrinkage.

Proof Sketch

Shrinkage-elimination argument. Suppose the true sparse code on input $x$ has $k$ nonzero entries, exactly. Then the L1 SAE's optimum is $z_j = \mathrm{soft}_\lambda(z_j^*) = \mathrm{sign}(z_j^*) \max(0, |z_j^*| - \lambda/2)$ , biased downward by $\lambda/2$ per active entry. The TopK SAE sets the active indices via TopK and the active magnitudes via least-squares regression onto the chosen columns; no penalty appears in the magnitude solve, so the magnitudes are unbiased. The Pareto improvement on $(R^2, L_0)$ follows directly: at fixed L0, TopK achieves the least-squares optimum; L1 achieves a strictly worse value due to soft-thresholding.

Gradient pass-through. TopK is non-differentiable, but the gradient flows through the selected indices identically to a standard linear layer; non-selected indices receive zero gradient. There is no straight-through estimator needed for the selection step itself; the discrete choice is treated as a hard mask in the backward pass.

Auxiliary loss for dead features. Gao et al. add a per-feature AuxK loss that activates dead features on the top-aux-k positions of the residual error, keeping the dictionary alive without resampling.

Why It Matters

TopK SAEs are the current default at OpenAI (Gao et al. trained 16M-feature SAEs on GPT-4 activations) and one of two defaults at most other interp labs. The pattern of "discrete top-K selection, dense regression on the chosen subset" is a clean architectural template that translates to any setting where you want a hard sparsity constraint without a penalty hyperparameter.

Failure Mode

TopK fixes $k$ globally, but the true number of active features per input can vary (a long sentence about many topics needs more features than a short uniform one). Per-input adaptive $k$ is an open design question; some recent work uses a learned gating network or a soft TopK relaxation. TopK also still suffers from feature splitting and feature absorption; the L0 constraint controls count but not granularity, so SAEs at different $M$ values can still find substantially different feature decompositions.

report a correction →

JumpReLU SAEs: Separating Gating from Magnitude

Proposition

JumpReLU Sparse Autoencoder (Rajamanoharan et al. 2024 / DeepMind)

Statement

The JumpReLU activation is a per-coordinate piecewise function with a learnable threshold $\theta_j > 0$ :

$\mathrm{JumpReLU}_{\theta_j}(u_j) = u_j \cdot \mathbf 1[u_j > \theta_j] = \begin{cases} u_j & \text{if } u_j > \theta_j \\ 0 & \text{otherwise} \end{cases}.$

The encoder is $f(x) = \mathrm{JumpReLU}_\theta(W_{\mathrm{enc}}(x - b_{\mathrm{dec}}) + b_{\mathrm{enc}})$ applied coordinate-wise, and the loss combines reconstruction with an L0 penalty:

$\mathcal L(x) = \| x - W_{\mathrm{dec}} f(x) - b_{\mathrm{dec}} \|_2^2 + \lambda \sum_j \mathbf 1[u_j > \theta_j].$

The L0 indicator is non-differentiable; Rajamanoharan et al. (2024, arXiv:2407.14435) train it with a straight-through estimator using a kernel-density approximation of the Heaviside step.

Intuition

A standard ReLU activation $\max(0, u)$ commits two errors at once: it zeros small magnitudes (good, that is gating), and it shrinks remaining magnitudes near the threshold (bad, that is shrinkage). JumpReLU separates these: the magnitude pass-through is exact ( $f_j(x) = u_j$ when active), and the gating decision is a hard threshold at $\theta_j$ . Compared to TopK, JumpReLU has per-feature adaptive sparsity: a feature with a tiny threshold fires often, one with a large threshold fires rarely. The L0 penalty then trades off how many features fire, summed across the dictionary.

Proof Sketch

Step 1 (magnitude pass-through). When $u_j > \theta_j$ , $f_j(x) = u_j$ exactly. Reconstruction sees the unbiased magnitude, so no shrinkage bias on active features.

Step 2 (gating decision). $\mathbf 1[u_j > \theta_j]$ is a hard threshold; gradient is zero almost everywhere. Rajamanoharan et al. use a straight-through estimator: in the forward pass, use the indicator; in the backward pass, replace it with a smooth kernel $K_\sigma((u_j - \theta_j)/\sigma)/\sigma$ for narrow bandwidth $\sigma$ . As $\sigma \to 0$ the kernel becomes a delta function and the surrogate gradient is unbiased on average but high-variance.

Step 3 (Pareto improvement over L1 + ReLU). At any fixed L0 sparsity level, JumpReLU achieves strictly better reconstruction than the L1 + ReLU baseline because (1) magnitudes are unbiased, (2) the per-feature threshold can adapt to feature scale (no global $\lambda$ ). Empirically (Rajamanoharan et al. Figure 2), JumpReLU matches or slightly beats TopK on the $(R^2, L_0)$ frontier across Gemma-2-2B and Gemma-2-9B layers.

Why It Matters

JumpReLU is one of two architectures (with TopK) used to train DeepMind's Gemma Scope (Lieberum et al. July 2024, arXiv:2408.05147), a public release of $\sim 400$ JumpReLU SAEs across Gemma-2-2B and Gemma-2-9B layers and sites. Per-feature adaptive thresholds make JumpReLU especially well-suited to scale: in TopK every input gets exactly $k$ features, while in JumpReLU some inputs naturally activate more than the average. This matches the empirical reality that sentences vary in semantic complexity.

Failure Mode

The straight-through estimator is unbiased only in expectation and has high variance for small kernel bandwidth $\sigma$ ; large $\sigma$ biases the gating decision. Picking $\sigma$ is an additional hyperparameter that L1 and TopK SAEs avoid. JumpReLU SAEs also rely on per-feature thresholds being well-initialized; cold-start training can leave many features dead until thresholds settle. The Gated SAE variant (Rajamanoharan et al. April 2024, arXiv:2404.16014) is closely related and has the same shrinkage-elimination property; choice between Gated and JumpReLU is mostly a matter of implementation taste.

report a correction →

Matryoshka SAEs: Multi-Resolution Dictionaries

Proposition

Matryoshka Sparse Autoencoder (Bushnaq et al. / Bussmann 2024)

Statement

A Matryoshka SAE trains a single encoder/decoder pair such that the first $M_k$ dictionary elements form a stand-alone SAE for every prefix size $M_k$ in a chosen geometric series $M_1 < M_2 < \ldots < M_K = M$ . The training loss is

$\mathcal L(x) = \sum_{k=1}^K w_k \, \| x - W_{\mathrm{dec}}[:, 1:M_k] \cdot \mathrm{TopK}_{k}(f_{1:M_k}(x)) \|_2^2,$

with weights $w_k > 0$ and per-prefix sparsity targets $\mathrm{TopK}_k$ scaled to maintain a roughly constant ratio $k_k / M_k$ .

Intuition

A standard SAE commits to one dictionary size $M$ and one feature granularity. Increasing $M$ produces finer features (more concept splitting) but also more dead features and more redundancy. Matryoshka SAEs hedge: the first 64 dictionary elements form a coarse decomposition (high-level concepts like "Python code", "French text", "math notation"), the first $4{,}096$ elements form a moderate decomposition (sub-types of those concepts), and the full $M = 65{,}536$ form a fine decomposition. A user can choose the granularity post hoc by truncating the dictionary at any prefix length.

Proof Sketch

Step 1 (nested feature representation). Training with the prefix-sum loss forces the first $M_1$ dictionary columns to carry as much reconstruction quality as possible at the smallest size, then the next $M_2 - M_1$ columns to add residual reconstruction, and so on. This is identical in spirit to the Matryoshka Representation Learning construction (Kusupati et al. 2022, arXiv:2205.13147) for embeddings, applied to dictionary learning.

Step 2 (avoiding feature splitting). A standard SAE forced to use $M$ features has no constraint on whether two features are coarse-then-fine or two siblings of the same coarse concept. The Matryoshka loss biases the optimization toward a hierarchy: coarse concepts in the first prefix, refinements in later prefixes. Empirically (Bussmann et al. 2024 LessWrong report) this produces fewer split features and clearer feature trees.

Step 3 (computational cost). Matryoshka loss costs $K$ forward/backward passes per training step relative to a standard SAE; with $K = 6$ prefixes the overhead is $\sim 6 \times$ . Some implementations amortize by sharing the encoder forward pass and only doing $K$ separate decoder solves.

Why It Matters

Feature splitting is the most-cited critique of standard SAEs: a single semantic concept (say, "the word 'the' in English") can be split across 5-10 dictionary elements that each fire on slightly different contexts, making downstream analysis ambiguous. Matryoshka SAEs partially address this by forcing the dictionary to organize hierarchically. The construction also gives a knob for users who want different feature granularities for different analyses (coarse for steering, fine for circuit analysis).

Failure Mode

Matryoshka SAEs are still empirical; rigorous evaluations against TopK and JumpReLU baselines on the same tasks are limited as of 2026 Q1. The training cost ( $\sim 6 \times$ a standard SAE) limits scaling. The hierarchy is implicit in the loss, not enforced; some Matryoshka runs still produce flat splits at certain depths. The closely-related Gated SAE and cross-layer transcoder lines (Lindsey et al. 2024-2025) attack feature granularity from different angles, and the field has not converged on a single dominant approach.

report a correction →

Worked Examples

Example

Sphere-packing capacity for d = 4096

Let $d = 4096$ (the residual stream dimension of a typical 7B-parameter transformer). How many features can the network represent in superposition?

For pairwise inner-product slack $\varepsilon = 0.1$ (allowing 10% "off-axis" projection between any two features), the Welch bound on the number of $\varepsilon$ -orthogonal unit vectors in $\mathbb R^d$ is approximately $\exp(d \varepsilon^2 / 2) = \exp(20.48) \approx 7.9 \times 10^8$ . With $\varepsilon = 0.05$ the bound drops to $\exp(5.12) \approx 167$ features — so the number of features the model can pack scales exponentially in $d$ but extremely steeply in the slack tolerance $\varepsilon$ .

This is one reason Anthropic's Scaling Monosemanticity needed dictionary sizes of $\sim 10^7$ to find interpretable features in Claude 3 Sonnet's residual stream: the model can plausibly be running with $\sim 10^7$ to $10^8$ features in superposition, and the SAE has to be at least that large to recover them.

Example

L1 shrinkage on a single feature

Consider a single-feature toy: $x = z \cdot f$ with feature direction $f \in \mathbb R^d$ unit-norm, $z > 0$ scalar magnitude. The L1 SAE optimum at $\lambda$ is $\hat z = \mathrm{soft}_{\lambda/2}(z) = \max(0, z - \lambda/2)$ (with the encoder learning to approximate this). Reconstruction error: $\| x - \hat z f \|^2 = (z - \hat z)^2 = (\lambda/2)^2$ when $z > \lambda/2$ .

The TopK SAE with $k = 1$ recovers $\hat z = z$ exactly (least-squares regression onto the chosen direction with no penalty). Reconstruction error: $0$ .

This is the cleanest demonstration of why TopK strictly Pareto-dominates L1 on the $(R^2, L_0)$ frontier: the magnitude-shrinkage cost paid by L1 is unnecessary once you separate gating from magnitude.

Example

Anthropic's Golden Gate Claude steering demo (May 2024)

Anthropic trained an SAE on Claude 3 Sonnet's middle-layer residual stream and identified a feature that activates on the Golden Gate Bridge: text mentioning the bridge, images of the bridge, references to fog or San Francisco that contextually evoke the bridge. The feature is monosemantic in the sense that its top activations are all bridge-related across modalities.

To steer, Anthropic added a multiple of the feature direction to the residual stream during generation: $h_\ell \leftarrow h_\ell + \alpha \cdot f_{\mathrm{GGB}}$ for layer $\ell$ and steering coefficient $\alpha$ . With $\alpha$ moderate, the model started referencing the Golden Gate Bridge in answers to unrelated questions; with $\alpha$ very large, the model became unable to talk about anything else (the public Golden Gate Claude demo).

Lesson: SAE features are causally meaningful rather than merely correlated with concepts. Adding a feature direction at inference time produces predictable behavioral changes, which is the operational test of mechanistic understanding.

Example

Dead features and the Gao et al. AuxK fix

A standard L1 SAE trained on GPT-2 small with $M = 16{,}384$ on the residual stream of layer 6 typically has 60-90% dead features after convergence (Cunningham et al. 2023; Bricken et al. 2023). Fix 1: resampling, i.e. re-initialize dead encoder rows to point toward inputs the SAE reconstructs poorly, every $\sim 25{,}000$ training steps.

Fix 2 (Gao et al. 2024): the AuxK loss. Compute the residual error $r = x - W_{\mathrm{dec}} f(x)$ , run a separate TopK over the dead features only with size $k_{\mathrm{aux}}$ (typically 512), and add the auxiliary reconstruction loss $\| r - W_{\mathrm{dec}} \mathrm{TopK}_{\mathrm{aux,dead}}(\ldots) \|^2$ scaled by a small coefficient. This activates dead features on residual error and keeps the dictionary alive without hard resampling.

Empirical result: Gao et al. report dead-feature rates below 7% for GPT-4-class TopK SAEs with $M = 16$ M, compared to $> 40\%$ for L1 SAEs of equivalent size.

Example

Feature splitting in a small SAE

Train an L1 SAE with $M = 4{,}096$ on Pythia-1B layer 4 residual stream, then train another with $M = 32{,}768$ on the same data. The smaller SAE has a single feature that fires on "the word 'the' followed by an adjective"; the larger SAE has six features that fire on slightly different versions of this pattern (the + adjective + noun, the + comparative adjective, the + national-origin adjective, etc.).

This is feature splitting: as $M$ grows, semantic concepts subdivide into finer-grained features. There is no objective ground truth; the "right" granularity depends on the downstream analysis. Matryoshka SAEs partially address this by giving access to the same SAE at multiple granularities post hoc.

Common Confusions

Watch Out

SAE features are learned decompositions, not discovered ground truth

Two SAE training runs with different random seeds produce different feature dictionaries. Two SAE architectures (TopK vs JumpReLU vs Matryoshka) produce different decompositions of the same activation distribution. The features are useful summaries of the activation space, not metaphysically real "atoms" of cognition. The metric Max Mean Cosine Similarity (MMCS) across runs measures dictionary stability and typically lands at 0.7-0.85 for L1 SAEs and 0.85-0.95 for TopK/JumpReLU on Gemma-scale models (Lieberum et al. 2024).

Watch Out

Overcomplete does not guarantee finding all features

An SAE with $M = 64 d$ does not necessarily find all features the network uses. It finds the $M$ directions that best explain the activation distribution under the architectural sparsity constraint. Rare features (active on $< 0.001\%$ of tokens) may be missed; concentrated features (a single direction explaining a lot of variance) may absorb related concepts; common features may split across dictionary elements. Increasing $M$ helps recall at the cost of feature splitting.

Watch Out

L0 sparsity is per-input, not per-feature

The standard sparsity metric $L_0 = \mathbb E_x [\| f(x) \|_0]$ counts how many features fire on the average input. A typical Gemma Scope SAE has $L_0 \approx 50$ out of $M = 65{,}536$ features ( $\sim 0.08\%$ ). This says nothing about which features tend to fire; some features may fire on $> 50\%$ of inputs (these are usually positional or role-encoding features and deserve separate attention), others may fire on $< 0.01\%$ (rare specialized concepts). The marginal feature firing rate is a separate diagnostic.

Watch Out

SAEs train per layer, not on the whole model

Each SAE is fit to one layer and one site (residual stream, MLP output, attention output). Different layers have different SAEs and different feature dictionaries. The same concept can appear at multiple layers with different feature directions; tracing a concept through the network requires cross-layer transcoders (Lindsey et al. 2024) or careful pairing of features across layers via causal probing.

Watch Out

Reconstruction R² is a necessary but insufficient metric

A high $R^2$ (say $> 0.95$ ) means the SAE preserves most of the activation, but it does not directly mean the features are interpretable or useful. Two SAEs with the same $R^2$ at the same $L_0$ can have very different feature qualities; the standard supplementary diagnostics are (1) autointerp score (use a strong LLM to label features and compute correctness), (2) probe transfer (do features predict downstream behaviors?), (3) causal effects under ablation, (4) MMCS across seeds. The community is still converging on a unified evaluation suite; SAE-Bench (2024) is one attempt.

Watch Out

SAE-based steering is not the only valid feature-causality test

"Adding the feature direction makes the model say bridge-related things" is a necessary condition for the feature to be causally meaningful, not a sufficient one. Spurious directions in activation space can also produce coherent steering effects without corresponding to interpretable features (Marks et al. 2024 on the steering vs probing gap). Strong evidence requires both behavioral steering and causal ablation effects, ideally with control comparisons against random or matched-norm directions.

Evaluation: How Do You Know an SAE Is Good?

The community has converged on a multi-axis evaluation that no single metric replaces:

Reconstruction quality. $R^2 = 1 - \mathbb E[\| x - g(f(x)) \|^2] / \mathbb E[\| x \|^2]$ . Modern SAEs target $R^2 > 0.85$ at the chosen sparsity level, with frontier-scale TopK SAEs hitting $0.95+$ (Gao et al. 2024).
Sparsity. $L_0 = \mathbb E_x \| f(x) \|_0$ . Typical operating points: 32-512 active features on a dictionary of $10^4$ - $10^7$ .
Downstream loss preservation. Substitute the SAE's reconstruction $\hat x = g(f(x))$ for the actual activation $x$ in the model's forward pass and measure the increase in cross-entropy loss. A "perfect" SAE has zero degradation; a typical SAE adds 0.05-0.2 nats per token.
Autointerp. Have a strong LLM (typically Claude or GPT-4) generate a natural-language description of each feature from its top-activating examples, then evaluate whether the description predicts activations on held-out examples. Modern SAEs score $\sim 0.7$ - $0.9$ on this test for the most-active features and worse for rare ones.
Feature stability across seeds. MMCS across two independent training runs.
Causal effects under ablation. Zero feature $j$ , measure behavioral change; compare to ablating a random direction with matched norm.

SAE-Bench (Karvonen et al. 2024) and the Gemma Scope evaluation suite (Lieberum et al. 2024) are public benchmarks attempting to standardize these axes. As of April 2026 the community has not converged on a single best metric or weighting.

Frontier-Scale SAEs: What They Find

Anthropic's Scaling Monosemanticity (Templeton et al. May 2024) trained SAEs on Claude 3 Sonnet's residual stream with dictionary sizes of $\sim 10^6$ and $10^7$ . Notable findings:

Multi-modal features. Features fire on the same concept across English text, code referencing the concept, and images of the concept. The "Golden Gate Bridge" feature fires on text mentioning the bridge, on photos of the bridge, and on related concepts (San Francisco, fog, suspension bridges) at lower magnitudes.
Safety-relevant features. Features for sycophancy, deception, biological-weapons synthesis, SQL injection, racist content. These were used to demonstrate that SAE features can localize dangerous behaviors, with implications for both safety auditing and steering the model away from those behaviors.
Compositional features. Features for syntactic patterns ("subject-verb agreement"), semantic relations ("X is a member of Y"), task-style patterns ("the model is being asked to refuse").
Scaling pattern. Useful feature count scales roughly linearly with $M$ up to $M = 10^7$ , suggesting that frontier models contain at least that many distinct features.

DeepMind's Gemma Scope (Lieberum et al. July 2024) released $\sim 400$ JumpReLU SAEs across layers and sites of Gemma-2-2B and Gemma-2-9B as a public artifact. This made frontier-scale SAE research reproducible outside major labs for the first time.

Exercises

ExerciseCore

Problem

An SAE with dictionary size $M = 16{,}384$ is trained on $d = 512$ dimensional activations. On a typical input, $L_0 = 150$ features are nonzero. What is the overcomplete factor? What fraction of features are active per input? Why must the overcomplete factor be much larger than 1 (cite the superposition hypothesis)?

ExerciseCore

Problem

Show that the SAE objective $\min_z \| x - W z \|_2^2 + \lambda \| z \|_1$ is exactly the LASSO problem (with $W$ as the design matrix and $z$ as coefficients). Then sketch why the L1 ball geometry produces sparsity while the L2 ball does not.

ExerciseAdvanced

Problem

Derive the L1 shrinkage explicitly. Suppose $W \in \mathbb R^{d \times M}$ has orthonormal columns and the true sparse code is $z^* \in \mathbb R^M$ with $x = W z^*$ (no noise). Compute the L1-penalized SAE optimum $\hat z(\lambda) = \arg\min_z \tfrac 12 \| x - W z \|^2 + \lambda \| z \|_1$ in closed form, and describe the bias on $\hat z_j$ relative to $z_j^*$ .

ExerciseAdvanced

Problem

Implement TopK SAE backward-pass logic in pseudocode. Specifically, given a forward pass that computes $u = W_{\mathrm{enc}} (x - b_{\mathrm{dec}}) + b_{\mathrm{enc}}$ , $z = \mathrm{TopK}(u)$ (the top- $k$ entries kept, others zeroed), $\hat x = W_{\mathrm{dec}} z + b_{\mathrm{dec}}$ , $\mathcal L = \| x - \hat x \|^2$ , derive the gradient $\partial \mathcal L / \partial u$ and explain why no straight-through estimator is needed.

ExerciseResearch

Problem

Capacity bound for superposition. Use the Welch bound to derive a quantitative cap on how many features a $d$ -dimensional residual stream can carry in superposition with pairwise inner-product slack $\varepsilon$ . Then estimate the cap for $d = 4096$ (a 7B-parameter transformer) at $\varepsilon = 0.1$ and at $\varepsilon = 0.05$ . What does this imply about the feasibility of frontier-scale SAEs with $M = 10^7$ ?

Summary

Polysemanticity is empirical: individual neurons fire on many concepts.
Superposition is the explanatory hypothesis: networks pack $F \gg d$ features as nearly-orthogonal directions.
L1 SAE (Bricken et al. 2023): minimize $\| x - g(f(x)) \|^2 + \lambda \| f(x) \|_1$ , decoder columns unit-norm. Suffers from shrinkage and dead features.
TopK SAE (Gao et al. 2024 / OpenAI): replace L1 with hard top- $k$ selection. Eliminates shrinkage, Pareto-dominates L1.
JumpReLU SAE (Rajamanoharan et al. 2024 / DeepMind): per-feature learnable threshold $\theta_j$ with straight-through gradient on the L0 indicator. Per-feature adaptive sparsity.
Matryoshka SAE (Bushnaq / Bussmann 2024): nested dictionaries trained with prefix-sum loss. Multi-resolution decomposition.
Frontier-scale evidence: Anthropic's Claude 3 Sonnet decomposition (Templeton et al. May 2024) finds millions of interpretable features; DeepMind's Gemma Scope (Lieberum et al. July 2024) releases $\sim 400$ public SAEs.
Steering: adding a feature direction to the residual stream produces predictable behavioral changes (Golden Gate Claude).
Evaluation: $R^2$ , $L_0$ , downstream loss preservation, autointerp, MMCS across seeds, ablation effects. No single metric suffices.
Open problems: feature splitting, feature absorption, cross-layer composition, adaptive sparsity, evaluating "real" features.

References

Canonical:

Olshausen & Field, "Emergence of simple-cell receptive field properties by learning a sparse code for natural images," Nature 381 (1996). The original sparse coding paper for natural images, foundational to dictionary learning for feature extraction.
Bricken, Templeton, Batson, Chen et al., "Towards Monosemanticity: Decomposing Language Models with Dictionary Learning," Anthropic Transformer Circuits Thread (October 2023). The L1 SAE construction applied to a 1L transformer; the empirical demonstration that SAEs recover monosemantic features.
Elhage, Hume, Olsson, Schiefer, Henighan, Kravec, Hatfield-Dodds, Lasenby, Drain, Chen, Grosse, McCandlish, Kaplan, Amodei, Wattenberg, Olah, "Toy Models of Superposition" (2022). Foundational toy-model analysis of superposition, the phase diagram in $(p, F/d)$ , and the polytope solutions.
Cunningham, Ewart, Riggs, Huben, Sharkey, "Sparse Autoencoders Find Highly Interpretable Features in Language Models," arXiv:2309.08600 (ICLR 2024). Independent reproduction of Bricken et al. on Pythia and GPT-2; established that SAE features generalize across model families.
Templeton, Conerly, Marcus, Lindsey, Bricken, Chen, Pearce, Citro, Ameisen, Jermyn, Anil, Denison, Askell, Lasenby, Wu, Kravec, Schiefer, Maxwell, Joseph, Hatfield-Dodds, Tamkin, Nguyen, McLean, Burke, Hume, Carter, Henighan, Olah, "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet," Anthropic Transformer Circuits Thread (May 2024). The frontier-scale demonstration: $\sim 10^7$ feature SAE on Claude 3 Sonnet, multi-modal features, safety-relevant features, the Golden Gate Claude steering result.
Scherlis, Sachan, Jermyn, Benton, Shlegeris, "Polysemanticity and Capacity in Neural Networks," arXiv:2210.01892 (2022). Rigorous capacity analysis for superposition.

Current:

Gao, la Tour, Tillman, Goh, Troll, Radford, Sutskever, Leike, Wu, "Scaling and evaluating sparse autoencoders," arXiv:2406.04093 (June 2024). The OpenAI TopK SAE paper. Trained 16M-feature SAEs on GPT-4 activations; introduces the AuxK loss for dead features; introduces multi-axis evaluation including downstream loss preservation.
Rajamanoharan, Conmy, Lieberum, Smith, Varma, Sonnerat, Krakovna, Kramár, Shah, Nanda, "Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders," arXiv:2407.14435 (July 2024). The DeepMind JumpReLU SAE paper with straight-through gradient on the L0 indicator.
Rajamanoharan, Conmy, Smith, Sonnerat, Lieberum, Krakovna, Kramár, Shah, Nanda, "Improving Dictionary Learning with Gated Sparse Autoencoders," arXiv:2404.16014 (April 2024). The DeepMind Gated SAE: separate gating from magnitude estimation. JumpReLU's predecessor; closely related architecture.
Lieberum, Rajamanoharan, Conmy, Smith, Sonnerat, Varma, Kramár, Dragan, Shah, Nanda, "Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2," arXiv:2408.05147 (August 2024). The DeepMind public release of $\sim 400$ JumpReLU SAEs across Gemma-2-2B and Gemma-2-9B layers and sites.
Bussmann, Leask, Nanda, "Showing SAE Latents are not Atomic Using Meta-SAEs," LessWrong / Alignment Forum (August 2024). Empirical demonstration that L1/TopK SAE features themselves decompose into finer features under a meta-SAE; one motivation for Matryoshka.
Bushnaq, Nanda, et al., "Matryoshka Sparse Autoencoders," LessWrong / Alignment Forum (October 2024). Matryoshka construction with nested dictionaries.
Karvonen, Wright, Rager, Angell, Brinkmann, Smith, Verdun, Bau, Marks, "SAEBench: A Comprehensive Benchmark for Sparse Autoencoders" (2024). Standardized benchmark across sparsity, fidelity, and interpretability axes.
Marks, Rager, Michaud, Belrose, Sharkey, Bau, "Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models," arXiv:2403.19647 (2024). Causal-graph extraction from SAE features; one of the strongest demonstrations of SAE-based mechanistic understanding.
Lindsey, Templeton, Marcus, Chen, et al., "Cross-Layer Transcoders," Anthropic Transformer Circuits Thread (2024-2025). Architecture for tracing concepts across layers using SAE-style decompositions.

Critique:

Anders, Neuhaus, Rauker, Sharkey, "The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks," arXiv:2405.10928 (May 2024). Argues that SAE features may not be the right basis for circuit-level analysis; introduces the local interaction basis as a competitor.
Engels, Liao, Michaud, Gurnee, Tegmark, "Not All Language Model Features Are Linear," arXiv:2405.14860 (May 2024). Empirical finding that some features are circular (encoded as 2D rotational patterns), undermining the strict linear-feature assumption underlying L1/TopK/JumpReLU SAEs.
Bussmann, Leask, Nanda, "Meta-SAEs and Feature Splitting" (LessWrong, 2024, ongoing). Documents the feature splitting phenomenon and its implications for interpretation.
Black, Olszewski, Cundy, Conmy, "Interpreting Neural Networks through the Polytope Lens," arXiv:2310.07466 (2023). Argues that polytope structure (rather than linear features) is the more fundamental object; complementary to but distinct from the SAE program.
Hindupur, Lubana, Fel, Ba, "Probing the Inadequacy of Sparse Autoencoders for Linearly Decodable Concepts," arXiv:2502.05407 (Feb 2025). Documents cases where SAEs fail to recover concepts that linear probes find, raising questions about completeness.
Marks, Templeton, Sharkey, et al., "What is the Difference Between Steering and Probing?" (2024). The steering-probing gap: SAE features can be probed accurately without producing clean steering, and vice versa, suggesting the two operations measure different things.

Next Topics

Mechanistic interpretability: the broader research program SAEs feed into; circuits, probing, and behavioral analysis.
Induction heads: a specific interpretable circuit discovered by direct mechanistic analysis (predates SAEs); a useful comparison case.
Residual stream and transformer internals: the architectural framework for understanding where SAE features live (residual stream vs MLP output vs attention output).
Feature importance and interpretability: classical (non-mechanistic) feature attribution methods for comparison.

Last reviewed: April 19, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

5

Lasso Regressionlayer 2 · tier 1
Mechanistic Interpretability: Features, Circuits, and Causal Faithfulnesslayer 4 · tier 1
Autoencoderslayer 2 · tier 2
Residual Stream and Transformer Internalslayer 4 · tier 2
Truth Directions and Linear Probeslayer 4 · tier 2

Derived topics

2

Feature Importance and Interpretabilitylayer 2 · tier 2
Induction Headslayer 4 · tier 2

Graph-backed continuations

Induction Heads Feature Importance and Interpretability