Skip to main content

AI Safety

Mechanistic Interpretability: Features, Circuits, and Causal Faithfulness

Reverse-engineering trained neural networks. Coverage of the superposition hypothesis, sparse autoencoders for feature extraction, the linear representation hypothesis and its counterexamples, induction heads and IOI as canonical circuits, sparse feature circuits (Marks et al. 2024), cross-layer transcoders (Lindsey et al. 2024), activation patching (noising vs denoising), faithfulness checks, frontier-scale evidence from Anthropic's Scaling Monosemanticity (Templeton et al. 2024) and DeepMind's Gemma Scope (Lieberum et al. 2024), and the limits of current interpretability.

ResearchTier 1FrontierCore spine~90 min

Why This Matters

neuron 1neuron 2legalcolormathfoodmusicsportcodeweather8 features in 2D spaceNearly orthogonal packingWhy sparsity helpsInput 1:legal ONfood OFFInput 2:legal OFFfood ONRare:legal ONfood ONInterference only when both active (rare if sparse)Sparsity = low collision rate = tolerable interference

Modern transformer language models make consequential decisions, but their internal computations are opaque to direct inspection. Mechanistic interpretability is the research program that treats trained networks as objects of scientific study: identify what individual components compute, how they compose into circuits, and how those circuits implement the algorithms producing observed behavior.

The program rests on three claims that have accumulated empirical support since 2021. First, transformer representations are approximately linear: many semantic features correspond to directions in activation space, recoverable by probes and by sparse autoencoders (sparse autoencoders, Bricken et al. 2023, Templeton et al. 2024). Second, transformer computation is circuit-structured: specific behaviors are implemented by identifiable subgraphs of the computational graph involving a small number of attention heads and MLP neurons (Olsson et al. 2022 induction heads, Wang et al. 2023 IOI, Hanna et al. 2023 Greater-Than). Third, faithful causal claims about model internals require intervention, not just correlation: probing tells you what is encoded, activation patching tells you what is used (Heimersheim & Nanda 2024).

The frontier-scale lift in 2024 was decisive. Anthropic's Scaling Monosemanticity (Templeton et al. May 2024) extracted millions of interpretable features from Claude 3 Sonnet's residual stream, including the famous Golden Gate Bridge feature whose amplification produced "Golden Gate Claude". DeepMind's Gemma Scope (Lieberum et al. July 2024, arXiv:2408.05147) released approximately 400 SAEs trained on every layer and site of Gemma 2, making mechanistic interpretability research possible without proprietary model access. Marks et al. 2024 (Sparse Feature Circuits, arXiv:2403.19647) showed that circuits can be discovered in SAE feature space rather than neuron space, dramatically improving readability. The field is no longer toy-model demonstrations; it is producing causal claims about deployed models.

For AI safety, the goal is evaluable trust: if we can verify what a model computes, we can detect deceptive alignment, find failure modes before deployment, and edit specific behaviors without retraining. None of these capabilities are mature, but each is now within experimental reach.

Mental Model

Treat a trained transformer as a learned program. The forward pass is an algorithm: at each layer, the residual stream carries a high-dimensional working memory, attention heads route information between positions, and MLP layers transform it through learned nonlinear functions. Mechanistic interpretability reverse-engineers this program in three layers of analysis:

  1. Features. What does the model represent? Find directions in activation space that correspond to interpretable concepts. Tools: sparse autoencoders, linear probes, sparse coding, the linear representation hypothesis. Output: a dictionary of features with human-readable labels.

  2. Circuits. How does the model compute? Identify subgraphs (attention heads + MLPs + residual stream directions) that implement specific algorithms (induction, indirect object resolution, factual recall, arithmetic). Tools: activation patching, attention pattern analysis, ACDC, sparse feature circuits. Output: a circuit diagram with causal annotations.

  3. Faithfulness. Are our explanations true to the actual computation? Test that the proposed feature/circuit accounts for the observed behavior under intervention, not just under observation. Tools: noising and denoising activation patching, causal scrubbing, completeness checks. Output: a causal claim with confidence.

The central technical obstacle is superposition (Elhage et al. 2022): models pack many more features than they have dimensions, encoding each feature as a direction rather than a neuron. So Layer 1 (features) cannot proceed by inspecting individual neurons; it requires methods like SAEs that recover the underlying feature directions from their superposed projections. The central methodological obstacle is faithfulness: many descriptions are intuitive but not causally accurate. Probing accuracy of 95% does not mean the model uses the probed information; an SAE with high reconstruction quality does not guarantee its features are the model's actual computational primitives.

The Superposition Hypothesis

Definition

Superposition

A model exhibits superposition if and only if it represents mm features in dd-dimensional activation space with m>dm > d, by encoding features as nearly-orthogonal (rather than exactly orthogonal) directions. Sparse activation patterns make this possible: when each feature fires on only a small fraction of inputs, the typical interference between simultaneously-active features stays small.

Definition

Polysemantic neuron

A polysemantic neuron activates on inputs corresponding to several unrelated features, because the neuron axis is not aligned with any single feature direction in the superposed representation. Anthropic's Claude 3 Sonnet residual stream contains neurons that fire on the Golden Gate Bridge, security vulnerabilities, sycophantic praise, and DNA sequences (Templeton et al. 2024). Polysemanticity is the empirical signature of superposition.

Proposition

Geometry of Superposition

Statement

Consider mm features represented as unit vectors {v1,,vm}\{v_1, \ldots, v_m\} in Rd\mathbb R^d, with the activation for an input being x=jzjvjx = \sum_j z_j v_j where zjz_j is the active intensity of feature jj (zero with probability 1sj1 - s_j). The expected squared interference on the read-out for feature ii, when ziz_i is fixed and other features are active independently with probabilities sjs_j, is

E[interferencei2]=ji(vivj)2E[zj2]sj.\mathbb E[\mathrm{interference}_i^2] = \sum_{j \neq i} (v_i \cdot v_j)^2 \cdot \mathbb E[z_j^2] \cdot s_j.

For accurate read-out, this must be small relative to the signal ziz_i. In the high-sparsity limit (sj0s_j \to 0), the model can use up to exp(Ω(dε2))\exp(\Omega(d \varepsilon^2)) features with pairwise inner products bounded by ε\varepsilon (Welch and Kabatyanskii-Levenshtein bounds for sphere packings).

Intuition

Orthogonality is not binary. In dd dimensions, exactly orthogonal vectors are capped at dd, but nearly orthogonal vectors (with pairwise dot product bounded by ε\varepsilon) are exponentially abundant. When most features are rare, the average interference from the non-orthogonal directions stays small, so the network can pack more features than dimensions without losing read-out accuracy.

Proof Sketch

The interference for feature ii on read-out direction viv_i comes from active non-ii features projected onto viv_i. Their squared magnitudes sum independently (by independence of zjz_j across jj), giving the variance formula. The Welch / Kabatyanskii-Levenshtein bounds say that in Rd\mathbb R^d, the number of unit vectors with pairwise inner product ε\leq \varepsilon is at most exp(O(dε2log(1/ε)))\exp(O(d \varepsilon^2 \log(1/\varepsilon))) for small ε\varepsilon, and matching constructions achieve this packing density (random Gaussian directions, equiangular tight frames).

Elhage et al. (2022) show empirically that small toy models go through three regimes as feature sparsity decreases: dense regime (model uses dd orthogonal directions, ignores extras), antipodal regime (pairs of features share a direction by sign), and polytope regime (features pack into geometric structures like digons, triangles, tetrahedra in subspaces of Rd\mathbb R^d). Phase transitions between regimes are governed by a sparsity-vs-importance phase diagram.

Why It Matters

Superposition is the geometric reason individual neurons are unreliable units of analysis. The meaningful computational unit is a feature direction (typically the inverse decoder column of an SAE), not a neuron. Every interpretability method that looks at neurons in isolation (e.g., neuron activation maximization, naive probing of single neurons) is fighting against superposition. Methods that look at directions (SAEs, linear probes, principal components, sparse coding) recover the underlying structure.

Failure Mode

Phase boundary. When feature sparsity is low (many features active at once), interference dominates and the model is forced to discard features rather than pack them. The transition is sharp; below a critical sparsity, no useful superposition is possible.

Anisotropic feature importance. The Elhage analysis assumes equal-importance features. Real models likely have a long-tailed importance distribution; the superposition geometry shifts to favor important features with cleaner directions and pack low-importance features more densely. This makes the simple md/pm \sim d/p scaling formula a rough heuristic, not a precise prediction.

Non-linear features. Some features may not be representable as a single direction at all (Engels et al. 2024 Not All Language Model Features Are Linear, arXiv:2405.14860). Days-of-week features in Mistral 7B form a circular structure that no single direction captures. The superposition framework is incomplete for these cases.

The Linear Representation Hypothesis

Proposition

Linear Representation Hypothesis (working framework)

Statement

Many interpretable features in trained transformer activations correspond to directions in activation space: there exists a unit vector vfRdv_f \in \mathbb R^d such that (1) the projection vf,x\langle v_f, x \rangle on the residual stream xx correlates with the conceptual presence of feature ff on input data, and (2) intervention xx+αvfx \mapsto x + \alpha v_f produces predictable behavioral changes corresponding to feature ff being more present.

This is a working framework, not a theorem. It has empirical support across many features, modalities, and model scales (Park-Choe-Veitch 2024 arXiv:2311.03658, Marks-Tegmark 2023 truth direction, Bricken et al. 2023, Templeton et al. 2024). It also has documented exceptions: Engels et al. 2024 (arXiv:2405.14860) find features (e.g., days of the week, months of the year) that organize as low-dimensional manifolds rather than single directions. For the truth-direction branch of this story, see Truth Directions and Linear Probes.

Intuition

The residual stream is a high-dimensional vector that accumulates contributions from each layer. If the model uses a feature for prediction, it must read that feature out from the residual stream by projecting onto some direction (since linear projections are what dot-product attention and the unembedding matrix compute). So features the model actually uses are likely to be linearly readable, even if they were trained into the model nonlinearly.

Proof Sketch

The hypothesis is empirical, not provable in general. Three lines of evidence support it:

  1. Probing transfers. Linear probes on intermediate layers achieve high accuracy on diverse semantic, syntactic, and factual properties (Tenney 2019, Belinkov 2022 surveys). The probed information is linearly readable.
  2. Steering works. Activation steering (adding a direction to the residual stream) produces semantically coherent behavioral shifts (Anthropic Golden Gate Claude, Turner et al. 2024 contrastive activation addition).
  3. SAE features are interpretable. Decoder columns of trained SAEs correspond to concept-aligned features, often at multiple semantic granularities (feature splitting; Bricken 2023, Bushnaq 2024).

Counterevidence: Engels 2024 finds clusters of features (days of week, calendar months) that lie on low-dimensional manifolds (circles or higher-genus surfaces), not single directions. These are explicit exceptions to the linear representation hypothesis as stated.

Why It Matters

The linear representation hypothesis is the foundation that licenses every linear-projection-based interpretability tool: SAEs, linear probes, contrast vectors, steering vectors, the unembedding interpretation. If features were generically nonlinear (e.g., XOR-of-two-other-features without a corresponding single direction), most of the current toolkit would not work. The hypothesis is provisional: empirical support is strong for many features but explicit exceptions exist for features with intrinsic geometric structure (cyclic, hierarchical).

Failure Mode

Manifold features. Engels et al. 2024 days-of-week feature is a 2D circle in residual-stream space, not a single direction. Linear probes identify the spanning subspace but not the cyclic order; SAE features split the circle into arc segments. The right primitive is a 2D manifold-feature, not a 1D direction-feature.

Cross-layer composition. A feature might be representable linearly only after a particular sequence of operations across layers; the direction changes layer to layer (cross-layer transcoders, Lindsey 2024). The static linear-representation framing under-describes these features.

Distributed-but-not-linear. Some computations (e.g., modular addition, Nanda et al. 2023 grokking analysis) use Fourier-basis decompositions that are linear in a transformed basis but not in the standard residual-stream basis. The right tool is the inverse transform, not raw projection.

Circuits: Induction Heads, IOI, Greater-Than

Definition

Circuit

A circuit is a subgraph of the computational graph of a neural network whose components (attention heads, MLP neurons, residual stream directions) jointly implement a specific identifiable behavior. A circuit is complete if and only if it accounts for the behavior under both correlation and intervention; a circuit is minimal if and only if no proper subgraph reproduces the behavior.

The reverse-engineering of circuits has produced canonical case studies:

Induction heads (Olsson et al. 2022). A two-head circuit in transformers that implements pattern continuation: when token AA is followed by token BB earlier in the context and AA appears again, predict BB. Step 1: a previous-token head writes information about each token's predecessor into the residual stream. Step 2: an induction head attends to positions matching the current token (via the previous-token information) and copies what came after. Olsson et al. show that induction heads emerge during training at a sharp phase transition coinciding with a loss bump; they argue induction heads explain a substantial fraction of in-context learning. Later work shows other heads contribute too; "induction heads do all of in-context learning" is an overclaim.

Indirect object identification (IOI) (Wang, Variengien, Conmy, Shlegeris, Steinhardt 2023, arXiv:2211.00593). A reverse-engineered end-to-end circuit in GPT-2 small for predicting the indirect object in sentences like "When Mary and John went to the store, John gave a drink to ___". The circuit involves: Duplicate Token Heads (detect the repeated name), S-Inhibition Heads (suppress the repeated name from the candidate set), Name Mover Heads (copy the remaining name to the output position), with Backup Name Movers and Negative Name Movers playing supporting roles. The full circuit involves about 26 attention heads across 4 functional groups, validated by comprehensive ablation experiments.

Greater-Than (Hanna, Liu, Variengien 2023, arXiv:2305.00586). A circuit in GPT-2 small for the task "The war lasted from the year 17XX to the year 17", where the model must complete with a year > XX. The identified circuit uses MLPs in layers 8-9 (the only layers that contribute monotonically) plus three "year head" attention heads. The key insight: the model represents years not as a single direction but as a probability distribution across the unembedding for digit tokens, and computes "greater than" by suppressing low-digit tokens monotonically. The circuit reveals that GPT-2 small implements numerical comparison through a learned digit ordering, not symbolic comparison.

These circuits are case studies; collectively they support the claim that transformer behavior is decomposable into legible algorithms, but they cover a tiny fraction of total model behavior. The general task of automated circuit discovery (Conmy et al. 2023 ACDC, arXiv:2304.14997) is open: ACDC iteratively prunes the computational graph to find minimal subgraphs that preserve behavior, but the search is expensive and the resulting circuits are noisy.

Sparse Feature Circuits

Proposition

Sparse Feature Circuits (Marks et al. 2024)

Statement

Circuit discovery in SAE feature space is dramatically more interpretable than in neuron or residual-stream-direction space. A sparse feature circuit identifies the SAE features at each layer that causally contribute to a target behavior, and the connections between them across layers (mediated by attention and MLP). Marks, Rager, Michaud, Belrose, Sharkey, Mueller (2024, arXiv:2403.19647) find that circuits in feature space are typically 10-100 features wide, compared to thousands of neurons or thousands of dimensions in the standard residual-stream view, and the features are individually labelable.

Intuition

Circuits in neuron space have to deal with polysemanticity at every node: each neuron contributes to many features and the interpretability of each edge is muddled. Circuits in SAE feature space replace polysemantic edges with monosemantic ones; each feature has a (mostly) single interpretable meaning, so the circuit reads as a sequence of feature-level operations.

Proof Sketch

Marks et al. (2024) implement sparse feature circuits by training SAEs at every layer, then performing circuit discovery using attribution patching (a fast linear approximation to activation patching) on SAE features. The result is a sparse subgraph of features (not neurons or directions) that accounts for the target behavior under intervention. They show the method on bias detection in classifier transformers: gender-bias circuits are sparse and surgically editable: ablating ~10 SAE features removes the bias without harming general performance.

The technical components are: (1) per-layer SAE training, (2) attribution patching to score features by causal contribution, (3) edge-attribution patching for the connections between features across layers, (4) graph pruning to retain only high-effect features and edges.

Why It Matters

Sparse feature circuits are the current best method for interpretable end-to-end circuit analysis at scale. They unify the feature-extraction perspective (SAEs as the basis change) with the circuit-decomposition perspective (subgraph isolation), producing diagrams that are both faithful and human-readable. The downstream applications include automated bias auditing, surgical model editing, and faithful circuit-level explanations.

Failure Mode

The method depends on the SAE basis being accurate. If SAEs miss real features (false negatives) or split a single semantic feature into many (false positives, feature splitting), the resulting circuits will be wrong in either direction. SAE quality is itself an open problem (see sparse autoencoders for the current evaluation challenges). Cross-layer feature alignment is also tricky: the same semantic feature may have different SAE indices at different layers, requiring cross-layer transcoders (Lindsey et al. 2024) to track features through the network.

Cross-Layer Transcoders

A standard SAE is trained per layer, so the dictionary of features is layer-specific. To track a feature through the network (from where it is constructed to where it influences the output), you need a way to identify the same feature at different layers. Cross-layer transcoders (Lindsey, Lindner, Templeton, Marks, Henighan, Carter, Olah 2024) train a single SAE on activations from multiple layers, with the encoder reading from one layer and the decoder writing to another. The result is a feature dictionary that is consistent across the layer range, enabling true layer-to-layer feature tracking.

Cross-layer transcoders address a real limitation of standard SAEs: a feature for "the concept of dogs" might be present at layer 4, attended-over by an attention head at layer 5, and integrated into the unembedding logic at layer 6. Without cross-layer alignment, this single feature appears as three nominally different features in three layer-specific SAEs, making circuit tracing combinatorially hard. With cross-layer transcoders, the feature has a single index throughout, and circuit discovery becomes a standard graph problem on a stable feature set.

The technique is computationally expensive (training a single SAE on activations from many layers requires more data and parameters than per-layer SAEs) and the architectural design space (which layers to span, how to weight per-layer reconstruction) is still being explored. The Anthropic 2024 release shows feasibility on Claude 3 Sonnet and shows interpretable cross-layer features (e.g., "this is a Python list comprehension" tracked across multiple layers).

Probing and Activation Patching

Proposition

Linear Probing Bound

Statement

Let V=minW,bE[logpW,b(yx)]\mathcal V = \min_{W, b} \mathbb E[-\log p_{W, b}(y \mid x)] be the minimum cross-entropy of a linear probe predicting label yy from representation xx. Then VH(YX)\mathcal V \geq H(Y \mid X) (the conditional entropy), with equality iff the Bayes-optimal predictor p(yx)p(y \mid x) is in the linear-softmax family.

Probing accuracy quantifies what is encoded in the representation, not what the model uses. High probing accuracy is necessary but not sufficient for the model to causally rely on the probed information.

Intuition

A linear probe extracts the linearly-readable information about the label. If the probe is accurate, the information is present and accessible. If inaccurate, the information is either absent or encoded nonlinearly (so the probe class cannot read it). Linear probes are the simplest interpretability instrument and the most widely used; they form the empirical case for the linear representation hypothesis.

Proof Sketch

Cross-entropy is bounded below by the conditional entropy H(YX)H(Y \mid X) (data-processing inequality / Gibbs' inequality). The linear-probe minimum V\mathcal V is the cross-entropy with the constraint that the predictor is a linear function of xx, which is at least the unconstrained minimum (= H(YX)H(Y \mid X)).

Why It Matters

Probing has revealed linear encoding of syntax (Tenney 2019), semantics (Conneau 2018), factual associations (Meng 2022 ROME), spatial relationships, time, sentiment, and many other properties. It is the empirical workhorse of the linear representation hypothesis. The probing literature also includes well-known probe-causality gaps: a representation can encode a property the model does not actually use for its output (Belinkov 2022 survey).

Failure Mode

Descriptive vs causal. High probe accuracy says the information is present, not that the model uses it. To establish causality, intervene (next theorem). Belinkov 2022 documents many cases where probe accuracy and causal use diverge.

Probe selection bias. Probing finds whatever linearly-encoded information exists; it does not find nonlinearly-encoded information. A 60% probe accuracy might indicate "model encodes 60% of the information linearly" or "model encodes 100% of the information but only 60% is linearly readable." The probe class matters.

Spurious linear directions. Hewitt-Liang (2019) show that even random representations admit informative-looking probes for some labels; probe accuracy must be calibrated against random-feature baselines.

Proposition

Activation Patching Validity (Heimersheim & Nanda 2024)

Statement

Activation patching has two standard directions:

Noising (resample ablation). Run on clean input. At a chosen site, replace the clean activation with the corresponding activation from a corrupted run. Measure the degradation in clean-run output. Large degradation: the patched site carries information necessary for the clean prediction.

Denoising. Run on corrupted input. At a chosen site, replace the corrupted activation with the corresponding clean activation. Measure the recovery toward clean-run output. Large recovery: the patched site is sufficient for the behavior, given the rest of the corrupted context.

Necessity and sufficiency are different conditions. The two patching directions can localize a behavior to different sites; treating them interchangeably is a common methodological error (Heimersheim & Nanda 2024, arXiv:2404.15255).

Intuition

Necessity asks "if I delete this, does the behavior break?"; sufficiency asks "if I provide just this, does the behavior occur?" These are logically distinct. A backup mechanism (e.g., redundant Name Mover Heads in IOI) makes sites individually unnecessary but jointly sufficient; a bottleneck site is necessary but not sufficient on its own.

Proof Sketch

Heimersheim and Nanda (2024) systematically catalog the differences between noising and denoising and argue that neither is universally correct. They recommend reporting both, and additionally: (1) varying the corruption distribution (the patch baseline matters), (2) using attribution patching (Syed et al. 2023 arXiv:2305.07440) for fast approximate attribution, (3) checking additivity (does patching multiple sites equal the sum of single-site patches?), and (4) running causal scrubbing (Chan et al. 2022) for end-to-end faithfulness verification.

Why It Matters

Activation patching is the central causal tool of mechanistic interpretability. Probing provides correlation; patching provides causation. The technique has matured between 2022 and 2024; Heimersheim-Nanda is now the standard reference for how to do patching correctly.

Notable applications: Meng-Bau-Andonian-Belinkov 2022 (ROME, arXiv:2202.05262) localized factual associations to mid-layer MLPs in GPT using a denoising variant called causal tracing. Geva et al. 2023 (Dissecting Recall of Factual Associations, arXiv:2304.14767) refined the picture by patching at finer granularity. Wang et al. 2023 IOI used patching extensively to validate the proposed circuit.

Failure Mode

Method-dependence. Different patching variants localize to different sites. The "right" method depends on the question (necessity vs sufficiency); reporting only one direction is incomplete.

Distribution-dependence. The corruption distribution determines what counterfactual is being measured. Patching with random activations vs activations from a different prompt vs activations from a structurally-paired prompt all give different answers. Vig et al. (2020) and Heimersheim-Nanda (2024) document this sensitivity.

Polysemantic sites. Patching at a polysemantic neuron or direction conflates many features simultaneously, making it hard to attribute the effect to a specific computation. This is one motivation for switching to feature-level patching (sparse feature circuits, Marks et al. 2024).

Frontier-Scale Evidence

The 2024 lift in mechanistic interpretability research came from two scaling demonstrations on production models:

Anthropic Scaling Monosemanticity (Templeton, Conerly, Marcus, Lindsey, Bricken, Chen, Pearce, Citro, Ameisen, Jermyn, Henighan, Olah 2024, transformer-circuits.pub). Trained SAEs on Claude 3 Sonnet's middle-layer residual stream with dictionary sizes up to ~10710^7 features. Recovered a wide range of interpretable features: programming languages, geographic locations, scientific concepts, emotional tones, the Golden Gate Bridge, security vulnerabilities, deceptive-behavior triggers. The Golden Gate Claude demo (May 2024) made the steering result viral: amplifying the bridge feature by a large coefficient produces a model that compulsively references the Golden Gate Bridge in unrelated answers, evidence that the SAE feature is causally tied to the represented concept.

DeepMind Gemma Scope (Lieberum, Rajamanoharan, Conmy, Smith, Sonnerat, Varma, Kramár, Dragan, Shah, Nanda 2024, arXiv:2408.05147). Released approximately 400 SAEs trained on Gemma 2 (2B and 9B parameter models) at every layer and site (residual stream, MLP outputs, attention outputs). Together with feature labels, attention pattern visualizations, and steering interfaces, Gemma Scope is the largest open release of SAEs at the time of writing and is the de-facto research platform for academic mechanistic interpretability.

These two releases collectively established that mechanistic interpretability methods do scale to production models, and that the feature dictionary at frontier scale is rich (millions of features), interpretable (human labels and behavioral steering), and causally meaningful (intervention experiments work). What they have not yet established: completeness (do we find all the features the model uses?) or compositional understanding (do we know how features combine into circuits at scale?). Both remain open.

The Faithfulness Problem

A central methodological worry: interpretability methods can produce explanations that look right without being causally accurate. Two concrete failure modes:

The interpretability illusion (Bolukbasi et al. 2021, "An Interpretability Illusion for BERT"). A neuron that appears to detect concept XX on inspection (top activations are all XX-related) might also activate strongly on other concepts whose data was not in the inspection set. Interpretation must consider the distribution of inputs the explanation covers, not just the cherry-picked top examples.

Fake circuit explanations (Chan et al. 2022, "Causal Scrubbing"). A circuit with reasonable-looking attention patterns might implement an incorrect algorithm that happens to produce correct outputs on the test distribution. Causal scrubbing is a stress-test: replace each component of the proposed circuit with a random sample from a distribution where the circuit's hypothesis says it should not matter, and check that the model's behavior is preserved. Many published circuits have failed causal scrubbing, exposing incomplete or wrong interpretations.

The remediation is adversarial validation: design counterexample inputs, test the explanation on out-of-distribution data, run causal scrubbing, check completeness with edge attribution. This is expensive but the field has converged on the view that circuits without faithfulness checks are not durable claims.

Common Confusions

Watch Out

SAE features are learned, not the model's ground-truth features

SAE features are an inferred basis that fits the activation distribution; they are not necessarily the features the model uses internally. SAE training has its own loss function (reconstruction + sparsity), and different SAE configurations (L1 vs TopK vs JumpReLU, dictionary size, sparsity level) produce different feature dictionaries. Feature splitting (Bushnaq, Bussmann 2024) shows that what looks like a single feature at one width decomposes into multiple at higher width, suggesting the "true" features may have hierarchical structure. Treat SAE features as a useful working basis, not a privileged decomposition.

Watch Out

Probing is not causal evidence

Probe accuracy says "the information is linearly readable from the representation." It does not say "the model uses this information for its computation." Many properties are encoded in transformer activations but not used for the next-token prediction (e.g., the sentence's syntactic parse may be encoded but downstream computation may not depend on it). Causal claims require intervention, not just probing.

Watch Out

A circuit explains one behavior, not the model

The IOI circuit explains the indirect-object task on a specific input distribution. It does not explain the model's general capabilities, nor does it imply that all attention is circuit-structured. A model with billions of parameters and trillions of training tokens implements thousands of behaviors; finding circuits for a few does not mean we understand the model.

Watch Out

Activation patching localization is method-dependent

Noising (resample ablation) and denoising patch in opposite directions and give different localizations. Patching with random activations vs paired-prompt activations gives different results. Patching at attention output vs residual stream gives different sites. Heimersheim-Nanda (2024) argue that no single patching protocol is universally correct; report multiple variants and check additivity.

Watch Out

Steering is necessary but not sufficient evidence for causal features

"Adding the feature direction to the residual stream produces semantically related output" is consistent with the feature being causally meaningful. It is also consistent with the feature being a spurious linear direction whose addition triggers unrelated downstream effects (Marks et al. 2024 steering-vs-probing gap). Strong evidence requires steering plus ablation plus matched-norm random control.

Watch Out

Interpretability is not safety

Even with perfect mechanistic understanding of a model's circuits, you still need to decide what behaviors are acceptable, how to modify undesirable behaviors, and how to verify modifications work without breaking other capabilities. Interpretability provides the measurement layer for safety; the intervention layer (RLHF, constitutional AI, scalable oversight) and the evaluation layer (red-teaming, capability evaluations) are also required.

Exercises

ExerciseCore

Problem

In Rd\mathbb R^d, you can have at most dd exactly orthogonal unit vectors. How many unit vectors with pairwise inner products bounded by ε=0.1\varepsilon = 0.1 in absolute value can you have? Give the Welch lower bound for d=4096d = 4096.

ExerciseCore

Problem

Suppose you have a trained SAE with decoder WdecW_{\mathrm{dec}}, encoder ff, and a feature jj that is active on input xx with activation zj>0z_j > 0. Describe how you would test whether feature jj causally affects the model's prediction of token tt at the next position. State the difference between ablation (zero out the feature) and steering (add a multiple of the feature direction) in terms of necessity vs sufficiency.

ExerciseAdvanced

Problem

The IOI circuit identified by Wang et al. 2023 in GPT-2 small involves Backup Name Movers: heads that remain quiet under normal conditions but activate when the primary Name Mover Heads are ablated, partially restoring the IOI behavior. What does this say about the necessity of the primary Name Mover Heads? Describe an activation-patching experiment that would reveal the backup mechanism.

ExerciseAdvanced

Problem

The Engels et al. 2024 Not All Language Model Features Are Linear (arXiv:2405.14860) paper shows that days-of-the-week features in Mistral 7B form a 2D circular manifold rather than a 1D direction. Suppose you train an SAE on Mistral 7B's residual stream. What do you expect to happen to the days-of-week features in the SAE's dictionary? Predict whether you will see (a) seven separate single-direction features, one per day, (b) a single 2D feature (somehow), (c) something else.

ExerciseResearch

Problem

The Marks et al. 2024 sparse-feature-circuits method requires SAEs at every layer and uses attribution patching to find feature-level circuits. Design an experiment to test the completeness of a sparse feature circuit: how would you check that the identified features explain all of the model's behavior on the target task, not just most of it? Specify the metric, the baselines, and the failure conditions.

Frequently Asked Questions

What is superposition?
The hypothesis that a network with d neurons can represent more than d concepts because features lie along non-orthogonal directions in activation space. Each neuron then activates for many unrelated concepts (polysemanticity). Anthropic's toy-models paper (Elhage et al. 2022) shows superposition emerges when the number of features exceeds the embedding dimension and features are sparse.
How is mechanistic interpretability different from saliency methods?
Saliency methods (Grad-CAM, integrated gradients) ask which inputs influence the output. Mech interp asks how the network computes the output: which components, in which sequence, with what algorithm. Saliency is descriptive; mech interp is mechanistic. The two address different questions and answer them with different tools.
What does a sparse autoencoder do?
An SAE is an overcomplete autoencoder trained on activations with a sparsity constraint. It learns a dictionary where most directions are inactive on most inputs; the active directions are candidate monosemantic features. TopK SAEs enforce exactly k active directions per input; JumpReLU uses a learned threshold. Output: thousands to millions of human-interpretable feature directions.
Are circuits actually used by the model, or just consistent explanations?
This is the faithfulness problem. Activation patching (replacing a component's activation with a different input's activation) gives causal evidence: if patching a circuit element changes the behavior, the element matters. Causal scrubbing extends this to test whether a circuit hypothesis fully explains the behavior. Both are necessary but not sufficient.
Will mechanistic interpretability scale to frontier models?
Partially. Anthropic's Scaling Monosemanticity (2024) extracted millions of features from Claude 3 Sonnet; Gemma Scope (2024) released SAEs for Gemma 2. Feature extraction scales with compute. Circuit-level reverse engineering at frontier scale remains harder; the open problem is moving from feature catalogs to algorithmic descriptions.

References

Canonical:

  • Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., et al. (2022). "Toy Models of Superposition." Anthropic, transformer-circuits.pub. The foundational paper establishing the superposition hypothesis with toy-model phase diagrams.
  • Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., et al. (2022). "In-context Learning and Induction Heads." Anthropic, transformer-circuits.pub. Identifies induction heads as a candidate mechanism for in-context pattern copying.
  • Wang, K., Variengien, A., Conmy, A., Shlegeris, B., & Steinhardt, J. (2023). "Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small." ICLR 2023, arXiv:2211.00593. Reverse-engineered IOI circuit.
  • Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., et al. (2023). "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning." Anthropic, transformer-circuits.pub. The L1 sparse autoencoder for interpretability.
  • Cunningham, H., Ewart, A., Riggs, L., Huben, R., & Sharkey, L. (2023). "Sparse Autoencoders Find Highly Interpretable Features in Language Models." arXiv:2309.08600. SAE scaling to GPT-2-class models.

Current frontier (2024-2025):

  • Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., et al. (2024). "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet." Anthropic, transformer-circuits.pub. Frontier-scale SAE results, Golden Gate Claude.
  • Lieberum, T., Rajamanoharan, S., Conmy, A., Smith, L., Sonnerat, N., Varma, V., et al. (2024). "Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2." arXiv:2408.05147. Public release of ~400 SAEs on Gemma 2.
  • Marks, S., Rager, C., Michaud, E. J., Belrose, N., Sharkey, L., & Mueller, A. (2024). "Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models." arXiv:2403.19647. Circuit discovery in SAE feature space.
  • Lindsey, J., Lindner, D., Templeton, A., Marks, S., Henighan, T., Carter, S., & Olah, C. (2024). "Sparse Crosscoders for Cross-Layer Features and Model Diffing." Anthropic, transformer-circuits.pub. Cross-layer transcoders.
  • Conmy, A., Mavor-Parker, A., Lynch, A., Heimersheim, S., & Garriga-Alonso, A. (2023). "Towards Automated Circuit Discovery for Mechanistic Interpretability." NeurIPS 2023, arXiv:2304.14997. ACDC algorithm for automated circuit discovery.
  • Hanna, M., Liu, O., & Variengien, A. (2023). "How does GPT-2 compute greater-than? Interpreting mathematical abilities in a pre-trained language model." arXiv:2305.00586. Greater-Than circuit case study.
  • Park, K., Choe, Y. J., & Veitch, V. (2024). "The Linear Representation Hypothesis and the Geometry of Large Language Models." ICML 2024, arXiv:2311.03658. Formal articulation of the linear representation hypothesis.
  • Heimersheim, S., & Nanda, N. (2024). "How to use and interpret activation patching." arXiv:2404.15255. Methodological reference for patching variants.
  • Meng, K., Bau, D., Andonian, A., & Belinkov, Y. (2022). "Locating and Editing Factual Associations in GPT" (ROME). NeurIPS 2022, arXiv:2202.05262. Causal tracing for factual associations.
  • Bills, S., Cammarata, N., Mossing, D., Tillman, H., Gao, L., Goh, G., et al. (2023). "Language models can explain neurons in language models." OpenAI. Automated neuron interpretation via LLM labeling.

Critique and limits:

  • Engels, J., Liao, I., Michaud, E. J., Gurnee, W., & Tegmark, M. (2024). "Not All Language Model Features Are Linear." arXiv:2405.14860. Days-of-week feature is a 2D circular manifold, contradicting strict linear representation hypothesis.
  • Bolukbasi, T., Pearce, A., Yuan, A., Coenen, A., Reif, E., Viégas, F., & Wattenberg, M. (2021). "An Interpretability Illusion for BERT." arXiv:2104.07143. The interpretability illusion: top-activation inspection misleads.
  • Chan, L., Garriga-Alonso, A., Goldwasser, N., Greenblatt, R., Nanda, N., Riche, J., et al. (2022). "Causal Scrubbing: a method for rigorously testing interpretability hypotheses." Alignment Forum / LessWrong. The causal scrubbing protocol for faithfulness checking.
  • Hewitt, J., & Liang, P. (2019). "Designing and Interpreting Probes with Control Tasks." EMNLP 2019, arXiv:1909.03368. Probe accuracy must be calibrated against random-feature baselines.
  • Belinkov, Y. (2022). "Probing Classifiers: Promises, Shortcomings, and Advances." Computational Linguistics, 48(1), 207-219. Survey of probing methods and their limits.
  • Vig, J., Gehrmann, S., Belinkov, Y., Qian, S., Nevo, D., Singer, Y., & Shieber, S. (2020). "Investigating Gender Bias in Language Models Using Causal Mediation Analysis." NeurIPS 2020, arXiv:2004.12265. Patching distribution-dependence for causal claims.
  • Hindupur, S. K., et al. (2025). "Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry." arXiv:2502.05407. SAE geometric assumptions and their failure modes.
  • Anders, A., et al. (2024). "Sums of Sums: Latent Interventions for the Linear Representation Hypothesis." arXiv:2405.10928. Critique of linear representation in causal-intervention setting.

Next Topics

Last reviewed: April 19, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.