Unlock: Sparse Autoencoders for Interpretability: TopK, JumpReLU, Matryoshka, and Scaling

Sparse autoencoders (SAEs) decompose neural network activations into interpretable feature directions by learning an overcomplete dictionary with a sparsity constraint. Coverage of the superposition hypothesis, the L1 SAE objective and its shrinkage failure mode, TopK SAEs (Gao et al. 2024 / OpenAI), JumpReLU SAEs (Rajamanoharan et al. 2024 / DeepMind), Matryoshka SAEs (Bushnaq / Bussmann 2024), Anthropic's million-feature Claude 3 Sonnet decomposition (Templeton et al. 2024), feature splitting, dead features, evaluation, and steering.

272 Prerequisites0 Mastered0 Working207 Gaps

Prerequisite mastery24%

Recommended probe

Floating-Point Arithmetic is your weakest prerequisite with available questions. You haven't been assessed on this topic yet.

Sparse Autoencoders for Interpretability: TopK, JumpReLU, Matryoshka, and ScalingTARGET