Autoencoders

Sneiderman, Robby

ML Methods

Autoencoders

Encoder-decoder architectures for unsupervised representation learning: undercomplete bottlenecks, sparse and denoising variants, and the connection between linear autoencoders and PCA.

AdvancedTier 2StableSupporting~45 min

Prerequisites

Feedforward Networks and Backpropagation Vectors Matrices and Linear Maps Boltzmann Machines and Hopfield Networks Dimensionality Reduction Theory

Quiz (4)Prereq Map

Why This Matters

Autoencoders are the simplest framework for unsupervised representation learning: compress data through a bottleneck, then reconstruct it, and the bottleneck representation captures the most important structure. This idea is foundational. It leads directly to variational autoencoders, which connect to probabilistic generative modeling, and to modern representation learning methods used in self-supervised learning.

Mental Model

Think of an autoencoder as a compression algorithm that learns what to compress. The encoder maps high-dimensional input to a low-dimensional code. The decoder maps the code back to the original space. Training minimizes reconstruction error. Whatever structure the code must capture to reconstruct well. that is the learned representation.

Formal Setup

Definition

Autoencoder

An autoencoder consists of two functions:

Encoder: $f_\theta: \mathbb{R}^D \to \mathbb{R}^d$ mapping input $x$ to a latent code $z = f_\theta(x)$
Decoder: $g_\phi: \mathbb{R}^d \to \mathbb{R}^D$ mapping the code back to a reconstruction $\hat{x} = g_\phi(z)$

Training minimizes the reconstruction loss:

$\mathcal{L}(\theta, \phi) = \frac{1}{n}\sum_{i=1}^{n} \|x_i - g_\phi(f_\theta(x_i))\|^2$

The composition $g_\phi \circ f_\theta$ learns to approximate the identity function, but only through the information bottleneck of the code $z$ .

Undercomplete Autoencoders

Definition

Undercomplete Autoencoder

An autoencoder is undercomplete if and only if the code dimension $d$ is smaller than the input dimension $D$ (i.e., $d < D$ ). The bottleneck forces the encoder to learn a compressed representation that retains the most important information for reconstruction.

If $d \geq D$ and the encoder/decoder have enough capacity, the autoencoder can learn the identity function. which captures nothing useful about the data structure. The bottleneck is what makes autoencoders nontrivial.

The Linear Autoencoder = PCA Connection

Theorem

Linear Autoencoders Recover PCA

Statement

If the encoder and decoder are both linear (no nonlinearity), then the optimal $d$ -dimensional autoencoder (minimizing reconstruction error) recovers the subspace spanned by the top $d$ principal components of the data. The optimal composition $W'W$ is the orthogonal projection onto the span of the top $d$ eigenvectors of the data covariance matrix $\Sigma = \frac{1}{n}X^\top X$ . The individual factors $W$ and $W'$ are not unique: any invertible $A \in \mathbb{R}^{d \times d}$ gives $W \mapsto AW$ , $W' \mapsto W' A^{-1}$ with the same product, so the linear autoencoder identifies the PCA subspace, not the PCA principal directions themselves. Recovering the eigenvectors as encoder rows requires an extra constraint (e.g., orthonormal rows of $W$ , or tied weights $W' = W^\top$ ).

Intuition

PCA finds the $d$ -dimensional subspace that preserves the most variance; equivalently, that minimizes reconstruction error under orthogonal projection. A linear autoencoder with a $d$ -dimensional bottleneck is solving exactly the same optimization problem, just without the orthogonality constraint. The optimal solution finds the same subspace (though the encoder/decoder matrices may not be orthogonal individually).

Proof Sketch

Let the encoder be $z = Wx$ with $W \in \mathbb{R}^{d \times D}$ and decoder be $\hat{x} = W'z$ with $W' \in \mathbb{R}^{D \times d}$ . The reconstruction loss is $\|x - W'Wx\|^2$ . The optimal $W'W$ is the rank- $d$ projection onto the top $d$ eigenvectors of the covariance matrix, by the Eckart-Young theorem (best rank- $d$ approximation in Frobenius norm comes from truncated SVD).

Why It Matters

This theorem reveals that nonlinearity is what makes deep autoencoders more powerful than PCA. Without nonlinear activations, you are just doing PCA with extra steps. Nonlinear autoencoders can learn curved, nonlinear manifolds that PCA (restricted to linear subspaces) cannot capture.

Failure Mode

The equivalence is exact only for linear activations and squared loss. With nonlinear activations, the autoencoder can capture nonlinear structure but also has more complex loss landscapes with local minima.

report a correction →

Sparse Autoencoders

Definition

Sparse Autoencoder

A sparse autoencoder uses an overcomplete code ( $d \geq D$ ) but adds a sparsity penalty on the activations:

$\mathcal{L} = \frac{1}{n}\sum_{i=1}^{n}\|x_i - g_\phi(f_\theta(x_i))\|^2 + \lambda \sum_{j=1}^{d} |z_j|$

The $\ell_1$ penalty (or KL divergence from a target activation level) encourages most code units to be zero for any given input. Each input activates only a small subset of features, and each feature specializes in a particular pattern.

Sparse autoencoders are useful for feature discovery: the learned features are often interpretable (edges, textures, object parts for images). They are also used in mechanistic interpretability to decompose neural network activations into interpretable directions.

Denoising Autoencoders

Definition

Denoising Autoencoder

A denoising autoencoder corrupts each input $x$ to produce $\tilde{x}$ (e.g., by adding Gaussian noise or randomly masking features) and trains to reconstruct the clean original:

$\mathcal{L} = \frac{1}{n}\sum_{i=1}^{n}\|x_i - g_\phi(f_\theta(\tilde{x}_i))\|^2$

where $\tilde{x}_i = x_i + \varepsilon$ with $\varepsilon \sim \mathcal{N}(0, \sigma^2 I)$ or $\tilde{x}_i$ is $x_i$ with random entries zeroed out.

The denoising objective prevents the autoencoder from learning the identity even without a bottleneck. To reconstruct clean data from noisy input, the model must learn the data manifold structure. It must know what "clean" looks like. This connects to score matching: the optimal denoising function is related to the gradient of the log data density.

Canonical Examples

Example

Image reconstruction with convolutional autoencoder

For $28 \times 28$ MNIST images ( $D = 784$ ), a convolutional encoder reduces spatial dimensions via strided convolutions down to a $d = 32$ dimensional code vector. The decoder uses transposed convolutions to upsample back to $28 \times 28$ . Reconstruction error is typically MSE. The 32-dimensional code captures digit identity, stroke thickness, and orientation. The factors of variation that matter for reconstruction.

Example

Linear autoencoder on 2D data

Consider 1000 points in $\mathbb{R}^{10}$ lying near a 2D plane (with small noise). A linear autoencoder with $d = 2$ learns to project onto this plane, exactly what PCA does. The reconstruction error equals the variance in the 8 discarded dimensions. A nonlinear autoencoder with $d = 2$ could capture a curved manifold that PCA misses.

Common Confusions

Watch Out

Overcomplete autoencoders are not useless

If $d > D$ with no regularization, the autoencoder can learn the identity (trivial solution). But with sparsity, denoising, or other regularization, overcomplete autoencoders learn useful representations. The constraint is not just dimensionality. It is the information bottleneck in a broader sense.

Watch Out

Autoencoders are not generative models by default

A standard autoencoder learns to reconstruct training data, but the latent space has no particular structure. Sampling a random point in the latent space and decoding it typically produces garbage. Making autoencoders generative requires imposing structure on the latent space. This is exactly what variational autoencoders do.

Summary

Autoencoders learn representations by minimizing reconstruction error through a bottleneck
Undercomplete ( $d < D$ ): bottleneck forces compression
Linear autoencoder with squared loss = PCA (same subspace, possibly different basis)
Sparse autoencoders: overcomplete code with $\ell_1$ penalty encourages interpretable features
Denoising autoencoders: corrupt input, reconstruct clean. learns the data manifold without needing a bottleneck
Standard autoencoders are not generative models. The latent space lacks structure for sampling

Exercises

ExerciseCore

Problem

A linear autoencoder with encoder $W \in \mathbb{R}^{2 \times 10}$ and decoder $W' \in \mathbb{R}^{10 \times 2}$ is trained on centered data. What is the rank of the reconstruction $W'Wx$ , and how does this relate to PCA?

ExerciseAdvanced

Problem

Why does a denoising autoencoder learn useful representations even without a bottleneck (when $d = D$ )? What prevents it from learning the identity?

References

Canonical:

Goodfellow, Bengio, Courville, Deep Learning, MIT Press (2016), Chapter 14
Bourlard & Kamp, "Auto-Association by Multilayer Perceptrons and Singular Value Decomposition," Biological Cybernetics 59(4-5):291-294 (1988)
Hinton & Salakhutdinov, "Reducing the Dimensionality of Data with Neural Networks," Science 313(5786):504-507 (2006)
Vincent, Larochelle, Bengio, Manzagol, "Extracting and Composing Robust Features with Denoising Autoencoders," ICML (2008)
Vincent, "A Connection Between Score Matching and Denoising Autoencoders," Neural Computation 23(7):1661-1674 (2011)
Kingma & Welling, "Auto-Encoding Variational Bayes," ICLR (2014)

Current:

Cunningham, Ewart, Riggs, Huben, Sharkey, "Sparse Autoencoders Find Highly Interpretable Features in Language Models," ICLR (2024), arXiv:2309.08600
Bricken et al., "Towards Monosemanticity: Decomposing Language Models with Dictionary Learning," Anthropic Transformer Circuits Thread (2023)

Next Topics

The natural next steps from autoencoders:

Variational autoencoders: making the latent space generative
Dimensionality reduction: PCA and nonlinear alternatives

Last reviewed: April 27, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

6

Vectors, Matrices, and Linear Mapslayer 0A · tier 1
Feedforward Networks and Backpropagationlayer 2 · tier 1
Dimensionality Reduction Theorylayer 2 · tier 2
t-SNE and UMAPlayer 2 · tier 2
EM Algorithm Variantslayer 3 · tier 2

Derived topics

6

Variational Autoencoderslayer 3 · tier 1
Sparse Autoencoders for Interpretability: TopK, JumpReLU, Matryoshka, and Scalinglayer 4 · tier 1
JEPA and Joint Embeddinglayer 4 · tier 2
Autoencoders for Low-Dimensional Dynamical Structureslayer 4 · tier 3
Autoencoders for Single-Cell RNA-seqlayer 4 · tier 3

+1 more on the derived-topics page.

Graph-backed continuations

Variational Autoencoders Autoencoders for Low-Dimensional Dynamical Structures Autoencoders for Single-Cell RNA-seq JEPA and Joint Embedding Predictive Coding and Autoencoders in the Brain Sparse Autoencoders for Interpretability: TopK, JumpReLU, Matryoshka, and Scaling