Skip to main content

ML Methods

Autoencoders

Encoder-decoder architectures for unsupervised representation learning: undercomplete bottlenecks, sparse and denoising variants, and the connection between linear autoencoders and PCA.

AdvancedTier 2StableSupporting~45 min

Why This Matters

Autoencoders are the simplest framework for unsupervised representation learning: compress data through a bottleneck, then reconstruct it, and the bottleneck representation captures the most important structure. This idea is foundational. It leads directly to variational autoencoders, which connect to probabilistic generative modeling, and to modern representation learning methods used in self-supervised learning.

Mental Model

Think of an autoencoder as a compression algorithm that learns what to compress. The encoder maps high-dimensional input to a low-dimensional code. The decoder maps the code back to the original space. Training minimizes reconstruction error. Whatever structure the code must capture to reconstruct well. that is the learned representation.

Formal Setup

Definition

Autoencoder

An autoencoder consists of two functions:

  • Encoder: fθ:RDRdf_\theta: \mathbb{R}^D \to \mathbb{R}^d mapping input xx to a latent code z=fθ(x)z = f_\theta(x)
  • Decoder: gϕ:RdRDg_\phi: \mathbb{R}^d \to \mathbb{R}^D mapping the code back to a reconstruction x^=gϕ(z)\hat{x} = g_\phi(z)

Training minimizes the reconstruction loss:

L(θ,ϕ)=1ni=1nxigϕ(fθ(xi))2\mathcal{L}(\theta, \phi) = \frac{1}{n}\sum_{i=1}^{n} \|x_i - g_\phi(f_\theta(x_i))\|^2

The composition gϕfθg_\phi \circ f_\theta learns to approximate the identity function, but only through the information bottleneck of the code zz.

Undercomplete Autoencoders

Definition

Undercomplete Autoencoder

An autoencoder is undercomplete if and only if the code dimension dd is smaller than the input dimension DD (i.e., d<Dd < D). The bottleneck forces the encoder to learn a compressed representation that retains the most important information for reconstruction.

If dDd \geq D and the encoder/decoder have enough capacity, the autoencoder can learn the identity function. which captures nothing useful about the data structure. The bottleneck is what makes autoencoders nontrivial.

The Linear Autoencoder = PCA Connection

Theorem

Linear Autoencoders Recover PCA

Statement

If the encoder and decoder are both linear (no nonlinearity), then the optimal dd-dimensional autoencoder (minimizing reconstruction error) recovers the subspace spanned by the top dd principal components of the data. The optimal composition WWW'W is the orthogonal projection onto the span of the top dd eigenvectors of the data covariance matrix Σ=1nXX\Sigma = \frac{1}{n}X^\top X. The individual factors WW and WW' are not unique: any invertible ARd×dA \in \mathbb{R}^{d \times d} gives WAWW \mapsto AW, WWA1W' \mapsto W' A^{-1} with the same product, so the linear autoencoder identifies the PCA subspace, not the PCA principal directions themselves. Recovering the eigenvectors as encoder rows requires an extra constraint (e.g., orthonormal rows of WW, or tied weights W=WW' = W^\top).

Intuition

PCA finds the dd-dimensional subspace that preserves the most variance; equivalently, that minimizes reconstruction error under orthogonal projection. A linear autoencoder with a dd-dimensional bottleneck is solving exactly the same optimization problem, just without the orthogonality constraint. The optimal solution finds the same subspace (though the encoder/decoder matrices may not be orthogonal individually).

Proof Sketch

Let the encoder be z=Wxz = Wx with WRd×DW \in \mathbb{R}^{d \times D} and decoder be x^=Wz\hat{x} = W'z with WRD×dW' \in \mathbb{R}^{D \times d}. The reconstruction loss is xWWx2\|x - W'Wx\|^2. The optimal WWW'W is the rank-dd projection onto the top dd eigenvectors of the covariance matrix, by the Eckart-Young theorem (best rank-dd approximation in Frobenius norm comes from truncated SVD).

Why It Matters

This theorem reveals that nonlinearity is what makes deep autoencoders more powerful than PCA. Without nonlinear activations, you are just doing PCA with extra steps. Nonlinear autoencoders can learn curved, nonlinear manifolds that PCA (restricted to linear subspaces) cannot capture.

Failure Mode

The equivalence is exact only for linear activations and squared loss. With nonlinear activations, the autoencoder can capture nonlinear structure but also has more complex loss landscapes with local minima.

Sparse Autoencoders

Definition

Sparse Autoencoder

A sparse autoencoder uses an overcomplete code (dDd \geq D) but adds a sparsity penalty on the activations:

L=1ni=1nxigϕ(fθ(xi))2+λj=1dzj\mathcal{L} = \frac{1}{n}\sum_{i=1}^{n}\|x_i - g_\phi(f_\theta(x_i))\|^2 + \lambda \sum_{j=1}^{d} |z_j|

The 1\ell_1 penalty (or KL divergence from a target activation level) encourages most code units to be zero for any given input. Each input activates only a small subset of features, and each feature specializes in a particular pattern.

Sparse autoencoders are useful for feature discovery: the learned features are often interpretable (edges, textures, object parts for images). They are also used in mechanistic interpretability to decompose neural network activations into interpretable directions.

Denoising Autoencoders

Definition

Denoising Autoencoder

A denoising autoencoder corrupts each input xx to produce x~\tilde{x} (e.g., by adding Gaussian noise or randomly masking features) and trains to reconstruct the clean original:

L=1ni=1nxigϕ(fθ(x~i))2\mathcal{L} = \frac{1}{n}\sum_{i=1}^{n}\|x_i - g_\phi(f_\theta(\tilde{x}_i))\|^2

where x~i=xi+ε\tilde{x}_i = x_i + \varepsilon with εN(0,σ2I)\varepsilon \sim \mathcal{N}(0, \sigma^2 I) or x~i\tilde{x}_i is xix_i with random entries zeroed out.

The denoising objective prevents the autoencoder from learning the identity even without a bottleneck. To reconstruct clean data from noisy input, the model must learn the data manifold structure. It must know what "clean" looks like. This connects to score matching: the optimal denoising function is related to the gradient of the log data density.

Canonical Examples

Example

Image reconstruction with convolutional autoencoder

For 28×2828 \times 28 MNIST images (D=784D = 784), a convolutional encoder reduces spatial dimensions via strided convolutions down to a d=32d = 32 dimensional code vector. The decoder uses transposed convolutions to upsample back to 28×2828 \times 28. Reconstruction error is typically MSE. The 32-dimensional code captures digit identity, stroke thickness, and orientation. The factors of variation that matter for reconstruction.

Example

Linear autoencoder on 2D data

Consider 1000 points in R10\mathbb{R}^{10} lying near a 2D plane (with small noise). A linear autoencoder with d=2d = 2 learns to project onto this plane, exactly what PCA does. The reconstruction error equals the variance in the 8 discarded dimensions. A nonlinear autoencoder with d=2d = 2 could capture a curved manifold that PCA misses.

Common Confusions

Watch Out

Overcomplete autoencoders are not useless

If d>Dd > D with no regularization, the autoencoder can learn the identity (trivial solution). But with sparsity, denoising, or other regularization, overcomplete autoencoders learn useful representations. The constraint is not just dimensionality. It is the information bottleneck in a broader sense.

Watch Out

Autoencoders are not generative models by default

A standard autoencoder learns to reconstruct training data, but the latent space has no particular structure. Sampling a random point in the latent space and decoding it typically produces garbage. Making autoencoders generative requires imposing structure on the latent space. This is exactly what variational autoencoders do.

Summary

  • Autoencoders learn representations by minimizing reconstruction error through a bottleneck
  • Undercomplete (d<Dd < D): bottleneck forces compression
  • Linear autoencoder with squared loss = PCA (same subspace, possibly different basis)
  • Sparse autoencoders: overcomplete code with 1\ell_1 penalty encourages interpretable features
  • Denoising autoencoders: corrupt input, reconstruct clean. learns the data manifold without needing a bottleneck
  • Standard autoencoders are not generative models. The latent space lacks structure for sampling

Exercises

ExerciseCore

Problem

A linear autoencoder with encoder WR2×10W \in \mathbb{R}^{2 \times 10} and decoder WR10×2W' \in \mathbb{R}^{10 \times 2} is trained on centered data. What is the rank of the reconstruction WWxW'Wx, and how does this relate to PCA?

ExerciseAdvanced

Problem

Why does a denoising autoencoder learn useful representations even without a bottleneck (when d=Dd = D)? What prevents it from learning the identity?

References

Canonical:

  • Goodfellow, Bengio, Courville, Deep Learning, MIT Press (2016), Chapter 14
  • Bourlard & Kamp, "Auto-Association by Multilayer Perceptrons and Singular Value Decomposition," Biological Cybernetics 59(4-5):291-294 (1988)
  • Hinton & Salakhutdinov, "Reducing the Dimensionality of Data with Neural Networks," Science 313(5786):504-507 (2006)
  • Vincent, Larochelle, Bengio, Manzagol, "Extracting and Composing Robust Features with Denoising Autoencoders," ICML (2008)
  • Vincent, "A Connection Between Score Matching and Denoising Autoencoders," Neural Computation 23(7):1661-1674 (2011)
  • Kingma & Welling, "Auto-Encoding Variational Bayes," ICLR (2014)

Current:

  • Cunningham, Ewart, Riggs, Huben, Sharkey, "Sparse Autoencoders Find Highly Interpretable Features in Language Models," ICLR (2024), arXiv:2309.08600
  • Bricken et al., "Towards Monosemanticity: Decomposing Language Models with Dictionary Learning," Anthropic Transformer Circuits Thread (2023)

Next Topics

The natural next steps from autoencoders:

  • Variational autoencoders: making the latent space generative
  • Dimensionality reduction: PCA and nonlinear alternatives

Last reviewed: April 27, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

6