Diffusion Models

Q: What is classifier-free guidance?

A way to steer samples toward a condition $c$ without a separate classifier. Train one model on both conditional $p(x \mid c)$ and unconditional $p(x)$ (with $c$ randomly dropped). At sampling time, combine the conditional and unconditional scores: $\hat{\varepsilon}(x, c) = (1 + w) \, \varepsilon(x, c) - w \, \varepsilon(x, \emptyset)$. Higher $w$ trades sample diversity for prompt fidelity.

Q: Why use latent diffusion (Stable Diffusion) instead of pixel diffusion?

Pixel-space diffusion at $512 \times 512$ is computationally heavy and the diffusion model wastes capacity on perceptual details a learned VAE can recover. Latent diffusion runs the U-Net on an 8x downsampled VAE latent, cuts compute roughly 64x, and concentrates the diffusion model's modeling power on the semantic structure of the image. This is the design choice that made high-resolution image generation tractable.

Sneiderman, Robby

Beyond LLMS

Diffusion Models

Generative models that learn to reverse a noise-adding process: forward and reverse SDEs, score matching, the DDPM ELBO, deterministic ODE sampling, classifier-free guidance, latent diffusion, and the bridge to flow matching.

AdvancedTier 1CurrentCore spine~95 min

Prerequisites

Variational Autoencoders Score Matching Boltzmann Machines and Hopfield Networks Clip and Openclip in Practice

Quiz (4)Prereq Map

Why This Matters

Diffusion keeps the corruption process fixed and spends all learnable capacity on reversing it

Training is cheap because any noisy state can be sampled in closed form. Sampling is expensive because the model has to walk a noisy prior back toward structure one denoising move at a time.

Forward process

Known Gaussian corruption

Moves left to right from clean structure toward nearly pure noise.

x₀

clean

xₜ

closed form

xₜ′

high noise

xₜ

prior

Reverse process

Learned denoising path

Moves left to right from a noise prior back toward a coherent sample candidate.

xₜ

noise prior

x̂ₜ′

denoising field

x̂ₜ

structure returns

x̂₀

sample

The upper lane is analytic: sample a timestep and corrupt in one shot. The lower lane is what the network earns: a denoising field that keeps telling the sampler how to turn noise back into signal.

Training target

$Sample t, draw ϵ, then form x_{t} = \overset{α}{ˉ}_{t} x_{0} + 1 - \overset{α}{ˉ}_{t} ϵ without simulating the whole chain.$

Equivalent parameterizations

$Predicting ϵ_{θ}, the score s_{θ}, or a velocity-style target all describe the same denoising direction once you rescale by the known noise level.$

What the sampler actually changes

DDPM keeps stochasticity, DDIM makes the path deterministic, and newer ODE solvers buy quality at lower step counts. The model is the same; the numerical path is what changes.

Why guidance helps and hurts

$Classifier-free guidance extrapolates with ϵ_{cfg} = ϵ_{\emptyset} + w (ϵ_{c} - ϵ_{\emptyset}) . Moderate w sharpens the prompt; large w pushes the sampler off the density it actually learned.$

Diffusion is the dominant generative architecture for images, video, audio, 3D geometry, molecules, and robot policies as of 2025-2026. The frontier image systems (Stable Diffusion 3.5, Imagen 3, Flux.1, DALL-E 3, Midjourney v6) and the frontier video systems (Sora, Veo 3, Kling, Runway Gen-3) are all either diffusion models or their close relative, flow matching. Outside generative media, diffusion appears in protein-structure design, robotics action policies (Chi et al. 2023), molecular conformer generation (Xu et al. 2022), and decision-making.

Three properties separate diffusion from earlier generative families:

Stable training. Unlike GANs, diffusion has no minimax game. The loss is a weighted regression against a known noise target. Training scales monotonically with compute and data.
Mode coverage. Unlike VAEs and GANs, diffusion does not aggressively collapse modes. The score-matching objective averages over the data distribution rather than fighting it.
Likelihood-grade theory. The discrete formulation gives a variational ELBO upper bound on negative log-likelihood (Sohl-Dickstein 2015, Ho et al. 2020). The objective most diffusion models actually train on, $L_{\text{simple}}$ , drops the per-timestep weights and is not the tight ELBO; it is a reweighted surrogate that improves sample quality at the cost of bound tightness. The continuous-time formulation gives exact likelihoods through an ODE change-of-variables (Song et al. 2021), independent of how the training loss was weighted.

The price is iterative sampling: a single image requires 4-50 forward passes through a large model. Most of the engineering progress since 2020 has been about cutting that cost.

If you want to watch timestep, guidance scale, and sampler choice change that tradeoff directly, open the Diffusion Lab.

Mental Model

Take a clean image, gradually add Gaussian noise over many steps until the image is pure noise, and then learn to undo each step. The forward process is fixed and analytically tractable; only the reverse direction is learned. At sampling time, draw a Gaussian noise vector and integrate the learned reverse dynamics back to a clean image.

There are several common parameterizations for what the network actually predicts: the noise $\boldsymbol{\epsilon}$ that was added (DDPM, Ho et al. 2020), the score $\nabla_{\mathbf{x}} \log p_t(\mathbf{x})$ of the perturbed data (NCSN, Song & Ermon 2019), the clean image $\mathbf{x}_0$ , or the linear combination $\mathbf{v}$ (Salimans-Ho 2022). At a fixed noise level $t$ these targets are linear functions of each other, so the algebraic relationship is exact (Vincent 2011 gives the denoising-score identity). What the trained network ends up computing is not identical across choices: the implicit per-timestep loss weighting differs, which changes which noise levels the model is optimized for, and that interacts with the choice of sampler at inference time. The parameterization is a real design choice, not just a notational rewrite — schedule, weighting convention, and sampler all matter.

Formal Setup

Definition

Forward Process (Discrete-Time DDPM)

The forward process is a Markov chain that adds Gaussian noise over $T$ steps. Starting from $\mathbf{x}_0 \sim q(\mathbf{x}_0)$ :

$q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) = \mathcal{N}\!\big(\mathbf{x}_t;\; \sqrt{1 - \beta_t}\,\mathbf{x}_{t-1},\; \beta_t \mathbf{I}\big)$

where $\beta_1, \ldots, \beta_T \in (0, 1)$ is a fixed variance schedule. With $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s$ , the marginal at any step has a closed form:

$q(\mathbf{x}_t \mid \mathbf{x}_0) = \mathcal{N}\!\big(\mathbf{x}_t;\; \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0,\; (1 - \bar{\alpha}_t)\mathbf{I}\big)$

so $\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\,\boldsymbol{\epsilon}$ with $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ . As $T \to \infty$ and the schedule is chosen so that $\bar{\alpha}_T \to 0$ , the prior $q(\mathbf{x}_T)$ converges to $\mathcal{N}(\mathbf{0}, \mathbf{I})$ .

Definition

Reverse Process (Learned Markov Kernel)

The reverse process is a parametric Markov chain trained to undo the forward process:

$p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t) = \mathcal{N}\!\big(\mathbf{x}_{t-1};\; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t),\; \Sigma_\theta(\mathbf{x}_t, t)\big)$

In Ho et al. (2020) the variance is fixed to $\sigma_t^2 \mathbf{I}$ (with $\sigma_t^2 \in \{\beta_t, \tilde{\beta}_t\}$ ), and the mean is reparameterized via a noise predictor $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$ :

$\boldsymbol{\mu}_\theta(\mathbf{x}_t, t) = \frac{1}{\sqrt{\alpha_t}}\!\Big(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\Big)$

Nichol & Dhariwal (2021) introduced learned $\Sigma_\theta$ for tighter likelihood bounds.

Definition

Score Function

The score of a density $p(\mathbf{x})$ is

$\mathbf{s}(\mathbf{x}) = \nabla_{\mathbf{x}} \log p(\mathbf{x})$

A score model $\mathbf{s}_\theta(\mathbf{x}, t) \approx \nabla_{\mathbf{x}} \log p_t(\mathbf{x})$ targets the score of the perturbed distribution at noise level $t$ . The score does not require knowing the normalizing constant of $p$ , which is what makes it tractable to learn from samples alone (Hyvärinen 2005).

Definition

Forward SDE (Continuous-Time)

A continuous-time forward process is an Itô SDE on $[0, T]$ :

$d\mathbf{x} = \mathbf{f}(\mathbf{x}, t)\,dt + g(t)\,d\mathbf{w}$

with drift $\mathbf{f}: \mathbb{R}^d \times [0, T] \to \mathbb{R}^d$ , scalar diffusion coefficient $g: [0, T] \to \mathbb{R}_{\geq 0}$ , and standard Wiener process $\mathbf{w}$ . The two canonical choices are Variance Preserving (VP-SDE), $\mathbf{f}(\mathbf{x}, t) = -\tfrac{1}{2}\beta(t)\mathbf{x}$ and $g(t) = \sqrt{\beta(t)}$ , the continuous limit of DDPM, and Variance Exploding (VE-SDE), $\mathbf{f} = \mathbf{0}$ and $g(t) = \sqrt{d[\sigma^2(t)]/dt}$ , the continuous limit of NCSN (Song et al. 2021).

Main Theorems

Theorem

DDPM Variational Lower Bound

Statement

The negative log-likelihood admits the variational upper bound

\begin{aligned} -\log p_\theta(\mathbf{x}_0) \;\leq\; \mathbb{E}_q\!\Big[ L_T + \sum_{t=2}^{T} L_{t-1} + L_0 \Big] \end{aligned}

with

\begin{aligned} L_T &= D_{\mathrm{KL}}\!\big(q(\mathbf{x}_T \mid \mathbf{x}_0) \,\big\Vert\, p(\mathbf{x}_T)\big), \\ L_{t-1} &= D_{\mathrm{KL}}\!\big(q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0) \,\big\Vert\, p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t)\big), \\ L_0 &= -\log p_\theta(\mathbf{x}_0 \mid \mathbf{x}_1). \end{aligned}

When the reverse mean is reparameterized via a noise predictor $\boldsymbol{\epsilon}_\theta$ and the variance is fixed, each $L_{t-1}$ reduces to a weighted squared error. Dropping the timestep weights gives the simplified DDPM loss of Ho et al. (2020):

$\mathcal{L}_{\text{simple}}(\theta) \;=\; \mathbb{E}_{t \sim \mathrm{Unif}\{1, \ldots, T\},\; \mathbf{x}_0,\; \boldsymbol{\epsilon}}\!\Big[\,\big\Vert\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta\!\big(\sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\,\boldsymbol{\epsilon},\; t\big)\big\Vert^2\,\Big]$

Intuition

Training reduces to: sample a clean image $\mathbf{x}_0$ , sample a timestep $t$ and a noise vector $\boldsymbol{\epsilon}$ , form the noisy view $\mathbf{x}_t$ in closed form, and regress the network output against the noise. No sampling of long chains during training, no GAN-style adversary, no per-step backpropagation through time.

Proof Sketch

Apply the standard ELBO decomposition to $-\log p_\theta(\mathbf{x}_0)$ :

Write $-\log p_\theta(\mathbf{x}_0) \leq \mathbb{E}_q[-\log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} \mid \mathbf{x}_0)}]$ by Jensen.
Use Bayes to rewrite the forward chain as the time-reversed $q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0)$ , which is a closed-form Gaussian because the forward kernel is Gaussian with known means.
Group terms by $t$ : prior at $t = T$ , KL between true and learned reverse kernel for $1 < t \leq T$ , and reconstruction at $t = 1$ .
Each KL between two Gaussians with the same fixed variance reduces to a squared-mean difference. Reparameterizing the mean as $\boldsymbol{\epsilon}_\theta$ folds the $\mathbf{x}_t$ -dependent constants into a single timestep weight $w_t$ .
The simplified loss sets $w_t = 1$ . Karras et al. (2022) note that uniform timestep sampling implicitly weights $L_{t-1}$ by SNR-dependent factors that happen to align with perceptual quality, which is why $\mathcal{L}_{\text{simple}}$ outperforms the tight ELBO for sample quality despite being looser as a likelihood bound.

Why It Matters

This is why diffusion training is engineering-friendly. The loss is a single mean-squared-error over noise predictions, parallelized across timesteps and data points. The variational bound certifies that minimizing $\mathcal{L}_{\text{simple}}$ pushes a valid upper bound on $-\log p_\theta(\mathbf{x}_0)$ , even though the bound itself is loose.

Failure Mode

$\mathcal{L}_{\text{simple}}$ underweights large- $t$ steps for likelihood purposes. If the goal is reporting NLL or BPD (e.g., density estimation benchmarks), use the variance-weighted variational loss or the EDM parameterization (Karras et al. 2022). For sample quality on standard benchmarks, $\mathcal{L}_{\text{simple}}$ is the strong baseline.

report a correction →

Theorem

Denoising Score Matching Equivalence

Statement

For perturbation kernel $q_\sigma(\tilde{\mathbf{x}} \mid \mathbf{x}) = \mathcal{N}(\tilde{\mathbf{x}}; \mathbf{x}, \sigma^2 \mathbf{I})$ and the induced marginal $q_\sigma(\tilde{\mathbf{x}}) = \int q_\sigma(\tilde{\mathbf{x}} \mid \mathbf{x})\, p(\mathbf{x})\, d\mathbf{x}$ , the denoising score matching objective

$\mathcal{L}_{\text{DSM}}(\theta) = \mathbb{E}_{p(\mathbf{x})\, q_\sigma(\tilde{\mathbf{x}} \mid \mathbf{x})}\!\Big[\,\big\Vert\mathbf{s}_\theta(\tilde{\mathbf{x}}) - \nabla_{\tilde{\mathbf{x}}} \log q_\sigma(\tilde{\mathbf{x}} \mid \mathbf{x})\big\Vert^2\,\Big]$

is equal, up to a $\theta$ -independent constant, to the marginal score matching objective $\mathbb{E}_{q_\sigma(\tilde{\mathbf{x}})}\!\big[\Vert\mathbf{s}_\theta(\tilde{\mathbf{x}}) - \nabla_{\tilde{\mathbf{x}}} \log q_\sigma(\tilde{\mathbf{x}})\Vert^2\big]$ .

Equivalently, the noise-prediction objective $\mathbb{E}\!\big[\Vert\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta\Vert^2\big]$ and the score model are linked by

$\mathbf{s}_\theta(\mathbf{x}_t, t) \;=\; -\frac{\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)}{\sqrt{1 - \bar{\alpha}_t}}$

so denoising and score learning are the same algorithm.

Intuition

You cannot evaluate $\nabla \log p(\mathbf{x})$ for the data distribution because $p$ is unknown. But after corrupting $\mathbf{x}$ with known Gaussian noise, the conditional score $\nabla \log q_\sigma(\tilde{\mathbf{x}} \mid \mathbf{x})$ is just the negative noise direction, which the network can learn from $(\mathbf{x}, \tilde{\mathbf{x}})$ pairs. The marginal score follows by an integration-by-parts argument that shifts the $\mathbf{x}$ expectation onto the network output.

Why It Matters

This is the bridge between the DDPM view (denoise pixels) and the score-SDE view (estimate $\nabla \log p_t$ ). They are not two methods; they are one method with two parameterizations. The same trained network drives both discrete-time DDPM samplers and continuous-time SDE/ODE solvers.

Failure Mode

The equivalence assumes a Gaussian perturbation kernel. With non-Gaussian corruption (e.g., cold diffusion or masking), the conditional-score formula changes and the simple "predict the noise" target no longer applies (Bansal et al. 2022).

report a correction →

Theorem

Anderson Reverse-Time SDE

Statement

The forward SDE $d\mathbf{x} = \mathbf{f}(\mathbf{x}, t)\,dt + g(t)\,d\mathbf{w}$ on $t \in [0, T]$ has a reverse-time SDE with the same marginal distributions:

$d\mathbf{x} \;=\; \big[\mathbf{f}(\mathbf{x}, t) \;-\; g(t)^2 \nabla_{\mathbf{x}} \log p_t(\mathbf{x})\big]\,dt \;+\; g(t)\,d\bar{\mathbf{w}}$

where $\bar{\mathbf{w}}$ is a standard Wiener process running backward in time and the SDE is integrated from $t = T$ down to $t = 0$ .

Intuition

Forward dynamics push samples from data to noise; the reverse correction term $-g(t)^2 \nabla_{\mathbf{x}} \log p_t(\mathbf{x})$ is the "uphill" force that sends them back to high-density regions. The score is the only data-dependent quantity, and the diffusion coefficient $g(t)$ is unchanged.

Proof Sketch

Anderson (1982) proves this by writing the joint $p(\mathbf{x}_s, \mathbf{x}_t)$ for $s < t$ , conditioning to obtain the time-reversed transition density, and verifying it satisfies the reverse-time Fokker-Planck equation. Modern proofs (Haussmann & Pardoux 1986, Föllmer 1985) start from the time-reversal of the SDE directly. The intuition is that the forward Fokker-Planck equation for $p_t$ involves a drift term and a Laplacian term; reversing time flips the sign of the drift, and the Laplacian becomes a drift correction whose magnitude is exactly $g(t)^2 \nabla \log p_t$ .

Why It Matters

This is the foundational result that makes score-based generative modeling work. Train a neural network to approximate $\nabla \log p_t$ at every $t$ , plug it into the reverse SDE, and integrate from $T$ to $0$ starting from Gaussian noise. Score-SDE (Song et al. 2021) unifies DDPM and NCSN as discrete solvers for VP- and VE-flavored versions of this same equation.

Failure Mode

The score model is least accurate near $t = 0$ (data manifold has near-zero density off the manifold and the score blows up) and near $t = T$ (the noisy distribution is nearly Gaussian and the score is small but easy to mispredict in direction). EDM (Karras et al. 2022) handles both ends with explicit preconditioning of the network input/output.

report a correction →

Theorem

Probability Flow ODE

Statement

The marginal densities $\{p_t\}_{t \in [0, T]}$ of the forward SDE are also the marginals of the deterministic ODE

$\frac{d\mathbf{x}}{dt} \;=\; \mathbf{f}(\mathbf{x}, t) \;-\; \tfrac{1}{2}\, g(t)^2\, \nabla_{\mathbf{x}} \log p_t(\mathbf{x})$

Integrating this ODE backward in time from a sample $\mathbf{x}_T \sim p_T$ produces a sample $\mathbf{x}_0 \sim p_0$ without any injected noise. The ODE is a deterministic, invertible bijection between the prior and the data distribution at the level of single trajectories.

Intuition

Removing half the diffusion coefficient and dropping the Wiener increment preserves the family of marginal densities while removing all randomness from sample paths. Stochastic and deterministic samplers see different per-trajectory dynamics but the same marginal distribution at every $t$ .

Why It Matters

Three consequences land here. First, deterministic sampling: DDIM (Song, Meng, Ermon 2020) is a discretization of this ODE for the VP-SDE, which is why it samples in 10-50 steps instead of 1000. Second, exact likelihood evaluation: integrating the ODE forward gives a continuous normalizing flow, and the change-of-variables formula yields exact log-likelihoods (Song et al. 2021). Third, interpolation and editing: noise-to-sample is bijective along each trajectory, so editing in noise space corresponds to consistent edits in sample space, the basis of techniques like SDEdit (Meng et al. 2022).

Failure Mode

The ODE shares the score model's error budget but loses the regularizing stochasticity. In regions where the score is mispredicted, deterministic trajectories accumulate the error monotonically, while the SDE injects fresh noise that can correct course. EDM and DPM-Solver compensate with higher-order solvers (Heun, multistep) rather than more steps.

report a correction →

Theorem

Classifier-Free Guidance

Statement

Let $p_t(\mathbf{x} \mid c)$ be the noise-perturbed conditional and $p_t(\mathbf{x})$ the unconditional marginal. The Bayesian rule $\log p_t(\mathbf{x} \mid c) = \log p_t(c \mid \mathbf{x}) + \log p_t(\mathbf{x}) + \mathrm{const}$ implies

$\nabla_{\mathbf{x}} \log p_t(\mathbf{x} \mid c) \;=\; \nabla_{\mathbf{x}} \log p_t(c \mid \mathbf{x}) \;+\; \nabla_{\mathbf{x}} \log p_t(\mathbf{x})$

Sampling from a sharpened conditional $p_t(\mathbf{x} \mid c)\, p_t(c \mid \mathbf{x})^{w}$ (more weight on the implicit classifier) corresponds to using the guided score

$\tilde{\mathbf{s}}_\theta(\mathbf{x}, t, c) \;=\; (1 + w)\,\mathbf{s}_\theta(\mathbf{x}, t, c) \;-\; w\,\mathbf{s}_\theta(\mathbf{x}, t)$

or equivalently in noise-prediction form

$\tilde{\boldsymbol{\epsilon}}_\theta(\mathbf{x}_t, t, c) \;=\; (1 + w)\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, c) \;-\; w\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$

The conditional and unconditional scores are produced by the same network trained with the conditioning signal randomly dropped to a null token at rate $p_{\text{drop}} \approx 0.1$ (Ho & Salimans 2022).

Intuition

The model's own conditional and unconditional predictions act as an implicit classifier $\nabla \log p(c \mid \mathbf{x}) = \boldsymbol{\epsilon}(c) - \boldsymbol{\epsilon}(\varnothing)$ . Extrapolating in that direction sharpens the conditional, biasing samples toward higher implicit-classifier confidence in $c$ .

Why It Matters

Every text-to-image and text-to-video system in production uses classifier-free guidance. It replaced the earlier classifier guidance of Dhariwal & Nichol (2021), which required training a separate noise-conditioned classifier. CFG needs only a single dropout rate and one extra forward pass per sampling step.

Failure Mode

At high $w$ (typically $w > 15$ for image models), samples become saturated, high-contrast, and lose diversity. At very high $w$ they fall off the data manifold entirely, producing the characteristic "burnt" look of overguided diffusion. The cause is that the implicit-classifier extrapolation pushes samples into low-density regions where the score model was never trained accurately. CFG-zero, CFG-rescaled (Lin et al. 2024), and dynamic-thresholding (Imagen) are workarounds that cap or rescale the extrapolation magnitude.

report a correction →

Discrete-Time Samplers

The reverse SDE and probability flow ODE both need a numerical integrator. Sampler choice trades quality for steps. Three families dominate practice.

Ancestral / DDPM sampling. A first-order Euler-Maruyama on the reverse VP-SDE. Use $T = 1000$ steps for high quality, $T = 250$ for fast research baselines. Stochastic (injects fresh Gaussian noise per step), so it self-corrects but needs many steps.

DDIM. A first-order discretization of the probability flow ODE for the VP-SDE (Song, Meng, Ermon 2020). Deterministic, with a single hyperparameter $\eta \in [0, 1]$ that interpolates between fully deterministic ( $\eta = 0$ ) and ancestral ( $\eta = 1$ ). Hits competitive FID at 25-50 steps.

EDM Heun & DPM-Solver. Higher-order ODE solvers tailored to the diffusion ODE. EDM (Karras et al. 2022) uses Heun's second-order method on a $\sigma$ -parameterized ODE with the Karras schedule; DPM-Solver (Lu et al. 2022) and DPM-Solver++ exploit the semi-linear structure of the diffusion ODE to reach FID-competitive samples in 10-20 steps. UniPC (Zhao et al. 2023) generalizes both with a multistep predictor-corrector.

Distillation-based fast samplers. Progressive distillation (Salimans & Ho 2022), consistency models (Song et al. 2023), and consistency distillation (Luo et al. 2023) train student networks that take large jumps in noise level, reaching 1-4 step generation at the cost of an extra training stage.

Latent Diffusion

Pixel-space diffusion at $1024 \times 1024$ is computationally heavy because every denoising step must process every pixel. Latent Diffusion Models (LDM, Rombach et al. 2022) move the diffusion process into the latent space of a pretrained autoencoder.

The pipeline has three stages:

A VAE encoder $\mathcal{E}$ maps an image $\mathbf{x} \in \mathbb{R}^{H \times W \times 3}$ to a latent $\mathbf{z} = \mathcal{E}(\mathbf{x}) \in \mathbb{R}^{h \times w \times c}$ with $h, w \approx H/8, W/8$ and $c$ between 4 and 16.
A diffusion model is trained on the latents with the standard DDPM or score-SDE objective.
A VAE decoder $\mathcal{D}$ maps generated latents back to pixels.

The autoencoder is trained once with a perceptual loss and a small KL-regularizer; it is held fixed during diffusion training. The compute saving is roughly $(H/h)^2 = 64\times$ per denoising step, which is what made Stable Diffusion practical on consumer GPUs. Stable Diffusion 1.x/2.x, SDXL, SD3 and SD3.5 are all latent diffusion. Imagen and Flux.1 use related cascaded or higher-resolution-latent variants.

The trade-off is fidelity at fine scales. The autoencoder bottleneck loses high-frequency detail; tiny text and faces are the canonical failure mode of latent diffusion. SDXL added a refiner stage, and SD3 increased the latent channel count from 4 to 16, both targeting this.

Flow Matching

Theorem

Conditional Flow Matching

Statement

Let $\mathbf{x}_t = (1 - t)\boldsymbol{\epsilon} + t\,\mathbf{x}_0$ for $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ , $\mathbf{x}_0 \sim p_{\text{data}}$ , $t \in [0, 1]$ . The conditional flow matching loss

$\mathcal{L}_{\text{CFM}}(\theta) \;=\; \mathbb{E}_{t,\, \mathbf{x}_0,\, \boldsymbol{\epsilon}}\!\Big[\,\big\Vert\mathbf{v}_\theta(\mathbf{x}_t, t) - (\mathbf{x}_0 - \boldsymbol{\epsilon})\big\Vert^2\,\Big]$

trains a velocity field $\mathbf{v}_\theta$ whose ODE $d\mathbf{x}/dt = \mathbf{v}_\theta(\mathbf{x}, t)$ transports the source $\mathcal{N}(\mathbf{0}, \mathbf{I})$ to $p_{\text{data}}$ along straight lines from each sampled noise vector to its paired data point (Lipman et al. 2023).

Intuition

Diffusion takes curved paths from data to noise and back through a stochastic SDE; flow matching takes straight lines through a deterministic ODE. The training target is no longer "the noise that was added" but "the constant velocity that interpolates between this noise and this data point."

Why It Matters

Flow matching is a strict generalization of the probability flow ODE for diffusion: choosing the conditional path $\mathbf{x}_t = \alpha_t \mathbf{x}_0 + \sigma_t \boldsymbol{\epsilon}$ with $(\alpha_t, \sigma_t) = (\sqrt{\bar{\alpha}_t}, \sqrt{1 - \bar{\alpha}_t})$ recovers the diffusion ODE; the linear path $(\alpha_t, \sigma_t) = (t, 1 - t)$ is rectified flow (Liu et al. 2023), which yields straighter trajectories and faster sampling. Stable Diffusion 3 (Esser et al. 2024) and most newer image/video systems are flow-matching trained because the straighter paths permit larger ODE step sizes.

Failure Mode

The linearity is in the latent (or pixel) space, not in the data manifold's intrinsic geometry. For data on highly curved manifolds (molecular conformations, 3D rotations, group-valued data), straight pixel-space paths cross low-density regions and the velocity field has to make large, hard-to-fit corrections. Riemannian flow matching (Chen & Lipman 2024) replaces straight lines with manifold geodesics for these cases.

report a correction →

Failure Modes and Engineering Realities

Schedule sensitivity. The variance schedule $\{\beta_t\}$ is not a hyperparameter you can sweep casually. The cosine schedule of Nichol & Dhariwal (2021) outperforms the linear DDPM schedule across image scales because it spends more steps near $\bar{\alpha}_t \approx 0.5$ where the model learns the most. EDM (Karras et al. 2022) reframes the problem in terms of $\sigma$ rather than $\beta$ and shows that the implied distribution over $\log \sigma$ during training is the right object to design.

Exposure bias. During training the network sees true noisy samples $\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\,\boldsymbol{\epsilon}$ . During sampling it sees its own previous predictions, which carry error. The mismatch compounds across steps; Ning et al. (2024) analyze it and propose epsilon-scaling, a training-free correction that improves FID across DDIM, EDM, and LDM.

Truncation at the endpoints. The score blows up as $t \to 0$ (data manifold has near-zero off-manifold density) and the model is poorly calibrated as $t \to T$ . Karras et al. (2022) address this with explicit input/output preconditioning that keeps the network's effective inputs and targets in a fixed numerical range across all $\sigma$ .

Mode coverage versus prompt fidelity. High classifier-free guidance sharpens prompt adherence at the cost of mode coverage; the resulting samples look prototypical and lose the long-tail diversity present at $w = 1$ . There is no free lunch; the practical recipe is to use moderate guidance ( $w \in [3, 8]$ for SD-style image models) and rely on prompting and conditioning channels (ControlNet, IP-Adapter) for fine control rather than pushing $w$ .

Sampler-architecture mismatch. A network trained with the simplified $\boldsymbol{\epsilon}$ -prediction loss using DDPM ancestral sampling will not necessarily transfer cleanly to a DPM-Solver++ ODE solver at low step counts. EDM Heun + EDM-trained networks form one matched stack; rectified-flow Euler + flow-matching networks form another. Mixing across stacks can leave sampler-quality on the table.

Where This Shows Up

Where This Shows Up in Current Papers

Image generation: Stable Diffusion 3.5, SDXL, DALL-E 3, Imagen 3, Midjourney v6, Flux.1, Ideogram
Video generation: Sora (OpenAI), Veo 3 (Google), Kling 2 (Kuaishou), Runway Gen-3, Pika, video world models
Audio synthesis: AudioLDM 2, Stable Audio 2, Diff-A-Riff, music generation systems
3D generation: DreamFusion (SDS-based text-to-3D), MVDream multi-view diffusion, point-cloud diffusion, and Gaussian splatting generators
Molecular and protein design: RFdiffusion (RF-diffusion for protein structure), GeoDiff for molecular conformers, Boltz-1 for structure prediction with diffusion components
Robotics: Diffusion Policy (Chi et al. 2023) for visuomotor control, 3D Diffuser Actor for manipulation
Decision making and planning: Janner et al. 2022 trajectory diffusion; world-model planners that diffuse over future states

Diffusion vs. GANs in One Table

If someone asks "why are GANs not diffusion models?" the clean answer is: diffusion learns how to reverse a fixed corruption process, while GANs learn to fool a discriminator. Both may start from noise at sampling time, but the objective, the intermediate states, and the failure modes are different.

Question	Diffusion	GAN
What is the training target?	Predict noise, denoised sample, or score at many noise levels	Fool a discriminator or critic
Is there a fixed forward process?	Yes: add Gaussian noise by design	No: latent prior is sampled directly
What is optimized?	Regression or ELBO-style objective	Minimax game
How is a sample generated?	Iterative reverse denoising from $x_T$ to $x_0$	One generator pass from $z$ to $x$
Typical weakness	Slow sampling	Mode collapse and unstable training
Typical strength	Stable training, strong mode coverage, flexible conditioning	Very fast inference once trained

Common Confusions

Watch Out

DDPM and score-based models are not different methods

The DDPM derivation starts from a variational bound on a Markov chain; the NCSN/score-SDE derivation starts from score matching plus the reverse-time SDE. They produce the same training loss up to per-timestep weighting and the same network architecture. The two papers (Ho et al. 2020, Song & Ermon 2019) arrived at the same algorithm from different angles within months of each other, and Song et al. (2021) made the equivalence rigorous via the SDE view. Treat them as one framework with two parameterizations.

Watch Out

Probability flow ODE is not flow matching

Both produce deterministic ODEs that transport noise to data, and both are called "ODE samplers" in casual usage. They differ in what is held fixed:

Probability flow ODE is derived from a forward stochastic process; the velocity field equals $\mathbf{f} - \tfrac{1}{2} g^2 \nabla \log p_t$ and is determined once the SDE is chosen.
Flow matching starts from a chosen probability path and trains a velocity field directly via regression. There is no underlying SDE.

Diffusion's probability flow ODE is a special case of flow matching, but flow matching is the more general construction. Rectified flow (Liu et al. 2023) is one specific flow-matching path that is not a diffusion probability flow ODE.

Watch Out

More steps is not always better

DDPM originally used $T = 1000$ steps because the discretization error of ancestral sampling on the reverse VP-SDE is large at coarse step counts. Modern higher-order solvers on the probability flow ODE (DDIM, DPM-Solver++, EDM Heun) close most of the quality gap at 10-25 steps. Step count is a sampler choice, not a model property; a single trained model can be sampled at any step count from 1 (with distillation) to 1000.

Watch Out

Guidance scale is not a quality dial

Increasing classifier-free guidance scale $w$ does not monotonically improve quality. Past a sweet spot (typically $w \in [5, 12]$ for image models) the samples saturate, lose color diversity, and pick up high-frequency artifacts. The mechanism is that the implicit-classifier extrapolation pushes the score into low-density regions where it was never trained. CFG-rescaled and dynamic-thresholding are practical workarounds, but high $w$ trading prompt fidelity for sample diversity is a structural feature of the method, not a bug to be tuned away.

Watch Out

Latent diffusion is not a separate generative model family

Stable Diffusion is a latent diffusion model: an autoencoder plus a standard diffusion process on the latents. The diffusion math is unchanged; only the operating space differs. Failure modes of latent diffusion (small text, faces) are autoencoder failures, not diffusion failures.

Summary

Forward process adds Gaussian noise on a fixed schedule; the closed-form marginal $q(\mathbf{x}_t \mid \mathbf{x}_0)$ makes training trivially parallel.
Reverse process is a learned Markov kernel (DDPM) or, equivalently, a learned score $\nabla \log p_t$ plugged into the reverse-time SDE (score-SDE). The two views train the same network with the same loss.
DDPM ELBO decomposes into per-step KL terms that reduce to a weighted noise-prediction MSE; the simplified $\mathcal{L}_{\text{simple}}$ drops the weights and trains best in practice.
Reverse-time SDE (Anderson 1982) is the foundational result: for any forward SDE, the reverse process equals the forward dynamics minus $g(t)^2 \nabla \log p_t$ plus a backward Wiener increment.
Probability flow ODE removes the stochasticity while preserving marginal densities. It enables deterministic sampling (DDIM), exact likelihoods, and invertible noise-to-sample bijections.
Classifier-free guidance sharpens conditional samples by extrapolating in the direction of the implicit classifier $\boldsymbol{\epsilon}(c) - \boldsymbol{\epsilon}(\varnothing)$ . It is the mechanism behind every modern text-to-image system.
Latent diffusion moves the diffusion process into a pretrained autoencoder's latent space, cutting per-step compute by roughly $64\times$ and making consumer-GPU image synthesis practical.
Flow matching generalizes the probability flow ODE by training a velocity field directly against any chosen probability path; rectified flow is the linear-path special case used by SD3 and Flux.

Exercises

ExerciseCore

Problem

Given $\bar{\alpha}_t = 0.01$ (i.e., 99% of the variance is noise), write $\mathbf{x}_t$ in terms of $\mathbf{x}_0$ and $\boldsymbol{\epsilon}$ . What fraction of the original signal amplitude remains, and what is the signal-to-noise ratio?

ExerciseCore

Problem

Show that the DDPM noise-prediction objective and the score model are linked by $\mathbf{s}_\theta(\mathbf{x}_t, t) = -\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) / \sqrt{1 - \bar{\alpha}_t}$ . What does this imply about training a single network and using it both as a DDPM denoiser and as a score model in the reverse-time SDE?

ExerciseAdvanced

Problem

Derive the probability flow ODE from the forward SDE $d\mathbf{x} = \mathbf{f}(\mathbf{x}, t)\,dt + g(t)\,d\mathbf{w}$ by writing the Fokker-Planck equation for $p_t$ and re-expressing the diffusion term as a drift correction.

ExerciseAdvanced

Problem

A practitioner trains a diffusion model with $\mathcal{L}_{\text{simple}}$ and reports excellent FID but poor reported NLL relative to a normalizing-flow baseline. Is this a contradiction? Explain how $\mathcal{L}_{\text{simple}}$ relates to a tight likelihood bound and what to change if NLL is the priority.

ExerciseCore

Problem

An interviewer says: "Both GANs and diffusion start from noise and end at an image. Why are they different?" Give a mechanism-level answer.

ExerciseResearch

Problem

Latent diffusion processes images through a fixed VAE bottleneck. Suppose you want to fine-tune Stable Diffusion 1.5 (4-channel latent, $f = 8$ downsampling) to render legible small text. The diffusion training loss decreases but generated text remains garbled. Where is the bottleneck, and what would you change?

Related Comparisons

Diffusion Models vs. GANs vs. VAEs

Frequently Asked Questions

$How is a diffusion model different from a VAE?$: $A VAE has an encoder that learns a compressed latent and a decoder that maps back. A diffusion model has no learned encoder; the forward process is a fixed Gaussian noise schedule, and only the reverse denoiser is trained. Latent diffusion combines both: a VAE provides the latent space, and a diffusion model operates inside it.$
$Why does DDIM sample faster than DDPM?$: $DDPM is the stochastic reverse SDE discretized at every training timestep, typically 1000 steps. DDIM uses the deterministic probability-flow ODE with the same trained score, so the sampler is non-Markovian and can take much larger steps. With 50 steps DDIM produces samples comparable to 1000-step DDPM at the same model.$
$What is classifier-free guidance?$: $A way to steer samples toward a condition c without a separate classifier. Train one model on both conditional p (x ∣ c) and unconditional p (x) (with c randomly dropped). At sampling time, combine the conditional and unconditional scores: \overset{ε}{^} (x, c) = (1 + w) ε (x, c) - w ε (x, \emptyset) . Higher w trades sample diversity for prompt fidelity.$
$Why use latent diffusion (Stable Diffusion) instead of pixel diffusion?$: $Pixel-space diffusion at 512 \times 512 is computationally heavy and the diffusion model wastes capacity on perceptual details a learned VAE can recover. Latent diffusion runs the U-Net on an 8x downsampled VAE latent, cuts compute roughly 64x, and concentrates the diffusion model's modeling power on the semantic structure of the image. This is the design choice that made high-resolution image generation tractable.$
$Are diffusion and flow matching the same thing?$: $Mathematically very close. The probability-flow ODE of a diffusion process IS a flow. Flow matching (Lipman et al. 2023) directly learns the velocity field rather than the score; the training objective has lower variance and the resulting flow is straighter, enabling fewer-step sampling. Rectified flow takes this further by iteratively rectifying the flow toward straight-line trajectories.$

References

Foundations:

Hyvärinen, "Estimation of Non-Normalized Statistical Models by Score Matching", JMLR 6:695-709 (2005). The original score-matching objective with the integration-by-parts trick that avoids the normalizing constant.
Vincent, "A Connection Between Score Matching and Denoising Autoencoders", Neural Computation 23(7):1661-1674 (2011). Establishes the denoising-score-matching equivalence that licenses learning $\nabla \log p$ by predicting noise.
Anderson, "Reverse-time diffusion equation models", Stochastic Processes and their Applications 12(3):313-326 (1982). The reverse-time SDE result that powers all score-based sampling.
Sohl-Dickstein, Weiss, Maheswaranathan, Ganguli, "Deep Unsupervised Learning using Nonequilibrium Thermodynamics" (ICML 2015), arXiv:1503.03585. The original diffusion-model paper, five years before DDPM scaled it.

Canonical (2019-2021):

Song & Ermon, "Generative Modeling by Estimating Gradients of the Data Distribution" (NCSN, NeurIPS 2019), arXiv:1907.05600. The score-based-generation paper rediscovering and scaling Vincent (2011).
Ho, Jain, Abbeel, "Denoising Diffusion Probabilistic Models" (NeurIPS 2020), arXiv:2006.11239. The paper that made diffusion competitive with GANs on images. Section 3 derives $\mathcal{L}_{\text{simple}}$ .
Song, Sohl-Dickstein, Kingma, Kumar, Ermon, Poole, "Score-Based Generative Modeling through Stochastic Differential Equations" (ICLR 2021), arXiv:2011.13456. Unifies DDPM and NCSN under the continuous-time SDE framework; introduces the probability flow ODE.
Song, Meng, Ermon, "Denoising Diffusion Implicit Models" (DDIM, ICLR 2021), arXiv:2010.02502. Deterministic sampling with 10-50 steps via the probability flow ODE perspective.
Nichol & Dhariwal, "Improved Denoising Diffusion Probabilistic Models" (ICML 2021), arXiv:2102.09672. Cosine schedule, learned variances, log-likelihood improvements.

Conditioning and guidance:

Dhariwal & Nichol, "Diffusion Models Beat GANs on Image Synthesis" (NeurIPS 2021), arXiv:2105.05233. Classifier guidance with a separately trained noise-conditioned classifier; established the FID parity claim.
Ho & Salimans, "Classifier-Free Diffusion Guidance" (NeurIPS Workshop 2021, full version 2022), arXiv:2207.12598. The single-network conditioning trick behind every production text-to-image system.

Practical training and sampling (2022-2023):

Karras, Aittala, Aila, Laine, "Elucidating the Design Space of Diffusion-Based Generative Models" (EDM, NeurIPS 2022), arXiv:2206.00364. The reference parameterization analysis with second-order Heun sampler; reframes everything in terms of $\sigma$ .
Lu, Zhou, Bao, Chen, Li, Zhu, "DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling" (NeurIPS 2022), arXiv:2206.00927; and the DPM-Solver++ followup. High-order ODE solvers tailored to the diffusion ODE.
Salimans & Ho, "Progressive Distillation for Fast Sampling of Diffusion Models" (ICLR 2022), arXiv:2202.00512. Distill an $N$ -step teacher into a $\lfloor N/2 \rfloor$ -step student, iterated to 1-4 steps.
Song, Dhariwal, Chen, Sutskever, "Consistency Models" (ICML 2023), arXiv:2303.01469. Single-step generation by training a network to map any point on the ODE trajectory to its origin.
Lipman, Chen, Ben-Hamu, Nickel, Le, "Flow Matching for Generative Modeling" (ICLR 2023), arXiv:2210.02747. Continuous normalizing flows recast as regression against velocity fields; foundation for SD3 and Flux.
Liu, Gong, Liu, "Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow" (ICLR 2023), arXiv:2209.03003. The straight-line conditional path that yields the simplest flow-matching objective.

Latent and frontier image systems:

Rombach, Blattmann, Lorenz, Esser, Ommer, "High-Resolution Image Synthesis with Latent Diffusion Models" (LDM/Stable Diffusion, CVPR 2022), arXiv:2112.10752. The autoencoder-plus-diffusion architecture behind SD 1.x/2.x/SDXL.
Esser, Kulal, Blattmann et al., "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis" (Stable Diffusion 3, ICML 2024), arXiv:2403.03206. Flow-matching MMDiT architecture; widely deployed open-weights system as of 2024.
Saharia et al., "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding" (Imagen, NeurIPS 2022), arXiv:2205.11487. Cascaded pixel-space diffusion with T5-XXL text conditioning.

Failure modes and corrections:

Ning, Li, Su, Salah, Ertugrul, "Elucidating the Exposure Bias in Diffusion Models" (ICLR 2024), arXiv:2308.15321. The training-vs-sampling input-distribution mismatch; introduces epsilon-scaling.
Lin, Liu, Liu, "Common Diffusion Noise Schedules and Sample Steps are Flawed" (WACV 2024), arXiv:2305.08891. Zero-SNR schedule fix and CFG rescale; explains the saturated-color failure mode of high-CFG sampling.

Selected applications:

Chi, Feng, Du et al., "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion" (RSS 2023), arXiv:2303.04137. Diffusion as a robotics action prior.
Xu, Yu, Song, Shi, Ermon, Tang, "GeoDiff: a Geometric Diffusion Model for Molecular Conformation Generation" (ICLR 2022), arXiv:2203.02923. Diffusion on molecular geometry.
Watson, Juergens, Bennett et al., "De novo design of protein structure and function with RFdiffusion" (Nature 2023). RoseTTAFold + diffusion for protein backbone design.

Next Topics

The natural next steps from diffusion models:

Flow matching: the generalization that drives Stable Diffusion 3, Flux, and most recent video models.
Score matching: the underlying objective with the full Hyvärinen-Vincent derivation and its extensions.
Video world models: how Sora, Veo, and Kling extend image diffusion to spatiotemporal data with simulation-grade physical fidelity claims.
Gaussian splatting: the dominant 3D representation that diffusion models increasingly target.

Last reviewed: April 19, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

19

Score Matchinglayer 3 · tier 1
Variational Autoencoderslayer 3 · tier 1
CLIP, OpenCLIP, and SigLIP: Contrastive Language-Image Pretraininglayer 4 · tier 1
Vision Transformer Lineage: ViT, DeiT, Swin, MAE, DINOv2, SAMlayer 4 · tier 1
PDE Fundamentals for Machine Learninglayer 1 · tier 2

Derived topics

4

Flow Matchinglayer 4 · tier 2
Video World Modelslayer 5 · tier 2
3D Gaussian Splattinglayer 4 · tier 3
Deep Generative Models for Moleculeslayer 4 · tier 3

Graph-backed continuations

Flow Matching Video World Models 3D Gaussian Splatting Deep Generative Models for Molecules