Skip to main content

Beyond LLMS

Diffusion Models

Generative models that learn to reverse a noise-adding process: forward and reverse SDEs, score matching, the DDPM ELBO, deterministic ODE sampling, classifier-free guidance, latent diffusion, and the bridge to flow matching.

AdvancedTier 1CurrentCore spine~95 min

Why This Matters

Diffusion keeps the corruption process fixed and spends all learnable capacity on reversing it

Training is cheap because any noisy state can be sampled in closed form. Sampling is expensive because the model has to walk a noisy prior back toward structure one denoising move at a time.

Training target

Sample , draw , then form without simulating the whole chain.

Equivalent parameterizations

Predicting , the score , or a velocity-style target all describe the same denoising direction once you rescale by the known noise level.

What the sampler actually changes

DDPM keeps stochasticity, DDIM makes the path deterministic, and newer ODE solvers buy quality at lower step counts. The model is the same; the numerical path is what changes.

Why guidance helps and hurts

Classifier-free guidance extrapolates with . Moderate sharpens the prompt; large pushes the sampler off the density it actually learned.

Diffusion is the dominant generative architecture for images, video, audio, 3D geometry, molecules, and robot policies as of 2025-2026. The frontier image systems (Stable Diffusion 3.5, Imagen 3, Flux.1, DALL-E 3, Midjourney v6) and the frontier video systems (Sora, Veo 3, Kling, Runway Gen-3) are all either diffusion models or their close relative, flow matching. Outside generative media, diffusion appears in protein-structure design, robotics action policies (Chi et al. 2023), molecular conformer generation (Xu et al. 2022), and decision-making.

Three properties separate diffusion from earlier generative families:

  1. Stable training. Unlike GANs, diffusion has no minimax game. The loss is a weighted regression against a known noise target. Training scales monotonically with compute and data.
  2. Mode coverage. Unlike VAEs and GANs, diffusion does not aggressively collapse modes. The score-matching objective averages over the data distribution rather than fighting it.
  3. Likelihood-grade theory. The discrete formulation gives a variational ELBO upper bound on negative log-likelihood (Sohl-Dickstein 2015, Ho et al. 2020). The objective most diffusion models actually train on, LsimpleL_{\text{simple}}, drops the per-timestep weights and is not the tight ELBO; it is a reweighted surrogate that improves sample quality at the cost of bound tightness. The continuous-time formulation gives exact likelihoods through an ODE change-of-variables (Song et al. 2021), independent of how the training loss was weighted.

The price is iterative sampling: a single image requires 4-50 forward passes through a large model. Most of the engineering progress since 2020 has been about cutting that cost.

If you want to watch timestep, guidance scale, and sampler choice change that tradeoff directly, open the Diffusion Lab.

Mental Model

Take a clean image, gradually add Gaussian noise over many steps until the image is pure noise, and then learn to undo each step. The forward process is fixed and analytically tractable; only the reverse direction is learned. At sampling time, draw a Gaussian noise vector and integrate the learned reverse dynamics back to a clean image.

There are several common parameterizations for what the network actually predicts: the noise ϵ\boldsymbol{\epsilon} that was added (DDPM, Ho et al. 2020), the score xlogpt(x)\nabla_{\mathbf{x}} \log p_t(\mathbf{x}) of the perturbed data (NCSN, Song & Ermon 2019), the clean image x0\mathbf{x}_0, or the linear combination v\mathbf{v} (Salimans-Ho 2022). At a fixed noise level tt these targets are linear functions of each other, so the algebraic relationship is exact (Vincent 2011 gives the denoising-score identity). What the trained network ends up computing is not identical across choices: the implicit per-timestep loss weighting differs, which changes which noise levels the model is optimized for, and that interacts with the choice of sampler at inference time. The parameterization is a real design choice, not just a notational rewrite — schedule, weighting convention, and sampler all matter.

Formal Setup

Definition

Forward Process (Discrete-Time DDPM)

The forward process is a Markov chain that adds Gaussian noise over TT steps. Starting from x0q(x0)\mathbf{x}_0 \sim q(\mathbf{x}_0):

q(xtxt1)=N ⁣(xt;  1βtxt1,  βtI)q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) = \mathcal{N}\!\big(\mathbf{x}_t;\; \sqrt{1 - \beta_t}\,\mathbf{x}_{t-1},\; \beta_t \mathbf{I}\big)

where β1,,βT(0,1)\beta_1, \ldots, \beta_T \in (0, 1) is a fixed variance schedule. With αt=1βt\alpha_t = 1 - \beta_t and αˉt=s=1tαs\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s, the marginal at any step has a closed form:

q(xtx0)=N ⁣(xt;  αˉtx0,  (1αˉt)I)q(\mathbf{x}_t \mid \mathbf{x}_0) = \mathcal{N}\!\big(\mathbf{x}_t;\; \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0,\; (1 - \bar{\alpha}_t)\mathbf{I}\big)

so xt=αˉtx0+1αˉtϵ\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\,\boldsymbol{\epsilon} with ϵN(0,I)\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}). As TT \to \infty and the schedule is chosen so that αˉT0\bar{\alpha}_T \to 0, the prior q(xT)q(\mathbf{x}_T) converges to N(0,I)\mathcal{N}(\mathbf{0}, \mathbf{I}).

Definition

Reverse Process (Learned Markov Kernel)

The reverse process is a parametric Markov chain trained to undo the forward process:

pθ(xt1xt)=N ⁣(xt1;  μθ(xt,t),  Σθ(xt,t))p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t) = \mathcal{N}\!\big(\mathbf{x}_{t-1};\; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t),\; \Sigma_\theta(\mathbf{x}_t, t)\big)

In Ho et al. (2020) the variance is fixed to σt2I\sigma_t^2 \mathbf{I} (with σt2{βt,β~t}\sigma_t^2 \in \{\beta_t, \tilde{\beta}_t\}), and the mean is reparameterized via a noise predictor ϵθ(xt,t)\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t):

μθ(xt,t)=1αt ⁣(xtβt1αˉtϵθ(xt,t))\boldsymbol{\mu}_\theta(\mathbf{x}_t, t) = \frac{1}{\sqrt{\alpha_t}}\!\Big(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\Big)

Nichol & Dhariwal (2021) introduced learned Σθ\Sigma_\theta for tighter likelihood bounds.

Definition

Score Function

The score of a density p(x)p(\mathbf{x}) is

s(x)=xlogp(x)\mathbf{s}(\mathbf{x}) = \nabla_{\mathbf{x}} \log p(\mathbf{x})

A score model sθ(x,t)xlogpt(x)\mathbf{s}_\theta(\mathbf{x}, t) \approx \nabla_{\mathbf{x}} \log p_t(\mathbf{x}) targets the score of the perturbed distribution at noise level tt. The score does not require knowing the normalizing constant of pp, which is what makes it tractable to learn from samples alone (Hyvärinen 2005).

Definition

Forward SDE (Continuous-Time)

A continuous-time forward process is an Itô SDE on [0,T][0, T]:

dx=f(x,t)dt+g(t)dwd\mathbf{x} = \mathbf{f}(\mathbf{x}, t)\,dt + g(t)\,d\mathbf{w}

with drift f:Rd×[0,T]Rd\mathbf{f}: \mathbb{R}^d \times [0, T] \to \mathbb{R}^d, scalar diffusion coefficient g:[0,T]R0g: [0, T] \to \mathbb{R}_{\geq 0}, and standard Wiener process w\mathbf{w}. The two canonical choices are Variance Preserving (VP-SDE), f(x,t)=12β(t)x\mathbf{f}(\mathbf{x}, t) = -\tfrac{1}{2}\beta(t)\mathbf{x} and g(t)=β(t)g(t) = \sqrt{\beta(t)}, the continuous limit of DDPM, and Variance Exploding (VE-SDE), f=0\mathbf{f} = \mathbf{0} and g(t)=d[σ2(t)]/dtg(t) = \sqrt{d[\sigma^2(t)]/dt}, the continuous limit of NCSN (Song et al. 2021).

Main Theorems

Theorem

DDPM Variational Lower Bound

Statement

The negative log-likelihood admits the variational upper bound

logpθ(x0)    Eq ⁣[LT+t=2TLt1+L0]\begin{aligned} -\log p_\theta(\mathbf{x}_0) \;\leq\; \mathbb{E}_q\!\Big[ L_T + \sum_{t=2}^{T} L_{t-1} + L_0 \Big] \end{aligned}

with

LT=DKL ⁣(q(xTx0)p(xT)),Lt1=DKL ⁣(q(xt1xt,x0)pθ(xt1xt)),L0=logpθ(x0x1).\begin{aligned} L_T &= D_{\mathrm{KL}}\!\big(q(\mathbf{x}_T \mid \mathbf{x}_0) \,\big\Vert\, p(\mathbf{x}_T)\big), \\ L_{t-1} &= D_{\mathrm{KL}}\!\big(q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0) \,\big\Vert\, p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t)\big), \\ L_0 &= -\log p_\theta(\mathbf{x}_0 \mid \mathbf{x}_1). \end{aligned}

When the reverse mean is reparameterized via a noise predictor ϵθ\boldsymbol{\epsilon}_\theta and the variance is fixed, each Lt1L_{t-1} reduces to a weighted squared error. Dropping the timestep weights gives the simplified DDPM loss of Ho et al. (2020):

Lsimple(θ)  =  EtUnif{1,,T},  x0,  ϵ ⁣[ϵϵθ ⁣(αˉtx0+1αˉtϵ,  t)2]\mathcal{L}_{\text{simple}}(\theta) \;=\; \mathbb{E}_{t \sim \mathrm{Unif}\{1, \ldots, T\},\; \mathbf{x}_0,\; \boldsymbol{\epsilon}}\!\Big[\,\big\Vert\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta\!\big(\sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\,\boldsymbol{\epsilon},\; t\big)\big\Vert^2\,\Big]

Intuition

Training reduces to: sample a clean image x0\mathbf{x}_0, sample a timestep tt and a noise vector ϵ\boldsymbol{\epsilon}, form the noisy view xt\mathbf{x}_t in closed form, and regress the network output against the noise. No sampling of long chains during training, no GAN-style adversary, no per-step backpropagation through time.

Proof Sketch

Apply the standard ELBO decomposition to logpθ(x0)-\log p_\theta(\mathbf{x}_0):

  1. Write logpθ(x0)Eq[logpθ(x0:T)q(x1:Tx0)]-\log p_\theta(\mathbf{x}_0) \leq \mathbb{E}_q[-\log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} \mid \mathbf{x}_0)}] by Jensen.
  2. Use Bayes to rewrite the forward chain as the time-reversed q(xt1xt,x0)q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0), which is a closed-form Gaussian because the forward kernel is Gaussian with known means.
  3. Group terms by tt: prior at t=Tt = T, KL between true and learned reverse kernel for 1<tT1 < t \leq T, and reconstruction at t=1t = 1.
  4. Each KL between two Gaussians with the same fixed variance reduces to a squared-mean difference. Reparameterizing the mean as ϵθ\boldsymbol{\epsilon}_\theta folds the xt\mathbf{x}_t-dependent constants into a single timestep weight wtw_t.
  5. The simplified loss sets wt=1w_t = 1. Karras et al. (2022) note that uniform timestep sampling implicitly weights Lt1L_{t-1} by SNR-dependent factors that happen to align with perceptual quality, which is why Lsimple\mathcal{L}_{\text{simple}} outperforms the tight ELBO for sample quality despite being looser as a likelihood bound.

Why It Matters

This is why diffusion training is engineering-friendly. The loss is a single mean-squared-error over noise predictions, parallelized across timesteps and data points. The variational bound certifies that minimizing Lsimple\mathcal{L}_{\text{simple}} pushes a valid upper bound on logpθ(x0)-\log p_\theta(\mathbf{x}_0), even though the bound itself is loose.

Failure Mode

Lsimple\mathcal{L}_{\text{simple}} underweights large-tt steps for likelihood purposes. If the goal is reporting NLL or BPD (e.g., density estimation benchmarks), use the variance-weighted variational loss or the EDM parameterization (Karras et al. 2022). For sample quality on standard benchmarks, Lsimple\mathcal{L}_{\text{simple}} is the strong baseline.

Theorem

Denoising Score Matching Equivalence

Statement

For perturbation kernel qσ(x~x)=N(x~;x,σ2I)q_\sigma(\tilde{\mathbf{x}} \mid \mathbf{x}) = \mathcal{N}(\tilde{\mathbf{x}}; \mathbf{x}, \sigma^2 \mathbf{I}) and the induced marginal qσ(x~)=qσ(x~x)p(x)dxq_\sigma(\tilde{\mathbf{x}}) = \int q_\sigma(\tilde{\mathbf{x}} \mid \mathbf{x})\, p(\mathbf{x})\, d\mathbf{x}, the denoising score matching objective

LDSM(θ)=Ep(x)qσ(x~x) ⁣[sθ(x~)x~logqσ(x~x)2]\mathcal{L}_{\text{DSM}}(\theta) = \mathbb{E}_{p(\mathbf{x})\, q_\sigma(\tilde{\mathbf{x}} \mid \mathbf{x})}\!\Big[\,\big\Vert\mathbf{s}_\theta(\tilde{\mathbf{x}}) - \nabla_{\tilde{\mathbf{x}}} \log q_\sigma(\tilde{\mathbf{x}} \mid \mathbf{x})\big\Vert^2\,\Big]

is equal, up to a θ\theta-independent constant, to the marginal score matching objective Eqσ(x~) ⁣[sθ(x~)x~logqσ(x~)2]\mathbb{E}_{q_\sigma(\tilde{\mathbf{x}})}\!\big[\Vert\mathbf{s}_\theta(\tilde{\mathbf{x}}) - \nabla_{\tilde{\mathbf{x}}} \log q_\sigma(\tilde{\mathbf{x}})\Vert^2\big].

Equivalently, the noise-prediction objective E ⁣[ϵϵθ2]\mathbb{E}\!\big[\Vert\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta\Vert^2\big] and the score model are linked by

sθ(xt,t)  =  ϵθ(xt,t)1αˉt\mathbf{s}_\theta(\mathbf{x}_t, t) \;=\; -\frac{\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)}{\sqrt{1 - \bar{\alpha}_t}}

so denoising and score learning are the same algorithm.

Intuition

You cannot evaluate logp(x)\nabla \log p(\mathbf{x}) for the data distribution because pp is unknown. But after corrupting x\mathbf{x} with known Gaussian noise, the conditional score logqσ(x~x)\nabla \log q_\sigma(\tilde{\mathbf{x}} \mid \mathbf{x}) is just the negative noise direction, which the network can learn from (x,x~)(\mathbf{x}, \tilde{\mathbf{x}}) pairs. The marginal score follows by an integration-by-parts argument that shifts the x\mathbf{x} expectation onto the network output.

Why It Matters

This is the bridge between the DDPM view (denoise pixels) and the score-SDE view (estimate logpt\nabla \log p_t). They are not two methods; they are one method with two parameterizations. The same trained network drives both discrete-time DDPM samplers and continuous-time SDE/ODE solvers.

Failure Mode

The equivalence assumes a Gaussian perturbation kernel. With non-Gaussian corruption (e.g., cold diffusion or masking), the conditional-score formula changes and the simple "predict the noise" target no longer applies (Bansal et al. 2022).

Theorem

Anderson Reverse-Time SDE

Statement

The forward SDE dx=f(x,t)dt+g(t)dwd\mathbf{x} = \mathbf{f}(\mathbf{x}, t)\,dt + g(t)\,d\mathbf{w} on t[0,T]t \in [0, T] has a reverse-time SDE with the same marginal distributions:

dx  =  [f(x,t)    g(t)2xlogpt(x)]dt  +  g(t)dwˉd\mathbf{x} \;=\; \big[\mathbf{f}(\mathbf{x}, t) \;-\; g(t)^2 \nabla_{\mathbf{x}} \log p_t(\mathbf{x})\big]\,dt \;+\; g(t)\,d\bar{\mathbf{w}}

where wˉ\bar{\mathbf{w}} is a standard Wiener process running backward in time and the SDE is integrated from t=Tt = T down to t=0t = 0.

Intuition

Forward dynamics push samples from data to noise; the reverse correction term g(t)2xlogpt(x)-g(t)^2 \nabla_{\mathbf{x}} \log p_t(\mathbf{x}) is the "uphill" force that sends them back to high-density regions. The score is the only data-dependent quantity, and the diffusion coefficient g(t)g(t) is unchanged.

Proof Sketch

Anderson (1982) proves this by writing the joint p(xs,xt)p(\mathbf{x}_s, \mathbf{x}_t) for s<ts < t, conditioning to obtain the time-reversed transition density, and verifying it satisfies the reverse-time Fokker-Planck equation. Modern proofs (Haussmann & Pardoux 1986, Föllmer 1985) start from the time-reversal of the SDE directly. The intuition is that the forward Fokker-Planck equation for ptp_t involves a drift term and a Laplacian term; reversing time flips the sign of the drift, and the Laplacian becomes a drift correction whose magnitude is exactly g(t)2logptg(t)^2 \nabla \log p_t.

Why It Matters

This is the foundational result that makes score-based generative modeling work. Train a neural network to approximate logpt\nabla \log p_t at every tt, plug it into the reverse SDE, and integrate from TT to 00 starting from Gaussian noise. Score-SDE (Song et al. 2021) unifies DDPM and NCSN as discrete solvers for VP- and VE-flavored versions of this same equation.

Failure Mode

The score model is least accurate near t=0t = 0 (data manifold has near-zero density off the manifold and the score blows up) and near t=Tt = T (the noisy distribution is nearly Gaussian and the score is small but easy to mispredict in direction). EDM (Karras et al. 2022) handles both ends with explicit preconditioning of the network input/output.

Theorem

Probability Flow ODE

Statement

The marginal densities {pt}t[0,T]\{p_t\}_{t \in [0, T]} of the forward SDE are also the marginals of the deterministic ODE

dxdt  =  f(x,t)    12g(t)2xlogpt(x)\frac{d\mathbf{x}}{dt} \;=\; \mathbf{f}(\mathbf{x}, t) \;-\; \tfrac{1}{2}\, g(t)^2\, \nabla_{\mathbf{x}} \log p_t(\mathbf{x})

Integrating this ODE backward in time from a sample xTpT\mathbf{x}_T \sim p_T produces a sample x0p0\mathbf{x}_0 \sim p_0 without any injected noise. The ODE is a deterministic, invertible bijection between the prior and the data distribution at the level of single trajectories.

Intuition

Removing half the diffusion coefficient and dropping the Wiener increment preserves the family of marginal densities while removing all randomness from sample paths. Stochastic and deterministic samplers see different per-trajectory dynamics but the same marginal distribution at every tt.

Why It Matters

Three consequences land here. First, deterministic sampling: DDIM (Song, Meng, Ermon 2020) is a discretization of this ODE for the VP-SDE, which is why it samples in 10-50 steps instead of 1000. Second, exact likelihood evaluation: integrating the ODE forward gives a continuous normalizing flow, and the change-of-variables formula yields exact log-likelihoods (Song et al. 2021). Third, interpolation and editing: noise-to-sample is bijective along each trajectory, so editing in noise space corresponds to consistent edits in sample space, the basis of techniques like SDEdit (Meng et al. 2022).

Failure Mode

The ODE shares the score model's error budget but loses the regularizing stochasticity. In regions where the score is mispredicted, deterministic trajectories accumulate the error monotonically, while the SDE injects fresh noise that can correct course. EDM and DPM-Solver compensate with higher-order solvers (Heun, multistep) rather than more steps.

Theorem

Classifier-Free Guidance

Statement

Let pt(xc)p_t(\mathbf{x} \mid c) be the noise-perturbed conditional and pt(x)p_t(\mathbf{x}) the unconditional marginal. The Bayesian rule logpt(xc)=logpt(cx)+logpt(x)+const\log p_t(\mathbf{x} \mid c) = \log p_t(c \mid \mathbf{x}) + \log p_t(\mathbf{x}) + \mathrm{const} implies

xlogpt(xc)  =  xlogpt(cx)  +  xlogpt(x)\nabla_{\mathbf{x}} \log p_t(\mathbf{x} \mid c) \;=\; \nabla_{\mathbf{x}} \log p_t(c \mid \mathbf{x}) \;+\; \nabla_{\mathbf{x}} \log p_t(\mathbf{x})

Sampling from a sharpened conditional pt(xc)pt(cx)wp_t(\mathbf{x} \mid c)\, p_t(c \mid \mathbf{x})^{w} (more weight on the implicit classifier) corresponds to using the guided score

s~θ(x,t,c)  =  (1+w)sθ(x,t,c)    wsθ(x,t)\tilde{\mathbf{s}}_\theta(\mathbf{x}, t, c) \;=\; (1 + w)\,\mathbf{s}_\theta(\mathbf{x}, t, c) \;-\; w\,\mathbf{s}_\theta(\mathbf{x}, t)

or equivalently in noise-prediction form

ϵ~θ(xt,t,c)  =  (1+w)ϵθ(xt,t,c)    wϵθ(xt,t)\tilde{\boldsymbol{\epsilon}}_\theta(\mathbf{x}_t, t, c) \;=\; (1 + w)\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, c) \;-\; w\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)

The conditional and unconditional scores are produced by the same network trained with the conditioning signal randomly dropped to a null token at rate pdrop0.1p_{\text{drop}} \approx 0.1 (Ho & Salimans 2022).

Intuition

The model's own conditional and unconditional predictions act as an implicit classifier logp(cx)=ϵ(c)ϵ()\nabla \log p(c \mid \mathbf{x}) = \boldsymbol{\epsilon}(c) - \boldsymbol{\epsilon}(\varnothing). Extrapolating in that direction sharpens the conditional, biasing samples toward higher implicit-classifier confidence in cc.

Why It Matters

Every text-to-image and text-to-video system in production uses classifier-free guidance. It replaced the earlier classifier guidance of Dhariwal & Nichol (2021), which required training a separate noise-conditioned classifier. CFG needs only a single dropout rate and one extra forward pass per sampling step.

Failure Mode

At high ww (typically w>15w > 15 for image models), samples become saturated, high-contrast, and lose diversity. At very high ww they fall off the data manifold entirely, producing the characteristic "burnt" look of overguided diffusion. The cause is that the implicit-classifier extrapolation pushes samples into low-density regions where the score model was never trained accurately. CFG-zero, CFG-rescaled (Lin et al. 2024), and dynamic-thresholding (Imagen) are workarounds that cap or rescale the extrapolation magnitude.

Discrete-Time Samplers

The reverse SDE and probability flow ODE both need a numerical integrator. Sampler choice trades quality for steps. Three families dominate practice.

Ancestral / DDPM sampling. A first-order Euler-Maruyama on the reverse VP-SDE. Use T=1000T = 1000 steps for high quality, T=250T = 250 for fast research baselines. Stochastic (injects fresh Gaussian noise per step), so it self-corrects but needs many steps.

DDIM. A first-order discretization of the probability flow ODE for the VP-SDE (Song, Meng, Ermon 2020). Deterministic, with a single hyperparameter η[0,1]\eta \in [0, 1] that interpolates between fully deterministic (η=0\eta = 0) and ancestral (η=1\eta = 1). Hits competitive FID at 25-50 steps.

EDM Heun & DPM-Solver. Higher-order ODE solvers tailored to the diffusion ODE. EDM (Karras et al. 2022) uses Heun's second-order method on a σ\sigma-parameterized ODE with the Karras schedule; DPM-Solver (Lu et al. 2022) and DPM-Solver++ exploit the semi-linear structure of the diffusion ODE to reach FID-competitive samples in 10-20 steps. UniPC (Zhao et al. 2023) generalizes both with a multistep predictor-corrector.

Distillation-based fast samplers. Progressive distillation (Salimans & Ho 2022), consistency models (Song et al. 2023), and consistency distillation (Luo et al. 2023) train student networks that take large jumps in noise level, reaching 1-4 step generation at the cost of an extra training stage.

Latent Diffusion

Pixel-space diffusion at 1024×10241024 \times 1024 is computationally heavy because every denoising step must process every pixel. Latent Diffusion Models (LDM, Rombach et al. 2022) move the diffusion process into the latent space of a pretrained autoencoder.

The pipeline has three stages:

  1. A VAE encoder E\mathcal{E} maps an image xRH×W×3\mathbf{x} \in \mathbb{R}^{H \times W \times 3} to a latent z=E(x)Rh×w×c\mathbf{z} = \mathcal{E}(\mathbf{x}) \in \mathbb{R}^{h \times w \times c} with h,wH/8,W/8h, w \approx H/8, W/8 and cc between 4 and 16.
  2. A diffusion model is trained on the latents with the standard DDPM or score-SDE objective.
  3. A VAE decoder D\mathcal{D} maps generated latents back to pixels.

The autoencoder is trained once with a perceptual loss and a small KL-regularizer; it is held fixed during diffusion training. The compute saving is roughly (H/h)2=64×(H/h)^2 = 64\times per denoising step, which is what made Stable Diffusion practical on consumer GPUs. Stable Diffusion 1.x/2.x, SDXL, SD3 and SD3.5 are all latent diffusion. Imagen and Flux.1 use related cascaded or higher-resolution-latent variants.

The trade-off is fidelity at fine scales. The autoencoder bottleneck loses high-frequency detail; tiny text and faces are the canonical failure mode of latent diffusion. SDXL added a refiner stage, and SD3 increased the latent channel count from 4 to 16, both targeting this.

Flow Matching

Theorem

Conditional Flow Matching

Statement

Let xt=(1t)ϵ+tx0\mathbf{x}_t = (1 - t)\boldsymbol{\epsilon} + t\,\mathbf{x}_0 for ϵN(0,I)\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}), x0pdata\mathbf{x}_0 \sim p_{\text{data}}, t[0,1]t \in [0, 1]. The conditional flow matching loss

LCFM(θ)  =  Et,x0,ϵ ⁣[vθ(xt,t)(x0ϵ)2]\mathcal{L}_{\text{CFM}}(\theta) \;=\; \mathbb{E}_{t,\, \mathbf{x}_0,\, \boldsymbol{\epsilon}}\!\Big[\,\big\Vert\mathbf{v}_\theta(\mathbf{x}_t, t) - (\mathbf{x}_0 - \boldsymbol{\epsilon})\big\Vert^2\,\Big]

trains a velocity field vθ\mathbf{v}_\theta whose ODE dx/dt=vθ(x,t)d\mathbf{x}/dt = \mathbf{v}_\theta(\mathbf{x}, t) transports the source N(0,I)\mathcal{N}(\mathbf{0}, \mathbf{I}) to pdatap_{\text{data}} along straight lines from each sampled noise vector to its paired data point (Lipman et al. 2023).

Intuition

Diffusion takes curved paths from data to noise and back through a stochastic SDE; flow matching takes straight lines through a deterministic ODE. The training target is no longer "the noise that was added" but "the constant velocity that interpolates between this noise and this data point."

Why It Matters

Flow matching is a strict generalization of the probability flow ODE for diffusion: choosing the conditional path xt=αtx0+σtϵ\mathbf{x}_t = \alpha_t \mathbf{x}_0 + \sigma_t \boldsymbol{\epsilon} with (αt,σt)=(αˉt,1αˉt)(\alpha_t, \sigma_t) = (\sqrt{\bar{\alpha}_t}, \sqrt{1 - \bar{\alpha}_t}) recovers the diffusion ODE; the linear path (αt,σt)=(t,1t)(\alpha_t, \sigma_t) = (t, 1 - t) is rectified flow (Liu et al. 2023), which yields straighter trajectories and faster sampling. Stable Diffusion 3 (Esser et al. 2024) and most newer image/video systems are flow-matching trained because the straighter paths permit larger ODE step sizes.

Failure Mode

The linearity is in the latent (or pixel) space, not in the data manifold's intrinsic geometry. For data on highly curved manifolds (molecular conformations, 3D rotations, group-valued data), straight pixel-space paths cross low-density regions and the velocity field has to make large, hard-to-fit corrections. Riemannian flow matching (Chen & Lipman 2024) replaces straight lines with manifold geodesics for these cases.

Failure Modes and Engineering Realities

Schedule sensitivity. The variance schedule {βt}\{\beta_t\} is not a hyperparameter you can sweep casually. The cosine schedule of Nichol & Dhariwal (2021) outperforms the linear DDPM schedule across image scales because it spends more steps near αˉt0.5\bar{\alpha}_t \approx 0.5 where the model learns the most. EDM (Karras et al. 2022) reframes the problem in terms of σ\sigma rather than β\beta and shows that the implied distribution over logσ\log \sigma during training is the right object to design.

Exposure bias. During training the network sees true noisy samples xt=αˉtx0+1αˉtϵ\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\,\boldsymbol{\epsilon}. During sampling it sees its own previous predictions, which carry error. The mismatch compounds across steps; Ning et al. (2024) analyze it and propose epsilon-scaling, a training-free correction that improves FID across DDIM, EDM, and LDM.

Truncation at the endpoints. The score blows up as t0t \to 0 (data manifold has near-zero off-manifold density) and the model is poorly calibrated as tTt \to T. Karras et al. (2022) address this with explicit input/output preconditioning that keeps the network's effective inputs and targets in a fixed numerical range across all σ\sigma.

Mode coverage versus prompt fidelity. High classifier-free guidance sharpens prompt adherence at the cost of mode coverage; the resulting samples look prototypical and lose the long-tail diversity present at w=1w = 1. There is no free lunch; the practical recipe is to use moderate guidance (w[3,8]w \in [3, 8] for SD-style image models) and rely on prompting and conditioning channels (ControlNet, IP-Adapter) for fine control rather than pushing ww.

Sampler-architecture mismatch. A network trained with the simplified ϵ\boldsymbol{\epsilon}-prediction loss using DDPM ancestral sampling will not necessarily transfer cleanly to a DPM-Solver++ ODE solver at low step counts. EDM Heun + EDM-trained networks form one matched stack; rectified-flow Euler + flow-matching networks form another. Mixing across stacks can leave sampler-quality on the table.

Where This Shows Up

Where This Shows Up in Current Papers
  • Image generation: Stable Diffusion 3.5, SDXL, DALL-E 3, Imagen 3, Midjourney v6, Flux.1, Ideogram
  • Video generation: Sora (OpenAI), Veo 3 (Google), Kling 2 (Kuaishou), Runway Gen-3, Pika, video world models
  • Audio synthesis: AudioLDM 2, Stable Audio 2, Diff-A-Riff, music generation systems
  • 3D generation: DreamFusion (SDS-based text-to-3D), MVDream multi-view diffusion, point-cloud diffusion, and Gaussian splatting generators
  • Molecular and protein design: RFdiffusion (RF-diffusion for protein structure), GeoDiff for molecular conformers, Boltz-1 for structure prediction with diffusion components
  • Robotics: Diffusion Policy (Chi et al. 2023) for visuomotor control, 3D Diffuser Actor for manipulation
  • Decision making and planning: Janner et al. 2022 trajectory diffusion; world-model planners that diffuse over future states

Diffusion vs. GANs in One Table

If someone asks "why are GANs not diffusion models?" the clean answer is: diffusion learns how to reverse a fixed corruption process, while GANs learn to fool a discriminator. Both may start from noise at sampling time, but the objective, the intermediate states, and the failure modes are different.

QuestionDiffusionGAN
What is the training target?Predict noise, denoised sample, or score at many noise levelsFool a discriminator or critic
Is there a fixed forward process?Yes: add Gaussian noise by designNo: latent prior is sampled directly
What is optimized?Regression or ELBO-style objectiveMinimax game
How is a sample generated?Iterative reverse denoising from xTx_T to x0x_0One generator pass from zz to xx
Typical weaknessSlow samplingMode collapse and unstable training
Typical strengthStable training, strong mode coverage, flexible conditioningVery fast inference once trained

Common Confusions

Watch Out

DDPM and score-based models are not different methods

The DDPM derivation starts from a variational bound on a Markov chain; the NCSN/score-SDE derivation starts from score matching plus the reverse-time SDE. They produce the same training loss up to per-timestep weighting and the same network architecture. The two papers (Ho et al. 2020, Song & Ermon 2019) arrived at the same algorithm from different angles within months of each other, and Song et al. (2021) made the equivalence rigorous via the SDE view. Treat them as one framework with two parameterizations.

Watch Out

Probability flow ODE is not flow matching

Both produce deterministic ODEs that transport noise to data, and both are called "ODE samplers" in casual usage. They differ in what is held fixed:

  • Probability flow ODE is derived from a forward stochastic process; the velocity field equals f12g2logpt\mathbf{f} - \tfrac{1}{2} g^2 \nabla \log p_t and is determined once the SDE is chosen.
  • Flow matching starts from a chosen probability path and trains a velocity field directly via regression. There is no underlying SDE.

Diffusion's probability flow ODE is a special case of flow matching, but flow matching is the more general construction. Rectified flow (Liu et al. 2023) is one specific flow-matching path that is not a diffusion probability flow ODE.

Watch Out

More steps is not always better

DDPM originally used T=1000T = 1000 steps because the discretization error of ancestral sampling on the reverse VP-SDE is large at coarse step counts. Modern higher-order solvers on the probability flow ODE (DDIM, DPM-Solver++, EDM Heun) close most of the quality gap at 10-25 steps. Step count is a sampler choice, not a model property; a single trained model can be sampled at any step count from 1 (with distillation) to 1000.

Watch Out

Guidance scale is not a quality dial

Increasing classifier-free guidance scale ww does not monotonically improve quality. Past a sweet spot (typically w[5,12]w \in [5, 12] for image models) the samples saturate, lose color diversity, and pick up high-frequency artifacts. The mechanism is that the implicit-classifier extrapolation pushes the score into low-density regions where it was never trained. CFG-rescaled and dynamic-thresholding are practical workarounds, but high ww trading prompt fidelity for sample diversity is a structural feature of the method, not a bug to be tuned away.

Watch Out

Latent diffusion is not a separate generative model family

Stable Diffusion is a latent diffusion model: an autoencoder plus a standard diffusion process on the latents. The diffusion math is unchanged; only the operating space differs. Failure modes of latent diffusion (small text, faces) are autoencoder failures, not diffusion failures.

Summary

  • Forward process adds Gaussian noise on a fixed schedule; the closed-form marginal q(xtx0)q(\mathbf{x}_t \mid \mathbf{x}_0) makes training trivially parallel.
  • Reverse process is a learned Markov kernel (DDPM) or, equivalently, a learned score logpt\nabla \log p_t plugged into the reverse-time SDE (score-SDE). The two views train the same network with the same loss.
  • DDPM ELBO decomposes into per-step KL terms that reduce to a weighted noise-prediction MSE; the simplified Lsimple\mathcal{L}_{\text{simple}} drops the weights and trains best in practice.
  • Reverse-time SDE (Anderson 1982) is the foundational result: for any forward SDE, the reverse process equals the forward dynamics minus g(t)2logptg(t)^2 \nabla \log p_t plus a backward Wiener increment.
  • Probability flow ODE removes the stochasticity while preserving marginal densities. It enables deterministic sampling (DDIM), exact likelihoods, and invertible noise-to-sample bijections.
  • Classifier-free guidance sharpens conditional samples by extrapolating in the direction of the implicit classifier ϵ(c)ϵ()\boldsymbol{\epsilon}(c) - \boldsymbol{\epsilon}(\varnothing). It is the mechanism behind every modern text-to-image system.
  • Latent diffusion moves the diffusion process into a pretrained autoencoder's latent space, cutting per-step compute by roughly 64×64\times and making consumer-GPU image synthesis practical.
  • Flow matching generalizes the probability flow ODE by training a velocity field directly against any chosen probability path; rectified flow is the linear-path special case used by SD3 and Flux.

Exercises

ExerciseCore

Problem

Given αˉt=0.01\bar{\alpha}_t = 0.01 (i.e., 99% of the variance is noise), write xt\mathbf{x}_t in terms of x0\mathbf{x}_0 and ϵ\boldsymbol{\epsilon}. What fraction of the original signal amplitude remains, and what is the signal-to-noise ratio?

ExerciseCore

Problem

Show that the DDPM noise-prediction objective and the score model are linked by sθ(xt,t)=ϵθ(xt,t)/1αˉt\mathbf{s}_\theta(\mathbf{x}_t, t) = -\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) / \sqrt{1 - \bar{\alpha}_t}. What does this imply about training a single network and using it both as a DDPM denoiser and as a score model in the reverse-time SDE?

ExerciseAdvanced

Problem

Derive the probability flow ODE from the forward SDE dx=f(x,t)dt+g(t)dwd\mathbf{x} = \mathbf{f}(\mathbf{x}, t)\,dt + g(t)\,d\mathbf{w} by writing the Fokker-Planck equation for ptp_t and re-expressing the diffusion term as a drift correction.

ExerciseAdvanced

Problem

A practitioner trains a diffusion model with Lsimple\mathcal{L}_{\text{simple}} and reports excellent FID but poor reported NLL relative to a normalizing-flow baseline. Is this a contradiction? Explain how Lsimple\mathcal{L}_{\text{simple}} relates to a tight likelihood bound and what to change if NLL is the priority.

ExerciseCore

Problem

An interviewer says: "Both GANs and diffusion start from noise and end at an image. Why are they different?" Give a mechanism-level answer.

ExerciseResearch

Problem

Latent diffusion processes images through a fixed VAE bottleneck. Suppose you want to fine-tune Stable Diffusion 1.5 (4-channel latent, f=8f = 8 downsampling) to render legible small text. The diffusion training loss decreases but generated text remains garbled. Where is the bottleneck, and what would you change?

Related Comparisons

Frequently Asked Questions

How is a diffusion model different from a VAE?
A VAE has an encoder that learns a compressed latent and a decoder that maps back. A diffusion model has no learned encoder; the forward process is a fixed Gaussian noise schedule, and only the reverse denoiser is trained. Latent diffusion combines both: a VAE provides the latent space, and a diffusion model operates inside it.
Why does DDIM sample faster than DDPM?
DDPM is the stochastic reverse SDE discretized at every training timestep, typically 1000 steps. DDIM uses the deterministic probability-flow ODE with the same trained score, so the sampler is non-Markovian and can take much larger steps. With 50 steps DDIM produces samples comparable to 1000-step DDPM at the same model.
What is classifier-free guidance?
A way to steer samples toward a condition without a separate classifier. Train one model on both conditional and unconditional (with randomly dropped). At sampling time, combine the conditional and unconditional scores: . Higher trades sample diversity for prompt fidelity.
Why use latent diffusion (Stable Diffusion) instead of pixel diffusion?
Pixel-space diffusion at is computationally heavy and the diffusion model wastes capacity on perceptual details a learned VAE can recover. Latent diffusion runs the U-Net on an 8x downsampled VAE latent, cuts compute roughly 64x, and concentrates the diffusion model's modeling power on the semantic structure of the image. This is the design choice that made high-resolution image generation tractable.
Are diffusion and flow matching the same thing?
Mathematically very close. The probability-flow ODE of a diffusion process IS a flow. Flow matching (Lipman et al. 2023) directly learns the velocity field rather than the score; the training objective has lower variance and the resulting flow is straighter, enabling fewer-step sampling. Rectified flow takes this further by iteratively rectifying the flow toward straight-line trajectories.

References

Foundations:

  • Hyvärinen, "Estimation of Non-Normalized Statistical Models by Score Matching", JMLR 6:695-709 (2005). The original score-matching objective with the integration-by-parts trick that avoids the normalizing constant.
  • Vincent, "A Connection Between Score Matching and Denoising Autoencoders", Neural Computation 23(7):1661-1674 (2011). Establishes the denoising-score-matching equivalence that licenses learning logp\nabla \log p by predicting noise.
  • Anderson, "Reverse-time diffusion equation models", Stochastic Processes and their Applications 12(3):313-326 (1982). The reverse-time SDE result that powers all score-based sampling.
  • Sohl-Dickstein, Weiss, Maheswaranathan, Ganguli, "Deep Unsupervised Learning using Nonequilibrium Thermodynamics" (ICML 2015), arXiv:1503.03585. The original diffusion-model paper, five years before DDPM scaled it.

Canonical (2019-2021):

  • Song & Ermon, "Generative Modeling by Estimating Gradients of the Data Distribution" (NCSN, NeurIPS 2019), arXiv:1907.05600. The score-based-generation paper rediscovering and scaling Vincent (2011).
  • Ho, Jain, Abbeel, "Denoising Diffusion Probabilistic Models" (NeurIPS 2020), arXiv:2006.11239. The paper that made diffusion competitive with GANs on images. Section 3 derives Lsimple\mathcal{L}_{\text{simple}}.
  • Song, Sohl-Dickstein, Kingma, Kumar, Ermon, Poole, "Score-Based Generative Modeling through Stochastic Differential Equations" (ICLR 2021), arXiv:2011.13456. Unifies DDPM and NCSN under the continuous-time SDE framework; introduces the probability flow ODE.
  • Song, Meng, Ermon, "Denoising Diffusion Implicit Models" (DDIM, ICLR 2021), arXiv:2010.02502. Deterministic sampling with 10-50 steps via the probability flow ODE perspective.
  • Nichol & Dhariwal, "Improved Denoising Diffusion Probabilistic Models" (ICML 2021), arXiv:2102.09672. Cosine schedule, learned variances, log-likelihood improvements.

Conditioning and guidance:

  • Dhariwal & Nichol, "Diffusion Models Beat GANs on Image Synthesis" (NeurIPS 2021), arXiv:2105.05233. Classifier guidance with a separately trained noise-conditioned classifier; established the FID parity claim.
  • Ho & Salimans, "Classifier-Free Diffusion Guidance" (NeurIPS Workshop 2021, full version 2022), arXiv:2207.12598. The single-network conditioning trick behind every production text-to-image system.

Practical training and sampling (2022-2023):

  • Karras, Aittala, Aila, Laine, "Elucidating the Design Space of Diffusion-Based Generative Models" (EDM, NeurIPS 2022), arXiv:2206.00364. The reference parameterization analysis with second-order Heun sampler; reframes everything in terms of σ\sigma.
  • Lu, Zhou, Bao, Chen, Li, Zhu, "DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling" (NeurIPS 2022), arXiv:2206.00927; and the DPM-Solver++ followup. High-order ODE solvers tailored to the diffusion ODE.
  • Salimans & Ho, "Progressive Distillation for Fast Sampling of Diffusion Models" (ICLR 2022), arXiv:2202.00512. Distill an NN-step teacher into a N/2\lfloor N/2 \rfloor-step student, iterated to 1-4 steps.
  • Song, Dhariwal, Chen, Sutskever, "Consistency Models" (ICML 2023), arXiv:2303.01469. Single-step generation by training a network to map any point on the ODE trajectory to its origin.
  • Lipman, Chen, Ben-Hamu, Nickel, Le, "Flow Matching for Generative Modeling" (ICLR 2023), arXiv:2210.02747. Continuous normalizing flows recast as regression against velocity fields; foundation for SD3 and Flux.
  • Liu, Gong, Liu, "Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow" (ICLR 2023), arXiv:2209.03003. The straight-line conditional path that yields the simplest flow-matching objective.

Latent and frontier image systems:

  • Rombach, Blattmann, Lorenz, Esser, Ommer, "High-Resolution Image Synthesis with Latent Diffusion Models" (LDM/Stable Diffusion, CVPR 2022), arXiv:2112.10752. The autoencoder-plus-diffusion architecture behind SD 1.x/2.x/SDXL.
  • Esser, Kulal, Blattmann et al., "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis" (Stable Diffusion 3, ICML 2024), arXiv:2403.03206. Flow-matching MMDiT architecture; widely deployed open-weights system as of 2024.
  • Saharia et al., "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding" (Imagen, NeurIPS 2022), arXiv:2205.11487. Cascaded pixel-space diffusion with T5-XXL text conditioning.

Failure modes and corrections:

  • Ning, Li, Su, Salah, Ertugrul, "Elucidating the Exposure Bias in Diffusion Models" (ICLR 2024), arXiv:2308.15321. The training-vs-sampling input-distribution mismatch; introduces epsilon-scaling.
  • Lin, Liu, Liu, "Common Diffusion Noise Schedules and Sample Steps are Flawed" (WACV 2024), arXiv:2305.08891. Zero-SNR schedule fix and CFG rescale; explains the saturated-color failure mode of high-CFG sampling.

Selected applications:

  • Chi, Feng, Du et al., "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion" (RSS 2023), arXiv:2303.04137. Diffusion as a robotics action prior.
  • Xu, Yu, Song, Shi, Ermon, Tang, "GeoDiff: a Geometric Diffusion Model for Molecular Conformation Generation" (ICLR 2022), arXiv:2203.02923. Diffusion on molecular geometry.
  • Watson, Juergens, Bennett et al., "De novo design of protein structure and function with RFdiffusion" (Nature 2023). RoseTTAFold + diffusion for protein backbone design.

Next Topics

The natural next steps from diffusion models:

  • Flow matching: the generalization that drives Stable Diffusion 3, Flux, and most recent video models.
  • Score matching: the underlying objective with the full Hyvärinen-Vincent derivation and its extensions.
  • Video world models: how Sora, Veo, and Kling extend image diffusion to spatiotemporal data with simulation-grade physical fidelity claims.
  • Gaussian splatting: the dominant 3D representation that diffusion models increasingly target.

Last reviewed: April 19, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.