Paper breakdown

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, and Pieter Abbeel · 2020 · NeurIPS 2020

Trains a single network to predict the Gaussian noise added at a random timestep, with a reweighted variational bound that reduces to a simple mean-squared-error loss. Establishes the noise-prediction parameterisation and U-Net backbone underlying every modern image and video diffusion system.

arXiv:2006.11239

Overview

Ho, Jain, and Abbeel (2020) took diffusion probabilistic models — first written down by Sohl-Dickstein et al. (2015) — and showed how to train them at the scale and quality needed to compete with GANs on image generation. Two changes did most of the work. They reparameterised the reverse process to predict the noise $\epsilon$ that was added during the forward process, rather than the mean of the reverse Gaussian. They then dropped the variational-bound weighting in favor of a uniformly weighted mean-squared-error loss, $\mathcal{L}_{\text{simple}}$ , which they argued empirically gives better samples than the formally correct ELBO.

The forward process is fixed: a Markov chain that adds Gaussian noise to data over $T = 1000$ steps until the final state is approximately $\mathcal{N}(0, I)$ . The reverse process is a Markov chain with the same form, parameterised by a U-Net that takes a noisy image $x_t$ and the timestep $t$ and outputs the predicted noise $\hat{\epsilon}_\theta(x_t, t)$ . Training samples a random $t$ , a random data point $x_0$ , and a fresh noise vector $\epsilon$ , constructs $x_t$ in closed form, and minimises $\|\epsilon - \hat{\epsilon}_\theta(x_t, t)\|^2$ .

The result on CIFAR-10 was an Inception score of 9.46 and an FID of 3.17, beating every prior likelihood-based model and matching contemporary GANs without the training instabilities. The paper is now the canonical reference for image diffusion: every later system — Stable Diffusion, Imagen, DALL-E 2, Sora — uses the noise-prediction parameterisation and the U-Net backbone introduced here. Song et al. (2021) later showed the DDPM training objective is a discretisation of denoising score matching, unifying this paper with the score-based generative models line.

Mathematical Contributions

The forward process

Pick a variance schedule $\beta_1, \beta_2, \ldots, \beta_T \in (0, 1)$ . The forward chain is

$q(x_t \mid x_{t-1}) = \mathcal{N}(x_t;\, \sqrt{1 - \beta_t}\, x_{t-1},\, \beta_t I).$

A useful closed form (Equation 4 in the paper) lets you sample $x_t$ from $x_0$ directly without iterating the chain. With $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s$ ,

$q(x_t \mid x_0) = \mathcal{N}(x_t;\, \sqrt{\bar{\alpha}_t}\, x_0,\, (1 - \bar{\alpha}_t) I).$

So $x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon$ with $\epsilon \sim \mathcal{N}(0, I)$ . The schedule is chosen so that $\bar{\alpha}_T \approx 0$ and $x_T$ is approximately a standard Gaussian, regardless of $x_0$ . The original paper uses $T = 1000$ and a linear schedule from $\beta_1 = 10^{-4}$ to $\beta_T = 0.02$ .

The reverse process

The reverse Markov chain $p_\theta(x_{t-1} \mid x_t)$ is also Gaussian:

$p_\theta(x_{t-1} \mid x_t) = \mathcal{N}(x_{t-1};\, \mu_\theta(x_t, t),\, \Sigma_\theta(x_t, t)).$

The paper fixes the variance to $\sigma_t^2 I$ where $\sigma_t^2 = \beta_t$ (or the slightly smaller $\tilde{\beta}_t = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t$ ) and learns only the mean. Both choices give comparable sample quality.

The variational bound

The training objective is the standard variational lower bound on $\log p_\theta(x_0)$ , written as a sum of KL terms (Equation 5):

$\mathcal{L} = \mathbb{E}_q\!\Big[ D_{\mathrm{KL}}\big(q(x_T \mid x_0)\,\|\,p(x_T)\big) + \sum_{t > 1} D_{\mathrm{KL}}\big(q(x_{t-1} \mid x_t, x_0)\,\|\,p_\theta(x_{t-1} \mid x_t)\big) - \log p_\theta(x_0 \mid x_1) \Big].$

The first term has no learnable parameters. The middle terms are KL divergences between two Gaussians with the same variance, which reduce to the squared distance between their means. The forward posterior $q(x_{t-1} \mid x_t, x_0)$ is itself a known Gaussian with mean

$\tilde{\mu}_t(x_t, x_0) = \frac{\sqrt{\bar{\alpha}_{t-1}}\, \beta_t}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t}\,(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} x_t.$

Noise-prediction parameterisation

Substituting $x_0 = (x_t - \sqrt{1 - \bar{\alpha}_t}\, \epsilon) / \sqrt{\bar{\alpha}_t}$ and rearranging gives (Equation 11):

$\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}} \!\left( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}}\, \epsilon_\theta(x_t, t) \right).$

The network outputs $\epsilon_\theta(x_t, t)$ — the noise — rather than $\mu_\theta$ directly. Because $\mu_\theta$ is an affine function of $\epsilon_\theta$ , predicting either is equivalent in principle, but the noise parameterisation has uniform target scale across $t$ , which makes optimisation more stable.

The simple loss

After algebra, the per-timestep KL term becomes proportional to a squared error in $\epsilon$ space (Equation 12):

$L_{t-1} = \mathbb{E}_{x_0, \epsilon}\!\left[ \frac{\beta_t^2}{2 \sigma_t^2 \alpha_t (1 - \bar{\alpha}_t)} \,\|\epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon,\, t)\|^2 \right].$

The paper observes that dropping the $t$ -dependent weighting improves sample quality and uses (Equation 14):

$\mathcal{L}_{\text{simple}}(\theta) = \mathbb{E}_{t, x_0, \epsilon}\!\Big[ \|\epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon,\, t)\|^2 \Big]$

with $t$ sampled uniformly from $\{1, \ldots, T\}$ . This is no longer the variational bound; it down-weights small $t$ (where the original ELBO weight is large) in favour of large $t$ (where most of the visual structure is decided). Trading a tighter likelihood for better samples is an explicit choice the paper makes and defends empirically.

The training and sampling algorithms

Step	Training	Sampling
1	Sample $x_0 \sim q$ , $t \sim \mathrm{Unif}\{1, \ldots, T\}$ , $\epsilon \sim \mathcal{N}(0, I)$	$x_T \sim \mathcal{N}(0, I)$
2	Form $x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon$	For $t = T, T{-}1, \ldots, 1$ :
3	Take a gradient step on $\\|\epsilon - \epsilon_\theta(x_t, t)\\|^2$	$\;\;z \sim \mathcal{N}(0, I)$ if $t > 1$ else $z = 0$
4		$\;\;x_{t-1} = \frac{1}{\sqrt{\alpha_t}} (x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(x_t, t)) + \sigma_t z$

Training is one forward and one backward pass per gradient step, identical in cost to a vanilla autoencoder. Sampling is $T$ sequential network evaluations, which is the price the model pays for not being a one-shot generator like a GAN or VAE.

Connection to score matching

A separate, equivalent view (made explicit in Song & Ermon, 2019, and unified with DDPM in Song et al., 2021): the noise predictor $\epsilon_\theta(x_t, t)$ is, up to a known scale, the score $\nabla_{x_t} \log q(x_t)$ . The DDPM update is then a discretisation of an annealed Langevin sampler driven by that score. This justifies the noise-prediction objective as denoising score matching, and it is the lens used by every continuous-time formulation (DDIM, score SDEs, EDM) that came later.

What It Gets Right

The first thing is the parameterisation. Predicting noise rather than the reverse mean keeps the regression target on a unit scale at every $t$ , which is what makes a single shared network able to denoise across $T = 1000$ noise levels. Earlier diffusion-style attempts that predicted $x_0$ or $\mu_\theta$ directly never reached competitive image quality; this small change is most of the gap.

The second is the loss simplification. Dropping the formal ELBO weights is mathematically a step backward — the resulting objective is a reweighted bound, not the bound — but it improved FID by a wide margin. The paper reports the result honestly and does not pretend the simple loss is theoretically motivated; later work (Karras et al. 2022, EDM) recovered the choice by deriving the optimal weighting in a different parameterisation.

The third is implementation discipline. The U-Net backbone with sinusoidal time embedding and the linear noise schedule were straightforward enough that the paper's CIFAR-10 numbers were independently reproduced within months, and the code was robust enough to scale to 256×256 images and far beyond when later groups did the engineering.

Common Misconceptions

DDPM is not a one-step generator. Sampling needs $T$ sequential network calls, ~1000 in the original recipe. DDIM (Song et al., 2021), DPM-Solver (Lu et al., 2022), and consistency models (Song et al., 2023) cut this to 1–50 steps with various tradeoffs, but the original DDPM is iterative.

DDPM does not maximise likelihood. The simple loss is a reweighting of the variational bound and gives worse log-likelihoods than the ELBO weighting; the paper trades likelihood for sample quality. Anyone reporting bits-per-dimension for a DDPM-style model needs to be careful about which loss was used at training time.

The noise predictor and the score are two views of the same object, not independent ideas. $\epsilon_\theta(x_t, t) \approx -\sqrt{1 - \bar{\alpha}_t}\, \nabla_{x_t} \log q(x_t)$ . The 2020 DDPM paper does not state this connection — Song et al. (2021) made it explicit — and a reader who picks up only one of the two papers can come away thinking they are different methods.

Connections to TheoremPath Topics

Diffusion models — modern treatment covering DDPM, score SDEs, probability-flow ODEs, and flow matching.
Score matching — the identity that makes the noise predictor equivalent to a score estimator.
Variational autoencoders — the variational lower bound that DDPM derives from.
Markov chains and steady state — the forward process is a fixed Markov chain converging to a Gaussian stationary distribution.
Langevin dynamics — the continuous-time sampler whose discretisation underlies score-based and diffusion samplers.
KL divergence — every term in the DDPM ELBO is a KL between two Gaussians.

References

Canonical:

Ho, J., Jain, A., & Abbeel, P. (2020). "Denoising Diffusion Probabilistic Models." NeurIPS. arXiv:2006.11239.

Direct precursors:

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). "Deep Unsupervised Learning using Nonequilibrium Thermodynamics." ICML. arXiv:1503.03585.
Song, Y., & Ermon, S. (2019). "Generative Modeling by Estimating Gradients of the Data Distribution." NeurIPS. arXiv:1907.05600. NCSN; denoising score matching at multiple noise scales.

Unification and theory:

Song, Y. et al. (2021). "Score-Based Generative Modeling through Stochastic Differential Equations." ICLR. arXiv:2011.13456.
Vincent, P. (2011). "A Connection Between Score Matching and Denoising Autoencoders." Neural Computation 23(7). The denoising-score-matching identity used implicitly throughout DDPM.

Practical follow-ons:

Song, J., Meng, C., & Ermon, S. (2021). "Denoising Diffusion Implicit Models." ICLR. arXiv:2010.02502.
Nichol, A. Q., & Dhariwal, P. (2021). "Improved Denoising Diffusion Probabilistic Models." ICML. arXiv:2102.09672.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). "High-Resolution Image Synthesis with Latent Diffusion Models." CVPR. arXiv:2112.10752. Stable Diffusion.
Karras, T., Aittala, M., Aila, T., & Laine, S. (2022). "Elucidating the Design Space of Diffusion-Based Generative Models." NeurIPS. arXiv:2206.00364.

Connected topics

Last reviewed: May 7, 2026