Skip to main content

Paper breakdown

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, and Pieter Abbeel · 2020 · NeurIPS 2020

Trains a single network to predict the Gaussian noise added at a random timestep, with a reweighted variational bound that reduces to a simple mean-squared-error loss. Establishes the noise-prediction parameterisation and U-Net backbone underlying every modern image and video diffusion system.

Overview

Ho, Jain, and Abbeel (2020) took diffusion probabilistic models — first written down by Sohl-Dickstein et al. (2015) — and showed how to train them at the scale and quality needed to compete with GANs on image generation. Two changes did most of the work. They reparameterised the reverse process to predict the noise ϵ\epsilon that was added during the forward process, rather than the mean of the reverse Gaussian. They then dropped the variational-bound weighting in favor of a uniformly weighted mean-squared-error loss, Lsimple\mathcal{L}_{\text{simple}}, which they argued empirically gives better samples than the formally correct ELBO.

The forward process is fixed: a Markov chain that adds Gaussian noise to data over T=1000T = 1000 steps until the final state is approximately N(0,I)\mathcal{N}(0, I). The reverse process is a Markov chain with the same form, parameterised by a U-Net that takes a noisy image xtx_t and the timestep tt and outputs the predicted noise ϵ^θ(xt,t)\hat{\epsilon}_\theta(x_t, t). Training samples a random tt, a random data point x0x_0, and a fresh noise vector ϵ\epsilon, constructs xtx_t in closed form, and minimises ϵϵ^θ(xt,t)2\|\epsilon - \hat{\epsilon}_\theta(x_t, t)\|^2.

The result on CIFAR-10 was an Inception score of 9.46 and an FID of 3.17, beating every prior likelihood-based model and matching contemporary GANs without the training instabilities. The paper is now the canonical reference for image diffusion: every later system — Stable Diffusion, Imagen, DALL-E 2, Sora — uses the noise-prediction parameterisation and the U-Net backbone introduced here. Song et al. (2021) later showed the DDPM training objective is a discretisation of denoising score matching, unifying this paper with the score-based generative models line.

Mathematical Contributions

The forward process

Pick a variance schedule β1,β2,,βT(0,1)\beta_1, \beta_2, \ldots, \beta_T \in (0, 1). The forward chain is

q(xtxt1)=N(xt;1βtxt1,βtI).q(x_t \mid x_{t-1}) = \mathcal{N}(x_t;\, \sqrt{1 - \beta_t}\, x_{t-1},\, \beta_t I).

A useful closed form (Equation 4 in the paper) lets you sample xtx_t from x0x_0 directly without iterating the chain. With αt=1βt\alpha_t = 1 - \beta_t and αˉt=s=1tαs\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s,

q(xtx0)=N(xt;αˉtx0,(1αˉt)I).q(x_t \mid x_0) = \mathcal{N}(x_t;\, \sqrt{\bar{\alpha}_t}\, x_0,\, (1 - \bar{\alpha}_t) I).

So xt=αˉtx0+1αˉtϵx_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon with ϵN(0,I)\epsilon \sim \mathcal{N}(0, I). The schedule is chosen so that αˉT0\bar{\alpha}_T \approx 0 and xTx_T is approximately a standard Gaussian, regardless of x0x_0. The original paper uses T=1000T = 1000 and a linear schedule from β1=104\beta_1 = 10^{-4} to βT=0.02\beta_T = 0.02.

The reverse process

The reverse Markov chain pθ(xt1xt)p_\theta(x_{t-1} \mid x_t) is also Gaussian:

pθ(xt1xt)=N(xt1;μθ(xt,t),Σθ(xt,t)).p_\theta(x_{t-1} \mid x_t) = \mathcal{N}(x_{t-1};\, \mu_\theta(x_t, t),\, \Sigma_\theta(x_t, t)).

The paper fixes the variance to σt2I\sigma_t^2 I where σt2=βt\sigma_t^2 = \beta_t (or the slightly smaller β~t=1αˉt11αˉtβt\tilde{\beta}_t = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t) and learns only the mean. Both choices give comparable sample quality.

The variational bound

The training objective is the standard variational lower bound on logpθ(x0)\log p_\theta(x_0), written as a sum of KL terms (Equation 5):

L=Eq ⁣[DKL(q(xTx0)p(xT))+t>1DKL(q(xt1xt,x0)pθ(xt1xt))logpθ(x0x1)].\mathcal{L} = \mathbb{E}_q\!\Big[ D_{\mathrm{KL}}\big(q(x_T \mid x_0)\,\|\,p(x_T)\big) + \sum_{t > 1} D_{\mathrm{KL}}\big(q(x_{t-1} \mid x_t, x_0)\,\|\,p_\theta(x_{t-1} \mid x_t)\big) - \log p_\theta(x_0 \mid x_1) \Big].

The first term has no learnable parameters. The middle terms are KL divergences between two Gaussians with the same variance, which reduce to the squared distance between their means. The forward posterior q(xt1xt,x0)q(x_{t-1} \mid x_t, x_0) is itself a known Gaussian with mean

μ~t(xt,x0)=αˉt1βt1αˉtx0+αt(1αˉt1)1αˉtxt.\tilde{\mu}_t(x_t, x_0) = \frac{\sqrt{\bar{\alpha}_{t-1}}\, \beta_t}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t}\,(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} x_t.

Noise-prediction parameterisation

Substituting x0=(xt1αˉtϵ)/αˉtx_0 = (x_t - \sqrt{1 - \bar{\alpha}_t}\, \epsilon) / \sqrt{\bar{\alpha}_t} and rearranging gives (Equation 11):

μθ(xt,t)=1αt ⁣(xtβt1αˉtϵθ(xt,t)).\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}} \!\left( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}}\, \epsilon_\theta(x_t, t) \right).

The network outputs ϵθ(xt,t)\epsilon_\theta(x_t, t) — the noise — rather than μθ\mu_\theta directly. Because μθ\mu_\theta is an affine function of ϵθ\epsilon_\theta, predicting either is equivalent in principle, but the noise parameterisation has uniform target scale across tt, which makes optimisation more stable.

The simple loss

After algebra, the per-timestep KL term becomes proportional to a squared error in ϵ\epsilon space (Equation 12):

Lt1=Ex0,ϵ ⁣[βt22σt2αt(1αˉt)ϵϵθ(αˉtx0+1αˉtϵ,t)2].L_{t-1} = \mathbb{E}_{x_0, \epsilon}\!\left[ \frac{\beta_t^2}{2 \sigma_t^2 \alpha_t (1 - \bar{\alpha}_t)} \,\|\epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon,\, t)\|^2 \right].

The paper observes that dropping the tt-dependent weighting improves sample quality and uses (Equation 14):

Lsimple(θ)=Et,x0,ϵ ⁣[ϵϵθ(αˉtx0+1αˉtϵ,t)2]\mathcal{L}_{\text{simple}}(\theta) = \mathbb{E}_{t, x_0, \epsilon}\!\Big[ \|\epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon,\, t)\|^2 \Big]

with tt sampled uniformly from {1,,T}\{1, \ldots, T\}. This is no longer the variational bound; it down-weights small tt (where the original ELBO weight is large) in favour of large tt (where most of the visual structure is decided). Trading a tighter likelihood for better samples is an explicit choice the paper makes and defends empirically.

The training and sampling algorithms

StepTrainingSampling
1Sample x0qx_0 \sim q, tUnif{1,,T}t \sim \mathrm{Unif}\{1, \ldots, T\}, ϵN(0,I)\epsilon \sim \mathcal{N}(0, I)xTN(0,I)x_T \sim \mathcal{N}(0, I)
2Form xt=αˉtx0+1αˉtϵx_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilonFor t=T,T1,,1t = T, T{-}1, \ldots, 1:
3Take a gradient step on ϵϵθ(xt,t)2\|\epsilon - \epsilon_\theta(x_t, t)\|^2    zN(0,I)\;\;z \sim \mathcal{N}(0, I) if t>1t > 1 else z=0z = 0
4    xt1=1αt(xtβt1αˉtϵθ(xt,t))+σtz\;\;x_{t-1} = \frac{1}{\sqrt{\alpha_t}} (x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(x_t, t)) + \sigma_t z

Training is one forward and one backward pass per gradient step, identical in cost to a vanilla autoencoder. Sampling is TT sequential network evaluations, which is the price the model pays for not being a one-shot generator like a GAN or VAE.

Connection to score matching

A separate, equivalent view (made explicit in Song & Ermon, 2019, and unified with DDPM in Song et al., 2021): the noise predictor ϵθ(xt,t)\epsilon_\theta(x_t, t) is, up to a known scale, the score xtlogq(xt)\nabla_{x_t} \log q(x_t). The DDPM update is then a discretisation of an annealed Langevin sampler driven by that score. This justifies the noise-prediction objective as denoising score matching, and it is the lens used by every continuous-time formulation (DDIM, score SDEs, EDM) that came later.

What It Gets Right

The first thing is the parameterisation. Predicting noise rather than the reverse mean keeps the regression target on a unit scale at every tt, which is what makes a single shared network able to denoise across T=1000T = 1000 noise levels. Earlier diffusion-style attempts that predicted x0x_0 or μθ\mu_\theta directly never reached competitive image quality; this small change is most of the gap.

The second is the loss simplification. Dropping the formal ELBO weights is mathematically a step backward — the resulting objective is a reweighted bound, not the bound — but it improved FID by a wide margin. The paper reports the result honestly and does not pretend the simple loss is theoretically motivated; later work (Karras et al. 2022, EDM) recovered the choice by deriving the optimal weighting in a different parameterisation.

The third is implementation discipline. The U-Net backbone with sinusoidal time embedding and the linear noise schedule were straightforward enough that the paper's CIFAR-10 numbers were independently reproduced within months, and the code was robust enough to scale to 256×256 images and far beyond when later groups did the engineering.

Common Misconceptions

DDPM is not a one-step generator. Sampling needs TT sequential network calls, ~1000 in the original recipe. DDIM (Song et al., 2021), DPM-Solver (Lu et al., 2022), and consistency models (Song et al., 2023) cut this to 1–50 steps with various tradeoffs, but the original DDPM is iterative.

DDPM does not maximise likelihood. The simple loss is a reweighting of the variational bound and gives worse log-likelihoods than the ELBO weighting; the paper trades likelihood for sample quality. Anyone reporting bits-per-dimension for a DDPM-style model needs to be careful about which loss was used at training time.

The noise predictor and the score are two views of the same object, not independent ideas. ϵθ(xt,t)1αˉtxtlogq(xt)\epsilon_\theta(x_t, t) \approx -\sqrt{1 - \bar{\alpha}_t}\, \nabla_{x_t} \log q(x_t). The 2020 DDPM paper does not state this connection — Song et al. (2021) made it explicit — and a reader who picks up only one of the two papers can come away thinking they are different methods.

Connections to TheoremPath Topics

  • Diffusion models — modern treatment covering DDPM, score SDEs, probability-flow ODEs, and flow matching.
  • Score matching — the identity that makes the noise predictor equivalent to a score estimator.
  • Variational autoencoders — the variational lower bound that DDPM derives from.
  • Markov chains and steady state — the forward process is a fixed Markov chain converging to a Gaussian stationary distribution.
  • Langevin dynamics — the continuous-time sampler whose discretisation underlies score-based and diffusion samplers.
  • KL divergence — every term in the DDPM ELBO is a KL between two Gaussians.

Further Reading

  • Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). "Deep Unsupervised Learning using Nonequilibrium Thermodynamics." ICML. arXiv:1503.03585. The original diffusion-probabilistic-model formulation that DDPM scaled up.
  • Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2021). "Score-Based Generative Modeling through Stochastic Differential Equations." ICLR. arXiv:2011.13456. Continuous-time SDE view; shows DDPM is a discretisation of denoising score matching and unifies it with NCSN.
  • Song, J., Meng, C., & Ermon, S. (2021). "Denoising Diffusion Implicit Models." ICLR. arXiv:2010.02502. DDIM. Same trained model, deterministic non-Markovian sampler, 10–50 steps instead of 1000.
  • Nichol, A. Q., & Dhariwal, P. (2021). "Improved Denoising Diffusion Probabilistic Models." ICML. arXiv:2102.09672. Cosine schedule, learned variance, hybrid loss; the practical recipe that fed Imagen and DALL-E 2.
  • Karras, T., Aittala, M., Aila, T., & Laine, S. (2022). "Elucidating the Design Space of Diffusion-Based Generative Models." NeurIPS. arXiv:2206.00364. EDM. Re-derives the parameterisation, weighting, and noise schedule from first principles; current best practice for unconditional image diffusion.

References

Canonical:

  • Ho, J., Jain, A., & Abbeel, P. (2020). "Denoising Diffusion Probabilistic Models." NeurIPS. arXiv:2006.11239.

Direct precursors:

  • Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). "Deep Unsupervised Learning using Nonequilibrium Thermodynamics." ICML. arXiv:1503.03585.
  • Song, Y., & Ermon, S. (2019). "Generative Modeling by Estimating Gradients of the Data Distribution." NeurIPS. arXiv:1907.05600. NCSN; denoising score matching at multiple noise scales.

Unification and theory:

  • Song, Y. et al. (2021). "Score-Based Generative Modeling through Stochastic Differential Equations." ICLR. arXiv:2011.13456.
  • Vincent, P. (2011). "A Connection Between Score Matching and Denoising Autoencoders." Neural Computation 23(7). The denoising-score-matching identity used implicitly throughout DDPM.

Practical follow-ons:

  • Song, J., Meng, C., & Ermon, S. (2021). "Denoising Diffusion Implicit Models." ICLR. arXiv:2010.02502.
  • Nichol, A. Q., & Dhariwal, P. (2021). "Improved Denoising Diffusion Probabilistic Models." ICML. arXiv:2102.09672.
  • Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). "High-Resolution Image Synthesis with Latent Diffusion Models." CVPR. arXiv:2112.10752. Stable Diffusion.
  • Karras, T., Aittala, M., Aila, T., & Laine, S. (2022). "Elucidating the Design Space of Diffusion-Based Generative Models." NeurIPS. arXiv:2206.00364.

Connected topics

Last reviewed: May 7, 2026