Paper breakdown
Denoising Diffusion Probabilistic Models
Jonathan Ho, Ajay Jain, and Pieter Abbeel · 2020 · NeurIPS 2020
Trains a single network to predict the Gaussian noise added at a random timestep, with a reweighted variational bound that reduces to a simple mean-squared-error loss. Establishes the noise-prediction parameterisation and U-Net backbone underlying every modern image and video diffusion system.
Overview
Ho, Jain, and Abbeel (2020) took diffusion probabilistic models — first written down by Sohl-Dickstein et al. (2015) — and showed how to train them at the scale and quality needed to compete with GANs on image generation. Two changes did most of the work. They reparameterised the reverse process to predict the noise that was added during the forward process, rather than the mean of the reverse Gaussian. They then dropped the variational-bound weighting in favor of a uniformly weighted mean-squared-error loss, , which they argued empirically gives better samples than the formally correct ELBO.
The forward process is fixed: a Markov chain that adds Gaussian noise to data over steps until the final state is approximately . The reverse process is a Markov chain with the same form, parameterised by a U-Net that takes a noisy image and the timestep and outputs the predicted noise . Training samples a random , a random data point , and a fresh noise vector , constructs in closed form, and minimises .
The result on CIFAR-10 was an Inception score of 9.46 and an FID of 3.17, beating every prior likelihood-based model and matching contemporary GANs without the training instabilities. The paper is now the canonical reference for image diffusion: every later system — Stable Diffusion, Imagen, DALL-E 2, Sora — uses the noise-prediction parameterisation and the U-Net backbone introduced here. Song et al. (2021) later showed the DDPM training objective is a discretisation of denoising score matching, unifying this paper with the score-based generative models line.
Mathematical Contributions
The forward process
Pick a variance schedule . The forward chain is
A useful closed form (Equation 4 in the paper) lets you sample from directly without iterating the chain. With and ,
So with . The schedule is chosen so that and is approximately a standard Gaussian, regardless of . The original paper uses and a linear schedule from to .
The reverse process
The reverse Markov chain is also Gaussian:
The paper fixes the variance to where (or the slightly smaller ) and learns only the mean. Both choices give comparable sample quality.
The variational bound
The training objective is the standard variational lower bound on , written as a sum of KL terms (Equation 5):
The first term has no learnable parameters. The middle terms are KL divergences between two Gaussians with the same variance, which reduce to the squared distance between their means. The forward posterior is itself a known Gaussian with mean
Noise-prediction parameterisation
Substituting and rearranging gives (Equation 11):
The network outputs — the noise — rather than directly. Because is an affine function of , predicting either is equivalent in principle, but the noise parameterisation has uniform target scale across , which makes optimisation more stable.
The simple loss
After algebra, the per-timestep KL term becomes proportional to a squared error in space (Equation 12):
The paper observes that dropping the -dependent weighting improves sample quality and uses (Equation 14):
with sampled uniformly from . This is no longer the variational bound; it down-weights small (where the original ELBO weight is large) in favour of large (where most of the visual structure is decided). Trading a tighter likelihood for better samples is an explicit choice the paper makes and defends empirically.
The training and sampling algorithms
| Step | Training | Sampling |
|---|---|---|
| 1 | Sample , , | |
| 2 | Form | For : |
| 3 | Take a gradient step on | if else |
| 4 |
Training is one forward and one backward pass per gradient step, identical in cost to a vanilla autoencoder. Sampling is sequential network evaluations, which is the price the model pays for not being a one-shot generator like a GAN or VAE.
Connection to score matching
A separate, equivalent view (made explicit in Song & Ermon, 2019, and unified with DDPM in Song et al., 2021): the noise predictor is, up to a known scale, the score . The DDPM update is then a discretisation of an annealed Langevin sampler driven by that score. This justifies the noise-prediction objective as denoising score matching, and it is the lens used by every continuous-time formulation (DDIM, score SDEs, EDM) that came later.
What It Gets Right
The first thing is the parameterisation. Predicting noise rather than the reverse mean keeps the regression target on a unit scale at every , which is what makes a single shared network able to denoise across noise levels. Earlier diffusion-style attempts that predicted or directly never reached competitive image quality; this small change is most of the gap.
The second is the loss simplification. Dropping the formal ELBO weights is mathematically a step backward — the resulting objective is a reweighted bound, not the bound — but it improved FID by a wide margin. The paper reports the result honestly and does not pretend the simple loss is theoretically motivated; later work (Karras et al. 2022, EDM) recovered the choice by deriving the optimal weighting in a different parameterisation.
The third is implementation discipline. The U-Net backbone with sinusoidal time embedding and the linear noise schedule were straightforward enough that the paper's CIFAR-10 numbers were independently reproduced within months, and the code was robust enough to scale to 256×256 images and far beyond when later groups did the engineering.
Common Misconceptions
DDPM is not a one-step generator. Sampling needs sequential network calls, ~1000 in the original recipe. DDIM (Song et al., 2021), DPM-Solver (Lu et al., 2022), and consistency models (Song et al., 2023) cut this to 1–50 steps with various tradeoffs, but the original DDPM is iterative.
DDPM does not maximise likelihood. The simple loss is a reweighting of the variational bound and gives worse log-likelihoods than the ELBO weighting; the paper trades likelihood for sample quality. Anyone reporting bits-per-dimension for a DDPM-style model needs to be careful about which loss was used at training time.
The noise predictor and the score are two views of the same object, not independent ideas. . The 2020 DDPM paper does not state this connection — Song et al. (2021) made it explicit — and a reader who picks up only one of the two papers can come away thinking they are different methods.
Connections to TheoremPath Topics
- Diffusion models — modern treatment covering DDPM, score SDEs, probability-flow ODEs, and flow matching.
- Score matching — the identity that makes the noise predictor equivalent to a score estimator.
- Variational autoencoders — the variational lower bound that DDPM derives from.
- Markov chains and steady state — the forward process is a fixed Markov chain converging to a Gaussian stationary distribution.
- Langevin dynamics — the continuous-time sampler whose discretisation underlies score-based and diffusion samplers.
- KL divergence — every term in the DDPM ELBO is a KL between two Gaussians.
Further Reading
- Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). "Deep Unsupervised Learning using Nonequilibrium Thermodynamics." ICML. arXiv:1503.03585. The original diffusion-probabilistic-model formulation that DDPM scaled up.
- Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2021). "Score-Based Generative Modeling through Stochastic Differential Equations." ICLR. arXiv:2011.13456. Continuous-time SDE view; shows DDPM is a discretisation of denoising score matching and unifies it with NCSN.
- Song, J., Meng, C., & Ermon, S. (2021). "Denoising Diffusion Implicit Models." ICLR. arXiv:2010.02502. DDIM. Same trained model, deterministic non-Markovian sampler, 10–50 steps instead of 1000.
- Nichol, A. Q., & Dhariwal, P. (2021). "Improved Denoising Diffusion Probabilistic Models." ICML. arXiv:2102.09672. Cosine schedule, learned variance, hybrid loss; the practical recipe that fed Imagen and DALL-E 2.
- Karras, T., Aittala, M., Aila, T., & Laine, S. (2022). "Elucidating the Design Space of Diffusion-Based Generative Models." NeurIPS. arXiv:2206.00364. EDM. Re-derives the parameterisation, weighting, and noise schedule from first principles; current best practice for unconditional image diffusion.
References
Canonical:
- Ho, J., Jain, A., & Abbeel, P. (2020). "Denoising Diffusion Probabilistic Models." NeurIPS. arXiv:2006.11239.
Direct precursors:
- Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). "Deep Unsupervised Learning using Nonequilibrium Thermodynamics." ICML. arXiv:1503.03585.
- Song, Y., & Ermon, S. (2019). "Generative Modeling by Estimating Gradients of the Data Distribution." NeurIPS. arXiv:1907.05600. NCSN; denoising score matching at multiple noise scales.
Unification and theory:
- Song, Y. et al. (2021). "Score-Based Generative Modeling through Stochastic Differential Equations." ICLR. arXiv:2011.13456.
- Vincent, P. (2011). "A Connection Between Score Matching and Denoising Autoencoders." Neural Computation 23(7). The denoising-score-matching identity used implicitly throughout DDPM.
Practical follow-ons:
- Song, J., Meng, C., & Ermon, S. (2021). "Denoising Diffusion Implicit Models." ICLR. arXiv:2010.02502.
- Nichol, A. Q., & Dhariwal, P. (2021). "Improved Denoising Diffusion Probabilistic Models." ICML. arXiv:2102.09672.
- Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). "High-Resolution Image Synthesis with Latent Diffusion Models." CVPR. arXiv:2112.10752. Stable Diffusion.
- Karras, T., Aittala, M., Aila, T., & Laine, S. (2022). "Elucidating the Design Space of Diffusion-Based Generative Models." NeurIPS. arXiv:2206.00364.
Connected topics
Last reviewed: May 7, 2026