ML Methods
Energy-Based Models
A unifying framework for generative modeling: assign low energy to likely configurations via E(x), define probability through the Boltzmann distribution, and train without computing the intractable partition function.
Prerequisites
Why This Matters
Energy-based models provide the most general framework for specifying probability distributions over high-dimensional data. Every probabilistic model --- GANs, VAEs, diffusion models, Boltzmann machines --- can be viewed through the energy lens. Understanding EBMs gives you a unified language for generative modeling and reveals why certain training strategies work.
The recent explosion of diffusion models is an EBM story: the score function that diffusion models learn is exactly the gradient of the log-density of an energy-based model.
An energy-based model is a factor graph whose local terms add up to one scalar score
The rectangles are additive energy terms over observed and latent variables. The rounded block on the right is the easy forward part: once the configuration is fixed, it computes one number, the energy.
The subtle point is that inference and scoring are different. Given a full configuration, the D-shaped block computes the energy directly. But recovering good latent states still means searching for configurations where the factor terms agree.
Diagram language
Observed versus latent structure
The filled circles are the visible data. The hollow circles are the hidden units or latent variables whose configuration determines whether the local energy terms can all be small at once.
Probability only appears after normalization
The factor graph defines . The corresponding probability is , so the partition function is what turns a score into a normalized model.
Why this notation helps
Thinking in factors makes it obvious which variables interact locally and which part is just a forward computation. That is the same distinction that reappears in modern world-model diagrams.
Mental Model
Think of the energy function as a landscape over the data space. Valleys (low energy) correspond to likely data configurations. Peaks (high energy) correspond to unlikely configurations. Training an EBM means sculpting this landscape so that the valleys align with the data distribution: real images get low energy, noise gets high energy.
The challenge: converting this energy landscape into a proper probability distribution requires normalizing over all possible configurations --- an intractable integral in high dimensions. The entire field of EBM training is about avoiding or approximating this normalization.
Formal Setup and Notation
Let be a data point. An energy-based model defines an energy function parameterized by .
Energy Function
The energy function assigns a scalar energy to each configuration . Lower energy means higher probability. There are no constraints on --- it can be any function from to , including a neural network.
Boltzmann Distribution
The energy function induces a probability distribution via the Boltzmann distribution (also called the Gibbs distribution):
where the partition function is:
The partition function ensures integrates to 1. Computing requires integrating over all of --- this is intractable for high-dimensional .
Core Definitions
The score function of an EBM is the gradient of the log-density:
The partition function disappears because it does not depend on . This observation is the foundation of score-based methods.
The free energy is . The log-likelihood of a single observation is:
Main Theorems
MLE Gradient for Energy-Based Models
Statement
The gradient of the log-likelihood with respect to parameters is:
Equivalently, the gradient of the average log-likelihood over the data distribution is:
Intuition
MLE training pushes down the energy of real data (first term) and pushes up the energy of model samples (second term). This is a "contrastive" update: make real data more likely by making it lower energy, and make model samples less likely by making them higher energy. The challenge is computing the second expectation, which requires sampling from --- itself an intractable distribution.
Proof Sketch
Differentiate . The first term gives . For the second, note:
Combining: .
Why It Matters
This gradient formula is the starting point for all EBM training methods. Every training algorithm for EBMs is a strategy for approximating the model expectation , which requires sampling from .
Failure Mode
If the MCMC sampler used to approximate the model expectation does not mix well, the gradient estimate is biased. The model may learn to place low energy only on the MCMC chain's trajectory rather than on the true data manifold, leading to poor generalization.
Training Methods
Contrastive Divergence (CD)
Since sampling from is intractable, Hinton (2002) proposed contrastive divergence: initialize an MCMC chain at a data point and run it for only steps (typically ). Use the resulting sample as an approximation to a sample from .
The CD- gradient is:
CD is biased (the chain has not converged to ) but works surprisingly well in practice. It was the workhorse for training restricted Boltzmann machines and deep belief networks.
Score Matching
Score Matching Avoids the Partition Function
Statement
The Fisher divergence between and :
can be rewritten (via integration by parts) as:
where the constant does not depend on .
Intuition
Score matching trains the model by matching the gradient of the log-density (the score) rather than the density itself. Since the score does not involve , the partition function drops out entirely. You can train an EBM without ever computing or approximating .
Connection to Diffusion Models
Diffusion models learn a score function where is the data distribution convolved with Gaussian noise of variance . This is precisely a noise-conditional score matching objective applied to an EBM at each noise level.
The denoising score matching identity:
trains the model to point toward the clean data from the noisy version. This is equivalent to learning the energy gradient at multiple noise scales.
Canonical Examples
Restricted Boltzmann Machine
An RBM defines energy over visible units and hidden units : . The marginal is obtained by summing out , which is tractable because is conditionally independent given . RBMs were the first scalable EBMs, trained with CD.
Modern deep EBM
Parameterize as a deep convolutional network mapping images to a scalar energy. Train via MCMC-based contrastive learning or score matching. The resulting model assigns low energy to realistic images and high energy to noise, enabling generation via Langevin dynamics (gradient descent on with added noise).
Common Confusions
Low energy does not mean the point is a mode
A point can have low energy but low probability if it sits in a region where many other points also have low energy (high partition function contribution from that region). Probability depends on both energy and volume. This is the energy-entropy tradeoff.
EBMs are not just Boltzmann machines
Boltzmann machines are a specific type of EBM with bilinear energy functions over binary variables. Modern EBMs use arbitrary neural network energy functions over continuous spaces. The framework is far more general than its historical association with Boltzmann machines suggests.
Summary
- Energy function : low energy = high probability
- Boltzmann distribution:
- Partition function is intractable in high dimensions
- MLE gradient = push down data energy, push up model-sample energy
- Score function avoids
- Diffusion models are score-matching EBMs at multiple noise levels
Exercises
Problem
Show that the score function does not depend on the partition function .
Problem
Derive the MLE gradient for an EBM. Specifically, show that .
Problem
Contrastive divergence with step is known to be a biased estimator of the MLE gradient. Explain intuitively why this bias exists and describe a scenario where it causes the model to fail.
References
Canonical:
- Ackley, Hinton, Sejnowski, "A Learning Algorithm for Boltzmann Machines" (Cognitive Science, 1985)
- LeCun, Chopra, Hadsell, Ranzato, Huang, A Tutorial on Energy-Based Learning (2006)
- Hinton, "Training Products of Experts by Minimizing Contrastive Divergence" (Neural Computation, 2002)
- Hyvärinen, "Estimation of Non-Normalized Statistical Models by Score Matching" (JMLR, 2005)
- Vincent, "A Connection Between Score Matching and Denoising Autoencoders" (Neural Computation, 2011)
Current:
- Song & Ermon, "Generative Modeling by Estimating Gradients of the Data Distribution" (NeurIPS 2019). arXiv:1907.05600
- Du & Mordatch, "Implicit Generation and Modeling with Energy-Based Models" (NeurIPS 2019). arXiv:1903.08689
- Ho, Jain, Abbeel, "Denoising Diffusion Probabilistic Models" (NeurIPS 2020). arXiv:2006.11239
- Grathwohl et al., "Your Classifier is Secretly an Energy Based Model and You Should Treat it Like One" (ICLR 2020). arXiv:1912.03263
- Song et al., "Score-Based Generative Modeling through Stochastic Differential Equations" (ICLR 2021). arXiv:2011.13456
Next Topics
The natural next steps from energy-based models:
- Diffusion models: score matching at multiple noise levels, the modern EBM
- Variational autoencoders: a different approach to intractable normalization via amortized inference
Last reviewed: April 25, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
5- Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1
- Feedforward Networks and Backpropagationlayer 2 · tier 1
- Score Matchinglayer 3 · tier 1
- Normalizing Flowslayer 3 · tier 3
- Neural SDEs and the Diffusion Bridgelayer 4 · tier 3
Derived topics
1- Diffusion Modelslayer 4 · tier 1
Graph-backed continuations