Hamiltonian Monte Carlo

Sneiderman, Robby

Sampling MCMC

Hamiltonian Monte Carlo

HMC uses gradients and Hamiltonian dynamics to propose large moves that still land in high-density regions. The real practical story includes leapfrog geometry, warmup, divergences, and when centered hierarchical models break the sampler.

AdvancedTier 2StableSupporting~60 min

Prerequisites

Metropolis Hastings Markov Chain Monte Carlo Gibbs Sampling Griddy Gibbs

Quiz (5)Pulse Check Prereq Map

Why This Matters

Metropolis-Hastings (MH) with a random walk proposal is the simplest MCMC algorithm, but it fails in high dimensions. The random walk takes tiny steps to maintain a reasonable acceptance rate, resulting in extremely slow exploration. In the canonical analysis of Roberts, Gelman, Gilks (1997) for product-form target distributions, the cost of one effective sample under random-walk MH scales as $O(d^2)$ in dimension $d$ . For even moderate $d$ this is prohibitively slow under that analysis, and the empirical picture is similar for many smooth high-dimensional posteriors.

Five-section infographic on Hamiltonian Monte Carlo: (1) why HMC beats random-walk Metropolis, taking long curved trajectories instead of tiny local steps; (2) energy view of HMC moving on the negative log posterior surface, with potential energy U(q) and kinetic energy K(p); (3) the leapfrog integrator with explicit half-step / full-step / half-step formulas; (4) the sampling cycle: sample momentum, simulate dynamics, Metropolis correction; (5) practical notes on warmup adaptation, step size, trajectory length, mass matrix, divergences, and modern HMC/NUTS workflows. — HMC uses gradients and momentum to travel far through parameter space without turning sampling into a random walk.

Hamiltonian Monte Carlo (HMC) solves this by using gradient information to propose distant points in parameter space that are still likely to be accepted. The companion analysis (Beskos, Pillai, Roberts, Sanz-Serna, Stuart 2013) shows that the cost of one effective sample under HMC scales as $O(d^{5/4})$ for the same product-form targets. These exponents are reference rates derived for product-form targets in the diffusion limit, not universal mixing-time guarantees. They predict the right qualitative ordering on a broad range of smooth posteriors, which is why Bayesian inference in hundreds or thousands of dimensions is practical with HMC, but worst-case mixing times for non-log-concave or geometrically pathological targets can be much worse.

HMC is the engine behind Stan, one of the most widely used probabilistic programming languages, and is the default MCMC algorithm for serious Bayesian modeling.

Mental Model

Imagine a ball rolling on a surface whose height at position $q$ is the $U(q) = -\log p(q | \text{data})$ . The ball naturally rolls toward regions of high posterior density (low $U$ ) and maintains to traverse ridges between modes.

HMC simulates this physics. Instead of randomly proposing a new point, it gives the current point a random "kick" (momentum) and then follows the laws of physics (Hamilton's equations) for a fixed time. The endpoint of this trajectory is the proposal. Because the trajectory follows the posterior landscape, the proposal tends to be in a region of similar density, even if it is far from the starting point.

Formal Setup and Notation

We want to sample from a target distribution with density $\pi(q) \propto e^{-U(q)}$ where $q \in \mathbb{R}^d$ are the parameters and $U(q) = -\log \pi(q)$ is the potential energy (up to a constant).

HMC introduces an auxiliary momentum variable $p \in \mathbb{R}^d$ on and defines the joint distribution:

$\pi(q, p) \propto \exp(-H(q, p))$

where the Hamiltonian is:

$H(q, p) = U(q) + K(p)$

The kinetic energy is typically $K(p) = \frac{1}{2}p^T M^{-1} p$ , where $M$ is a mass matrix (often $M = I$ ). The marginal distribution of $q$ is the target $\pi(q)$ , and $p | q \sim \mathcal{N}(0, M)$ .

Core Definitions

Definition

Hamilton's Equations

The Hamiltonian dynamics are governed by:

$\frac{dq}{dt} = \frac{\partial H}{\partial p} = M^{-1} p, \qquad \frac{dp}{dt} = -\frac{\partial H}{\partial q} = -\nabla U(q)$

These equations describe how position $q$ and momentum $p$ evolve over time. The key property: the Hamiltonian $H(q, p)$ is conserved along the trajectory. A particle moving along this trajectory stays on a contour of constant $H$ .

Definition

Leapfrog Integrator

Since we cannot solve Hamilton's equations analytically for general $U$ , we use a numerical integrator. The leapfrog (Stormer-Verlet) integrator with step size $\epsilon$ :

$p_{t+\epsilon/2} = p_t - \frac{\epsilon}{2}\nabla U(q_t)$ $q_{t+\epsilon} = q_t + \epsilon \, M^{-1} p_{t+\epsilon/2}$ $p_{t+\epsilon} = p_{t+\epsilon/2} - \frac{\epsilon}{2}\nabla U(q_{t+\epsilon})$

This is repeated $L$ times to produce the proposal. Each step requires one gradient evaluation $\nabla U(q)$ .

Definition

HMC Algorithm

One iteration of HMC:

Resample momentum: Draw $p \sim \mathcal{N}(0, M)$
Simulate dynamics: Starting from $(q, p)$ , run $L$ leapfrog steps with step size $\epsilon$ to get $(q^*, p^*)$
Negate momentum: Set $p^* \leftarrow -p^*$ (for reversibility)
Metropolis accept/reject: Accept $(q^*, p^*)$ with probability $\min(1, \exp(-H(q^*, p^*) + H(q, p)))$

If the leapfrog integrator were exact (zero discretization error), the Hamiltonian would be perfectly conserved and the acceptance probability would be 1. In practice, the acceptance rate depends on $\epsilon$ and $L$ .

Main Theorems

Theorem

Conservation Properties of Hamiltonian Dynamics

Statement

The exact Hamiltonian flow has three crucial properties:

Energy conservation: $H(q(t), p(t)) = H(q(0), p(0))$ for all $t$ .
Volume preservation (Liouville's theorem): The flow preserves the Lebesgue measure on phase space $(q, p)$ .
Reversibility: If $(q(T), -p(T))$ is evolved for time $T$ , it returns to $(q(0), -p(0))$ .

Together, these ensure that the dynamics leave the joint distribution $\pi(q, p) \propto e^{-H(q,p)}$ invariant.

Intuition

Energy conservation means the proposal has the same Hamiltonian value as the starting point, so acceptance probability is 1. Volume preservation means we do not need a Jacobian correction in the acceptance probability. Reversibility ensures detailed balance. These three properties are exactly what we need for a valid MCMC proposal.

Why It Matters

These properties explain why HMC works. The Hamiltonian flow is a measure-preserving, reversible map on phase space that conserves energy. It is the ideal MCMC proposal: it moves you to a distant point with the same target density, guaranteed. The only reason we need the Metropolis-Hastings correction is that the leapfrog integrator introduces discretization error, breaking exact energy conservation.

report a correction →

Proposition

Leapfrog Preserves Volume

Statement

The leapfrog integrator is a symplectic integrator: it preserves the symplectic 2-form on phase space. In particular, it is volume-preserving (the Jacobian of the leapfrog map has determinant 1) and time-reversible.

The local truncation error in the position variable is $O(\epsilon^3)$ per step. More importantly, because leapfrog is symplectic, the energy error along the trajectory is bounded by $O(\epsilon^2)$ uniformly in the number of steps (the error does not accumulate linearly, unlike non-symplectic integrators). This bounded-energy-error property comes from the shadow-Hamiltonian analysis (Hairer, Lubich, Wanner 2006).

Intuition

The leapfrog integrator inherits the volume-preservation and reversibility of exact Hamiltonian dynamics, even though it does not conserve energy exactly. This is why leapfrog is used instead of, say, Euler's method: Euler's method is neither volume-preserving nor reversible, and would require a Jacobian correction that is expensive to compute.

The non-accumulation of energy error is the key miracle: after $L$ steps, the Hamiltonian error is still $O(\epsilon^2)$ , not $O(L\epsilon^2)$ . This is because the leapfrog trajectory stays on a "shadow Hamiltonian" that is $O(\epsilon^2)$ -close to the true Hamiltonian.

Why It Matters

Volume preservation means the Metropolis acceptance probability is simply $\min(1, \exp(-\Delta H))$ with no Jacobian term. Reversibility ensures detailed balance. The $O(\epsilon^2)$ global error means we can take large steps (large $\epsilon$ ) while maintaining high acceptance rates, which is exactly what makes HMC efficient.

report a correction →

Proposition

HMC Satisfies Detailed Balance

Statement

The HMC transition kernel satisfies detailed balance with respect to the joint distribution $\pi(q, p) \propto e^{-H(q,p)}$ . Therefore, $\pi(q)$ is a stationary distribution of the HMC Markov chain (marginalizing over $p$ ).

Intuition

The momentum resampling step leaves $\pi(q,p)$ invariant because $p$ is drawn from its conditional $\mathcal{N}(0, M)$ . The leapfrog + Metropolis step leaves $\pi(q,p)$ invariant because leapfrog is volume-preserving and reversible, and the Metropolis correction accounts for the energy error. Composing two $\pi$ -invariant steps gives a $\pi$ -invariant chain.

report a correction →

Why HMC Beats Random-Walk MH

The fundamental problem with random-walk MH in $d$ dimensions, under the product-target diffusion-limit analysis of Roberts, Gelman, Gilks (1997):

To maintain acceptance rate $\sim 0.234$ (optimal), the step size must scale as $O(1/\sqrt{d})$ .
The chain takes steps of size $O(1/\sqrt{d})$ , so it needs $O(d)$ steps to traverse one standard deviation of the target.
Total cost to get one effective sample: $O(d^2)$ density evaluations.

HMC with properly tuned $\epsilon$ and $L$ , under the matching analysis of Beskos, Pillai, Roberts, Sanz-Serna, Stuart (2013):

The leapfrog trajectory can traverse the entire target distribution in one proposal.
The optimal step size scales as $O(d^{-1/4})$ .
Total cost per effective sample: $O(d^{5/4})$ gradient evaluations.

The improvement from $O(d^2)$ to $O(d^{5/4})$ in this product-target regime is the canonical reason HMC is practical for high-dimensional Bayesian inference. The cost is requiring $\nabla U(q)$ , which is not always available, and these scalings are not guaranteed for non-product or non-log-concave targets — multimodal posteriors, funnels, and posteriors with heavy tails or sharp ridges can mix much more slowly.

MCMC Sampler Comparison

The following table compares HMC against other MCMC methods across the dimensions that matter for practical Bayesian inference.

Property	Random-walk MH	Gibbs sampling	HMC (NUTS)	Langevin MCMC
Cost per step	$O(1)$ density eval	$O(d)$ conditional evals	$O(L)$ gradient evals	$O(1)$ gradient eval
Scaling with $d$	$O(d^2)$ for one effective sample	$O(d^2)$ (worse with correlation)	$O(d^{5/4})$	$O(d^{5/3})$
Requires gradients	No	No	Yes	Yes
Handles correlation	Poorly (slow mixing)	Poorly (blocked Gibbs helps)	Well (momentum carries through correlated dimensions)	Moderately
Tuning parameters	Proposal scale	Conditional distributions	$\epsilon$ , $L$ (NUTS auto-tunes both)	Step size only
Optimal acceptance rate	0.234	1.0 (exact conditionals)	0.65 (HMC) or 0.80 (NUTS)	0.574
Multimodal targets	Very slow	Very slow	Slow (cannot jump between modes)	Very slow

When HMC Fails

HMC is not a universal solution. Several target distribution geometries cause problems.

Multimodal targets. HMC explores a single mode efficiently but cannot jump between well-separated modes. The Hamiltonian dynamics are continuous, so the trajectory must pass through the low-density region between modes. If this region has very low density (high energy barrier), the trajectory would need enormous kinetic energy (momentum) to cross it, which is extremely unlikely under the Gaussian momentum distribution. For multimodal posteriors, parallel tempering or mixture proposals are needed alongside HMC.

Discontinuous densities. The leapfrog integrator requires $\nabla U(q)$ , which does not exist at density discontinuities. Models with discrete parameters, hard constraints, or indicator functions cannot be sampled directly with HMC. Reparameterization (replacing hard constraints with smooth approximations) is the standard workaround.

Funnel geometries. Neal's funnel is the canonical example: $v \sim \mathcal{N}(0, 3^2)$ , $x_i | v \sim \mathcal{N}(0, e^v)$ . When $v$ is large and negative, the $x_i$ are tightly concentrated near zero, requiring very small step sizes. When $v$ is large and positive, the $x_i$ spread widely, allowing large steps. No single step size works across the entire distribution. Non-centered parameterizations (writing the model in terms of standard normal variates and transforming) resolve this for many hierarchical models.

Very high dimensions without structure. At $d > 10{,}000$ , even HMC's $O(d^{5/4})$ scaling becomes expensive. Each leapfrog step requires a full gradient computation over all parameters. For large neural networks, variational inference (which uses gradient descent to optimize an approximate posterior) is preferred over MCMC because its cost per iteration does not grow with the number of posterior samples needed.

NUTS: No-U-Turn Sampler

Definition

NUTS (No-U-Turn Sampler)

HMC has two tuning parameters: step size $\epsilon$ and number of leapfrog steps $L$ . Choosing $L$ is difficult:

Too small $L$ : the trajectory does not explore enough, behaving like a random walk
Too large $L$ : the trajectory doubles back on itself, wasting computation

NUTS (Hoffman & Gelman, 2014) automatically selects $L$ by running the leapfrog trajectory until it makes a "U-turn": when the trajectory starts moving back toward the starting point. Formally, it uses a doubling scheme: double the trajectory length until the criterion $p \cdot (q_{\text{forward}} - q_{\text{backward}}) < 0$ is met.

NUTS also adapts $\epsilon$ during warmup to achieve a target acceptance rate (typically 0.8). This makes HMC tuning-free.

NUTS is the default sampler in Stan and PyMC. In practice, you almost never tune HMC by hand. You use NUTS and let it adapt. For the canonical geometry failure case and the centered-vs-non-centered fix, see No-U-Turn Sampler and Neal's Funnel and Centered vs. Non-Centered Hierarchical Models.

What You Actually Check In Practice

HMC is not one of those topics where the theory alone tells you whether a run is trustworthy. In real workflows you look at the chain output and the sampler's geometry warnings together.

Split $\hat{R}$ and ESS tell you whether chains agree and whether the Monte Carlo precision is usable. See Burn-in and Convergence Diagnostics.
Divergences tell you the numerical trajectory repeatedly hits regions of curvature that the current step size or parameterization cannot resolve.
Maximum treedepth warnings in NUTS say the sampler keeps needing longer paths than your current cap allows. That is often a geometry or tuning clue, not just a nuisance.
Energy diagnostics such as E-BFMI ask whether the momentum refresh is exploring the energy levels of the posterior well enough.

The hierarchy of trust is important: good $\hat{R}$ is necessary, but divergences can still invalidate the apparent success. HMC diagnostics are not just "convergence diagnostics plus gradients." They are geometry diagnostics.

Centered Hierarchies Are Where HMC's Reputation Gets Tested

The famous HMC success stories come from high-dimensional smooth posteriors with manageable curvature. The famous HMC failure stories come from hierarchical models with a global scale parameter and weak local data.

That is why Neal's funnel matters so much. A centered parameterization can force HMC to use one step size for both the wide mouth and the narrow neck of the posterior. No single step size does both well. The result is usually:

divergences near the neck,
very uneven exploration of the scale direction,
and misleading summaries that look acceptable until you inspect pair plots or divergence locations.

The usual fix is not "run longer." It is to change coordinates, often through the non-centered form. For the full geometry story, see No-U-Turn Sampler and Neal's Funnel and Centered vs. Non-Centered Hierarchical Models.

Connection to Symplectic Geometry

The leapfrog integrator is a symplectic integrator, meaning it preserves the symplectic 2-form $\omega = \sum_i dq_i \wedge dp_i$ . Volume preservation is a consequence of symplecticity (by Liouville's theorem for symplectic maps).

Symplecticity is strictly stronger than volume preservation. It ensures that the leapfrog trajectory is qualitatively correct: it stays on a "shadow Hamiltonian" $\tilde{H}$ that is $O(\epsilon^2)$ -close to $H$ . Non-symplectic integrators (like Euler) can be volume-preserving but still exhibit energy drift, making them unsuitable for HMC.

This connection to symplectic geometry is not just theoretical elegance. It is the practical reason why HMC works. The mathematical structure of Hamiltonian systems guarantees that the leapfrog integrator has the exact properties needed for efficient MCMC.

Proof Ideas and Templates Used

The correctness of HMC uses:

Liouville's theorem: Hamiltonian flow preserves phase space volume, so no Jacobian correction is needed.
Detailed balance via reversibility: The momentum flip plus leapfrog reversibility ensures the proposal is its own inverse, satisfying detailed balance with the Metropolis correction.
Backward error analysis: The leapfrog trajectory exactly solves a modified Hamiltonian $\tilde{H} = H + O(\epsilon^2)$ , explaining why the energy error is bounded.

Canonical Examples

Example

Sampling from a correlated Gaussian

Consider $\pi(q) = \mathcal{N}(0, \Sigma)$ with $\Sigma$ having condition number $\kappa$ . Random-walk MH takes $O(\kappa^2)$ steps per effective sample. HMC with mass matrix $M = \Sigma^{-1}$ (preconditioning) takes $O(\kappa^{1/2})$ leapfrog steps per effective sample. For $\kappa = 100$ , this is the difference between 10,000 steps and 10 steps.

Example

Bayesian logistic regression

With $d = 100$ parameters and $n = 1000$ data points, random-walk MH requires tens of thousands of iterations to converge. HMC (via Stan) typically converges in a few hundred iterations, each requiring $L \approx 10$ --50 leapfrog steps. The gradient $\nabla U(q)$ is the gradient of the negative log-posterior, computable in $O(nd)$ time.

Common Confusions

Watch Out

HMC requires gradients, not Hessians

HMC uses only first-order gradient information $\nabla U(q)$ , not the Hessian. The mass matrix $M$ can be set to an estimate of the posterior covariance (as NUTS does during warmup), but this is a preconditioning choice, not a requirement. Each leapfrog step costs one gradient evaluation, the same as one step of gradient descent.

Watch Out

The momentum is auxiliary and gets discarded

After each HMC step, the momentum $p$ is resampled from scratch. It is an auxiliary variable introduced solely to enable Hamiltonian dynamics. The samples of interest are the positions $q$ only. The momentum is never part of the output; it is a computational device.

Watch Out

HMC is not gradient descent

Although HMC uses gradients, it is structurally different from gradient descent. Gradient descent deterministically moves toward a mode. HMC uses gradients to navigate the entire posterior distribution, including regions far from the mode. The random momentum resampling ensures exploration, and the Metropolis correction ensures correctness. HMC samples from the posterior; gradient descent finds the MAP estimate.

Watch Out

The acceptance rate should not be 1.0

An acceptance rate of 1.0 means $\epsilon$ is too small and the chain is taking unnecessarily short steps. The optimal acceptance rate for HMC is around 0.65 (for standard HMC) or 0.80 (for NUTS). Lower rates mean $\epsilon$ is too large (too much discretization error); higher rates mean $\epsilon$ is too small (too many wasted gradient evaluations).

Watch Out

A good R-hat does not erase divergences

If the chains all agree on the part of the posterior they did explore, then $\hat{R}$ can look fine while the sampler is still failing in a narrow, high-curvature region. Divergences are not redundant with $\hat{R}$ . They are evidence about the geometry of the numerical trajectory itself.

Summary

HMC = Hamiltonian dynamics + leapfrog integrator + Metropolis correction
Position $q$ = parameters, momentum $p$ = auxiliary (Gaussian)
Hamiltonian $H(q,p) = U(q) + K(p)$ where $U = -\log\pi(q)$
Leapfrog is symplectic: volume-preserving, reversible, $O(\epsilon^2)$ energy error
Scaling: $O(d^{5/4})$ vs. random walk MH's $O(d^2)$
NUTS automatically tunes trajectory length $L$ and step size $\epsilon$
NUTS is the default in Stan and PyMC, but diagnostics like divergences and E-BFMI still matter
Hierarchical geometry and parameterization often matter more than "sampler branding"

Exercises

ExerciseCore

Problem

Write out the three steps of one leapfrog iteration for $U(q) = \frac{1}{2}q^T\Sigma^{-1}q$ (Gaussian target) with $M = I$ . What is $\nabla U(q)$ ?

ExerciseAdvanced

Problem

Explain why Euler's method ( $q_{t+\epsilon} = q_t + \epsilon p_t$ , $p_{t+\epsilon} = p_t - \epsilon \nabla U(q_t)$ ) is unsuitable for HMC, even though it is simpler than leapfrog.

ExerciseAdvanced

Problem

In $d = 1000$ dimensions, estimate the relative cost (in gradient evaluations) of getting one effective sample from random-walk MH vs. HMC. Assume the target is a standard Gaussian.

References

Canonical:

Duane, Kennedy, Pendleton, Roweth, "Hybrid Monte Carlo," Physics Letters B 195(2) (1987), 216-222 — the original HMC paper (called Hybrid Monte Carlo in lattice QCD).
Neal, "MCMC Using Hamiltonian Dynamics," Handbook of MCMC (2011), Chapter 5 (arXiv:1206.1901) — the standard modern exposition.
Hoffman & Gelman, "The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo," JMLR 15 (2014), 1593-1623 (arXiv:1111.4246)

Scaling theory:

Beskos, Pillai, Roberts, Sanz-Serna, Stuart, "Optimal tuning of the hybrid Monte Carlo algorithm," Bernoulli 19(5A) (2013), 1501-1534 (arXiv:1001.4460). Source of the $O(d^{5/4})$ scaling quoted above, and the optimal acceptance rate of 0.651.
Roberts, Gelman, Gilks, "Weak convergence and optimal scaling of random walk Metropolis algorithms," Annals of Applied Probability 7(1) (1997), 110-120. Source of the $O(d^2)$ random-walk MH scaling and the 0.234 acceptance rate.
Roberts & Rosenthal, "Optimal scaling of discrete approximations to Langevin diffusions," JRSSB 60(1) (1998), 255-268. Source of the $O(d^{5/3})$ Langevin MCMC scaling.

Geometric / adaptive:

Girolami & Calderhead, "Riemann manifold Langevin and Hamiltonian Monte Carlo methods," JRSSB 73(2) (2011), 123-214. Riemannian HMC with position-dependent mass matrix.
Betancourt, "A Conceptual Introduction to Hamiltonian Monte Carlo" (2017, arXiv:1701.02434)
Betancourt & Girolami, "Hamiltonian Monte Carlo for Hierarchical Models" (2013, arXiv:1312.0906). Canonical source on funnel pathologies, geometry, and reparameterization.
Hairer, Lubich, Wanner, Geometric Numerical Integration (Springer, 2006), Chapters I-VI — symplectic integrator theory and shadow-Hamiltonian bounds.

Current / practical:

Stan Development Team, Stan User's Guide and Stan Reference Manual, diagnostics and adaptation chapters.
Robert & Casella, Monte Carlo Statistical Methods (2004), Chapters 3-7
Brooks et al., Handbook of MCMC (2011), Chapters 1-5

Next Topics

The natural next steps from HMC:

Burn-in and Convergence Diagnostics: the scalar and geometry diagnostics that tell you whether an HMC run is trustworthy
No-U-Turn Sampler and Neal's Funnel: automatic path-length adaptation and the canonical hierarchical failure geometry
Riemannian HMC: adapting the mass matrix to local curvature
Variational inference: the optimization-based alternative to MCMC
Importance sampling: a non-MCMC approach to estimating expectations, useful when the target is easy to evaluate but hard to sample from directly

Last reviewed: May 6, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

5

Gibbs Samplinglayer 2 · tier 1
Markov Chain Monte Carlolayer 2 · tier 1
Metropolis-Hastings Algorithmlayer 2 · tier 1
Variance Reduction Techniqueslayer 2 · tier 2
Griddy Gibbs Samplinglayer 2 · tier 3

Derived topics

3

Burn-in and Convergence Diagnosticslayer 2 · tier 2
Langevin Dynamicslayer 3 · tier 2
No-U-Turn Sampler and Neal's Funnellayer 3 · tier 2

Graph-backed continuations

Burn-in and Convergence Diagnostics No-U-Turn Sampler and Neal's Funnel Langevin Dynamics