Skip to main content

Sampling MCMC

Metropolis-Hastings Algorithm

The foundational MCMC algorithm: build a Markov chain with the right stationary distribution by accepting or rejecting proposals according to a carefully balanced ratio. The real story is not only correctness, but also proposal geometry, diagnostics, and where plain MH becomes painfully slow.

CoreTier 1StableSupporting~70 min

Why This Matters

If you need to sample from a probability distribution that you can evaluate only up to a normalizing constant, you are already in Metropolis-Hastings territory. That describes most posterior distributions in Bayesian statistics. The Metropolis-Hastings algorithm is the foundational construction that makes such problems computationally tractable. It sits underneath Gibbs sampling, motivates why Hamiltonian Monte Carlo had to be invented, and still provides the cleanest introduction to how MCMC achieves correctness at all.

Before MH, Bayesian inference was largely restricted to conjugate models where closed-form posteriors exist. MH broke that barrier and made Bayesian computation general-purpose. The modern lesson is slightly different: MH gives you the correctness template, but its practical efficiency depends almost entirely on proposal design and posterior geometry.

startMode 1Mode 2AcceptedRejectedChain explores high-density regions; rejections prevent low-density drift

Mental Model

You are exploring a landscape where the height at each point represents the (unnormalized) probability density of your target distribution. You stand at some point xx. A friend proposes a new location xx' according to some rule. You evaluate how much more (or less) probable xx' is compared to xx. If xx' is more probable, you always move there. If xx' is less probable, you move there with a probability proportional to how much less probable it is. Over time, you spend more time in high-probability regions, exactly in proportion to their probability.

Formal Setup and Notation

Let π(x)\pi(x) be the target distribution we wish to sample from. We assume we can evaluate π(x)\pi(x) up to a normalizing constant; that is, we can compute π~(x)\tilde{\pi}(x) where π(x)=π~(x)/Z\pi(x) = \tilde{\pi}(x)/Z for an unknown constant ZZ.

Let q(xx)q(x' \mid x) be a proposal distribution: a conditional density from which we can easily draw candidates.

Definition

Proposal Distribution

The proposal distribution q(xx)q(x' \mid x) is a conditional density that, given the current state xx, generates a candidate next state xx'. The proposal must be chosen so that it is easy to sample from and (ideally) explores the target distribution efficiently. Common choices include:

  • Random walk: q(xx)=N(x,σ2I)q(x' \mid x) = \mathcal{N}(x, \sigma^2 I). propose near the current state
  • Independence sampler: q(xx)=q(x)q(x' \mid x) = q(x'). propose independently of the current state
Definition

Acceptance Ratio

The acceptance ratio (or acceptance probability) for moving from state xx to proposed state xx' is:

α(x,x)=min ⁣(1,  π(x)q(xx)π(x)q(xx))\alpha(x, x') = \min\!\left(1,\; \frac{\pi(x')\, q(x \mid x')}{\pi(x)\, q(x' \mid x)}\right)

When the proposal is symmetric, i.e., q(xx)=q(xx)q(x' \mid x) = q(x \mid x'), this simplifies to the original Metropolis ratio:

α(x,x)=min ⁣(1,  π(x)π(x))\alpha(x, x') = \min\!\left(1,\; \frac{\pi(x')}{\pi(x)}\right)

Definition

Metropolis-Hastings Algorithm

Given target π\pi, proposal qq, and initial state x0x_0:

  1. Propose: Draw xq(xt)x' \sim q(\cdot \mid x_t)
  2. Compute the acceptance ratio using the unnormalized target and proposal densities
  3. Accept/Reject: Draw uUniform(0,1)u \sim \text{Uniform}(0,1). If uαu \leq \alpha, set xt+1=xx_{t+1} = x'; otherwise set xt+1=xtx_{t+1} = x_t
  4. Repeat from step 1

We use π~\tilde{\pi} (unnormalized) because ZZ cancels in the ratio.

Why the Algorithm Works

The key insight is that MH constructs a Markov chain whose transition kernel satisfies detailed balance with respect to π\pi. Detailed balance is a sufficient condition for π\pi to be the stationary distribution of the chain.

Definition

Detailed Balance

A Markov chain with transition kernel K(xx)K(x \to x') satisfies detailed balance with respect to π\pi if and only if:

π(x)K(xx)=π(x)K(xx)\pi(x)\, K(x \to x') = \pi(x')\, K(x' \to x)

for all states x,xx, x'. This says: the probability flow from xx to xx' under π\pi equals the flow from xx' to xx. Detailed balance implies π\pi is stationary (integrate both sides over xx), but is stronger: it says the chain is reversible.

Main Theorems

Theorem

MH Satisfies Detailed Balance

Statement

The Metropolis-Hastings transition kernel

K(xx)=q(xx)α(x,x)for xxK(x \to x') = q(x' \mid x)\,\alpha(x, x') \quad \text{for } x' \neq x

satisfies detailed balance with respect to π\pi:

π(x)q(xx)α(x,x)=π(x)q(xx)α(x,x)\pi(x)\, q(x' \mid x)\,\alpha(x, x') = \pi(x')\, q(x \mid x')\,\alpha(x', x)

Therefore π\pi is a stationary distribution of the MH chain.

Intuition

The acceptance ratio is specifically designed so that the "excess flow" from xx to xx' is throttled back to match the flow in the reverse direction. If π(x)q(xx)>π(x)q(xx)\pi(x')q(x \mid x') > \pi(x)q(x' \mid x), then the move from xx to xx' is accepted with probability 1, and the reverse move is accepted with a probability less than 1, exactly the right amount less to balance the flows.

Proof Sketch

Without loss of generality, assume π(x)q(xx)π(x)q(xx)\pi(x)\,q(x' \mid x) \leq \pi(x')\,q(x \mid x').

Then α(x,x)=1\alpha(x, x') = 1 and α(x,x)=π(x)q(xx)π(x)q(xx)\alpha(x', x) = \frac{\pi(x)\,q(x' \mid x)}{\pi(x')\,q(x \mid x')}.

The left side of the detailed balance equation becomes: π(x)q(xx)1=π(x)q(xx)\pi(x)\,q(x' \mid x) \cdot 1 = \pi(x)\,q(x' \mid x).

The right side becomes: π(x)q(xx)π(x)q(xx)π(x)q(xx)=π(x)q(xx)\pi(x')\,q(x \mid x') \cdot \frac{\pi(x)\,q(x' \mid x)}{\pi(x')\,q(x \mid x')} = \pi(x)\,q(x' \mid x).

The two sides are equal. The symmetric case follows identically.

Why It Matters

This is the theoretical guarantee that MH actually samples from the correct distribution in the long run. Without detailed balance, you would have no reason to believe the chain converges to π\pi.

Failure Mode

Detailed balance alone only guarantees π\pi is a stationary distribution. For π\pi to be the unique stationary distribution and for the chain to converge to it from any starting point, you also need the chain to be irreducible and aperiodic, which is the content of the ergodicity theorem.

Theorem

Ergodicity of the MH Chain

Statement

If the MH chain is irreducible and aperiodic, then for any initial distribution μ0\mu_0:

limtKt(μ0,)π()TV=0\lim_{t \to \infty} \| K^t(\mu_0, \cdot) - \pi(\cdot) \|_{\text{TV}} = 0

Moreover, for any integrable function ff:

1Tt=1Tf(xt)a.s.Eπ[f(X)]\frac{1}{T}\sum_{t=1}^{T} f(x_t) \xrightarrow{a.s.} \mathbb{E}_\pi[f(X)]

Intuition

Irreducibility means the chain can reach any state from any other state. Aperiodicity means it does not get trapped in deterministic cycles. Together with detailed balance, these conditions ensure the chain "forgets" its starting point and converges to π\pi. The ergodic theorem then says that time averages along the chain converge to expectations under π\pi.

Proof Sketch

Detailed balance implies π\pi-reversibility, which implies π\pi is stationary. Irreducibility and aperiodicity together with the existence of a stationary distribution imply convergence in total variation by the fundamental theorem of Markov chains. The ergodic theorem for Markov chains then gives the almost sure convergence of time averages.

Why It Matters

This theorem is what lets you use the output of an MH chain to compute expectations, posterior means, credible intervals, and any other quantity that can be expressed as Eπ[f(X)]\mathbb{E}_\pi[f(X)]. It is the theoretical justification for all of MCMC-based Bayesian inference.

Failure Mode

If the proposal is too narrow, the chain explores slowly and may not effectively reach all regions of high probability within a practical number of iterations. The chain is still ergodic in theory, but convergence may be so slow that your finite sample is useless. Diagnosing this requires convergence diagnostics (trace plots, R^\hat{R}, effective sample size).

Random Walk MH vs Independence Sampler

Definition

Random Walk Metropolis

In random walk MH, the proposal is centered at the current state:

q(xx)=g(xx)q(x' \mid x) = g(x' - x)

for some symmetric density gg. A common choice is g=N(0,σ2I)g = \mathcal{N}(0, \sigma^2 I). The acceptance ratio simplifies to α=min(1,π(x)/π(x))\alpha = \min(1, \pi(x')/\pi(x)) because gg is symmetric.

The step size σ\sigma controls a tradeoff: too small means slow exploration, too large means most proposals are rejected. The optimal acceptance rate for random walk MH in the specific high-dimensional diffusion-limit analysis of Roberts, Gelman, and Gilks (1997) is approximately 0.234. That is a useful benchmark for Gaussian random walks, not a universal target for every MH implementation.

Definition

Independence Sampler

In the independence sampler, the proposal ignores the current state:

q(xx)=q(x)q(x' \mid x) = q(x')

The acceptance ratio becomes α=min(1,π(x)q(x)/(π(x)q(x)))\alpha = \min(1, \pi(x')q(x)/(\pi(x)q(x'))), which involves the ratio of importance weights. This works well only if qq is a good approximation to π\pi with heavier tails.

Burn-in

The chain's initial samples are influenced by the starting point x0x_0 and do not yet represent draws from π\pi. The burn-in period is the initial segment of the chain that is discarded before collecting samples for inference. Choosing the burn-in length is an art informed by convergence diagnostics.

Formally, burn-in discards samples x1,,xBx_1, \ldots, x_B and uses only xB+1,,xTx_{B+1}, \ldots, x_T for estimation. There is no universal formula for BB. It depends on the mixing rate of the chain.

Diagnostics and Proposal Geometry

The practical failure mode of MH is not incorrectness; it is slow mixing. Two chains can both target the right stationary distribution and yet differ by orders of magnitude in usable effective sample size.

Three diagnostics matter more than raw acceptance rate:

  • Trace behavior and between-chain agreement. If chains started from dispersed initial values still occupy different regions, you do not yet have evidence of practical convergence. See burn-in and convergence diagnostics.
  • Effective sample size (ESS). High autocorrelation means thousands of MH steps may correspond to only tens of effectively independent draws.
  • Mode traversal. On multimodal targets, a random-walk proposal can look locally healthy while almost never crossing between important regions.

This is why plain random-walk MH becomes brittle in higher dimension. Proposal scales large enough to move globally are often rejected, while scales small enough to be accepted move only locally. Hamiltonian Monte Carlo was invented to escape this random-walk regime by proposing long, geometry-informed moves.

Where Plain MH Still Matters

Metropolis-Hastings remains valuable even when you would not deploy vanilla random-walk MH on a serious posterior:

  • it is the cleanest proof template for why MCMC can target an unnormalized distribution;
  • it still works well on low-dimensional targets, discrete spaces, and carefully engineered independence proposals;
  • many specialized samplers are best understood as "better proposals inside the MH framework," including pseudo-marginal MH and some reversible-jump constructions.

Canonical Examples

Example

Sampling from a mixture of Gaussians

Target: π(x)=0.3N(3,1)+0.7N(3,1)\pi(x) = 0.3\,\mathcal{N}(-3, 1) + 0.7\,\mathcal{N}(3, 1).

Using random walk MH with proposal q(xx)=N(x,σ2)q(x' \mid x) = \mathcal{N}(x, \sigma^2):

  • If σ=0.1\sigma = 0.1: most proposals are accepted but the chain moves slowly and may get stuck in one mode for long stretches
  • If σ=10\sigma = 10: the chain proposes large jumps but most are rejected because they land in low-probability regions
  • If σ2\sigma \approx 2: the chain efficiently explores both modes

At each step, compute α=min(1,π(x)/π(x))\alpha = \min(1, \pi(x')/\pi(x)) (since the proposal is symmetric). If the chain is at x=3x = -3 and proposes x=3x' = 3, then α=min(1,0.7/0.3)=1\alpha = \min(1, 0.7/0.3) = 1. The move is always accepted. If the chain is at x=3x = 3 and proposes x=3x' = -3, then α=min(1,0.3/0.7)0.43\alpha = \min(1, 0.3/0.7) \approx 0.43.

Example

Bayesian inference for a normal mean

Prior: μN(0,τ2)\mu \sim \mathcal{N}(0, \tau^2). Likelihood: xiμN(μ,σ2)x_i \mid \mu \sim \mathcal{N}(\mu, \sigma^2) for i=1,,ni = 1, \ldots, n.

The posterior is:

π(μx)exp ⁣(μ22τ2(xiμ)22σ2)\pi(\mu \mid x) \propto \exp\!\left(-\frac{\mu^2}{2\tau^2} - \frac{\sum(x_i - \mu)^2}{2\sigma^2}\right)

This is a case where the posterior is available in closed form (it is normal), so we can verify that MH gives the right answer. Using random walk MH, the chain samples converge to the known posterior normal distribution with a precision-weighted mean.

Common Confusions

Watch Out

MH does NOT require knowing the normalizing constant

This is the single most important practical feature of MH. Because the acceptance ratio involves π(x)/π(x)\pi(x')/\pi(x), any normalizing constant ZZ cancels:

π(x)π(x)=π~(x)/Zπ~(x)/Z=π~(x)π~(x)\frac{\pi(x')}{\pi(x)} = \frac{\tilde{\pi}(x')/Z}{\tilde{\pi}(x)/Z} = \frac{\tilde{\pi}(x')}{\tilde{\pi}(x)}

You only need to evaluate the unnormalized target density. In Bayesian inference, this means you only need the prior times the likelihood; you never need to compute the evidence (marginal likelihood) p(data)p(\text{data}).

Watch Out

Rejected proposals are NOT wasted

When a proposal is rejected and the chain stays at xtx_t, this is not a failure. It is part of the algorithm working correctly. Repeated copies of xtx_t in the chain are needed to correctly represent the probability mass at that point. An acceptance rate of 100% would mean the chain is not properly weighting different regions of the state space.

Watch Out

MH samples are NOT independent

Consecutive samples xt,xt+1x_t, x_{t+1} are correlated because each depends on the previous. This autocorrelation reduces the effective sample size. If you need approximately independent samples, you can thin the chain (keep every kk-th sample), though it is generally more efficient to simply run the chain longer and report the effective sample size.

Watch Out

A 0.234 acceptance rate is not a universal law

The famous 0.2340.234 number applies to one asymptotic regime: Gaussian random-walk proposals on high-dimensional product targets. It is a useful benchmark for that setting, not a universal optimum. Independence samplers, discrete proposals, heavy-tailed proposals, and geometry-aware methods can all have very different healthy acceptance rates.

Watch Out

High acceptance can still mean a bad sampler

If the proposal scale is tiny, acceptance can be near 100% while the chain barely moves. That is not good mixing; it is local dithering. Acceptance rate must be read together with ESS, trace behavior, and whether the chain actually traverses the relevant posterior mass.

Summary

  • MH constructs a Markov chain with target π\pi as stationary distribution
  • Acceptance ratio: α=min(1,π(x)q(xx)/(π(x)q(xx)))\alpha = \min(1, \pi(x')q(x \mid x')/(\pi(x)q(x' \mid x)))
  • Only need unnormalized target density. normalizing constants cancel
  • Detailed balance is the core theoretical guarantee
  • Ergodicity (irreducibility + aperiodicity) ensures convergence from any start
  • Random walk MH: optimal acceptance rate 0.234\approx 0.234 in high dimensions
  • Burn-in period must be discarded before using samples for inference
  • Finite-time usefulness depends on proposal geometry, ESS, and whether the chain traverses modes in practice

Exercises

ExerciseCore

Problem

Suppose the proposal distribution is symmetric, i.e., q(xx)=q(xx)q(x' \mid x) = q(x \mid x') for all x,xx, x'. Derive the acceptance ratio and show that it depends only on the ratio π(x)/π(x)\pi(x')/\pi(x).

ExerciseCore

Problem

Verify the detailed balance condition for MH directly. That is, show:

π(x)q(xx)α(x,x)=π(x)q(xx)α(x,x)\pi(x)\,q(x' \mid x)\,\alpha(x, x') = \pi(x')\,q(x \mid x')\,\alpha(x', x)

for the general (asymmetric) case.

ExerciseAdvanced

Problem

Consider an independence sampler with proposal q(x)=N(0,1)q(x') = \mathcal{N}(0, 1) targeting π(x)ex\pi(x) \propto e^{-|x|} (a Laplace distribution). Write down the acceptance ratio. Will this sampler work well? Why or why not?

References

Canonical:

  • Metropolis, Rosenbluth, Rosenbluth, Teller, Teller (1953), "Equation of State Calculations by Fast Computing Machines"
  • Hastings (1970), "Monte Carlo Sampling Methods Using Markov Chains and Their Applications"

Current:

  • Robert & Casella, Monte Carlo Statistical Methods (2004), Chapters 6-7
  • Brooks, Gelman, Jones, Meng, Handbook of Markov Chain Monte Carlo (2011), Chapter 1
  • Tierney, "Markov Chains for Exploring Posterior Distributions" (Annals of Statistics, 1994)
  • Roberts, Gelman, and Gilks, "Weak convergence and optimal scaling of random walk Metropolis algorithms" (Annals of Applied Probability, 1997)
  • Gelman et al., Bayesian Data Analysis (2013), Chapters 11-12

Next Topics

The natural next steps from Metropolis-Hastings:

Last reviewed: April 25, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

4

Derived topics

8

+3 more on the derived-topics page.