Skip to main content

Statistical Estimation

Bayesian Estimation

The Bayesian approach to parameter estimation: encode prior beliefs, update with data via Bayes rule, and obtain a full posterior distribution over parameters. Conjugate priors, MAP estimation, and the Bernstein-von Mises theorem showing that the posterior concentrates around the true parameter.

CoreTier 2StableSupporting~60 min

Why This Matters

Maximum likelihood estimation gives you a single point estimate of the parameter. Bayesian estimation gives you a full distribution over the parameter. This distribution tells you not just "what is the best guess?" but "how uncertain am I, and in what directions?"

Bayesian methods dominate when you have small samples, informative prior knowledge, or need to quantify uncertainty. They are the foundation of Gaussian processes, Bayesian neural networks, and modern probabilistic programming. The Bernstein-von Mises theorem provides the bridge back to frequentist theory: as the sample size grows, the posterior converges to a Gaussian centered at the MLE.

theorem visual

Prior, likelihood, posterior

Bayes rule multiplies belief by evidence, then renormalizes. The posterior is not a point estimate; it is the updated uncertainty curve.

0.000.250.500.751.00priorlikelihoodposteriorprior meanposterior meansample rate

starting belief

Prior: . This encodes twenty pseudo-observations centered near a fair coin.

data pull

Observed data: heads in flips. The likelihood peaks near .

updated uncertainty

Posterior: , with mean .

Mental Model

You start with a prior belief p(θ)p(\theta) about the parameter before seeing any data. You observe data x\mathbf{x}. Bayes rule updates your belief to the posterior p(θx)p(\theta | \mathbf{x}). The posterior is a compromise between the prior and the likelihood: with little data, the prior dominates; with lots of data, the likelihood dominates and the prior is washed out.

Think of the prior as your starting position and the data as a force that pulls you toward the truth. The posterior is where you end up after the pull. Ignoring the prior leads to the base rate fallacy, one of the most common reasoning errors in applied probability.

Formal Setup and Notation

Let θΘ\theta \in \Theta be a parameter, p(θ)p(\theta) a prior distribution, and p(xθ)p(\mathbf{x} | \theta) the likelihood of observing data x\mathbf{x} given θ\theta.

Definition

Posterior Distribution

The posterior distribution is given by Bayes rule:

p(θx)=p(xθ)p(θ)p(x)p(\theta | \mathbf{x}) = \frac{p(\mathbf{x} | \theta) \, p(\theta)}{p(\mathbf{x})}

where p(x)=p(xθ)p(θ)dθp(\mathbf{x}) = \int p(\mathbf{x} | \theta) \, p(\theta) \, d\theta is the marginal likelihood (evidence). The posterior is proportional to prior times likelihood:

p(θx)p(θ)p(xθ)p(\theta | \mathbf{x}) \propto p(\theta) \cdot p(\mathbf{x} | \theta)

Definition

MAP Estimation

The maximum a posteriori (MAP) estimator is the mode of the posterior:

θ^MAP=argmaxθ  p(θx)=argmaxθ  [logp(xθ)+logp(θ)]\hat{\theta}_{\text{MAP}} = \arg\max_\theta \; p(\theta | \mathbf{x}) = \arg\max_\theta \; [\log p(\mathbf{x} | \theta) + \log p(\theta)]

MAP is like MLE with a regularization term logp(θ)\log p(\theta). With a Gaussian prior θN(0,τ2)\theta \sim \mathcal{N}(0, \tau^2), MAP is equivalent to 2\ell_2-regularized MLE (ridge regression).

Conjugate Priors

A prior p(θ)p(\theta) is conjugate to a likelihood p(xθ)p(\mathbf{x}|\theta) if and only if the posterior p(θx)p(\theta|\mathbf{x}) belongs to the same family as the prior. Conjugacy makes Bayesian updates analytically tractable.

Proposition

Conjugate Prior Updates

Statement

The three most important conjugate pairs are:

Beta-Binomial. Prior: θBeta(α,β)\theta \sim \text{Beta}(\alpha, \beta). Likelihood: kk successes in nn Bernoulli trials. Posterior: θkBeta(α+k,β+nk)\theta | k \sim \text{Beta}(\alpha + k, \beta + n - k).

Normal-Normal. Prior: μN(μ0,σ02)\mu \sim \mathcal{N}(\mu_0, \sigma_0^2). Likelihood: x1,,xnN(μ,σ2)x_1, \ldots, x_n \sim \mathcal{N}(\mu, \sigma^2) with known σ2\sigma^2. Posterior: μxN(μn,σn2)\mu | \mathbf{x} \sim \mathcal{N}(\mu_n, \sigma_n^2) where:

μn=σ2μ0+nσ02xˉσ2+nσ02,σn2=σ2σ02σ2+nσ02\mu_n = \frac{\sigma^2 \mu_0 + n \sigma_0^2 \bar{x}}{\sigma^2 + n \sigma_0^2}, \qquad \sigma_n^2 = \frac{\sigma^2 \sigma_0^2}{\sigma^2 + n \sigma_0^2}

Gamma-Poisson. Prior: λGamma(α,β)\lambda \sim \text{Gamma}(\alpha, \beta). Likelihood: x1,,xnPoisson(λ)x_1, \ldots, x_n \sim \text{Poisson}(\lambda). Posterior: λxGamma(α+xi,β+n)\lambda | \mathbf{x} \sim \text{Gamma}(\alpha + \sum x_i, \beta + n).

Intuition

In each case, the prior parameters act as "pseudo-observations." For the beta-binomial, α\alpha acts like prior successes and β\beta like prior failures. The posterior combines these with the actual observed counts. As nn \to \infty, the prior contribution vanishes and the posterior concentrates on the MLE.

Why It Matters

Conjugate priors are the workhorses of practical Bayesian analysis. They give closed-form posteriors, making inference fast and interpretable. Even when the true prior is not conjugate, conjugate families serve as tractable approximations.

The Normal-Normal Update in Detail

The normal-normal case is particularly revealing. The posterior mean is a precision-weighted average of the prior mean and the data mean:

μn=τ0τ0+nτμ0+nττ0+nτxˉ\mu_n = \frac{\tau_0}{\tau_0 + n\tau} \mu_0 + \frac{n\tau}{\tau_0 + n\tau} \bar{x}

where τ0=1/σ02\tau_0 = 1/\sigma_0^2 is the prior precision and τ=1/σ2\tau = 1/\sigma^2 is the data precision per observation. The posterior precision is the sum τ0+nτ\tau_0 + n\tau. More data means more precision. A tighter prior (large τ0\tau_0) means the prior has more influence.

With n=0n = 0 (no data), the posterior equals the prior. As nn \to \infty, the posterior mean approaches xˉ\bar{x} and the posterior variance approaches σ2/n\sigma^2/n, recovering the MLE and its sampling distribution.

Posterior Consistency: Bernstein-von Mises

Theorem

Bernstein-von Mises Theorem

Statement

Under regularity conditions, as nn \to \infty, the posterior distribution converges in total variation to a Gaussian centered at the MLE:

p(θxn)N ⁣(θ^MLE,1nI(θ0))TVP0\left\| p(\theta | \mathbf{x}_n) - \mathcal{N}\!\left(\hat{\theta}_{\text{MLE}}, \frac{1}{n I(\theta_0)}\right) \right\|_{\text{TV}} \xrightarrow{P} 0

where I(θ0)I(\theta_0) is the Fisher information at the true parameter. The posterior concentrates around the true parameter at rate 1/n1/\sqrt{n} and its width matches the frequentist standard error.

Intuition

With enough data, the likelihood overwhelms the prior and the posterior becomes approximately Gaussian centered at the MLE. The prior does not matter asymptotically (as long as it puts positive mass near the truth). Bayesian and frequentist inference agree in the large-sample limit.

Proof Sketch

Expand the log-posterior around the MLE using a Taylor expansion. The log-likelihood term dominates the log-prior for large nn. The quadratic approximation to the log-likelihood gives a Gaussian with precision nI(θ0)nI(\theta_0). The prior contributes a term of order O(1)O(1) that becomes negligible compared to the O(n)O(n) likelihood term.

Why It Matters

This theorem bridges Bayesian and frequentist statistics. It justifies using Bayesian credible intervals as approximate confidence intervals in large samples. It also shows that Bayesian inference is consistent: the posterior concentrates on the truth.

Failure Mode

The theorem requires the model to be correctly specified (θ0\theta_0 is in the model). If the model is misspecified, the posterior concentrates on the KL-closest parameter to the truth, not the truth itself. The theorem also fails for non-regular models, infinite-dimensional parameters (nonparametric Bayesian models require separate theory), and improper priors in some cases.

Credible Intervals vs Confidence Intervals

A 95% credible interval [a,b][a, b] satisfies P(θ[a,b]x)=0.95P(\theta \in [a,b] | \mathbf{x}) = 0.95. It is a statement about θ\theta given the data.

A 95% confidence interval [a,b][a, b] satisfies: if you repeat the experiment many times, 95% of the intervals will contain the true θ\theta. It is a statement about the procedure, not about θ\theta for the observed data.

These are structurally different interpretations. However, by Bernstein-von Mises, Bayesian credible intervals and frequentist confidence intervals coincide asymptotically. In finite samples they can differ, especially when the prior is informative.

Marginal Likelihood (Bayesian Evidence)

The denominator in Bayes rule is the marginal likelihood (also called the evidence):

m(x)=p(xθ)π(θ)dθm(x) = \int p(x | \theta) \, \pi(\theta) \, d\theta

It plays three distinct roles.

  • Normalization constant. It makes the posterior p(θx)=p(xθ)π(θ)/m(x)p(\theta|x) = p(x|\theta)\pi(\theta)/m(x) integrate to 1. For pointwise inference on θ\theta, it is a constant and can be ignored.
  • Bayes factor. For model comparison between M0M_0 and M1M_1, the ratio B10=m1(x)/m0(x)B_{10} = m_1(x)/m_0(x) quantifies evidence for M1M_1 over M0M_0.
  • Empirical Bayes. Given a prior π(θτ)\pi(\theta | \tau) with hyperparameter τ\tau, the marginal m(xτ)=p(xθ)π(θτ)dθm(x | \tau) = \int p(x|\theta)\pi(\theta|\tau) d\theta can be maximized over τ\tau to pick a data-driven prior. This is the core mechanism behind small-area estimation and shrinkage estimators.

Computation of m(x)m(x) is usually intractable because the integral is high-dimensional. Standard approximations are Laplace approximation (Gaussian fit at the mode), variational inference (fit a tractable qq to the posterior), bridge sampling, and nested sampling.

Posterior Predictive Distribution

To predict a new observation xnewx_{\text{new}} with parameter uncertainty integrated out, use the posterior predictive distribution:

p(xnewx)=p(xnewθ)π(θx)dθp(x_{\text{new}} | x) = \int p(x_{\text{new}} | \theta) \, \pi(\theta | x) \, d\theta

Unlike plugging in the MAP or posterior mean, this propagates the full posterior uncertainty into the prediction. For the beta-binomial conjugate setup, the posterior predictive for a single future trial is Bernoulli with success probability equal to the posterior mean, and the posterior predictive for mm future trials is Beta-binomial with parameters (m,α+k,β+nk)(m, \alpha + k, \beta + n - k).

Bayes Factors and Model Comparison

The Bayes factor comparing M1M_1 to M0M_0 is

B10=m1(x)m0(x)=p(xθ1,M1)π(θ1M1)dθ1p(xθ0,M0)π(θ0M0)dθ0B_{10} = \frac{m_1(x)}{m_0(x)} = \frac{\int p(x | \theta_1, M_1) \pi(\theta_1 | M_1) d\theta_1}{\int p(x | \theta_0, M_0) \pi(\theta_0 | M_0) d\theta_0}

Jeffreys' scale interprets B10>10B_{10} > 10 as strong evidence for M1M_1, >30> 30 as very strong, and >100> 100 as decisive.

Compared to the frequentist likelihood ratio test, Bayes factors penalize model complexity automatically: a more flexible model spreads its prior mass over more parameter values, which lowers m(x)m(x) unless the data strongly prefer that flexibility. This is often called "Occam's razor via the evidence."

Lindley's paradox. In point-null hypothesis testing with a diffuse prior under M1M_1, the Bayes factor can strongly favor M0M_0 at the same data where the frequentist pp-value rejects M0M_0. The divergence grows with nn: frequentist rejection thresholds scale like 1/n1/\sqrt{n} while the Bayes factor penalty for a diffuse prior scales like n\sqrt{n}. The two frameworks are answering different questions.

When Bayesian is Better

Bayesian estimation shines in several settings:

  • Small samples. The prior regularizes estimation when data is scarce. Without a prior, MLE can overfit or be undefined
  • Informative priors. When domain knowledge constrains the parameter (e.g., a probability must be near 0.5, a physical constant is known approximately), the prior encodes this and improves the estimate
  • Uncertainty quantification. The full posterior gives calibrated uncertainty bands, not just a point estimate. This is critical for decision making under uncertainty
  • Hierarchical models. Bayesian methods naturally handle multi-level structure where parameters at one level serve as priors for the next. This is the basis of small-area estimation, where empirical Bayes borrows strength across related units

Common Confusions

Watch Out

The prior is not arbitrary

A common criticism is that the prior is "subjective." But the prior can be chosen systematically: use domain knowledge, previous studies, or weakly-informative priors that regularize without strongly constraining. By Bernstein-von Mises, reasonable priors all lead to the same posterior with enough data. The choice matters most when data is scarce, which is exactly when you should use prior knowledge. The Monty Hall problem is a classic example where the canonical error is not in the prior (uniform over doors is correct) but in failing to condition on the host's asymmetric action: the host must open a non-winning, non-chosen door, and this constraint makes his choice informative about the remaining door.

Watch Out

MAP is not full Bayesian inference

MAP gives a point estimate (the posterior mode). Full Bayesian inference uses the entire posterior distribution. MAP ignores posterior uncertainty and can give misleading results when the posterior is skewed or multimodal. Use the posterior mean or full posterior for uncertainty quantification.

Watch Out

Flat priors are not always noninformative

A flat (uniform) prior on θ\theta is not flat on θ2\theta^2 or logθ\log\theta. The notion of "noninformative" depends on the parameterization. Jeffreys prior p(θ)I(θ)p(\theta) \propto \sqrt{I(\theta)} is invariant to reparameterization but can be improper (not integrating to 1). Reference priors and weakly-informative priors are more practical alternatives.

Summary

  • Posterior \propto prior ×\times likelihood. The posterior is a compromise between prior belief and observed data
  • Conjugate priors give closed-form posteriors: beta-binomial, normal-normal, gamma-Poisson
  • MAP estimation adds logp(θ)\log p(\theta) to the log-likelihood, equivalent to regularization
  • Bernstein-von Mises: the posterior converges to N(θ^MLE,1/(nI(θ0)))\mathcal{N}(\hat{\theta}_{\text{MLE}}, 1/(nI(\theta_0))) as nn \to \infty
  • Credible intervals are probability statements about θ\theta; confidence intervals are probability statements about the procedure
  • Bayesian methods are best when data is scarce, priors are informative, or you need full uncertainty quantification

Exercises

ExerciseCore

Problem

You have a coin that you believe is roughly fair. You choose a Beta(10,10)\text{Beta}(10, 10) prior for the probability θ\theta of heads. You flip the coin 20 times and observe 15 heads. What is the posterior distribution? What is the posterior mean?

ExerciseAdvanced

Problem

Show that MAP estimation with a Gaussian prior θN(0,τ2)\theta \sim \mathcal{N}(0, \tau^2) and Gaussian likelihood xiN(θ,σ2)x_i \sim \mathcal{N}(\theta, \sigma^2) is equivalent to ridge regression. What is the effective regularization parameter λ\lambda in terms of σ2\sigma^2 and τ2\tau^2?

ExerciseResearch

Problem

The Bernstein-von Mises theorem requires the model to be correctly specified. What happens to the posterior when the model is misspecified? Give a concrete example where the posterior concentrates on a parameter value that is not the "true" parameter, and explain what that parameter represents.

References

Canonical:

  • Berger, J.O. (1985). Statistical Decision Theory and Bayesian Analysis, 2nd ed. Springer. Ch. 4.
  • Bernardo, J.M., & Smith, A.F.M. (1994). Bayesian Theory. Wiley. Ch. 1-5.
  • Robert, C.P. (2007). The Bayesian Choice, 2nd ed. Springer. Ch. 2-4, 7.
  • Gelman et al. (2013). Bayesian Data Analysis, 3rd ed. CRC Press. Ch. 2-3, 5.
  • Lehmann, E.L., & Casella, G. (1998). Theory of Point Estimation, 2nd ed. Springer. Ch. 4, 5.
  • Casella, G., & Berger, R.L. (2001). Statistical Inference, 2nd ed. Duxbury. Ch. 4.7, 7.2.3.

Current:

  • McElreath, R. (2020). Statistical Rethinking, 2nd ed. CRC Press.
  • van der Vaart, A.W. (1998). Asymptotic Statistics. Cambridge. Ch. 10 (Bernstein-von Mises).

Next Topics

The natural next steps from Bayesian estimation:

Last reviewed: April 22, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.