Bayesian Estimation

Sneiderman, Robby

Statistical Estimation

Bayesian Estimation

The Bayesian approach to parameter estimation: encode prior beliefs, update with data via Bayes rule, and obtain a full posterior distribution over parameters. Conjugate priors, MAP estimation, and the Bernstein-von Mises theorem showing that the posterior concentrates around the true parameter.

CoreTier 2StableSupporting~60 min

Prerequisites

Maximum Likelihood Estimation Common Probability Distributions Joint Marginal Conditional Distributions Shrinkage Estimation James Stein

Quiz (12)Pulse Check Prereq Map

Why This Matters

Maximum likelihood estimation gives you a single point estimate of the parameter. Bayesian estimation gives you a full distribution over the parameter. This distribution tells you not just "what is the best guess?" but "how uncertain am I, and in what directions?"

Bayesian methods dominate when you have small samples, informative prior knowledge, or need to quantify uncertainty. They are the foundation of Gaussian processes, Bayesian neural networks, and modern probabilistic programming. The Bernstein-von Mises theorem provides the bridge back to frequentist theory: as the sample size grows, the posterior converges to a Gaussian centered at the MLE.

theorem visual

Prior, likelihood, posterior

$Bayes rule multiplies belief by evidence, then renormalizes. The posterior is not a point estimate; it is the updated uncertainty curve.$

starting belief

$Prior: θ \sim Beta (10, 10) . This encodes twenty pseudo-observations centered near a fair coin.$

data pull

$Observed data: 15 heads in 20 flips. The likelihood peaks near \hat{θ}_{MLE} = 0.75 .$

updated uncertainty

$Posterior: θ ∣ x \sim Beta (25, 15), with mean 25/40 = 0.625 .$

Mental Model

You start with a prior belief $p(\theta)$ about the parameter before seeing any data. You observe data $\mathbf{x}$ . Bayes rule updates your belief to the posterior $p(\theta | \mathbf{x})$ . The posterior is a compromise between the prior and the likelihood: with little data, the prior dominates; with lots of data, the likelihood dominates and the prior is washed out.

Think of the prior as your starting position and the data as a force that pulls you toward the truth. The posterior is where you end up after the pull. Ignoring the prior leads to the base rate fallacy, one of the most common reasoning errors in applied probability.

Formal Setup and Notation

Let $\theta \in \Theta$ be a parameter, $p(\theta)$ a prior distribution, and $p(\mathbf{x} | \theta)$ the likelihood of observing data $\mathbf{x}$ given $\theta$ .

Definition

Posterior Distribution

The posterior distribution is given by Bayes rule:

$p(\theta | \mathbf{x}) = \frac{p(\mathbf{x} | \theta) \, p(\theta)}{p(\mathbf{x})}$

where $p(\mathbf{x}) = \int p(\mathbf{x} | \theta) \, p(\theta) \, d\theta$ is the marginal likelihood (evidence). The posterior is proportional to prior times likelihood:

$p(\theta | \mathbf{x}) \propto p(\theta) \cdot p(\mathbf{x} | \theta)$

Definition

MAP Estimation

The maximum a posteriori (MAP) estimator is the mode of the posterior:

$\hat{\theta}_{\text{MAP}} = \arg\max_\theta \; p(\theta | \mathbf{x}) = \arg\max_\theta \; [\log p(\mathbf{x} | \theta) + \log p(\theta)]$

MAP is like MLE with a regularization term $\log p(\theta)$ . With a Gaussian prior $\theta \sim \mathcal{N}(0, \tau^2)$ , MAP is equivalent to $\ell_2$ -regularized MLE (ridge regression).

Conjugate Priors

A prior $p(\theta)$ is conjugate to a likelihood $p(\mathbf{x}|\theta)$ if and only if the posterior $p(\theta|\mathbf{x})$ belongs to the same family as the prior. Conjugacy makes Bayesian updates analytically tractable.

Proposition

Conjugate Prior Updates

Statement

The three most important conjugate pairs are:

Beta-Binomial. Prior: $\theta \sim \text{Beta}(\alpha, \beta)$ . Likelihood: $k$ successes in $n$ Bernoulli trials. Posterior: $\theta | k \sim \text{Beta}(\alpha + k, \beta + n - k)$ .

Normal-Normal. Prior: $\mu \sim \mathcal{N}(\mu_0, \sigma_0^2)$ . Likelihood: $x_1, \ldots, x_n \sim \mathcal{N}(\mu, \sigma^2)$ with known $\sigma^2$ . Posterior: $\mu | \mathbf{x} \sim \mathcal{N}(\mu_n, \sigma_n^2)$ where:

$\mu_n = \frac{\sigma^2 \mu_0 + n \sigma_0^2 \bar{x}}{\sigma^2 + n \sigma_0^2}, \qquad \sigma_n^2 = \frac{\sigma^2 \sigma_0^2}{\sigma^2 + n \sigma_0^2}$

Gamma-Poisson. Prior: $\lambda \sim \text{Gamma}(\alpha, \beta)$ . Likelihood: $x_1, \ldots, x_n \sim \text{Poisson}(\lambda)$ . Posterior: $\lambda | \mathbf{x} \sim \text{Gamma}(\alpha + \sum x_i, \beta + n)$ .

Intuition

In each case, the prior parameters act as "pseudo-observations." For the beta-binomial, $\alpha$ acts like prior successes and $\beta$ like prior failures. The posterior combines these with the actual observed counts. As $n \to \infty$ , the prior contribution vanishes and the posterior concentrates on the MLE.

Why It Matters

Conjugate priors are the workhorses of practical Bayesian analysis. They give closed-form posteriors, making inference fast and interpretable. Even when the true prior is not conjugate, conjugate families serve as tractable approximations.

report a correction →

The Normal-Normal Update in Detail

The normal-normal case is particularly revealing. The posterior mean is a precision-weighted average of the prior mean and the data mean:

$\mu_n = \frac{\tau_0}{\tau_0 + n\tau} \mu_0 + \frac{n\tau}{\tau_0 + n\tau} \bar{x}$

where $\tau_0 = 1/\sigma_0^2$ is the prior precision and $\tau = 1/\sigma^2$ is the data precision per observation. The posterior precision is the sum $\tau_0 + n\tau$ . More data means more precision. A tighter prior (large $\tau_0$ ) means the prior has more influence.

With $n = 0$ (no data), the posterior equals the prior. As $n \to \infty$ , the posterior mean approaches $\bar{x}$ and the posterior variance approaches $\sigma^2/n$ , recovering the MLE and its sampling distribution.

Posterior Consistency: Bernstein-von Mises

Theorem

Bernstein-von Mises Theorem

Statement

Under regularity conditions, as $n \to \infty$ , the posterior distribution converges in total variation to a Gaussian centered at the MLE:

$\left\| p(\theta | \mathbf{x}_n) - \mathcal{N}\!\left(\hat{\theta}_{\text{MLE}}, \frac{1}{n I(\theta_0)}\right) \right\|_{\text{TV}} \xrightarrow{P} 0$

where $I(\theta_0)$ is the Fisher information at the true parameter. The posterior concentrates around the true parameter at rate $1/\sqrt{n}$ and its width matches the frequentist standard error.

Intuition

With enough data, the likelihood overwhelms the prior and the posterior becomes approximately Gaussian centered at the MLE. The prior does not matter asymptotically (as long as it puts positive mass near the truth). Bayesian and frequentist inference agree in the large-sample limit.

Proof Sketch

Expand the log-posterior around the MLE using a Taylor expansion. The log-likelihood term dominates the log-prior for large $n$ . The quadratic approximation to the log-likelihood gives a Gaussian with precision $nI(\theta_0)$ . The prior contributes a term of order $O(1)$ that becomes negligible compared to the $O(n)$ likelihood term.

Why It Matters

This theorem bridges Bayesian and frequentist statistics. It justifies using Bayesian credible intervals as approximate confidence intervals in large samples. It also shows that Bayesian inference is consistent: the posterior concentrates on the truth.

Failure Mode

The theorem requires the model to be correctly specified ( $\theta_0$ is in the model). If the model is misspecified, the posterior concentrates on the KL-closest parameter to the truth, not the truth itself. The theorem also fails for non-regular models, infinite-dimensional parameters (nonparametric Bayesian models require separate theory), and improper priors in some cases.

report a correction →

Credible Intervals vs Confidence Intervals

A 95% credible interval $[a, b]$ satisfies $P(\theta \in [a,b] | \mathbf{x}) = 0.95$ . It is a statement about $\theta$ given the data.

A 95% confidence interval $[a, b]$ satisfies: if you repeat the experiment many times, 95% of the intervals will contain the true $\theta$ . It is a statement about the procedure, not about $\theta$ for the observed data.

These are structurally different interpretations. However, by Bernstein-von Mises, Bayesian credible intervals and frequentist confidence intervals coincide asymptotically. In finite samples they can differ, especially when the prior is informative.

Marginal Likelihood (Bayesian Evidence)

The denominator in Bayes rule is the marginal likelihood (also called the evidence):

$m(x) = \int p(x | \theta) \, \pi(\theta) \, d\theta$

It plays three distinct roles.

Normalization constant. It makes the posterior $p(\theta|x) = p(x|\theta)\pi(\theta)/m(x)$ integrate to 1. For pointwise inference on $\theta$ , it is a constant and can be ignored.
Bayes factor. For model comparison between $M_0$ and $M_1$ , the ratio $B_{10} = m_1(x)/m_0(x)$ quantifies evidence for $M_1$ over $M_0$ .
Empirical Bayes. Given a prior $\pi(\theta | \tau)$ with hyperparameter $\tau$ , the marginal $m(x | \tau) = \int p(x|\theta)\pi(\theta|\tau) d\theta$ can be maximized over $\tau$ to pick a data-driven prior. This is the core mechanism behind small-area estimation and shrinkage estimators.

Computation of $m(x)$ is usually intractable because the integral is high-dimensional. Standard approximations are Laplace approximation (Gaussian fit at the mode), variational inference (fit a tractable $q$ to the posterior), bridge sampling, and nested sampling.

Posterior Predictive Distribution

To predict a new observation $x_{\text{new}}$ with parameter uncertainty integrated out, use the posterior predictive distribution:

$p(x_{\text{new}} | x) = \int p(x_{\text{new}} | \theta) \, \pi(\theta | x) \, d\theta$

Unlike plugging in the MAP or posterior mean, this propagates the full posterior uncertainty into the prediction. For the beta-binomial conjugate setup, the posterior predictive for a single future trial is Bernoulli with success probability equal to the posterior mean, and the posterior predictive for $m$ future trials is Beta-binomial with parameters $(m, \alpha + k, \beta + n - k)$ .

Bayes Factors and Model Comparison

The Bayes factor comparing $M_1$ to $M_0$ is

$B_{10} = \frac{m_1(x)}{m_0(x)} = \frac{\int p(x | \theta_1, M_1) \pi(\theta_1 | M_1) d\theta_1}{\int p(x | \theta_0, M_0) \pi(\theta_0 | M_0) d\theta_0}$

Jeffreys' scale interprets $B_{10} > 10$ as strong evidence for $M_1$ , $> 30$ as very strong, and $> 100$ as decisive.

Compared to the frequentist likelihood ratio test, Bayes factors penalize model complexity automatically: a more flexible model spreads its prior mass over more parameter values, which lowers $m(x)$ unless the data strongly prefer that flexibility. This is often called "Occam's razor via the evidence."

Lindley's paradox. In point-null hypothesis testing with a diffuse prior under $M_1$ , the Bayes factor can strongly favor $M_0$ at the same data where the frequentist $p$ -value rejects $M_0$ . The divergence grows with $n$ : frequentist rejection thresholds scale like $1/\sqrt{n}$ while the Bayes factor penalty for a diffuse prior scales like $\sqrt{n}$ . The two frameworks are answering different questions.

When Bayesian is Better

Bayesian estimation shines in several settings:

Small samples. The prior regularizes estimation when data is scarce. Without a prior, MLE can overfit or be undefined
Informative priors. When domain knowledge constrains the parameter (e.g., a probability must be near 0.5, a physical constant is known approximately), the prior encodes this and improves the estimate
Uncertainty quantification. The full posterior gives calibrated uncertainty bands, not just a point estimate. This is critical for decision making under uncertainty
Hierarchical models. Bayesian methods naturally handle multi-level structure where parameters at one level serve as priors for the next. This is the basis of small-area estimation, where empirical Bayes borrows strength across related units

Common Confusions

Watch Out

The prior is not arbitrary

A common criticism is that the prior is "subjective." But the prior can be chosen systematically: use domain knowledge, previous studies, or weakly-informative priors that regularize without strongly constraining. By Bernstein-von Mises, reasonable priors all lead to the same posterior with enough data. The choice matters most when data is scarce, which is exactly when you should use prior knowledge. The Monty Hall problem is a classic example where the canonical error is not in the prior (uniform over doors is correct) but in failing to condition on the host's asymmetric action: the host must open a non-winning, non-chosen door, and this constraint makes his choice informative about the remaining door.

Watch Out

MAP is not full Bayesian inference

MAP gives a point estimate (the posterior mode). Full Bayesian inference uses the entire posterior distribution. MAP ignores posterior uncertainty and can give misleading results when the posterior is skewed or multimodal. Use the posterior mean or full posterior for uncertainty quantification.

Watch Out

Flat priors are not always noninformative

A flat (uniform) prior on $\theta$ is not flat on $\theta^2$ or $\log\theta$ . The notion of "noninformative" depends on the parameterization. Jeffreys prior $p(\theta) \propto \sqrt{I(\theta)}$ is invariant to reparameterization but can be improper (not integrating to 1). Reference priors and weakly-informative priors are more practical alternatives.

Summary

Posterior $\propto$ prior $\times$ likelihood. The posterior is a compromise between prior belief and observed data
Conjugate priors give closed-form posteriors: beta-binomial, normal-normal, gamma-Poisson
MAP estimation adds $\log p(\theta)$ to the log-likelihood, equivalent to regularization
Bernstein-von Mises: the posterior converges to $\mathcal{N}(\hat{\theta}_{\text{MLE}}, 1/(nI(\theta_0)))$ as $n \to \infty$
Credible intervals are probability statements about $\theta$ ; confidence intervals are probability statements about the procedure
Bayesian methods are best when data is scarce, priors are informative, or you need full uncertainty quantification

Exercises

ExerciseCore

Problem

You have a coin that you believe is roughly fair. You choose a $\text{Beta}(10, 10)$ prior for the probability $\theta$ of heads. You flip the coin 20 times and observe 15 heads. What is the posterior distribution? What is the posterior mean?

ExerciseAdvanced

Problem

Show that MAP estimation with a Gaussian prior $\theta \sim \mathcal{N}(0, \tau^2)$ and Gaussian likelihood $x_i \sim \mathcal{N}(\theta, \sigma^2)$ is equivalent to ridge regression. What is the effective regularization parameter $\lambda$ in terms of $\sigma^2$ and $\tau^2$ ?

ExerciseResearch

Problem

The Bernstein-von Mises theorem requires the model to be correctly specified. What happens to the posterior when the model is misspecified? Give a concrete example where the posterior concentrates on a parameter value that is not the "true" parameter, and explain what that parameter represents.

References

Canonical:

Berger, J.O. (1985). Statistical Decision Theory and Bayesian Analysis, 2nd ed. Springer. Ch. 4.
Bernardo, J.M., & Smith, A.F.M. (1994). Bayesian Theory. Wiley. Ch. 1-5.
Robert, C.P. (2007). The Bayesian Choice, 2nd ed. Springer. Ch. 2-4, 7.
Gelman et al. (2013). Bayesian Data Analysis, 3rd ed. CRC Press. Ch. 2-3, 5.
Lehmann, E.L., & Casella, G. (1998). Theory of Point Estimation, 2nd ed. Springer. Ch. 4, 5.
Casella, G., & Berger, R.L. (2001). Statistical Inference, 2nd ed. Duxbury. Ch. 4.7, 7.2.3.

Current:

McElreath, R. (2020). Statistical Rethinking, 2nd ed. CRC Press.
van der Vaart, A.W. (1998). Asymptotic Statistics. Cambridge. Ch. 10 (Bernstein-von Mises).

Next Topics

The natural next steps from Bayesian estimation:

Gaussian processes for ML: nonparametric Bayesian inference over functions
Variational autoencoders: variational inference when the posterior is intractable

Last reviewed: April 22, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Common Probability Distributionslayer 0A · tier 1
Joint, Marginal, and Conditional Distributionslayer 0A · tier 1
Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1
Shrinkage Estimation and the James-Stein Estimator: Inadmissibility, SURE, and Brown's Characterizationlayer 0B · tier 1

Derived topics

17

Conjugate Priorslayer 0B · tier 1
Maximum A Posteriori (MAP) Estimationlayer 0B · tier 1
Bayesian Linear Regressionlayer 2 · tier 1
Causal Inference and the Ladder of Causationlayer 3 · tier 1
PAC-Bayes Boundslayer 3 · tier 1

+12 more on the derived-topics page.

Graph-backed continuations

Gaussian Processes for Machine Learning Variational Autoencoders Anthropic Bias and Observation Selection Bayesian Neural Networks Bayesian State Estimation Causal Inference and the Ladder of Causation Decision Theory Foundations Detection Theory Empirical Bayes vs Hierarchical Bayes Meta-Analysis No-U-Turn Sampler and Neal's Funnel PAC-Bayes Bounds Small Area Estimation Tabular Foundation Models as Bayesian Inference Engines Bayesian Linear Regression Conjugate Priors Maximum A Posteriori (MAP) Estimation