Asymptotic Statistics: M-Estimators, Delta Method, LAN

Sneiderman, Robby

Statistical Estimation

Asymptotic Statistics: M-Estimators, Delta Method, LAN

The large-sample toolbox for statistical inference: continuous mapping theorem, Slutsky, the delta method, M- and Z-estimator consistency and asymptotic normality, MLE as a special M-estimator, local asymptotic normality (Le Cam), the asymptotic equivalence of Wald / score / likelihood-ratio tests, and influence-function representations. These results justify essentially every confidence interval, standard error, and p-value in applied statistics, and they are the language of modern semiparametric theory.

AdvancedTier 1StableCore spine~90 min

Prerequisites

Central Limit Theorem Maximum Likelihood Estimation Modes of Convergence Random Variables Cramer Rao Bound

Quiz (15)Pulse Check Prereq Map

Why This Matters

Three asymptotically equivalent ways to measure how far the data carries the parameter from H₀ on the same quadratic log-likelihood.

Under LAN, all three statistics equal Δₙᵀ I⁻¹ Δₙ + o_p(1) and converge to χ²_k; they differ in finite samples and under misspecification.

Almost every confidence interval, standard error, and $p$ -value in applied statistics relies on asymptotic theory. When you report $\hat\theta \pm 1.96 \cdot \widehat{\text{SE}}$ , you are invoking asymptotic normality of $\hat\theta$ . When you compare a likelihood-ratio statistic to a $\chi^2_k$ table, you are invoking Wilks' theorem. When you compute a sandwich standard error after a misspecified model fit, you are invoking M-estimator asymptotic normality with the model-robust variance formula. When you bootstrap, you are using a finite-sample proxy for an asymptotic distribution.

The page that follows develops the toolbox in the order it actually gets used. First: the four "plumbing" results (continuous mapping, Slutsky, delta method, joint convergence) that turn one CLT into a thousand applied results. Second: the M- and Z-estimator framework (van der Vaart Ch 5) that subsumes MLE, GMM, OLS, and quantile regression as one theorem. Third: Le Cam's local asymptotic normality, which says all regular parametric problems look like Gaussian shift experiments at the $1/\sqrt{n}$ scale and which underwrites the asymptotic minimax lower bound. Fourth: the Wald, score, and likelihood-ratio tests, which are asymptotically equivalent under the null but differ in finite samples and under misspecification.

The synthesis that anchors stats-PhD intuition: every regular estimator has an influence function that determines its asymptotic variance, and the MLE achieves the unique influence function with minimum variance (the efficient influence function), which equals the inverse Fisher information. Everything else is variation around this center.

Mental Model

Think of asymptotic statistics as three stacked layers.

Convergence calculus. Convergence in distribution and in probability are partial-order-like relations on sequences of random variables. The continuous mapping theorem pushes them through continuous functions, Slutsky lets you absorb constants, the delta method pushes them through smooth deterministic maps. These are the algebraic moves you make to manipulate limits.
Estimator asymptotics. Most estimators of interest are defined as minimizers (M-estimators) or zeros (Z-estimators) of a sample criterion:
$\hat\theta_n = \arg\min_\theta \frac{1}{n}\sum_i m(X_i; \theta) \quad\text{or}\quad \frac{1}{n}\sum_i \psi(X_i; \hat\theta_n) = 0.$
Under regularity conditions they are $\sqrt{n}$ -consistent and asymptotically normal with variance you can write down in closed form. MLE is the special case $\psi = \nabla \log f$ , with variance $I(\theta_0)^{-1}$ .
Local geometry. Around the truth, the statistical experiment looks Gaussian at the $1/\sqrt{n}$ scale. Local asymptotic normality (Le Cam) makes this precise via the contiguity of nearby measures and the convergence of likelihood ratios to a normal density. The asymptotic minimax bound, the Wald/score/LRT equivalence, and semiparametric efficiency theory all sit on this layer.

Notation

Throughout: $\xrightarrow{d}$ for convergence in distribution, $\xrightarrow{p}$ for convergence in probability, $\xrightarrow{a.s.}$ for almost sure. $X_n = O_p(a_n)$ means $X_n / a_n$ is bounded in probability ( $\sup_n P(|X_n/a_n| > M) \to 0$ as $M \to \infty$ ). $X_n = o_p(a_n)$ means $X_n / a_n \xrightarrow{p} 0$ . We write $\ell(\theta) = \log f(X; \theta)$ for a single log-likelihood and $\ell_n(\theta) = \sum_{i=1}^n \log f(X_i; \theta)$ for the sample log-likelihood. The score is $s(X; \theta) = \nabla_\theta \log f(X; \theta)$ , with sample average $\bar s_n(\theta) = \frac{1}{n}\sum_i s(X_i; \theta)$ . Fisher information is $I(\theta) = \mathbb{E}_\theta[s(X; \theta) s(X; \theta)^\top] = -\mathbb{E}_\theta[\nabla^2 \log f(X; \theta)]$ , with the equality holding under regularity (the information identity).

The Convergence Calculus

Theorem

Continuous Mapping Theorem (CMT)

Statement

Let $X_n \xrightarrow{d} X$ in a metric space and let $g$ be measurable and continuous at every point of a set $C$ with $P(X \in C) = 1$ . Then

$g(X_n) \xrightarrow{d} g(X).$

The same holds with $\xrightarrow{d}$ replaced by $\xrightarrow{p}$ or $\xrightarrow{a.s.}$ .

Intuition

Continuous functions preserve convergence. The "continuous on a set of $X$ -measure one" condition allows $g$ to be discontinuous at points the limit never visits, which matters in practice (e.g., $g(x) = 1/x$ when $X$ is positive almost surely).

Proof Sketch

Use the Portmanteau theorem: $X_n \xrightarrow{d} X$ iff $E[f(X_n)] \to E[f(X)]$ for every bounded continuous $f$ . For $g$ continuous on a $P_X$ -measure-one set $C$ , the composition $f \circ g$ is bounded and continuous on the same set, and its discontinuity points have measure zero. Apply Portmanteau to $f \circ g$ to conclude. See van der Vaart Asymptotic Statistics Theorem 2.3.

Why It Matters

CMT is the most-used convergence result. Every "transform an estimator through a smooth function" argument starts here. Slutsky and the delta method are corollaries.

Failure Mode

Discontinuity of $g$ at a point where the limit puts positive mass breaks the conclusion. Example: let $X_n = 1/n$ a.s., $X = 0$ , and $g(x) = \mathbb{1}\{x = 0\}$ . Then $X_n \xrightarrow{a.s.} 0$ but $g(X_n) = 0$ while $g(X) = 1$ . The discontinuity at zero coincides with the limit's support.

report a correction →

Theorem

Slutsky's Theorem

Statement

If $X_n \xrightarrow{d} X$ and $Y_n \xrightarrow{p} c$ for a deterministic constant $c$ , then jointly $(X_n, Y_n) \xrightarrow{d} (X, c)$ , and therefore for any continuous $f$ ,

$f(X_n, Y_n) \xrightarrow{d} f(X, c).$

The standard corollaries:

$X_n + Y_n \xrightarrow{d} X + c, \quad X_n Y_n \xrightarrow{d} cX, \quad X_n / Y_n \xrightarrow{d} X/c \;\; (c \neq 0).$

Intuition

A sequence converging in probability to a constant behaves like that constant in the limit. Marginal convergence $X_n \xrightarrow{d} X$ and $Y_n \xrightarrow{d} Y$ does not imply joint convergence in general; Slutsky works precisely because the $Y_n$ limit is degenerate (a point mass), so there is only one admissible coupling.

Proof Sketch

$Y_n \xrightarrow{p} c$ implies $|Y_n - c| \xrightarrow{p} 0$ , which gives joint convergence $(X_n, Y_n) \xrightarrow{d} (X, c)$ via the coupling that pairs $X_n$ with the deterministic limit $c$ in the limit. Apply CMT to the continuous map $f$ . See van der Vaart Lemma 2.8.

Why It Matters

Slutsky is the glue that holds asymptotic arguments together. Every time you replace $\sigma$ with $\hat\sigma$ in a $z$ -statistic and claim the limit is still standard normal, you are using Slutsky.

Failure Mode

Slutsky fails if $Y_n$ has a non-degenerate limit. Example: if $X_n, Y_n$ are independent standard normals, $X_n \xrightarrow{d} N(0,1)$ and $Y_n \xrightarrow{d} N(0,1)$ , but $X_n + Y_n$ has variance $2$ , not the variance $1$ that "treating $Y_n$ as $0$ " would suggest. The fix is joint convergence and CMT applied to the joint limit.

report a correction →

Theorem

Delta Method (univariate and multivariate)

Statement

Univariate. If $\sqrt{n}(T_n - \mu) \xrightarrow{d} N(0, \sigma^2)$ and $g : \mathbb{R} \to \mathbb{R}$ is differentiable at $\mu$ with $g'(\mu) \neq 0$ , then

$\sqrt{n}(g(T_n) - g(\mu)) \xrightarrow{d} N(0, [g'(\mu)]^2 \sigma^2).$

Multivariate. If $\sqrt{n}(T_n - \mu) \xrightarrow{d} N_k(0, \Sigma)$ and $g : \mathbb{R}^k \to \mathbb{R}^m$ has Jacobian $G(\mu) = \nabla g(\mu) \in \mathbb{R}^{m \times k}$ at $\mu$ , then

$\sqrt{n}(g(T_n) - g(\mu)) \xrightarrow{d} N_m\!\left(0, G(\mu) \Sigma G(\mu)^\top\right).$

Second-order. If additionally $g'(\mu) = 0$ and $g''(\mu) \neq 0$ , then

$n(g(T_n) - g(\mu)) \xrightarrow{d} \tfrac{1}{2} g''(\mu) \, \sigma^2 \chi^2_1.$

The convergence rate changes from $\sqrt{n}$ to $n$ and the limit is no longer normal.

Intuition

A smooth function of an approximately normal random vector is itself approximately normal, with covariance pushed forward through the Jacobian. The second-order version covers the degenerate case where the gradient vanishes (e.g., the variance of the sample variance at the true variance has this structure).

Proof Sketch

Univariate: Taylor expand $g(T_n) = g(\mu) + g'(\mu)(T_n - \mu) + R_n$ where $R_n = O_p(|T_n - \mu|^2) = O_p(1/n)$ . Multiply by $\sqrt{n}$ : $\sqrt{n}(g(T_n) - g(\mu)) = g'(\mu) \sqrt{n}(T_n - \mu) + O_p(1/\sqrt{n})$ . Apply Slutsky to absorb the remainder.

Multivariate: same argument with the gradient replaced by the Jacobian $G(\mu)$ . Push-forward of a Gaussian through a linear map is Gaussian with covariance $G \Sigma G^\top$ .

Second-order: when $g'(\mu) = 0$ the Taylor expansion gives $g(T_n) = g(\mu) + \frac{1}{2} g''(\mu)(T_n - \mu)^2 + O_p(|T_n - \mu|^3)$ . Multiplying by $n$ and using $\sqrt{n}(T_n - \mu) \xrightarrow{d} N(0, \sigma^2)$ gives $n(T_n - \mu)^2 \xrightarrow{d} \sigma^2 \chi^2_1$ .

Why It Matters

The delta method is the workhorse of applied statistics. Variance of $\log(\hat p)$ , of $1/\bar X$ , of $\hat p_1 / \hat p_2$ , of any smooth functional of moments. It is also the basis for variance-stabilizing transformations (the Anscombe and Freeman-Tukey families come from solving an ODE in $g'$ ).

Failure Mode

Three failure modes. (1) $g'(\mu) = 0$ : use the second-order version. (2) $g$ not differentiable at $\mu$ : e.g., $g(x) = |x|$ at $0$ gives a folded-normal limit, not a normal. (3) $T_n$ converges at a non- $\sqrt{n}$ rate: the delta method must use that rate, not $\sqrt{n}$ , in the expansion.

report a correction →

Tangent linearization of g around μ: an input spread σ/√N pushes to an output spread |g′(μ)|·σ/√N

μ1.50

σ/√N0.40

g′(μ) = 0.67SE(g) ≈ 0.27

Blue ribbon on the x-axis: input spread ±σ/√N. Green ribbon on the y-axis: output spread ±|g′(μ)|·σ/√N, the delta-method prediction. The orange dashed tangent line is the linearization the method uses; the gray curve is the true g. The approximation tightens as σ/√N shrinks.

M- and Z-Estimators: The Unified Framework

Most estimators of interest in statistics arise as either M-estimators (arg-min of a sample criterion) or Z-estimators (zero of a sample estimating equation). Under regularity these are equivalent (set $\psi = \nabla m$ ), but the Z-formulation is more convenient for asymptotic analysis because it avoids second-order arg-min arguments.

Definition

M-estimator and Z-estimator

Given iid data $X_1, \ldots, X_n$ from $P$ and a parameter space $\Theta \subset \mathbb{R}^k$ , an M-estimator is

$\hat\theta_n = \arg\min_{\theta \in \Theta} \frac{1}{n} \sum_{i=1}^n m(X_i; \theta).$

A Z-estimator is any solution

$\hat\theta_n \in \Theta \;\; \text{such that} \;\; \frac{1}{n} \sum_{i=1}^n \psi(X_i; \hat\theta_n) = 0,$

where $\psi : \mathcal{X} \times \Theta \to \mathbb{R}^k$ is the estimating function. The population analogue is $\theta_0 = \arg\min M(\theta)$ where $M(\theta) = E_P[m(X; \theta)]$ , or equivalently $\Psi(\theta_0) = 0$ where $\Psi(\theta) = E_P[\psi(X; \theta)]$ .

Examples:

MLE. $m(X; \theta) = -\log f(X; \theta)$ , $\psi(X; \theta) = -\nabla \log f(X; \theta) = -s(X; \theta)$ .
OLS. $m(X, Y; \beta) = (Y - X^\top \beta)^2$ , $\psi(X, Y; \beta) = -X(Y - X^\top \beta)$ .
GMM (just-identified). $\psi(X; \theta) =$ moment conditions evaluated at $\theta$ .
Quantile regression. $\psi(X, Y; \beta) = X(\tau - \mathbb{1}\{Y < X^\top \beta\})$ for the $\tau$ -th quantile.
Huber's robust location. $\psi(X; \theta) = \psi_c(X - \theta)$ , where $\psi_c$ is the Huber influence function (linear inside $[-c, c]$ , capped outside).

Theorem

M-Estimator Asymptotic Normality

Statement

Under the stated regularity conditions, the Z-estimator $\hat\theta_n$ solving $\frac{1}{n}\sum_i \psi(X_i; \hat\theta_n) = 0$ satisfies

$\sqrt{n}(\hat\theta_n - \theta_0) \xrightarrow{d} N\!\left(0, V(\theta_0)^{-1} A(\theta_0) V(\theta_0)^{-\top}\right),$

where $V(\theta_0) = \nabla_\theta \Psi(\theta_0) = E_P[\nabla_\theta \psi(X; \theta_0)]$ and $A(\theta_0) = E_P[\psi(X; \theta_0)\psi(X; \theta_0)^\top]$ .

The variance is the sandwich form $V^{-1} A V^{-\top}$ , with $V$ the "bread" and $A$ the "meat." When the model is correctly specified and $\psi$ is the score (MLE case), the information identity gives $V(\theta_0) = -A(\theta_0)$ , the sandwich collapses to $A^{-1} = I(\theta_0)^{-1}$ , and the variance equals the inverse Fisher information.

Intuition

Expand the estimating equation around $\theta_0$ :

$0 = \frac{1}{n}\sum_i \psi(X_i; \hat\theta_n) \approx \frac{1}{n}\sum_i \psi(X_i; \theta_0) + V(\theta_0) (\hat\theta_n - \theta_0).$

Solve for $\hat\theta_n - \theta_0$ , multiply by $\sqrt{n}$ :

$\sqrt{n}(\hat\theta_n - \theta_0) \approx -V(\theta_0)^{-1} \cdot \frac{1}{\sqrt{n}}\sum_i \psi(X_i; \theta_0).$

The right side is a CLT applied to mean-zero iid vectors with covariance $A(\theta_0)$ , scaled by the deterministic matrix $-V(\theta_0)^{-1}$ . The push-forward gives the sandwich variance.

Proof Sketch

Write the estimating equation with a mean-value-theorem step (or use a stochastic equicontinuity argument when $\psi$ is non-smooth). The linearization

$0 = \bar \psi_n(\theta_0) + \bar{\nabla \psi}_n(\tilde\theta) (\hat\theta_n - \theta_0)$

with $\tilde\theta$ between $\theta_0$ and $\hat\theta_n$ , plus a uniform LLN on $\nabla \psi$ , gives $\bar{\nabla \psi}_n(\tilde\theta) \xrightarrow{p} V(\theta_0)$ . The CLT gives $\sqrt{n}\, \bar \psi_n(\theta_0) \xrightarrow{d} N(0, A(\theta_0))$ . Apply Slutsky and CMT to invert $V$ and conclude. See van der Vaart Asymptotic Statistics Theorem 5.21.

A more refined version (Theorem 5.41, "Z-estimators with non-smooth $\psi$ ") replaces pointwise differentiability with Hellinger differentiability of the model and stochastic equicontinuity of the empirical-process term. This handles quantile regression and other $\psi$ with discontinuities.

Why It Matters

This is the most general asymptotic-normality theorem in regular parametric inference. MLE asymptotic normality is the special case where the model is correctly specified and $\psi$ is the score. OLS, GMM, Huber's robust location, and quantile regression all follow as special cases with their own $V$ and $A$ . The sandwich variance is also exactly what you should report when you suspect model misspecification: it is robust to $V \neq -A$ , while the naive $A^{-1}$ "model-based" variance is not.

Failure Mode

Three failure modes worth naming. (1) Lack of identification: if $V(\theta_0)$ is singular, the asymptotic variance is undefined and the estimator may converge at a slower rate (cube-root asymptotics for maximum-score estimators, $n^{1/3}$ rate). (2) Boundary parameter: if $\theta_0$ is on the boundary of $\Theta$ the limit is not normal but a half-normal or projection of a normal onto a cone (Self & Liang 1987, JASA). (3) Misspecification: $\Psi(\theta_0) = 0$ defines the "least-false" parameter, but the sandwich variance still applies; the naive Fisher-information-based variance does not.

report a correction →

MLE as a Special Case

Corollary

Asymptotic Normality of MLE

Statement

Under the regularity conditions above, the maximum likelihood estimator $\hat\theta_n$ satisfies

$\sqrt{n}(\hat\theta_n - \theta_0) \xrightarrow{d} N\!\left(0, I(\theta_0)^{-1}\right),$

where $I(\theta_0) = E_{\theta_0}[s(X; \theta_0) s(X; \theta_0)^\top] = -E_{\theta_0}[\nabla^2 \log f(X; \theta_0)]$ is the Fisher information matrix, with the second equality holding by the information identity.

Intuition

The MLE is asymptotically normal centered at the truth with covariance $I(\theta_0)^{-1}/n$ . This is the smallest variance any regular estimator can achieve (Cramér-Rao lower bound), so the MLE is asymptotically efficient. The result follows from the M-estimator theorem with $\psi = -s$ , $V(\theta_0) = -E[\nabla s] = I(\theta_0)$ , $A(\theta_0) = E[s s^\top] = I(\theta_0)$ , and the sandwich $V^{-1} A V^{-\top}$ collapsing to $I^{-1} I I^{-1} = I^{-1}$ .

Proof Sketch

Apply the M-estimator theorem with $\psi(X; \theta) = -\nabla \log f(X; \theta)$ . The information identity (which holds under regularity, by differentiating $\int f(x; \theta) dx = 1$ twice and exchanging integration and differentiation) gives $V(\theta_0) = I(\theta_0) = A(\theta_0)$ , so the sandwich variance reduces to $I(\theta_0)^{-1}$ .

Why It Matters

Justifies the standard practice of reporting MLE point estimates with standard errors derived from the inverse observed information. It also underwrites likelihood-based inference: confidence intervals from inverting the Wald, score, or LRT statistic; profile-likelihood intervals for nuisance-parameter problems; and the asymptotic chi-squared distribution of $-2 \log \Lambda$ in nested-model testing.

Failure Mode

Regularity fails at boundary parameters (e.g., $\sigma = 0$ in $N(\mu, \sigma^2)$ ), non-identified models (mixture models with unknown component count), models where the support depends on the parameter (e.g., $\text{Uniform}(0, \theta)$ , where $\hat\theta_{\text{MLE}} = X_{(n)}$ converges at rate $n$ to an exponential), and singular information ( $I(\theta_0)$ not full rank, e.g., at a point of non-identification in mixture models).

report a correction →

Le Cam Theory: Local Asymptotic Normality

Theorem

Local Asymptotic Normality (Le Cam)

Statement

Under DQM at $\theta_0$ , the log-likelihood ratio admits the LAN expansion

$\log \frac{dP_{\theta_0 + h/\sqrt{n}}^{\otimes n}}{dP_{\theta_0}^{\otimes n}}(X_1, \ldots, X_n) = h^\top \Delta_n - \frac{1}{2} h^\top I(\theta_0) h + o_{P_{\theta_0}}(1),$

where the central sequence $\Delta_n = \frac{1}{\sqrt{n}}\sum_i s(X_i; \theta_0)$ satisfies $\Delta_n \xrightarrow{d} N_k(0, I(\theta_0))$ under $P_{\theta_0}$ .

Le Cam's three lemmas then give:

The product measures $P_{\theta_0 + h/\sqrt{n}}^{\otimes n}$ and $P_{\theta_0}^{\otimes n}$ are mutually contiguous.
Convergence in distribution under $P_{\theta_0}$ transfers to $P_{\theta_0 + h/\sqrt{n}}$ via the change-of-measure formula.
The asymptotic experiment is the Gaussian shift experiment $\Delta \sim N_k(I(\theta_0) h, I(\theta_0))$ : every regular procedure in the original problem has a corresponding procedure in the Gaussian experiment with the same asymptotic risk.

Intuition

At the local scale $h/\sqrt{n}$ around the truth, every regular parametric problem reduces to a Gaussian shift problem. The sufficient statistic is the central sequence $\Delta_n$ (the normalized score), the parameter is the local direction $h$ , and the Fisher information sets the signal-to-noise ratio. This is the deepest structural result in classical parametric statistics: it says the MLE is asymptotically optimal not because of clever proofs but because it is the natural estimator in the limiting Gaussian experiment.

Differentiability in quadratic mean (DQM) is weaker than pointwise differentiability of $\log f$ : it requires only that the square-root density $\sqrt{f}$ be differentiable in $L^2$ as a function of $\theta$ . This handles models with bounded support, non-smooth densities, and many other "almost regular" cases where the textbook regularity conditions fail.

Proof Sketch

Hellinger / DQM expansion: under DQM at $\theta_0$ with score $s$ ,

$\sqrt{f_{\theta_0 + h/\sqrt{n}}}(x) - \sqrt{f_{\theta_0}}(x) = \frac{h^\top s(x; \theta_0)}{2\sqrt{n}} \sqrt{f_{\theta_0}}(x) + R_n(x; h),$

with $\|R_n(\cdot; h)\|_{L^2} = o(1/\sqrt{n})$ . Square, sum over $i$ , take logs, and Taylor expand $\log(1 + u) = u - u^2/2 + O(u^3)$ . The linear term aggregates to $h^\top \Delta_n$ ; the quadratic term aggregates by a LLN-type argument to $-\frac{1}{2} h^\top I(\theta_0) h$ . The remainder is $o_p(1)$ uniformly in $h$ on bounded sets. See van der Vaart Asymptotic Statistics Theorem 7.2 and Le Cam-Yang Asymptotics in Statistics Chapter 6.

Le Cam's three lemmas follow from the LAN expansion via standard arguments. Mutual contiguity is immediate from the boundedness in probability of the log-likelihood ratio; the change-of-measure formula is just the definition of contiguity; the Gaussian shift representation follows from the convergence $(\log L_n, \Delta_n) \xrightarrow{d} (h^\top \Delta - \frac{1}{2} h^\top I h, \Delta)$ jointly.

Why It Matters

LAN underwrites the asymptotic minimax lower bound: no regular estimator can have asymptotic variance smaller than $I(\theta_0)^{-1}$ . It also gives the asymptotic equivalence of Wald, score, and LRT tests under the null and under contiguous alternatives, and it is the foundation of semiparametric efficiency theory (Bickel-Klaassen-Ritov-Wellner 1993): in semiparametric models the analogous expansion identifies the efficient score as the projection of the parametric score onto the orthogonal complement of the nuisance tangent space.

Failure Mode

LAN fails for non-regular models. Canonical examples include parameter-dependent support (e.g., Uniform $(0, \theta)$ , where the MLE converges at rate $n$ rather than $\sqrt{n}$ and the limit is exponential), boundary parameter problems (parameter on the edge of $\Theta$ , where the limit is a censored normal), change-point models (where the rate is $n$ and the limit is a compound Poisson functional), mixture models at non-identified configurations, and unidentified parameters generally. Note that the standard Cauchy location family $f_\theta(x) = 1/[\pi(1 + (x-\theta)^2)]$ is regular: its Fisher information equals $1/2$ and LAN holds, so the location MLE is regular and asymptotically normal at the $\sqrt{n}$ rate. The Cauchy sample mean is pathological because the Cauchy has no first moment, but likelihood-based location inference is fine. Many machine-learning "models" with high-dimensional parameters whose interior is unclear also fall outside LAN. The corresponding theory is LAMN (locally asymptotically mixed normal) and LAQ (locally asymptotically quadratic), with weaker optimality conclusions.

report a correction →

The Wald, Score, and Likelihood-Ratio Tests

Theorem

Asymptotic Equivalence of Wald, Score, and LRT

Statement

Under the null $H_0 : \theta = \theta_0$ and LAN, define the three test statistics:

Wald. $W_n = n (\hat\theta_n - \theta_0)^\top I(\theta_0) (\hat\theta_n - \theta_0)$ .
Score (Rao / Lagrange multiplier). $R_n = \frac{1}{n} \nabla \ell_n(\theta_0)^\top I(\theta_0)^{-1} \nabla \ell_n(\theta_0) = \Delta_n^\top I(\theta_0)^{-1} \Delta_n$ .
Likelihood ratio. $L_n = -2 \log \frac{L_n(\theta_0)}{L_n(\hat\theta_n)} = 2(\ell_n(\hat\theta_n) - \ell_n(\theta_0))$ .

Then under $H_0$ , all three converge to the same chi-squared limit:

$W_n, R_n, L_n \xrightarrow{d} \chi^2_k \quad \text{where} \;\; k = \dim(\theta).$

Moreover, $W_n - L_n = o_p(1)$ and $R_n - L_n = o_p(1)$ , so the three tests are asymptotically equivalent under $H_0$ and under contiguous alternatives $\theta_n = \theta_0 + h/\sqrt{n}$ (where the common limit becomes a non-central $\chi^2_k(\lambda)$ with non-centrality $\lambda = h^\top I(\theta_0) h$ ).

Intuition

The three tests measure "distance from $H_0$ " in three different but asymptotically equivalent ways. Wald measures distance in the parameter space (how far is $\hat\theta_n$ from $\theta_0$ , weighted by the information). Score measures the slope of the log-likelihood at $\theta_0$ (is the gradient $\Delta_n$ near zero, or far?). LRT measures the depth of the log-likelihood (how much does the maximum exceed the null value?). All three reduce to the same quadratic form in the central sequence $\Delta_n$ under LAN, and the chi-squared limit is the distribution of the squared length of a standard normal $k$ -vector.

Proof Sketch

Substitute the LAN expansion into each statistic and Taylor-expand to second order around $\theta_0$ . The MLE satisfies $\hat\theta_n - \theta_0 = I(\theta_0)^{-1} \Delta_n / \sqrt{n} + o_p(1/\sqrt{n})$ , so

$W_n = \Delta_n^\top I(\theta_0)^{-1} \Delta_n + o_p(1),$

which matches $R_n$ exactly. For the LRT, expand $\ell_n(\hat\theta_n)$ around $\theta_0$ to second order; the cross-term contributions cancel and the remainder is also $\Delta_n^\top I(\theta_0)^{-1} \Delta_n + o_p(1)$ . Since $\Delta_n \xrightarrow{d} N(0, I(\theta_0))$ under $H_0$ , the quadratic form has a $\chi^2_k$ limit. See van der Vaart Asymptotic Statistics Theorem 16.7.

Why It Matters

Justifies reporting any of the three test statistics in practice with the same chi-squared reference distribution. Picks among them are finite-sample concerns: Wald requires the MLE and an estimate of $I(\theta_0)$ ; score does not require fitting the alternative model and is preferred in some computational settings; LRT is invariant under parameter transformations and tends to have better small-sample behavior in nested-model comparisons. The non-central chi-squared limit under contiguous alternatives gives the local power calculation: power $\to P(\chi^2_k(\lambda) > \chi^2_{k, 1-\alpha})$ , a function of the non-centrality $\lambda = h^\top I(\theta_0) h$ .

Failure Mode

The three tests can disagree substantially in finite samples, particularly when $n / k$ is small or the model is misspecified. Wald is parameterization-dependent (Wald CIs for $\theta$ and for $g(\theta)$ are not invariant under reparameterization, while LRT CIs are). Under misspecification only the robust score test with a sandwich denominator preserves a valid chi-squared limit; the naive Wald and LRT statistics use the wrong variance. Under boundary conditions the limit is a mixture of chi-squared distributions, not a single $\chi^2$ (Self-Liang 1987).

report a correction →

Definitions: Influence Functions, Efficiency, Contiguity

Definition

Asymptotically Linear Estimator and Influence Function

A regular estimator $\hat\theta_n$ of $\theta_0$ is asymptotically linear with influence function $\varphi : \mathcal{X} \to \mathbb{R}^k$ if and only if

$\sqrt{n}(\hat\theta_n - \theta_0) = \frac{1}{\sqrt{n}} \sum_{i=1}^n \varphi(X_i) + o_p(1),$

where $E[\varphi(X)] = 0$ and $E[\|\varphi(X)\|^2] < \infty$ . The asymptotic variance of $\hat\theta_n$ is then $E[\varphi(X) \varphi(X)^\top]$ .

For an M-estimator with estimating function $\psi$ , $\varphi(X) = -V(\theta_0)^{-1} \psi(X; \theta_0)$ . For the MLE, $\varphi(X) = I(\theta_0)^{-1} s(X; \theta_0)$ , the efficient influence function. Influence functions parameterize the entire space of regular estimators in a parametric model, and they are the central object of semiparametric efficiency theory.

Definition

Asymptotic Relative Efficiency

The asymptotic relative efficiency (ARE) of estimator $T_n$ relative to $S_n$ , when both are $\sqrt{n}$ -consistent and asymptotically normal with variances $\sigma_T^2$ and $\sigma_S^2$ , is

$\text{ARE}(T_n, S_n) = \frac{\sigma_S^2}{\sigma_T^2}.$

In this asymptotic-variance sense, $T_n$ is more efficient than $S_n$ if and only if $\text{ARE} > 1$ . The MLE achieves $\text{ARE} = 1$ relative to the Cramér-Rao bound by construction (under regularity). For testing, Pitman ARE is the analogous notion based on local power, and it equals the estimation ARE for tests derived from estimators (a striking equivalence due to Pitman 1949 and Noether 1955).

Definition

Contiguity

Two sequences of probability measures $P_n$ and $Q_n$ on possibly different sample spaces are contiguous, written $P_n \triangleleft Q_n$ , if and only if for every sequence of measurable sets $A_n$ , $Q_n(A_n) \to 0$ implies $P_n(A_n) \to 0$ . They are mutually contiguous if both $P_n \triangleleft Q_n$ and $Q_n \triangleleft P_n$ .

Under LAN, $P_{\theta_0 + h/\sqrt{n}}^{\otimes n}$ and $P_{\theta_0}^{\otimes n}$ are mutually contiguous for every fixed $h$ . Le Cam's first lemma characterizes this in terms of the limiting log-likelihood ratio: contiguity holds iff the likelihood ratio $L_n = dQ_n / dP_n$ is uniformly integrable under $P_n$ . The third lemma converts statements about $P_n$ -distributions into statements about $Q_n$ -distributions via the joint limit of $(T_n, \log L_n)$ .

Worked Examples

Example

Delta method: variance-stabilizing transform for Poisson

If $X_1, \ldots, X_n \sim \text{Poisson}(\lambda)$ , then $\sqrt{n}(\bar X - \lambda) \xrightarrow{d} N(0, \lambda)$ . The variance depends on $\lambda$ , so a confidence interval on $\lambda$ requires a plug-in estimate. Apply the delta method with $g(x) = \sqrt{x}$ : $g'(\lambda) = 1/(2\sqrt{\lambda})$ , so

$\sqrt{n}(\sqrt{\bar X} - \sqrt{\lambda}) \xrightarrow{d} N(0, \lambda \cdot 1/(4\lambda)) = N(0, 1/4).$

The variance is now constant. Confidence intervals on $\sqrt{\lambda}$ do not need a plug-in step; transform back at the end. The same idea gives the Anscombe transform $g(x) = \sqrt{x + 3/8}$ for finite- $n$ correction and the Freeman-Tukey $g(x) = \sqrt{x} + \sqrt{x+1}$ for ratios of Poissons.

Poisson(λ) before and after the Anscombe transform; the right panel keeps variance near 1/4 regardless of λ

λ8.0

raw var = 8.18transformed var = 0.26

Left: raw Poisson samples have variance λ (mean equals variance). Right: the Anscombe transform √(Y + 3/8) yields variance close to 1/4 across λ. Slide λ and watch the raw variance scale up while the transformed variance stays put.

Example

MLE for the exponential rate, with confidence interval

If $X_1, \ldots, X_n \sim \text{Exp}(\lambda)$ (density $\lambda e^{-\lambda x}$ ), the MLE is $\hat\lambda = 1/\bar X$ . The log-likelihood is $\ell(\lambda) = n \log \lambda - \lambda \sum X_i$ , so $s(X; \lambda) = 1/\lambda - X$ and Fisher information is $I(\lambda) = E[(1/\lambda - X)^2] = \text{Var}(X) = 1/\lambda^2$ . By asymptotic normality,

$\sqrt{n}(\hat\lambda - \lambda) \xrightarrow{d} N(0, \lambda^2).$

A naive 95% CI is $\hat\lambda \pm 1.96 \hat\lambda / \sqrt{n}$ . A better CI uses the variance-stabilizing transform $g(\lambda) = \log \lambda$ : $\sqrt{n}(\log \hat\lambda - \log \lambda) \xrightarrow{d} N(0, 1)$ , giving the symmetric-on-the-log-scale CI $\hat\lambda \cdot \exp(\pm 1.96/\sqrt{n})$ , which is always positive.

Example

Sandwich variance for misspecified MLE

Suppose you fit an exponential model to data that are actually Gamma with shape $\alpha \neq 1$ . The "MLE" $\hat\lambda = 1/\bar X$ converges in probability to $\lambda_0 = 1/E_P[X]$ (the least-false parameter), not to anything intrinsic to the Gamma. The naive variance estimate $\hat\lambda^2 / n$ uses the misspecified information identity and is generally wrong. The sandwich variance is

$\text{Var}_{\text{sand}}(\hat\lambda) = \frac{1}{n} V(\lambda_0)^{-1} A(\lambda_0) V(\lambda_0)^{-1}, \quad V(\lambda_0) = -E[\partial \psi / \partial \lambda] = 1/\lambda_0^2, \quad A(\lambda_0) = E[\psi^2] = \text{Var}(X).$

For Gamma $(\alpha, \alpha\lambda_0)$ (so that $E[X] = 1/\lambda_0$ ), $\text{Var}(X) = 1/(\alpha \lambda_0^2)$ , giving $\text{Var}_{\text{sand}} = 1/(n \alpha \lambda_0^2)$ . The naive exponential-model variance would have given $\lambda_0^2 / n = 1 / (n \cdot 1 \cdot \lambda_0^{-2})$ , which is $\alpha$ times too large when $\alpha > 1$ and too small when $\alpha < 1$ . This is the standard "robust SEs are different from model-based SEs under misspecification" phenomenon.

Common Confusions

Watch Out

Asymptotic normality is not a property of the estimator

Asymptotic normality is a property of the sequence $\sqrt{n}(\hat\theta_n - \theta_0)$ , not of the estimator at any fixed sample size. The actual finite-sample distribution of an MLE can be heavily skewed, multimodal, or have heavy tails, even when the asymptotic distribution is a clean normal. Bootstrap or exact methods should be used for small samples. The asymptotic approximation is a tool, not a fact about $\hat\theta_n$ at $n = 30$ .

Watch Out

Efficiency is asymptotic; finite-sample winners can be biased

The MLE is asymptotically efficient under regularity, but finite-sample efficiency is a different question. James-Stein (1961) showed that for estimating a multivariate normal mean with $k \geq 3$ , the MLE $\hat\mu = \bar X$ is dominated by a biased shrinkage estimator $\hat\mu_{\text{JS}} = (1 - (k-2)/\|\bar X\|^2) \bar X$ in mean squared error for every $\mu$ . The JS estimator is asymptotically equivalent to the MLE (both have variance $I^{-1}/n$ ), but it strictly improves finite-sample MSE by trading bias for variance.

Watch Out

Convergence in distribution does not imply convergence of moments

$X_n \xrightarrow{d} X$ does not imply $E[X_n] \to E[X]$ or $\text{Var}(X_n) \to \text{Var}(X)$ . The classical counterexample is $X_n = n \cdot \mathbb{1}\{U \leq 1/n\}$ for $U \sim \text{Uniform}(0,1)$ : $X_n \xrightarrow{d} 0$ (since $X_n = 0$ with probability $1 - 1/n$ ), but $E[X_n] = 1$ for every $n$ . The fix is uniform integrability: if $\{X_n\}$ is uniformly integrable, then convergence in distribution implies convergence of expectations. In practice, asymptotic-normality results give the limiting distribution; whether $\sqrt{n}(\hat\theta_n - \theta_0)$ has a moment converging to $I^{-1}$ requires a separate argument.

Watch Out

Le Cam regularity (DQM) is weaker than classical regularity

Pointwise differentiability of $\log f(x; \theta)$ in $\theta$ , combined with bounded-by-integrable-envelope conditions, is the textbook "regularity" assumption (Lehmann-Casella, Cox-Hinkley). Le Cam's differentiability in quadratic mean (DQM) is weaker: it requires only that $\sqrt{f}$ be differentiable in $L^2(\mu)$ as a function of $\theta$ . Models with bounded support (Uniform on $[0, \theta]$ is not DQM at $\theta_0$ because the support changes; but $\text{Beta}(\alpha, \beta)$ on $[0,1]$ for fixed support is DQM despite non-smooth densities at the endpoints) and many semiparametric models satisfy DQM without satisfying the textbook conditions. Modern asymptotic theory works at the DQM level; classical regularity is a sufficient-but-not-necessary special case.

Watch Out

Super-efficient estimators exist but only on a measure-zero set

Hodges (1951) constructed an estimator that is asymptotically normal with variance $I(\theta_0)^{-1}/n$ for every $\theta_0 \neq 0$ and asymptotic variance $0$ at $\theta_0 = 0$ . This violates the Cramér-Rao bound at the single point $\theta = 0$ . The resolution: the Hodges estimator is not regular (its asymptotic distribution depends discontinuously on contiguous perturbations $\theta_0 + h/\sqrt{n}$ ). Le Cam's convolution theorem says: among regular estimators (those with continuous-in- $h$ asymptotic distributions), no estimator can beat $I^{-1}$ . Super-efficiency is a measure-theoretic curiosity, not a practical estimation tool, and the Hodges estimator has terrible local risk near zero (the local-minimax risk over a $1/\sqrt{n}$ -neighborhood is unbounded).

Watch Out

MLE consistency requires more than just asymptotic normality of the score

The standard "expand the score equation around $\theta_0$ " derivation of asymptotic normality assumes $\hat\theta_n \xrightarrow{p} \theta_0$ . Establishing consistency is a separate (often harder) step, requiring either (a) compactness of $\Theta$ + continuity of $\theta \mapsto E[\log f(X; \theta)]$ + identifiability, or (b) Wald's classical consistency proof using uniform LLN, or (c) the modern empirical-process approach via Glivenko-Cantelli plus a strict separation between $\theta_0$ and competitors. Skipping the consistency step and "deriving asymptotic normality" from the score equation alone is a circular argument.

Summary

Convergence calculus: CMT, Slutsky, delta method (incl. multivariate and second-order). These are the algebraic moves that let you turn one CLT into many applied confidence intervals.
M-/Z-estimators: $\hat\theta_n$ is asymptotically normal with sandwich variance $V^{-1} A V^{-\top}$ , where $V$ is the derivative of the population estimating function and $A$ is its variance. MLE, OLS, GMM, quantile regression, and Huber-M are special cases.
MLE asymptotics: under correct specification and regularity, $\sqrt{n}(\hat\theta_n - \theta_0) \xrightarrow{d} N(0, I(\theta_0)^{-1})$ . The information identity collapses the sandwich.
LAN (Le Cam): every regular parametric problem looks like a Gaussian shift experiment at the $1/\sqrt{n}$ scale. This gives the asymptotic minimax bound and the Wald/score/LRT equivalence.
Wald, score, LRT: all three converge to $\chi^2_k$ under $H_0$ and to non-central $\chi^2_k(\lambda)$ under contiguous alternatives. Choose among them on finite-sample, robustness, and computational grounds.
Influence functions parameterize the entire space of regular estimators; the efficient influence function is $I^{-1} s$ .

Exercises

ExerciseCore

Problem

Let $X_1, \ldots, X_n \sim \text{Bernoulli}(p)$ and $\hat p = \bar X$ . Use the multivariate delta method to find the asymptotic distribution of the log-odds $g(\hat p) = \log(\hat p / (1 - \hat p))$ , then construct a 95% confidence interval for the log-odds and transform back to a CI for $p$ that respects $p \in [0, 1]$ .

ExerciseAdvanced

Problem

Compute the influence function and asymptotic variance of Huber's robust location estimator with tuning constant $c$ , defined by

$\hat\theta_n \;\; \text{such that} \;\; \sum_i \psi_c(X_i - \hat\theta_n) = 0, \quad \psi_c(u) = \begin{cases} u & |u| \leq c, \\ c \, \text{sign}(u) & |u| > c. \end{cases}$

Specialize to $X_i \sim N(\theta, \sigma^2)$ . What is the ARE of Huber-M relative to the sample mean as $c \to \infty$ ? As $c \to 0$ ?

ExerciseAdvanced

Problem

Show that under the LAN expansion, the Wald statistic $W_n = n (\hat\theta_n - \theta_0)^\top I(\theta_0) (\hat\theta_n - \theta_0)$ and the score statistic $R_n = \Delta_n^\top I(\theta_0)^{-1} \Delta_n$ differ by $o_p(1)$ under $H_0 : \theta = \theta_0$ , and that both equal the LRT statistic $L_n = -2 \log \Lambda_n + o_p(1)$ .

ExerciseAdvanced

Problem

Consider the change-point model $X_i \sim N(\mu_1, 1)$ for $i \leq \tau n$ and $X_i \sim N(\mu_2, 1)$ for $i > \tau n$ , with unknown change point $\tau \in (0, 1)$ and known $\mu_1 \neq \mu_2$ . Show that the MLE of $\tau$ converges at rate $n$ (not $\sqrt{n}$ ) and identify the limiting distribution.

ExerciseResearch

Problem

In the LAN expansion, the experiment at local scale $h/\sqrt{n}$ looks like observing $\Delta \sim N(I(\theta_0) h, I(\theta_0))$ . (a) Derive the optimal estimator of $h$ and its risk in this Gaussian shift experiment, and recover the Cramér-Rao bound. (b) For a semiparametric model with parametric component $\theta$ and infinite-dimensional nuisance $\eta$ , what is the analogue of $I(\theta_0)$ that controls the asymptotic variance of any regular estimator of $\theta$ ?

References

van der Vaart, Asymptotic Statistics (Cambridge, 1998). Chapters 2-3 (convergence calculus, delta method), Chapter 5 (M- and Z-estimators), Chapters 7-8 (local asymptotic normality, contiguity, convolution theorem), Chapter 16 (Wald/score/LRT equivalence). The standard graduate text and the source most cited above.
Lehmann & Casella, Theory of Point Estimation, 2nd ed. (Springer, 1998). Chapter 6 covers MLE asymptotic theory under classical regularity. Less general than van der Vaart but gives more finite-sample finesse.
Le Cam, Asymptotic Methods in Statistical Decision Theory (Springer, 1986). The original treatise on contiguity, LAN, and the convolution theorem. Dense and difficult, but the source.
Le Cam & Yang, Asymptotics in Statistics: Some Basic Concepts, 2nd ed. (Springer, 2000). A more readable distillation of Le Cam's program; Chapter 6 covers DQM and LAN.
Pollard, Convergence of Stochastic Processes (Springer, 1984). Chapters 4-5 develop the empirical-process tools (Glivenko-Cantelli, Donsker, stochastic equicontinuity) underlying modern M-estimator theory.
Bickel, Klaassen, Ritov, Wellner, Efficient and Adaptive Estimation for Semiparametric Models (Springer, 1993). The foundational text on semiparametric efficiency, projection-onto-tangent-spaces, and the efficient influence function.

Next Topics

Bootstrap methods: resampling as a finite-sample proxy for the asymptotic distribution; second-order accurate for smooth functionals.
Fisher information and Cramér-Rao bound: the geometric and information-theoretic perspective on the asymptotic variance lower bound.
Empirical processes and chaining: the modern technical machinery for M-estimator asymptotics with non-smooth $\psi$ and high-dimensional parameter spaces.
Semiparametric efficiency and TMLE: the infinite-dimensional generalization of LAN underlying modern causal-inference methods.

Last reviewed: April 25, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

6

Central Limit Theoremlayer 0B · tier 1
Cramér-Rao Bound: Information Inequality, Achievability, and Sharper Variantslayer 0B · tier 1
Fisher Information: Curvature, KL Geometry, and the Natural Gradientlayer 0B · tier 1
Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1
Modes of Convergence of Random Variableslayer 0B · tier 1

Derived topics

5

Delta Methodlayer 1 · tier 1
Bootstrap Methodslayer 2 · tier 1
Likelihood-Ratio, Wald, and Score Testslayer 2 · tier 1
Double/Debiased Machine Learninglayer 3 · tier 1
Empirical Processes and Chaininglayer 3 · tier 2

Graph-backed continuations

Bootstrap Methods Empirical Processes and Chaining Double/Debiased Machine Learning