Skip to main content

Statistical Estimation

Asymptotic Statistics: M-Estimators, Delta Method, LAN

The large-sample toolbox for statistical inference: continuous mapping theorem, Slutsky, the delta method, M- and Z-estimator consistency and asymptotic normality, MLE as a special M-estimator, local asymptotic normality (Le Cam), the asymptotic equivalence of Wald / score / likelihood-ratio tests, and influence-function representations. These results justify essentially every confidence interval, standard error, and p-value in applied statistics, and they are the language of modern semiparametric theory.

AdvancedTier 1StableCore spine~90 min

Why This Matters

Three asymptotically equivalent ways to measure how far the data carries the parameter from H₀ on the same quadratic log-likelihood.

Wald: n(θ̂ − θ₀)² · I(θ₀)LRT: −2 log Λ= 2[ℓ(θ̂) − ℓ(θ₀)]Score: ∇ℓ(θ₀)ᵀ I⁻¹ ∇ℓ(θ₀)slope of dashed line at θ₀θ̂ (MLE)θ₀ (H₀)-3-2-10123parameter θ-5-4-3-2-10ℓ(θ) − ℓ(θ̂)

Under LAN, all three statistics equal Δₙᵀ I⁻¹ Δₙ + op(1) and converge to χ²k; they differ in finite samples and under misspecification.

Almost every confidence interval, standard error, and pp-value in applied statistics relies on asymptotic theory. When you report θ^±1.96SE^\hat\theta \pm 1.96 \cdot \widehat{\text{SE}}, you are invoking asymptotic normality of θ^\hat\theta. When you compare a likelihood-ratio statistic to a χk2\chi^2_k table, you are invoking Wilks' theorem. When you compute a sandwich standard error after a misspecified model fit, you are invoking M-estimator asymptotic normality with the model-robust variance formula. When you bootstrap, you are using a finite-sample proxy for an asymptotic distribution.

The page that follows develops the toolbox in the order it actually gets used. First: the four "plumbing" results (continuous mapping, Slutsky, delta method, joint convergence) that turn one CLT into a thousand applied results. Second: the M- and Z-estimator framework (van der Vaart Ch 5) that subsumes MLE, GMM, OLS, and quantile regression as one theorem. Third: Le Cam's local asymptotic normality, which says all regular parametric problems look like Gaussian shift experiments at the 1/n1/\sqrt{n} scale and which underwrites the asymptotic minimax lower bound. Fourth: the Wald, score, and likelihood-ratio tests, which are asymptotically equivalent under the null but differ in finite samples and under misspecification.

The synthesis that anchors stats-PhD intuition: every regular estimator has an influence function that determines its asymptotic variance, and the MLE achieves the unique influence function with minimum variance (the efficient influence function), which equals the inverse Fisher information. Everything else is variation around this center.

Mental Model

Think of asymptotic statistics as three stacked layers.

  1. Convergence calculus. Convergence in distribution and in probability are partial-order-like relations on sequences of random variables. The continuous mapping theorem pushes them through continuous functions, Slutsky lets you absorb constants, the delta method pushes them through smooth deterministic maps. These are the algebraic moves you make to manipulate limits.

  2. Estimator asymptotics. Most estimators of interest are defined as minimizers (M-estimators) or zeros (Z-estimators) of a sample criterion:

    θ^n=argminθ1nim(Xi;θ)or1niψ(Xi;θ^n)=0.\hat\theta_n = \arg\min_\theta \frac{1}{n}\sum_i m(X_i; \theta) \quad\text{or}\quad \frac{1}{n}\sum_i \psi(X_i; \hat\theta_n) = 0.

    Under regularity conditions they are n\sqrt{n}-consistent and asymptotically normal with variance you can write down in closed form. MLE is the special case ψ=logf\psi = \nabla \log f, with variance I(θ0)1I(\theta_0)^{-1}.

  3. Local geometry. Around the truth, the statistical experiment looks Gaussian at the 1/n1/\sqrt{n} scale. Local asymptotic normality (Le Cam) makes this precise via the contiguity of nearby measures and the convergence of likelihood ratios to a normal density. The asymptotic minimax bound, the Wald/score/LRT equivalence, and semiparametric efficiency theory all sit on this layer.

Notation

Throughout: d\xrightarrow{d} for convergence in distribution, p\xrightarrow{p} for convergence in probability, a.s.\xrightarrow{a.s.} for almost sure. Xn=Op(an)X_n = O_p(a_n) means Xn/anX_n / a_n is bounded in probability (supnP(Xn/an>M)0\sup_n P(|X_n/a_n| > M) \to 0 as MM \to \infty). Xn=op(an)X_n = o_p(a_n) means Xn/anp0X_n / a_n \xrightarrow{p} 0. We write (θ)=logf(X;θ)\ell(\theta) = \log f(X; \theta) for a single log-likelihood and n(θ)=i=1nlogf(Xi;θ)\ell_n(\theta) = \sum_{i=1}^n \log f(X_i; \theta) for the sample log-likelihood. The score is s(X;θ)=θlogf(X;θ)s(X; \theta) = \nabla_\theta \log f(X; \theta), with sample average sˉn(θ)=1nis(Xi;θ)\bar s_n(\theta) = \frac{1}{n}\sum_i s(X_i; \theta). Fisher information is I(θ)=Eθ[s(X;θ)s(X;θ)]=Eθ[2logf(X;θ)]I(\theta) = \mathbb{E}_\theta[s(X; \theta) s(X; \theta)^\top] = -\mathbb{E}_\theta[\nabla^2 \log f(X; \theta)], with the equality holding under regularity (the information identity).

The Convergence Calculus

Theorem

Continuous Mapping Theorem (CMT)

Statement

Let XndXX_n \xrightarrow{d} X in a metric space and let gg be measurable and continuous at every point of a set CC with P(XC)=1P(X \in C) = 1. Then

g(Xn)dg(X).g(X_n) \xrightarrow{d} g(X).

The same holds with d\xrightarrow{d} replaced by p\xrightarrow{p} or a.s.\xrightarrow{a.s.}.

Intuition

Continuous functions preserve convergence. The "continuous on a set of XX-measure one" condition allows gg to be discontinuous at points the limit never visits, which matters in practice (e.g., g(x)=1/xg(x) = 1/x when XX is positive almost surely).

Proof Sketch

Use the Portmanteau theorem: XndXX_n \xrightarrow{d} X iff E[f(Xn)]E[f(X)]E[f(X_n)] \to E[f(X)] for every bounded continuous ff. For gg continuous on a PXP_X-measure-one set CC, the composition fgf \circ g is bounded and continuous on the same set, and its discontinuity points have measure zero. Apply Portmanteau to fgf \circ g to conclude. See van der Vaart Asymptotic Statistics Theorem 2.3.

Why It Matters

CMT is the most-used convergence result. Every "transform an estimator through a smooth function" argument starts here. Slutsky and the delta method are corollaries.

Failure Mode

Discontinuity of gg at a point where the limit puts positive mass breaks the conclusion. Example: let Xn=1/nX_n = 1/n a.s., X=0X = 0, and g(x)=1{x=0}g(x) = \mathbb{1}\{x = 0\}. Then Xna.s.0X_n \xrightarrow{a.s.} 0 but g(Xn)=0g(X_n) = 0 while g(X)=1g(X) = 1. The discontinuity at zero coincides with the limit's support.

Theorem

Slutsky's Theorem

Statement

If XndXX_n \xrightarrow{d} X and YnpcY_n \xrightarrow{p} c for a deterministic constant cc, then jointly (Xn,Yn)d(X,c)(X_n, Y_n) \xrightarrow{d} (X, c), and therefore for any continuous ff,

f(Xn,Yn)df(X,c).f(X_n, Y_n) \xrightarrow{d} f(X, c).

The standard corollaries:

Xn+YndX+c,XnYndcX,Xn/YndX/c    (c0).X_n + Y_n \xrightarrow{d} X + c, \quad X_n Y_n \xrightarrow{d} cX, \quad X_n / Y_n \xrightarrow{d} X/c \;\; (c \neq 0).

Intuition

A sequence converging in probability to a constant behaves like that constant in the limit. Marginal convergence XndXX_n \xrightarrow{d} X and YndYY_n \xrightarrow{d} Y does not imply joint convergence in general; Slutsky works precisely because the YnY_n limit is degenerate (a point mass), so there is only one admissible coupling.

Proof Sketch

YnpcY_n \xrightarrow{p} c implies Yncp0|Y_n - c| \xrightarrow{p} 0, which gives joint convergence (Xn,Yn)d(X,c)(X_n, Y_n) \xrightarrow{d} (X, c) via the coupling that pairs XnX_n with the deterministic limit cc in the limit. Apply CMT to the continuous map ff. See van der Vaart Lemma 2.8.

Why It Matters

Slutsky is the glue that holds asymptotic arguments together. Every time you replace σ\sigma with σ^\hat\sigma in a zz-statistic and claim the limit is still standard normal, you are using Slutsky.

Failure Mode

Slutsky fails if YnY_n has a non-degenerate limit. Example: if Xn,YnX_n, Y_n are independent standard normals, XndN(0,1)X_n \xrightarrow{d} N(0,1) and YndN(0,1)Y_n \xrightarrow{d} N(0,1), but Xn+YnX_n + Y_n has variance 22, not the variance 11 that "treating YnY_n as 00" would suggest. The fix is joint convergence and CMT applied to the joint limit.

Theorem

Delta Method (univariate and multivariate)

Statement

Univariate. If n(Tnμ)dN(0,σ2)\sqrt{n}(T_n - \mu) \xrightarrow{d} N(0, \sigma^2) and g:RRg : \mathbb{R} \to \mathbb{R} is differentiable at μ\mu with g(μ)0g'(\mu) \neq 0, then

n(g(Tn)g(μ))dN(0,[g(μ)]2σ2).\sqrt{n}(g(T_n) - g(\mu)) \xrightarrow{d} N(0, [g'(\mu)]^2 \sigma^2).

Multivariate. If n(Tnμ)dNk(0,Σ)\sqrt{n}(T_n - \mu) \xrightarrow{d} N_k(0, \Sigma) and g:RkRmg : \mathbb{R}^k \to \mathbb{R}^m has Jacobian G(μ)=g(μ)Rm×kG(\mu) = \nabla g(\mu) \in \mathbb{R}^{m \times k} at μ\mu, then

n(g(Tn)g(μ))dNm ⁣(0,G(μ)ΣG(μ)).\sqrt{n}(g(T_n) - g(\mu)) \xrightarrow{d} N_m\!\left(0, G(\mu) \Sigma G(\mu)^\top\right).

Second-order. If additionally g(μ)=0g'(\mu) = 0 and g(μ)0g''(\mu) \neq 0, then

n(g(Tn)g(μ))d12g(μ)σ2χ12.n(g(T_n) - g(\mu)) \xrightarrow{d} \tfrac{1}{2} g''(\mu) \, \sigma^2 \chi^2_1.

The convergence rate changes from n\sqrt{n} to nn and the limit is no longer normal.

Intuition

A smooth function of an approximately normal random vector is itself approximately normal, with covariance pushed forward through the Jacobian. The second-order version covers the degenerate case where the gradient vanishes (e.g., the variance of the sample variance at the true variance has this structure).

Proof Sketch

Univariate: Taylor expand g(Tn)=g(μ)+g(μ)(Tnμ)+Rng(T_n) = g(\mu) + g'(\mu)(T_n - \mu) + R_n where Rn=Op(Tnμ2)=Op(1/n)R_n = O_p(|T_n - \mu|^2) = O_p(1/n). Multiply by n\sqrt{n}: n(g(Tn)g(μ))=g(μ)n(Tnμ)+Op(1/n)\sqrt{n}(g(T_n) - g(\mu)) = g'(\mu) \sqrt{n}(T_n - \mu) + O_p(1/\sqrt{n}). Apply Slutsky to absorb the remainder.

Multivariate: same argument with the gradient replaced by the Jacobian G(μ)G(\mu). Push-forward of a Gaussian through a linear map is Gaussian with covariance GΣGG \Sigma G^\top.

Second-order: when g(μ)=0g'(\mu) = 0 the Taylor expansion gives g(Tn)=g(μ)+12g(μ)(Tnμ)2+Op(Tnμ3)g(T_n) = g(\mu) + \frac{1}{2} g''(\mu)(T_n - \mu)^2 + O_p(|T_n - \mu|^3). Multiplying by nn and using n(Tnμ)dN(0,σ2)\sqrt{n}(T_n - \mu) \xrightarrow{d} N(0, \sigma^2) gives n(Tnμ)2dσ2χ12n(T_n - \mu)^2 \xrightarrow{d} \sigma^2 \chi^2_1.

Why It Matters

The delta method is the workhorse of applied statistics. Variance of log(p^)\log(\hat p), of 1/Xˉ1/\bar X, of p^1/p^2\hat p_1 / \hat p_2, of any smooth functional of moments. It is also the basis for variance-stabilizing transformations (the Anscombe and Freeman-Tukey families come from solving an ODE in gg').

Failure Mode

Three failure modes. (1) g(μ)=0g'(\mu) = 0: use the second-order version. (2) gg not differentiable at μ\mu: e.g., g(x)=xg(x) = |x| at 00 gives a folded-normal limit, not a normal. (3) TnT_n converges at a non-n\sqrt{n} rate: the delta method must use that rate, not n\sqrt{n}, in the expansion.

Tangent linearization of g around μ: an input spread σ/√N pushes to an output spread |g′(μ)|·σ/√N

1.50
0.40
g′(μ) = 0.67SE(g) ≈ 0.27
μg(μ)0.11.02.03.04.0xg(x)

Blue ribbon on the x-axis: input spread ±σ/√N. Green ribbon on the y-axis: output spread ±|g′(μ)|·σ/√N, the delta-method prediction. The orange dashed tangent line is the linearization the method uses; the gray curve is the true g. The approximation tightens as σ/√N shrinks.

M- and Z-Estimators: The Unified Framework

Most estimators of interest in statistics arise as either M-estimators (arg-min of a sample criterion) or Z-estimators (zero of a sample estimating equation). Under regularity these are equivalent (set ψ=m\psi = \nabla m), but the Z-formulation is more convenient for asymptotic analysis because it avoids second-order arg-min arguments.

Definition

M-estimator and Z-estimator

Given iid data X1,,XnX_1, \ldots, X_n from PP and a parameter space ΘRk\Theta \subset \mathbb{R}^k, an M-estimator is

θ^n=argminθΘ1ni=1nm(Xi;θ).\hat\theta_n = \arg\min_{\theta \in \Theta} \frac{1}{n} \sum_{i=1}^n m(X_i; \theta).

A Z-estimator is any solution

θ^nΘ    such that    1ni=1nψ(Xi;θ^n)=0,\hat\theta_n \in \Theta \;\; \text{such that} \;\; \frac{1}{n} \sum_{i=1}^n \psi(X_i; \hat\theta_n) = 0,

where ψ:X×ΘRk\psi : \mathcal{X} \times \Theta \to \mathbb{R}^k is the estimating function. The population analogue is θ0=argminM(θ)\theta_0 = \arg\min M(\theta) where M(θ)=EP[m(X;θ)]M(\theta) = E_P[m(X; \theta)], or equivalently Ψ(θ0)=0\Psi(\theta_0) = 0 where Ψ(θ)=EP[ψ(X;θ)]\Psi(\theta) = E_P[\psi(X; \theta)].

Examples:

  • MLE. m(X;θ)=logf(X;θ)m(X; \theta) = -\log f(X; \theta), ψ(X;θ)=logf(X;θ)=s(X;θ)\psi(X; \theta) = -\nabla \log f(X; \theta) = -s(X; \theta).
  • OLS. m(X,Y;β)=(YXβ)2m(X, Y; \beta) = (Y - X^\top \beta)^2, ψ(X,Y;β)=X(YXβ)\psi(X, Y; \beta) = -X(Y - X^\top \beta).
  • GMM (just-identified). ψ(X;θ)=\psi(X; \theta) = moment conditions evaluated at θ\theta.
  • Quantile regression. ψ(X,Y;β)=X(τ1{Y<Xβ})\psi(X, Y; \beta) = X(\tau - \mathbb{1}\{Y < X^\top \beta\}) for the τ\tau-th quantile.
  • Huber's robust location. ψ(X;θ)=ψc(Xθ)\psi(X; \theta) = \psi_c(X - \theta), where ψc\psi_c is the Huber influence function (linear inside [c,c][-c, c], capped outside).
Theorem

M-Estimator Asymptotic Normality

Statement

Under the stated regularity conditions, the Z-estimator θ^n\hat\theta_n solving 1niψ(Xi;θ^n)=0\frac{1}{n}\sum_i \psi(X_i; \hat\theta_n) = 0 satisfies

n(θ^nθ0)dN ⁣(0,V(θ0)1A(θ0)V(θ0)),\sqrt{n}(\hat\theta_n - \theta_0) \xrightarrow{d} N\!\left(0, V(\theta_0)^{-1} A(\theta_0) V(\theta_0)^{-\top}\right),

where V(θ0)=θΨ(θ0)=EP[θψ(X;θ0)]V(\theta_0) = \nabla_\theta \Psi(\theta_0) = E_P[\nabla_\theta \psi(X; \theta_0)] and A(θ0)=EP[ψ(X;θ0)ψ(X;θ0)]A(\theta_0) = E_P[\psi(X; \theta_0)\psi(X; \theta_0)^\top].

The variance is the sandwich form V1AVV^{-1} A V^{-\top}, with VV the "bread" and AA the "meat." When the model is correctly specified and ψ\psi is the score (MLE case), the information identity gives V(θ0)=A(θ0)V(\theta_0) = -A(\theta_0), the sandwich collapses to A1=I(θ0)1A^{-1} = I(\theta_0)^{-1}, and the variance equals the inverse Fisher information.

Intuition

Expand the estimating equation around θ0\theta_0:

0=1niψ(Xi;θ^n)1niψ(Xi;θ0)+V(θ0)(θ^nθ0).0 = \frac{1}{n}\sum_i \psi(X_i; \hat\theta_n) \approx \frac{1}{n}\sum_i \psi(X_i; \theta_0) + V(\theta_0) (\hat\theta_n - \theta_0).

Solve for θ^nθ0\hat\theta_n - \theta_0, multiply by n\sqrt{n}:

n(θ^nθ0)V(θ0)11niψ(Xi;θ0).\sqrt{n}(\hat\theta_n - \theta_0) \approx -V(\theta_0)^{-1} \cdot \frac{1}{\sqrt{n}}\sum_i \psi(X_i; \theta_0).

The right side is a CLT applied to mean-zero iid vectors with covariance A(θ0)A(\theta_0), scaled by the deterministic matrix V(θ0)1-V(\theta_0)^{-1}. The push-forward gives the sandwich variance.

Proof Sketch

Write the estimating equation with a mean-value-theorem step (or use a stochastic equicontinuity argument when ψ\psi is non-smooth). The linearization

0=ψˉn(θ0)+ψˉn(θ~)(θ^nθ0)0 = \bar \psi_n(\theta_0) + \bar{\nabla \psi}_n(\tilde\theta) (\hat\theta_n - \theta_0)

with θ~\tilde\theta between θ0\theta_0 and θ^n\hat\theta_n, plus a uniform LLN on ψ\nabla \psi, gives ψˉn(θ~)pV(θ0)\bar{\nabla \psi}_n(\tilde\theta) \xrightarrow{p} V(\theta_0). The CLT gives nψˉn(θ0)dN(0,A(θ0))\sqrt{n}\, \bar \psi_n(\theta_0) \xrightarrow{d} N(0, A(\theta_0)). Apply Slutsky and CMT to invert VV and conclude. See van der Vaart Asymptotic Statistics Theorem 5.21.

A more refined version (Theorem 5.41, "Z-estimators with non-smooth ψ\psi") replaces pointwise differentiability with Hellinger differentiability of the model and stochastic equicontinuity of the empirical-process term. This handles quantile regression and other ψ\psi with discontinuities.

Why It Matters

This is the most general asymptotic-normality theorem in regular parametric inference. MLE asymptotic normality is the special case where the model is correctly specified and ψ\psi is the score. OLS, GMM, Huber's robust location, and quantile regression all follow as special cases with their own VV and AA. The sandwich variance is also exactly what you should report when you suspect model misspecification: it is robust to VAV \neq -A, while the naive A1A^{-1} "model-based" variance is not.

Failure Mode

Three failure modes worth naming. (1) Lack of identification: if V(θ0)V(\theta_0) is singular, the asymptotic variance is undefined and the estimator may converge at a slower rate (cube-root asymptotics for maximum-score estimators, n1/3n^{1/3} rate). (2) Boundary parameter: if θ0\theta_0 is on the boundary of Θ\Theta the limit is not normal but a half-normal or projection of a normal onto a cone (Self & Liang 1987, JASA). (3) Misspecification: Ψ(θ0)=0\Psi(\theta_0) = 0 defines the "least-false" parameter, but the sandwich variance still applies; the naive Fisher-information-based variance does not.

MLE as a Special Case

Corollary

Asymptotic Normality of MLE

Statement

Under the regularity conditions above, the maximum likelihood estimator θ^n\hat\theta_n satisfies

n(θ^nθ0)dN ⁣(0,I(θ0)1),\sqrt{n}(\hat\theta_n - \theta_0) \xrightarrow{d} N\!\left(0, I(\theta_0)^{-1}\right),

where I(θ0)=Eθ0[s(X;θ0)s(X;θ0)]=Eθ0[2logf(X;θ0)]I(\theta_0) = E_{\theta_0}[s(X; \theta_0) s(X; \theta_0)^\top] = -E_{\theta_0}[\nabla^2 \log f(X; \theta_0)] is the Fisher information matrix, with the second equality holding by the information identity.

Intuition

The MLE is asymptotically normal centered at the truth with covariance I(θ0)1/nI(\theta_0)^{-1}/n. This is the smallest variance any regular estimator can achieve (Cramér-Rao lower bound), so the MLE is asymptotically efficient. The result follows from the M-estimator theorem with ψ=s\psi = -s, V(θ0)=E[s]=I(θ0)V(\theta_0) = -E[\nabla s] = I(\theta_0), A(θ0)=E[ss]=I(θ0)A(\theta_0) = E[s s^\top] = I(\theta_0), and the sandwich V1AVV^{-1} A V^{-\top} collapsing to I1II1=I1I^{-1} I I^{-1} = I^{-1}.

Proof Sketch

Apply the M-estimator theorem with ψ(X;θ)=logf(X;θ)\psi(X; \theta) = -\nabla \log f(X; \theta). The information identity (which holds under regularity, by differentiating f(x;θ)dx=1\int f(x; \theta) dx = 1 twice and exchanging integration and differentiation) gives V(θ0)=I(θ0)=A(θ0)V(\theta_0) = I(\theta_0) = A(\theta_0), so the sandwich variance reduces to I(θ0)1I(\theta_0)^{-1}.

Why It Matters

Justifies the standard practice of reporting MLE point estimates with standard errors derived from the inverse observed information. It also underwrites likelihood-based inference: confidence intervals from inverting the Wald, score, or LRT statistic; profile-likelihood intervals for nuisance-parameter problems; and the asymptotic chi-squared distribution of 2logΛ-2 \log \Lambda in nested-model testing.

Failure Mode

Regularity fails at boundary parameters (e.g., σ=0\sigma = 0 in N(μ,σ2)N(\mu, \sigma^2)), non-identified models (mixture models with unknown component count), models where the support depends on the parameter (e.g., Uniform(0,θ)\text{Uniform}(0, \theta), where θ^MLE=X(n)\hat\theta_{\text{MLE}} = X_{(n)} converges at rate nn to an exponential), and singular information (I(θ0)I(\theta_0) not full rank, e.g., at a point of non-identification in mixture models).

Le Cam Theory: Local Asymptotic Normality

Theorem

Local Asymptotic Normality (Le Cam)

Statement

Under DQM at θ0\theta_0, the log-likelihood ratio admits the LAN expansion

logdPθ0+h/nndPθ0n(X1,,Xn)=hΔn12hI(θ0)h+oPθ0(1),\log \frac{dP_{\theta_0 + h/\sqrt{n}}^{\otimes n}}{dP_{\theta_0}^{\otimes n}}(X_1, \ldots, X_n) = h^\top \Delta_n - \frac{1}{2} h^\top I(\theta_0) h + o_{P_{\theta_0}}(1),

where the central sequence Δn=1nis(Xi;θ0)\Delta_n = \frac{1}{\sqrt{n}}\sum_i s(X_i; \theta_0) satisfies ΔndNk(0,I(θ0))\Delta_n \xrightarrow{d} N_k(0, I(\theta_0)) under Pθ0P_{\theta_0}.

Le Cam's three lemmas then give:

  1. The product measures Pθ0+h/nnP_{\theta_0 + h/\sqrt{n}}^{\otimes n} and Pθ0nP_{\theta_0}^{\otimes n} are mutually contiguous.
  2. Convergence in distribution under Pθ0P_{\theta_0} transfers to Pθ0+h/nP_{\theta_0 + h/\sqrt{n}} via the change-of-measure formula.
  3. The asymptotic experiment is the Gaussian shift experiment ΔNk(I(θ0)h,I(θ0))\Delta \sim N_k(I(\theta_0) h, I(\theta_0)): every regular procedure in the original problem has a corresponding procedure in the Gaussian experiment with the same asymptotic risk.

Intuition

At the local scale h/nh/\sqrt{n} around the truth, every regular parametric problem reduces to a Gaussian shift problem. The sufficient statistic is the central sequence Δn\Delta_n (the normalized score), the parameter is the local direction hh, and the Fisher information sets the signal-to-noise ratio. This is the deepest structural result in classical parametric statistics: it says the MLE is asymptotically optimal not because of clever proofs but because it is the natural estimator in the limiting Gaussian experiment.

Differentiability in quadratic mean (DQM) is weaker than pointwise differentiability of logf\log f: it requires only that the square-root density f\sqrt{f} be differentiable in L2L^2 as a function of θ\theta. This handles models with bounded support, non-smooth densities, and many other "almost regular" cases where the textbook regularity conditions fail.

Proof Sketch

Hellinger / DQM expansion: under DQM at θ0\theta_0 with score ss,

fθ0+h/n(x)fθ0(x)=hs(x;θ0)2nfθ0(x)+Rn(x;h),\sqrt{f_{\theta_0 + h/\sqrt{n}}}(x) - \sqrt{f_{\theta_0}}(x) = \frac{h^\top s(x; \theta_0)}{2\sqrt{n}} \sqrt{f_{\theta_0}}(x) + R_n(x; h),

with Rn(;h)L2=o(1/n)\|R_n(\cdot; h)\|_{L^2} = o(1/\sqrt{n}). Square, sum over ii, take logs, and Taylor expand log(1+u)=uu2/2+O(u3)\log(1 + u) = u - u^2/2 + O(u^3). The linear term aggregates to hΔnh^\top \Delta_n; the quadratic term aggregates by a LLN-type argument to 12hI(θ0)h-\frac{1}{2} h^\top I(\theta_0) h. The remainder is op(1)o_p(1) uniformly in hh on bounded sets. See van der Vaart Asymptotic Statistics Theorem 7.2 and Le Cam-Yang Asymptotics in Statistics Chapter 6.

Le Cam's three lemmas follow from the LAN expansion via standard arguments. Mutual contiguity is immediate from the boundedness in probability of the log-likelihood ratio; the change-of-measure formula is just the definition of contiguity; the Gaussian shift representation follows from the convergence (logLn,Δn)d(hΔ12hIh,Δ)(\log L_n, \Delta_n) \xrightarrow{d} (h^\top \Delta - \frac{1}{2} h^\top I h, \Delta) jointly.

Why It Matters

LAN underwrites the asymptotic minimax lower bound: no regular estimator can have asymptotic variance smaller than I(θ0)1I(\theta_0)^{-1}. It also gives the asymptotic equivalence of Wald, score, and LRT tests under the null and under contiguous alternatives, and it is the foundation of semiparametric efficiency theory (Bickel-Klaassen-Ritov-Wellner 1993): in semiparametric models the analogous expansion identifies the efficient score as the projection of the parametric score onto the orthogonal complement of the nuisance tangent space.

Failure Mode

LAN fails for non-regular models. Canonical examples include parameter-dependent support (e.g., Uniform(0,θ)(0, \theta), where the MLE converges at rate nn rather than n\sqrt{n} and the limit is exponential), boundary parameter problems (parameter on the edge of Θ\Theta, where the limit is a censored normal), change-point models (where the rate is nn and the limit is a compound Poisson functional), mixture models at non-identified configurations, and unidentified parameters generally. Note that the standard Cauchy location family fθ(x)=1/[π(1+(xθ)2)]f_\theta(x) = 1/[\pi(1 + (x-\theta)^2)] is regular: its Fisher information equals 1/21/2 and LAN holds, so the location MLE is regular and asymptotically normal at the n\sqrt{n} rate. The Cauchy sample mean is pathological because the Cauchy has no first moment, but likelihood-based location inference is fine. Many machine-learning "models" with high-dimensional parameters whose interior is unclear also fall outside LAN. The corresponding theory is LAMN (locally asymptotically mixed normal) and LAQ (locally asymptotically quadratic), with weaker optimality conclusions.

The Wald, Score, and Likelihood-Ratio Tests

Theorem

Asymptotic Equivalence of Wald, Score, and LRT

Statement

Under the null H0:θ=θ0H_0 : \theta = \theta_0 and LAN, define the three test statistics:

  • Wald. Wn=n(θ^nθ0)I(θ0)(θ^nθ0)W_n = n (\hat\theta_n - \theta_0)^\top I(\theta_0) (\hat\theta_n - \theta_0).
  • Score (Rao / Lagrange multiplier). Rn=1nn(θ0)I(θ0)1n(θ0)=ΔnI(θ0)1ΔnR_n = \frac{1}{n} \nabla \ell_n(\theta_0)^\top I(\theta_0)^{-1} \nabla \ell_n(\theta_0) = \Delta_n^\top I(\theta_0)^{-1} \Delta_n.
  • Likelihood ratio. Ln=2logLn(θ0)Ln(θ^n)=2(n(θ^n)n(θ0))L_n = -2 \log \frac{L_n(\theta_0)}{L_n(\hat\theta_n)} = 2(\ell_n(\hat\theta_n) - \ell_n(\theta_0)).

Then under H0H_0, all three converge to the same chi-squared limit:

Wn,Rn,Lndχk2where    k=dim(θ).W_n, R_n, L_n \xrightarrow{d} \chi^2_k \quad \text{where} \;\; k = \dim(\theta).

Moreover, WnLn=op(1)W_n - L_n = o_p(1) and RnLn=op(1)R_n - L_n = o_p(1), so the three tests are asymptotically equivalent under H0H_0 and under contiguous alternatives θn=θ0+h/n\theta_n = \theta_0 + h/\sqrt{n} (where the common limit becomes a non-central χk2(λ)\chi^2_k(\lambda) with non-centrality λ=hI(θ0)h\lambda = h^\top I(\theta_0) h).

Intuition

The three tests measure "distance from H0H_0" in three different but asymptotically equivalent ways. Wald measures distance in the parameter space (how far is θ^n\hat\theta_n from θ0\theta_0, weighted by the information). Score measures the slope of the log-likelihood at θ0\theta_0 (is the gradient Δn\Delta_n near zero, or far?). LRT measures the depth of the log-likelihood (how much does the maximum exceed the null value?). All three reduce to the same quadratic form in the central sequence Δn\Delta_n under LAN, and the chi-squared limit is the distribution of the squared length of a standard normal kk-vector.

Proof Sketch

Substitute the LAN expansion into each statistic and Taylor-expand to second order around θ0\theta_0. The MLE satisfies θ^nθ0=I(θ0)1Δn/n+op(1/n)\hat\theta_n - \theta_0 = I(\theta_0)^{-1} \Delta_n / \sqrt{n} + o_p(1/\sqrt{n}), so

Wn=ΔnI(θ0)1Δn+op(1),W_n = \Delta_n^\top I(\theta_0)^{-1} \Delta_n + o_p(1),

which matches RnR_n exactly. For the LRT, expand n(θ^n)\ell_n(\hat\theta_n) around θ0\theta_0 to second order; the cross-term contributions cancel and the remainder is also ΔnI(θ0)1Δn+op(1)\Delta_n^\top I(\theta_0)^{-1} \Delta_n + o_p(1). Since ΔndN(0,I(θ0))\Delta_n \xrightarrow{d} N(0, I(\theta_0)) under H0H_0, the quadratic form has a χk2\chi^2_k limit. See van der Vaart Asymptotic Statistics Theorem 16.7.

Why It Matters

Justifies reporting any of the three test statistics in practice with the same chi-squared reference distribution. Picks among them are finite-sample concerns: Wald requires the MLE and an estimate of I(θ0)I(\theta_0); score does not require fitting the alternative model and is preferred in some computational settings; LRT is invariant under parameter transformations and tends to have better small-sample behavior in nested-model comparisons. The non-central chi-squared limit under contiguous alternatives gives the local power calculation: power P(χk2(λ)>χk,1α2)\to P(\chi^2_k(\lambda) > \chi^2_{k, 1-\alpha}), a function of the non-centrality λ=hI(θ0)h\lambda = h^\top I(\theta_0) h.

Failure Mode

The three tests can disagree substantially in finite samples, particularly when n/kn / k is small or the model is misspecified. Wald is parameterization-dependent (Wald CIs for θ\theta and for g(θ)g(\theta) are not invariant under reparameterization, while LRT CIs are). Under misspecification only the robust score test with a sandwich denominator preserves a valid chi-squared limit; the naive Wald and LRT statistics use the wrong variance. Under boundary conditions the limit is a mixture of chi-squared distributions, not a single χ2\chi^2 (Self-Liang 1987).

Definitions: Influence Functions, Efficiency, Contiguity

Definition

Asymptotically Linear Estimator and Influence Function

A regular estimator θ^n\hat\theta_n of θ0\theta_0 is asymptotically linear with influence function φ:XRk\varphi : \mathcal{X} \to \mathbb{R}^k if and only if

n(θ^nθ0)=1ni=1nφ(Xi)+op(1),\sqrt{n}(\hat\theta_n - \theta_0) = \frac{1}{\sqrt{n}} \sum_{i=1}^n \varphi(X_i) + o_p(1),

where E[φ(X)]=0E[\varphi(X)] = 0 and E[φ(X)2]<E[\|\varphi(X)\|^2] < \infty. The asymptotic variance of θ^n\hat\theta_n is then E[φ(X)φ(X)]E[\varphi(X) \varphi(X)^\top].

For an M-estimator with estimating function ψ\psi, φ(X)=V(θ0)1ψ(X;θ0)\varphi(X) = -V(\theta_0)^{-1} \psi(X; \theta_0). For the MLE, φ(X)=I(θ0)1s(X;θ0)\varphi(X) = I(\theta_0)^{-1} s(X; \theta_0), the efficient influence function. Influence functions parameterize the entire space of regular estimators in a parametric model, and they are the central object of semiparametric efficiency theory.

Definition

Asymptotic Relative Efficiency

The asymptotic relative efficiency (ARE) of estimator TnT_n relative to SnS_n, when both are n\sqrt{n}-consistent and asymptotically normal with variances σT2\sigma_T^2 and σS2\sigma_S^2, is

ARE(Tn,Sn)=σS2σT2.\text{ARE}(T_n, S_n) = \frac{\sigma_S^2}{\sigma_T^2}.

In this asymptotic-variance sense, TnT_n is more efficient than SnS_n if and only if ARE>1\text{ARE} > 1. The MLE achieves ARE=1\text{ARE} = 1 relative to the Cramér-Rao bound by construction (under regularity). For testing, Pitman ARE is the analogous notion based on local power, and it equals the estimation ARE for tests derived from estimators (a striking equivalence due to Pitman 1949 and Noether 1955).

Definition

Contiguity

Two sequences of probability measures PnP_n and QnQ_n on possibly different sample spaces are contiguous, written PnQnP_n \triangleleft Q_n, if and only if for every sequence of measurable sets AnA_n, Qn(An)0Q_n(A_n) \to 0 implies Pn(An)0P_n(A_n) \to 0. They are mutually contiguous if both PnQnP_n \triangleleft Q_n and QnPnQ_n \triangleleft P_n.

Under LAN, Pθ0+h/nnP_{\theta_0 + h/\sqrt{n}}^{\otimes n} and Pθ0nP_{\theta_0}^{\otimes n} are mutually contiguous for every fixed hh. Le Cam's first lemma characterizes this in terms of the limiting log-likelihood ratio: contiguity holds iff the likelihood ratio Ln=dQn/dPnL_n = dQ_n / dP_n is uniformly integrable under PnP_n. The third lemma converts statements about PnP_n-distributions into statements about QnQ_n-distributions via the joint limit of (Tn,logLn)(T_n, \log L_n).

Worked Examples

Example

Delta method: variance-stabilizing transform for Poisson

If X1,,XnPoisson(λ)X_1, \ldots, X_n \sim \text{Poisson}(\lambda), then n(Xˉλ)dN(0,λ)\sqrt{n}(\bar X - \lambda) \xrightarrow{d} N(0, \lambda). The variance depends on λ\lambda, so a confidence interval on λ\lambda requires a plug-in estimate. Apply the delta method with g(x)=xg(x) = \sqrt{x}: g(λ)=1/(2λ)g'(\lambda) = 1/(2\sqrt{\lambda}), so

n(Xˉλ)dN(0,λ1/(4λ))=N(0,1/4).\sqrt{n}(\sqrt{\bar X} - \sqrt{\lambda}) \xrightarrow{d} N(0, \lambda \cdot 1/(4\lambda)) = N(0, 1/4).

The variance is now constant. Confidence intervals on λ\sqrt{\lambda} do not need a plug-in step; transform back at the end. The same idea gives the Anscombe transform g(x)=x+3/8g(x) = \sqrt{x + 3/8} for finite-nn correction and the Freeman-Tukey g(x)=x+x+1g(x) = \sqrt{x} + \sqrt{x+1} for ratios of Poissons.

Poisson(λ) before and after the Anscombe transform; the right panel keeps variance near 1/4 regardless of λ

8.0
raw var = 8.18transformed var = 0.26
Poisson(λ)y√(Y + 3/8)√(y + 3/8)

Left: raw Poisson samples have variance λ (mean equals variance). Right: the Anscombe transform √(Y + 3/8) yields variance close to 1/4 across λ. Slide λ and watch the raw variance scale up while the transformed variance stays put.

Example

MLE for the exponential rate, with confidence interval

If X1,,XnExp(λ)X_1, \ldots, X_n \sim \text{Exp}(\lambda) (density λeλx\lambda e^{-\lambda x}), the MLE is λ^=1/Xˉ\hat\lambda = 1/\bar X. The log-likelihood is (λ)=nlogλλXi\ell(\lambda) = n \log \lambda - \lambda \sum X_i, so s(X;λ)=1/λXs(X; \lambda) = 1/\lambda - X and Fisher information is I(λ)=E[(1/λX)2]=Var(X)=1/λ2I(\lambda) = E[(1/\lambda - X)^2] = \text{Var}(X) = 1/\lambda^2. By asymptotic normality,

n(λ^λ)dN(0,λ2).\sqrt{n}(\hat\lambda - \lambda) \xrightarrow{d} N(0, \lambda^2).

A naive 95% CI is λ^±1.96λ^/n\hat\lambda \pm 1.96 \hat\lambda / \sqrt{n}. A better CI uses the variance-stabilizing transform g(λ)=logλg(\lambda) = \log \lambda: n(logλ^logλ)dN(0,1)\sqrt{n}(\log \hat\lambda - \log \lambda) \xrightarrow{d} N(0, 1), giving the symmetric-on-the-log-scale CI λ^exp(±1.96/n)\hat\lambda \cdot \exp(\pm 1.96/\sqrt{n}), which is always positive.

Example

Sandwich variance for misspecified MLE

Suppose you fit an exponential model to data that are actually Gamma with shape α1\alpha \neq 1. The "MLE" λ^=1/Xˉ\hat\lambda = 1/\bar X converges in probability to λ0=1/EP[X]\lambda_0 = 1/E_P[X] (the least-false parameter), not to anything intrinsic to the Gamma. The naive variance estimate λ^2/n\hat\lambda^2 / n uses the misspecified information identity and is generally wrong. The sandwich variance is

Varsand(λ^)=1nV(λ0)1A(λ0)V(λ0)1,V(λ0)=E[ψ/λ]=1/λ02,A(λ0)=E[ψ2]=Var(X).\text{Var}_{\text{sand}}(\hat\lambda) = \frac{1}{n} V(\lambda_0)^{-1} A(\lambda_0) V(\lambda_0)^{-1}, \quad V(\lambda_0) = -E[\partial \psi / \partial \lambda] = 1/\lambda_0^2, \quad A(\lambda_0) = E[\psi^2] = \text{Var}(X).

For Gamma(α,αλ0)(\alpha, \alpha\lambda_0) (so that E[X]=1/λ0E[X] = 1/\lambda_0), Var(X)=1/(αλ02)\text{Var}(X) = 1/(\alpha \lambda_0^2), giving Varsand=1/(nαλ02)\text{Var}_{\text{sand}} = 1/(n \alpha \lambda_0^2). The naive exponential-model variance would have given λ02/n=1/(n1λ02)\lambda_0^2 / n = 1 / (n \cdot 1 \cdot \lambda_0^{-2}), which is α\alpha times too large when α>1\alpha > 1 and too small when α<1\alpha < 1. This is the standard "robust SEs are different from model-based SEs under misspecification" phenomenon.

Common Confusions

Watch Out

Asymptotic normality is not a property of the estimator

Asymptotic normality is a property of the sequence n(θ^nθ0)\sqrt{n}(\hat\theta_n - \theta_0), not of the estimator at any fixed sample size. The actual finite-sample distribution of an MLE can be heavily skewed, multimodal, or have heavy tails, even when the asymptotic distribution is a clean normal. Bootstrap or exact methods should be used for small samples. The asymptotic approximation is a tool, not a fact about θ^n\hat\theta_n at n=30n = 30.

Watch Out

Efficiency is asymptotic; finite-sample winners can be biased

The MLE is asymptotically efficient under regularity, but finite-sample efficiency is a different question. James-Stein (1961) showed that for estimating a multivariate normal mean with k3k \geq 3, the MLE μ^=Xˉ\hat\mu = \bar X is dominated by a biased shrinkage estimator μ^JS=(1(k2)/Xˉ2)Xˉ\hat\mu_{\text{JS}} = (1 - (k-2)/\|\bar X\|^2) \bar X in mean squared error for every μ\mu. The JS estimator is asymptotically equivalent to the MLE (both have variance I1/nI^{-1}/n), but it strictly improves finite-sample MSE by trading bias for variance.

Watch Out

Convergence in distribution does not imply convergence of moments

XndXX_n \xrightarrow{d} X does not imply E[Xn]E[X]E[X_n] \to E[X] or Var(Xn)Var(X)\text{Var}(X_n) \to \text{Var}(X). The classical counterexample is Xn=n1{U1/n}X_n = n \cdot \mathbb{1}\{U \leq 1/n\} for UUniform(0,1)U \sim \text{Uniform}(0,1): Xnd0X_n \xrightarrow{d} 0 (since Xn=0X_n = 0 with probability 11/n1 - 1/n), but E[Xn]=1E[X_n] = 1 for every nn. The fix is uniform integrability: if {Xn}\{X_n\} is uniformly integrable, then convergence in distribution implies convergence of expectations. In practice, asymptotic-normality results give the limiting distribution; whether n(θ^nθ0)\sqrt{n}(\hat\theta_n - \theta_0) has a moment converging to I1I^{-1} requires a separate argument.

Watch Out

Le Cam regularity (DQM) is weaker than classical regularity

Pointwise differentiability of logf(x;θ)\log f(x; \theta) in θ\theta, combined with bounded-by-integrable-envelope conditions, is the textbook "regularity" assumption (Lehmann-Casella, Cox-Hinkley). Le Cam's differentiability in quadratic mean (DQM) is weaker: it requires only that f\sqrt{f} be differentiable in L2(μ)L^2(\mu) as a function of θ\theta. Models with bounded support (Uniform on [0,θ][0, \theta] is not DQM at θ0\theta_0 because the support changes; but Beta(α,β)\text{Beta}(\alpha, \beta) on [0,1][0,1] for fixed support is DQM despite non-smooth densities at the endpoints) and many semiparametric models satisfy DQM without satisfying the textbook conditions. Modern asymptotic theory works at the DQM level; classical regularity is a sufficient-but-not-necessary special case.

Watch Out

Super-efficient estimators exist but only on a measure-zero set

Hodges (1951) constructed an estimator that is asymptotically normal with variance I(θ0)1/nI(\theta_0)^{-1}/n for every θ00\theta_0 \neq 0 and asymptotic variance 00 at θ0=0\theta_0 = 0. This violates the Cramér-Rao bound at the single point θ=0\theta = 0. The resolution: the Hodges estimator is not regular (its asymptotic distribution depends discontinuously on contiguous perturbations θ0+h/n\theta_0 + h/\sqrt{n}). Le Cam's convolution theorem says: among regular estimators (those with continuous-in-hh asymptotic distributions), no estimator can beat I1I^{-1}. Super-efficiency is a measure-theoretic curiosity, not a practical estimation tool, and the Hodges estimator has terrible local risk near zero (the local-minimax risk over a 1/n1/\sqrt{n}-neighborhood is unbounded).

Watch Out

MLE consistency requires more than just asymptotic normality of the score

The standard "expand the score equation around θ0\theta_0" derivation of asymptotic normality assumes θ^npθ0\hat\theta_n \xrightarrow{p} \theta_0. Establishing consistency is a separate (often harder) step, requiring either (a) compactness of Θ\Theta + continuity of θE[logf(X;θ)]\theta \mapsto E[\log f(X; \theta)] + identifiability, or (b) Wald's classical consistency proof using uniform LLN, or (c) the modern empirical-process approach via Glivenko-Cantelli plus a strict separation between θ0\theta_0 and competitors. Skipping the consistency step and "deriving asymptotic normality" from the score equation alone is a circular argument.

Summary

  • Convergence calculus: CMT, Slutsky, delta method (incl. multivariate and second-order). These are the algebraic moves that let you turn one CLT into many applied confidence intervals.
  • M-/Z-estimators: θ^n\hat\theta_n is asymptotically normal with sandwich variance V1AVV^{-1} A V^{-\top}, where VV is the derivative of the population estimating function and AA is its variance. MLE, OLS, GMM, quantile regression, and Huber-M are special cases.
  • MLE asymptotics: under correct specification and regularity, n(θ^nθ0)dN(0,I(θ0)1)\sqrt{n}(\hat\theta_n - \theta_0) \xrightarrow{d} N(0, I(\theta_0)^{-1}). The information identity collapses the sandwich.
  • LAN (Le Cam): every regular parametric problem looks like a Gaussian shift experiment at the 1/n1/\sqrt{n} scale. This gives the asymptotic minimax bound and the Wald/score/LRT equivalence.
  • Wald, score, LRT: all three converge to χk2\chi^2_k under H0H_0 and to non-central χk2(λ)\chi^2_k(\lambda) under contiguous alternatives. Choose among them on finite-sample, robustness, and computational grounds.
  • Influence functions parameterize the entire space of regular estimators; the efficient influence function is I1sI^{-1} s.

Exercises

ExerciseCore

Problem

Let X1,,XnBernoulli(p)X_1, \ldots, X_n \sim \text{Bernoulli}(p) and p^=Xˉ\hat p = \bar X. Use the multivariate delta method to find the asymptotic distribution of the log-odds g(p^)=log(p^/(1p^))g(\hat p) = \log(\hat p / (1 - \hat p)), then construct a 95% confidence interval for the log-odds and transform back to a CI for pp that respects p[0,1]p \in [0, 1].

ExerciseAdvanced

Problem

Compute the influence function and asymptotic variance of Huber's robust location estimator with tuning constant cc, defined by

θ^n    such that    iψc(Xiθ^n)=0,ψc(u)={uuc,csign(u)u>c.\hat\theta_n \;\; \text{such that} \;\; \sum_i \psi_c(X_i - \hat\theta_n) = 0, \quad \psi_c(u) = \begin{cases} u & |u| \leq c, \\ c \, \text{sign}(u) & |u| > c. \end{cases}

Specialize to XiN(θ,σ2)X_i \sim N(\theta, \sigma^2). What is the ARE of Huber-M relative to the sample mean as cc \to \infty? As c0c \to 0?

ExerciseAdvanced

Problem

Show that under the LAN expansion, the Wald statistic Wn=n(θ^nθ0)I(θ0)(θ^nθ0)W_n = n (\hat\theta_n - \theta_0)^\top I(\theta_0) (\hat\theta_n - \theta_0) and the score statistic Rn=ΔnI(θ0)1ΔnR_n = \Delta_n^\top I(\theta_0)^{-1} \Delta_n differ by op(1)o_p(1) under H0:θ=θ0H_0 : \theta = \theta_0, and that both equal the LRT statistic Ln=2logΛn+op(1)L_n = -2 \log \Lambda_n + o_p(1).

ExerciseAdvanced

Problem

Consider the change-point model XiN(μ1,1)X_i \sim N(\mu_1, 1) for iτni \leq \tau n and XiN(μ2,1)X_i \sim N(\mu_2, 1) for i>τni > \tau n, with unknown change point τ(0,1)\tau \in (0, 1) and known μ1μ2\mu_1 \neq \mu_2. Show that the MLE of τ\tau converges at rate nn (not n\sqrt{n}) and identify the limiting distribution.

ExerciseResearch

Problem

In the LAN expansion, the experiment at local scale h/nh/\sqrt{n} looks like observing ΔN(I(θ0)h,I(θ0))\Delta \sim N(I(\theta_0) h, I(\theta_0)). (a) Derive the optimal estimator of hh and its risk in this Gaussian shift experiment, and recover the Cramér-Rao bound. (b) For a semiparametric model with parametric component θ\theta and infinite-dimensional nuisance η\eta, what is the analogue of I(θ0)I(\theta_0) that controls the asymptotic variance of any regular estimator of θ\theta?

References

  • van der Vaart, Asymptotic Statistics (Cambridge, 1998). Chapters 2-3 (convergence calculus, delta method), Chapter 5 (M- and Z-estimators), Chapters 7-8 (local asymptotic normality, contiguity, convolution theorem), Chapter 16 (Wald/score/LRT equivalence). The standard graduate text and the source most cited above.
  • Lehmann & Casella, Theory of Point Estimation, 2nd ed. (Springer, 1998). Chapter 6 covers MLE asymptotic theory under classical regularity. Less general than van der Vaart but gives more finite-sample finesse.
  • Le Cam, Asymptotic Methods in Statistical Decision Theory (Springer, 1986). The original treatise on contiguity, LAN, and the convolution theorem. Dense and difficult, but the source.
  • Le Cam & Yang, Asymptotics in Statistics: Some Basic Concepts, 2nd ed. (Springer, 2000). A more readable distillation of Le Cam's program; Chapter 6 covers DQM and LAN.
  • Pollard, Convergence of Stochastic Processes (Springer, 1984). Chapters 4-5 develop the empirical-process tools (Glivenko-Cantelli, Donsker, stochastic equicontinuity) underlying modern M-estimator theory.
  • Bickel, Klaassen, Ritov, Wellner, Efficient and Adaptive Estimation for Semiparametric Models (Springer, 1993). The foundational text on semiparametric efficiency, projection-onto-tangent-spaces, and the efficient influence function.

Next Topics

  • Bootstrap methods: resampling as a finite-sample proxy for the asymptotic distribution; second-order accurate for smooth functionals.
  • Fisher information and Cramér-Rao bound: the geometric and information-theoretic perspective on the asymptotic variance lower bound.
  • Empirical processes and chaining: the modern technical machinery for M-estimator asymptotics with non-smooth ψ\psi and high-dimensional parameter spaces.
  • Semiparametric efficiency and TMLE: the infinite-dimensional generalization of LAN underlying modern causal-inference methods.

Last reviewed: April 25, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.