Skip to main content

Statistical Estimation

Central Limit Theorem

The CLT family: classical Lindeberg-Levy, the Lindeberg-Feller condition for non-identical summands, Berry-Esseen rate, multivariate CLT, the delta method, and the martingale CLT that underwrites SGD asymptotics. Why these results power MLE asymptotic normality, bootstrap validity, and the Gaussian-noise approximation of stochastic optimization, and where the CLT breaks in high dimensions and under heavy tails.

EssentialCoreTier 1StableCore spine~75 min
For:MLStatsActuarialGeneral

Why This Matters

The central limit theorem explains why the Gaussian distribution appears everywhere. It is not because nature is inherently Gaussian. It is because many quantities of interest are averages, or behave like averages, and averages of independent finite-variance terms are approximately Gaussian regardless of the underlying distribution.

In machine learning and statistics the CLT is the engine behind:

  • Asymptotic normality of MLE: n(θ^MLEθ0)dN(0,I(θ0)1)\sqrt{n}(\hat\theta_{\text{MLE}} - \theta_0) \xrightarrow{d} \mathcal{N}(0, I(\theta_0)^{-1}) where I(θ0)I(\theta_0) is the Fisher information. The score function is a sum of iid mean-zero terms, the CLT applies, the delta method propagates the limit through the score-equation expansion.
  • Confidence intervals and hypothesis tests: Xˉ±zα/2s/n\bar{X} \pm z_{\alpha/2}\,s/\sqrt{n}, Wald, score, and LRT statistics all read their critical values off the standard normal because of the CLT.
  • Bootstrap validity: the bootstrap works because the empirical CDF satisfies its own functional CLT (Donsker's theorem), so resampling from F^n\hat F_n approximates sampling from FF.
  • SGD noise as Gaussian: at each step the stochastic gradient is a sum over a minibatch of per-sample gradients minus the population gradient. The CLT (martingale form) gives the noise a Gaussian-like shape, which is the basis for SGD-as-SDE diffusion analyses.
  • Monte Carlo error bars: estimator variance σ2/n\sigma^2/n, confidence band width σ/n\sigma/\sqrt{n}, and the universal 1/n1/\sqrt{n} rate for simulation all come from the CLT.

The CLT also has a sharp dual: it is asymptotic, so it tells you nothing about a fixed nn unless you carry along a rate. Concentration inequalities give the non-asymptotic counterpart with explicit constants. Berry-Esseen sits in between.

Want to feel the convergence instead of only reading about it? The Sampling and Limit Laws Lab lets you switch the source distribution, increase nn, and watch the sample-mean histogram move toward the Gaussian picture in real time.

Quick Version

StatementWhat it says
Classical CLT (Lindeberg-Levy)iid finite-variance: n(Xˉnμ)/σN(0,1)\sqrt{n}(\bar X_n - \mu)/\sigma \to \mathcal{N}(0,1)
Lindeberg-FellerNon-identical summands: triangular array works iff Lindeberg condition
Berry-EsseenCDF error is O(1/n)O(1/\sqrt{n}) with constant 0.4748ρ/σ30.4748\,\rho/\sigma^3
Multivariate CLTn(Xˉnμ)N(0,Σ)\sqrt{n}(\bar X_n - \mu) \to \mathcal{N}(0, \Sigma) with the full covariance
Delta methodSmooth transformations preserve asymptotic normality with Jacobian
Martingale CLTReplaces independence with conditional moment conditions on Ft1\mathcal{F}_{t-1}
Hard requirementFinite second moment (or equivalent on conditional variance)
CLT visual

Sample-mean shape becomes Gaussian as n grows

Start with one bounded source. Average more copies of it. The sample mean narrows around the true center and its shape becomes smoother and more bell-like.

1234density00.250.50.751sample mean
n = 1
raw source
n = 2
still angular
n = 5
bell-like
n = 12
close to Gaussian
Gaussian target
the limit shape the CLT predicts

The source begins flat or angular. Averaging more copies makes the distribution of the sample mean tighter and more bell-shaped around the true center.

Sample-mean histogram (K=1500 trials) versus the predicted N(μ, σ²/N)

10
K = 1500 trialsμ = 1.00σ/√N = 0.316
0.430.660.881.111.331.561.78sample meandensity

Bars: empirical histogram of the sample mean across K trials. Dashed curve: the limiting Normal N(μ, σ²/N) predicted by the CLT.

Mental Model

The law of large numbers tells you that the sample mean converges. The CLT tells you how: the fluctuations around the true mean are Gaussian with standard deviation σ/n\sigma/\sqrt{n}. If you zoom in by multiplying by n\sqrt{n}, the magnified fluctuations have a specific universal shape, the bell curve.

The CLT is a universality result. Whatever distribution the XiX_i come from, the standardized average looks the same in the limit, so long as the variance is finite. The Lindeberg condition makes precise what "summand" means in this universality: no single term can dominate the variance budget at the relevant scale.

Formal Setup

Definition

Convergence in Distribution

A sequence XnX_n converges in distribution to XX if for every continuous bounded function gg:

limnE[g(Xn)]=E[g(X)].\lim_{n \to \infty} \mathbb{E}[g(X_n)] = \mathbb{E}[g(X)].

Equivalently, FXn(t)FX(t)F_{X_n}(t) \to F_X(t) at every continuity point tt of FXF_X.

Convergence in distribution is the weakest of the standard convergence modes. It is a statement about distributions, not about random variables: XnX_n and XX need not live on the same probability space, and XnXX_n - X need not be small in any pathwise sense.

The rigorous proofs of every CLT in this page use characteristic functions and Levy's continuity theorem: pointwise convergence of φZn(t)φZ(t)\varphi_{Z_n}(t) \to \varphi_Z(t) at every tRt \in \mathbb{R} implies ZndZZ_n \xrightarrow{d} Z, provided the limiting characteristic function is continuous at zero. The Gaussian N(0,1)\mathcal{N}(0,1) has φ(t)=et2/2\varphi(t) = e^{-t^2/2}, and showing the characteristic function of the standardized sum converges to et2/2e^{-t^2/2} is the entire game.

Classical CLT

Theorem

Classical Central Limit Theorem (Lindeberg-Levy)

Statement

If X1,X2,X_1, X_2, \ldots are iid with mean μ\mu and variance σ2(0,)\sigma^2 \in (0, \infty), then

n(Xˉnμ)σdN(0,1).\frac{\sqrt{n}(\bar{X}_n - \mu)}{\sigma} \xrightarrow{d} \mathcal{N}(0, 1).

Equivalently, n(Xˉnμ)dN(0,σ2)\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} \mathcal{N}(0, \sigma^2), and at the working level XˉnN(μ,σ2/n)\bar{X}_n \approx \mathcal{N}(\mu, \sigma^2/n) for large nn.

Intuition

Each XiX_i contributes a random kick to the sum. For large nn the aggregate of many small independent kicks is Gaussian, and the only properties of the kicks that survive the limit are the mean and the variance. The n\sqrt{n} scaling is the right zoom: the sum Xi\sum X_i grows as nμn\mu, the standard deviation grows as σn\sigma\sqrt{n}, and standardizing leaves a quantity of order 11 whose distribution converges.

Proof Sketch

Let Yi=(Xiμ)/σY_i = (X_i - \mu)/\sigma, so YiY_i is iid with mean 00 and variance 11. Set Zn=1ni=1nYiZ_n = \frac{1}{\sqrt{n}} \sum_{i=1}^n Y_i. The characteristic function of YiY_i admits the Taylor expansion

φY(t)=1t22+o(t2).\varphi_Y(t) = 1 - \frac{t^2}{2} + o(t^2).

By independence, φZn(t)=[φY(t/n)]n\varphi_{Z_n}(t) = [\varphi_Y(t/\sqrt{n})]^n, so

φZn(t)=[1t22n+o(t2/n)]net2/2.\varphi_{Z_n}(t) = \left[1 - \frac{t^2}{2n} + o(t^2/n)\right]^n \to e^{-t^2/2}.

The limit is the characteristic function of N(0,1)\mathcal{N}(0,1). Apply Levy's continuity theorem to conclude ZndN(0,1)Z_n \xrightarrow{d} \mathcal{N}(0,1).

Why It Matters

The classical CLT is the single most-used theorem in applied statistics. Every Wald confidence interval, zz-test, and plug-in standard error invokes it. In ML theory it is the asymptotic half of every generalization argument that uses a normal approximation in place of a tail bound, and it is the result behind the 1/n1/\sqrt{n} baseline rate that concentration inequalities try to match non-asymptotically.

Failure Mode

Three failure modes worth naming. (1) Infinite variance: stable distributions with index α<2\alpha < 2 (Cauchy, fat-tailed financial returns) have Sn/n1/αS_n/n^{1/\alpha} converging to a non-Gaussian stable limit, not a Gaussian. (2) Strong dependence: long-range-dependent processes can have CLT-like statements with non-Gaussian limits or non-n\sqrt{n} rates. (3) High dimension: the iid CLT is one-dimensional; operator-norm CLTs for sample covariances in dimensions dnd \asymp n require additional structure and the limits are not Gaussian.

Lindeberg-Feller CLT

The classical CLT assumes identical distribution. This is too restrictive for many applications: in regression with deterministic covariates, the contribution XiβX_i \beta to the score has variance that depends on ii; in stratified sampling, summands come from different populations; in neural-network analyses, layerwise contributions to a loss may be heterogeneous. The Lindeberg-Feller theorem replaces "iid" with a condition that controls the relative size of the largest summand.

Theorem

Lindeberg-Feller CLT

Statement

Let {Xn,i}\{X_{n,i}\} be a triangular array of independent random variables in row nn, with E[Xn,i]=0\mathbb{E}[X_{n,i}] = 0 and σn,i2=Var(Xn,i)<\sigma_{n,i}^2 = \mathrm{Var}(X_{n,i}) < \infty. Set sn2=i=1knσn,i2s_n^2 = \sum_{i=1}^{k_n} \sigma_{n,i}^2. The Lindeberg condition is

Ln(ϵ):=1sn2i=1knE ⁣[Xn,i21{Xn,i>ϵsn}]0for every ϵ>0.L_n(\epsilon) := \frac{1}{s_n^2} \sum_{i=1}^{k_n} \mathbb{E}\!\left[X_{n,i}^2 \, \mathbf{1}\{|X_{n,i}| > \epsilon\, s_n\}\right] \to 0 \quad \text{for every } \epsilon > 0.

If Ln(ϵ)0L_n(\epsilon) \to 0 for every ϵ>0\epsilon > 0, then

1sni=1knXn,idN(0,1).\frac{1}{s_n} \sum_{i=1}^{k_n} X_{n,i} \xrightarrow{d} \mathcal{N}(0, 1).

The condition is essentially necessary: under the uniform asymptotic negligibility condition maxiσn,i2/sn20\max_i \sigma_{n,i}^2 / s_n^2 \to 0, the Lindeberg condition is equivalent to the CLT plus uniform negligibility.

Intuition

Ln(ϵ)L_n(\epsilon) is the fraction of the total variance carried by the "big" parts of summands (the parts above ϵsn\epsilon\,s_n in absolute value). Lindeberg says: at every threshold ϵ\epsilon, the big parts contribute a vanishing share of the variance. No single summand, and no constant fraction of summands, is allowed to dominate.

The Lindeberg condition is harder to verify directly than the simpler Lyapunov condition: there exists δ>0\delta > 0 such that

1sn2+δi=1knE ⁣[Xn,i2+δ]0.\frac{1}{s_n^{2+\delta}} \sum_{i=1}^{k_n} \mathbb{E}\!\left[|X_{n,i}|^{2+\delta}\right] \to 0.

Lyapunov implies Lindeberg by Markov's inequality, and is the form most often used in applied work because (2+δ)(2+\delta)-th moments are easier to bound than truncated second moments.

Proof Sketch

Lindeberg's replacement method. Replace the Xn,iX_{n,i} one at a time with independent Gaussians Gn,iN(0,σn,i2)G_{n,i} \sim \mathcal{N}(0, \sigma_{n,i}^2) and control the change in E[f(sum)]\mathbb{E}[f(\text{sum})] for smooth bounded ff via a third-order Taylor expansion. The remainder is bounded by the Lindeberg sum Ln(ϵ)+ϵL_n(\epsilon) + \epsilon for a free ϵ>0\epsilon > 0, then take ϵ0\epsilon \to 0 along a subsequence with Ln(ϵ)0L_n(\epsilon) \to 0. After full replacement, the sum is exactly N(0,sn2)\mathcal{N}(0, s_n^2), and 1/sn1/s_n standardizes to N(0,1)\mathcal{N}(0,1). See Billingsley, Probability and Measure, Section 27, or Durrett, Probability: Theory and Examples, Section 3.4.

Why It Matters

This is the right form of the CLT for theory. Whenever you read "by the CLT" in a learning-theory or asymptotic-statistics paper that does not have iid data, the unstated assumption is almost always Lindeberg or Lyapunov on a triangular array. Specific places it shows up:

  • OLS asymptotics with non-random covariates: the score sum ixiϵi\sum_i x_i \epsilon_i is independent but not identically distributed. Lyapunov on ϵi2+δ|\epsilon_i|^{2+\delta} plus a balanced design (maxixi2/ixi20\max_i \|x_i\|^2 / \sum_i \|x_i\|^2 \to 0) gives asymptotic normality of β^\hat\beta.
  • Influence-function-based asymptotics: in semiparametric efficiency theory, estimators are linearized as sample averages of an influence function, but with finite-sample plug-in errors that make the summands non-iid. Lindeberg controls the error term.
  • Non-iid generalization analyses: when training data come from heterogeneous sources or active-learning streams, Lindeberg-style conditions are what permit Gaussian limits of empirical risks.

Failure Mode

The Lindeberg condition fails when one or a few summands carry a non-vanishing share of the variance. Canonical example: Xn,n=nX_{n,n} = n with probability 1/n21/n^2 and 00 otherwise, with Xn,iX_{n,i} tiny for i<ni < n, makes σn,n2=1\sigma_{n,n}^2 = 1 a constant fraction of sn2s_n^2. The sum does not converge to a Gaussian.

A subtle case: with Xn,i=±cX_{n,i} = \pm c each with probability 1/21/2 for fixed constant cc, the Lindeberg condition holds, but with Xn,1=±nX_{n,1} = \pm \sqrt{n} and Xn,iX_{n,i} tiny for i2i \geq 2, the variance budget is dominated by a single Bernoulli-like term. The standardized sum stays approximately ±1\pm 1 with probability 1/21/2 each, not Gaussian.

Berry-Esseen: Rate of Convergence

The CLT says ZnN(0,1)Z_n \to \mathcal{N}(0,1) but does not say how large nn must be. Berry-Esseen quantifies the gap.

Theorem

Berry-Esseen Theorem

Statement

If X1,,XnX_1, \ldots, X_n are iid with E[Xi]=0\mathbb{E}[X_i] = 0, Var(Xi)=σ2\mathrm{Var}(X_i) = \sigma^2, and E[Xi3]=ρ<\mathbb{E}[|X_i|^3] = \rho < \infty, then

suptPr ⁣[Xˉnσ/nt]Φ(t)Cρσ3n,\sup_t \left| \Pr\!\left[\frac{\bar{X}_n}{\sigma/\sqrt{n}} \leq t\right] - \Phi(t) \right| \leq \frac{C\,\rho}{\sigma^3 \sqrt{n}},

where Φ\Phi is the standard-normal CDF and CC is a universal constant. The best known constant is C0.4748C \leq 0.4748 (Shevtsova, 2011); a lower bound of C0.4097C \geq 0.4097 (Esseen, 1956) shows the optimal CC lies in this narrow range.

The CLT rate of convergence is O(1/n)O(1/\sqrt{n}), with the constant set by the third absolute moment of the centered summand.

Intuition

The bound has two factors. 1/n1/\sqrt{n} is the universal rate. The non-Gaussianity factor ρ/σ3\rho/\sigma^3 measures how skewed and heavy-tailed the centered XX is, relative to a Gaussian baseline. A N(0,1)\mathcal{N}(0,1) has ρ=EZ3=8/π1.595\rho = \mathbb{E}|Z|^3 = \sqrt{8/\pi} \approx 1.595 and σ3=1\sigma^3 = 1, so the ratio is 1.6\approx 1.6. Heavier tails or larger skewness make the constant blow up and you need more samples before the Gaussian approximation lands.

Proof Sketch

The proof uses smoothing of the indicator 1{xt}\mathbf{1}\{x \leq t\} against a kernel of bandwidth 1/T1/T:

F(t)Φ(t)1πTTφF(s)es2/2sds+cT.|F(t) - \Phi(t)| \leq \frac{1}{\pi}\int_{-T}^{T} \frac{|\varphi_F(s) - e^{-s^2/2}|}{|s|}\,ds + \frac{c}{T}.

Bound φZn(t)et2/2|\varphi_{Z_n}(t) - e^{-t^2/2}| by a third-order Taylor remainder of the per-summand characteristic function at the standardized scale. Optimize TT to balance the smoothing error against the truncation error, which gives the 1/n1/\sqrt{n} rate.

Why It Matters

Berry-Esseen tells you when the CLT approximation is good enough to trust. For Bernoulli(1/2)(1/2): ρ=1/8\rho = 1/8, σ3=1/8\sigma^3 = 1/8, so ρ/σ3=1\rho/\sigma^3 = 1 and the bound is 0.4748/n0.4748/\sqrt{n}. At n=100n=100 the worst-case CDF error is 0.0475\leq 0.0475; at n=10000n=10\,000 it is 0.00475\leq 0.00475.

Practical consequence: for CLT-based confidence intervals to be reliable to within 1%1\% uniformly across the CDF, you need n104n \sim 10^4 in the worst case. This is the back-of-envelope number behind why CLT-based critical values are usually fine for n100n \geq 100 for symmetric data and require n1000n \geq 1000 for skewed data.

In ML theory, Berry-Esseen is the right reference point when comparing CLT-based confidence intervals against concentration inequalities. For sub-Gaussian data, Hoeffding gives a non-asymptotic Gaussian-type tail at every nn; Berry-Esseen is what tells you the asymptotic statement is accurate to O(1/n)O(1/\sqrt{n}), which can be stronger or weaker than Hoeffding's constant depending on the regime.

Failure Mode

Berry-Esseen requires a finite third moment. Without it the rate can be slower than 1/n1/\sqrt{n}; under finite (2+δ)(2+\delta)-th moments only, the rate is 1/nδ/21/n^{\delta/2}. The bound is uniform over tt, so it controls the central CDF well but is loose deep in the tails: at tt where Φ(t)\Phi(t) itself is exponentially small, the relative error of the Gaussian approximation can be much larger than the absolute Berry-Esseen budget suggests. Edgeworth expansions give sharper rates inside the bulk and the Cramér large-deviation regime gives sharper tail rates under exponential-moment conditions.

Multivariate CLT

Theorem

Multivariate CLT

Statement

If X1,,XnRdX_1, \ldots, X_n \in \mathbb{R}^d are iid with mean μ\mu and covariance matrix Σ\Sigma, then

n(Xˉnμ)dNd(0,Σ).\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} \mathcal{N}_d(0, \Sigma).

The full covariance structure is preserved in the limit: the joint distribution of the standardized sample-mean vector converges to a multivariate Gaussian with the same covariance as a single XiX_i.

Intuition

The Cramér-Wold device reduces multivariate weak convergence to all one-dimensional projections. For every aRda \in \mathbb{R}^d, the projected sample aXia^\top X_i is a one-dimensional iid sequence with variance aΣaa^\top \Sigma a, and the classical CLT gives an(Xˉnμ)dN(0,aΣa)a^\top \sqrt{n}(\bar X_n - \mu) \xrightarrow{d} \mathcal{N}(0, a^\top \Sigma a). Since this characterizes a multivariate Gaussian uniquely, the joint limit is Nd(0,Σ)\mathcal{N}_d(0, \Sigma).

Proof Sketch

Cramér-Wold theorem: XndXX_n \xrightarrow{d} X in Rd\mathbb{R}^d if and only if aXndaXa^\top X_n \xrightarrow{d} a^\top X for every aRda \in \mathbb{R}^d. Each projection is a univariate iid CLT with limit N(0,aΣa)\mathcal{N}(0, a^\top \Sigma a), and these projections agree with the projections of Nd(0,Σ)\mathcal{N}_d(0, \Sigma). Conclude joint convergence.

Why It Matters

Multivariate CLT is the workhorse for vector-valued estimators. The MLE in a dd-dimensional regular parametric family is asymptotically Nd(θ0,I(θ0)1/n)\mathcal{N}_d(\theta_0, I(\theta_0)^{-1}/n), which comes from the multivariate CLT applied to the dd-vector score θlogp(X;θ0)\nabla_\theta \log p(X; \theta_0). OLS asymptotic normality, GMM asymptotics, and the joint distribution of estimated regression coefficients all follow the same template: write the estimator as a sum of iid random vectors plus negligible remainder, then apply the multivariate CLT.

The covariance Σ\Sigma matters: scalar tests of single coefficients use the diagonal entries, joint tests (chi-squared statistics, F-tests) use the inverse of Σ\Sigma to whiten the vector. Misreporting Σ\Sigma by ignoring off-diagonal correlations is the most common practical error in multivariate inference.

Failure Mode

Singular Σ\Sigma collapses the limit to a degenerate Gaussian on a lower-dimensional subspace. The marginal CLT still holds in each direction, but inverting Σ\Sigma in a Wald or Mahalanobis statistic requires care. In high dimensions (dd growing with nn), the multivariate CLT in operator norm fails outright: the sample covariance Σ^n\hat\Sigma_n has eigenvalues that do not converge to those of Σ\Sigma unless d/n0d/n \to 0. The Marchenko-Pastur law replaces the multivariate CLT in the regime d/nc(0,)d/n \to c \in (0, \infty).

Delta Method: CLT Through Smooth Functions

The delta method is what lets you propagate a CLT through a smooth nonlinear transformation. It is the second-most-used theorem in applied statistics after the CLT itself, and it is the bridge from "CLT for the sample mean" to "CLT for any smooth functional of moments". The full treatment with regularity conditions and second-order versions lives in asymptotic statistics; the result is restated here because most applications of the CLT in practice immediately chain into a delta-method step.

Theorem

Delta Method

Statement

Let TnRkT_n \in \mathbb{R}^k satisfy n(Tnμ)dNk(0,Σ)\sqrt{n}(T_n - \mu) \xrightarrow{d} \mathcal{N}_k(0, \Sigma), and let g:RkRmg : \mathbb{R}^k \to \mathbb{R}^m be differentiable at μ\mu with Jacobian G(μ)=g(μ)Rm×kG(\mu) = \nabla g(\mu) \in \mathbb{R}^{m \times k}. Then

n(g(Tn)g(μ))dNm ⁣(0,G(μ)ΣG(μ)).\sqrt{n}(g(T_n) - g(\mu)) \xrightarrow{d} \mathcal{N}_m\!\left(0, G(\mu)\, \Sigma\, G(\mu)^\top\right).

In the scalar case g(μ)0g'(\mu) \neq 0 and the asymptotic variance is [g(μ)]2σ2[g'(\mu)]^2 \sigma^2. When g(μ)=0g'(\mu) = 0 the second-order delta method gives n(g(Tn)g(μ))d12g(μ)σ2χ12n(g(T_n) - g(\mu)) \xrightarrow{d} \tfrac{1}{2} g''(\mu) \sigma^2 \chi^2_1.

Intuition

A first-order Taylor expansion of gg at μ\mu gives g(Tn)g(μ)=G(μ)(Tnμ)+Rng(T_n) - g(\mu) = G(\mu)(T_n - \mu) + R_n with the remainder Rn=op(Tnμ)=op(1/n)R_n = o_p(|T_n - \mu|) = o_p(1/\sqrt{n}). Multiplying by n\sqrt{n} gives n(g(Tn)g(μ))=G(μ)n(Tnμ)+op(1)\sqrt{n}(g(T_n) - g(\mu)) = G(\mu) \sqrt{n}(T_n - \mu) + o_p(1), and the lead term pushes the Gaussian forward by the linear map G(μ)G(\mu).

Proof Sketch

Apply the multivariate Taylor theorem to gg at μ\mu. The remainder satisfies the bound Rn/Tnμ0\|R_n\| / \|T_n - \mu\| \to 0 in probability, which combined with the n\sqrt{n}-rate hypothesis on TnT_n gives nRnp0\sqrt{n} R_n \xrightarrow{p} 0. Slutsky converts the joint limit (n(Tnμ),nRn)(\sqrt{n}(T_n - \mu), \sqrt{n} R_n) to the conclusion.

Why It Matters

The delta method is how every asymptotic standard error in applied statistics is computed beyond simple sample means.

  • Asymptotic variance of logp^\log \hat p for p^=Xˉn\hat p = \bar X_n, XiBern(p)X_i \sim \mathrm{Bern}(p): g(p)=logpg(p) = \log p, g(p)=1/pg'(p) = 1/p, asymptotic variance (1/p)2p(1p)=(1p)/p(1/p)^2 \cdot p(1-p) = (1-p)/p.
  • Variance of the log-odds log(p^/(1p^))\log(\hat p/(1-\hat p)): g(p)=1/(p(1p))g'(p) = 1/(p(1-p)), asymptotic variance 1/(p(1p))1/(p(1-p)). This is the basis of logit-scale Wald intervals, which often have better coverage than untransformed Wald intervals.
  • Variance of a ratio μ^1/μ^2\hat\mu_1/\hat\mu_2: standard delta-method exercise giving the variance of the ratio in terms of variances and covariance of the components.
  • Sandwich variance for misspecified MLE: the delta method through the score-equation expansion produces the V1AVV^{-1} A V^{-\top} sandwich form, with VV the expected score derivative and AA the score covariance.

Failure Mode

Three failure modes. (1) g(μ)=0g'(\mu) = 0: use the second-order delta method, in which the rate jumps from n\sqrt{n} to nn and the limit is χ12\chi^2_1, not normal. (2) gg not differentiable at μ\mu: e.g. g(x)=xg(x) = |x| at μ=0\mu = 0 gives a folded-normal limit. (3) TnT_n converges at a rate other than n\sqrt{n} (e.g. extreme-value problems with rate nn): the delta method must use the actual rate, not n\sqrt{n}.

Martingale CLT: Beyond Independence

The classical CLT requires independence. Many sequences in machine learning are not independent but are nearly so in a conditional sense: SGD iterates, the stochastic gradient at step tt depends on past iterates but is mean-zero conditional on them. The right tool is the martingale CLT, which replaces "iid" with "martingale difference plus conditional Lindeberg".

Theorem

Martingale CLT (Brown 1971 / McLeish 1974)

Statement

Let {Xt}t=1n\{X_t\}_{t=1}^n be a martingale difference sequence with respect to a filtration {Ft}\{\mathcal{F}_t\}, meaning E[XtFt1]=0\mathbb{E}[X_t \mid \mathcal{F}_{t-1}] = 0 and E[Xt2]<\mathbb{E}[X_t^2] < \infty for every tt. Suppose

Vn:=1nt=1nE ⁣[Xt2Ft1]pσ2>0,V_n := \frac{1}{n} \sum_{t=1}^n \mathbb{E}\!\left[X_t^2 \mid \mathcal{F}_{t-1}\right] \xrightarrow{p} \sigma^2 > 0,

and the conditional Lindeberg condition holds: for every ϵ>0\epsilon > 0,

1nt=1nE ⁣[Xt21{Xt>ϵn}Ft1]p0.\frac{1}{n} \sum_{t=1}^n \mathbb{E}\!\left[X_t^2 \, \mathbf{1}\{|X_t| > \epsilon \sqrt{n}\} \,\Big|\, \mathcal{F}_{t-1}\right] \xrightarrow{p} 0.

Then

1nt=1nXtdN(0,σ2).\frac{1}{\sqrt{n}} \sum_{t=1}^n X_t \xrightarrow{d} \mathcal{N}(0, \sigma^2).

The conditional variance condition replaces "identically distributed", the conditional Lindeberg replaces the unconditional Lindeberg, and the martingale-difference property replaces independence.

Intuition

A martingale difference is "locally mean-zero": each term has mean zero given the past, even though it can depend on the past in arbitrary nonlinear ways. The conditional variance condition fixes a long-run average level of conditional variance; the conditional Lindeberg condition forbids any single term from dominating the variance budget at scale n\sqrt{n}. Together these are exactly the right replacements for iid: predictable second moments stand in for identical distribution, and conditional negligibility stands in for genuine independence.

Proof Sketch

McLeish's proof uses the auxiliary product t(1+itXt/n)\prod_t (1 + i t X_t / \sqrt{n}) and shows it is close in L1L^1 to eitSn/ne^{i t S_n / \sqrt{n}} under the stated conditions, which converts the problem to controlling the conditional second moment of the truncated log-product. The conditional Lindeberg condition implies the truncation error is op(1)o_p(1); the conditional variance condition pins down the deterministic limit. See Hall and Heyde, Martingale Limit Theory and Its Application (1980), Theorem 3.2, or Durrett, Probability: Theory and Examples, Theorem 4.3.

Lyapunov-type substitute: if the stronger condition E[Xt2+δFt1]\mathbb{E}[|X_t|^{2+\delta} \mid \mathcal{F}_{t-1}] is uniformly bounded for some δ>0\delta > 0, conditional Lindeberg follows by Markov's inequality (the same logic as the iid case).

Why It Matters

The martingale CLT is the asymptotic backbone of modern stochastic-process theory and underwrites several core ML applications.

  • SGD as approximate Gaussian noise: at step tt the stochastic gradient gt=(θt;ξt)g_t = \nabla \ell(\theta_t; \xi_t) for random batch ξt\xi_t has E[gtFt1]=L(θt)\mathbb{E}[g_t \mid \mathcal{F}_{t-1}] = \nabla L(\theta_t) where LL is the population loss. The noise ηt=gtL(θt)\eta_t = g_t - \nabla L(\theta_t) is a martingale difference, and under bounded-variance assumptions the martingale CLT gives tηt/nN(0,Σnoise)\sum_t \eta_t / \sqrt{n} \to \mathcal{N}(0, \Sigma_{\text{noise}}). This is what licenses the SGD-as-SDE approximation dθt=L(θt)dt+ηΣnoisedWtd\theta_t = -\nabla L(\theta_t)\, dt + \sqrt{\eta\, \Sigma_{\text{noise}}}\, dW_t used in stochastic-modified-equation analyses (Li, Tai, E, 2017).
  • Time-series asymptotics: an AR(1)(1) process Xt=ρXt1+ϵtX_t = \rho X_{t-1} + \epsilon_t with ρ<1|\rho| < 1 has scaled partial sums whose Gaussian limit comes from a martingale CLT applied to the innovation sequence ϵt\epsilon_t.
  • Online learning regret bounds: the regret of an online algorithm is a sum of martingale differences (current loss minus its conditional expectation given the past). Martingale CLT gives the asymptotic Gaussian distribution of regret around its mean, which is the basis of asymptotic confidence intervals for online estimators.
  • Stochastic-approximation iterates: Robbins-Monro and Polyak-Ruppert averaging produce iterates whose asymptotic normality n(θˉnθ)dN(0,Σ)\sqrt{n}(\bar\theta_n - \theta^*) \xrightarrow{d} \mathcal{N}(0, \Sigma) is a martingale CLT statement on the noise sequence at the stationary point.

Failure Mode

The martingale CLT can fail when the conditional variance does not stabilize. Heteroscedastic SGD with a non-stationary step size has VnV_n that does not converge to a deterministic limit, and the conclusion has to be modified: Sn/VnS_n / \sqrt{V_n} converges to N(0,1)\mathcal{N}(0,1) in stable convergence, with VnV_n itself random in the limit. This is the LAMN (locally asymptotically mixed normal) regime, and it shows up in unit-root time series and in some non-stationary stochastic-approximation problems.

Strong dependence breaks the conclusion: if the conditional variance E[Xt2Ft1]\mathbb{E}[X_t^2 \mid \mathcal{F}_{t-1}] has long-range dependence, the limit can be fractional Brownian motion at a non-n\sqrt{n} rate, not Gaussian.

Applications in Machine Learning

The CLT is invoked dozens of times across ML theory, sometimes silently. This section makes the key invocations explicit, with the precise form of the CLT used in each.

Asymptotic normality of MLE

For a regular parametric family with score s(X;θ)=θlogp(X;θ)s(X; \theta) = \nabla_\theta \log p(X; \theta) and Fisher information I(θ0)=Eθ0[ss]I(\theta_0) = \mathbb{E}_{\theta_0}[s s^\top], the MLE satisfies the score-equation expansion n(θ^nθ0)=I(θ0)11nis(Xi;θ0)+op(1)\sqrt{n}(\hat\theta_n - \theta_0) = I(\theta_0)^{-1} \cdot \frac{1}{\sqrt{n}} \sum_i s(X_i; \theta_0) + o_p(1). The lead term is a multivariate iid CLT applied to the mean-zero score vector, giving n(θ^nθ0)dNd(0,I(θ0)1)\sqrt{n}(\hat\theta_n - \theta_0) \xrightarrow{d} \mathcal{N}_d(0, I(\theta_0)^{-1}). The full derivation, including the regularity conditions and the extension to misspecified models with sandwich variance, is in asymptotic statistics.

SGD noise: martingale CLT in action

Stochastic gradient descent with minibatch size bb at iterate θt\theta_t takes a step θt+1=θtηgt\theta_{t+1} = \theta_t - \eta\, g_t where gtg_t is the empirical gradient on a random minibatch. Decompose the noise as ηt=gtL(θt)\eta_t = g_t - \nabla L(\theta_t) where LL is the population loss. Under bounded fourth moments and a stationary distribution of iterates, the martingale CLT gives the rescaled cumulative noise 1ntηtdNd(0,Σ(θ))\frac{1}{\sqrt{n}} \sum_t \eta_t \xrightarrow{d} \mathcal{N}_d(0, \Sigma(\theta^*)) where Σ(θ)\Sigma(\theta^*) is the per-sample gradient covariance at the optimum. This is the asymptotic-Gaussian noise approximation that underpins the SGD-as-SDE diffusion analysis: at stationarity, SGD is approximately a discretization of the Langevin SDE dθt=L(θt)dt+ηΣ(θt)dWtd\theta_t = -\nabla L(\theta_t)\, dt + \sqrt{\eta\, \Sigma(\theta_t)}\, dW_t.

The non-asymptotic version of the same statement is a Hoeffding or Bernstein-type concentration bound on the cumulative noise; the martingale CLT is the asymptotic counterpart that gives the precise covariance.

Bootstrap validity via the empirical-process CLT

The bootstrap works because resampling from F^n\hat F_n approximates sampling from FF, but the rigorous justification is a functional CLT. Donsker's theorem states that the empirical process n(F^nF)\sqrt{n}(\hat F_n - F) converges weakly to a Brownian bridge BFB \circ F in ([ ⁣, ⁣])\ell^\infty([\!-\infty, \infty\!]). The Bickel-Freedman (1981) bootstrap theorem applies the same machinery in the bootstrap world conditional on the data, giving conditional weak convergence of n(F^nF^n)\sqrt{n}(\hat F_n^* - \hat F_n) to the same Brownian bridge with probability 11. Smooth functionals of F^n\hat F_n inherit asymptotic normality from this, and the bootstrap distribution of the smooth functional matches the true sampling distribution at the same O(1/n)O(1/\sqrt{n}) Edgeworth-corrected rate.

Sample covariance and where the CLT fails in high dimension

For iid XiRdX_i \in \mathbb{R}^d with E[Xi]=0\mathbb{E}[X_i] = 0 and finite fourth moments, the sample covariance Σ^n=1niXiXi\hat\Sigma_n = \frac{1}{n} \sum_i X_i X_i^\top satisfies a multivariate CLT entrywise: n(Σ^nΣ)jkdN(0,Vjk)\sqrt{n}\,(\hat\Sigma_n - \Sigma)_{jk} \xrightarrow{d} \mathcal{N}(0, V_{jk}) where VjkV_{jk} involves the fourth moments of the corresponding components. With dd fixed and nn \to \infty, the operator-norm error Σ^nΣop\|\hat\Sigma_n - \Sigma\|_{\text{op}} is Op(d/n)O_p(\sqrt{d/n}).

In the high-dimensional regime d/nc>0d/n \to c > 0, the operator-norm error does not vanish: empirical eigenvalues are biased in a deterministic way described by the Marchenko-Pastur distribution, and the CLT for Σ^n\hat\Sigma_n entrywise does not assemble into a CLT for the spectrum or for matrix-valued functionals like Σ^n1\hat\Sigma_n^{-1}. This is why covariance-based methods (PCA, GLS, Wald tests) need shrinkage or regularization in high dimensions: the multivariate CLT gives no operator-norm guarantee.

Asymptotic normality of empirical risk

For a fixed hypothesis hh, the empirical risk R^n(h)=1ni(h(Xi),Yi)\hat R_n(h) = \frac{1}{n} \sum_i \ell(h(X_i), Y_i) is the sample mean of iid mean-R(h)R(h) losses with variance σ2(h)=Var((h(X),Y))\sigma^2(h) = \mathrm{Var}(\ell(h(X), Y)). The classical CLT gives n(R^n(h)R(h))dN(0,σ2(h))\sqrt{n}(\hat R_n(h) - R(h)) \xrightarrow{d} \mathcal{N}(0, \sigma^2(h)), which justifies the standard variance-based confidence interval for held-out test loss.

This is pointwise in hh: the CLT does not give a Gaussian limit for the worst-case empirical-risk gap suphR^n(h)R(h)\sup_h |\hat R_n(h) - R(h)|. Uniform versions of the CLT (Donsker classes, Rademacher complexity) are what control the generalization gap.

Canonical Examples

Example

Sample mean of Exponential variables

Let XiExp(1)X_i \sim \mathrm{Exp}(1) with μ=1\mu = 1, σ2=1\sigma^2 = 1, and ρ=EXi13\rho = \mathbb{E}|X_i - 1|^3. The classical CLT gives n(Xˉn1)dN(0,1)\sqrt{n}(\bar X_n - 1) \xrightarrow{d} \mathcal{N}(0, 1), so Xˉ30N(1,1/30)\bar X_{30} \approx \mathcal{N}(1, 1/30).

A standard 95% Wald confidence interval is Xˉ30±1.96/30Xˉ30±0.358\bar X_{30} \pm 1.96/\sqrt{30} \approx \bar X_{30} \pm 0.358.

The Exponential is skewed (skewness =2= 2), so the Berry-Esseen bound with ρ/σ3=2\rho/\sigma^3 = 2 gives a CDF error of 0.17\approx 0.17 at n=30n = 30, which is large. By n=100n = 100 the bound drops to 0.095\approx 0.095, and by n=1000n = 1000 to 0.03\approx 0.03. Symmetrizing transformations (e.g. Wilson-Hilferty cube-root for χ2\chi^2, or log for Exponential rates) reduce the constant and give better finite-sample coverage.

Example

Normal approximation to the Binomial

If XiBern(p)X_i \sim \mathrm{Bern}(p), then Sn=iXiBin(n,p)S_n = \sum_i X_i \sim \mathrm{Bin}(n, p) and the classical CLT gives

Snnpnp(1p)dN(0,1).\frac{S_n - np}{\sqrt{n p (1-p)}} \xrightarrow{d} \mathcal{N}(0,1).

The textbook rule of thumb np5np \geq 5 and n(1p)5n(1-p) \geq 5 ensures the Berry-Esseen constant ρ/σ3=(p2+(1p)2)/p(1p)\rho/\sigma^3 = (p^2 + (1-p)^2)/\sqrt{p(1-p)} times 1/n1/\sqrt{n} is small enough for two-sided coverage to be roughly correct. For p=1/2p = 1/2, n=100n = 100: S100N(50,25)S_{100} \approx \mathcal{N}(50, 25), and Pr[S10060]Pr[Z2]0.023\Pr[S_{100} \geq 60] \approx \Pr[Z \geq 2] \approx 0.023 (CLT) vs exact 0.01760.0176.

Binomial(n, p) PMF with the de Moivre–Laplace Normal(np, np(1−p)) overlay

20
0.50
μ = 10.00σ = 2.24
14710131620k (number of successes)P(X = k)

Bars: exact Binomial PMF. Dashed curve: the Normal(np, np(1−p)) limit predicted by de Moivre–Laplace. The overlay tightens around the bars as n grows, with the worst fit near p = 0 or p = 1 where the distribution is skewed.

Example

Lyapunov over Lindeberg in OLS asymptotics

Take XiRdX_i \in \mathbb{R}^d deterministic, ϵi\epsilon_i iid mean-zero with Eϵi2+δ<\mathbb{E}|\epsilon_i|^{2+\delta} < \infty, and the linear model Yi=Xiβ+ϵiY_i = X_i^\top \beta + \epsilon_i. The score vector iXiϵi\sum_i X_i \epsilon_i is independent but not identically distributed (the XiX_i vary). The Lyapunov condition reduces to a balanced-design condition

iXi2+δ(iXi2)1+δ/20,\frac{\sum_i \|X_i\|^{2+\delta}}{\left(\sum_i \|X_i\|^2\right)^{1+\delta/2}} \to 0,

which is satisfied for any reasonable design where no covariate vector dominates. Lyapunov implies Lindeberg, and the Lindeberg-Feller CLT gives β^OLS\hat\beta_{\text{OLS}} asymptotic normality with the sandwich variance. The Lyapunov condition is the form practitioners verify; the Lindeberg condition is the form theorists prove.

Common Confusions

Watch Out

The CLT does NOT say individual observations are Gaussian

The CLT says the average Xˉn\bar X_n is approximately Gaussian for large nn. Individual observations XiX_i retain their original distribution. They could be Bernoulli, Exponential, Poisson, or anything else with finite variance. The Gaussian emerges only through averaging. A single coin flip is not Gaussian; the average of a million coin flips is.

Watch Out

The CLT is asymptotic, concentration is non-asymptotic

The CLT says ZndN(0,1)Z_n \xrightarrow{d} \mathcal{N}(0,1) as nn \to \infty. It does not say the approximation is good for any specific nn. Concentration inequalities (Hoeffding, Bernstein, sub-Gaussian tails) give finite-sample bounds with explicit constants. Berry-Esseen sits between the two as a finite-nn correction to the CLT, but it is uniform over the CDF and can be loose deep in the tails. For tight non-asymptotic tail probabilities, use a concentration inequality, not Berry-Esseen.

Watch Out

Convergence in distribution does not imply convergence in probability

The CLT gives convergence in distribution: ZndZZ_n \xrightarrow{d} Z where ZN(0,1)Z \sim \mathcal{N}(0,1). This does not mean ZnZ0Z_n - Z \to 0 in any sense. The random variables ZnZ_n are determined by the data; ZZ is a "fresh" Gaussian on a possibly different probability space. Convergence in distribution is a statement about the shape of the distribution, not about the values of the random variables.

Watch Out

Finite variance is required, not just finite mean

The LLN requires only a finite mean. The CLT requires a finite variance. For distributions with finite mean but infinite variance (e.g. stable distributions with index 1<α<21 < \alpha < 2), the sample mean still converges (by the LLN) but the CLT does not hold in its standard form. The normalized sum Sn/n1/αS_n/n^{1/\alpha} converges to a non-Gaussian stable distribution instead.

Watch Out

The Lindeberg condition is not just about iid

The Lindeberg condition was designed for triangular arrays of non-iid independent variables. For iid sequences, Lindeberg holds automatically under finite variance, so reading "by Lindeberg's CLT" in a paper on iid data is just a longer way of citing the classical CLT. The condition becomes load-bearing only when the summands have heterogeneous variances or grow with nn, as in regression with deterministic covariates, stratified sampling, and triangular-array bootstrap.

Watch Out

The martingale CLT does NOT say the noise is iid

SGD noise ηt=gtL(θt)\eta_t = g_t - \nabla L(\theta_t) is not iid: ηt\eta_t depends on θt\theta_t, which depends on past noise. The martingale CLT allows this dependence so long as the noise is mean-zero conditional on the past and the conditional second moments stabilize. Reading the SGD-as-SDE literature as "SGD noise is iid Gaussian" is a common oversimplification; the precise statement is "the rescaled cumulative noise has a Gaussian limit by the martingale CLT".

Summary

  • Classical (Lindeberg-Levy) CLT: n(Xˉnμ)/σdN(0,1)\sqrt{n}(\bar X_n - \mu)/\sigma \xrightarrow{d} \mathcal{N}(0,1) for iid finite-variance variables.
  • Lindeberg-Feller: triangular arrays of independent variables converge to a Gaussian iff the Lindeberg condition holds (essentially: no summand dominates the variance).
  • Lyapunov condition: a (2+δ)(2+\delta)-th moment criterion that implies Lindeberg and is easier to verify in practice.
  • Berry-Esseen: CDF error 0.4748ρ/(σ3n)\leq 0.4748\,\rho/(\sigma^3\sqrt{n}), the rate is O(1/n)O(1/\sqrt{n}) and tight.
  • Multivariate CLT: n(Xˉnμ)dNd(0,Σ)\sqrt{n}(\bar X_n - \mu) \xrightarrow{d} \mathcal{N}_d(0, \Sigma), breaks in high dimension.
  • Delta method: smooth gg pushes the Gaussian forward by the Jacobian; the second-order version handles vanishing gradients.
  • Martingale CLT: replaces independence with E[XtFt1]=0\mathbb{E}[X_t \mid \mathcal{F}_{t-1}] = 0 plus conditional variance and conditional Lindeberg conditions; underwrites SGD asymptotics, time-series CLTs, and stochastic-approximation theory.
  • ML uses: MLE asymptotic normality, SGD-noise Gaussian approximation, bootstrap validity, sample-covariance asymptotics, asymptotic normality of held-out empirical risk.
  • Failure modes: infinite variance gives stable-law limits; high dimension breaks operator-norm CLTs; strong dependence breaks martingale CLTs.

Exercises

ExerciseCore

Problem

A factory produces bolts whose lengths have mean μ=10\mu = 10 cm and standard deviation σ=0.5\sigma = 0.5 cm (the distribution of lengths is unknown). You measure n=100n = 100 bolts and compute Xˉ100\bar X_{100}. Using the CLT, find an interval that contains μ\mu with approximately 95% probability. Then apply Berry-Esseen to bound the worst-case CDF error of the Gaussian approximation, assuming a third absolute moment of ρ=0.5\rho = 0.5 cm3^3.

ExerciseAdvanced

Problem

Let X1,,XnBern(p)X_1, \ldots, X_n \sim \mathrm{Bern}(p) with pp unknown, and let p^=Xˉn\hat p = \bar X_n. Use the CLT and the delta method to: (a) derive the asymptotic distribution of p^\hat p and construct a Wald 95% confidence interval for pp; (b) derive the asymptotic distribution of the log-odds g(p^)=log(p^/(1p^))g(\hat p) = \log(\hat p/(1-\hat p)) and the corresponding asymptotic variance; (c) explain why the log-odds Wald interval typically has better finite-sample coverage than the untransformed Wald interval, especially near p=0p = 0 or p=1p = 1.

ExerciseAdvanced

Problem

Construct a triangular array {Xn,i}i=1n\{X_{n,i}\}_{i=1}^n of independent mean-zero random variables, each with variance 11, for which the classical CLT holds but the Lindeberg condition fails a naive threshold check at ϵ=0.5\epsilon = 0.5. Then construct a different array where the Lindeberg condition fails for some ϵ\epsilon and the CLT also fails. Use these to explain why "Lindeberg fails" is necessary context: the question is at what ϵ\epsilon and how the variance is allocated.

ExerciseResearch

Problem

The CLT assumes finite variance. For a symmetric stable distribution with index α(0,2)\alpha \in (0, 2), the variance is infinite. State the generalized CLT for this case, give the correct normalization rate, and explain why this matters for: (i) financial-returns models with α1.5\alpha \approx 1.5-1.71.7, (ii) ML algorithms whose theory assumes sub-Gaussian tails (concentration inequalities, generalization bounds), and (iii) the validity of the bootstrap for the sample mean.

Related Comparisons

Frequently Asked Questions

What does the Central Limit Theorem say?
If you average many independent random variables with finite variance, the average looks Gaussian regardless of the original distribution. Formally: for iid with mean and variance , converges in distribution to . The convergence is in distribution, and the rate is under a finite third moment.
How fast does the CLT converge?
Berry-Esseen gives the rate: . The convergence is ; heavier tails (larger third absolute moment) slow it. For symmetric distributions, higher-order Edgeworth expansions give faster rates in the bulk.
When does the CLT apply to non-iid data?
Lindeberg-Feller covers triangular arrays of independent but non-identically-distributed variables, with the Lindeberg condition replacing iid. Martingale CLTs (Brown 1971, McLeish 1974) handle dependent sequences with mean-zero conditional structure, which is the right tool for SGD asymptotics, time series, and stochastic approximation. Stationary mixing sequences have their own CLTs (Ibragimov, Rio).
When does the CLT fail?
Three regimes. (1) Infinite variance: stable-law limits at rate instead of Gaussian at rate . (2) High dimension: operator-norm CLTs for sample covariances fail when , replaced by Marchenko-Pastur asymptotics. (3) Strong dependence: long-range dependence breaks the classical CLT and martingale variants; fractional Brownian limits at non- rates can apply instead.
Why does the CLT make the MLE asymptotically normal?
Score-equation expansion: , where the sum is iid mean-zero scores with covariance (by Bartlett's first identity). Multivariate CLT plus Slutsky give . The full chain (consistency + score expansion + CLT + Slutsky) is the M-estimator framework on the [asymptotic statistics page](/topics/asymptotic-statistics).
Is SGD noise actually Gaussian?
Approximately, by the martingale CLT applied to the per-step noise . The noise is a martingale difference (mean zero given the past), and under bounded second moments and a stationary iterate distribution, the cumulative noise is Gaussian in the limit. This is what licenses the SGD-as-SDE diffusion approximation; the precise statement is asymptotic, not exact, and the per-step noise distribution is generally non-Gaussian (it is the per-sample-gradient distribution).

References

Canonical:

  • Billingsley, Probability and Measure (3rd ed., 1995), Sections 27-29 (classical CLT, Lindeberg-Feller, Berry-Esseen)
  • Durrett, Probability: Theory and Examples (5th ed., 2019), Section 3.4 (CLT family) and Section 4.3 (martingale CLT)
  • Feller, An Introduction to Probability Theory and Its Applications Vol. 2 (1971), Chapters VIII, XV, XVI

Asymptotic statistics:

  • van der Vaart, Asymptotic Statistics (1998), Chapters 2-3 (CLT, delta method, Slutsky) and Chapter 5 (M-estimator asymptotic normality)
  • Lehmann, Elements of Large-Sample Theory (1999), Chapters 2-3 (delta method, Lindeberg-Feller in regression)

Martingale CLT:

  • Hall and Heyde, Martingale Limit Theory and Its Application (1980), Theorem 3.2 (the standard reference)
  • Brown, "Martingale central limit theorems" (1971), Annals of Mathematical Statistics
  • McLeish, "Dependent central limit theorems and invariance principles" (1974), Annals of Probability

Modern probabilistic perspective:

  • Vershynin, High-Dimensional Probability (2018), Section 2.7 (CLT and Berry-Esseen)
  • Kallenberg, Foundations of Modern Probability (3rd ed., 2021), Chapters 5-6
  • Casella & Berger, Statistical Inference (2002), Chapters 5-10 (applied perspective)

ML applications:

  • Li, Tai, E, "Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms" (2017), arXiv:1511.06251 (SGD-as-SDE, martingale CLT setup)
  • Bickel and Freedman, "Some Asymptotic Theory for the Bootstrap" (1981), Annals of Statistics (bootstrap consistency via CLT)

Next Topics

Building on the CLT family:

Last reviewed: May 6, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

5

Derived topics

16

+11 more on the derived-topics page.