Central Limit Theorem

Q: What does the Central Limit Theorem say?

If you average many independent random variables with finite variance, the average looks Gaussian regardless of the original distribution. Formally: for iid $X_i$ with mean $\mu$ and variance $\sigma^2$, $\sqrt{n} (\bar X_n - \mu) / \sigma$ converges in distribution to $\mathcal{N}(0, 1)$. The convergence is in distribution, and the rate is $O(1/\sqrt{n})$ under a finite third moment.

Q: How fast does the CLT converge?

Berry-Esseen gives the rate: $|\mathbb{P}(\sqrt{n}(\bar X - \mu)/\sigma \le x) - \Phi(x)| \le 0.4748 \cdot \mathbb{E}|X - \mu|^3 / (\sigma^3 \sqrt{n})$. The convergence is $O(1/\sqrt{n})$; heavier tails (larger third absolute moment) slow it. For symmetric distributions, higher-order Edgeworth expansions give faster rates in the bulk.

Q: When does the CLT apply to non-iid data?

Lindeberg-Feller covers triangular arrays of independent but non-identically-distributed variables, with the Lindeberg condition replacing iid. Martingale CLTs (Brown 1971, McLeish 1974) handle dependent sequences with mean-zero conditional structure, which is the right tool for SGD asymptotics, time series, and stochastic approximation. Stationary mixing sequences have their own CLTs (Ibragimov, Rio).

Q: When does the CLT fail?

Three regimes. (1) Infinite variance: stable-law limits at rate $n^{1/\alpha}$ instead of Gaussian at rate $\sqrt{n}$. (2) High dimension: operator-norm CLTs for sample covariances fail when $d/n \to c > 0$, replaced by Marchenko-Pastur asymptotics. (3) Strong dependence: long-range dependence breaks the classical CLT and martingale variants; fractional Brownian limits at non-$\sqrt{n}$ rates can apply instead.

Q: Why does the CLT make the MLE asymptotically normal?

Score-equation expansion: $\sqrt{n}(\hat\theta - \theta_0) = I(\theta_0)^{-1} \cdot \frac{1}{\sqrt{n}} \sum_i s(X_i; \theta_0) + o_p(1)$, where the sum is iid mean-zero scores with covariance $I(\theta_0)$ (by Bartlett's first identity). Multivariate CLT plus Slutsky give $\sqrt{n}(\hat\theta - \theta_0) \xrightarrow{d} \mathcal{N}(0, I(\theta_0)^{-1})$. The full chain (consistency + score expansion + CLT + Slutsky) is the M-estimator framework on the [asymptotic statistics page](/topics/asymptotic-statistics).

Q: Is SGD noise actually Gaussian?

Approximately, by the martingale CLT applied to the per-step noise $\eta_t = g_t - \nabla L(\theta_t)$. The noise is a martingale difference (mean zero given the past), and under bounded second moments and a stationary iterate distribution, the cumulative noise is Gaussian in the limit. This is what licenses the SGD-as-SDE diffusion approximation; the precise statement is asymptotic, not exact, and the per-step noise distribution is generally non-Gaussian (it is the per-sample-gradient distribution).

Sneiderman, Robby

Statistical Estimation

Central Limit Theorem

The CLT family: classical Lindeberg-Levy, the Lindeberg-Feller condition for non-identical summands, Berry-Esseen rate, multivariate CLT, the delta method, and the martingale CLT that underwrites SGD asymptotics. Why these results power MLE asymptotic normality, bootstrap validity, and the Gaussian-noise approximation of stochastic optimization, and where the CLT breaks in high dimensions and under heavy tails.

EssentialCoreTier 1StableCore spine~75 min

For:MLStatsActuarialGeneral

Prerequisites

Law of Large Numbers Expectation Variance Covariance Moments Common Probability Distributions Characteristic Functions

Quiz (19)Pulse Check Prereq Map

Why This Matters

The central limit theorem explains why the Gaussian distribution appears everywhere. It is not because nature is inherently Gaussian. It is because many quantities of interest are averages, or behave like averages, and averages of independent finite-variance terms are approximately Gaussian regardless of the underlying distribution.

In machine learning and statistics the CLT is the engine behind:

Asymptotic normality of MLE: $\sqrt{n}(\hat\theta_{\text{MLE}} - \theta_0) \xrightarrow{d} \mathcal{N}(0, I(\theta_0)^{-1})$ where $I(\theta_0)$ is the Fisher information. The score function is a sum of iid mean-zero terms, the CLT applies, the delta method propagates the limit through the score-equation expansion.
Confidence intervals and hypothesis tests: $\bar{X} \pm z_{\alpha/2}\,s/\sqrt{n}$ , Wald, score, and LRT statistics all read their critical values off the standard normal because of the CLT.
Bootstrap validity: the bootstrap works because the empirical CDF satisfies its own functional CLT (Donsker's theorem), so resampling from $\hat F_n$ approximates sampling from $F$ .
SGD noise as Gaussian: at each step the stochastic gradient is a sum over a minibatch of per-sample gradients minus the population gradient. The CLT (martingale form) gives the noise a Gaussian-like shape, which is the basis for SGD-as-SDE diffusion analyses.
Monte Carlo error bars: estimator variance $\sigma^2/n$ , confidence band width $\sigma/\sqrt{n}$ , and the universal $1/\sqrt{n}$ rate for simulation all come from the CLT.

The CLT also has a sharp dual: it is asymptotic, so it tells you nothing about a fixed $n$ unless you carry along a rate. Concentration inequalities give the non-asymptotic counterpart with explicit constants. Berry-Esseen sits in between.

Want to feel the convergence instead of only reading about it? The Sampling and Limit Laws Lab lets you switch the source distribution, increase $n$ , and watch the sample-mean histogram move toward the Gaussian picture in real time.

Quick Version

Statement	What it says
Classical CLT (Lindeberg-Levy)	iid finite-variance: $\sqrt{n}(\bar X_n - \mu)/\sigma \to \mathcal{N}(0,1)$
Lindeberg-Feller	Non-identical summands: triangular array works iff Lindeberg condition
Berry-Esseen	CDF error is $O(1/\sqrt{n})$ with constant $0.4748\,\rho/\sigma^3$
Multivariate CLT	$\sqrt{n}(\bar X_n - \mu) \to \mathcal{N}(0, \Sigma)$ with the full covariance
Delta method	Smooth transformations preserve asymptotic normality with Jacobian
Martingale CLT	Replaces independence with conditional moment conditions on $\mathcal{F}_{t-1}$
Hard requirement	Finite second moment (or equivalent on conditional variance)

CLT visual

Sample-mean shape becomes Gaussian as n grows

Start with one bounded source. Average more copies of it. The sample mean narrows around the true center and its shape becomes smoother and more bell-like.

n = 1

raw source

n = 2

still angular

n = 5

bell-like

n = 12

close to Gaussian

Gaussian target

the limit shape the CLT predicts

The source begins flat or angular. Averaging more copies makes the distribution of the sample mean tighter and more bell-shaped around the true center.

Sample-mean histogram (K=1500 trials) versus the predicted N(μ, σ²/N)

N10

K = 1500 trialsμ = 1.00σ/√N = 0.316

Bars: empirical histogram of the sample mean across K trials. Dashed curve: the limiting Normal N(μ, σ²/N) predicted by the CLT.

Mental Model

The law of large numbers tells you that the sample mean converges. The CLT tells you how: the fluctuations around the true mean are Gaussian with standard deviation $\sigma/\sqrt{n}$ . If you zoom in by multiplying by $\sqrt{n}$ , the magnified fluctuations have a specific universal shape, the bell curve.

The CLT is a universality result. Whatever distribution the $X_i$ come from, the standardized average looks the same in the limit, so long as the variance is finite. The Lindeberg condition makes precise what "summand" means in this universality: no single term can dominate the variance budget at the relevant scale.

Formal Setup

Definition

Convergence in Distribution $X_{n} d X$

A sequence $X_n$ converges in distribution to $X$ if for every continuous bounded function $g$ :

$\lim_{n \to \infty} \mathbb{E}[g(X_n)] = \mathbb{E}[g(X)].$

Equivalently, $F_{X_n}(t) \to F_X(t)$ at every continuity point $t$ of $F_X$ .

Convergence in distribution is the weakest of the standard convergence modes. It is a statement about distributions, not about random variables: $X_n$ and $X$ need not live on the same probability space, and $X_n - X$ need not be small in any pathwise sense.

The rigorous proofs of every CLT in this page use characteristic functions and Levy's continuity theorem: pointwise convergence of $\varphi_{Z_n}(t) \to \varphi_Z(t)$ at every $t \in \mathbb{R}$ implies $Z_n \xrightarrow{d} Z$ , provided the limiting characteristic function is continuous at zero. The Gaussian $\mathcal{N}(0,1)$ has $\varphi(t) = e^{-t^2/2}$ , and showing the characteristic function of the standardized sum converges to $e^{-t^2/2}$ is the entire game.

Classical CLT

Theorem

Classical Central Limit Theorem (Lindeberg-Levy)

Statement

If $X_1, X_2, \ldots$ are iid with mean $\mu$ and variance $\sigma^2 \in (0, \infty)$ , then

$\frac{\sqrt{n}(\bar{X}_n - \mu)}{\sigma} \xrightarrow{d} \mathcal{N}(0, 1).$

Equivalently, $\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} \mathcal{N}(0, \sigma^2)$ , and at the working level $\bar{X}_n \approx \mathcal{N}(\mu, \sigma^2/n)$ for large $n$ .

Intuition

Each $X_i$ contributes a random kick to the sum. For large $n$ the aggregate of many small independent kicks is Gaussian, and the only properties of the kicks that survive the limit are the mean and the variance. The $\sqrt{n}$ scaling is the right zoom: the sum $\sum X_i$ grows as $n\mu$ , the standard deviation grows as $\sigma\sqrt{n}$ , and standardizing leaves a quantity of order $1$ whose distribution converges.

Proof Sketch

Let $Y_i = (X_i - \mu)/\sigma$ , so $Y_i$ is iid with mean $0$ and variance $1$ . Set $Z_n = \frac{1}{\sqrt{n}} \sum_{i=1}^n Y_i$ . The characteristic function of $Y_i$ admits the Taylor expansion

$\varphi_Y(t) = 1 - \frac{t^2}{2} + o(t^2).$

By independence, $\varphi_{Z_n}(t) = [\varphi_Y(t/\sqrt{n})]^n$ , so

$\varphi_{Z_n}(t) = \left[1 - \frac{t^2}{2n} + o(t^2/n)\right]^n \to e^{-t^2/2}.$

The limit is the characteristic function of $\mathcal{N}(0,1)$ . Apply Levy's continuity theorem to conclude $Z_n \xrightarrow{d} \mathcal{N}(0,1)$ .

Why It Matters

The classical CLT is the single most-used theorem in applied statistics. Every Wald confidence interval, $z$ -test, and plug-in standard error invokes it. In ML theory it is the asymptotic half of every generalization argument that uses a normal approximation in place of a tail bound, and it is the result behind the $1/\sqrt{n}$ baseline rate that concentration inequalities try to match non-asymptotically.

Failure Mode

Three failure modes worth naming. (1) Infinite variance: stable distributions with index $\alpha < 2$ (Cauchy, fat-tailed financial returns) have $S_n/n^{1/\alpha}$ converging to a non-Gaussian stable limit, not a Gaussian. (2) Strong dependence: long-range-dependent processes can have CLT-like statements with non-Gaussian limits or non- $\sqrt{n}$ rates. (3) High dimension: the iid CLT is one-dimensional; operator-norm CLTs for sample covariances in dimensions $d \asymp n$ require additional structure and the limits are not Gaussian.

report a correction →

Lindeberg-Feller CLT

The classical CLT assumes identical distribution. This is too restrictive for many applications: in regression with deterministic covariates, the contribution $X_i \beta$ to the score has variance that depends on $i$ ; in stratified sampling, summands come from different populations; in neural-network analyses, layerwise contributions to a loss may be heterogeneous. The Lindeberg-Feller theorem replaces "iid" with a condition that controls the relative size of the largest summand.

Theorem

Lindeberg-Feller CLT

Statement

Let $\{X_{n,i}\}$ be a triangular array of independent random variables in row $n$ , with $\mathbb{E}[X_{n,i}] = 0$ and $\sigma_{n,i}^2 = \mathrm{Var}(X_{n,i}) < \infty$ . Set $s_n^2 = \sum_{i=1}^{k_n} \sigma_{n,i}^2$ . The Lindeberg condition is

$L_n(\epsilon) := \frac{1}{s_n^2} \sum_{i=1}^{k_n} \mathbb{E}\!\left[X_{n,i}^2 \, \mathbf{1}\{|X_{n,i}| > \epsilon\, s_n\}\right] \to 0 \quad \text{for every } \epsilon > 0.$

If $L_n(\epsilon) \to 0$ for every $\epsilon > 0$ , then

$\frac{1}{s_n} \sum_{i=1}^{k_n} X_{n,i} \xrightarrow{d} \mathcal{N}(0, 1).$

The condition is essentially necessary: under the uniform asymptotic negligibility condition $\max_i \sigma_{n,i}^2 / s_n^2 \to 0$ , the Lindeberg condition is equivalent to the CLT plus uniform negligibility.

Intuition

$L_n(\epsilon)$ is the fraction of the total variance carried by the "big" parts of summands (the parts above $\epsilon\,s_n$ in absolute value). Lindeberg says: at every threshold $\epsilon$ , the big parts contribute a vanishing share of the variance. No single summand, and no constant fraction of summands, is allowed to dominate.

The Lindeberg condition is harder to verify directly than the simpler Lyapunov condition: there exists $\delta > 0$ such that

$\frac{1}{s_n^{2+\delta}} \sum_{i=1}^{k_n} \mathbb{E}\!\left[|X_{n,i}|^{2+\delta}\right] \to 0.$

Lyapunov implies Lindeberg by Markov's inequality, and is the form most often used in applied work because $(2+\delta)$ -th moments are easier to bound than truncated second moments.

Proof Sketch

Lindeberg's replacement method. Replace the $X_{n,i}$ one at a time with independent Gaussians $G_{n,i} \sim \mathcal{N}(0, \sigma_{n,i}^2)$ and control the change in $\mathbb{E}[f(\text{sum})]$ for smooth bounded $f$ via a third-order Taylor expansion. The remainder is bounded by the Lindeberg sum $L_n(\epsilon) + \epsilon$ for a free $\epsilon > 0$ , then take $\epsilon \to 0$ along a subsequence with $L_n(\epsilon) \to 0$ . After full replacement, the sum is exactly $\mathcal{N}(0, s_n^2)$ , and $1/s_n$ standardizes to $\mathcal{N}(0,1)$ . See Billingsley, Probability and Measure, Section 27, or Durrett, Probability: Theory and Examples, Section 3.4.

Why It Matters

This is the right form of the CLT for theory. Whenever you read "by the CLT" in a learning-theory or asymptotic-statistics paper that does not have iid data, the unstated assumption is almost always Lindeberg or Lyapunov on a triangular array. Specific places it shows up:

OLS asymptotics with non-random covariates: the score sum $\sum_i x_i \epsilon_i$ is independent but not identically distributed. Lyapunov on $|\epsilon_i|^{2+\delta}$ plus a balanced design ( $\max_i \|x_i\|^2 / \sum_i \|x_i\|^2 \to 0$ ) gives asymptotic normality of $\hat\beta$ .
Influence-function-based asymptotics: in semiparametric efficiency theory, estimators are linearized as sample averages of an influence function, but with finite-sample plug-in errors that make the summands non-iid. Lindeberg controls the error term.
Non-iid generalization analyses: when training data come from heterogeneous sources or active-learning streams, Lindeberg-style conditions are what permit Gaussian limits of empirical risks.

Failure Mode

The Lindeberg condition fails when one or a few summands carry a non-vanishing share of the variance. Canonical example: $X_{n,n} = n$ with probability $1/n^2$ and $0$ otherwise, with $X_{n,i}$ tiny for $i < n$ , makes $\sigma_{n,n}^2 = 1$ a constant fraction of $s_n^2$ . The sum does not converge to a Gaussian.

A subtle case: with $X_{n,i} = \pm c$ each with probability $1/2$ for fixed constant $c$ , the Lindeberg condition holds, but with $X_{n,1} = \pm \sqrt{n}$ and $X_{n,i}$ tiny for $i \geq 2$ , the variance budget is dominated by a single Bernoulli-like term. The standardized sum stays approximately $\pm 1$ with probability $1/2$ each, not Gaussian.

report a correction →

Berry-Esseen: Rate of Convergence

The CLT says $Z_n \to \mathcal{N}(0,1)$ but does not say how large $n$ must be. Berry-Esseen quantifies the gap.

Theorem

Berry-Esseen Theorem

Statement

If $X_1, \ldots, X_n$ are iid with $\mathbb{E}[X_i] = 0$ , $\mathrm{Var}(X_i) = \sigma^2$ , and $\mathbb{E}[|X_i|^3] = \rho < \infty$ , then

$\sup_t \left| \Pr\!\left[\frac{\bar{X}_n}{\sigma/\sqrt{n}} \leq t\right] - \Phi(t) \right| \leq \frac{C\,\rho}{\sigma^3 \sqrt{n}},$

where $\Phi$ is the standard-normal CDF and $C$ is a universal constant. The best known constant is $C \leq 0.4748$ (Shevtsova, 2011); a lower bound of $C \geq 0.4097$ (Esseen, 1956) shows the optimal $C$ lies in this narrow range.

The CLT rate of convergence is $O(1/\sqrt{n})$ , with the constant set by the third absolute moment of the centered summand.

Intuition

The bound has two factors. $1/\sqrt{n}$ is the universal rate. The non-Gaussianity factor $\rho/\sigma^3$ measures how skewed and heavy-tailed the centered $X$ is, relative to a Gaussian baseline. A $\mathcal{N}(0,1)$ has $\rho = \mathbb{E}|Z|^3 = \sqrt{8/\pi} \approx 1.595$ and $\sigma^3 = 1$ , so the ratio is $\approx 1.6$ . Heavier tails or larger skewness make the constant blow up and you need more samples before the Gaussian approximation lands.

Proof Sketch

The proof uses smoothing of the indicator $\mathbf{1}\{x \leq t\}$ against a kernel of bandwidth $1/T$ :

$|F(t) - \Phi(t)| \leq \frac{1}{\pi}\int_{-T}^{T} \frac{|\varphi_F(s) - e^{-s^2/2}|}{|s|}\,ds + \frac{c}{T}.$

Bound $|\varphi_{Z_n}(t) - e^{-t^2/2}|$ by a third-order Taylor remainder of the per-summand characteristic function at the standardized scale. Optimize $T$ to balance the smoothing error against the truncation error, which gives the $1/\sqrt{n}$ rate.

Why It Matters

Berry-Esseen tells you when the CLT approximation is good enough to trust. For Bernoulli $(1/2)$ : $\rho = 1/8$ , $\sigma^3 = 1/8$ , so $\rho/\sigma^3 = 1$ and the bound is $0.4748/\sqrt{n}$ . At $n=100$ the worst-case CDF error is $\leq 0.0475$ ; at $n=10\,000$ it is $\leq 0.00475$ .

Practical consequence: for CLT-based confidence intervals to be reliable to within $1\%$ uniformly across the CDF, you need $n \sim 10^4$ in the worst case. This is the back-of-envelope number behind why CLT-based critical values are usually fine for $n \geq 100$ for symmetric data and require $n \geq 1000$ for skewed data.

In ML theory, Berry-Esseen is the right reference point when comparing CLT-based confidence intervals against concentration inequalities. For sub-Gaussian data, Hoeffding gives a non-asymptotic Gaussian-type tail at every $n$ ; Berry-Esseen is what tells you the asymptotic statement is accurate to $O(1/\sqrt{n})$ , which can be stronger or weaker than Hoeffding's constant depending on the regime.

Failure Mode

Berry-Esseen requires a finite third moment. Without it the rate can be slower than $1/\sqrt{n}$ ; under finite $(2+\delta)$ -th moments only, the rate is $1/n^{\delta/2}$ . The bound is uniform over $t$ , so it controls the central CDF well but is loose deep in the tails: at $t$ where $\Phi(t)$ itself is exponentially small, the relative error of the Gaussian approximation can be much larger than the absolute Berry-Esseen budget suggests. Edgeworth expansions give sharper rates inside the bulk and the Cramér large-deviation regime gives sharper tail rates under exponential-moment conditions.

report a correction →

Multivariate CLT

Theorem

Multivariate CLT

Statement

If $X_1, \ldots, X_n \in \mathbb{R}^d$ are iid with mean $\mu$ and covariance matrix $\Sigma$ , then

$\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} \mathcal{N}_d(0, \Sigma).$

The full covariance structure is preserved in the limit: the joint distribution of the standardized sample-mean vector converges to a multivariate Gaussian with the same covariance as a single $X_i$ .

Intuition

The Cramér-Wold device reduces multivariate weak convergence to all one-dimensional projections. For every $a \in \mathbb{R}^d$ , the projected sample $a^\top X_i$ is a one-dimensional iid sequence with variance $a^\top \Sigma a$ , and the classical CLT gives $a^\top \sqrt{n}(\bar X_n - \mu) \xrightarrow{d} \mathcal{N}(0, a^\top \Sigma a)$ . Since this characterizes a multivariate Gaussian uniquely, the joint limit is $\mathcal{N}_d(0, \Sigma)$ .

Proof Sketch

Cramér-Wold theorem: $X_n \xrightarrow{d} X$ in $\mathbb{R}^d$ if and only if $a^\top X_n \xrightarrow{d} a^\top X$ for every $a \in \mathbb{R}^d$ . Each projection is a univariate iid CLT with limit $\mathcal{N}(0, a^\top \Sigma a)$ , and these projections agree with the projections of $\mathcal{N}_d(0, \Sigma)$ . Conclude joint convergence.

Why It Matters

Multivariate CLT is the workhorse for vector-valued estimators. The MLE in a $d$ -dimensional regular parametric family is asymptotically $\mathcal{N}_d(\theta_0, I(\theta_0)^{-1}/n)$ , which comes from the multivariate CLT applied to the $d$ -vector score $\nabla_\theta \log p(X; \theta_0)$ . OLS asymptotic normality, GMM asymptotics, and the joint distribution of estimated regression coefficients all follow the same template: write the estimator as a sum of iid random vectors plus negligible remainder, then apply the multivariate CLT.

The covariance $\Sigma$ matters: scalar tests of single coefficients use the diagonal entries, joint tests (chi-squared statistics, F-tests) use the inverse of $\Sigma$ to whiten the vector. Misreporting $\Sigma$ by ignoring off-diagonal correlations is the most common practical error in multivariate inference.

Failure Mode

Singular $\Sigma$ collapses the limit to a degenerate Gaussian on a lower-dimensional subspace. The marginal CLT still holds in each direction, but inverting $\Sigma$ in a Wald or Mahalanobis statistic requires care. In high dimensions ( $d$ growing with $n$ ), the multivariate CLT in operator norm fails outright: the sample covariance $\hat\Sigma_n$ has eigenvalues that do not converge to those of $\Sigma$ unless $d/n \to 0$ . The Marchenko-Pastur law replaces the multivariate CLT in the regime $d/n \to c \in (0, \infty)$ .

report a correction →

Delta Method: CLT Through Smooth Functions

The delta method is what lets you propagate a CLT through a smooth nonlinear transformation. It is the second-most-used theorem in applied statistics after the CLT itself, and it is the bridge from "CLT for the sample mean" to "CLT for any smooth functional of moments". The full treatment with regularity conditions and second-order versions lives in asymptotic statistics; the result is restated here because most applications of the CLT in practice immediately chain into a delta-method step.

Theorem

Delta Method

Statement

Let $T_n \in \mathbb{R}^k$ satisfy $\sqrt{n}(T_n - \mu) \xrightarrow{d} \mathcal{N}_k(0, \Sigma)$ , and let $g : \mathbb{R}^k \to \mathbb{R}^m$ be differentiable at $\mu$ with Jacobian $G(\mu) = \nabla g(\mu) \in \mathbb{R}^{m \times k}$ . Then

$\sqrt{n}(g(T_n) - g(\mu)) \xrightarrow{d} \mathcal{N}_m\!\left(0, G(\mu)\, \Sigma\, G(\mu)^\top\right).$

In the scalar case $g'(\mu) \neq 0$ and the asymptotic variance is $[g'(\mu)]^2 \sigma^2$ . When $g'(\mu) = 0$ the second-order delta method gives $n(g(T_n) - g(\mu)) \xrightarrow{d} \tfrac{1}{2} g''(\mu) \sigma^2 \chi^2_1$ .

Intuition

A first-order Taylor expansion of $g$ at $\mu$ gives $g(T_n) - g(\mu) = G(\mu)(T_n - \mu) + R_n$ with the remainder $R_n = o_p(|T_n - \mu|) = o_p(1/\sqrt{n})$ . Multiplying by $\sqrt{n}$ gives $\sqrt{n}(g(T_n) - g(\mu)) = G(\mu) \sqrt{n}(T_n - \mu) + o_p(1)$ , and the lead term pushes the Gaussian forward by the linear map $G(\mu)$ .

Proof Sketch

Apply the multivariate Taylor theorem to $g$ at $\mu$ . The remainder satisfies the bound $\|R_n\| / \|T_n - \mu\| \to 0$ in probability, which combined with the $\sqrt{n}$ -rate hypothesis on $T_n$ gives $\sqrt{n} R_n \xrightarrow{p} 0$ . Slutsky converts the joint limit $(\sqrt{n}(T_n - \mu), \sqrt{n} R_n)$ to the conclusion.

Why It Matters

The delta method is how every asymptotic standard error in applied statistics is computed beyond simple sample means.

Asymptotic variance of $\log \hat p$ for $\hat p = \bar X_n$ , $X_i \sim \mathrm{Bern}(p)$ : $g(p) = \log p$ , $g'(p) = 1/p$ , asymptotic variance $(1/p)^2 \cdot p(1-p) = (1-p)/p$ .
Variance of the log-odds $\log(\hat p/(1-\hat p))$ : $g'(p) = 1/(p(1-p))$ , asymptotic variance $1/(p(1-p))$ . This is the basis of logit-scale Wald intervals, which often have better coverage than untransformed Wald intervals.
Variance of a ratio $\hat\mu_1/\hat\mu_2$ : standard delta-method exercise giving the variance of the ratio in terms of variances and covariance of the components.
Sandwich variance for misspecified MLE: the delta method through the score-equation expansion produces the $V^{-1} A V^{-\top}$ sandwich form, with $V$ the expected score derivative and $A$ the score covariance.

Failure Mode

Three failure modes. (1) $g'(\mu) = 0$ : use the second-order delta method, in which the rate jumps from $\sqrt{n}$ to $n$ and the limit is $\chi^2_1$ , not normal. (2) $g$ not differentiable at $\mu$ : e.g. $g(x) = |x|$ at $\mu = 0$ gives a folded-normal limit. (3) $T_n$ converges at a rate other than $\sqrt{n}$ (e.g. extreme-value problems with rate $n$ ): the delta method must use the actual rate, not $\sqrt{n}$ .

report a correction →

Martingale CLT: Beyond Independence

The classical CLT requires independence. Many sequences in machine learning are not independent but are nearly so in a conditional sense: SGD iterates, the stochastic gradient at step $t$ depends on past iterates but is mean-zero conditional on them. The right tool is the martingale CLT, which replaces "iid" with "martingale difference plus conditional Lindeberg".

Theorem

Martingale CLT (Brown 1971 / McLeish 1974)

Statement

Let $\{X_t\}_{t=1}^n$ be a martingale difference sequence with respect to a filtration $\{\mathcal{F}_t\}$ , meaning $\mathbb{E}[X_t \mid \mathcal{F}_{t-1}] = 0$ and $\mathbb{E}[X_t^2] < \infty$ for every $t$ . Suppose

$V_n := \frac{1}{n} \sum_{t=1}^n \mathbb{E}\!\left[X_t^2 \mid \mathcal{F}_{t-1}\right] \xrightarrow{p} \sigma^2 > 0,$

and the conditional Lindeberg condition holds: for every $\epsilon > 0$ ,

$\frac{1}{n} \sum_{t=1}^n \mathbb{E}\!\left[X_t^2 \, \mathbf{1}\{|X_t| > \epsilon \sqrt{n}\} \,\Big|\, \mathcal{F}_{t-1}\right] \xrightarrow{p} 0.$

Then

$\frac{1}{\sqrt{n}} \sum_{t=1}^n X_t \xrightarrow{d} \mathcal{N}(0, \sigma^2).$

The conditional variance condition replaces "identically distributed", the conditional Lindeberg replaces the unconditional Lindeberg, and the martingale-difference property replaces independence.

Intuition

A martingale difference is "locally mean-zero": each term has mean zero given the past, even though it can depend on the past in arbitrary nonlinear ways. The conditional variance condition fixes a long-run average level of conditional variance; the conditional Lindeberg condition forbids any single term from dominating the variance budget at scale $\sqrt{n}$ . Together these are exactly the right replacements for iid: predictable second moments stand in for identical distribution, and conditional negligibility stands in for genuine independence.

Proof Sketch

McLeish's proof uses the auxiliary product $\prod_t (1 + i t X_t / \sqrt{n})$ and shows it is close in $L^1$ to $e^{i t S_n / \sqrt{n}}$ under the stated conditions, which converts the problem to controlling the conditional second moment of the truncated log-product. The conditional Lindeberg condition implies the truncation error is $o_p(1)$ ; the conditional variance condition pins down the deterministic limit. See Hall and Heyde, Martingale Limit Theory and Its Application (1980), Theorem 3.2, or Durrett, Probability: Theory and Examples, Theorem 4.3.

Lyapunov-type substitute: if the stronger condition $\mathbb{E}[|X_t|^{2+\delta} \mid \mathcal{F}_{t-1}]$ is uniformly bounded for some $\delta > 0$ , conditional Lindeberg follows by Markov's inequality (the same logic as the iid case).

Why It Matters

The martingale CLT is the asymptotic backbone of modern stochastic-process theory and underwrites several core ML applications.

SGD as approximate Gaussian noise: at step $t$ the stochastic gradient $g_t = \nabla \ell(\theta_t; \xi_t)$ for random batch $\xi_t$ has $\mathbb{E}[g_t \mid \mathcal{F}_{t-1}] = \nabla L(\theta_t)$ where $L$ is the population loss. The noise $\eta_t = g_t - \nabla L(\theta_t)$ is a martingale difference, and under bounded-variance assumptions the martingale CLT gives $\sum_t \eta_t / \sqrt{n} \to \mathcal{N}(0, \Sigma_{\text{noise}})$ . This is what licenses the SGD-as-SDE approximation $d\theta_t = -\nabla L(\theta_t)\, dt + \sqrt{\eta\, \Sigma_{\text{noise}}}\, dW_t$ used in stochastic-modified-equation analyses (Li, Tai, E, 2017).
Time-series asymptotics: an AR $(1)$ process $X_t = \rho X_{t-1} + \epsilon_t$ with $|\rho| < 1$ has scaled partial sums whose Gaussian limit comes from a martingale CLT applied to the innovation sequence $\epsilon_t$ .
Online learning regret bounds: the regret of an online algorithm is a sum of martingale differences (current loss minus its conditional expectation given the past). Martingale CLT gives the asymptotic Gaussian distribution of regret around its mean, which is the basis of asymptotic confidence intervals for online estimators.
Stochastic-approximation iterates: Robbins-Monro and Polyak-Ruppert averaging produce iterates whose asymptotic normality $\sqrt{n}(\bar\theta_n - \theta^*) \xrightarrow{d} \mathcal{N}(0, \Sigma)$ is a martingale CLT statement on the noise sequence at the stationary point.

Failure Mode

The martingale CLT can fail when the conditional variance does not stabilize. Heteroscedastic SGD with a non-stationary step size has $V_n$ that does not converge to a deterministic limit, and the conclusion has to be modified: $S_n / \sqrt{V_n}$ converges to $\mathcal{N}(0,1)$ in stable convergence, with $V_n$ itself random in the limit. This is the LAMN (locally asymptotically mixed normal) regime, and it shows up in unit-root time series and in some non-stationary stochastic-approximation problems.

Strong dependence breaks the conclusion: if the conditional variance $\mathbb{E}[X_t^2 \mid \mathcal{F}_{t-1}]$ has long-range dependence, the limit can be fractional Brownian motion at a non- $\sqrt{n}$ rate, not Gaussian.

report a correction →

Applications in Machine Learning

The CLT is invoked dozens of times across ML theory, sometimes silently. This section makes the key invocations explicit, with the precise form of the CLT used in each.

Asymptotic normality of MLE

For a regular parametric family with score $s(X; \theta) = \nabla_\theta \log p(X; \theta)$ and Fisher information $I(\theta_0) = \mathbb{E}_{\theta_0}[s s^\top]$ , the MLE satisfies the score-equation expansion $\sqrt{n}(\hat\theta_n - \theta_0) = I(\theta_0)^{-1} \cdot \frac{1}{\sqrt{n}} \sum_i s(X_i; \theta_0) + o_p(1)$ . The lead term is a multivariate iid CLT applied to the mean-zero score vector, giving $\sqrt{n}(\hat\theta_n - \theta_0) \xrightarrow{d} \mathcal{N}_d(0, I(\theta_0)^{-1})$ . The full derivation, including the regularity conditions and the extension to misspecified models with sandwich variance, is in asymptotic statistics.

SGD noise: martingale CLT in action

Stochastic gradient descent with minibatch size $b$ at iterate $\theta_t$ takes a step $\theta_{t+1} = \theta_t - \eta\, g_t$ where $g_t$ is the empirical gradient on a random minibatch. Decompose the noise as $\eta_t = g_t - \nabla L(\theta_t)$ where $L$ is the population loss. Under bounded fourth moments and a stationary distribution of iterates, the martingale CLT gives the rescaled cumulative noise $\frac{1}{\sqrt{n}} \sum_t \eta_t \xrightarrow{d} \mathcal{N}_d(0, \Sigma(\theta^*))$ where $\Sigma(\theta^*)$ is the per-sample gradient covariance at the optimum. This is the asymptotic-Gaussian noise approximation that underpins the SGD-as-SDE diffusion analysis: at stationarity, SGD is approximately a discretization of the Langevin SDE $d\theta_t = -\nabla L(\theta_t)\, dt + \sqrt{\eta\, \Sigma(\theta_t)}\, dW_t$ .

The non-asymptotic version of the same statement is a Hoeffding or Bernstein-type concentration bound on the cumulative noise; the martingale CLT is the asymptotic counterpart that gives the precise covariance.

Bootstrap validity via the empirical-process CLT

The bootstrap works because resampling from $\hat F_n$ approximates sampling from $F$ , but the rigorous justification is a functional CLT. Donsker's theorem states that the empirical process $\sqrt{n}(\hat F_n - F)$ converges weakly to a Brownian bridge $B \circ F$ in $\ell^\infty([\!-\infty, \infty\!])$ . The Bickel-Freedman (1981) bootstrap theorem applies the same machinery in the bootstrap world conditional on the data, giving conditional weak convergence of $\sqrt{n}(\hat F_n^* - \hat F_n)$ to the same Brownian bridge with probability $1$ . Smooth functionals of $\hat F_n$ inherit asymptotic normality from this, and the bootstrap distribution of the smooth functional matches the true sampling distribution at the same $O(1/\sqrt{n})$ Edgeworth-corrected rate.

Sample covariance and where the CLT fails in high dimension

For iid $X_i \in \mathbb{R}^d$ with $\mathbb{E}[X_i] = 0$ and finite fourth moments, the sample covariance $\hat\Sigma_n = \frac{1}{n} \sum_i X_i X_i^\top$ satisfies a multivariate CLT entrywise: $\sqrt{n}\,(\hat\Sigma_n - \Sigma)_{jk} \xrightarrow{d} \mathcal{N}(0, V_{jk})$ where $V_{jk}$ involves the fourth moments of the corresponding components. With $d$ fixed and $n \to \infty$ , the operator-norm error $\|\hat\Sigma_n - \Sigma\|_{\text{op}}$ is $O_p(\sqrt{d/n})$ .

In the high-dimensional regime $d/n \to c > 0$ , the operator-norm error does not vanish: empirical eigenvalues are biased in a deterministic way described by the Marchenko-Pastur distribution, and the CLT for $\hat\Sigma_n$ entrywise does not assemble into a CLT for the spectrum or for matrix-valued functionals like $\hat\Sigma_n^{-1}$ . This is why covariance-based methods (PCA, GLS, Wald tests) need shrinkage or regularization in high dimensions: the multivariate CLT gives no operator-norm guarantee.

Asymptotic normality of empirical risk

For a fixed hypothesis $h$ , the empirical risk $\hat R_n(h) = \frac{1}{n} \sum_i \ell(h(X_i), Y_i)$ is the sample mean of iid mean- $R(h)$ losses with variance $\sigma^2(h) = \mathrm{Var}(\ell(h(X), Y))$ . The classical CLT gives $\sqrt{n}(\hat R_n(h) - R(h)) \xrightarrow{d} \mathcal{N}(0, \sigma^2(h))$ , which justifies the standard variance-based confidence interval for held-out test loss.

This is pointwise in $h$ : the CLT does not give a Gaussian limit for the worst-case empirical-risk gap $\sup_h |\hat R_n(h) - R(h)|$ . Uniform versions of the CLT (Donsker classes, Rademacher complexity) are what control the generalization gap.

Canonical Examples

Example

Sample mean of Exponential variables

Let $X_i \sim \mathrm{Exp}(1)$ with $\mu = 1$ , $\sigma^2 = 1$ , and $\rho = \mathbb{E}|X_i - 1|^3$ . The classical CLT gives $\sqrt{n}(\bar X_n - 1) \xrightarrow{d} \mathcal{N}(0, 1)$ , so $\bar X_{30} \approx \mathcal{N}(1, 1/30)$ .

A standard 95% Wald confidence interval is $\bar X_{30} \pm 1.96/\sqrt{30} \approx \bar X_{30} \pm 0.358$ .

The Exponential is skewed (skewness $= 2$ ), so the Berry-Esseen bound with $\rho/\sigma^3 = 2$ gives a CDF error of $\approx 0.17$ at $n = 30$ , which is large. By $n = 100$ the bound drops to $\approx 0.095$ , and by $n = 1000$ to $\approx 0.03$ . Symmetrizing transformations (e.g. Wilson-Hilferty cube-root for $\chi^2$ , or log for Exponential rates) reduce the constant and give better finite-sample coverage.

Example

Normal approximation to the Binomial

If $X_i \sim \mathrm{Bern}(p)$ , then $S_n = \sum_i X_i \sim \mathrm{Bin}(n, p)$ and the classical CLT gives

$\frac{S_n - np}{\sqrt{n p (1-p)}} \xrightarrow{d} \mathcal{N}(0,1).$

The textbook rule of thumb $np \geq 5$ and $n(1-p) \geq 5$ ensures the Berry-Esseen constant $\rho/\sigma^3 = (p^2 + (1-p)^2)/\sqrt{p(1-p)}$ times $1/\sqrt{n}$ is small enough for two-sided coverage to be roughly correct. For $p = 1/2$ , $n = 100$ : $S_{100} \approx \mathcal{N}(50, 25)$ , and $\Pr[S_{100} \geq 60] \approx \Pr[Z \geq 2] \approx 0.023$ (CLT) vs exact $0.0176$ .

Binomial(n, p) PMF with the de Moivre–Laplace Normal(np, np(1−p)) overlay

n20

p0.50

μ = 10.00σ = 2.24

Bars: exact Binomial PMF. Dashed curve: the Normal(np, np(1−p)) limit predicted by de Moivre–Laplace. The overlay tightens around the bars as n grows, with the worst fit near p = 0 or p = 1 where the distribution is skewed.

Example

Lyapunov over Lindeberg in OLS asymptotics

Take $X_i \in \mathbb{R}^d$ deterministic, $\epsilon_i$ iid mean-zero with $\mathbb{E}|\epsilon_i|^{2+\delta} < \infty$ , and the linear model $Y_i = X_i^\top \beta + \epsilon_i$ . The score vector $\sum_i X_i \epsilon_i$ is independent but not identically distributed (the $X_i$ vary). The Lyapunov condition reduces to a balanced-design condition

$\frac{\sum_i \|X_i\|^{2+\delta}}{\left(\sum_i \|X_i\|^2\right)^{1+\delta/2}} \to 0,$

which is satisfied for any reasonable design where no covariate vector dominates. Lyapunov implies Lindeberg, and the Lindeberg-Feller CLT gives $\hat\beta_{\text{OLS}}$ asymptotic normality with the sandwich variance. The Lyapunov condition is the form practitioners verify; the Lindeberg condition is the form theorists prove.

Common Confusions

Watch Out

The CLT does NOT say individual observations are Gaussian

The CLT says the average $\bar X_n$ is approximately Gaussian for large $n$ . Individual observations $X_i$ retain their original distribution. They could be Bernoulli, Exponential, Poisson, or anything else with finite variance. The Gaussian emerges only through averaging. A single coin flip is not Gaussian; the average of a million coin flips is.

Watch Out

The CLT is asymptotic, concentration is non-asymptotic

The CLT says $Z_n \xrightarrow{d} \mathcal{N}(0,1)$ as $n \to \infty$ . It does not say the approximation is good for any specific $n$ . Concentration inequalities (Hoeffding, Bernstein, sub-Gaussian tails) give finite-sample bounds with explicit constants. Berry-Esseen sits between the two as a finite- $n$ correction to the CLT, but it is uniform over the CDF and can be loose deep in the tails. For tight non-asymptotic tail probabilities, use a concentration inequality, not Berry-Esseen.

Watch Out

Convergence in distribution does not imply convergence in probability

The CLT gives convergence in distribution: $Z_n \xrightarrow{d} Z$ where $Z \sim \mathcal{N}(0,1)$ . This does not mean $Z_n - Z \to 0$ in any sense. The random variables $Z_n$ are determined by the data; $Z$ is a "fresh" Gaussian on a possibly different probability space. Convergence in distribution is a statement about the shape of the distribution, not about the values of the random variables.

Watch Out

Finite variance is required, not just finite mean

The LLN requires only a finite mean. The CLT requires a finite variance. For distributions with finite mean but infinite variance (e.g. stable distributions with index $1 < \alpha < 2$ ), the sample mean still converges (by the LLN) but the CLT does not hold in its standard form. The normalized sum $S_n/n^{1/\alpha}$ converges to a non-Gaussian stable distribution instead.

Watch Out

The Lindeberg condition is not just about iid

The Lindeberg condition was designed for triangular arrays of non-iid independent variables. For iid sequences, Lindeberg holds automatically under finite variance, so reading "by Lindeberg's CLT" in a paper on iid data is just a longer way of citing the classical CLT. The condition becomes load-bearing only when the summands have heterogeneous variances or grow with $n$ , as in regression with deterministic covariates, stratified sampling, and triangular-array bootstrap.

Watch Out

The martingale CLT does NOT say the noise is iid

SGD noise $\eta_t = g_t - \nabla L(\theta_t)$ is not iid: $\eta_t$ depends on $\theta_t$ , which depends on past noise. The martingale CLT allows this dependence so long as the noise is mean-zero conditional on the past and the conditional second moments stabilize. Reading the SGD-as-SDE literature as "SGD noise is iid Gaussian" is a common oversimplification; the precise statement is "the rescaled cumulative noise has a Gaussian limit by the martingale CLT".

Summary

Classical (Lindeberg-Levy) CLT: $\sqrt{n}(\bar X_n - \mu)/\sigma \xrightarrow{d} \mathcal{N}(0,1)$ for iid finite-variance variables.
Lindeberg-Feller: triangular arrays of independent variables converge to a Gaussian iff the Lindeberg condition holds (essentially: no summand dominates the variance).
Lyapunov condition: a $(2+\delta)$ -th moment criterion that implies Lindeberg and is easier to verify in practice.
Berry-Esseen: CDF error $\leq 0.4748\,\rho/(\sigma^3\sqrt{n})$ , the rate is $O(1/\sqrt{n})$ and tight.
Multivariate CLT: $\sqrt{n}(\bar X_n - \mu) \xrightarrow{d} \mathcal{N}_d(0, \Sigma)$ , breaks in high dimension.
Delta method: smooth $g$ pushes the Gaussian forward by the Jacobian; the second-order version handles vanishing gradients.
Martingale CLT: replaces independence with $\mathbb{E}[X_t \mid \mathcal{F}_{t-1}] = 0$ plus conditional variance and conditional Lindeberg conditions; underwrites SGD asymptotics, time-series CLTs, and stochastic-approximation theory.
ML uses: MLE asymptotic normality, SGD-noise Gaussian approximation, bootstrap validity, sample-covariance asymptotics, asymptotic normality of held-out empirical risk.
Failure modes: infinite variance gives stable-law limits; high dimension breaks operator-norm CLTs; strong dependence breaks martingale CLTs.

Exercises

ExerciseCore

Problem

A factory produces bolts whose lengths have mean $\mu = 10$ cm and standard deviation $\sigma = 0.5$ cm (the distribution of lengths is unknown). You measure $n = 100$ bolts and compute $\bar X_{100}$ . Using the CLT, find an interval that contains $\mu$ with approximately 95% probability. Then apply Berry-Esseen to bound the worst-case CDF error of the Gaussian approximation, assuming a third absolute moment of $\rho = 0.5$ cm $^3$ .

ExerciseAdvanced

Problem

Let $X_1, \ldots, X_n \sim \mathrm{Bern}(p)$ with $p$ unknown, and let $\hat p = \bar X_n$ . Use the CLT and the delta method to: (a) derive the asymptotic distribution of $\hat p$ and construct a Wald 95% confidence interval for $p$ ; (b) derive the asymptotic distribution of the log-odds $g(\hat p) = \log(\hat p/(1-\hat p))$ and the corresponding asymptotic variance; (c) explain why the log-odds Wald interval typically has better finite-sample coverage than the untransformed Wald interval, especially near $p = 0$ or $p = 1$ .

ExerciseAdvanced

Problem

Construct a triangular array $\{X_{n,i}\}_{i=1}^n$ of independent mean-zero random variables, each with variance $1$ , for which the classical CLT holds but the Lindeberg condition fails a naive threshold check at $\epsilon = 0.5$ . Then construct a different array where the Lindeberg condition fails for some $\epsilon$ and the CLT also fails. Use these to explain why "Lindeberg fails" is necessary context: the question is at what $\epsilon$ and how the variance is allocated.

ExerciseResearch

Problem

The CLT assumes finite variance. For a symmetric stable distribution with index $\alpha \in (0, 2)$ , the variance is infinite. State the generalized CLT for this case, give the correct normalization rate, and explain why this matters for: (i) financial-returns models with $\alpha \approx 1.5$ - $1.7$ , (ii) ML algorithms whose theory assumes sub-Gaussian tails (concentration inequalities, generalization bounds), and (iii) the validity of the bootstrap for the sample mean.

Related Comparisons

Martingale CLT vs. Classical CLT

Frequently Asked Questions

$What does the Central Limit Theorem say?$: $If you average many independent random variables with finite variance, the average looks Gaussian regardless of the original distribution. Formally: for iid X_{i} with mean μ and variance σ^{2}, n (\overset{ˉ}{X}_{n} - μ) / σ converges in distribution to N (0, 1) . The convergence is in distribution, and the rate is O (1/ n) under a finite third moment.$
$How fast does the CLT converge?$: $Berry-Esseen gives the rate: ∣ P (n (\overset{ˉ}{X} - μ) / σ \leq x) - Φ (x) ∣ \leq 0.4748 \cdot E ∣ X - μ ∣^{3} / (σ^{3} n) . The convergence is O (1/ n); heavier tails (larger third absolute moment) slow it. For symmetric distributions, higher-order Edgeworth expansions give faster rates in the bulk.$
$When does the CLT apply to non-iid data?$: $Lindeberg-Feller covers triangular arrays of independent but non-identically-distributed variables, with the Lindeberg condition replacing iid. Martingale CLTs (Brown 1971, McLeish 1974) handle dependent sequences with mean-zero conditional structure, which is the right tool for SGD asymptotics, time series, and stochastic approximation. Stationary mixing sequences have their own CLTs (Ibragimov, Rio).$
$When does the CLT fail?$: $Three regimes. (1) Infinite variance: stable-law limits at rate instead of Gaussian at rate . (2) High dimension: operator-norm CLTs for sample covariances fail when , replaced by Marchenko-Pastur asymptotics. (3) Strong dependence: long-range dependence breaks the classical CLT and martingale variants; fractional Brownian limits at non- rates can apply instead.$
$Why does the CLT make the MLE asymptotically normal?$: $Score-equation expansion: n (\hat{θ} - θ_{0}) = I (θ_{0})^{- 1} \cdot \frac{1}{n} \sum_{i} s (X_{i}; θ_{0}) + o_{p} (1), where the sum is iid mean-zero scores with covariance I (θ_{0}) (by Bartlett's first identity). Multivariate CLT plus Slutsky give n (\hat{θ} - θ_{0}) d N (0, I (θ_{0})^{- 1}) . The full chain (consistency + score expansion + CLT + Slutsky) is the M-estimator framework on the [asymptotic statistics page](/topics/asymptotic-statistics).$
$Is SGD noise actually Gaussian?$: $Approximately, by the martingale CLT applied to the per-step noise η_{t} = g_{t} - \nabla L (θ_{t}) . The noise is a martingale difference (mean zero given the past), and under bounded second moments and a stationary iterate distribution, the cumulative noise is Gaussian in the limit. This is what licenses the SGD-as-SDE diffusion approximation; the precise statement is asymptotic, not exact, and the per-step noise distribution is generally non-Gaussian (it is the per-sample-gradient distribution).$

References

Canonical:

Billingsley, Probability and Measure (3rd ed., 1995), Sections 27-29 (classical CLT, Lindeberg-Feller, Berry-Esseen)
Durrett, Probability: Theory and Examples (5th ed., 2019), Section 3.4 (CLT family) and Section 4.3 (martingale CLT)
Feller, An Introduction to Probability Theory and Its Applications Vol. 2 (1971), Chapters VIII, XV, XVI

Asymptotic statistics:

van der Vaart, Asymptotic Statistics (1998), Chapters 2-3 (CLT, delta method, Slutsky) and Chapter 5 (M-estimator asymptotic normality)
Lehmann, Elements of Large-Sample Theory (1999), Chapters 2-3 (delta method, Lindeberg-Feller in regression)

Martingale CLT:

Hall and Heyde, Martingale Limit Theory and Its Application (1980), Theorem 3.2 (the standard reference)
Brown, "Martingale central limit theorems" (1971), Annals of Mathematical Statistics
McLeish, "Dependent central limit theorems and invariance principles" (1974), Annals of Probability

Modern probabilistic perspective:

Vershynin, High-Dimensional Probability (2018), Section 2.7 (CLT and Berry-Esseen)
Kallenberg, Foundations of Modern Probability (3rd ed., 2021), Chapters 5-6
Casella & Berger, Statistical Inference (2002), Chapters 5-10 (applied perspective)

ML applications:

Li, Tai, E, "Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms" (2017), arXiv:1511.06251 (SGD-as-SDE, martingale CLT setup)
Bickel and Freedman, "Some Asymptotic Theory for the Bootstrap" (1981), Annals of Statistics (bootstrap consistency via CLT)

Next Topics

Building on the CLT family:

Asymptotic statistics: full delta method, M-estimator framework, Le Cam's local asymptotic normality, Wald/score/LRT equivalence
Maximum likelihood estimation: asymptotic normality of MLE as a CLT consequence
Bootstrap methods: resampling-based inference and bootstrap consistency
Concentration inequalities: non-asymptotic, finite-sample alternatives to the CLT

Last reviewed: May 6, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

5

Common Probability Distributionslayer 0A · tier 1
Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
Law of Large Numberslayer 0B · tier 1
Modes of Convergence of Random Variableslayer 0B · tier 1
Characteristic Functionslayer 1 · tier 1

Derived topics

16

Asymptotic Statistics: M-Estimators, Delta Method, LANlayer 0B · tier 1
Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1
Analysis of Variancelayer 1 · tier 1
Concentration Inequalitieslayer 1 · tier 1
Delta Methodlayer 1 · tier 1

+11 more on the derived-topics page.

Graph-backed continuations

Asymptotic Statistics: M-Estimators, Delta Method, LAN Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiency Bootstrap Methods Concentration Inequalities Cramér-Wold Theorem Double/Debiased Machine Learning Monte Carlo Methods Non-Probability Sampling