Statistical Estimation
Central Limit Theorem
The CLT family: classical Lindeberg-Levy, the Lindeberg-Feller condition for non-identical summands, Berry-Esseen rate, multivariate CLT, the delta method, and the martingale CLT that underwrites SGD asymptotics. Why these results power MLE asymptotic normality, bootstrap validity, and the Gaussian-noise approximation of stochastic optimization, and where the CLT breaks in high dimensions and under heavy tails.
Prerequisites
Why This Matters
The central limit theorem explains why the Gaussian distribution appears everywhere. It is not because nature is inherently Gaussian. It is because many quantities of interest are averages, or behave like averages, and averages of independent finite-variance terms are approximately Gaussian regardless of the underlying distribution.
In machine learning and statistics the CLT is the engine behind:
- Asymptotic normality of MLE: where is the Fisher information. The score function is a sum of iid mean-zero terms, the CLT applies, the delta method propagates the limit through the score-equation expansion.
- Confidence intervals and hypothesis tests: , Wald, score, and LRT statistics all read their critical values off the standard normal because of the CLT.
- Bootstrap validity: the bootstrap works because the empirical CDF satisfies its own functional CLT (Donsker's theorem), so resampling from approximates sampling from .
- SGD noise as Gaussian: at each step the stochastic gradient is a sum over a minibatch of per-sample gradients minus the population gradient. The CLT (martingale form) gives the noise a Gaussian-like shape, which is the basis for SGD-as-SDE diffusion analyses.
- Monte Carlo error bars: estimator variance , confidence band width , and the universal rate for simulation all come from the CLT.
The CLT also has a sharp dual: it is asymptotic, so it tells you nothing about a fixed unless you carry along a rate. Concentration inequalities give the non-asymptotic counterpart with explicit constants. Berry-Esseen sits in between.
Want to feel the convergence instead of only reading about it? The Sampling and Limit Laws Lab lets you switch the source distribution, increase , and watch the sample-mean histogram move toward the Gaussian picture in real time.
Quick Version
| Statement | What it says |
|---|---|
| Classical CLT (Lindeberg-Levy) | iid finite-variance: |
| Lindeberg-Feller | Non-identical summands: triangular array works iff Lindeberg condition |
| Berry-Esseen | CDF error is with constant |
| Multivariate CLT | with the full covariance |
| Delta method | Smooth transformations preserve asymptotic normality with Jacobian |
| Martingale CLT | Replaces independence with conditional moment conditions on |
| Hard requirement | Finite second moment (or equivalent on conditional variance) |
Sample-mean shape becomes Gaussian as n grows
Start with one bounded source. Average more copies of it. The sample mean narrows around the true center and its shape becomes smoother and more bell-like.
The source begins flat or angular. Averaging more copies makes the distribution of the sample mean tighter and more bell-shaped around the true center.
Sample-mean histogram (K=1500 trials) versus the predicted N(μ, σ²/N)
Bars: empirical histogram of the sample mean across K trials. Dashed curve: the limiting Normal N(μ, σ²/N) predicted by the CLT.
Mental Model
The law of large numbers tells you that the sample mean converges. The CLT tells you how: the fluctuations around the true mean are Gaussian with standard deviation . If you zoom in by multiplying by , the magnified fluctuations have a specific universal shape, the bell curve.
The CLT is a universality result. Whatever distribution the come from, the standardized average looks the same in the limit, so long as the variance is finite. The Lindeberg condition makes precise what "summand" means in this universality: no single term can dominate the variance budget at the relevant scale.
Formal Setup
Convergence in Distribution
A sequence converges in distribution to if for every continuous bounded function :
Equivalently, at every continuity point of .
Convergence in distribution is the weakest of the standard convergence modes. It is a statement about distributions, not about random variables: and need not live on the same probability space, and need not be small in any pathwise sense.
The rigorous proofs of every CLT in this page use characteristic functions and Levy's continuity theorem: pointwise convergence of at every implies , provided the limiting characteristic function is continuous at zero. The Gaussian has , and showing the characteristic function of the standardized sum converges to is the entire game.
Classical CLT
Classical Central Limit Theorem (Lindeberg-Levy)
Statement
If are iid with mean and variance , then
Equivalently, , and at the working level for large .
Intuition
Each contributes a random kick to the sum. For large the aggregate of many small independent kicks is Gaussian, and the only properties of the kicks that survive the limit are the mean and the variance. The scaling is the right zoom: the sum grows as , the standard deviation grows as , and standardizing leaves a quantity of order whose distribution converges.
Proof Sketch
Let , so is iid with mean and variance . Set . The characteristic function of admits the Taylor expansion
By independence, , so
The limit is the characteristic function of . Apply Levy's continuity theorem to conclude .
Why It Matters
The classical CLT is the single most-used theorem in applied statistics. Every Wald confidence interval, -test, and plug-in standard error invokes it. In ML theory it is the asymptotic half of every generalization argument that uses a normal approximation in place of a tail bound, and it is the result behind the baseline rate that concentration inequalities try to match non-asymptotically.
Failure Mode
Three failure modes worth naming. (1) Infinite variance: stable distributions with index (Cauchy, fat-tailed financial returns) have converging to a non-Gaussian stable limit, not a Gaussian. (2) Strong dependence: long-range-dependent processes can have CLT-like statements with non-Gaussian limits or non- rates. (3) High dimension: the iid CLT is one-dimensional; operator-norm CLTs for sample covariances in dimensions require additional structure and the limits are not Gaussian.
Lindeberg-Feller CLT
The classical CLT assumes identical distribution. This is too restrictive for many applications: in regression with deterministic covariates, the contribution to the score has variance that depends on ; in stratified sampling, summands come from different populations; in neural-network analyses, layerwise contributions to a loss may be heterogeneous. The Lindeberg-Feller theorem replaces "iid" with a condition that controls the relative size of the largest summand.
Lindeberg-Feller CLT
Statement
Let be a triangular array of independent random variables in row , with and . Set . The Lindeberg condition is
If for every , then
The condition is essentially necessary: under the uniform asymptotic negligibility condition , the Lindeberg condition is equivalent to the CLT plus uniform negligibility.
Intuition
is the fraction of the total variance carried by the "big" parts of summands (the parts above in absolute value). Lindeberg says: at every threshold , the big parts contribute a vanishing share of the variance. No single summand, and no constant fraction of summands, is allowed to dominate.
The Lindeberg condition is harder to verify directly than the simpler Lyapunov condition: there exists such that
Lyapunov implies Lindeberg by Markov's inequality, and is the form most often used in applied work because -th moments are easier to bound than truncated second moments.
Proof Sketch
Lindeberg's replacement method. Replace the one at a time with independent Gaussians and control the change in for smooth bounded via a third-order Taylor expansion. The remainder is bounded by the Lindeberg sum for a free , then take along a subsequence with . After full replacement, the sum is exactly , and standardizes to . See Billingsley, Probability and Measure, Section 27, or Durrett, Probability: Theory and Examples, Section 3.4.
Why It Matters
This is the right form of the CLT for theory. Whenever you read "by the CLT" in a learning-theory or asymptotic-statistics paper that does not have iid data, the unstated assumption is almost always Lindeberg or Lyapunov on a triangular array. Specific places it shows up:
- OLS asymptotics with non-random covariates: the score sum is independent but not identically distributed. Lyapunov on plus a balanced design () gives asymptotic normality of .
- Influence-function-based asymptotics: in semiparametric efficiency theory, estimators are linearized as sample averages of an influence function, but with finite-sample plug-in errors that make the summands non-iid. Lindeberg controls the error term.
- Non-iid generalization analyses: when training data come from heterogeneous sources or active-learning streams, Lindeberg-style conditions are what permit Gaussian limits of empirical risks.
Failure Mode
The Lindeberg condition fails when one or a few summands carry a non-vanishing share of the variance. Canonical example: with probability and otherwise, with tiny for , makes a constant fraction of . The sum does not converge to a Gaussian.
A subtle case: with each with probability for fixed constant , the Lindeberg condition holds, but with and tiny for , the variance budget is dominated by a single Bernoulli-like term. The standardized sum stays approximately with probability each, not Gaussian.
Berry-Esseen: Rate of Convergence
The CLT says but does not say how large must be. Berry-Esseen quantifies the gap.
Berry-Esseen Theorem
Statement
If are iid with , , and , then
where is the standard-normal CDF and is a universal constant. The best known constant is (Shevtsova, 2011); a lower bound of (Esseen, 1956) shows the optimal lies in this narrow range.
The CLT rate of convergence is , with the constant set by the third absolute moment of the centered summand.
Intuition
The bound has two factors. is the universal rate. The non-Gaussianity factor measures how skewed and heavy-tailed the centered is, relative to a Gaussian baseline. A has and , so the ratio is . Heavier tails or larger skewness make the constant blow up and you need more samples before the Gaussian approximation lands.
Proof Sketch
The proof uses smoothing of the indicator against a kernel of bandwidth :
Bound by a third-order Taylor remainder of the per-summand characteristic function at the standardized scale. Optimize to balance the smoothing error against the truncation error, which gives the rate.
Why It Matters
Berry-Esseen tells you when the CLT approximation is good enough to trust. For Bernoulli: , , so and the bound is . At the worst-case CDF error is ; at it is .
Practical consequence: for CLT-based confidence intervals to be reliable to within uniformly across the CDF, you need in the worst case. This is the back-of-envelope number behind why CLT-based critical values are usually fine for for symmetric data and require for skewed data.
In ML theory, Berry-Esseen is the right reference point when comparing CLT-based confidence intervals against concentration inequalities. For sub-Gaussian data, Hoeffding gives a non-asymptotic Gaussian-type tail at every ; Berry-Esseen is what tells you the asymptotic statement is accurate to , which can be stronger or weaker than Hoeffding's constant depending on the regime.
Failure Mode
Berry-Esseen requires a finite third moment. Without it the rate can be slower than ; under finite -th moments only, the rate is . The bound is uniform over , so it controls the central CDF well but is loose deep in the tails: at where itself is exponentially small, the relative error of the Gaussian approximation can be much larger than the absolute Berry-Esseen budget suggests. Edgeworth expansions give sharper rates inside the bulk and the Cramér large-deviation regime gives sharper tail rates under exponential-moment conditions.
Multivariate CLT
Multivariate CLT
Statement
If are iid with mean and covariance matrix , then
The full covariance structure is preserved in the limit: the joint distribution of the standardized sample-mean vector converges to a multivariate Gaussian with the same covariance as a single .
Intuition
The Cramér-Wold device reduces multivariate weak convergence to all one-dimensional projections. For every , the projected sample is a one-dimensional iid sequence with variance , and the classical CLT gives . Since this characterizes a multivariate Gaussian uniquely, the joint limit is .
Proof Sketch
Cramér-Wold theorem: in if and only if for every . Each projection is a univariate iid CLT with limit , and these projections agree with the projections of . Conclude joint convergence.
Why It Matters
Multivariate CLT is the workhorse for vector-valued estimators. The MLE in a -dimensional regular parametric family is asymptotically , which comes from the multivariate CLT applied to the -vector score . OLS asymptotic normality, GMM asymptotics, and the joint distribution of estimated regression coefficients all follow the same template: write the estimator as a sum of iid random vectors plus negligible remainder, then apply the multivariate CLT.
The covariance matters: scalar tests of single coefficients use the diagonal entries, joint tests (chi-squared statistics, F-tests) use the inverse of to whiten the vector. Misreporting by ignoring off-diagonal correlations is the most common practical error in multivariate inference.
Failure Mode
Singular collapses the limit to a degenerate Gaussian on a lower-dimensional subspace. The marginal CLT still holds in each direction, but inverting in a Wald or Mahalanobis statistic requires care. In high dimensions ( growing with ), the multivariate CLT in operator norm fails outright: the sample covariance has eigenvalues that do not converge to those of unless . The Marchenko-Pastur law replaces the multivariate CLT in the regime .
Delta Method: CLT Through Smooth Functions
The delta method is what lets you propagate a CLT through a smooth nonlinear transformation. It is the second-most-used theorem in applied statistics after the CLT itself, and it is the bridge from "CLT for the sample mean" to "CLT for any smooth functional of moments". The full treatment with regularity conditions and second-order versions lives in asymptotic statistics; the result is restated here because most applications of the CLT in practice immediately chain into a delta-method step.
Delta Method
Statement
Let satisfy , and let be differentiable at with Jacobian . Then
In the scalar case and the asymptotic variance is . When the second-order delta method gives .
Intuition
A first-order Taylor expansion of at gives with the remainder . Multiplying by gives , and the lead term pushes the Gaussian forward by the linear map .
Proof Sketch
Apply the multivariate Taylor theorem to at . The remainder satisfies the bound in probability, which combined with the -rate hypothesis on gives . Slutsky converts the joint limit to the conclusion.
Why It Matters
The delta method is how every asymptotic standard error in applied statistics is computed beyond simple sample means.
- Asymptotic variance of for , : , , asymptotic variance .
- Variance of the log-odds : , asymptotic variance . This is the basis of logit-scale Wald intervals, which often have better coverage than untransformed Wald intervals.
- Variance of a ratio : standard delta-method exercise giving the variance of the ratio in terms of variances and covariance of the components.
- Sandwich variance for misspecified MLE: the delta method through the score-equation expansion produces the sandwich form, with the expected score derivative and the score covariance.
Failure Mode
Three failure modes. (1) : use the second-order delta method, in which the rate jumps from to and the limit is , not normal. (2) not differentiable at : e.g. at gives a folded-normal limit. (3) converges at a rate other than (e.g. extreme-value problems with rate ): the delta method must use the actual rate, not .
Martingale CLT: Beyond Independence
The classical CLT requires independence. Many sequences in machine learning are not independent but are nearly so in a conditional sense: SGD iterates, the stochastic gradient at step depends on past iterates but is mean-zero conditional on them. The right tool is the martingale CLT, which replaces "iid" with "martingale difference plus conditional Lindeberg".
Martingale CLT (Brown 1971 / McLeish 1974)
Statement
Let be a martingale difference sequence with respect to a filtration , meaning and for every . Suppose
and the conditional Lindeberg condition holds: for every ,
Then
The conditional variance condition replaces "identically distributed", the conditional Lindeberg replaces the unconditional Lindeberg, and the martingale-difference property replaces independence.
Intuition
A martingale difference is "locally mean-zero": each term has mean zero given the past, even though it can depend on the past in arbitrary nonlinear ways. The conditional variance condition fixes a long-run average level of conditional variance; the conditional Lindeberg condition forbids any single term from dominating the variance budget at scale . Together these are exactly the right replacements for iid: predictable second moments stand in for identical distribution, and conditional negligibility stands in for genuine independence.
Proof Sketch
McLeish's proof uses the auxiliary product and shows it is close in to under the stated conditions, which converts the problem to controlling the conditional second moment of the truncated log-product. The conditional Lindeberg condition implies the truncation error is ; the conditional variance condition pins down the deterministic limit. See Hall and Heyde, Martingale Limit Theory and Its Application (1980), Theorem 3.2, or Durrett, Probability: Theory and Examples, Theorem 4.3.
Lyapunov-type substitute: if the stronger condition is uniformly bounded for some , conditional Lindeberg follows by Markov's inequality (the same logic as the iid case).
Why It Matters
The martingale CLT is the asymptotic backbone of modern stochastic-process theory and underwrites several core ML applications.
- SGD as approximate Gaussian noise: at step the stochastic gradient for random batch has where is the population loss. The noise is a martingale difference, and under bounded-variance assumptions the martingale CLT gives . This is what licenses the SGD-as-SDE approximation used in stochastic-modified-equation analyses (Li, Tai, E, 2017).
- Time-series asymptotics: an AR process with has scaled partial sums whose Gaussian limit comes from a martingale CLT applied to the innovation sequence .
- Online learning regret bounds: the regret of an online algorithm is a sum of martingale differences (current loss minus its conditional expectation given the past). Martingale CLT gives the asymptotic Gaussian distribution of regret around its mean, which is the basis of asymptotic confidence intervals for online estimators.
- Stochastic-approximation iterates: Robbins-Monro and Polyak-Ruppert averaging produce iterates whose asymptotic normality is a martingale CLT statement on the noise sequence at the stationary point.
Failure Mode
The martingale CLT can fail when the conditional variance does not stabilize. Heteroscedastic SGD with a non-stationary step size has that does not converge to a deterministic limit, and the conclusion has to be modified: converges to in stable convergence, with itself random in the limit. This is the LAMN (locally asymptotically mixed normal) regime, and it shows up in unit-root time series and in some non-stationary stochastic-approximation problems.
Strong dependence breaks the conclusion: if the conditional variance has long-range dependence, the limit can be fractional Brownian motion at a non- rate, not Gaussian.
Applications in Machine Learning
The CLT is invoked dozens of times across ML theory, sometimes silently. This section makes the key invocations explicit, with the precise form of the CLT used in each.
Asymptotic normality of MLE
For a regular parametric family with score and Fisher information , the MLE satisfies the score-equation expansion . The lead term is a multivariate iid CLT applied to the mean-zero score vector, giving . The full derivation, including the regularity conditions and the extension to misspecified models with sandwich variance, is in asymptotic statistics.
SGD noise: martingale CLT in action
Stochastic gradient descent with minibatch size at iterate takes a step where is the empirical gradient on a random minibatch. Decompose the noise as where is the population loss. Under bounded fourth moments and a stationary distribution of iterates, the martingale CLT gives the rescaled cumulative noise where is the per-sample gradient covariance at the optimum. This is the asymptotic-Gaussian noise approximation that underpins the SGD-as-SDE diffusion analysis: at stationarity, SGD is approximately a discretization of the Langevin SDE .
The non-asymptotic version of the same statement is a Hoeffding or Bernstein-type concentration bound on the cumulative noise; the martingale CLT is the asymptotic counterpart that gives the precise covariance.
Bootstrap validity via the empirical-process CLT
The bootstrap works because resampling from approximates sampling from , but the rigorous justification is a functional CLT. Donsker's theorem states that the empirical process converges weakly to a Brownian bridge in . The Bickel-Freedman (1981) bootstrap theorem applies the same machinery in the bootstrap world conditional on the data, giving conditional weak convergence of to the same Brownian bridge with probability . Smooth functionals of inherit asymptotic normality from this, and the bootstrap distribution of the smooth functional matches the true sampling distribution at the same Edgeworth-corrected rate.
Sample covariance and where the CLT fails in high dimension
For iid with and finite fourth moments, the sample covariance satisfies a multivariate CLT entrywise: where involves the fourth moments of the corresponding components. With fixed and , the operator-norm error is .
In the high-dimensional regime , the operator-norm error does not vanish: empirical eigenvalues are biased in a deterministic way described by the Marchenko-Pastur distribution, and the CLT for entrywise does not assemble into a CLT for the spectrum or for matrix-valued functionals like . This is why covariance-based methods (PCA, GLS, Wald tests) need shrinkage or regularization in high dimensions: the multivariate CLT gives no operator-norm guarantee.
Asymptotic normality of empirical risk
For a fixed hypothesis , the empirical risk is the sample mean of iid mean- losses with variance . The classical CLT gives , which justifies the standard variance-based confidence interval for held-out test loss.
This is pointwise in : the CLT does not give a Gaussian limit for the worst-case empirical-risk gap . Uniform versions of the CLT (Donsker classes, Rademacher complexity) are what control the generalization gap.
Canonical Examples
Sample mean of Exponential variables
Let with , , and . The classical CLT gives , so .
A standard 95% Wald confidence interval is .
The Exponential is skewed (skewness ), so the Berry-Esseen bound with gives a CDF error of at , which is large. By the bound drops to , and by to . Symmetrizing transformations (e.g. Wilson-Hilferty cube-root for , or log for Exponential rates) reduce the constant and give better finite-sample coverage.
Normal approximation to the Binomial
If , then and the classical CLT gives
The textbook rule of thumb and ensures the Berry-Esseen constant times is small enough for two-sided coverage to be roughly correct. For , : , and (CLT) vs exact .
Binomial(n, p) PMF with the de Moivre–Laplace Normal(np, np(1−p)) overlay
Bars: exact Binomial PMF. Dashed curve: the Normal(np, np(1−p)) limit predicted by de Moivre–Laplace. The overlay tightens around the bars as n grows, with the worst fit near p = 0 or p = 1 where the distribution is skewed.
Lyapunov over Lindeberg in OLS asymptotics
Take deterministic, iid mean-zero with , and the linear model . The score vector is independent but not identically distributed (the vary). The Lyapunov condition reduces to a balanced-design condition
which is satisfied for any reasonable design where no covariate vector dominates. Lyapunov implies Lindeberg, and the Lindeberg-Feller CLT gives asymptotic normality with the sandwich variance. The Lyapunov condition is the form practitioners verify; the Lindeberg condition is the form theorists prove.
Common Confusions
The CLT does NOT say individual observations are Gaussian
The CLT says the average is approximately Gaussian for large . Individual observations retain their original distribution. They could be Bernoulli, Exponential, Poisson, or anything else with finite variance. The Gaussian emerges only through averaging. A single coin flip is not Gaussian; the average of a million coin flips is.
The CLT is asymptotic, concentration is non-asymptotic
The CLT says as . It does not say the approximation is good for any specific . Concentration inequalities (Hoeffding, Bernstein, sub-Gaussian tails) give finite-sample bounds with explicit constants. Berry-Esseen sits between the two as a finite- correction to the CLT, but it is uniform over the CDF and can be loose deep in the tails. For tight non-asymptotic tail probabilities, use a concentration inequality, not Berry-Esseen.
Convergence in distribution does not imply convergence in probability
The CLT gives convergence in distribution: where . This does not mean in any sense. The random variables are determined by the data; is a "fresh" Gaussian on a possibly different probability space. Convergence in distribution is a statement about the shape of the distribution, not about the values of the random variables.
Finite variance is required, not just finite mean
The LLN requires only a finite mean. The CLT requires a finite variance. For distributions with finite mean but infinite variance (e.g. stable distributions with index ), the sample mean still converges (by the LLN) but the CLT does not hold in its standard form. The normalized sum converges to a non-Gaussian stable distribution instead.
The Lindeberg condition is not just about iid
The Lindeberg condition was designed for triangular arrays of non-iid independent variables. For iid sequences, Lindeberg holds automatically under finite variance, so reading "by Lindeberg's CLT" in a paper on iid data is just a longer way of citing the classical CLT. The condition becomes load-bearing only when the summands have heterogeneous variances or grow with , as in regression with deterministic covariates, stratified sampling, and triangular-array bootstrap.
The martingale CLT does NOT say the noise is iid
SGD noise is not iid: depends on , which depends on past noise. The martingale CLT allows this dependence so long as the noise is mean-zero conditional on the past and the conditional second moments stabilize. Reading the SGD-as-SDE literature as "SGD noise is iid Gaussian" is a common oversimplification; the precise statement is "the rescaled cumulative noise has a Gaussian limit by the martingale CLT".
Summary
- Classical (Lindeberg-Levy) CLT: for iid finite-variance variables.
- Lindeberg-Feller: triangular arrays of independent variables converge to a Gaussian iff the Lindeberg condition holds (essentially: no summand dominates the variance).
- Lyapunov condition: a -th moment criterion that implies Lindeberg and is easier to verify in practice.
- Berry-Esseen: CDF error , the rate is and tight.
- Multivariate CLT: , breaks in high dimension.
- Delta method: smooth pushes the Gaussian forward by the Jacobian; the second-order version handles vanishing gradients.
- Martingale CLT: replaces independence with plus conditional variance and conditional Lindeberg conditions; underwrites SGD asymptotics, time-series CLTs, and stochastic-approximation theory.
- ML uses: MLE asymptotic normality, SGD-noise Gaussian approximation, bootstrap validity, sample-covariance asymptotics, asymptotic normality of held-out empirical risk.
- Failure modes: infinite variance gives stable-law limits; high dimension breaks operator-norm CLTs; strong dependence breaks martingale CLTs.
Exercises
Problem
A factory produces bolts whose lengths have mean cm and standard deviation cm (the distribution of lengths is unknown). You measure bolts and compute . Using the CLT, find an interval that contains with approximately 95% probability. Then apply Berry-Esseen to bound the worst-case CDF error of the Gaussian approximation, assuming a third absolute moment of cm.
Problem
Let with unknown, and let . Use the CLT and the delta method to: (a) derive the asymptotic distribution of and construct a Wald 95% confidence interval for ; (b) derive the asymptotic distribution of the log-odds and the corresponding asymptotic variance; (c) explain why the log-odds Wald interval typically has better finite-sample coverage than the untransformed Wald interval, especially near or .
Problem
Construct a triangular array of independent mean-zero random variables, each with variance , for which the classical CLT holds but the Lindeberg condition fails a naive threshold check at . Then construct a different array where the Lindeberg condition fails for some and the CLT also fails. Use these to explain why "Lindeberg fails" is necessary context: the question is at what and how the variance is allocated.
Problem
The CLT assumes finite variance. For a symmetric stable distribution with index , the variance is infinite. State the generalized CLT for this case, give the correct normalization rate, and explain why this matters for: (i) financial-returns models with -, (ii) ML algorithms whose theory assumes sub-Gaussian tails (concentration inequalities, generalization bounds), and (iii) the validity of the bootstrap for the sample mean.
Related Comparisons
Frequently Asked Questions
- What does the Central Limit Theorem say?
- If you average many independent random variables with finite variance, the average looks Gaussian regardless of the original distribution. Formally: for iid with mean and variance , converges in distribution to . The convergence is in distribution, and the rate is under a finite third moment.
- How fast does the CLT converge?
- Berry-Esseen gives the rate: . The convergence is ; heavier tails (larger third absolute moment) slow it. For symmetric distributions, higher-order Edgeworth expansions give faster rates in the bulk.
- When does the CLT apply to non-iid data?
- Lindeberg-Feller covers triangular arrays of independent but non-identically-distributed variables, with the Lindeberg condition replacing iid. Martingale CLTs (Brown 1971, McLeish 1974) handle dependent sequences with mean-zero conditional structure, which is the right tool for SGD asymptotics, time series, and stochastic approximation. Stationary mixing sequences have their own CLTs (Ibragimov, Rio).
- When does the CLT fail?
- Three regimes. (1) Infinite variance: stable-law limits at rate instead of Gaussian at rate . (2) High dimension: operator-norm CLTs for sample covariances fail when , replaced by Marchenko-Pastur asymptotics. (3) Strong dependence: long-range dependence breaks the classical CLT and martingale variants; fractional Brownian limits at non- rates can apply instead.
- Why does the CLT make the MLE asymptotically normal?
- Score-equation expansion: , where the sum is iid mean-zero scores with covariance (by Bartlett's first identity). Multivariate CLT plus Slutsky give . The full chain (consistency + score expansion + CLT + Slutsky) is the M-estimator framework on the [asymptotic statistics page](/topics/asymptotic-statistics).
- Is SGD noise actually Gaussian?
- Approximately, by the martingale CLT applied to the per-step noise . The noise is a martingale difference (mean zero given the past), and under bounded second moments and a stationary iterate distribution, the cumulative noise is Gaussian in the limit. This is what licenses the SGD-as-SDE diffusion approximation; the precise statement is asymptotic, not exact, and the per-step noise distribution is generally non-Gaussian (it is the per-sample-gradient distribution).
References
Canonical:
- Billingsley, Probability and Measure (3rd ed., 1995), Sections 27-29 (classical CLT, Lindeberg-Feller, Berry-Esseen)
- Durrett, Probability: Theory and Examples (5th ed., 2019), Section 3.4 (CLT family) and Section 4.3 (martingale CLT)
- Feller, An Introduction to Probability Theory and Its Applications Vol. 2 (1971), Chapters VIII, XV, XVI
Asymptotic statistics:
- van der Vaart, Asymptotic Statistics (1998), Chapters 2-3 (CLT, delta method, Slutsky) and Chapter 5 (M-estimator asymptotic normality)
- Lehmann, Elements of Large-Sample Theory (1999), Chapters 2-3 (delta method, Lindeberg-Feller in regression)
Martingale CLT:
- Hall and Heyde, Martingale Limit Theory and Its Application (1980), Theorem 3.2 (the standard reference)
- Brown, "Martingale central limit theorems" (1971), Annals of Mathematical Statistics
- McLeish, "Dependent central limit theorems and invariance principles" (1974), Annals of Probability
Modern probabilistic perspective:
- Vershynin, High-Dimensional Probability (2018), Section 2.7 (CLT and Berry-Esseen)
- Kallenberg, Foundations of Modern Probability (3rd ed., 2021), Chapters 5-6
- Casella & Berger, Statistical Inference (2002), Chapters 5-10 (applied perspective)
ML applications:
- Li, Tai, E, "Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms" (2017), arXiv:1511.06251 (SGD-as-SDE, martingale CLT setup)
- Bickel and Freedman, "Some Asymptotic Theory for the Bootstrap" (1981), Annals of Statistics (bootstrap consistency via CLT)
Next Topics
Building on the CLT family:
- Asymptotic statistics: full delta method, M-estimator framework, Le Cam's local asymptotic normality, Wald/score/LRT equivalence
- Maximum likelihood estimation: asymptotic normality of MLE as a CLT consequence
- Bootstrap methods: resampling-based inference and bootstrap consistency
- Concentration inequalities: non-asymptotic, finite-sample alternatives to the CLT
Last reviewed: May 6, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
5- Common Probability Distributionslayer 0A · tier 1
- Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
- Law of Large Numberslayer 0B · tier 1
- Modes of Convergence of Random Variableslayer 0B · tier 1
- Characteristic Functionslayer 1 · tier 1
Derived topics
16- Asymptotic Statistics: M-Estimators, Delta Method, LANlayer 0B · tier 1
- Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1
- Analysis of Variancelayer 1 · tier 1
- Concentration Inequalitieslayer 1 · tier 1
- Delta Methodlayer 1 · tier 1
+11 more on the derived-topics page.
Graph-backed continuations