Skip to main content

Statistical Estimation

Goodness-of-Fit Tests

KS test, Shapiro-Wilk, Anderson-Darling, and the probability integral transform: testing whether data follows a specified distribution, with exact power comparisons and when each test fails.

CoreTier 2StableSupporting~50 min

Why This Matters

Before you apply any parametric method, you need to know whether your distributional assumptions hold. Maximum likelihood estimation assumes a parametric family. The central limit theorem assumes finite variance. Confidence intervals assume approximate normality of the estimator. All of these depend on distributional structure that can be tested.

Goodness-of-fit tests ask: does this data come from the distribution I think it does? The answer determines whether your downstream analysis is trustworthy or built on sand. Applications range from checking normality assumptions to testing whether financial data follows Benford's law for fraud detection.

The Probability Integral Transform

Theorem

Probability Integral Transform

Statement

If XX is a continuous random variable with CDF FF, then:

U=F(X)Uniform(0,1)U = F(X) \sim \text{Uniform}(0, 1)

Conversely, if UUniform(0,1)U \sim \text{Uniform}(0, 1), then F1(U)FF^{-1}(U) \sim F.

Intuition

The CDF maps every distribution to the uniform. If you push data through its own CDF, the result should look uniform. If it does not look uniform, the assumed CDF is wrong. This is the conceptual engine behind all CDF-based goodness-of-fit tests.

Proof Sketch

Pr[F(X)u]=Pr[XF1(u)]=F(F1(u))=u\Pr[F(X) \leq u] = \Pr[X \leq F^{-1}(u)] = F(F^{-1}(u)) = u for u[0,1]u \in [0, 1]. This is the CDF of Uniform(0,1).

Why It Matters

The PIT is the theoretical foundation of calibration checks, quantile residuals, and CDF-based test statistics. When you check whether a model's predicted probabilities are calibrated, you are implicitly using the PIT: well-calibrated predictions, pushed through the data CDF, should be uniform.

Failure Mode

Requires continuity of FF. For discrete distributions, F(X)F(X) is not uniform but has a more complex distribution. Modified versions exist (randomized PIT) but are less clean.

The Kolmogorov-Smirnov Test

Definition

Empirical CDF

The empirical CDF of a sample x1,,xnx_1, \ldots, x_n is:

F^n(t)=1ni=1n1[xit]\hat{F}_n(t) = \frac{1}{n} \sum_{i=1}^n \mathbf{1}[x_i \leq t]

This is a step function that jumps by 1/n1/n at each data point. By the law of large numbers, F^n(t)F(t)\hat{F}_n(t) \to F(t) pointwise. By the Glivenko-Cantelli theorem, the convergence is uniform.

Theorem

Kolmogorov-Smirnov Test

Statement

The KS test statistic is:

Dn=suptF^n(t)F0(t)D_n = \sup_t |\hat{F}_n(t) - F_0(t)|

Under H0:F=F0H_0: F = F_0, the distribution of DnD_n is distribution-free (does not depend on F0F_0). Asymptotically:

nDndK\sqrt{n} D_n \xrightarrow{d} K

where KK is the Kolmogorov distribution with CDF Pr[Kt]=12k=1(1)k1e2k2t2\Pr[K \leq t] = 1 - 2\sum_{k=1}^{\infty} (-1)^{k-1} e^{-2k^2 t^2}.

Reject H0H_0 when DnD_n exceeds the critical value from the Kolmogorov distribution.

Intuition

The KS test measures the maximum vertical distance between the empirical CDF and the hypothesized CDF. If the data really came from F0F_0, this distance should be small (of order 1/n1/\sqrt{n}). A large distance indicates the data does not follow F0F_0.

Proof Sketch

The distribution-free property follows from the probability integral transform: if F=F0F = F_0, then F0(Xi)Uniform(0,1)F_0(X_i) \sim \text{Uniform}(0,1), so DnD_n has the same distribution as the KS statistic for testing uniformity, regardless of what F0F_0 is.

Why It Matters

The KS test is nonparametric and distribution-free under H0H_0. It requires no binning (unlike chi-squared tests) and works for any continuous distribution. It is the default first test when checking distributional assumptions.

Failure Mode

The KS test has maximum power in the center of the distribution and low power in the tails. It detects location and scale shifts well but is insensitive to differences concentrated in the tails. For tail-sensitive testing, Anderson-Darling is better. The KS test also requires F0F_0 to be fully specified: if you estimate parameters from the data and then test using those estimates (the Lilliefors problem), the standard KS critical values are wrong and the test is too conservative — the empirical CDF is closer on average to the estimated F^0\hat F_0 than to a fully prespecified F0F_0, so DnD_n is systematically smaller than under the unrestricted null and a fixed-table critical value under-rejects. Use Lilliefors-adjusted critical values instead.

Theorem

Glivenko-Cantelli Theorem

Statement

suptF^n(t)F(t)a.s.0\sup_t |\hat{F}_n(t) - F(t)| \xrightarrow{a.s.} 0

The empirical CDF converges uniformly to the true CDF, almost surely. This is the uniform law of large numbers for CDFs.

Intuition

At each point tt, F^n(t)\hat{F}_n(t) is a sample mean of Bernoulli variables, so it converges by the law of large numbers. Glivenko-Cantelli upgrades this from pointwise to uniform convergence, which is much stronger.

Why It Matters

This theorem justifies the empirical CDF as a consistent estimator of the true CDF. It is the CDF analog of uniform convergence in learning theory, and the reason why CDF-based tests like KS have power against all alternatives.

Failure Mode

The rate of convergence is O(1/n)O(1/\sqrt{n}) by the Dvoretzky-Kiefer-Wolfowitz (DKW) inequality. For heavy-tailed distributions, convergence in the tails can be slow in practice even though the uniform rate is the same.

Normality Tests

When the specific question is "is this data Gaussian?", specialized tests outperform the generic KS test.

Shapiro-Wilk test. Computes the ratio of two variance estimates: one based on the order statistics' expected values under normality, one the usual sample variance. The test statistic is:

W=(i=1naix(i))2i=1n(xixˉ)2W = \frac{\left(\sum_{i=1}^n a_i x_{(i)}\right)^2}{\sum_{i=1}^n (x_i - \bar{x})^2}

where x(i)x_{(i)} are the order statistics and aia_i are coefficients derived from the expected values and covariance matrix of the standard normal order statistics.

Shapiro-Wilk has the highest power among normality tests for small to moderate sample sizes (n<5000n < 5000). It detects asymmetry, heavy tails, and light tails effectively. Its main limitation is computational: the exact critical values require specific tables, and the test was originally designed for n50n \leq 50 (though extensions exist for larger samples).

Anderson-Darling test. A modification of the KS test that puts more weight on the tails:

A2=n1ni=1n(2i1)[logF0(x(i))+log(1F0(x(n+1i)))]A^2 = -n - \frac{1}{n} \sum_{i=1}^n (2i - 1)[\log F_0(x_{(i)}) + \log(1 - F_0(x_{(n+1-i)}))]

The weighting function 1/[F0(t)(1F0(t))]1/[F_0(t)(1 - F_0(t))] amplifies discrepancies where F0(t)F_0(t) is near 0 or 1, making Anderson-Darling more sensitive to tail departures than KS.

Jarque-Bera test. A moment-based normality test that uses sample skewness SS and excess kurtosis KK:

JB=n6(S2+K24)JB = \frac{n}{6}\left(S^2 + \frac{K^2}{4}\right)

Under H0H_0 of normality, JBJB converges to χ22\chi^2_2. Easy to compute and standard output in econometrics packages. Low power against alternatives that preserve the first four moments but differ elsewhere. Convergence to the chi-squared limit is slow, so small-sample p-values should come from simulation rather than the asymptotic approximation.

Lilliefors test. Vanilla KS requires that F0F_0 be fully specified. If you estimate parameters from the data (for example, testing normality with μ^\hat{\mu} and σ^2\hat{\sigma}^2 plugged in for μ\mu and σ\sigma), the standard KS critical values are too conservative: the data is closer on average to the estimated distribution than to a randomly chosen one from the same family, so DnD_n is systematically smaller than its unrestricted null. Lilliefors (1967) fixed this for the composite normality null by tabulating adjusted critical values via Monte Carlo simulation. A parallel procedure exists for the composite exponentiality null (Lilliefors 1969). Use Lilliefors rather than raw KS whenever parameters are estimated from the same sample.

Related Tests

Cramér-von Mises. An integrated squared-error alternative to the supremum-based KS:

W2=n[F^n(x)F0(x)]2dF0(x)W^2 = n \int_{-\infty}^\infty [\hat{F}_n(x) - F_0(x)]^2 \, dF_0(x)

Under H0H_0, the limiting distribution of W2W^2 is distribution-free and does not depend on F0F_0 (same argument as KS via the probability integral transform). Because W2W^2 integrates rather than takes a supremum, it is more sensitive to discrepancies in the middle of the distribution and less driven by a single extreme deviation. Anderson-Darling is the weighted variant with weight 1/[F0(t)(1F0(t))]1/[F_0(t)(1-F_0(t))], pushing sensitivity toward the tails.

Two-sample Kolmogorov-Smirnov. For samples of sizes mm and nn with empirical CDFs F^m\hat{F}_m and G^n\hat{G}_n:

Dm,n=supxF^m(x)G^n(x)D_{m,n} = \sup_x |\hat{F}_m(x) - \hat{G}_n(x)|

Tests whether two samples come from the same continuous distribution. Under H0H_0, mn/(m+n)Dm,n\sqrt{mn/(m+n)} \, D_{m,n} converges to the Kolmogorov distribution, so the same tables that calibrate the one-sample test apply with effective sample size mn/(m+n)mn/(m+n). Unlike the one-sample KS, this test does not require specifying either distribution, which makes it a standard distribution-free comparison tool (covariate shift detection, A/B outcome checks, cross-lab sample comparison).

Q-Q Plots

A Q-Q (quantile-quantile) plot is the visual companion to formal GOF tests. Plot the sorted sample values x(i)x_{(i)} against the theoretical quantiles F01((i0.5)/n)F_0^{-1}((i - 0.5)/n). If the data follow F0F_0, the points fall on the line y=xy = x (up to location and scale if the plot is unstandardized). Departures have recognizable shapes:

  • Heavy tails: S-shape, with the lower-left below the line and the upper-right above it.
  • Light tails: inverse S-shape.
  • Right skew: upward curvature.
  • Left skew: downward curvature.
  • Discrete structure: horizontal steps.

Q-Q plots complement formal tests. At large nn, p-values become cheap (any tiny deviation from F0F_0 rejects), so a Q-Q plot tells you whether the departure is practically meaningful and where it sits in the distribution. At small nn, formal tests lack power and the Q-Q plot can still reveal qualitative misfit that will matter downstream. Report both.

Power Comparison

TestBest forWeakness
Kolmogorov-SmirnovGeneral-purpose, any continuous CDFLow tail sensitivity, requires fully specified F0F_0
Shapiro-WilkNormality testing, small samplesOnly tests normality, not general distributions
Anderson-DarlingTail-sensitive testingRequires computing logF0(x(i))\log F_0(x_{(i)}), slower
Chi-squaredDiscrete distributions, large samplesRequires binning, low power for continuous data
LillieforsNormality/exponentiality with estimated parametersSpecific to certain families, limited critical value tables
Cramér-von MisesGeneral continuous CDF, mid-distribution sensitivityLess tail-sensitive than Anderson-Darling
Jarque-BeraNormality via skewness and kurtosis, econometricsMisses alternatives matching first four moments, slow χ2\chi^2 convergence
Two-sample KSComparing two samples without specifying F0F_0Same low tail sensitivity as one-sample KS

Common Confusions

Watch Out

The KS test requires a fully specified null distribution

If you estimate μ\mu and σ\sigma from the data and then test normality using KS with F0=N(μ^,σ^2)F_0 = \mathcal{N}(\hat{\mu}, \hat{\sigma}^2), the test is invalid. The parameter estimation changes the null distribution of DnD_n, making the standard critical values too conservative (you reject too infrequently). The Lilliefors correction provides the correct critical values for this setting. Alternatively, use Shapiro-Wilk, which is designed for composite hypotheses.

Watch Out

Failing to reject does not confirm the distribution

A goodness-of-fit test can only reject the null. Not rejecting means the data is consistent with F0F_0, not that F0F_0 is true. With small nn, you lack power to detect even substantial departures. With large nn, you will reject almost any parametric model because no real data exactly follows a theoretical distribution. The useful question is not "is my data exactly Gaussian?" but "is the Gaussian approximation good enough for my purpose?"

Watch Out

p-values from goodness-of-fit tests are not effect sizes

A small p-value from a KS test says the departure from F0F_0 is statistically significant, but it says nothing about how large or practically important the departure is. Report DnD_n (the maximum CDF discrepancy) alongside the p-value. A DnD_n of 0.02 with n=50,000n = 50{,}000 is statistically significant but practically negligible for most applications.

Exercises

ExerciseCore

Problem

You have 200 observations and want to test whether they come from an Exponential(1) distribution. You compute Dn=0.08D_n = 0.08. The KS critical value at α=0.05\alpha = 0.05 for n=200n = 200 is approximately 1.36/2000.0961.36/\sqrt{200} \approx 0.096. Do you reject?

ExerciseAdvanced

Problem

Explain why the Anderson-Darling test is more powerful than KS for detecting heavy-tailed alternatives to normality. What is the weighting function, and how does it change the test's sensitivity?

References

Canonical:

  • Lehmann & Romano, Testing Statistical Hypotheses (3rd ed., 2005), Chapter 14
  • D'Agostino, R.B., & Stephens, M.A. (eds.), Goodness-of-Fit Techniques (Marcel Dekker, 1986). The comprehensive reference for GOF testing.
  • Thas, O., Comparing Distributions (Springer, 2010), Ch. 3-7. Modern textbook treatment of EDF-based and smooth tests.
  • Casella, G., & Berger, R.L., Statistical Inference (2nd ed., Duxbury, 2001), Ch. 10.3-10.4. Chi-squared and likelihood ratio tests for GOF.
  • Shapiro, S.S., & Wilk, M.B., "An Analysis of Variance Test for Normality" (Biometrika, 1965)
  • Stephens, M.A., "EDF Statistics for Goodness of Fit and Some Comparisons" (JASA 69(347):730-737, 1974). Original tables for KS, Cramér-von Mises, and Anderson-Darling.
  • Lilliefors, H.W., "On the Kolmogorov-Smirnov Test for Normality with Mean and Variance Unknown" (JASA 62(318):399-402, 1967).

Current:

  • Razali, N.M., & Wah, Y.B., "Power Comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors and Anderson-Darling Tests" (Journal of Statistical Modeling and Analytics, 2011)
  • Jarque, C.M., & Bera, A.K., "A Test for Normality of Observations and Regression Residuals" (International Statistical Review 55(2):163-172, 1987)

Next Topics

  • Neyman-Pearson theory: the theoretical framework underlying all hypothesis tests
  • Calibration: when you need to test whether a model's predicted probabilities match observed frequencies

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

3

Derived topics

1