Goodness-of-Fit Tests

Sneiderman, Robby

Statistical Estimation

Goodness-of-Fit Tests

KS test, Shapiro-Wilk, Anderson-Darling, and the probability integral transform: testing whether data follows a specified distribution, with exact power comparisons and when each test fails.

CoreTier 2StableSupporting~50 min

Prerequisites

Common Probability Distributions Benfords Law Hypothesis Testing for ML

Quiz (5)Pulse Check Prereq Map

Why This Matters

Before you apply any parametric method, you need to know whether your distributional assumptions hold. Maximum likelihood estimation assumes a parametric family. The central limit theorem assumes finite variance. Confidence intervals assume approximate normality of the estimator. All of these depend on distributional structure that can be tested.

Goodness-of-fit tests ask: does this data come from the distribution I think it does? The answer determines whether your downstream analysis is trustworthy or built on sand. Applications range from checking normality assumptions to testing whether financial data follows Benford's law for fraud detection.

The Probability Integral Transform

Theorem

Probability Integral Transform

Statement

If $X$ is a continuous random variable with CDF $F$ , then:

$U = F(X) \sim \text{Uniform}(0, 1)$

Conversely, if $U \sim \text{Uniform}(0, 1)$ , then $F^{-1}(U) \sim F$ .

Intuition

The CDF maps every distribution to the uniform. If you push data through its own CDF, the result should look uniform. If it does not look uniform, the assumed CDF is wrong. This is the conceptual engine behind all CDF-based goodness-of-fit tests.

Proof Sketch

$\Pr[F(X) \leq u] = \Pr[X \leq F^{-1}(u)] = F(F^{-1}(u)) = u$ for $u \in [0, 1]$ . This is the CDF of Uniform(0,1).

Why It Matters

The PIT is the theoretical foundation of calibration checks, quantile residuals, and CDF-based test statistics. When you check whether a model's predicted probabilities are calibrated, you are implicitly using the PIT: well-calibrated predictions, pushed through the data CDF, should be uniform.

Failure Mode

Requires continuity of $F$ . For discrete distributions, $F(X)$ is not uniform but has a more complex distribution. Modified versions exist (randomized PIT) but are less clean.

report a correction →

The Kolmogorov-Smirnov Test

Definition

Empirical CDF $\hat{F}_{n}$

The empirical CDF of a sample $x_1, \ldots, x_n$ is:

$\hat{F}_n(t) = \frac{1}{n} \sum_{i=1}^n \mathbf{1}[x_i \leq t]$

This is a step function that jumps by $1/n$ at each data point. By the law of large numbers, $\hat{F}_n(t) \to F(t)$ pointwise. By the Glivenko-Cantelli theorem, the convergence is uniform.

Theorem

Kolmogorov-Smirnov Test

Statement

The KS test statistic is:

$D_n = \sup_t |\hat{F}_n(t) - F_0(t)|$

Under $H_0: F = F_0$ , the distribution of $D_n$ is distribution-free (does not depend on $F_0$ ). Asymptotically:

$\sqrt{n} D_n \xrightarrow{d} K$

where $K$ is the Kolmogorov distribution with CDF $\Pr[K \leq t] = 1 - 2\sum_{k=1}^{\infty} (-1)^{k-1} e^{-2k^2 t^2}$ .

Reject $H_0$ when $D_n$ exceeds the critical value from the Kolmogorov distribution.

Intuition

The KS test measures the maximum vertical distance between the empirical CDF and the hypothesized CDF. If the data really came from $F_0$ , this distance should be small (of order $1/\sqrt{n}$ ). A large distance indicates the data does not follow $F_0$ .

Proof Sketch

The distribution-free property follows from the probability integral transform: if $F = F_0$ , then $F_0(X_i) \sim \text{Uniform}(0,1)$ , so $D_n$ has the same distribution as the KS statistic for testing uniformity, regardless of what $F_0$ is.

Why It Matters

The KS test is nonparametric and distribution-free under $H_0$ . It requires no binning (unlike chi-squared tests) and works for any continuous distribution. It is the default first test when checking distributional assumptions.

Failure Mode

The KS test has maximum power in the center of the distribution and low power in the tails. It detects location and scale shifts well but is insensitive to differences concentrated in the tails. For tail-sensitive testing, Anderson-Darling is better. The KS test also requires $F_0$ to be fully specified: if you estimate parameters from the data and then test using those estimates (the Lilliefors problem), the standard KS critical values are wrong and the test is too conservative — the empirical CDF is closer on average to the estimated $\hat F_0$ than to a fully prespecified $F_0$ , so $D_n$ is systematically smaller than under the unrestricted null and a fixed-table critical value under-rejects. Use Lilliefors-adjusted critical values instead.

report a correction →

Theorem

Glivenko-Cantelli Theorem

Statement

$\sup_t |\hat{F}_n(t) - F(t)| \xrightarrow{a.s.} 0$

The empirical CDF converges uniformly to the true CDF, almost surely. This is the uniform law of large numbers for CDFs.

Intuition

At each point $t$ , $\hat{F}_n(t)$ is a sample mean of Bernoulli variables, so it converges by the law of large numbers. Glivenko-Cantelli upgrades this from pointwise to uniform convergence, which is much stronger.

Why It Matters

This theorem justifies the empirical CDF as a consistent estimator of the true CDF. It is the CDF analog of uniform convergence in learning theory, and the reason why CDF-based tests like KS have power against all alternatives.

Failure Mode

The rate of convergence is $O(1/\sqrt{n})$ by the Dvoretzky-Kiefer-Wolfowitz (DKW) inequality. For heavy-tailed distributions, convergence in the tails can be slow in practice even though the uniform rate is the same.

report a correction →

Normality Tests

When the specific question is "is this data Gaussian?", specialized tests outperform the generic KS test.

Shapiro-Wilk test. Computes the ratio of two variance estimates: one based on the order statistics' expected values under normality, one the usual sample variance. The test statistic is:

$W = \frac{\left(\sum_{i=1}^n a_i x_{(i)}\right)^2}{\sum_{i=1}^n (x_i - \bar{x})^2}$

where $x_{(i)}$ are the order statistics and $a_i$ are coefficients derived from the expected values and covariance matrix of the standard normal order statistics.

Shapiro-Wilk has the highest power among normality tests for small to moderate sample sizes ( $n < 5000$ ). It detects asymmetry, heavy tails, and light tails effectively. Its main limitation is computational: the exact critical values require specific tables, and the test was originally designed for $n \leq 50$ (though extensions exist for larger samples).

Anderson-Darling test. A modification of the KS test that puts more weight on the tails:

$A^2 = -n - \frac{1}{n} \sum_{i=1}^n (2i - 1)[\log F_0(x_{(i)}) + \log(1 - F_0(x_{(n+1-i)}))]$

The weighting function $1/[F_0(t)(1 - F_0(t))]$ amplifies discrepancies where $F_0(t)$ is near 0 or 1, making Anderson-Darling more sensitive to tail departures than KS.

Jarque-Bera test. A moment-based normality test that uses sample skewness $S$ and excess kurtosis $K$ :

$JB = \frac{n}{6}\left(S^2 + \frac{K^2}{4}\right)$

Under $H_0$ of normality, $JB$ converges to $\chi^2_2$ . Easy to compute and standard output in econometrics packages. Low power against alternatives that preserve the first four moments but differ elsewhere. Convergence to the chi-squared limit is slow, so small-sample p-values should come from simulation rather than the asymptotic approximation.

Lilliefors test. Vanilla KS requires that $F_0$ be fully specified. If you estimate parameters from the data (for example, testing normality with $\hat{\mu}$ and $\hat{\sigma}^2$ plugged in for $\mu$ and $\sigma$ ), the standard KS critical values are too conservative: the data is closer on average to the estimated distribution than to a randomly chosen one from the same family, so $D_n$ is systematically smaller than its unrestricted null. Lilliefors (1967) fixed this for the composite normality null by tabulating adjusted critical values via Monte Carlo simulation. A parallel procedure exists for the composite exponentiality null (Lilliefors 1969). Use Lilliefors rather than raw KS whenever parameters are estimated from the same sample.

Related Tests

Cramér-von Mises. An integrated squared-error alternative to the supremum-based KS:

$W^2 = n \int_{-\infty}^\infty [\hat{F}_n(x) - F_0(x)]^2 \, dF_0(x)$

Under $H_0$ , the limiting distribution of $W^2$ is distribution-free and does not depend on $F_0$ (same argument as KS via the probability integral transform). Because $W^2$ integrates rather than takes a supremum, it is more sensitive to discrepancies in the middle of the distribution and less driven by a single extreme deviation. Anderson-Darling is the weighted variant with weight $1/[F_0(t)(1-F_0(t))]$ , pushing sensitivity toward the tails.

Two-sample Kolmogorov-Smirnov. For samples of sizes $m$ and $n$ with empirical CDFs $\hat{F}_m$ and $\hat{G}_n$ :

$D_{m,n} = \sup_x |\hat{F}_m(x) - \hat{G}_n(x)|$

Tests whether two samples come from the same continuous distribution. Under $H_0$ , $\sqrt{mn/(m+n)} \, D_{m,n}$ converges to the Kolmogorov distribution, so the same tables that calibrate the one-sample test apply with effective sample size $mn/(m+n)$ . Unlike the one-sample KS, this test does not require specifying either distribution, which makes it a standard distribution-free comparison tool (covariate shift detection, A/B outcome checks, cross-lab sample comparison).

Q-Q Plots

A Q-Q (quantile-quantile) plot is the visual companion to formal GOF tests. Plot the sorted sample values $x_{(i)}$ against the theoretical quantiles $F_0^{-1}((i - 0.5)/n)$ . If the data follow $F_0$ , the points fall on the line $y = x$ (up to location and scale if the plot is unstandardized). Departures have recognizable shapes:

Heavy tails: S-shape, with the lower-left below the line and the upper-right above it.
Light tails: inverse S-shape.
Right skew: upward curvature.
Left skew: downward curvature.
Discrete structure: horizontal steps.

Q-Q plots complement formal tests. At large $n$ , p-values become cheap (any tiny deviation from $F_0$ rejects), so a Q-Q plot tells you whether the departure is practically meaningful and where it sits in the distribution. At small $n$ , formal tests lack power and the Q-Q plot can still reveal qualitative misfit that will matter downstream. Report both.

Power Comparison

Test	Best for	Weakness
Kolmogorov-Smirnov	General-purpose, any continuous CDF	Low tail sensitivity, requires fully specified $F_0$
Shapiro-Wilk	Normality testing, small samples	Only tests normality, not general distributions
Anderson-Darling	Tail-sensitive testing	Requires computing $\log F_0(x_{(i)})$ , slower
Chi-squared	Discrete distributions, large samples	Requires binning, low power for continuous data
Lilliefors	Normality/exponentiality with estimated parameters	Specific to certain families, limited critical value tables
Cramér-von Mises	General continuous CDF, mid-distribution sensitivity	Less tail-sensitive than Anderson-Darling
Jarque-Bera	Normality via skewness and kurtosis, econometrics	Misses alternatives matching first four moments, slow $\chi^2$ convergence
Two-sample KS	Comparing two samples without specifying $F_0$	Same low tail sensitivity as one-sample KS

Common Confusions

Watch Out

The KS test requires a fully specified null distribution

If you estimate $\mu$ and $\sigma$ from the data and then test normality using KS with $F_0 = \mathcal{N}(\hat{\mu}, \hat{\sigma}^2)$ , the test is invalid. The parameter estimation changes the null distribution of $D_n$ , making the standard critical values too conservative (you reject too infrequently). The Lilliefors correction provides the correct critical values for this setting. Alternatively, use Shapiro-Wilk, which is designed for composite hypotheses.

Watch Out

Failing to reject does not confirm the distribution

A goodness-of-fit test can only reject the null. Not rejecting means the data is consistent with $F_0$ , not that $F_0$ is true. With small $n$ , you lack power to detect even substantial departures. With large $n$ , you will reject almost any parametric model because no real data exactly follows a theoretical distribution. The useful question is not "is my data exactly Gaussian?" but "is the Gaussian approximation good enough for my purpose?"

Watch Out

p-values from goodness-of-fit tests are not effect sizes

A small p-value from a KS test says the departure from $F_0$ is statistically significant, but it says nothing about how large or practically important the departure is. Report $D_n$ (the maximum CDF discrepancy) alongside the p-value. A $D_n$ of 0.02 with $n = 50{,}000$ is statistically significant but practically negligible for most applications.

Exercises

ExerciseCore

Problem

You have 200 observations and want to test whether they come from an Exponential(1) distribution. You compute $D_n = 0.08$ . The KS critical value at $\alpha = 0.05$ for $n = 200$ is approximately $1.36/\sqrt{200} \approx 0.096$ . Do you reject?

ExerciseAdvanced

Problem

Explain why the Anderson-Darling test is more powerful than KS for detecting heavy-tailed alternatives to normality. What is the weighting function, and how does it change the test's sensitivity?

References

Canonical:

Lehmann & Romano, Testing Statistical Hypotheses (3rd ed., 2005), Chapter 14
D'Agostino, R.B., & Stephens, M.A. (eds.), Goodness-of-Fit Techniques (Marcel Dekker, 1986). The comprehensive reference for GOF testing.
Thas, O., Comparing Distributions (Springer, 2010), Ch. 3-7. Modern textbook treatment of EDF-based and smooth tests.
Casella, G., & Berger, R.L., Statistical Inference (2nd ed., Duxbury, 2001), Ch. 10.3-10.4. Chi-squared and likelihood ratio tests for GOF.
Shapiro, S.S., & Wilk, M.B., "An Analysis of Variance Test for Normality" (Biometrika, 1965)
Stephens, M.A., "EDF Statistics for Goodness of Fit and Some Comparisons" (JASA 69(347):730-737, 1974). Original tables for KS, Cramér-von Mises, and Anderson-Darling.
Lilliefors, H.W., "On the Kolmogorov-Smirnov Test for Normality with Mean and Variance Unknown" (JASA 62(318):399-402, 1967).

Current:

Razali, N.M., & Wah, Y.B., "Power Comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors and Anderson-Darling Tests" (Journal of Statistical Modeling and Analytics, 2011)
Jarque, C.M., & Bera, A.K., "A Test for Normality of Observations and Regression Residuals" (International Statistical Review 55(2):163-172, 1987)

Next Topics

Neyman-Pearson theory: the theoretical framework underlying all hypothesis tests
Calibration: when you need to test whether a model's predicted probabilities match observed frequencies

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Common Probability Distributionslayer 0A · tier 1
Benford's Lawlayer 1 · tier 2
Hypothesis Testing for MLlayer 2 · tier 2

Derived topics

1

Calibration and Uncertainty Quantificationlayer 3 · tier 2

Graph-backed continuations

Calibration and Uncertainty Quantification