Skip to main content

Methodology

Hypothesis Testing for ML

Null and alternative hypotheses, p-values, confidence intervals, error types, multiple comparisons, and proper statistical tests for comparing ML models.

EssentialCoreTier 2StableSupporting~55 min
For:MLStats

Why This Matters

Machine learning papers routinely claim that model A outperforms model B. But how do you know the difference is real and not an artifact of the random train/test split, initialization, or data ordering? Hypothesis testing provides the formal framework for answering this question. Without it, you cannot distinguish signal from noise in experimental comparisons. The multiple comparisons problem is especially critical in ML, where researchers often compare many models, hyperparameter settings, or metrics simultaneously.

Mental Model

You have two models. You run them on a test set and model A scores 0.85 while model B scores 0.83. Is A better than B, or did you just get lucky? Hypothesis testing formalizes this question: assume the models are equally good (null hypothesis), compute how unlikely the observed difference is under that assumption (p-value), and reject the null if the data is sufficiently surprising.

theorem visual

Null distribution, rejection region

A test is a rule before the data: choose a statistic, choose a rejection region, then control how often the rule falsely rejects a true null.

-202H0H1critical valuealphapower

Type I error

The red area is : reject even though is true.

Power

The green area is : reject when the alternative is true.

p-value

A p-value is a tail area under , not .

The Neyman-Pearson Framework

Definition

Null Hypothesis

The null hypothesis H0H_0 is the default assumption you are trying to disprove. In ML model comparison:

H0:μA=μBH_0: \mu_A = \mu_B

where μA\mu_A and μB\mu_B are the true expected performances of models A and B. The null says there is no real difference between the models.

Definition

Alternative Hypothesis

The alternative hypothesis H1H_1 is the claim tested against H0H_0; rejecting H0H_0 is evidence in favor of H1H_1.

  • Two-sided: H1:μAμBH_1: \mu_A \neq \mu_B (the models differ)
  • One-sided: H1:μA>μBH_1: \mu_A > \mu_B (model A is better)

In ML, two-sided tests are more conservative and generally preferred unless you have a strong prior reason to test only one direction.

Definition

p-Value

The p-value is the probability of observing a test statistic at least as extreme as the one computed from your data, assuming H0H_0 is true. The exact form depends on the alternative:

  • One-sided (H1:θ>θ0H_1: \theta > \theta_0): p=P(TtobsH0)p = P(T \geq t_{\text{obs}} \mid H_0)
  • Two-sided (H1:θθ0H_1: \theta \neq \theta_0): p=P(TtobsH0)p = P(|T| \geq |t_{\text{obs}}| \mid H_0)

Since two-sided tests are the default in ML model comparison, use the two-sided form unless you have committed in advance to a directional alternative. A small p-value means the observed data is unlikely under H0H_0, which is evidence against H0H_0.

The p-value is not the probability that H0H_0 is true. It is the probability of the data given H0H_0, not the probability of H0H_0 given the data.

Definition

Significance Level

The significance level α\alpha is the threshold for rejecting H0H_0. If and only if pαp \leq \alpha, we reject H0H_0 and declare the result "statistically significant." The standard choice is α=0.05\alpha = 0.05, meaning we accept a 5% chance of falsely rejecting a true null hypothesis.

Watch Out

p = 0.05 is a convention, not a theorem

The 0.05 threshold comes from Fisher (1925), stated as a rough guide, not a principled cutoff. The ASA's 2016 and 2019 statements on p-values warn against thresholded dichotomous decisions. The replication crisis in psychology and biomedicine (Open Science Collaboration 2015; Ioannidis 2005) showed that a large fraction of "significant" findings at p<0.05p < 0.05 fail to replicate. In ML practice: test across multiple seeds, report effect sizes and confidence intervals, and treat p<0.05p < 0.05 on a single run as weak evidence.

Error Types

Proposition

Type I and Type II Errors

Statement

There are two types of errors in hypothesis testing:

H0H_0 trueH0H_0 false
Reject H0H_0Type I error (α\alpha)Correct (power)
Do not reject H0H_0CorrectType II error (β\beta)
  • Type I error (false positive): rejecting H0H_0 when it is true. Probability = α\alpha.
  • Type II error (false negative): failing to reject H0H_0 when it is false. Probability = β\beta.
  • Power = 1β1 - \beta: probability of correctly rejecting a false H0H_0.

Intuition

Type I error is a false alarm: you claim the models are different when they are not. Type II error is a missed detection: the models truly differ but your test fails to detect it. Decreasing α\alpha (being more conservative) increases β\beta (more missed detections) for fixed sample size.

Proof Sketch

By definition, α=P(reject H0H0 true)\alpha = P(\text{reject } H_0 \mid H_0 \text{ true}), which is controlled directly by the choice of significance level. The power depends on the true effect size δ=μAμB\delta = |\mu_A - \mu_B|, the variance, and the sample size nn: powerΦ(δnσzα/2)\text{power} \approx \Phi\left(\frac{\delta\sqrt{n}}{\sigma} - z_{\alpha/2}\right).

Why It Matters

In ML, Type I errors lead to published claims of improvement that do not replicate. Type II errors lead to dismissing genuinely better methods. Both are costly. Understanding the tradeoff is essential for designing experiments with adequate power.

Failure Mode

Most ML experiments are underpowered: they use too few random seeds or train/test splits to detect real but small improvements. An underpowered test has high β\beta, meaning it frequently fails to detect true differences.

Confidence Intervals

Definition

Confidence Interval

A (1α)(1-\alpha) confidence interval for a parameter θ\theta is a random interval [L,U][L, U] whose coverage probability is at least 1α1-\alpha:

Pθ(θ[L,U])1αP_\theta(\theta \in [L, U]) \geq 1 - \alpha

for every parameter value θ\theta in the model. When equality holds, the interval has exact coverage.

For the difference in model performance δ=μAμB\delta = \mu_A - \mu_B, a 95% CI provides a range of plausible values for the true difference. If the CI excludes zero, the difference is significant at α=0.05\alpha = 0.05.

Confidence intervals are more informative than p-values alone because they communicate both statistical significance and effect size.

The Multiple Comparisons Problem

When you test many hypotheses simultaneously, the probability of at least one false positive grows rapidly.

Theorem

Bonferroni Correction

Statement

If you perform mm simultaneous hypothesis tests and want to control the family-wise error rate (FWER) at level α\alpha, reject each individual test ii only if piα/mp_i \leq \alpha/m.

FWER=P(at least one false rejection)α\text{FWER} = P(\text{at least one false rejection}) \leq \alpha

This holds regardless of the dependence structure among the tests.

Intuition

If you flip a coin 20 times looking for "evidence of bias," you expect at least one streak just by chance. Bonferroni compensates by making each individual test more stringent: if you test 20 hypotheses, you need p0.05/20=0.0025p \leq 0.05/20 = 0.0025 for any single one.

Proof Sketch

By the union bound: P(i:reject H0,iall H0,i true)i=1mP(piα/m)=m(α/m)=αP(\exists i: \text{reject } H_{0,i} \mid \text{all } H_{0,i} \text{ true}) \leq \sum_{i=1}^m P(p_i \leq \alpha/m) = m \cdot (\alpha/m) = \alpha.

Why It Matters

In ML, the multiple comparisons problem arises constantly: comparing many models, testing on multiple datasets, evaluating multiple metrics, running experiments with different hyperparameters. Without correction, the false positive rate can be far higher than the nominal α\alpha.

Failure Mode

Bonferroni is very conservative: it controls the probability of any false positive, which can make it too strict when mm is large. Many true effects are missed. When you care about controlling the proportion of false positives rather than their existence, use FDR control instead.

Holm-Bonferroni (1979) is a uniformly more powerful alternative that still controls FWER at α\alpha without additional assumptions. Sort the p-values p(1)p(m)p_{(1)} \leq \cdots \leq p_{(m)}, and reject H(i)H_{(i)} iff p(j)α/(mj+1)p_{(j)} \leq \alpha / (m - j + 1) for all jij \leq i. The smallest p-value is tested against α/m\alpha/m (same as Bonferroni), but later ones use a larger threshold. Holm dominates Bonferroni: any rejection Bonferroni makes, Holm also makes. Prefer Holm whenever you would have used Bonferroni.

FDR Control: Benjamini-Hochberg

Definition

False Discovery Rate

The False Discovery Rate is the expected proportion of false positives among all rejected hypotheses:

FDR=E[false rejectionstotal rejections]\text{FDR} = \mathbb{E}\left[\frac{\text{false rejections}}{\text{total rejections}}\right]

The Benjamini-Hochberg (BH) procedure controls FDR at level qq:

  1. Order the mm p-values: p(1)p(2)p(m)p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}
  2. Find the largest kk such that p(k)kmqp_{(k)} \leq \frac{k}{m} \cdot q
  3. Reject all hypotheses with p(i)p(k)p_{(i)} \leq p_{(k)}

BH is less conservative than Bonferroni and is preferred when testing many hypotheses (e.g., comparing models on 50 datasets).

BH controls FDR at level qq under independence or positive regression dependence of the test statistics (Benjamini-Hochberg 1995; Benjamini-Yekutieli 2001). Under arbitrary dependence, use the Benjamini-Yekutieli correction, which replaces the threshold with kmc(m)q\frac{k}{m \cdot c(m)} q where c(m)=j=1m1/jlnmc(m) = \sum_{j=1}^m 1/j \approx \ln m. This is strictly more conservative but valid without structural assumptions on the joint distribution of p-values.

Paired Tests for Model Comparison

When comparing two models on the same data, you should use paired tests that account for the correlation between the two models' errors on each example.

Definition

Paired t-Test for Model Comparison

Given nn test examples, let di=(A,xi)(B,xi)d_i = \ell(A, x_i) - \ell(B, x_i) be the difference in loss between models A and B on example ii. The paired t-test statistic is:

t=dˉsd/n,dˉ=1nidi,sd2=1n1i(didˉ)2t = \frac{\bar{d}}{s_d / \sqrt{n}}, \quad \bar{d} = \frac{1}{n}\sum_i d_i, \quad s_d^2 = \frac{1}{n-1}\sum_i (d_i - \bar{d})^2

Under H0:E[di]=0H_0: \mathbb{E}[d_i] = 0, tt follows a Student tt-distribution with n1n-1 degrees of freedom exactly when the did_i are i.i.d. Gaussian, and approximately for non-Gaussian did_i by the CLT when nn is large. The tt-distribution is robust to moderate departures from normality but breaks down for heavy-tailed or highly skewed did_i; in those cases use the Wilcoxon signed-rank test or a bootstrap.

Watch Out

Do not reuse a single held-out test set with a standard paired t-test across resamples

The standard paired t-test assumes independent did_i. If you generate differences by repeated random train/test splits of a single dataset, the did_i are positively correlated (the same examples appear in multiple test folds), the variance estimate is too small, and the nominal Type I error is inflated. Nadeau and Bengio (2003) proposed a corrected resampled t-test that inflates sd2s_d^2 by 1/n+ntest/ntrain1/n + n_{\text{test}}/n_{\text{train}} to account for the overlap. Dietterich (1998) proposed an earlier alternative, the 5x2 cv paired t-test, which runs five 2-fold cross-validations and uses only one of the two folds per repetition to keep variance comparisons independent.

Definition

Wilcoxon Signed-Rank Test

A non-parametric alternative to the paired t-test. It does not assume normality of the differences did_i. Instead, it ranks the absolute differences di|d_i| and compares the sum of ranks for positive vs. negative differences. Use this when the differences are not approximately normal (e.g., heavy-tailed or skewed distributions).

Definition

McNemar's Test

The appropriate paired test when the per-example outcome is binary (correct/incorrect). Build a 2×22 \times 2 contingency table of model A vs. model B outcomes on each test example and let bb be the count of examples where A is correct and B is wrong, cc the count where A is wrong and B is correct. Under H0H_0 that both models have the same error rate, the discordant pairs bb and cc are exchangeable, and

χ2=(bc)2b+c\chi^2 = \frac{(b - c)^2}{b + c}

is asymptotically χ12\chi^2_1. For small counts (b+c<25b + c < 25) use the exact binomial form: P(Xmax(b,c))P(X \geq \max(b,c)) where XBinomial(b+c,1/2)X \sim \text{Binomial}(b+c, 1/2). McNemar's test only uses discordant pairs. Concordant pairs (both right or both wrong) carry no signal about which model is better.

Resampling Tests: Permutation and Bootstrap

Two distinct resampling schemes are both used for hypothesis testing and are often confused. They answer different questions.

Definition

Permutation Test

Use when H0H_0 is exchangeability of observations across groups (e.g., model A and model B perform identically on each example).

  1. Compute the observed test statistic TobsT_{\text{obs}} (e.g., difference in accuracy).
  2. For each of BB iterations, permute the group labels (swap which model produced each prediction) and recompute TbT_b^* on the relabeled data.
  3. The p-value is the fraction of permutation statistics at least as extreme: p=1Bb=1B1[TbTobs]p = \frac{1}{B}\sum_{b=1}^B \mathbf{1}[|T_b^*| \geq |T_{\text{obs}}|].

Permutation tests are exact under the exchangeability null and require no distributional assumptions.

Definition

Bootstrap Test

Use to approximate the sampling distribution of a statistic when you cannot derive it analytically and the sample is too small for asymptotic approximations.

  1. Compute the observed test statistic TobsT_{\text{obs}}.
  2. For each of BB iterations, resample the data with replacement (not permute), compute TbT_b^*, typically after recentering under H0H_0.
  3. Construct confidence intervals from the empirical distribution of TbT_b^*, or compute a p-value from the recentered statistics.

Bootstrap tests estimate a sampling distribution; permutation tests construct a null distribution by exchange. They coincide asymptotically under some conditions but are not interchangeable in small samples.

Common Confusions

Watch Out

p less than 0.05 does NOT mean 95% probability that H1 is true

This is the most common misinterpretation in all of statistics. The p-value is P(dataH0)P(\text{data} \mid H_0), not P(H0data)P(H_0 \mid \text{data}). To get the latter, you need Bayes' theorem and a prior on H0H_0. A p-value of 0.03 means: "if the models were truly equal, there is a 3% chance of seeing this large a difference." It does not mean "there is a 97% chance that model A is better."

Watch Out

Statistical significance is not practical significance

A difference can be statistically significant (small p-value) but practically meaningless (tiny effect size). With enough data, you can detect arbitrarily small differences. Always report effect sizes and confidence intervals alongside p-values. A 0.1% accuracy improvement that is "statistically significant" may not matter in practice.

Watch Out

Do not use unpaired tests for paired data

When comparing models on the same test set, the predictions are correlated (both models see the same examples). An unpaired t-test ignores this correlation and has lower power. Always use a paired test (paired t-test, Wilcoxon signed-rank, or bootstrap with pairing) when the data is paired.

Summary

  • The p-value is P(dataH0)P(\text{data} \mid H_0), not P(H0data)P(H_0 \mid \text{data})
  • Type I error (false positive) rate is controlled by α\alpha; Type II (false negative) depends on power
  • Multiple comparisons inflate false positives: use Bonferroni (strict) or Benjamini-Hochberg (less strict)
  • Use paired tests (paired t-test, Wilcoxon, or McNemar for binary outcomes) when comparing models on the same data. For resampling-based estimates of generalization error, use the Nadeau-Bengio correction or Dietterich 5x2 cv t-test, not the naive paired t-test.
  • Confidence intervals are more informative than p-values alone
  • Bootstrap tests work when you cannot derive the test statistic distribution

Exercises

ExerciseCore

Problem

You compare 10 models on a test set and report p-values for each pairwise comparison. How many comparisons are there? If you use α=0.05\alpha = 0.05 without correction, what is the approximate probability of at least one false positive (assuming all nulls are true)?

ExerciseCore

Problem

Design a proper statistical test to compare two classifiers (a fine-tuned BERT model and a logistic regression baseline) on a binary classification task with 1000 test examples. Specify: the null hypothesis, the test statistic, the type of test, and how you would compute the p-value.

References

Canonical:

  • Fisher, Statistical Methods for Research Workers (1925), Ch. 4-5 - classical significance testing and the 0.05 convention
  • Neyman and Pearson, "On the Problem of the Most Efficient Tests of Statistical Hypotheses" (1933)
  • Demsar, "Statistical Comparisons of Classifiers over Multiple Datasets" (JMLR 2006) — Friedman + Nemenyi workflow for multi-dataset comparisons
  • Benjamini and Hochberg, "Controlling the False Discovery Rate" (1995)
  • Benjamini and Yekutieli, "The Control of the False Discovery Rate in Multiple Testing under Dependency" (Annals of Statistics 2001)
  • Holm, "A Simple Sequentially Rejective Multiple Test Procedure" (Scandinavian J. Statistics 1979)
  • Dietterich, "Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms" (Neural Computation 1998) — 5x2 cv and McNemar's for ML
  • Nadeau and Bengio, "Inference for the Generalization Error" (Machine Learning 2003) — corrected resampled t-test for cross-validation

Current:

  • Wasserstein and Lazar, "The ASA Statement on p-Values" (The American Statistician, 2016)
  • Wasserstein, Schirm, and Lazar, "Moving to a World Beyond p < 0.05" (The American Statistician, 2019)
  • Bouthillier et al., "Accounting for Variance in Machine Learning Benchmarks" (MLSys 2021)
  • Benavoli et al., "Time for a Change: A Tutorial for Comparing Multiple Classifiers Through Bayesian Analysis" (JMLR 2017)
  • Dror et al., "Deep Dominance: How to Properly Compare Deep Neural Models" (ACL 2019) — almost-stochastic dominance

Next Topics

  • Statistical significance and multiple comparisons: deeper treatment of FDR, permutation tests, and replication
  • Bootstrap methods: the general bootstrap framework for inference

Last reviewed: April 22, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.