Hypothesis Testing for ML

Sneiderman, Robby

Methodology

Hypothesis Testing for ML

Null and alternative hypotheses, p-values, confidence intervals, error types, multiple comparisons, and proper statistical tests for comparing ML models.

EssentialCoreTier 2StableSupporting~55 min

For:MLStats

Prerequisites

Benfords Law Confusion Matrix Deep Dive Differential Privacy Evaluation Metrics and Properties

Quiz (17)Pulse Check Prereq Map

Why This Matters

Machine learning papers routinely claim that model A outperforms model B. But how do you know the difference is real and not an artifact of the random train/test split, initialization, or data ordering? Hypothesis testing provides the formal framework for answering this question. Without it, you cannot distinguish signal from noise in experimental comparisons. The multiple comparisons problem is especially critical in ML, where researchers often compare many models, hyperparameter settings, or metrics simultaneously.

Mental Model

You have two models. You run them on a test set and model A scores 0.85 while model B scores 0.83. Is A better than B, or did you just get lucky? Hypothesis testing formalizes this question: assume the models are equally good (null hypothesis), compute how unlikely the observed difference is under that assumption (p-value), and reject the null if the data is sufficiently surprising.

theorem visual

Null distribution, rejection region

$A test is a rule before the data: choose a statistic, choose a rejection region, then control how often the rule falsely rejects a true null.$

Type I error

$The red area is α : reject H_{0} even though H_{0} is true.$

Power

$The green area is 1 - β : reject H_{0} when the alternative is true.$

p-value

$A p-value is a tail area under H_{0}, not P (H_{0} ∣ data) .$

The Neyman-Pearson Framework

Definition

Null Hypothesis $H_{0}$

The null hypothesis $H_0$ is the default assumption you are trying to disprove. In ML model comparison:

$H_0: \mu_A = \mu_B$

where $\mu_A$ and $\mu_B$ are the true expected performances of models A and B. The null says there is no real difference between the models.

Definition

Alternative Hypothesis $H_{1}$

The alternative hypothesis $H_1$ is the claim tested against $H_0$ ; rejecting $H_0$ is evidence in favor of $H_1$ .

Two-sided: $H_1: \mu_A \neq \mu_B$ (the models differ)
One-sided: $H_1: \mu_A > \mu_B$ (model A is better)

In ML, two-sided tests are more conservative and generally preferred unless you have a strong prior reason to test only one direction.

Definition

p-Value

The p-value is the probability of observing a test statistic at least as extreme as the one computed from your data, assuming $H_0$ is true. The exact form depends on the alternative:

One-sided ( $H_1: \theta > \theta_0$ ): $p = P(T \geq t_{\text{obs}} \mid H_0)$
Two-sided ( $H_1: \theta \neq \theta_0$ ): $p = P(|T| \geq |t_{\text{obs}}| \mid H_0)$

Since two-sided tests are the default in ML model comparison, use the two-sided form unless you have committed in advance to a directional alternative. A small p-value means the observed data is unlikely under $H_0$ , which is evidence against $H_0$ .

The p-value is not the probability that $H_0$ is true. It is the probability of the data given $H_0$ , not the probability of $H_0$ given the data.

Definition

Significance Level $α$

The significance level $\alpha$ is the threshold for rejecting $H_0$ . If and only if $p \leq \alpha$ , we reject $H_0$ and declare the result "statistically significant." The standard choice is $\alpha = 0.05$ , meaning we accept a 5% chance of falsely rejecting a true null hypothesis.

Watch Out

p = 0.05 is a convention, not a theorem

The 0.05 threshold comes from Fisher (1925), stated as a rough guide, not a principled cutoff. The ASA's 2016 and 2019 statements on p-values warn against thresholded dichotomous decisions. The replication crisis in psychology and biomedicine (Open Science Collaboration 2015; Ioannidis 2005) showed that a large fraction of "significant" findings at $p < 0.05$ fail to replicate. In ML practice: test across multiple seeds, report effect sizes and confidence intervals, and treat $p < 0.05$ on a single run as weak evidence.

Error Types

Proposition

Type I and Type II Errors

Statement

There are two types of errors in hypothesis testing:

	$H_0$ true	$H_0$ false
Reject $H_0$	Type I error ( $\alpha$ )	Correct (power)
Do not reject $H_0$	Correct	Type II error ( $\beta$ )

Type I error (false positive): rejecting $H_0$ when it is true. Probability = $\alpha$ .
Type II error (false negative): failing to reject $H_0$ when it is false. Probability = $\beta$ .
Power = $1 - \beta$ : probability of correctly rejecting a false $H_0$ .

Intuition

Type I error is a false alarm: you claim the models are different when they are not. Type II error is a missed detection: the models truly differ but your test fails to detect it. Decreasing $\alpha$ (being more conservative) increases $\beta$ (more missed detections) for fixed sample size.

Proof Sketch

By definition, $\alpha = P(\text{reject } H_0 \mid H_0 \text{ true})$ , which is controlled directly by the choice of significance level. The power depends on the true effect size $\delta = |\mu_A - \mu_B|$ , the variance, and the sample size $n$ : $\text{power} \approx \Phi\left(\frac{\delta\sqrt{n}}{\sigma} - z_{\alpha/2}\right)$ .

Why It Matters

In ML, Type I errors lead to published claims of improvement that do not replicate. Type II errors lead to dismissing genuinely better methods. Both are costly. Understanding the tradeoff is essential for designing experiments with adequate power.

Failure Mode

Most ML experiments are underpowered: they use too few random seeds or train/test splits to detect real but small improvements. An underpowered test has high $\beta$ , meaning it frequently fails to detect true differences.

report a correction →

Confidence Intervals

Definition

Confidence Interval $C I$

A $(1-\alpha)$ confidence interval for a parameter $\theta$ is a random interval $[L, U]$ whose coverage probability is at least $1-\alpha$ :

$P_\theta(\theta \in [L, U]) \geq 1 - \alpha$

for every parameter value $\theta$ in the model. When equality holds, the interval has exact coverage.

For the difference in model performance $\delta = \mu_A - \mu_B$ , a 95% CI provides a range of plausible values for the true difference. If the CI excludes zero, the difference is significant at $\alpha = 0.05$ .

Confidence intervals are more informative than p-values alone because they communicate both statistical significance and effect size.

The Multiple Comparisons Problem

When you test many hypotheses simultaneously, the probability of at least one false positive grows rapidly.

Theorem

Bonferroni Correction

Statement

If you perform $m$ simultaneous hypothesis tests and want to control the family-wise error rate (FWER) at level $\alpha$ , reject each individual test $i$ only if $p_i \leq \alpha/m$ .

$\text{FWER} = P(\text{at least one false rejection}) \leq \alpha$

This holds regardless of the dependence structure among the tests.

Intuition

If you flip a coin 20 times looking for "evidence of bias," you expect at least one streak just by chance. Bonferroni compensates by making each individual test more stringent: if you test 20 hypotheses, you need $p \leq 0.05/20 = 0.0025$ for any single one.

Proof Sketch

By the union bound: $P(\exists i: \text{reject } H_{0,i} \mid \text{all } H_{0,i} \text{ true}) \leq \sum_{i=1}^m P(p_i \leq \alpha/m) = m \cdot (\alpha/m) = \alpha$ .

Why It Matters

In ML, the multiple comparisons problem arises constantly: comparing many models, testing on multiple datasets, evaluating multiple metrics, running experiments with different hyperparameters. Without correction, the false positive rate can be far higher than the nominal $\alpha$ .

Failure Mode

Bonferroni is very conservative: it controls the probability of any false positive, which can make it too strict when $m$ is large. Many true effects are missed. When you care about controlling the proportion of false positives rather than their existence, use FDR control instead.

report a correction →

Holm-Bonferroni (1979) is a uniformly more powerful alternative that still controls FWER at $\alpha$ without additional assumptions. Sort the p-values $p_{(1)} \leq \cdots \leq p_{(m)}$ , and reject $H_{(i)}$ iff $p_{(j)} \leq \alpha / (m - j + 1)$ for all $j \leq i$ . The smallest p-value is tested against $\alpha/m$ (same as Bonferroni), but later ones use a larger threshold. Holm dominates Bonferroni: any rejection Bonferroni makes, Holm also makes. Prefer Holm whenever you would have used Bonferroni.

FDR Control: Benjamini-Hochberg

Definition

False Discovery Rate $F D R$

The False Discovery Rate is the expected proportion of false positives among all rejected hypotheses:

$\text{FDR} = \mathbb{E}\left[\frac{\text{false rejections}}{\text{total rejections}}\right]$

The Benjamini-Hochberg (BH) procedure controls FDR at level $q$ :

Order the $m$ p-values: $p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}$
Find the largest $k$ such that $p_{(k)} \leq \frac{k}{m} \cdot q$
Reject all hypotheses with $p_{(i)} \leq p_{(k)}$

BH is less conservative than Bonferroni and is preferred when testing many hypotheses (e.g., comparing models on 50 datasets).

BH controls FDR at level $q$ under independence or positive regression dependence of the test statistics (Benjamini-Hochberg 1995; Benjamini-Yekutieli 2001). Under arbitrary dependence, use the Benjamini-Yekutieli correction, which replaces the threshold with $\frac{k}{m \cdot c(m)} q$ where $c(m) = \sum_{j=1}^m 1/j \approx \ln m$ . This is strictly more conservative but valid without structural assumptions on the joint distribution of p-values.

Paired Tests for Model Comparison

When comparing two models on the same data, you should use paired tests that account for the correlation between the two models' errors on each example.

Definition

Paired t-Test for Model Comparison

Given $n$ test examples, let $d_i = \ell(A, x_i) - \ell(B, x_i)$ be the difference in loss between models A and B on example $i$ . The paired t-test statistic is:

$t = \frac{\bar{d}}{s_d / \sqrt{n}}, \quad \bar{d} = \frac{1}{n}\sum_i d_i, \quad s_d^2 = \frac{1}{n-1}\sum_i (d_i - \bar{d})^2$

Under $H_0: \mathbb{E}[d_i] = 0$ , $t$ follows a Student $t$ -distribution with $n-1$ degrees of freedom exactly when the $d_i$ are i.i.d. Gaussian, and approximately for non-Gaussian $d_i$ by the CLT when $n$ is large. The $t$ -distribution is robust to moderate departures from normality but breaks down for heavy-tailed or highly skewed $d_i$ ; in those cases use the Wilcoxon signed-rank test or a bootstrap.

Watch Out

Do not reuse a single held-out test set with a standard paired t-test across resamples

The standard paired t-test assumes independent $d_i$ . If you generate differences by repeated random train/test splits of a single dataset, the $d_i$ are positively correlated (the same examples appear in multiple test folds), the variance estimate is too small, and the nominal Type I error is inflated. Nadeau and Bengio (2003) proposed a corrected resampled t-test that inflates $s_d^2$ by $1/n + n_{\text{test}}/n_{\text{train}}$ to account for the overlap. Dietterich (1998) proposed an earlier alternative, the 5x2 cv paired t-test, which runs five 2-fold cross-validations and uses only one of the two folds per repetition to keep variance comparisons independent.

Definition

Wilcoxon Signed-Rank Test

A non-parametric alternative to the paired t-test. It does not assume normality of the differences $d_i$ . Instead, it ranks the absolute differences $|d_i|$ and compares the sum of ranks for positive vs. negative differences. Use this when the differences are not approximately normal (e.g., heavy-tailed or skewed distributions).

Definition

McNemar's Test $χ_{McNemar}^{2}$

The appropriate paired test when the per-example outcome is binary (correct/incorrect). Build a $2 \times 2$ contingency table of model A vs. model B outcomes on each test example and let $b$ be the count of examples where A is correct and B is wrong, $c$ the count where A is wrong and B is correct. Under $H_0$ that both models have the same error rate, the discordant pairs $b$ and $c$ are exchangeable, and

$\chi^2 = \frac{(b - c)^2}{b + c}$

is asymptotically $\chi^2_1$ . For small counts ( $b + c < 25$ ) use the exact binomial form: $P(X \geq \max(b,c))$ where $X \sim \text{Binomial}(b+c, 1/2)$ . McNemar's test only uses discordant pairs. Concordant pairs (both right or both wrong) carry no signal about which model is better.

Resampling Tests: Permutation and Bootstrap

Two distinct resampling schemes are both used for hypothesis testing and are often confused. They answer different questions.

Definition

Permutation Test

Use when $H_0$ is exchangeability of observations across groups (e.g., model A and model B perform identically on each example).

Compute the observed test statistic $T_{\text{obs}}$ (e.g., difference in accuracy).
For each of $B$ iterations, permute the group labels (swap which model produced each prediction) and recompute $T_b^*$ on the relabeled data.
The p-value is the fraction of permutation statistics at least as extreme: $p = \frac{1}{B}\sum_{b=1}^B \mathbf{1}[|T_b^*| \geq |T_{\text{obs}}|]$ .

Permutation tests are exact under the exchangeability null and require no distributional assumptions.

Definition

Bootstrap Test

Use to approximate the sampling distribution of a statistic when you cannot derive it analytically and the sample is too small for asymptotic approximations.

Compute the observed test statistic $T_{\text{obs}}$ .
For each of $B$ iterations, resample the data with replacement (not permute), compute $T_b^*$ , typically after recentering under $H_0$ .
Construct confidence intervals from the empirical distribution of $T_b^*$ , or compute a p-value from the recentered statistics.

Bootstrap tests estimate a sampling distribution; permutation tests construct a null distribution by exchange. They coincide asymptotically under some conditions but are not interchangeable in small samples.

Common Confusions

Watch Out

p less than 0.05 does NOT mean 95% probability that H1 is true

This is the most common misinterpretation in all of statistics. The p-value is $P(\text{data} \mid H_0)$ , not $P(H_0 \mid \text{data})$ . To get the latter, you need Bayes' theorem and a prior on $H_0$ . A p-value of 0.03 means: "if the models were truly equal, there is a 3% chance of seeing this large a difference." It does not mean "there is a 97% chance that model A is better."

Watch Out

Statistical significance is not practical significance

A difference can be statistically significant (small p-value) but practically meaningless (tiny effect size). With enough data, you can detect arbitrarily small differences. Always report effect sizes and confidence intervals alongside p-values. A 0.1% accuracy improvement that is "statistically significant" may not matter in practice.

Watch Out

Do not use unpaired tests for paired data

When comparing models on the same test set, the predictions are correlated (both models see the same examples). An unpaired t-test ignores this correlation and has lower power. Always use a paired test (paired t-test, Wilcoxon signed-rank, or bootstrap with pairing) when the data is paired.

Summary

The p-value is $P(\text{data} \mid H_0)$ , not $P(H_0 \mid \text{data})$
Type I error (false positive) rate is controlled by $\alpha$ ; Type II (false negative) depends on power
Multiple comparisons inflate false positives: use Bonferroni (strict) or Benjamini-Hochberg (less strict)
Use paired tests (paired t-test, Wilcoxon, or McNemar for binary outcomes) when comparing models on the same data. For resampling-based estimates of generalization error, use the Nadeau-Bengio correction or Dietterich 5x2 cv t-test, not the naive paired t-test.
Confidence intervals are more informative than p-values alone
Bootstrap tests work when you cannot derive the test statistic distribution

Exercises

ExerciseCore

Problem

You compare 10 models on a test set and report p-values for each pairwise comparison. How many comparisons are there? If you use $\alpha = 0.05$ without correction, what is the approximate probability of at least one false positive (assuming all nulls are true)?

ExerciseCore

Problem

Design a proper statistical test to compare two classifiers (a fine-tuned BERT model and a logistic regression baseline) on a binary classification task with 1000 test examples. Specify: the null hypothesis, the test statistic, the type of test, and how you would compute the p-value.

References

Canonical:

Fisher, Statistical Methods for Research Workers (1925), Ch. 4-5 - classical significance testing and the 0.05 convention
Neyman and Pearson, "On the Problem of the Most Efficient Tests of Statistical Hypotheses" (1933)
Demsar, "Statistical Comparisons of Classifiers over Multiple Datasets" (JMLR 2006) — Friedman + Nemenyi workflow for multi-dataset comparisons
Benjamini and Hochberg, "Controlling the False Discovery Rate" (1995)
Benjamini and Yekutieli, "The Control of the False Discovery Rate in Multiple Testing under Dependency" (Annals of Statistics 2001)
Holm, "A Simple Sequentially Rejective Multiple Test Procedure" (Scandinavian J. Statistics 1979)
Dietterich, "Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms" (Neural Computation 1998) — 5x2 cv and McNemar's for ML
Nadeau and Bengio, "Inference for the Generalization Error" (Machine Learning 2003) — corrected resampled t-test for cross-validation

Current:

Wasserstein and Lazar, "The ASA Statement on p-Values" (The American Statistician, 2016)
Wasserstein, Schirm, and Lazar, "Moving to a World Beyond p < 0.05" (The American Statistician, 2019)
Bouthillier et al., "Accounting for Variance in Machine Learning Benchmarks" (MLSys 2021)
Benavoli et al., "Time for a Change: A Tutorial for Comparing Multiple Classifiers Through Bayesian Analysis" (JMLR 2017)
Dror et al., "Deep Dominance: How to Properly Compare Deep Neural Models" (ACL 2019) — almost-stochastic dominance

Next Topics

Statistical significance and multiple comparisons: deeper treatment of FDR, permutation tests, and replication
Bootstrap methods: the general bootstrap framework for inference

Last reviewed: April 22, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

12

Fisher Information: Curvature, KL Geometry, and the Natural Gradientlayer 0B · tier 1
Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1
Confusion Matrix: MCC, Kappa, and Cost-Sensitive Evaluationlayer 1 · tier 1
Method of Momentslayer 0B · tier 2
Sufficient Statistics and Exponential Familieslayer 0B · tier 2

Derived topics

18

Analysis of Variancelayer 1 · tier 1
Chi-Squared Distribution and Testslayer 1 · tier 1
F-Distribution and ANOVAlayer 1 · tier 1
Student-t Distribution and t-Testlayer 1 · tier 1
Permutation Testslayer 2 · tier 1

+13 more on the derived-topics page.

Graph-backed continuations

Statistical Significance and Multiple Comparisons Goodness-of-Fit Tests Ablation Study Design Causal Inference Basics Causal Inference for Policy Evaluation Data Contamination and Evaluation Detection Theory E-Values and Anytime-Valid Inference Meta-Analysis P-Hacking and Multiple Testing Sample Size Determination Signal Detection Theory Split Conformal Prediction