Methodology
Statistical Significance and Multiple Comparisons
p-values, significance levels, confidence intervals, and the multiple comparisons problem. Bonferroni correction, Benjamini-Hochberg FDR control, and why these matter for model selection and benchmark evaluation.
Prerequisites
Why This Matters
You train five models, tune three hyperparameters each, evaluate on four metrics, and report the best combination. How many implicit hypothesis tests did you just run? Roughly . Without correction, the probability of at least one spurious "significant" result exceeds 95%. This is the multiple comparisons problem, and it is the single largest source of false claims in ML evaluation.
Formal Setup
p-Value
The p-value for a test statistic is the probability under the null of observing a value at least as extreme as . For a one-sided alternative large:
For a two-sided alternative:
A small means the data is unlikely under . The p-value is not the probability that is true, and is not the probability that the alternative is true.
Significance Level
The significance level is the threshold for rejecting . Reject if and only if . The standard choice means you accept a 5% false positive rate for any single test.
Confidence Interval
A confidence interval for parameter is a random interval such that over repeated sampling. For the difference in model performance :
where is the standard error. The interval excludes zero if and only if the difference is significant at level .
The Multiple Comparisons Problem
When testing hypotheses simultaneously at level each, the probability of at least one false positive grows rapidly.
Family-Wise Error Rate
The FWER is the probability of making at least one Type I error across all tests:
when the tests are independent. For and : .
Bonferroni Correction
Bonferroni Correction
Statement
To control the family-wise error rate at level across tests, reject hypothesis only if . Then:
where is the number of true null hypotheses.
Intuition
If you test 20 hypotheses at , Bonferroni requires for each individual test. You divide your error budget equally among all tests. The union bound guarantees this works regardless of whether the tests are correlated.
Proof Sketch
By the union bound: . The key step is that for a true null, is uniform on , so .
Why It Matters
Bonferroni is the simplest multiple testing correction. It requires no assumptions about the dependence structure among tests. In ML, use it when comparing a small number of models (say, 5 or fewer) and you need a strong guarantee that no comparison is a false positive.
Failure Mode
Bonferroni is very conservative when is large. With tests, you need per test. Many true effects will be missed. When you care about controlling the proportion of false discoveries rather than their existence, use Benjamini-Hochberg instead. Even when you insist on FWER control, Holm's step-down procedure (Holm 1979) uniformly dominates Bonferroni: order the p-values and reject while , stopping at the first failure. Holm still controls FWER at level and always rejects a (weak) superset of what Bonferroni rejects, so use it by default.
Benjamini-Hochberg FDR Control
False Discovery Rate
The False Discovery Rate is the expected proportion of false positives among all rejected hypotheses:
where is the number of false rejections, is the total number of rejections, and avoids division by zero.
Benjamini-Hochberg Procedure
Statement
Order the p-values . Find the largest such that . Reject all hypotheses . Then .
Intuition
BH draws a line with slope through the ordered p-values. The largest p-value below this line determines the cutoff. Hypotheses with smaller p-values are rejected. The procedure is adaptive: if many p-values are small (many true effects), the threshold is more lenient than Bonferroni.
Proof Sketch
The original proof by Benjamini and Hochberg (1995) proceeds by induction on the number of true nulls . The key insight: under independence, the expected number of false rejections at the -th threshold is . The proof extends to positive regression dependence (PRDS) by the work of Benjamini and Yekutieli (2001).
Why It Matters
BH is the standard correction for large-scale testing in ML. When comparing models across many datasets, metrics, or hyperparameter settings, BH controls the fraction of false discoveries rather than the probability of any false discovery. This is almost always the right notion for ML practitioners.
Failure Mode
BH requires independence or positive regression dependence (PRDS) among p-values (Benjamini-Yekutieli 2001). If tests are negatively correlated (rare in practice but possible) or dependence is unknown, FDR may exceed . Under arbitrary dependence, use the Benjamini-Yekutieli correction, which replaces with where is the harmonic number. For large , (Euler-Mascheroni ), so the BY threshold is roughly , which costs a factor of about in power. Do not apply the approximation for small : at , while .
Why This Matters for ML
Model Selection
When you compare models, you are implicitly running pairwise tests. With 10 models, that is 45 tests. Reporting the "best" model without correction inflates the false positive rate.
Hyperparameter Tuning
Each hyperparameter configuration evaluated on a validation set is an implicit hypothesis test. Random search over 100 configurations at yields an expected 5 false positives. Cross-validation helps but does not eliminate this problem.
Benchmark Comparisons
Papers comparing on multiple benchmarks (GLUE has 9 tasks, for example) should correct for multiple comparisons when claiming improvements "across the board."
Bonferroni vs BH on benchmark evaluation
You compare your model against a baseline on 20 benchmark tasks. The sorted p-values are: 0.001, 0.003, 0.008, 0.012, 0.025, 0.04, 0.06, ... (remaining above 0.05).
Bonferroni (): threshold is . Only the first p-value (0.001) passes. You claim significance on 1 task.
BH (): check each . Thresholds are 0.0025, 0.005, 0.0075, 0.01, 0.0125, ... Walk through in order: passes (0.001 0.0025), passes (0.003 0.005), fails (0.008 0.0075). BH is a step-up procedure: find the largest such that , then reject . Here , so BH rejects the first 2 hypotheses. You claim significance on 2 tasks.
Note: BH does not stop at the first failure. It scans all and takes the largest passing index. In this example the only passing indices are , so the answer coincides with the first-failure index minus one. For other p-value sequences (e.g., a late-arriving small p-value), the largest- rule can reject more than the first-failure rule would.
BH discovers more true effects than Bonferroni while controlling the false discovery proportion.
Common Confusions
p-hacking is implicit multiple testing
If you try many analysis strategies (different features, different preprocessing, different splits) and report only the one that gives , you have performed many implicit tests without correction. This is p-hacking. The fix is to preregister your analysis plan or apply multiple testing correction to all analyses you tried.
Garden of forking paths: you do not have to run multiple tests to multiply test
Gelman and Loken (2013) make the sharper point: even researchers who run only a single analysis can commit de facto multiple testing, because the choice of analysis was contingent on the data. A covariate was dropped because it looked noisy. A split was chosen after a peek. An outlier rule was applied only when results were marginal. Each data-dependent choice defines a branch in the garden of forking paths, and the reported p-value ignores all the branches that did not get taken. Preregistration of the full analysis pipeline is the only clean fix.
Bonferroni and BH control different error rates
Bonferroni controls the probability of any false positive (FWER). BH controls the proportion of false positives (FDR). These are different quantities. FWER is appropriate when any false positive is costly (e.g., clinical trials). FDR is appropriate when you expect some false positives and want to control their rate (e.g., screening many benchmark tasks).
Post-hoc correction does not fix bad experimental design
If your experiment has data leakage or other methodological flaws, no multiple testing correction will save you. Corrections adjust p-value thresholds; they do not fix biased test statistics. Always ensure the individual tests are valid before applying corrections.
Summary
- The p-value is , not
- Testing hypotheses at each gives FWER
- Bonferroni: reject if . Controls FWER. Conservative for large
- BH: order p-values, find largest with . Controls FDR
- ML model selection, hyperparameter tuning, and benchmark evaluation all involve implicit multiple testing
- Confidence intervals convey both significance and effect size
Exercises
Problem
You evaluate a model on 8 benchmark datasets and obtain p-values (vs. baseline) of 0.004, 0.01, 0.02, 0.03, 0.06, 0.08, 0.12, 0.25. Apply both Bonferroni and BH at level 0.05. How many datasets show significant improvement under each?
Problem
You run a random hyperparameter search with 200 configurations. The best configuration has validation accuracy 0.5% higher than the second best. You report this as your final result. Explain why this is problematic from a multiple comparisons perspective, and propose a fix.
References
Canonical:
- Benjamini and Hochberg, "Controlling the False Discovery Rate" (1995)
- Demsar, "Statistical Comparisons of Classifiers over Multiple Datasets" (JMLR 2006)
Current:
- Bouthillier et al., "Accounting for Variance in Machine Learning Benchmarks" (MLSys 2021)
- Recht et al., "Do ImageNet Classifiers Generalize to ImageNet?" (ICML 2019)
- Holm, "A Simple Sequentially Rejective Multiple Test Procedure" (Scand. J. Stat. 1979)
- Benjamini and Yekutieli, "The Control of the False Discovery Rate Under Dependency" (Annals of Statistics 2001)
- Storey, "A Direct Approach to False Discovery Rates" (JRSS-B 2002, q-values)
- Gelman and Loken, "The Garden of Forking Paths" (2013 working paper, later in Am. Scientist 2014)
- Dror, Baumer, Shlomov, Reichart, "The Hitchhiker's Guide to Testing Statistical Significance in NLP" (ACL 2018)
Next Topics
- Bootstrap methods: nonparametric inference for any test statistic
- Cross-validation theory: proper evaluation under limited data
Last reviewed: April 18, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
1- Hypothesis Testing for MLlayer 2 · tier 2
Derived topics
2- Bootstrap Methodslayer 2 · tier 1
- Cross-Validation Theorylayer 2 · tier 2
Graph-backed continuations