Methodology
Meta-Analysis
Combining results from multiple studies: fixed-effect and random-effects models, heterogeneity measures, publication bias, and why this matters for ML benchmarking.
Why This Matters
ML papers report results on benchmarks. Different papers use different datasets, different hyperparameters, different compute budgets. When five papers compare method A to method B and get different effect sizes, how do you draw a conclusion? Meta-analysis provides the formal framework for combining evidence across studies.
The same problems that plague medical meta-analysis (publication bias, heterogeneity, p-hacking) plague ML. Papers that show improvement get published; papers that show no improvement do not. Understanding meta-analysis helps you read the ML literature critically.
Setup and Notation
Suppose we have studies, each estimating some effect size with estimate and variance (assumed known or well-estimated).
Effect Size
The effect size is the quantity each study estimates. In medical research, this might be a treatment effect. In ML, it is typically the performance difference between two methods (e.g., accuracy of method A minus accuracy of method B) on a particular benchmark.
Inverse-Variance Weight
Studies with smaller variance (more precise estimates) get more weight. The inverse-variance weight for study is .
Fixed-Effect Model
The fixed-effect model assumes all studies estimate the same true effect . Differences across studies are due to sampling variability only.
Fixed-Effect Meta-Analytic Estimator
Statement
The optimal combined estimator under the fixed-effect model is:
with variance , where .
Intuition
This is a weighted average of the study estimates, weighting each by its precision. More precise studies contribute more. If all studies had equal precision, it reduces to the simple average.
Proof Sketch
This is the minimum-variance unbiased estimator for the common mean given independent estimates with known variances. It follows from the general theory of weighted least squares, or equivalently, from maximum likelihood under Gaussian sampling distributions.
Why It Matters
The fixed-effect estimator is the simplest way to combine results. Its variance decreases as , so adding more studies always increases precision. But the model is only valid when the true effect is the same across all studies.
Failure Mode
If the true effect varies across studies (heterogeneity), the fixed-effect model underestimates the uncertainty. The confidence interval is too narrow, and you get false precision. In ML, different datasets, architectures, and compute budgets almost certainly introduce heterogeneity.
Random-Effects Model
The random-effects model assumes each study has its own true effect , drawn from a distribution with mean and between-study variance .
The goal is to estimate (the average effect across the population of studies).
DerSimonian-Laird Random-Effects Estimator
Statement
The random-effects combined estimator is:
where and is estimated from the data.
Intuition
The weights now account for both within-study variance and between-study variance . When is large, all weights become more equal because between-study variability dominates within-study precision. When , the random-effects model reduces to the fixed-effect model.
Proof Sketch
The marginal distribution of is . Apply inverse-variance weighting with the total variance . The DerSimonian-Laird method estimates using the method of moments from Cochran's Q statistic.
Why It Matters
The random-effects model is more realistic for ML. Different benchmarks, architectures, and experimental setups introduce genuine variability. The random-effects estimator gives wider (more honest) confidence intervals that account for this heterogeneity.
Failure Mode
The DerSimonian-Laird (DL) estimator is known to be anticonservative when is small or heterogeneity is large: the standard Wald-type confidence interval for has coverage below nominal, sometimes well below (Brockwell-Gordon 2001, IntHout et al. 2014). Two fixes are now standard:
- Better estimators. Restricted maximum likelihood (REML) and Paule-Mandel generally recover heterogeneity more reliably than DL, especially when DL rounds down to zero.
- Hartung-Knapp-Sidik-Jonsson (HKSJ) adjustment. Instead of a Wald interval, use a -distributed test statistic with a model-based variance estimator for . IntHout, Ioannidis, Borm (2014, "The Hartung-Knapp-Sidik-Jonsson method has no alternative") show HKSJ restores coverage close to nominal across a wide range of scenarios. The Cochrane Handbook (2019) now recommends HKSJ + REML as the default random- effects setup.
Separately, the random-effects model assumes normality of the study-level effects, which may not hold when is small. With very few studies (), any random-effects analysis is fragile and should be reported as exploratory.
Measuring Heterogeneity
Cochran's Q Statistic
Under the null of no heterogeneity, asymptotically and assuming within-study variances are known. For small the chi-squared approximation is poor, and has low power to detect heterogeneity. Do not read a non-significant as "no heterogeneity"; read it as "not enough evidence from alone."
I-squared Statistic
Higgins and Thompson (2002) define
estimates the percentage of total variability due to between-study heterogeneity rather than sampling error. The original Higgins-Thompson thresholds ("low" 25%, "moderate" 50%, "high" 75%) are now deprecated. The Cochrane Handbook (2019, Chapter 10) recommends interpreting alongside the point estimate of , a confidence interval for , and an explicit prediction interval, because depends on within-study precision and can be inflated by very large studies even when substantive heterogeneity is small.
Prediction Interval
A prediction interval for a random-effects meta-analysis gives the range within which the true effect in a new study is expected to fall with a given probability (typically 95%). Under the random-effects model , the prediction interval is approximately (IntHout, Ioannidis, Rovers, Goeman 2016, "Plea for routinely presenting prediction intervals in meta-analysis"). The prediction interval is always wider than the confidence interval for , because it adds to the sampling variance. It is the right summary when the question is "what effect should we expect in the next study?" rather than "what is the average effect across studies we have seen?"
Combining p-Values
Fisher's method combines independent p-values when the effect sizes are not directly comparable (e.g., mixed metrics, different sample scales):
under the null that all effects are zero. This tests whether there is any signal, but does not estimate the effect size. Stouffer's method (inverse normal combination) gives similar power and allows weighting. Use these only when inverse-variance meta-analysis is unavailable; when effect sizes and their variances are available, prefer the fixed-effect or random-effects estimator above, which also gives an effect size, not only a combined p-value.
Publication Bias
Studies with statistically significant results are more likely to be published. This creates a biased sample of the literature.
Funnel plot: Plot effect size against study precision (or sample size). Under no bias, the plot should be symmetric around the combined estimate. Asymmetry suggests smaller, less precise studies with null results are missing.
Egger's regression test (Egger, Davey Smith, Schneider, Minder 1997) formalizes funnel-plot asymmetry: regress the standardized effect size on its precision and test whether the intercept differs from zero. A non-zero intercept is evidence of small-study effects. Begg and Mazumdar's rank test is an alternative with lower power but fewer assumptions. Sterne et al. (2011, BMJ) give the standard diagnostic recommendations.
Trim and fill (Duval and Tweedie 2000) imputes missing studies to make the funnel plot symmetric, then recomputes the combined estimate. Useful for sensitivity analysis, not definitive proof of bias.
PET-PEESE (precision-effect test with precision-effect estimator with standard error) uses a meta-regression on study standard errors (PET) or variances (PEESE) to extrapolate to an "infinite-precision" effect size. It is controversial (Stanley-Doucouliagos 2014 vs. critics) but widely used in economics and psychology as a bias-adjusted estimator.
Leave-one-out (influence) analysis: refit the meta-analysis times, each time excluding one study, and report how moves. Studies whose removal shifts the estimate materially are "influential"; if one such study is also the most precise, the aggregate is driven by a single trial and should be flagged.
Effect-Size Metrics
For continuous outcomes, the standard effect sizes are Cohen's (using a pooled standard deviation ) and Hedges's , the small-sample-bias-corrected version with Hedges's correction factor (Hedges 1981). Use when any individual study has fewer than roughly 20 samples per group. For binary outcomes, the standard choices are the log odds ratio, log risk ratio, and risk difference. ML papers can usually treat accuracy differences with their standard errors directly and compute effect sizes on demand.
Meta-Regression
When heterogeneity exists, meta-regression models the effect size as a function of study-level covariates. For example: effect = baseline + coefficient * dataset_size. This is analogous to a weighted regression where each study contributes one data point.
Network Meta-Analysis
Pairwise meta-analysis combines studies of two interventions (A vs B). Network meta-analysis (NMA) combines evidence across a connected graph of interventions (A vs B, A vs C, B vs C, ...) to estimate all pairwise contrasts, including those never directly compared, under a consistency assumption between direct and indirect evidence. In ML it applies when a benchmark table reports many methods evaluated on overlapping subsets of tasks: NMA lets you recover a full ranking from incomplete head-to-heads. See Salanti (2012) for a readable introduction.
Connection to ML
In ML, meta-analysis appears in several guises:
- Benchmark aggregation: Comparing methods across multiple datasets. A table of accuracy numbers across 10 datasets is raw material for meta-analysis.
- Systematic reviews: Papers like "A Survey of X" often implicitly perform meta-analysis when summarizing results.
- Hyperparameter sensitivity: How much does performance vary across random seeds, datasets, or hyperparameter choices? This is heterogeneity analysis.
- Reproducibility studies: When a result fails to replicate, formal meta-analysis of the original and replication attempts is the correct way to update your belief.
Common Confusions
Fixed-effect does not mean fixed effects in the regression sense
In meta-analysis, "fixed-effect" means assuming one true effect across all studies. This is different from "fixed effects" in panel data regression (which means study-specific intercepts). The terminology collision is unfortunate.
I-squared does not measure the magnitude of heterogeneity
means 75% of the observed variability is due to between-study differences, not that the effect sizes vary by 75%. A set of studies could have with all effects between 0.50 and 0.55. The prediction interval (not the confidence interval) gives the range of plausible effects.
Summary
- Fixed-effect model: one true effect, weight by inverse variance
- Random-effects model: effects vary across studies, accounts for between-study variance
- Heterogeneity: measured by and . High means the effect is not consistent
- Publication bias: funnel plots and trim-and-fill for diagnosis
- For ML: treat benchmark tables as raw meta-analytic data and think critically about missing results
Exercises
Problem
Three studies estimate an effect with and standard errors . Compute the fixed-effect combined estimate .
Problem
Explain why the random-effects combined estimate moves toward the unweighted average of study estimates as between-study variance grows. Under what conditions are the fixed-effect and random-effects combined estimates equal?
References
Canonical:
- Borenstein, Hedges, Higgins, Rothstein, Introduction to Meta-Analysis (2009), Chapters 1-16
- DerSimonian & Laird, "Meta-Analysis in Clinical Trials" (1986), Controlled Clinical Trials
Current:
- Higgins et al., Cochrane Handbook for Systematic Reviews of Interventions (2019), Chapter 10
- Bouthillier et al., "Accounting for Variance in Machine Learning Benchmarks" (MLSys 2021)
- Higgins and Thompson, "Quantifying heterogeneity in a meta-analysis" (Statistics in Medicine 2002; definition of )
- IntHout, Ioannidis, Borm, "The Hartung-Knapp-Sidik-Jonsson method for random effects meta-analysis is straightforward and considerably outperforms the standard DerSimonian-Laird method" (BMC Med. Res. Methodol. 2014)
- IntHout, Ioannidis, Rovers, Goeman, "Plea for routinely presenting prediction intervals in meta-analysis" (BMJ Open 2016)
- Viechtbauer, "Conducting Meta-Analyses in R with the metafor Package" (J. Stat. Soft. 2010)
- Sterne et al., "Recommendations for examining and interpreting funnel plot asymmetry in meta-analyses of randomised controlled trials" (BMJ 2011)
- Egger, Davey Smith, Schneider, Minder, "Bias in meta-analysis detected by a simple, graphical test" (BMJ 1997)
- Hedges, "Distribution theory for Glass's estimator of effect size and related estimators" (J. Educational Statistics 1981)
Next Topics
- Reproducibility and experimental rigor: why replication matters and how to do it
- P-hacking and multiple testing: the statistical sins that meta-analysis must contend with
Last reviewed: April 18, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
3- Bayesian Estimationlayer 0B · tier 2
- Hypothesis Testing for MLlayer 2 · tier 2
- REML and Variance Component Estimationlayer 2 · tier 2
Derived topics
1- P-Hacking and Multiple Testinglayer 2 · tier 2
Graph-backed continuations