Meta-Analysis

Sneiderman, Robby

Methodology

Meta-Analysis

Combining results from multiple studies: fixed-effect and random-effects models, heterogeneity measures, publication bias, and why this matters for ML benchmarking.

CoreTier 2StableSupporting~50 min

Prerequisites

Hypothesis Testing for ML Bayesian Estimation Reml and Variance Component Estimation

Prereq Map

Why This Matters

ML papers report results on benchmarks. Different papers use different datasets, different hyperparameters, different compute budgets. When five papers compare method A to method B and get different effect sizes, how do you draw a conclusion? Meta-analysis provides the formal framework for combining evidence across studies.

The same problems that plague medical meta-analysis (publication bias, heterogeneity, p-hacking) plague ML. Papers that show improvement get published; papers that show no improvement do not. Understanding meta-analysis helps you read the ML literature critically.

Setup and Notation

Suppose we have $k$ studies, each estimating some effect size $\theta_i$ with estimate $\hat{\theta}_i$ and variance $\sigma_i^2$ (assumed known or well-estimated).

Definition

Effect Size

The effect size is the quantity each study estimates. In medical research, this might be a treatment effect. In ML, it is typically the performance difference between two methods (e.g., accuracy of method A minus accuracy of method B) on a particular benchmark.

Definition

Inverse-Variance Weight $w_{i} = 1/ σ_{i}^{2}$

Studies with smaller variance (more precise estimates) get more weight. The inverse-variance weight for study $i$ is $w_i = 1/\sigma_i^2$ .

Fixed-Effect Model

The fixed-effect model assumes all studies estimate the same true effect $\theta$ . Differences across studies are due to sampling variability only.

Theorem

Fixed-Effect Meta-Analytic Estimator

Statement

The optimal combined estimator under the fixed-effect model is:

$\hat{\theta}_{\text{FE}} = \frac{\sum_{i=1}^{k} w_i \hat{\theta}_i}{\sum_{i=1}^{k} w_i}$

with variance $\text{Var}(\hat{\theta}_{\text{FE}}) = 1 / \sum_{i=1}^{k} w_i$ , where $w_i = 1/\sigma_i^2$ .

Intuition

This is a weighted average of the study estimates, weighting each by its precision. More precise studies contribute more. If all studies had equal precision, it reduces to the simple average.

Proof Sketch

This is the minimum-variance unbiased estimator for the common mean $\theta$ given independent estimates with known variances. It follows from the general theory of weighted least squares, or equivalently, from maximum likelihood under Gaussian sampling distributions.

Why It Matters

The fixed-effect estimator is the simplest way to combine results. Its variance decreases as $1/\sum w_i$ , so adding more studies always increases precision. But the model is only valid when the true effect is the same across all studies.

Failure Mode

If the true effect varies across studies (heterogeneity), the fixed-effect model underestimates the uncertainty. The confidence interval is too narrow, and you get false precision. In ML, different datasets, architectures, and compute budgets almost certainly introduce heterogeneity.

report a correction →

Random-Effects Model

The random-effects model assumes each study has its own true effect $\theta_i$ , drawn from a distribution with mean $\mu$ and between-study variance $\tau^2$ .

$\hat{\theta}_i \sim N(\theta_i, \sigma_i^2), \quad \theta_i \sim N(\mu, \tau^2)$

The goal is to estimate $\mu$ (the average effect across the population of studies).

Theorem

DerSimonian-Laird Random-Effects Estimator

Statement

The random-effects combined estimator is:

$\hat{\mu}_{\text{RE}} = \frac{\sum_{i=1}^{k} w_i^* \hat{\theta}_i}{\sum_{i=1}^{k} w_i^*}$

where $w_i^* = 1/(\sigma_i^2 + \hat{\tau}^2)$ and $\hat{\tau}^2$ is estimated from the data.

Intuition

The weights now account for both within-study variance $\sigma_i^2$ and between-study variance $\tau^2$ . When $\tau^2$ is large, all weights become more equal because between-study variability dominates within-study precision. When $\tau^2 = 0$ , the random-effects model reduces to the fixed-effect model.

Proof Sketch

The marginal distribution of $\hat{\theta}_i$ is $N(\mu, \sigma_i^2 + \tau^2)$ . Apply inverse-variance weighting with the total variance $\sigma_i^2 + \tau^2$ . The DerSimonian-Laird method estimates $\tau^2$ using the method of moments from Cochran's Q statistic.

Why It Matters

The random-effects model is more realistic for ML. Different benchmarks, architectures, and experimental setups introduce genuine variability. The random-effects estimator gives wider (more honest) confidence intervals that account for this heterogeneity.

Failure Mode

The DerSimonian-Laird (DL) estimator is known to be anticonservative when $k$ is small or heterogeneity is large: the standard Wald-type confidence interval for $\mu$ has coverage below nominal, sometimes well below (Brockwell-Gordon 2001, IntHout et al. 2014). Two fixes are now standard:

Better $\tau^2$ estimators. Restricted maximum likelihood (REML) and Paule-Mandel generally recover heterogeneity more reliably than DL, especially when DL rounds $\hat\tau^2$ down to zero.
Hartung-Knapp-Sidik-Jonsson (HKSJ) adjustment. Instead of a Wald interval, use a $t_{k-1}$ -distributed test statistic with a model-based variance estimator for $\hat\mu_{\text{RE}}$ . IntHout, Ioannidis, Borm (2014, "The Hartung-Knapp-Sidik-Jonsson method has no alternative") show HKSJ restores coverage close to nominal across a wide range of scenarios. The Cochrane Handbook (2019) now recommends HKSJ + REML as the default random- effects setup.

Separately, the random-effects model assumes normality of the study-level effects, which may not hold when $k$ is small. With very few studies ( $k \leq 5$ ), any random-effects analysis is fragile and should be reported as exploratory.

report a correction →

Measuring Heterogeneity

Definition

Cochran's Q Statistic $Q$

$Q = \sum_{i=1}^{k} w_i (\hat{\theta}_i - \hat{\theta}_{\text{FE}})^2$

Under the null of no heterogeneity, $Q \sim \chi^2_{k-1}$ asymptotically and assuming within-study variances $\sigma_i^2$ are known. For small $k$ the chi-squared approximation is poor, and $Q$ has low power to detect heterogeneity. Do not read a non-significant $Q$ as "no heterogeneity"; read it as "not enough evidence from $Q$ alone."

Definition

I-squared Statistic $I^{2}$

Higgins and Thompson (2002) define $I^2 = \max\left(0, \frac{Q - (k-1)}{Q}\right) \times 100\%$

$I^2$ estimates the percentage of total variability due to between-study heterogeneity rather than sampling error. The original Higgins-Thompson thresholds ("low" 25%, "moderate" 50%, "high" 75%) are now deprecated. The Cochrane Handbook (2019, Chapter 10) recommends interpreting $I^2$ alongside the point estimate of $\tau^2$ , a confidence interval for $I^2$ , and an explicit prediction interval, because $I^2$ depends on within-study precision and can be inflated by very large studies even when substantive heterogeneity is small.

Definition

Prediction Interval

A prediction interval for a random-effects meta-analysis gives the range within which the true effect in a new study is expected to fall with a given probability (typically 95%). Under the random-effects model $\theta_{\text{new}} \sim N(\mu, \tau^2)$ , the $100(1-\alpha)\%$ prediction interval is approximately $\hat\mu_{\text{RE}} \pm t_{k-2, 1-\alpha/2} \sqrt{\text{Var}(\hat\mu_{\text{RE}}) + \hat\tau^2}$ (IntHout, Ioannidis, Rovers, Goeman 2016, "Plea for routinely presenting prediction intervals in meta-analysis"). The prediction interval is always wider than the confidence interval for $\mu$ , because it adds $\hat\tau^2$ to the sampling variance. It is the right summary when the question is "what effect should we expect in the next study?" rather than "what is the average effect across studies we have seen?"

Combining p-Values

Fisher's method combines independent p-values when the effect sizes are not directly comparable (e.g., mixed metrics, different sample scales):

$-2\sum_{i=1}^{k} \ln(p_i) \sim \chi^2_{2k}$

under the null that all effects are zero. This tests whether there is any signal, but does not estimate the effect size. Stouffer's method (inverse normal combination) $Z = \sum \Phi^{-1}(1 - p_i)/\sqrt{k}$ gives similar power and allows weighting. Use these only when inverse-variance meta-analysis is unavailable; when effect sizes and their variances are available, prefer the fixed-effect or random-effects estimator above, which also gives an effect size, not only a combined p-value.

Publication Bias

Studies with statistically significant results are more likely to be published. This creates a biased sample of the literature.

Funnel plot: Plot effect size against study precision (or sample size). Under no bias, the plot should be symmetric around the combined estimate. Asymmetry suggests smaller, less precise studies with null results are missing.

Egger's regression test (Egger, Davey Smith, Schneider, Minder 1997) formalizes funnel-plot asymmetry: regress the standardized effect size on its precision and test whether the intercept differs from zero. A non-zero intercept is evidence of small-study effects. Begg and Mazumdar's rank test is an alternative with lower power but fewer assumptions. Sterne et al. (2011, BMJ) give the standard diagnostic recommendations.

Trim and fill (Duval and Tweedie 2000) imputes missing studies to make the funnel plot symmetric, then recomputes the combined estimate. Useful for sensitivity analysis, not definitive proof of bias.

PET-PEESE (precision-effect test with precision-effect estimator with standard error) uses a meta-regression on study standard errors (PET) or variances (PEESE) to extrapolate to an "infinite-precision" effect size. It is controversial (Stanley-Doucouliagos 2014 vs. critics) but widely used in economics and psychology as a bias-adjusted estimator.

Leave-one-out (influence) analysis: refit the meta-analysis $k$ times, each time excluding one study, and report how $\hat\mu$ moves. Studies whose removal shifts the estimate materially are "influential"; if one such study is also the most precise, the aggregate is driven by a single trial and should be flagged.

Effect-Size Metrics

For continuous outcomes, the standard effect sizes are Cohen's $d$ $d = (\bar x_A - \bar x_B)/s_p$ (using a pooled standard deviation $s_p$ ) and Hedges's $g$ , the small-sample-bias-corrected version $g = d \cdot J(\text{df})$ with Hedges's correction factor $J(\text{df}) \approx 1 - 3/(4\,\text{df} - 1)$ (Hedges 1981). Use $g$ when any individual study has fewer than roughly 20 samples per group. For binary outcomes, the standard choices are the log odds ratio, log risk ratio, and risk difference. ML papers can usually treat accuracy differences with their standard errors directly and compute effect sizes on demand.

Meta-Regression

When heterogeneity exists, meta-regression models the effect size as a function of study-level covariates. For example: effect = baseline + coefficient * dataset_size. This is analogous to a weighted regression where each study contributes one data point.

Network Meta-Analysis

Pairwise meta-analysis combines studies of two interventions (A vs B). Network meta-analysis (NMA) combines evidence across a connected graph of interventions (A vs B, A vs C, B vs C, ...) to estimate all pairwise contrasts, including those never directly compared, under a consistency assumption between direct and indirect evidence. In ML it applies when a benchmark table reports many methods evaluated on overlapping subsets of tasks: NMA lets you recover a full ranking from incomplete head-to-heads. See Salanti (2012) for a readable introduction.

Connection to ML

In ML, meta-analysis appears in several guises:

Benchmark aggregation: Comparing methods across multiple datasets. A table of accuracy numbers across 10 datasets is raw material for meta-analysis.
Systematic reviews: Papers like "A Survey of X" often implicitly perform meta-analysis when summarizing results.
Hyperparameter sensitivity: How much does performance vary across random seeds, datasets, or hyperparameter choices? This is heterogeneity analysis.
Reproducibility studies: When a result fails to replicate, formal meta-analysis of the original and replication attempts is the correct way to update your belief.

Common Confusions

Watch Out

Fixed-effect does not mean fixed effects in the regression sense

In meta-analysis, "fixed-effect" means assuming one true effect across all studies. This is different from "fixed effects" in panel data regression (which means study-specific intercepts). The terminology collision is unfortunate.

Watch Out

I-squared does not measure the magnitude of heterogeneity

$I^2 = 75\%$ means 75% of the observed variability is due to between-study differences, not that the effect sizes vary by 75%. A set of studies could have $I^2 = 75\%$ with all effects between 0.50 and 0.55. The prediction interval (not the confidence interval) gives the range of plausible effects.

Summary

Fixed-effect model: one true effect, weight by inverse variance
Random-effects model: effects vary across studies, accounts for between-study variance $\tau^2$
Heterogeneity: measured by $Q$ and $I^2$ . High $I^2$ means the effect is not consistent
Publication bias: funnel plots and trim-and-fill for diagnosis
For ML: treat benchmark tables as raw meta-analytic data and think critically about missing results

Exercises

ExerciseCore

Problem

Three studies estimate an effect with $\hat{\theta}_1 = 0.5, \hat{\theta}_2 = 0.3, \hat{\theta}_3 = 0.7$ and standard errors $\sigma_1 = 0.1, \sigma_2 = 0.2, \sigma_3 = 0.15$ . Compute the fixed-effect combined estimate $\hat{\theta}_{\text{FE}}$ .

ExerciseAdvanced

Problem

Explain why the random-effects combined estimate moves toward the unweighted average of study estimates as between-study variance $\tau^2$ grows. Under what conditions are the fixed-effect and random-effects combined estimates equal?

References

Canonical:

Borenstein, Hedges, Higgins, Rothstein, Introduction to Meta-Analysis (2009), Chapters 1-16
DerSimonian & Laird, "Meta-Analysis in Clinical Trials" (1986), Controlled Clinical Trials

Current:

Higgins et al., Cochrane Handbook for Systematic Reviews of Interventions (2019), Chapter 10
Bouthillier et al., "Accounting for Variance in Machine Learning Benchmarks" (MLSys 2021)
Higgins and Thompson, "Quantifying heterogeneity in a meta-analysis" (Statistics in Medicine 2002; definition of $I^2$ )
IntHout, Ioannidis, Borm, "The Hartung-Knapp-Sidik-Jonsson method for random effects meta-analysis is straightforward and considerably outperforms the standard DerSimonian-Laird method" (BMC Med. Res. Methodol. 2014)
IntHout, Ioannidis, Rovers, Goeman, "Plea for routinely presenting prediction intervals in meta-analysis" (BMJ Open 2016)
Viechtbauer, "Conducting Meta-Analyses in R with the metafor Package" (J. Stat. Soft. 2010)
Sterne et al., "Recommendations for examining and interpreting funnel plot asymmetry in meta-analyses of randomised controlled trials" (BMJ 2011)
Egger, Davey Smith, Schneider, Minder, "Bias in meta-analysis detected by a simple, graphical test" (BMJ 1997)
Hedges, "Distribution theory for Glass's estimator of effect size and related estimators" (J. Educational Statistics 1981)

Next Topics

Reproducibility and experimental rigor: why replication matters and how to do it
P-hacking and multiple testing: the statistical sins that meta-analysis must contend with

Last reviewed: April 18, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Bayesian Estimationlayer 0B · tier 2
Hypothesis Testing for MLlayer 2 · tier 2
REML and Variance Component Estimationlayer 2 · tier 2

Derived topics

1

P-Hacking and Multiple Testinglayer 2 · tier 2

Graph-backed continuations

P-Hacking and Multiple Testing