Skip to main content

ML Methods

Bagging

Bootstrap Aggregation: train B models on bootstrap samples and average their predictions. Why averaging reduces variance, how correlation limits the gain, and out-of-bag error estimation.

CoreTier 1StableSupporting~35 min

Why This Matters

Bagging: bootstrap samples with replacement, train independent models, average predictions

Originaln = 812345678sample with replacementBootstrap 111344678repeated: 1, 4 | missing: 2, 5Model 1pred = 2.3Bootstrap 222355677repeated: 2, 5, 7 | missing: 1, 4, 8Model 2pred = 2.8Bootstrap 313345688repeated: 3, 8 | missing: 2, 7Model 3pred = 2.1Average:(2.3 + 2.8 + 2.1) / 3= 2.4

Bagging is the simplest ensemble method and the foundation for random forests. The idea is so clean it can be stated in one sentence: train many copies of an unstable learner on different bootstrap samples and average the results. Yet this simple trick can dramatically improve prediction accuracy for high-variance models like decision trees.

Understanding bagging teaches you the fundamental principle behind all averaging ensembles: variance reduction through diversity.

Mental Model

Imagine you ask 100 people to independently estimate the number of jellybeans in a jar. Each individual estimate is noisy, but their average is remarkably close to the truth. This is the wisdom-of-crowds effect, and it works because individual errors are partially independent and cancel when averaged.

Bagging does the same thing with models. Each bootstrap sample creates a slightly different training set, producing a slightly different model. The models make different errors, and averaging smooths these errors out.

The Bagging Algorithm

Definition

Bagging (Bootstrap Aggregation)

Given training data {(xi,yi)}i=1n\{(x_i, y_i)\}_{i=1}^n and a base learner f^\hat{f}:

  1. For b=1,,Bb = 1, \ldots, B:
    • Draw a bootstrap sample SbS_b of size nn with replacement from the training data
    • Train a base learner f^b\hat{f}_b on SbS_b
  2. The bagged prediction is:

Regression: f^bag(x)=1Bb=1Bf^b(x)\hat{f}_{\text{bag}}(x) = \frac{1}{B} \sum_{b=1}^{B} \hat{f}_b(x)

Classification: f^bag(x)=majority vote of f^1(x),,f^B(x)\hat{f}_{\text{bag}}(x) = \text{majority vote of } \hat{f}_1(x), \ldots, \hat{f}_B(x)

The base learner is typically a decision tree grown without pruning (high variance, low bias). But bagging can be applied to any learner.

Why Bagging Reduces Variance

Theorem

Variance of Averaged Estimators

Statement

Let f^1,,f^B\hat{f}_1, \ldots, \hat{f}_B be BB estimators, each with variance σ2\sigma^2 and pairwise correlation ρ\rho. The variance of their average is:

Var(1Bb=1Bf^b)=ρσ2+1ρBσ2\text{Var}\left(\frac{1}{B}\sum_{b=1}^{B}\hat{f}_b\right) = \rho\sigma^2 + \frac{1-\rho}{B}\sigma^2

Special cases:

  • If ρ=0\rho = 0 (independent estimators): variance =σ2/B= \sigma^2/B, perfect 1/B1/B reduction
  • If ρ=1\rho = 1 (identical estimators): variance =σ2= \sigma^2, no reduction
  • As BB \to \infty: variance ρσ2\to \rho\sigma^2, the irreducible limit

Intuition

Averaging independent noisy estimates reduces noise by factor 1/B1/B. But bagged estimators are not independent: they are trained on overlapping bootstrap samples. For two independent bootstrap samples S1,S2S_1, S_2 of size nn and a fixed training point xix_i, in the large-nn limit: the probability xix_i appears in S1S_1 is 1(11/n)n11/e0.6321 - (1-1/n)^n \to 1 - 1/e \approx 0.632; the probability it appears in both (the natural pairwise-overlap quantity) is (11/e)20.400(1 - 1/e)^2 \approx 0.400; the probability it appears in the union is 11/e20.8651 - 1/e^2 \approx 0.865; and the probability it appears in exactly one is 2(1/e)(11/e)0.4652(1/e)(1 - 1/e) \approx 0.465. So roughly 40% of training points are shared between any two bootstrap samples, and this overlap induces correlation ρ>0\rho > 0 in the fitted models. This creates a floor below which variance cannot fall, no matter how many models you average.

Proof Sketch

Var(1Bbf^b)=1B2[bVar(f^b)+bbCov(f^b,f^b)]\text{Var}\left(\frac{1}{B}\sum_b \hat{f}_b\right) = \frac{1}{B^2}\left[\sum_b \text{Var}(\hat{f}_b) + \sum_{b \neq b'}\text{Cov}(\hat{f}_b, \hat{f}_{b'})\right]

There are BB variance terms each equal to σ2\sigma^2 and B(B1)B(B-1) covariance terms each equal to ρσ2\rho\sigma^2:

=1B2[Bσ2+B(B1)ρσ2]=σ2B+B1Bρσ2=ρσ2+(1ρ)σ2B= \frac{1}{B^2}[B\sigma^2 + B(B-1)\rho\sigma^2] = \frac{\sigma^2}{B} + \frac{B-1}{B}\rho\sigma^2 = \rho\sigma^2 + \frac{(1-\rho)\sigma^2}{B}

Why It Matters

This formula explains when bagging works and when it does not. Bagging is most effective for high-variance learners (large σ2\sigma^2) with moderate correlation (moderate ρ\rho). For stable learners like k-nearest neighbors with large kk, σ2\sigma^2 is already small and bagging helps little. For decision trees, σ2\sigma^2 is large and ρ\rho is moderate, so bagging gives substantial gains. This formula also motivates random forests: feature subsampling further reduces ρ\rho.

Failure Mode

The formula assumes equal variances and uniform pairwise correlation. In practice, some bootstrap samples produce better models than others. Bagging also does not reliably reduce bias: for a linear base learner the bagged expectation equals the expectation of a single estimator, and for a nonlinear learner it can shift bias slightly (sometimes reducing it in high-variance regions, sometimes not). If the base learner is strongly biased (e.g., a shallow tree), bagging generally cannot fix this.

Bagging Does Not Reliably Reduce Bias

This is the most important theoretical limitation of bagging. The relationship between the bias of the bagged estimator and the bias of the base learner depends on whether the base learner is linear in the training data.

Linear base learners. If f^(x;S)\hat{f}(x; S) is a linear functional of the empirical distribution F^n\hat{F}_n of SS, then ES[f^(x;S)S]=f^(x;S)\mathbb{E}_{S^*}[\hat{f}(x; S^*) \mid S] = \hat{f}(x; S) for bootstrap samples SS^* drawn from F^n\hat{F}_n, so in expectation:

Bias(f^bag)=E[f^bag(x)]f(x)=E[f^(x)]f(x)=Bias(f^)\text{Bias}(\hat{f}_{\text{bag}}) = \mathbb{E}[\hat{f}_{\text{bag}}(x)] - f(x) = \mathbb{E}[\hat{f}(x)] - f(x) = \text{Bias}(\hat{f})

Bagging exactly preserves bias for such functionals. Sample means and fixed-design (residual-bootstrap) regressions evaluated at a fixed query xx are the standard examples. Ordinary least squares under the pairs bootstrap is not such a linear functional: pairs bootstrap resamples rows of the design, so XX^* varies across bootstrap draws and (XX)1Xy(X^{*\top}X^*)^{-1}X^{*\top}y^* is a nonlinear function of the empirical distribution. The OLS-as-linear-in-yy identity holds only with XX held fixed (e.g., residual bootstrap with a fixed design), in which case the bagged OLS prediction at a query equals the OLS prediction up to bootstrap Monte Carlo noise.

Nonlinear base learners. For nonlinear learners like decision trees or neural nets, the bagged expectation ES[f^(x;S)]\mathbb{E}_{S^*}[\hat{f}(x; S^*)] (over bootstrap resamples SS^* from F^n\hat{F}_n) is not in general equal to ES[f^(x;S)]\mathbb{E}_{S}[\hat{f}(x; S)] (over i.i.d. draws SS from the true distribution FF). Bagging can therefore shift the bias, typically slightly, sometimes reducing it in high-variance regions of the input space. Breiman (1996, §5) notes this explicitly: the approximation Bias(f^bag)Bias(f^)\text{Bias}(\hat{f}_{\text{bag}}) \approx \text{Bias}(\hat{f}) is a useful rule of thumb, not an identity.

The practical upshot is unchanged: bagging uses deep trees (low bias, high variance) rather than shallow trees (high bias, low variance). The variance reduction from averaging compensates for the high individual tree variance, while the low bias is approximately preserved.

Out-of-Bag Error Estimation

Definition

Out-of-Bag (OOB) Error

Each bootstrap sample omits about 36.8% of the training points. For training point (xi,yi)(x_i, y_i), let Bi={b:(xi,yi)Sb}\mathcal{B}_i = \{b : (x_i, y_i) \notin S_b\} be the set of trees that did not train on xix_i. The OOB prediction is:

y^iOOB=1BibBif^b(xi)\hat{y}_i^{\text{OOB}} = \frac{1}{|\mathcal{B}_i|} \sum_{b \in \mathcal{B}_i} \hat{f}_b(x_i)

The OOB error is the average loss over all training points using only their OOB predictions.

OOB error behaves similarly to leave-one-out cross-validation asymptotically: each held-out point is predicted by trees trained without it, on training sets of expected size 0.632n\approx 0.632n. It is not identical to LOOCV, however. The held-out prediction is an ensemble over a random subset of the BB trees (the roughly B/eB/e that never saw xix_i), not a single model trained on the remaining n1n-1 points. As BB grows, the OOB error becomes an increasingly accurate estimate of the true test error.

The practical appeal is that OOB error comes nearly for free during training: bagging provides a built-in validation signal without requiring a held-out set or a separate cross-validation pass.

When Bagging Helps and When It Hurts

Bagging helps when the base learner is:

  • Unstable: small changes in the training data cause large changes in the model (e.g., deep decision trees, neural networks)
  • Low bias: the base learner can approximate the target function well, but with high variance

Bagging helps little when the base learner is:

  • Stable: predictions barely change with different training samples (e.g., linear regression, kk-NN with large kk)
  • High bias: the base learner systematically misses the target, and no amount of averaging can fix this (e.g., decision stumps)

Canonical Examples

Example

Bagged decision trees on a noisy regression

Consider predicting y=sin(x)+ϵy = \sin(x) + \epsilon where ϵN(0,0.3)\epsilon \sim N(0, 0.3), with n=100n = 100 training points.

A single unpruned tree fits the noise closely, producing a jagged step function with high variance across different training sets. With B=100B = 100 bagged trees, the averaged prediction is a smooth approximation of sin(x)\sin(x). The individual trees still overfit, but their overfitting patterns differ and cancel in the average.

Example

Why bagging a linear model does nothing

Linear regression is a stable learner. Different bootstrap samples produce nearly identical regression coefficients (perturbed by sampling noise). Averaging nearly identical predictions gives nearly the same prediction as a single model. The variance reduction is negligible because ρ1\rho \approx 1.

Common Confusions

Watch Out

Bagging is NOT the same as training on more data

Bagging reuses the same nn data points in different combinations. It does not increase the effective sample size. Each bootstrap sample has the same number of unique observations (about 0.632n0.632n on average). The benefit comes from averaging diverse models, not from seeing more data.

Watch Out

Bagging does not solve underfitting

If your base learner is too simple (high bias), bagging will average many biased predictions and produce a biased ensemble. Bagging a depth-1 decision stump on a complex nonlinear problem will fail. You need a sufficiently expressive base learner for bagging to work.

Summary

  • Bagging = train BB models on bootstrap samples, average predictions
  • Variance of the average: ρσ2+(1ρ)σ2/B\rho\sigma^2 + (1-\rho)\sigma^2/B
  • Perfect variance reduction (1/B1/B) only if models are independent (ρ=0\rho = 0)
  • Bagging exactly preserves bias for linear base learners and approximately preserves it for nonlinear ones; either way, use low-bias base learners (deep trees)
  • OOB error gives a free estimate of test error
  • Bagging helps unstable, low-bias learners; it does little for stable learners
  • Random forests improve on bagging by adding feature subsampling to reduce ρ\rho

Exercises

ExerciseCore

Problem

You train B=200B = 200 bagged decision trees. Each tree has prediction variance σ2=4.0\sigma^2 = 4.0 and the average pairwise correlation is ρ=0.4\rho = 0.4. Compute the ensemble variance. How does this compare to a single tree?

ExerciseAdvanced

Problem

Prove that bagging preserves bias exactly when the base learner f^(x;S)\hat{f}(x; S) is a linear functional of the training data SS. That is, show that E[f^bag(x)]=E[f^(x)]\mathbb{E}[\hat{f}_{\text{bag}}(x)] = \mathbb{E}[\hat{f}(x)] under this assumption, where the expectation is over the training data and bootstrap randomness. Then give a concrete counterexample showing this identity can fail for a nonlinear base learner.

References

Canonical:

  • Breiman, "Bagging Predictors" (Machine Learning, 1996). The original paper
  • Hastie, Tibshirani, Friedman, Elements of Statistical Learning, Chapter 8.7
  • Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapter 18

Current:

  • Biau, Devroye, Lugosi, "Consistency of Random Forests and Other Averaging Classifiers" (JMLR, 2008)
  • Grandvalet, "Bagging Equalizes Influence" (Machine Learning, 2004)
  • Buhlmann & Yu, "Analyzing Bagging" (Annals of Statistics, 2002)
  • Breiman, "Random Forests" (Machine Learning, 2001)
  • Wager, Hastie, Efron, "Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife" (JMLR, 2014)

Next Topics

The natural next steps from bagging:

  • Random forests: bagging plus feature subsampling for further decorrelation
  • Gradient boosting: the complementary ensemble strategy that reduces bias

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

2

Derived topics

2

Graph-backed continuations