ML Methods
Bagging
Bootstrap Aggregation: train B models on bootstrap samples and average their predictions. Why averaging reduces variance, how correlation limits the gain, and out-of-bag error estimation.
Prerequisites
Why This Matters
Bagging: bootstrap samples with replacement, train independent models, average predictions
Bagging is the simplest ensemble method and the foundation for random forests. The idea is so clean it can be stated in one sentence: train many copies of an unstable learner on different bootstrap samples and average the results. Yet this simple trick can dramatically improve prediction accuracy for high-variance models like decision trees.
Understanding bagging teaches you the fundamental principle behind all averaging ensembles: variance reduction through diversity.
Mental Model
Imagine you ask 100 people to independently estimate the number of jellybeans in a jar. Each individual estimate is noisy, but their average is remarkably close to the truth. This is the wisdom-of-crowds effect, and it works because individual errors are partially independent and cancel when averaged.
Bagging does the same thing with models. Each bootstrap sample creates a slightly different training set, producing a slightly different model. The models make different errors, and averaging smooths these errors out.
The Bagging Algorithm
Bagging (Bootstrap Aggregation)
Given training data and a base learner :
- For :
- Draw a bootstrap sample of size with replacement from the training data
- Train a base learner on
- The bagged prediction is:
Regression:
Classification:
The base learner is typically a decision tree grown without pruning (high variance, low bias). But bagging can be applied to any learner.
Why Bagging Reduces Variance
Variance of Averaged Estimators
Statement
Let be estimators, each with variance and pairwise correlation . The variance of their average is:
Special cases:
- If (independent estimators): variance , perfect reduction
- If (identical estimators): variance , no reduction
- As : variance , the irreducible limit
Intuition
Averaging independent noisy estimates reduces noise by factor . But bagged estimators are not independent: they are trained on overlapping bootstrap samples. For two independent bootstrap samples of size and a fixed training point , in the large- limit: the probability appears in is ; the probability it appears in both (the natural pairwise-overlap quantity) is ; the probability it appears in the union is ; and the probability it appears in exactly one is . So roughly 40% of training points are shared between any two bootstrap samples, and this overlap induces correlation in the fitted models. This creates a floor below which variance cannot fall, no matter how many models you average.
Proof Sketch
There are variance terms each equal to and covariance terms each equal to :
Why It Matters
This formula explains when bagging works and when it does not. Bagging is most effective for high-variance learners (large ) with moderate correlation (moderate ). For stable learners like k-nearest neighbors with large , is already small and bagging helps little. For decision trees, is large and is moderate, so bagging gives substantial gains. This formula also motivates random forests: feature subsampling further reduces .
Failure Mode
The formula assumes equal variances and uniform pairwise correlation. In practice, some bootstrap samples produce better models than others. Bagging also does not reliably reduce bias: for a linear base learner the bagged expectation equals the expectation of a single estimator, and for a nonlinear learner it can shift bias slightly (sometimes reducing it in high-variance regions, sometimes not). If the base learner is strongly biased (e.g., a shallow tree), bagging generally cannot fix this.
Bagging Does Not Reliably Reduce Bias
This is the most important theoretical limitation of bagging. The relationship between the bias of the bagged estimator and the bias of the base learner depends on whether the base learner is linear in the training data.
Linear base learners. If is a linear functional of the empirical distribution of , then for bootstrap samples drawn from , so in expectation:
Bagging exactly preserves bias for such functionals. Sample means and fixed-design (residual-bootstrap) regressions evaluated at a fixed query are the standard examples. Ordinary least squares under the pairs bootstrap is not such a linear functional: pairs bootstrap resamples rows of the design, so varies across bootstrap draws and is a nonlinear function of the empirical distribution. The OLS-as-linear-in- identity holds only with held fixed (e.g., residual bootstrap with a fixed design), in which case the bagged OLS prediction at a query equals the OLS prediction up to bootstrap Monte Carlo noise.
Nonlinear base learners. For nonlinear learners like decision trees or neural nets, the bagged expectation (over bootstrap resamples from ) is not in general equal to (over i.i.d. draws from the true distribution ). Bagging can therefore shift the bias, typically slightly, sometimes reducing it in high-variance regions of the input space. Breiman (1996, §5) notes this explicitly: the approximation is a useful rule of thumb, not an identity.
The practical upshot is unchanged: bagging uses deep trees (low bias, high variance) rather than shallow trees (high bias, low variance). The variance reduction from averaging compensates for the high individual tree variance, while the low bias is approximately preserved.
Out-of-Bag Error Estimation
Out-of-Bag (OOB) Error
Each bootstrap sample omits about 36.8% of the training points. For training point , let be the set of trees that did not train on . The OOB prediction is:
The OOB error is the average loss over all training points using only their OOB predictions.
OOB error behaves similarly to leave-one-out cross-validation asymptotically: each held-out point is predicted by trees trained without it, on training sets of expected size . It is not identical to LOOCV, however. The held-out prediction is an ensemble over a random subset of the trees (the roughly that never saw ), not a single model trained on the remaining points. As grows, the OOB error becomes an increasingly accurate estimate of the true test error.
The practical appeal is that OOB error comes nearly for free during training: bagging provides a built-in validation signal without requiring a held-out set or a separate cross-validation pass.
When Bagging Helps and When It Hurts
Bagging helps when the base learner is:
- Unstable: small changes in the training data cause large changes in the model (e.g., deep decision trees, neural networks)
- Low bias: the base learner can approximate the target function well, but with high variance
Bagging helps little when the base learner is:
- Stable: predictions barely change with different training samples (e.g., linear regression, -NN with large )
- High bias: the base learner systematically misses the target, and no amount of averaging can fix this (e.g., decision stumps)
Canonical Examples
Bagged decision trees on a noisy regression
Consider predicting where , with training points.
A single unpruned tree fits the noise closely, producing a jagged step function with high variance across different training sets. With bagged trees, the averaged prediction is a smooth approximation of . The individual trees still overfit, but their overfitting patterns differ and cancel in the average.
Why bagging a linear model does nothing
Linear regression is a stable learner. Different bootstrap samples produce nearly identical regression coefficients (perturbed by sampling noise). Averaging nearly identical predictions gives nearly the same prediction as a single model. The variance reduction is negligible because .
Common Confusions
Bagging is NOT the same as training on more data
Bagging reuses the same data points in different combinations. It does not increase the effective sample size. Each bootstrap sample has the same number of unique observations (about on average). The benefit comes from averaging diverse models, not from seeing more data.
Bagging does not solve underfitting
If your base learner is too simple (high bias), bagging will average many biased predictions and produce a biased ensemble. Bagging a depth-1 decision stump on a complex nonlinear problem will fail. You need a sufficiently expressive base learner for bagging to work.
Summary
- Bagging = train models on bootstrap samples, average predictions
- Variance of the average:
- Perfect variance reduction () only if models are independent ()
- Bagging exactly preserves bias for linear base learners and approximately preserves it for nonlinear ones; either way, use low-bias base learners (deep trees)
- OOB error gives a free estimate of test error
- Bagging helps unstable, low-bias learners; it does little for stable learners
- Random forests improve on bagging by adding feature subsampling to reduce
Exercises
Problem
You train bagged decision trees. Each tree has prediction variance and the average pairwise correlation is . Compute the ensemble variance. How does this compare to a single tree?
Problem
Prove that bagging preserves bias exactly when the base learner is a linear functional of the training data . That is, show that under this assumption, where the expectation is over the training data and bootstrap randomness. Then give a concrete counterexample showing this identity can fail for a nonlinear base learner.
References
Canonical:
- Breiman, "Bagging Predictors" (Machine Learning, 1996). The original paper
- Hastie, Tibshirani, Friedman, Elements of Statistical Learning, Chapter 8.7
- Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapter 18
Current:
- Biau, Devroye, Lugosi, "Consistency of Random Forests and Other Averaging Classifiers" (JMLR, 2008)
- Grandvalet, "Bagging Equalizes Influence" (Machine Learning, 2004)
- Buhlmann & Yu, "Analyzing Bagging" (Annals of Statistics, 2002)
- Breiman, "Random Forests" (Machine Learning, 2001)
- Wager, Hastie, Efron, "Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife" (JMLR, 2014)
Next Topics
The natural next steps from bagging:
- Random forests: bagging plus feature subsampling for further decorrelation
- Gradient boosting: the complementary ensemble strategy that reduces bias
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
2- Bootstrap Methodslayer 2 · tier 1
- Decision Trees and Ensembleslayer 2 · tier 2
Derived topics
2- Random Forestslayer 2 · tier 1
- Ensemble Methods Theorylayer 2 · tier 2
Graph-backed continuations