ML Methods
AIC and BIC
Akaike and Bayesian information criteria for model selection: how they trade off fit versus complexity, when to use each, and their connection to cross-validation.
Prerequisites
Why This Matters
You have fit several models to the same data. a linear model, a quadratic model, a model with 20 features. Which one should you use? You cannot just pick the one with the highest likelihood, because more complex models always fit the training data better. This is the core problem of overfitting.
AIC and BIC are principled ways to balance goodness-of-fit against model complexity. They are fast to compute (no retraining needed), widely used, and have deep theoretical justifications.
Mental Model
Both AIC and BIC start with the log-likelihood (how well the model fits) and add a penalty for complexity (number of parameters). Both criteria require models estimated by maximum likelihood. The key difference: AIC penalizes complexity less aggressively, favoring more complex models. BIC penalizes complexity more, favoring simpler models.
The penalty term addresses a real phenomenon: overfitting. A model with more parameters always achieves at least as high a log-likelihood on the training data; the penalty corrects for the optimism. AIC and BIC differ in how much correction they apply and why.
This is not arbitrary. AIC and BIC are optimizing for different goals.
Definitions
Akaike Information Criterion
For a model with parameters and maximized log-likelihood :
Lower AIC is better. The first term rewards fit; the second penalizes complexity.
Bayesian Information Criterion
For a model with parameters, observations, and maximized log-likelihood :
Lower BIC is better. The penalty grows with sample size, making BIC increasingly hostile to complex models as increases.
Note: when , we have , so BIC penalizes each parameter more than AIC. In practice, is almost always much larger than 8.
Where the Formulas Come From
AIC from KL divergence
AIC is grounded in information theory. The true data-generating distribution is ; the fitted model is . The KL divergence measures the information lost when using the model instead of the truth:
We want to minimize this, but is unknown. Expanding the expectation:
The first term does not depend on the model, so minimizing KL divergence is equivalent to maximizing , the expected log-likelihood under the true distribution.
The natural estimator is the in-sample log-likelihood . But this is optimistic: the parameters were fit to the same data, so the in-sample likelihood overestimates out-of-sample performance. Akaike (1974) showed via a second-order Taylor expansion of the likelihood that the bias is approximately (where is the number of parameters), giving:
That is AIC. The factor of 2 is the natural scale for likelihood-based inference: under regularity, Wilks' theorem (1938) gives that twice the log-likelihood ratio between nested models is asymptotically -distributed with degrees of freedom equal to the difference in parameter counts, so AIC differences land on the same scale as classical goodness-of-fit and likelihood-ratio tests.
BIC as a Laplace approximation to the log marginal likelihood
BIC approximates the log marginal likelihood for a model with parameters. By Bayes' theorem, the marginal likelihood for comparing two models is the correct Bayesian criterion.
A Laplace approximation to the marginal likelihood gives:
For a flat or regular prior, the term is and is dropped. The dominant term is the log-likelihood minus . Multiplying by gives:
This derivation makes explicit that BIC is a large- approximation. The true marginal likelihood can differ substantially from BIC for small samples, non-regular models, or strong priors.
Asymptotic Properties
AIC and BIC have structurally different theoretical goals:
Asymptotic Efficiency (AIC)
AIC is asymptotically efficient: among all models under consideration, the model selected by AIC minimizes the expected Kullback-Leibler divergence to the true data-generating process. In other words, AIC selects the model that predicts best.
Consistency (BIC)
BIC is consistent: if the true model is among the candidates, BIC selects the true model with probability approaching 1 as . AIC is not consistent. It tends to overfit by selecting models that are too complex, even asymptotically.
The tradeoff is clear:
- AIC: best for prediction. May select an overly complex model, but that model will predict well.
- BIC: best for identifying the true model. May select an overly simple model if is small, but will eventually find the truth.
When to Use Each
Use AIC when:
- Your goal is prediction accuracy
- You believe the true model is not in your candidate set (it almost never is)
- You prefer to err on the side of including relevant variables
Use BIC when:
- Your goal is identifying the correct model (inference, scientific discovery)
- You believe the true model might be in your candidate set
- You prefer to err on the side of parsimony
In most ML applications, prediction is the goal, so AIC (or cross-validation) is more appropriate. In scientific modeling and causal inference, BIC is often preferred.
Connection to Cross-Validation
AIC Approximates Leave-One-Out CV
Statement
Under regularity conditions, AIC is asymptotically equivalent to leave-one-out cross-validation (LOO-CV). Specifically, for models estimated by maximum likelihood, the AIC score and the LOO-CV estimate of prediction error select the same model as .
Intuition
The penalty in AIC can be derived as an estimate of the optimism. the amount by which training error underestimates test error. LOO-CV also estimates test error. Both are estimating the same quantity, so they agree asymptotically.
Proof Sketch
The key result is due to Stone (1977). The proof shows that the AIC correction equals the expected difference between training log-likelihood and test log-likelihood, up to terms. LOO-CV directly estimates test log-likelihood. Both therefore estimate the same quantity with the same leading-order behavior.
Why It Matters
This connection means AIC gives you an approximate LOO-CV estimate for free, no retraining required. For models where cross-validation is expensive (e.g., large datasets or slow training), AIC is a practical alternative.
Failure Mode
The equivalence breaks down when: (1) the model is severely misspecified, (2) is large relative to (use corrected AICc instead), or (3) the model is not estimated by maximum likelihood.
Corrected AIC (AICc)
When the number of parameters is not negligible compared to , standard AIC underpenalizes. The corrected version is:
Use AICc when as a rough rule. AICc converges to AIC as .
When AIC and BIC Disagree
The two criteria point in different directions when the data provide moderate evidence for a more complex model: strong enough that AIC's lighter penalty accepts it, but not strong enough to overcome BIC's heavier penalty.
Structural reason: AIC penalizes at rate regardless of . BIC penalizes at rate , which grows without bound. For any fixed model comparison, BIC eventually penalizes more aggressively than AIC as increases.
Practical interpretation of disagreement:
-
AIC selects a larger model; BIC selects a smaller one. The most common pattern. AIC is betting the extra parameters help predict; BIC says the likelihood gain does not justify the complexity at this sample size. Check whether the simpler model's residuals show systematic structure (if yes, AIC may be right). If residuals look random, BIC is likely correct.
-
Both select the same model. Agreement is strong evidence for that model. The log-likelihood gain was either convincingly large (both choose complex) or convincingly small (both choose simple).
-
BIC selects a larger model; AIC selects a smaller one. This is unusual and almost always signals a problem: very small (BIC's , so ), a coding error in one criterion, or a comparison across different datasets.
The practical heuristic: When AIC and BIC disagree and you must pick one model, use domain knowledge. If you are building a prediction system, prefer AIC. If you are testing a scientific hypothesis about structure, prefer BIC. If neither applies, report both and flag the disagreement; it signals genuine uncertainty about model complexity that the data cannot resolve.
AIC vs BIC in linear regression with collinear predictors
Suppose you are fitting models to predict house prices with observations. You have a baseline model with parameters (intercept, size, bedrooms, bathrooms) and an extended model adding parameters (size squared, location interactions, and neighborhood fixed effects).
| Model | AIC | BIC () | ||
|---|---|---|---|---|
| Baseline | 4 | 510.0 | 518.0 | 527.5 |
| Extended | 9 | 498.0 | 516.0 | 537.4 |
AIC: 516.0 vs 518.0. AIC prefers the extended model by 2 points; the 12-point log-likelihood gain more than offsets the point penalty.
BIC: 537.4 vs 527.5. BIC prefers the baseline; the same 12-point gain cannot offset the point penalty.
The five extra predictors are correlated with size and location, each adding modest individual predictive value. AIC rewards the marginal gain; BIC says the parameters are likely to be noise at this sample size. Cross-validation on a held-out set would resolve which is correct.
Related Criteria
AIC and BIC are part of a wider family of information criteria. Mallows' applies to linear regression with known noise variance : . Under Gaussian errors it is equivalent to AIC up to an additive constant. DIC (Deviance Information Criterion) is the Bayesian analogue, replacing maximum likelihood with posterior mean deviance and with an effective number of parameters computed from the posterior; it is common in hierarchical Bayesian models fit by MCMC. HQIC (Hannan-Quinn) uses penalty , which sits between AIC and BIC asymptotically. HQIC is consistent like BIC but penalizes less aggressively, making it useful when BIC is too conservative but AIC overfits.
Worked Example
Polynomial regression model selection
You fit polynomials of degree 1 through 5 to data points. Each model has parameters (including intercept). Suppose the log-likelihoods are:
| Degree | AIC | BIC | ||
|---|---|---|---|---|
| 1 | 2 | 120.0 | 124.0 | 120.0 + 2(3.91) = 127.8 |
| 2 | 3 | 105.0 | 111.0 | 116.7 |
| 3 | 4 | 103.5 | 111.5 | 119.1 |
| 4 | 5 | 103.0 | 113.0 | 122.6 |
| 5 | 6 | 102.8 | 114.8 | 126.3 |
AIC selects degree 2 (AIC = 111.0). BIC also selects degree 2 (BIC = 116.7). Both agree that the degree-2 model best trades off fit and complexity. Note how higher-degree models barely improve the log-likelihood but accumulate penalty.
Common Confusions
AIC and BIC are not comparable across different datasets
AIC and BIC values depend on the data. You can compare AIC values across models fit to the same data, but comparing AIC values from models fit to different datasets is meaningless. The log-likelihood scale depends on and the data distribution.
Lower is better, not higher
Both AIC and BIC are formulated so that lower is better. This is because they use (negative log-likelihood), which decreases as fit improves. Some implementations negate this convention. always check.
BIC does not come from Bayesian model comparison
Despite the name, BIC is an approximation to the log marginal likelihood, but it makes strong simplifying assumptions (large , flat prior, regular model). True Bayesian model comparison uses the actual marginal likelihood, which can give very different answers.
Summary
- AIC . penalizes lightly, good for prediction
- BIC . penalizes more, good for model identification
- AIC is asymptotically efficient (best prediction); BIC is consistent (finds true model)
- AIC approximates leave-one-out cross-validation
- Use AICc when
- Lower is better for both criteria
- They can disagree. And that disagreement is informative
Exercises
Problem
For observations, how much does BIC penalize each additional parameter compared to AIC? At what sample size does BIC penalize exactly twice as much as AIC per parameter?
Problem
You have two models: Model A with parameters and , and Model B with parameters and . With observations, which model does AIC prefer? Which does BIC prefer? Interpret the disagreement.
References
Canonical:
- Akaike, "A new look at the statistical model identification" (1974): original AIC paper; derives the correction via KL divergence and expected log-likelihood bias
- Schwarz, "Estimating the dimension of a model" (1978): one-page paper introducing BIC from the Laplace approximation to marginal likelihood
- Stone, "An asymptotic equivalence of choice of model by cross-validation and Akaike's criterion" (1977): proves AIC LOO-CV asymptotically
- Wilks, "The Large-Sample Distribution of the Likelihood Ratio for Testing Composite Hypotheses," Annals of Mathematical Statistics 9(1):60-62 (1938): the theorem that aligns AIC's factor-of-2 with tests
Current:
- Burnham & Anderson, Model Selection and Multimodel Inference (2002), Chapters 2-4: comprehensive treatment of AIC, AICc, model averaging, and evidence ratios
- Hastie, Tibshirani & Friedman, Elements of Statistical Learning (2009), Section 7.5-7.7: AIC, BIC, and their relationship to cross-validation and degrees of freedom
- Gelman et al., Bayesian Data Analysis (2013), Chapter 7: LOO-CV, WAIC, and why BIC can misfire on hierarchical models
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
4- Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1
- Gauss-Markov Theoremlayer 2 · tier 1
- Decision Trees and Ensembleslayer 2 · tier 2
- Kolmogorov Complexity and MDLlayer 2 · tier 2
Derived topics
1- Cross-Validation Theorylayer 2 · tier 2
Graph-backed continuations