Skip to main content

ML Methods

AIC and BIC

Akaike and Bayesian information criteria for model selection: how they trade off fit versus complexity, when to use each, and their connection to cross-validation.

CoreTier 1StableSupporting~40 min

Why This Matters

You have fit several models to the same data. a linear model, a quadratic model, a model with 20 features. Which one should you use? You cannot just pick the one with the highest likelihood, because more complex models always fit the training data better. This is the core problem of overfitting.

AIC and BIC are principled ways to balance goodness-of-fit against model complexity. They are fast to compute (no retraining needed), widely used, and have deep theoretical justifications.

theorem visual

Fit-Complexity Ledger

AIC keeps a flat optimism correction, while BIC raises the complexity tax as the sample size grows.

n = 60

candidate models

baseline

4 params

extended

7 params

AIC

winner: extended

baseline128.0
extended122.0

BIC

winner: baseline

baseline136.4
extended136.7

AIC penalty

The complexity tax stays flat, so a modest likelihood gain can justify extra parameters.

BIC penalty

As grows, the same three extra parameters become harder to justify.

current disagreement

The extended model buys a 12-point fit improvement but pays for three extra parameters.

AIC asks whether predictive optimism was corrected; BIC asks whether the larger model still looks plausible after the sample-size-scaled penalty.

Mental Model

Both AIC and BIC start with the log-likelihood (how well the model fits) and add a penalty for complexity (number of parameters). Both criteria require models estimated by maximum likelihood. The key difference: AIC penalizes complexity less aggressively, favoring more complex models. BIC penalizes complexity more, favoring simpler models.

The penalty term addresses a real phenomenon: overfitting. A model with more parameters always achieves at least as high a log-likelihood on the training data; the penalty corrects for the optimism. AIC and BIC differ in how much correction they apply and why.

This is not arbitrary. AIC and BIC are optimizing for different goals.

Definitions

Definition

Akaike Information Criterion

For a model with kk parameters and maximized log-likelihood L^\hat{L}:

AIC=2logL^+2k\text{AIC} = -2 \log \hat{L} + 2k

Lower AIC is better. The first term rewards fit; the second penalizes complexity.

Definition

Bayesian Information Criterion

For a model with kk parameters, nn observations, and maximized log-likelihood L^\hat{L}:

BIC=2logL^+klogn\text{BIC} = -2 \log \hat{L} + k \log n

Lower BIC is better. The penalty klognk \log n grows with sample size, making BIC increasingly hostile to complex models as nn increases.

Note: when n8n \geq 8, we have logn>2\log n > 2, so BIC penalizes each parameter more than AIC. In practice, nn is almost always much larger than 8.

Where the Formulas Come From

AIC from KL divergence

AIC is grounded in information theory. The true data-generating distribution is ptruep_{\text{true}}; the fitted model is pθ^p_{\hat{\theta}}. The KL divergence measures the information lost when using the model instead of the truth:

DKL(ptruepθ^)=Eptrue ⁣[logptrue(x)pθ^(x)]D_{\text{KL}}(p_{\text{true}} \| p_{\hat{\theta}}) = \mathbb{E}_{p_{\text{true}}}\!\left[\log \frac{p_{\text{true}}(x)}{p_{\hat{\theta}}(x)}\right]

We want to minimize this, but ptruep_{\text{true}} is unknown. Expanding the expectation:

DKL(ptruepθ^)=E[logptrue(x)]E[logpθ^(x)]D_{\text{KL}}(p_{\text{true}} \| p_{\hat{\theta}}) = \mathbb{E}[\log p_{\text{true}}(x)] - \mathbb{E}[\log p_{\hat{\theta}}(x)]

The first term does not depend on the model, so minimizing KL divergence is equivalent to maximizing E[logpθ^(x)]\mathbb{E}[\log p_{\hat{\theta}}(x)], the expected log-likelihood under the true distribution.

The natural estimator is the in-sample log-likelihood 1nilogpθ^(xi)\frac{1}{n}\sum_i \log p_{\hat{\theta}}(x_i). But this is optimistic: the parameters θ^\hat{\theta} were fit to the same data, so the in-sample likelihood overestimates out-of-sample performance. Akaike (1974) showed via a second-order Taylor expansion of the likelihood that the bias is approximately k/nk/n (where kk is the number of parameters), giving:

2E[out-of-sample log-likelihood]2logL^+2k-2\,\mathbb{E}[\text{out-of-sample log-likelihood}] \approx -2\log\hat{L} + 2k

That is AIC. The factor of 2 is the natural scale for likelihood-based inference: under regularity, Wilks' theorem (1938) gives that twice the log-likelihood ratio between nested models is asymptotically χ2\chi^2-distributed with degrees of freedom equal to the difference in parameter counts, so AIC differences land on the same scale as classical χ2\chi^2 goodness-of-fit and likelihood-ratio tests.

BIC as a Laplace approximation to the log marginal likelihood

BIC approximates the log marginal likelihood logp(dataM)\log p(\text{data} | M) for a model MM with kk parameters. By Bayes' theorem, the marginal likelihood for comparing two models is the correct Bayesian criterion.

A Laplace approximation to the marginal likelihood gives:

logp(dataM)logp(dataθ^,M)+logp(θ^M)k2logn+const\log p(\text{data} | M) \approx \log p(\text{data} | \hat{\theta}, M) + \log p(\hat{\theta} | M) - \frac{k}{2}\log n + \text{const}

For a flat or regular prior, the logp(θ^M)\log p(\hat{\theta}|M) term is O(1)O(1) and is dropped. The dominant term is the log-likelihood minus k2logn\frac{k}{2}\log n. Multiplying by 2-2 gives:

BIC=2logL^+klogn\text{BIC} = -2\log\hat{L} + k\log n

This derivation makes explicit that BIC is a large-nn approximation. The true marginal likelihood can differ substantially from BIC for small samples, non-regular models, or strong priors.

Asymptotic Properties

AIC and BIC have structurally different theoretical goals:

Definition

Asymptotic Efficiency (AIC)

AIC is asymptotically efficient: among all models under consideration, the model selected by AIC minimizes the expected Kullback-Leibler divergence to the true data-generating process. In other words, AIC selects the model that predicts best.

Definition

Consistency (BIC)

BIC is consistent: if the true model is among the candidates, BIC selects the true model with probability approaching 1 as nn \to \infty. AIC is not consistent. It tends to overfit by selecting models that are too complex, even asymptotically.

The tradeoff is clear:

  • AIC: best for prediction. May select an overly complex model, but that model will predict well.
  • BIC: best for identifying the true model. May select an overly simple model if nn is small, but will eventually find the truth.

When to Use Each

Use AIC when:

  • Your goal is prediction accuracy
  • You believe the true model is not in your candidate set (it almost never is)
  • You prefer to err on the side of including relevant variables

Use BIC when:

  • Your goal is identifying the correct model (inference, scientific discovery)
  • You believe the true model might be in your candidate set
  • You prefer to err on the side of parsimony

In most ML applications, prediction is the goal, so AIC (or cross-validation) is more appropriate. In scientific modeling and causal inference, BIC is often preferred.

Connection to Cross-Validation

Proposition

AIC Approximates Leave-One-Out CV

Statement

Under regularity conditions, AIC is asymptotically equivalent to leave-one-out cross-validation (LOO-CV). Specifically, for models estimated by maximum likelihood, the AIC score and the LOO-CV estimate of prediction error select the same model as nn \to \infty.

Intuition

The 2k2k penalty in AIC can be derived as an estimate of the optimism. the amount by which training error underestimates test error. LOO-CV also estimates test error. Both are estimating the same quantity, so they agree asymptotically.

Proof Sketch

The key result is due to Stone (1977). The proof shows that the AIC correction 2k2k equals the expected difference between training log-likelihood and test log-likelihood, up to O(1/n)O(1/n) terms. LOO-CV directly estimates test log-likelihood. Both therefore estimate the same quantity with the same leading-order behavior.

Why It Matters

This connection means AIC gives you an approximate LOO-CV estimate for free, no retraining required. For models where cross-validation is expensive (e.g., large datasets or slow training), AIC is a practical alternative.

Failure Mode

The equivalence breaks down when: (1) the model is severely misspecified, (2) kk is large relative to nn (use corrected AICc instead), or (3) the model is not estimated by maximum likelihood.

Corrected AIC (AICc)

When the number of parameters kk is not negligible compared to nn, standard AIC underpenalizes. The corrected version is:

AICc=AIC+2k(k+1)nk1\text{AICc} = \text{AIC} + \frac{2k(k+1)}{n - k - 1}

Use AICc when n/k<40n/k < 40 as a rough rule. AICc converges to AIC as nn \to \infty.

When AIC and BIC Disagree

The two criteria point in different directions when the data provide moderate evidence for a more complex model: strong enough that AIC's lighter penalty accepts it, but not strong enough to overcome BIC's heavier penalty.

Structural reason: AIC penalizes at rate 2k2k regardless of nn. BIC penalizes at rate klognk \log n, which grows without bound. For any fixed model comparison, BIC eventually penalizes more aggressively than AIC as nn increases.

Practical interpretation of disagreement:

  1. AIC selects a larger model; BIC selects a smaller one. The most common pattern. AIC is betting the extra parameters help predict; BIC says the likelihood gain does not justify the complexity at this sample size. Check whether the simpler model's residuals show systematic structure (if yes, AIC may be right). If residuals look random, BIC is likely correct.

  2. Both select the same model. Agreement is strong evidence for that model. The log-likelihood gain was either convincingly large (both choose complex) or convincingly small (both choose simple).

  3. BIC selects a larger model; AIC selects a smaller one. This is unusual and almost always signals a problem: very small nn (BIC's logn<2\log n < 2, so n<e27.4n < e^2 \approx 7.4), a coding error in one criterion, or a comparison across different datasets.

The practical heuristic: When AIC and BIC disagree and you must pick one model, use domain knowledge. If you are building a prediction system, prefer AIC. If you are testing a scientific hypothesis about structure, prefer BIC. If neither applies, report both and flag the disagreement; it signals genuine uncertainty about model complexity that the data cannot resolve.

Example

AIC vs BIC in linear regression with collinear predictors

Suppose you are fitting models to predict house prices with n=80n = 80 observations. You have a baseline model with k=4k = 4 parameters (intercept, size, bedrooms, bathrooms) and an extended model adding k=9k = 9 parameters (size squared, location interactions, and neighborhood fixed effects).

Modelkk2logL^-2\log\hat{L}AICBIC (log804.38\log 80 \approx 4.38)
Baseline4510.0518.0527.5
Extended9498.0516.0537.4

AIC: 516.0 vs 518.0. AIC prefers the extended model by 2 points; the 12-point log-likelihood gain more than offsets the 2×5=102 \times 5 = 10 point penalty.

BIC: 537.4 vs 527.5. BIC prefers the baseline; the same 12-point gain cannot offset the 5×4.38=21.95 \times 4.38 = 21.9 point penalty.

The five extra predictors are correlated with size and location, each adding modest individual predictive value. AIC rewards the marginal gain; BIC says the parameters are likely to be noise at this sample size. Cross-validation on a held-out set would resolve which is correct.

Related Criteria

AIC and BIC are part of a wider family of information criteria. Mallows' CpC_p applies to linear regression with known noise variance σ^2\hat\sigma^2: Cp=RSS/σ^2+2knC_p = \text{RSS}/\hat\sigma^2 + 2k - n. Under Gaussian errors it is equivalent to AIC up to an additive constant. DIC (Deviance Information Criterion) is the Bayesian analogue, replacing maximum likelihood with posterior mean deviance and kk with an effective number of parameters pDp_D computed from the posterior; it is common in hierarchical Bayesian models fit by MCMC. HQIC (Hannan-Quinn) uses penalty 2kloglogn2k \log \log n, which sits between AIC and BIC asymptotically. HQIC is consistent like BIC but penalizes less aggressively, making it useful when BIC is too conservative but AIC overfits.

Worked Example

Example

Polynomial regression model selection

You fit polynomials of degree 1 through 5 to n=50n = 50 data points. Each model has k=d+1k = d + 1 parameters (including intercept). Suppose the log-likelihoods are:

Degreekk2logL^-2\log\hat{L}AICBIC
12120.0124.0120.0 + 2(3.91) = 127.8
23105.0111.0116.7
34103.5111.5119.1
45103.0113.0122.6
56102.8114.8126.3

AIC selects degree 2 (AIC = 111.0). BIC also selects degree 2 (BIC = 116.7). Both agree that the degree-2 model best trades off fit and complexity. Note how higher-degree models barely improve the log-likelihood but accumulate penalty.

Common Confusions

Watch Out

AIC and BIC are not comparable across different datasets

AIC and BIC values depend on the data. You can compare AIC values across models fit to the same data, but comparing AIC values from models fit to different datasets is meaningless. The log-likelihood scale depends on nn and the data distribution.

Watch Out

Lower is better, not higher

Both AIC and BIC are formulated so that lower is better. This is because they use 2logL^-2\log\hat{L} (negative log-likelihood), which decreases as fit improves. Some implementations negate this convention. always check.

Watch Out

BIC does not come from Bayesian model comparison

Despite the name, BIC is an approximation to the log marginal likelihood, but it makes strong simplifying assumptions (large nn, flat prior, regular model). True Bayesian model comparison uses the actual marginal likelihood, which can give very different answers.

Summary

  • AIC =2logL^+2k= -2\log\hat{L} + 2k. penalizes lightly, good for prediction
  • BIC =2logL^+klogn= -2\log\hat{L} + k\log n. penalizes more, good for model identification
  • AIC is asymptotically efficient (best prediction); BIC is consistent (finds true model)
  • AIC approximates leave-one-out cross-validation
  • Use AICc when n/k<40n/k < 40
  • Lower is better for both criteria
  • They can disagree. And that disagreement is informative

Exercises

ExerciseCore

Problem

For n=100n = 100 observations, how much does BIC penalize each additional parameter compared to AIC? At what sample size does BIC penalize exactly twice as much as AIC per parameter?

ExerciseAdvanced

Problem

You have two models: Model A with k=3k = 3 parameters and logL^A=50\log\hat{L}_A = -50, and Model B with k=10k = 10 parameters and logL^B=38\log\hat{L}_B = -38. With n=200n = 200 observations, which model does AIC prefer? Which does BIC prefer? Interpret the disagreement.

References

Canonical:

  • Akaike, "A new look at the statistical model identification" (1974): original AIC paper; derives the 2k2k correction via KL divergence and expected log-likelihood bias
  • Schwarz, "Estimating the dimension of a model" (1978): one-page paper introducing BIC from the Laplace approximation to marginal likelihood
  • Stone, "An asymptotic equivalence of choice of model by cross-validation and Akaike's criterion" (1977): proves AIC \approx LOO-CV asymptotically
  • Wilks, "The Large-Sample Distribution of the Likelihood Ratio for Testing Composite Hypotheses," Annals of Mathematical Statistics 9(1):60-62 (1938): the theorem that aligns AIC's factor-of-2 with χ2\chi^2 tests

Current:

  • Burnham & Anderson, Model Selection and Multimodel Inference (2002), Chapters 2-4: comprehensive treatment of AIC, AICc, model averaging, and evidence ratios
  • Hastie, Tibshirani & Friedman, Elements of Statistical Learning (2009), Section 7.5-7.7: AIC, BIC, and their relationship to cross-validation and degrees of freedom
  • Gelman et al., Bayesian Data Analysis (2013), Chapter 7: LOO-CV, WAIC, and why BIC can misfire on hierarchical models

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Derived topics

1

Graph-backed continuations