AIC and BIC

Sneiderman, Robby

ML Methods

AIC and BIC

Akaike and Bayesian information criteria for model selection: how they trade off fit versus complexity, when to use each, and their connection to cross-validation.

CoreTier 1StableSupporting~40 min

Prerequisites

Maximum Likelihood Estimation Decision Trees and Ensembles Gauss Markov Theorem Kolmogorov Complexity and Mdl

Quiz (3)Prereq Map

Why This Matters

You have fit several models to the same data. a linear model, a quadratic model, a model with 20 features. Which one should you use? You cannot just pick the one with the highest likelihood, because more complex models always fit the training data better. This is the core problem of overfitting.

AIC and BIC are principled ways to balance goodness-of-fit against model complexity. They are fast to compute (no retraining needed), widely used, and have deep theoretical justifications.

theorem visual

Fit-Complexity Ledger

$AIC keeps a flat optimism correction, while BIC raises the complexity tax as the sample size grows.$

Sample size

n = 60

candidate models

baseline

4 params

$- 2 lo g \hat{L} = 120$

extended

7 params

$- 2 lo g \hat{L} = 108$

AIC

winner: extended

baseline128.0

extended122.0

BIC

winner: baseline

baseline136.4

extended136.7

AIC penalty

$AIC = - 2 lo g \hat{L} + 2 k$

The complexity tax stays flat, so a modest likelihood gain can justify extra parameters.

BIC penalty

$BIC = - 2 lo g \hat{L} + k lo g n$

As $n$ grows, the same three extra parameters become harder to justify.

current disagreement

The extended model buys a 12-point fit improvement but pays for three extra parameters.

AIC asks whether predictive optimism was corrected; BIC asks whether the larger model still looks plausible after the sample-size-scaled penalty.

Mental Model

Both AIC and BIC start with the log-likelihood (how well the model fits) and add a penalty for complexity (number of parameters). Both criteria require models estimated by maximum likelihood. The key difference: AIC penalizes complexity less aggressively, favoring more complex models. BIC penalizes complexity more, favoring simpler models.

The penalty term addresses a real phenomenon: overfitting. A model with more parameters always achieves at least as high a log-likelihood on the training data; the penalty corrects for the optimism. AIC and BIC differ in how much correction they apply and why.

This is not arbitrary. AIC and BIC are optimizing for different goals.

Definitions

Definition

Akaike Information Criterion $A I C$

For a model with $k$ parameters and maximized log-likelihood $\hat{L}$ :

$\text{AIC} = -2 \log \hat{L} + 2k$

Lower AIC is better. The first term rewards fit; the second penalizes complexity.

Definition

Bayesian Information Criterion $B I C$

For a model with $k$ parameters, $n$ observations, and maximized log-likelihood $\hat{L}$ :

$\text{BIC} = -2 \log \hat{L} + k \log n$

Lower BIC is better. The penalty $k \log n$ grows with sample size, making BIC increasingly hostile to complex models as $n$ increases.

Note: when $n \geq 8$ , we have $\log n > 2$ , so BIC penalizes each parameter more than AIC. In practice, $n$ is almost always much larger than 8.

Where the Formulas Come From

AIC from KL divergence

AIC is grounded in information theory. The true data-generating distribution is $p_{\text{true}}$ ; the fitted model is $p_{\hat{\theta}}$ . The KL divergence measures the information lost when using the model instead of the truth:

$D_{\text{KL}}(p_{\text{true}} \| p_{\hat{\theta}}) = \mathbb{E}_{p_{\text{true}}}\!\left[\log \frac{p_{\text{true}}(x)}{p_{\hat{\theta}}(x)}\right]$

We want to minimize this, but $p_{\text{true}}$ is unknown. Expanding the expectation:

$D_{\text{KL}}(p_{\text{true}} \| p_{\hat{\theta}}) = \mathbb{E}[\log p_{\text{true}}(x)] - \mathbb{E}[\log p_{\hat{\theta}}(x)]$

The first term does not depend on the model, so minimizing KL divergence is equivalent to maximizing $\mathbb{E}[\log p_{\hat{\theta}}(x)]$ , the expected log-likelihood under the true distribution.

The natural estimator is the in-sample log-likelihood $\frac{1}{n}\sum_i \log p_{\hat{\theta}}(x_i)$ . But this is optimistic: the parameters $\hat{\theta}$ were fit to the same data, so the in-sample likelihood overestimates out-of-sample performance. Akaike (1974) showed via a second-order Taylor expansion of the likelihood that the bias is approximately $k/n$ (where $k$ is the number of parameters), giving:

$-2\,\mathbb{E}[\text{out-of-sample log-likelihood}] \approx -2\log\hat{L} + 2k$

That is AIC. The factor of 2 is the natural scale for likelihood-based inference: under regularity, Wilks' theorem (1938) gives that twice the log-likelihood ratio between nested models is asymptotically $\chi^2$ -distributed with degrees of freedom equal to the difference in parameter counts, so AIC differences land on the same scale as classical $\chi^2$ goodness-of-fit and likelihood-ratio tests.

BIC as a Laplace approximation to the log marginal likelihood

BIC approximates the log marginal likelihood $\log p(\text{data} | M)$ for a model $M$ with $k$ parameters. By Bayes' theorem, the marginal likelihood for comparing two models is the correct Bayesian criterion.

A Laplace approximation to the marginal likelihood gives:

$\log p(\text{data} | M) \approx \log p(\text{data} | \hat{\theta}, M) + \log p(\hat{\theta} | M) - \frac{k}{2}\log n + \text{const}$

For a flat or regular prior, the $\log p(\hat{\theta}|M)$ term is $O(1)$ and is dropped. The dominant term is the log-likelihood minus $\frac{k}{2}\log n$ . Multiplying by $-2$ gives:

$\text{BIC} = -2\log\hat{L} + k\log n$

This derivation makes explicit that BIC is a large- $n$ approximation. The true marginal likelihood can differ substantially from BIC for small samples, non-regular models, or strong priors.

Asymptotic Properties

AIC and BIC have structurally different theoretical goals:

Definition

Asymptotic Efficiency (AIC)

AIC is asymptotically efficient: among all models under consideration, the model selected by AIC minimizes the expected Kullback-Leibler divergence to the true data-generating process. In other words, AIC selects the model that predicts best.

Definition

Consistency (BIC)

BIC is consistent: if the true model is among the candidates, BIC selects the true model with probability approaching 1 as $n \to \infty$ . AIC is not consistent. It tends to overfit by selecting models that are too complex, even asymptotically.

The tradeoff is clear:

AIC: best for prediction. May select an overly complex model, but that model will predict well.
BIC: best for identifying the true model. May select an overly simple model if $n$ is small, but will eventually find the truth.

When to Use Each

Use AIC when:

Your goal is prediction accuracy
You believe the true model is not in your candidate set (it almost never is)
You prefer to err on the side of including relevant variables

Use BIC when:

Your goal is identifying the correct model (inference, scientific discovery)
You believe the true model might be in your candidate set
You prefer to err on the side of parsimony

In most ML applications, prediction is the goal, so AIC (or cross-validation) is more appropriate. In scientific modeling and causal inference, BIC is often preferred.

Connection to Cross-Validation

Proposition

AIC Approximates Leave-One-Out CV

Statement

Under regularity conditions, AIC is asymptotically equivalent to leave-one-out cross-validation (LOO-CV). Specifically, for models estimated by maximum likelihood, the AIC score and the LOO-CV estimate of prediction error select the same model as $n \to \infty$ .

Intuition

The $2k$ penalty in AIC can be derived as an estimate of the optimism. the amount by which training error underestimates test error. LOO-CV also estimates test error. Both are estimating the same quantity, so they agree asymptotically.

Proof Sketch

The key result is due to Stone (1977). The proof shows that the AIC correction $2k$ equals the expected difference between training log-likelihood and test log-likelihood, up to $O(1/n)$ terms. LOO-CV directly estimates test log-likelihood. Both therefore estimate the same quantity with the same leading-order behavior.

Why It Matters

This connection means AIC gives you an approximate LOO-CV estimate for free, no retraining required. For models where cross-validation is expensive (e.g., large datasets or slow training), AIC is a practical alternative.

Failure Mode

The equivalence breaks down when: (1) the model is severely misspecified, (2) $k$ is large relative to $n$ (use corrected AICc instead), or (3) the model is not estimated by maximum likelihood.

report a correction →

Corrected AIC (AICc)

When the number of parameters $k$ is not negligible compared to $n$ , standard AIC underpenalizes. The corrected version is:

$\text{AICc} = \text{AIC} + \frac{2k(k+1)}{n - k - 1}$

Use AICc when $n/k < 40$ as a rough rule. AICc converges to AIC as $n \to \infty$ .

When AIC and BIC Disagree

The two criteria point in different directions when the data provide moderate evidence for a more complex model: strong enough that AIC's lighter penalty accepts it, but not strong enough to overcome BIC's heavier penalty.

Structural reason: AIC penalizes at rate $2k$ regardless of $n$ . BIC penalizes at rate $k \log n$ , which grows without bound. For any fixed model comparison, BIC eventually penalizes more aggressively than AIC as $n$ increases.

Practical interpretation of disagreement:

AIC selects a larger model; BIC selects a smaller one. The most common pattern. AIC is betting the extra parameters help predict; BIC says the likelihood gain does not justify the complexity at this sample size. Check whether the simpler model's residuals show systematic structure (if yes, AIC may be right). If residuals look random, BIC is likely correct.
Both select the same model. Agreement is strong evidence for that model. The log-likelihood gain was either convincingly large (both choose complex) or convincingly small (both choose simple).
BIC selects a larger model; AIC selects a smaller one. This is unusual and almost always signals a problem: very small $n$ (BIC's $\log n < 2$ , so $n < e^2 \approx 7.4$ ), a coding error in one criterion, or a comparison across different datasets.

The practical heuristic: When AIC and BIC disagree and you must pick one model, use domain knowledge. If you are building a prediction system, prefer AIC. If you are testing a scientific hypothesis about structure, prefer BIC. If neither applies, report both and flag the disagreement; it signals genuine uncertainty about model complexity that the data cannot resolve.

Example

AIC vs BIC in linear regression with collinear predictors

Suppose you are fitting models to predict house prices with $n = 80$ observations. You have a baseline model with $k = 4$ parameters (intercept, size, bedrooms, bathrooms) and an extended model adding $k = 9$ parameters (size squared, location interactions, and neighborhood fixed effects).

Model	$k$	$-2\log\hat{L}$	AIC	BIC ( $\log 80 \approx 4.38$ )
Baseline	4	510.0	518.0	527.5
Extended	9	498.0	516.0	537.4

AIC: 516.0 vs 518.0. AIC prefers the extended model by 2 points; the 12-point log-likelihood gain more than offsets the $2 \times 5 = 10$ point penalty.

BIC: 537.4 vs 527.5. BIC prefers the baseline; the same 12-point gain cannot offset the $5 \times 4.38 = 21.9$ point penalty.

The five extra predictors are correlated with size and location, each adding modest individual predictive value. AIC rewards the marginal gain; BIC says the parameters are likely to be noise at this sample size. Cross-validation on a held-out set would resolve which is correct.

Related Criteria

AIC and BIC are part of a wider family of information criteria. Mallows' $C_p$ applies to linear regression with known noise variance $\hat\sigma^2$ : $C_p = \text{RSS}/\hat\sigma^2 + 2k - n$ . Under Gaussian errors it is equivalent to AIC up to an additive constant. DIC (Deviance Information Criterion) is the Bayesian analogue, replacing maximum likelihood with posterior mean deviance and $k$ with an effective number of parameters $p_D$ computed from the posterior; it is common in hierarchical Bayesian models fit by MCMC. HQIC (Hannan-Quinn) uses penalty $2k \log \log n$ , which sits between AIC and BIC asymptotically. HQIC is consistent like BIC but penalizes less aggressively, making it useful when BIC is too conservative but AIC overfits.

Worked Example

Example

Polynomial regression model selection

You fit polynomials of degree 1 through 5 to $n = 50$ data points. Each model has $k = d + 1$ parameters (including intercept). Suppose the log-likelihoods are:

Degree	$k$	$-2\log\hat{L}$	AIC	BIC
1	2	120.0	124.0	120.0 + 2(3.91) = 127.8
2	3	105.0	111.0	116.7
3	4	103.5	111.5	119.1
4	5	103.0	113.0	122.6
5	6	102.8	114.8	126.3

AIC selects degree 2 (AIC = 111.0). BIC also selects degree 2 (BIC = 116.7). Both agree that the degree-2 model best trades off fit and complexity. Note how higher-degree models barely improve the log-likelihood but accumulate penalty.

Common Confusions

Watch Out

AIC and BIC are not comparable across different datasets

AIC and BIC values depend on the data. You can compare AIC values across models fit to the same data, but comparing AIC values from models fit to different datasets is meaningless. The log-likelihood scale depends on $n$ and the data distribution.

Watch Out

Lower is better, not higher

Both AIC and BIC are formulated so that lower is better. This is because they use $-2\log\hat{L}$ (negative log-likelihood), which decreases as fit improves. Some implementations negate this convention. always check.

Watch Out

BIC does not come from Bayesian model comparison

Despite the name, BIC is an approximation to the log marginal likelihood, but it makes strong simplifying assumptions (large $n$ , flat prior, regular model). True Bayesian model comparison uses the actual marginal likelihood, which can give very different answers.

Summary

AIC $= -2\log\hat{L} + 2k$ . penalizes lightly, good for prediction
BIC $= -2\log\hat{L} + k\log n$ . penalizes more, good for model identification
AIC is asymptotically efficient (best prediction); BIC is consistent (finds true model)
AIC approximates leave-one-out cross-validation
Use AICc when $n/k < 40$
Lower is better for both criteria
They can disagree. And that disagreement is informative

Exercises

ExerciseCore

Problem

For $n = 100$ observations, how much does BIC penalize each additional parameter compared to AIC? At what sample size does BIC penalize exactly twice as much as AIC per parameter?

ExerciseAdvanced

Problem

You have two models: Model A with $k = 3$ parameters and $\log\hat{L}_A = -50$ , and Model B with $k = 10$ parameters and $\log\hat{L}_B = -38$ . With $n = 200$ observations, which model does AIC prefer? Which does BIC prefer? Interpret the disagreement.

References

Canonical:

Akaike, "A new look at the statistical model identification" (1974): original AIC paper; derives the $2k$ correction via KL divergence and expected log-likelihood bias
Schwarz, "Estimating the dimension of a model" (1978): one-page paper introducing BIC from the Laplace approximation to marginal likelihood
Stone, "An asymptotic equivalence of choice of model by cross-validation and Akaike's criterion" (1977): proves AIC $\approx$ LOO-CV asymptotically
Wilks, "The Large-Sample Distribution of the Likelihood Ratio for Testing Composite Hypotheses," Annals of Mathematical Statistics 9(1):60-62 (1938): the theorem that aligns AIC's factor-of-2 with $\chi^2$ tests

Current:

Burnham & Anderson, Model Selection and Multimodel Inference (2002), Chapters 2-4: comprehensive treatment of AIC, AICc, model averaging, and evidence ratios
Hastie, Tibshirani & Friedman, Elements of Statistical Learning (2009), Section 7.5-7.7: AIC, BIC, and their relationship to cross-validation and degrees of freedom
Gelman et al., Bayesian Data Analysis (2013), Chapter 7: LOO-CV, WAIC, and why BIC can misfire on hierarchical models

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1
Gauss-Markov Theoremlayer 2 · tier 1
Decision Trees and Ensembleslayer 2 · tier 2
Kolmogorov Complexity and MDLlayer 2 · tier 2

Derived topics

1

Cross-Validation Theorylayer 2 · tier 2

Graph-backed continuations

Cross-Validation Theory