Skip to main content

Learning Theory Core

Kolmogorov Complexity and MDL

Kolmogorov complexity measures the shortest program that produces a string. The Minimum Description Length principle selects models that compress data best, providing a computable approximation to an incomputable ideal.

CoreTier 2StableSupporting~55 min

Prerequisites

Why This Matters

The connection between compression and learning is deep: a model that compresses data well must have captured its regularities, and those regularities are exactly what you need for prediction. Kolmogorov complexity gives the theoretical ideal for compression. MDL (Minimum Description Length) gives a practical principle for model selection based on the same idea.

This perspective offers an alternative to the classical bias-variance framework. Instead of asking "how complex is the hypothesis class?", you ask "how short is the total description of the data using this model?"

Kolmogorov Complexity

Definition

Kolmogorov Complexity

The Kolmogorov complexity of a string xx (with respect to a fixed universal Turing machine UU) is:

K(x)=min{p:U(p)=x}K(x) = \min\{|p| : U(p) = x\}

where p|p| is the length of program pp in bits. It is the length of the shortest program that outputs xx and halts.

Key properties:

  1. Subadditivity: K(x,y)K(x)+K(y)+O(log(min(K(x),K(y))))K(x, y) \leq K(x) + K(y) + O(\log(\min(K(x), K(y)))).
  2. Invariance: changing the universal Turing machine changes K(x)K(x) by at most an additive constant (independent of xx).
  3. Most strings are incompressible: for any length nn, at most 2nc2^{n-c} strings of length nn have K(x)ncK(x) \leq n - c. So at most a fraction 2c2^{-c} of strings can be compressed by cc bits.
Theorem

Incomputability of Kolmogorov Complexity

Statement

There is no algorithm that, given an arbitrary string xx, computes K(x)K(x).

Intuition

A computable KK would be self-defeating. Given any threshold nn, you could write a short program that searches through strings in lexicographic order and outputs the first one with K(x)>nK(x) > n — call its output xnx_n. By construction K(xn)>nK(x_n) > n, but the search program itself has length only O(logn)+cO(\log n) + c (it just stores the constant nn and a fixed search routine), which for large nn is far below nn. That contradicts K(xn)>nK(x_n) > n. This is the Berry paradox made formal, and it is the same self-reference that powers the undecidability of halting — but the argument is direct, not a reduction from halting.

Proof Sketch

Suppose KK is computable. Define a program PP that enumerates all strings xx with K(x)>nK(x) > n and outputs the first such string found. Program PP has length O(logn)O(\log n) (it just encodes the constant nn). But PP outputs a string with K(x)>nK(x) > n, while PP itself is a program of length O(logn)O(\log n) that produces xx. For large enough nn, this gives K(x)O(logn)<n<K(x)K(x) \leq O(\log n) < n < K(x), a contradiction.

Why It Matters

Incomputability is not a technicality. It means there is no algorithm that can determine the true complexity of data. Any practical approach to compression-based learning must use an approximation. MDL, BIC, and practical compression algorithms are all such approximations.

Failure Mode

You can compute upper bounds on K(x)K(x) (any specific compression of xx gives one), but you can never certify that your bound is tight. There is no algorithm that, given xx and kk, decides whether K(x)kK(x) \leq k.

Minimum Description Length

Definition

Minimum Description Length Principle

The MDL principle selects the model MM that minimizes the total description length:

L(M)+L(DM)L(M) + L(D \mid M)

where L(M)L(M) is the length (in bits) of encoding the model and L(DM)L(D \mid M) is the length of encoding the data DD given the model. This is a two-part code: first describe the model, then describe the data using the model.

The first term penalizes model complexity (more parameters require more bits). The second term penalizes poor fit (if the model does not explain the data well, the residuals require many bits). MDL balances these automatically.

Connection to coding theory. By the Shannon-Kraft inequality, code lengths correspond to probability distributions: L(x)=log2P(x)L(x) = -\log_2 P(x). So minimizing description length is equivalent to maximizing a penalized likelihood. MDL with a universal code for the model class recovers BIC asymptotically.

Theorem

MDL Consistency

Statement

Assume the true data-generating distribution PP^* lies in some model MM^* in the (countable, identifiable) model class, the model code satisfies the Kraft inequality M2L(M)1\sum_M 2^{-L(M)} \le 1, and L(M)<L(M^*) < \infty. Under a two-part MDL code, as the sample size nn \to \infty, the MDL-selected model converges (in probability) to the simplest model in the class that contains PP^*.

Intuition

Kraft alone is not enough: it only ensures the model code is decodable. The work is done by realizability and identifiability — a wrong model pays a per-sample cross-entropy excess that grows linearly with nn, while the correct model pays only the fixed cost L(M)L(M^*) plus an O(logn)O(\log n) parameter-encoding term. For large nn the linear penalty dominates the logarithmic one, so wrong models lose. Among models that contain PP^*, the Kraft-constrained code length L(M)L(M) breaks ties in favor of the simplest.

Proof Sketch

For any model MM that does not contain the true distribution PP^*, the average code length 1nL(DM)\frac{1}{n} L(D \mid M) converges to the cross-entropy H(P,PM)>H(P)H(P^*, P_M) > H(P^*). The true model achieves H(P)H(P^*) plus a O(d2nlogn)O(\frac{d}{2n} \log n) penalty term. For large nn, the excess cross-entropy of the wrong model exceeds the complexity penalty of the right one.

Why It Matters

This justifies MDL as a model selection principle: it is consistent (picks the right model eventually) and implements Occam's razor automatically. You do not need to specify a separate regularization parameter.

Failure Mode

Consistency is an asymptotic property. For finite samples, MDL can select a model that is too simple (if the complexity penalty overwhelms the data-fit term) or too complex (if the model class encoding is poorly chosen). The result also requires the true distribution to lie within the model class, which is never exactly true in practice.

Connection to Learning Theory

The compression view of learning: if a learning algorithm can compress the training data (describe it in fewer bits than the raw data), then the patterns it found are likely real and will generalize. More precisely, if a hypothesis hh allows encoding the training labels in L(h)+L(ynh,xn)L(h) + L(y^n \mid h, x^n) bits, and this is less than nn bits (the cost of encoding labels with no model), then hh has captured structure.

This intuition is made precise by occam-style PAC bounds. For a finite or prefix-coded hypothesis class with code length L(h)L(h) in bits, with probability at least 1δ1 - \delta over the sample,

err(h)err^(h)O ⁣(L(h)ln2+ln(1/δ)n)\text{err}(h) - \widehat{\text{err}}(h) \le O\!\left(\sqrt{\frac{L(h)\ln 2 + \ln(1/\delta)}{n}}\right)

in the agnostic setting (or the analogous (L(h)ln2+ln(1/δ))/n(L(h)\ln 2 + \ln(1/\delta))/n rate in the realizable case). What "compresses by cc bits" buys is a smaller L(h)L(h) in the numerator, hence a tighter excess-risk bound. There is no general (nc)/n\sqrt{(n-c)/n} rule: the right form is the Shawe-Taylor / Langford-style PAC bound above, with hypothesis cost measured by code length.

Why MDL Is Hard to Operationalize

The elegance of MDL hides practical difficulties:

  1. Code design: the encoding of models and data given models is not unique. Different encodings lead to different model selections.
  2. Continuous parameters: encoding real-valued parameters requires discretization, and the grid resolution affects the result.
  3. Model class: MDL selects among a given collection of models. It cannot search over all possible models (that would require computing K(x)K(x)).

In practice, BIC and cross-validation are used more often than explicit MDL, partly because they avoid the coding construction.

Common Confusions

Watch Out

Kolmogorov complexity is not Shannon entropy

Shannon entropy H(X)H(X) is a property of a distribution. Kolmogorov complexity K(x)K(x) is a property of a specific string. For a random string drawn from distribution PP, E[K(X)]H(X)\mathbb{E}[K(X)] \approx H(X) (up to O(1)O(1)), but individual strings can have K(x)K(x) far from H(X)H(X).

Watch Out

MDL is not just penalized likelihood

While MDL can be viewed as penalized likelihood (since code lengths correspond to log-probabilities), the MDL philosophy is different. Penalized likelihood assumes a true model exists and tries to find it. MDL makes no such assumption: it seeks the best compression regardless of whether any model is "true." In the misspecified setting, MDL selects the model that compresses best, which may not be the most "likely" in any generative sense.

Canonical Examples

Example

MDL for polynomial regression

Suppose you fit polynomials of degree 1, 2, 5, and 20 to 50 data points. The degree-20 polynomial fits the training data perfectly (zero residuals, so L(DM)0L(D \mid M) \approx 0), but encoding 21 coefficients to high precision costs many bits (L(M)L(M) is large). A degree-2 polynomial has small L(M)L(M) (3 coefficients) but nonzero residuals that require some bits to encode. MDL selects the degree that minimizes the total: if the true relationship is quadratic, degree 2 wins for large enough nn.

Exercises

ExerciseCore

Problem

The string x=0101010101x = 01010101\ldots01 (the pattern "01" repeated 500 times, total length 1000 bits) has K(x)1000K(x) \ll 1000. Give an upper bound on K(x)K(x) and explain why. A random 1000-bit string has K(x)1000K(x) \approx 1000. Why?

ExerciseAdvanced

Problem

In two-part MDL, you encode a polynomial regression model by specifying the degree dd and the d+1d+1 coefficients. Suppose each coefficient is quantized to bb bits. Write the total description length as a function of dd, bb, nn (sample size), and the residual sum of squares RSS(d)\text{RSS}(d). Under what conditions does MDL prefer degree 2 over degree 10?

References

Canonical:

  • Li & Vitanyi, An Introduction to Kolmogorov Complexity and Its Applications (2008), Chapters 1-2
  • Grunwald, The Minimum Description Length Principle (2007), Chapters 1-3, 17

Current:

  • Rissanen, "Modeling by Shortest Data Description" (1978)

  • Barron, Rissanen, Yu, "The Minimum Description Length Principle in Coding and Modeling" (1998)

  • Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 2-6

  • Mohri, Rostamizadeh, Talwalkar, Foundations of Machine Learning (2018), Chapters 2-4

Next Topics

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

0

No direct prerequisites are declared; this is treated as an entry point.

Derived topics

2

Graph-backed continuations