Skip to main content

Foundations

KL Divergence

Kullback-Leibler divergence measures how one probability distribution differs from another. Asymmetric, always non-negative, and central to variational inference, MLE, and RLHF.

CoreTier 1StableCore spine~45 min

Why This Matters

KL asymmetry is about which mistakes the expectation notices

Penalizes missing target mass, so q spreads to cover every mode.

target distribution papproximation q

Objective

The expectation is under the target distribution , so places where has mass are expensive to miss.

Behavior

A unimodal approximation spreads out to cover both target modes. This is why forward KL is often called mode-covering or mean-seeking.

ML connection

MLE and cross-entropy training minimize forward KL up to the fixed entropy of the data distribution.

KL divergence appears everywhere in machine learning. Maximum likelihood estimation minimizes KL divergence from the data distribution to the model. Variational inference minimizes KL divergence from the approximate posterior to the true posterior. RLHF uses a KL penalty to keep the fine-tuned policy close to the base model. Understanding the asymmetry of KL divergence explains why these different objectives produce different behaviors.

Mental Model

DKL(PQ)D_{\text{KL}}(P \| Q) measures the expected number of extra bits needed to encode samples from PP using a code optimized for QQ. It is zero only when P=QP = Q. It is not symmetric: the cost of encoding PP-samples with a QQ-code differs from encoding QQ-samples with a PP-code.

Core Definitions

Definition

KL Divergence

For discrete distributions PP and QQ on the same sample space:

DKL(PQ)=xP(x)logP(x)Q(x)D_{\text{KL}}(P \| Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)}

For continuous distributions with densities pp and qq:

DKL(PQ)=p(x)logp(x)q(x)dxD_{\text{KL}}(P \| Q) = \int p(x) \log \frac{p(x)}{q(x)} \, dx

Convention: 0log(0/q)=00 \log(0/q) = 0 and plog(p/0)=+p \log(p/0) = +\infty when p>0p > 0. If QQ does not dominate PP (there exists xx with P(x)>0P(x) > 0 but Q(x)=0Q(x) = 0), then DKL(PQ)=+D_{\text{KL}}(P \| Q) = +\infty.

Definition

Cross-Entropy

The cross-entropy between PP and QQ is:

H(P,Q)=xP(x)logQ(x)=H(P)+DKL(PQ)H(P, Q) = -\sum_{x} P(x) \log Q(x) = H(P) + D_{\text{KL}}(P \| Q)

where H(P)=xP(x)logP(x)H(P) = -\sum_x P(x) \log P(x) is the Shannon entropy. Since H(P)H(P) is fixed with respect to QQ, minimizing cross-entropy over QQ is equivalent to minimizing KL divergence.

Main Theorems

Theorem

Gibbs Inequality (Non-negativity of KL Divergence)

Statement

For any two probability distributions PP and QQ:

DKL(PQ)0D_{\text{KL}}(P \| Q) \geq 0

with equality if and only if P=QP = Q almost everywhere.

Intuition

The logarithm is concave. Applying Jensen's inequality to log(Q/P)\log(Q/P) under expectation with respect to PP gives EP[log(Q/P)]log(EP[Q/P])=log(1)=0\mathbb{E}_P[\log(Q/P)] \leq \log(\mathbb{E}_P[Q/P]) = \log(1) = 0. Negating both sides yields DKL0D_{\text{KL}} \geq 0.

Proof Sketch

By Jensen's inequality applied to the concave function log\log:

DKL(PQ)=EP[logQ(X)P(X)]logEP[Q(X)P(X)]=logxQ(x)=0-D_{\text{KL}}(P \| Q) = \mathbb{E}_P\left[\log \frac{Q(X)}{P(X)}\right] \leq \log \mathbb{E}_P\left[\frac{Q(X)}{P(X)}\right] = \log \sum_x Q(x) = 0

Equality in Jensen's holds iff Q(X)/P(X)Q(X)/P(X) is constant PP-a.s., which forces P=QP = Q.

Why It Matters

Non-negativity of KL divergence is the foundation for many results. It implies that cross-entropy is always at least the entropy: H(P,Q)H(P)H(P, Q) \geq H(P). It also implies that the population minimizer of DKL(PQθ)D_{\text{KL}}(P \| Q_\theta) over a well-specified family is exactly PP, which is the identifiability fact behind MLE; full asymptotic consistency of the sample MLE additionally requires standard M-estimation regularity (compactness, continuity, dominated convergence, identifiability). It also ensures the ELBO is a valid lower bound on the log-evidence.

Failure Mode

KL divergence is not a metric. It is not symmetric: DKL(PQ)DKL(QP)D_{\text{KL}}(P \| Q) \neq D_{\text{KL}}(Q \| P) in general. It does not satisfy the triangle inequality. This means you cannot use it as a distance in metric-space arguments. If you need a true metric, consider the total variation distance or Wasserstein distance.

Forward vs. Reverse KL

The two directions of KL divergence produce qualitatively different behavior when fitting an approximate distribution QQ to a target PP.

Forward DKL(PQ)D_{\text{KL}}(P \| Q)Reverse DKL(QP)D_{\text{KL}}(Q \| P)
BehaviorMode-covering (M-projection)Mode-seeking (I-projection)
Penalty heaviest whenQ(x)0Q(x) \to 0 where P(x)>0P(x) > 0 (zero-avoiding)Q(x)>0Q(x) > 0 where P(x)0P(x) \to 0 (zero-forcing)
What QQ learnsSpreads mass to cover every mode of PP, even small onesConcentrates on a single mode of PP, ignores the others
Used inMLE, cross-entropy training, normalizing flowsVariational inference (ELBO), RLHF KL penalty, mean-field VI
Failure modeQQ underfits multimodal PP by averagingQQ collapses, missing modes that matter

The choice of direction is the choice of failure mode, not a free parameter. Pick the direction whose pathology is the one you can tolerate.

Forward KL: DKL(PQ)D_{\text{KL}}(P \| Q)

Minimizing DKL(PQ)D_{\text{KL}}(P \| Q) over QQ is mode-covering (also called "M-projection" or mean-seeking). The objective penalizes Q(x)0Q(x) \approx 0 wherever P(x)>0P(x) > 0, because P(x)log(P(x)/Q(x))P(x) \log(P(x)/Q(x)) \to \infty as Q(x)0Q(x) \to 0. So QQ must spread mass to cover all modes of PP, even if this means placing mass where PP is small.

This is exactly what MLE does: given data from PP, minimize H(Pdata,Qθ)=H(Pdata)+DKL(PdataQθ)H(P_{\text{data}}, Q_\theta) = H(P_{\text{data}}) + D_{\text{KL}}(P_{\text{data}} \| Q_\theta).

Reverse KL: DKL(QP)D_{\text{KL}}(Q \| P)

Minimizing DKL(QP)D_{\text{KL}}(Q \| P) over QQ is mode-seeking (also called "I-projection"). The objective penalizes Q(x)>0Q(x) > 0 wherever P(x)0P(x) \approx 0, because Q(x)log(Q(x)/P(x))Q(x) \log(Q(x)/P(x)) \to \infty as P(x)0P(x) \to 0. So QQ avoids placing mass where PP is small, and tends to concentrate on a single mode of PP.

This is what variational inference does: minimize DKL(qϕ(z)p(zx))D_{\text{KL}}(q_\phi(z) \| p(z | x)) over the variational family qϕq_\phi.

Connection to Variational Inference

The evidence lower bound (ELBO) arises from the decomposition:

logp(x)=L(q)+DKL(q(z)p(zx))\log p(x) = \mathcal{L}(q) + D_{\text{KL}}(q(z) \| p(z|x))

where L(q)=Eq[logp(x,z)logq(z)]\mathcal{L}(q) = \mathbb{E}_{q}[\log p(x, z) - \log q(z)] is the ELBO. Since DKL0D_{\text{KL}} \geq 0, we have L(q)logp(x)\mathcal{L}(q) \leq \log p(x). Maximizing the ELBO over qq is equivalent to minimizing DKL(qp(x))D_{\text{KL}}(q \| p(\cdot | x)).

Connection to RLHF

In RLHF, the optimization objective includes a KL penalty:

maxπExπ[r(x)]βDKL(ππref)\max_\pi \mathbb{E}_{x \sim \pi}[r(x)] - \beta \, D_{\text{KL}}(\pi \| \pi_{\text{ref}})

where πref\pi_{\text{ref}} is the supervised fine-tuned (base) policy and rr is the reward model. The KL term prevents the fine-tuned policy from deviating too far from the base model, which would lead to reward hacking (exploiting errors in rr rather than genuinely improving quality). The coefficient β\beta controls the tradeoff.

Canonical Examples

Example

KL between two Gaussians

For P=N(μ1,σ12)P = \mathcal{N}(\mu_1, \sigma_1^2) and Q=N(μ2,σ22)Q = \mathcal{N}(\mu_2, \sigma_2^2):

DKL(PQ)=logσ2σ1+σ12+(μ1μ2)22σ2212D_{\text{KL}}(P \| Q) = \log \frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac{1}{2}

Note: this is zero only when μ1=μ2\mu_1 = \mu_2 and σ1=σ2\sigma_1 = \sigma_2. The asymmetry is visible: swapping PP and QQ gives a different expression.

Example

KL between Bernoulli distributions

For P=Bern(p)P = \text{Bern}(p) and Q=Bern(q)Q = \text{Bern}(q):

DKL(PQ)=plogpq+(1p)log1p1qD_{\text{KL}}(P \| Q) = p \log \frac{p}{q} + (1-p) \log \frac{1-p}{1-q}

If p=0.9p = 0.9 and q=0.5q = 0.5: DKL=0.9log(1.8)+0.1log(0.2)0.368D_{\text{KL}} = 0.9 \log(1.8) + 0.1 \log(0.2) \approx 0.368 nats. If p=0.5p = 0.5 and q=0.9q = 0.9: DKL=0.5log(5/9)+0.5log(5)0.511D_{\text{KL}} = 0.5 \log(5/9) + 0.5 \log(5) \approx 0.511 nats. Different values confirm asymmetry.

Common Confusions

Watch Out

KL divergence is not a distance

DKL(PQ)DKL(QP)D_{\text{KL}}(P \| Q) \neq D_{\text{KL}}(Q \| P) and the triangle inequality fails. Calling it a "distance" is common but misleading. It is a divergence: a measure of discrepancy, not a metric.

Watch Out

Small KL is not the same as pointwise closeness

DKL(PQ)=0D_{\text{KL}}(P \| Q) = 0 means P=QP = Q a.e., which is a strong statement. Small forward KL does imply small total variation via Pinsker's inequality,

TV(P,Q)DKL(PQ)/2,\text{TV}(P, Q) \leq \sqrt{D_{\text{KL}}(P \| Q)/2},

but it does not mean the densities are close pointwise, and it does not control the reverse direction DKL(QP)D_{\text{KL}}(Q \| P). A model can have small average extra code length while still assigning poor density in rare but important regions.

Watch Out

Minimizing forward KL is not the same as minimizing reverse KL

MLE minimizes forward KL DKL(PdataQθ)D_{\text{KL}}(P_{\text{data}} \| Q_\theta), producing mode-covering fits. Variational inference minimizes reverse KL DKL(qϕp(x))D_{\text{KL}}(q_\phi \| p(\cdot | x)), producing mode-seeking fits. When PP is multimodal and QQ is unimodal, forward KL places QQ between modes; reverse KL concentrates QQ on one mode. These are genuinely different objectives with different solutions.

Summary

  • DKL(PQ)=EP[log(P/Q)]0D_{\text{KL}}(P \| Q) = \mathbb{E}_P[\log(P/Q)] \geq 0, with equality iff P=QP = Q
  • Not symmetric, not a metric
  • Forward KL (D(PQ)D(P \| Q) over QQ): mode-covering, equivalent to MLE
  • Reverse KL (D(QP)D(Q \| P) over QQ): mode-seeking, used in variational inference
  • Cross-entropy = entropy + KL divergence
  • KL penalty in RLHF constrains policy deviation from the base model

Exercises

ExerciseCore

Problem

Compute DKL(PQ)D_{\text{KL}}(P \| Q) and DKL(QP)D_{\text{KL}}(Q \| P) for P=Bern(0.5)P = \text{Bern}(0.5) and Q=Bern(0.1)Q = \text{Bern}(0.1). Verify that they differ.

ExerciseAdvanced

Problem

Derive the closed-form KL divergence between two multivariate Gaussians P=N(μ1,Σ1)P = \mathcal{N}(\mu_1, \Sigma_1) and Q=N(μ2,Σ2)Q = \mathcal{N}(\mu_2, \Sigma_2) in Rd\mathbb{R}^d.

Related Comparisons

References

Canonical:

  • Cover & Thomas, Elements of Information Theory (2006), Chapter 2
  • Kullback & Leibler, "On Information and Sufficiency" (1951)

Current:

  • Murphy, Probabilistic Machine Learning: Advanced Topics (2023), Chapter 6
  • Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1.6 and 10.1
  • MacKay, Information Theory, Inference, and Learning Algorithms (2003), Chapter 2
  • Csiszár & Körner, Information Theory (2nd ed., 2011), Chapter 2

Next Topics

  • Variational inference: using reverse KL to approximate intractable posteriors
  • Mutual information: symmetric quantity built from KL divergence

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.