KL Divergence

Sneiderman, Robby

Foundations

KL Divergence

Kullback-Leibler divergence measures how one probability distribution differs from another. Asymmetric, always non-negative, and central to variational inference, MLE, and RLHF.

CoreTier 1StableCore spine~45 min

Prerequisites

Common Probability Distributions Information Theory Foundations Distance Metrics Compared Total Variation Distance

Quiz (16)Pulse Check Prereq Map

Why This Matters

KL asymmetry is about which mistakes the expectation notices

Penalizes missing target mass, so q spreads to cover every mode.

target distribution papproximation q

Objective

$min_{q} D_{KL} (p ∥ q)$

$The expectation is under the target distribution p, so places where p has mass are expensive to miss.$

Behavior

$A unimodal approximation spreads out to cover both target modes. This is why forward KL is often called mode-covering or mean-seeking.$

ML connection

$NLL = H (p_{data}, q_{θ})$

$MLE and cross-entropy training minimize forward KL up to the fixed entropy of the data distribution.$

KL divergence appears everywhere in machine learning. Maximum likelihood estimation minimizes KL divergence from the data distribution to the model. Variational inference minimizes KL divergence from the approximate posterior to the true posterior. RLHF uses a KL penalty to keep the fine-tuned policy close to the base model. Understanding the asymmetry of KL divergence explains why these different objectives produce different behaviors.

Mental Model

$D_{\text{KL}}(P \| Q)$ measures the expected number of extra bits needed to encode samples from $P$ using a code optimized for $Q$ . It is zero only when $P = Q$ . It is not symmetric: the cost of encoding $P$ -samples with a $Q$ -code differs from encoding $Q$ -samples with a $P$ -code.

Core Definitions

Definition

KL Divergence $D_{KL} (P ∥ Q)$

For discrete distributions $P$ and $Q$ on the same sample space:

$D_{\text{KL}}(P \| Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)}$

For continuous distributions with densities $p$ and $q$ :

$D_{\text{KL}}(P \| Q) = \int p(x) \log \frac{p(x)}{q(x)} \, dx$

Convention: $0 \log(0/q) = 0$ and $p \log(p/0) = +\infty$ when $p > 0$ . If $Q$ does not dominate $P$ (there exists $x$ with $P(x) > 0$ but $Q(x) = 0$ ), then $D_{\text{KL}}(P \| Q) = +\infty$ .

Definition

Cross-Entropy $H (P, Q)$

The cross-entropy between $P$ and $Q$ is:

$H(P, Q) = -\sum_{x} P(x) \log Q(x) = H(P) + D_{\text{KL}}(P \| Q)$

where $H(P) = -\sum_x P(x) \log P(x)$ is the Shannon entropy. Since $H(P)$ is fixed with respect to $Q$ , minimizing cross-entropy over $Q$ is equivalent to minimizing KL divergence.

Main Theorems

Theorem

Gibbs Inequality (Non-negativity of KL Divergence)

Statement

For any two probability distributions $P$ and $Q$ :

$D_{\text{KL}}(P \| Q) \geq 0$

with equality if and only if $P = Q$ almost everywhere.

Intuition

The logarithm is concave. Applying Jensen's inequality to $\log(Q/P)$ under expectation with respect to $P$ gives $\mathbb{E}_P[\log(Q/P)] \leq \log(\mathbb{E}_P[Q/P]) = \log(1) = 0$ . Negating both sides yields $D_{\text{KL}} \geq 0$ .

Proof Sketch

By Jensen's inequality applied to the concave function $\log$ :

$-D_{\text{KL}}(P \| Q) = \mathbb{E}_P\left[\log \frac{Q(X)}{P(X)}\right] \leq \log \mathbb{E}_P\left[\frac{Q(X)}{P(X)}\right] = \log \sum_x Q(x) = 0$

Equality in Jensen's holds iff $Q(X)/P(X)$ is constant $P$ -a.s., which forces $P = Q$ .

Why It Matters

Non-negativity of KL divergence is the foundation for many results. It implies that cross-entropy is always at least the entropy: $H(P, Q) \geq H(P)$ . It also implies that the population minimizer of $D_{\text{KL}}(P \| Q_\theta)$ over a well-specified family is exactly $P$ , which is the identifiability fact behind MLE; full asymptotic consistency of the sample MLE additionally requires standard M-estimation regularity (compactness, continuity, dominated convergence, identifiability). It also ensures the ELBO is a valid lower bound on the log-evidence.

Failure Mode

KL divergence is not a metric. It is not symmetric: $D_{\text{KL}}(P \| Q) \neq D_{\text{KL}}(Q \| P)$ in general. It does not satisfy the triangle inequality. This means you cannot use it as a distance in metric-space arguments. If you need a true metric, consider the total variation distance or Wasserstein distance.

report a correction →

Forward vs. Reverse KL

The two directions of KL divergence produce qualitatively different behavior when fitting an approximate distribution $Q$ to a target $P$ .

	Forward $D_{\text{KL}}(P \\| Q)$	Reverse $D_{\text{KL}}(Q \\| P)$
Behavior	Mode-covering (M-projection)	Mode-seeking (I-projection)
Penalty heaviest when	$Q(x) \to 0$ where $P(x) > 0$ (zero-avoiding)	$Q(x) > 0$ where $P(x) \to 0$ (zero-forcing)
What $Q$ learns	Spreads mass to cover every mode of $P$ , even small ones	Concentrates on a single mode of $P$ , ignores the others
Used in	MLE, cross-entropy training, normalizing flows	Variational inference (ELBO), RLHF KL penalty, mean-field VI
Failure mode	$Q$ underfits multimodal $P$ by averaging	$Q$ collapses, missing modes that matter

The choice of direction is the choice of failure mode, not a free parameter. Pick the direction whose pathology is the one you can tolerate.

Forward KL: $D_{\text{KL}}(P \| Q)$

Minimizing $D_{\text{KL}}(P \| Q)$ over $Q$ is mode-covering (also called "M-projection" or mean-seeking). The objective penalizes $Q(x) \approx 0$ wherever $P(x) > 0$ , because $P(x) \log(P(x)/Q(x)) \to \infty$ as $Q(x) \to 0$ . So $Q$ must spread mass to cover all modes of $P$ , even if this means placing mass where $P$ is small.

This is exactly what MLE does: given data from $P$ , minimize $H(P_{\text{data}}, Q_\theta) = H(P_{\text{data}}) + D_{\text{KL}}(P_{\text{data}} \| Q_\theta)$ .

Reverse KL: $D_{\text{KL}}(Q \| P)$

Minimizing $D_{\text{KL}}(Q \| P)$ over $Q$ is mode-seeking (also called "I-projection"). The objective penalizes $Q(x) > 0$ wherever $P(x) \approx 0$ , because $Q(x) \log(Q(x)/P(x)) \to \infty$ as $P(x) \to 0$ . So $Q$ avoids placing mass where $P$ is small, and tends to concentrate on a single mode of $P$ .

This is what variational inference does: minimize $D_{\text{KL}}(q_\phi(z) \| p(z | x))$ over the variational family $q_\phi$ .

Connection to Variational Inference

The evidence lower bound (ELBO) arises from the decomposition:

$\log p(x) = \mathcal{L}(q) + D_{\text{KL}}(q(z) \| p(z|x))$

where $\mathcal{L}(q) = \mathbb{E}_{q}[\log p(x, z) - \log q(z)]$ is the ELBO. Since $D_{\text{KL}} \geq 0$ , we have $\mathcal{L}(q) \leq \log p(x)$ . Maximizing the ELBO over $q$ is equivalent to minimizing $D_{\text{KL}}(q \| p(\cdot | x))$ .

Connection to RLHF

In RLHF, the optimization objective includes a KL penalty:

$\max_\pi \mathbb{E}_{x \sim \pi}[r(x)] - \beta \, D_{\text{KL}}(\pi \| \pi_{\text{ref}})$

where $\pi_{\text{ref}}$ is the supervised fine-tuned (base) policy and $r$ is the reward model. The KL term prevents the fine-tuned policy from deviating too far from the base model, which would lead to reward hacking (exploiting errors in $r$ rather than genuinely improving quality). The coefficient $\beta$ controls the tradeoff.

Canonical Examples

Example

KL between two Gaussians

For $P = \mathcal{N}(\mu_1, \sigma_1^2)$ and $Q = \mathcal{N}(\mu_2, \sigma_2^2)$ :

$D_{\text{KL}}(P \| Q) = \log \frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac{1}{2}$

Note: this is zero only when $\mu_1 = \mu_2$ and $\sigma_1 = \sigma_2$ . The asymmetry is visible: swapping $P$ and $Q$ gives a different expression.

Example

KL between Bernoulli distributions

For $P = \text{Bern}(p)$ and $Q = \text{Bern}(q)$ :

$D_{\text{KL}}(P \| Q) = p \log \frac{p}{q} + (1-p) \log \frac{1-p}{1-q}$

If $p = 0.9$ and $q = 0.5$ : $D_{\text{KL}} = 0.9 \log(1.8) + 0.1 \log(0.2) \approx 0.368$ nats. If $p = 0.5$ and $q = 0.9$ : $D_{\text{KL}} = 0.5 \log(5/9) + 0.5 \log(5) \approx 0.511$ nats. Different values confirm asymmetry.

Common Confusions

Watch Out

KL divergence is not a distance

$D_{\text{KL}}(P \| Q) \neq D_{\text{KL}}(Q \| P)$ and the triangle inequality fails. Calling it a "distance" is common but misleading. It is a divergence: a measure of discrepancy, not a metric.

Watch Out

Small KL is not the same as pointwise closeness

$D_{\text{KL}}(P \| Q) = 0$ means $P = Q$ a.e., which is a strong statement. Small forward KL does imply small total variation via Pinsker's inequality,

$\text{TV}(P, Q) \leq \sqrt{D_{\text{KL}}(P \| Q)/2},$

but it does not mean the densities are close pointwise, and it does not control the reverse direction $D_{\text{KL}}(Q \| P)$ . A model can have small average extra code length while still assigning poor density in rare but important regions.

Watch Out

Minimizing forward KL is not the same as minimizing reverse KL

MLE minimizes forward KL $D_{\text{KL}}(P_{\text{data}} \| Q_\theta)$ , producing mode-covering fits. Variational inference minimizes reverse KL $D_{\text{KL}}(q_\phi \| p(\cdot | x))$ , producing mode-seeking fits. When $P$ is multimodal and $Q$ is unimodal, forward KL places $Q$ between modes; reverse KL concentrates $Q$ on one mode. These are genuinely different objectives with different solutions.

Summary

$D_{\text{KL}}(P \| Q) = \mathbb{E}_P[\log(P/Q)] \geq 0$ , with equality iff $P = Q$
Not symmetric, not a metric
Forward KL ( $D(P \| Q)$ over $Q$ ): mode-covering, equivalent to MLE
Reverse KL ( $D(Q \| P)$ over $Q$ ): mode-seeking, used in variational inference
Cross-entropy = entropy + KL divergence
KL penalty in RLHF constrains policy deviation from the base model

Exercises

ExerciseCore

Problem

Compute $D_{\text{KL}}(P \| Q)$ and $D_{\text{KL}}(Q \| P)$ for $P = \text{Bern}(0.5)$ and $Q = \text{Bern}(0.1)$ . Verify that they differ.

ExerciseAdvanced

Problem

Derive the closed-form KL divergence between two multivariate Gaussians $P = \mathcal{N}(\mu_1, \Sigma_1)$ and $Q = \mathcal{N}(\mu_2, \Sigma_2)$ in $\mathbb{R}^d$ .

Related Comparisons

KL Divergence vs. Cross-Entropy

References

Canonical:

Cover & Thomas, Elements of Information Theory (2006), Chapter 2
Kullback & Leibler, "On Information and Sufficiency" (1951)

Current:

Murphy, Probabilistic Machine Learning: Advanced Topics (2023), Chapter 6
Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1.6 and 10.1
MacKay, Information Theory, Inference, and Learning Algorithms (2003), Chapter 2
Csiszár & Körner, Information Theory (2nd ed., 2011), Chapter 2

Next Topics

Variational inference: using reverse KL to approximate intractable posteriors
Mutual information: symmetric quantity built from KL divergence

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Common Probability Distributionslayer 0A · tier 1
Total Variation Distancelayer 1 · tier 1
Information Theory Foundationslayer 0B · tier 2
Distance Metrics Comparedlayer 1 · tier 2

Derived topics

7

Cramér-Rao Bound: Information Inequality, Achievability, and Sharper Variantslayer 0B · tier 1
Fisher Information: Curvature, KL Geometry, and the Natural Gradientlayer 0B · tier 1
Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1
Minimax Lower Bounds: Le Cam, Fano, Assouad, and the Reduction to Testinglayer 3 · tier 1
Variational Autoencoderslayer 3 · tier 1

+2 more on the derived-topics page.

Graph-backed continuations

Variational Autoencoders Bits, Nats, Perplexity, and BPB Cramér-Rao Bound: Information Inequality, Achievability, and Sharper Variants Fisher Information: Curvature, KL Geometry, and the Natural Gradient Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiency Minimax Lower Bounds: Le Cam, Fano, Assouad, and the Reduction to Testing Predictive Coding and Autoencoders in the Brain

Why This Matters

Mental Model

Core Definitions

Main Theorems

Forward vs. Reverse KL

Forward KL: DKL(P∥Q)D_{\text{KL}}(P \| Q)DKL​(P∥Q)

Reverse KL: DKL(Q∥P)D_{\text{KL}}(Q \| P)DKL​(Q∥P)

Connection to Variational Inference

Connection to RLHF

Canonical Examples

Common Confusions

Summary

Exercises

Related Comparisons

References

Next Topics

Required before and derived from this topic

Required prerequisites

Derived topics

Forward KL: $D_{\text{KL}}(P \| Q)$

Reverse KL: $D_{\text{KL}}(Q \| P)$