Foundations
KL Divergence
Kullback-Leibler divergence measures how one probability distribution differs from another. Asymmetric, always non-negative, and central to variational inference, MLE, and RLHF.
Prerequisites
Why This Matters
KL asymmetry is about which mistakes the expectation notices
Penalizes missing target mass, so q spreads to cover every mode.
Objective
The expectation is under the target distribution , so places where has mass are expensive to miss.
Behavior
A unimodal approximation spreads out to cover both target modes. This is why forward KL is often called mode-covering or mean-seeking.
ML connection
MLE and cross-entropy training minimize forward KL up to the fixed entropy of the data distribution.
KL divergence appears everywhere in machine learning. Maximum likelihood estimation minimizes KL divergence from the data distribution to the model. Variational inference minimizes KL divergence from the approximate posterior to the true posterior. RLHF uses a KL penalty to keep the fine-tuned policy close to the base model. Understanding the asymmetry of KL divergence explains why these different objectives produce different behaviors.
Mental Model
measures the expected number of extra bits needed to encode samples from using a code optimized for . It is zero only when . It is not symmetric: the cost of encoding -samples with a -code differs from encoding -samples with a -code.
Core Definitions
KL Divergence
For discrete distributions and on the same sample space:
For continuous distributions with densities and :
Convention: and when . If does not dominate (there exists with but ), then .
Cross-Entropy
The cross-entropy between and is:
where is the Shannon entropy. Since is fixed with respect to , minimizing cross-entropy over is equivalent to minimizing KL divergence.
Main Theorems
Gibbs Inequality (Non-negativity of KL Divergence)
Statement
For any two probability distributions and :
with equality if and only if almost everywhere.
Intuition
The logarithm is concave. Applying Jensen's inequality to under expectation with respect to gives . Negating both sides yields .
Proof Sketch
By Jensen's inequality applied to the concave function :
Equality in Jensen's holds iff is constant -a.s., which forces .
Why It Matters
Non-negativity of KL divergence is the foundation for many results. It implies that cross-entropy is always at least the entropy: . It also implies that the population minimizer of over a well-specified family is exactly , which is the identifiability fact behind MLE; full asymptotic consistency of the sample MLE additionally requires standard M-estimation regularity (compactness, continuity, dominated convergence, identifiability). It also ensures the ELBO is a valid lower bound on the log-evidence.
Failure Mode
KL divergence is not a metric. It is not symmetric: in general. It does not satisfy the triangle inequality. This means you cannot use it as a distance in metric-space arguments. If you need a true metric, consider the total variation distance or Wasserstein distance.
Forward vs. Reverse KL
The two directions of KL divergence produce qualitatively different behavior when fitting an approximate distribution to a target .
| Forward | Reverse | |
|---|---|---|
| Behavior | Mode-covering (M-projection) | Mode-seeking (I-projection) |
| Penalty heaviest when | where (zero-avoiding) | where (zero-forcing) |
| What learns | Spreads mass to cover every mode of , even small ones | Concentrates on a single mode of , ignores the others |
| Used in | MLE, cross-entropy training, normalizing flows | Variational inference (ELBO), RLHF KL penalty, mean-field VI |
| Failure mode | underfits multimodal by averaging | collapses, missing modes that matter |
The choice of direction is the choice of failure mode, not a free parameter. Pick the direction whose pathology is the one you can tolerate.
Forward KL:
Minimizing over is mode-covering (also called "M-projection" or mean-seeking). The objective penalizes wherever , because as . So must spread mass to cover all modes of , even if this means placing mass where is small.
This is exactly what MLE does: given data from , minimize .
Reverse KL:
Minimizing over is mode-seeking (also called "I-projection"). The objective penalizes wherever , because as . So avoids placing mass where is small, and tends to concentrate on a single mode of .
This is what variational inference does: minimize over the variational family .
Connection to Variational Inference
The evidence lower bound (ELBO) arises from the decomposition:
where is the ELBO. Since , we have . Maximizing the ELBO over is equivalent to minimizing .
Connection to RLHF
In RLHF, the optimization objective includes a KL penalty:
where is the supervised fine-tuned (base) policy and is the reward model. The KL term prevents the fine-tuned policy from deviating too far from the base model, which would lead to reward hacking (exploiting errors in rather than genuinely improving quality). The coefficient controls the tradeoff.
Canonical Examples
KL between two Gaussians
For and :
Note: this is zero only when and . The asymmetry is visible: swapping and gives a different expression.
KL between Bernoulli distributions
For and :
If and : nats. If and : nats. Different values confirm asymmetry.
Common Confusions
KL divergence is not a distance
and the triangle inequality fails. Calling it a "distance" is common but misleading. It is a divergence: a measure of discrepancy, not a metric.
Small KL is not the same as pointwise closeness
means a.e., which is a strong statement. Small forward KL does imply small total variation via Pinsker's inequality,
but it does not mean the densities are close pointwise, and it does not control the reverse direction . A model can have small average extra code length while still assigning poor density in rare but important regions.
Minimizing forward KL is not the same as minimizing reverse KL
MLE minimizes forward KL , producing mode-covering fits. Variational inference minimizes reverse KL , producing mode-seeking fits. When is multimodal and is unimodal, forward KL places between modes; reverse KL concentrates on one mode. These are genuinely different objectives with different solutions.
Summary
- , with equality iff
- Not symmetric, not a metric
- Forward KL ( over ): mode-covering, equivalent to MLE
- Reverse KL ( over ): mode-seeking, used in variational inference
- Cross-entropy = entropy + KL divergence
- KL penalty in RLHF constrains policy deviation from the base model
Exercises
Problem
Compute and for and . Verify that they differ.
Problem
Derive the closed-form KL divergence between two multivariate Gaussians and in .
Related Comparisons
References
Canonical:
- Cover & Thomas, Elements of Information Theory (2006), Chapter 2
- Kullback & Leibler, "On Information and Sufficiency" (1951)
Current:
- Murphy, Probabilistic Machine Learning: Advanced Topics (2023), Chapter 6
- Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1.6 and 10.1
- MacKay, Information Theory, Inference, and Learning Algorithms (2003), Chapter 2
- Csiszár & Körner, Information Theory (2nd ed., 2011), Chapter 2
Next Topics
- Variational inference: using reverse KL to approximate intractable posteriors
- Mutual information: symmetric quantity built from KL divergence
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
4- Common Probability Distributionslayer 0A · tier 1
- Total Variation Distancelayer 1 · tier 1
- Information Theory Foundationslayer 0B · tier 2
- Distance Metrics Comparedlayer 1 · tier 2
Derived topics
7- Cramér-Rao Bound: Information Inequality, Achievability, and Sharper Variantslayer 0B · tier 1
- Fisher Information: Curvature, KL Geometry, and the Natural Gradientlayer 0B · tier 1
- Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1
- Minimax Lower Bounds: Le Cam, Fano, Assouad, and the Reduction to Testinglayer 3 · tier 1
- Variational Autoencoderslayer 3 · tier 1
+2 more on the derived-topics page.
Graph-backed continuations