Total Variation Distance

Sneiderman, Robby

Foundations

Total Variation Distance

Total variation distance measures the largest possible discrepancy between two probability distributions: equivalently half the L1 gap, one minus the overlap mass, or the minimum disagreement probability under a coupling.

CoreTier 1StableCore spine~35 min

Prerequisites

Common Probability Distributions Measure Theoretic Probability

Quiz (7)Pulse Check Prereq Map

Why This Matters

Total variation distance is one of the standard notions of worst-case distributional discrepancy. If you give two models the freedom to disagree on any measurable event, total variation is the biggest probability gap they can produce.

Six-panel infographic on total variation distance: (1) the overlap view shows TV as the L1 difference between two densities, (2) the event view as the maximum probability gap over events, (3) the coupling view via matched and excess mass, (4) the closed-form formulas, (5) why it matters for MCMC mixing, divergence-mismatch, KL upper-bound, and TV-vs-Wasserstein, (6) a mental shortcut: 'remove the shared mass; what remains, divided by 2, is total variation distance.' — Three equivalent views of total variation distance, plus the formulas and where the metric matters in ML.

That single quantity shows up everywhere:

in coupling arguments, where TV is the disagreement probability under the best joint construction
in KL divergence, via Pinsker's inequality
in Wasserstein distance, as the contrast case where geometry is ignored
in MCMC and mixing-time theory, where convergence is often stated directly in TV

Mental Model

Three equivalent readings are worth holding at once:

Overlap view. Remove the shared mass $\min(p,q)$ ; what remains is TV.
Event view. Let an adversary pick the measurable set $A$ where the two distributions disagree most. The biggest gap $|P(A)-Q(A)|$ is TV.
Coupling view. Draw $X \sim P$ and $Y \sim Q$ on the same probability space in the smartest possible way. The smallest possible probability that $X \neq Y$ is TV.

Each view is mathematically equivalent, but each teaches a different instinct.

Core Definitions

Definition

Total Variation Distance $TV (P, Q)$

For two probability measures $P$ and $Q$ on the same measurable space,

$\mathrm{TV}(P,Q) := \sup_{A} |P(A)-Q(A)|.$

The supremum runs over all measurable sets $A$ .

When densities $p$ and $q$ exist with respect to a common dominating measure, this becomes

$\mathrm{TV}(P,Q) = \frac12 \int |p(x)-q(x)|\,dx = 1-\int \min(p,q)\,dx.$

So TV is literally half the $L^1$ distance, or equivalently one minus the overlap mass.

Main Theorems

Theorem

Equivalent Forms of Total Variation

Statement

The following are equivalent:

$\mathrm{TV}(P,Q)=\sup_A |P(A)-Q(A)|=\frac12\int |p-q|\,dx=1-\int \min(p,q)\,dx.$

If $p$ and $q$ are densities, the extremizing event is

$A^*=\{x: p(x)>q(x)\}.$

Intuition

The signed difference $p-q$ has positive and negative regions. If you want the biggest possible probability gap, you keep exactly the region where $p$ exceeds $q$ and ignore the region where it falls below. That converts the absolute-value integral into the measurable-event supremum.

Proof Sketch

Split the space into $A^*=\{p>q\}$ and its complement. On $A^*$ , the difference $p-q$ is positive; on the complement, it is negative. Therefore

$\int |p-q|\,dx = \int_{A^*}(p-q)\,dx + \int_{(A^*)^c}(q-p)\,dx = 2(P(A^*)-Q(A^*)).$

This gives $\mathrm{TV}(P,Q)=P(A^*)-Q(A^*)=\tfrac12\int|p-q|\,dx$ . The overlap identity follows from the pointwise equality $|p-q|=p+q-2\min(p,q)$ and the fact that $\int p=\int q=1$ .

Why It Matters

This theorem is why TV is so interpretable. It is simultaneously an adversarial event gap, a density-overlap deficit, and an $L^1$ discrepancy. Different subfields pick different forms, but they are all the same object.

Failure Mode

TV ignores geometry. Two point masses at nearby locations can have $\mathrm{TV}=1$ even if the ambient metric says they are extremely close. That is exactly why Wasserstein distance exists.

report a correction →

Theorem

Coupling Characterization of Total Variation

Statement

Among all couplings $(X,Y)$ with marginals $P$ and $Q$ ,

$\mathrm{TV}(P,Q)=\inf_{(X,Y)} \mathbb P[X\neq Y].$

Moreover, there exists a maximal coupling attaining equality.

Intuition

The common overlap mass can be coupled to agree exactly. Only the leftover unmatched mass must disagree. So the best possible disagreement probability is precisely the amount of unmatched mass, which is TV.

Proof Sketch

Write the shared mass as $\mu(dx)=\min(p(x),q(x))\,dx$ . Sample from the normalized overlap $\mu/\mu(\Omega)$ and set $X=Y$ there. The remaining residual masses of $P$ and $Q$ are then sampled independently on the non-overlap part. Agreement happens exactly on the shared-mass component, whose total weight is $\int \min(p,q)\,dx = 1-\mathrm{TV}(P,Q)$ . Hence the disagreement probability is $\mathrm{TV}(P,Q)$ .

Why It Matters

This is the bridge from abstract probability metrics to concrete stochastic processes. In mixing-time proofs, you often build a coupling and show the two chains have met by time $t$ with high probability; the theorem converts that meeting event directly into a TV bound.

Failure Mode

Not every naive coupling is maximal. A bad coupling can make the disagreement probability much larger than TV. The theorem says TV is the best possible disagreement rate, not the rate produced by an arbitrary coupling.

report a correction →

Pinsker's Inequality

TV and KL are linked by the classical bound

$\mathrm{TV}(P,Q)\le \sqrt{\frac12 D_{\mathrm{KL}}(P\|Q)}.$

The constant $\frac12$ assumes KL is measured in nats (natural log). If KL is measured in bits ( $\log_2$ ), the bound becomes $\mathrm{TV}\le\sqrt{(\ln 2)/2 \cdot D_{\mathrm{KL}}^{\mathrm{bits}}}$ . Always check the log base before plugging in numbers.

This is useful in lower bounds, concentration arguments, and asymptotic statistics because KL often tensorizes more easily than TV. But it is only a one-way control: small KL implies small TV, while the reverse is false without extra assumptions.

Watch Out

There is no universal reverse Pinsker inequality

Small TV does not imply small KL in general. KL is unbounded — it can be infinite while TV is finite (and at most one). The standard counterexample: $P$ has support that includes a region where $Q$ assigns zero mass; then $D_{\mathrm{KL}}(P\|Q) = \infty$ but $\mathrm{TV}(P,Q) \le 1$ . Reverse Pinsker bounds exist only under extra conditions, e.g., a uniform bound on the likelihood ratio $\mathrm{d}P/\mathrm{d}Q$ or a bounded chi-squared divergence. When you need both directions, state the regularity assumption that controls KL from above.

Worked Example: TV Between Two Gaussians

For $P = \mathcal{N}(0, 1)$ and $Q = \mathcal{N}(\mu, 1)$ with $\mu > 0$ , the densities cross at $x = \mu/2$ , and the optimal event is $A^* = \{x \le \mu/2\}$ . The closed form is

\mathrm{TV}(P, Q) = 2\Phi(\mu/2) - 1,

where $\Phi$ is the standard normal CDF. Numeric values:

Mean shift $\mu$	$\mathrm{TV}(\mathcal{N}(0,1), \mathcal{N}(\mu, 1))$	$\sqrt{D_{\mathrm{KL}}/2}$ (Pinsker)	Slack: Pinsker $-$ TV
0.1	0.0398	0.0500	0.0102
0.5	0.1974	0.2500	0.0526
1.0	0.3829	0.5000	0.1171
2.0	0.6827	1.0000	0.3173

The KL between two unit-variance Gaussians with mean shift $\mu$ is $D_{\mathrm{KL}} = \mu^2/2$ , so Pinsker's bound says $\mathrm{TV} \le |\mu|/2$ . The bound is tight at $\mu \to 0$ (both vanish at rate $\mu/(\sqrt{2\pi}) \approx 0.399 \mu$ vs $\mu/2$ ) but loose for moderate-to-large $\mu$ . By $\mu = 2$ , Pinsker overstates TV by 47%.

This is the canonical illustration of the slack in Pinsker. Tighter inequalities (Bretagnolle-Huber, refined Pinsker bounds) close some of this gap when needed.

Le Cam's Two-Point Method

TV is the central object in minimax lower bounds for hypothesis testing and parameter estimation. The mechanism is Le Cam's two-point inequality.

Theorem

Le Cam's Two-Point Inequality

Statement

For any binary test $\psi$ distinguishing $P_0$ from $P_1$ ,

\inf_\psi \bigl[P_0(\psi = 1) + P_1(\psi = 0)\bigr] = 1 - \mathrm{TV}(P_0, P_1).

The minimum total error of any test is exactly $1 - \mathrm{TV}$ , and the optimal test is the likelihood-ratio test thresholded at 1.

Intuition

Tests can only do as well as the distributions allow them to. The probability of correct discrimination is bounded by the overlap deficit. If $\mathrm{TV} = 0$ , the two distributions are identical and no test can beat random guessing; if $\mathrm{TV} = 1$ , the two distributions have disjoint support and a perfect test exists.

Why It Matters

Le Cam's method converts every minimax lower-bound problem into a TV computation. Pick two parameter values $\theta_0, \theta_1$ separated by $2\delta$ . The minimax risk for any estimator is at least $\delta(1 - \mathrm{TV}(P_{\theta_0}^n, P_{\theta_1}^n))$ . To produce a non-trivial lower bound, pick $\theta_1 - \theta_0$ small enough that the $n$ -fold TV stays bounded away from 1.

Failure Mode

The two-point method gives only a lower bound, often loose. For tighter rates, use Fano's inequality (multiple hypotheses) or Assouad's lemma (parameter-cube hypotheses). The two-point method is the simplest case and the foundation for all of them.

report a correction →

The TV-tensorization rule for product measures, $1 - \mathrm{TV}(P^n, Q^n) \ge (1 - \mathrm{TV}(P, Q))^n$ (with equality only in the trivial cases), is what controls how much $n$ samples shrink the error rate. For $\mathrm{TV}(P, Q) = c$ , the error decays as $(1 - c)^n$ , exponential in the number of samples.

Maximal Coupling: An Explicit Construction

The coupling theorem above guarantees a maximal coupling exists. Here is the explicit construction for densities $p, q$ on $\mathbb{R}$ with respect to Lebesgue measure.

Let $m(x) = \min(p(x), q(x))$ and $\mu = \int m(x) \, dx = 1 - \mathrm{TV}(P, Q)$ . The procedure:

Toss a coin with $P[\text{heads}] = \mu$ .
If heads (probability $\mu$ ): sample $Z \sim m / \mu$ and set $X = Y = Z$ .
If tails (probability $1 - \mu = \mathrm{TV}$ ): sample $X \sim (p - m) / \mathrm{TV}$ and $Y \sim (q - m) / \mathrm{TV}$ independently.

Verify the marginals: with probability $\mu$ , $X$ has density $m/\mu$ , and with probability $1-\mu$ , $X$ has density $(p-m)/\mathrm{TV}$ . The mixture density is $\mu \cdot m/\mu + \mathrm{TV} \cdot (p-m)/\mathrm{TV} = m + (p - m) = p$ . Same for $Y$ with marginal $q$ . The disagreement event $\{X \ne Y\}$ occurs only on tails, with probability $\mathrm{TV}$ . Done.

For specific examples (two Gaussians, two Bernoullis, two Poissons), this construction is explicit and computable. It is the workhorse of mixing-time arguments for Markov chains: build a coupling, bound the meeting time, convert to TV.

Why TV and Wasserstein Feel Different

TV is a support-sensitive metric: if two distributions place mass on disjoint sets, TV is already maximal. Wasserstein is a geometry-sensitive metric: if those disjoint sets are close in the ambient space, Wasserstein can still be small.

This difference explains the common dichotomy:

TV is natural for hypothesis testing, coupling, and mixing.
Wasserstein is natural for transport, generative modeling, and robustness with geometric structure.

Common Confusions

Watch Out

TV is a metric, but it is not geometric

TV satisfies symmetry and the triangle inequality, so it is a genuine metric. But it does not know about ambient distances between outcomes. It only sees how much mass fails to overlap.

Watch Out

The factor 1/2 is convention, not substance

Some ML papers define TV as $\|P-Q\|_1$ without the $\frac12$ . Probability texts almost always include the $\frac12$ , giving range $[0,1]$ . The two conventions differ by exactly a factor of two. Always check which one a paper uses before comparing constants.

Summary

$\mathrm{TV}(P,Q)=\sup_A |P(A)-Q(A)|$ — adversarial event gap
with densities, $\mathrm{TV}(P,Q)=\frac12\int |p-q|\,dx = 1-\int \min(p,q)\,dx$
TV is the smallest disagreement probability over all couplings; the maximal coupling attaining it is explicitly constructible
TV is sensitive to support mismatch but blind to geometry; Wasserstein is the geometric counterpart
Pinsker links TV to KL: $\mathrm{TV}\le \sqrt{D_{\mathrm{KL}}/2}$ , tight as $\mathrm{TV} \to 0$ , loose for moderate-to-large $\mathrm{TV}$
For $\mathcal{N}(0,1)$ vs $\mathcal{N}(\mu, 1)$ : $\mathrm{TV} = 2\Phi(\mu/2) - 1$
Le Cam's two-point method: minimax test error $= 1 - \mathrm{TV}(P_0, P_1)$
TV product-tensorization: $\mathrm{TV}(P^n, Q^n) \le n \cdot \mathrm{TV}(P, Q)$ (linear, useless once $n \cdot \mathrm{TV} \ge 1$ ); Bretagnolle-Huber gives exponential control instead

Exercises

ExerciseCore

Problem

Let $P=\mathrm{Bernoulli}(0.7)$ and $Q=\mathrm{Bernoulli}(0.4)$ . Compute $\mathrm{TV}(P,Q)$ using both the event-gap definition and the $\frac12\|p-q\|_1$ formula.

ExerciseCore

Problem

Verify Pinsker's inequality for $P = \mathrm{Bernoulli}(0.5)$ and $Q = \mathrm{Bernoulli}(0.5 + \epsilon)$ for small $\epsilon > 0$ . Compute $\mathrm{TV}(P, Q)$ exactly and compare to $\sqrt{D_{\mathrm{KL}}(P \| Q)/2}$ .

ExerciseAdvanced

Problem

Let $P^n$ and $Q^n$ be the $n$ -fold product measures of $P$ and $Q$ . Show that $\mathrm{TV}(P^n, Q^n) \le n \cdot \mathrm{TV}(P, Q)$ . Explain why this bound, while linear in $n$ , becomes useless as soon as $n \cdot \mathrm{TV}(P, Q) \ge 1$ .

References

Canonical:

Levin, D. A., Peres, Y., & Wilmer, E. L. (2009). Markov Chains and Mixing Times. AMS. Chapter 4 is the standard reference for TV in mixing-time arguments and the coupling construction.
Villani, C. (2009). Optimal Transport: Old and New. Springer. Chapter 6 covers the contrast between TV and Wasserstein and explains why the geometric/non-geometric distinction matters.
Le Cam, L. (1986). Asymptotic Methods in Statistical Decision Theory. Springer. The original treatment of the two-point method and TV's role in minimax theory.
Lindvall, T. (1992). Lectures on the Coupling Method. Wiley. The classical treatment of coupling theory; chapters 1-2 cover TV and the maximal coupling.

Current / standard texts:

Durrett, R. (2019). Probability: Theory and Examples (5th ed.). Cambridge University Press. Sections on coupling and total variation in the discrete-time chain chapters.
van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press. Appendix and Chapter 7 cover TV-KL relations in the M-estimation context.
Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press. Chapter 3 covers TV, KL, and chi-squared divergences in the modern non-asymptotic setting.
Polyanskiy, Y., & Wu, Y. (2025). Information Theory: From Coding to Learning. Cambridge University Press. Chapter 7 covers $f$ -divergences (TV, KL, chi-squared) with refined Pinsker-style inequalities and a unified Le-Cam-type framework.

Refined inequalities:

Bretagnolle, J., & Huber, C. (1979). Estimation des densités: risque minimax. Zeitschrift für Wahrscheinlichkeitstheorie, 47(2), 119-137. The Bretagnolle-Huber inequality $\mathrm{TV}(P, Q)^2 \le 1 - e^{-D_{\mathrm{KL}}(P \| Q)}$ , which is tighter than Pinsker for large KL.

Last reviewed: May 4, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Common Probability Distributionslayer 0A · tier 1
Measure-Theoretic Probabilitylayer 0B · tier 1

Derived topics

3

KL Divergencelayer 1 · tier 1
Coupling Arguments and Mixing Timelayer 3 · tier 3
Wasserstein Distanceslayer 4 · tier 3

Graph-backed continuations

KL Divergence Wasserstein Distances Coupling Arguments and Mixing Time