Skip to main content

Foundations

Total Variation Distance

Total variation distance measures the largest possible discrepancy between two probability distributions: equivalently half the L1 gap, one minus the overlap mass, or the minimum disagreement probability under a coupling.

CoreTier 1StableCore spine~35 min

Why This Matters

Total variation distance is one of the standard notions of worst-case distributional discrepancy. If you give two models the freedom to disagree on any measurable event, total variation is the biggest probability gap they can produce.

Hide overviewShow overview
Six-panel infographic on total variation distance: (1) the overlap view shows TV as the L1 difference between two densities, (2) the event view as the maximum probability gap over events, (3) the coupling view via matched and excess mass, (4) the closed-form formulas, (5) why it matters for MCMC mixing, divergence-mismatch, KL upper-bound, and TV-vs-Wasserstein, (6) a mental shortcut: 'remove the shared mass; what remains, divided by 2, is total variation distance.'
Three equivalent views of total variation distance, plus the formulas and where the metric matters in ML.

That single quantity shows up everywhere:

  • in coupling arguments, where TV is the disagreement probability under the best joint construction
  • in KL divergence, via Pinsker's inequality
  • in Wasserstein distance, as the contrast case where geometry is ignored
  • in MCMC and mixing-time theory, where convergence is often stated directly in TV

Mental Model

Three equivalent readings are worth holding at once:

  1. Overlap view. Remove the shared mass min(p,q)\min(p,q); what remains is TV.
  2. Event view. Let an adversary pick the measurable set AA where the two distributions disagree most. The biggest gap P(A)Q(A)|P(A)-Q(A)| is TV.
  3. Coupling view. Draw XPX \sim P and YQY \sim Q on the same probability space in the smartest possible way. The smallest possible probability that XYX \neq Y is TV.

Each view is mathematically equivalent, but each teaches a different instinct.

Core Definitions

Definition

Total Variation Distance

For two probability measures PP and QQ on the same measurable space,

TV(P,Q):=supAP(A)Q(A).\mathrm{TV}(P,Q) := \sup_{A} |P(A)-Q(A)|.

The supremum runs over all measurable sets AA.

When densities pp and qq exist with respect to a common dominating measure, this becomes

TV(P,Q)=12p(x)q(x)dx=1min(p,q)dx.\mathrm{TV}(P,Q) = \frac12 \int |p(x)-q(x)|\,dx = 1-\int \min(p,q)\,dx.

So TV is literally half the L1L^1 distance, or equivalently one minus the overlap mass.

Main Theorems

Theorem

Equivalent Forms of Total Variation

Statement

The following are equivalent:

TV(P,Q)=supAP(A)Q(A)=12pqdx=1min(p,q)dx.\mathrm{TV}(P,Q)=\sup_A |P(A)-Q(A)|=\frac12\int |p-q|\,dx=1-\int \min(p,q)\,dx.

If pp and qq are densities, the extremizing event is

A={x:p(x)>q(x)}.A^*=\{x: p(x)>q(x)\}.

Intuition

The signed difference pqp-q has positive and negative regions. If you want the biggest possible probability gap, you keep exactly the region where pp exceeds qq and ignore the region where it falls below. That converts the absolute-value integral into the measurable-event supremum.

Proof Sketch

Split the space into A={p>q}A^*=\{p>q\} and its complement. On AA^*, the difference pqp-q is positive; on the complement, it is negative. Therefore

pqdx=A(pq)dx+(A)c(qp)dx=2(P(A)Q(A)).\int |p-q|\,dx = \int_{A^*}(p-q)\,dx + \int_{(A^*)^c}(q-p)\,dx = 2(P(A^*)-Q(A^*)).

This gives TV(P,Q)=P(A)Q(A)=12pqdx\mathrm{TV}(P,Q)=P(A^*)-Q(A^*)=\tfrac12\int|p-q|\,dx. The overlap identity follows from the pointwise equality pq=p+q2min(p,q)|p-q|=p+q-2\min(p,q) and the fact that p=q=1\int p=\int q=1.

Why It Matters

This theorem is why TV is so interpretable. It is simultaneously an adversarial event gap, a density-overlap deficit, and an L1L^1 discrepancy. Different subfields pick different forms, but they are all the same object.

Failure Mode

TV ignores geometry. Two point masses at nearby locations can have TV=1\mathrm{TV}=1 even if the ambient metric says they are extremely close. That is exactly why Wasserstein distance exists.

Theorem

Coupling Characterization of Total Variation

Statement

Among all couplings (X,Y)(X,Y) with marginals PP and QQ,

TV(P,Q)=inf(X,Y)P[XY].\mathrm{TV}(P,Q)=\inf_{(X,Y)} \mathbb P[X\neq Y].

Moreover, there exists a maximal coupling attaining equality.

Intuition

The common overlap mass can be coupled to agree exactly. Only the leftover unmatched mass must disagree. So the best possible disagreement probability is precisely the amount of unmatched mass, which is TV.

Proof Sketch

Write the shared mass as μ(dx)=min(p(x),q(x))dx\mu(dx)=\min(p(x),q(x))\,dx. Sample from the normalized overlap μ/μ(Ω)\mu/\mu(\Omega) and set X=YX=Y there. The remaining residual masses of PP and QQ are then sampled independently on the non-overlap part. Agreement happens exactly on the shared-mass component, whose total weight is min(p,q)dx=1TV(P,Q)\int \min(p,q)\,dx = 1-\mathrm{TV}(P,Q). Hence the disagreement probability is TV(P,Q)\mathrm{TV}(P,Q).

Why It Matters

This is the bridge from abstract probability metrics to concrete stochastic processes. In mixing-time proofs, you often build a coupling and show the two chains have met by time tt with high probability; the theorem converts that meeting event directly into a TV bound.

Failure Mode

Not every naive coupling is maximal. A bad coupling can make the disagreement probability much larger than TV. The theorem says TV is the best possible disagreement rate, not the rate produced by an arbitrary coupling.

Pinsker's Inequality

TV and KL are linked by the classical bound

TV(P,Q)12DKL(PQ).\mathrm{TV}(P,Q)\le \sqrt{\frac12 D_{\mathrm{KL}}(P\|Q)}.

The constant 12\frac12 assumes KL is measured in nats (natural log). If KL is measured in bits (log2\log_2), the bound becomes TV(ln2)/2DKLbits\mathrm{TV}\le\sqrt{(\ln 2)/2 \cdot D_{\mathrm{KL}}^{\mathrm{bits}}}. Always check the log base before plugging in numbers.

This is useful in lower bounds, concentration arguments, and asymptotic statistics because KL often tensorizes more easily than TV. But it is only a one-way control: small KL implies small TV, while the reverse is false without extra assumptions.

Watch Out

There is no universal reverse Pinsker inequality

Small TV does not imply small KL in general. KL is unbounded — it can be infinite while TV is finite (and at most one). The standard counterexample: PP has support that includes a region where QQ assigns zero mass; then DKL(PQ)=D_{\mathrm{KL}}(P\|Q) = \infty but TV(P,Q)1\mathrm{TV}(P,Q) \le 1. Reverse Pinsker bounds exist only under extra conditions, e.g., a uniform bound on the likelihood ratio dP/dQ\mathrm{d}P/\mathrm{d}Q or a bounded chi-squared divergence. When you need both directions, state the regularity assumption that controls KL from above.

Worked Example: TV Between Two Gaussians

For P=N(0,1)P = \mathcal{N}(0, 1) and Q=N(μ,1)Q = \mathcal{N}(\mu, 1) with μ>0\mu > 0, the densities cross at x=μ/2x = \mu/2, and the optimal event is A={xμ/2}A^* = \{x \le \mu/2\}. The closed form is

TV(P,Q)=2Φ(μ/2)1,\mathrm{TV}(P, Q) = 2\Phi(\mu/2) - 1,

where Φ\Phi is the standard normal CDF. Numeric values:

Mean shift μ\muTV(N(0,1),N(μ,1))\mathrm{TV}(\mathcal{N}(0,1), \mathcal{N}(\mu, 1))DKL/2\sqrt{D_{\mathrm{KL}}/2} (Pinsker)Slack: Pinsker - TV
0.10.03980.05000.0102
0.50.19740.25000.0526
1.00.38290.50000.1171
2.00.68271.00000.3173

The KL between two unit-variance Gaussians with mean shift μ\mu is DKL=μ2/2D_{\mathrm{KL}} = \mu^2/2, so Pinsker's bound says TVμ/2\mathrm{TV} \le |\mu|/2. The bound is tight at μ0\mu \to 0 (both vanish at rate μ/(2π)0.399μ\mu/(\sqrt{2\pi}) \approx 0.399 \mu vs μ/2\mu/2) but loose for moderate-to-large μ\mu. By μ=2\mu = 2, Pinsker overstates TV by 47%.

This is the canonical illustration of the slack in Pinsker. Tighter inequalities (Bretagnolle-Huber, refined Pinsker bounds) close some of this gap when needed.

Le Cam's Two-Point Method

TV is the central object in minimax lower bounds for hypothesis testing and parameter estimation. The mechanism is Le Cam's two-point inequality.

Theorem

Le Cam's Two-Point Inequality

Statement

For any binary test ψ\psi distinguishing P0P_0 from P1P_1,

infψ[P0(ψ=1)+P1(ψ=0)]=1TV(P0,P1).\inf_\psi \bigl[P_0(\psi = 1) + P_1(\psi = 0)\bigr] = 1 - \mathrm{TV}(P_0, P_1).

The minimum total error of any test is exactly 1TV1 - \mathrm{TV}, and the optimal test is the likelihood-ratio test thresholded at 1.

Intuition

Tests can only do as well as the distributions allow them to. The probability of correct discrimination is bounded by the overlap deficit. If TV=0\mathrm{TV} = 0, the two distributions are identical and no test can beat random guessing; if TV=1\mathrm{TV} = 1, the two distributions have disjoint support and a perfect test exists.

Why It Matters

Le Cam's method converts every minimax lower-bound problem into a TV computation. Pick two parameter values θ0,θ1\theta_0, \theta_1 separated by 2δ2\delta. The minimax risk for any estimator is at least δ(1TV(Pθ0n,Pθ1n))\delta(1 - \mathrm{TV}(P_{\theta_0}^n, P_{\theta_1}^n)). To produce a non-trivial lower bound, pick θ1θ0\theta_1 - \theta_0 small enough that the nn-fold TV stays bounded away from 1.

Failure Mode

The two-point method gives only a lower bound, often loose. For tighter rates, use Fano's inequality (multiple hypotheses) or Assouad's lemma (parameter-cube hypotheses). The two-point method is the simplest case and the foundation for all of them.

The TV-tensorization rule for product measures, 1TV(Pn,Qn)(1TV(P,Q))n1 - \mathrm{TV}(P^n, Q^n) \ge (1 - \mathrm{TV}(P, Q))^n (with equality only in the trivial cases), is what controls how much nn samples shrink the error rate. For TV(P,Q)=c\mathrm{TV}(P, Q) = c, the error decays as (1c)n(1 - c)^n, exponential in the number of samples.

Maximal Coupling: An Explicit Construction

The coupling theorem above guarantees a maximal coupling exists. Here is the explicit construction for densities p,qp, q on R\mathbb{R} with respect to Lebesgue measure.

Let m(x)=min(p(x),q(x))m(x) = \min(p(x), q(x)) and μ=m(x)dx=1TV(P,Q)\mu = \int m(x) \, dx = 1 - \mathrm{TV}(P, Q). The procedure:

  1. Toss a coin with P[heads]=μP[\text{heads}] = \mu.
  2. If heads (probability μ\mu): sample Zm/μZ \sim m / \mu and set X=Y=ZX = Y = Z.
  3. If tails (probability 1μ=TV1 - \mu = \mathrm{TV}): sample X(pm)/TVX \sim (p - m) / \mathrm{TV} and Y(qm)/TVY \sim (q - m) / \mathrm{TV} independently.

Verify the marginals: with probability μ\mu, XX has density m/μm/\mu, and with probability 1μ1-\mu, XX has density (pm)/TV(p-m)/\mathrm{TV}. The mixture density is μm/μ+TV(pm)/TV=m+(pm)=p\mu \cdot m/\mu + \mathrm{TV} \cdot (p-m)/\mathrm{TV} = m + (p - m) = p. Same for YY with marginal qq. The disagreement event {XY}\{X \ne Y\} occurs only on tails, with probability TV\mathrm{TV}. Done.

For specific examples (two Gaussians, two Bernoullis, two Poissons), this construction is explicit and computable. It is the workhorse of mixing-time arguments for Markov chains: build a coupling, bound the meeting time, convert to TV.

Why TV and Wasserstein Feel Different

TV is a support-sensitive metric: if two distributions place mass on disjoint sets, TV is already maximal. Wasserstein is a geometry-sensitive metric: if those disjoint sets are close in the ambient space, Wasserstein can still be small.

This difference explains the common dichotomy:

  • TV is natural for hypothesis testing, coupling, and mixing.
  • Wasserstein is natural for transport, generative modeling, and robustness with geometric structure.

Common Confusions

Watch Out

TV is a metric, but it is not geometric

TV satisfies symmetry and the triangle inequality, so it is a genuine metric. But it does not know about ambient distances between outcomes. It only sees how much mass fails to overlap.

Watch Out

The factor 1/2 is convention, not substance

Some ML papers define TV as PQ1\|P-Q\|_1 without the 12\frac12. Probability texts almost always include the 12\frac12, giving range [0,1][0,1]. The two conventions differ by exactly a factor of two. Always check which one a paper uses before comparing constants.

Summary

  • TV(P,Q)=supAP(A)Q(A)\mathrm{TV}(P,Q)=\sup_A |P(A)-Q(A)| — adversarial event gap
  • with densities, TV(P,Q)=12pqdx=1min(p,q)dx\mathrm{TV}(P,Q)=\frac12\int |p-q|\,dx = 1-\int \min(p,q)\,dx
  • TV is the smallest disagreement probability over all couplings; the maximal coupling attaining it is explicitly constructible
  • TV is sensitive to support mismatch but blind to geometry; Wasserstein is the geometric counterpart
  • Pinsker links TV to KL: TVDKL/2\mathrm{TV}\le \sqrt{D_{\mathrm{KL}}/2}, tight as TV0\mathrm{TV} \to 0, loose for moderate-to-large TV\mathrm{TV}
  • For N(0,1)\mathcal{N}(0,1) vs N(μ,1)\mathcal{N}(\mu, 1): TV=2Φ(μ/2)1\mathrm{TV} = 2\Phi(\mu/2) - 1
  • Le Cam's two-point method: minimax test error =1TV(P0,P1)= 1 - \mathrm{TV}(P_0, P_1)
  • TV product-tensorization: TV(Pn,Qn)nTV(P,Q)\mathrm{TV}(P^n, Q^n) \le n \cdot \mathrm{TV}(P, Q) (linear, useless once nTV1n \cdot \mathrm{TV} \ge 1); Bretagnolle-Huber gives exponential control instead

Exercises

ExerciseCore

Problem

Let P=Bernoulli(0.7)P=\mathrm{Bernoulli}(0.7) and Q=Bernoulli(0.4)Q=\mathrm{Bernoulli}(0.4). Compute TV(P,Q)\mathrm{TV}(P,Q) using both the event-gap definition and the 12pq1\frac12\|p-q\|_1 formula.

ExerciseCore

Problem

Verify Pinsker's inequality for P=Bernoulli(0.5)P = \mathrm{Bernoulli}(0.5) and Q=Bernoulli(0.5+ϵ)Q = \mathrm{Bernoulli}(0.5 + \epsilon) for small ϵ>0\epsilon > 0. Compute TV(P,Q)\mathrm{TV}(P, Q) exactly and compare to DKL(PQ)/2\sqrt{D_{\mathrm{KL}}(P \| Q)/2}.

ExerciseAdvanced

Problem

Let PnP^n and QnQ^n be the nn-fold product measures of PP and QQ. Show that TV(Pn,Qn)nTV(P,Q)\mathrm{TV}(P^n, Q^n) \le n \cdot \mathrm{TV}(P, Q). Explain why this bound, while linear in nn, becomes useless as soon as nTV(P,Q)1n \cdot \mathrm{TV}(P, Q) \ge 1.

References

Canonical:

  • Levin, D. A., Peres, Y., & Wilmer, E. L. (2009). Markov Chains and Mixing Times. AMS. Chapter 4 is the standard reference for TV in mixing-time arguments and the coupling construction.
  • Villani, C. (2009). Optimal Transport: Old and New. Springer. Chapter 6 covers the contrast between TV and Wasserstein and explains why the geometric/non-geometric distinction matters.
  • Le Cam, L. (1986). Asymptotic Methods in Statistical Decision Theory. Springer. The original treatment of the two-point method and TV's role in minimax theory.
  • Lindvall, T. (1992). Lectures on the Coupling Method. Wiley. The classical treatment of coupling theory; chapters 1-2 cover TV and the maximal coupling.

Current / standard texts:

  • Durrett, R. (2019). Probability: Theory and Examples (5th ed.). Cambridge University Press. Sections on coupling and total variation in the discrete-time chain chapters.
  • van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press. Appendix and Chapter 7 cover TV-KL relations in the M-estimation context.
  • Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press. Chapter 3 covers TV, KL, and chi-squared divergences in the modern non-asymptotic setting.
  • Polyanskiy, Y., & Wu, Y. (2025). Information Theory: From Coding to Learning. Cambridge University Press. Chapter 7 covers ff-divergences (TV, KL, chi-squared) with refined Pinsker-style inequalities and a unified Le-Cam-type framework.

Refined inequalities:

  • Bretagnolle, J., & Huber, C. (1979). Estimation des densités: risque minimax. Zeitschrift für Wahrscheinlichkeitstheorie, 47(2), 119-137. The Bretagnolle-Huber inequality TV(P,Q)21eDKL(PQ)\mathrm{TV}(P, Q)^2 \le 1 - e^{-D_{\mathrm{KL}}(P \| Q)}, which is tighter than Pinsker for large KL.

Last reviewed: May 4, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

2

Derived topics

3