Mathematical Infrastructure
Distance Metrics Compared
Side-by-side reference for the distance and divergence functions that appear in ML: Lp norms, cosine, Mahalanobis, Hamming, Jaccard, edit distance, KL, Jensen-Shannon, Wasserstein, MMD. When each is appropriate and what assumptions it implicitly makes.
Why This Matters
Picking the wrong distance silently bakes a wrong assumption into the model. Cosine similarity discards magnitude, which is correct for normalized text embeddings and wrong for raw counts. Euclidean distance assumes isotropy, which is wrong for features on different scales. KL divergence is asymmetric and infinite when supports disagree, which is fatal for sample-based GAN training. Wasserstein fixes the support problem but requires solving an optimal-transport program. Each choice has a regime where it is the right tool and a regime where it ruins the experiment.
This page is a comparison reference, not a derivation. For mathematical foundations see metric-spaces-convergence-completeness, wasserstein-distances, kl-divergence, and non-euclidean-and-hyperbolic-geometry.
Vector-on-Vector
| Name | Formula | Metric? | Use it for | Avoid when |
|---|---|---|---|---|
| Euclidean () | yes | Geometry-respecting features, after standardization | Features on heterogeneous scales | |
| Manhattan () | yes | Sparse, robust regression, Lasso geometry | Smooth-gradient optimization | |
| Chebyshev () | yes | Worst-case error, grid-step distance | Average-case behaviour matters | |
| Mahalanobis | yes if | Correlated features with known covariance | ill-conditioned or unknown | |
| Cosine distance | not a metric | L2-normalized embeddings, text similarity | Magnitude carries signal | |
| Angular distance | yes | Want a true metric on the sphere | Comparing magnitudes | |
| Minkowski- | yes for | Tuning robustness vs smoothness | (not a metric) |
Notes:
- for is not a metric (triangle inequality fails).
- Cosine "distance" violates the triangle inequality; use angular distance when you need a true metric.
- Mahalanobis equals Euclidean after whitening by . The two are the same operation in different bases.
Set, Sequence, and String
| Name | Formula | Metric? | Use it for |
|---|---|---|---|
| Hamming | yes | Equal-length binary or categorical strings | |
| Jaccard | yes | Set similarity, document shingles | |
| Tanimoto (binary fingerprints) | same as Jaccard on bit sets | yes | Cheminformatics fingerprint matching |
| Levenshtein (edit) | min insert+delete+substitute | yes | Sequence comparison: spell-check, DNA |
| Longest common subsequence | yes | Diff tools, weak edit alignment | |
| Tree edit distance | min tree-rewriting operations | yes | Parse tree comparison, structured data |
The string and tree metrics all satisfy the triangle inequality but are not embeddable into any low-dimensional Euclidean space without distortion. For high-throughput nearest-neighbour search on these, locality-sensitive hashing or MinHash are the standard tools.
Distribution-on-Distribution
| Name | Formula | Symmetric? | Triangle? | Use it for | Trap |
|---|---|---|---|---|---|
| KL divergence | no | no | MLE, variational inference, info-theoretic bounds | Infinite if | |
| Reverse KL | no | no | Variational lower bounds, mode-seeking objectives | Different mode behaviour than forward KL | |
| Jensen-Shannon | , | yes | only is a metric | Symmetric KL surrogate, GAN-style losses | Bounded by -- saturates for disjoint supports, gives no gradient |
| Total variation | yes | yes | Worst-case event-probability gap | Often very loose vs Wasserstein | |
| Wasserstein- | yes | yes | Distribution alignment with disjoint supports, OT, GAN training | Quadratic-time exact solvers; use Sinkhorn for scale | |
| Maximum Mean Discrepancy (MMD) | yes | yes | Kernel-based two-sample tests, kernel-MMD GAN | Metric only if the kernel is characteristic (e.g. Gaussian); kernel and bandwidth choice matter a lot | |
| Hellinger | yes | yes | Robust two-sample comparison, L2-on-sqrt-densities | Less common in ML papers | |
| Renyi- | no in general | no | Privacy accounting (DP), info bounds | Requires shared support |
KL is the dominant choice when you have parametric models and access to log-densities; Wasserstein dominates when distributions live on different parts of the space (real vs generated samples in early GAN training); MMD dominates when you only have samples and want a closed-form unbiased estimator with kernels.
Worked Example: Same Vectors, Different Verdicts
Let and . Then:
- Euclidean: .
- Manhattan: .
- Cosine distance: .
By cosine the vectors are identical (same direction); by Euclidean they are far apart. Both are correct given their assumptions: cosine treats and as the same "concept" at different magnitudes; Euclidean treats them as different positions. The choice encodes whether magnitude carries signal.
Common Confusions
KL is not a distance
in general, and neither satisfies the triangle inequality. Forward KL ( with data and model) is mode-covering; reverse KL is mode-seeking. Using "KL distance" colloquially is fine; using it as a metric in proofs is not.
Cosine similarity is not invariant to centering
Cosine treats and () as identical, but treats and as different. Many practitioners "center then cosine," which is equivalent to Pearson correlation, not cosine. The two have different invariances and pick different nearest neighbours.
Wasserstein is not just a smoothed KL
Wasserstein incorporates the geometry of the underlying space (it depends on the cost ), KL does not. Two distributions that are "identical up to a translation" have if the translation moves them off each other's support, but proportional to the translation distance. This is exactly why WGAN trains stably where vanilla GAN does not.
Exercises
Problem
Verify that cosine "distance" violates the triangle inequality by exhibiting three vectors on the unit circle in with .
Problem
Construct two probability distributions on with disjoint supports such that but . Compute explicitly.
References
- Michel Marie Deza, Elena Deza. Encyclopedia of Distances (4th ed.). Springer, 2016. Comprehensive catalog of distance functions across mathematics, physics, computer science, and biology. The reference for this page.
- Thomas M. Cover, Joy A. Thomas. Elements of Information Theory (2nd ed.). Wiley-Interscience, 2006. Chapters 2 and 4: KL divergence, Jensen-Shannon, total variation, Pinsker's inequality.
- Cedric Villani. Optimal Transport: Old and New. Springer, 2008. Chapters 1-6: Wasserstein distances, the dual Kantorovich formulation, convergence properties.
- Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Scholkopf, Alexander Smola. A Kernel Two-Sample Test. JMLR 13, 2012. The foundational reference for Maximum Mean Discrepancy and its statistical properties.
- Martin Arjovsky, Soumith Chintala, Leon Bottou. Wasserstein GAN. ICML 2017. Section 2 contains the canonical comparison of TV, KL, JS, Wasserstein on training stability. arXiv:1701.07875
- Trevor Hastie, Robert Tibshirani, Jerome Friedman. The Elements of Statistical Learning (2nd ed.). Springer, 2009. Section 14.3: distance choices in clustering, including the practical impact of standardization and Mahalanobis weighting.
Related Topics
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
2- Metric Spaces, Convergence, and Completenesslayer 0A · tier 1
- Non-Euclidean and Hyperbolic Geometrylayer 1 · tier 2
Derived topics
3- Gram Matrices and Kernel Matriceslayer 1 · tier 1
- KL Divergencelayer 1 · tier 1
- Wasserstein Distanceslayer 4 · tier 3
Graph-backed continuations