Skip to main content

Foundations

Common Inequalities

The algebraic and probabilistic inequalities that appear everywhere in ML theory: Cauchy-Schwarz, Jensen, AM-GM, Holder, Minkowski, Young, Markov, and Chebyshev.

CoreTier 1StableSupporting~50 min

Why This Matters

Inequalities are the grammar of mathematical proofs. In ML theory, you cannot write a single generalization bound, convergence proof, or complexity argument without reaching for at least one inequality from this page. Cauchy-Schwarz bounds inner products. Jensen connects to expectations. Markov and Chebyshev give tail bounds.

Two notations recur: for expectation and as the verb of every line.

Hide overviewShow overview
Infographic on common inequalities used across ML theory: Cauchy-Schwarz, Jensen's inequality, Markov, Chebyshev, Hoeffding, Bernstein, AM-GM, triangle inequality. Each with a one-line statement and the most common application context (concentration, expectation manipulation, geometric bounds).
The inequality toolbox β€” what each bound says and when to reach for it.

These are not abstract exercises. They are tools you will use on every page of this site. Know them so well that applying them is automatic.

Mental Model

The inequalities on this page fall into three groups:

  1. Inner product / norm inequalities: Cauchy-Schwarz, Holder, Minkowski. These relate different norms and inner products. Used to bound terms that involve products or sums.

  2. Convexity inequalities: Jensen, AM-GM, Young. These exploit the curvature of convex (or concave) functions. Jensen is the most important single inequality in ML theory.

  3. Probabilistic inequalities: Markov, Chebyshev. These bound tail probabilities using moment information. They are the foundation for all concentration inequalities.

Inner Product and Norm Inequalities

Theorem

Cauchy-Schwarz Inequality

Statement

For any vectors u,vu, v in an inner product space:

∣⟨u,vβŸ©βˆ£β‰€βˆ₯uβˆ₯β‹…βˆ₯vβˆ₯|\langle u, v \rangle| \leq \|u\| \cdot \|v\|

Exact statement

LaTeX source for copy/export

|\langle u, v \rangle| \leq \|u\|\,\|v\|

Intuition

The inner product of two vectors cannot exceed the product of their lengths. This is a generalization of the fact that cos⁑θ≀1\cos\theta \leq 1.

Proof Sketch

For any t∈Rt \in \mathbb{R}, βˆ₯u+tvβˆ₯2β‰₯0\|u + tv\|^2 \geq 0. Expanding:

βˆ₯uβˆ₯2+2t⟨u,v⟩+t2βˆ₯vβˆ₯2β‰₯0\|u\|^2 + 2t\langle u, v\rangle + t^2\|v\|^2 \geq 0

This is a non-negative quadratic in tt, so its discriminant is non-positive:

4⟨u,v⟩2βˆ’4βˆ₯uβˆ₯2βˆ₯vβˆ₯2≀04\langle u, v\rangle^2 - 4\|u\|^2\|v\|^2 \leq 0

which gives ∣⟨u,vβŸ©βˆ£β‰€βˆ₯uβˆ₯β‹…βˆ₯vβˆ₯|\langle u, v\rangle| \leq \|u\| \cdot \|v\|.

Why It Matters

Cauchy-Schwarz is the most frequently used inequality in all of mathematics. In ML theory, it appears when: bounding inner products in kernel methods, analyzing gradient descent geometry, studying Rademacher complexity, and controlling cross-terms in bias-variance decompositions.

Failure Mode

Cauchy-Schwarz gives the tightest possible bound for inner products in terms of norms, but it can still be loose if uu and vv are nearly orthogonal (the bound gives βˆ₯uβˆ₯βˆ₯vβˆ₯\|u\|\|v\| while the true value is near zero). When you have structural information about the angle between uu and vv, tighter bounds are possible.

Common equivalent forms include:

  • Equality case: equality occurs when the two vectors are linearly dependent.
  • For random variables: ∣E[XY]βˆ£β‰€E[X2] E[Y2]|\mathbb{E}[XY]| \leq \sqrt{\mathbb{E}[X^2]\,\mathbb{E}[Y^2]} when the second moments exist.
  • For finite sums: (βˆ‘iaibi)2≀(βˆ‘iai2)(βˆ‘ibi2)\left(\sum_i a_i b_i\right)^2 \leq \left(\sum_i a_i^2\right)\left(\sum_i b_i^2\right).
Definition

Holder Inequality

For p,qβ‰₯1p, q \geq 1 with 1p+1q=1\frac{1}{p} + \frac{1}{q} = 1 (conjugate exponents):

βˆ‘i∣aibiβˆ£β‰€(βˆ‘i∣ai∣p)1/p(βˆ‘i∣bi∣q)1/q\sum_i |a_i b_i| \leq \left(\sum_i |a_i|^p\right)^{1/p} \left(\sum_i |b_i|^q\right)^{1/q}

Or in integral form: ∫∣fgβˆ£β‰€βˆ₯fβˆ₯pβˆ₯gβˆ₯q\int |fg| \leq \|f\|_p \|g\|_q.

Special case: p=q=2p = q = 2 gives Cauchy-Schwarz.

When it arises: Bounding mixed-norm expressions, dual norm computations (βˆ₯xβˆ₯pβˆ—=βˆ₯xβˆ₯q\|x\|_p^* = \|x\|_q), and proving interpolation inequalities. The duality between β„“p\ell_p and β„“q\ell_q norms underpins convex analysis.

Definition

Minkowski Inequality (Triangle Inequality for L^p)

For pβ‰₯1p \geq 1:

(βˆ‘i∣ai+bi∣p)1/p≀(βˆ‘i∣ai∣p)1/p+(βˆ‘i∣bi∣p)1/p\left(\sum_i |a_i + b_i|^p\right)^{1/p} \leq \left(\sum_i |a_i|^p\right)^{1/p} + \left(\sum_i |b_i|^p\right)^{1/p}

In short: βˆ₯a+bβˆ₯p≀βˆ₯aβˆ₯p+βˆ₯bβˆ₯p\|a + b\|_p \leq \|a\|_p + \|b\|_p.

This is the triangle inequality for β„“p\ell_p norms, proven using Holder. It confirms that βˆ₯β‹…βˆ₯p\|\cdot\|_p is a valid norm for pβ‰₯1p \geq 1. (For p<1p < 1, the triangle inequality fails and βˆ₯β‹…βˆ₯p\|\cdot\|_p is not a norm.)

Convexity Inequalities

Theorem

Jensen Inequality

Statement

If ff is a convex function and XX is a random variable:

f(E[X])≀E[f(X)]f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]

If ff is concave, the inequality reverses: f(E[X])β‰₯E[f(X)]f(\mathbb{E}[X]) \geq \mathbb{E}[f(X)].

Finite form: For weights wiβ‰₯0w_i \geq 0 with βˆ‘iwi=1\sum_i w_i = 1:

f ⁣(βˆ‘iwixi)β‰€βˆ‘iwif(xi)f\!\left(\sum_i w_i x_i\right) \leq \sum_i w_i f(x_i)

Exact statement

LaTeX source for copy/export

f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]

Intuition

A convex function "bows upward." The value of ff at the average point is at most the average of ff at the individual points. Geometrically: the chord connecting two points on a convex curve lies above the curve.

Proof Sketch

The cleanest sketch uses a subgradient. If ff is a proper convex function and E[X]\mathbb{E}[X] lies in the relative interior of the domain of ff, then ff admits a subgradient aa at E[X]\mathbb{E}[X], i.e. f(x)β‰₯f(E[X])+a⊀(xβˆ’E[X])f(x) \geq f(\mathbb{E}[X]) + a^\top (x - \mathbb{E}[X]) for all xx. Taking expectations:

E[f(X)]β‰₯f(E[X])+a⊀E[Xβˆ’E[X]]=f(E[X]).\mathbb{E}[f(X)] \geq f(\mathbb{E}[X]) + a^\top \mathbb{E}[X - \mathbb{E}[X]] = f(\mathbb{E}[X]).

For convex ff on R\mathbb{R}, a subgradient (left/right derivative) always exists at interior points, so this argument covers the standard real-valued case. For boundary points, infinite-dimensional settings, or extended-real-valued ff, one typically falls back on the finite convex-combination version f(βˆ‘iwixi)β‰€βˆ‘iwif(xi)f(\sum_i w_i x_i) \leq \sum_i w_i f(x_i) with βˆ‘iwi=1\sum_i w_i = 1, wiβ‰₯0w_i \geq 0 (provable by induction on the number of points), and then extends to expectations by an approximation/measure-theoretic argument.

Why It Matters

Jensen is arguably the single most important inequality in ML theory. Applications include:

  • EM algorithm: The ELBO is a lower bound on log-likelihood because log⁑\log is concave: log⁑E[X]β‰₯E[log⁑X]\log \mathbb{E}[X] \geq \mathbb{E}[\log X].
  • KL divergence is non-negative: follows from Jensen applied to βˆ’log⁑-\log.
  • Convex loss functions: f(E[w^])≀E[f(w^)]f(\mathbb{E}[\hat{w}]) \leq \mathbb{E}[f(\hat{w})] relates the risk of the average model to the average risk.
  • Information theory: entropy bounds, data processing inequality.

Failure Mode

Jensen gives a direction of inequality but not its magnitude. The gap E[f(X)]βˆ’f(E[X])\mathbb{E}[f(X)] - f(\mathbb{E}[X]) can be anywhere from zero (when XX is constant or ff is linear) to very large (when XX has high variance and ff is strongly convex). Tighter bounds require knowledge of the variance of XX and the curvature of ff.

Definition

AM-GM Inequality (Arithmetic Mean - Geometric Mean)

For non-negative reals a1,…,ana_1, \ldots, a_n:

a1+a2+β‹―+annβ‰₯(a1a2β‹―an)1/n\frac{a_1 + a_2 + \cdots + a_n}{n} \geq (a_1 a_2 \cdots a_n)^{1/n}

Equality holds if and only if all aia_i are equal.

Two-variable form: a+b2β‰₯ab\frac{a + b}{2} \geq \sqrt{ab}, or equivalently ab≀(a+b)24ab \leq \frac{(a + b)^2}{4}.

Proof from Jensen: Apply Jensen to the concave function f(x)=log⁑xf(x) = \log x: log⁑ ⁣(1nβˆ‘ai)β‰₯1nβˆ‘log⁑ai=log⁑ ⁣(∏ai)1/n\log\!\left(\frac{1}{n}\sum a_i\right) \geq \frac{1}{n}\sum \log a_i = \log\!\left(\prod a_i\right)^{1/n}.

When it arises: Bounding products by sums (which are easier to handle), optimizing expressions involving products, proving convergence rates.

Definition

Young Inequality

For a,bβ‰₯0a, b \geq 0 and conjugate exponents p,q>1p, q > 1 with 1p+1q=1\frac{1}{p} + \frac{1}{q} = 1:

ab≀app+bqqab \leq \frac{a^p}{p} + \frac{b^q}{q}

Special case (p=q=2p = q = 2): ab≀a22+b22ab \leq \frac{a^2}{2} + \frac{b^2}{2}, which is AM-GM for squares.

Parametric form: For any Ο΅>0\epsilon > 0: ab≀ϡa22+b22Ο΅ab \leq \frac{\epsilon a^2}{2} + \frac{b^2}{2\epsilon}. This is extremely useful in proofs where you need to "split" a product into two terms with different scaling.

When it arises: Splitting cross-terms in energy estimates, proving Holder from Young, absorbing error terms in convergence proofs. The parametric form lets you trade off how much of the bound goes into each term.

Probabilistic Inequalities

Definition

Markov Inequality

If Xβ‰₯0X \geq 0, then for any t>0t > 0:

P(Xβ‰₯t)≀E[X]t\mathbb{P}(X \geq t) \leq \frac{\mathbb{E}[X]}{t}

Proof: E[X]β‰₯E[Xβ‹…1Xβ‰₯t]β‰₯tβ‹…P(Xβ‰₯t)\mathbb{E}[X] \geq \mathbb{E}[X \cdot \mathbf{1}_{X \geq t}] \geq t \cdot \mathbb{P}(X \geq t).

When it arises: Markov is the foundation of all tail bounds. Chebyshev applies Markov to (Xβˆ’ΞΌ)2(X - \mu)^2. The Chernoff method applies Markov to eΞ»Xe^{\lambda X}. Every concentration inequality starts with Markov.

Limitation: The bound decays as 1/t1/t, far too slowly for most applications. You almost always want Hoeffding or Bernstein instead.

Definition

Chebyshev Inequality

For any random variable XX with mean ΞΌ\mu and finite variance Οƒ2\sigma^2:

P(∣Xβˆ’ΞΌβˆ£β‰₯t)≀σ2t2\mathbb{P}(|X - \mu| \geq t) \leq \frac{\sigma^2}{t^2}

Proof: Apply Markov to (Xβˆ’ΞΌ)2(X - \mu)^2: P(∣Xβˆ’ΞΌβˆ£β‰₯t)=P((Xβˆ’ΞΌ)2β‰₯t2)≀σ2/t2\mathbb{P}(|X - \mu| \geq t) = \mathbb{P}((X-\mu)^2 \geq t^2) \leq \sigma^2/t^2.

When it arises: Proving weak laws of large numbers, giving initial sample complexity bounds (n=O(1/Ο΅2Ξ΄)n = O(1/\epsilon^2\delta) to estimate a mean within Ο΅\epsilon at confidence 1βˆ’Ξ΄1 - \delta). The polynomial tail 1/t21/t^2 is worse than the exponential tails from Hoeffding, but Chebyshev requires only finite variance, not boundedness.

Applications in ML Theory

Here is how each inequality appears in practice:

InequalityTypical ML Use
Cauchy-SchwarzBounding inner products in kernel methods, RKHS norms
HolderDual norm computations, mixed-norm regularization
Jensen (log⁑\log concave)EM algorithm ELBO, KL divergence non-negativity
Jensen (convex loss)Relating ensemble risk to average individual risk
AM-GMBounding products in convergence rate proofs
Young (parametric)Splitting cross-terms with tunable Ο΅\epsilon
MarkovStarting point for all exponential tail bounds
ChebyshevQuick variance-based concentration when boundedness is unavailable

Canonical Examples

Example

Jensen in the EM algorithm

The EM algorithm maximizes a lower bound on the log-likelihood. For any distribution q(z)q(z) over latent variables:

log⁑p(x∣θ)=logβ‘βˆ‘zp(x,z∣θ)=logβ‘βˆ‘zq(z)p(x,z∣θ)q(z)β‰₯βˆ‘zq(z)log⁑p(x,z∣θ)q(z)\log p(x \mid \theta) = \log \sum_z p(x, z \mid \theta) = \log \sum_z q(z) \frac{p(x, z \mid \theta)}{q(z)} \geq \sum_z q(z) \log \frac{p(x, z \mid \theta)}{q(z)}

The inequality is Jensen applied to the concave function log⁑\log: log⁑(Eq[f(Z)])β‰₯Eq[log⁑f(Z)]\log(\mathbb{E}_q[f(Z)]) \geq \mathbb{E}_q[\log f(Z)]. The right-hand side is the ELBO (Evidence Lower BOund).

Example

Cauchy-Schwarz in bounding gradient norms

In gradient descent analysis, you often need to bound βŸ¨βˆ‡f(x),xβˆ’xβˆ—βŸ©\langle \nabla f(x), x - x^* \rangle. By Cauchy-Schwarz:

βŸ¨βˆ‡f(x),xβˆ’xβˆ—βŸ©β‰€βˆ₯βˆ‡f(x)βˆ₯β‹…βˆ₯xβˆ’xβˆ—βˆ₯\langle \nabla f(x), x - x^* \rangle \leq \|\nabla f(x)\| \cdot \|x - x^*\|

This converts an inner product (hard to control) into a product of norms (each of which can be bounded separately).

Example

Young inequality to split cross-terms

In proving convergence of gradient descent, terms like 2βŸ¨βˆ‡f(x),e⟩2\langle \nabla f(x), e\rangle arise (where ee is an error). Using the parametric Young inequality:

2βˆ£βŸ¨βˆ‡f(x),eβŸ©βˆ£β‰€2βˆ₯βˆ‡f(x)βˆ₯βˆ₯eβˆ₯≀ϡβˆ₯βˆ‡f(x)βˆ₯2+1Ο΅βˆ₯eβˆ₯22|\langle \nabla f(x), e\rangle| \leq 2\|\nabla f(x)\|\|e\| \leq \epsilon\|\nabla f(x)\|^2 + \frac{1}{\epsilon}\|e\|^2

By choosing Ο΅\epsilon appropriately, the gradient term can be absorbed into the main descent lemma while the error term is controlled separately.

Common Confusions

Watch Out

Jensen goes the wrong way for concave functions

For convex ff: f(E[X])≀E[f(X)]f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]. For concave ff: f(E[X])β‰₯E[f(X)]f(\mathbb{E}[X]) \geq \mathbb{E}[f(X)].

Students frequently apply Jensen to log⁑\log (concave) and get the direction wrong. Remember: log⁑(E[X])β‰₯E[log⁑X]\log(\mathbb{E}[X]) \geq \mathbb{E}[\log X] (concave Jensen), which gives KLβ‰₯0\text{KL} \geq 0 and the EM lower bound.

Watch Out

Cauchy-Schwarz is a special case of Holder, not the other way around

Holder with p=q=2p = q = 2 gives Cauchy-Schwarz. Holder is strictly more general (it works for any conjugate pair p,qp, q). But Cauchy-Schwarz is by far the most commonly used special case.

Summary

  • Cauchy-Schwarz: ∣⟨u,vβŸ©βˆ£β‰€βˆ₯uβˆ₯βˆ₯vβˆ₯|\langle u, v\rangle| \leq \|u\|\|v\|. the universal inner product bound
  • Holder: βˆ‘βˆ£aibiβˆ£β‰€βˆ₯aβˆ₯pβˆ₯bβˆ₯q\sum|a_ib_i| \leq \|a\|_p\|b\|_q. generalizes CS to β„“p\ell_p spaces
  • Jensen: f(E[X])≀E[f(X)]f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)] for convex ff. The most important inequality in ML theory
  • AM-GM: arithmetic mean β‰₯\geq geometric mean. bounds products by sums
  • Young: ab≀ap/p+bq/qab \leq a^p/p + b^q/q. splits products with tunable balance
  • Markov: P(Xβ‰₯t)≀E[X]/tP(X \geq t) \leq \mathbb{E}[X]/t. foundation of all tail bounds
  • Chebyshev: P(∣Xβˆ’ΞΌβˆ£β‰₯t)≀σ2/t2P(|X - \mu| \geq t) \leq \sigma^2/t^2. variance-based tails

Exercises

ExerciseCore

Problem

Assume Pβ‰ͺQP \ll Q (i.e., q(x)>0q(x) > 0 wherever p(x)>0p(x) > 0, so p/qp/q is PP-a.s. finite). Use Jensen's inequality to show that the KL divergence is non-negative: DKL(Pβˆ₯Q)=Ep[log⁑(p(X)/q(X))]β‰₯0D_{\text{KL}}(P \| Q) = \mathbb{E}_p[\log(p(X)/q(X))] \geq 0. What happens to DKL(Pβˆ₯Q)D_{\text{KL}}(P \| Q) if PP is not absolutely continuous with respect to QQ?

ExerciseCore

Problem

Use the parametric Young inequality to prove that for any a,bβ‰₯0a, b \geq 0 and Ο΅>0\epsilon > 0: 2ab≀ϡa2+b2/Ο΅2ab \leq \epsilon a^2 + b^2/\epsilon.

ExerciseAdvanced

Problem

Prove Holder's inequality for sums: if 1p+1q=1\frac{1}{p} + \frac{1}{q} = 1 with p,q>1p, q > 1, then βˆ‘i∣aibiβˆ£β‰€βˆ₯aβˆ₯pβˆ₯bβˆ₯q\sum_i |a_i b_i| \leq \|a\|_p \|b\|_q.

References

Canonical:

  • Axler, Linear Algebra Done Right (2024), Chapter 6 for inner-product spaces
  • Hardy, Littlewood, Polya, Inequalities (1934). The classic reference
  • Steele, The Cauchy-Schwarz Master Class (2004). excellent problem-driven treatment

Current:

  • Boucheron, Lugosi, Massart, Concentration Inequalities (2013), Chapter 1
  • Cover & Thomas, Elements of Information Theory (2006), Chapter 2 for Jensen applications

Next Topics

Building on these foundational inequalities:

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

2

Derived topics

4