Common Inequalities

Sneiderman, Robby

Foundations

Common Inequalities

The algebraic and probabilistic inequalities that appear everywhere in ML theory: Cauchy-Schwarz, Jensen, AM-GM, Holder, Minkowski, Young, Markov, and Chebyshev.

CoreTier 1StableSupporting~50 min

Prerequisites

Common Probability Distributions Kolmogorov Probability Axioms

Quiz (10)Pulse Check Prereq Map

Why This Matters

Inequalities are the grammar of mathematical proofs. In ML theory, you cannot write a single generalization bound, convergence proof, or complexity argument without reaching for at least one inequality from this page. Cauchy-Schwarz bounds inner products. Jensen connects to expectations. Markov and Chebyshev give tail bounds.

Two notations recur: for expectation and as the verb of every line.

Infographic on common inequalities used across ML theory: Cauchy-Schwarz, Jensen's inequality, Markov, Chebyshev, Hoeffding, Bernstein, AM-GM, triangle inequality. Each with a one-line statement and the most common application context (concentration, expectation manipulation, geometric bounds). — The inequality toolbox — what each bound says and when to reach for it.

These are not abstract exercises. They are tools you will use on every page of this site. Know them so well that applying them is automatic.

Mental Model

The inequalities on this page fall into three groups:

Inner product / norm inequalities: Cauchy-Schwarz, Holder, Minkowski. These relate different norms and inner products. Used to bound terms that involve products or sums.
Convexity inequalities: Jensen, AM-GM, Young. These exploit the curvature of convex (or concave) functions. Jensen is the most important single inequality in ML theory.
Probabilistic inequalities: Markov, Chebyshev. These bound tail probabilities using moment information. They are the foundation for all concentration inequalities.

Inner Product and Norm Inequalities

Theorem

Cauchy-Schwarz Inequality

Statement

For any vectors $u, v$ in an inner product space:

$|\langle u, v \rangle| \leq \|u\| \cdot \|v\|$

Exact statement

∣ ⟨ u, v ⟩ ∣ \leq ∥ u ∥ ∥ v ∥

LaTeX source for copy/export

|\langle u, v \rangle| \leq \|u\|\,\|v\|

Intuition

The inner product of two vectors cannot exceed the product of their lengths. This is a generalization of the fact that $\cos\theta \leq 1$ .

Proof Sketch

For any $t \in \mathbb{R}$ , $\|u + tv\|^2 \geq 0$ . Expanding:

$\|u\|^2 + 2t\langle u, v\rangle + t^2\|v\|^2 \geq 0$

This is a non-negative quadratic in $t$ , so its discriminant is non-positive:

$4\langle u, v\rangle^2 - 4\|u\|^2\|v\|^2 \leq 0$

which gives $|\langle u, v\rangle| \leq \|u\| \cdot \|v\|$ .

Why It Matters

Cauchy-Schwarz is the most frequently used inequality in all of mathematics. In ML theory, it appears when: bounding inner products in kernel methods, analyzing gradient descent geometry, studying Rademacher complexity, and controlling cross-terms in bias-variance decompositions.

Failure Mode

Cauchy-Schwarz gives the tightest possible bound for inner products in terms of norms, but it can still be loose if $u$ and $v$ are nearly orthogonal (the bound gives $\|u\|\|v\|$ while the true value is near zero). When you have structural information about the angle between $u$ and $v$ , tighter bounds are possible.

report a correction →

Common equivalent forms include:

Equality case: equality occurs when the two vectors are linearly dependent.
For random variables: $|\mathbb{E}[XY]| \leq \sqrt{\mathbb{E}[X^2]\,\mathbb{E}[Y^2]}$ when the second moments exist.
For finite sums: $\left(\sum_i a_i b_i\right)^2 \leq \left(\sum_i a_i^2\right)\left(\sum_i b_i^2\right)$ .

Definition

Holder Inequality $p, q with 1/ p + 1/ q = 1$

For $p, q \geq 1$ with $\frac{1}{p} + \frac{1}{q} = 1$ (conjugate exponents):

$\sum_i |a_i b_i| \leq \left(\sum_i |a_i|^p\right)^{1/p} \left(\sum_i |b_i|^q\right)^{1/q}$

Or in integral form: $\int |fg| \leq \|f\|_p \|g\|_q$ .

Special case: $p = q = 2$ gives Cauchy-Schwarz.

When it arises: Bounding mixed-norm expressions, dual norm computations ( $\|x\|_p^* = \|x\|_q$ ), and proving interpolation inequalities. The duality between $\ell_p$ and $\ell_q$ norms underpins convex analysis.

Definition

Minkowski Inequality (Triangle Inequality for L^p)

For $p \geq 1$ :

$\left(\sum_i |a_i + b_i|^p\right)^{1/p} \leq \left(\sum_i |a_i|^p\right)^{1/p} + \left(\sum_i |b_i|^p\right)^{1/p}$

In short: $\|a + b\|_p \leq \|a\|_p + \|b\|_p$ .

This is the triangle inequality for $\ell_p$ norms, proven using Holder. It confirms that $\|\cdot\|_p$ is a valid norm for $p \geq 1$ . (For $p < 1$ , the triangle inequality fails and $\|\cdot\|_p$ is not a norm.)

Convexity Inequalities

Theorem

Jensen Inequality

Statement

If $f$ is a convex function and $X$ is a random variable:

$f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]$

If $f$ is concave, the inequality reverses: $f(\mathbb{E}[X]) \geq \mathbb{E}[f(X)]$ .

Finite form: For weights $w_i \geq 0$ with $\sum_i w_i = 1$ :

$f\!\left(\sum_i w_i x_i\right) \leq \sum_i w_i f(x_i)$

Exact statement

f (E [X]) \leq E [f (X)]

LaTeX source for copy/export

f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]

Intuition

A convex function "bows upward." The value of $f$ at the average point is at most the average of $f$ at the individual points. Geometrically: the chord connecting two points on a convex curve lies above the curve.

Proof Sketch

The cleanest sketch uses a subgradient. If $f$ is a proper convex function and $\mathbb{E}[X]$ lies in the relative interior of the domain of $f$ , then $f$ admits a subgradient $a$ at $\mathbb{E}[X]$ , i.e. $f(x) \geq f(\mathbb{E}[X]) + a^\top (x - \mathbb{E}[X])$ for all $x$ . Taking expectations:

$\mathbb{E}[f(X)] \geq f(\mathbb{E}[X]) + a^\top \mathbb{E}[X - \mathbb{E}[X]] = f(\mathbb{E}[X]).$

For convex $f$ on $\mathbb{R}$ , a subgradient (left/right derivative) always exists at interior points, so this argument covers the standard real-valued case. For boundary points, infinite-dimensional settings, or extended-real-valued $f$ , one typically falls back on the finite convex-combination version $f(\sum_i w_i x_i) \leq \sum_i w_i f(x_i)$ with $\sum_i w_i = 1$ , $w_i \geq 0$ (provable by induction on the number of points), and then extends to expectations by an approximation/measure-theoretic argument.

Why It Matters

Jensen is arguably the single most important inequality in ML theory. Applications include:

EM algorithm: The ELBO is a lower bound on log-likelihood because $\log$ is concave: $\log \mathbb{E}[X] \geq \mathbb{E}[\log X]$ .
KL divergence is non-negative: follows from Jensen applied to $-\log$ .
Convex loss functions: $f(\mathbb{E}[\hat{w}]) \leq \mathbb{E}[f(\hat{w})]$ relates the risk of the average model to the average risk.
Information theory: entropy bounds, data processing inequality.

Failure Mode

Jensen gives a direction of inequality but not its magnitude. The gap $\mathbb{E}[f(X)] - f(\mathbb{E}[X])$ can be anywhere from zero (when $X$ is constant or $f$ is linear) to very large (when $X$ has high variance and $f$ is strongly convex). Tighter bounds require knowledge of the variance of $X$ and the curvature of $f$ .

report a correction →

Definition

AM-GM Inequality (Arithmetic Mean - Geometric Mean)

For non-negative reals $a_1, \ldots, a_n$ :

$\frac{a_1 + a_2 + \cdots + a_n}{n} \geq (a_1 a_2 \cdots a_n)^{1/n}$

Equality holds if and only if all $a_i$ are equal.

Two-variable form: $\frac{a + b}{2} \geq \sqrt{ab}$ , or equivalently $ab \leq \frac{(a + b)^2}{4}$ .

Proof from Jensen: Apply Jensen to the concave function $f(x) = \log x$ : $\log\!\left(\frac{1}{n}\sum a_i\right) \geq \frac{1}{n}\sum \log a_i = \log\!\left(\prod a_i\right)^{1/n}$ .

When it arises: Bounding products by sums (which are easier to handle), optimizing expressions involving products, proving convergence rates.

Definition

Young Inequality

For $a, b \geq 0$ and conjugate exponents $p, q > 1$ with $\frac{1}{p} + \frac{1}{q} = 1$ :

$ab \leq \frac{a^p}{p} + \frac{b^q}{q}$

Special case ( $p = q = 2$ ): $ab \leq \frac{a^2}{2} + \frac{b^2}{2}$ , which is AM-GM for squares.

Parametric form: For any $\epsilon > 0$ : $ab \leq \frac{\epsilon a^2}{2} + \frac{b^2}{2\epsilon}$ . This is extremely useful in proofs where you need to "split" a product into two terms with different scaling.

When it arises: Splitting cross-terms in energy estimates, proving Holder from Young, absorbing error terms in convergence proofs. The parametric form lets you trade off how much of the bound goes into each term.

Probabilistic Inequalities

Definition

Markov Inequality

If $X \geq 0$ , then for any $t > 0$ :

$\mathbb{P}(X \geq t) \leq \frac{\mathbb{E}[X]}{t}$

Proof: $\mathbb{E}[X] \geq \mathbb{E}[X \cdot \mathbf{1}_{X \geq t}] \geq t \cdot \mathbb{P}(X \geq t)$ .

When it arises: Markov is the foundation of all tail bounds. Chebyshev applies Markov to $(X - \mu)^2$ . The Chernoff method applies Markov to $e^{\lambda X}$ . Every concentration inequality starts with Markov.

Limitation: The bound decays as $1/t$ , far too slowly for most applications. You almost always want Hoeffding or Bernstein instead.

Definition

Chebyshev Inequality

For any random variable $X$ with mean $\mu$ and finite variance $\sigma^2$ :

$\mathbb{P}(|X - \mu| \geq t) \leq \frac{\sigma^2}{t^2}$

Proof: Apply Markov to $(X - \mu)^2$ : $\mathbb{P}(|X - \mu| \geq t) = \mathbb{P}((X-\mu)^2 \geq t^2) \leq \sigma^2/t^2$ .

When it arises: Proving weak laws of large numbers, giving initial sample complexity bounds ( $n = O(1/\epsilon^2\delta)$ to estimate a mean within $\epsilon$ at confidence $1 - \delta$ ). The polynomial tail $1/t^2$ is worse than the exponential tails from Hoeffding, but Chebyshev requires only finite variance, not boundedness.

Applications in ML Theory

Here is how each inequality appears in practice:

Inequality	Typical ML Use
Cauchy-Schwarz	Bounding inner products in kernel methods, RKHS norms
Holder	Dual norm computations, mixed-norm regularization
Jensen ( $\log$ concave)	EM algorithm ELBO, KL divergence non-negativity
Jensen (convex loss)	Relating ensemble risk to average individual risk
AM-GM	Bounding products in convergence rate proofs
Young (parametric)	Splitting cross-terms with tunable $\epsilon$
Markov	Starting point for all exponential tail bounds
Chebyshev	Quick variance-based concentration when boundedness is unavailable

Canonical Examples

Example

Jensen in the EM algorithm

The EM algorithm maximizes a lower bound on the log-likelihood. For any distribution $q(z)$ over latent variables:

$\log p(x \mid \theta) = \log \sum_z p(x, z \mid \theta) = \log \sum_z q(z) \frac{p(x, z \mid \theta)}{q(z)} \geq \sum_z q(z) \log \frac{p(x, z \mid \theta)}{q(z)}$

The inequality is Jensen applied to the concave function $\log$ : $\log(\mathbb{E}_q[f(Z)]) \geq \mathbb{E}_q[\log f(Z)]$ . The right-hand side is the ELBO (Evidence Lower BOund).

Example

Cauchy-Schwarz in bounding gradient norms

In gradient descent analysis, you often need to bound $\langle \nabla f(x), x - x^* \rangle$ . By Cauchy-Schwarz:

$\langle \nabla f(x), x - x^* \rangle \leq \|\nabla f(x)\| \cdot \|x - x^*\|$

This converts an inner product (hard to control) into a product of norms (each of which can be bounded separately).

Example

Young inequality to split cross-terms

In proving convergence of gradient descent, terms like $2\langle \nabla f(x), e\rangle$ arise (where $e$ is an error). Using the parametric Young inequality:

$2|\langle \nabla f(x), e\rangle| \leq 2\|\nabla f(x)\|\|e\| \leq \epsilon\|\nabla f(x)\|^2 + \frac{1}{\epsilon}\|e\|^2$

By choosing $\epsilon$ appropriately, the gradient term can be absorbed into the main descent lemma while the error term is controlled separately.

Common Confusions

Watch Out

Jensen goes the wrong way for concave functions

For convex $f$ : $f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]$ . For concave $f$ : $f(\mathbb{E}[X]) \geq \mathbb{E}[f(X)]$ .

Students frequently apply Jensen to $\log$ (concave) and get the direction wrong. Remember: $\log(\mathbb{E}[X]) \geq \mathbb{E}[\log X]$ (concave Jensen), which gives $\text{KL} \geq 0$ and the EM lower bound.

Watch Out

Cauchy-Schwarz is a special case of Holder, not the other way around

Holder with $p = q = 2$ gives Cauchy-Schwarz. Holder is strictly more general (it works for any conjugate pair $p, q$ ). But Cauchy-Schwarz is by far the most commonly used special case.

Summary

Cauchy-Schwarz: $|\langle u, v\rangle| \leq \|u\|\|v\|$ . the universal inner product bound
Holder: $\sum|a_ib_i| \leq \|a\|_p\|b\|_q$ . generalizes CS to $\ell_p$ spaces
Jensen: $f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]$ for convex $f$ . The most important inequality in ML theory
AM-GM: arithmetic mean $\geq$ geometric mean. bounds products by sums
Young: $ab \leq a^p/p + b^q/q$ . splits products with tunable balance
Markov: $P(X \geq t) \leq \mathbb{E}[X]/t$ . foundation of all tail bounds
Chebyshev: $P(|X - \mu| \geq t) \leq \sigma^2/t^2$ . variance-based tails

Exercises

ExerciseCore

Problem

Assume $P \ll Q$ (i.e., $q(x) > 0$ wherever $p(x) > 0$ , so $p/q$ is $P$ -a.s. finite). Use Jensen's inequality to show that the KL divergence is non-negative: $D_{\text{KL}}(P \| Q) = \mathbb{E}_p[\log(p(X)/q(X))] \geq 0$ . What happens to $D_{\text{KL}}(P \| Q)$ if $P$ is not absolutely continuous with respect to $Q$ ?

ExerciseCore

Problem

Use the parametric Young inequality to prove that for any $a, b \geq 0$ and $\epsilon > 0$ : $2ab \leq \epsilon a^2 + b^2/\epsilon$ .

ExerciseAdvanced

Problem

Prove Holder's inequality for sums: if $\frac{1}{p} + \frac{1}{q} = 1$ with $p, q > 1$ , then $\sum_i |a_i b_i| \leq \|a\|_p \|b\|_q$ .

References

Canonical:

Axler, Linear Algebra Done Right (2024), Chapter 6 for inner-product spaces
Hardy, Littlewood, Polya, Inequalities (1934). The classic reference
Steele, The Cauchy-Schwarz Master Class (2004). excellent problem-driven treatment

Current:

Boucheron, Lugosi, Massart, Concentration Inequalities (2013), Chapter 1
Cover & Thomas, Elements of Information Theory (2006), Chapter 2 for Jensen applications

Next Topics

Building on these foundational inequalities:

Concentration inequalities: exponential tail bounds that build on Markov and Chebyshev
Convex optimization basics: where convexity inequalities become algorithmic
Empirical risk minimization: where everything comes together

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Common Probability Distributionslayer 0A · tier 1
Kolmogorov Probability Axiomslayer 0A · tier 1

Derived topics

4

Concentration Inequalitieslayer 1 · tier 1
Convex Optimization Basicslayer 1 · tier 1
Empirical Risk Minimizationlayer 2 · tier 1
Convex Tinkeringlayer 2 · tier 2

Graph-backed continuations

Concentration Inequalities Convex Optimization Basics Empirical Risk Minimization Convex Tinkering