Skip to main content

Concentration Probability

Contraction Inequality

The Ledoux-Talagrand contraction principle: applying a Lipschitz function anchored at zero to a function class can only contract Rademacher complexity, letting you bound the complexity of the loss class from the hypothesis class.

AdvancedTier 2StableSupporting~50 min

Why This Matters

In learning theory, the generalization bound involves the Rademacher complexity of the loss class H\ell \circ \mathcal{H}, not the hypothesis class H\mathcal{H} itself. But computing Rademacher complexity of the composed class directly is often hard. The contraction inequality solves this problem: if the loss function \ell is LL-Lipschitz, then Rn(H)LRn(H)\mathfrak{R}_n(\ell \circ \mathcal{H}) \leq L \cdot \mathfrak{R}_n(\mathcal{H}).

This is the bridge between the quantity you need (loss class complexity) and the quantity you can compute (hypothesis class complexity). Without it, every new loss function would require a separate Rademacher analysis from scratch.

theorem visual

A Lipschitz loss cannot make the class noisier

Contraction lets you bound the loss-class complexity through the simpler hypothesis-class complexity.

hypothesis outputsafter Lipschitz lossloss map

what changes

Replace by , usually a loss applied to predictions.

what stays controlled

If is -Lipschitz and anchored, complexity grows by at most .

why it matters

You analyze once, then transfer the bound to .

Mental Model

Imagine the outputs of your hypothesis class as a cloud of points in Rn\mathbb{R}^n (one coordinate per data point). The Rademacher complexity measures how "spread out" this cloud is in terms of correlating with random signs. Now apply a Lipschitz function ϕ\phi to each coordinate independently. The Lipschitz condition means ϕ\phi cannot stretch distances. It can only shrink them, or preserve them at most. The cloud after applying ϕ\phi is no more spread out than the original cloud, so the Rademacher complexity can only decrease.

The condition ϕ(0)=0\phi(0) = 0 ensures that the contraction is anchored: ϕ\phi does not shift the cloud, only reshapes it.

Formal Setup

Let F\mathcal{F} be a class of real-valued functions f:ZRf: \mathcal{Z} \to \mathbb{R}. Let S=(z1,,zn)S = (z_1, \ldots, z_n) be a sample and σ1,,σn\sigma_1, \ldots, \sigma_n i.i.d. Rademacher random variables. Let ϕ:RR\phi: \mathbb{R} \to \mathbb{R} be a function applied pointwise.

Definition

Lipschitz Function

A function ϕ:RR\phi: \mathbb{R} \to \mathbb{R} is LL-Lipschitz if and only if for all a,bRa, b \in \mathbb{R}:

ϕ(a)ϕ(b)Lab|\phi(a) - \phi(b)| \leq L|a - b|

The smallest such LL is the Lipschitz constant of ϕ\phi. If ϕ\phi is differentiable, then L=suptϕ(t)L = \sup_t |\phi'(t)|.

Definition

Composed Function Class

Given a function ϕ:RR\phi: \mathbb{R} \to \mathbb{R} and a class F\mathcal{F}, the composed class is:

ϕF={ϕf:fF}\phi \circ \mathcal{F} = \{\phi \circ f : f \in \mathcal{F}\}

where (ϕf)(z)=ϕ(f(z))(\phi \circ f)(z) = \phi(f(z)). In learning theory, ϕ\phi is typically the loss function viewed as a function of the hypothesis output, and F=H\mathcal{F} = \mathcal{H} is the hypothesis class.

Main Theorems

Theorem

Contraction Inequality (Ledoux-Talagrand)

Statement

Let ϕ:RR\phi: \mathbb{R} \to \mathbb{R} be LL-Lipschitz with ϕ(0)=0\phi(0) = 0. For any class F\mathcal{F} of real-valued functions and any sample S=(z1,,zn)S = (z_1, \ldots, z_n):

R^S(ϕF)LR^S(F)\hat{\mathfrak{R}}_S(\phi \circ \mathcal{F}) \leq L \cdot \hat{\mathfrak{R}}_S(\mathcal{F})

where R^S\hat{\mathfrak{R}}_S denotes the empirical Rademacher complexity. Taking expectations over SS:

Rn(ϕF)LRn(F)\mathfrak{R}_n(\phi \circ \mathcal{F}) \leq L \cdot \mathfrak{R}_n(\mathcal{F})

Intuition

Applying ϕ\phi pointwise to the values f(z1),,f(zn)f(z_1), \ldots, f(z_n) cannot increase how well the class correlates with random signs. The Lipschitz condition bounds how much ϕ\phi can separate two function values, and ϕ(0)=0\phi(0) = 0 ensures ϕ\phi does not add a constant offset. The result is that post-composition with ϕ\phi can only reduce the "effective spread" of the function class as seen by Rademacher variables.

Proof Sketch

We need to show:

Eσ ⁣[supfF1ni=1nσiϕ(f(zi))]LEσ ⁣[supfF1ni=1nσif(zi)]\mathbb{E}_\sigma\!\left[\sup_{f \in \mathcal{F}} \frac{1}{n}\sum_{i=1}^n \sigma_i \phi(f(z_i))\right] \leq L \cdot \mathbb{E}_\sigma\!\left[\sup_{f \in \mathcal{F}} \frac{1}{n}\sum_{i=1}^n \sigma_i f(z_i)\right]

The proof is by induction: peel off one Rademacher variable at a time. Condition on σ1,,σn1\sigma_1, \ldots, \sigma_{n-1} and let A(f)A(f) collect those fixed terms. The inner conditional expectation in σn\sigma_n is:

12supfF[A(f)+ϕ(f(zn))]+12supgF[A(g)ϕ(g(zn))]\frac{1}{2}\sup_{f \in \mathcal{F}}\bigl[A(f) + \phi(f(z_n))\bigr] + \frac{1}{2}\sup_{g \in \mathcal{F}}\bigl[A(g) - \phi(g(z_n))\bigr]

This is the two-point step. The goal is to bound this by the same expression with Lf(zn)L \cdot f(z_n) in place of ϕ(f(zn))\phi(f(z_n)). The key lemma is: for any two values u=f(zn)u = f(z_n) and v=g(zn)v = g(z_n) and any LL-Lipschitz ϕ\phi,

ϕ(u)ϕ(v)Luv=Ls(uv),s{+1,1}\phi(u) - \phi(v) \leq L|u - v| = L \cdot s(u - v), \quad s \in \{+1, -1\}

chosen to match the sign of uvu - v. Splitting the two suprema by the relative ordering of f(zn)f(z_n) and g(zn)g(z_n) and applying this pointwise replaces ϕ(u)ϕ(v)\phi(u) - \phi(v) with L(uv)L(u - v) inside the sum of suprema, which is exactly the target inequality after using the symmetry {+1,1}\{+1, -1\} of the Rademacher distribution. The condition ϕ(0)=0\phi(0) = 0 is used to anchor the comparison when only one of u,vu, v appears (see Ledoux and Talagrand 1991, Theorem 4.12 for the full argument).

Iterating the two-point step across all nn coordinates gives the stated inequality. The same argument yields the vector (coordinate-wise) contraction in which each σi\sigma_i may see a different LL-Lipschitz ϕi\phi_i with ϕi(0)=0\phi_i(0) = 0.

The naive argument using ϕ(t)Lt|\phi(t)| \leq L|t| does not prove contraction, since that only controls ϕ\phi at zero, not pairwise differences ϕ(u)ϕ(v)\phi(u) - \phi(v). The two-point argument is essential.

Why It Matters

This is the key modularity result in Rademacher complexity theory. It means you can analyze the Rademacher complexity of common hypothesis classes (linear functions, kernel classes, neural networks) once, and then compose with any Lipschitz loss to get a generalization bound.

For example, if you know Rn(H)BR/n\mathfrak{R}_n(\mathcal{H}) \leq BR/\sqrt{n} for norm-bounded linear classifiers, and the loss is 1-Lipschitz (like the ramp loss), then immediately Rn(H)BR/n\mathfrak{R}_n(\ell \circ \mathcal{H}) \leq BR/\sqrt{n}.

Without contraction, you would need to compute the Rademacher complexity of each loss-hypothesis combination separately.

Failure Mode

The condition ϕ(0)=0\phi(0) = 0 is essential. If ϕ(0)=c0\phi(0) = c \neq 0, the result fails because ϕ\phi shifts all function values by a constant, which can increase the ability to correlate with noise. In practice, you can often center the loss: define ϕ~(t)=ϕ(t)ϕ(0)\tilde{\phi}(t) = \phi(t) - \phi(0), apply contraction to ϕ~\tilde{\phi}, and add back the constant (which does not affect the supremum over F\mathcal{F} since it is independent of ff).

Also, the Lipschitz constant is a global worst case. If ϕ\phi is much smoother in the range actually taken by functions in F\mathcal{F}, the bound can be loose.

Proposition

Vector-Valued Contraction

Statement

Let ϕ1,,ϕn:RR\phi_1, \ldots, \phi_n: \mathbb{R} \to \mathbb{R} each be LL-Lipschitz with ϕi(0)=0\phi_i(0) = 0 (the ϕi\phi_i may differ). Then:

Eσ ⁣[supfFi=1nσiϕi(f(zi))]LEσ ⁣[supfFi=1nσif(zi)]\mathbb{E}_\sigma\!\left[\sup_{f \in \mathcal{F}} \sum_{i=1}^n \sigma_i \phi_i(f(z_i))\right] \leq L \cdot \mathbb{E}_\sigma\!\left[\sup_{f \in \mathcal{F}} \sum_{i=1}^n \sigma_i f(z_i)\right]

Intuition

The contraction works even when different coordinates undergo different Lipschitz maps, as long as they all share the same Lipschitz constant LL. This is useful because in supervised learning, the loss at data point ii is (h(xi),yi)\ell(h(x_i), y_i), viewed as a function of h(xi)h(x_i) alone. Since yiy_i differs across data points, the effective ϕi(t)=(t,yi)\phi_i(t) = \ell(t, y_i) differs per coordinate, but the Lipschitz constant is the same.

Proof Sketch

The proof is the same as for the scalar case. The comparison inequality for Rademacher processes handles the coordinate-wise case directly. Each ϕi\phi_i is individually LL-Lipschitz, and the key property, that contracting each coordinate individually cannot increase the expected supremum of iσiϕi(ti)\sum_i \sigma_i \phi_i(t_i), follows from the independence of the Rademacher variables and the inductive peeling argument.

Why It Matters

This vector-valued version is what you actually use in practice. When computing Rn(H)\mathfrak{R}_n(\ell \circ \mathcal{H}), the loss at each data point depends on the label yiy_i, giving a different function ϕi\phi_i per data point. The vector-valued contraction handles this seamlessly.

Failure Mode

If the Lipschitz constants differ across coordinates (ϕi\phi_i is LiL_i-Lipschitz), the bound uses maxiLi\max_i L_i. This can be loose if a few coordinates have much larger Lipschitz constants than the rest.

Proof Ideas and Templates Used

The contraction inequality uses one main technique:

  1. Rademacher comparison inequality: the core tool from the theory of Gaussian and Rademacher processes. It says that if you apply Lipschitz maps coordinate-wise to the index set of a Rademacher process, the expected supremum can only decrease. The proof works by induction on the number of coordinates, conditioning on all but one Rademacher variable at each step.

This is conceptually different from the symmetrization technique (which connects generalization gaps to Rademacher complexity). Contraction is about composing function classes, not about connecting population and empirical quantities.

Canonical Examples

Example

Contraction for the hinge loss

The hinge loss is (t,y)=max(0,1yt)\ell(t, y) = \max(0, 1 - yt) for y{1,+1}y \in \{-1, +1\}. As a function of tt for fixed yy, ϕy(t)=max(0,1yt)\phi_y(t) = \max(0, 1 - yt).

This is 1-Lipschitz: ϕy(a)ϕy(b)ab|\phi_y(a) - \phi_y(b)| \leq |a - b|. But ϕy(0)=10\phi_y(0) = 1 \neq 0. So center it: ϕ~y(t)=ϕy(t)1\tilde{\phi}_y(t) = \phi_y(t) - 1. Then ϕ~y(0)=0\tilde{\phi}_y(0) = 0 and ϕ~y\tilde{\phi}_y is still 1-Lipschitz.

By contraction: R^S(ϕ~yH)1R^S(H)\hat{\mathfrak{R}}_S(\tilde{\phi}_y \circ \mathcal{H}) \leq 1 \cdot \hat{\mathfrak{R}}_S(\mathcal{H}). Since the constant shift by 1 does not affect the supremum over H\mathcal{H} (it cancels):

R^S(H)R^S(H)\hat{\mathfrak{R}}_S(\ell \circ \mathcal{H}) \leq \hat{\mathfrak{R}}_S(\mathcal{H})

For linear classifiers with wB\|w\| \leq B and xiR\|x_i\| \leq R: R^S(H)BR/n\hat{\mathfrak{R}}_S(\ell \circ \mathcal{H}) \leq BR/\sqrt{n}.

Example

Contraction for the squared loss (bounded case)

The squared loss ϕy(t)=(ty)2\phi_y(t) = (t - y)^2 is only locally Lipschitz. To apply contraction, assume f(x)M|f(x)| \leq M for all fFf \in \mathcal{F} and yYmax|y| \leq Y_{\max}. Then ϕy\phi_y restricted to [M,M][-M, M] is Lipschitz with constant

L=suptMϕy(t)=suptM2(ty)2(M+Ymax).L = \sup_{|t| \leq M} |\phi_y'(t)| = \sup_{|t| \leq M} |2(t - y)| \leq 2(M + Y_{\max}).

Next handle ϕy(0)=y20\phi_y(0) = y^2 \neq 0 by constant-shift invariance. For any constant cc (here c=y2c = y^2, which depends on ii but not on ff),

Eσsupfiσi(gi(f)+ci)=Eσsupfiσigi(f)+Eσiσici=Eσsupfiσigi(f),\mathbb{E}_\sigma \sup_f \sum_i \sigma_i (g_i(f) + c_i) = \mathbb{E}_\sigma \sup_f \sum_i \sigma_i g_i(f) + \mathbb{E}_\sigma \sum_i \sigma_i c_i = \mathbb{E}_\sigma \sup_f \sum_i \sigma_i g_i(f),

since Eσi=0\mathbb{E}\sigma_i = 0 and the cic_i do not depend on ff. So subtracting the constants yi2y_i^2 from ϕyi(f(xi))\phi_{y_i}(f(x_i)) leaves the Rademacher average unchanged. Define ϕ~y(t)=(ty)2y2=t22yt\tilde{\phi}_y(t) = (t - y)^2 - y^2 = t^2 - 2yt; then ϕ~y(0)=0\tilde{\phi}_y(0) = 0 and the Lipschitz constant on [M,M][-M, M] is still bounded by 2(M+Ymax)2(M + Y_{\max}).

Applying the vector contraction gives

R^S(F)2(M+Ymax)R^S(F).\hat{\mathfrak{R}}_S(\ell \circ \mathcal{F}) \leq 2(M + Y_{\max}) \cdot \hat{\mathfrak{R}}_S(\mathcal{F}).

The Lipschitz constant grows with the output range. This is expected: the squared loss is more sensitive to large predictions. Without the bounded-range assumption, contraction cannot be applied directly to the squared loss at all.

Common Confusions

Watch Out

The phi(0) = 0 condition is not optional

Students often forget this condition and apply contraction directly to losses like (t)=(ty)2\ell(t) = (t - y)^2, which has (0)=y20\ell(0) = y^2 \neq 0. The fix is always to center: ~(t)=(t)(0)\tilde{\ell}(t) = \ell(t) - \ell(0). This works because shifting all function values by a constant does not change the Rademacher complexity (the σi\sigma_i have mean zero, so iσic=ciσi\sum_i \sigma_i \cdot c = c \sum_i \sigma_i has zero mean and does not affect the supremum in expectation).

Watch Out

Contraction gives an upper bound, not equality

R^S(ϕF)LR^S(F)\hat{\mathfrak{R}}_S(\phi \circ \mathcal{F}) \leq L \cdot \hat{\mathfrak{R}}_S(\mathcal{F}), and the inequality can be strict. A loss function might dramatically reduce the effective complexity (e.g., a loss that is constant on most of the range). Contraction does not tell you the Rademacher complexity of the composed class. It only gives you an upper bound. For tighter results, you may need direct computation.

Watch Out

Which constant applies: L, 2L, or sqrt(2) L?

The contraction inequality has several forms with different constants, and the conditions matter.

  • Factor LL, centered case (Ledoux-Talagrand 1991, Theorem 4.12): if ϕ\phi is LL-Lipschitz with ϕ(0)=0\phi(0) = 0 and the supremum is over the signed form iσiϕ(f(zi))\sum_i \sigma_i \phi(f(z_i)), then Esupfiσiϕ(f(zi))LEsupfiσif(zi)\mathbb{E}\sup_f \sum_i \sigma_i \phi(f(z_i)) \leq L \cdot \mathbb{E}\sup_f \sum_i \sigma_i f(z_i).

  • Factor 2L2L, absolute-value supremum without ϕ(0)=0\phi(0) = 0 (Boucheron-Lugosi-Massart 2013, Theorem 11.6): dropping the centering assumption and taking supf\sup_f |\cdot| costs an extra factor of 2.

  • Factor 2L\sqrt{2} L, vector case (Maurer 2016): when F\mathcal{F} is a class of Rk\mathbb{R}^k-valued functions and ϕ:RkR\phi: \mathbb{R}^k \to \mathbb{R} is LL-Lipschitz in Euclidean norm, the vector contraction inequality carries a constant 2\sqrt{2} in place of 11.

Always state explicitly which version you are using, and check whether your ϕ\phi is centered.

Watch Out

Contraction is about pointwise composition, not class composition

The contraction inequality applies ϕ\phi to the output of each function ff in the class. It does not compose two function classes (e.g., F1F2\mathcal{F}_1 \circ \mathcal{F}_2). Composing two function classes can increase Rademacher complexity and requires different tools (e.g., covering number arguments for deep networks).

Summary

  • Contraction: R^S(ϕF)LR^S(F)\hat{\mathfrak{R}}_S(\phi \circ \mathcal{F}) \leq L \cdot \hat{\mathfrak{R}}_S(\mathcal{F}) for LL-Lipschitz ϕ\phi with ϕ(0)=0\phi(0) = 0
  • This lets you bound loss class complexity from hypothesis class complexity
  • Center the loss (ϕ~=ϕϕ(0)\tilde{\phi} = \phi - \phi(0)) if ϕ(0)0\phi(0) \neq 0
  • The vector-valued version allows different ϕi\phi_i per coordinate (same LL)
  • Hinge loss: L=1L = 1, so loss class Rademacher equals hypothesis class Rademacher
  • Squared loss with t,yB|t|, |y| \leq B: L=4BL = 4B
  • Contraction is why Rademacher bounds are modular: compute Rn(H)\mathfrak{R}_n(\mathcal{H}) once, compose with any Lipschitz loss

Exercises

ExerciseCore

Problem

The ramp loss is γ(t)=min(1,max(0,1t/γ))\ell_\gamma(t) = \min(1, \max(0, 1 - t/\gamma)) for margin parameter γ>0\gamma > 0. Show that the ramp loss is (1/γ)(1/\gamma)-Lipschitz and apply the contraction inequality to bound R^S(γH)\hat{\mathfrak{R}}_S(\ell_\gamma \circ \mathcal{H}) in terms of R^S(H)\hat{\mathfrak{R}}_S(\mathcal{H}).

ExerciseAdvanced

Problem

Consider the logistic loss (t,y)=log(1+eyt)\ell(t, y) = \log(1 + e^{-yt}) for y{1,+1}y \in \{-1, +1\}. Compute the Lipschitz constant of ϕy(t)=log(1+eyt)\phi_y(t) = \log(1 + e^{-yt}) as a function of tt, and use the contraction inequality to bound the Rademacher complexity of the logistic loss class in terms of R^S(H)\hat{\mathfrak{R}}_S(\mathcal{H}).

ExerciseResearch

Problem

The contraction inequality gives a global Lipschitz constant LL. In practice, if fFf \in \mathcal{F} takes values in [a,b][a, b], only the Lipschitz constant of ϕ\phi restricted to [a,b][a, b] matters. State and prove this "local contraction" refinement. How does this improve the bound for the squared loss when the hypothesis class has bounded output?

References

Canonical:

  • Ledoux & Talagrand, Probability in Banach Spaces (1991), Theorem 4.12. The canonical statement and proof of the scalar contraction principle via the two-point comparison argument.
  • Boucheron, Lugosi, Massart, Concentration Inequalities (2013), Theorem 11.6. Textbook form of the contraction corollary, with the factor-2 variant discussed.
  • Bartlett & Mendelson, "Rademacher and Gaussian Complexities" (JMLR 2002). Introduces the modern data-dependent Rademacher framework in which contraction is used.

Vector-valued and applications:

  • Maurer, "A Vector-Contraction Inequality for Rademacher Complexities" (ALT 2016). Extends contraction to Rk\mathbb{R}^k-valued function classes with the 2\sqrt{2} constant.
  • Golowich, Rakhlin, Shamir, "Size-Independent Sample Complexity of Neural Networks" (COLT 2018). Uses contraction (and peeling) to bound Rademacher complexity of deep networks.
  • Bartlett, Foster, Telgarsky, "Spectrally-normalized margin bounds for neural networks" (NeurIPS 2017). Applies contraction to margin-based surrogate losses for deep networks.

Textbook treatments:

  • Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Lemma 26.9. Contraction lemma in standard learning-theory form.
  • Wainwright, High-Dimensional Statistics (2019), Chapter 4.2. Rademacher and Gaussian complexities, contraction discussion.
  • Vershynin, High-Dimensional Probability (2018), Chapter 8. Concentration of measure in Banach-space form, related to the contraction argument.

Next Topics

From contraction, the natural next steps are:

  • Algorithmic stability: moving beyond complexity-based bounds to algorithm-dependent generalization
  • Covering numbers and epsilon-nets: alternative complexity measures that interact with contraction via chaining arguments

Last reviewed: April 22, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

1

Derived topics

2