Skip to main content

Concentration Probability

Empirical Processes and Chaining

Bounding the supremum of empirical processes via covering numbers and chaining: Dudley's entropy integral and Talagrand's generic chaining, the sharpest tools in classical learning theory.

AdvancedTier 2StableSupporting~65 min

Why This Matters

The central question of learning theory is: how fast does the empirical risk converge to the population risk, uniformly over a function class? This question reduces to bounding the supremum of an empirical process. Dudley's entropy integral and Talagrand's generic chaining are the two main tools for this task.

Dudley's bound converts covering number estimates (geometric information about the function class) into bounds on the expected supremum. Generic chaining improves on Dudley by a logarithmic factor in many cases. Together, they represent the culmination of the covering number approach to learning theory.

theorem visual

Chaining pays for complexity one scale at a time

Instead of one huge union bound, chaining follows each function through coarse-to-fine nets and sums the small increments.

coarsemiddlefinesame target, refined by smaller increments

single scale

One -net pays one large cost plus approximation error.

chaining

Use nets at many scales and bound each increment separately.

Dudley

The total cost becomes .

Formal Setup

Definition

Empirical Process

Given i.i.d. samples X1,,XnX_1, \ldots, X_n from distribution PP and a function class F\mathcal{F}, the empirical process is:

Gn(f)=n(1ni=1nf(Xi)E[f(X)])=n(PnfPf)G_n(f) = \sqrt{n}\left(\frac{1}{n}\sum_{i=1}^n f(X_i) - \mathbb{E}[f(X)]\right) = \sqrt{n}(P_n f - Pf)

where Pn=1ni=1nδXiP_n = \frac{1}{n}\sum_{i=1}^n \delta_{X_i} is the empirical measure.

Definition

Supremum of the Empirical Process

The quantity of interest for uniform convergence is:

supfFGn(f)=nsupfFPnfPf\sup_{f \in \mathcal{F}} |G_n(f)| = \sqrt{n} \sup_{f \in \mathcal{F}} |P_n f - Pf|

Bounding E[supfFGn(f)]\mathbb{E}[\sup_{f \in \mathcal{F}} |G_n(f)|] controls how well empirical risk approximates population risk uniformly over F\mathcal{F}.

Definition

Covering Number

The covering number N(ϵ,F,d)N(\epsilon, \mathcal{F}, d) is the minimum number of balls of radius ϵ\epsilon (in metric dd) needed to cover F\mathcal{F}. The metric entropy is logN(ϵ,F,d)\log N(\epsilon, \mathcal{F}, d).

Definition

Packing Number

The packing number D(ϵ,F,d)D(\epsilon, \mathcal{F}, d) is the maximum number of points in F\mathcal{F} that are pairwise at distance strictly greater than ϵ\epsilon.

Covering and packing numbers are equivalent up to a factor of 2 in the radius:

D(2ϵ,F,d)N(ϵ,F,d)D(ϵ,F,d)D(2\epsilon, \mathcal{F}, d) \leq N(\epsilon, \mathcal{F}, d) \leq D(\epsilon, \mathcal{F}, d)

This follows because a maximal ϵ\epsilon-packing is automatically an ϵ\epsilon-covering (no point can be at distance >ϵ> \epsilon from all packing points without contradicting maximality). For chaining bounds, packing and covering numbers can be used interchangeably up to constants. See van Handel (2016), Lemma 4.1.

Metric Entropy: Catalog of Standard Classes

Before chaining can give a numerical bound, you need a covering number estimate for the function class. The following entries are the workhorses:

ClassMetriclogN(ϵ,F,d)\log N(\epsilon, \mathcal{F}, d)Source
2\ell_2 ball in Rd\mathbb{R}^d, radius RR2\ell_2dlog(3R/ϵ)d \log(3R/\epsilon)Vershynin HDP, Cor. 4.2.13
Linear class {w,:w21}\{\langle w, \cdot\rangle : \|w\|_2 \leq 1\} on x21\|x\|_2 \leq 1L2(Pn)L_2(P_n)dlog(3/ϵ)\leq d \log(3/\epsilon)Wainwright HDS, Ex. 5.4
VC class with VC(F)=d\mathrm{VC}(\mathcal{F}) = d, envelope 1L2(Pn)L_2(P_n)dlog(1/ϵ)\lesssim d \log(1/\epsilon)van der Vaart-Wellner, Thm. 2.6.7
LL-Lipschitz functions on [0,1]d[0,1]^d, fB\|f\|_\infty \leq Bsup-norm(L/ϵ)d\asymp (L/\epsilon)^dKolmogorov-Tikhomirov (1959)
CkC^k-smooth functions on [0,1]d[0,1]^d, fCkB\|f\|_{C^k} \leq Bsup-norm(B/ϵ)d/k\asymp (B/\epsilon)^{d/k}Kolmogorov-Tikhomirov (1959)
RKHS ball with kernel of decay rate μjj2s\mu_j \asymp j^{-2s}L2L_2ϵ1/s\asymp \epsilon^{-1/s}Steinwart-Christmann, Thm. 6.27

The qualitative regimes for chaining:

  • Polynomial entropy logN(ϵ)dlog(1/ϵ)\log N(\epsilon) \asymp d \log(1/\epsilon): parametric rate, Dudley integral =O(d)= O(\sqrt{d}), generalization O(d/n)O(\sqrt{d/n}).
  • Power entropy logN(ϵ)ϵα\log N(\epsilon) \asymp \epsilon^{-\alpha} with α<2\alpha < 2: Dudley integral converges, nonparametric rate n1/(2+α)n^{-1/(2+\alpha)}.
  • Critical entropy α=2\alpha = 2: Dudley integral diverges; chaining alone is insufficient, you need localization.
  • Heavy entropy α>2\alpha > 2: minimax-rate problem requires entirely different machinery (Le Cam's two-point method, Fano).

The Chaining Idea

The naive approach is to discretize F\mathcal{F} at a single resolution ϵ\epsilon and take a union bound over the ϵ\epsilon-net. This gives:

E[supfFGn(f)]ϵn+logN(ϵ,F,L2(Pn))\mathbb{E}\left[\sup_{f \in \mathcal{F}} |G_n(f)|\right] \lesssim \epsilon\sqrt{n} + \sqrt{\log N(\epsilon, \mathcal{F}, L_2(P_n))}

The problem: choosing ϵ\epsilon involves a tradeoff. Small ϵ\epsilon controls approximation error but increases the covering number. Large ϵ\epsilon is the reverse.

Chaining resolves this by discretizing at all scales simultaneously. Build a sequence of increasingly fine ϵ\epsilon-nets F0F1\mathcal{F}_0 \subset \mathcal{F}_1 \subset \cdots with ϵk=2k\epsilon_k = 2^{-k}. For each fFf \in \mathcal{F}, write f=f0+(f1f0)+(f2f1)+f = f_0 + (f_1 - f_0) + (f_2 - f_1) + \cdots where fkf_k is the nearest point to ff in the kk-th net. Bound each increment separately.

Main Theorems

Theorem

Dudley's Entropy Integral

Statement

Let F\mathcal{F} be a function class with envelope FF (meaning fF|f| \leq F for all fFf \in \mathcal{F}) and let σ2=supfFVar(f(X))\sigma^2 = \sup_{f \in \mathcal{F}} \text{Var}(f(X)). Then:

E[supfFGn(f)]C0σlogN(ϵ,F,L2(P))dϵ\mathbb{E}\left[\sup_{f \in \mathcal{F}} |G_n(f)|\right] \leq C \int_0^{\sigma} \sqrt{\log N(\epsilon, \mathcal{F}, L_2(P))}\, d\epsilon

where CC is a universal constant. The integral converges whenever the covering numbers grow at most polynomially as ϵ0\epsilon \to 0.

Intuition

At each scale ϵ\epsilon, the contribution to the supremum is controlled by logN(ϵ)\sqrt{\log N(\epsilon)} (a union bound over the net at that scale). Chaining integrates these contributions across all scales. Coarse scales (large ϵ\epsilon) contribute little because the net is small. Fine scales (small ϵ\epsilon) contribute little because the increments are small. The integral balances these.

Proof Sketch

Fix a sequence of scales ϵk=2kσ\epsilon_k = 2^{-k} \sigma and corresponding ϵk\epsilon_k-nets Nk\mathcal{N}_k. For each ff, let πk(f)\pi_k(f) be the closest point in Nk\mathcal{N}_k. Write:

Gn(f)=Gn(π0(f))+k=1K[Gn(πk(f))Gn(πk1(f))]+Gn(f)Gn(πK(f))G_n(f) = G_n(\pi_0(f)) + \sum_{k=1}^K [G_n(\pi_k(f)) - G_n(\pi_{k-1}(f))] + G_n(f) - G_n(\pi_K(f))

The first term is bounded by logN0\sqrt{\log |\mathcal{N}_0|} via a union bound. Each increment has sub-Gaussian parameter O(ϵk/n)O(\epsilon_k / \sqrt{n}) and there are Nk|\mathcal{N}_k| choices, giving logNk\sqrt{\log |\mathcal{N}_k|} per scale. Summing over scales gives the integral.

Why It Matters

This is the standard tool for converting covering number estimates into generalization bounds. For a class with logN(ϵ)=O(dlog(1/ϵ))\log N(\epsilon) = O(d \log(1/\epsilon)) (like VC classes or linear predictors), the integral evaluates to O(d)O(\sqrt{d}), recovering the standard O(d/n)O(\sqrt{d/n}) generalization bound without going through VC theory.

Failure Mode

Dudley's bound is not tight in general. It can be loose by a logn\sqrt{\log n} factor compared to the true expected supremum. For classes where the covering numbers have complex structure at different scales, generic chaining gives tighter results.

Theorem

Talagrand's Generic Chaining (Majorizing Measures)

Statement

For a centered Gaussian process (Xt)tT(X_t)_{t \in T}, define the γ2\gamma_2 functional:

γ2(T,d)=inf{Ak}suptTk=02k/2d(t,Ak)\gamma_2(T, d) = \inf_{\{A_k\}} \sup_{t \in T} \sum_{k=0}^{\infty} 2^{k/2} d(t, A_k)

where the infimum is over all sequences of sets AkA_k with Ak22k|A_k| \leq 2^{2^k}. Then:

1Lγ2(T,d)E[suptTXt]Lγ2(T,d)\frac{1}{L} \gamma_2(T, d) \leq \mathbb{E}\left[\sup_{t \in T} X_t\right] \leq L \cdot \gamma_2(T, d)

for a universal constant LL. The γ2\gamma_2 functional characterizes the expected supremum up to universal constants.

Intuition

Generic chaining replaces the uniform covering at each scale (Dudley) with an optimized covering that can vary across the set TT. Some points in TT may need fine resolution while others do not. The γ2\gamma_2 functional captures this by allowing different "chains" for different points.

Proof Sketch

The upper bound follows the same chaining idea as Dudley, but with optimized nets AkA_k instead of minimal coverings. The lower bound (which makes this a characterization, not just an upper bound) uses the Sudakov minoration argument and a sophisticated construction of the admissible sequence.

Why It Matters

This is the definitive result on suprema of Gaussian processes. Since empirical processes can be compared to Gaussian processes via symmetrization and comparison inequalities, this provides the tightest possible bounds on learning theory quantities. It resolves a long-standing question in probability theory about when chaining gives the right answer.

Failure Mode

The γ2\gamma_2 functional is hard to compute in general. For most applications, Dudley's entropy integral is sufficient and much more practical. Generic chaining is most useful for proving theoretical optimality results rather than computing explicit bounds.

Lower Bounds: Sudakov Minoration

Dudley's integral is an upper bound. Without a matching lower bound there is no way to know whether the integral overstates the true expected supremum. Sudakov's minoration provides a single-scale lower bound for Gaussian processes.

Theorem

Sudakov's Minoration

Statement

Let (Xt)tT(X_t)_{t \in T} be a centered Gaussian process with canonical metric d(s,t)=E[(XsXt)2]d(s, t) = \sqrt{\mathbb{E}[(X_s - X_t)^2]}. Then for every ϵ>0\epsilon > 0:

E[suptTXt]cϵlog2logD(ϵ,T,d)\mathbb{E}\left[\sup_{t \in T} X_t\right] \geq \frac{c \cdot \epsilon}{\sqrt{\log 2}} \sqrt{\log D(\epsilon, T, d)}

for a universal constant cc. Equivalently, since D(ϵ)N(ϵ)D(ϵ/2)D(\epsilon) \leq N(\epsilon) \leq D(\epsilon/2):

E[suptTXt]cϵlogN(2ϵ,T,d)\mathbb{E}\left[\sup_{t \in T} X_t\right] \geq c' \cdot \epsilon \sqrt{\log N(2\epsilon, T, d)}

Intuition

A packing of size D(ϵ)D(\epsilon) in TT produces D(ϵ)D(\epsilon) Gaussians with pairwise standard deviation at least ϵ\epsilon. The maximum of NN independent or weakly correlated standard Gaussians grows like 2logN\sqrt{2 \log N}, so taking the maximum over the packing gives a contribution of order ϵlogD(ϵ)\epsilon \sqrt{\log D(\epsilon)}. The minoration extracts this single-scale lower bound; supremum over ϵ\epsilon is the sharpest version.

Proof Sketch

Fix an ϵ\epsilon-packing {t1,,tN}\{t_1, \ldots, t_N\} where N=D(ϵ)N = D(\epsilon). Compare the Gaussian process (Xti)(X_{t_i}) to an independent collection of N(0,ϵ2/2)\mathcal{N}(0, \epsilon^2/2) random variables using the Slepian inequality: since E[(XtiXtj)2]ϵ2\mathbb{E}[(X_{t_i} - X_{t_j})^2] \geq \epsilon^2, the original process has larger increments and hence at least as large an expected supremum. The expected supremum of NN i.i.d. N(0,σ2)\mathcal{N}(0, \sigma^2) variables is cσlogN\geq c \sigma \sqrt{\log N} for N2N \geq 2. See van Handel (2016), Theorem 5.30.

Why It Matters

Combined with Dudley's upper bound, Sudakov gives:

csupϵ>0ϵlogN(ϵ)E[suptXt]C0logN(ϵ)dϵc \sup_{\epsilon > 0} \epsilon \sqrt{\log N(\epsilon)} \leq \mathbb{E}\left[\sup_t X_t\right] \leq C \int_0^\infty \sqrt{\log N(\epsilon)}\, d\epsilon

The two bounds match up to a logT\sqrt{\log |T|} factor in the worst case, and exactly when logN(ϵ)\log N(\epsilon) is regularly varying. This is why Talagrand's γ2\gamma_2 functional was needed: it closes the gap and gives a characterization up to universal constants.

Failure Mode

Sudakov is sharp only at the single optimizing scale and only for Gaussian processes (or sub-Gaussian processes via a comparison theorem). For sub-exponential or heavy-tailed processes, neither Sudakov nor Dudley directly applies; you need separate tools (Bernstein-type concentration, generic chaining with γ1+γ2\gamma_1 + \gamma_2).

Gaussian Comparison: Slepian-Sudakov-Fernique

Sudakov's proof uses a Gaussian comparison inequality. The same tool lets you compare the expected supremum of a complicated Gaussian process to a simpler reference process whose supremum is known.

Theorem

Sudakov-Fernique Inequality

Statement

Let (Xt)tT(X_t)_{t \in T} and (Yt)tT(Y_t)_{t \in T} be centered Gaussian processes on a finite index set TT. If

E[(XsXt)2]E[(YsYt)2]for all s,tT,\mathbb{E}[(X_s - X_t)^2] \geq \mathbb{E}[(Y_s - Y_t)^2] \quad \text{for all } s, t \in T,

then

E[suptTXt]E[suptTYt]\mathbb{E}\left[\sup_{t \in T} X_t\right] \geq \mathbb{E}\left[\sup_{t \in T} Y_t\right]

Intuition

Larger increments push points apart more, so the maximum spreads farther from zero in expectation. The comparison is on increments, not on marginal variances; only the geometry of the process matters.

Slepian's inequality is the stronger statement: under the same increment condition plus equal marginal variances (E[Xt2]=E[Yt2]\mathbb{E}[X_t^2] = \mathbb{E}[Y_t^2]), every tail probability of supXt\sup X_t exceeds the corresponding probability for supYt\sup Y_t. Sudakov-Fernique drops the equal-variance condition at the cost of giving only an expectation comparison.

Proof Sketch

Interpolate between the two processes via Zt(θ)=cos(θ)Xt+sin(θ)YtZ_t(\theta) = \cos(\theta) X_t + \sin(\theta) Y_t. Differentiate E[maxtZt(θ)]\mathbb{E}[\max_t Z_t(\theta)] in θ\theta using Gaussian integration by parts (Stein's lemma). The increment-domination condition makes the derivative non-positive, so the expected max decreases as θ\theta moves from 00 (XX) to π/2\pi/2 (YY). See Vershynin HDP, Theorem 7.2.11.

Why It Matters

This is the key tool to pass from a Gaussian process indexed by a function class to a canonical Gaussian process whose suprema are explicit. Examples:

  • Random matrix operator norm: the operator norm of an n×nn \times n Gaussian matrix can be sandwiched between two reference processes, yielding EGop2n\mathbb{E}\|G\|_{\text{op}} \leq 2\sqrt{n} (Gordon's theorem).
  • Comparison to white noise: bounds on suprema of empirical processes often pass through comparison with a Gaussian process built from Brownian motion.
  • Chevet-type inequalities: bilinear bounds for products of two function classes use Sudakov-Fernique on the bilinear Gaussian process Xu,v=u,GvX_{u, v} = \langle u, G v \rangle.

Failure Mode

Both processes must be Gaussian. The inequality fails for sub-Gaussian processes in general; the Talagrand-Borell-Tsirelson comparison theorems are the sub-Gaussian replacements but require additional structure.

Contraction (Ledoux-Talagrand)

In learning theory you often need to compare the Rademacher complexity of a loss class H\ell \circ \mathcal{H} to the Rademacher complexity of the hypothesis class H\mathcal{H}. The Ledoux-Talagrand contraction inequality makes this exchange explicit when the loss is Lipschitz.

Theorem

Ledoux-Talagrand Contraction Inequality

Statement

Let F\mathcal{F} be a class of real-valued functions on X\mathcal{X}. Let φ1,,φn:RR\varphi_1, \ldots, \varphi_n : \mathbb{R} \to \mathbb{R} be LL-Lipschitz with φi(0)=0\varphi_i(0) = 0. Let ε1,,εn\varepsilon_1, \ldots, \varepsilon_n be i.i.d. Rademacher signs and x1,,xnXx_1, \ldots, x_n \in \mathcal{X} fixed. Then:

Eε[supfFi=1nεiφi(f(xi))]2LEε[supfFi=1nεif(xi)]\mathbb{E}_\varepsilon\left[\sup_{f \in \mathcal{F}} \sum_{i=1}^n \varepsilon_i \varphi_i(f(x_i))\right] \leq 2L \cdot \mathbb{E}_\varepsilon\left[\sup_{f \in \mathcal{F}} \sum_{i=1}^n \varepsilon_i f(x_i)\right]

Dividing both sides by nn gives the Rademacher-complexity form:

Rn(φF)2LRn(F)\mathfrak{R}_n(\varphi \circ \mathcal{F}) \leq 2L \cdot \mathfrak{R}_n(\mathcal{F})

The constant 2L2L is the standard Ledoux-Talagrand constant; under the additional assumption that F\mathcal{F} is symmetric (F=F-\mathcal{F} = \mathcal{F}), the constant tightens to LL (Boucheron-Lugosi-Massart, Theorem 11.6).

Intuition

Pointwise Lipschitz application cannot stretch distances, only contract them. The Rademacher complexity measures how much the function class can correlate with random signs ±1\pm 1 on the data. Contraction passes through this correlation up to a factor of the Lipschitz constant. The anchoring φi(0)=0\varphi_i(0) = 0 removes a deterministic shift that would otherwise add a constant to the supremum without increasing the complexity.

Proof Sketch

Peel off one coordinate at a time. Conditional on ε1,,εn1\varepsilon_1, \ldots, \varepsilon_{n-1}, the remaining randomness is in εn\varepsilon_n. Show that for any function gg of the first n1n-1 coordinates and any pair f,fFf, f' \in \mathcal{F}:

Eεn[max(g+εnφn(f(xn)), g+εnφn(f(xn)))]Eεn[max(g+Lεnf(xn), g+Lεnf(xn))]\mathbb{E}_{\varepsilon_n}\left[\max(g + \varepsilon_n \varphi_n(f(x_n)),\ g' + \varepsilon_n \varphi_n(f'(x_n)))\right] \leq \mathbb{E}_{\varepsilon_n}\left[\max(g + L\varepsilon_n f(x_n),\ g' + L\varepsilon_n f'(x_n))\right]

This is a direct calculation using φn(a)φn(b)Lab|\varphi_n(a) - \varphi_n(b)| \leq L|a - b| and the symmetry of εn\varepsilon_n. Iterate over coordinates. The factor of 22 enters when reducing to a symmetric class via F(F)\mathcal{F} \cup (-\mathcal{F}). See Ledoux-Talagrand (1991), Theorem 4.12; cleaner exposition in Mohri- Rostamizadeh-Talwalkar (2018), Lemma 5.7.

Why It Matters

This is the bridge from Rn(H)\mathfrak{R}_n(\mathcal{H}) (the quantity you can compute or bound via covering numbers) to Rn(H)\mathfrak{R}_n(\ell \circ \mathcal{H}) (the quantity that appears in the generalization bound).

  • Margin loss: φ(t)=min(1,max(0,1t))\varphi(t) = \min(1, \max(0, 1 - t)) is 11-Lipschitz, so the margin-loss complexity is at most 2Rn(H)2 \mathfrak{R}_n(\mathcal{H}).
  • Logistic loss: φ(t)=log(1+et)\varphi(t) = \log(1 + e^{-t}) is 11-Lipschitz, same bound.
  • Hinge loss: φ(t)=max(0,1t)\varphi(t) = \max(0, 1 - t) is 11-Lipschitz with φ(0)=10\varphi(0) = 1 \neq 0; use a centering trick or absorb the constant into the bias.

For neural networks where H\mathcal{H} is bounded in 2\ell_2 but the post-activation is squashed by a Lipschitz nonlinearity, this gives layer-by- layer Rademacher bounds.

For full statement and proof discussion see contraction inequality.

Failure Mode

Two failure modes appear in practice:

  1. The anchoring condition φ(0)=0\varphi(0) = 0 matters. Without it, the inequality picks up a E[φ(0)]\mathbb{E}[|\varphi(0)|] term that does not vanish even when the function class is centered. Most ML losses can be shifted to satisfy this without changing the learned predictor.
  2. Vector contraction is more subtle. For a class of vector-valued functions f:XRkf : \mathcal{X} \to \mathbb{R}^k with a Lipschitz scalarization, the Maurer (2016) vector contraction inequality gives a bound of order 2L\sqrt{2}\, L per coordinate but requires kk-dependence in the worst case. This matters for multi-class classification where the effective kk is the number of classes.

Connection to Learning Theory

The Rademacher complexity of a function class F\mathcal{F} is:

Rn(F)=E[supfF1ni=1nσif(Xi)]\mathfrak{R}_n(\mathcal{F}) = \mathbb{E}\left[\sup_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^n \sigma_i f(X_i)\right]

By symmetrization, E[supfPnfPf]2Rn(F)\mathbb{E}[\sup_f |P_n f - Pf|] \leq 2\mathfrak{R}_n(\mathcal{F}). Dudley's bound then gives:

Rn(F)Cn0logN(ϵ,F,L2(Pn))dϵ\mathfrak{R}_n(\mathcal{F}) \leq \frac{C}{\sqrt{n}} \int_0^\infty \sqrt{\log N(\epsilon, \mathcal{F}, L_2(P_n))}\, d\epsilon

This is how covering numbers and chaining enter generalization bounds.

Canonical Examples

Example

Linear classifiers in R^d

For F={xw,x:w21,x21}\mathcal{F} = \{x \mapsto \langle w, x \rangle : \|w\|_2 \leq 1, \|x\|_2 \leq 1\}, the covering number satisfies logN(ϵ,F,L2)dlog(3/ϵ)\log N(\epsilon, \mathcal{F}, L_2) \leq d \log(3/\epsilon). Dudley's integral gives 01dlog(3/ϵ)dϵ=O(d)\int_0^1 \sqrt{d \log(3/\epsilon)}\, d\epsilon = O(\sqrt{d}), yielding Rn(F)O(d/n)\mathfrak{R}_n(\mathcal{F}) \leq O(\sqrt{d/n}) via the chaining route.

This bound is loose for the 2\ell_2-bounded class. A direct Cauchy-Schwarz argument gives a dimension-free bound: Rn(F)=E[supww,1niσixi]E1niσixi21/n\mathfrak{R}_n(\mathcal{F}) = \mathbb{E}[\sup_w \langle w, \frac{1}{n} \sum_i \sigma_i x_i \rangle] \leq \mathbb{E}\|\frac{1}{n} \sum_i \sigma_i x_i\|_2 \leq 1/\sqrt{n}, independent of dd. Chaining recovers the O(d/n)O(\sqrt{d/n}) rate appropriate for \ell_\infty- or 1\ell_1-bounded weight classes; for the 2\ell_2 ball, the direct bound is tighter and is the one used in standard analyses of margin-based classifiers and kernel methods.

Example

Lipschitz functions on the unit cube

Let FL={f:[0,1]d[0,B]f is L-Lipschitz in sup-norm}\mathcal{F}_L = \{f : [0,1]^d \to [0, B] \mid f \text{ is } L\text{-Lipschitz in sup-norm}\}. Kolmogorov-Tikhomirov (1959) gives:

logN(ϵ,FL,)(Lϵ)d\log N(\epsilon, \mathcal{F}_L, \|\cdot\|_\infty) \asymp \left(\frac{L}{\epsilon}\right)^d

Substituting into Dudley's integral:

0B(L/ϵ)ddϵ=Ld/20Bϵd/2dϵ\int_0^B \sqrt{(L/\epsilon)^d}\, d\epsilon = L^{d/2} \int_0^B \epsilon^{-d/2}\, d\epsilon

The integral converges only when d<2d < 2. For d=1d = 1, it gives O(L1/2B1/2)O(L^{1/2} \cdot B^{1/2}) and the Rademacher complexity is O(LB/n)O(\sqrt{LB/n}), the parametric rate. For d2d \geq 2, Dudley's integral diverges and you need localization: only small balls around the empirical risk minimizer matter, and the entropy of those small balls grows slower with ϵ\epsilon. Localized Rademacher complexity (Bartlett-Bousquet-Mendelson, 2005) gives the correct rate n1/dn^{-1/d} for d2d \geq 2.

This is why high-dimensional nonparametric regression suffers the curse of dimensionality even with Lipschitz smoothness: the metric entropy is just too large for naive chaining.

Example

Function class with VC dimension d

For a binary VC class F\mathcal{F} with VC(F)=d\mathrm{VC}(\mathcal{F}) = d, Haussler's theorem gives:

logN(ϵ,F,L2(Pn))Cdlog(1/ϵ)\log N(\epsilon, \mathcal{F}, L_2(P_n)) \leq C d \log(1/\epsilon)

Dudley's integral over [0,1][0, 1] evaluates to:

C01dlog(1/ϵ)dϵ=O(d)C \int_0^1 \sqrt{d \log(1/\epsilon)}\, d\epsilon = O(\sqrt{d})

(The substitution u=log(1/ϵ)u = \log(1/\epsilon) gives 0dueudu=Γ(3/2)d=O(d)\int_0^\infty \sqrt{du}\, e^{-u}\, du = \Gamma(3/2) \sqrt{d} = O(\sqrt{d}).)

This recovers the classical Rn(F)=O(d/n)\mathfrak{R}_n(\mathcal{F}) = O(\sqrt{d/n}) generalization bound through chaining, without any combinatorial Sauer- Shelah argument. The chaining route generalizes to non-binary classes; the Sauer-Shelah route does not.

Common Confusions

Watch Out

Chaining is not a single-scale argument

The power of chaining comes from simultaneously using approximations at all scales. A single-scale argument (discretize at one ϵ\epsilon) always involves a tradeoff between approximation error and the size of the net. Chaining eliminates this tradeoff by summing contributions across scales.

Watch Out

Dudley vs Talagrand: when does the difference matter?

For most function classes encountered in ML (VC classes, kernel classes, neural network classes with bounded norms), Dudley's integral gives bounds that are tight up to logarithmic factors. Generic chaining matters when the geometry of the function class has different complexity at different scales, which is rare in standard applications but important in probability theory.

Watch Out

The Lipschitz constant in contraction is not always 1

Many ML practitioners assume the contraction inequality is "free" because common losses are 1-Lipschitz. But the relevant Lipschitz constant is the one for φ\varphi as a function of the hypothesis output, not as a function of the loss inputs. For squared loss (y,y^)=(yy^)2\ell(y, \hat{y}) = (y - \hat{y})^2 on bounded predictions y^B|\hat{y}| \leq B and labels yB|y| \leq B, the Lipschitz constant in y^\hat{y} is 4B4B, not 1. Forgetting the BB factor is a common source of vacuous bounds.

Watch Out

Sudakov is for Gaussian, Dudley is for sub-Gaussian

Dudley's entropy integral applies to any process with sub-Gaussian increments (i.e., XsXtψ2d(s,t)\|X_s - X_t\|_{\psi_2} \leq d(s, t)). Sudakov's minoration applies only to Gaussian processes; the lower bound can fail for sub-Gaussian processes. For empirical processes, you typically use Dudley directly (since empirical processes are sub-Gaussian conditional on the data) and prove lower bounds via Le Cam's method or Fano's inequality rather than Sudakov.

Watch Out

Empirical process supremum vs uniform deviation

The supremum of the empirical process supfGn(f)\sup_f |G_n(f)| is the natural quantity for two-sided bounds. Some papers work with one-sided supf(PnfPf)\sup_f (P_n f - Pf), which is half as large in expectation but admits slightly tighter constants. Both lead to the same generalization rate; only constants differ. Pay attention to which version a paper uses before plugging numbers in.

Summary

  • Empirical process theory reduces generalization to bounding supfPnfPf\sup_f |P_n f - Pf|
  • Dudley's entropy integral: E[supGn]ClogN(ϵ)dϵ\mathbb{E}[\sup |G_n|] \leq C \int \sqrt{\log N(\epsilon)}\,d\epsilon
  • Chaining works by discretizing at all scales simultaneously
  • Generic chaining (γ2\gamma_2) is tight for Gaussian processes but hard to compute
  • Sudakov-Fernique compares Gaussian processes via increment domination
  • Ledoux-Talagrand contraction passes Rademacher bounds through Lipschitz losses with constant 2L2L
  • For practical ML bounds, Dudley's integral with covering number estimates suffices

Exercises

ExerciseCore

Problem

For a function class with logN(ϵ,F,L2)=dlog(1/ϵ)\log N(\epsilon, \mathcal{F}, L_2) = d \log(1/\epsilon), evaluate Dudley's entropy integral up to constants and state the resulting generalization bound.

ExerciseAdvanced

Problem

Explain why a single-scale ϵ\epsilon-net argument gives a bound of O(ϵ+dlog(1/ϵ)/n)O(\epsilon + \sqrt{d\log(1/\epsilon)/n}) and why optimizing over ϵ\epsilon gives O(dlogn/n)O(\sqrt{d\log n / n}), which is worse than Dudley's O(d/n)O(\sqrt{d/n}) by a logn\sqrt{\log n} factor.

ExerciseAdvanced

Problem

A neural network with one hidden layer of width mm, sigmoid activation, and weights bounded by W2B\|W\|_2 \leq B has covering number satisfying logN(ϵ,H,L2)mdlog(B/ϵ)\log N(\epsilon, \mathcal{H}, L_2) \leq m d \log(B/\epsilon) where dd is the input dimension. Apply Dudley's integral plus the Ledoux-Talagrand contraction inequality to bound the Rademacher complexity of the loss class 0-1H\ell_{0\text{-}1} \circ \mathcal{H} for the margin-loss surrogate φρ(t)=min(1,max(0,1t/ρ))\varphi_\rho(t) = \min(1, \max(0, 1 - t/\rho)) with margin ρ>0\rho > 0.

ExerciseResearch

Problem

State Sudakov's minoration: a lower bound on the expected supremum of a Gaussian process in terms of a single-scale packing number. Explain why this shows Dudley's integral cannot be improved by more than a logarithmic factor.

References

Canonical:

  • van der Vaart & Wellner (1996), Weak Convergence and Empirical Processes, Chapters 2.5-2.6 (Dudley) and 2.7 (chaining bracketing)
  • Vershynin (2018), High-Dimensional Probability, Chapter 8 (chaining), Theorem 7.2.11 (Sudakov-Fernique)
  • van Handel (2016), Probability in High Dimension, Chapters 4-5 (Dudley, Sudakov, generic chaining)
  • Ledoux & Talagrand (1991), Probability in Banach Spaces, Theorem 4.12 (contraction inequality)

Current:

  • Talagrand (2014), Upper and Lower Bounds for Stochastic Processes (2nd ed.) — definitive treatment of generic chaining and γ2\gamma_2
  • Boucheron, Lugosi, Massart (2013), Concentration Inequalities, Chapters 11-13 (Rademacher contraction, chaining)
  • Wainwright (2019), High-Dimensional Statistics, Chapter 5 (uniform laws via chaining)
  • Mohri, Rostamizadeh, Talwalkar (2018), Foundations of Machine Learning (2nd ed.), Lemma 5.7 (contraction with constant 2L2L)
  • Maurer (2016), "A vector-contraction inequality for Rademacher complexities," ALT

Frontier:

  • Bartlett, Bousquet, Mendelson (2005), "Local Rademacher complexities," Ann. Stat. 33(4), 1497-1537 — localization for fast rates beyond Dudley
  • Bartlett, Foster, Telgarsky (2017), "Spectrally-normalized margin bounds for neural networks," NeurIPS — chaining-based norm capacity bounds for deep nets
  • Koltchinskii (2011), Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems, Saint-Flour XXXVIII

Next Topics

Last reviewed: May 5, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.