Empirical Processes and Chaining

Sneiderman, Robby

Concentration Probability

Empirical Processes and Chaining

Bounding the supremum of empirical processes via covering numbers and chaining: Dudley's entropy integral and Talagrand's generic chaining, the sharpest tools in classical learning theory.

AdvancedTier 2StableSupporting~65 min

Prerequisites

Rademacher Complexity Epsilon Nets and Covering Numbers Asymptotic Statistics Glivenko Cantelli Theorem

Quiz (6)Pulse Check Prereq Map

Why This Matters

The central question of learning theory is: how fast does the empirical risk converge to the population risk, uniformly over a function class? This question reduces to bounding the supremum of an empirical process. Dudley's entropy integral and Talagrand's generic chaining are the two main tools for this task.

Dudley's bound converts covering number estimates (geometric information about the function class) into bounds on the expected supremum. Generic chaining improves on Dudley by a logarithmic factor in many cases. Together, they represent the culmination of the covering number approach to learning theory.

theorem visual

Chaining pays for complexity one scale at a time

$Instead of one huge union bound, chaining follows each function through coarse-to-fine nets and sums the small increments.$

single scale

$One ϵ -net pays one large lo g N (ϵ) cost plus approximation error.$

chaining

$Use nets at many scales and bound each increment separately.$

Dudley

$The total cost becomes \int lo g N (ϵ) d ϵ .$

Formal Setup

Definition

Empirical Process $G_{n} (f)$

Given i.i.d. samples $X_1, \ldots, X_n$ from distribution $P$ and a function class $\mathcal{F}$ , the empirical process is:

$G_n(f) = \sqrt{n}\left(\frac{1}{n}\sum_{i=1}^n f(X_i) - \mathbb{E}[f(X)]\right) = \sqrt{n}(P_n f - Pf)$

where $P_n = \frac{1}{n}\sum_{i=1}^n \delta_{X_i}$ is the empirical measure.

Definition

Supremum of the Empirical Process

The quantity of interest for uniform convergence is:

$\sup_{f \in \mathcal{F}} |G_n(f)| = \sqrt{n} \sup_{f \in \mathcal{F}} |P_n f - Pf|$

Bounding $\mathbb{E}[\sup_{f \in \mathcal{F}} |G_n(f)|]$ controls how well empirical risk approximates population risk uniformly over $\mathcal{F}$ .

Definition

Covering Number $N (ϵ, F, d)$

The covering number $N(\epsilon, \mathcal{F}, d)$ is the minimum number of balls of radius $\epsilon$ (in metric $d$ ) needed to cover $\mathcal{F}$ . The metric entropy is $\log N(\epsilon, \mathcal{F}, d)$ .

Definition

Packing Number $D (ϵ, F, d)$

The packing number $D(\epsilon, \mathcal{F}, d)$ is the maximum number of points in $\mathcal{F}$ that are pairwise at distance strictly greater than $\epsilon$ .

Covering and packing numbers are equivalent up to a factor of 2 in the radius:

$D(2\epsilon, \mathcal{F}, d) \leq N(\epsilon, \mathcal{F}, d) \leq D(\epsilon, \mathcal{F}, d)$

This follows because a maximal $\epsilon$ -packing is automatically an $\epsilon$ -covering (no point can be at distance $> \epsilon$ from all packing points without contradicting maximality). For chaining bounds, packing and covering numbers can be used interchangeably up to constants. See van Handel (2016), Lemma 4.1.

Metric Entropy: Catalog of Standard Classes

Before chaining can give a numerical bound, you need a covering number estimate for the function class. The following entries are the workhorses:

Class	Metric	$\log N(\epsilon, \mathcal{F}, d)$	Source
$\ell_2$ ball in $\mathbb{R}^d$ , radius $R$	$\ell_2$	$d \log(3R/\epsilon)$	Vershynin HDP, Cor. 4.2.13
Linear class $\{\langle w, \cdot\rangle : \\|w\\|_2 \leq 1\}$ on $\\|x\\|_2 \leq 1$	$L_2(P_n)$	$\leq d \log(3/\epsilon)$	Wainwright HDS, Ex. 5.4
VC class with $\mathrm{VC}(\mathcal{F}) = d$ , envelope 1	$L_2(P_n)$	$\lesssim d \log(1/\epsilon)$	van der Vaart-Wellner, Thm. 2.6.7
$L$ -Lipschitz functions on $[0,1]^d$ , $\\|f\\|_\infty \leq B$	sup-norm	$\asymp (L/\epsilon)^d$	Kolmogorov-Tikhomirov (1959)
$C^k$ -smooth functions on $[0,1]^d$ , $\\|f\\|_{C^k} \leq B$	sup-norm	$\asymp (B/\epsilon)^{d/k}$	Kolmogorov-Tikhomirov (1959)
RKHS ball with kernel of decay rate $\mu_j \asymp j^{-2s}$	$L_2$	$\asymp \epsilon^{-1/s}$	Steinwart-Christmann, Thm. 6.27

The qualitative regimes for chaining:

Polynomial entropy $\log N(\epsilon) \asymp d \log(1/\epsilon)$ : parametric rate, Dudley integral $= O(\sqrt{d})$ , generalization $O(\sqrt{d/n})$ .
Power entropy $\log N(\epsilon) \asymp \epsilon^{-\alpha}$ with $\alpha < 2$ : Dudley integral converges, nonparametric rate $n^{-1/(2+\alpha)}$ .
Critical entropy $\alpha = 2$ : Dudley integral diverges; chaining alone is insufficient, you need localization.
Heavy entropy $\alpha > 2$ : minimax-rate problem requires entirely different machinery (Le Cam's two-point method, Fano).

The Chaining Idea

The naive approach is to discretize $\mathcal{F}$ at a single resolution $\epsilon$ and take a union bound over the $\epsilon$ -net. This gives:

$\mathbb{E}\left[\sup_{f \in \mathcal{F}} |G_n(f)|\right] \lesssim \epsilon\sqrt{n} + \sqrt{\log N(\epsilon, \mathcal{F}, L_2(P_n))}$

The problem: choosing $\epsilon$ involves a tradeoff. Small $\epsilon$ controls approximation error but increases the covering number. Large $\epsilon$ is the reverse.

Chaining resolves this by discretizing at all scales simultaneously. Build a sequence of increasingly fine $\epsilon$ -nets $\mathcal{F}_0 \subset \mathcal{F}_1 \subset \cdots$ with $\epsilon_k = 2^{-k}$ . For each $f \in \mathcal{F}$ , write $f = f_0 + (f_1 - f_0) + (f_2 - f_1) + \cdots$ where $f_k$ is the nearest point to $f$ in the $k$ -th net. Bound each increment separately.

Main Theorems

Theorem

Dudley's Entropy Integral

Statement

Let $\mathcal{F}$ be a function class with envelope $F$ (meaning $|f| \leq F$ for all $f \in \mathcal{F}$ ) and let $\sigma^2 = \sup_{f \in \mathcal{F}} \text{Var}(f(X))$ . Then:

$\mathbb{E}\left[\sup_{f \in \mathcal{F}} |G_n(f)|\right] \leq C \int_0^{\sigma} \sqrt{\log N(\epsilon, \mathcal{F}, L_2(P))}\, d\epsilon$

where $C$ is a universal constant. The integral converges whenever the covering numbers grow at most polynomially as $\epsilon \to 0$ .

Intuition

At each scale $\epsilon$ , the contribution to the supremum is controlled by $\sqrt{\log N(\epsilon)}$ (a union bound over the net at that scale). Chaining integrates these contributions across all scales. Coarse scales (large $\epsilon$ ) contribute little because the net is small. Fine scales (small $\epsilon$ ) contribute little because the increments are small. The integral balances these.

Proof Sketch

Fix a sequence of scales $\epsilon_k = 2^{-k} \sigma$ and corresponding $\epsilon_k$ -nets $\mathcal{N}_k$ . For each $f$ , let $\pi_k(f)$ be the closest point in $\mathcal{N}_k$ . Write:

$G_n(f) = G_n(\pi_0(f)) + \sum_{k=1}^K [G_n(\pi_k(f)) - G_n(\pi_{k-1}(f))] + G_n(f) - G_n(\pi_K(f))$

The first term is bounded by $\sqrt{\log |\mathcal{N}_0|}$ via a union bound. Each increment has sub-Gaussian parameter $O(\epsilon_k / \sqrt{n})$ and there are $|\mathcal{N}_k|$ choices, giving $\sqrt{\log |\mathcal{N}_k|}$ per scale. Summing over scales gives the integral.

Why It Matters

This is the standard tool for converting covering number estimates into generalization bounds. For a class with $\log N(\epsilon) = O(d \log(1/\epsilon))$ (like VC classes or linear predictors), the integral evaluates to $O(\sqrt{d})$ , recovering the standard $O(\sqrt{d/n})$ generalization bound without going through VC theory.

Failure Mode

Dudley's bound is not tight in general. It can be loose by a $\sqrt{\log n}$ factor compared to the true expected supremum. For classes where the covering numbers have complex structure at different scales, generic chaining gives tighter results.

report a correction →

Theorem

Talagrand's Generic Chaining (Majorizing Measures)

Statement

For a centered Gaussian process $(X_t)_{t \in T}$ , define the $\gamma_2$ functional:

$\gamma_2(T, d) = \inf_{\{A_k\}} \sup_{t \in T} \sum_{k=0}^{\infty} 2^{k/2} d(t, A_k)$

where the infimum is over all sequences of sets $A_k$ with $|A_k| \leq 2^{2^k}$ . Then:

$\frac{1}{L} \gamma_2(T, d) \leq \mathbb{E}\left[\sup_{t \in T} X_t\right] \leq L \cdot \gamma_2(T, d)$

for a universal constant $L$ . The $\gamma_2$ functional characterizes the expected supremum up to universal constants.

Intuition

Generic chaining replaces the uniform covering at each scale (Dudley) with an optimized covering that can vary across the set $T$ . Some points in $T$ may need fine resolution while others do not. The $\gamma_2$ functional captures this by allowing different "chains" for different points.

Proof Sketch

The upper bound follows the same chaining idea as Dudley, but with optimized nets $A_k$ instead of minimal coverings. The lower bound (which makes this a characterization, not just an upper bound) uses the Sudakov minoration argument and a sophisticated construction of the admissible sequence.

Why It Matters

This is the definitive result on suprema of Gaussian processes. Since empirical processes can be compared to Gaussian processes via symmetrization and comparison inequalities, this provides the tightest possible bounds on learning theory quantities. It resolves a long-standing question in probability theory about when chaining gives the right answer.

Failure Mode

The $\gamma_2$ functional is hard to compute in general. For most applications, Dudley's entropy integral is sufficient and much more practical. Generic chaining is most useful for proving theoretical optimality results rather than computing explicit bounds.

report a correction →

Lower Bounds: Sudakov Minoration

Dudley's integral is an upper bound. Without a matching lower bound there is no way to know whether the integral overstates the true expected supremum. Sudakov's minoration provides a single-scale lower bound for Gaussian processes.

Theorem

Sudakov's Minoration

Statement

Let $(X_t)_{t \in T}$ be a centered Gaussian process with canonical metric $d(s, t) = \sqrt{\mathbb{E}[(X_s - X_t)^2]}$ . Then for every $\epsilon > 0$ :

$\mathbb{E}\left[\sup_{t \in T} X_t\right] \geq \frac{c \cdot \epsilon}{\sqrt{\log 2}} \sqrt{\log D(\epsilon, T, d)}$

for a universal constant $c$ . Equivalently, since $D(\epsilon) \leq N(\epsilon) \leq D(\epsilon/2)$ :

$\mathbb{E}\left[\sup_{t \in T} X_t\right] \geq c' \cdot \epsilon \sqrt{\log N(2\epsilon, T, d)}$

Intuition

A packing of size $D(\epsilon)$ in $T$ produces $D(\epsilon)$ Gaussians with pairwise standard deviation at least $\epsilon$ . The maximum of $N$ independent or weakly correlated standard Gaussians grows like $\sqrt{2 \log N}$ , so taking the maximum over the packing gives a contribution of order $\epsilon \sqrt{\log D(\epsilon)}$ . The minoration extracts this single-scale lower bound; supremum over $\epsilon$ is the sharpest version.

Proof Sketch

Fix an $\epsilon$ -packing $\{t_1, \ldots, t_N\}$ where $N = D(\epsilon)$ . Compare the Gaussian process $(X_{t_i})$ to an independent collection of $\mathcal{N}(0, \epsilon^2/2)$ random variables using the Slepian inequality: since $\mathbb{E}[(X_{t_i} - X_{t_j})^2] \geq \epsilon^2$ , the original process has larger increments and hence at least as large an expected supremum. The expected supremum of $N$ i.i.d. $\mathcal{N}(0, \sigma^2)$ variables is $\geq c \sigma \sqrt{\log N}$ for $N \geq 2$ . See van Handel (2016), Theorem 5.30.

Why It Matters

Combined with Dudley's upper bound, Sudakov gives:

$c \sup_{\epsilon > 0} \epsilon \sqrt{\log N(\epsilon)} \leq \mathbb{E}\left[\sup_t X_t\right] \leq C \int_0^\infty \sqrt{\log N(\epsilon)}\, d\epsilon$

The two bounds match up to a $\sqrt{\log |T|}$ factor in the worst case, and exactly when $\log N(\epsilon)$ is regularly varying. This is why Talagrand's $\gamma_2$ functional was needed: it closes the gap and gives a characterization up to universal constants.

Failure Mode

Sudakov is sharp only at the single optimizing scale and only for Gaussian processes (or sub-Gaussian processes via a comparison theorem). For sub-exponential or heavy-tailed processes, neither Sudakov nor Dudley directly applies; you need separate tools (Bernstein-type concentration, generic chaining with $\gamma_1 + \gamma_2$ ).

report a correction →

Gaussian Comparison: Slepian-Sudakov-Fernique

Sudakov's proof uses a Gaussian comparison inequality. The same tool lets you compare the expected supremum of a complicated Gaussian process to a simpler reference process whose supremum is known.

Theorem

Sudakov-Fernique Inequality

Statement

Let $(X_t)_{t \in T}$ and $(Y_t)_{t \in T}$ be centered Gaussian processes on a finite index set $T$ . If

$\mathbb{E}[(X_s - X_t)^2] \geq \mathbb{E}[(Y_s - Y_t)^2] \quad \text{for all } s, t \in T,$

then

$\mathbb{E}\left[\sup_{t \in T} X_t\right] \geq \mathbb{E}\left[\sup_{t \in T} Y_t\right]$

Intuition

Larger increments push points apart more, so the maximum spreads farther from zero in expectation. The comparison is on increments, not on marginal variances; only the geometry of the process matters.

Slepian's inequality is the stronger statement: under the same increment condition plus equal marginal variances ( $\mathbb{E}[X_t^2] = \mathbb{E}[Y_t^2]$ ), every tail probability of $\sup X_t$ exceeds the corresponding probability for $\sup Y_t$ . Sudakov-Fernique drops the equal-variance condition at the cost of giving only an expectation comparison.

Proof Sketch

Interpolate between the two processes via $Z_t(\theta) = \cos(\theta) X_t + \sin(\theta) Y_t$ . Differentiate $\mathbb{E}[\max_t Z_t(\theta)]$ in $\theta$ using Gaussian integration by parts (Stein's lemma). The increment-domination condition makes the derivative non-positive, so the expected max decreases as $\theta$ moves from $0$ ( $X$ ) to $\pi/2$ ( $Y$ ). See Vershynin HDP, Theorem 7.2.11.

Why It Matters

This is the key tool to pass from a Gaussian process indexed by a function class to a canonical Gaussian process whose suprema are explicit. Examples:

Random matrix operator norm: the operator norm of an $n \times n$ Gaussian matrix can be sandwiched between two reference processes, yielding $\mathbb{E}\|G\|_{\text{op}} \leq 2\sqrt{n}$ (Gordon's theorem).
Comparison to white noise: bounds on suprema of empirical processes often pass through comparison with a Gaussian process built from Brownian motion.
Chevet-type inequalities: bilinear bounds for products of two function classes use Sudakov-Fernique on the bilinear Gaussian process $X_{u, v} = \langle u, G v \rangle$ .

Failure Mode

Both processes must be Gaussian. The inequality fails for sub-Gaussian processes in general; the Talagrand-Borell-Tsirelson comparison theorems are the sub-Gaussian replacements but require additional structure.

report a correction →

Contraction (Ledoux-Talagrand)

In learning theory you often need to compare the Rademacher complexity of a loss class $\ell \circ \mathcal{H}$ to the Rademacher complexity of the hypothesis class $\mathcal{H}$ . The Ledoux-Talagrand contraction inequality makes this exchange explicit when the loss is Lipschitz.

Theorem

Ledoux-Talagrand Contraction Inequality

Statement

Let $\mathcal{F}$ be a class of real-valued functions on $\mathcal{X}$ . Let $\varphi_1, \ldots, \varphi_n : \mathbb{R} \to \mathbb{R}$ be $L$ -Lipschitz with $\varphi_i(0) = 0$ . Let $\varepsilon_1, \ldots, \varepsilon_n$ be i.i.d. Rademacher signs and $x_1, \ldots, x_n \in \mathcal{X}$ fixed. Then:

$\mathbb{E}_\varepsilon\left[\sup_{f \in \mathcal{F}} \sum_{i=1}^n \varepsilon_i \varphi_i(f(x_i))\right] \leq 2L \cdot \mathbb{E}_\varepsilon\left[\sup_{f \in \mathcal{F}} \sum_{i=1}^n \varepsilon_i f(x_i)\right]$

Dividing both sides by $n$ gives the Rademacher-complexity form:

$\mathfrak{R}_n(\varphi \circ \mathcal{F}) \leq 2L \cdot \mathfrak{R}_n(\mathcal{F})$

The constant $2L$ is the standard Ledoux-Talagrand constant; under the additional assumption that $\mathcal{F}$ is symmetric ( $-\mathcal{F} = \mathcal{F}$ ), the constant tightens to $L$ (Boucheron-Lugosi-Massart, Theorem 11.6).

Intuition

Pointwise Lipschitz application cannot stretch distances, only contract them. The Rademacher complexity measures how much the function class can correlate with random signs $\pm 1$ on the data. Contraction passes through this correlation up to a factor of the Lipschitz constant. The anchoring $\varphi_i(0) = 0$ removes a deterministic shift that would otherwise add a constant to the supremum without increasing the complexity.

Proof Sketch

Peel off one coordinate at a time. Conditional on $\varepsilon_1, \ldots, \varepsilon_{n-1}$ , the remaining randomness is in $\varepsilon_n$ . Show that for any function $g$ of the first $n-1$ coordinates and any pair $f, f' \in \mathcal{F}$ :

$\mathbb{E}_{\varepsilon_n}\left[\max(g + \varepsilon_n \varphi_n(f(x_n)),\ g' + \varepsilon_n \varphi_n(f'(x_n)))\right] \leq \mathbb{E}_{\varepsilon_n}\left[\max(g + L\varepsilon_n f(x_n),\ g' + L\varepsilon_n f'(x_n))\right]$

This is a direct calculation using $|\varphi_n(a) - \varphi_n(b)| \leq L|a - b|$ and the symmetry of $\varepsilon_n$ . Iterate over coordinates. The factor of $2$ enters when reducing to a symmetric class via $\mathcal{F} \cup (-\mathcal{F})$ . See Ledoux-Talagrand (1991), Theorem 4.12; cleaner exposition in Mohri- Rostamizadeh-Talwalkar (2018), Lemma 5.7.

Why It Matters

This is the bridge from $\mathfrak{R}_n(\mathcal{H})$ (the quantity you can compute or bound via covering numbers) to $\mathfrak{R}_n(\ell \circ \mathcal{H})$ (the quantity that appears in the generalization bound).

Margin loss: $\varphi(t) = \min(1, \max(0, 1 - t))$ is $1$ -Lipschitz, so the margin-loss complexity is at most $2 \mathfrak{R}_n(\mathcal{H})$ .
Logistic loss: $\varphi(t) = \log(1 + e^{-t})$ is $1$ -Lipschitz, same bound.
Hinge loss: $\varphi(t) = \max(0, 1 - t)$ is $1$ -Lipschitz with $\varphi(0) = 1 \neq 0$ ; use a centering trick or absorb the constant into the bias.

For neural networks where $\mathcal{H}$ is bounded in $\ell_2$ but the post-activation is squashed by a Lipschitz nonlinearity, this gives layer-by- layer Rademacher bounds.

For full statement and proof discussion see contraction inequality.

Failure Mode

Two failure modes appear in practice:

The anchoring condition $\varphi(0) = 0$ matters. Without it, the inequality picks up a $\mathbb{E}[|\varphi(0)|]$ term that does not vanish even when the function class is centered. Most ML losses can be shifted to satisfy this without changing the learned predictor.
Vector contraction is more subtle. For a class of vector-valued functions $f : \mathcal{X} \to \mathbb{R}^k$ with a Lipschitz scalarization, the Maurer (2016) vector contraction inequality gives a bound of order $\sqrt{2}\, L$ per coordinate but requires $k$ -dependence in the worst case. This matters for multi-class classification where the effective $k$ is the number of classes.

report a correction →

Connection to Learning Theory

The Rademacher complexity of a function class $\mathcal{F}$ is:

$\mathfrak{R}_n(\mathcal{F}) = \mathbb{E}\left[\sup_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^n \sigma_i f(X_i)\right]$

By symmetrization, $\mathbb{E}[\sup_f |P_n f - Pf|] \leq 2\mathfrak{R}_n(\mathcal{F})$ . Dudley's bound then gives:

$\mathfrak{R}_n(\mathcal{F}) \leq \frac{C}{\sqrt{n}} \int_0^\infty \sqrt{\log N(\epsilon, \mathcal{F}, L_2(P_n))}\, d\epsilon$

This is how covering numbers and chaining enter generalization bounds.

Canonical Examples

Example

Linear classifiers in R^d

For $\mathcal{F} = \{x \mapsto \langle w, x \rangle : \|w\|_2 \leq 1, \|x\|_2 \leq 1\}$ , the covering number satisfies $\log N(\epsilon, \mathcal{F}, L_2) \leq d \log(3/\epsilon)$ . Dudley's integral gives $\int_0^1 \sqrt{d \log(3/\epsilon)}\, d\epsilon = O(\sqrt{d})$ , yielding $\mathfrak{R}_n(\mathcal{F}) \leq O(\sqrt{d/n})$ via the chaining route.

This bound is loose for the $\ell_2$ -bounded class. A direct Cauchy-Schwarz argument gives a dimension-free bound: $\mathfrak{R}_n(\mathcal{F}) = \mathbb{E}[\sup_w \langle w, \frac{1}{n} \sum_i \sigma_i x_i \rangle] \leq \mathbb{E}\|\frac{1}{n} \sum_i \sigma_i x_i\|_2 \leq 1/\sqrt{n}$ , independent of $d$ . Chaining recovers the $O(\sqrt{d/n})$ rate appropriate for $\ell_\infty$ - or $\ell_1$ -bounded weight classes; for the $\ell_2$ ball, the direct bound is tighter and is the one used in standard analyses of margin-based classifiers and kernel methods.

Example

Lipschitz functions on the unit cube

Let $\mathcal{F}_L = \{f : [0,1]^d \to [0, B] \mid f \text{ is } L\text{-Lipschitz in sup-norm}\}$ . Kolmogorov-Tikhomirov (1959) gives:

$\log N(\epsilon, \mathcal{F}_L, \|\cdot\|_\infty) \asymp \left(\frac{L}{\epsilon}\right)^d$

Substituting into Dudley's integral:

$\int_0^B \sqrt{(L/\epsilon)^d}\, d\epsilon = L^{d/2} \int_0^B \epsilon^{-d/2}\, d\epsilon$

The integral converges only when $d < 2$ . For $d = 1$ , it gives $O(L^{1/2} \cdot B^{1/2})$ and the Rademacher complexity is $O(\sqrt{LB/n})$ , the parametric rate. For $d \geq 2$ , Dudley's integral diverges and you need localization: only small balls around the empirical risk minimizer matter, and the entropy of those small balls grows slower with $\epsilon$ . Localized Rademacher complexity (Bartlett-Bousquet-Mendelson, 2005) gives the correct rate $n^{-1/d}$ for $d \geq 2$ .

This is why high-dimensional nonparametric regression suffers the curse of dimensionality even with Lipschitz smoothness: the metric entropy is just too large for naive chaining.

Example

Function class with VC dimension d

For a binary VC class $\mathcal{F}$ with $\mathrm{VC}(\mathcal{F}) = d$ , Haussler's theorem gives:

$\log N(\epsilon, \mathcal{F}, L_2(P_n)) \leq C d \log(1/\epsilon)$

Dudley's integral over $[0, 1]$ evaluates to:

$C \int_0^1 \sqrt{d \log(1/\epsilon)}\, d\epsilon = O(\sqrt{d})$

(The substitution $u = \log(1/\epsilon)$ gives $\int_0^\infty \sqrt{du}\, e^{-u}\, du = \Gamma(3/2) \sqrt{d} = O(\sqrt{d})$ .)

This recovers the classical $\mathfrak{R}_n(\mathcal{F}) = O(\sqrt{d/n})$ generalization bound through chaining, without any combinatorial Sauer- Shelah argument. The chaining route generalizes to non-binary classes; the Sauer-Shelah route does not.

Common Confusions

Watch Out

Chaining is not a single-scale argument

The power of chaining comes from simultaneously using approximations at all scales. A single-scale argument (discretize at one $\epsilon$ ) always involves a tradeoff between approximation error and the size of the net. Chaining eliminates this tradeoff by summing contributions across scales.

Watch Out

Dudley vs Talagrand: when does the difference matter?

For most function classes encountered in ML (VC classes, kernel classes, neural network classes with bounded norms), Dudley's integral gives bounds that are tight up to logarithmic factors. Generic chaining matters when the geometry of the function class has different complexity at different scales, which is rare in standard applications but important in probability theory.

Watch Out

The Lipschitz constant in contraction is not always 1

Many ML practitioners assume the contraction inequality is "free" because common losses are 1-Lipschitz. But the relevant Lipschitz constant is the one for $\varphi$ as a function of the hypothesis output, not as a function of the loss inputs. For squared loss $\ell(y, \hat{y}) = (y - \hat{y})^2$ on bounded predictions $|\hat{y}| \leq B$ and labels $|y| \leq B$ , the Lipschitz constant in $\hat{y}$ is $4B$ , not 1. Forgetting the $B$ factor is a common source of vacuous bounds.

Watch Out

Sudakov is for Gaussian, Dudley is for sub-Gaussian

Dudley's entropy integral applies to any process with sub-Gaussian increments (i.e., $\|X_s - X_t\|_{\psi_2} \leq d(s, t)$ ). Sudakov's minoration applies only to Gaussian processes; the lower bound can fail for sub-Gaussian processes. For empirical processes, you typically use Dudley directly (since empirical processes are sub-Gaussian conditional on the data) and prove lower bounds via Le Cam's method or Fano's inequality rather than Sudakov.

Watch Out

Empirical process supremum vs uniform deviation

The supremum of the empirical process $\sup_f |G_n(f)|$ is the natural quantity for two-sided bounds. Some papers work with one-sided $\sup_f (P_n f - Pf)$ , which is half as large in expectation but admits slightly tighter constants. Both lead to the same generalization rate; only constants differ. Pay attention to which version a paper uses before plugging numbers in.

Summary

Empirical process theory reduces generalization to bounding $\sup_f |P_n f - Pf|$
Dudley's entropy integral: $\mathbb{E}[\sup |G_n|] \leq C \int \sqrt{\log N(\epsilon)}\,d\epsilon$
Chaining works by discretizing at all scales simultaneously
Generic chaining ( $\gamma_2$ ) is tight for Gaussian processes but hard to compute
Sudakov-Fernique compares Gaussian processes via increment domination
Ledoux-Talagrand contraction passes Rademacher bounds through Lipschitz losses with constant $2L$
For practical ML bounds, Dudley's integral with covering number estimates suffices

Exercises

ExerciseCore

Problem

For a function class with $\log N(\epsilon, \mathcal{F}, L_2) = d \log(1/\epsilon)$ , evaluate Dudley's entropy integral up to constants and state the resulting generalization bound.

ExerciseAdvanced

Problem

Explain why a single-scale $\epsilon$ -net argument gives a bound of $O(\epsilon + \sqrt{d\log(1/\epsilon)/n})$ and why optimizing over $\epsilon$ gives $O(\sqrt{d\log n / n})$ , which is worse than Dudley's $O(\sqrt{d/n})$ by a $\sqrt{\log n}$ factor.

ExerciseAdvanced

Problem

A neural network with one hidden layer of width $m$ , sigmoid activation, and weights bounded by $\|W\|_2 \leq B$ has covering number satisfying $\log N(\epsilon, \mathcal{H}, L_2) \leq m d \log(B/\epsilon)$ where $d$ is the input dimension. Apply Dudley's integral plus the Ledoux-Talagrand contraction inequality to bound the Rademacher complexity of the loss class $\ell_{0\text{-}1} \circ \mathcal{H}$ for the margin-loss surrogate $\varphi_\rho(t) = \min(1, \max(0, 1 - t/\rho))$ with margin $\rho > 0$ .

ExerciseResearch

Problem

State Sudakov's minoration: a lower bound on the expected supremum of a Gaussian process in terms of a single-scale packing number. Explain why this shows Dudley's integral cannot be improved by more than a logarithmic factor.

References

Canonical:

van der Vaart & Wellner (1996), Weak Convergence and Empirical Processes, Chapters 2.5-2.6 (Dudley) and 2.7 (chaining bracketing)
Vershynin (2018), High-Dimensional Probability, Chapter 8 (chaining), Theorem 7.2.11 (Sudakov-Fernique)
van Handel (2016), Probability in High Dimension, Chapters 4-5 (Dudley, Sudakov, generic chaining)
Ledoux & Talagrand (1991), Probability in Banach Spaces, Theorem 4.12 (contraction inequality)

Current:

Talagrand (2014), Upper and Lower Bounds for Stochastic Processes (2nd ed.) — definitive treatment of generic chaining and $\gamma_2$
Boucheron, Lugosi, Massart (2013), Concentration Inequalities, Chapters 11-13 (Rademacher contraction, chaining)
Wainwright (2019), High-Dimensional Statistics, Chapter 5 (uniform laws via chaining)
Mohri, Rostamizadeh, Talwalkar (2018), Foundations of Machine Learning (2nd ed.), Lemma 5.7 (contraction with constant $2L$ )
Maurer (2016), "A vector-contraction inequality for Rademacher complexities," ALT

Frontier:

Bartlett, Bousquet, Mendelson (2005), "Local Rademacher complexities," Ann. Stat. 33(4), 1497-1537 — localization for fast rates beyond Dudley
Bartlett, Foster, Telgarsky (2017), "Spectrally-normalized margin bounds for neural networks," NeurIPS — chaining-based norm capacity bounds for deep nets
Koltchinskii (2011), Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems, Saint-Flour XXXVIII

Next Topics

Matrix concentration: chaining with operator-norm increments
Minimax lower bounds: using empirical process theory to prove optimality

Last reviewed: May 5, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

6

Asymptotic Statistics: M-Estimators, Delta Method, LANlayer 0B · tier 1
Epsilon-Nets and Covering Numberslayer 3 · tier 1
Measure Concentration and Geometric Functional Analysislayer 3 · tier 1
Rademacher Complexitylayer 3 · tier 1
Glivenko-Cantelli Theoremlayer 2 · tier 2

Derived topics

2

Matrix Concentrationlayer 3 · tier 1
Minimax Lower Bounds: Le Cam, Fano, Assouad, and the Reduction to Testinglayer 3 · tier 1

Graph-backed continuations

Minimax Lower Bounds: Le Cam, Fano, Assouad, and the Reduction to Testing Matrix Concentration