Sample Complexity Bounds

Sneiderman, Robby

Learning Theory Core

Sample Complexity Bounds

How many samples do you need to learn? Tight answers for finite hypothesis classes, VC classes, and Rademacher-bounded classes, plus matching lower bounds via Fano and Le Cam.

CoreTier 1StableSupporting~50 min

Prerequisites

VC Dimension Realizability Assumption

Quiz (11)Pulse Check Prereq Map

Why This Matters

Sample complexity is the right currency for comparing learning algorithms. Asking "how accurate is this algorithm?" without specifying the sample size is meaningless. The question that matters is: how many samples $n$ does an algorithm need to achieve excess risk at most $\epsilon$ with probability at least $1 - \delta$ ?

Every upper bound on generalization error from ERM, VC theory, or Rademacher complexity can be inverted to give a sample complexity bound. Lower bounds tell you no algorithm can do better, regardless of computational power.

theorem visual

Sample Complexity Budget

$Upper bounds say what sample size is sufficient. Lower bounds say what no learner can beat. A mature theory makes those two numbers nearly meet.$

sufficient budget

388 units

information-theoretic floor

230 units

complexity

lo g ∣ H ∣

confidence

lo g (1/ δ)

accuracy

1/ ϵ^{2}

Accuracy epsilon

ε = 0.25

active regime

Finite class

Counting hypotheses gives a sufficient budget. The price is logarithmic in the class size.

n ≳ \frac{lo g ∣ H ∣ + lo g ( 1/ δ )}{ϵ ^{2}}

If the green floor is far below the sufficient bar, the proof is leaving slack. If they nearly touch, the rate is close to sharp.

Mental Model

Think of sample complexity as a budget. You want to buy $\epsilon$ -accurate generalization. The price depends on how complex your hypothesis class is (measured by $\log|\mathcal{H}|$ , VC dimension $d$ , or Rademacher complexity $\mathfrak{R}_n$ ). Upper bounds tell you a price that is always sufficient. Lower bounds tell you the cheapest possible price.

The gap between upper and lower bounds measures how well we understand a learning problem.

Formal Setup

Fix an input space $\mathcal{X}$ , output space $\mathcal{Y}$ , a loss function $\ell$ bounded in $[0, 1]$ , a hypothesis class $\mathcal{H}$ , and accuracy parameters $\epsilon, \delta > 0$ .

Definition

Sample Complexity $n (ϵ, δ, H)$

The sample complexity of learning $\mathcal{H}$ to accuracy $\epsilon$ with confidence $1 - \delta$ is the smallest $n$ such that there exists an algorithm $A$ satisfying:

$\sup_{\mathcal{D}} \Pr_{S \sim \mathcal{D}^n}\left[R(A(S)) - \min_{h \in \mathcal{H}} R(h) > \epsilon\right] \leq \delta$

The supremum is over all distributions $\mathcal{D}$ on $\mathcal{X} \times \mathcal{Y}$ .

This is a worst-case definition. The distribution $\mathcal{D}$ is adversarial.

Main Theorems

Theorem

Sample Complexity of Finite Hypothesis Classes

Statement

For a finite hypothesis class $\mathcal{H}$ with loss in $[0,1]$ , ERM achieves excess risk at most $\epsilon$ with probability at least $1 - \delta$ whenever:

$n \geq \frac{2(\log|\mathcal{H}| + \log(2/\delta))}{\epsilon^2}$

Equivalently, the sample complexity satisfies $n(\epsilon, \delta, \mathcal{H}) = O\!\left(\frac{\log|\mathcal{H}| + \log(1/\delta)}{\epsilon^2}\right)$ .

Intuition

Each hypothesis needs roughly $1/\epsilon^2$ samples to pin down its population risk (by Hoeffding). The union bound over $|\mathcal{H}|$ hypotheses costs a factor of $\log|\mathcal{H}|$ , not $|\mathcal{H}|$ , because the bound enters through an exponential inequality.

Proof Sketch

From the ERM generalization bound for finite classes, with probability $\geq 1 - \delta$ :

$\sup_{h \in \mathcal{H}} |R(h) - \hat{R}_n(h)| \leq \sqrt{\frac{\log|\mathcal{H}| + \log(2/\delta)}{2n}}$

Set this $\leq \epsilon/\sqrt{2}$ and solve for $n$ . Since the ERM hypothesis $\hat{h}$ satisfies $R(\hat{h}) - \min_h R(h) \leq 2 \sup_h |R(h) - \hat{R}_n(h)|$ , the excess risk is at most $\epsilon$ .

Why It Matters

This is the baseline sample complexity result. It shows that finite classes are always learnable, and the cost is logarithmic in the class size. A class with $2^d$ hypotheses needs only $O(d/\epsilon^2)$ samples, not $O(2^d/\epsilon^2)$ .

Failure Mode

The bound is vacuous for infinite classes. It also gives no help when $|\mathcal{H}|$ is finite but astronomically large (e.g., all possible decision trees of a given depth), because the $\log|\mathcal{H}|$ term may be too large to be useful.

report a correction →

Theorem

Realizable Finite-Class Sample Complexity

Statement

Under realizability (some $h^* \in \mathcal{H}$ has $R(h^*) = 0$ ) with 0-1 loss and finite $\mathcal{H}$ , any ERM hypothesis $\hat{h}$ (i.e., any $h \in \mathcal{H}$ consistent with the training set) satisfies $R(\hat{h}) \leq \epsilon$ with probability $\geq 1 - \delta$ whenever:

$n \geq \frac{1}{\epsilon}\!\left(\log|\mathcal{H}| + \log\frac{1}{\delta}\right)$

The dependence on $\epsilon$ is linear, not quadratic. Realizability buys a full factor of $1/\epsilon$ over the agnostic case.

Intuition

When some $h^*$ achieves zero error, ERM can refuse to return any hypothesis with positive empirical error. The only way ERM fails is if a bad hypothesis (one with $R(h) > \epsilon$ ) happens to look perfect on the training set. The probability of this shrinks like $(1 - \epsilon)^n \leq e^{-n\epsilon}$ per bad hypothesis, giving the $1/\epsilon$ rate.

Proof Sketch

We derive the bound step by step; the same template appears in PAC learning and is the workhorse of finite-class learning theory.

Step 1: Single bad hypothesis. Fix $h \in \mathcal{H}$ with $R(h) > \epsilon$ . Each training point $x_i$ has probability at least $\epsilon$ of exposing $h$ (i.e., $h(x_i) \ne c^*(x_i)$ ). By independence, the probability $h$ survives all $n$ checks is

$\Pr_S[\hat{R}_n(h) = 0] = \prod_{i=1}^n \Pr[h(x_i) = c^*(x_i)] \leq (1 - \epsilon)^n \leq e^{-n\epsilon}$

The final inequality uses $1 - x \leq e^{-x}$ (the tangent line to $e^{-x}$ at $x = 0$ , which sits above the curve by convexity).

Step 2: Union bound. Let $\mathcal{H}_{\text{bad}} = \{h : R(h) > \epsilon\}$ . The failure event is "some bad hypothesis is consistent with $S$ ", which has probability

$\Pr_S\!\left[\exists h \in \mathcal{H}_{\text{bad}} : \hat{R}_n(h) = 0\right] \leq |\mathcal{H}_{\text{bad}}| \cdot e^{-n\epsilon} \leq |\mathcal{H}| \cdot e^{-n\epsilon}$

Step 3: Solve for $n$ . Setting $|\mathcal{H}| e^{-n\epsilon} \leq \delta$ , taking logs, and dividing by $\epsilon$ :

$n \geq \frac{1}{\epsilon}\!\left(\log|\mathcal{H}| + \log\frac{1}{\delta}\right)$

That is the bound. ERM cannot return a bad hypothesis on the high-probability event because $h^*$ is consistent with $S$ , so any consistent ERM output has $R \leq \epsilon$ .

Why It Matters

The $\log|\mathcal{H}|$ factor is the key. You pay logarithmically in the size of the hypothesis class, not linearly. For infinite $\mathcal{H}$ , this argument fails because $\log|\mathcal{H}| = \infty$ , and you need VC theory to replace the union bound with a polynomial growth-function argument.

The realizable rate $1/\epsilon$ versus the agnostic $1/\epsilon^2$ is a real distinction: at $\epsilon = 0.01$ , the realizable bound asks for $\sim 100 \log|\mathcal{H}|$ samples while the agnostic bound asks for $\sim 10\,000 \log|\mathcal{H}|$ — two orders of magnitude more. This gap is the formal cost of not knowing whether the target sits in your class.

Failure Mode

This bound requires realizability. Without it, no $h^*$ achieves zero training error, the $(1 - \epsilon)^n$ tail no longer applies, and the rate degrades to the agnostic $1/\epsilon^2$ from Hoeffding-style concentration. The bound is also vacuous for infinite hypothesis classes (linear classifiers, neural networks); use VC dimension or Rademacher complexity instead.

report a correction →

Theorem

Agnostic Sample Complexity via VC Dimension

Statement

For agnostic binary classification with 0-1 loss, if $\mathcal{H}$ has VC dimension $d < \infty$ :

Upper bound: There exists a constant $C > 0$ such that ERM achieves excess risk $\leq \epsilon$ with probability $\geq 1 - \delta$ whenever:

$n \geq C \cdot \frac{d + \log(1/\delta)}{\epsilon^2}$

Lower bound: There exist constants $c_1, c_2 > 0$ such that for any algorithm:

$n(\epsilon, \delta, \mathcal{H}) \geq c_1 \cdot \frac{d}{\epsilon^2} + c_2 \cdot \frac{\log(1/\delta)}{\epsilon^2}$

Intuition

VC dimension $d$ replaces $\log|\mathcal{H}|$ for infinite classes. The Sauer-Shelah lemma shows that the effective number of behaviors of $\mathcal{H}$ on $n$ points is at most $(en/d)^d$ . Taking logs recovers $d \log(n/d)$ , which leads to a sample complexity of $O(d/\epsilon^2)$ up to log factors. The lower bound shows this is tight.

Proof Sketch

Upper bound: Apply the VC generalization bound: with probability $\geq 1 - \delta$ , the uniform deviation is $O(\sqrt{(d\log(n/d) + \log(1/\delta))/n})$ . Setting this $\leq \epsilon$ and solving for $n$ gives $n = O(d/\epsilon^2)$ (the log factor is absorbed because $n$ appears on both sides, and the solution satisfies $n = O(d/\epsilon^2 \cdot \log(d/\epsilon^2))$ , which simplifies).

Lower bound: Construct a set of $2^d$ distributions that are pairwise hard to distinguish. Apply Fano's inequality.

Why It Matters

This is the noisy, distribution-free version of the VC sample-complexity story. Finite VC dimension characterizes learnability for binary classification, but the rate depends on the setting: realizable problems can have $1/\epsilon$ scaling, while agnostic excess-risk learning has the slower $1/\epsilon^2$ scaling.

Failure Mode

The VC bound applies only to binary classification with 0-1 loss. For regression, multi-class classification, or other loss functions, you need different complexity measures (pseudo-dimension, Natarajan dimension, or Rademacher complexity). The constants $C, c_1, c_2$ are often not known tightly, making the bound hard to use for choosing $n$ in practice.

report a correction →

Theorem

Sample Complexity via Rademacher Complexity

Statement

If the Rademacher complexity of the loss class satisfies $\mathfrak{R}_n(\ell \circ \mathcal{H}) \leq R(n)$ , then ERM achieves excess risk $\leq \epsilon$ with probability $\geq 1 - \delta$ whenever:

$n \text{ is chosen so that } 2R(n) + \sqrt{\frac{\log(1/\delta)}{2n}} \leq \epsilon$

For many classes, $R(n) = O(1/\sqrt{n})$ , giving sample complexity $O(1/\epsilon^2)$ .

Intuition

Rademacher complexity directly measures how well the class can fit random noise. If the class cannot fit noise well (small $\mathfrak{R}_n$ ), it cannot overfit real data either. The sample complexity is determined by how fast $\mathfrak{R}_n$ decays with $n$ .

Proof Sketch

The Rademacher generalization bound states that with probability $\geq 1 - \delta$ :

$\sup_{h \in \mathcal{H}} (R(h) - \hat{R}_n(h)) \leq 2\mathfrak{R}_n(\ell \circ \mathcal{H}) + \sqrt{\frac{\log(1/\delta)}{2n}}$

Set the right side $\leq \epsilon$ and solve for $n$ .

Why It Matters

Rademacher bounds are data-dependent: you can estimate $\mathfrak{R}_n$ from the training sample itself. This gives tighter sample complexity for specific distributions, not just worst-case over all distributions. For $\ell_\infty$ - or $\ell_1$ -bounded linear classes in $\mathbb{R}^d$ , the chaining route gives $\mathfrak{R}_n = O(\sqrt{d/n})$ , recovering the VC result without going through combinatorics. For the $\ell_2$ -bounded class $\{x \mapsto \langle w, x \rangle : \|w\|_2 \leq B,\, \|x\|_2 \leq R\}$ a direct Cauchy-Schwarz / Khintchine argument gives a sharper dimension-free bound $\mathfrak{R}_n \leq BR/\sqrt{n}$ . This is the bound used for SVM, ridge, and kernel analyses; the $O(\sqrt{d/n})$ rate is not the right rate for $\ell_2$ -bounded classes.

Failure Mode

Computing or bounding $\mathfrak{R}_n$ can be difficult for complex hypothesis classes (e.g., deep neural networks). The resulting bounds are often loose in practice and do not explain the generalization of overparameterized models.

report a correction →

Lower Bound Methods

Upper bounds say "this many samples suffice." Lower bounds say "no algorithm can do better." Two classical methods:

Fano's method. Construct $M$ distributions $P_1, \ldots, P_M$ such that: (1) any two $P_i, P_j$ require different hypotheses (i.e., $\|h_i - h_j\| > \epsilon$ ), and (2) the distributions are hard to tell apart (measured by KL divergence). Fano's inequality then gives:

$n \geq \frac{\log M - \log 2}{\max_{i \neq j} \mathrm{KL}(P_i^n \| P_j^n)}$

Le Cam's method. A two-hypothesis version. Construct $P_0$ and $P_1$ with different optimal hypotheses. The minimax risk is at least $(\epsilon/2)(1 - \mathrm{TV}(P_0^n, P_1^n))$ . Simpler than Fano but gives weaker results (no $\log M$ factor).

Theorem

Fano Method Lower Bound

Statement

Let $P_1, \ldots, P_M$ be distributions such that for all $i \neq j$ , the associated optimal hypotheses satisfy $d(h_i, h_j) > 2\epsilon$ and $\mathrm{KL}(P_i^n \| P_j^n) \leq \beta$ . Then for any estimator $\hat{h}$ :

$\max_{1 \leq i \leq M} \Pr_{S \sim P_i^n}[d(\hat{h}(S), h_i) > \epsilon] \geq 1 - \frac{n\beta + \log 2}{\log M}$

Intuition

If you have $M$ hypotheses that all look similar (low KL divergence) but require different answers, then no algorithm can reliably pick the right one unless $n$ is large enough for the total information $n\beta$ to exceed $\log M$ .

Proof Sketch

Apply Fano's inequality to the $M$ -ary hypothesis testing problem. The mutual information between the index $i$ and the sample $S$ is at most $n\beta$ (by the data processing inequality and the KL bound). Fano's inequality bounds the probability of error in terms of $I(i; S)/\log M$ .

Why It Matters

This is the workhorse of minimax lower bounds. Nearly every known sample complexity lower bound in nonparametric statistics and learning theory uses some form of Fano's method or its generalizations.

Failure Mode

Constructing the right packing (the $M$ distributions) is the hard part. A poor choice of packing gives a weak lower bound. The bound is also limited to worst-case analysis; it says nothing about specific distributions that might be easier.

report a correction →

The Gap Between Upper and Lower Bounds

For agnostic binary classification with VC dimension $d$ :

Classical upper bound: $O(d \log(1/\epsilon) / \epsilon^2)$ from standard VC uniform-convergence arguments
Information-theoretic lower bound: $\Omega(d/\epsilon^2)$ from noisy packing/Fano constructions

The clean mental model is: realizable/noiseless learning can scale like $1/\epsilon$ , while agnostic/noisy excess-risk learning usually pays the slower $1/\epsilon^2$ concentration price. Some classical upper bounds carry extra logarithmic factors; sharper theory studies when those logs are artifacts of the proof rather than a real sample requirement.

Canonical Examples

Example

Learning threshold classifiers on the line

Let $\mathcal{X} = [0, 1]$ and $\mathcal{H} = \{x \mapsto \mathbf{1}[x \geq t] : t \in [0,1]\}$ . This class has VC dimension $d = 1$ . In the agnostic/noisy setting, the worst-case excess-risk sample complexity is $\Theta(1/\epsilon^2)$ . In the realizable/noiseless threshold setting, the dependence improves to roughly $1/\epsilon$ up to confidence factors.

With $n = 100$ samples, ERM achieves excess risk $\approx 0.1$ with high probability. With $n = 10{,}000$ samples, excess risk $\approx 0.01$ .

Example

Linear classifiers in d dimensions

Halfspaces in $\mathbb{R}^d$ have VC dimension on the order of $d$ . In agnostic distribution-free classification, sample complexity is $\Theta(d/\epsilon^2)$ up to logarithmic and confidence factors. In $d = 100$ dimensions, you need roughly $100/\epsilon^2$ samples. At $\epsilon = 0.01$ , that is $10^8$ samples. This explains why high-dimensional linear classification needs large datasets.

Common Confusions

Watch Out

Sample complexity is not the same as convergence rate

Sample complexity asks: how many samples for $\epsilon$ -accuracy? Convergence rate asks: how fast does the excess risk decrease with $n$ ? They are related by inversion, but people often confuse the two. A rate of $O(1/\sqrt{n})$ corresponds to a sample complexity of $O(1/\epsilon^2)$ . A rate of $O(1/n)$ corresponds to $O(1/\epsilon)$ , which is called a "fast rate."

Watch Out

Computational complexity is separate from sample complexity

Sample complexity ignores computation. An algorithm might need only $O(d/\epsilon^2)$ samples but require exponential time to process them. Cryptographic hardness results show that for some problems, computationally efficient algorithms provably need more samples than information-theoretically optimal (but computationally unbounded) ones.

Watch Out

Bounds with unspecified constants are hard to use in practice

The statement $n = O(d/\epsilon^2)$ hides a constant. That constant might be 1 or 1000. For practical sample size planning, you need either explicit constants (which are rarely tight) or empirical methods like learning curves. Sample complexity theory tells you the right scaling, not the right number.

Summary

For finite $\mathcal{H}$ : $n = O(\log|\mathcal{H}|/\epsilon^2)$
For agnostic VC classes: $n = \Theta(d/\epsilon^2)$ up to log factors
For realizable VC classes: the dependence on $\epsilon$ can improve to roughly $1/\epsilon$
For Rademacher complexity $R(n)$ : solve $R(n) \leq \epsilon/2$
Lower bounds via Fano require constructing hard packing sets
Sample complexity is the right way to compare learning algorithms across different regimes
The gap between upper and lower bounds measures our ignorance about a problem

Exercises

ExerciseCore

Problem

A hypothesis class has 500 elements. How many samples does ERM need to achieve excess risk $\leq 0.05$ with probability $\geq 0.95$ ?

ExerciseAdvanced

Problem

The class of halfspaces in $\mathbb{R}^{10}$ has VC dimension 11. Use the VC sample complexity bound to determine the order of magnitude of samples needed for $\epsilon = 0.01$ accuracy. Then explain why this is still useful despite the loose constants.

ExerciseResearch

Problem

Explain why the deterministic threshold construction does not prove an agnostic $\Omega(1/\epsilon^2)$ lower bound by itself. Then state the right lower-bound lesson for VC classes.

References

Canonical:

Shalev-Shwartz & Ben-David, Understanding Machine Learning, Chapters 3-6
Vapnik, Statistical Learning Theory (1998), Chapters 3-4

Current:

Mohri, Rostamizadeh, Talwalkar, Foundations of Machine Learning (2018), Chapter 3
Wainwright, High-Dimensional Statistics (2019), Chapter 15 (minimax lower bounds)
Bartlett & Mendelson, "Rademacher and Gaussian Complexities: Risk Bounds and Structural Results," JMLR 3 (2002), 463-482. Rademacher complexity tightens the constants and removes the worst-case dependence on the function class structure that VC bounds carry. jmlr.org

Next Topics

The natural extensions from sample complexity:

Algorithmic stability: an alternative route to generalization that bypasses uniform convergence
PAC-Bayes: bounds that depend on a prior over hypotheses, often tighter for overparameterized models

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Realizability Assumptionlayer 2 · tier 1
VC Dimensionlayer 2 · tier 1

Derived topics

2

Algorithmic Stabilitylayer 3 · tier 1
PAC-Bayes Boundslayer 3 · tier 1

Graph-backed continuations

Algorithmic Stability PAC-Bayes Bounds