Skip to main content

Learning Theory Core

Sample Complexity Bounds

How many samples do you need to learn? Tight answers for finite hypothesis classes, VC classes, and Rademacher-bounded classes, plus matching lower bounds via Fano and Le Cam.

CoreTier 1StableSupporting~50 min

Why This Matters

Sample complexity is the right currency for comparing learning algorithms. Asking "how accurate is this algorithm?" without specifying the sample size is meaningless. The question that matters is: how many samples nn does an algorithm need to achieve excess risk at most ϵ\epsilon with probability at least 1δ1 - \delta?

Every upper bound on generalization error from ERM, VC theory, or Rademacher complexity can be inverted to give a sample complexity bound. Lower bounds tell you no algorithm can do better, regardless of computational power.

theorem visual

Sample Complexity Budget

Upper bounds say what sample size is sufficient. Lower bounds say what no learner can beat. A mature theory makes those two numbers nearly meet.

sufficient budget

388 units

information-theoretic floor

230 units

complexity

confidence

accuracy

ε = 0.25

active regime

Finite class

Counting hypotheses gives a sufficient budget. The price is logarithmic in the class size.

If the green floor is far below the sufficient bar, the proof is leaving slack. If they nearly touch, the rate is close to sharp.

Mental Model

Think of sample complexity as a budget. You want to buy ϵ\epsilon-accurate generalization. The price depends on how complex your hypothesis class is (measured by logH\log|\mathcal{H}|, VC dimension dd, or Rademacher complexity Rn\mathfrak{R}_n). Upper bounds tell you a price that is always sufficient. Lower bounds tell you the cheapest possible price.

The gap between upper and lower bounds measures how well we understand a learning problem.

Formal Setup

Fix an input space X\mathcal{X}, output space Y\mathcal{Y}, a loss function \ell bounded in [0,1][0, 1], a hypothesis class H\mathcal{H}, and accuracy parameters ϵ,δ>0\epsilon, \delta > 0.

Definition

Sample Complexity

The sample complexity of learning H\mathcal{H} to accuracy ϵ\epsilon with confidence 1δ1 - \delta is the smallest nn such that there exists an algorithm AA satisfying:

supDPrSDn[R(A(S))minhHR(h)>ϵ]δ\sup_{\mathcal{D}} \Pr_{S \sim \mathcal{D}^n}\left[R(A(S)) - \min_{h \in \mathcal{H}} R(h) > \epsilon\right] \leq \delta

The supremum is over all distributions D\mathcal{D} on X×Y\mathcal{X} \times \mathcal{Y}.

This is a worst-case definition. The distribution D\mathcal{D} is adversarial.

Main Theorems

Theorem

Sample Complexity of Finite Hypothesis Classes

Statement

For a finite hypothesis class H\mathcal{H} with loss in [0,1][0,1], ERM achieves excess risk at most ϵ\epsilon with probability at least 1δ1 - \delta whenever:

n2(logH+log(2/δ))ϵ2n \geq \frac{2(\log|\mathcal{H}| + \log(2/\delta))}{\epsilon^2}

Equivalently, the sample complexity satisfies n(ϵ,δ,H)=O ⁣(logH+log(1/δ)ϵ2)n(\epsilon, \delta, \mathcal{H}) = O\!\left(\frac{\log|\mathcal{H}| + \log(1/\delta)}{\epsilon^2}\right).

Intuition

Each hypothesis needs roughly 1/ϵ21/\epsilon^2 samples to pin down its population risk (by Hoeffding). The union bound over H|\mathcal{H}| hypotheses costs a factor of logH\log|\mathcal{H}|, not H|\mathcal{H}|, because the bound enters through an exponential inequality.

Proof Sketch

From the ERM generalization bound for finite classes, with probability 1δ\geq 1 - \delta:

suphHR(h)R^n(h)logH+log(2/δ)2n\sup_{h \in \mathcal{H}} |R(h) - \hat{R}_n(h)| \leq \sqrt{\frac{\log|\mathcal{H}| + \log(2/\delta)}{2n}}

Set this ϵ/2\leq \epsilon/\sqrt{2} and solve for nn. Since the ERM hypothesis h^\hat{h} satisfies R(h^)minhR(h)2suphR(h)R^n(h)R(\hat{h}) - \min_h R(h) \leq 2 \sup_h |R(h) - \hat{R}_n(h)|, the excess risk is at most ϵ\epsilon.

Why It Matters

This is the baseline sample complexity result. It shows that finite classes are always learnable, and the cost is logarithmic in the class size. A class with 2d2^d hypotheses needs only O(d/ϵ2)O(d/\epsilon^2) samples, not O(2d/ϵ2)O(2^d/\epsilon^2).

Failure Mode

The bound is vacuous for infinite classes. It also gives no help when H|\mathcal{H}| is finite but astronomically large (e.g., all possible decision trees of a given depth), because the logH\log|\mathcal{H}| term may be too large to be useful.

Theorem

Realizable Finite-Class Sample Complexity

Statement

Under realizability (some hHh^* \in \mathcal{H} has R(h)=0R(h^*) = 0) with 0-1 loss and finite H\mathcal{H}, any ERM hypothesis h^\hat{h} (i.e., any hHh \in \mathcal{H} consistent with the training set) satisfies R(h^)ϵR(\hat{h}) \leq \epsilon with probability 1δ\geq 1 - \delta whenever:

n1ϵ ⁣(logH+log1δ)n \geq \frac{1}{\epsilon}\!\left(\log|\mathcal{H}| + \log\frac{1}{\delta}\right)

The dependence on ϵ\epsilon is linear, not quadratic. Realizability buys a full factor of 1/ϵ1/\epsilon over the agnostic case.

Intuition

When some hh^* achieves zero error, ERM can refuse to return any hypothesis with positive empirical error. The only way ERM fails is if a bad hypothesis (one with R(h)>ϵR(h) > \epsilon) happens to look perfect on the training set. The probability of this shrinks like (1ϵ)nenϵ(1 - \epsilon)^n \leq e^{-n\epsilon} per bad hypothesis, giving the 1/ϵ1/\epsilon rate.

Proof Sketch

We derive the bound step by step; the same template appears in PAC learning and is the workhorse of finite-class learning theory.

Step 1: Single bad hypothesis. Fix hHh \in \mathcal{H} with R(h)>ϵR(h) > \epsilon. Each training point xix_i has probability at least ϵ\epsilon of exposing hh (i.e., h(xi)c(xi)h(x_i) \ne c^*(x_i)). By independence, the probability hh survives all nn checks is

PrS[R^n(h)=0]=i=1nPr[h(xi)=c(xi)](1ϵ)nenϵ\Pr_S[\hat{R}_n(h) = 0] = \prod_{i=1}^n \Pr[h(x_i) = c^*(x_i)] \leq (1 - \epsilon)^n \leq e^{-n\epsilon}

The final inequality uses 1xex1 - x \leq e^{-x} (the tangent line to exe^{-x} at x=0x = 0, which sits above the curve by convexity).

Step 2: Union bound. Let Hbad={h:R(h)>ϵ}\mathcal{H}_{\text{bad}} = \{h : R(h) > \epsilon\}. The failure event is "some bad hypothesis is consistent with SS", which has probability

PrS ⁣[hHbad:R^n(h)=0]HbadenϵHenϵ\Pr_S\!\left[\exists h \in \mathcal{H}_{\text{bad}} : \hat{R}_n(h) = 0\right] \leq |\mathcal{H}_{\text{bad}}| \cdot e^{-n\epsilon} \leq |\mathcal{H}| \cdot e^{-n\epsilon}

Step 3: Solve for nn. Setting Henϵδ|\mathcal{H}| e^{-n\epsilon} \leq \delta, taking logs, and dividing by ϵ\epsilon:

n1ϵ ⁣(logH+log1δ)n \geq \frac{1}{\epsilon}\!\left(\log|\mathcal{H}| + \log\frac{1}{\delta}\right)

That is the bound. ERM cannot return a bad hypothesis on the high-probability event because hh^* is consistent with SS, so any consistent ERM output has RϵR \leq \epsilon.

Why It Matters

The logH\log|\mathcal{H}| factor is the key. You pay logarithmically in the size of the hypothesis class, not linearly. For infinite H\mathcal{H}, this argument fails because logH=\log|\mathcal{H}| = \infty, and you need VC theory to replace the union bound with a polynomial growth-function argument.

The realizable rate 1/ϵ1/\epsilon versus the agnostic 1/ϵ21/\epsilon^2 is a real distinction: at ϵ=0.01\epsilon = 0.01, the realizable bound asks for 100logH\sim 100 \log|\mathcal{H}| samples while the agnostic bound asks for 10000logH\sim 10\,000 \log|\mathcal{H}| — two orders of magnitude more. This gap is the formal cost of not knowing whether the target sits in your class.

Failure Mode

This bound requires realizability. Without it, no hh^* achieves zero training error, the (1ϵ)n(1 - \epsilon)^n tail no longer applies, and the rate degrades to the agnostic 1/ϵ21/\epsilon^2 from Hoeffding-style concentration. The bound is also vacuous for infinite hypothesis classes (linear classifiers, neural networks); use VC dimension or Rademacher complexity instead.

Theorem

Agnostic Sample Complexity via VC Dimension

Statement

For agnostic binary classification with 0-1 loss, if H\mathcal{H} has VC dimension d<d < \infty:

Upper bound: There exists a constant C>0C > 0 such that ERM achieves excess risk ϵ\leq \epsilon with probability 1δ\geq 1 - \delta whenever:

nCd+log(1/δ)ϵ2n \geq C \cdot \frac{d + \log(1/\delta)}{\epsilon^2}

Lower bound: There exist constants c1,c2>0c_1, c_2 > 0 such that for any algorithm:

n(ϵ,δ,H)c1dϵ2+c2log(1/δ)ϵ2n(\epsilon, \delta, \mathcal{H}) \geq c_1 \cdot \frac{d}{\epsilon^2} + c_2 \cdot \frac{\log(1/\delta)}{\epsilon^2}

Intuition

VC dimension dd replaces logH\log|\mathcal{H}| for infinite classes. The Sauer-Shelah lemma shows that the effective number of behaviors of H\mathcal{H} on nn points is at most (en/d)d(en/d)^d. Taking logs recovers dlog(n/d)d \log(n/d), which leads to a sample complexity of O(d/ϵ2)O(d/\epsilon^2) up to log factors. The lower bound shows this is tight.

Proof Sketch

Upper bound: Apply the VC generalization bound: with probability 1δ\geq 1 - \delta, the uniform deviation is O((dlog(n/d)+log(1/δ))/n)O(\sqrt{(d\log(n/d) + \log(1/\delta))/n}). Setting this ϵ\leq \epsilon and solving for nn gives n=O(d/ϵ2)n = O(d/\epsilon^2) (the log factor is absorbed because nn appears on both sides, and the solution satisfies n=O(d/ϵ2log(d/ϵ2))n = O(d/\epsilon^2 \cdot \log(d/\epsilon^2)), which simplifies).

Lower bound: Construct a set of 2d2^d distributions that are pairwise hard to distinguish. Apply Fano's inequality.

Why It Matters

This is the noisy, distribution-free version of the VC sample-complexity story. Finite VC dimension characterizes learnability for binary classification, but the rate depends on the setting: realizable problems can have 1/ϵ1/\epsilon scaling, while agnostic excess-risk learning has the slower 1/ϵ21/\epsilon^2 scaling.

Failure Mode

The VC bound applies only to binary classification with 0-1 loss. For regression, multi-class classification, or other loss functions, you need different complexity measures (pseudo-dimension, Natarajan dimension, or Rademacher complexity). The constants C,c1,c2C, c_1, c_2 are often not known tightly, making the bound hard to use for choosing nn in practice.

Theorem

Sample Complexity via Rademacher Complexity

Statement

If the Rademacher complexity of the loss class satisfies Rn(H)R(n)\mathfrak{R}_n(\ell \circ \mathcal{H}) \leq R(n), then ERM achieves excess risk ϵ\leq \epsilon with probability 1δ\geq 1 - \delta whenever:

n is chosen so that 2R(n)+log(1/δ)2nϵn \text{ is chosen so that } 2R(n) + \sqrt{\frac{\log(1/\delta)}{2n}} \leq \epsilon

For many classes, R(n)=O(1/n)R(n) = O(1/\sqrt{n}), giving sample complexity O(1/ϵ2)O(1/\epsilon^2).

Intuition

Rademacher complexity directly measures how well the class can fit random noise. If the class cannot fit noise well (small Rn\mathfrak{R}_n), it cannot overfit real data either. The sample complexity is determined by how fast Rn\mathfrak{R}_n decays with nn.

Proof Sketch

The Rademacher generalization bound states that with probability 1δ\geq 1 - \delta:

suphH(R(h)R^n(h))2Rn(H)+log(1/δ)2n\sup_{h \in \mathcal{H}} (R(h) - \hat{R}_n(h)) \leq 2\mathfrak{R}_n(\ell \circ \mathcal{H}) + \sqrt{\frac{\log(1/\delta)}{2n}}

Set the right side ϵ\leq \epsilon and solve for nn.

Why It Matters

Rademacher bounds are data-dependent: you can estimate Rn\mathfrak{R}_n from the training sample itself. This gives tighter sample complexity for specific distributions, not just worst-case over all distributions. For \ell_\infty- or 1\ell_1-bounded linear classes in Rd\mathbb{R}^d, the chaining route gives Rn=O(d/n)\mathfrak{R}_n = O(\sqrt{d/n}), recovering the VC result without going through combinatorics. For the 2\ell_2-bounded class {xw,x:w2B,x2R}\{x \mapsto \langle w, x \rangle : \|w\|_2 \leq B,\, \|x\|_2 \leq R\} a direct Cauchy-Schwarz / Khintchine argument gives a sharper dimension-free bound RnBR/n\mathfrak{R}_n \leq BR/\sqrt{n}. This is the bound used for SVM, ridge, and kernel analyses; the O(d/n)O(\sqrt{d/n}) rate is not the right rate for 2\ell_2-bounded classes.

Failure Mode

Computing or bounding Rn\mathfrak{R}_n can be difficult for complex hypothesis classes (e.g., deep neural networks). The resulting bounds are often loose in practice and do not explain the generalization of overparameterized models.

Lower Bound Methods

Upper bounds say "this many samples suffice." Lower bounds say "no algorithm can do better." Two classical methods:

Fano's method. Construct MM distributions P1,,PMP_1, \ldots, P_M such that: (1) any two Pi,PjP_i, P_j require different hypotheses (i.e., hihj>ϵ\|h_i - h_j\| > \epsilon), and (2) the distributions are hard to tell apart (measured by KL divergence). Fano's inequality then gives:

nlogMlog2maxijKL(PinPjn)n \geq \frac{\log M - \log 2}{\max_{i \neq j} \mathrm{KL}(P_i^n \| P_j^n)}

Le Cam's method. A two-hypothesis version. Construct P0P_0 and P1P_1 with different optimal hypotheses. The minimax risk is at least (ϵ/2)(1TV(P0n,P1n))(\epsilon/2)(1 - \mathrm{TV}(P_0^n, P_1^n)). Simpler than Fano but gives weaker results (no logM\log M factor).

Theorem

Fano Method Lower Bound

Statement

Let P1,,PMP_1, \ldots, P_M be distributions such that for all iji \neq j, the associated optimal hypotheses satisfy d(hi,hj)>2ϵd(h_i, h_j) > 2\epsilon and KL(PinPjn)β\mathrm{KL}(P_i^n \| P_j^n) \leq \beta. Then for any estimator h^\hat{h}:

max1iMPrSPin[d(h^(S),hi)>ϵ]1nβ+log2logM\max_{1 \leq i \leq M} \Pr_{S \sim P_i^n}[d(\hat{h}(S), h_i) > \epsilon] \geq 1 - \frac{n\beta + \log 2}{\log M}

Intuition

If you have MM hypotheses that all look similar (low KL divergence) but require different answers, then no algorithm can reliably pick the right one unless nn is large enough for the total information nβn\beta to exceed logM\log M.

Proof Sketch

Apply Fano's inequality to the MM-ary hypothesis testing problem. The mutual information between the index ii and the sample SS is at most nβn\beta (by the data processing inequality and the KL bound). Fano's inequality bounds the probability of error in terms of I(i;S)/logMI(i; S)/\log M.

Why It Matters

This is the workhorse of minimax lower bounds. Nearly every known sample complexity lower bound in nonparametric statistics and learning theory uses some form of Fano's method or its generalizations.

Failure Mode

Constructing the right packing (the MM distributions) is the hard part. A poor choice of packing gives a weak lower bound. The bound is also limited to worst-case analysis; it says nothing about specific distributions that might be easier.

The Gap Between Upper and Lower Bounds

For agnostic binary classification with VC dimension dd:

  • Classical upper bound: O(dlog(1/ϵ)/ϵ2)O(d \log(1/\epsilon) / \epsilon^2) from standard VC uniform-convergence arguments
  • Information-theoretic lower bound: Ω(d/ϵ2)\Omega(d/\epsilon^2) from noisy packing/Fano constructions

The clean mental model is: realizable/noiseless learning can scale like 1/ϵ1/\epsilon, while agnostic/noisy excess-risk learning usually pays the slower 1/ϵ21/\epsilon^2 concentration price. Some classical upper bounds carry extra logarithmic factors; sharper theory studies when those logs are artifacts of the proof rather than a real sample requirement.

Canonical Examples

Example

Learning threshold classifiers on the line

Let X=[0,1]\mathcal{X} = [0, 1] and H={x1[xt]:t[0,1]}\mathcal{H} = \{x \mapsto \mathbf{1}[x \geq t] : t \in [0,1]\}. This class has VC dimension d=1d = 1. In the agnostic/noisy setting, the worst-case excess-risk sample complexity is Θ(1/ϵ2)\Theta(1/\epsilon^2). In the realizable/noiseless threshold setting, the dependence improves to roughly 1/ϵ1/\epsilon up to confidence factors.

With n=100n = 100 samples, ERM achieves excess risk 0.1\approx 0.1 with high probability. With n=10,000n = 10{,}000 samples, excess risk 0.01\approx 0.01.

Example

Linear classifiers in d dimensions

Halfspaces in Rd\mathbb{R}^d have VC dimension on the order of dd. In agnostic distribution-free classification, sample complexity is Θ(d/ϵ2)\Theta(d/\epsilon^2) up to logarithmic and confidence factors. In d=100d = 100 dimensions, you need roughly 100/ϵ2100/\epsilon^2 samples. At ϵ=0.01\epsilon = 0.01, that is 10810^8 samples. This explains why high-dimensional linear classification needs large datasets.

Common Confusions

Watch Out

Sample complexity is not the same as convergence rate

Sample complexity asks: how many samples for ϵ\epsilon-accuracy? Convergence rate asks: how fast does the excess risk decrease with nn? They are related by inversion, but people often confuse the two. A rate of O(1/n)O(1/\sqrt{n}) corresponds to a sample complexity of O(1/ϵ2)O(1/\epsilon^2). A rate of O(1/n)O(1/n) corresponds to O(1/ϵ)O(1/\epsilon), which is called a "fast rate."

Watch Out

Computational complexity is separate from sample complexity

Sample complexity ignores computation. An algorithm might need only O(d/ϵ2)O(d/\epsilon^2) samples but require exponential time to process them. Cryptographic hardness results show that for some problems, computationally efficient algorithms provably need more samples than information-theoretically optimal (but computationally unbounded) ones.

Watch Out

Bounds with unspecified constants are hard to use in practice

The statement n=O(d/ϵ2)n = O(d/\epsilon^2) hides a constant. That constant might be 1 or 1000. For practical sample size planning, you need either explicit constants (which are rarely tight) or empirical methods like learning curves. Sample complexity theory tells you the right scaling, not the right number.

Summary

  • For finite H\mathcal{H}: n=O(logH/ϵ2)n = O(\log|\mathcal{H}|/\epsilon^2)
  • For agnostic VC classes: n=Θ(d/ϵ2)n = \Theta(d/\epsilon^2) up to log factors
  • For realizable VC classes: the dependence on ϵ\epsilon can improve to roughly 1/ϵ1/\epsilon
  • For Rademacher complexity R(n)R(n): solve R(n)ϵ/2R(n) \leq \epsilon/2
  • Lower bounds via Fano require constructing hard packing sets
  • Sample complexity is the right way to compare learning algorithms across different regimes
  • The gap between upper and lower bounds measures our ignorance about a problem

Exercises

ExerciseCore

Problem

A hypothesis class has 500 elements. How many samples does ERM need to achieve excess risk 0.05\leq 0.05 with probability 0.95\geq 0.95?

ExerciseAdvanced

Problem

The class of halfspaces in R10\mathbb{R}^{10} has VC dimension 11. Use the VC sample complexity bound to determine the order of magnitude of samples needed for ϵ=0.01\epsilon = 0.01 accuracy. Then explain why this is still useful despite the loose constants.

ExerciseResearch

Problem

Explain why the deterministic threshold construction does not prove an agnostic Ω(1/ϵ2)\Omega(1/\epsilon^2) lower bound by itself. Then state the right lower-bound lesson for VC classes.

References

Canonical:

Current:

  • Mohri, Rostamizadeh, Talwalkar, Foundations of Machine Learning (2018), Chapter 3
  • Wainwright, High-Dimensional Statistics (2019), Chapter 15 (minimax lower bounds)
  • Bartlett & Mendelson, "Rademacher and Gaussian Complexities: Risk Bounds and Structural Results," JMLR 3 (2002), 463-482. Rademacher complexity tightens the constants and removes the worst-case dependence on the function class structure that VC bounds carry. jmlr.org

Next Topics

The natural extensions from sample complexity:

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

2

Derived topics

2

Graph-backed continuations