Learning Theory Core
Sample Complexity Bounds
How many samples do you need to learn? Tight answers for finite hypothesis classes, VC classes, and Rademacher-bounded classes, plus matching lower bounds via Fano and Le Cam.
Prerequisites
Why This Matters
Sample complexity is the right currency for comparing learning algorithms. Asking "how accurate is this algorithm?" without specifying the sample size is meaningless. The question that matters is: how many samples does an algorithm need to achieve excess risk at most with probability at least ?
Every upper bound on generalization error from ERM, VC theory, or Rademacher complexity can be inverted to give a sample complexity bound. Lower bounds tell you no algorithm can do better, regardless of computational power.
Mental Model
Think of sample complexity as a budget. You want to buy -accurate generalization. The price depends on how complex your hypothesis class is (measured by , VC dimension , or Rademacher complexity ). Upper bounds tell you a price that is always sufficient. Lower bounds tell you the cheapest possible price.
The gap between upper and lower bounds measures how well we understand a learning problem.
Formal Setup
Fix an input space , output space , a loss function bounded in , a hypothesis class , and accuracy parameters .
Sample Complexity
The sample complexity of learning to accuracy with confidence is the smallest such that there exists an algorithm satisfying:
The supremum is over all distributions on .
This is a worst-case definition. The distribution is adversarial.
Main Theorems
Sample Complexity of Finite Hypothesis Classes
Statement
For a finite hypothesis class with loss in , ERM achieves excess risk at most with probability at least whenever:
Equivalently, the sample complexity satisfies .
Intuition
Each hypothesis needs roughly samples to pin down its population risk (by Hoeffding). The union bound over hypotheses costs a factor of , not , because the bound enters through an exponential inequality.
Proof Sketch
From the ERM generalization bound for finite classes, with probability :
Set this and solve for . Since the ERM hypothesis satisfies , the excess risk is at most .
Why It Matters
This is the baseline sample complexity result. It shows that finite classes are always learnable, and the cost is logarithmic in the class size. A class with hypotheses needs only samples, not .
Failure Mode
The bound is vacuous for infinite classes. It also gives no help when is finite but astronomically large (e.g., all possible decision trees of a given depth), because the term may be too large to be useful.
Realizable Finite-Class Sample Complexity
Statement
Under realizability (some has ) with 0-1 loss and finite , any ERM hypothesis (i.e., any consistent with the training set) satisfies with probability whenever:
The dependence on is linear, not quadratic. Realizability buys a full factor of over the agnostic case.
Intuition
When some achieves zero error, ERM can refuse to return any hypothesis with positive empirical error. The only way ERM fails is if a bad hypothesis (one with ) happens to look perfect on the training set. The probability of this shrinks like per bad hypothesis, giving the rate.
Proof Sketch
We derive the bound step by step; the same template appears in PAC learning and is the workhorse of finite-class learning theory.
Step 1: Single bad hypothesis. Fix with . Each training point has probability at least of exposing (i.e., ). By independence, the probability survives all checks is
The final inequality uses (the tangent line to at , which sits above the curve by convexity).
Step 2: Union bound. Let . The failure event is "some bad hypothesis is consistent with ", which has probability
Step 3: Solve for . Setting , taking logs, and dividing by :
That is the bound. ERM cannot return a bad hypothesis on the high-probability event because is consistent with , so any consistent ERM output has .
Why It Matters
The factor is the key. You pay logarithmically in the size of the hypothesis class, not linearly. For infinite , this argument fails because , and you need VC theory to replace the union bound with a polynomial growth-function argument.
The realizable rate versus the agnostic is a real distinction: at , the realizable bound asks for samples while the agnostic bound asks for — two orders of magnitude more. This gap is the formal cost of not knowing whether the target sits in your class.
Failure Mode
This bound requires realizability. Without it, no achieves zero training error, the tail no longer applies, and the rate degrades to the agnostic from Hoeffding-style concentration. The bound is also vacuous for infinite hypothesis classes (linear classifiers, neural networks); use VC dimension or Rademacher complexity instead.
Agnostic Sample Complexity via VC Dimension
Statement
For agnostic binary classification with 0-1 loss, if has VC dimension :
Upper bound: There exists a constant such that ERM achieves excess risk with probability whenever:
Lower bound: There exist constants such that for any algorithm:
Intuition
VC dimension replaces for infinite classes. The Sauer-Shelah lemma shows that the effective number of behaviors of on points is at most . Taking logs recovers , which leads to a sample complexity of up to log factors. The lower bound shows this is tight.
Proof Sketch
Upper bound: Apply the VC generalization bound: with probability , the uniform deviation is . Setting this and solving for gives (the log factor is absorbed because appears on both sides, and the solution satisfies , which simplifies).
Lower bound: Construct a set of distributions that are pairwise hard to distinguish. Apply Fano's inequality.
Why It Matters
This is the noisy, distribution-free version of the VC sample-complexity story. Finite VC dimension characterizes learnability for binary classification, but the rate depends on the setting: realizable problems can have scaling, while agnostic excess-risk learning has the slower scaling.
Failure Mode
The VC bound applies only to binary classification with 0-1 loss. For regression, multi-class classification, or other loss functions, you need different complexity measures (pseudo-dimension, Natarajan dimension, or Rademacher complexity). The constants are often not known tightly, making the bound hard to use for choosing in practice.
Sample Complexity via Rademacher Complexity
Statement
If the Rademacher complexity of the loss class satisfies , then ERM achieves excess risk with probability whenever:
For many classes, , giving sample complexity .
Intuition
Rademacher complexity directly measures how well the class can fit random noise. If the class cannot fit noise well (small ), it cannot overfit real data either. The sample complexity is determined by how fast decays with .
Proof Sketch
The Rademacher generalization bound states that with probability :
Set the right side and solve for .
Why It Matters
Rademacher bounds are data-dependent: you can estimate from the training sample itself. This gives tighter sample complexity for specific distributions, not just worst-case over all distributions. For - or -bounded linear classes in , the chaining route gives , recovering the VC result without going through combinatorics. For the -bounded class a direct Cauchy-Schwarz / Khintchine argument gives a sharper dimension-free bound . This is the bound used for SVM, ridge, and kernel analyses; the rate is not the right rate for -bounded classes.
Failure Mode
Computing or bounding can be difficult for complex hypothesis classes (e.g., deep neural networks). The resulting bounds are often loose in practice and do not explain the generalization of overparameterized models.
Lower Bound Methods
Upper bounds say "this many samples suffice." Lower bounds say "no algorithm can do better." Two classical methods:
Fano's method. Construct distributions such that: (1) any two require different hypotheses (i.e., ), and (2) the distributions are hard to tell apart (measured by KL divergence). Fano's inequality then gives:
Le Cam's method. A two-hypothesis version. Construct and with different optimal hypotheses. The minimax risk is at least . Simpler than Fano but gives weaker results (no factor).
Fano Method Lower Bound
Statement
Let be distributions such that for all , the associated optimal hypotheses satisfy and . Then for any estimator :
Intuition
If you have hypotheses that all look similar (low KL divergence) but require different answers, then no algorithm can reliably pick the right one unless is large enough for the total information to exceed .
Proof Sketch
Apply Fano's inequality to the -ary hypothesis testing problem. The mutual information between the index and the sample is at most (by the data processing inequality and the KL bound). Fano's inequality bounds the probability of error in terms of .
Why It Matters
This is the workhorse of minimax lower bounds. Nearly every known sample complexity lower bound in nonparametric statistics and learning theory uses some form of Fano's method or its generalizations.
Failure Mode
Constructing the right packing (the distributions) is the hard part. A poor choice of packing gives a weak lower bound. The bound is also limited to worst-case analysis; it says nothing about specific distributions that might be easier.
The Gap Between Upper and Lower Bounds
For agnostic binary classification with VC dimension :
- Classical upper bound: from standard VC uniform-convergence arguments
- Information-theoretic lower bound: from noisy packing/Fano constructions
The clean mental model is: realizable/noiseless learning can scale like , while agnostic/noisy excess-risk learning usually pays the slower concentration price. Some classical upper bounds carry extra logarithmic factors; sharper theory studies when those logs are artifacts of the proof rather than a real sample requirement.
Canonical Examples
Learning threshold classifiers on the line
Let and . This class has VC dimension . In the agnostic/noisy setting, the worst-case excess-risk sample complexity is . In the realizable/noiseless threshold setting, the dependence improves to roughly up to confidence factors.
With samples, ERM achieves excess risk with high probability. With samples, excess risk .
Linear classifiers in d dimensions
Halfspaces in have VC dimension on the order of . In agnostic distribution-free classification, sample complexity is up to logarithmic and confidence factors. In dimensions, you need roughly samples. At , that is samples. This explains why high-dimensional linear classification needs large datasets.
Common Confusions
Sample complexity is not the same as convergence rate
Sample complexity asks: how many samples for -accuracy? Convergence rate asks: how fast does the excess risk decrease with ? They are related by inversion, but people often confuse the two. A rate of corresponds to a sample complexity of . A rate of corresponds to , which is called a "fast rate."
Computational complexity is separate from sample complexity
Sample complexity ignores computation. An algorithm might need only samples but require exponential time to process them. Cryptographic hardness results show that for some problems, computationally efficient algorithms provably need more samples than information-theoretically optimal (but computationally unbounded) ones.
Bounds with unspecified constants are hard to use in practice
The statement hides a constant. That constant might be 1 or 1000. For practical sample size planning, you need either explicit constants (which are rarely tight) or empirical methods like learning curves. Sample complexity theory tells you the right scaling, not the right number.
Summary
- For finite :
- For agnostic VC classes: up to log factors
- For realizable VC classes: the dependence on can improve to roughly
- For Rademacher complexity : solve
- Lower bounds via Fano require constructing hard packing sets
- Sample complexity is the right way to compare learning algorithms across different regimes
- The gap between upper and lower bounds measures our ignorance about a problem
Exercises
Problem
A hypothesis class has 500 elements. How many samples does ERM need to achieve excess risk with probability ?
Problem
The class of halfspaces in has VC dimension 11. Use the VC sample complexity bound to determine the order of magnitude of samples needed for accuracy. Then explain why this is still useful despite the loose constants.
Problem
Explain why the deterministic threshold construction does not prove an agnostic lower bound by itself. Then state the right lower-bound lesson for VC classes.
References
Canonical:
- Shalev-Shwartz & Ben-David, Understanding Machine Learning, Chapters 3-6
- Vapnik, Statistical Learning Theory (1998), Chapters 3-4
Current:
- Mohri, Rostamizadeh, Talwalkar, Foundations of Machine Learning (2018), Chapter 3
- Wainwright, High-Dimensional Statistics (2019), Chapter 15 (minimax lower bounds)
- Bartlett & Mendelson, "Rademacher and Gaussian Complexities: Risk Bounds and Structural Results," JMLR 3 (2002), 463-482. Rademacher complexity tightens the constants and removes the worst-case dependence on the function class structure that VC bounds carry. jmlr.org
Next Topics
The natural extensions from sample complexity:
- Algorithmic stability: an alternative route to generalization that bypasses uniform convergence
- PAC-Bayes: bounds that depend on a prior over hypotheses, often tighter for overparameterized models
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
2- Realizability Assumptionlayer 2 · tier 1
- VC Dimensionlayer 2 · tier 1
Derived topics
2- Algorithmic Stabilitylayer 3 · tier 1
- PAC-Bayes Boundslayer 3 · tier 1
Graph-backed continuations