Skip to main content

Learning Theory Core

Hypothesis Classes and Function Spaces

What is a hypothesis class, why the choice of hypothesis class determines what ERM can learn, and the approximation-estimation tradeoff: bigger classes reduce approximation error but increase estimation error.

CoreTier 1StableSupporting~45 min

Why This Matters

theorem visual

The class sets the learning problem

A hypothesis class controls what the learner can express and how much data it needs. Bigger classes reduce approximation error but raise the statistical price.

best in Htruthapproximation gaplinearsimplepolynomialricherall measurabletoo largeunderfitbalancedoverfit risk

excess risk ledger

target

is the Bayes risk, usually outside the chosen class.

approx

shrinks when the class gets richer.

estimate

usually grows when the class gets richer for fixed sample size.

When you train a model, you do not search over all possible functions. You search over a specific set: linear functions, decision trees, neural networks with a given architecture. This set is the hypothesis class. The choice of hypothesis class is the most consequential decision in learning, because it determines both what you can learn and what you cannot.

A hypothesis class that is too small cannot approximate the true function (high approximation error). A class that is too large needs too many samples to learn reliably (high estimation error). All of learning theory is about understanding this tradeoff.

Mental Model

The hypothesis class fixes both what the algorithm can express and what it has to choose among. A class that is too restrictive cannot fit the target function: large approximation error. A class that is too rich requires more data to identify the right function from the rest: large estimation error. The size and structure of H\mathcal{H} control where the bias-variance balance sits at sample size nn.

Formal Setup

Definition

Hypothesis Class

A hypothesis class is a set of functions H{h:XY}\mathcal{H} \subseteq \{h : \mathcal{X} \to \mathcal{Y}\} that a learning algorithm considers as candidate predictors. The algorithm selects h^H\hat{h} \in \mathcal{H} based on training data.

Definition

Realizable Setting

The learning problem is realizable if and only if the true data-generating function ff^* belongs to H\mathcal{H}: there exists hHh^* \in \mathcal{H} with R(h)=RR(h^*) = R^*, where RR^* is the Bayes risk. In the realizable setting, the approximation error is zero.

Definition

Agnostic Setting

The learning problem is agnostic if and only if we make no assumption about whether fHf^* \in \mathcal{H}. The best we can hope for is to find h^\hat{h} whose risk is close to minhHR(h)\min_{h \in \mathcal{H}} R(h), which may be strictly larger than RR^*.

Examples of Hypothesis Classes

Finite hypothesis classes. H={h1,,hM}\mathcal{H} = \{h_1, \ldots, h_M\}. The ERM generalization bound from the ERM topic gives a gap of O(logM/n)O(\sqrt{\log M / n}).

Linear classifiers. H={xsign(wTx+b):wRd,bR}\mathcal{H} = \{x \mapsto \text{sign}(w^T x + b) : w \in \mathbb{R}^d, b \in \mathbb{R}\}. This is an infinite class parameterized by d+1d + 1 real numbers. The VC dimension is d+1d + 1.

Neural networks with fixed architecture. H={xfθ(x):θRp}\mathcal{H} = \{x \mapsto f_\theta(x) : \theta \in \mathbb{R}^p\} for a network fθf_\theta with pp parameters. The class is parameterized by pp real numbers, but its effective complexity depends on the architecture, not just pp.

Nonparametric classes. H={f:fRKHSB}\mathcal{H} = \{f : \|f\|_{\text{RKHS}} \leq B\} for a reproducing kernel Hilbert space. The "number of parameters" is not fixed; it can grow with nn. Examples: kernel SVM, Gaussian process regression.

All measurable functions. H={h:XY}\mathcal{H} = \{h : \mathcal{X} \to \mathcal{Y}\}. This class can memorize any dataset, giving zero training error. But it provides no generalization: the estimation error is maximal.

Main Theorems

Theorem

Excess Risk Decomposition

Statement

The excess risk of the ERM hypothesis h^\hat{h} over the Bayes-optimal predictor hBayesh^*_{\text{Bayes}} decomposes as:

R(h^)R=(minhHR(h)R)approximation error+(R(h^)minhHR(h))estimation errorR(\hat{h}) - R^* = \underbrace{\left(\min_{h \in \mathcal{H}} R(h) - R^*\right)}_{\text{approximation error}} + \underbrace{\left(R(\hat{h}) - \min_{h \in \mathcal{H}} R(h)\right)}_{\text{estimation error}}

The approximation error depends only on H\mathcal{H} (not on the sample). The estimation error depends on both H\mathcal{H} and the sample size nn.

Intuition

The approximation error asks: how close can the best function in H\mathcal{H} get to the truth? The estimation error asks: how close can ERM get to the best function in H\mathcal{H}, given finite data? These are two different sources of error, and they pull in opposite directions as you vary the size of H\mathcal{H}.

Proof Sketch

This is an identity: add and subtract minhHR(h)\min_{h \in \mathcal{H}} R(h). The content is in recognizing that the two terms have different dependencies and require different tools to bound.

Why It Matters

This decomposition is the backbone of model selection. Regularization, cross-validation, and structural risk minimization are all about balancing these two terms. When someone says a model is "too complex" or "too simple," they are implicitly referring to this tradeoff.

Failure Mode

This decomposition does not account for optimization error. In practice, ERM does not find the exact minimizer of empirical risk (especially for non-convex problems like neural networks). The true decomposition has three terms: approximation + estimation + optimization. The optimization error can dominate for large models trained with SGD.

Proposition

Nested Classes and Monotonicity

Statement

For nested hypothesis classes H1H2H3\mathcal{H}_1 \subset \mathcal{H}_2 \subset \mathcal{H}_3:

  1. Approximation error is non-increasing: minhH3R(h)minhH2R(h)minhH1R(h)\min_{h \in \mathcal{H}_3} R(h) \leq \min_{h \in \mathcal{H}_2} R(h) \leq \min_{h \in \mathcal{H}_1} R(h)

  2. Estimation error is non-decreasing (in expectation, for fixed nn): the uniform convergence gap grows with the complexity of H\mathcal{H}.

The optimal class H\mathcal{H}^* minimizes the sum. It is neither the smallest nor the largest class.

Intuition

A bigger class always contains the best functions from a smaller class, so the approximation error can only improve. But a bigger class also gives ERM more room to overfit, so the estimation error grows. The sweet spot balances these opposing forces.

Proof Sketch

Part 1: H1H2\mathcal{H}_1 \subset \mathcal{H}_2 implies minH2R(h)minH1R(h)\min_{\mathcal{H}_2} R(h) \leq \min_{\mathcal{H}_1} R(h) by minimizing over a larger set. Part 2: the uniform convergence bound for H2\mathcal{H}_2 depends on the complexity of H2\mathcal{H}_2 (e.g., VC dimension), which is at least that of H1\mathcal{H}_1.

Why It Matters

This is the formal version of the bias-variance tradeoff. It explains why increasing model capacity helps up to a point and then hurts (in classical statistics). It also frames the puzzle of modern overparameterized models: why does estimation error not blow up when H\mathcal{H} is massive? The answer involves implicit regularization and is covered in the double descent and benign overfitting topics.

Failure Mode

The monotonicity of estimation error relies on uniform convergence bounds. For overparameterized models, uniform convergence bounds are vacuous, and the actual estimation error may decrease with increasing class size. This is the "interpolation regime" studied in double descent.

Parameterized vs Nonparametric

A parametric hypothesis class has a fixed-dimensional parameter space ΘRp\Theta \subseteq \mathbb{R}^p. The number of parameters pp does not change with the sample size nn. Examples: linear regression (p=d+1p = d + 1), logistic regression, fixed-architecture neural networks.

A nonparametric hypothesis class allows the effective number of parameters to grow with nn. Examples: kk-nearest neighbors (stores all training points), kernel methods (one coefficient per data point), random forests (depth grows with nn).

The distinction matters for generalization bounds: parametric classes have fixed complexity, while nonparametric classes require bounds that account for growing complexity.

Common Confusions

Watch Out

More parameters does not always mean bigger hypothesis class

A neural network with 10910^9 parameters might have effective complexity much less than 10910^9 due to symmetries, implicit regularization from SGD, and architectural constraints. Parameter count is a coarse proxy for hypothesis-class capacity, not a tight measure of it: for ReLU networks the VC dimension is Θ(WLlogW)\Theta(W L \log W) in width WW and depth LL (Bartlett, Harvey, Liaw, Mehrabian 2019), and norm-based, Rademacher, and PAC-Bayes bounds attempt to measure effective complexity more directly than the raw parameter count.

Watch Out

The hypothesis class is chosen before seeing data

In the ERM framework, H\mathcal{H} is fixed before observing the training set. Using the training data to select H\mathcal{H} (e.g., choosing the network architecture via cross-validation) introduces a dependence that invalidates simple ERM bounds. Structural risk minimization (SRM) and model selection theory handle this case.

Watch Out

Approximation error zero does not mean the problem is easy

In the realizable setting, fHf^* \in \mathcal{H} and approximation error is zero. But estimation error can still be large if H\mathcal{H} is complex. Realizability makes the problem easier (you only fight estimation error), but not trivial.

Exercises

ExerciseCore

Problem

Consider two hypothesis classes: H1\mathcal{H}_1 is all linear functions from R10\mathbb{R}^{10} to R\mathbb{R} and H2\mathcal{H}_2 is all polynomials of degree at most 3 from R10\mathbb{R}^{10} to R\mathbb{R}. Which has smaller approximation error? Which has smaller estimation error (for fixed nn)? What is the relationship H1H2\mathcal{H}_1 \subset \mathcal{H}_2?

ExerciseAdvanced

Problem

Suppose the true function is f(x)=sin(x1)f^*(x) = \sin(x_1) and we use ERM with 0-1 loss. Compare the excess risk decomposition for: (a) Hlin\mathcal{H}_{\text{lin}} = linear classifiers, (b) Hnn\mathcal{H}_{\text{nn}} = two-layer neural networks with 100 hidden units. Which term dominates in each case, and why?

References

Canonical:

  • Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 2, 5, and 6
  • Vapnik, Statistical Learning Theory (1998), Chapter 3
  • Devroye, Gyorfi, Lugosi, A Probabilistic Theory of Pattern Recognition (1996), Chapters 4-6 (hypothesis classes and risk)

Current:

  • Mohri, Rostamizadeh, Talwalkar, Foundations of Machine Learning (2018), Chapter 2
  • Wainwright, High-Dimensional Statistics (2019), Chapter 4.1 (function classes and approximation)
  • Bartlett, Bousquet, Mendelson, "Local Rademacher complexities" (2005), Annals of Statistics. Refined analysis of estimation error for localized classes.

Next Topics

  • VC dimension: measuring the complexity of infinite hypothesis classes
  • Rademacher complexity: data-dependent complexity that adapts to the data distribution
  • PAC learning framework: formalization of when learning is possible

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

1

Derived topics

5