Hypothesis Classes and Function Spaces

Sneiderman, Robby

Learning Theory Core

Hypothesis Classes and Function Spaces

What is a hypothesis class, why the choice of hypothesis class determines what ERM can learn, and the approximation-estimation tradeoff: bigger classes reduce approximation error but increase estimation error.

CoreTier 1StableSupporting~45 min

Prerequisites

Empirical Risk Minimization

Quiz (10)Pulse Check Prereq Map

Why This Matters

theorem visual

The class sets the learning problem

$A hypothesis class controls what the learner can express and how much data it needs. Bigger classes reduce approximation error but raise the statistical price.$

excess risk ledger

target

$R^{*} is the Bayes risk, usually outside the chosen class.$

approx

$in f_{h \in H} R (h) - R^{*} shrinks when the class gets richer.$

estimate

$R (h) - in f_{h \in H} R (h) usually grows when the class gets richer for fixed sample size.$

When you train a model, you do not search over all possible functions. You search over a specific set: linear functions, decision trees, neural networks with a given architecture. This set is the hypothesis class. The choice of hypothesis class is the most consequential decision in learning, because it determines both what you can learn and what you cannot.

A hypothesis class that is too small cannot approximate the true function (high approximation error). A class that is too large needs too many samples to learn reliably (high estimation error). All of learning theory is about understanding this tradeoff.

Mental Model

The hypothesis class fixes both what the algorithm can express and what it has to choose among. A class that is too restrictive cannot fit the target function: large approximation error. A class that is too rich requires more data to identify the right function from the rest: large estimation error. The size and structure of $\mathcal{H}$ control where the bias-variance balance sits at sample size $n$ .

Formal Setup

Definition

Hypothesis Class $H$

A hypothesis class is a set of functions $\mathcal{H} \subseteq \{h : \mathcal{X} \to \mathcal{Y}\}$ that a learning algorithm considers as candidate predictors. The algorithm selects $\hat{h} \in \mathcal{H}$ based on training data.

Definition

Realizable Setting

The learning problem is realizable if and only if the true data-generating function $f^*$ belongs to $\mathcal{H}$ : there exists $h^* \in \mathcal{H}$ with $R(h^*) = R^*$ , where $R^*$ is the Bayes risk. In the realizable setting, the approximation error is zero.

Definition

Agnostic Setting

The learning problem is agnostic if and only if we make no assumption about whether $f^* \in \mathcal{H}$ . The best we can hope for is to find $\hat{h}$ whose risk is close to $\min_{h \in \mathcal{H}} R(h)$ , which may be strictly larger than $R^*$ .

Examples of Hypothesis Classes

Finite hypothesis classes. $\mathcal{H} = \{h_1, \ldots, h_M\}$ . The ERM generalization bound from the ERM topic gives a gap of $O(\sqrt{\log M / n})$ .

Linear classifiers. $\mathcal{H} = \{x \mapsto \text{sign}(w^T x + b) : w \in \mathbb{R}^d, b \in \mathbb{R}\}$ . This is an infinite class parameterized by $d + 1$ real numbers. The VC dimension is $d + 1$ .

Neural networks with fixed architecture. $\mathcal{H} = \{x \mapsto f_\theta(x) : \theta \in \mathbb{R}^p\}$ for a network $f_\theta$ with $p$ parameters. The class is parameterized by $p$ real numbers, but its effective complexity depends on the architecture, not just $p$ .

Nonparametric classes. $\mathcal{H} = \{f : \|f\|_{\text{RKHS}} \leq B\}$ for a reproducing kernel Hilbert space. The "number of parameters" is not fixed; it can grow with $n$ . Examples: kernel SVM, Gaussian process regression.

All measurable functions. $\mathcal{H} = \{h : \mathcal{X} \to \mathcal{Y}\}$ . This class can memorize any dataset, giving zero training error. But it provides no generalization: the estimation error is maximal.

Main Theorems

Theorem

Excess Risk Decomposition

Statement

The excess risk of the ERM hypothesis $\hat{h}$ over the Bayes-optimal predictor $h^*_{\text{Bayes}}$ decomposes as:

$R(\hat{h}) - R^* = \underbrace{\left(\min_{h \in \mathcal{H}} R(h) - R^*\right)}_{\text{approximation error}} + \underbrace{\left(R(\hat{h}) - \min_{h \in \mathcal{H}} R(h)\right)}_{\text{estimation error}}$

The approximation error depends only on $\mathcal{H}$ (not on the sample). The estimation error depends on both $\mathcal{H}$ and the sample size $n$ .

Intuition

The approximation error asks: how close can the best function in $\mathcal{H}$ get to the truth? The estimation error asks: how close can ERM get to the best function in $\mathcal{H}$ , given finite data? These are two different sources of error, and they pull in opposite directions as you vary the size of $\mathcal{H}$ .

Proof Sketch

This is an identity: add and subtract $\min_{h \in \mathcal{H}} R(h)$ . The content is in recognizing that the two terms have different dependencies and require different tools to bound.

Why It Matters

This decomposition is the backbone of model selection. Regularization, cross-validation, and structural risk minimization are all about balancing these two terms. When someone says a model is "too complex" or "too simple," they are implicitly referring to this tradeoff.

Failure Mode

This decomposition does not account for optimization error. In practice, ERM does not find the exact minimizer of empirical risk (especially for non-convex problems like neural networks). The true decomposition has three terms: approximation + estimation + optimization. The optimization error can dominate for large models trained with SGD.

report a correction →

Proposition

Nested Classes and Monotonicity

Statement

For nested hypothesis classes $\mathcal{H}_1 \subset \mathcal{H}_2 \subset \mathcal{H}_3$ :

Approximation error is non-increasing: $\min_{h \in \mathcal{H}_3} R(h) \leq \min_{h \in \mathcal{H}_2} R(h) \leq \min_{h \in \mathcal{H}_1} R(h)$
Estimation error is non-decreasing (in expectation, for fixed $n$ ): the uniform convergence gap grows with the complexity of $\mathcal{H}$ .

The optimal class $\mathcal{H}^*$ minimizes the sum. It is neither the smallest nor the largest class.

Intuition

A bigger class always contains the best functions from a smaller class, so the approximation error can only improve. But a bigger class also gives ERM more room to overfit, so the estimation error grows. The sweet spot balances these opposing forces.

Proof Sketch

Part 1: $\mathcal{H}_1 \subset \mathcal{H}_2$ implies $\min_{\mathcal{H}_2} R(h) \leq \min_{\mathcal{H}_1} R(h)$ by minimizing over a larger set. Part 2: the uniform convergence bound for $\mathcal{H}_2$ depends on the complexity of $\mathcal{H}_2$ (e.g., VC dimension), which is at least that of $\mathcal{H}_1$ .

Why It Matters

This is the formal version of the bias-variance tradeoff. It explains why increasing model capacity helps up to a point and then hurts (in classical statistics). It also frames the puzzle of modern overparameterized models: why does estimation error not blow up when $\mathcal{H}$ is massive? The answer involves implicit regularization and is covered in the double descent and benign overfitting topics.

Failure Mode

The monotonicity of estimation error relies on uniform convergence bounds. For overparameterized models, uniform convergence bounds are vacuous, and the actual estimation error may decrease with increasing class size. This is the "interpolation regime" studied in double descent.

report a correction →

Parameterized vs Nonparametric

A parametric hypothesis class has a fixed-dimensional parameter space $\Theta \subseteq \mathbb{R}^p$ . The number of parameters $p$ does not change with the sample size $n$ . Examples: linear regression ( $p = d + 1$ ), logistic regression, fixed-architecture neural networks.

A nonparametric hypothesis class allows the effective number of parameters to grow with $n$ . Examples: $k$ -nearest neighbors (stores all training points), kernel methods (one coefficient per data point), random forests (depth grows with $n$ ).

The distinction matters for generalization bounds: parametric classes have fixed complexity, while nonparametric classes require bounds that account for growing complexity.

Common Confusions

Watch Out

More parameters does not always mean bigger hypothesis class

A neural network with $10^9$ parameters might have effective complexity much less than $10^9$ due to symmetries, implicit regularization from SGD, and architectural constraints. Parameter count is a coarse proxy for hypothesis-class capacity, not a tight measure of it: for ReLU networks the VC dimension is $\Theta(W L \log W)$ in width $W$ and depth $L$ (Bartlett, Harvey, Liaw, Mehrabian 2019), and norm-based, Rademacher, and PAC-Bayes bounds attempt to measure effective complexity more directly than the raw parameter count.

Watch Out

The hypothesis class is chosen before seeing data

In the ERM framework, $\mathcal{H}$ is fixed before observing the training set. Using the training data to select $\mathcal{H}$ (e.g., choosing the network architecture via cross-validation) introduces a dependence that invalidates simple ERM bounds. Structural risk minimization (SRM) and model selection theory handle this case.

Watch Out

Approximation error zero does not mean the problem is easy

In the realizable setting, $f^* \in \mathcal{H}$ and approximation error is zero. But estimation error can still be large if $\mathcal{H}$ is complex. Realizability makes the problem easier (you only fight estimation error), but not trivial.

Exercises

ExerciseCore

Problem

Consider two hypothesis classes: $\mathcal{H}_1$ is all linear functions from $\mathbb{R}^{10}$ to $\mathbb{R}$ and $\mathcal{H}_2$ is all polynomials of degree at most 3 from $\mathbb{R}^{10}$ to $\mathbb{R}$ . Which has smaller approximation error? Which has smaller estimation error (for fixed $n$ )? What is the relationship $\mathcal{H}_1 \subset \mathcal{H}_2$ ?

ExerciseAdvanced

Problem

Suppose the true function is $f^*(x) = \sin(x_1)$ and we use ERM with 0-1 loss. Compare the excess risk decomposition for: (a) $\mathcal{H}_{\text{lin}}$ = linear classifiers, (b) $\mathcal{H}_{\text{nn}}$ = two-layer neural networks with 100 hidden units. Which term dominates in each case, and why?

References

Canonical:

Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 2, 5, and 6
Vapnik, Statistical Learning Theory (1998), Chapter 3
Devroye, Gyorfi, Lugosi, A Probabilistic Theory of Pattern Recognition (1996), Chapters 4-6 (hypothesis classes and risk)

Current:

Mohri, Rostamizadeh, Talwalkar, Foundations of Machine Learning (2018), Chapter 2
Wainwright, High-Dimensional Statistics (2019), Chapter 4.1 (function classes and approximation)
Bartlett, Bousquet, Mendelson, "Local Rademacher complexities" (2005), Annals of Statistics. Refined analysis of estimation error for localized classes.

Next Topics

VC dimension: measuring the complexity of infinite hypothesis classes
Rademacher complexity: data-dependent complexity that adapts to the data distribution
PAC learning framework: formalization of when learning is possible

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Empirical Risk Minimizationlayer 2 · tier 1

Derived topics

5

PAC Learning Frameworklayer 1 · tier 1
Realizability Assumptionlayer 2 · tier 1
VC Dimensionlayer 2 · tier 1
Rademacher Complexitylayer 3 · tier 1
Loss Functionslayer 1 · tier 2

Graph-backed continuations

VC Dimension Rademacher Complexity PAC Learning Framework Realizability Assumption Loss Functions