Foundations
Counting and Combinatorics
Counting principles, binomial and multinomial coefficients, inclusion-exclusion, and Stirling's approximation. These tools appear whenever you count hypotheses, bound shattering coefficients, or analyze combinatorial arguments in learning theory.
Prerequisites
Why This Matters
Combinatorics appears throughout learning theory in unexpected places.
VC dimension theory counts the number of labelings a hypothesis class can produce on points. The Sauer-Shelah lemma bounds this by . Bounding this sum requires knowing how binomial coefficients behave.
Union bounds over hypothesis classes require counting. Covering number arguments require counting. The PAC learning sample complexity formula has terms that you simplify using Stirling's approximation.
Counting Principles
Product rule: if task A has outcomes and task B has outcomes, the pair (A, B) has outcomes. This extends to tasks.
Sum rule: if tasks A and B are mutually exclusive, the total outcomes are .
Permutations: the number of ordered arrangements of items from is .
Core Definitions
Binomial Coefficient
The binomial coefficient counts the number of ways to choose items from without regard to order:
for . By convention, if and only if or .
Key properties:
- Symmetry:
- Pascal's rule:
- Sum:
Multinomial Coefficient
The number of ways to partition objects into groups of sizes (where ) is:
Main Theorems
Binomial Theorem
Statement
For any and non-negative integer :
Intuition
Expand . Each term in the expansion picks either or from each of the factors. The term appears once for each way to choose factors to contribute , which is .
Proof Sketch
Induction on . Base case : both sides equal 1. Inductive step: . Distribute and reindex using Pascal's rule.
Why It Matters
Setting gives : a set of size has subsets. Setting gives : the number of even-size subsets equals the number of odd-size subsets. These identities appear in VC theory and randomized arguments.
Failure Mode
The formula is exact for integer . For non-integer or negative , use the generalized binomial series, which is an infinite series and requires convergence conditions.
Inclusion-Exclusion
Inclusion-Exclusion Principle
For finite sets :
This alternating sum corrects for overcounting at each level.
In probability, replace with :
The union bound is the first term of inclusion-exclusion, which overcounts. Bonferroni inequalities give alternating upper and lower bounds by truncating at different levels.
Other Combinatorial Tools
Three identities appear often enough in ML theory and adjacent areas to flag by name.
Vandermonde's identity: . Choose items from a pool of by first deciding how many come from each side. It shows up when splitting a binomial coefficient along a partition of the underlying set, for example in convolution arguments over hypergeometric-type quantities.
Stars and bars: the number of ways to write a non-negative integer as an ordered sum of non-negative integers is . This counts compositions of an integer, lattice points in a simplex, and (up to reindexing) the number of monomials of total degree in variables. The second use appears in polynomial kernel dimension counts and in bounding the number of degree- features in inputs.
Catalan numbers: counts balanced parenthesizations of pairs, binary trees with leaves, and several other structures in bijection with those. They show up in parsing (binary-tree enumerations over context-free grammars), in some combinatorial proofs about random walks, and in the analysis of tree-structured models.
Stirling's Approximation
Stirling Approximation
Statement
The cruder but often sufficient form is:
Intuition
The factorial grows faster than any exponential but slower than . Stirling's formula pins down the growth rate: is approximately times a polynomial correction.
Proof Sketch
Write and approximate this sum by the integral . The factor comes from a more refined analysis using the Euler-Maclaurin formula or the Laplace method applied to the Gamma function integral.
Why It Matters
Stirling lets you simplify binomial coefficients. For example: where is the binary entropy. This connection between counting and entropy appears throughout information theory and learning theory.
Failure Mode
The approximation is asymptotic; for small the relative error can be noticeable. For , Stirling gives vs the true value of 1. For , the relative error is below 1%.
Applications in Machine Learning Theory
Hypothesis Counting and VC Theory
The Sauer-Shelah lemma bounds the growth function of a hypothesis class with VC dimension :
For fixed and growing , this sum is , a polynomial bound. Without this combinatorial bound, the growth function could be as large as (all possible labelings), making uniform convergence arguments impossible. The transition from exponential to polynomial growth is the reason finite VC dimension implies learnability.
To see why: for , each term satisfies , so summing terms gives . A tighter standard bound is for . Either way, the growth is polynomial in for fixed . The PAC learning sample complexity bound then becomes , which is polynomial in , , and .
Model Selection and Counting Arguments
When comparing models via a held-out validation set, the probability that the best model on the validation set is also the best model on unseen data depends on . A union bound over candidates gives:
Solving for gives the agnostic finite-class PAC bound:
The sample size scales additively in , not multiplicatively. Going from candidates to increases the required sample size by , an additive offset whose size depends on and whatever baseline term is present. Whether that offset is a 10% bump or a doubling depends entirely on the other terms. Phrasing it as a "2x more data" rule of thumb is misleading. This bound assumes is finite. For infinite hypothesis classes, the counting is replaced by VC dimension, and the bound on the pac-learning-framework page applies.
Multinomial Coefficients in Text Modeling
In the multinomial naive Bayes model, the probability of a document with word counts is:
The multinomial coefficient counts the number of orderings that produce the same bag of words. In practice, this coefficient cancels when comparing class posteriors (it does not depend on the class label), but understanding its origin clarifies why naive Bayes treats word order as irrelevant.
Counting binary classifiers on n points
A binary classifier on points assigns each point a label in . The total number of possible labelings is . For 20 data points, . A hypothesis class that can produce all of these labelings has VC dimension at least 20 and will overfit with fewer than training points.
An affine linear classifier in (i.e. a halfspace with a free bias term) shatters at most points, so its VC dimension is . The homogeneous version through the origin () loses one degree of freedom and has VC dimension . In with bias, the VC dimension is 11, and the number of achievable labelings on 20 points is at most , about 75% of all possible labelings. The restriction tightens dramatically as grows: on points the Sauer-Shelah bound is polynomial rather than the exponential , and this polynomial-vs-exponential gap is what makes PAC learning possible.
Common Confusions
Binomial coefficients grow polynomially for fixed k
when is fixed and grows. But when grows with (say ), is exponential: . The Sauer-Shelah bound is polynomial in for fixed , which is the whole point.
Exercises
Problem
Prove that using a combinatorial argument.
Problem
Use Stirling's approximation to show that for , where is the binary entropy (using natural log).
References
Canonical:
- Flajolet & Sedgewick, Analytic Combinatorics (2009), Chapter I
- Graham, Knuth, Patashnik, Concrete Mathematics (1994), Chapters 5-6
Current:
- Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Section 6.5 (Sauer-Shelah and combinatorial arguments)
- Stanley, Enumerative Combinatorics, Vol. 1 (2nd ed., 2011), Chapter 1 (counting principles and generating functions)
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
1- Basic Logic and Proof Techniqueslayer 0A · tier 2
Derived topics
2- PAC Learning Frameworklayer 1 · tier 1
- VC Dimensionlayer 2 · tier 1
Graph-backed continuations