Skip to main content

Foundations

Counting and Combinatorics

Counting principles, binomial and multinomial coefficients, inclusion-exclusion, and Stirling's approximation. These tools appear whenever you count hypotheses, bound shattering coefficients, or analyze combinatorial arguments in learning theory.

CoreTier 2StableSupporting~35 min

Why This Matters

Combinatorics appears throughout learning theory in unexpected places.

VC dimension theory counts the number of labelings a hypothesis class can produce on nn points. The Sauer-Shelah lemma bounds this by i=0d(ni)\sum_{i=0}^{d} \binom{n}{i}. Bounding this sum requires knowing how binomial coefficients behave.

Union bounds over hypothesis classes require counting. Covering number arguments require counting. The PAC learning sample complexity formula has log(nk)\log \binom{n}{k} terms that you simplify using Stirling's approximation.

Counting Principles

Product rule: if task A has mm outcomes and task B has nn outcomes, the pair (A, B) has mnmn outcomes. This extends to kk tasks.

Sum rule: if tasks A and B are mutually exclusive, the total outcomes are m+nm + n.

Permutations: the number of ordered arrangements of kk items from nn is n!/(nk)!=n(n1)(nk+1)n!/(n-k)! = n(n-1)\cdots(n-k+1).

Core Definitions

Definition

Binomial Coefficient

The binomial coefficient counts the number of ways to choose kk items from nn without regard to order:

(nk)=n!k!(nk)!\binom{n}{k} = \frac{n!}{k!(n-k)!}

for 0kn0 \leq k \leq n. By convention, (nk)=0\binom{n}{k} = 0 if and only if k<0k < 0 or k>nk > n.

Key properties:

  • Symmetry: (nk)=(nnk)\binom{n}{k} = \binom{n}{n-k}
  • Pascal's rule: (nk)=(n1k1)+(n1k)\binom{n}{k} = \binom{n-1}{k-1} + \binom{n-1}{k}
  • Sum: k=0n(nk)=2n\sum_{k=0}^n \binom{n}{k} = 2^n
Definition

Multinomial Coefficient

The number of ways to partition nn objects into groups of sizes k1,,krk_1, \ldots, k_r (where k1++kr=nk_1 + \cdots + k_r = n) is:

(nk1,k2,,kr)=n!k1!k2!kr!\binom{n}{k_1, k_2, \ldots, k_r} = \frac{n!}{k_1! k_2! \cdots k_r!}

Main Theorems

Theorem

Binomial Theorem

Statement

For any a,bRa, b \in \mathbb{R} and non-negative integer nn:

(a+b)n=k=0n(nk)akbnk( a + b)^n = \sum_{k=0}^{n} \binom{n}{k} a^k b^{n-k}

Intuition

Expand (a+b)n=(a+b)(a+b)(a+b)(a+b)^n = (a+b)(a+b)\cdots(a+b). Each term in the expansion picks either aa or bb from each of the nn factors. The term akbnka^k b^{n-k} appears once for each way to choose kk factors to contribute aa, which is (nk)\binom{n}{k}.

Proof Sketch

Induction on nn. Base case n=0n=0: both sides equal 1. Inductive step: (a+b)n+1=(a+b)(a+b)n=(a+b)k(nk)akbnk(a+b)^{n+1} = (a+b)(a+b)^n = (a+b)\sum_k \binom{n}{k}a^k b^{n-k}. Distribute and reindex using Pascal's rule.

Why It Matters

Setting a=b=1a = b = 1 gives 2n=k(nk)2^n = \sum_k \binom{n}{k}: a set of size nn has 2n2^n subsets. Setting a=1,b=1a = 1, b = -1 gives 0=k(1)k(nk)0 = \sum_k (-1)^k \binom{n}{k}: the number of even-size subsets equals the number of odd-size subsets. These identities appear in VC theory and randomized arguments.

Failure Mode

The formula is exact for integer n0n \geq 0. For non-integer or negative nn, use the generalized binomial series, which is an infinite series and requires convergence conditions.

Inclusion-Exclusion

Definition

Inclusion-Exclusion Principle

For finite sets A1,,AnA_1, \ldots, A_n:

i=1nAi=iAii<jAiAj+i<j<kAiAjAk+(1)n+1A1An\left|\bigcup_{i=1}^n A_i\right| = \sum_{i} |A_i| - \sum_{i < j} |A_i \cap A_j| + \sum_{i < j < k} |A_i \cap A_j \cap A_k| - \cdots + (-1)^{n+1} |A_1 \cap \cdots \cap A_n|

This alternating sum corrects for overcounting at each level.

In probability, replace A|A| with P(A)P(A):

P(i=1nAi)=iP(Ai)i<jP(AiAj)+P\left(\bigcup_{i=1}^n A_i\right) = \sum_i P(A_i) - \sum_{i<j} P(A_i \cap A_j) + \cdots

The union bound P(Ai)P(Ai)P(\bigcup A_i) \leq \sum P(A_i) is the first term of inclusion-exclusion, which overcounts. Bonferroni inequalities give alternating upper and lower bounds by truncating at different levels.

Other Combinatorial Tools

Three identities appear often enough in ML theory and adjacent areas to flag by name.

Vandermonde's identity: (m+nr)=k=0r(mk)(nrk)\binom{m+n}{r} = \sum_{k=0}^{r} \binom{m}{k}\binom{n}{r-k}. Choose rr items from a pool of m+nm+n by first deciding how many come from each side. It shows up when splitting a binomial coefficient along a partition of the underlying set, for example in convolution arguments over hypergeometric-type quantities.

Stars and bars: the number of ways to write a non-negative integer nn as an ordered sum of kk non-negative integers is (n+k1k1)\binom{n+k-1}{k-1}. This counts compositions of an integer, lattice points in a simplex, and (up to reindexing) the number of monomials of total degree nn in kk variables. The second use appears in polynomial kernel dimension counts and in bounding the number of degree-nn features in kk inputs.

Catalan numbers: Cn=1n+1(2nn)C_n = \frac{1}{n+1}\binom{2n}{n} counts balanced parenthesizations of nn pairs, binary trees with n+1n+1 leaves, and several other structures in bijection with those. They show up in parsing (binary-tree enumerations over context-free grammars), in some combinatorial proofs about random walks, and in the analysis of tree-structured models.

Stirling's Approximation

Theorem

Stirling Approximation

Statement

n!=2πn(ne)n(1+O(1n))n! = \sqrt{2\pi n} \left(\frac{n}{e}\right)^n \left(1 + O\left(\frac{1}{n}\right)\right)

The cruder but often sufficient form is:

log(n!)=nlognn+O(logn)\log(n!) = n \log n - n + O(\log n)

Intuition

The factorial grows faster than any exponential but slower than nnn^n. Stirling's formula pins down the growth rate: n!n! is approximately (n/e)n(n/e)^n times a polynomial correction.

Proof Sketch

Write log(n!)=k=1nlogk\log(n!) = \sum_{k=1}^n \log k and approximate this sum by the integral 1nlogxdx=nlognn+1\int_1^n \log x \, dx = n \log n - n + 1. The 2πn\sqrt{2\pi n} factor comes from a more refined analysis using the Euler-Maclaurin formula or the Laplace method applied to the Gamma function integral.

Why It Matters

Stirling lets you simplify binomial coefficients. For example: log(nk)nH(k/n)\log \binom{n}{k} \approx n H(k/n) where H(p)=plogp(1p)log(1p)H(p) = -p \log p - (1-p) \log(1-p) is the binary entropy. This connection between counting and entropy appears throughout information theory and learning theory.

Failure Mode

The approximation is asymptotic; for small nn the relative error can be noticeable. For n=1n = 1, Stirling gives 2π(1/e)0.922\sqrt{2\pi} \cdot (1/e) \approx 0.922 vs the true value of 1. For n10n \geq 10, the relative error is below 1%.

Applications in Machine Learning Theory

Hypothesis Counting and VC Theory

The Sauer-Shelah lemma bounds the growth function of a hypothesis class H\mathcal{H} with VC dimension dd:

ΠH(n)i=0d(ni)\Pi_{\mathcal{H}}(n) \leq \sum_{i=0}^{d} \binom{n}{i}

For fixed dd and growing nn, this sum is O(nd)O(n^d), a polynomial bound. Without this combinatorial bound, the growth function could be as large as 2n2^n (all possible labelings), making uniform convergence arguments impossible. The transition from exponential to polynomial growth is the reason finite VC dimension implies learnability.

To see why: for ndn \geq d, each term satisfies (ni)nind\binom{n}{i} \leq n^i \leq n^d, so summing d+1d+1 terms gives i=0d(ni)(d+1)nd\sum_{i=0}^{d} \binom{n}{i} \leq (d+1) n^d. A tighter standard bound is i=0d(ni)(en/d)d\sum_{i=0}^{d} \binom{n}{i} \leq (en/d)^d for nd1n \geq d \geq 1. Either way, the growth is polynomial in nn for fixed dd. The PAC learning sample complexity bound then becomes n=O((dlog(d/ϵ)+log(1/δ))/ϵ)n = O((d \log(d/\epsilon) + \log(1/\delta)) / \epsilon), which is polynomial in dd, 1/ϵ1/\epsilon, and log(1/δ)\log(1/\delta).

Model Selection and Counting Arguments

When comparing MM models via a held-out validation set, the probability that the best model on the validation set is also the best model on unseen data depends on MM. A union bound over MM candidates gives:

P(any model overfits by >ϵ)Me2nϵ2P(\text{any model overfits by } > \epsilon) \leq M \cdot e^{-2n\epsilon^2}

Solving for nn gives the agnostic finite-class PAC bound:

nlogM+log(1/δ)2ϵ2n \geq \frac{\log M + \log(1/\delta)}{2 \epsilon^2}

The sample size scales additively in logM\log M, not multiplicatively. Going from M=2M = 2 candidates to M=100M = 100 increases the required sample size by (log100log2)/(2ϵ2)(\log 100 - \log 2)/(2\epsilon^2), an additive offset whose size depends on ϵ\epsilon and whatever baseline log(1/δ)\log(1/\delta) term is present. Whether that offset is a 10% bump or a doubling depends entirely on the other terms. Phrasing it as a "2x more data" rule of thumb is misleading. This bound assumes H=M|\mathcal{H}| = M is finite. For infinite hypothesis classes, the counting is replaced by VC dimension, and the bound on the pac-learning-framework page applies.

Multinomial Coefficients in Text Modeling

In the multinomial naive Bayes model, the probability of a document with word counts (k1,,kV)(k_1, \ldots, k_V) is:

P(doc)=n!k1!kV!j=1VpjkjP(\text{doc}) = \frac{n!}{k_1! \cdots k_V!} \prod_{j=1}^{V} p_j^{k_j}

The multinomial coefficient n!/(k1!kV!)n!/(k_1! \cdots k_V!) counts the number of orderings that produce the same bag of words. In practice, this coefficient cancels when comparing class posteriors (it does not depend on the class label), but understanding its origin clarifies why naive Bayes treats word order as irrelevant.

Example

Counting binary classifiers on n points

A binary classifier on nn points assigns each point a label in {0,1}\{0, 1\}. The total number of possible labelings is 2n2^n. For 20 data points, 220=1,048,5762^{20} = 1,048,576. A hypothesis class that can produce all of these labelings has VC dimension at least 20 and will overfit with fewer than O(d/ϵ)O(d/\epsilon) training points.

An affine linear classifier in Rd\mathbb{R}^d (i.e. a halfspace {x:wTx+b0}\{x : w^T x + b \geq 0\} with a free bias term) shatters at most d+1d+1 points, so its VC dimension is d+1d+1. The homogeneous version through the origin (b=0b = 0) loses one degree of freedom and has VC dimension dd. In R10\mathbb{R}^{10} with bias, the VC dimension is 11, and the number of achievable labelings on 20 points is at most i=011(20i)=784,626\sum_{i=0}^{11} \binom{20}{i} = 784{,}626, about 75% of all 2202^{20} possible labelings. The restriction tightens dramatically as nn grows: on nn points the Sauer-Shelah bound is polynomial O(n11)O(n^{11}) rather than the exponential 2n2^n, and this polynomial-vs-exponential gap is what makes PAC learning possible.

Common Confusions

Watch Out

Binomial coefficients grow polynomially for fixed k

(nk)=Θ(nk/k!)\binom{n}{k} = \Theta(n^k / k!) when kk is fixed and nn grows. But when kk grows with nn (say k=n/2k = n/2), (nk)\binom{n}{k} is exponential: (nn/2)2n/πn/2\binom{n}{n/2} \approx 2^n / \sqrt{\pi n / 2}. The Sauer-Shelah bound i=0d(ni)\sum_{i=0}^d \binom{n}{i} is polynomial in nn for fixed dd, which is the whole point.

Exercises

ExerciseCore

Problem

Prove that k=0n(nk)2=(2nn)\sum_{k=0}^{n} \binom{n}{k}^2 = \binom{2n}{n} using a combinatorial argument.

ExerciseAdvanced

Problem

Use Stirling's approximation to show that log(npn)nH(p)\log \binom{n}{\lfloor pn \rfloor} \approx nH(p) for p(0,1)p \in (0,1), where H(p)=plogp(1p)log(1p)H(p) = -p\log p - (1-p)\log(1-p) is the binary entropy (using natural log).

References

Canonical:

  • Flajolet & Sedgewick, Analytic Combinatorics (2009), Chapter I
  • Graham, Knuth, Patashnik, Concrete Mathematics (1994), Chapters 5-6

Current:

  • Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Section 6.5 (Sauer-Shelah and combinatorial arguments)
  • Stanley, Enumerative Combinatorics, Vol. 1 (2nd ed., 2011), Chapter 1 (counting principles and generating functions)

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

1

Derived topics

2

Graph-backed continuations