Counting and Combinatorics

Sneiderman, Robby

Foundations

Counting and Combinatorics

Counting principles, binomial and multinomial coefficients, inclusion-exclusion, and Stirling's approximation. These tools appear whenever you count hypotheses, bound shattering coefficients, or analyze combinatorial arguments in learning theory.

CoreTier 2StableSupporting~35 min

Prerequisites

Basic Logic and Proof Techniques

Quiz (3)Prereq Map

Why This Matters

Combinatorics appears throughout learning theory in unexpected places.

VC dimension theory counts the number of labelings a hypothesis class can produce on $n$ points. The Sauer-Shelah lemma bounds this by $\sum_{i=0}^{d} \binom{n}{i}$ . Bounding this sum requires knowing how binomial coefficients behave.

Union bounds over hypothesis classes require counting. Covering number arguments require counting. The PAC learning sample complexity formula has $\log \binom{n}{k}$ terms that you simplify using Stirling's approximation.

Counting Principles

Product rule: if task A has $m$ outcomes and task B has $n$ outcomes, the pair (A, B) has $mn$ outcomes. This extends to $k$ tasks.

Sum rule: if tasks A and B are mutually exclusive, the total outcomes are $m + n$ .

Permutations: the number of ordered arrangements of $k$ items from $n$ is $n!/(n-k)! = n(n-1)\cdots(n-k+1)$ .

Core Definitions

Definition

Binomial Coefficient $(k n)$

The binomial coefficient counts the number of ways to choose $k$ items from $n$ without regard to order:

$\binom{n}{k} = \frac{n!}{k!(n-k)!}$

for $0 \leq k \leq n$ . By convention, $\binom{n}{k} = 0$ if and only if $k < 0$ or $k > n$ .

Key properties:

Symmetry: $\binom{n}{k} = \binom{n}{n-k}$
Pascal's rule: $\binom{n}{k} = \binom{n-1}{k-1} + \binom{n-1}{k}$
Sum: $\sum_{k=0}^n \binom{n}{k} = 2^n$

Definition

Multinomial Coefficient

The number of ways to partition $n$ objects into groups of sizes $k_1, \ldots, k_r$ (where $k_1 + \cdots + k_r = n$ ) is:

$\binom{n}{k_1, k_2, \ldots, k_r} = \frac{n!}{k_1! k_2! \cdots k_r!}$

Main Theorems

Theorem

Binomial Theorem

Statement

For any $a, b \in \mathbb{R}$ and non-negative integer $n$ :

$( a + b)^n = \sum_{k=0}^{n} \binom{n}{k} a^k b^{n-k}$

Intuition

Expand $(a+b)^n = (a+b)(a+b)\cdots(a+b)$ . Each term in the expansion picks either $a$ or $b$ from each of the $n$ factors. The term $a^k b^{n-k}$ appears once for each way to choose $k$ factors to contribute $a$ , which is $\binom{n}{k}$ .

Proof Sketch

Induction on $n$ . Base case $n=0$ : both sides equal 1. Inductive step: $(a+b)^{n+1} = (a+b)(a+b)^n = (a+b)\sum_k \binom{n}{k}a^k b^{n-k}$ . Distribute and reindex using Pascal's rule.

Why It Matters

Setting $a = b = 1$ gives $2^n = \sum_k \binom{n}{k}$ : a set of size $n$ has $2^n$ subsets. Setting $a = 1, b = -1$ gives $0 = \sum_k (-1)^k \binom{n}{k}$ : the number of even-size subsets equals the number of odd-size subsets. These identities appear in VC theory and randomized arguments.

Failure Mode

The formula is exact for integer $n \geq 0$ . For non-integer or negative $n$ , use the generalized binomial series, which is an infinite series and requires convergence conditions.

report a correction →

Inclusion-Exclusion

Definition

Inclusion-Exclusion Principle

For finite sets $A_1, \ldots, A_n$ :

$\left|\bigcup_{i=1}^n A_i\right| = \sum_{i} |A_i| - \sum_{i < j} |A_i \cap A_j| + \sum_{i < j < k} |A_i \cap A_j \cap A_k| - \cdots + (-1)^{n+1} |A_1 \cap \cdots \cap A_n|$

This alternating sum corrects for overcounting at each level.

In probability, replace $|A|$ with $P(A)$ :

$P\left(\bigcup_{i=1}^n A_i\right) = \sum_i P(A_i) - \sum_{i<j} P(A_i \cap A_j) + \cdots$

The union bound $P(\bigcup A_i) \leq \sum P(A_i)$ is the first term of inclusion-exclusion, which overcounts. Bonferroni inequalities give alternating upper and lower bounds by truncating at different levels.

Other Combinatorial Tools

Three identities appear often enough in ML theory and adjacent areas to flag by name.

Vandermonde's identity: $\binom{m+n}{r} = \sum_{k=0}^{r} \binom{m}{k}\binom{n}{r-k}$ . Choose $r$ items from a pool of $m+n$ by first deciding how many come from each side. It shows up when splitting a binomial coefficient along a partition of the underlying set, for example in convolution arguments over hypergeometric-type quantities.

Stars and bars: the number of ways to write a non-negative integer $n$ as an ordered sum of $k$ non-negative integers is $\binom{n+k-1}{k-1}$ . This counts compositions of an integer, lattice points in a simplex, and (up to reindexing) the number of monomials of total degree $n$ in $k$ variables. The second use appears in polynomial kernel dimension counts and in bounding the number of degree- $n$ features in $k$ inputs.

Catalan numbers: $C_n = \frac{1}{n+1}\binom{2n}{n}$ counts balanced parenthesizations of $n$ pairs, binary trees with $n+1$ leaves, and several other structures in bijection with those. They show up in parsing (binary-tree enumerations over context-free grammars), in some combinatorial proofs about random walks, and in the analysis of tree-structured models.

Stirling's Approximation

Theorem

Stirling Approximation

Statement

$n! = \sqrt{2\pi n} \left(\frac{n}{e}\right)^n \left(1 + O\left(\frac{1}{n}\right)\right)$

The cruder but often sufficient form is:

$\log(n!) = n \log n - n + O(\log n)$

Intuition

The factorial grows faster than any exponential but slower than $n^n$ . Stirling's formula pins down the growth rate: $n!$ is approximately $(n/e)^n$ times a polynomial correction.

Proof Sketch

Write $\log(n!) = \sum_{k=1}^n \log k$ and approximate this sum by the integral $\int_1^n \log x \, dx = n \log n - n + 1$ . The $\sqrt{2\pi n}$ factor comes from a more refined analysis using the Euler-Maclaurin formula or the Laplace method applied to the Gamma function integral.

Why It Matters

Stirling lets you simplify binomial coefficients. For example: $\log \binom{n}{k} \approx n H(k/n)$ where $H(p) = -p \log p - (1-p) \log(1-p)$ is the binary entropy. This connection between counting and entropy appears throughout information theory and learning theory.

Failure Mode

The approximation is asymptotic; for small $n$ the relative error can be noticeable. For $n = 1$ , Stirling gives $\sqrt{2\pi} \cdot (1/e) \approx 0.922$ vs the true value of 1. For $n \geq 10$ , the relative error is below 1%.

report a correction →

Applications in Machine Learning Theory

Hypothesis Counting and VC Theory

The Sauer-Shelah lemma bounds the growth function of a hypothesis class $\mathcal{H}$ with VC dimension $d$ :

$\Pi_{\mathcal{H}}(n) \leq \sum_{i=0}^{d} \binom{n}{i}$

For fixed $d$ and growing $n$ , this sum is $O(n^d)$ , a polynomial bound. Without this combinatorial bound, the growth function could be as large as $2^n$ (all possible labelings), making uniform convergence arguments impossible. The transition from exponential to polynomial growth is the reason finite VC dimension implies learnability.

To see why: for $n \geq d$ , each term satisfies $\binom{n}{i} \leq n^i \leq n^d$ , so summing $d+1$ terms gives $\sum_{i=0}^{d} \binom{n}{i} \leq (d+1) n^d$ . A tighter standard bound is $\sum_{i=0}^{d} \binom{n}{i} \leq (en/d)^d$ for $n \geq d \geq 1$ . Either way, the growth is polynomial in $n$ for fixed $d$ . The PAC learning sample complexity bound then becomes $n = O((d \log(d/\epsilon) + \log(1/\delta)) / \epsilon)$ , which is polynomial in $d$ , $1/\epsilon$ , and $\log(1/\delta)$ .

Model Selection and Counting Arguments

When comparing $M$ models via a held-out validation set, the probability that the best model on the validation set is also the best model on unseen data depends on $M$ . A union bound over $M$ candidates gives:

$P(\text{any model overfits by } > \epsilon) \leq M \cdot e^{-2n\epsilon^2}$

Solving for $n$ gives the agnostic finite-class PAC bound:

$n \geq \frac{\log M + \log(1/\delta)}{2 \epsilon^2}$

The sample size scales additively in $\log M$ , not multiplicatively. Going from $M = 2$ candidates to $M = 100$ increases the required sample size by $(\log 100 - \log 2)/(2\epsilon^2)$ , an additive offset whose size depends on $\epsilon$ and whatever baseline $\log(1/\delta)$ term is present. Whether that offset is a 10% bump or a doubling depends entirely on the other terms. Phrasing it as a "2x more data" rule of thumb is misleading. This bound assumes $|\mathcal{H}| = M$ is finite. For infinite hypothesis classes, the counting is replaced by VC dimension, and the bound on the pac-learning-framework page applies.

Multinomial Coefficients in Text Modeling

In the multinomial naive Bayes model, the probability of a document with word counts $(k_1, \ldots, k_V)$ is:

$P(\text{doc}) = \frac{n!}{k_1! \cdots k_V!} \prod_{j=1}^{V} p_j^{k_j}$

The multinomial coefficient $n!/(k_1! \cdots k_V!)$ counts the number of orderings that produce the same bag of words. In practice, this coefficient cancels when comparing class posteriors (it does not depend on the class label), but understanding its origin clarifies why naive Bayes treats word order as irrelevant.

Example

Counting binary classifiers on n points

A binary classifier on $n$ points assigns each point a label in $\{0, 1\}$ . The total number of possible labelings is $2^n$ . For 20 data points, $2^{20} = 1,048,576$ . A hypothesis class that can produce all of these labelings has VC dimension at least 20 and will overfit with fewer than $O(d/\epsilon)$ training points.

An affine linear classifier in $\mathbb{R}^d$ (i.e. a halfspace $\{x : w^T x + b \geq 0\}$ with a free bias term) shatters at most $d+1$ points, so its VC dimension is $d+1$ . The homogeneous version through the origin ( $b = 0$ ) loses one degree of freedom and has VC dimension $d$ . In $\mathbb{R}^{10}$ with bias, the VC dimension is 11, and the number of achievable labelings on 20 points is at most $\sum_{i=0}^{11} \binom{20}{i} = 784{,}626$ , about 75% of all $2^{20}$ possible labelings. The restriction tightens dramatically as $n$ grows: on $n$ points the Sauer-Shelah bound is polynomial $O(n^{11})$ rather than the exponential $2^n$ , and this polynomial-vs-exponential gap is what makes PAC learning possible.

Common Confusions

Watch Out

Binomial coefficients grow polynomially for fixed k

$\binom{n}{k} = \Theta(n^k / k!)$ when $k$ is fixed and $n$ grows. But when $k$ grows with $n$ (say $k = n/2$ ), $\binom{n}{k}$ is exponential: $\binom{n}{n/2} \approx 2^n / \sqrt{\pi n / 2}$ . The Sauer-Shelah bound $\sum_{i=0}^d \binom{n}{i}$ is polynomial in $n$ for fixed $d$ , which is the whole point.

Exercises

ExerciseCore

Problem

Prove that $\sum_{k=0}^{n} \binom{n}{k}^2 = \binom{2n}{n}$ using a combinatorial argument.

ExerciseAdvanced

Problem

Use Stirling's approximation to show that $\log \binom{n}{\lfloor pn \rfloor} \approx nH(p)$ for $p \in (0,1)$ , where $H(p) = -p\log p - (1-p)\log(1-p)$ is the binary entropy (using natural log).

References

Canonical:

Flajolet & Sedgewick, Analytic Combinatorics (2009), Chapter I
Graham, Knuth, Patashnik, Concrete Mathematics (1994), Chapters 5-6

Current:

Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Section 6.5 (Sauer-Shelah and combinatorial arguments)
Stanley, Enumerative Combinatorics, Vol. 1 (2nd ed., 2011), Chapter 1 (counting principles and generating functions)

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Basic Logic and Proof Techniqueslayer 0A · tier 2

Derived topics

2

PAC Learning Frameworklayer 1 · tier 1
VC Dimensionlayer 2 · tier 1

Graph-backed continuations

VC Dimension PAC Learning Framework