Skip to main content

Mathematical Infrastructure

Martingale Theory

Martingales and their convergence properties: Doob martingale, optional stopping theorem, martingale convergence, Azuma-Hoeffding inequality, and Freedman inequality. The tools behind McDiarmid's inequality, online learning regret bounds, and stochastic approximation.

AdvancedTier 2StableCore spine~70 min

Why This Matters

Martingales are the mathematical framework for "fair games" --- sequences of random variables where the conditional expectation of the next value, given the past, equals the current value. This simple definition forces strong concentration and convergence properties (Doob, Azuma, optional stopping) that the iid case alone does not deliver.

In machine learning theory, martingales are everywhere:

  • McDiarmid's inequality (the key concentration tool for generalization bounds) is proved using the Azuma-Hoeffding inequality for martingales
  • Online learning regret bounds use martingale arguments to control the cumulative loss of adaptive algorithms
  • Stochastic approximation (SGD convergence theory) uses martingale convergence theorems to show that noisy gradient estimates converge
  • Sequential hypothesis testing uses optional stopping to determine when to stop collecting data
  • PAC-Bayes bounds with data-dependent priors use martingale constructions to handle adaptivity

Martingales are the probabilistic structure that lets you extend concentration-of-measure arguments from independent sequences to adapted sequences with conditional centering. They generalize sums of independent variables while retaining enough structure for tail bounds.

theorem visual

Filtration Reveal

As information is revealed, the conditional expectation updates fairly; bounded increments then control how far the process can drift.

step = 3

F0
reveal 0
F1
reveal 1
F2
reveal 2
F3
reveal 3
F4
reveal 4
CONDITIONAL EXPECTATION PATHincreasing information

Doob interpretation

Each step is the best estimate of the final outcome after revealing one more piece of information.

Martingale identity

Mental Model

Think of a gambler playing a sequence of fair games. After each round, the expected value of their fortune equals their current fortune --- no strategy can create an expected gain from a fair game. This is the martingale property.

The power comes from what you can prove about such sequences: they cannot stray too far from their starting point (concentration), they converge to a limit (convergence theorem), and stopping at a clever time cannot create an expected profit (optional stopping).

Formal Setup

Definition

Filtration

A filtration is an increasing sequence of sigma-algebras:

F0F1F2\mathcal{F}_0 \subseteq \mathcal{F}_1 \subseteq \mathcal{F}_2 \subseteq \cdots

Fn\mathcal{F}_n represents the "information available at time nn." A random variable XX is Fn\mathcal{F}_n-measurable if and only if its value is determined by the information in Fn\mathcal{F}_n. The natural filtration of a sequence X0,X1,X_0, X_1, \ldots is Fn=σ(X0,X1,,Xn)\mathcal{F}_n = \sigma(X_0, X_1, \ldots, X_n) --- the sigma-algebra generated by the first n+1n+1 observations.

Definition

Martingale

A sequence of random variables (Mn)n0(M_n)_{n \geq 0} is a martingale with respect to filtration (Fn)n0(\mathcal{F}_n)_{n \geq 0} if and only if:

  1. MnM_n is Fn\mathcal{F}_n-measurable (adapted) for all nn
  2. E[Mn]<\mathbb{E}[|M_n|] < \infty for all nn (integrability)
  3. E[Mn+1Fn]=Mn\mathbb{E}[M_{n+1} \mid \mathcal{F}_n] = M_n for all nn (martingale property)

Submartingale: condition 3 is replaced by E[Mn+1Fn]Mn\mathbb{E}[M_{n+1} \mid \mathcal{F}_n] \geq M_n (expected upward drift).

Supermartingale: condition 3 is replaced by E[Mn+1Fn]Mn\mathbb{E}[M_{n+1} \mid \mathcal{F}_n] \leq M_n (expected downward drift).

The martingale property says: the best prediction of the future value, given all information up to now, is the current value. No trend, no drift --- a "fair game."

Definition

Martingale Difference Sequence

A sequence (Dn)n1(D_n)_{n \geq 1} is a martingale difference sequence (MDS) with respect to (Fn)(\mathcal{F}_n) if and only if:

E[DnFn1]=0for all n1\mathbb{E}[D_n \mid \mathcal{F}_{n-1}] = 0 \quad \text{for all } n \geq 1

If (Mn)(M_n) is a martingale, then Dn=MnMn1D_n = M_n - M_{n-1} is an MDS. The martingale is the cumulative sum: Mn=M0+k=1nDkM_n = M_0 + \sum_{k=1}^n D_k.

MDSs are the martingale analogue of i.i.d. mean-zero random variables, but with two key differences: the DnD_n are allowed to be dependent, and their conditional variance can change over time.

The Doob Martingale

Definition

Doob Martingale

Let X=(X1,,Xn)X = (X_1, \ldots, X_n) be a random vector and f:XnRf: \mathcal{X}^n \to \mathbb{R} a function with E[f(X)]<\mathbb{E}[|f(X)|] < \infty. The Doob martingale (or exposure martingale) is:

Zi=E[f(X1,,Xn)X1,,Xi],i=0,1,,nZ_i = \mathbb{E}[f(X_1, \ldots, X_n) \mid X_1, \ldots, X_i], \quad i = 0, 1, \ldots, n

with Z0=E[f(X)]Z_0 = \mathbb{E}[f(X)] and Zn=f(X)Z_n = f(X).

This is a martingale: E[Zi+1X1,,Xi]=Zi\mathbb{E}[Z_{i+1} \mid X_1, \ldots, X_i] = Z_i by the tower property of conditional expectation.

The Doob martingale is the fundamental construction for proving concentration inequalities. It "reveals" the random variables one at a time: Z0Z_0 is the unconditional expectation (no information), ZnZ_n is the actual value (full information), and each ZiZ_i adds one variable's worth of information.

The martingale differences are Di=ZiZi1=E[f(X)X1,,Xi]E[f(X)X1,,Xi1]D_i = Z_i - Z_{i-1} = \mathbb{E}[f(X) \mid X_1, \ldots, X_i] - \mathbb{E}[f(X) \mid X_1, \ldots, X_{i-1}]. These measure how much the conditional expectation of f(X)f(X) changes when XiX_i is revealed. If ff has bounded differences (Dici|D_i| \leq c_i), the Azuma-Hoeffding inequality gives exponential concentration.

Main Theorems

Theorem

Azuma-Hoeffding Inequality

Statement

Let (Mn)(M_n) be a martingale with bounded increments: MkMk1ck|M_k - M_{k-1}| \leq c_k almost surely for k=1,,nk = 1, \ldots, n. Then for any t>0t > 0:

Pr[MnM0t]exp ⁣(t22k=1nck2)\Pr[M_n - M_0 \geq t] \leq \exp\!\left(-\frac{t^2}{2\sum_{k=1}^n c_k^2}\right)

Pr[MnM0t]2exp ⁣(t22k=1nck2)\Pr[|M_n - M_0| \geq t] \leq 2\exp\!\left(-\frac{t^2}{2\sum_{k=1}^n c_k^2}\right)

If all ck=cc_k = c (equal bounds), this simplifies to Pr[MnM0t]2exp(t2/(2nc2))\Pr[|M_n - M_0| \geq t] \leq 2\exp(-t^2/(2nc^2)).

Intuition

A martingale with bounded increments (each step moves by at most ckc_k) cannot deviate far from its starting point with high probability. The deviation is controlled by ck2\sqrt{\sum c_k^2} --- the "diffusion scale" of the random walk. This is the martingale analogue of Hoeffding's inequality for sums of independent bounded random variables.

Proof Sketch

Use the exponential supermartingale technique:

Step 1: For any λ>0\lambda > 0, define Yn=exp(λMnψ(λ,n))Y_n = \exp(\lambda M_n - \psi(\lambda, n)) where ψ\psi is chosen so that (Yn)(Y_n) is a supermartingale.

Step 2: Since Dkck|D_k| \leq c_k and E[DkFk1]=0\mathbb{E}[D_k \mid \mathcal{F}_{k-1}] = 0, Hoeffding's lemma gives E[eλDkFk1]eλ2ck2/2\mathbb{E}[e^{\lambda D_k} \mid \mathcal{F}_{k-1}] \leq e^{\lambda^2 c_k^2/2}.

Therefore E[YnFn1]Yn1\mathbb{E}[Y_n \mid \mathcal{F}_{n-1}] \leq Y_{n-1} with ψ(λ,n)=λ22k=1nck2\psi(\lambda, n) = \frac{\lambda^2}{2}\sum_{k=1}^n c_k^2.

Step 3: Apply Markov's inequality: Pr[MnM0t]=Pr[eλ(MnM0)eλt]eλtE[eλ(MnM0)]eλt+λ2ck2/2\Pr[M_n - M_0 \geq t] = \Pr[e^{\lambda(M_n - M_0)} \geq e^{\lambda t}] \leq e^{-\lambda t} \cdot \mathbb{E}[e^{\lambda(M_n - M_0)}] \leq e^{-\lambda t + \lambda^2 \sum c_k^2/2}.

Step 4: Optimize over λ\lambda: set λ=t/ck2\lambda = t/\sum c_k^2 to get exp(t2/(2ck2))\exp(-t^2/(2\sum c_k^2)).

Why It Matters

Azuma-Hoeffding is the engine behind McDiarmid's inequality, but the two constants are not obtained by naive chaining. McDiarmid's inequality is derived from Azuma-Hoeffding applied to the Doob martingale Mi=E[f(X)X1,,Xi]M_i = \mathbb{E}[f(X) \mid X_1, \ldots, X_i]. Under the bounded-differences condition, the Doob increments Di=MiMi1D_i = M_i - M_{i-1} satisfy the conditional range condition supxi,xiDiXi1=xi1ci\sup_{x_i, x_i'} |D_i|_{X_{i-1} = x_{i-1}} \leq c_i, not merely Dici|D_i| \leq c_i. Hoeffding's lemma for bounded-range random variables gives a factor-of-4 improvement over the naive absolute-value bound, yielding the McDiarmid constant:

Pr[f(X)E[f(X)]t]exp ⁣(2t2ci2)\Pr[f(X) - \mathbb{E}[f(X)] \geq t] \leq \exp\!\left(-\frac{2t^2}{\sum c_i^2}\right)

This is NOT the same as the Azuma constant exp(t2/(2ck2))\exp(-t^2/(2\sum c_k^2)); they differ by a factor of 4 inside the exponent. A student who naively plugs ck=cic_k = c_i into the Azuma bound above will get a constant that is 4x too loose. This is how generalization bounds, Rademacher complexity concentration, and many other ML results are proved.

Failure Mode

Azuma-Hoeffding uses only the worst-case increment bound ckc_k. If the increments are typically much smaller than ckc_k (small variance but bounded range), the bound is loose. The Freedman inequality below is tighter in this case because it uses the variance of the increments.

Theorem

Doob's Martingale Convergence Theorem

Statement

Let (Mn)n0(M_n)_{n \geq 0} be a supermartingale with supnE[Mn]<\sup_n \mathbb{E}[M_n^-] < \infty (in particular, a non-negative supermartingale is a special case since Mn=0M_n^- = 0). Then there exists a random variable MM_\infty with E[M]<\mathbb{E}[|M_\infty|] < \infty such that:

MnMalmost surelyM_n \to M_\infty \quad \text{almost surely}

For a martingale bounded in L2L^2 (supnE[Mn2]<\sup_n \mathbb{E}[M_n^2] < \infty), the convergence also holds in L1L^1: E[MnM]0\mathbb{E}[|M_n - M_\infty|] \to 0.

Intuition

A non-negative supermartingale is a "wealth process" for a gambler playing unfavorable games: expected wealth decreases over time, and the wealth cannot go below zero. Such a process must eventually settle down --- it cannot keep fluctuating forever because it has no expected upward drift and it is bounded below. The almost-sure limit captures the gambler's long-run fortune.

For bounded L2L^2 martingales, the convergence is stronger: the sequence not only converges pointwise but the expected deviation from the limit goes to zero.

Proof Sketch

The proof uses Doob's upcrossing inequality: let Un(a,b)U_n(a, b) be the number of times the sequence M0,M1,,MnM_0, M_1, \ldots, M_n crosses upward from below aa to above bb (with a<ba < b). Then:

E[Un(a,b)]E[(Mna)]ba\mathbb{E}[U_n(a, b)] \leq \frac{\mathbb{E}[(M_n - a)^-]}{b - a}

If supnE[Mn]<\sup_n \mathbb{E}[M_n^-] < \infty, the right side is bounded uniformly in nn, so U(a,b)<U_\infty(a, b) < \infty a.s. for all rational a<ba < b.

A sequence with finitely many upcrossings of every interval (a,b)(a, b) must converge (it cannot oscillate between any two values infinitely often). Therefore MnMM_n \to M_\infty a.s.

Why It Matters

In ML theory, martingale convergence appears in:

  • Stochastic approximation: in the Robbins-Monro framework, SGD with step sizes satisfying tηt=\sum_t \eta_t = \infty and tηt2<\sum_t \eta_t^2 < \infty has iterates that, under standard regularity, generate a non-negative almost-supermartingale to which the Robbins-Siegmund (1971) lemma applies, yielding almost-sure convergence to a stationary point
  • Online learning: the regret of certain algorithms, properly normalized, is a supermartingale that converges, giving per-round regret bounds
  • Bayesian consistency: posterior beliefs, viewed as a martingale indexed by the amount of data, converge to the truth under regularity conditions

Failure Mode

Almost-sure convergence does not imply L1L^1 convergence in general. The classic counterexample: let MnM_n be 0 with probability 11/n1 - 1/n and nn with probability 1/n1/n, arranged so E[Mn]=1\mathbb{E}[M_n] = 1 for all nn. Then Mn0M_n \to 0 a.s. But E[Mn]=1↛0\mathbb{E}[M_n] = 1 \not\to 0. For L1L^1 convergence, you need uniform integrability.

Theorem

Doob's Weak Maximal Inequality

Statement

Let (Xn)(X_n) be a non-negative submartingale. For any λ>0\lambda > 0:

Pr ⁣(max0knXkλ)E[Xn]λ\Pr\!\left(\max_{0 \leq k \leq n} X_k \geq \lambda\right) \leq \frac{\mathbb{E}[X_n]}{\lambda}

Intuition

This is the martingale analogue of Markov's inequality. Markov's inequality bounds the probability that a single non-negative random variable exceeds a threshold by its mean divided by the threshold. Doob's weak maximal inequality gives the same kind of tail control for the running maximum of a submartingale, using only the terminal expectation.

Why It Matters

The weak maximal form is the version used when a proof needs a uniform-in-time tail bound rather than a fixed-time tail bound. It is a standard bridge from terminal martingale control to anytime-valid arguments, stopping-time estimates, and concentration proofs where the path's supremum matters.

Failure Mode

The non-negativity and submartingale hypotheses are doing real work. Without them, the running maximum can be large even when the terminal value has small expectation, so the Markov-style ratio no longer controls the event. This weak inequality controls a tail probability; it does not by itself give an LpL^p norm bound for the maximum.

Theorem

Doob's L^p Maximal Inequality

Statement

Let (Xn)(X_n) be a non-negative submartingale. For p>1p > 1, the LpL^p maximal inequality gives:

E ⁣[(max0knXk)p](pp1)pE[Xnp]\mathbb{E}\!\left[\left(\max_{0 \leq k \leq n} X_k\right)^p\right] \leq \left(\frac{p}{p-1}\right)^p \mathbb{E}[X_n^p]

Intuition

The weak maximal inequality controls the probability of crossing one threshold. The LpL^p maximal inequality upgrades that into moment control: if the terminal variable has a finite ppth moment, then the whole path's maximum has a finite ppth moment with an explicit constant.

Why It Matters

The LpL^p form is the stronger norm estimate used in convergence theorems, empirical-process arguments, and stochastic analysis. It turns terminal LpL^p control into pathwise supremum control, which is exactly what many uniform-in-time estimates need.

Failure Mode

The constant (p/(p1))p(p/(p-1))^p diverges as p1p \downarrow 1, which is why this is not an L1L^1 maximal theorem. At p=1p = 1, the right replacement is the weak maximal inequality above, not a strong L1L^1 bound with the same shape.

Theorem

Optional Stopping Theorem

Statement

Let (Mn)(M_n) be a martingale and τ\tau a stopping time with respect to (Fn)(\mathcal{F}_n). If τN\tau \leq N almost surely for some constant NN (bounded stopping time), then:

E[Mτ]=E[M0]\mathbb{E}[M_\tau] = \mathbb{E}[M_0]

A useful sufficient condition for unbounded stopping times is: (a) E[τ]<\mathbb{E}[\tau] < \infty, and (b) there exists cc such that E[Mn+1MnFn]c\mathbb{E}[|M_{n+1} - M_n| \mid \mathcal{F}_n] \leq c, which imply E[Mτ]=E[M0]\mathbb{E}[M_\tau] = \mathbb{E}[M_0] (see Williams, Probability with Martingales §10.10). Other non-equivalent sufficient conditions include: (i) uniform integrability of the stopped sequence (Mτn)(M_{\tau \wedge n}), or (ii) a dominating bound MτnY|M_{\tau \wedge n}| \leq Y with E[Y]<\mathbb{E}[Y] < \infty (dominated convergence). None of these is strictly "more general" than the others.

Intuition

You cannot beat a fair game by choosing when to stop. If the game is fair at every step (E[Mn+1Fn]=Mn\mathbb{E}[M_{n+1} | \mathcal{F}_n] = M_n), then your expected payoff when you stop equals your starting capital, no matter how cleverly you choose when to stop (as long as you must stop in bounded time).

The caveat "bounded time" is critical. With unbounded stopping times, you can have E[Mτ]E[M0]\mathbb{E}[M_\tau] \neq \mathbb{E}[M_0] --- the classic example is the doubling strategy in gambling, which requires infinite credit.

Proof Sketch

For a bounded stopping time τN\tau \leq N:

E[Mτ]=E[MN]E ⁣[k=τ+1N(MkMk1)]\mathbb{E}[M_\tau] = \mathbb{E}[M_N] - \mathbb{E}\!\left[\sum_{k=\tau+1}^{N}(M_k - M_{k-1})\right]

Since τ\tau is a stopping time, the event {τ<k}\{\tau < k\} is Fk1\mathcal{F}_{k-1}-measurable, and:

E[(MkMk1)1τ<kFk1]=1τ<kE[MkMk1Fk1]=0\mathbb{E}[(M_k - M_{k-1})\mathbf{1}_{\tau < k} \mid \mathcal{F}_{k-1}] = \mathbf{1}_{\tau < k} \cdot \mathbb{E}[M_k - M_{k-1} \mid \mathcal{F}_{k-1}] = 0

So E[Mτ]=E[MN]=E[M0]\mathbb{E}[M_\tau] = \mathbb{E}[M_N] = \mathbb{E}[M_0].

Why It Matters

Optional stopping is used in sequential analysis and online learning where you must decide when to stop based on observed data. In sequential hypothesis testing (Wald's SPRT), the likelihood ratio process is a martingale, and optional stopping gives error probability guarantees. In anytime-valid confidence sequences, martingale constructions provide confidence bounds that hold at arbitrary data- dependent stopping times.

Failure Mode

The boundedness condition is not a technicality. The simple random walk Mn=i=1nXiM_n = \sum_{i=1}^n X_i (with Xi=±1X_i = \pm 1 equally likely) is a martingale with E[M0]=0\mathbb{E}[M_0] = 0. Let τ=inf{n:Mn=1}\tau = \inf\{n : M_n = 1\}. Then Mτ=1M_\tau = 1 always, so E[Mτ]=10=E[M0]\mathbb{E}[M_\tau] = 1 \neq 0 = \mathbb{E}[M_0]. The problem: τ\tau is not bounded (and E[τ]=\mathbb{E}[\tau] = \infty).

The Freedman Inequality (Variance-Sensitive)

Theorem

Freedman Inequality

Statement

Let (Mn)(M_n) be a martingale with Dk=MkMk1R|D_k| = |M_k - M_{k-1}| \leq R. Define the predictable quadratic variation:

Wn=k=1nE[Dk2Fk1]=k=1nVar(DkFk1)W_n = \sum_{k=1}^n \mathbb{E}[D_k^2 \mid \mathcal{F}_{k-1}] = \sum_{k=1}^n \text{Var}(D_k \mid \mathcal{F}_{k-1})

Then for any t>0t > 0 and v>0v > 0:

Pr[MnM0t and Wnv]exp ⁣(t2/2v+Rt/3)\Pr[M_n - M_0 \geq t \text{ and } W_n \leq v] \leq \exp\!\left(-\frac{t^2/2}{v + Rt/3}\right)

Intuition

Azuma-Hoeffding uses only the worst-case increment bound RR. But if the increments are typically much smaller (low conditional variance), the martingale concentrates much better than Azuma-Hoeffding predicts. Freedman's inequality captures this by depending on the actual cumulative conditional variance WnW_n rather than the worst-case bound nR2nR^2.

When vRtv \gg Rt: the bound becomes exp(t2/(2v))\exp(-t^2/(2v)), a Gaussian-type tail (sub-Gaussian with parameter vv). When RtvRt \gg v: the bound becomes exp(3t/(2R))\exp(-3t/(2R)), a Poisson-type tail (sub-exponential).

Proof Sketch

The proof follows the same exponential supermartingale strategy as Azuma-Hoeffding, but with a tighter bound on the moment generating function of the increments. Instead of using Hoeffding's lemma (which only uses the range), use Bennett's inequality for the MGF:

E[eλDkFk1]exp ⁣(Var(DkFk1)ϕ(λR)R2)\mathbb{E}[e^{\lambda D_k} \mid \mathcal{F}_{k-1}] \leq \exp\!\left(\frac{\text{Var}(D_k \mid \mathcal{F}_{k-1}) \cdot \phi(\lambda R)}{R^2}\right)

where ϕ(u)=euu1\phi(u) = e^u - u - 1. Summing the exponents over kk gives a bound in terms of WnW_n instead of nR2nR^2. The optimization λ=log(1+Rt/v)/R\lambda^* = \log(1 + Rt / v) / R yields the final form exp(t2/(2(v+Rt/3)))\exp(-t^2 / (2(v + Rt/3))) via the inequality ϕ(u)u2/(2(1u/3))\phi(u) \leq u^2 / (2(1 - u/3)) for u>0u > 0.

Why It Matters

Freedman's inequality is the variance-sensitive analogue of Azuma-Hoeffding. It is tighter whenever the conditional variances are much smaller than the worst-case increment bounds --- which is common in practice.

In ML theory: when proving regret bounds for online learning algorithms, the per-round loss differences are bounded but often have low variance (the algorithm makes mostly good predictions). Freedman gives tighter regret bounds than Azuma in these cases.

Failure Mode

The bound requires knowing (or bounding) the predictable quadratic variation WnW_n, which may itself be random. In practice, you often bound WnW_n by a deterministic quantity and then the Freedman bound reduces to a variance-sensitive version of Azuma.

Martingale Representation Theorem

In continuous time, martingales adapted to a Brownian filtration have a remarkably clean structure. If (Mt)0tT(M_t)_{0 \leq t \leq T} is a continuous square-integrable martingale adapted to the filtration generated by a Brownian motion (Bt)(B_t), then there exists a unique adapted process ϕ\phi with E0Tϕs2ds<\mathbb{E}\int_0^T \phi_s^2 \, ds < \infty such that

Mt=M0+0tϕsdBsM_t = M_0 + \int_0^t \phi_s \, dB_s

Every Brownian-filtration martingale is a stochastic integral against Brownian motion. This is foundational for mathematical finance (every contingent claim on a complete market can be replicated by a dynamic trading strategy) and for optimal stopping and filtering theory. The theorem connects martingale theory to stochastic calculus: the integrand ϕs\phi_s plays the role of an instantaneous hedging ratio, and the stochastic integral provides the mechanism for reconstructing the martingale from Brownian increments.

Proof Ideas and Templates Used

Martingale proofs in ML theory follow several standard patterns:

  1. Doob martingale + Azuma-Hoeffding: to prove concentration of a function f(X1,,Xn)f(X_1, \ldots, X_n), construct the Doob martingale, bound the increments, and apply Azuma. This is how McDiarmid's inequality is proved.

  2. Exponential supermartingale: define Yn=exp(λMnψn)Y_n = \exp(\lambda M_n - \psi_n) and show it is a supermartingale. This is the universal technique for deriving concentration inequalities from martingale bounds.

  3. Stopping time arguments: to prove properties of adaptive procedures (online learning, sequential testing), formulate the quantity of interest as a martingale and apply optional stopping or convergence theorems.

Canonical Examples

Example

Doob martingale for the empirical mean

Let X1,,XnX_1, \ldots, X_n be i.i.d. with E[Xi]=μ\mathbb{E}[X_i] = \mu and Xi[a,b]X_i \in [a, b]. Let f(X1,,Xn)=1niXif(X_1, \ldots, X_n) = \frac{1}{n}\sum_i X_i.

The Doob martingale is: Zk=E[fX1,,Xk]=1n(i=1kXi+(nk)μ)Z_k = \mathbb{E}[f \mid X_1, \ldots, X_k] = \frac{1}{n}(\sum_{i=1}^k X_i + (n-k)\mu).

The increment Dk=ZkZk1=1n(Xkμ)D_k = Z_k - Z_{k-1} = \frac{1}{n}(X_k - \mu), which satisfies Dk(ba)/n|D_k| \leq (b-a)/n. By Azuma-Hoeffding:

Pr[Xˉμt]2exp(2nt2/(ba)2)\Pr[|\bar{X} - \mu| \geq t] \leq 2\exp(-2nt^2/(b-a)^2)

This recovers Hoeffding's inequality via the martingale route.

Common Confusions

Watch Out

Martingale is not the same as i.i.d. sum

A sum of i.i.d. mean-zero random variables is a martingale, but not all martingales are i.i.d. sums. The increments Dk=MkMk1D_k = M_k - M_{k-1} can depend on the past through Fk1\mathcal{F}_{k-1}. What is required is only that E[DkFk1]=0\mathbb{E}[D_k \mid \mathcal{F}_{k-1}] = 0 --- the conditional mean is zero. The conditional variance, distribution, and higher moments can all depend on the past.

Watch Out

Optional stopping requires conditions

The conclusion E[Mτ]=E[M0]\mathbb{E}[M_\tau] = \mathbb{E}[M_0] does not hold for all stopping times. The stopping time must be bounded, or more general conditions (finite expectation + bounded increments) must hold. Ignoring this leads to paradoxes (doubling strategy, St. Petersburg paradox).

Watch Out

Convergence in L1 is stronger than a.s. convergence for martingales

Doob's convergence theorem gives a.s. convergence for supermartingales. But the limit MM_\infty may satisfy E[M]<E[M0]\mathbb{E}[M_\infty] < \mathbb{E}[M_0] (the expectation can "leak" to infinity). For E[Mn]E[M]\mathbb{E}[M_n] \to \mathbb{E}[M_\infty], you need uniform integrability, which is a strictly stronger condition.

Summary

  • Martingale: E[Mn+1Fn]=Mn\mathbb{E}[M_{n+1} \mid \mathcal{F}_n] = M_n (fair game)
  • Doob martingale: Zi=E[f(X)X1,,Xi]Z_i = \mathbb{E}[f(X) \mid X_1, \ldots, X_i] exposes variables one at a time
  • Azuma-Hoeffding: bounded-increment martingale concentrates: Pr[MnM0t]2exp(t2/(2ck2))\Pr[|M_n - M_0| \geq t] \leq 2\exp(-t^2/(2\sum c_k^2))
  • Freedman: variance-sensitive refinement, uses Wn=Var(DkFk1)W_n = \sum \text{Var}(D_k \mid \mathcal{F}_{k-1})
  • Optional stopping: E[Mτ]=E[M0]\mathbb{E}[M_\tau] = \mathbb{E}[M_0] for bounded stopping times
  • Doob convergence: bounded supermartingales converge a.s.
  • McDiarmid's inequality = Doob martingale + Azuma-Hoeffding
  • Martingales handle dependent sequences, not just i.i.d. sums

Exercises

ExerciseCore

Problem

Let X1,X2,X_1, X_2, \ldots be i.i.d. with E[Xi]=0\mathbb{E}[X_i] = 0 and Xi1|X_i| \leq 1. Define Sn=i=1nXiS_n = \sum_{i=1}^n X_i. Show that (Sn)(S_n) is a martingale with respect to the natural filtration and apply Azuma-Hoeffding to bound Pr[Snt]\Pr[S_n \geq t].

ExerciseAdvanced

Problem

Use the Doob martingale and Azuma-Hoeffding to prove McDiarmid's inequality: if ff satisfies the bounded differences condition supxif(x1,,xi,,xn)f(x1,,xi,,xn)ci\sup_{x_i'} |f(x_1, \ldots, x_i, \ldots, x_n) - f(x_1, \ldots, x_i', \ldots, x_n)| \leq c_i for each ii, then for independent X1,,XnX_1, \ldots, X_n:

Pr[f(X)E[f(X)]t]exp ⁣(2t2i=1nci2)\Pr[f(X) - \mathbb{E}[f(X)] \geq t] \leq \exp\!\left(-\frac{2t^2}{\sum_{i=1}^n c_i^2}\right)

ExerciseResearch

Problem

Compare Azuma-Hoeffding and Freedman for the following setting: a martingale with n=1000n = 1000 steps, bounded increments Dk1|D_k| \leq 1, but conditional variance Var(DkFk1)=0.01\text{Var}(D_k \mid \mathcal{F}_{k-1}) = 0.01 for all kk (almost all the increment's distribution is concentrated near zero). Compute the Azuma-Hoeffding and Freedman bounds for Pr[MnM010]\Pr[M_n - M_0 \geq 10] and comment on the difference.

Related Comparisons

References

Canonical:

  • Durrett, Probability: Theory and Examples (5th ed., 2019), Chapter 4
  • Williams, Probability with Martingales (1991) --- the standard introductory text

Current:

  • Wainwright, High-Dimensional Statistics (2019), Chapter 2.5
  • Freedman, "On Tail Probabilities for Martingales" (Annals of Probability, 1975)
  • de la Pena, "A General Class of Exponential Inequalities for Martingales and Ratios" (Annals of Probability, 1999)

Next Topics

From martingale theory, the natural next steps are:

  • McDiarmid's inequality: the direct application of Doob + Azuma-Hoeffding to bounded-difference functions
  • Concentration inequalities: the broader toolkit that martingales support
  • Stochastic calculus for ML: the continuous-time extension of martingale theory, with applications to diffusion models and SGD analysis

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

1

Derived topics

7

+2 more on the derived-topics page.