Skip to main content

Concentration Probability

Chernoff Bounds

The Chernoff method: the universal technique for deriving exponential tail bounds by optimizing over the moment generating function, yielding the tightest possible exponential concentration.

CoreTier 1StableCore spine~45 min

Why This Matters

theorem visual

Chernoff Tilt

The upper-tail bound becomes tight by choosing the slope that best tilts the log-MGF.

t = 1.4

values

The best exponential upper tail bound comes from choosing the optimizer that minimizes the tilted intercept.

: log-MGFAmber: tilted lineGreen: optimizer and gap

The Chernoff method is not a single inequality. It is a technique, and it is the single most important proof technique in concentration inequalities. Every exponential tail bound you have seen (Hoeffding, Bernstein, sub-Gaussian bounds, Bennett) is derived by the same recipe:

  1. Exponentiate: convert Pr[Xt]\Pr[X \geq t] into Pr[esXest]\Pr[e^{sX} \geq e^{st}]
  2. Apply Markov: estE[esX]\leq e^{-st} \mathbb{E}[e^{sX}]
  3. Optimize: choose s>0s > 0 to minimize the bound

This three-step recipe. The Chernoff method. produces the tightest exponential bound achievable from the moment generating function. If you master this one idea, you can derive Hoeffding, Bernstein, and sub-Gaussian bounds from scratch.

Read this page as the operational sequel to Moment Generating Functions: that page explains what the MGF is, and this page shows how the MGF becomes a tail bound after one Markov step and one optimization step.

Quick Version

  • Exponentiate. Replace XtX \ge t by esXeste^{sX} \ge e^{st}. This turns a tail event into a non-negative random variable.
  • Apply Markov. Bound the probability by estE[esX]e^{-st}\mathbb{E}[e^{sX}]. This converts the tail event into an MGF expression.
  • Optimize over ss. Choose the best tilt parameter. This gives the sharpest exponential certificate available from the MGF.

If the MGF is well controlled, Chernoff gives exponential decay. If the MGF does not exist, the method has nothing to optimize and the certificate disappears.

Mental Model

Think of the Chernoff method as applying Markov's inequality to the "best possible" monotone function of XX. Markov says Pr[g(X)g(t)]E[g(X)]/g(t)\Pr[g(X) \geq g(t)] \leq \mathbb{E}[g(X)]/g(t) for any non-negative increasing gg. The exponential g(x)=esxg(x) = e^{sx} is the optimal choice because it grows the fastest (among functions whose expectation is finite), penalizing large values of XX most aggressively. The free parameter s>0s > 0 lets you tune the exponential to the specific threshold tt.

Why is this tighter than Chebyshev? Chebyshev uses g(x)=x2g(x) = x^2, which gives polynomial tails (1/t21/t^2). The exponential g(x)=esxg(x) = e^{sx} gives exponential tails (ecte^{-ct} or ect2e^{-ct^2}), which are dramatically better for large tt.

The MGF Technique Step-by-Step

The Chernoff method is the proof pattern, not a specific bound. Read each step alongside the prerequisite it depends on; the recall boxes name those prerequisites explicitly so the derivation is auditable end-to-end.

Step 1 — Exponentiate. For s>0s > 0, monotonicity of uesuu \mapsto e^{su} gives

Pr[Xt]=Pr[esXest].\Pr[X \geq t] = \Pr[e^{sX} \geq e^{st}].

The event has not changed; only the algebraic form has. Both sides remain exact equalities, no inequality has been used yet.

Recall: Markov's inequality. For a nonnegative random variable YY and any a>0a > 0, Pr[Ya]E[Y]/a\Pr[Y \geq a] \leq \mathbb{E}[Y]/a. The next step applies this to Y=esXY = e^{sX}, which is nonnegative because the exponential is. See Markov's inequality.

Step 2 — Apply Markov. Markov on the nonnegative random variable esXe^{sX}:

Pr[esXest]E[esX]est=estMX(s).\Pr[e^{sX} \geq e^{st}] \leq \frac{\mathbb{E}[e^{sX}]}{e^{st}} = e^{-st}\, M_X(s).

The tail probability is now an MGF expression. The certificate is no longer distribution-specific; it depends only on MXM_X.

Step 3 — Use independence to factor. When X=i=1nXiX = \sum_{i=1}^n X_i with the XiX_i independent,

MX(s)=E ⁣[esiXi]=i=1nE[esXi].M_X(s) = \mathbb{E}\!\left[e^{s\sum_i X_i}\right] = \prod_{i=1}^n \mathbb{E}[e^{sX_i}].

Each factor is a one-variable MGF that admits a per-summand bound (Hoeffding's lemma for bounded variables, Bernstein's MGF lemma when variance information is available, and so on).

Recall: 1+xex1 + x \leq e^x. Valid for every xRx \in \mathbb{R}. The Bernoulli MGF for XiBern(pi)X_i \sim \mathrm{Bern}(p_i) is E[esXi]=pies+(1pi)=1+pi(es1)\mathbb{E}[e^{sX_i}] = p_i e^s + (1 - p_i) = 1 + p_i(e^s - 1), which the inequality bounds by exp(pi(es1))\exp(p_i(e^s - 1)). The same bound, with different per-summand MGFs, drives every Chernoff-style proof on this site.

Step 4 — Optimize over ss. The bound estMX(s)e^{-st}M_X(s) depends on the free parameter ss. Taking the infimum gives the Legendre transform of the log-MGF:

Pr[Xt]infs>0estMX(s)=exp ⁣(sups>0[stΛX(s)])=eΛX(t).\Pr[X \geq t] \leq \inf_{s > 0} e^{-st}\, M_X(s) = \exp\!\left(-\sup_{s > 0}\bigl[st - \Lambda_X(s)\bigr]\right) = e^{-\Lambda_X^*(t)}.

The optimum s>0s^* > 0 is characterized by ΛX(s)=t\Lambda_X'(s^*) = t, i.e., the exponentially tilted distribution has mean exactly tt. This is the geometric content of large-deviations theory.

Formal Setup and Notation

Let XX be a real-valued random variable with moment generating function (MGF) MX(s)=E[esX]M_X(s) = \mathbb{E}[e^{sX}].

Definition

Moment Generating Function

The moment generating function of XX is MX(s)=E[esX]M_X(s) = \mathbb{E}[e^{sX}] for those sRs \in \mathbb{R} where the expectation is finite. The MGF "generates" moments: E[Xk]=MX(k)(0)\mathbb{E}[X^k] = M_X^{(k)}(0). The existence of the MGF in a neighborhood of zero is equivalent to the distribution having exponentially decaying tails.

Definition

Log-Moment Generating Function (Cumulant Generating Function)

The log-MGF or cumulant generating function is ΛX(s)=logE[esX]\Lambda_X(s) = \log \mathbb{E}[e^{sX}]. It is convex in ss. The Chernoff bound becomes: Pr[Xt]exp(infs>0[ΛX(s)st])\Pr[X \geq t] \leq \exp(\inf_{s > 0}[\Lambda_X(s) - st]). The Legendre transform ΛX(t)=sups>0[stΛX(s)]\Lambda_X^*(t) = \sup_{s > 0}[st - \Lambda_X(s)] is the rate function of large deviations theory.

Core Method

Theorem

The Chernoff Method (General Upper Tail)

Statement

For any random variable XX whose MGF MX(s)=E[esX]M_X(s) = \mathbb{E}[e^{sX}] exists for some s>0s > 0, and any tRt \in \mathbb{R}:

Pr[Xt]infs>0estMX(s)=infs>0exp ⁣(ΛX(s)st)=exp ⁣(ΛX(t))\Pr[X \geq t] \leq \inf_{s > 0} e^{-st} M_X(s) = \inf_{s > 0} \exp\!\bigl(\Lambda_X(s) - st\bigr) = \exp\!\bigl(-\Lambda_X^*(t)\bigr)

where ΛX(t)=sups>0[stΛX(s)]\Lambda_X^*(t) = \sup_{s > 0}[st - \Lambda_X(s)] is the Fenchel-Legendre dual of the log-MGF.

Intuition

The Chernoff method converts a tail probability into an optimization problem. For each s>0s > 0, the bound estMX(s)e^{-st}M_X(s) is valid. Different values of ss give different bounds, and you pick the best one. The optimal ss^* satisfies ΛX(s)=t\Lambda_X'(s^*) = t: the tilted distribution has mean exactly tt.

Geometrically: ΛX(t)\Lambda_X^*(t) is the Legendre transform of ΛX(s)\Lambda_X(s), which measures how "surprising" the event {Xt}\{X \geq t\} is relative to the distribution of XX. Larger ΛX(t)\Lambda_X^*(t) means more surprising, hence less probable.

Proof Sketch

For any s>0s > 0: Pr[Xt]=Pr[esXest]E[esX]est=estMX(s)\Pr[X \geq t] = \Pr[e^{sX} \geq e^{st}] \leq \frac{\mathbb{E}[e^{sX}]}{e^{st}} = e^{-st} M_X(s)

The first equality is because xesxx \mapsto e^{sx} is strictly increasing. The inequality is Markov applied to the non-negative random variable esXe^{sX}. Since this holds for all s>0s > 0, take the infimum.

Why It Matters

The Chernoff method is a meta-theorem: it transforms the problem of bounding tails into the problem of bounding the MGF, which is often easier.

Different MGF bounds lead to different named inequalities:

  • If MX(s)es2σ2/2M_X(s) \leq e^{s^2\sigma^2/2}: sub-Gaussian bound (Hoeffding-type)
  • If MX(s)es2σ2/(2(1sM/3))M_X(s) \leq e^{s^2\sigma^2/(2(1-s M/3))}: Bernstein-type bound
  • If MX(s)=(1p+pes)nM_X(s) = (1-p+pe^s)^n: multiplicative Chernoff for Binomials
  • If MX(s)=(1s)1M_X(s) = (1-s)^{-1}: exponential tail bound

Every exponential concentration inequality is a Chernoff bound with a specific MGF estimate plugged in.

Failure Mode

The Chernoff method requires the MGF to exist in a neighborhood of s=0s = 0. For heavy-tailed distributions (Pareto, Cauchy), the MGF is infinite for all s>0s > 0, and the Chernoff method gives nothing. For these, you need moment-based methods (Chebyshev for 1/t21/t^2, or higher-moment bounds for 1/tp1/t^p).

Multiplicative Chernoff Bounds

The most important special case: sums of independent Bernoulli (or more generally, [0,1][0,1]-valued) random variables. The "multiplicative" form expresses deviations as a fraction of the mean, which gives tighter bounds than additive Hoeffding when the mean is small.

Theorem

Multiplicative Chernoff Bound

Statement

Let X1,,XnX_1, \ldots, X_n be independent random variables with Xi[0,1]X_i \in [0, 1]. Let S=i=1nXiS = \sum_{i=1}^n X_i and μ=E[S]\mu = \mathbb{E}[S]. Then:

Upper tail: For δ>0\delta > 0: Pr[S(1+δ)μ](eδ(1+δ)(1+δ))μ\Pr[S \geq (1+\delta)\mu] \leq \left(\frac{e^\delta}{(1+\delta)^{(1+\delta)}}\right)^\mu

Simplified upper tail: For δ(0,1]\delta \in (0, 1]: Pr[S(1+δ)μ]exp ⁣(μδ23)\Pr[S \geq (1+\delta)\mu] \leq \exp\!\left(-\frac{\mu\delta^2}{3}\right)

Lower tail: For δ(0,1)\delta \in (0, 1): Pr[S(1δ)μ]exp ⁣(μδ22)\Pr[S \leq (1-\delta)\mu] \leq \exp\!\left(-\frac{\mu\delta^2}{2}\right)

Intuition

The multiplicative form measures deviations relative to the mean. If μ=10\mu = 10, a deviation of S=15S = 15 is a 50% overshoot (δ=0.5\delta = 0.5). The bound eμδ2/3e^{-\mu\delta^2/3} depends on μδ2\mu\delta^2: the product of the mean and the squared relative deviation.

Why is this better than Hoeffding for small means? Hoeffding gives Pr[Sμ+t]e2t2/n\Pr[S \geq \mu + t] \leq e^{-2t^2/n}. With t=δμt = \delta\mu and small μ\mu: Hoeffding gives e2δ2μ2/ne^{-2\delta^2\mu^2/n} while multiplicative Chernoff gives eδ2μ/3e^{-\delta^2\mu/3}. When μn\mu \ll n (rare events), Chernoff is dramatically tighter.

Proof Sketch

Step 1: Apply the Chernoff method to SS with parameter s>0s > 0: Pr[S(1+δ)μ]es(1+δ)μi=1nE[esXi]\Pr[S \geq (1+\delta)\mu] \leq e^{-s(1+\delta)\mu} \prod_{i=1}^n \mathbb{E}[e^{sX_i}].

Step 2: Bound the MGF of each XiX_i: since Xi[0,1]X_i \in [0,1], E[esXi]1+E[Xi](es1)exp(E[Xi](es1))\mathbb{E}[e^{sX_i}] \leq 1 + \mathbb{E}[X_i](e^s - 1) \leq \exp(\mathbb{E}[X_i](e^s - 1)) using 1+xex1 + x \leq e^x.

Step 3: Multiply: iE[esXi]exp(μ(es1))\prod_i \mathbb{E}[e^{sX_i}] \leq \exp(\mu(e^s - 1)).

Step 4: Total bound: exp(μ(es1)s(1+δ)μ)\exp(\mu(e^s - 1) - s(1+\delta)\mu).

Step 5: Optimize: set s=ln(1+δ)s = \ln(1+\delta) to get exp(μ(δ(1+δ)ln(1+δ)))\exp(\mu(\delta - (1+\delta)\ln(1+\delta))). The simplified form follows from the inequality (1+δ)ln(1+δ)δ+δ2/3(1+\delta)\ln(1+\delta) \geq \delta + \delta^2/3 for δ(0,1]\delta \in (0, 1].

Why It Matters

The multiplicative Chernoff bound is the standard tool for:

  • Randomized algorithms: bounding the probability that a random hash function has too many collisions
  • Sampling: how many samples to estimate a proportion within relative error δ\delta
  • Network analysis: bounding node degrees in random graphs
  • Binomial concentration: any setting where you sum independent 0/1 indicators and the expected sum may be small

The multiplicative form is especially important when μ=o(n)\mu = o(n): for rare events, Hoeffding wastes a factor of n/μn/\mu in the exponent.

Failure Mode

The simplified form eμδ2/3e^{-\mu\delta^2/3} is valid only for δ(0,1]\delta \in (0,1] (at most doubling the mean). For δ>1\delta > 1, you must use the exact form or the weaker bound Pr[St]eμ(eμ/t)t\Pr[S \geq t] \leq e^{-\mu} \cdot (e\mu/t)^t for t2eμt \geq 2e\mu. Also, the bound requires independence; for dependent indicators, you need Azuma-Hoeffding or other martingale methods.

Bernstein Inequality

Hoeffding uses only the support [ai,bi][a_i, b_i] of each summand. When the variance σi2=Var(Xi)\sigma_i^2 = \operatorname{Var}(X_i) is much smaller than (biai)2(b_i - a_i)^2, Hoeffding is loose. Bernstein's inequality keeps the variance information and produces a tighter bound in exactly this regime.

Theorem

Bernstein's Inequality

Statement

Let X1,,XnX_1, \ldots, X_n be independent random variables with E[Xi]=0\mathbb{E}[X_i] = 0, XiM|X_i| \leq M almost surely, and Var(Xi)=σi2\operatorname{Var}(X_i) = \sigma_i^2. Let v=i=1nσi2v = \sum_{i=1}^n \sigma_i^2. Then for all t0t \geq 0:

Pr ⁣[i=1nXit]exp ⁣(t22v+23Mt)\Pr\!\left[\sum_{i=1}^n X_i \geq t\right] \leq \exp\!\left(-\frac{t^2}{2v + \tfrac{2}{3}Mt}\right)

Intuition

The bound interpolates between two regimes via the denominator 2v+(2/3)Mt2v + (2/3)Mt:

  • Small tt (variance regime): when MtvMt \ll v, the denominator is approximately 2v2v and the bound reduces to exp(t2/(2v))\exp(-t^2/(2v)), the Gaussian tail controlled by the total variance. This matches what the CLT predicts for sums whose individual summands are small compared to the aggregate spread.
  • Large tt (Poisson/Hoeffding regime): when MtvMt \gg v, the denominator is dominated by (2/3)Mt(2/3)Mt and the bound becomes exp(3t/(2M))\exp(-3t/(2M)), an exponential tail linear in tt. This is the regime where a single bounded summand can contribute most of the deviation.

Bernstein strictly dominates Hoeffding when vnM2v \ll nM^2, which is the common case for rare-event indicators, heavy-imbalance Bernoullis, or any setting where most summands have tiny variance.

Proof Sketch

Apply the Chernoff method to S=iXiS = \sum_i X_i. Control each MGF by the Bernstein condition: for s<3/M|s| < 3/M,

E[esXi]exp ⁣(s2σi2/21sM/3)\mathbb{E}[e^{sX_i}] \leq \exp\!\left(\frac{s^2 \sigma_i^2/2}{1 - s M/3}\right)

Multiply across ii and take logs: ΛS(s)s2v/(2(1sM/3))\Lambda_S(s) \leq s^2 v / (2(1 - sM/3)). Optimize infs>0(ΛS(s)st)\inf_{s > 0}(\Lambda_S(s) - st) with the substitution s=t/(v+Mt/3)s = t/(v + Mt/3) to obtain the stated bound.

Why It Matters

Bernstein is the right tool whenever you know both a uniform bound MM and a variance proxy. It is the backbone of empirical process bounds, matrix Bernstein, and sample-complexity arguments in statistical learning where the effective variance can be much smaller than the worst-case range.

Failure Mode

Requires an almost-sure bound XiM|X_i| \leq M. For sub-exponential tails without an almost-sure bound, use the Bernstein-type inequality for sub-exponential variables (which replaces MM by the Orlicz parameter). Also, the variance must be finite and correctly estimated; plugging in a loose upper bound on σi2\sigma_i^2 weakens the inequality.

Azuma-Hoeffding (Martingales)

Hoeffding requires independence. Many natural sums are not independent but do form a martingale with bounded differences. Azuma-Hoeffding gives a Chernoff-style bound in exactly this setting.

Theorem

Azuma-Hoeffding Inequality

Statement

Let {Xk}k=0n\{X_k\}_{k=0}^n be a martingale (or {XkX0}\{X_k - X_0\} a martingale difference sequence) with XkXk1ck|X_k - X_{k-1}| \leq c_k almost surely for each kk. Then for all t0t \geq 0:

Pr ⁣[XnX0t]2exp ⁣(t22k=1nck2)\Pr\!\left[\,|X_n - X_0| \geq t\,\right] \leq 2 \exp\!\left(-\frac{t^2}{2 \sum_{k=1}^n c_k^2}\right)

Intuition

Define Dk=XkXk1D_k = X_k - X_{k-1}, so E[DkFk1]=0\mathbb{E}[D_k \mid \mathcal{F}_{k-1}] = 0 and Dkck|D_k| \leq c_k. Conditional on the past, DkD_k behaves like a bounded zero-mean variable, so Hoeffding's lemma applied conditionally gives E[esDkFk1]es2ck2/2\mathbb{E}[e^{sD_k} \mid \mathcal{F}_{k-1}] \leq e^{s^2 c_k^2/2}. Tower over kk to multiply these bounds, then run the Chernoff method on XnX0=kDkX_n - X_0 = \sum_k D_k. The exponent t2/(2ck2)t^2/(2\sum c_k^2) is identical to Hoeffding's; only the assumption is weakened from independence to the martingale property.

Proof Sketch

For s>0s > 0, use the tower property:

E[es(XnX0)]=E ⁣[E[esDnFn1]es(Xn1X0)]es2cn2/2E[es(Xn1X0)]\mathbb{E}[e^{s(X_n - X_0)}] = \mathbb{E}\!\left[\mathbb{E}[e^{sD_n} \mid \mathcal{F}_{n-1}] \cdot e^{s(X_{n-1} - X_0)}\right] \leq e^{s^2 c_n^2/2}\, \mathbb{E}[e^{s(X_{n-1} - X_0)}]

Iterate to get E[es(XnX0)]exp(s2kck2/2)\mathbb{E}[e^{s(X_n - X_0)}] \leq \exp(s^2 \sum_k c_k^2 / 2). Apply the Chernoff method and optimize s=t/kck2s = t / \sum_k c_k^2. A symmetric argument for s-s handles the lower tail; the factor of 2 is the union bound.

Why It Matters

Azuma-Hoeffding generalizes Hoeffding beyond independence to any martingale with bounded differences. Typical applications:

  • McDiarmid's inequality (bounded differences for functions of independent variables) is a direct corollary via the Doob martingale.
  • Stochastic optimization: convergence rates for online algorithms where iterates depend on all past samples.
  • Graph and combinatorial concentration: chromatic number, longest increasing subsequence, edge exposure in random graphs.

Failure Mode

Requires an almost-sure bound on each increment. If the differences are only bounded in L2L^2 or in probability, Azuma-Hoeffding does not apply; use Freedman's inequality (a martingale Bernstein) or sub-Gaussian martingale concentration. The bound is loose when the conditional variances Var(DkFk1)\operatorname{Var}(D_k \mid \mathcal{F}_{k-1}) are much smaller than ck2c_k^2; in that case use Freedman or a Bernstein-type martingale bound.

Why Chernoff Gives the Tightest Exponential Bounds

The Chernoff bound Pr[Xt]eΛ(t)\Pr[X \geq t] \leq e^{-\Lambda^*(t)} is the best possible exponential bound derivable from the MGF. More precisely:

Claim. For any random variable XX, among all bounds of the form Pr[Xt]er(t)\Pr[X \geq t] \leq e^{-r(t)} that hold for all distributions with the same MGF, the Chernoff bound with r(t)=ΛX(t)r(t) = \Lambda_X^*(t) is optimal.

Why. By the Legendre transform, ΛX(t)=sups[stΛX(s)]\Lambda_X^*(t) = \sup_s [st - \Lambda_X(s)]. Any valid exponential bound based on the MGF must have r(t)stΛX(s)r(t) \leq st - \Lambda_X(s) for the worst-case ss. Taking the supremum over ss gives exactly ΛX(t)\Lambda_X^*(t).

Connection to large deviations. For i.i.d. sums Sn=X1++XnS_n = X_1 + \cdots + X_n, the Chernoff bound gives Pr[Sn/nt]enΛ(t)\Pr[S_n/n \geq t] \leq e^{-n\Lambda^*(t)}. The remarkable fact (Cramér's theorem) is that this is asymptotically exact:

limn1nlogPr[Sn/nt]=Λ(t)\lim_{n \to \infty} \frac{1}{n}\log \Pr[S_n/n \geq t] = -\Lambda^*(t)

So the Chernoff method not only gives an upper bound. It finds the correct exponential rate. This is the starting point of large deviations theory.

Proof Ideas and Templates Used

The Chernoff method is a proof template consisting of three steps:

  1. Exponentiate: introduce the free parameter ss
  2. Apply Markov: convert tail probability to MGF
  3. Optimize: choose ss to minimize the bound

This template is universal. The art lies in step 2.5: bounding the MGF. Different MGF bounds produce different named inequalities:

  • Hoeffding's lemma for X[a,b]X \in [a,b] gives Hoeffding's inequality with Gaussian-style decay ect2e^{-ct^2}.
  • Bernstein's condition with variance and a uniform bound gives Bernstein's inequality with tail shape ect2/(v+Mt)e^{-ct^2/(v+Mt)}.
  • A sub-Gaussian MGF bound gives a sub-Gaussian tail ect2e^{-ct^2}.
  • The exact Binomial MGF gives the multiplicative Chernoff bound with tail shape ecμδ2e^{-c\mu\delta^2}.

Canonical Examples

Example

Chernoff for a standard Gaussian

Let XN(0,1)X \sim \mathcal{N}(0, 1). The MGF is MX(s)=es2/2M_X(s) = e^{s^2/2}. The Chernoff bound gives:

Pr[Xt]infs>0est+s2/2\Pr[X \geq t] \leq \inf_{s > 0} e^{-st + s^2/2}

Optimize: d/ds(st+s2/2)=0d/ds(-st + s^2/2) = 0 gives s=ts^* = t. So Pr[Xt]et2/2\Pr[X \geq t] \leq e^{-t^2/2}.

This is the standard Gaussian tail bound. The exact Gaussian tail is Pr[Xt]=12πtex2/2dxet2/2t2π\Pr[X \geq t] = \frac{1}{\sqrt{2\pi}} \int_t^\infty e^{-x^2/2}\,dx \approx \frac{e^{-t^2/2}}{t\sqrt{2\pi}}, so Chernoff is tight up to the polynomial factor 1/(t2π)1/(t\sqrt{2\pi}).

Example

Estimating a rare event probability

Flip a coin with heads probability p=0.01p = 0.01 a total of n=1000n = 1000 times. Let SS = number of heads, so μ=E[S]=10\mu = \mathbb{E}[S] = 10.

What is Pr[S20]\Pr[S \geq 20]? This is Pr[S2μ]\Pr[S \geq 2\mu], i.e., δ=1\delta = 1.

Multiplicative Chernoff: Pr[S20]eμ/3=e10/30.036\Pr[S \geq 20] \leq e^{-\mu/3} = e^{-10/3} \approx 0.036.

Hoeffding: Pr[S20]=Pr[Xˉp0.01]e2n(0.01)2=e0.20.82\Pr[S \geq 20] = \Pr[\bar{X} - p \geq 0.01] \leq e^{-2n(0.01)^2} = e^{-0.2} \approx 0.82.

Chernoff gives 3.6%3.6\% vs. Hoeffding's 82%82\%. The multiplicative Chernoff is dramatically tighter because it uses the fact that μ\mu is small relative to nn.

Common Confusions

Watch Out

Chernoff is a method, not a single inequality

When people say "Chernoff bound," they sometimes mean the general method (optimize over ss in the MGF bound) and sometimes mean the specific multiplicative bound for sums of independent [0,1] variables. Be aware of which one is meant. Hoeffding's inequality is a Chernoff bound, derived by plugging Hoeffding's lemma into the Chernoff method. The method is more general than any single application.

Watch Out

The Chernoff bound is one-sided

The basic Chernoff method with s>0s > 0 bounds the upper tail Pr[Xt]\Pr[X \geq t]. For the lower tail Pr[Xt]\Pr[X \leq t], use s<0s < 0 (equivalently, apply the upper-tail bound to X-X). For two-sided bounds, combine the two one-sided bounds via a union bound, picking up a factor of 2.

Watch Out

MGF must exist for Chernoff to work

The Chernoff method requires MX(s)<M_X(s) < \infty for some s>0s > 0. This fails for heavy-tailed distributions: the Cauchy distribution has no finite moments at all, and the exponential distribution has MX(s)=M_X(s) = \infty for s1s \geq 1. For exponential random variables, you can still apply Chernoff but only for s<1s < 1, limiting the range of tt values you can bound.

Connection to Large Deviations: Sanov's Theorem Preview

The Chernoff bound for i.i.d. sums is the starting point of large deviations theory. Cramér's theorem says Pr[Sn/nt]enΛ(t)\Pr[S_n/n \geq t] \approx e^{-n\Lambda^*(t)}, and this extends to a much more general setting.

Sanov's theorem generalizes this to empirical distributions: if P^n\hat{P}_n is the empirical distribution of nn i.i.d. draws from PP, then the probability that P^n\hat{P}_n lands in a set E\mathcal{E} of distributions decays as eninfQEDKL(QP)e^{-n \inf_{Q \in \mathcal{E}} D_{\text{KL}}(Q \| P)}, where DKLD_{\text{KL}} is the KL divergence.

This connects Chernoff bounds to information theory: the "rate" at which empirical observations deviate from the truth is measured by KL divergence.

Summary

  • The Chernoff method: Pr[Xt]estMX(s)\Pr[X \geq t] \leq e^{-st} M_X(s), optimize over s>0s > 0
  • This gives the tightest exponential bound from the MGF
  • Multiplicative Chernoff for Binomials: Pr[S(1+δ)μ]eμδ2/3\Pr[S \geq (1+\delta)\mu] \leq e^{-\mu\delta^2/3} for δ(0,1]\delta \in (0,1]
  • Multiplicative form is tighter than Hoeffding when μn\mu \ll n
  • Every exponential concentration inequality (Hoeffding, Bernstein, etc.) is a Chernoff bound with a specific MGF estimate
  • Bernstein interpolates: variance-driven Gaussian tail for small tt, linear-in-tt Hoeffding-style tail when MtMt dominates the variance
  • Azuma-Hoeffding extends Hoeffding from independent sums to martingales with bounded differences XkXk1ck|X_k - X_{k-1}| \leq c_k
  • The optimal rate Λ(t)\Lambda^*(t) equals the large deviations rate function (Cramér's theorem)
  • Requires the MGF to exist; fails for heavy-tailed distributions

Exercises

ExerciseCore

Problem

Let XExp(1)X \sim \text{Exp}(1) (exponential with rate 1). Use the Chernoff method to derive a tail bound for Pr[Xt]\Pr[X \geq t] for t>1t > 1.

ExerciseCore

Problem

A randomized algorithm succeeds with probability p=0.7p = 0.7 on each independent trial. You run it n=100n = 100 times and take a majority vote. Use the multiplicative Chernoff bound to bound the probability that fewer than 50 trials succeed (i.e., the majority vote fails).

ExerciseAdvanced

Problem

Derive Hoeffding's inequality from the Chernoff method. Specifically, let X1,,XnX_1, \ldots, X_n be independent with Xi[ai,bi]X_i \in [a_i, b_i]. Starting from the Chernoff bound, use Hoeffding's lemma (E[es(Xiμi)]es2(biai)2/8\mathbb{E}[e^{s(X_i - \mu_i)}] \leq e^{s^2(b_i - a_i)^2/8}) to derive:

Pr[Xˉnμt]exp ⁣(2n2t2i(biai)2)\Pr[\bar{X}_n - \mu \geq t] \leq \exp\!\left(-\frac{2n^2t^2}{\sum_i(b_i-a_i)^2}\right)

Related Comparisons

References

Canonical:

  • Mitzenmacher & Upfal, Probability and Computing (2017), Chapter 4
  • Boucheron, Lugosi, Massart, Concentration Inequalities (2013), Chapters 2-3 (Chernoff and Bernstein), Chapter 6 (martingale methods)
  • Vershynin, High-Dimensional Probability (2018), Chapter 2

Current:

  • Motwani & Raghavan, Randomized Algorithms (1995), Chapter 4

  • Wainwright, High-Dimensional Statistics (2019), Chapter 2 (sub-Gaussian, Bernstein, Azuma)

  • van Handel, Probability in High Dimension (2016), Chapters 1-3

Next Topics

Building on the Chernoff method:

Last reviewed: April 23, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

2

Derived topics

5