Skip to main content

Concentration Probability

Sub-Gaussian Random Variables

Sub-Gaussian random variables: the precise characterization of 'light-tailed' behavior that underpins exponential-tail concentration inequalities and the Gaussian-rate generalization bounds in learning theory.

CoreTier 1StableCore spine~75 min

Why This Matters

Concentration inequalities are the engine of learning theory. The simplest concentration bounds --- Markov and Chebyshev --- require only finite first or second moments and give polynomial tail decay. The strongest non-asymptotic bounds --- Hoeffding, Chernoff, Bernstein --- require exponential-moment conditions and give exponential tail decay. Sub-Gaussian random variables are the precise class for which exponential-moment conditions are as tight as if the variable were Gaussian, even when it is not. Most Hoeffding-type generalization bounds in VC and Rademacher theory are sub-Gaussian bounds in disguise; the familiar log(1/δ)/n\sqrt{\log(1/\delta)/n} rates come from this class.

If you understand sub-Gaussianity, you understand why exponential-tail concentration works. Heavier tails (sub-exponential, polynomial) need their own machinery.

Tail Class Board

Why the exponent shape matters

Move the threshold and watch three tail assumptions separate. The concentration story is not just small probability; it is whether the exponent is quadratic, linear, or only polynomial.

1e-11e-21e-31e-41e-51e-60123456deviation threshold ttail probability, log scale

Larger thresholds expose the difference between quadratic, linear, and polynomial decay.

t = 2.2

Bigger scale makes the linear tail regime arrive earlier and decay more slowly.

kink 1.7

Sub-Gaussian at this threshold

0.18

Quadratic exponent. Large deviations disappear at Gaussian speed.

Formula shape

The exponent shape is what determines whether sample complexity pays a log, square-root-log, or worse penalty.

ML translation

Sub-Gaussian assumptions power clean Hoeffding-style bounds. Squares, products, and chi-squared terms usually require sub-exponential Bernstein bounds instead.

A standard Gaussian saturates the sub-Gaussian tail bound. Whenever a "sub-Gaussian bound" curve appears distinct from a "Gaussian: exp(t2/2)\exp(-t^2/2)" curve in tail plots, read the sub-Gaussian curve as exp(ct2)\exp(-ct^2) with an explicit generic constant c<1/2c < 1/2 corresponding to some σ>1\sigma > 1. The sub-Gaussian class includes Gaussian tails: a Gaussian variable ZN(0,1)Z \sim \mathcal{N}(0,1) is itself sub-Gaussian with parameter σ=1\sigma = 1.

Mental Model

A sub-Gaussian random variable has tails that decay at least as fast as those of a Gaussian. Concretely, the probability of being far from the mean drops like ect2e^{-ct^2}, not merely ecte^{-ct} (sub-exponential) or 1/tp1/t^p (polynomial). This exponential-squared decay is what gives you the log(1/δ)/n\sqrt{\log(1/\delta)/n} rates in learning theory.

The key insight: many non-Gaussian distributions - bounded random variables, Rademacher random variables, and projections of high-dimensional uniform distributions - all satisfy this same tail behavior. Sub-Gaussianity captures exactly what these distributions have in common.

Formal Setup and Notation

Let XX be a real-valued random variable with E[X]=0\mathbb{E}[X] = 0 (we center without loss of generality).

Definition

Sub-Gaussian Random Variable (MGF characterization)

A centered random variable XX is sub-Gaussian with parameter σ\sigma if and only if for all λR\lambda \in \mathbb{R}:

E[eλX]eσ2λ2/2\mathbb{E}[e^{\lambda X}] \leq e^{\sigma^2 \lambda^2 / 2}

The smallest such σ\sigma is called the sub-Gaussian parameter (or sub-Gaussian proxy variance) of XX.

This is the workhorse definition. It says the moment generating function of XX is dominated by that of a N(0,σ2)\mathcal{N}(0, \sigma^2) random variable.

Definition

Sub-Gaussian Norm (Orlicz norm)

The sub-Gaussian norm (or ψ2\psi_2-norm) of XX is:

Xψ2=inf{t>0:E[eX2/t2]2}\|X\|_{\psi_2} = \inf\{t > 0 : \mathbb{E}[e^{X^2/t^2}] \leq 2\}

A random variable is sub-Gaussian if and only if Xψ2<\|X\|_{\psi_2} < \infty. The sub-Gaussian parameter σ\sigma and the ψ2\psi_2-norm are related by σXψ2\sigma \asymp \|X\|_{\psi_2} (up to universal constants).

Definition

Tail Characterization

An equivalent characterization: XX is sub-Gaussian with parameter σ\sigma if and only if for all t>0t > 0:

P(Xt)2exp ⁣(t22σ2)\mathbb{P}(|X| \geq t) \leq 2 \exp\!\bigl(-\tfrac{t^2}{2\sigma^2}\bigr)

This is the form you will use most often in practice.

Equivalent Characterizations

The following conditions are equivalent (up to universal constants in the parameters):

  1. MGF condition: E[eλX]eC12λ2/2\mathbb{E}[e^{\lambda X}] \leq e^{C_1^2 \lambda^2 / 2} for all λ\lambda
  2. Tail condition: P(Xt)2et2/(2C22)\mathbb{P}(|X| \geq t) \leq 2e^{-t^2/(2C_2^2)} for all t>0t > 0
  3. Moment condition: (E[Xp])1/pCXψ2p(\mathbb{E}[|X|^p])^{1/p} \leq C \|X\|_{\psi_2} \sqrt{p} for all p1p \geq 1
  4. Orlicz norm: Xψ2<\|X\|_{\psi_2} < \infty

The constants across these four statements differ by at most universal multiplicative factors. The equivalence is what makes the sub-Gaussian class robust: you can verify membership via whichever characterization is most convenient.

Proof sketches for the harder directions

Tail \Rightarrow MGF (layer cake). Suppose P(Xt)2et2/(2K2)\mathbb{P}(|X| \geq t) \leq 2e^{-t^2/(2K^2)}. For λ>0\lambda > 0, by the layer-cake formula applied to the non-negative random variable eλX1{X0}e^{\lambda X} \mathbf{1}\{X \geq 0\}:

E[eλX1{X0}]=1+0λeλtP(Xt)dt1+20λeλtt2/(2K2)dt.\mathbb{E}[e^{\lambda X} \mathbf{1}\{X \geq 0\}] = 1 + \int_0^\infty \lambda e^{\lambda t}\,\mathbb{P}(X \geq t)\,dt \leq 1 + 2\int_0^\infty \lambda e^{\lambda t - t^2/(2K^2)}\,dt.

Complete the square in the exponent: λtt2/(2K2)=K2λ2/2(tK2λ)2/(2K2)\lambda t - t^2/(2K^2) = K^2\lambda^2/2 - (t - K^2\lambda)^2/(2K^2). The Gaussian integral yields E[eλX]ecK2λ2\mathbb{E}[e^{\lambda X}] \leq e^{cK^2\lambda^2} for a universal cc. The same argument applied to X-X handles λ<0\lambda < 0.

Moment \Rightarrow MGF (Taylor expansion). Suppose (E[Xp])1/pKp(\mathbb{E}[|X|^p])^{1/p} \leq K\sqrt{p} for all p1p \geq 1. Expand the MGF term by term:

E[eλX]=1+k=1λkE[Xk]k!1+k=1λkKkkk/2k!.\mathbb{E}[e^{\lambda X}] = 1 + \sum_{k=1}^\infty \frac{\lambda^k \mathbb{E}[X^k]}{k!} \leq 1 + \sum_{k=1}^\infty \frac{|\lambda|^k K^k k^{k/2}}{k!}.

Stirling's bound k!(k/e)kk! \geq (k/e)^k gives kk/2/k!(e/k)kkk/2Ck/kk/2k^{k/2}/k! \leq (e/\sqrt{k})^k \cdot k^{-k/2} \leq C^k / k^{k/2}. For λc/K|\lambda| \leq c/K the series converges to something bounded by ecK2λ2e^{cK^2\lambda^2}. Extend to all λ\lambda via a separate bound in the large-λ\lambda regime using the tail equivalence.

Interactive: three views of one definition

Same distribution, three separate tests. Switch the tab to compare the tail, the log-MGF, and the ψ2\psi_2 expectation side by side. Gaussian, Rademacher, and Uniform[1,1][-1,1] pass all three. Laplace and the centered Exponential have only exponential (not Gaussian) tails, so they pass none of the three sub-Gaussian conditions globally. The interactive panel inspects each condition on a finite range, so a distribution may appear to satisfy the tail check inside the visible window while still violating the MGF and ψ2\psi_2 conditions at large enough λ\lambda or pp; that finite-window artifact is what the tab labels are showing. Watching a distribution fail one panel while holding another makes the equivalence concrete: each characterization picks up the same sub-Gaussian tail behavior from a different direction, and any genuinely sub-Gaussian variable passes or fails them together.

Main Theorems

Theorem

Sub-Gaussian MGF Implies Tail Bound

Statement

If XX is a centered random variable satisfying E[eλX]eσ2λ2/2\mathbb{E}[e^{\lambda X}] \leq e^{\sigma^2 \lambda^2/2} for all λR\lambda \in \mathbb{R}, then for all t>0t > 0:

P(Xt)exp ⁣(t22σ2)\mathbb{P}(X \geq t) \leq \exp\!\bigl(-\tfrac{t^2}{2\sigma^2}\bigr)

Exact statement

LaTeX source for copy/export

\mathbb{E}[e^{\lambda X}] \leq e^{\sigma^2\lambda^2/2} \Rightarrow \mathbb{P}(X \geq t) \leq \exp\!\left(-\frac{t^2}{2\sigma^2}\right)

Intuition

The MGF condition is a generating condition. It controls all moments simultaneously. The tail bound is a consequence obtained by the Chernoff method: exponentiate, apply Markov, and optimize over λ\lambda. The quadratic exponent in the MGF translates directly to a quadratic exponent in the tail.

Proof Sketch

For any λ>0\lambda > 0, by Markov's inequality:

P(Xt)=P(eλXeλt)eλtE[eλX]eλt+σ2λ2/2\mathbb{P}(X \geq t) = \mathbb{P}(e^{\lambda X} \geq e^{\lambda t}) \leq e^{-\lambda t}\,\mathbb{E}[e^{\lambda X}] \leq e^{-\lambda t + \sigma^2 \lambda^2/2}

Minimize over λ\lambda: set λ=t/σ2\lambda^* = t/\sigma^2 to get P(Xt)et2/(2σ2)\mathbb{P}(X \geq t) \leq e^{-t^2/(2\sigma^2)}.

Why It Matters

This is the Chernoff method: the single most important proof technique in concentration. Every time you see an exponential tail bound, this is the engine underneath. Mastering this one argument gives you access to all of sub-Gaussian concentration theory.

Failure Mode

The bound is only tight up to constants. For a true Gaussian, the bound is off by a polynomial factor in tt (the exact Gaussian tail has a 1/t1/t prefactor). For non-asymptotic purposes this rarely matters.

Lemma

Hoeffding's Lemma

Statement

If XX is a centered random variable with X[a,b]X \in [a, b] almost surely, then for all λR\lambda \in \mathbb{R}:

E[eλX]exp ⁣(λ2(ba)28)\mathbb{E}[e^{\lambda X}] \leq \exp\!\bigl(\tfrac{\lambda^2(b-a)^2}{8}\bigr)

That is, XX is sub-Gaussian with parameter σ=(ba)/2\sigma = (b-a)/2.

Exact statement

LaTeX source for copy/export

\mathbb{E}[e^{\lambda X}] \leq \exp\!\left(\frac{\lambda^2(b-a)^2}{8}\right)

Intuition

Bounded random variables cannot have heavy tails. They literally cannot take values beyond [a,b][a,b]. Hoeffding's lemma quantifies this: the MGF of any bounded centered variable is dominated by a Gaussian MGF. The factor of 1/81/8 comes from a convexity argument applied to eλxe^{\lambda x} on the interval [a,b][a,b].

Proof Sketch

Since X[a,b]X \in [a, b], write X=αb+(1α)aX = \alpha b + (1-\alpha) a for some random α[0,1]\alpha \in [0,1]. By convexity of eλxe^{\lambda x}:

E[eλX]E[αeλb+(1α)eλa]\mathbb{E}[e^{\lambda X}] \leq \mathbb{E}[\alpha e^{\lambda b} + (1-\alpha) e^{\lambda a}]

Use E[X]=0\mathbb{E}[X] = 0 to express E[α]\mathbb{E}[\alpha] in terms of a,ba, b. Then apply the inequality log(pes+qes)s2/8\log(pe^s + qe^{-s}) \leq s^2/8 for p+q=1p + q = 1 (which follows from Taylor expansion and bounding the remainder). Setting s=λ(ba)/2s = \lambda(b-a)/2 gives the result.

Why It Matters

Hoeffding's lemma is the bridge between "bounded" and "sub-Gaussian." It explains why bounded losses (e.g., 0-1 loss) always yield O(1/n)O(1/\sqrt{n}) concentration. which is exactly what appears in ERM generalization bounds for finite hypothesis classes.

Failure Mode

The bound treats all bounded distributions the same: a constant taking value 00 and a uniform on [1,1][-1, 1] get the same sub-Gaussian parameter. For distributions concentrated near 00, the variance-based Bernstein inequality gives tighter bounds. Hoeffding's lemma is worst-case over the bounded class.

Canonical Examples

Example

Gaussian random variable

If XN(0,σ2)X \sim \mathcal{N}(0, \sigma^2), then E[eλX]=eσ2λ2/2\mathbb{E}[e^{\lambda X}] = e^{\sigma^2 \lambda^2/2} exactly. Gaussians are sub-Gaussian with parameter exactly σ\sigma. This is the "prototype". The definition is designed so that Gaussians saturate the bound.

Example

Rademacher random variable

Let ε\varepsilon be a Rademacher variable, so P(ε=+1)=P(ε=1)=1/2\mathbb{P}(\varepsilon = +1) = \mathbb{P}(\varepsilon = -1) = 1/2. Then ε[1,1]\varepsilon \in [-1, 1], so by Hoeffding's lemma, ε\varepsilon is sub-Gaussian with parameter σ=1\sigma = 1. In fact, the exact MGF is E[eλε]=cosh(λ)eλ2/2\mathbb{E}[e^{\lambda \varepsilon}] = \cosh(\lambda) \leq e^{\lambda^2/2}, confirming sub-Gaussianity directly.

Rademacher variables appear everywhere in symmetrization arguments and Rademacher complexity.

Example

Bounded random variable

If X[a,b]X \in [a, b] almost surely with E[X]=0\mathbb{E}[X] = 0, then XX is sub-Gaussian with parameter (ba)/2(b-a)/2 by Hoeffding's lemma. This covers:

  • 0-1 loss: σ=1/2\sigma = 1/2
  • Any loss bounded in [0,M][0, M] (after centering): σ=M/2\sigma = M/2
  • Features bounded in [B,B][-B, B]: σ=B\sigma = B
Example

Sum of independent sub-Gaussians

If X1,,XnX_1, \ldots, X_n are independent, centered, sub-Gaussian with parameters σ1,,σn\sigma_1, \ldots, \sigma_n, then S=i=1nXiS = \sum_{i=1}^n X_i is sub-Gaussian with parameter σ=σ12++σn2\sigma = \sqrt{\sigma_1^2 + \cdots + \sigma_n^2}.

This follows because MGFs multiply under independence:

E[eλS]=iE[eλXi]ieσi2λ2/2=eλ2iσi2/2.\mathbb{E}[e^{\lambda S}] = \prod_i \mathbb{E}[e^{\lambda X_i}] \leq \prod_i e^{\sigma_i^2 \lambda^2/2} = e^{\lambda^2 \sum_i \sigma_i^2/2}.

For the sample mean Xˉ=S/n\bar{X} = S/n with equal σi=σ\sigma_i = \sigma: sub-Gaussian parameter is σ/n\sigma/\sqrt{n}, giving the familiar O(1/n)O(1/\sqrt{n}) concentration.

Common Confusions

Watch Out

Sub-Gaussian parameter is NOT the standard deviation

The sub-Gaussian parameter σ\sigma is an upper bound on the effective spread, but it is not equal to Var(X)1/2\text{Var}(X)^{1/2} in general. For a variable supported on [1,1][-1, 1] with variance 0.010.01, Hoeffding gives σ=1\sigma = 1, far larger than sd(X)=0.1\text{sd}(X) = 0.1. This is why Bernstein inequalities (which use variance information) are sometimes much tighter than Hoeffding-type bounds.

Watch Out

Not all 'nice' distributions are sub-Gaussian

Exponential, Poisson, and chi-squared random variables are not sub-Gaussian variables. Their tails decay like ecte^{-ct} rather than ect2e^{-ct^2}. These are sub-exponential. The distinction matters: sub-exponential concentration gives O(1/n)O(1/\sqrt{n}) rates only for the mean, not for individual deviations, and requires separate treatment of "small λ\lambda" and "large λ\lambda" regimes.

Watch Out

The centering assumption is essential

The MGF condition E[eλX]eσ2λ2/2\mathbb{E}[e^{\lambda X}] \leq e^{\sigma^2\lambda^2/2} implicitly requires E[X]=0\mathbb{E}[X] = 0. If E[X]=μ0\mathbb{E}[X] = \mu \neq 0, you apply the definition to XμX - \mu, not to XX directly. Failing to center will give you wrong tail bounds.

Centered vs Non-Centered Sub-Gaussian

The canonical MGF definition assumes E[X]=0\mathbb{E}[X] = 0. For non-centered XX with E[X]=μ\mathbb{E}[X] = \mu and XμX - \mu sub-Gaussian with parameter σ\sigma:

E[eλX]eλμ+σ2λ2/2for all λR.\mathbb{E}[e^{\lambda X}] \leq e^{\lambda \mu + \sigma^2 \lambda^2/2} \quad \text{for all } \lambda \in \mathbb{R}.

This is the natural non-centered analog. The linear term in the exponent carries the mean; the quadratic term controls the tails. All downstream closure and concentration results go through after the mean shift.

The ψ2\psi_2-norm, unlike the sub-Gaussian parameter, does not require centering: it is a norm on XX itself, since ψ2(x)=ex21\psi_2(x) = e^{x^2} - 1 depends on X|X|. This is one reason the ψ2\psi_2 formulation is often cleaner to work with.

Why Sub-Gaussian is the Right Abstraction

Three reasons sub-Gaussianity appears everywhere in ML theory:

  1. Closure under summation: sums of independent sub-Gaussians are sub-Gaussian. This is exactly what you need for sample averages.

  2. Captures all bounded losses: every bounded loss function produces sub-Gaussian empirical averages. Since most learning theory starts with bounded losses, sub-Gaussianity is the natural first assumption.

  3. Tight enough for optimal rates: sub-Gaussian concentration gives O(1/n)O(1/\sqrt{n}) deviation bounds for sample means, which matches the CLT rate. You cannot do better in general.

The sub-Gaussian class is the largest class of distributions for which the Chernoff method gives Gaussian-quality tail bounds. Going beyond sub-Gaussian (to sub-exponential, or heavier tails) requires different tools and yields weaker concentration.

The Orlicz Norm Hierarchy

The ψ2\psi_2 norm makes the set of sub-Gaussian random variables a Banach space Lψ2L_{\psi_2}, sitting in a strict inclusion chain on a probability space:

LLψ2Lψ1p<LpL^\infty \subset L_{\psi_2} \subset L_{\psi_1} \subset \bigcap_{p < \infty} L^p

Bounded \subset Sub-Gaussian \subset Sub-exponential \subset All-moments-finite. Each inclusion is strict: a standard Gaussian is sub-Gaussian but not bounded; an Exp(1)\text{Exp}(1) variable is sub-exponential but not sub-Gaussian.

Recall that on a probability space LpL^p decreases with pp, so LLqLpL^\infty \subset L^q \subset L^p whenever pqp \leq q. Sub-exponential variables have all moments finite: they sit inside every LpL^p for p<p < \infty.

The ψ1\psi_1 norm is defined analogously with ψ1(x)=ex1\psi_1(x) = e^x - 1: Xψ1=inf{t>0:E[eX/t]2}\|X\|_{\psi_1} = \inf\{t > 0 : \mathbb{E}[e^{|X|/t}] \leq 2\}.

Exact ψ2\psi_2 Norms for Common Distributions

DistributionXψ2\|X\|_{\psi_2}Sub-Gaussian σ\sigmaKey step
Rademacher ε=±1\varepsilon = \pm 11/ln21.201/\sqrt{\ln 2} \approx 1.2011ε2=1\varepsilon^2 = 1 always, so solve e1/t2=2e^{1/t^2} = 2
Gaussian ZN(0,1)Z \sim \mathcal{N}(0,1)8/31.633\sqrt{8/3} \approx 1.63311E[eZ2/t2]=(12/t2)1/2\mathbb{E}[e^{Z^2/t^2}] = (1 - 2/t^2)^{-1/2} for t>2t > \sqrt{2}; set equal to 22 to get t2=8/3t^2 = 8/3
Bounded X[a,b]X \in [a,b], centeredC(ba)C(b-a)(ba)/2(b-a)/2Hoeffding's lemma gives σ=(ba)/2\sigma = (b-a)/2
Gaussian ZN(0,σ2)Z \sim \mathcal{N}(0, \sigma^2)CσC\sigmaσ\sigmaHomogeneity: σZψ2=σZψ2\|\sigma Z\|_{\psi_2} = \sigma \|Z\|_{\psi_2}

Closure Properties

Sub-Gaussianity is useful precisely because it is preserved under the operations that appear in proofs.

Theorem

Closure Under Independent Sums

Statement

Suppose X1,,XnX_1, \ldots, X_n are independent, centered, and Xiψ2K\|X_i\|_{\psi_2} \leq K for all ii. Then for any a=(a1,,an)Rna = (a_1, \ldots, a_n) \in \mathbb{R}^n:

iaiXiψ2CKa2\left\|\sum_i a_i X_i\right\|_{\psi_2} \leq CK \|a\|_2

where CC is a universal constant.

Exact statement

LaTeX source for copy/export

\left\|\sum_i a_i X_i\right\|_{\psi_2} \leq CK\|a\|_2

Intuition

By independence, MGFs multiply: E[eλaiXi]=iE[eλaiXi]ieC2K2λ2ai2/2=eC2K2λ2a22/2\mathbb{E}[e^{\lambda \sum a_i X_i}] = \prod_i \mathbb{E}[e^{\lambda a_i X_i}] \leq \prod_i e^{C^2 K^2 \lambda^2 a_i^2 / 2} = e^{C^2 K^2 \lambda^2 \|a\|_2^2 / 2}. The 2\ell_2 norm of aa controls the sub-Gaussian parameter of the sum.

Failure Mode

Without independence, the bound fails. If X1=X2==Xn=XX_1 = X_2 = \cdots = X_n = X, then Xi=nX\sum X_i = nX with nXψ2=nXψ2\|nX\|_{\psi_2} = n\|X\|_{\psi_2}, but the theorem would predict CKnCK\sqrt{n}. The gap between n\sqrt{n} and nn is exactly the cost of perfect correlation.

Other closure properties:

  • Scaling: aXψ2=aXψ2\|aX\|_{\psi_2} = |a| \cdot \|X\|_{\psi_2} (immediate from definition).
  • Lipschitz maps: if ff is LL-Lipschitz with f(0)=0f(0) = 0, then f(X)ψ2CLXψ2\|f(X)\|_{\psi_2} \leq CL \cdot \|X\|_{\psi_2}.
  • Products break sub-Gaussianity: if X,YX, Y are independent sub-Gaussian, then XYXY is sub-exponential, not sub-Gaussian. In particular, X2X^2 is sub-exponential whenever XX is sub-Gaussian. This is exactly why Bernstein's inequality has two regimes.

Sub-Gaussian Maxima

Theorem

Maximum of Sub-Gaussians

Statement

E ⁣[maxinXi]CKlogn\mathbb{E}\!\left[\max_{i \leq n} X_i\right] \leq CK\sqrt{\log n}

Moreover, for any t0t \geq 0:

P ⁣(maxinXiCKlogn+t)et2/(2K2)\mathbb{P}\!\left(\max_{i \leq n} X_i \geq CK\sqrt{\log n} + t\right) \leq e^{-t^2/(2K^2)}

The maximum is a one-sided event, so there is no factor of 22 on the right-hand side.

Exact statement

LaTeX source for copy/export

\mathbb{E}\!\left[\max_{i \leq n} X_i\right] \leq CK\sqrt{\log n}

Intuition

For any λ>0\lambda > 0: E[maxXi]1λlogiE[eλXi]1λ(logn+C2K2λ2/2)\mathbb{E}[\max X_i] \leq \frac{1}{\lambda}\log\sum_i \mathbb{E}[e^{\lambda X_i}] \leq \frac{1}{\lambda}(\log n + C^2 K^2 \lambda^2/2). Setting λ=2logn/(CK)\lambda = \sqrt{2\log n}/(CK) gives CK2lognCK\sqrt{2\log n}.

Why It Matters

This is where the logH\log|H| term in PAC bounds comes from. Bounding suphHR(h)R^(h)\sup_{h \in H}|R(h) - \hat{R}(h)| over a finite hypothesis class H=n|H| = n is bounding the maximum of nn sub-Gaussian variables. The logn\sqrt{\log n} overhead is the price of uniformity.

Failure Mode

For sub-exponential variables, the maximum grows as logn\log n (not logn\sqrt{\log n}). The heavier tails make the worst case worse. For heavy-tailed variables (polynomial tails), the maximum can grow polynomially in nn.

The Sub-Exponential Bridge

The relationship between sub-Gaussian and sub-exponential is precise:

  • If XX is sub-Gaussian, then X2X^2 is sub-exponential with X2ψ1=Xψ22\|X^2\|_{\psi_1} = \|X\|_{\psi_2}^2 (identity, under the standard Orlicz norm convention ψ1(x)=ex1\psi_1(x) = e^x - 1, ψ2(x)=ex21\psi_2(x) = e^{x^2} - 1).
  • Conversely, if X2X^2 is sub-exponential, then XX is sub-Gaussian.

This explains why Bernstein's inequality for centered sub-exponential variables Xi\sum X_i has two regimes:

P ⁣(Xit)2exp ⁣(cmin ⁣(t2Xiψ12,  tmaxXiψ1))\mathbb{P}\!\left(\left|\sum X_i\right| \geq t\right) \leq 2\exp\!\left(-c \cdot \min\!\left(\frac{t^2}{\sum \|X_i\|_{\psi_1}^2},\; \frac{t}{\max \|X_i\|_{\psi_1}}\right)\right)

For small tt, the t2t^2 term dominates (sub-Gaussian regime). For large tt, the tt term dominates (sub-exponential regime, heavier tail). Setting t2/Xiψ12=t/maxXiψ1t^2/\sum\|X_i\|_{\psi_1}^2 = t/\max\|X_i\|_{\psi_1} gives the crossover t=Xiψ12/maxXiψ1t^\star = \sum\|X_i\|_{\psi_1}^2 / \max\|X_i\|_{\psi_1} (Vershynin Thm 2.9.1).

Proof Checklist

When writing or reading a sub-Gaussian argument:

  1. Identify the random variable. What quantity do you want to concentrate? Write it as Z=g(X1,,Xn)Z = g(X_1, \ldots, X_n).
  2. Center it. Work with ZE[Z]Z - \mathbb{E}[Z].
  3. Decompose if needed. Express ZE[Z]Z - \mathbb{E}[Z] as a sum or martingale difference sequence.
  4. Check sub-Gaussianity of each piece. Bounded? Hoeffding gives σi\sigma_i. Lipschitz of Gaussian? Gaussian concentration applies.
  5. Propagate. Independent sum: σ2=σi2\sigma^2 = \sum \sigma_i^2. Martingale: Azuma-Hoeffding. Linear combination: a2\|a\|_2 scaling.
  6. Apply tail bound. P(ZE[Z]t)2et2/(2σ2)\mathbb{P}(|Z - \mathbb{E}[Z]| \geq t) \leq 2e^{-t^2/(2\sigma^2)}. Invert to get confidence intervals.

Exercises

ExerciseCore

Problem

Let XX be a Rademacher random variable (±1\pm 1 with equal probability). Verify directly that E[eλX]=cosh(λ)eλ2/2\mathbb{E}[e^{\lambda X}] = \cosh(\lambda) \leq e^{\lambda^2/2} for all λ\lambda.

ExerciseCore

Problem

Let X1,,XnX_1, \ldots, X_n be i.i.d. with Xi[0,1]X_i \in [0, 1] and E[Xi]=μ\mathbb{E}[X_i] = \mu. Using Hoeffding's lemma and the Chernoff method, prove that:

P ⁣(Xˉnμt)e2nt2\mathbb{P}\!\bigl(\bar{X}_n - \mu \geq t\bigr) \leq e^{-2nt^2}

ExerciseAdvanced

Problem

Show that the exponential distribution XExp(1)X \sim \text{Exp}(1) is not sub-Gaussian. Specifically, show that E[eλ(X1)]\mathbb{E}[e^{\lambda(X-1)}] cannot be bounded by eCλ2e^{C\lambda^2} for any constant CC when λ\lambda is large.

ExerciseAdvanced

Problem

Verify the table entry for ZN(0,1)Z \sim \mathcal{N}(0,1): Zψ2=8/3\|Z\|_{\psi_2} = \sqrt{8/3}. Start from E[eZ2/t2]=(12/t2)1/2\mathbb{E}[e^{Z^2/t^2}] = (1 - 2/t^2)^{-1/2} for t>2t > \sqrt{2}, and solve for the smallest tt with E[eZ2/t2]2\mathbb{E}[e^{Z^2/t^2}] \leq 2.

ExerciseAdvanced

Problem

Let X1,,XnX_1, \ldots, X_n be independent sub-Gaussian with Xiψ2K\|X_i\|_{\psi_2} \leq K. Show that E[maxinXi]CKlogn\mathbb{E}[\max_{i \leq n} X_i] \leq CK\sqrt{\log n} using the argument: exponentiate, bound by sum, apply MGF bound, optimize λ\lambda.

Related Comparisons

References

  • Vershynin, High-Dimensional Probability (2018), Chapters 2-3. Definitive modern treatment using the Orlicz norm framework. Proposition 2.5.2 for the equivalence theorem.
  • Boucheron, Lugosi, Massart, Concentration Inequalities (2013), Chapter 2. Uses the entropy method rather than Orlicz norms.
  • Wainwright, High-Dimensional Statistics (2019), Chapter 2. Clear exposition connecting sub-Gaussian theory to statistical applications.
  • van Handel, Probability in High Dimension (2016), Chapters 1-2. Excellent treatment of sub-Gaussian maxima and chaining.
  • Shalev-Shwartz and Ben-David, Understanding Machine Learning (2014), Chapters 26-28. How sub-Gaussian bounds enter learning theory.
  • Rigollet and Hutter, High-Dimensional Statistics (MIT lecture notes, 2023), Chapter 1. Concise derivation of all five characterizations with explicit constants.

Next Topics

Natural next steps from sub-Gaussian random variables:

  • Sub-exponential random variables: the next distributional class, for heavier tails.
  • Hanson-Wright inequality: sub-Gaussian concentration for quadratic forms XAXX^\top A X.
  • Epsilon-nets and covering numbers: combining sub-Gaussian concentration with geometric arguments. Dudley's chaining integral, developed in empirical processes and chaining, sharpens the logn\sqrt{\log n} maxima bound when XiX_i are indexed by a metric space.
  • Azuma-Hoeffding inequality via martingale theory: sub-Gaussian concentration for martingale difference sequences, the natural generalization beyond independent sums.
  • McDiarmid's inequality: concentration for functions of independent variables with bounded differences. Direct application of sub-Gaussian ideas via a martingale expansion.
  • Tsirelson-Ibragimov-Sudakov (Gaussian Lipschitz concentration): for f:RnRf : \mathbb{R}^n \to \mathbb{R} that is LL-Lipschitz and ZN(0,In)Z \sim \mathcal{N}(0, I_n), f(Z)E[f(Z)]f(Z) - \mathbb{E}[f(Z)] is sub-Gaussian with parameter LL. The cleanest statement of sub-Gaussian behavior in high dimension.

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

4

Derived topics

9

+4 more on the derived-topics page.