Skip to main content

Concentration Probability

Sub-Exponential Random Variables

The distributional class between sub-Gaussian and heavy-tailed: heavier tails than Gaussian, the Orlicz norm characterization, Bernstein condition, and the two-regime concentration bound.

CoreTier 1StableSupporting~55 min

Why This Matters

Sub-Gaussian random variables have tails decaying like ect2e^{-ct^2} and give the cleanest concentration bounds. But many quantities in ML are not sub-Gaussian:

  • Chi-squared random variables (sums of squared Gaussians)
  • Products of sub-Gaussian random variables
  • Exponential random variables (waiting times)
  • Squared losses in regression with Gaussian noise

These have tails decaying like ecte^{-ct}, exponentially, but with a linear exponent rather than quadratic. The sub-exponential class captures exactly this level of tail behavior, and the Bernstein inequality for sub-exponential sums provides the right concentration bound: sub-Gaussian for small deviations, exponential for large deviations.

Tail Class Board

Why the exponent shape matters

Move the threshold and watch three tail assumptions separate. The concentration story is not just small probability; it is whether the exponent is quadratic, linear, or only polynomial.

1e-11e-21e-31e-41e-51e-60123456deviation threshold ttail probability, log scale

Larger thresholds expose the difference between quadratic, linear, and polynomial decay.

t = 2.7

Bigger scale makes the linear tail regime arrive earlier and decay more slowly.

kink 2.1

Sub-exponential at this threshold

0.41

Quadratic near zero, then linear after the MGF boundary matters.

Formula shape

The exponent shape is what determines whether sample complexity pays a log, square-root-log, or worse penalty.

ML translation

Sub-Gaussian assumptions power clean Hoeffding-style bounds. Squares, products, and chi-squared terms usually require sub-exponential Bernstein bounds instead.

Mental Model

The tail decay hierarchy is:

ClassTail decayMGFExample
Sub-Gaussianect2e^{-ct^2}Finite for all λR\lambda \in \mathbb{R}Bounded, Gaussian
Sub-exponentialecte^{-ct}Finite for λ<λ0\vert\lambda\vert < \lambda_0χ2\chi^2, Exponential
Heavy-tailed1/tp1/t^pDiverges near 00 for any λ0\lambda \neq 0Cauchy, Pareto

Sub-exponential is the intermediate class. The MGF exists in a neighborhood of zero (not everywhere), which means the Chernoff method works for small λ\lambda but breaks for large λ\lambda. This produces a concentration bound with two regimes.

Formal Setup and Definitions

Let XX be a centered random variable (E[X]=0\mathbb{E}[X] = 0).

Definition

Sub-Exponential Random Variable (MGF characterization)

A centered random variable XX is sub-exponential with parameters (ν2,b)(\nu^2, b) if and only if for all λ<1/b|\lambda| < 1/b:

E[eλX]exp ⁣(ν2λ22)\mathbb{E}[e^{\lambda X}] \leq \exp\!\left(\frac{\nu^2 \lambda^2}{2}\right)

The parameter ν2\nu^2 plays the role of a "variance proxy" (controlling the sub-Gaussian regime), and bb controls the radius where the MGF bound holds.

Definition

Sub-Exponential Norm (Orlicz ψ₁ norm)

The sub-exponential norm (or ψ1\psi_1-norm) of XX is:

Xψ1=inf{t>0:E[eX/t]2}\|X\|_{\psi_1} = \inf\{t > 0 : \mathbb{E}[e^{|X|/t}] \leq 2\}

A random variable is sub-exponential if and only if Xψ1<\|X\|_{\psi_1} < \infty.

Compare to the sub-Gaussian norm: Xψ2=inf{t>0:E[eX2/t2]2}\|X\|_{\psi_2} = \inf\{t > 0 : \mathbb{E}[e^{X^2/t^2}] \leq 2\}. The ψ1\psi_1 norm uses eX/te^{|X|/t} (linear in X|X|) while ψ2\psi_2 uses eX2/t2e^{X^2/t^2} (quadratic). The linear growth in the exponent is what makes sub-exponential tails heavier than sub-Gaussian.

Definition

Bernstein Condition

A centered random variable XX satisfies the Bernstein condition with parameters (σ2,b)(\sigma^2, b) if and only if for all integers k2k \geq 2:

E[Xk]k!2σ2bk2\mathbb{E}[|X|^k] \leq \frac{k!}{2} \sigma^2 b^{k-2}

This is equivalent (up to constants) to being sub-exponential with parameters (σ2,b)(\sigma^2, b). The condition controls all moments: the kk-th moment grows at most like k!bkk! \cdot b^k, compared to sub-Gaussian where moments grow like (Ck)k(C\sqrt{k})^k.

Equivalent Characterizations

The following are equivalent (up to constants in parameters):

  1. MGF condition: E[eλX]eν2λ2/2\mathbb{E}[e^{\lambda X}] \leq e^{\nu^2\lambda^2/2} for λ<1/b|\lambda| < 1/b
  2. Tail condition: P(Xt)2exp(t/C1)\mathbb{P}(|X| \geq t) \leq 2\exp(-t/C_1) for all t>0t > 0
  3. Moment condition: (E[Xk])1/kC2k(\mathbb{E}[|X|^k])^{1/k} \leq C_2 k for all k1k \geq 1
  4. Orlicz norm: Xψ1C3\|X\|_{\psi_1} \leq C_3
  5. Bernstein condition: E[Xk]k!2σ2bk2\mathbb{E}[|X|^k] \leq \frac{k!}{2}\sigma^2 b^{k-2} for k2k \geq 2

Compare to sub-Gaussian characterization 3: (E[Xk])1/kCk(\mathbb{E}[|X|^k])^{1/k} \leq C\sqrt{k} for all k1k \geq 1. Sub-exponential moments grow like kk; sub-Gaussian moments grow like k\sqrt{k}. Both forms use k1k \geq 1. The extra factor of k\sqrt{k} is the precise difference between the two classes.

Main Theorems

Theorem

Products of Sub-Gaussians are Sub-Exponential

Statement

If XX and YY are sub-Gaussian random variables (not necessarily independent), then the product XYXY is sub-exponential, with:

XYψ1Xψ2Yψ2\|XY\|_{\psi_1} \leq \|X\|_{\psi_2} \cdot \|Y\|_{\psi_2}

Intuition

Multiplying two "Gaussian-like" variables produces something "exponential-like." A concrete example: if ZN(0,1)Z \sim \mathcal{N}(0, 1), then ZZ is sub-Gaussian, but Z2=ZZZ^2 = Z \cdot Z is chi-squared (with 1 degree of freedom) and sub-exponential. The tails get heavier because multiplying two variables that can be large makes the product even larger.

Proof Sketch

Tail bound (qualitative). If XYt|X| \cdot |Y| \geq t, then at least one of Xt|X| \geq \sqrt{t} or Yt|Y| \geq \sqrt{t} must hold. By union bound:

P(XYt)P(Xt)+P(Yt)2ect+2ect=4ect\mathbb{P}(|XY| \geq t) \leq \mathbb{P}(|X| \geq \sqrt{t}) + \mathbb{P}(|Y| \geq \sqrt{t}) \leq 2e^{-ct} + 2e^{-ct} = 4e^{-ct}

where each sub-Gaussian tail is evaluated at t\sqrt{t}, yielding 2ec(t)2=2ect2e^{-c(\sqrt{t})^2} = 2e^{-ct}. The resulting tail ecte^{-ct} is sub-exponential.

Norm bound via AM-GM. The sharper identity uses the pointwise bound XY(X2+Y2)/2|XY| \leq (X^2 + Y^2)/2. Combined with the key identity relating the ψ1\psi_1 and ψ2\psi_2 norms:

X2ψ1=Xψ22\|X^2\|_{\psi_1} = \|X\|_{\psi_2}^2

(which follows directly from the definitions: E[eX2/t]2\mathbb{E}[e^{X^2/t}] \leq 2 means E[eX2/(t)2]2\mathbb{E}[e^{X^2/(\sqrt{t})^2}] \leq 2), we get:

XYψ112X2ψ1+12Y2ψ1=12Xψ22+12Yψ22\|XY\|_{\psi_1} \leq \tfrac{1}{2}\|X^2\|_{\psi_1} + \tfrac{1}{2}\|Y^2\|_{\psi_1} = \tfrac{1}{2}\|X\|_{\psi_2}^2 + \tfrac{1}{2}\|Y\|_{\psi_2}^2

Cauchy-Schwarz on the right side (or applying the same argument to the rescaled variables X/Xψ2X/\|X\|_{\psi_2} and Y/Yψ2Y/\|Y\|_{\psi_2}) yields the product form XYψ1Xψ2Yψ2\|XY\|_{\psi_1} \leq \|X\|_{\psi_2}\|Y\|_{\psi_2}.

Why It Matters

This explains why chi-squared variables, quadratic forms, and many statistics in learning theory are sub-exponential: they involve products or squares of sub-Gaussian quantities. When you see XXX^\top X or Xw2\|Xw\|^2 in a bound, expect sub-exponential concentration, not sub-Gaussian.

Failure Mode

The product of two sub-exponential variables is generally not sub-exponential (it may be even heavier-tailed). Sub-exponential is not closed under multiplication, unlike sub-Gaussian being closed under addition. This limits the composability of sub-exponential bounds.

Theorem

Bernstein Inequality for Sub-Exponential Sums

Statement

Let X1,,XnX_1, \ldots, X_n be independent centered sub-exponential random variables with parameters (νi2,b)(\nu_i^2, b). Then for any t>0t > 0:

P ⁣(i=1nXit)2exp ⁣(cmin ⁣(t2iνi2,tb))\mathbb{P}\!\left(\left|\sum_{i=1}^n X_i\right| \geq t\right) \leq 2\exp\!\left(-c \min\!\left(\frac{t^2}{\sum_i \nu_i^2},\, \frac{t}{b}\right)\right)

where c>0c > 0 is a universal constant. For i.i.d. variables with common parameter ν2\nu^2 and sample mean Xˉn=1niXi\bar{X}_n = \frac{1}{n}\sum_i X_i:

P ⁣(Xˉnt)2exp ⁣(cnmin ⁣(t2ν2,tb))\mathbb{P}\!\left(|\bar{X}_n| \geq t\right) \leq 2\exp\!\left(-cn\min\!\left(\frac{t^2}{\nu^2},\, \frac{t}{b}\right)\right)

Intuition

The bound has two regimes separated by the threshold t=ν2/bt^* = \nu^2/b:

Small deviations (ttt \leq t^*): The t2/ν2t^2/\nu^2 term dominates the minimum. The bound is exp(cnt2/ν2)\exp(-cnt^2/\nu^2), a sub-Gaussian tail. For moderate deviations, the sub-exponential variable behaves as if it were sub-Gaussian.

Large deviations (t>tt > t^*): The t/bt/b term dominates. The bound is exp(cnt/b)\exp(-cnt/b), an exponential (not Gaussian) tail. For large deviations, the heavier tails kick in and concentration is weaker.

The transition between regimes is smooth. The sub-Gaussian regime gives the familiar log(1/δ)/n\sqrt{\log(1/\delta)/n} rates for moderate confidence levels.

Proof Sketch

Use the Chernoff method. For λ<1/b|\lambda| < 1/b:

P ⁣(iXit)eλtiE[eλXi]eλtexp ⁣(λ2iνi22)\mathbb{P}\!\left(\sum_i X_i \geq t\right) \leq e^{-\lambda t} \prod_i \mathbb{E}[e^{\lambda X_i}] \leq e^{-\lambda t} \exp\!\left(\frac{\lambda^2 \sum_i \nu_i^2}{2}\right)

Optimize over λ(0,1/b)\lambda \in (0, 1/b):

  • If the unconstrained optimum λ=t/iνi2\lambda^* = t/\sum_i\nu_i^2 satisfies λ<1/b\lambda^* < 1/b, we are in the sub-Gaussian regime: bound is exp(t2/(2iνi2))\exp(-t^2/(2\sum_i\nu_i^2)).
  • If λ1/b\lambda^* \geq 1/b, set λ=1/(2b)\lambda = 1/(2b) (at the boundary): bound is exp(t/(2b)+iνi2/(8b2))\exp(-t/(2b) + \sum_i\nu_i^2/(8b^2)). For large tt, this gives exp(ct/b)\exp(-ct/b).

Taking the minimum of the two cases gives the stated bound.

Why It Matters

This is the correct concentration bound for quantities like 1ni(Xi2E[Xi2])\frac{1}{n}\sum_i (X_i^2 - \mathbb{E}[X_i^2]) where each XiX_i is Gaussian. The sub-Gaussian bound does not apply (these are not sub-Gaussian), but the Bernstein bound does. The two-regime behavior is what you actually observe in practice: moderate deviations look Gaussian, but extreme deviations reveal the heavier tails.

Failure Mode

The bound requires knowing (or bounding) both ν2\nu^2 and bb. For the sample variance of Gaussian data, ν2=O(σ4)\nu^2 = O(\sigma^4) and b=O(σ2)b = O(\sigma^2). If these parameters are poorly estimated, the bound may be loose in one or both regimes.

Key Examples of Sub-Exponential Variables

Example

Chi-squared with k degrees of freedom

If Z1,,ZkN(0,1)Z_1, \ldots, Z_k \sim \mathcal{N}(0, 1) independently, then W=j=1kZj2χ2(k)W = \sum_{j=1}^k Z_j^2 \sim \chi^2(k) with E[W]=k\mathbb{E}[W] = k and Var(W)=2k\text{Var}(W) = 2k.

The centered variable WkW - k is sub-exponential with parameters ν2=4k\nu^2 = 4k and b=4b = 4 (Wainwright 2019, Example 2.4, using the normalization where each Zj21Z_j^2 - 1 has sub-exponential parameters (4,4)(4, 4)). It is not sub-Gaussian because each Zj2Z_j^2 has MGF E[eλZj2]=(12λ)1/2\mathbb{E}[e^{\lambda Z_j^2}] = (1 - 2\lambda)^{-1/2} which diverges at λ=1/2\lambda = 1/2. A sub-Gaussian MGF would be finite for all λ\lambda.

The Bernstein bound gives: P(Wkt)2exp(cmin(t2/k,t))\mathbb{P}(|W - k| \geq t) \leq 2\exp(-c\min(t^2/k, t)).

Laurent-Massart bound (canonical form). The sharpest explicit constants come from Laurent and Massart (2000), Lemma 1: for all t>0t > 0,

P(Wk2kt+2t)et,P(kW2kt)et\mathbb{P}(W - k \geq 2\sqrt{kt} + 2t) \leq e^{-t}, \qquad \mathbb{P}(k - W \geq 2\sqrt{kt}) \leq e^{-t}

The two-term form on the upper tail exposes both regimes directly: the 2kt2\sqrt{kt} piece is the sub-Gaussian contribution (dominant when tkt \lesssim k), the 2t2t piece is the exponential contribution (dominant when tkt \gtrsim k). This is the standard reference for chi-squared tails in high-dimensional statistics.

Example

Exponential distribution

If XExp(1)X \sim \text{Exp}(1), then X1X - 1 is centered with E[eλ(X1)]=eλ/(1λ)\mathbb{E}[e^{\lambda(X-1)}] = e^{-\lambda}/(1 - \lambda) for λ<1\lambda < 1.

This is finite only for λ<1\lambda < 1, confirming sub-exponential (not sub-Gaussian) behavior. The ψ1\psi_1 norm is X1ψ1=O(1)\|X - 1\|_{\psi_1} = O(1).

Example

Product of two independent standard normals

Let Z1,Z2N(0,1)Z_1, Z_2 \sim \mathcal{N}(0, 1) independently. The product Z1Z2Z_1 Z_2 has E[Z1Z2]=0\mathbb{E}[Z_1 Z_2] = 0, and its density (a variance-gamma law with parameter 00) has Laplace-like tails:

P(Z1Z2t)Cectfor all t>0\mathbb{P}(|Z_1 Z_2| \geq t) \leq C e^{-c t} \qquad \text{for all } t > 0

with universal constants C,c>0C, c > 0. This is a genuine sub-exponential (linear-exponent) tail, not a t\sqrt{t} tail. The naive union bound P(Xt)+P(Yt)4ect\mathbb{P}(|X| \geq \sqrt{t}) + \mathbb{P}(|Y| \geq \sqrt{t}) \leq 4e^{-ct} in the proof sketch above already delivers this rate, because substituting t\sqrt{t} into the Gaussian tail eu2/2e^{-u^2/2} yields et/2e^{-t/2}.

The norm bound Z1Z2ψ1Z1ψ2Z2ψ2=O(1)\|Z_1 Z_2\|_{\psi_1} \leq \|Z_1\|_{\psi_2} \|Z_2\|_{\psi_2} = O(1) confirms sub-exponential membership.

Sub-Exponential Closure Properties

  1. Sum closure (worst case). For any sub-exponential X1,,XnX_1, \ldots, X_n (not necessarily independent), iXi\sum_i X_i is sub-exponential with iXiψ1CiXiψ1\|\sum_i X_i\|_{\psi_1} \leq C \sum_i \|X_i\|_{\psi_1}. Unlike sub-Gaussian, the norm adds linearly, not in quadrature. This is tight under arbitrary correlation.

    Independent sums (Bernstein regime). When the XiX_i are independent, the linear bound is pessimistic: the Bernstein inequality above gives a two-regime concentration bound for iXi\sum_i X_i with a sub-Gaussian small-deviation regime controlled by iνi2\sum_i \nu_i^2. This is a concentration statement about the sum, not a norm identity, and it is what actually drives rates in ML theory.

  2. Scalar multiplication: aXψ1=aXψ1\|aX\|_{\psi_1} = |a| \cdot \|X\|_{\psi_1}.

  3. Product of sub-Gaussians: XYψ1Xψ2Yψ2\|XY\|_{\psi_1} \leq \|X\|_{\psi_2} \|Y\|_{\psi_2}.

  4. Not closed under products: The product of two sub-exponential variables may be heavier than sub-exponential.

Connection to Bernstein's Classical Inequality

The classical Bernstein inequality (from the concentration-inequalities page) for bounded random variables:

P ⁣(i(Xiμi)t)exp ⁣(t2/2iσi2+Mt/3)\mathbb{P}\!\left(\sum_i (X_i - \mu_i) \geq t\right) \leq \exp\!\left(-\frac{t^2/2}{\sum_i \sigma_i^2 + Mt/3}\right)

is a special case of the sub-exponential Bernstein inequality. Bounded random variables are sub-exponential (since they are even sub-Gaussian), and the denominator σi2+Mt/3\sum\sigma_i^2 + Mt/3 captures both regimes: sub-Gaussian when tσi2/Mt \ll \sum\sigma_i^2/M and exponential when tσi2/Mt \gg \sum\sigma_i^2/M.

Equivalence of classical and modern forms. The modern min-form exp(cmin(t2/iνi2,t/b))\exp(-c\min(t^2/\sum_i \nu_i^2, t/b)) and the classical quotient form exp(t2/(2iνi2+2bt))\exp(-t^2/(2\sum_i\nu_i^2 + 2bt)) are equivalent up to universal constants. This follows from the elementary comparison min(a,b)ab/(a+b)\min(a, b) \asymp ab/(a+b): writing a=t2/iνi2a = t^2/\sum_i\nu_i^2 and b=t/bb = t/b, both expressions interpolate between the sub-Gaussian regime (small tt) and the exponential regime (large tt) at the same threshold tiνi2/bt^* \asymp \sum_i\nu_i^2/b.

Common Confusions

Watch Out

Sub-exponential is BETWEEN sub-Gaussian and heavy-tailed

Every sub-Gaussian variable is sub-exponential, but not vice versa. Sub-exponential is a weaker condition (allows heavier tails). The hierarchy is: bounded \subset sub-Gaussian \subset sub-exponential \subset finite variance \subset all distributions.

Watch Out

The MGF bound holds only for small lambda

For sub-Gaussian variables, E[eλX]eCλ2\mathbb{E}[e^{\lambda X}] \leq e^{C\lambda^2} for all λ\lambda. For sub-exponential, the bound holds only for λ<1/b|\lambda| < 1/b. Beyond this, the MGF may diverge. This is why the Chernoff method produces two regimes: you can only optimize λ\lambda up to the boundary 1/b1/b.

Watch Out

ψ₁ norm vs ψ₂ norm

ψ2\psi_2 (sub-Gaussian) uses eX2/t2e^{X^2/t^2} in the definition: the square of XX in the exponent. ψ1\psi_1 (sub-exponential) uses eX/te^{|X|/t}: the absolute value of XX. The square in ψ2\psi_2 is what forces the tails to be Gaussian-like. Every sub-Gaussian variable is sub-exponential because eX/t1+eX2/t2e^{|X|/t} \leq 1 + e^{X^2/t^2} for appropriate tt.

Summary

  • Sub-exponential tails: P(Xt)2ect\mathbb{P}(|X| \geq t) \leq 2e^{-ct} (linear exponent)
  • Sub-Gaussian tails: P(Xt)2ect2\mathbb{P}(|X| \geq t) \leq 2e^{-ct^2} (quadratic exponent)
  • Products of sub-Gaussians are sub-exponential (ψ2ψ2ψ1\psi_2 \cdot \psi_2 \to \psi_1)
  • Chi-squared and squared losses are sub-exponential, not sub-Gaussian
  • Bernstein bound has two regimes: sub-Gaussian for small tt, exponential for large tt
  • Threshold between regimes: t=ν2/bt^* = \nu^2/b
  • ψ1\psi_1 norm adds linearly for sums; ψ2\psi_2 norm adds in quadrature

Exercises

ExerciseCore

Problem

Show that XExp(1)X \sim \text{Exp}(1) is sub-exponential but not sub-Gaussian. Compute the MGF E[eλ(X1)]\mathbb{E}[e^{\lambda(X-1)}] and show it is finite only for λ<1\lambda < 1.

ExerciseCore

Problem

Let ZN(0,1)Z \sim \mathcal{N}(0, 1). Show that Z21Z^2 - 1 (centered chi-squared with 1 degree of freedom) is sub-exponential by verifying the tail bound P(Z21t)ect\mathbb{P}(Z^2 - 1 \geq t) \leq e^{-ct} for large tt.

ExerciseAdvanced

Problem

Let X1,,XnN(0,σ2)X_1, \ldots, X_n \sim \mathcal{N}(0, \sigma^2) independently. Using the Bernstein inequality for sub-exponential variables, bound P ⁣(1niXi2σ2t)\mathbb{P}\!\left(\left|\frac{1}{n}\sum_i X_i^2 - \sigma^2\right| \geq t\right). Identify the two regimes.

Related Comparisons

References

Canonical:

  • Vershynin, High-Dimensional Probability (2018), Chapter 2.7-2.8
  • Boucheron, Lugosi, Massart, Concentration Inequalities (2013), Chapter 2
  • Laurent and Massart, "Adaptive Estimation of a Quadratic Functional by Model Selection," Annals of Statistics, 28(5):1302-1338 (2000). Canonical chi-squared tail constants (Lemma 1).

Current:

  • Wainwright, High-Dimensional Statistics (2019), Chapter 2
  • Rigollet and Hutter, High-Dimensional Statistics (MIT lecture notes, 2023)
  • van Handel, Probability in High Dimension (2016), Chapters 1-3
  • Götze, Sambale, Sinulis, "Concentration inequalities for polynomials in α\alpha-sub-exponential random variables," Electron. J. Probab. (2021). Extends the framework to sub-Weibull tails and polynomial functionals.
  • Catoni, PAC-Bayesian Supervised Classification (IMS Lecture Notes, 2007). The Bernstein condition plays the central role in PAC-Bayes excess-risk bounds.

Related results:

  • See Hanson-Wright inequality for quadratic forms in sub-Gaussian vectors, which sharpens the products-of-sub-Gaussians theorem from this page.

Next Topics

Building on sub-exponential theory:

Last reviewed: April 18, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

6

Derived topics

2