Sub-Exponential Random Variables

Sneiderman, Robby

Concentration Probability

Sub-Exponential Random Variables

The distributional class between sub-Gaussian and heavy-tailed: heavier tails than Gaussian, the Orlicz norm characterization, Bernstein condition, and the two-regime concentration bound.

CoreTier 1StableSupporting~55 min

Prerequisites

Subgaussian Random Variables Concentration Inequalities Bernstein Inequality Chernoff Bounds

Quiz (2)Prereq Map

Why This Matters

Sub-Gaussian random variables have tails decaying like $e^{-ct^2}$ and give the cleanest concentration bounds. But many quantities in ML are not sub-Gaussian:

Chi-squared random variables (sums of squared Gaussians)
Products of sub-Gaussian random variables
Exponential random variables (waiting times)
Squared losses in regression with Gaussian noise

These have tails decaying like $e^{-ct}$ , exponentially, but with a linear exponent rather than quadratic. The sub-exponential class captures exactly this level of tail behavior, and the Bernstein inequality for sub-exponential sums provides the right concentration bound: sub-Gaussian for small deviations, exponential for large deviations.

Tail Class Board

Why the exponent shape matters

Move the threshold and watch three tail assumptions separate. The concentration story is not just small probability; it is whether the exponent is quadratic, linear, or only polynomial.

Primary control: tail threshold

Larger thresholds expose the difference between quadratic, linear, and polynomial decay.

t = 2.7

Sub-exponential tail scale

Bigger scale makes the linear tail regime arrive earlier and decay more slowly.

kink 2.1

Sub-exponential at this threshold

0.41

Quadratic near zero, then linear after the MGF boundary matters.

Formula shape

$exp [- c min (t^{2} / ν^{2}, t / b)]$

The exponent shape is what determines whether sample complexity pays a log, square-root-log, or worse penalty.

ML translation

Sub-Gaussian assumptions power clean Hoeffding-style bounds. Squares, products, and chi-squared terms usually require sub-exponential Bernstein bounds instead.

Mental Model

The tail decay hierarchy is:

Class	Tail decay	MGF	Example
Sub-Gaussian	$e^{-ct^2}$	Finite for all $\lambda \in \mathbb{R}$	Bounded, Gaussian
Sub-exponential	$e^{-ct}$	Finite for $\vert\lambda\vert < \lambda_0$	$\chi^2$ , Exponential
Heavy-tailed	$1/t^p$	Diverges near $0$ for any $\lambda \neq 0$	Cauchy, Pareto

Sub-exponential is the intermediate class. The MGF exists in a neighborhood of zero (not everywhere), which means the Chernoff method works for small $\lambda$ but breaks for large $\lambda$ . This produces a concentration bound with two regimes.

Formal Setup and Definitions

Let $X$ be a centered random variable ( $\mathbb{E}[X] = 0$ ).

Definition

Sub-Exponential Random Variable (MGF characterization) $ψ_{1}$

A centered random variable $X$ is sub-exponential with parameters $(\nu^2, b)$ if and only if for all $|\lambda| < 1/b$ :

$\mathbb{E}[e^{\lambda X}] \leq \exp\!\left(\frac{\nu^2 \lambda^2}{2}\right)$

The parameter $\nu^2$ plays the role of a "variance proxy" (controlling the sub-Gaussian regime), and $b$ controls the radius where the MGF bound holds.

Definition

Sub-Exponential Norm (Orlicz ψ₁ norm) $∥ X ∥_{ψ_{1}}$

The sub-exponential norm (or $\psi_1$ -norm) of $X$ is:

$\|X\|_{\psi_1} = \inf\{t > 0 : \mathbb{E}[e^{|X|/t}] \leq 2\}$

A random variable is sub-exponential if and only if $\|X\|_{\psi_1} < \infty$ .

Compare to the sub-Gaussian norm: $\|X\|_{\psi_2} = \inf\{t > 0 : \mathbb{E}[e^{X^2/t^2}] \leq 2\}$ . The $\psi_1$ norm uses $e^{|X|/t}$ (linear in $|X|$ ) while $\psi_2$ uses $e^{X^2/t^2}$ (quadratic). The linear growth in the exponent is what makes sub-exponential tails heavier than sub-Gaussian.

Definition

Bernstein Condition

A centered random variable $X$ satisfies the Bernstein condition with parameters $(\sigma^2, b)$ if and only if for all integers $k \geq 2$ :

$\mathbb{E}[|X|^k] \leq \frac{k!}{2} \sigma^2 b^{k-2}$

This is equivalent (up to constants) to being sub-exponential with parameters $(\sigma^2, b)$ . The condition controls all moments: the $k$ -th moment grows at most like $k! \cdot b^k$ , compared to sub-Gaussian where moments grow like $(C\sqrt{k})^k$ .

Equivalent Characterizations

The following are equivalent (up to constants in parameters):

MGF condition: $\mathbb{E}[e^{\lambda X}] \leq e^{\nu^2\lambda^2/2}$ for $|\lambda| < 1/b$
Tail condition: $\mathbb{P}(|X| \geq t) \leq 2\exp(-t/C_1)$ for all $t > 0$
Moment condition: $(\mathbb{E}[|X|^k])^{1/k} \leq C_2 k$ for all $k \geq 1$
Orlicz norm: $\|X\|_{\psi_1} \leq C_3$
Bernstein condition: $\mathbb{E}[|X|^k] \leq \frac{k!}{2}\sigma^2 b^{k-2}$ for $k \geq 2$

Compare to sub-Gaussian characterization 3: $(\mathbb{E}[|X|^k])^{1/k} \leq C\sqrt{k}$ for all $k \geq 1$ . Sub-exponential moments grow like $k$ ; sub-Gaussian moments grow like $\sqrt{k}$ . Both forms use $k \geq 1$ . The extra factor of $\sqrt{k}$ is the precise difference between the two classes.

Main Theorems

Theorem

Products of Sub-Gaussians are Sub-Exponential

Statement

If $X$ and $Y$ are sub-Gaussian random variables (not necessarily independent), then the product $XY$ is sub-exponential, with:

$\|XY\|_{\psi_1} \leq \|X\|_{\psi_2} \cdot \|Y\|_{\psi_2}$

Intuition

Multiplying two "Gaussian-like" variables produces something "exponential-like." A concrete example: if $Z \sim \mathcal{N}(0, 1)$ , then $Z$ is sub-Gaussian, but $Z^2 = Z \cdot Z$ is chi-squared (with 1 degree of freedom) and sub-exponential. The tails get heavier because multiplying two variables that can be large makes the product even larger.

Proof Sketch

Tail bound (qualitative). If $|X| \cdot |Y| \geq t$ , then at least one of $|X| \geq \sqrt{t}$ or $|Y| \geq \sqrt{t}$ must hold. By union bound:

$\mathbb{P}(|XY| \geq t) \leq \mathbb{P}(|X| \geq \sqrt{t}) + \mathbb{P}(|Y| \geq \sqrt{t}) \leq 2e^{-ct} + 2e^{-ct} = 4e^{-ct}$

where each sub-Gaussian tail is evaluated at $\sqrt{t}$ , yielding $2e^{-c(\sqrt{t})^2} = 2e^{-ct}$ . The resulting tail $e^{-ct}$ is sub-exponential.

Norm bound via AM-GM. The sharper identity uses the pointwise bound $|XY| \leq (X^2 + Y^2)/2$ . Combined with the key identity relating the $\psi_1$ and $\psi_2$ norms:

$\|X^2\|_{\psi_1} = \|X\|_{\psi_2}^2$

(which follows directly from the definitions: $\mathbb{E}[e^{X^2/t}] \leq 2$ means $\mathbb{E}[e^{X^2/(\sqrt{t})^2}] \leq 2$ ), we get:

$\|XY\|_{\psi_1} \leq \tfrac{1}{2}\|X^2\|_{\psi_1} + \tfrac{1}{2}\|Y^2\|_{\psi_1} = \tfrac{1}{2}\|X\|_{\psi_2}^2 + \tfrac{1}{2}\|Y\|_{\psi_2}^2$

Cauchy-Schwarz on the right side (or applying the same argument to the rescaled variables $X/\|X\|_{\psi_2}$ and $Y/\|Y\|_{\psi_2}$ ) yields the product form $\|XY\|_{\psi_1} \leq \|X\|_{\psi_2}\|Y\|_{\psi_2}$ .

Why It Matters

This explains why chi-squared variables, quadratic forms, and many statistics in learning theory are sub-exponential: they involve products or squares of sub-Gaussian quantities. When you see $X^\top X$ or $\|Xw\|^2$ in a bound, expect sub-exponential concentration, not sub-Gaussian.

Failure Mode

The product of two sub-exponential variables is generally not sub-exponential (it may be even heavier-tailed). Sub-exponential is not closed under multiplication, unlike sub-Gaussian being closed under addition. This limits the composability of sub-exponential bounds.

report a correction →

Theorem

Bernstein Inequality for Sub-Exponential Sums

Statement

Let $X_1, \ldots, X_n$ be independent centered sub-exponential random variables with parameters $(\nu_i^2, b)$ . Then for any $t > 0$ :

$\mathbb{P}\!\left(\left|\sum_{i=1}^n X_i\right| \geq t\right) \leq 2\exp\!\left(-c \min\!\left(\frac{t^2}{\sum_i \nu_i^2},\, \frac{t}{b}\right)\right)$

where $c > 0$ is a universal constant. For i.i.d. variables with common parameter $\nu^2$ and sample mean $\bar{X}_n = \frac{1}{n}\sum_i X_i$ :

$\mathbb{P}\!\left(|\bar{X}_n| \geq t\right) \leq 2\exp\!\left(-cn\min\!\left(\frac{t^2}{\nu^2},\, \frac{t}{b}\right)\right)$

Intuition

The bound has two regimes separated by the threshold $t^* = \nu^2/b$ :

Small deviations ( $t \leq t^*$ ): The $t^2/\nu^2$ term dominates the minimum. The bound is $\exp(-cnt^2/\nu^2)$ , a sub-Gaussian tail. For moderate deviations, the sub-exponential variable behaves as if it were sub-Gaussian.

Large deviations ( $t > t^*$ ): The $t/b$ term dominates. The bound is $\exp(-cnt/b)$ , an exponential (not Gaussian) tail. For large deviations, the heavier tails kick in and concentration is weaker.

The transition between regimes is smooth. The sub-Gaussian regime gives the familiar $\sqrt{\log(1/\delta)/n}$ rates for moderate confidence levels.

Proof Sketch

Use the Chernoff method. For $|\lambda| < 1/b$ :

$\mathbb{P}\!\left(\sum_i X_i \geq t\right) \leq e^{-\lambda t} \prod_i \mathbb{E}[e^{\lambda X_i}] \leq e^{-\lambda t} \exp\!\left(\frac{\lambda^2 \sum_i \nu_i^2}{2}\right)$

Optimize over $\lambda \in (0, 1/b)$ :

If the unconstrained optimum $\lambda^* = t/\sum_i\nu_i^2$ satisfies $\lambda^* < 1/b$ , we are in the sub-Gaussian regime: bound is $\exp(-t^2/(2\sum_i\nu_i^2))$ .
If $\lambda^* \geq 1/b$ , set $\lambda = 1/(2b)$ (at the boundary): bound is $\exp(-t/(2b) + \sum_i\nu_i^2/(8b^2))$ . For large $t$ , this gives $\exp(-ct/b)$ .

Taking the minimum of the two cases gives the stated bound.

Why It Matters

This is the correct concentration bound for quantities like $\frac{1}{n}\sum_i (X_i^2 - \mathbb{E}[X_i^2])$ where each $X_i$ is Gaussian. The sub-Gaussian bound does not apply (these are not sub-Gaussian), but the Bernstein bound does. The two-regime behavior is what you actually observe in practice: moderate deviations look Gaussian, but extreme deviations reveal the heavier tails.

Failure Mode

The bound requires knowing (or bounding) both $\nu^2$ and $b$ . For the sample variance of Gaussian data, $\nu^2 = O(\sigma^4)$ and $b = O(\sigma^2)$ . If these parameters are poorly estimated, the bound may be loose in one or both regimes.

report a correction →

Key Examples of Sub-Exponential Variables

Example

Chi-squared with k degrees of freedom

If $Z_1, \ldots, Z_k \sim \mathcal{N}(0, 1)$ independently, then $W = \sum_{j=1}^k Z_j^2 \sim \chi^2(k)$ with $\mathbb{E}[W] = k$ and $\text{Var}(W) = 2k$ .

The centered variable $W - k$ is sub-exponential with parameters $\nu^2 = 4k$ and $b = 4$ (Wainwright 2019, Example 2.4, using the normalization where each $Z_j^2 - 1$ has sub-exponential parameters $(4, 4)$ ). It is not sub-Gaussian because each $Z_j^2$ has MGF $\mathbb{E}[e^{\lambda Z_j^2}] = (1 - 2\lambda)^{-1/2}$ which diverges at $\lambda = 1/2$ . A sub-Gaussian MGF would be finite for all $\lambda$ .

The Bernstein bound gives: $\mathbb{P}(|W - k| \geq t) \leq 2\exp(-c\min(t^2/k, t))$ .

Laurent-Massart bound (canonical form). The sharpest explicit constants come from Laurent and Massart (2000), Lemma 1: for all $t > 0$ ,

$\mathbb{P}(W - k \geq 2\sqrt{kt} + 2t) \leq e^{-t}, \qquad \mathbb{P}(k - W \geq 2\sqrt{kt}) \leq e^{-t}$

The two-term form on the upper tail exposes both regimes directly: the $2\sqrt{kt}$ piece is the sub-Gaussian contribution (dominant when $t \lesssim k$ ), the $2t$ piece is the exponential contribution (dominant when $t \gtrsim k$ ). This is the standard reference for chi-squared tails in high-dimensional statistics.

Example

Exponential distribution

If $X \sim \text{Exp}(1)$ , then $X - 1$ is centered with $\mathbb{E}[e^{\lambda(X-1)}] = e^{-\lambda}/(1 - \lambda)$ for $\lambda < 1$ .

This is finite only for $\lambda < 1$ , confirming sub-exponential (not sub-Gaussian) behavior. The $\psi_1$ norm is $\|X - 1\|_{\psi_1} = O(1)$ .

Example

Product of two independent standard normals

Let $Z_1, Z_2 \sim \mathcal{N}(0, 1)$ independently. The product $Z_1 Z_2$ has $\mathbb{E}[Z_1 Z_2] = 0$ , and its density (a variance-gamma law with parameter $0$ ) has Laplace-like tails:

$\mathbb{P}(|Z_1 Z_2| \geq t) \leq C e^{-c t} \qquad \text{for all } t > 0$

with universal constants $C, c > 0$ . This is a genuine sub-exponential (linear-exponent) tail, not a $\sqrt{t}$ tail. The naive union bound $\mathbb{P}(|X| \geq \sqrt{t}) + \mathbb{P}(|Y| \geq \sqrt{t}) \leq 4e^{-ct}$ in the proof sketch above already delivers this rate, because substituting $\sqrt{t}$ into the Gaussian tail $e^{-u^2/2}$ yields $e^{-t/2}$ .

The norm bound $\|Z_1 Z_2\|_{\psi_1} \leq \|Z_1\|_{\psi_2} \|Z_2\|_{\psi_2} = O(1)$ confirms sub-exponential membership.

Sub-Exponential Closure Properties

Sum closure (worst case). For any sub-exponential $X_1, \ldots, X_n$ (not necessarily independent), $\sum_i X_i$ is sub-exponential with $\|\sum_i X_i\|_{\psi_1} \leq C \sum_i \|X_i\|_{\psi_1}$ . Unlike sub-Gaussian, the norm adds linearly, not in quadrature. This is tight under arbitrary correlation.

Independent sums (Bernstein regime). When the $X_i$ are independent, the linear bound is pessimistic: the Bernstein inequality above gives a two-regime concentration bound for $\sum_i X_i$ with a sub-Gaussian small-deviation regime controlled by $\sum_i \nu_i^2$ . This is a concentration statement about the sum, not a norm identity, and it is what actually drives rates in ML theory.
Scalar multiplication: $\|aX\|_{\psi_1} = |a| \cdot \|X\|_{\psi_1}$ .
Product of sub-Gaussians: $\|XY\|_{\psi_1} \leq \|X\|_{\psi_2} \|Y\|_{\psi_2}$ .
Not closed under products: The product of two sub-exponential variables may be heavier than sub-exponential.

Connection to Bernstein's Classical Inequality

The classical Bernstein inequality (from the concentration-inequalities page) for bounded random variables:

$\mathbb{P}\!\left(\sum_i (X_i - \mu_i) \geq t\right) \leq \exp\!\left(-\frac{t^2/2}{\sum_i \sigma_i^2 + Mt/3}\right)$

is a special case of the sub-exponential Bernstein inequality. Bounded random variables are sub-exponential (since they are even sub-Gaussian), and the denominator $\sum\sigma_i^2 + Mt/3$ captures both regimes: sub-Gaussian when $t \ll \sum\sigma_i^2/M$ and exponential when $t \gg \sum\sigma_i^2/M$ .

Equivalence of classical and modern forms. The modern min-form $\exp(-c\min(t^2/\sum_i \nu_i^2, t/b))$ and the classical quotient form $\exp(-t^2/(2\sum_i\nu_i^2 + 2bt))$ are equivalent up to universal constants. This follows from the elementary comparison $\min(a, b) \asymp ab/(a+b)$ : writing $a = t^2/\sum_i\nu_i^2$ and $b = t/b$ , both expressions interpolate between the sub-Gaussian regime (small $t$ ) and the exponential regime (large $t$ ) at the same threshold $t^* \asymp \sum_i\nu_i^2/b$ .

Common Confusions

Watch Out

Sub-exponential is BETWEEN sub-Gaussian and heavy-tailed

Every sub-Gaussian variable is sub-exponential, but not vice versa. Sub-exponential is a weaker condition (allows heavier tails). The hierarchy is: bounded $\subset$ sub-Gaussian $\subset$ sub-exponential $\subset$ finite variance $\subset$ all distributions.

Watch Out

The MGF bound holds only for small lambda

For sub-Gaussian variables, $\mathbb{E}[e^{\lambda X}] \leq e^{C\lambda^2}$ for all $\lambda$ . For sub-exponential, the bound holds only for $|\lambda| < 1/b$ . Beyond this, the MGF may diverge. This is why the Chernoff method produces two regimes: you can only optimize $\lambda$ up to the boundary $1/b$ .

Watch Out

ψ₁ norm vs ψ₂ norm

$\psi_2$ (sub-Gaussian) uses $e^{X^2/t^2}$ in the definition: the square of $X$ in the exponent. $\psi_1$ (sub-exponential) uses $e^{|X|/t}$ : the absolute value of $X$ . The square in $\psi_2$ is what forces the tails to be Gaussian-like. Every sub-Gaussian variable is sub-exponential because $e^{|X|/t} \leq 1 + e^{X^2/t^2}$ for appropriate $t$ .

Summary

Sub-exponential tails: $\mathbb{P}(|X| \geq t) \leq 2e^{-ct}$ (linear exponent)
Sub-Gaussian tails: $\mathbb{P}(|X| \geq t) \leq 2e^{-ct^2}$ (quadratic exponent)
Products of sub-Gaussians are sub-exponential ( $\psi_2 \cdot \psi_2 \to \psi_1$ )
Chi-squared and squared losses are sub-exponential, not sub-Gaussian
Bernstein bound has two regimes: sub-Gaussian for small $t$ , exponential for large $t$
Threshold between regimes: $t^* = \nu^2/b$
$\psi_1$ norm adds linearly for sums; $\psi_2$ norm adds in quadrature

Exercises

ExerciseCore

Problem

Show that $X \sim \text{Exp}(1)$ is sub-exponential but not sub-Gaussian. Compute the MGF $\mathbb{E}[e^{\lambda(X-1)}]$ and show it is finite only for $\lambda < 1$ .

ExerciseCore

Problem

Let $Z \sim \mathcal{N}(0, 1)$ . Show that $Z^2 - 1$ (centered chi-squared with 1 degree of freedom) is sub-exponential by verifying the tail bound $\mathbb{P}(Z^2 - 1 \geq t) \leq e^{-ct}$ for large $t$ .

ExerciseAdvanced

Problem

Let $X_1, \ldots, X_n \sim \mathcal{N}(0, \sigma^2)$ independently. Using the Bernstein inequality for sub-exponential variables, bound $\mathbb{P}\!\left(\left|\frac{1}{n}\sum_i X_i^2 - \sigma^2\right| \geq t\right)$ . Identify the two regimes.

Related Comparisons

Sub-Gaussian vs. Sub-Exponential Random Variables

References

Canonical:

Vershynin, High-Dimensional Probability (2018), Chapter 2.7-2.8
Boucheron, Lugosi, Massart, Concentration Inequalities (2013), Chapter 2
Laurent and Massart, "Adaptive Estimation of a Quadratic Functional by Model Selection," Annals of Statistics, 28(5):1302-1338 (2000). Canonical chi-squared tail constants (Lemma 1).

Current:

Wainwright, High-Dimensional Statistics (2019), Chapter 2
Rigollet and Hutter, High-Dimensional Statistics (MIT lecture notes, 2023)
van Handel, Probability in High Dimension (2016), Chapters 1-3
Götze, Sambale, Sinulis, "Concentration inequalities for polynomials in $\alpha$ -sub-exponential random variables," Electron. J. Probab. (2021). Extends the framework to sub-Weibull tails and polynomial functionals.
Catoni, PAC-Bayesian Supervised Classification (IMS Lecture Notes, 2007). The Bernstein condition plays the central role in PAC-Bayes excess-risk bounds.

Related results:

See Hanson-Wright inequality for quadratic forms in sub-Gaussian vectors, which sharpens the products-of-sub-Gaussians theorem from this page.

Next Topics

Building on sub-exponential theory:

Matrix concentration: extending sub-exponential bounds to matrix-valued random variables
Epsilon-nets and covering numbers: combining concentration with geometric discretization

Last reviewed: April 18, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

6

Chernoff Boundslayer 1 · tier 1
Concentration Inequalitieslayer 1 · tier 1
Bennett's Inequalitylayer 2 · tier 1
Bernstein Inequalitylayer 2 · tier 1
Chi-Squared Concentrationlayer 2 · tier 1

Derived topics

2

Epsilon-Nets and Covering Numberslayer 3 · tier 1
Matrix Concentrationlayer 3 · tier 1

Graph-backed continuations

Matrix Concentration Epsilon-Nets and Covering Numbers