Concentration Probability
Sub-Gaussian Random Variables
Sub-Gaussian random variables: the precise characterization of 'light-tailed' behavior that underpins exponential-tail concentration inequalities and the Gaussian-rate generalization bounds in learning theory.
Prerequisites
Why This Matters
Concentration inequalities are the engine of learning theory. The simplest concentration bounds --- Markov and Chebyshev --- require only finite first or second moments and give polynomial tail decay. The strongest non-asymptotic bounds --- Hoeffding, Chernoff, Bernstein --- require exponential-moment conditions and give exponential tail decay. Sub-Gaussian random variables are the precise class for which exponential-moment conditions are as tight as if the variable were Gaussian, even when it is not. Most Hoeffding-type generalization bounds in VC and Rademacher theory are sub-Gaussian bounds in disguise; the familiar rates come from this class.
If you understand sub-Gaussianity, you understand why exponential-tail concentration works. Heavier tails (sub-exponential, polynomial) need their own machinery.
Tail Class Board
Why the exponent shape matters
Move the threshold and watch three tail assumptions separate. The concentration story is not just small probability; it is whether the exponent is quadratic, linear, or only polynomial.
Larger thresholds expose the difference between quadratic, linear, and polynomial decay.
Bigger scale makes the linear tail regime arrive earlier and decay more slowly.
Sub-Gaussian at this threshold
0.18
Quadratic exponent. Large deviations disappear at Gaussian speed.
Formula shape
The exponent shape is what determines whether sample complexity pays a log, square-root-log, or worse penalty.
ML translation
Sub-Gaussian assumptions power clean Hoeffding-style bounds. Squares, products, and chi-squared terms usually require sub-exponential Bernstein bounds instead.
A standard Gaussian saturates the sub-Gaussian tail bound. Whenever a "sub-Gaussian bound" curve appears distinct from a "Gaussian: " curve in tail plots, read the sub-Gaussian curve as with an explicit generic constant corresponding to some . The sub-Gaussian class includes Gaussian tails: a Gaussian variable is itself sub-Gaussian with parameter .
Mental Model
A sub-Gaussian random variable has tails that decay at least as fast as those of a Gaussian. Concretely, the probability of being far from the mean drops like , not merely (sub-exponential) or (polynomial). This exponential-squared decay is what gives you the rates in learning theory.
The key insight: many non-Gaussian distributions - bounded random variables, Rademacher random variables, and projections of high-dimensional uniform distributions - all satisfy this same tail behavior. Sub-Gaussianity captures exactly what these distributions have in common.
Formal Setup and Notation
Let be a real-valued random variable with (we center without loss of generality).
Sub-Gaussian Random Variable (MGF characterization)
A centered random variable is sub-Gaussian with parameter if and only if for all :
The smallest such is called the sub-Gaussian parameter (or sub-Gaussian proxy variance) of .
This is the workhorse definition. It says the moment generating function of is dominated by that of a random variable.
Sub-Gaussian Norm (Orlicz norm)
The sub-Gaussian norm (or -norm) of is:
A random variable is sub-Gaussian if and only if . The sub-Gaussian parameter and the -norm are related by (up to universal constants).
Tail Characterization
An equivalent characterization: is sub-Gaussian with parameter if and only if for all :
This is the form you will use most often in practice.
Equivalent Characterizations
The following conditions are equivalent (up to universal constants in the parameters):
- MGF condition: for all
- Tail condition: for all
- Moment condition: for all
- Orlicz norm:
The constants across these four statements differ by at most universal multiplicative factors. The equivalence is what makes the sub-Gaussian class robust: you can verify membership via whichever characterization is most convenient.
Proof sketches for the harder directions
Tail MGF (layer cake). Suppose . For , by the layer-cake formula applied to the non-negative random variable :
Complete the square in the exponent: . The Gaussian integral yields for a universal . The same argument applied to handles .
Moment MGF (Taylor expansion). Suppose for all . Expand the MGF term by term:
Stirling's bound gives . For the series converges to something bounded by . Extend to all via a separate bound in the large- regime using the tail equivalence.
Interactive: three views of one definition
Same distribution, three separate tests. Switch the tab to compare the tail, the log-MGF, and the expectation side by side. Gaussian, Rademacher, and Uniform pass all three. Laplace and the centered Exponential have only exponential (not Gaussian) tails, so they pass none of the three sub-Gaussian conditions globally. The interactive panel inspects each condition on a finite range, so a distribution may appear to satisfy the tail check inside the visible window while still violating the MGF and conditions at large enough or ; that finite-window artifact is what the tab labels are showing. Watching a distribution fail one panel while holding another makes the equivalence concrete: each characterization picks up the same sub-Gaussian tail behavior from a different direction, and any genuinely sub-Gaussian variable passes or fails them together.
Main Theorems
Sub-Gaussian MGF Implies Tail Bound
Statement
If is a centered random variable satisfying for all , then for all :
Exact statement
LaTeX source for copy/export
\mathbb{E}[e^{\lambda X}] \leq e^{\sigma^2\lambda^2/2} \Rightarrow \mathbb{P}(X \geq t) \leq \exp\!\left(-\frac{t^2}{2\sigma^2}\right)Intuition
The MGF condition is a generating condition. It controls all moments simultaneously. The tail bound is a consequence obtained by the Chernoff method: exponentiate, apply Markov, and optimize over . The quadratic exponent in the MGF translates directly to a quadratic exponent in the tail.
Proof Sketch
For any , by Markov's inequality:
Minimize over : set to get .
Why It Matters
This is the Chernoff method: the single most important proof technique in concentration. Every time you see an exponential tail bound, this is the engine underneath. Mastering this one argument gives you access to all of sub-Gaussian concentration theory.
Failure Mode
The bound is only tight up to constants. For a true Gaussian, the bound is off by a polynomial factor in (the exact Gaussian tail has a prefactor). For non-asymptotic purposes this rarely matters.
Hoeffding's Lemma
Statement
If is a centered random variable with almost surely, then for all :
That is, is sub-Gaussian with parameter .
Exact statement
LaTeX source for copy/export
\mathbb{E}[e^{\lambda X}] \leq \exp\!\left(\frac{\lambda^2(b-a)^2}{8}\right)Intuition
Bounded random variables cannot have heavy tails. They literally cannot take values beyond . Hoeffding's lemma quantifies this: the MGF of any bounded centered variable is dominated by a Gaussian MGF. The factor of comes from a convexity argument applied to on the interval .
Proof Sketch
Since , write for some random . By convexity of :
Use to express in terms of . Then apply the inequality for (which follows from Taylor expansion and bounding the remainder). Setting gives the result.
Why It Matters
Hoeffding's lemma is the bridge between "bounded" and "sub-Gaussian." It explains why bounded losses (e.g., 0-1 loss) always yield concentration. which is exactly what appears in ERM generalization bounds for finite hypothesis classes.
Failure Mode
The bound treats all bounded distributions the same: a constant taking value and a uniform on get the same sub-Gaussian parameter. For distributions concentrated near , the variance-based Bernstein inequality gives tighter bounds. Hoeffding's lemma is worst-case over the bounded class.
Canonical Examples
Gaussian random variable
If , then exactly. Gaussians are sub-Gaussian with parameter exactly . This is the "prototype". The definition is designed so that Gaussians saturate the bound.
Rademacher random variable
Let be a Rademacher variable, so . Then , so by Hoeffding's lemma, is sub-Gaussian with parameter . In fact, the exact MGF is , confirming sub-Gaussianity directly.
Rademacher variables appear everywhere in symmetrization arguments and Rademacher complexity.
Bounded random variable
If almost surely with , then is sub-Gaussian with parameter by Hoeffding's lemma. This covers:
- 0-1 loss:
- Any loss bounded in (after centering):
- Features bounded in :
Sum of independent sub-Gaussians
If are independent, centered, sub-Gaussian with parameters , then is sub-Gaussian with parameter .
This follows because MGFs multiply under independence:
For the sample mean with equal : sub-Gaussian parameter is , giving the familiar concentration.
Common Confusions
Sub-Gaussian parameter is NOT the standard deviation
The sub-Gaussian parameter is an upper bound on the effective spread, but it is not equal to in general. For a variable supported on with variance , Hoeffding gives , far larger than . This is why Bernstein inequalities (which use variance information) are sometimes much tighter than Hoeffding-type bounds.
Not all 'nice' distributions are sub-Gaussian
Exponential, Poisson, and chi-squared random variables are not sub-Gaussian variables. Their tails decay like rather than . These are sub-exponential. The distinction matters: sub-exponential concentration gives rates only for the mean, not for individual deviations, and requires separate treatment of "small " and "large " regimes.
The centering assumption is essential
The MGF condition implicitly requires . If , you apply the definition to , not to directly. Failing to center will give you wrong tail bounds.
Centered vs Non-Centered Sub-Gaussian
The canonical MGF definition assumes . For non-centered with and sub-Gaussian with parameter :
This is the natural non-centered analog. The linear term in the exponent carries the mean; the quadratic term controls the tails. All downstream closure and concentration results go through after the mean shift.
The -norm, unlike the sub-Gaussian parameter, does not require centering: it is a norm on itself, since depends on . This is one reason the formulation is often cleaner to work with.
Why Sub-Gaussian is the Right Abstraction
Three reasons sub-Gaussianity appears everywhere in ML theory:
-
Closure under summation: sums of independent sub-Gaussians are sub-Gaussian. This is exactly what you need for sample averages.
-
Captures all bounded losses: every bounded loss function produces sub-Gaussian empirical averages. Since most learning theory starts with bounded losses, sub-Gaussianity is the natural first assumption.
-
Tight enough for optimal rates: sub-Gaussian concentration gives deviation bounds for sample means, which matches the CLT rate. You cannot do better in general.
The sub-Gaussian class is the largest class of distributions for which the Chernoff method gives Gaussian-quality tail bounds. Going beyond sub-Gaussian (to sub-exponential, or heavier tails) requires different tools and yields weaker concentration.
The Orlicz Norm Hierarchy
The norm makes the set of sub-Gaussian random variables a Banach space , sitting in a strict inclusion chain on a probability space:
Bounded Sub-Gaussian Sub-exponential All-moments-finite. Each inclusion is strict: a standard Gaussian is sub-Gaussian but not bounded; an variable is sub-exponential but not sub-Gaussian.
Recall that on a probability space decreases with , so whenever . Sub-exponential variables have all moments finite: they sit inside every for .
The norm is defined analogously with : .
Exact Norms for Common Distributions
| Distribution | Sub-Gaussian | Key step | |
|---|---|---|---|
| Rademacher | always, so solve | ||
| Gaussian | for ; set equal to to get | ||
| Bounded , centered | Hoeffding's lemma gives | ||
| Gaussian | Homogeneity: |
Closure Properties
Sub-Gaussianity is useful precisely because it is preserved under the operations that appear in proofs.
Closure Under Independent Sums
Statement
Suppose are independent, centered, and for all . Then for any :
where is a universal constant.
Exact statement
LaTeX source for copy/export
\left\|\sum_i a_i X_i\right\|_{\psi_2} \leq CK\|a\|_2Intuition
By independence, MGFs multiply: . The norm of controls the sub-Gaussian parameter of the sum.
Failure Mode
Without independence, the bound fails. If , then with , but the theorem would predict . The gap between and is exactly the cost of perfect correlation.
Other closure properties:
- Scaling: (immediate from definition).
- Lipschitz maps: if is -Lipschitz with , then .
- Products break sub-Gaussianity: if are independent sub-Gaussian, then is sub-exponential, not sub-Gaussian. In particular, is sub-exponential whenever is sub-Gaussian. This is exactly why Bernstein's inequality has two regimes.
Sub-Gaussian Maxima
Maximum of Sub-Gaussians
Statement
Moreover, for any :
The maximum is a one-sided event, so there is no factor of on the right-hand side.
Exact statement
LaTeX source for copy/export
\mathbb{E}\!\left[\max_{i \leq n} X_i\right] \leq CK\sqrt{\log n}Intuition
For any : . Setting gives .
Why It Matters
This is where the term in PAC bounds comes from. Bounding over a finite hypothesis class is bounding the maximum of sub-Gaussian variables. The overhead is the price of uniformity.
Failure Mode
For sub-exponential variables, the maximum grows as (not ). The heavier tails make the worst case worse. For heavy-tailed variables (polynomial tails), the maximum can grow polynomially in .
The Sub-Exponential Bridge
The relationship between sub-Gaussian and sub-exponential is precise:
- If is sub-Gaussian, then is sub-exponential with (identity, under the standard Orlicz norm convention , ).
- Conversely, if is sub-exponential, then is sub-Gaussian.
This explains why Bernstein's inequality for centered sub-exponential variables has two regimes:
For small , the term dominates (sub-Gaussian regime). For large , the term dominates (sub-exponential regime, heavier tail). Setting gives the crossover (Vershynin Thm 2.9.1).
Proof Checklist
When writing or reading a sub-Gaussian argument:
- Identify the random variable. What quantity do you want to concentrate? Write it as .
- Center it. Work with .
- Decompose if needed. Express as a sum or martingale difference sequence.
- Check sub-Gaussianity of each piece. Bounded? Hoeffding gives . Lipschitz of Gaussian? Gaussian concentration applies.
- Propagate. Independent sum: . Martingale: Azuma-Hoeffding. Linear combination: scaling.
- Apply tail bound. . Invert to get confidence intervals.
Exercises
Problem
Let be a Rademacher random variable ( with equal probability). Verify directly that for all .
Problem
Let be i.i.d. with and . Using Hoeffding's lemma and the Chernoff method, prove that:
Problem
Show that the exponential distribution is not sub-Gaussian. Specifically, show that cannot be bounded by for any constant when is large.
Problem
Verify the table entry for : . Start from for , and solve for the smallest with .
Problem
Let be independent sub-Gaussian with . Show that using the argument: exponentiate, bound by sum, apply MGF bound, optimize .
Related Comparisons
References
- Vershynin, High-Dimensional Probability (2018), Chapters 2-3. Definitive modern treatment using the Orlicz norm framework. Proposition 2.5.2 for the equivalence theorem.
- Boucheron, Lugosi, Massart, Concentration Inequalities (2013), Chapter 2. Uses the entropy method rather than Orlicz norms.
- Wainwright, High-Dimensional Statistics (2019), Chapter 2. Clear exposition connecting sub-Gaussian theory to statistical applications.
- van Handel, Probability in High Dimension (2016), Chapters 1-2. Excellent treatment of sub-Gaussian maxima and chaining.
- Shalev-Shwartz and Ben-David, Understanding Machine Learning (2014), Chapters 26-28. How sub-Gaussian bounds enter learning theory.
- Rigollet and Hutter, High-Dimensional Statistics (MIT lecture notes, 2023), Chapter 1. Concise derivation of all five characterizations with explicit constants.
Next Topics
Natural next steps from sub-Gaussian random variables:
- Sub-exponential random variables: the next distributional class, for heavier tails.
- Hanson-Wright inequality: sub-Gaussian concentration for quadratic forms .
- Epsilon-nets and covering numbers: combining sub-Gaussian concentration with geometric arguments. Dudley's chaining integral, developed in empirical processes and chaining, sharpens the maxima bound when are indexed by a metric space.
- Azuma-Hoeffding inequality via martingale theory: sub-Gaussian concentration for martingale difference sequences, the natural generalization beyond independent sums.
- McDiarmid's inequality: concentration for functions of independent variables with bounded differences. Direct application of sub-Gaussian ideas via a martingale expansion.
- Tsirelson-Ibragimov-Sudakov (Gaussian Lipschitz concentration): for that is -Lipschitz and , is sub-Gaussian with parameter . The cleanest statement of sub-Gaussian behavior in high dimension.
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
4- Chernoff Boundslayer 1 · tier 1
- Concentration Inequalitieslayer 1 · tier 1
- Hoeffding's Lemmalayer 1 · tier 1
- Skewness, Kurtosis, and Higher Momentslayer 1 · tier 1
Derived topics
9- Sub-Exponential Random Variableslayer 2 · tier 1
- Epsilon-Nets and Covering Numberslayer 3 · tier 1
- Matrix Concentrationlayer 3 · tier 1
- McDiarmid's Inequalitylayer 3 · tier 1
- Measure Concentration and Geometric Functional Analysislayer 3 · tier 1
+4 more on the derived-topics page.