Skip to main content

Foundations

Expectation, Variance, Covariance, and Moments

Expectation as average value, variance as squared spread, covariance as joint movement, and the moment identities used throughout statistics and ML.

CoreTier 1StableSupporting~55 min

Why This Matters

Expectation is the average value of a random variable over repeated draws. Variance measures how far that variable tends to move away from its mean. Covariance measures whether two variables move together.

Those three ideas show up constantly in ML. Expected loss is population risk. Variance of stochastic gradients controls how noisy SGD updates are. Covariance matrices describe the geometry of data clouds and appear in PCA, Gaussian processes, whitening, natural gradients, and Kalman filters.

Quick Version

ObjectPlain meaningWhat it answers
E[X]\mathbb E[X]long-run average valueWhere is the center?
Var(X)\operatorname{Var}(X)average squared distance from the meanHow spread out is XX?
Cov(X,Y)\operatorname{Cov}(X,Y)average product of centered deviationsDo XX and YY move together?
ρ(X,Y)\rho(X,Y)scale-free covarianceHow strong is the linear relationship?
E[Xk]\mathbb E[X^k]kk-th momentWhat shape information survives averaging?

Do not treat these as formulas to memorize separately. They are all averages of transformed random variables. Once expectation is understood, variance and covariance are just expectation applied to centered products.

Example

A tiny distribution shows all three ideas

Let XX take values 00 and 22 with probabilities 1/41/4 and 3/43/4.

E[X]=014+234=32.\mathbb E[X]=0\cdot\frac14+2\cdot\frac34=\frac32.

Variance averages squared deviations from that mean:

Var(X)=(032)214+(232)234=34.\operatorname{Var}(X)=\left(0-\frac32\right)^2\frac14+\left(2-\frac32\right)^2\frac34=\frac34.

The standard deviation is 3/2\sqrt{3}/2. The mean tells you where the distribution balances; the variance tells you how far the mass sits from that balance point.

Visual: Covariance Geometry

theorem visual

Covariance Ellipse

A joint Gaussian contour separates marginal spread from directional tilt, so covariance becomes visible instead of purely algebraic.

rho = 0.0

Drag left for negative joint movement, right for positive joint movement. The variances stay fixed; only the off-diagonal covariance changes.

Spread

variance sets the axis lengths

Tilt

zero covariance; the major axis rotates as changes sign.

Identity

Current covariance: 0.00

Core Definitions

Definition

Expectation

For a discrete random variable: E[X]=xxP(X=x)\mathbb{E}[X] = \sum_x x \, P(X = x). For a continuous random variable with density ff: E[X]=xf(x)dx\mathbb{E}[X] = \int_{-\infty}^{\infty} x \, f(x) \, dx. The expectation exists when the sum or integral is absolutely convergent.

More generally, for a measurable function gg: E[g(X)]=g(x)dPX(x)\mathbb{E}[g(X)] = \int g(x) \, dP_X(x).

Definition

Variance

The variance measures spread around the mean:

Var(X)=E[(XE[X])2]=E[X2](E[X])2\text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2

The second form (computational formula) is often easier to use. The standard deviation is σ=Var(X)\sigma = \sqrt{\text{Var}(X)}.

Definition

Covariance

The covariance measures linear association between two random variables:

Cov(X,Y)=E[(XE[X])(YE[Y])]=E[XY]E[X]E[Y]\text{Cov}(X, Y) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]

Cov(X,X)=Var(X)\text{Cov}(X, X) = \text{Var}(X). For a random vector XRdX \in \mathbb{R}^d, the covariance matrix has entries Σij=Cov(Xi,Xj)\Sigma_{ij} = \text{Cov}(X_i, X_j).

Definition

Correlation

The Pearson correlation normalizes covariance to [1,1][-1, 1]:

ρ(X,Y)=Cov(X,Y)Var(X)Var(Y)\rho(X, Y) = \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X) \, \text{Var}(Y)}}

ρ=1|\rho| = 1 iff YY is an affine function of XX almost surely. ρ=0\rho = 0 means uncorrelated (not necessarily independent).

Definition

Moments

The kk-th moment of XX is E[Xk]\mathbb{E}[X^k]. The kk-th central moment is E[(XE[X])k]\mathbb{E}[(X - \mathbb{E}[X])^k]. The third central moment (normalized) is skewness. The fourth (normalized) is kurtosis. Heavy-tailed distributions have large kurtosis.

Covariance vs Correlation

Covariance and correlation both measure linear association, but they serve different purposes. Covariance is a bilinear form that participates in algebraic computations: the variance of a sum formula, the covariance matrix in PCA, the Kalman gain equation. Its magnitude depends on the scale of the variables, so Cov(X,Y)=500\text{Cov}(X, Y) = 500 is meaningless without knowing the units.

Correlation normalizes away scale: ρ(X,Y)[1,1]\rho(X, Y) \in [-1, 1] regardless of units. Use correlation for interpretation (how strong is the linear relationship?) and covariance for computation (what is the variance of a portfolio return?).

Two critical points. First, ρ=0\rho = 0 does not imply independence. It only rules out linear dependence. Second, correlation measures linear association only. Variables with strong nonlinear dependence can have ρ=0\rho = 0. For broader dependence measures, see mutual information or rank correlations (Spearman, Kendall).

Key Properties

Theorem

Linearity of Expectation

Statement

E[aX+bY]=aE[X]+bE[Y]\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]

This extends to any finite sum: E[i=1naiXi]=i=1naiE[Xi]\mathbb{E}[\sum_{i=1}^n a_i X_i] = \sum_{i=1}^n a_i \mathbb{E}[X_i].

Exact statement

LaTeX source for copy/export

\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]

Intuition

No independence or uncorrelatedness is required. This holds for arbitrary dependence structure. It is the single most-used property in probabilistic analysis.

Proof Sketch

For continuous random variables with joint density fX,Yf_{X,Y}:

E[aX+bY]=(ax+by)fX,Y(x,y)dxdy\mathbb{E}[aX + bY] = \iint (ax + by) f_{X,Y}(x,y) \, dx \, dy

Split the integral: axfX,Y(x,y)dxdy+byfX,Y(x,y)dxdya \iint x \, f_{X,Y}(x,y) \, dx \, dy + b \iint y \, f_{X,Y}(x,y) \, dx \, dy. The first integral is aE[X]a\mathbb{E}[X] (integrate out yy to get the marginal), the second is bE[Y]b\mathbb{E}[Y].

Why It Matters

Linearity makes expected value tractable even for complex random variables. To compute E[X]\mathbb{E}[X] where XX counts something complicated, decompose X=XiX = \sum X_i into indicator random variables and sum E[Xi]\mathbb{E}[X_i]. This trick solves problems in combinatorics, algorithm analysis, and randomized methods where computing the joint distribution would be intractable.

Failure Mode

Linearity does not hold for variance, entropy, or other nonlinear functionals of distributions. Var(X+Y)Var(X)+Var(Y)\text{Var}(X + Y) \neq \text{Var}(X) + \text{Var}(Y) in general.

Variance scaling: Var(aX+b)=a2Var(X)\text{Var}(aX + b) = a^2 \text{Var}(X). Adding a constant shifts the mean but does not change spread. Scaling by aa scales variance by a2a^2.

Covariance bilinearity: Cov(aX+bY,Z)=aCov(X,Z)+bCov(Y,Z)\text{Cov}(aX + bY, Z) = a \, \text{Cov}(X, Z) + b \, \text{Cov}(Y, Z). Covariance is bilinear, making it an inner product on the space of zero-mean, finite-variance random variables.

Law of Total Variance

Theorem

Law of Total Variance (Eve's Law)

Statement

Var(X)=E[Var(XY)]+Var(E[XY])\text{Var}(X) = \mathbb{E}[\text{Var}(X \mid Y)] + \text{Var}(\mathbb{E}[X \mid Y])

Exact statement

LaTeX source for copy/export

\operatorname{Var}(X) = \mathbb{E}[\operatorname{Var}(X \mid Y)] + \operatorname{Var}(\mathbb{E}[X \mid Y])

Intuition

Total variance decomposes into two sources. E[Var(XY)]\mathbb{E}[\text{Var}(X \mid Y)] is the average variance within each level of YY (unexplained variance). Var(E[XY])\text{Var}(\mathbb{E}[X \mid Y]) is the variance of the conditional mean across levels of YY (explained variance). If knowing YY perfectly predicts XX, the first term is zero. If YY is useless, the second term is zero.

Proof Sketch

Start from Var(X)=E[X2](E[X])2\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2. Apply the law of total expectation to E[X2]=E[E[X2Y]]\mathbb{E}[X^2] = \mathbb{E}[\mathbb{E}[X^2 \mid Y]]. Write E[X2Y]=Var(XY)+(E[XY])2\mathbb{E}[X^2 \mid Y] = \text{Var}(X \mid Y) + (\mathbb{E}[X \mid Y])^2. Substitute and regroup to get E[Var(XY)]+E[(E[XY])2](E[E[XY]])2\mathbb{E}[\text{Var}(X \mid Y)] + \mathbb{E}[(\mathbb{E}[X \mid Y])^2] - (\mathbb{E}[\mathbb{E}[X \mid Y]])^2. The last two terms equal Var(E[XY])\text{Var}(\mathbb{E}[X \mid Y]).

Why It Matters

This decomposition is the theoretical basis for ANOVA, the bias-variance decomposition, and hierarchical models. In random effects models, it separates within-group and between-group variation.

Failure Mode

Requires Var(X)<\text{Var}(X) < \infty. The conditional variance Var(XY)\text{Var}(X \mid Y) is itself a random variable (a function of YY), not a number.

Chebyshev's Inequality

Theorem

Chebyshev's Inequality

Statement

P(Xμt)σ2t2P(|X - \mu| \geq t) \leq \frac{\sigma^2}{t^2}

Equivalently, P(Xμkσ)1/k2P(|X - \mu| \geq k\sigma) \leq 1/k^2 for k>0k > 0.

Exact statement

LaTeX source for copy/export

\Pr(|X - \mu| \geq t) \leq \frac{\sigma^2}{t^2}

Intuition

A random variable with small variance cannot deviate far from its mean with high probability. The bound is distribution-free: it holds for any distribution with finite variance.

Proof Sketch

Apply Markov's inequality to the nonneg random variable (Xμ)2(X - \mu)^2:

P((Xμ)2t2)E[(Xμ)2]t2=σ2t2P((X - \mu)^2 \geq t^2) \leq \frac{\mathbb{E}[(X - \mu)^2]}{t^2} = \frac{\sigma^2}{t^2}

Since (Xμ)2t2(X - \mu)^2 \geq t^2 iff Xμt|X - \mu| \geq t, the result follows.

Why It Matters

Chebyshev is the simplest concentration inequality. It proves the weak law of large numbers in two lines: apply Chebyshev to Xˉn\bar{X}_n with Var(Xˉn)=σ2/n\text{Var}(\bar{X}_n) = \sigma^2/n, getting P(Xˉnμϵ)σ2/(nϵ2)0P(|\bar{X}_n - \mu| \geq \epsilon) \leq \sigma^2/(n\epsilon^2) \to 0.

Failure Mode

The bound is loose for specific distributions. For Gaussians, P(Xμ2σ)0.046P(|X - \mu| \geq 2\sigma) \approx 0.046, while Chebyshev gives 0.25\leq 0.25. Tighter bounds require distributional assumptions; see concentration inequalities for Hoeffding, Bernstein, and sub-Gaussian bounds.

Higher Moments and Moment Generating Functions

The kk-th moment E[Xk]\mathbb{E}[X^k] and the kk-th central moment E[(Xμ)k]\mathbb{E}[(X - \mu)^k] capture progressively finer distributional information.

Skewness (third standardized central moment): γ1=E[(Xμ)3]/σ3\gamma_1 = \mathbb{E}[(X - \mu)^3] / \sigma^3. Positive skewness indicates a right tail heavier than the left. Zero for any symmetric distribution.

Kurtosis (fourth standardized central moment): γ2=E[(Xμ)4]/σ4\gamma_2 = \mathbb{E}[(X - \mu)^4] / \sigma^4. The Gaussian has γ2=3\gamma_2 = 3. Excess kurtosis γ23\gamma_2 - 3 measures tail heaviness relative to the Gaussian. Heavy-tailed distributions (relevant to financial returns, gradient noise) have large excess kurtosis.

Moment generating function (MGF): MX(t)=E[etX]M_X(t) = \mathbb{E}[e^{tX}], defined for tt in a neighborhood of zero. When it exists, the MGF uniquely determines the distribution. Its utility: MX(k)(0)=E[Xk]M_X^{(k)}(0) = \mathbb{E}[X^k], so all moments are encoded in one function. The MGF of a sum of independent random variables is the product of their MGFs, which is the standard tool for proving the central limit theorem. For distributions where the MGF does not exist (e.g., Cauchy, log-normal), use the characteristic function ϕX(t)=E[eitX]\phi_X(t) = \mathbb{E}[e^{itX}] instead, which always exists.

Main Theorems

Theorem

Variance of a Sum

Statement

Var ⁣(i=1nXi)=i=1nVar(Xi)+2i<jCov(Xi,Xj)\text{Var}\!\left(\sum_{i=1}^n X_i\right) = \sum_{i=1}^n \text{Var}(X_i) + 2 \sum_{i < j} \text{Cov}(X_i, X_j)

If X1,,XnX_1, \ldots, X_n are pairwise uncorrelated, this reduces to Var(Xi)=Var(Xi)\text{Var}(\sum X_i) = \sum \text{Var}(X_i).

Exact statement

LaTeX source for copy/export

\operatorname{Var}\!\left(\sum_{i=1}^n X_i\right) = \sum_{i=1}^n \operatorname{Var}(X_i) + 2\sum_{i < j}\operatorname{Cov}(X_i, X_j)

Intuition

Variance of a sum depends on both individual variances and how the variables co-vary. Positive correlations inflate the total variance; negative correlations reduce it.

Proof Sketch

Let S=XiS = \sum X_i. Then Var(S)=E[(SE[S])2]=E[((Xiμi))2]\text{Var}(S) = \mathbb{E}[(S - \mathbb{E}[S])^2] = \mathbb{E}[(\sum (X_i - \mu_i))^2]. Expanding the square gives iVar(Xi)+2i<jCov(Xi,Xj)\sum_i \text{Var}(X_i) + 2\sum_{i<j} \text{Cov}(X_i, X_j) by linearity of expectation.

Why It Matters

For i.i.d. random variables, Var(Xˉ)=Var(X)/n\text{Var}(\bar{X}) = \text{Var}(X)/n. This is why averaging reduces noise and is the basis for the 1/n1/\sqrt{n} convergence rate in the central limit theorem. In SGD, minibatch averaging reduces gradient variance by a factor of the batch size.

Failure Mode

Requires finite second moments. For heavy-tailed distributions (e.g., Cauchy; see common probability distributions), variance is infinite and this formula is meaningless. Pairwise uncorrelated does not imply independent: the simplification Var(Xi)=Var(Xi)\text{Var}(\sum X_i) = \sum \text{Var}(X_i) holds under the weaker pairwise uncorrelated condition, but other properties (e.g., concentration inequalities) may need full independence.

Common Confusions

Watch Out

Uncorrelated does not imply independent

XUniform({1,0,1})X \sim \text{Uniform}(\{-1, 0, 1\}) and Y=X2Y = X^2. Then Cov(X,Y)=E[X3]E[X]E[X2]=00=0\text{Cov}(X, Y) = \mathbb{E}[X^3] - \mathbb{E}[X]\mathbb{E}[X^2] = 0 - 0 = 0, so XX and YY are uncorrelated. But YY is a deterministic function of XX, so they are maximally dependent.

Watch Out

E[XY] = E[X]E[Y] requires independence (or uncorrelatedness)

The factorization E[XY]=E[X]E[Y]\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y] holds when X,YX, Y are uncorrelated (equivalently, Cov(X,Y)=0\text{Cov}(X,Y) = 0). Independence implies uncorrelatedness, but not vice versa. For nonlinear functions: E[g(X)h(Y)]=E[g(X)]E[h(Y)]\mathbb{E}[g(X)h(Y)] = \mathbb{E}[g(X)]\mathbb{E}[h(Y)] requires independence, not just uncorrelatedness.

Watch Out

Variance is not linear

Var(X+Y)Var(X)+Var(Y)\text{Var}(X + Y) \neq \text{Var}(X) + \text{Var}(Y) unless XX and YY are uncorrelated. The cross-term 2Cov(X,Y)2\text{Cov}(X, Y) is often forgotten.

Exercises

ExerciseCore

Problem

Let X1,,XnX_1, \ldots, X_n be i.i.d. with mean μ\mu and variance σ2\sigma^2. Compute E[Xˉ]\mathbb{E}[\bar{X}] and Var(Xˉ)\text{Var}(\bar{X}) where Xˉ=1ni=1nXi\bar{X} = \frac{1}{n}\sum_{i=1}^n X_i.

ExerciseAdvanced

Problem

Let XX and YY have finite second moments. Prove that Cov(X,Y)Var(X)Var(Y)|\text{Cov}(X, Y)| \leq \sqrt{\text{Var}(X)\text{Var}(Y)}, i.e., ρ(X,Y)1|\rho(X,Y)| \leq 1.

References

Canonical:

  • Grimmett & Stirzaker, Probability and Random Processes (2020), Chapter 3
  • Casella & Berger, Statistical Inference (2002), Chapter 2
  • Billingsley, Probability and Measure (1995), Chapters 5 and 21 (expectation, moments, MGFs)
  • Feller, An Introduction to Probability Theory and Its Applications, Vol. 2 (1971), Chapter XV (moments, characteristic functions)

For ML context:

  • Deisenroth, Faisal, Ong, Mathematics for Machine Learning (2020), Chapter 6
  • Blitzstein & Hwang, Introduction to Probability (2019), Chapters 4 and 7 (expectation, joint distributions, covariance)

Last reviewed: April 14, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

4

Derived topics

28

+23 more on the derived-topics page.