Expectation, Variance, Covariance, and Moments

Sneiderman, Robby

Foundations

Expectation, Variance, Covariance, and Moments

Expectation as average value, variance as squared spread, covariance as joint movement, and the moment identities used throughout statistics and ML.

CoreTier 1StableSupporting~55 min

Prerequisites

Random Variables Common Probability Distributions Joint Marginal Conditional Distributions Triangular Distribution

Quiz (41)Pulse Check Prereq Map

Why This Matters

Expectation is the average value of a random variable over repeated draws. Variance measures how far that variable tends to move away from its mean. Covariance measures whether two variables move together.

Those three ideas show up constantly in ML. Expected loss is population risk. Variance of stochastic gradients controls how noisy SGD updates are. Covariance matrices describe the geometry of data clouds and appear in PCA, Gaussian processes, whitening, natural gradients, and Kalman filters.

Quick Version

Object	Plain meaning	What it answers
$\mathbb E[X]$	long-run average value	Where is the center?
$\operatorname{Var}(X)$	average squared distance from the mean	How spread out is $X$ ?
$\operatorname{Cov}(X,Y)$	average product of centered deviations	Do $X$ and $Y$ move together?
$\rho(X,Y)$	scale-free covariance	How strong is the linear relationship?
$\mathbb E[X^k]$	$k$ -th moment	What shape information survives averaging?

Do not treat these as formulas to memorize separately. They are all averages of transformed random variables. Once expectation is understood, variance and covariance are just expectation applied to centered products.

Example

A tiny distribution shows all three ideas

Let $X$ take values $0$ and $2$ with probabilities $1/4$ and $3/4$ .

$\mathbb E[X]=0\cdot\frac14+2\cdot\frac34=\frac32.$

Variance averages squared deviations from that mean:

$\operatorname{Var}(X)=\left(0-\frac32\right)^2\frac14+\left(2-\frac32\right)^2\frac34=\frac34.$

The standard deviation is $\sqrt{3}/2$ . The mean tells you where the distribution balances; the variance tells you how far the mass sits from that balance point.

Visual: Covariance Geometry

theorem visual

Covariance Ellipse

$A joint Gaussian contour separates marginal spread from directional tilt, so covariance becomes visible instead of purely algebraic.$

Correlation

rho = 0.0

Drag left for negative joint movement, right for positive joint movement. The variances stay fixed; only the off-diagonal covariance changes.

Spread

variance sets the axis lengths

Tilt

zero covariance; the major axis rotates as $Cov (X, Y)$ changes sign.

Identity

$ρ (X, Y) = Cov (X, Y) / (σ_{X} σ_{Y})$

Current covariance: 0.00

Core Definitions

Definition

Expectation $E [X]$

For a discrete random variable: $\mathbb{E}[X] = \sum_x x \, P(X = x)$ . For a continuous random variable with density $f$ : $\mathbb{E}[X] = \int_{-\infty}^{\infty} x \, f(x) \, dx$ . The expectation exists when the sum or integral is absolutely convergent.

More generally, for a measurable function $g$ : $\mathbb{E}[g(X)] = \int g(x) \, dP_X(x)$ .

Definition

Variance $Var (X)$

The variance measures spread around the mean:

$\text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$

The second form (computational formula) is often easier to use. The standard deviation is $\sigma = \sqrt{\text{Var}(X)}$ .

Definition

Covariance $Cov (X, Y)$

The covariance measures linear association between two random variables:

$\text{Cov}(X, Y) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]$

$\text{Cov}(X, X) = \text{Var}(X)$ . For a random vector $X \in \mathbb{R}^d$ , the covariance matrix has entries $\Sigma_{ij} = \text{Cov}(X_i, X_j)$ .

Definition

Correlation $ρ (X, Y)$

The Pearson correlation normalizes covariance to $[-1, 1]$ :

$\rho(X, Y) = \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X) \, \text{Var}(Y)}}$

$|\rho| = 1$ iff $Y$ is an affine function of $X$ almost surely. $\rho = 0$ means uncorrelated (not necessarily independent).

Definition

Moments

The $k$ -th moment of $X$ is $\mathbb{E}[X^k]$ . The $k$ -th central moment is $\mathbb{E}[(X - \mathbb{E}[X])^k]$ . The third central moment (normalized) is skewness. The fourth (normalized) is kurtosis. Heavy-tailed distributions have large kurtosis.

Covariance vs Correlation

Covariance and correlation both measure linear association, but they serve different purposes. Covariance is a bilinear form that participates in algebraic computations: the variance of a sum formula, the covariance matrix in PCA, the Kalman gain equation. Its magnitude depends on the scale of the variables, so $\text{Cov}(X, Y) = 500$ is meaningless without knowing the units.

Correlation normalizes away scale: $\rho(X, Y) \in [-1, 1]$ regardless of units. Use correlation for interpretation (how strong is the linear relationship?) and covariance for computation (what is the variance of a portfolio return?).

Two critical points. First, $\rho = 0$ does not imply independence. It only rules out linear dependence. Second, correlation measures linear association only. Variables with strong nonlinear dependence can have $\rho = 0$ . For broader dependence measures, see mutual information or rank correlations (Spearman, Kendall).

Key Properties

Theorem

Linearity of Expectation

Statement

$\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]$

This extends to any finite sum: $\mathbb{E}[\sum_{i=1}^n a_i X_i] = \sum_{i=1}^n a_i \mathbb{E}[X_i]$ .

Exact statement

E [a X + bY] = a E [X] + b E [Y]

LaTeX source for copy/export

\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]

Intuition

No independence or uncorrelatedness is required. This holds for arbitrary dependence structure. It is the single most-used property in probabilistic analysis.

Proof Sketch

For continuous random variables with joint density $f_{X,Y}$ :

$\mathbb{E}[aX + bY] = \iint (ax + by) f_{X,Y}(x,y) \, dx \, dy$

Split the integral: $a \iint x \, f_{X,Y}(x,y) \, dx \, dy + b \iint y \, f_{X,Y}(x,y) \, dx \, dy$ . The first integral is $a\mathbb{E}[X]$ (integrate out $y$ to get the marginal), the second is $b\mathbb{E}[Y]$ .

Why It Matters

Linearity makes expected value tractable even for complex random variables. To compute $\mathbb{E}[X]$ where $X$ counts something complicated, decompose $X = \sum X_i$ into indicator random variables and sum $\mathbb{E}[X_i]$ . This trick solves problems in combinatorics, algorithm analysis, and randomized methods where computing the joint distribution would be intractable.

Failure Mode

Linearity does not hold for variance, entropy, or other nonlinear functionals of distributions. $\text{Var}(X + Y) \neq \text{Var}(X) + \text{Var}(Y)$ in general.

report a correction →

Variance scaling: $\text{Var}(aX + b) = a^2 \text{Var}(X)$ . Adding a constant shifts the mean but does not change spread. Scaling by $a$ scales variance by $a^2$ .

Covariance bilinearity: $\text{Cov}(aX + bY, Z) = a \, \text{Cov}(X, Z) + b \, \text{Cov}(Y, Z)$ . Covariance is bilinear, making it an inner product on the space of zero-mean, finite-variance random variables.

Law of Total Variance

Theorem

Law of Total Variance (Eve's Law)

Statement

$\text{Var}(X) = \mathbb{E}[\text{Var}(X \mid Y)] + \text{Var}(\mathbb{E}[X \mid Y])$

Exact statement

Var (X) = E [Var (X ∣ Y)] + Var (E [X ∣ Y])

LaTeX source for copy/export

\operatorname{Var}(X) = \mathbb{E}[\operatorname{Var}(X \mid Y)] + \operatorname{Var}(\mathbb{E}[X \mid Y])

Intuition

Total variance decomposes into two sources. $\mathbb{E}[\text{Var}(X \mid Y)]$ is the average variance within each level of $Y$ (unexplained variance). $\text{Var}(\mathbb{E}[X \mid Y])$ is the variance of the conditional mean across levels of $Y$ (explained variance). If knowing $Y$ perfectly predicts $X$ , the first term is zero. If $Y$ is useless, the second term is zero.

Proof Sketch

Start from $\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$ . Apply the law of total expectation to $\mathbb{E}[X^2] = \mathbb{E}[\mathbb{E}[X^2 \mid Y]]$ . Write $\mathbb{E}[X^2 \mid Y] = \text{Var}(X \mid Y) + (\mathbb{E}[X \mid Y])^2$ . Substitute and regroup to get $\mathbb{E}[\text{Var}(X \mid Y)] + \mathbb{E}[(\mathbb{E}[X \mid Y])^2] - (\mathbb{E}[\mathbb{E}[X \mid Y]])^2$ . The last two terms equal $\text{Var}(\mathbb{E}[X \mid Y])$ .

Why It Matters

This decomposition is the theoretical basis for ANOVA, the bias-variance decomposition, and hierarchical models. In random effects models, it separates within-group and between-group variation.

Failure Mode

Requires $\text{Var}(X) < \infty$ . The conditional variance $\text{Var}(X \mid Y)$ is itself a random variable (a function of $Y$ ), not a number.

report a correction →

Chebyshev's Inequality

Theorem

Chebyshev's Inequality

Statement

$P(|X - \mu| \geq t) \leq \frac{\sigma^2}{t^2}$

Equivalently, $P(|X - \mu| \geq k\sigma) \leq 1/k^2$ for $k > 0$ .

Exact statement

Pr (∣ X - μ ∣ \geq t) \leq \frac{σ ^{2}}{t ^{2}}

LaTeX source for copy/export

\Pr(|X - \mu| \geq t) \leq \frac{\sigma^2}{t^2}

Intuition

A random variable with small variance cannot deviate far from its mean with high probability. The bound is distribution-free: it holds for any distribution with finite variance.

Proof Sketch

Apply Markov's inequality to the nonneg random variable $(X - \mu)^2$ :

$P((X - \mu)^2 \geq t^2) \leq \frac{\mathbb{E}[(X - \mu)^2]}{t^2} = \frac{\sigma^2}{t^2}$

Since $(X - \mu)^2 \geq t^2$ iff $|X - \mu| \geq t$ , the result follows.

Why It Matters

Chebyshev is the simplest concentration inequality. It proves the weak law of large numbers in two lines: apply Chebyshev to $\bar{X}_n$ with $\text{Var}(\bar{X}_n) = \sigma^2/n$ , getting $P(|\bar{X}_n - \mu| \geq \epsilon) \leq \sigma^2/(n\epsilon^2) \to 0$ .

Failure Mode

The bound is loose for specific distributions. For Gaussians, $P(|X - \mu| \geq 2\sigma) \approx 0.046$ , while Chebyshev gives $\leq 0.25$ . Tighter bounds require distributional assumptions; see concentration inequalities for Hoeffding, Bernstein, and sub-Gaussian bounds.

report a correction →

Higher Moments and Moment Generating Functions

The $k$ -th moment $\mathbb{E}[X^k]$ and the $k$ -th central moment $\mathbb{E}[(X - \mu)^k]$ capture progressively finer distributional information.

Skewness (third standardized central moment): $\gamma_1 = \mathbb{E}[(X - \mu)^3] / \sigma^3$ . Positive skewness indicates a right tail heavier than the left. Zero for any symmetric distribution.

Kurtosis (fourth standardized central moment): $\gamma_2 = \mathbb{E}[(X - \mu)^4] / \sigma^4$ . The Gaussian has $\gamma_2 = 3$ . Excess kurtosis $\gamma_2 - 3$ measures tail heaviness relative to the Gaussian. Heavy-tailed distributions (relevant to financial returns, gradient noise) have large excess kurtosis.

Moment generating function (MGF): $M_X(t) = \mathbb{E}[e^{tX}]$ , defined for $t$ in a neighborhood of zero. When it exists, the MGF uniquely determines the distribution. Its utility: $M_X^{(k)}(0) = \mathbb{E}[X^k]$ , so all moments are encoded in one function. The MGF of a sum of independent random variables is the product of their MGFs, which is the standard tool for proving the central limit theorem. For distributions where the MGF does not exist (e.g., Cauchy, log-normal), use the characteristic function $\phi_X(t) = \mathbb{E}[e^{itX}]$ instead, which always exists.

Main Theorems

Theorem

Variance of a Sum

Statement

$\text{Var}\!\left(\sum_{i=1}^n X_i\right) = \sum_{i=1}^n \text{Var}(X_i) + 2 \sum_{i < j} \text{Cov}(X_i, X_j)$

If $X_1, \ldots, X_n$ are pairwise uncorrelated, this reduces to $\text{Var}(\sum X_i) = \sum \text{Var}(X_i)$ .

Exact statement

Var (i = 1 \sum n X_{i}) = i = 1 \sum n Var (X_{i}) + 2 i < j \sum Cov (X_{i}, X_{j})

LaTeX source for copy/export

\operatorname{Var}\!\left(\sum_{i=1}^n X_i\right) = \sum_{i=1}^n \operatorname{Var}(X_i) + 2\sum_{i < j}\operatorname{Cov}(X_i, X_j)

Intuition

Variance of a sum depends on both individual variances and how the variables co-vary. Positive correlations inflate the total variance; negative correlations reduce it.

Proof Sketch

Let $S = \sum X_i$ . Then $\text{Var}(S) = \mathbb{E}[(S - \mathbb{E}[S])^2] = \mathbb{E}[(\sum (X_i - \mu_i))^2]$ . Expanding the square gives $\sum_i \text{Var}(X_i) + 2\sum_{i<j} \text{Cov}(X_i, X_j)$ by linearity of expectation.

Why It Matters

For i.i.d. random variables, $\text{Var}(\bar{X}) = \text{Var}(X)/n$ . This is why averaging reduces noise and is the basis for the $1/\sqrt{n}$ convergence rate in the central limit theorem. In SGD, minibatch averaging reduces gradient variance by a factor of the batch size.

Failure Mode

Requires finite second moments. For heavy-tailed distributions (e.g., Cauchy; see common probability distributions), variance is infinite and this formula is meaningless. Pairwise uncorrelated does not imply independent: the simplification $\text{Var}(\sum X_i) = \sum \text{Var}(X_i)$ holds under the weaker pairwise uncorrelated condition, but other properties (e.g., concentration inequalities) may need full independence.

report a correction →

Common Confusions

Watch Out

Uncorrelated does not imply independent

$X \sim \text{Uniform}(\{-1, 0, 1\})$ and $Y = X^2$ . Then $\text{Cov}(X, Y) = \mathbb{E}[X^3] - \mathbb{E}[X]\mathbb{E}[X^2] = 0 - 0 = 0$ , so $X$ and $Y$ are uncorrelated. But $Y$ is a deterministic function of $X$ , so they are maximally dependent.

Watch Out

E[XY] = E[X]E[Y] requires independence (or uncorrelatedness)

The factorization $\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]$ holds when $X, Y$ are uncorrelated (equivalently, $\text{Cov}(X,Y) = 0$ ). Independence implies uncorrelatedness, but not vice versa. For nonlinear functions: $\mathbb{E}[g(X)h(Y)] = \mathbb{E}[g(X)]\mathbb{E}[h(Y)]$ requires independence, not just uncorrelatedness.

Watch Out

Variance is not linear

$\text{Var}(X + Y) \neq \text{Var}(X) + \text{Var}(Y)$ unless $X$ and $Y$ are uncorrelated. The cross-term $2\text{Cov}(X, Y)$ is often forgotten.

Exercises

ExerciseCore

Problem

Let $X_1, \ldots, X_n$ be i.i.d. with mean $\mu$ and variance $\sigma^2$ . Compute $\mathbb{E}[\bar{X}]$ and $\text{Var}(\bar{X})$ where $\bar{X} = \frac{1}{n}\sum_{i=1}^n X_i$ .

ExerciseAdvanced

Problem

Let $X$ and $Y$ have finite second moments. Prove that $|\text{Cov}(X, Y)| \leq \sqrt{\text{Var}(X)\text{Var}(Y)}$ , i.e., $|\rho(X,Y)| \leq 1$ .

References

Canonical:

Grimmett & Stirzaker, Probability and Random Processes (2020), Chapter 3
Casella & Berger, Statistical Inference (2002), Chapter 2
Billingsley, Probability and Measure (1995), Chapters 5 and 21 (expectation, moments, MGFs)
Feller, An Introduction to Probability Theory and Its Applications, Vol. 2 (1971), Chapter XV (moments, characteristic functions)

For ML context:

Deisenroth, Faisal, Ong, Mathematics for Machine Learning (2020), Chapter 6
Blitzstein & Hwang, Introduction to Probability (2019), Chapters 4 and 7 (expectation, joint distributions, covariance)

Last reviewed: April 14, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Common Probability Distributionslayer 0A · tier 1
Joint, Marginal, and Conditional Distributionslayer 0A · tier 1
Random Variableslayer 0A · tier 1
Triangular Distributionlayer 0A · tier 2

Derived topics

28

Central Limit Theoremlayer 0B · tier 1
Law of Large Numberslayer 0B · tier 1
The Multivariate Normal Distributionlayer 0B · tier 1
Analysis of Variancelayer 1 · tier 1
Concentration Inequalitieslayer 1 · tier 1

+23 more on the derived-topics page.