Common Probability Distributions

Sneiderman, Robby

Foundations

Common Probability Distributions

The essential probability distributions for ML: Bernoulli through Dirichlet, with PMFs/PDFs, moments, relationships, and when each arises in practice.

CoreTier 1StableCore spine~65 min

Prerequisites

Sets Functions and Relations Exponential Function Properties Integration and Change of Variables Measure Theoretic Probability

Quiz (42)Pulse Check Prereq Map

Why This Matters

Every model you will ever train makes distributional assumptions, whether explicitly or implicitly. Logistic regression assumes Bernoulli responses. Linear regression with squared loss assumes Gaussian noise. Bayesian methods require choosing priors. Beta, Gamma, Dirichlet. If you cannot write down the , compute the mean and variance, and explain when a distribution arises, you will be guessing instead of reasoning.

The greek letters and name the canonical mean and standard deviation, and the symbol reads "is distributed as".

This page is a reference you will return to repeatedly. Know these cold.

mu0.0

sigma1.0

mean = 0.00var = 1.00

Mental Model

Distributions fall into two families by support:

Discrete (PMF): Bernoulli, Binomial, Categorical, Multinomial, Geometric, Negative Binomial, Poisson
Continuous (PDF): Uniform, Gaussian, Exponential, Gamma, Beta, Chi-squared, Student-t, F, Cauchy, Dirichlet

Many are related: the Binomial is a sum of Bernoullis, the Exponential is a special Gamma, the Chi-squared is a special Gamma, the Beta is conjugate to the Bernoulli, and the Dirichlet generalizes the Beta.

Quick Reference

Family	Support	Parameters	Reach for it when	Core relationship
Bernoulli / Binomial / Categorical / Multinomial	finite counts	probability or probability vector, trial count	labels, counts, categorical outcomes, softmax outputs	Beta and Dirichlet conjugacy
Geometric / Negative Binomial	positive integers	success probability $p$ , count $r$	waiting times to first or $r$ -th success, overdispersed counts	NB is sum of $r$ Geometrics
Poisson	nonnegative integers	rate $\lambda$	rare-event counts in a fixed window	Binomial rare-event limit
Gaussian / Multivariate Gaussian	$\mathbb{R}$ or $\mathbb{R}^d$	mean and covariance	additive noise, linear models, CLT approximations	affine transforms stay Gaussian
Exponential / Gamma	positive reals	rate / shape-rate	waiting times, positive skew, arrival processes	Exponential is Gamma with shape 1
Beta / Dirichlet	$[0,1]$ or simplex	positive shape parameters	priors on probabilities	conjugate to Bernoulli/Binomial and Multinomial
Chi-squared / Student-t / F / Cauchy	positive or all reals	degrees of freedom, scale	variance inference, robust tails, ratio constructions	built from Gaussian and Gamma / chi-squared pieces

Discrete Distributions

Definition

Bernoulli Distribution $X \sim Bern (p)$

A single binary trial with success probability $p \in [0, 1]$ .

PMF: $P(X = k) = p^k (1-p)^{1-k}$ for $k \in \{0, 1\}$

Mean: $\mathbb{E}[X] = p$

Variance: $\text{Var}(X) = p(1 - p)$ , maximized at $p = 1/2$ with value $1/4$ .

MGF: $\mathbb{E}[e^{tX}] = 1 - p + p e^t$ for all $t \in \mathbb{R}$ .

When it arises: Binary classification labels, coin flips, any yes/no outcome. The log-likelihood of Bernoulli data gives rise to the cross-entropy loss used in logistic regression. The Bernoulli MGF is the starting point of the Chernoff bound: applying Markov's inequality to $e^{tS_n}$ for a sum of independent Bernoullis gives the canonical exponential tail bound on a binomial sum.

Definition

Binomial Distribution $X \sim Bin (n, p)$

The number of successes in $n$ independent Bernoulli $(p)$ trials.

PMF: $P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}$ for $k = 0, 1, \ldots, n$

Mean: $\mathbb{E}[X] = np$

Variance: $\text{Var}(X) = np(1 - p)$

MGF: $\mathbb{E}[e^{tX}] = (1 - p + p e^t)^n$ for all $t \in \mathbb{R}$ (the $n$ -th power of the Bernoulli MGF).

Key facts:

If $X_1, \ldots, X_n \sim \text{Bern}(p)$ independently, then $\sum_i X_i \sim \text{Bin}(n, p)$ .
As $n \to \infty$ with $np \to \lambda$ , the Binomial converges to $\text{Poisson}(\lambda)$ .
For fixed $p$ , the standardized sum $(X - np)/\sqrt{np(1-p)} \Rightarrow \mathcal{N}(0, 1)$ by the central limit theorem. The error is $O(1/\sqrt{np(1-p)})$ by Berry-Esseen.

When it arises: Empirical risk on $n$ samples with 0-1 loss is $\text{Bin}(n, p)/n$ where $p$ is the true error rate. Slud's inequality lower-bounds binomial tails and is used in VC sample-complexity lower bounds.

Definition

Categorical Distribution $X \sim Cat (p_{1}, \dots, p_{K})$

A single trial with $K$ possible outcomes, where outcome $k$ has probability $p_k \geq 0$ and $\sum_k p_k = 1$ . Generalizes Bernoulli (recovered when $K = 2$ ).

PMF: $P(X = k) = p_k$ for $k \in \{1, \ldots, K\}$ .

Mean of the one-hot encoding $\mathbf{e}_X \in \{0,1\}^K$ : $\mathbb{E}[\mathbf{e}_X] = (p_1, \ldots, p_K)$ .

When it arises: Multi-class classification labels. The softmax output of a neural network parameterizes a Categorical: $p_k = e^{z_k}/\sum_j e^{z_j}$ where $z = Wx + b$ are the logits. Cross-entropy loss is $-\log p_y$ where $y$ is the true class — this is the Categorical negative log-likelihood. Language models predict a Categorical over vocabulary tokens at every position. A Multinomial is the sum of $n$ independent Categoricals with the same parameter vector.

Definition

Poisson Distribution $X \sim Pois (λ)$

Models the count of events in a fixed interval when events occur independently at a constant rate $\lambda > 0$ .

PMF: $P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}$ for $k = 0, 1, 2, \ldots$

Mean: $\mathbb{E}[X] = \lambda$

Variance: $\text{Var}(X) = \lambda$

MGF: $\mathbb{E}[e^{tX}] = \exp(\lambda(e^t - 1))$ for all $t \in \mathbb{R}$ .

When it arises: Count data: number of clicks, arrivals, mutations. The mean equals the variance; if your count data has variance much larger than the mean, the Poisson model is wrong (use negative binomial instead). The Poisson MGF gives sub-exponential tail bounds; for large deviations, use the Bennett or Bernstein inequality rather than a Gaussian approximation.

Definition

Geometric Distribution $X \sim Geom (p)$

The number of trials until the first success in independent Bernoulli $(p)$ trials. (Convention: $X$ includes the successful trial.)

PMF: $P(X = k) = (1-p)^{k-1} p$ for $k = 1, 2, 3, \ldots$

Mean: $\mathbb{E}[X] = 1/p$

Variance: $\text{Var}(X) = (1-p)/p^2$

Key property: The Geometric distribution is the only discrete distribution with the memoryless property: $P(X > s + t \mid X > s) = P(X > t)$ .

When it arises in ML: Coupon-collector arguments (the expected time to see all $K$ classes from a uniform stream is $K H_K \approx K \log K$ ); PAC sample-complexity lower bounds; the geometric series bounds in TD-learning convergence.

Definition

Negative Binomial Distribution $X \sim NB (r, p)$

The number of trials until the $r$ -th success in independent Bernoulli $(p)$ trials. Generalizes the Geometric (recovered when $r = 1$ ).

PMF: $P(X = k) = \binom{k-1}{r-1} p^r (1-p)^{k-r}$ for $k = r, r+1, \ldots$

Mean: $\mathbb{E}[X] = r/p$

Variance: $\text{Var}(X) = r(1-p)/p^2$

Key fact: The variance exceeds the mean by a factor of $1/p$ , so the negative binomial is a natural overdispersed alternative to the Poisson. Sums of $r$ independent $\text{Geom}(p)$ variables are $\text{NB}(r, p)$ .

When it arises: Count data with overdispersion (variance $>$ mean), where the Poisson assumption fails. Common in NLP word-frequency models and biological sequencing data.

Definition

Multinomial Distribution $X \sim Mult (n, p_{1}, \dots, p_{K})$

The multivariate generalization of the Binomial. In $n$ trials, each of $K$ categories occurs with probability $p_k$ , and $X_k$ counts category $k$ .

PMF: $P(X_1 = x_1, \ldots, X_K = x_K) = \frac{n!}{x_1! \cdots x_K!} p_1^{x_1} \cdots p_K^{x_K}$

where $\sum_k x_k = n$ and $\sum_k p_k = 1$ .

Marginals: Each $X_k \sim \text{Bin}(n, p_k)$ .

Relationship: A Multinomial is the sum of $n$ i.i.d. Categorical $(p_1, \ldots, p_K)$ trials. The single-trial case ( $n = 1$ ) is the Categorical distribution defined above; this is what a softmax layer parameterizes.

When it arises: Bag-of-words models, topic models, document language models with $n$ -token contexts. Empirical class frequencies in a multi-class classifier with $n$ examples are $\text{Mult}(n, p)/n$ .

Continuous Distributions

Definition

Uniform Distribution $X \sim Unif (a, b)$

Constant density on the interval $[a, b]$ .

PDF: $f(x) = \frac{1}{b - a}$ for $x \in [a, b]$ , zero otherwise.

Mean: $\mathbb{E}[X] = \frac{a + b}{2}$

Variance: $\text{Var}(X) = \frac{(b - a)^2}{12}$

When it arises: Maximum entropy distribution on a bounded interval with no other constraints. Used in random initialization, random search, and as the base distribution for inverse transform sampling.

Definition

Gaussian (Normal) Distribution $X \sim N (μ, σ^{2})$

The most important continuous distribution. Parameterized by mean $\mu$ and variance $\sigma^2 > 0$ .

PDF: $f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)$

Mean: $\mathbb{E}[X] = \mu$

Variance: $\text{Var}(X) = \sigma^2$

MGF: $\mathbb{E}[e^{tX}] = \exp(\mu t + \sigma^2 t^2 / 2)$ for all $t \in \mathbb{R}$ .

Standard normal moments (for $Z \sim \mathcal{N}(0,1)$ ): $\mathbb{E}[Z^2] = 1$ , $\mathbb{E}[Z^4] = 3$ , and in general $\mathbb{E}[Z^{2k}] = (2k-1)!! = (2k)!/(2^k k!)$ for $k \geq 0$ (odd moments vanish by symmetry). The fact that $\mathbb{E}[Z^4] = 3$ is what makes $\text{Var}(Z^2) = \mathbb{E}[Z^4] - 1 = 2$ , the variance of a $\chi^2(1)$ summand.

Tail bound (Mills ratio): For $Z \sim \mathcal{N}(0, 1)$ and $t > 0$ , $\frac{1}{t + 1/t} \cdot \frac{e^{-t^2/2}}{\sqrt{2\pi}} \leq \mathbb{P}(Z > t) \leq \frac{1}{t} \cdot \frac{e^{-t^2/2}}{\sqrt{2\pi}}.$ The simpler bound $\mathbb{P}(Z > t) \leq e^{-t^2/2}$ for $t > 0$ follows from the Chernoff method on the Gaussian MGF and is what underlies sub-Gaussian concentration. The polynomial factor $1/t$ makes the Gaussian tail tighter than the bare sub-Gaussian bound by a $1/t$ factor — relevant in high-dimensional probability and union-bound calculations.

The normalizing constant $1/\sqrt{2\pi\sigma^2}$ ensures $\int_{-\infty}^{\infty} f(x)\,dx = 1$ . This requires the Gaussian integral $\int_{-\infty}^{\infty} e^{-x^2/2}\,dx = \sqrt{2\pi}$ .

Definition

Multivariate Gaussian $X \sim N (μ, Σ)$

For $X \in \mathbb{R}^d$ with mean vector $\mu \in \mathbb{R}^d$ and positive definite covariance matrix $\Sigma \in \mathbb{R}^{d \times d}$ :

PDF: $f(x) = \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} \exp\!\left(-\frac{1}{2}(x - \mu)^\top \Sigma^{-1} (x - \mu)\right)$

Key properties:

Marginals are Gaussian: if $X \sim \mathcal{N}(\mu, \Sigma)$ , any subset of coordinates is also multivariate Gaussian.
For jointly Gaussian distributions, uncorrelated implies independent. Marginal Gaussianity alone does not suffice. A standard counterexample: $X \sim \mathcal{N}(0, 1)$ and $Y = X \cdot \mathbb{1}\{|X| \leq c\} - X \cdot \mathbb{1}\{|X| > c\}$ . Both marginals are Gaussian and the pair is uncorrelated for an appropriate $c$ , yet $X$ and $Y$ are dependent (since $|X| = |Y|$ ).
Affine transformations preserve Gaussianity: $AX + b \sim \mathcal{N}(A\mu + b, A\Sigma A^\top)$ .

When it arises: Central limit theorem, Bayesian linear regression posterior, Gaussian processes, VAE latent spaces.

Definition

Exponential Distribution $X \sim Exp (λ)$

Models waiting times between events in a Poisson process with rate $\lambda > 0$ .

PDF: $f(x) = \lambda e^{-\lambda x}$ for $x \geq 0$

Mean: $\mathbb{E}[X] = 1/\lambda$

Variance: $\text{Var}(X) = 1/\lambda^2$

Key property: Memoryless: $P(X > s + t \mid X > s) = P(X > t)$ . This is the only continuous distribution with this property.

Relationship: $\text{Exp}(\lambda) = \text{Gamma}(1, \lambda)$ .

Definition

Gamma Distribution $X \sim Gamma (α, β)$

Generalizes the Exponential. Shape $\alpha > 0$ , rate $\beta > 0$ .

PDF: $f(x) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\beta x}$ for $x > 0$

where $\Gamma(\alpha) = \int_0^\infty t^{\alpha-1} e^{-t}\,dt$ is the Gamma function.

Mean: $\mathbb{E}[X] = \alpha/\beta$

Variance: $\text{Var}(X) = \alpha/\beta^2$

Special cases:

$\text{Gamma}(1, \lambda) = \text{Exp}(\lambda)$
$\text{Gamma}(n/2, 1/2) = \chi^2(n)$
Sum of $n$ independent $\text{Exp}(\beta)$ variables is $\text{Gamma}(n, \beta)$

When it arises: Conjugate prior for the Poisson rate and the precision (inverse variance) of a Gaussian. Models positive quantities with skew.

Definition

Beta Distribution $X \sim Beta (α, β)$

Defined on $[0, 1]$ with shape parameters $\alpha, \beta > 0$ .

PDF: $f(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha, \beta)}$ for $x \in [0, 1]$

where $B(\alpha, \beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha + \beta)}$ .

Mean: $\mathbb{E}[X] = \frac{\alpha}{\alpha + \beta}$

Variance: $\text{Var}(X) = \frac{\alpha\beta}{(\alpha + \beta)^2(\alpha + \beta + 1)}$

Key fact: The Beta is the conjugate prior for the Bernoulli/Binomial parameter $p$ . If $p \sim \text{Beta}(\alpha, \beta)$ and $X \mid p \sim \text{Bin}(n, p)$ with $k$ successes, then $p \mid X \sim \text{Beta}(\alpha + k, \beta + n - k)$ .

Special cases: $\text{Beta}(1, 1) = \text{Unif}(0, 1)$ .

Definition

Chi-Squared Distribution $X \sim χ^{2} (k)$

The sum of $k$ independent squared standard normals. Degrees of freedom $k$ .

PDF: $f(x) = \frac{x^{k/2 - 1} e^{-x/2}}{2^{k/2}\Gamma(k/2)}$ for $x > 0$

Mean: $\mathbb{E}[X] = k$

Variance: $\text{Var}(X) = 2k$

MGF: $\mathbb{E}[e^{\lambda X}] = (1 - 2\lambda)^{-k/2}$ for $\lambda < 1/2$ , with $\mathbb{E}[e^{\lambda X}] = +\infty$ for $\lambda \geq 1/2$ . This is the half-line MGF that drives every $\chi^2$ tail bound. Applying the Chernoff method and optimizing over $\lambda \in (0, 1/2)$ gives the standard concentration result: for $X \sim \chi^2(k)$ and $t \geq 0$ , $\mathbb{P}(X \geq k + t) \leq \exp\!\left(-\tfrac{t^2}{4(k + t)}\right),$ which sharpens to the Laurent-Massart bound $\mathbb{P}(X - k \geq 2\sqrt{kt} + 2t) \leq e^{-t}$ . These bounds are what appear in the random-projection lemma (B.12 in Shai & Shai), the Johnson-Lindenstrauss lemma, and PCA perturbation analyses.

Relationship: $\chi^2(k) = \text{Gamma}(k/2, 1/2)$ .

If $Z_1, \ldots, Z_k \sim \mathcal{N}(0,1)$ independently, then $\sum_i Z_i^2 \sim \chi^2(k)$ .

When it arises: Hypothesis testing (goodness-of-fit, likelihood ratio tests), analysis of variance. The sample variance of Gaussian data is proportional to a $\chi^2$ . Important in concentration: the $\chi^2$ is sub-exponential but not sub-Gaussian, because the MGF blows up at $\lambda = 1/2$ rather than being defined on all of $\mathbb{R}$ .

Definition

Student-t Distribution $X \sim t (ν)$

Arises when estimating the mean of a Gaussian population with unknown variance. Degrees of freedom $\nu > 0$ .

PDF: $f(x) = \frac{\Gamma((\nu+1)/2)}{\sqrt{\nu\pi}\,\Gamma(\nu/2)}\left(1 + \frac{x^2}{\nu}\right)^{-(\nu+1)/2}$

Mean: $\mathbb{E}[X] = 0$ for $\nu > 1$ (undefined for $\nu \leq 1$ )

Variance: $\text{Var}(X) = \frac{\nu}{\nu - 2}$ for $\nu > 2$

Construction: If $Z \sim \mathcal{N}(0,1)$ and $V \sim \chi^2(\nu)$ are independent, then $T = Z / \sqrt{V/\nu} \sim t(\nu)$ .

Key fact: Heavier tails than the Gaussian. As $\nu \to \infty$ , $t(\nu) \to \mathcal{N}(0, 1)$ . For small $\nu$ , the heavy tails accommodate outliers. This is why the Student-t is used in robust statistics.

Definition

F Distribution $X \sim F (d_{1}, d_{2})$

The ratio of two independent scaled chi-squared variables.

Construction: If $U \sim \chi^2(d_1)$ and $V \sim \chi^2(d_2)$ are independent, then $F = \frac{U/d_1}{V/d_2} \sim F(d_1, d_2)$ .

Mean: $\mathbb{E}[X] = \frac{d_2}{d_2 - 2}$ for $d_2 > 2$

When it arises: ANOVA, comparing two model fits (F-test for nested models), testing whether a group of regression coefficients is jointly zero.

Relationship: If $T \sim t(\nu)$ , then $T^2 \sim F(1, \nu)$ .

Definition

Cauchy Distribution $X \sim Cauchy (x_{0}, γ)$

Location $x_0$ , scale $\gamma > 0$ .

PDF: $f(x) = \frac{1}{\pi\gamma\left[1 + \left(\frac{x - x_0}{\gamma}\right)^2\right]}$

Mean: Does not exist (the integral diverges).

Variance: Does not exist.

Key fact: The Cauchy is $t(1)$ . The Student-t with one degree of freedom. It is the standard example of a distribution with no mean. The sample mean of $n$ i.i.d. Cauchy variables has the same distribution as a single observation. The law of large numbers does not apply. This is why finite moments matter for concentration inequalities.

Definition

Dirichlet Distribution $X \sim Dir (α_{1}, \dots, α_{K})$

The multivariate generalization of the Beta, defined on the $(K-1)$ -simplex $\{x \in \mathbb{R}^K : x_k \geq 0, \sum_k x_k = 1\}$ .

PDF: $f(x_1, \ldots, x_K) = \frac{\Gamma(\sum_k \alpha_k)}{\prod_k \Gamma(\alpha_k)} \prod_{k=1}^K x_k^{\alpha_k - 1}$

Marginals: Each $X_k \sim \text{Beta}(\alpha_k, \sum_{j \neq k} \alpha_j)$ .

Mean: $\mathbb{E}[X_k] = \alpha_k / \sum_j \alpha_j$

Key fact: Conjugate prior for the Multinomial/Categorical. If $\theta \sim \text{Dir}(\alpha)$ and $X \mid \theta \sim \text{Mult}(n, \theta)$ with counts $c_1, \ldots, c_K$ , then $\theta \mid X \sim \text{Dir}(\alpha_1 + c_1, \ldots, \alpha_K + c_K)$ .

When it arises: Latent Dirichlet Allocation (LDA), Bayesian multi-class models, any model requiring a prior over probability vectors.

Relationships Between Distributions

The major distributions form a web of connections:

Bernoulli $\to$ Binomial: sum of $n$ i.i.d. Bernoullis
Bernoulli $\to$ Categorical: $K$ -outcome generalization of single-trial Bernoulli (softmax outputs)
Categorical $\to$ Multinomial: sum of $n$ i.i.d. Categoricals
Geometric $\to$ Negative Binomial: sum of $r$ i.i.d. Geometrics (waiting time for $r$ -th success)
Binomial $\to$ Poisson: limit as $n \to \infty$ , $p \to 0$ , $np \to \lambda$
Exponential $\to$ Gamma: $\text{Gamma}(1, \lambda) = \text{Exp}(\lambda)$ ; sum of $n$ Exponentials is $\text{Gamma}(n, \lambda)$
Gamma $\to$ Chi-squared: $\chi^2(k) = \text{Gamma}(k/2, 1/2)$
Gaussian $\to$ Chi-squared: sum of squared standard normals
Gaussian + Chi-squared $\to$ Student-t: ratio construction
Chi-squared + Chi-squared $\to$ F: ratio of two scaled chi-squareds
Beta $\leftrightarrow$ Bernoulli: conjugate prior relationship
Beta $\to$ Dirichlet: multivariate generalization
Dirichlet $\leftrightarrow$ Multinomial: conjugate prior relationship
Gaussian + Gaussian prior $\to$ Gaussian posterior: self-conjugacy

The distributions of sorted samples from these families (the $k$ -th smallest value from $n$ draws) are studied in order statistics, which connects to extreme value theory and nonparametric inference.

Main Theorems

Proposition

The Gaussian Normalizing Constant

Statement

The Gaussian integral evaluates to:

$\int_{-\infty}^{\infty} e^{-x^2/2}\,dx = \sqrt{2\pi}$

Consequently, $f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-(x-\mu)^2/(2\sigma^2)}$ integrates to 1 and is a valid PDF.

Intuition

The $1/\sqrt{2\pi}$ in the Gaussian PDF is not arbitrary. It is the only constant that makes the total probability equal to 1. Without it, the "bell curve" would not be a proper probability distribution.

Proof Sketch

Let $I = \int_{-\infty}^{\infty} e^{-x^2/2}\,dx$ . Then:

$I^2 = \int_{-\infty}^{\infty}\int_{-\infty}^{\infty} e^{-(x^2+y^2)/2}\,dx\,dy$

Switch to polar coordinates: $x = r\cos\theta$ , $y = r\sin\theta$ , $dx\,dy = r\,dr\,d\theta$ :

$I^2 = \int_0^{2\pi}\int_0^{\infty} r\,e^{-r^2/2}\,dr\,d\theta = 2\pi \cdot \left[-e^{-r^2/2}\right]_0^{\infty} = 2\pi$

Therefore $I = \sqrt{2\pi}$ .

Why It Matters

This is one of the most important calculations in all of probability. The polar coordinates trick is a canonical proof technique. Understanding where $\sqrt{2\pi}$ comes from demystifies the Gaussian and explains why it appears in entropy formulas, the CLT, and information-theoretic quantities.

Failure Mode

Students sometimes assume the Gaussian integrates to 1 "by definition" without understanding the calculation. This leads to confusion when computing marginals of multivariate Gaussians or when normalizing constants matter (e.g., in Bayesian inference and partition functions).

report a correction →

Canonical Examples

Example

Beta-Bernoulli conjugacy in action

Suppose you observe $k = 7$ heads in $n = 10$ coin flips. With a $\text{Beta}(1, 1)$ prior (uniform on $[0, 1]$ ), the posterior is $\text{Beta}(1 + 7, 1 + 3) = \text{Beta}(8, 4)$ .

The posterior mean is $8/12 \approx 0.667$ , slightly shrunk from the MLE of $7/10 = 0.7$ toward $0.5$ . The posterior mode is $(8-1)/(8+4-2) = 0.7$ , which equals the MLE. With a stronger prior $\text{Beta}(10, 10)$ , the posterior would be $\text{Beta}(17, 13)$ with mean $17/30 \approx 0.567$ , with more shrinkage toward the prior mean of $0.5$ .

Example

Chi-squared as sum of squared normals

If $Z_1, Z_2, Z_3 \sim \mathcal{N}(0, 1)$ independently, then $W = Z_1^2 + Z_2^2 + Z_3^2 \sim \chi^2(3)$ with $\mathbb{E}[W] = 3$ and $\text{Var}(W) = 6$ .

Note: each $Z_i^2$ has mean 1 and variance 2. The $\chi^2$ is sub-exponential but not sub-Gaussian because $Z_i^2$ is a product of two sub-Gaussian variables (and products of sub-Gaussians are sub-exponential).

Common Confusions

Watch Out

The normalizing constant matters

Students sometimes write the Gaussian PDF without $1/\sqrt{2\pi\sigma^2}$ or confuse it with $1/\sqrt{2\pi}$ (forgetting the $\sigma$ ). The full constant is $1/\sqrt{2\pi\sigma^2} = 1/(\sigma\sqrt{2\pi})$ . Getting this wrong means your PDF does not integrate to 1, your log-likelihoods are wrong, and your MLE derivations break. In the multivariate case, the constant involves $|\Sigma|^{1/2}$ and $(2\pi)^{d/2}$ .

Watch Out

Gamma parameterization varies between sources

Some sources use the rate parameterization $\text{Gamma}(\alpha, \beta)$ with PDF proportional to $x^{\alpha-1}e^{-\beta x}$ , while others use the scale parameterization $\text{Gamma}(\alpha, \theta)$ with PDF proportional to $x^{\alpha-1}e^{-x/\theta}$ where $\theta = 1/\beta$ . Always check which convention a textbook or library uses. NumPy and SciPy use the scale parameterization. This page uses rate.

Watch Out

The Cauchy has no mean, not a mean of zero

The standard Cauchy distribution is symmetric about zero, so you might think its mean is zero. It is not. The mean does not exist because $\int |x| f(x)\,dx = \infty$ . The "center" of the Cauchy is the median (and mode), which is $x_0$ , but the expectation is undefined.

Summary

Bernoulli/Binomial/Categorical/Multinomial: discrete counts; Bernoulli MGF $1 - p + pe^t$ underlies Chernoff bounds; Categorical is what softmax parameterizes; cross-entropy is the Categorical NLL
Geometric/Negative Binomial: waiting times for first or $r$ -th success; coupon collector and overdispersed counts
Poisson: counts with mean = variance; approximates Binomial for rare events
Gaussian: the universal limit (CLT); maximizes entropy for given mean and variance
Exponential/Gamma: positive continuous; memoryless property for Exponential
Beta/Dirichlet: priors on probabilities; conjugate to Bernoulli/Multinomial
Chi-squared: sum of squared normals; MGF $(1-2t)^{-k/2}$ for $t < 1/2$ drives concentration via Laurent-Massart; appears in random projections (Johnson-Lindenstrauss), variance estimates, and likelihood-ratio tests
Student-t: Gaussian with unknown variance; heavier tails accommodate outliers
Cauchy: the pathological case. no moments, LLN fails

Exercises

ExerciseCore

Problem

If $X_1, \ldots, X_n \sim \mathcal{N}(\mu, \sigma^2)$ independently and $\bar{X} = \frac{1}{n}\sum_i X_i$ , what is the distribution of $\bar{X}$ ? What about $\frac{\bar{X} - \mu}{\sigma/\sqrt{n}}$ ?

ExerciseCore

Problem

Show that $\text{Beta}(1, 1) = \text{Unif}(0, 1)$ by writing out the Beta PDF with $\alpha = \beta = 1$ .

ExerciseAdvanced

Problem

Derive the chi-squared MGF $\mathbb{E}[e^{\lambda X}] = (1 - 2\lambda)^{-k/2}$ for $X \sim \chi^2(k)$ and $\lambda < 1/2$ from the standard normal MGF, and explain why it diverges at $\lambda = 1/2$ .

ExerciseAdvanced

Problem

Show that if $X \sim \text{Gamma}(\alpha_1, \beta)$ and $Y \sim \text{Gamma}(\alpha_2, \beta)$ are independent, then $X + Y \sim \text{Gamma}(\alpha_1 + \alpha_2, \beta)$ .

Related Comparisons

References

Canonical:

Casella & Berger, Statistical Inference (2002), Chapters 3-5
DeGroot & Schervish, Probability and Statistics (2012), Chapters 5-6

Current:

Murphy, Probabilistic Machine Learning: An Introduction (2022), Sections 2.2-2.7
Bishop, Pattern Recognition and Machine Learning (2006), Chapter 2
Durrett, Probability: Theory and Examples 5th ed (2019), Chapters 2-3
Billingsley, Probability and Measure 3rd ed (1995), Chapters 20-21

Next Topics

Building on these distribution foundations:

Concentration inequalities: bounding tail probabilities of sums
Common inequalities: the algebraic tools that connect distributions to bounds
Empirical risk minimization: where distributional assumptions meet learning

Last reviewed: April 23, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

5

Exponential Function Propertieslayer 0A · tier 1
Random Variableslayer 0A · tier 1
Sets, Functions, and Relationslayer 0A · tier 1
Measure-Theoretic Probabilitylayer 0B · tier 1
Integration and Change of Variableslayer 0A · tier 2

Derived topics

88

Beta Distributionlayer 0A · tier 1
Common Inequalitieslayer 0A · tier 1
Distributions Atlaslayer 0A · tier 1
Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
Exponential Distributionlayer 0A · tier 1

+83 more on the derived-topics page.

Graph-backed continuations

Concentration Inequalities Common Inequalities Empirical Risk Minimization Anomaly Detection Auction Theory Base Rate Fallacy Bayesian Estimation Bayesian State Estimation Benford's Law Birthday Paradox Boltzmann Machines and Hopfield Networks Bootstrap Methods Causal Inference and the Ladder of Causation Central Limit Theorem Confusion Matrices and Classification Metrics Copulas Data Preprocessing and Feature Engineering Decision Theory Foundations Differential Privacy Dropout Expectation, Variance, Covariance, and Moments Expected Utility Theory Extreme Value Theory Fat Tails and Heavy-Tailed Distributions Game Theory Foundations Goodness-of-Fit Tests Cryptographic Hash Functions Importance Sampling Information Retrieval Foundations Joint, Marginal, and Conditional Distributions K-Means Clustering Kalman Filter Kelly Criterion KL Divergence K-Nearest Neighbors Law of Large Numbers Markov Chains and Steady State Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiency Method of Moments Metropolis-Hastings Algorithm Moment Generating Functions Monty Hall Problem Multi-Armed Bandits Theory Naive Bayes Neyman-Pearson and Hypothesis Testing Theory No-Regret Learning Nonresponse and Missing Data Normalizing Flows Number Theory and Machine Learning Order Statistics Prospect Theory Public-Key Cryptography ROC Curve and AUC Sample Size Determination Signal Detection Theory Skewness, Kurtosis, and Higher Moments Survey Sampling Methods Synthetic Data Generation Token Prediction and Language Modeling Tokenization and Information Theory Total Variation Distance Triangular Distribution Wasserstein Distances Winsorization Chi-Squared Concentration Conjugate Priors The Kernel Trick Maximum A Posteriori (MAP) Estimation The Multivariate Normal Distribution Slud's Inequality