Skip to main content

Foundations

Common Probability Distributions

The essential probability distributions for ML: Bernoulli through Dirichlet, with PMFs/PDFs, moments, relationships, and when each arises in practice.

CoreTier 1StableCore spine~65 min

Why This Matters

Every model you will ever train makes distributional assumptions, whether explicitly or implicitly. Logistic regression assumes Bernoulli responses. Linear regression with squared loss assumes Gaussian noise. Bayesian methods require choosing priors. Beta, Gamma, Dirichlet. If you cannot write down the , compute the mean and variance, and explain when a distribution arises, you will be guessing instead of reasoning.

The greek letters and name the canonical mean and standard deviation, and the symbol reads "is distributed as".

This page is a reference you will return to repeatedly. Know these cold.

0.0
1.0
mean = 0.00var = 1.00
μ-5-3.3-1.701.73.35xf(x)

Mental Model

Distributions fall into two families by support:

  • Discrete (PMF): Bernoulli, Binomial, Categorical, Multinomial, Geometric, Negative Binomial, Poisson
  • Continuous (PDF): Uniform, Gaussian, Exponential, Gamma, Beta, Chi-squared, Student-t, F, Cauchy, Dirichlet

Many are related: the Binomial is a sum of Bernoullis, the Exponential is a special Gamma, the Chi-squared is a special Gamma, the Beta is conjugate to the Bernoulli, and the Dirichlet generalizes the Beta.

Quick Reference

FamilySupportParametersReach for it whenCore relationship
Bernoulli / Binomial / Categorical / Multinomialfinite countsprobability or probability vector, trial countlabels, counts, categorical outcomes, softmax outputsBeta and Dirichlet conjugacy
Geometric / Negative Binomialpositive integerssuccess probability pp, count rrwaiting times to first or rr-th success, overdispersed countsNB is sum of rr Geometrics
Poissonnonnegative integersrate λ\lambdarare-event counts in a fixed windowBinomial rare-event limit
Gaussian / Multivariate GaussianR\mathbb{R} or Rd\mathbb{R}^dmean and covarianceadditive noise, linear models, CLT approximationsaffine transforms stay Gaussian
Exponential / Gammapositive realsrate / shape-ratewaiting times, positive skew, arrival processesExponential is Gamma with shape 1
Beta / Dirichlet[0,1][0,1] or simplexpositive shape parameterspriors on probabilitiesconjugate to Bernoulli/Binomial and Multinomial
Chi-squared / Student-t / F / Cauchypositive or all realsdegrees of freedom, scalevariance inference, robust tails, ratio constructionsbuilt from Gaussian and Gamma / chi-squared pieces

Discrete Distributions

Definition

Bernoulli Distribution

A single binary trial with success probability p[0,1]p \in [0, 1].

PMF: P(X=k)=pk(1p)1kP(X = k) = p^k (1-p)^{1-k} for k{0,1}k \in \{0, 1\}

Mean: E[X]=p\mathbb{E}[X] = p

Variance: Var(X)=p(1p)\text{Var}(X) = p(1 - p), maximized at p=1/2p = 1/2 with value 1/41/4.

MGF: E[etX]=1p+pet\mathbb{E}[e^{tX}] = 1 - p + p e^t for all tRt \in \mathbb{R}.

When it arises: Binary classification labels, coin flips, any yes/no outcome. The log-likelihood of Bernoulli data gives rise to the cross-entropy loss used in logistic regression. The Bernoulli MGF is the starting point of the Chernoff bound: applying Markov's inequality to etSne^{tS_n} for a sum of independent Bernoullis gives the canonical exponential tail bound on a binomial sum.

Definition

Binomial Distribution

The number of successes in nn independent Bernoulli(p)(p) trials.

PMF: P(X=k)=(nk)pk(1p)nkP(X = k) = \binom{n}{k} p^k (1-p)^{n-k} for k=0,1,,nk = 0, 1, \ldots, n

Mean: E[X]=np\mathbb{E}[X] = np

Variance: Var(X)=np(1p)\text{Var}(X) = np(1 - p)

MGF: E[etX]=(1p+pet)n\mathbb{E}[e^{tX}] = (1 - p + p e^t)^n for all tRt \in \mathbb{R} (the nn-th power of the Bernoulli MGF).

Key facts:

  • If X1,,XnBern(p)X_1, \ldots, X_n \sim \text{Bern}(p) independently, then iXiBin(n,p)\sum_i X_i \sim \text{Bin}(n, p).
  • As nn \to \infty with npλnp \to \lambda, the Binomial converges to Poisson(λ)\text{Poisson}(\lambda).
  • For fixed pp, the standardized sum (Xnp)/np(1p)N(0,1)(X - np)/\sqrt{np(1-p)} \Rightarrow \mathcal{N}(0, 1) by the central limit theorem. The error is O(1/np(1p))O(1/\sqrt{np(1-p)}) by Berry-Esseen.

When it arises: Empirical risk on nn samples with 0-1 loss is Bin(n,p)/n\text{Bin}(n, p)/n where pp is the true error rate. Slud's inequality lower-bounds binomial tails and is used in VC sample-complexity lower bounds.

Definition

Categorical Distribution

A single trial with KK possible outcomes, where outcome kk has probability pk0p_k \geq 0 and kpk=1\sum_k p_k = 1. Generalizes Bernoulli (recovered when K=2K = 2).

PMF: P(X=k)=pkP(X = k) = p_k for k{1,,K}k \in \{1, \ldots, K\}.

Mean of the one-hot encoding eX{0,1}K\mathbf{e}_X \in \{0,1\}^K: E[eX]=(p1,,pK)\mathbb{E}[\mathbf{e}_X] = (p_1, \ldots, p_K).

When it arises: Multi-class classification labels. The softmax output of a neural network parameterizes a Categorical: pk=ezk/jezjp_k = e^{z_k}/\sum_j e^{z_j} where z=Wx+bz = Wx + b are the logits. Cross-entropy loss is logpy-\log p_y where yy is the true class — this is the Categorical negative log-likelihood. Language models predict a Categorical over vocabulary tokens at every position. A Multinomial is the sum of nn independent Categoricals with the same parameter vector.

Definition

Poisson Distribution

Models the count of events in a fixed interval when events occur independently at a constant rate λ>0\lambda > 0.

PMF: P(X=k)=λkeλk!P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!} for k=0,1,2,k = 0, 1, 2, \ldots

Mean: E[X]=λ\mathbb{E}[X] = \lambda

Variance: Var(X)=λ\text{Var}(X) = \lambda

MGF: E[etX]=exp(λ(et1))\mathbb{E}[e^{tX}] = \exp(\lambda(e^t - 1)) for all tRt \in \mathbb{R}.

When it arises: Count data: number of clicks, arrivals, mutations. The mean equals the variance; if your count data has variance much larger than the mean, the Poisson model is wrong (use negative binomial instead). The Poisson MGF gives sub-exponential tail bounds; for large deviations, use the Bennett or Bernstein inequality rather than a Gaussian approximation.

Definition

Geometric Distribution

The number of trials until the first success in independent Bernoulli(p)(p) trials. (Convention: XX includes the successful trial.)

PMF: P(X=k)=(1p)k1pP(X = k) = (1-p)^{k-1} p for k=1,2,3,k = 1, 2, 3, \ldots

Mean: E[X]=1/p\mathbb{E}[X] = 1/p

Variance: Var(X)=(1p)/p2\text{Var}(X) = (1-p)/p^2

Key property: The Geometric distribution is the only discrete distribution with the memoryless property: P(X>s+tX>s)=P(X>t)P(X > s + t \mid X > s) = P(X > t).

When it arises in ML: Coupon-collector arguments (the expected time to see all KK classes from a uniform stream is KHKKlogKK H_K \approx K \log K); PAC sample-complexity lower bounds; the geometric series bounds in TD-learning convergence.

Definition

Negative Binomial Distribution

The number of trials until the rr-th success in independent Bernoulli(p)(p) trials. Generalizes the Geometric (recovered when r=1r = 1).

PMF: P(X=k)=(k1r1)pr(1p)krP(X = k) = \binom{k-1}{r-1} p^r (1-p)^{k-r} for k=r,r+1,k = r, r+1, \ldots

Mean: E[X]=r/p\mathbb{E}[X] = r/p

Variance: Var(X)=r(1p)/p2\text{Var}(X) = r(1-p)/p^2

Key fact: The variance exceeds the mean by a factor of 1/p1/p, so the negative binomial is a natural overdispersed alternative to the Poisson. Sums of rr independent Geom(p)\text{Geom}(p) variables are NB(r,p)\text{NB}(r, p).

When it arises: Count data with overdispersion (variance >> mean), where the Poisson assumption fails. Common in NLP word-frequency models and biological sequencing data.

Definition

Multinomial Distribution

The multivariate generalization of the Binomial. In nn trials, each of KK categories occurs with probability pkp_k, and XkX_k counts category kk.

PMF: P(X1=x1,,XK=xK)=n!x1!xK!p1x1pKxKP(X_1 = x_1, \ldots, X_K = x_K) = \frac{n!}{x_1! \cdots x_K!} p_1^{x_1} \cdots p_K^{x_K}

where kxk=n\sum_k x_k = n and kpk=1\sum_k p_k = 1.

Marginals: Each XkBin(n,pk)X_k \sim \text{Bin}(n, p_k).

Relationship: A Multinomial is the sum of nn i.i.d. Categorical(p1,,pK)(p_1, \ldots, p_K) trials. The single-trial case (n=1n = 1) is the Categorical distribution defined above; this is what a softmax layer parameterizes.

When it arises: Bag-of-words models, topic models, document language models with nn-token contexts. Empirical class frequencies in a multi-class classifier with nn examples are Mult(n,p)/n\text{Mult}(n, p)/n.

Continuous Distributions

Definition

Uniform Distribution

Constant density on the interval [a,b][a, b].

PDF: f(x)=1baf(x) = \frac{1}{b - a} for x[a,b]x \in [a, b], zero otherwise.

Mean: E[X]=a+b2\mathbb{E}[X] = \frac{a + b}{2}

Variance: Var(X)=(ba)212\text{Var}(X) = \frac{(b - a)^2}{12}

When it arises: Maximum entropy distribution on a bounded interval with no other constraints. Used in random initialization, random search, and as the base distribution for inverse transform sampling.

Definition

Gaussian (Normal) Distribution

The most important continuous distribution. Parameterized by mean μ\mu and variance σ2>0\sigma^2 > 0.

PDF: f(x)=12πσ2exp ⁣((xμ)22σ2)f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)

Mean: E[X]=μ\mathbb{E}[X] = \mu

Variance: Var(X)=σ2\text{Var}(X) = \sigma^2

MGF: E[etX]=exp(μt+σ2t2/2)\mathbb{E}[e^{tX}] = \exp(\mu t + \sigma^2 t^2 / 2) for all tRt \in \mathbb{R}.

Standard normal moments (for ZN(0,1)Z \sim \mathcal{N}(0,1)): E[Z2]=1\mathbb{E}[Z^2] = 1, E[Z4]=3\mathbb{E}[Z^4] = 3, and in general E[Z2k]=(2k1)!!=(2k)!/(2kk!)\mathbb{E}[Z^{2k}] = (2k-1)!! = (2k)!/(2^k k!) for k0k \geq 0 (odd moments vanish by symmetry). The fact that E[Z4]=3\mathbb{E}[Z^4] = 3 is what makes Var(Z2)=E[Z4]1=2\text{Var}(Z^2) = \mathbb{E}[Z^4] - 1 = 2, the variance of a χ2(1)\chi^2(1) summand.

Tail bound (Mills ratio): For ZN(0,1)Z \sim \mathcal{N}(0, 1) and t>0t > 0, 1t+1/tet2/22πP(Z>t)1tet2/22π.\frac{1}{t + 1/t} \cdot \frac{e^{-t^2/2}}{\sqrt{2\pi}} \leq \mathbb{P}(Z > t) \leq \frac{1}{t} \cdot \frac{e^{-t^2/2}}{\sqrt{2\pi}}. The simpler bound P(Z>t)et2/2\mathbb{P}(Z > t) \leq e^{-t^2/2} for t>0t > 0 follows from the Chernoff method on the Gaussian MGF and is what underlies sub-Gaussian concentration. The polynomial factor 1/t1/t makes the Gaussian tail tighter than the bare sub-Gaussian bound by a 1/t1/t factor — relevant in high-dimensional probability and union-bound calculations.

The normalizing constant 1/2πσ21/\sqrt{2\pi\sigma^2} ensures f(x)dx=1\int_{-\infty}^{\infty} f(x)\,dx = 1. This requires the Gaussian integral ex2/2dx=2π\int_{-\infty}^{\infty} e^{-x^2/2}\,dx = \sqrt{2\pi}.

Definition

Multivariate Gaussian

For XRdX \in \mathbb{R}^d with mean vector μRd\mu \in \mathbb{R}^d and positive definite covariance matrix ΣRd×d\Sigma \in \mathbb{R}^{d \times d}:

PDF: f(x)=1(2π)d/2Σ1/2exp ⁣(12(xμ)Σ1(xμ))f(x) = \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} \exp\!\left(-\frac{1}{2}(x - \mu)^\top \Sigma^{-1} (x - \mu)\right)

Key properties:

  • Marginals are Gaussian: if XN(μ,Σ)X \sim \mathcal{N}(\mu, \Sigma), any subset of coordinates is also multivariate Gaussian.
  • For jointly Gaussian distributions, uncorrelated implies independent. Marginal Gaussianity alone does not suffice. A standard counterexample: XN(0,1)X \sim \mathcal{N}(0, 1) and Y=X1{Xc}X1{X>c}Y = X \cdot \mathbb{1}\{|X| \leq c\} - X \cdot \mathbb{1}\{|X| > c\}. Both marginals are Gaussian and the pair is uncorrelated for an appropriate cc, yet XX and YY are dependent (since X=Y|X| = |Y|).
  • Affine transformations preserve Gaussianity: AX+bN(Aμ+b,AΣA)AX + b \sim \mathcal{N}(A\mu + b, A\Sigma A^\top).

When it arises: Central limit theorem, Bayesian linear regression posterior, Gaussian processes, VAE latent spaces.

Definition

Exponential Distribution

Models waiting times between events in a Poisson process with rate λ>0\lambda > 0.

PDF: f(x)=λeλxf(x) = \lambda e^{-\lambda x} for x0x \geq 0

Mean: E[X]=1/λ\mathbb{E}[X] = 1/\lambda

Variance: Var(X)=1/λ2\text{Var}(X) = 1/\lambda^2

Key property: Memoryless: P(X>s+tX>s)=P(X>t)P(X > s + t \mid X > s) = P(X > t). This is the only continuous distribution with this property.

Relationship: Exp(λ)=Gamma(1,λ)\text{Exp}(\lambda) = \text{Gamma}(1, \lambda).

Definition

Gamma Distribution

Generalizes the Exponential. Shape α>0\alpha > 0, rate β>0\beta > 0.

PDF: f(x)=βαΓ(α)xα1eβxf(x) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\beta x} for x>0x > 0

where Γ(α)=0tα1etdt\Gamma(\alpha) = \int_0^\infty t^{\alpha-1} e^{-t}\,dt is the Gamma function.

Mean: E[X]=α/β\mathbb{E}[X] = \alpha/\beta

Variance: Var(X)=α/β2\text{Var}(X) = \alpha/\beta^2

Special cases:

  • Gamma(1,λ)=Exp(λ)\text{Gamma}(1, \lambda) = \text{Exp}(\lambda)
  • Gamma(n/2,1/2)=χ2(n)\text{Gamma}(n/2, 1/2) = \chi^2(n)
  • Sum of nn independent Exp(β)\text{Exp}(\beta) variables is Gamma(n,β)\text{Gamma}(n, \beta)

When it arises: Conjugate prior for the Poisson rate and the precision (inverse variance) of a Gaussian. Models positive quantities with skew.

Definition

Beta Distribution

Defined on [0,1][0, 1] with shape parameters α,β>0\alpha, \beta > 0.

PDF: f(x)=xα1(1x)β1B(α,β)f(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha, \beta)} for x[0,1]x \in [0, 1]

where B(α,β)=Γ(α)Γ(β)Γ(α+β)B(\alpha, \beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha + \beta)}.

Mean: E[X]=αα+β\mathbb{E}[X] = \frac{\alpha}{\alpha + \beta}

Variance: Var(X)=αβ(α+β)2(α+β+1)\text{Var}(X) = \frac{\alpha\beta}{(\alpha + \beta)^2(\alpha + \beta + 1)}

Key fact: The Beta is the conjugate prior for the Bernoulli/Binomial parameter pp. If pBeta(α,β)p \sim \text{Beta}(\alpha, \beta) and XpBin(n,p)X \mid p \sim \text{Bin}(n, p) with kk successes, then pXBeta(α+k,β+nk)p \mid X \sim \text{Beta}(\alpha + k, \beta + n - k).

Special cases: Beta(1,1)=Unif(0,1)\text{Beta}(1, 1) = \text{Unif}(0, 1).

Definition

Chi-Squared Distribution

The sum of kk independent squared standard normals. Degrees of freedom kk.

PDF: f(x)=xk/21ex/22k/2Γ(k/2)f(x) = \frac{x^{k/2 - 1} e^{-x/2}}{2^{k/2}\Gamma(k/2)} for x>0x > 0

Mean: E[X]=k\mathbb{E}[X] = k

Variance: Var(X)=2k\text{Var}(X) = 2k

MGF: E[eλX]=(12λ)k/2\mathbb{E}[e^{\lambda X}] = (1 - 2\lambda)^{-k/2} for λ<1/2\lambda < 1/2, with E[eλX]=+\mathbb{E}[e^{\lambda X}] = +\infty for λ1/2\lambda \geq 1/2. This is the half-line MGF that drives every χ2\chi^2 tail bound. Applying the Chernoff method and optimizing over λ(0,1/2)\lambda \in (0, 1/2) gives the standard concentration result: for Xχ2(k)X \sim \chi^2(k) and t0t \geq 0, P(Xk+t)exp ⁣(t24(k+t)),\mathbb{P}(X \geq k + t) \leq \exp\!\left(-\tfrac{t^2}{4(k + t)}\right), which sharpens to the Laurent-Massart bound P(Xk2kt+2t)et\mathbb{P}(X - k \geq 2\sqrt{kt} + 2t) \leq e^{-t}. These bounds are what appear in the random-projection lemma (B.12 in Shai & Shai), the Johnson-Lindenstrauss lemma, and PCA perturbation analyses.

Relationship: χ2(k)=Gamma(k/2,1/2)\chi^2(k) = \text{Gamma}(k/2, 1/2).

If Z1,,ZkN(0,1)Z_1, \ldots, Z_k \sim \mathcal{N}(0,1) independently, then iZi2χ2(k)\sum_i Z_i^2 \sim \chi^2(k).

When it arises: Hypothesis testing (goodness-of-fit, likelihood ratio tests), analysis of variance. The sample variance of Gaussian data is proportional to a χ2\chi^2. Important in concentration: the χ2\chi^2 is sub-exponential but not sub-Gaussian, because the MGF blows up at λ=1/2\lambda = 1/2 rather than being defined on all of R\mathbb{R}.

Definition

Student-t Distribution

Arises when estimating the mean of a Gaussian population with unknown variance. Degrees of freedom ν>0\nu > 0.

PDF: f(x)=Γ((ν+1)/2)νπΓ(ν/2)(1+x2ν)(ν+1)/2f(x) = \frac{\Gamma((\nu+1)/2)}{\sqrt{\nu\pi}\,\Gamma(\nu/2)}\left(1 + \frac{x^2}{\nu}\right)^{-(\nu+1)/2}

Mean: E[X]=0\mathbb{E}[X] = 0 for ν>1\nu > 1 (undefined for ν1\nu \leq 1)

Variance: Var(X)=νν2\text{Var}(X) = \frac{\nu}{\nu - 2} for ν>2\nu > 2

Construction: If ZN(0,1)Z \sim \mathcal{N}(0,1) and Vχ2(ν)V \sim \chi^2(\nu) are independent, then T=Z/V/νt(ν)T = Z / \sqrt{V/\nu} \sim t(\nu).

Key fact: Heavier tails than the Gaussian. As ν\nu \to \infty, t(ν)N(0,1)t(\nu) \to \mathcal{N}(0, 1). For small ν\nu, the heavy tails accommodate outliers. This is why the Student-t is used in robust statistics.

Definition

F Distribution

The ratio of two independent scaled chi-squared variables.

Construction: If Uχ2(d1)U \sim \chi^2(d_1) and Vχ2(d2)V \sim \chi^2(d_2) are independent, then F=U/d1V/d2F(d1,d2)F = \frac{U/d_1}{V/d_2} \sim F(d_1, d_2).

Mean: E[X]=d2d22\mathbb{E}[X] = \frac{d_2}{d_2 - 2} for d2>2d_2 > 2

When it arises: ANOVA, comparing two model fits (F-test for nested models), testing whether a group of regression coefficients is jointly zero.

Relationship: If Tt(ν)T \sim t(\nu), then T2F(1,ν)T^2 \sim F(1, \nu).

Definition

Cauchy Distribution

Location x0x_0, scale γ>0\gamma > 0.

PDF: f(x)=1πγ[1+(xx0γ)2]f(x) = \frac{1}{\pi\gamma\left[1 + \left(\frac{x - x_0}{\gamma}\right)^2\right]}

Mean: Does not exist (the integral diverges).

Variance: Does not exist.

Key fact: The Cauchy is t(1)t(1). The Student-t with one degree of freedom. It is the standard example of a distribution with no mean. The sample mean of nn i.i.d. Cauchy variables has the same distribution as a single observation. The law of large numbers does not apply. This is why finite moments matter for concentration inequalities.

Definition

Dirichlet Distribution

The multivariate generalization of the Beta, defined on the (K1)(K-1)-simplex {xRK:xk0,kxk=1}\{x \in \mathbb{R}^K : x_k \geq 0, \sum_k x_k = 1\}.

PDF: f(x1,,xK)=Γ(kαk)kΓ(αk)k=1Kxkαk1f(x_1, \ldots, x_K) = \frac{\Gamma(\sum_k \alpha_k)}{\prod_k \Gamma(\alpha_k)} \prod_{k=1}^K x_k^{\alpha_k - 1}

Marginals: Each XkBeta(αk,jkαj)X_k \sim \text{Beta}(\alpha_k, \sum_{j \neq k} \alpha_j).

Mean: E[Xk]=αk/jαj\mathbb{E}[X_k] = \alpha_k / \sum_j \alpha_j

Key fact: Conjugate prior for the Multinomial/Categorical. If θDir(α)\theta \sim \text{Dir}(\alpha) and XθMult(n,θ)X \mid \theta \sim \text{Mult}(n, \theta) with counts c1,,cKc_1, \ldots, c_K, then θXDir(α1+c1,,αK+cK)\theta \mid X \sim \text{Dir}(\alpha_1 + c_1, \ldots, \alpha_K + c_K).

When it arises: Latent Dirichlet Allocation (LDA), Bayesian multi-class models, any model requiring a prior over probability vectors.

Relationships Between Distributions

The major distributions form a web of connections:

  • Bernoulli \to Binomial: sum of nn i.i.d. Bernoullis
  • Bernoulli \to Categorical: KK-outcome generalization of single-trial Bernoulli (softmax outputs)
  • Categorical \to Multinomial: sum of nn i.i.d. Categoricals
  • Geometric \to Negative Binomial: sum of rr i.i.d. Geometrics (waiting time for rr-th success)
  • Binomial \to Poisson: limit as nn \to \infty, p0p \to 0, npλnp \to \lambda
  • Exponential \to Gamma: Gamma(1,λ)=Exp(λ)\text{Gamma}(1, \lambda) = \text{Exp}(\lambda); sum of nn Exponentials is Gamma(n,λ)\text{Gamma}(n, \lambda)
  • Gamma \to Chi-squared: χ2(k)=Gamma(k/2,1/2)\chi^2(k) = \text{Gamma}(k/2, 1/2)
  • Gaussian \to Chi-squared: sum of squared standard normals
  • Gaussian + Chi-squared \to Student-t: ratio construction
  • Chi-squared + Chi-squared \to F: ratio of two scaled chi-squareds
  • Beta \leftrightarrow Bernoulli: conjugate prior relationship
  • Beta \to Dirichlet: multivariate generalization
  • Dirichlet \leftrightarrow Multinomial: conjugate prior relationship
  • Gaussian + Gaussian prior \to Gaussian posterior: self-conjugacy

The distributions of sorted samples from these families (the kk-th smallest value from nn draws) are studied in order statistics, which connects to extreme value theory and nonparametric inference.

Main Theorems

Proposition

The Gaussian Normalizing Constant

Statement

The Gaussian integral evaluates to:

ex2/2dx=2π\int_{-\infty}^{\infty} e^{-x^2/2}\,dx = \sqrt{2\pi}

Consequently, f(x)=12πσ2e(xμ)2/(2σ2)f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-(x-\mu)^2/(2\sigma^2)} integrates to 1 and is a valid PDF.

Intuition

The 1/2π1/\sqrt{2\pi} in the Gaussian PDF is not arbitrary. It is the only constant that makes the total probability equal to 1. Without it, the "bell curve" would not be a proper probability distribution.

Proof Sketch

Let I=ex2/2dxI = \int_{-\infty}^{\infty} e^{-x^2/2}\,dx. Then:

I2=e(x2+y2)/2dxdyI^2 = \int_{-\infty}^{\infty}\int_{-\infty}^{\infty} e^{-(x^2+y^2)/2}\,dx\,dy

Switch to polar coordinates: x=rcosθx = r\cos\theta, y=rsinθy = r\sin\theta, dxdy=rdrdθdx\,dy = r\,dr\,d\theta:

I2=02π0rer2/2drdθ=2π[er2/2]0=2πI^2 = \int_0^{2\pi}\int_0^{\infty} r\,e^{-r^2/2}\,dr\,d\theta = 2\pi \cdot \left[-e^{-r^2/2}\right]_0^{\infty} = 2\pi

Therefore I=2πI = \sqrt{2\pi}.

Why It Matters

This is one of the most important calculations in all of probability. The polar coordinates trick is a canonical proof technique. Understanding where 2π\sqrt{2\pi} comes from demystifies the Gaussian and explains why it appears in entropy formulas, the CLT, and information-theoretic quantities.

Failure Mode

Students sometimes assume the Gaussian integrates to 1 "by definition" without understanding the calculation. This leads to confusion when computing marginals of multivariate Gaussians or when normalizing constants matter (e.g., in Bayesian inference and partition functions).

Canonical Examples

Example

Beta-Bernoulli conjugacy in action

Suppose you observe k=7k = 7 heads in n=10n = 10 coin flips. With a Beta(1,1)\text{Beta}(1, 1) prior (uniform on [0,1][0, 1]), the posterior is Beta(1+7,1+3)=Beta(8,4)\text{Beta}(1 + 7, 1 + 3) = \text{Beta}(8, 4).

The posterior mean is 8/120.6678/12 \approx 0.667, slightly shrunk from the MLE of 7/10=0.77/10 = 0.7 toward 0.50.5. The posterior mode is (81)/(8+42)=0.7(8-1)/(8+4-2) = 0.7, which equals the MLE. With a stronger prior Beta(10,10)\text{Beta}(10, 10), the posterior would be Beta(17,13)\text{Beta}(17, 13) with mean 17/300.56717/30 \approx 0.567, with more shrinkage toward the prior mean of 0.50.5.

Example

Chi-squared as sum of squared normals

If Z1,Z2,Z3N(0,1)Z_1, Z_2, Z_3 \sim \mathcal{N}(0, 1) independently, then W=Z12+Z22+Z32χ2(3)W = Z_1^2 + Z_2^2 + Z_3^2 \sim \chi^2(3) with E[W]=3\mathbb{E}[W] = 3 and Var(W)=6\text{Var}(W) = 6.

Note: each Zi2Z_i^2 has mean 1 and variance 2. The χ2\chi^2 is sub-exponential but not sub-Gaussian because Zi2Z_i^2 is a product of two sub-Gaussian variables (and products of sub-Gaussians are sub-exponential).

Common Confusions

Watch Out

The normalizing constant matters

Students sometimes write the Gaussian PDF without 1/2πσ21/\sqrt{2\pi\sigma^2} or confuse it with 1/2π1/\sqrt{2\pi} (forgetting the σ\sigma). The full constant is 1/2πσ2=1/(σ2π)1/\sqrt{2\pi\sigma^2} = 1/(\sigma\sqrt{2\pi}). Getting this wrong means your PDF does not integrate to 1, your log-likelihoods are wrong, and your MLE derivations break. In the multivariate case, the constant involves Σ1/2|\Sigma|^{1/2} and (2π)d/2(2\pi)^{d/2}.

Watch Out

Gamma parameterization varies between sources

Some sources use the rate parameterization Gamma(α,β)\text{Gamma}(\alpha, \beta) with PDF proportional to xα1eβxx^{\alpha-1}e^{-\beta x}, while others use the scale parameterization Gamma(α,θ)\text{Gamma}(\alpha, \theta) with PDF proportional to xα1ex/θx^{\alpha-1}e^{-x/\theta} where θ=1/β\theta = 1/\beta. Always check which convention a textbook or library uses. NumPy and SciPy use the scale parameterization. This page uses rate.

Watch Out

The Cauchy has no mean, not a mean of zero

The standard Cauchy distribution is symmetric about zero, so you might think its mean is zero. It is not. The mean does not exist because xf(x)dx=\int |x| f(x)\,dx = \infty. The "center" of the Cauchy is the median (and mode), which is x0x_0, but the expectation is undefined.

Summary

  • Bernoulli/Binomial/Categorical/Multinomial: discrete counts; Bernoulli MGF 1p+pet1 - p + pe^t underlies Chernoff bounds; Categorical is what softmax parameterizes; cross-entropy is the Categorical NLL
  • Geometric/Negative Binomial: waiting times for first or rr-th success; coupon collector and overdispersed counts
  • Poisson: counts with mean = variance; approximates Binomial for rare events
  • Gaussian: the universal limit (CLT); maximizes entropy for given mean and variance
  • Exponential/Gamma: positive continuous; memoryless property for Exponential
  • Beta/Dirichlet: priors on probabilities; conjugate to Bernoulli/Multinomial
  • Chi-squared: sum of squared normals; MGF (12t)k/2(1-2t)^{-k/2} for t<1/2t < 1/2 drives concentration via Laurent-Massart; appears in random projections (Johnson-Lindenstrauss), variance estimates, and likelihood-ratio tests
  • Student-t: Gaussian with unknown variance; heavier tails accommodate outliers
  • Cauchy: the pathological case. no moments, LLN fails

Exercises

ExerciseCore

Problem

If X1,,XnN(μ,σ2)X_1, \ldots, X_n \sim \mathcal{N}(\mu, \sigma^2) independently and Xˉ=1niXi\bar{X} = \frac{1}{n}\sum_i X_i, what is the distribution of Xˉ\bar{X}? What about Xˉμσ/n\frac{\bar{X} - \mu}{\sigma/\sqrt{n}}?

ExerciseCore

Problem

Show that Beta(1,1)=Unif(0,1)\text{Beta}(1, 1) = \text{Unif}(0, 1) by writing out the Beta PDF with α=β=1\alpha = \beta = 1.

ExerciseAdvanced

Problem

Derive the chi-squared MGF E[eλX]=(12λ)k/2\mathbb{E}[e^{\lambda X}] = (1 - 2\lambda)^{-k/2} for Xχ2(k)X \sim \chi^2(k) and λ<1/2\lambda < 1/2 from the standard normal MGF, and explain why it diverges at λ=1/2\lambda = 1/2.

ExerciseAdvanced

Problem

Show that if XGamma(α1,β)X \sim \text{Gamma}(\alpha_1, \beta) and YGamma(α2,β)Y \sim \text{Gamma}(\alpha_2, \beta) are independent, then X+YGamma(α1+α2,β)X + Y \sim \text{Gamma}(\alpha_1 + \alpha_2, \beta).

Related Comparisons

References

Canonical:

  • Casella & Berger, Statistical Inference (2002), Chapters 3-5
  • DeGroot & Schervish, Probability and Statistics (2012), Chapters 5-6

Current:

  • Murphy, Probabilistic Machine Learning: An Introduction (2022), Sections 2.2-2.7
  • Bishop, Pattern Recognition and Machine Learning (2006), Chapter 2
  • Durrett, Probability: Theory and Examples 5th ed (2019), Chapters 2-3
  • Billingsley, Probability and Measure 3rd ed (1995), Chapters 20-21

Next Topics

Building on these distribution foundations:

Last reviewed: April 23, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

5

Derived topics

88

+83 more on the derived-topics page.

Graph-backed continuations

Concentration InequalitiesCommon InequalitiesEmpirical Risk MinimizationAnomaly DetectionAuction TheoryBase Rate FallacyBayesian EstimationBayesian State EstimationBenford's LawBirthday ParadoxBoltzmann Machines and Hopfield NetworksBootstrap MethodsCausal Inference and the Ladder of CausationCentral Limit TheoremConfusion Matrices and Classification MetricsCopulasData Preprocessing and Feature EngineeringDecision Theory FoundationsDifferential PrivacyDropoutExpectation, Variance, Covariance, and MomentsExpected Utility TheoryExtreme Value TheoryFat Tails and Heavy-Tailed DistributionsGame Theory FoundationsGoodness-of-Fit TestsCryptographic Hash FunctionsImportance SamplingInformation Retrieval FoundationsJoint, Marginal, and Conditional DistributionsK-Means ClusteringKalman FilterKelly CriterionKL DivergenceK-Nearest NeighborsLaw of Large NumbersMarkov Chains and Steady StateMaximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencyMethod of MomentsMetropolis-Hastings AlgorithmMoment Generating FunctionsMonty Hall ProblemMulti-Armed Bandits TheoryNaive BayesNeyman-Pearson and Hypothesis Testing TheoryNo-Regret LearningNonresponse and Missing DataNormalizing FlowsNumber Theory and Machine LearningOrder StatisticsProspect TheoryPublic-Key CryptographyROC Curve and AUCSample Size DeterminationSignal Detection TheorySkewness, Kurtosis, and Higher MomentsSurvey Sampling MethodsSynthetic Data GenerationToken Prediction and Language ModelingTokenization and Information TheoryTotal Variation DistanceTriangular DistributionWasserstein DistancesWinsorizationChi-Squared ConcentrationConjugate PriorsThe Kernel TrickMaximum A Posteriori (MAP) EstimationThe Multivariate Normal DistributionSlud's Inequality