Foundations
Common Probability Distributions
The essential probability distributions for ML: Bernoulli through Dirichlet, with PMFs/PDFs, moments, relationships, and when each arises in practice.
Prerequisites
Why This Matters
Every model you will ever train makes distributional assumptions, whether explicitly or implicitly. Logistic regression assumes Bernoulli responses. Linear regression with squared loss assumes Gaussian noise. Bayesian methods require choosing priors. Beta, Gamma, Dirichlet. If you cannot write down the , compute the mean and variance, and explain when a distribution arises, you will be guessing instead of reasoning.
The greek letters and name the canonical mean and standard deviation, and the symbol reads "is distributed as".
This page is a reference you will return to repeatedly. Know these cold.
Mental Model
Distributions fall into two families by support:
- Discrete (PMF): Bernoulli, Binomial, Categorical, Multinomial, Geometric, Negative Binomial, Poisson
- Continuous (PDF): Uniform, Gaussian, Exponential, Gamma, Beta, Chi-squared, Student-t, F, Cauchy, Dirichlet
Many are related: the Binomial is a sum of Bernoullis, the Exponential is a special Gamma, the Chi-squared is a special Gamma, the Beta is conjugate to the Bernoulli, and the Dirichlet generalizes the Beta.
Quick Reference
| Family | Support | Parameters | Reach for it when | Core relationship |
|---|---|---|---|---|
| Bernoulli / Binomial / Categorical / Multinomial | finite counts | probability or probability vector, trial count | labels, counts, categorical outcomes, softmax outputs | Beta and Dirichlet conjugacy |
| Geometric / Negative Binomial | positive integers | success probability , count | waiting times to first or -th success, overdispersed counts | NB is sum of Geometrics |
| Poisson | nonnegative integers | rate | rare-event counts in a fixed window | Binomial rare-event limit |
| Gaussian / Multivariate Gaussian | or | mean and covariance | additive noise, linear models, CLT approximations | affine transforms stay Gaussian |
| Exponential / Gamma | positive reals | rate / shape-rate | waiting times, positive skew, arrival processes | Exponential is Gamma with shape 1 |
| Beta / Dirichlet | or simplex | positive shape parameters | priors on probabilities | conjugate to Bernoulli/Binomial and Multinomial |
| Chi-squared / Student-t / F / Cauchy | positive or all reals | degrees of freedom, scale | variance inference, robust tails, ratio constructions | built from Gaussian and Gamma / chi-squared pieces |
Discrete Distributions
Bernoulli Distribution
A single binary trial with success probability .
PMF: for
Mean:
Variance: , maximized at with value .
MGF: for all .
When it arises: Binary classification labels, coin flips, any yes/no outcome. The log-likelihood of Bernoulli data gives rise to the cross-entropy loss used in logistic regression. The Bernoulli MGF is the starting point of the Chernoff bound: applying Markov's inequality to for a sum of independent Bernoullis gives the canonical exponential tail bound on a binomial sum.
Binomial Distribution
The number of successes in independent Bernoulli trials.
PMF: for
Mean:
Variance:
MGF: for all (the -th power of the Bernoulli MGF).
Key facts:
- If independently, then .
- As with , the Binomial converges to .
- For fixed , the standardized sum by the central limit theorem. The error is by Berry-Esseen.
When it arises: Empirical risk on samples with 0-1 loss is where is the true error rate. Slud's inequality lower-bounds binomial tails and is used in VC sample-complexity lower bounds.
Categorical Distribution
A single trial with possible outcomes, where outcome has probability and . Generalizes Bernoulli (recovered when ).
PMF: for .
Mean of the one-hot encoding : .
When it arises: Multi-class classification labels. The softmax output of a neural network parameterizes a Categorical: where are the logits. Cross-entropy loss is where is the true class — this is the Categorical negative log-likelihood. Language models predict a Categorical over vocabulary tokens at every position. A Multinomial is the sum of independent Categoricals with the same parameter vector.
Poisson Distribution
Models the count of events in a fixed interval when events occur independently at a constant rate .
PMF: for
Mean:
Variance:
MGF: for all .
When it arises: Count data: number of clicks, arrivals, mutations. The mean equals the variance; if your count data has variance much larger than the mean, the Poisson model is wrong (use negative binomial instead). The Poisson MGF gives sub-exponential tail bounds; for large deviations, use the Bennett or Bernstein inequality rather than a Gaussian approximation.
Geometric Distribution
The number of trials until the first success in independent Bernoulli trials. (Convention: includes the successful trial.)
PMF: for
Mean:
Variance:
Key property: The Geometric distribution is the only discrete distribution with the memoryless property: .
When it arises in ML: Coupon-collector arguments (the expected time to see all classes from a uniform stream is ); PAC sample-complexity lower bounds; the geometric series bounds in TD-learning convergence.
Negative Binomial Distribution
The number of trials until the -th success in independent Bernoulli trials. Generalizes the Geometric (recovered when ).
PMF: for
Mean:
Variance:
Key fact: The variance exceeds the mean by a factor of , so the negative binomial is a natural overdispersed alternative to the Poisson. Sums of independent variables are .
When it arises: Count data with overdispersion (variance mean), where the Poisson assumption fails. Common in NLP word-frequency models and biological sequencing data.
Multinomial Distribution
The multivariate generalization of the Binomial. In trials, each of categories occurs with probability , and counts category .
PMF:
where and .
Marginals: Each .
Relationship: A Multinomial is the sum of i.i.d. Categorical trials. The single-trial case () is the Categorical distribution defined above; this is what a softmax layer parameterizes.
When it arises: Bag-of-words models, topic models, document language models with -token contexts. Empirical class frequencies in a multi-class classifier with examples are .
Continuous Distributions
Uniform Distribution
Constant density on the interval .
PDF: for , zero otherwise.
Mean:
Variance:
When it arises: Maximum entropy distribution on a bounded interval with no other constraints. Used in random initialization, random search, and as the base distribution for inverse transform sampling.
Gaussian (Normal) Distribution
The most important continuous distribution. Parameterized by mean and variance .
PDF:
Mean:
Variance:
MGF: for all .
Standard normal moments (for ): , , and in general for (odd moments vanish by symmetry). The fact that is what makes , the variance of a summand.
Tail bound (Mills ratio): For and , The simpler bound for follows from the Chernoff method on the Gaussian MGF and is what underlies sub-Gaussian concentration. The polynomial factor makes the Gaussian tail tighter than the bare sub-Gaussian bound by a factor — relevant in high-dimensional probability and union-bound calculations.
The normalizing constant ensures . This requires the Gaussian integral .
Multivariate Gaussian
For with mean vector and positive definite covariance matrix :
PDF:
Key properties:
- Marginals are Gaussian: if , any subset of coordinates is also multivariate Gaussian.
- For jointly Gaussian distributions, uncorrelated implies independent. Marginal Gaussianity alone does not suffice. A standard counterexample: and . Both marginals are Gaussian and the pair is uncorrelated for an appropriate , yet and are dependent (since ).
- Affine transformations preserve Gaussianity: .
When it arises: Central limit theorem, Bayesian linear regression posterior, Gaussian processes, VAE latent spaces.
Exponential Distribution
Models waiting times between events in a Poisson process with rate .
PDF: for
Mean:
Variance:
Key property: Memoryless: . This is the only continuous distribution with this property.
Relationship: .
Gamma Distribution
Generalizes the Exponential. Shape , rate .
PDF: for
where is the Gamma function.
Mean:
Variance:
Special cases:
- Sum of independent variables is
When it arises: Conjugate prior for the Poisson rate and the precision (inverse variance) of a Gaussian. Models positive quantities with skew.
Beta Distribution
Defined on with shape parameters .
PDF: for
where .
Mean:
Variance:
Key fact: The Beta is the conjugate prior for the Bernoulli/Binomial parameter . If and with successes, then .
Special cases: .
Chi-Squared Distribution
The sum of independent squared standard normals. Degrees of freedom .
PDF: for
Mean:
Variance:
MGF: for , with for . This is the half-line MGF that drives every tail bound. Applying the Chernoff method and optimizing over gives the standard concentration result: for and , which sharpens to the Laurent-Massart bound . These bounds are what appear in the random-projection lemma (B.12 in Shai & Shai), the Johnson-Lindenstrauss lemma, and PCA perturbation analyses.
Relationship: .
If independently, then .
When it arises: Hypothesis testing (goodness-of-fit, likelihood ratio tests), analysis of variance. The sample variance of Gaussian data is proportional to a . Important in concentration: the is sub-exponential but not sub-Gaussian, because the MGF blows up at rather than being defined on all of .
Student-t Distribution
Arises when estimating the mean of a Gaussian population with unknown variance. Degrees of freedom .
PDF:
Mean: for (undefined for )
Variance: for
Construction: If and are independent, then .
Key fact: Heavier tails than the Gaussian. As , . For small , the heavy tails accommodate outliers. This is why the Student-t is used in robust statistics.
F Distribution
The ratio of two independent scaled chi-squared variables.
Construction: If and are independent, then .
Mean: for
When it arises: ANOVA, comparing two model fits (F-test for nested models), testing whether a group of regression coefficients is jointly zero.
Relationship: If , then .
Cauchy Distribution
Location , scale .
PDF:
Mean: Does not exist (the integral diverges).
Variance: Does not exist.
Key fact: The Cauchy is . The Student-t with one degree of freedom. It is the standard example of a distribution with no mean. The sample mean of i.i.d. Cauchy variables has the same distribution as a single observation. The law of large numbers does not apply. This is why finite moments matter for concentration inequalities.
Dirichlet Distribution
The multivariate generalization of the Beta, defined on the -simplex .
PDF:
Marginals: Each .
Mean:
Key fact: Conjugate prior for the Multinomial/Categorical. If and with counts , then .
When it arises: Latent Dirichlet Allocation (LDA), Bayesian multi-class models, any model requiring a prior over probability vectors.
Relationships Between Distributions
The major distributions form a web of connections:
- Bernoulli Binomial: sum of i.i.d. Bernoullis
- Bernoulli Categorical: -outcome generalization of single-trial Bernoulli (softmax outputs)
- Categorical Multinomial: sum of i.i.d. Categoricals
- Geometric Negative Binomial: sum of i.i.d. Geometrics (waiting time for -th success)
- Binomial Poisson: limit as , ,
- Exponential Gamma: ; sum of Exponentials is
- Gamma Chi-squared:
- Gaussian Chi-squared: sum of squared standard normals
- Gaussian + Chi-squared Student-t: ratio construction
- Chi-squared + Chi-squared F: ratio of two scaled chi-squareds
- Beta Bernoulli: conjugate prior relationship
- Beta Dirichlet: multivariate generalization
- Dirichlet Multinomial: conjugate prior relationship
- Gaussian + Gaussian prior Gaussian posterior: self-conjugacy
The distributions of sorted samples from these families (the -th smallest value from draws) are studied in order statistics, which connects to extreme value theory and nonparametric inference.
Main Theorems
The Gaussian Normalizing Constant
Statement
The Gaussian integral evaluates to:
Consequently, integrates to 1 and is a valid PDF.
Intuition
The in the Gaussian PDF is not arbitrary. It is the only constant that makes the total probability equal to 1. Without it, the "bell curve" would not be a proper probability distribution.
Proof Sketch
Let . Then:
Switch to polar coordinates: , , :
Therefore .
Why It Matters
This is one of the most important calculations in all of probability. The polar coordinates trick is a canonical proof technique. Understanding where comes from demystifies the Gaussian and explains why it appears in entropy formulas, the CLT, and information-theoretic quantities.
Failure Mode
Students sometimes assume the Gaussian integrates to 1 "by definition" without understanding the calculation. This leads to confusion when computing marginals of multivariate Gaussians or when normalizing constants matter (e.g., in Bayesian inference and partition functions).
Canonical Examples
Beta-Bernoulli conjugacy in action
Suppose you observe heads in coin flips. With a prior (uniform on ), the posterior is .
The posterior mean is , slightly shrunk from the MLE of toward . The posterior mode is , which equals the MLE. With a stronger prior , the posterior would be with mean , with more shrinkage toward the prior mean of .
Chi-squared as sum of squared normals
If independently, then with and .
Note: each has mean 1 and variance 2. The is sub-exponential but not sub-Gaussian because is a product of two sub-Gaussian variables (and products of sub-Gaussians are sub-exponential).
Common Confusions
The normalizing constant matters
Students sometimes write the Gaussian PDF without or confuse it with (forgetting the ). The full constant is . Getting this wrong means your PDF does not integrate to 1, your log-likelihoods are wrong, and your MLE derivations break. In the multivariate case, the constant involves and .
Gamma parameterization varies between sources
Some sources use the rate parameterization with PDF proportional to , while others use the scale parameterization with PDF proportional to where . Always check which convention a textbook or library uses. NumPy and SciPy use the scale parameterization. This page uses rate.
The Cauchy has no mean, not a mean of zero
The standard Cauchy distribution is symmetric about zero, so you might think its mean is zero. It is not. The mean does not exist because . The "center" of the Cauchy is the median (and mode), which is , but the expectation is undefined.
Summary
- Bernoulli/Binomial/Categorical/Multinomial: discrete counts; Bernoulli MGF underlies Chernoff bounds; Categorical is what softmax parameterizes; cross-entropy is the Categorical NLL
- Geometric/Negative Binomial: waiting times for first or -th success; coupon collector and overdispersed counts
- Poisson: counts with mean = variance; approximates Binomial for rare events
- Gaussian: the universal limit (CLT); maximizes entropy for given mean and variance
- Exponential/Gamma: positive continuous; memoryless property for Exponential
- Beta/Dirichlet: priors on probabilities; conjugate to Bernoulli/Multinomial
- Chi-squared: sum of squared normals; MGF for drives concentration via Laurent-Massart; appears in random projections (Johnson-Lindenstrauss), variance estimates, and likelihood-ratio tests
- Student-t: Gaussian with unknown variance; heavier tails accommodate outliers
- Cauchy: the pathological case. no moments, LLN fails
Exercises
Problem
If independently and , what is the distribution of ? What about ?
Problem
Show that by writing out the Beta PDF with .
Problem
Derive the chi-squared MGF for and from the standard normal MGF, and explain why it diverges at .
Problem
Show that if and are independent, then .
Related Comparisons
References
Canonical:
- Casella & Berger, Statistical Inference (2002), Chapters 3-5
- DeGroot & Schervish, Probability and Statistics (2012), Chapters 5-6
Current:
- Murphy, Probabilistic Machine Learning: An Introduction (2022), Sections 2.2-2.7
- Bishop, Pattern Recognition and Machine Learning (2006), Chapter 2
- Durrett, Probability: Theory and Examples 5th ed (2019), Chapters 2-3
- Billingsley, Probability and Measure 3rd ed (1995), Chapters 20-21
Next Topics
Building on these distribution foundations:
- Concentration inequalities: bounding tail probabilities of sums
- Common inequalities: the algebraic tools that connect distributions to bounds
- Empirical risk minimization: where distributional assumptions meet learning
Last reviewed: April 23, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
5- Exponential Function Propertieslayer 0A · tier 1
- Random Variableslayer 0A · tier 1
- Sets, Functions, and Relationslayer 0A · tier 1
- Measure-Theoretic Probabilitylayer 0B · tier 1
- Integration and Change of Variableslayer 0A · tier 2
Derived topics
88- Beta Distributionlayer 0A · tier 1
- Common Inequalitieslayer 0A · tier 1
- Distributions Atlaslayer 0A · tier 1
- Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
- Exponential Distributionlayer 0A · tier 1
+83 more on the derived-topics page.
Graph-backed continuations