Foundations
Joint, Marginal, and Conditional Distributions
Joint distributions, marginalization, conditional distributions, independence, Bayes theorem, and the chain rule of probability.
Why This Matters
Probability gets interesting the moment one variable is no longer enough. You do not just care about a label or a feature vector by itself. You care about how they move together, what remains if you forget one variable, and what changes once one variable is revealed.
That is the whole joint / marginal / conditional story:
- the joint tells you the full relationship
- the marginal forgets one coordinate
- the conditional freezes one coordinate and looks at the other
In ML, these are everywhere. A classifier models . A generative model learns or . Bayesian inference starts from a joint model and turns it into a posterior with Bayes theorem.
Quick Version
| Object | Plain meaning | Typical ML question |
|---|---|---|
| Joint distribution | The full map of how two variables co-occur | What feature-label pairs appear together? |
| Marginal | What remains after forgetting | How often do these inputs appear overall? |
| Conditional | Distribution of once is fixed | Given this input, what label is likely? |
| Independence | Knowing changes nothing about | Is one variable informative about the other at all? |
Want to build those mechanics by hand before reading the formal rules? Try the Probability Mechanics Lab, then come back here and read the definitions as compressed versions of the same picture.
Visual Map
One table contains the full probabilistic story
Start with the full co-occurrence table. Then forget one coordinate to get a marginal, or freeze one coordinate to get a conditional slice.
The full table is the joint law: .
Add over the forgotten variable to get .
Keep one column and renormalize it to get .
The visual below should be read left to right: the joint object holds all the information, marginals are what remains after summing out a variable, and the conditional slice is what remains after fixing one coordinate and renormalizing.
Core Definitions
Joint Distribution
For discrete random variables : the joint PMF is . For continuous random variables (see common probability distributions for standard families): the joint PDF satisfies . Both must be non-negative and sum/integrate to 1.
Marginal Distribution
The marginal of is obtained by summing or integrating out :
Discrete:
Continuous:
Marginalization discards information about .
Conditional Distribution
The conditional distribution of given is:
Discrete: for
Continuous: for
This defines a valid distribution over for each fixed .
Independence
Random variables and are independent (written ) if and only if:
Equivalently, for all : knowing tells you nothing about .
Conditional Independence
and are conditionally independent given (written ) if and only if:
In the discrete / strictly-positive density setting this reduces to "for all with ." For general (continuous, mixed, or singular) joint laws, conditional distributions are only defined up to -null sets, so the factorization is an almost-sure statement in — equivalently, a.s. for all bounded measurable , or factorization through regular conditional probabilities. Conditional independence is the foundation of graphical models and the naive Bayes assumption. For modeling complex dependence structures beyond independence, copulas separate the marginal behavior from the joint dependence.
Main Theorems
Bayes Theorem
Statement
In density form:
where is the marginal (the "evidence").
Intuition
Bayes theorem inverts a conditional: it converts (the likelihood) into (the posterior) by incorporating the prior . The denominator normalizes the result.
Proof Sketch
Start from the definition: and . From the second equation, . Substituting into the first gives Bayes theorem.
Why It Matters
Bayes theorem is the foundation of Bayesian inference. The prior encodes beliefs before seeing data. The likelihood connects data to parameters. The posterior combines both. In ML: Bayesian neural networks, Gaussian processes, and MAP estimation all rest on this formula.
Failure Mode
The denominator (or ) is often intractable to compute because it requires integrating over all possible . This is why approximate inference (MCMC, variational inference) is needed in practice.
Chain Rule and Law of Total Probability
Chain rule of probability for random variables:
This is not an assumption; it follows directly from the definition of conditional probability applied repeatedly.
Law of total probability (discrete version):
This "averages out" the conditioning variable. It is used to compute marginals from conditionals and appears throughout mixture model derivations. These operations connect directly to computing expectations and moments.
Common Confusions
Marginal independence does not imply conditional independence
and can be marginally independent () but conditionally dependent (). The classic example: two independent causes of a common effect . Once you observe , learning about tells you about (explaining away).
Conditional independence does not imply marginal independence
The converse also fails. does not imply . Example: , , . Then (both are deterministic given ), but and are perfectly dependent marginally.
Conditioning on zero-probability events
For continuous random variables, for any specific . The conditional density is defined as a ratio of densities, not as . Rigorous treatment uses the Radon-Nikodym derivative or disintegration of measures.
Canonical Examples
Naive Bayes classifier
Naive Bayes assumes features are conditionally independent given the class: . By Bayes theorem, . This reduces the number of parameters from exponential in to linear in . The independence assumption is almost always false, but the classifier often works well because the decision boundary can still be correct even when the probability estimates are wrong.
Exercises
Problem
A medical test has sensitivity and specificity . The disease prevalence is . Compute .
Problem
Let have joint density for and otherwise. Find the marginal densities and , and the conditional density .
References
Canonical:
- Blitzstein & Hwang, Introduction to Probability (2019), Chapter 7.
- Billingsley, Probability and Measure (1995), Chapter 1.
- Durrett, Probability: Theory and Examples (2019), Chapter 1.
- Grimmett & Stirzaker, Probability and Random Processes (2020), Chapters 1-3
- Casella & Berger, Statistical Inference (2002), Chapters 1-2
For ML context:
- Bishop, Pattern Recognition and Machine Learning (2006), Chapters 2, 8.
- Murphy, Machine Learning: A Probabilistic Perspective (2012), Chapters 2, 10.
- Koller & Friedman, Probabilistic Graphical Models (2009), Chapters 2-4.
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
3- Common Probability Distributionslayer 0A · tier 1
- Kolmogorov Probability Axiomslayer 0A · tier 1
- Random Variableslayer 0A · tier 1
Derived topics
5- Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
- The Multivariate Normal Distributionlayer 0B · tier 1
- Bayesian Estimationlayer 0B · tier 2
- Gaussian Process Regressionlayer 3 · tier 2
- Gaussian Processes for Machine Learninglayer 4 · tier 3