Causal Inference Basics

Sneiderman, Robby

Methodology

Causal Inference Basics

Correlation is not causation. The potential outcomes framework, average treatment effects, confounders, and the methods that let you estimate causal effects from data.

AdvancedTier 3StableSupporting~55 min

Prerequisites

Hypothesis Testing for ML Feature Importance and Interpretability Survival Analysis

Quiz (1)Prereq Map

Why This Matters

Machine learning excels at prediction: given features $X$ , predict outcome $Y$ . But many important questions are causal: if we change $X$ , what happens to $Y$ ? Will this drug reduce mortality? Will showing a different ad increase clicks? Will a policy intervention reduce crime?

Prediction and causation are structurally different. A linear regression model can predict hospital mortality perfectly using "is on a ventilator" as a feature, but putting healthy people on ventilators does not help. Confounders, variables that cause both treatment and outcome, are the central obstacle. Causal inference provides the framework for identifying when and how you can estimate causal effects from data, and it matters for A/B testing, fairness, treatment effect estimation, and counterfactual reasoning in ML.

Embedded Causal Diagram

The key move is to block the backdoor path

The same data can support two different quantities: an observed association and a causal effect. What changes is whether the confounder still opens the path from treatment to outcome.

X

Severity

The confounder that changes both treatment assignment and outcome.

W

Treatment

The intervention or decision whose effect you want to estimate.

Y

Outcome

The result after treatment, policy, or exposure.

Observed quantity

P (Y ∣ W) answers: among treated units, how do outcomes differ?

If severity affects both who gets treated and how they do, this mixes the treatment effect with selection.

Causal target

P (Y ∣ d o (W)) asks what would change if treatment were actively set.

Randomization cuts the confounder-to-treatment arrow by design. In observational work, you try to block the same path with the right adjustment set.

Mental Model

Imagine you want to know if a new drug works. Ideally, you would give each patient both the drug and a placebo, then compare outcomes. But each patient can only live one life. This is the fundamental problem of causal inference: you never observe both potential outcomes for the same unit.

Randomized experiments solve this by making treated and control groups comparable on average. Without randomization, you must argue carefully that you have accounted for all confounders. Observational causal inference is the art of making that argument rigorous.

Two causal languages fit together here. The potential outcomes notation defines the estimand you want: ATE, ATT, or some design-specific quantity. The graphical / DAG view makes the assumptions visible: which paths are causal, which paths are spurious, and which variables you must adjust for to block the backdoor path. This page uses the Rubin-style notation as the formal core and uses DAG intuition to explain why the assumptions matter.

Formal Setup and Notation

Consider $n$ units (e.g., patients) indexed by $i$ . Each unit has a binary treatment $W_i \in \{0, 1\}$ , covariates $X_i$ , and an observed outcome $Y_i$ .

Definition

Potential Outcomes $Y_{i} (0), Y_{i} (1)$

For each unit $i$ , the potential outcomes are:

$Y_i(1)$ : the outcome if unit $i$ receives treatment ( $W_i = 1$ )
$Y_i(0)$ : the outcome if unit $i$ receives control ( $W_i = 0$ )

The observed outcome is $Y_i = W_i Y_i(1) + (1 - W_i) Y_i(0)$ . The individual treatment effect is $\tau_i = Y_i(1) - Y_i(0)$ , which is never observed because we only see one potential outcome per unit.

Definition

Average Treatment Effect (ATE)

The average treatment effect is:

$\tau = \mathbb{E}[Y_i(1) - Y_i(0)]$

This is the expected causal effect of treatment across the population.

Definition

Average Treatment Effect on the Treated (ATT)

The ATT conditions on the treated group:

$\tau_{\text{ATT}} = \mathbb{E}[Y_i(1) - Y_i(0) \mid W_i = 1]$

This answers: for those who actually received treatment, what was the average effect?

Definition

Confounder

A confounder is a variable $X$ that causally affects both the treatment assignment $W$ and the outcome $Y$ . When confounders are present, the naive comparison $\mathbb{E}[Y \mid W=1] - \mathbb{E}[Y \mid W=0]$ does not equal the ATE because treated and control groups differ systematically.

Core Definitions

The identification problem: to estimate the ATE from observational data, we need assumptions that link the observed data distribution to the causal quantity. The two standard assumptions are:

Definition

Ignorability / Conditional Exchangeability $(Y (0), Y (1)) ⊥ W ∣ X$

Ignorability (also called unconfoundedness or conditional exchangeability) says:

$(Y(0), Y(1)) \perp W \mid X$

Given covariates $X$ , treatment assignment is independent of potential outcomes. This means there are no unmeasured confounders after conditioning on the measured covariates. If severity, wealth, or baseline risk still affect both treatment and outcome after conditioning on $X$ , the causal estimate is biased.

Definition

Overlap / Positivity $0 < P (W = 1 ∣ X = x) < 1$

Overlap (also called positivity) says:

$0 < P(W = 1 \mid X = x) < 1 \text{ for all } x$

Every unit has a nonzero chance of receiving either treatment or control. Without overlap, you cannot compare treated and control units with similar covariates. Extreme propensity scores near $0$ or $1$ mean the design is trying to answer a question the data hardly support.

Definition

SUTVA

The stable unit treatment value assumption (SUTVA) bundles two ideas:

No interference: one unit's treatment does not change another unit's outcome
Well-defined treatment: "treatment" is a single intervention, not a vague bundle of different versions

SUTVA is easy to forget because it sounds technical, but it is load-bearing. If different hospitals deliver "the same" treatment in materially different ways, or if one person's treatment spills over onto another person, then the potential-outcomes notation hides heterogeneity that must be modeled explicitly.

Definition

Propensity Score $e (x)$

The propensity score is the conditional probability of treatment given covariates:

$e(x) = P(W = 1 \mid X = x)$

The propensity score is a scalar summary of the covariates that is sufficient for confounding adjustment.

Main Theorems

Theorem

Identification of the ATE Under Ignorability

Statement

Under ignorability and overlap, the average treatment effect is identified from observed data:

$\tau = \mathbb{E}\left[\mathbb{E}[Y \mid X, W=1] - \mathbb{E}[Y \mid X, W=0]\right]$

Equivalently, using inverse propensity weighting (IPW):

$\tau = \mathbb{E}\left[\frac{WY}{e(X)} - \frac{(1-W)Y}{1-e(X)}\right]$

Both expressions equal $\mathbb{E}[Y(1) - Y(0)]$ .

Intuition

Ignorability says that within groups of units with the same $X$ , treatment is effectively randomized. So the difference in outcomes within each $X$ -stratum is a valid causal estimate. Averaging over strata gives the ATE. The IPW formula reweights each observation to create a pseudo-population where treatment is independent of $X$ , mimicking a randomized experiment.

Proof Sketch

For the regression form: by ignorability, $\mathbb{E}[Y \mid X, W=1] = \mathbb{E}[Y(1) \mid X, W=1] = \mathbb{E}[Y(1) \mid X]$ . Similarly for $W = 0$ . So the conditional difference is $\mathbb{E}[Y(1) - Y(0) \mid X]$ , and taking the outer expectation gives $\tau$ .

For IPW: $\mathbb{E}[WY/e(X)] = \mathbb{E}[\mathbb{E}[WY/e(X) \mid X]]$ . By ignorability, $\mathbb{E}[WY \mid X] = \mathbb{E}[W \mid X] \mathbb{E}[Y(1) \mid X] = e(X) \mathbb{E}[Y(1) \mid X]$ . Dividing by $e(X)$ and taking the outer expectation gives $\mathbb{E}[Y(1)]$ . The same argument with $(1-W)/(1-e(X))$ gives $\mathbb{E}[Y(0)]$ .

Why It Matters

This is the foundational identification result of causal inference. It says exactly when and how you can estimate causal effects from observational data. Without ignorability, no amount of statistical sophistication gives you a causal effect. The theorem also motivates practical methods: outcome regression, IPW, and their doubly-robust combination.

Failure Mode

The theorem relies entirely on the ignorability assumption, which is untestable. If there are unmeasured confounders (e.g., disease severity that is not recorded in the data), the estimate is biased. Sensitivity analysis can assess how large an unmeasured confounder would need to be to change your conclusions.

report a correction →

Theorem

Rosenbaum-Rubin Propensity Score Theorem

Statement

If $(Y(0), Y(1)) \perp W \mid X$ , then:

$(Y(0), Y(1)) \perp W \mid e(X)$

That is, adjusting for the scalar propensity score $e(X)$ is sufficient for confounding control, even when $X$ is high-dimensional.

Intuition

The propensity score is a "summary score" that captures everything about $X$ that matters for treatment assignment. Two units with the same propensity score have the same probability of treatment, so comparing them is like a randomized experiment, even if their covariates $X$ differ in other ways.

Proof Sketch

We need to show $P(W=1 \mid Y(0), Y(1), e(X)) = P(W=1 \mid e(X))$ . By the law of iterated expectations:

$P(W=1 \mid Y(0), Y(1), e(X)) = \mathbb{E}[P(W=1 \mid Y(0), Y(1), X) \mid Y(0), Y(1), e(X)]$

By ignorability, $P(W=1 \mid Y(0), Y(1), X) = P(W=1 \mid X) = e(X)$ . Since $e(X)$ is a function of $X$ that we are conditioning on, this equals $e(X)$ almost surely.

Why It Matters

The propensity score reduces a high-dimensional adjustment problem to a one-dimensional one. Instead of matching or stratifying on all covariates $X$ (which is impractical in high dimensions due to the curse of dimensionality), you can match, stratify, or weight by the single propensity score. This is why propensity score methods are ubiquitous in observational studies.

Failure Mode

The propensity score does not solve the fundamental problem of unmeasured confounders. If ignorability fails given $X$ , it also fails given $e(X)$ . Also, propensity score matching discards unmatched units, which can change the target estimand. Extreme propensity scores near 0 or 1 (lack of overlap) cause high variance in IPW estimates.

report a correction →

Canonical Examples

Example

Observational study with confounding

You want to estimate the effect of a job training program on earnings. People who enroll are different from those who do not: they may be more motivated or have different baseline skills. Motivation is a confounder (it affects both enrollment and earnings).

A naive comparison of enrollees vs. non-enrollees gives a biased estimate. With measured covariates (age, education, prior earnings), you can use propensity score matching: match each enrollee with a non-enrollee who had similar probability of enrolling. The difference in outcomes within matched pairs estimates the ATT, assuming all confounders are measured.

Common Confusions

Watch Out

A/B tests are randomized experiments, not observational studies

When you run an A/B test, you randomly assign users to treatment and control. This guarantees ignorability by design: no confounders can bias the comparison. You do not need propensity scores or instrumental variables. The simple difference in means is an unbiased estimate of the ATE. The machinery of observational causal inference is for when you cannot randomize.

Watch Out

Adjusting for more variables is not always safer

Not every variable should be conditioned on. Adjusting for a confounder can remove bias, but adjusting for a collider can create bias and adjusting for a mediator changes the estimand by blocking part of the causal effect. This is why causal inference is not "throw every column into a regression." You need an argument about the causal graph, not just a longer feature list.

Watch Out

Prediction models do not give causal effects

A model that predicts $Y$ well from $X$ and $W$ tells you about associations. The coefficient on $W$ in a regression is not the causal effect of $W$ unless the regression satisfies ignorability. This distinction between predictive and causal models is critical for hypothesis testing in applied settings. A doctor who treats sicker patients will have worse patient outcomes, but this does not mean the doctor is harmful. The association between treatment and outcome is confounded by severity.

Watch Out

Doubly robust is not doubly safe

The doubly robust estimator combines outcome regression and IPW, and it is consistent if either model is correctly specified (not necessarily both). But it is not robust to unmeasured confounders. The "double robustness" is about modeling flexibility, not about the causal assumptions.

Summary

The fundamental problem: you only observe one potential outcome per unit, so individual causal effects are never directly observable
ATE $= \mathbb{E}[Y(1) - Y(0)]$ is the target; the naive difference $\mathbb{E}[Y|W=1] - \mathbb{E}[Y|W=0]$ equals it only under ignorability
Ignorability (no unmeasured confounders) is untestable and is the critical assumption for all observational methods
Propensity scores (often estimated via logistic regression) reduce high-dimensional confounding adjustment to a scalar
RCTs guarantee ignorability by design; observational methods try to approximate this with measured covariates
Instrumental variables and difference-in-differences work when ignorability fails, under different assumptions

Exercises

ExerciseCore

Problem

In an observational study, older people are more likely to receive treatment and also have worse outcomes. Is age a confounder? Draw the causal graph and explain why the naive treatment-outcome comparison is biased.

ExerciseAdvanced

Problem

Prove that the IPW estimator $\hat{\tau} = \frac{1}{n}\sum_{i=1}^n \left[\frac{W_i Y_i}{e(X_i)} - \frac{(1-W_i) Y_i}{1-e(X_i)}\right]$ is unbiased for the ATE under ignorability and overlap when the propensity score is known.

ExerciseResearch

Problem

Instrumental variables (IV) estimation does not require ignorability. Instead, it uses an instrument $Z$ that affects treatment $W$ but has no direct effect on outcome $Y$ . State the IV assumptions precisely and derive the Wald estimator for the case where both $Z$ and $W$ are binary.

References

Canonical:

Rubin, Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies (1974)
Rosenbaum & Rubin, The Central Role of the Propensity Score in Observational Studies (1983)
Imbens & Rubin, Causal Inference for Statistics, Social, and Biomedical Sciences (2015), Chapters 1-3
Pearl, Glymour, and Jewell, Causal Inference in Statistics: A Primer (2016), Chapters 1-4

Bridges and design-focused treatments:

Hernán & Robins, Causal Inference: What If (online ed., 2024). Strong target-trial, exchangeability, and positivity treatment.
Pearl, Causality: Models, Reasoning, and Inference (2nd ed., 2009), Chapters 1-4, 7

Modern ML-facing extensions:

Chernozhukov et al., Double/Debiased Machine Learning for Treatment and Structural Parameters (Econometrica, 2018)
Peters, Janzing, and Scholkopf, Elements of Causal Inference (2017), Chapters 1-4, 6

Next Topics

Causal inference connects to many ML topics:

Causal Inference and the Ladder of Causation: DAGs, do-calculus, and the backdoor criterion when you want the assumptions made explicit
Causal Inference for Policy Evaluation: DiD, RDD, IV, and design-based identification when ignorability is not plausible
Simpson's Paradox: the classic warning that aggregation alone cannot tell you whether to adjust or not

Last reviewed: April 23, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Feature Importance and Interpretabilitylayer 2 · tier 2
Hypothesis Testing for MLlayer 2 · tier 2
Survival Analysislayer 3 · tier 2

Derived topics

5

Causal Inference and the Ladder of Causationlayer 3 · tier 1
Double/Debiased Machine Learninglayer 3 · tier 1
Weighted Conformal Prediction Under Covariate Shiftlayer 3 · tier 1
Simpson's Paradoxlayer 1 · tier 2
Causal Inference for Policy Evaluationlayer 4 · tier 2

Graph-backed continuations

Causal Inference and the Ladder of Causation Causal Inference for Policy Evaluation Simpson's Paradox Double/Debiased Machine Learning Weighted Conformal Prediction Under Covariate Shift