Design-Based vs. Model-Based Inference

Sneiderman, Robby

Statistical Foundations

Design-Based vs. Model-Based Inference

Two philosophies of statistical inference from survey data: design-based inference where randomness comes from the sampling design, and model-based inference where randomness comes from a statistical model, with the model-assisted hybrid approach.

CoreTier 2StableSupporting~45 min

Prerequisites

Survey Sampling Methods Causal Inference for Policy Evaluation Official Statistics and National Surveys

Quiz (5)Pulse Check Prereq Map

Why This Matters

Survey statisticians and ML researchers think about inference differently, and this difference leads to real mistakes when the two communities interact. Survey statisticians typically use design-based inference: the randomness comes from the sampling design, not from a model. ML researchers typically use model-based inference: they assume a data-generating process and fit a model.

Neither approach is universally correct. Design-based inference is robust (it requires no model assumptions) but can be inefficient. Model-based inference is efficient but depends on the model being correct. Understanding both is necessary for anyone who works with survey data or builds models using survey-derived datasets.

Mental Model

Imagine a finite population of $N$ people. Each person has a fixed, non-random value $y_i$ (their income, their height, their vote). There is no randomness in the population itself.

Design-based view: the only randomness is in which people we select for the sample. The sampling design defines a probability distribution over all possible samples. Inference is about the properties of estimators under this distribution.

Model-based view: the population values $y_1, \ldots, y_N$ are themselves random draws from a probability model $y_i \sim f(y \mid x_i, \theta)$ . Inference is about the parameters $\theta$ or about predictions from the model.

These are structurally different conceptions of what "random" means in the problem.

Design-Based Inference

Definition

Design-Based Inference

In design-based (also called randomization-based) inference:

The population values $y_1, \ldots, y_N$ are fixed, unknown constants
The only source of randomness is the sampling mechanism $p(S)$
Expectations, variances, and probabilities are computed over repeated sampling from the same fixed population
The target of inference is a finite population quantity: $\bar{Y} = \frac{1}{N}\sum_{i=1}^N y_i$ or $T = \sum_{i=1}^N y_i$

No distributional assumptions are needed. The validity of inference depends entirely on the sampling design.

The workhorse estimator is the Horvitz-Thompson estimator:

$\hat{T}_{\text{HT}} = \sum_{i \in S} \frac{y_i}{\pi_i}$

Its unbiasedness $\mathbb{E}_p[\hat{T}_{\text{HT}}] = T$ holds for any set of population values $y_1, \ldots, y_N$ , because the expectation is over the sampling design $p$ , not over any model for $y$ .

Strengths of Design-Based Inference

Robustness: no model assumptions. If the design is a probability sample, the HT estimator is unbiased regardless of the true relationship between $y$ and $x$ .
Transparency: the properties of estimators depend only on the known sampling design.
Legal standing: official statistics agencies use design-based methods because their validity does not depend on contestable modeling choices.

Weaknesses of Design-Based Inference

Inefficiency: the HT estimator ignores auxiliary information (covariates) that could improve precision.
Requires probability sampling: if the sample is not a probability sample, design-based inference is undefined.
Small domains: design-based estimates for small subpopulations have large variance because they use only the data from that subpopulation.

Model-Based Inference

Definition

Model-Based (Superpopulation) Inference

In model-based inference:

The population values $y_1, \ldots, y_N$ are realizations of random variables from a model (the "superpopulation"): $y_i \sim f(y \mid x_i, \theta)$
Inference is about the model parameters $\theta$ or about predictions $\hat{y}_i$ for unobserved units
Expectations and variances are computed under the model, not under the sampling design
The sampling design is relevant only insofar as it creates selection bias under the model

Under the model, the population mean $\bar{Y}$ is itself a random variable. The target might be $\mathbb{E}_\xi[\bar{Y}]$ (expectation under the model $\xi$ ) or $\theta$ itself.

Strengths of Model-Based Inference

Efficiency: uses auxiliary information (covariates) to improve estimates.
Small domains: can produce estimates for any subpopulation, even those with zero sample size, by extrapolating from the model.
Non-probability samples: can (in principle) be used with any sample, as long as the model correctly specifies the relationship between $y$ and $x$ .

Weaknesses of Model-Based Inference

Model dependence: if the model is wrong, the estimator can be biased. There is no "free lunch" guarantee like the HT estimator's design-unbiasedness.
Difficult to verify: model diagnostics help but cannot definitively confirm the model is correct.
Controversial in official statistics: agencies are reluctant to publish estimates that depend on modeling choices that could be disputed.

Informative Sampling

Sampling is informative if the sample distribution of the outcome differs from the population distribution given the covariates. Formally, if $f_s(y \mid x)$ denotes the density of $y$ in the sample and $f(y \mid x)$ the population density, sampling is informative when $f_s(y \mid x) \neq f(y \mid x)$ . Equivalently, the inclusion probability $\pi_i$ depends on $y_i$ even after conditioning on $x_i$ .

Under informative sampling, unweighted model fits are biased for the population parameter $\theta$ . The model learns $f_s(y \mid x)$ rather than $f(y \mid x)$ . Two remedies restore consistency:

Design-based weighting: use pseudo-maximum likelihood with survey weights $w_i = 1/\pi_i$ , targeting the population score equation rather than the sample one.
Full-information likelihood: model $f_s(y \mid x)$ explicitly as a function of $f(y \mid x)$ and the conditional inclusion mechanism, then recover $\theta$ by inverting the relationship.

Pfeffermann and Sverchkov (2007) formalize the sample distribution under informative sampling and derive its relationship to the population distribution through the conditional expectation $E[\pi_i \mid y_i, x_i]$ . When this conditional expectation depends on $y_i$ , sampling is informative and naive likelihood methods fail.

This is the main reason ML practitioners cannot discard survey weights when training on survey microdata: if the design is informative, the unweighted sample yields a biased estimate of the population conditional distribution.

Model-Assisted Inference

Definition

Model-Assisted Inference

Model-assisted inference uses a model to improve efficiency while retaining the design-based validity guarantee. The idea: use the model to construct a good estimator, but evaluate its properties (bias, variance) under the sampling design, not the model.

The generalized regression estimator (GREG) is the standard model-assisted estimator:

$\hat{T}_{\text{GREG}} = \sum_{i \in S} \frac{y_i}{\pi_i} + \left(\sum_{i=1}^N x_i - \sum_{i \in S} \frac{x_i}{\pi_i}\right)^T \hat{\beta}$

where $\hat{\beta}$ is a regression coefficient estimated from the sample. This is the HT estimator plus a correction term that exploits the known population totals of auxiliary variables $x$ .

The GREG estimator is approximately design-unbiased regardless of whether the regression model is correct. If the model is correct, it is also efficient. If the model is wrong, the correction term is small (because the HT part already does the heavy lifting).

Calibration estimators generalize GREG. Calibration adjusts the design weights $1/\pi_i$ to new weights $w_i$ that (a) stay close to $1/\pi_i$ under a chosen distance and (b) reproduce known population totals of auxiliary variables exactly: $\sum_{i \in S} w_i x_i = \sum_{i=1}^N x_i$ . GREG corresponds to calibration under a quadratic distance. Deville and Sarndal (1992) and Sarndal, Swensson, and Wretman (1992, Chapter 6) give the general framework.

Main Theorems

Theorem

Design Consistency of the Horvitz-Thompson Estimator

Statement

Let $\hat{\bar{Y}}_{\text{HT}} = \frac{1}{N}\sum_{i \in S} y_i/\pi_i$ be the HT estimator of the population mean $\bar{Y}$ . Under a sequence of finite populations and sampling designs where $n/N \to f \in (0,1)$ and regularity conditions on the inclusion probabilities hold:

$\hat{\bar{Y}}_{\text{HT}} \xrightarrow{p_{\text{design}}} \bar{Y}$

and

$\frac{\hat{\bar{Y}}_{\text{HT}} - \bar{Y}}{\sqrt{\text{Var}_p(\hat{\bar{Y}}_{\text{HT}})}} \xrightarrow{d} N(0,1)$

The convergence is in probability and distribution under the sampling design $p$ , for any fixed set of population values $y_1, \ldots, y_N$ .

Intuition

As the sample gets large relative to the population, the HT estimator converges to the population mean and is approximately normal. This is a law of large numbers and central limit theorem under the randomization distribution, not under any model. The result holds for any population values because the randomness is in the sampling, not the data.

Proof Sketch

Write $\hat{\bar{Y}}_{\text{HT}} - \bar{Y} = \frac{1}{N}\sum_{i=1}^N (Z_i/\pi_i - 1)y_i$ where $Z_i = \mathbf{1}(i \in S)$ . Under regularity conditions (Hajek conditions on the design), this is a sum of weakly dependent random variables with mean zero. A CLT for finite-population sampling (Hajek, 1960) gives asymptotic normality.

Why It Matters

This theorem is the theoretical backbone of design-based survey inference. It guarantees that HT-based confidence intervals have correct coverage in large samples, with no model assumptions. This is why statistical agencies trust design-based methods: the theory requires only a probability sampling design and regularity conditions.

Failure Mode

The convergence requires the sample size to grow. For small samples (small areas, rare domains), the normal approximation is poor and confidence intervals may have incorrect coverage. The regularity conditions on inclusion probabilities can fail with highly unequal probability designs where some $\pi_i$ are very small. In such cases, the HT estimator has high variance and the CLT approximation is poor.

report a correction →

The Key Philosophical Difference

In design-based inference, the population quantity $\bar{Y}$ is a fixed number. There is nothing random about it. The probability statement "95% confidence interval" means: if we repeated the sampling procedure many times from the same population, 95% of the resulting intervals would contain $\bar{Y}$ .

In model-based inference, $\bar{Y}$ is a random variable. A "95% credible interval" (Bayesian) or "95% prediction interval" means something different: it reflects uncertainty about the data-generating process, not just the sampling.

These are not just philosophical niceties. They determine what your confidence intervals actually mean and when they have correct coverage.

When to Use Which

Situation	Recommended Approach
Official statistics, legal requirements	Design-based or model-assisted
Probability sample, large domains	Design-based (possibly model-assisted)
Small areas, rare subpopulations	Model-based (SAE)
Non-probability sample (web panel, convenience)	Model-based (with caution)
ML model training on survey data	Use survey weights (design-based perspective)

Common Confusions

Watch Out

Design-based inference does not mean you cannot use models

Model-assisted inference uses regression models to improve efficiency while retaining design-based validity. The distinction is about what justifies the inference (the design vs. the model), not about whether models appear in the estimation procedure.

Watch Out

ML prediction is not the same as survey estimation

In ML, you fit a model to predict $y$ from $x$ and evaluate on held-out data. The goal is prediction. In survey estimation, the goal is to estimate a population quantity (a mean, a total). Prediction error and estimation error are different things. A good predictive model can still produce biased population estimates if used without survey weights.

Watch Out

Ignoring survey weights in ML does not always help prediction

Some ML practitioners discard survey weights because they think weights only matter for inference, not prediction. This is wrong when the sampling design is informative: when the probability of inclusion depends on $y$ . In that case, the unweighted sample overrepresents certain $y$ values, and a model trained without weights learns a distorted relationship.

Summary

Design-based: randomness from sampling, population values are fixed constants
Model-based: randomness from the data-generating model, population values are random
Design-based is robust but can be inefficient; model-based is efficient but model-dependent
Model-assisted = use models for efficiency, evaluate properties under the design
The GREG estimator is the standard model-assisted estimator
HT estimator is design-consistent for any population under regularity conditions
Use survey weights when training ML models on survey data

Exercises

ExerciseCore

Problem

A population of $N = 4$ has values $y_1 = 10, y_2 = 20, y_3 = 30, y_4 = 40$ . You take an SRS of $n = 2$ . List all possible samples, compute $\hat{\bar{Y}}_{\text{HT}}$ for each, and verify that the average over all possible samples equals $\bar{Y}$ .

ExerciseAdvanced

Problem

A researcher trains a logistic regression on CPS microdata to predict labor force participation. They do not use the survey weights. A survey statistician says the model is "wrong." The ML researcher says prediction accuracy on held-out CPS data is 89%. Who is right, and why might they both have a point?

References

Canonical:

Sarndal, Swensson, Wretman, Model Assisted Survey Sampling (1992), Springer. Chapters 1-2 (design-based foundations), 6-7 (GREG and calibration), 12 (model-dependent inference).
Cochran, Sampling Techniques (1977), 3rd ed., Wiley. Chapters 1-2 (design-based basics), 5 (ratio and regression estimators).
Lohr, Sampling: Design and Analysis (2019), 3rd ed., Chapman and Hall. Chapters 2-3 (SRS and stratified), 11 (model-assisted and model-based).
Thompson, Sampling (2012), 3rd ed., Wiley. Chapters 2-6 (design-based inference), 18 (model-based inference).

Current:

Lumley, Complex Surveys: A Guide to Analysis Using R (2010), Chapters 1-2.
Little and Rubin, Statistical Analysis with Missing Data (2019), 3rd ed., Wiley. Chapters 1-3, 12 (model-based framing of nonresponse and imputation).
Pfeffermann, "Modelling of Complex Survey Data: Why Model? How Model?" (2011), Survey Methodology / JASA. Canonical treatment of when and how to use models with complex survey designs.
Pfeffermann, "The Role of Sampling Weights when Modeling Survey Data" (1993), International Statistical Review 61(2), 317-337. Foundational on weights in model-based analysis.
Pfeffermann and Sverchkov, "Small-Area Estimation Under Informative Probability Sampling of Areas and Within the Selected Areas" (2007), JASA. Formalizes the sample distribution under informative sampling.
Buelens et al., "Comparing Inference Methods for Non-Probability Samples" (2018), International Statistical Review.

Next Topics

Small area estimation: where model-based methods are essential
Nonresponse and missing data: a challenge for both paradigms

Last reviewed: April 18, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Survey Sampling Methodslayer 2 · tier 2
Causal Inference for Policy Evaluationlayer 4 · tier 2
Official Statistics and National Surveyslayer 3 · tier 3

Derived topics

2

Nonresponse and Missing Datalayer 2 · tier 2
GREG Estimatorlayer 3 · tier 2

Graph-backed continuations

Nonresponse and Missing Data GREG Estimator