Skip to main content

Statistical Foundations

Design-Based vs. Model-Based Inference

Two philosophies of statistical inference from survey data: design-based inference where randomness comes from the sampling design, and model-based inference where randomness comes from a statistical model, with the model-assisted hybrid approach.

CoreTier 2StableSupporting~45 min

Why This Matters

Survey statisticians and ML researchers think about inference differently, and this difference leads to real mistakes when the two communities interact. Survey statisticians typically use design-based inference: the randomness comes from the sampling design, not from a model. ML researchers typically use model-based inference: they assume a data-generating process and fit a model.

Neither approach is universally correct. Design-based inference is robust (it requires no model assumptions) but can be inefficient. Model-based inference is efficient but depends on the model being correct. Understanding both is necessary for anyone who works with survey data or builds models using survey-derived datasets.

Mental Model

Imagine a finite population of NN people. Each person has a fixed, non-random value yiy_i (their income, their height, their vote). There is no randomness in the population itself.

Design-based view: the only randomness is in which people we select for the sample. The sampling design defines a probability distribution over all possible samples. Inference is about the properties of estimators under this distribution.

Model-based view: the population values y1,,yNy_1, \ldots, y_N are themselves random draws from a probability model yif(yxi,θ)y_i \sim f(y \mid x_i, \theta). Inference is about the parameters θ\theta or about predictions from the model.

These are structurally different conceptions of what "random" means in the problem.

Design-Based Inference

Definition

Design-Based Inference

In design-based (also called randomization-based) inference:

  1. The population values y1,,yNy_1, \ldots, y_N are fixed, unknown constants
  2. The only source of randomness is the sampling mechanism p(S)p(S)
  3. Expectations, variances, and probabilities are computed over repeated sampling from the same fixed population
  4. The target of inference is a finite population quantity: Yˉ=1Ni=1Nyi\bar{Y} = \frac{1}{N}\sum_{i=1}^N y_i or T=i=1NyiT = \sum_{i=1}^N y_i

No distributional assumptions are needed. The validity of inference depends entirely on the sampling design.

The workhorse estimator is the Horvitz-Thompson estimator:

T^HT=iSyiπi\hat{T}_{\text{HT}} = \sum_{i \in S} \frac{y_i}{\pi_i}

Its unbiasedness Ep[T^HT]=T\mathbb{E}_p[\hat{T}_{\text{HT}}] = T holds for any set of population values y1,,yNy_1, \ldots, y_N, because the expectation is over the sampling design pp, not over any model for yy.

Strengths of Design-Based Inference

  • Robustness: no model assumptions. If the design is a probability sample, the HT estimator is unbiased regardless of the true relationship between yy and xx.
  • Transparency: the properties of estimators depend only on the known sampling design.
  • Legal standing: official statistics agencies use design-based methods because their validity does not depend on contestable modeling choices.

Weaknesses of Design-Based Inference

  • Inefficiency: the HT estimator ignores auxiliary information (covariates) that could improve precision.
  • Requires probability sampling: if the sample is not a probability sample, design-based inference is undefined.
  • Small domains: design-based estimates for small subpopulations have large variance because they use only the data from that subpopulation.

Model-Based Inference

Definition

Model-Based (Superpopulation) Inference

In model-based inference:

  1. The population values y1,,yNy_1, \ldots, y_N are realizations of random variables from a model (the "superpopulation"): yif(yxi,θ)y_i \sim f(y \mid x_i, \theta)
  2. Inference is about the model parameters θ\theta or about predictions y^i\hat{y}_i for unobserved units
  3. Expectations and variances are computed under the model, not under the sampling design
  4. The sampling design is relevant only insofar as it creates selection bias under the model

Under the model, the population mean Yˉ\bar{Y} is itself a random variable. The target might be Eξ[Yˉ]\mathbb{E}_\xi[\bar{Y}] (expectation under the model ξ\xi) or θ\theta itself.

Strengths of Model-Based Inference

  • Efficiency: uses auxiliary information (covariates) to improve estimates.
  • Small domains: can produce estimates for any subpopulation, even those with zero sample size, by extrapolating from the model.
  • Non-probability samples: can (in principle) be used with any sample, as long as the model correctly specifies the relationship between yy and xx.

Weaknesses of Model-Based Inference

  • Model dependence: if the model is wrong, the estimator can be biased. There is no "free lunch" guarantee like the HT estimator's design-unbiasedness.
  • Difficult to verify: model diagnostics help but cannot definitively confirm the model is correct.
  • Controversial in official statistics: agencies are reluctant to publish estimates that depend on modeling choices that could be disputed.

Informative Sampling

Sampling is informative if the sample distribution of the outcome differs from the population distribution given the covariates. Formally, if fs(yx)f_s(y \mid x) denotes the density of yy in the sample and f(yx)f(y \mid x) the population density, sampling is informative when fs(yx)f(yx)f_s(y \mid x) \neq f(y \mid x). Equivalently, the inclusion probability πi\pi_i depends on yiy_i even after conditioning on xix_i.

Under informative sampling, unweighted model fits are biased for the population parameter θ\theta. The model learns fs(yx)f_s(y \mid x) rather than f(yx)f(y \mid x). Two remedies restore consistency:

  1. Design-based weighting: use pseudo-maximum likelihood with survey weights wi=1/πiw_i = 1/\pi_i, targeting the population score equation rather than the sample one.
  2. Full-information likelihood: model fs(yx)f_s(y \mid x) explicitly as a function of f(yx)f(y \mid x) and the conditional inclusion mechanism, then recover θ\theta by inverting the relationship.

Pfeffermann and Sverchkov (2007) formalize the sample distribution under informative sampling and derive its relationship to the population distribution through the conditional expectation E[πiyi,xi]E[\pi_i \mid y_i, x_i]. When this conditional expectation depends on yiy_i, sampling is informative and naive likelihood methods fail.

This is the main reason ML practitioners cannot discard survey weights when training on survey microdata: if the design is informative, the unweighted sample yields a biased estimate of the population conditional distribution.

Model-Assisted Inference

Definition

Model-Assisted Inference

Model-assisted inference uses a model to improve efficiency while retaining the design-based validity guarantee. The idea: use the model to construct a good estimator, but evaluate its properties (bias, variance) under the sampling design, not the model.

The generalized regression estimator (GREG) is the standard model-assisted estimator:

T^GREG=iSyiπi+(i=1NxiiSxiπi)Tβ^\hat{T}_{\text{GREG}} = \sum_{i \in S} \frac{y_i}{\pi_i} + \left(\sum_{i=1}^N x_i - \sum_{i \in S} \frac{x_i}{\pi_i}\right)^T \hat{\beta}

where β^\hat{\beta} is a regression coefficient estimated from the sample. This is the HT estimator plus a correction term that exploits the known population totals of auxiliary variables xx.

The GREG estimator is approximately design-unbiased regardless of whether the regression model is correct. If the model is correct, it is also efficient. If the model is wrong, the correction term is small (because the HT part already does the heavy lifting).

Calibration estimators generalize GREG. Calibration adjusts the design weights 1/πi1/\pi_i to new weights wiw_i that (a) stay close to 1/πi1/\pi_i under a chosen distance and (b) reproduce known population totals of auxiliary variables exactly: iSwixi=i=1Nxi\sum_{i \in S} w_i x_i = \sum_{i=1}^N x_i. GREG corresponds to calibration under a quadratic distance. Deville and Sarndal (1992) and Sarndal, Swensson, and Wretman (1992, Chapter 6) give the general framework.

Main Theorems

Theorem

Design Consistency of the Horvitz-Thompson Estimator

Statement

Let Yˉ^HT=1NiSyi/πi\hat{\bar{Y}}_{\text{HT}} = \frac{1}{N}\sum_{i \in S} y_i/\pi_i be the HT estimator of the population mean Yˉ\bar{Y}. Under a sequence of finite populations and sampling designs where n/Nf(0,1)n/N \to f \in (0,1) and regularity conditions on the inclusion probabilities hold:

Yˉ^HTpdesignYˉ\hat{\bar{Y}}_{\text{HT}} \xrightarrow{p_{\text{design}}} \bar{Y}

and

Yˉ^HTYˉVarp(Yˉ^HT)dN(0,1)\frac{\hat{\bar{Y}}_{\text{HT}} - \bar{Y}}{\sqrt{\text{Var}_p(\hat{\bar{Y}}_{\text{HT}})}} \xrightarrow{d} N(0,1)

The convergence is in probability and distribution under the sampling design pp, for any fixed set of population values y1,,yNy_1, \ldots, y_N.

Intuition

As the sample gets large relative to the population, the HT estimator converges to the population mean and is approximately normal. This is a law of large numbers and central limit theorem under the randomization distribution, not under any model. The result holds for any population values because the randomness is in the sampling, not the data.

Proof Sketch

Write Yˉ^HTYˉ=1Ni=1N(Zi/πi1)yi\hat{\bar{Y}}_{\text{HT}} - \bar{Y} = \frac{1}{N}\sum_{i=1}^N (Z_i/\pi_i - 1)y_i where Zi=1(iS)Z_i = \mathbf{1}(i \in S). Under regularity conditions (Hajek conditions on the design), this is a sum of weakly dependent random variables with mean zero. A CLT for finite-population sampling (Hajek, 1960) gives asymptotic normality.

Why It Matters

This theorem is the theoretical backbone of design-based survey inference. It guarantees that HT-based confidence intervals have correct coverage in large samples, with no model assumptions. This is why statistical agencies trust design-based methods: the theory requires only a probability sampling design and regularity conditions.

Failure Mode

The convergence requires the sample size to grow. For small samples (small areas, rare domains), the normal approximation is poor and confidence intervals may have incorrect coverage. The regularity conditions on inclusion probabilities can fail with highly unequal probability designs where some πi\pi_i are very small. In such cases, the HT estimator has high variance and the CLT approximation is poor.

The Key Philosophical Difference

In design-based inference, the population quantity Yˉ\bar{Y} is a fixed number. There is nothing random about it. The probability statement "95% confidence interval" means: if we repeated the sampling procedure many times from the same population, 95% of the resulting intervals would contain Yˉ\bar{Y}.

In model-based inference, Yˉ\bar{Y} is a random variable. A "95% credible interval" (Bayesian) or "95% prediction interval" means something different: it reflects uncertainty about the data-generating process, not just the sampling.

These are not just philosophical niceties. They determine what your confidence intervals actually mean and when they have correct coverage.

When to Use Which

SituationRecommended Approach
Official statistics, legal requirementsDesign-based or model-assisted
Probability sample, large domainsDesign-based (possibly model-assisted)
Small areas, rare subpopulationsModel-based (SAE)
Non-probability sample (web panel, convenience)Model-based (with caution)
ML model training on survey dataUse survey weights (design-based perspective)

Common Confusions

Watch Out

Design-based inference does not mean you cannot use models

Model-assisted inference uses regression models to improve efficiency while retaining design-based validity. The distinction is about what justifies the inference (the design vs. the model), not about whether models appear in the estimation procedure.

Watch Out

ML prediction is not the same as survey estimation

In ML, you fit a model to predict yy from xx and evaluate on held-out data. The goal is prediction. In survey estimation, the goal is to estimate a population quantity (a mean, a total). Prediction error and estimation error are different things. A good predictive model can still produce biased population estimates if used without survey weights.

Watch Out

Ignoring survey weights in ML does not always help prediction

Some ML practitioners discard survey weights because they think weights only matter for inference, not prediction. This is wrong when the sampling design is informative: when the probability of inclusion depends on yy. In that case, the unweighted sample overrepresents certain yy values, and a model trained without weights learns a distorted relationship.

Summary

  • Design-based: randomness from sampling, population values are fixed constants
  • Model-based: randomness from the data-generating model, population values are random
  • Design-based is robust but can be inefficient; model-based is efficient but model-dependent
  • Model-assisted = use models for efficiency, evaluate properties under the design
  • The GREG estimator is the standard model-assisted estimator
  • HT estimator is design-consistent for any population under regularity conditions
  • Use survey weights when training ML models on survey data

Exercises

ExerciseCore

Problem

A population of N=4N = 4 has values y1=10,y2=20,y3=30,y4=40y_1 = 10, y_2 = 20, y_3 = 30, y_4 = 40. You take an SRS of n=2n = 2. List all possible samples, compute Yˉ^HT\hat{\bar{Y}}_{\text{HT}} for each, and verify that the average over all possible samples equals Yˉ\bar{Y}.

ExerciseAdvanced

Problem

A researcher trains a logistic regression on CPS microdata to predict labor force participation. They do not use the survey weights. A survey statistician says the model is "wrong." The ML researcher says prediction accuracy on held-out CPS data is 89%. Who is right, and why might they both have a point?

References

Canonical:

  • Sarndal, Swensson, Wretman, Model Assisted Survey Sampling (1992), Springer. Chapters 1-2 (design-based foundations), 6-7 (GREG and calibration), 12 (model-dependent inference).
  • Cochran, Sampling Techniques (1977), 3rd ed., Wiley. Chapters 1-2 (design-based basics), 5 (ratio and regression estimators).
  • Lohr, Sampling: Design and Analysis (2019), 3rd ed., Chapman and Hall. Chapters 2-3 (SRS and stratified), 11 (model-assisted and model-based).
  • Thompson, Sampling (2012), 3rd ed., Wiley. Chapters 2-6 (design-based inference), 18 (model-based inference).

Current:

  • Lumley, Complex Surveys: A Guide to Analysis Using R (2010), Chapters 1-2.
  • Little and Rubin, Statistical Analysis with Missing Data (2019), 3rd ed., Wiley. Chapters 1-3, 12 (model-based framing of nonresponse and imputation).
  • Pfeffermann, "Modelling of Complex Survey Data: Why Model? How Model?" (2011), Survey Methodology / JASA. Canonical treatment of when and how to use models with complex survey designs.
  • Pfeffermann, "The Role of Sampling Weights when Modeling Survey Data" (1993), International Statistical Review 61(2), 317-337. Foundational on weights in model-based analysis.
  • Pfeffermann and Sverchkov, "Small-Area Estimation Under Informative Probability Sampling of Areas and Within the Selected Areas" (2007), JASA. Formalizes the sample distribution under informative sampling.
  • Buelens et al., "Comparing Inference Methods for Non-Probability Samples" (2018), International Statistical Review.

Next Topics

Last reviewed: April 18, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

3

Derived topics

2