Skip to main content

Predictive Uncertainty

Weighted Conformal Prediction Under Covariate Shift

The extension of split conformal prediction that restores finite-sample coverage when the test distribution differs from the training distribution, at the cost of knowing or estimating the likelihood ratio between them.

ResearchTier 1CurrentCore spine~50 min

Why This Matters

Split conformal prediction rests on exchangeability. Exchangeability fails the moment the test distribution differs from the training distribution, which happens automatically when models are trained six months before deployment or evaluated on a different population than they were fitted on. Under covariate shift the nominal 1α1 - \alpha coverage silently degrades, often substantially, and the usual diagnostic (hold out a clean test set) is unavailable because the clean test set is itself drawn from the training distribution.

Weighted conformal prediction recovers validity by reweighting the calibration points using the likelihood ratio between the two distributions. The construction is due to Tibshirani, Barber, Candès, Ramdas (2019) and is the single most-cited extension of split conformal. The same machinery turns out to give prediction intervals for individual treatment effects, connecting conformal prediction to the causal-inference literature through propensity scores.

Which Conformal Variant Fits Which Setting

Conformal prediction is not a single recipe; it is a family of recipes keyed to which assumption you can defend. This page is the covariate shift branch. The neighbours and what each can and cannot do:

SettingWhat changes between train and testWhich method gives valid coverageWhat you must know
Exchangeable / IIDNothing (the joint PX,YP_{X,Y} is fixed)Split conformalNothing beyond the data itself
Covariate shiftFeature marginal PXQXP_X \to Q_X, conditional PYXP_{Y\mid X} unchangedWeighted conformal (this page)The likelihood ratio w(x)=dQX/dPXw(x) = \mathrm{d}Q_X/\mathrm{d}P_X, exactly or via density-ratio estimation
Label shiftP(Y)Q(Y)P(Y) \to Q(Y), class-conditional P(XY)P(X\mid Y) unchangedLabel-shifted conformal (Podkopaev–Ramdas 2021)Class proportion shift estimated on unlabeled test data
Streaming / time seriesDistribution drifts over timeAdaptive Conformal Inference (ACI; Gibbs–Candès 2021)A target miscoverage rate; the algorithm tracks the empirical rate online
Anytime-validStopping time chosen after seeing the dataE-values and anytime-valid inferenceA scoring rule that produces a non-negative martingale
Adversarial drift / concept driftBoth PXP_X and PYXP_{Y\mid X} change, possibly adversariallyNo distribution-free coverage guarantee in full generalityUse ACI + a domain-specific shift detector, or accept that intervals lose meaning

The ordering is roughly increasing assumption strength on the operator side: split conformal asks for nothing, weighted conformal asks for the likelihood ratio, label-shifted conformal asks for class proportions, ACI asks for an online learning rate, anytime-valid asks for a martingale construction. If the assumption you can defend is weaker than the assumption your method needs, the coverage guarantee silently fails and the dashboard does not warn you.

Formal Setup

Training data (Xi,Yi)i=1n(X_i, Y_i)_{i=1}^{n} are drawn from a joint distribution with feature marginal PXP_X. The test point (Xn+1,Yn+1)(X_{n+1}, Y_{n+1}) has feature marginal QXQ_X, possibly different from PXP_X, but the conditional law YXY \mid X is assumed the same across both distributions. This is the covariate shift setting; label shift and concept drift are handled separately and are strictly harder.

Define the likelihood ratio

w(x)=dQXdPX(x),w(x) = \frac{\mathrm{d} Q_X}{\mathrm{d} P_X}(x),

the Radon-Nikodym derivative of QXQ_X with respect to PXP_X. We assume QXPXQ_X \ll P_X so the ratio exists; where this fails the problem is ill-posed and no distribution-free correction is possible.

Why Split Conformal Fails

Under exchangeability the test score's rank among the calibration scores is uniform on {1,,n+1}\{1, \ldots, n+1\}. Under covariate shift this uniformity breaks: test points that are more likely under QXQ_X than under PXP_X are overrepresented at the test side of the score distribution, so the test score tends to fall at higher ranks than the calibration scores. Split conformal, which assumes uniform rank, undercovers on the tails where the shift is largest.

Weighted Exchangeability

Definition

Weighted Exchangeability

Random variables Z1,,Zn+1Z_1, \ldots, Z_{n+1} are weighted exchangeable with weight function ww if the joint density factors as

f(z1,,zn+1)=1Zni=1ng(zi)w(zn+1)g(zn+1),f(z_1, \ldots, z_{n+1}) = \frac{1}{Z_n} \prod_{i=1}^{n} g(z_i) \cdot w(z_{n+1}) g(z_{n+1}),

for some base density gg and normalizing constant ZnZ_n. Equivalently, the first nn points are i.i.d.\ from PP and the last point is drawn from QQ with density wgw \cdot g.

The key observation is that weighted exchangeability preserves a weighted version of rank uniformity, which is exactly what we need for a coverage argument.

The Weighted Quantile Construction

Compute calibration scores si=s(Xi,Yi)s_i = s(X_i, Y_i) as before. Define normalized weights on the calibration and test points:

pi(Xn+1)=w(Xi)j=1nw(Xj)+w(Xn+1),pn+1(Xn+1)=w(Xn+1)j=1nw(Xj)+w(Xn+1).p_i(X_{n+1}) = \frac{w(X_i)}{\sum_{j=1}^{n} w(X_j) + w(X_{n+1})}, \qquad p_{n+1}(X_{n+1}) = \frac{w(X_{n+1})}{\sum_{j=1}^{n} w(X_j) + w(X_{n+1})}.

The test-point weight appearing in the denominator is the subtle point most implementations get wrong. The prediction set is

C^(Xn+1)={y:s(Xn+1,y)q^w(Xn+1)},\hat{C}(X_{n+1}) = \bigl\{y : s(X_{n+1}, y) \leq \hat{q}_w(X_{n+1})\bigr\},

where q^w(Xn+1)\hat{q}_w(X_{n+1}) is the weighted (1α)\lceil (1-\alpha) \rceil-quantile of the distribution that puts mass pi(Xn+1)p_i(X_{n+1}) on score sis_i and mass pn+1(Xn+1)p_{n+1}(X_{n+1}) on ++\infty (so the test point always has a chance to fail).

Main Theorem

Theorem

Weighted Conformal Coverage

Statement

Let C^(Xn+1)\hat{C}(X_{n+1}) be the weighted split conformal prediction set constructed above. Then under the joint law where calibration points are drawn from PP and the test point from QQ,

PQ(Yn+1C^(Xn+1))1α.\mathbb{P}_{Q}\bigl(Y_{n+1} \in \hat{C}(X_{n+1})\bigr) \geq 1 - \alpha.

Intuition

Reweighting by ww converts the non-exchangeable (P,P,,P,Q)(P, P, \ldots, P, Q) setup into a weighted-exchangeable one. The weighted-exchangeable rank of the test score is uniform on a probability-weighted index set, and the weighted quantile catches it at the usual level. The proof is a reduction to ordinary exchangeability on an enlarged probability space.

Proof Sketch

The argument from Tibshirani, Barber, Candès, Ramdas (2019) reduces the covariate-shift coverage problem to a weighted analogue of the ordinary-exchangeability rank argument used in split conformal.

Step 1: write the joint density. The calibration data (X1,Y1),,(Xn,Yn)(X_1, Y_1), \ldots, (X_n, Y_n) is i.i.d. from PP (the training distribution); the test point (Xn+1,Yn+1)(X_{n+1}, Y_{n+1}) is drawn from QQ (the deployment distribution), with the same conditional YXY \mid X as PP. Under the assumption QXPXQ_X \ll P_X with likelihood ratio w(x)=dQX/dPX(x)w(x) = dQ_X / dP_X(x), the joint density of (Z1,,Zn,Zn+1)=((X1,Y1),,(Xn+1,Yn+1))(Z_1, \ldots, Z_n, Z_{n+1}) = ((X_1, Y_1), \ldots, (X_{n+1}, Y_{n+1})) relative to the all-from-PP measure Pn+1P^{n+1} is

d(Pn×Q)dPn+1(z1,,zn+1)=w(xn+1).\frac{d\bigl(P^n \times Q\bigr)}{d P^{n+1}}(z_1, \ldots, z_{n+1}) = w(x_{n+1}).

That is: under the mixed law, the joint distribution of (Z1,,Zn+1)(Z_1, \ldots, Z_{n+1}) differs from the all-from-PP distribution by exactly the factor ww applied to the test point.

Step 2: weighted exchangeability of the augmented sample. For any permutation σ\sigma of {1,,n+1}\{1, \ldots, n+1\}, the σ\sigma-permuted joint distribution puts the factor ww on position σ1(n+1)\sigma^{-1}(n+1) instead of position n+1n+1. By the factorization above, conditional on the unordered multiset {Z1,,Zn+1}\{Z_1, \ldots, Z_{n+1}\} being any specific collection {a1,,an+1}\{a_1, \ldots, a_{n+1}\}, the conditional probability that position ii holds the test-distribution point (Xn+1,Yn+1)(X_{n+1}, Y_{n+1}) is proportional to the weight at the XX-coordinate of position ii:

P(test index=i{Z1,,Zn+1}={a1,,an+1})=w(xi)j=1n+1w(xj).\mathbb{P}(\text{test index} = i \mid \{Z_1, \ldots, Z_{n+1}\} = \{a_1, \ldots, a_{n+1}\}) = \frac{w(x_i)}{\sum_{j=1}^{n+1} w(x_j)}.

This is the weighted exchangeability property: the conditional distribution of which index is the test point is determined by the normalized weights, not by the ordering of the augmented sample.

Step 3: apply to the rank of the test score. Define si=s(Xi,Yi)s_i = s(X_i, Y_i) for i=1,,n+1i = 1, \ldots, n+1. Conditional on the multiset {s1,,sn+1}\{s_1, \ldots, s_{n+1}\}, the rank of sn+1s_{n+1} (call it RR) is determined by which index in the augmented sample is the test index. By Step 2, for each k{1,,n+1}k \in \{1, \ldots, n+1\}

P(R=k    {s1,,sn+1})=w(X(k))j=1n+1w(X(j)),\mathbb{P}\bigl(R = k \;\big|\; \{s_1, \ldots, s_{n+1}\}\bigr) = \frac{w(X_{(k)})}{\sum_{j=1}^{n+1} w(X_{(j)})},

where X(k)X_{(k)} denotes the XX-coordinate of the kk-th-smallest score. The rank RR is therefore "weighted-uniform" over {1,,n+1}\{1, \ldots, n+1\} with weights proportional to the likelihood ratios.

Step 4: weighted-quantile cutoff catches the test rank. The weighted conformal prediction set is built so that

q^w(Xn+1)=inf{q:i:siqpi(Xn+1)+pn+1(Xn+1)1{q=+}1α},\hat{q}_w(X_{n+1}) = \inf\Bigl\{q : \sum_{i: s_i \leq q} p_i(X_{n+1}) + p_{n+1}(X_{n+1}) \cdot \mathbf{1}_{\{q = +\infty\}} \geq 1 - \alpha\Bigr\},

with the ++\infty atom carrying the test-point's own weight pn+1p_{n+1}. Conditional on the unordered multiset and on Xn+1X_{n+1} (which determines the weights), the probability that sn+1s_{n+1} falls at or below q^w\hat{q}_w is the cumulative weighted probability up to q^w\hat{q}_w, which by construction is at least 1α1 - \alpha.

Marginalizing over the multiset and over Xn+1X_{n+1}:

PQ(Yn+1C^(Xn+1))=PQ(sn+1q^w(Xn+1))1α.\mathbb{P}_Q\bigl(Y_{n+1} \in \hat{C}(X_{n+1})\bigr) = \mathbb{P}_Q\bigl(s_{n+1} \leq \hat{q}_w(X_{n+1})\bigr) \geq 1 - \alpha.

The full proof, including the careful measure-theoretic treatment of conditioning on the unordered multiset and the handling of ties, is in Tibshirani, Barber, Candès, Ramdas (2019), Conformal Prediction Under Covariate Shift, NeurIPS 2019, Section 3 and Lemma 1.

The argument is a clean analogue of the rank-uniformity argument that drives split conformal: in the unweighted case, the conditional distribution of the test rank is uniform on {1,,n+1}\{1, \ldots, n+1\}; in the weighted case, it is the obvious generalization, with probabilities proportional to the likelihood-ratio weights. The ++\infty atom in the weighted quantile is the analogue of the ++\infty atom in the split-conformal quantile: it makes the construction graceful when the desired weighted quantile lies beyond the largest observed score.

The proof is formalizable in Lean / Mathlib in principle (the required machinery is permutation symmetry + conditional probability + supremum-of-events arguments, all in scope for MeasureTheory); the formalization would be a substantial project comparable to Mathlib's existing entries on the exchangeability of i.i.d. sequences. No such wrapper exists at the time of writing.

Why It Matters

The guarantee is finite-sample, distribution-free over the conditional YXY \mid X, and requires no assumption on the function class of the predictor or the dimension of the features. The only assumption beyond split conformal is access to the exact likelihood ratio ww. This is plausible in some causal settings (randomized assignment with a known propensity model) and in some carefully-controlled deployment settings; it is essentially never literally true in observational covariate-shift applications, where ww must be estimated.

Failure Mode

If ww is unknown and replaced by an estimate w^\hat{w}, the exact finite-sample guarantee no longer holds. Coverage bias is first-order in w^w\|\hat{w} - w\| for a naive plug-in; the actual coverage now depends on density-ratio estimation error, sample-splitting design, and any robustification scheme. Doubly robust variants (next section) reduce the bias to a product of two estimation errors, but the resulting guarantee becomes asymptotic / approximate, not the exact finite-sample bound stated above.

Doubly Robust Weighted Conformal

The finite-sample guarantee in the main theorem assumes ww is known exactly. Once ww must be estimated, the distribution-free finite-sample guarantee no longer holds at face value: coverage depends on density-ratio estimation error, sample-splitting design, and which robustification scheme you use. A plug-in procedure inherits first-order bias from the weight estimator.

Treating covariate shift as a missing-data problem gives a cleaner path when ww is unknown: fit a quantile regression q^Y(x)\hat{q}_Y(x) for YXY \mid X on the training distribution, fit a weight model w^(x)\hat{w}(x), and combine them in an AIPW-style construction of the conformal score. Coverage bias then becomes the product of the two estimation errors, analogous to the double-robustness property in double/debiased machine learning — but the resulting guarantee is asymptotic / approximate, not the same exact finite-sample guarantee given for known ww. Yang, Kuchibhotla, and Tchetgen Tchetgen (JRSSB 2024) develop this construction using efficient-influence-function machinery; the missing-data framing makes the double robustness feel inevitable rather than a trick.

Connection to Causal Inference

Individual treatment effect (ITE) prediction is covariate shift in disguise. Under unconfoundedness, the conditional distribution of Y(1)XY(1) \mid X in the treated subpopulation equals the conditional distribution under the counterfactual "treat everyone" population, with covariate ratio proportional to 1/e(X)1/e(X) (the inverse propensity); symmetrically the control conditional uses 1/(1e(X))1/(1 - e(X)). Weighted conformal prediction with the corresponding weight produces valid prediction intervals for Y(1)(Xn+1)Y(1)(X_{n+1}) and Y(0)(Xn+1)Y(0)(X_{n+1}) separately; ITE intervals combine the two and inherit the finite-sample guarantee cleanly under randomized assignment / perfect compliance. In purely observational settings the same construction works under strong ignorability, but the guarantee usually becomes approximate or doubly robust depending on how the propensity is estimated; Lei and Candès (2021) make this randomized-versus-observational distinction explicitly. This links the conformal literature directly to semiparametric causal inference.

Sensitivity Analysis Under Unmeasured Confounding

Unconfoundedness is a strong assumption. Rosenbaum's marginal sensitivity model parameterizes hidden confounding by Γ1\Gamma \geq 1, interpreted as the worst-case ratio between the true and observed propensity odds. Under a Γ\Gamma-bounded violation, Jin, Ren, Candès (2023) give robust weighted conformal intervals with coverage guarantees that degrade gracefully with Γ\Gamma. This provides an honest answer to "what if unconfoundedness fails by a factor of Γ\Gamma?"

Beyond Covariate Shift

The nonexchangeable framework of Barber, Candès, Ramdas, Tibshirani (2023) generalizes weighted conformal to arbitrary user-chosen weights with coverage guarantees stated as a function of the total variation distance between the assumed and true weighting. This opens the door to time series, spatial data, and label shift, each at the cost of a coverage gap that reflects the modelling mismatch.

Exercises

ExerciseCore

Problem

Three calibration points have features XiRX_i \in \mathbb{R}, scores s1=1,s2=2,s3=3s_1 = 1, s_2 = 2, s_3 = 3, and known likelihood ratios w(X1)=1,w(X2)=2,w(X3)=1w(X_1) = 1, w(X_2) = 2, w(X_3) = 1. A test point has w(X4)=2w(X_4) = 2. Compute the weighted 0.750.75-quantile used by weighted conformal at α=0.25\alpha = 0.25.

ExerciseAdvanced

Problem

Show that when w1w \equiv 1 (no shift) the weighted conformal construction reduces exactly to the split conformal construction of the previous page, including the ceiling quantile level. Identify where the +1+1 in the denominator of split conformal comes from in the weighted derivation.

ExerciseResearch

Problem

State the Lei-Candès (2021) construction for ITE prediction intervals using weighted conformal with estimated propensity. Identify what happens to coverage when the propensity estimator converges at rate slower than n1/4n^{-1/4}.

Open Problems and Frontier

Weighted conformal under label shift rather than covariate shift is harder because YY is unobserved at test time; current work uses label-conditional scores or EM-style re-estimation.

Combining weighted conformal with anytime-valid inference (Koning, van Meer 2025 style) opens prediction sets that remain valid under sequential peeking and distribution drift simultaneously; the distribution-shift version is not yet fully written down.

Tight lower bounds on coverage loss as a function of density-ratio estimation error. Current guarantees are loose and there is room for sharp finite-sample rates.

Conditional coverage under covariate shift inherits the impossibility result from the split conformal page (Barber, Candès, Ramdas, Tibshirani 2021). Partial guarantees under smoothness or localizability are the current line of attack.

High-dimensional density-ratio estimation is the practical bottleneck. Methods based on discriminative classifiers, score matching, and kernel-based estimators trade off sample complexity against bias in ways that matter for the coverage product-rate.

References

Canonical:

  • Tibshirani, Barber, Candès, Ramdas, "Conformal Prediction Under Covariate Shift." NeurIPS 2019. The founding paper.
  • Lei, Candès, "Conformal Inference of Counterfactuals and Individual Treatment Effects." Journal of the Royal Statistical Society B 83(5) (2021), 911-938.
  • Barber, Candès, Ramdas, Tibshirani, "Conformal Prediction Beyond Exchangeability." Annals of Statistics 51(2) (2023), 816-845.

Robustness and sensitivity:

  • Jin, Ren, Candès, "Sensitivity Analysis of Individual Treatment Effects: A Robust Conformal Inference Approach." Proceedings of the National Academy of Sciences 120(6) (2023).
  • Yang, Kuchibhotla, Tchetgen Tchetgen, "Doubly Robust Calibration of Prediction Sets Under Covariate Shift." Journal of the Royal Statistical Society Series B (2024; arXiv:2203.01761). Authors are Yachong Yang, Arun Kumar Kuchibhotla, and Eric Tchetgen Tchetgen.

E-values and anytime-valid conformal:

  • Gauthier, Bach, Jordan, "E-Values Expand the Scope of Conformal Prediction" (2025; arXiv:2503.13050). Batch anytime-valid conformal constructions using e-values.
  • Koning and van Meer (2025), on fuzzy e-values and decision-theoretic conformal sets — connects e-values to fuzzy prediction sets and downstream decisions. Complementary to the Gauthier-Bach-Jordan construction. Locate via the authors' personal pages or Google Scholar; the arXiv ID is intentionally not asserted here.

Density-ratio estimation:

  • Sugiyama, Suzuki, Kanamori, Density Ratio Estimation in Machine Learning (Cambridge University Press, 2012). Chapters 3-5.

Background:

  • Angelopoulos, Bates, "A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification." Foundations and Trends in Machine Learning 16(4) (2023). Chapter 5.

Next Topics

Last reviewed: April 24, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

4

Derived topics

2