Skip to main content

Sequential Inference

E-Values and Anytime-Valid Inference

A framework for hypothesis testing whose validity survives optional stopping, optional continuation, and data-dependent peeking. E-values are nonnegative random variables with expectation at most one under the null; Ville's inequality makes the guarantee anytime-valid.

ResearchTier 1CurrentCore spine~60 min

Why This Matters

Every classical hypothesis test fixes the sample size in advance. The pp-value guarantee, PH0(pα)α\mathbb{P}_{H_0}(p \leq \alpha) \leq \alpha, breaks the moment you peek at the data and stop early when the result looks good. This is the reason online A/B tests, adaptive clinical trials, and rolling LLM evaluations produce elevated false-positive rates in practice. The analyst follows the mathematics of a fixed-nn test but behaves as if they were running a sequential test, and the gap is filled with Type I errors.

E-values and anytime-valid inference close the gap. An e-value is a nonnegative statistic whose expectation under the null is at most one. By Markov's inequality, 1/e1/e acts like a pp-value in the sense that PH0(e1/α)α\mathbb{P}_{H_0}(e \geq 1/\alpha) \leq \alpha. The key constructive fact is that a sequentially-built product of conditional e-values (equivalently, the running product of likelihood ratios under filtration (Ft)(\mathcal{F}_t)) is a nonnegative martingale or supermartingale under the null, hence an e-process. This gives a test that is valid at every stopping time simultaneously, not just at the sample size you had originally in mind. Peeking becomes a feature, not a failure mode. Arbitrary products of marginal (unconditional) e-values across experiments are not in general e-values, because E[E1E2]E[E1]E[E2]\mathbb{E}[E_1 E_2] \neq \mathbb{E}[E_1]\mathbb{E}[E_2] without independence; combination of evidence across experiments without conditional structure relies on arithmetic averaging or weighted means, which preserve the e-value property.

The regulatory case is live, though it should not be overstated. The FDA's January 2026 draft guidance on Bayesian adaptive medical-device trials is not an e-value guidance document; it concerns prior elicitation, simulation-calibrated operating characteristics, and decision rules for adaptive Bayesian designs. But it is a useful regulatory signal: adaptive evidence accumulation, simulation-based calibration, and transparent stopping rules are becoming central enough that the FDA is writing formal guidance on adjacent ideas. Anytime-valid methods sit naturally in that family. The tech case has been live for a decade and is finally being taught.

How e-values differ from p-values: structural comparison

PropertyClassical pp-valueE-value
Definitionpp such that PH0(pα)α\mathbb{P}_{H_0}(p \leq \alpha) \leq \alphaE0E \geq 0 with EH0[E]1\mathbb{E}_{H_0}[E] \leq 1
Validity guaranteeAt the fixed sample size committed in advanceAt every stopping time simultaneously (Ville)
Behavior under peekingType I error inflates toward 1 with continuous monitoringType I error stays at α\alpha regardless of monitoring rule
Multiplication across studiesNot valid (Fisher combination requires independence + scaling)Conditional products are e-values; arithmetic / weighted means combine across studies
Calibration gapNone (sharp at fixed nn)1/E1/E is a (conservative) p-value via Markov
Behavior under composite nullStandard via uniform-validity over the null setUniversal inference (Wasserman et al. 2020) gives a generic construction
Multiple testingBonferroni / BH on p-valuese-BH (Wang-Ramdas 2022) provides FDR control on e-values
Confidence intervalInverted at fixed nn (CI is for one nn)Inverted into a confidence sequence (CI valid at every nn)
SoftwareStandard everywhererivertest, safetestR, confseq, ridgesafe (R / Python)

The trade-off is sharp: e-values give up the worst-case sharpness of fixed-time p-values in exchange for anytime validity. For a user who is going to peek (online A/B testing, sequential clinical trials, rolling LLM evaluation), the sharpness loss is more than recovered because the alternative — using p-values under peeking — has no validity guarantee at all.

Watch Out

An e-value is not a sequential p-value

A p-value is a probability statement: PH0(pα)α\mathbb{P}_{H_0}(p \le \alpha) \le \alpha at the fixed sample size you committed to in advance. An e-value is a non-negative evidence measure with EH0[E]1\mathbb{E}_{H_0}[E] \le 1. The expectation bound, not a probability bound, is what makes multiplication of e-values under conditional validity a clean operation, and what lets 1/E1/E act as a calibrated p-value via Markov. Treating an e-value as "the new p-value" loses the structural difference: e-values combine across studies and dependent observations in ways p-values cannot, and they trade the worst-case sharpness of a fixed-time p-value for sequential validity.

Watch Out

What anytime-valid inference does not solve

Anytime-valid methods control Type I error under optional stopping provided the evidence process is valid under the planned filtration. They do not automatically rescue:

  • model-selection bias (you still cannot search across models and report the winner without accounting for the search);
  • adaptive-endpoint hacking (changing what you measure mid-study breaks the filtration);
  • post-hoc subgroup discovery (the same problem in disguise);
  • invalid nuisance estimation in semiparametric settings.

Each of these requires the selection or adaptation mechanism to be included in the validity argument. "Anytime-valid" is not a license to do anything adaptively; it is a guarantee about a specific kind of adaptation (when to stop) under a specified filtration.

Formal Setup

Let H0H_0 be a null hypothesis specifying a set of probability measures on a data stream X1,X2,X_1, X_2, \ldots Fix a significance level α(0,1)\alpha \in (0, 1).

Definition

E-Value

A nonnegative random variable EE is an e-value for H0H_0 if EP[E]1\mathbb{E}_{P}[E] \leq 1 for every PH0P \in H_0. By Markov,

PP(E1/α)αfor all PH0.\mathbb{P}_P(E \geq 1/\alpha) \leq \alpha \quad \text{for all } P \in H_0.

The test rejects when E1/αE \geq 1/\alpha.

Definition

E-Process

A sequence (Et)t0(E_t)_{t \geq 0} adapted to a filtration (Ft)(\mathcal{F}_t) is an e-process for H0H_0 if for every stopping time τ\tau (including τ=\tau = \infty) and every PH0P \in H_0,

EP[Eτ]1.\mathbb{E}_P[E_\tau] \leq 1.

Equivalently, EtE_t is upper-bounded by a nonnegative supermartingale under every PH0P \in H_0. When equality holds, (Et)(E_t) is a test martingale.

Ville's Inequality

Theorem

Ville's Inequality

Statement

For any λ>0\lambda > 0,

PP(supt0Etλ)1λ.\mathbb{P}_P\Bigl(\sup_{t \geq 0} E_t \geq \lambda\Bigr) \leq \frac{1}{\lambda}.

Consequently, the test "reject when Et1/αE_t \geq 1/\alpha at any tt" has Type I error at most α\alpha uniformly over all stopping times.

Intuition

A nonnegative supermartingale with expected starting value at most one cannot consistently grow large, because the expected value is bounded forever. Markov's inequality applied at a maximum (rather than at a fixed time) gives a maximal inequality. This single bound handles every continuation strategy: no matter how the analyst decides when to stop, the supremum control holds.

Proof Sketch

The proof is a clean application of optional stopping to a hitting time, plus monotone convergence to remove the time horizon.

Step 1: define the hitting time. Let τλ=inf{t0:Etλ}\tau_\lambda = \inf\{t \geq 0 : E_t \geq \lambda\}, with the convention τλ=\tau_\lambda = \infty if the supremum never reaches λ\lambda. The event of interest is {supt0Etλ}={τλ<}\{\sup_{t \geq 0} E_t \geq \lambda\} = \{\tau_\lambda < \infty\}.

Step 2: stop the supermartingale. For any deterministic horizon T0T \geq 0, the stopped process (Etτλ)0tT(E_{t \wedge \tau_\lambda})_{0 \leq t \leq T} is itself a non-negative supermartingale. Doob's optional-stopping theorem applies to bounded stopping times for a supermartingale, and tτλt \wedge \tau_\lambda is bounded by TT, so

E[ETτλ]E[E0]1.\mathbb{E}[E_{T \wedge \tau_\lambda}] \leq \mathbb{E}[E_0] \leq 1.

Step 3: lower-bound the expectation by the rejection probability. On the event {τλT}\{\tau_\lambda \leq T\} we have ETτλ=EτλλE_{T \wedge \tau_\lambda} = E_{\tau_\lambda} \geq \lambda by definition of τλ\tau_\lambda (right-continuity in continuous time; discrete time is immediate). Since Et0E_t \geq 0 everywhere,

E[ETτλ]E[ETτλ1{τλT}]λP(τλT).\mathbb{E}[E_{T \wedge \tau_\lambda}] \geq \mathbb{E}\bigl[E_{T \wedge \tau_\lambda} \cdot \mathbf{1}_{\{\tau_\lambda \leq T\}}\bigr] \geq \lambda \cdot \mathbb{P}(\tau_\lambda \leq T).

Step 4: combine. Steps 2 and 3 together give

λP(τλT)1,soP(τλT)1λ.\lambda \cdot \mathbb{P}(\tau_\lambda \leq T) \leq 1, \qquad \text{so} \qquad \mathbb{P}(\tau_\lambda \leq T) \leq \frac{1}{\lambda}.

Step 5: send TT \to \infty. The events {τλT}\{\tau_\lambda \leq T\} are nested and increase to {τλ<}={supt0Etλ}\{\tau_\lambda < \infty\} = \{\sup_{t \geq 0} E_t \geq \lambda\} as TT \to \infty. Continuity of probability from below gives

P(supt0Etλ)=limTP(τλT)1λ.\mathbb{P}\Bigl(\sup_{t \geq 0} E_t \geq \lambda\Bigr) = \lim_{T \to \infty} \mathbb{P}(\tau_\lambda \leq T) \leq \frac{1}{\lambda}.

The original formulation is Ville (1939), Étude critique de la notion de collectif. The modern textbook treatment is Williams (1991), Probability with Martingales, §A14, or Doob (1953), Stochastic Processes, Ch. VII §3. The inequality is sometimes called Doob's maximal inequality for non-negative supermartingales; the e-value literature uses "Ville's inequality" because Ville's 1939 formulation explicitly motivated the gambling-against-the-null interpretation and because the bound in the form P(supEt1/α)α\mathbb{P}(\sup E_t \geq 1/\alpha) \leq \alpha is exactly the anytime-valid Type I error guarantee.

This proof is formalizable in Lean / Mathlib today; the relevant library entries are MeasureTheory.Martingale.OptionalStopping and the supermartingale-stopping lemmas under MeasureTheory.Filtration. A Lean wrapper for this argument is a natural follow-up to the existing verification/lean/TheoremPath/Probability/ modules.

Why It Matters

This is the mathematical engine underneath every anytime-valid procedure. It converts the fixed-sample Markov inequality into a sample-size-free maximal bound, at the cost of requiring the statistic to be a supermartingale under the null. Building such a supermartingale is the constructive content of the theory.

Failure Mode

The supermartingale property is with respect to a specific filtration; test statistics that peek at future data, or that are not adapted, violate it silently. Composite nulls require the supermartingale property to hold simultaneously under every null distribution, which is where universal inference and reverse information projection (RIPr) come in.

Construction Methods

Likelihood ratios. When H0={P0}H_0 = \{P_0\} is simple and P1P_1 is an alternative, the likelihood ratio Lt=dP1(t)/dP0(t)L_t = \mathrm{d} P_1^{(t)} / \mathrm{d} P_0^{(t)} is a P0P_0-test martingale. This recovers Wald's sequential probability ratio test as a special case.

Universal inference (Wasserman, Ramdas, Balakrishnan 2020). For any composite null H0H_0, split the data into two halves. Fit a maximum likelihood estimator θ^1\hat{\theta}_1 on the first half and use

E=dPθ^1supθH0dPθE = \frac{\mathrm{d} P_{\hat{\theta}_1}}{\sup_{\theta \in H_0} \mathrm{d} P_\theta}

evaluated on the second half. This is an e-value with no regularity conditions, trading efficiency for universal validity.

Reverse information projection (RIPr) (Grünwald, de Heide, Koolen 2024). Take the reverse KL projection of the alternative onto the null, giving a minimax-optimal e-value within a specified class.

Method of mixtures and betting martingales (Shafer 2021). Interpret the e-process as the wealth of a player in a sequential betting game against the null, with the null's best strategy being to offer fair odds. The analyst bets; wealth accumulates under misspecification.

p-to-e and e-to-p Calibration

Vovk and Wang (2021) characterize the admissible calibrators converting between the two. A function f:[0,1][0,]f : [0, 1] \to [0, \infty] is a valid p-to-e calibrator if f(P)Ef(P) \leq E in expectation for every PP super-uniform under H0H_0, equivalently if ff is non-increasing with 01f(u)du1\int_0^1 f(u) \, du \leq 1.

  • p-to-e: Two valid examples are f(p)=logpf(p) = -\log p (which has EPUnif[f(P)]=1\mathbb{E}_{P \sim \mathrm{Unif}}[f(P)] = 1) and the family f(p)=(1κ)pκf(p) = (1 - \kappa) p^{-\kappa} for κ(0,1)\kappa \in (0, 1). The proposal f(p)=1+1/p+logpf(p) = -1 + 1/p + \log p does not integrate to a finite value over (0,1](0, 1] because EPUnif[1/P]\mathbb{E}_{P \sim \mathrm{Unif}}[1/P] diverges, so it is not a valid calibrator without additional truncation or restriction.
  • e-to-p: p=min(1,1/e)p = \min(1, 1/e) is a valid p-value from any valid e-value (Markov's inequality applied to EE).

The conversions are lossy in an expected-value sense; the standalone strength of e-values is for sequential combination via running products of conditional e-values, and for arithmetic averaging of e-values across arbitrarily dependent experiments. The analogous operations on p-values do not preserve the validity guarantee.

Confidence Sequences

Inverting an e-process for H0:θ=θ0H_0 : \theta = \theta_0 over all θ0\theta_0 gives a confidence sequence: a random sequence of sets CtC_t such that P(θ0Ct for all t)1α\mathbb{P}(\theta_0 \in C_t \text{ for all } t) \geq 1 - \alpha. The guarantee is uniform in time. Howard, Ramdas, McAuliffe, Sekhon (2021) give time-uniform nonparametric confidence sequences for bounded means, quantiles, and Bernoulli rates.

Multiple Testing: The e-BH Procedure

Theorem

e-BH Controls FDR Under Arbitrary Dependence

Statement

For e-values E1,,EmE_1, \ldots, E_m corresponding to mm null hypotheses, the e-BH procedure rejects the kk largest e-values where kk is the largest index such that E(k)m/(kα)E_{(k)} \geq m/(k \alpha). This procedure controls the false discovery rate at α\alpha, under arbitrary dependence among the e-values.

Intuition

The BH procedure for p-values requires positive regression dependence (PRDS) to control FDR. e-values replace that assumption with a structural property: a weighted average of e-values is an e-value, regardless of dependence. The FDR bound follows from a single application of Markov to the ratio of false discoveries to total discoveries.

Why It Matters

In LLM evaluation and multi-arm A/B testing, dependence among hypotheses is the rule, not the exception, and PRDS is rarely verifiable. e-BH gives FDR control with no dependence assumption, at the cost of slightly wider rejection regions than BH under favourable dependence.

Applications

Adaptive clinical trials. The R package evalinger and its successors implement e-process monitoring with futility stops and platform-trial multiplicity control. Regulatory adoption is the live frontier (FDA draft guidance, January 2026).

Online A/B testing. Confidence sequences replace fixed-nn tests at Netflix, Adobe, Optimizely, and similar shops. Any decision made "when the result looks stable" is an implicit sequential test, and a confidence sequence is the honest version.

LLM evaluation. Leaderboard comparisons are sequential by design: new models arrive, benchmarks are rerun, the top-kk ordering shifts. Selective inference on the final ranking requires e-processes, not pp-values.

Relationship to Bayes and Classical Sequential Analysis

Bayes factors are e-values from the null's perspective: the expectation of the Bayes factor under the null is one. Wald's sequential probability ratio test is a likelihood-ratio test martingale. The modern e-value framework unifies these threads and adds explicit universal-inference and RIPr constructions for composite nulls that the classical theories handle only clumsily.

Exercises

ExerciseCore

Problem

A classical pp-value satisfies PH0(pα)α\mathbb{P}_{H_0}(p \leq \alpha) \leq \alpha. An analyst peeks every n=100,200,,1000n = 100, 200, \ldots, 1000 and stops at the first p0.05p \leq 0.05. Under H0H_0, roughly what is the Type I error rate, assuming the pp-values at different stopping points are independent?

ExerciseAdvanced

Problem

Build a test martingale for H0:p=p0H_0: p = p_0 against an unspecified alternative on an i.i.d.\ Bernoulli stream, using a Beta(1,1)(1, 1) mixture prior on the alternative parameter. Explain why the mixture construction gives a universally valid e-process.

ExerciseResearch

Problem

State the safe-testing characterization of Grünwald, de Heide, Koolen (2024): under what conditions does the reverse information projection give a GROW-optimal e-value, and what goes wrong for composite nulls where RIPr does not exist in closed form?

Open Problems and Frontier

Tight e-values for composite nulls without closed-form RIPr is a central open line, with partial progress via numerical RIPr and GROW approximations.

Time-uniform CLT (Waudby-Smith 2024-26) extends e-values into settings that previously required fixed-nn asymptotics, including conditional independence testing without Model-X, observational causal inference, and semiparametric inference more broadly.

Anytime-valid conformal prediction via e-value-based prediction sets is an active frontier; the split-conformal guarantees extend to sequential prediction when the calibration scores generate a test martingale.

Regulatory adoption. The FDA January 2026 draft Bayesian guidance is the policy opening. e-value methods map cleanly onto the guidance's flexibility for non-traditional success criteria. Practical adoption in platform trials, futility monitoring, and adaptive design is the next 2-3 years.

Hypothesis testing with e-values (Ramdas, Wang 2025, Waterloo book draft) is the first textbook treatment; field-wide adoption will track the book's completion.

References

Foundational:

  • Ville, Étude Critique de la Notion de Collectif (Gauthier-Villars, 1939). The original maximal inequality.
  • Shafer, Vovk, Probability and Finance: It's Only a Game! (Wiley, 2001). Chapters 3-5.

Modern e-value theory:

  • Ramdas, Grünwald, Vovk, Shafer, "Game-Theoretic Statistics and Safe Anytime-Valid Inference." Statistical Science 38(4) (2023), 576-601. The canonical survey.
  • Vovk, Wang, "E-Values: Calibration, Combination and Applications." Annals of Statistics 49(3) (2021), 1736-1754.
  • Grünwald, de Heide, Koolen, "Safe Testing." Journal of the Royal Statistical Society B 86(4) (2024), 1091-1128.
  • Shafer, "Testing by Betting: A Strategy for Statistical and Scientific Communication." Journal of the Royal Statistical Society A 184(2) (2021), 407-431.

Confidence sequences and time-uniform bounds:

  • Howard, Ramdas, McAuliffe, Sekhon, "Time-Uniform, Nonparametric, Nonasymptotic Confidence Sequences." Annals of Statistics 49(2) (2021), 1055-1080.
  • Waudby-Smith, Ramdas, "Estimating Means of Bounded Random Variables by Betting." Journal of the Royal Statistical Society B 86(1) (2024), 1-27.

Multiple testing:

  • Wang, Ramdas, "False Discovery Rate Control with E-Values." Journal of the Royal Statistical Society B 84(3) (2022), 822-852.
  • Wasserman, Ramdas, Balakrishnan, "Universal Inference." Proceedings of the National Academy of Sciences 117(29) (2020), 16880-16890.

Textbook (in preparation):

  • Ramdas, Wang, Hypothesis Testing with E-Values. Book draft, University of Waterloo, 2025.

Next Topics

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Derived topics

0

No published topic currently declares this as a prerequisite.