Skip to main content

Statistical Foundations

Minimax Lower Bounds: Le Cam, Fano, Assouad, and the Reduction to Testing

Why upper bounds are not enough. The reduction-to-testing principle, Le Cam's two-point method, Fano's inequality, the Assouad hypercube lemma, packing and metric-entropy bridges, and applications to nonparametric, sparse, and locally-private estimation.

AdvancedTier 1StableCore spine~90 min

Why This Matters

An upper bound says: here is an estimator that works this well. A lower bound says: no estimator can do better. Together they characterize the minimax rate of a statistical problem, which is the right notion of fundamental difficulty: a rate-optimal estimator is one whose worst-case risk matches the lower bound up to constants.

Lower bounds tell you when to stop trying to improve. If your estimator achieves rate n2s/(2s+d)n^{-2s/(2s+d)} for ss-smooth densities in Rd\mathbb{R}^d and you can prove a matching Assouad lower bound, you know the rate is tight: no amount of cleverness can beat n2s/(2s+d)n^{-2s/(2s+d)}, and the curse of dimensionality (the dd in the exponent) is statistical, not algorithmic.

In modern theory, lower bounds separate what is statistically possible from what is a failure of current algorithms. They are also the right tool for arguing that a model class is genuinely hard (sparse high-dimensional regression, nonparametric density estimation, locally private estimation, classification under adversarial perturbations, off-policy reinforcement learning).

Mental Model

Lower-bound arguments share one structural idea: estimation reduces to testing. If two parameter values θ0,θ1\theta_0, \theta_1 are far in the loss metric but their distributions Pθ0n,Pθ1nP_{\theta_0}^n, P_{\theta_1}^n are close in some statistical divergence, no estimator can reliably tell them apart, so no estimator can recover θ\theta with small loss on both.

The art is the reduction. Different methods package this idea differently:

  1. Le Cam (two hypotheses): reduce to a single binary test.
  2. Fano (M hypotheses): reduce to an MM-ary test, controlled by mutual information.
  3. Assouad (hypercube): reduce to dd simultaneous binary tests on the cube {1,+1}d\{-1,+1\}^d.
  4. Packing / metric entropy (Yang-Barron, Birge): pick MM via a covering argument and apply Fano.

Each method trades constructive complexity for tightness. Le Cam is the fastest to apply and gives parametric 1/n1/n rates; Assouad and Fano are needed for high-dimensional and nonparametric rates.

The diagram is the reusable template: the top panel is a loss geometry construction, the bottom panel is the testing problem induced by the same construction, and the middle bridge is the reduction itself. Le Cam, Fano, and Assouad differ mostly in how many hypotheses they encode and how the testing error lower bound is packaged.

Formal Setup

Definition

Minimax Risk

The minimax risk over a parameter space Θ\Theta with loss \ell and sample size nn is:

Rn(Θ,)=infθ^supθΘEθ[(θ^,θ)].R_n^*(\Theta, \ell) = \inf_{\hat{\theta}} \sup_{\theta \in \Theta}\, \mathbb{E}_\theta\big[\ell(\hat{\theta}, \theta)\big].

The infimum is over all estimators θ^=θ^(X1,,Xn)\hat{\theta} = \hat{\theta}(X_1, \ldots, X_n) based on nn samples; the supremum is the worst case over the parameter space. A minimax lower bound is any inequality RnrnR_n^* \geq r_n.

Definition

Total Variation Distance

For distributions P,QP, Q on a common space,

PQTV=supAP(A)Q(A)=12 ⁣ ⁣dPdQ.\|P - Q\|_{TV} = \sup_A |P(A) - Q(A)| = \tfrac{1}{2}\!\!\int |dP - dQ|.

TV is bounded by the KL-Pinsker inequality PQTV12DKL(PQ)\|P - Q\|_{TV} \leq \sqrt{\tfrac{1}{2} D_{KL}(P\|Q)} and by Hellinger PQTVH(P,Q)1H2(P,Q)/4H(P,Q)\|P - Q\|_{TV} \leq H(P,Q)\sqrt{1 - H^2(P,Q)/4} \leq H(P,Q). KL tensorizes additively and Hellinger via H2(Pn,Qn)=2(1(1H2(P,Q)/2)n)H^2(P^n, Q^n) = 2(1 - (1 - H^2(P,Q)/2)^n); TV does not tensorize, which is why Le Cam bounds always go through KL or Hellinger to get to the product distribution.

The minimax framework is a two-player game: the statistician chooses an estimator (inf), nature chooses the worst parameter (sup), the value of the game is RnR_n^*. A lower bound certifies that no matter what the statistician plays, nature can force the listed risk on some θ\theta.

The Reduction to Testing

The fundamental observation: if you cannot test, you cannot estimate.

Pick two parameter values θ0,θ1\theta_0, \theta_1 with (θ0,θ1)2s\ell(\theta_0, \theta_1) \geq 2s. Any estimator θ^\hat\theta induces a binary test ψ=1[(θ^,θ0)s]\psi = \mathbf{1}[\ell(\hat\theta, \theta_0) \geq s]. By the triangle property of \ell, the test correctly distinguishes θ0\theta_0 from θ1\theta_1 whenever θ^\hat\theta is closer than ss to the truth. So estimation accuracy is at least as hard as binary testing, and testing accuracy is governed by the statistical divergence between Pθ0nP_{\theta_0}^n and Pθ1nP_{\theta_1}^n. This is the entire engine.

The choice of which divergence (TV, KL, Hellinger, χ2\chi^2, mutual information) governs which lower-bound method is most natural.

Le Cam Two-Point Method

Theorem

Le Cam Two-Point Method

Statement

Let θ0,θ1Θ\theta_0, \theta_1 \in \Theta with (θ0,θ1)2s>0\ell(\theta_0, \theta_1) \geq 2s > 0. Then:

Rn(Θ,)    s2(1Pθ0nPθ1nTV).R_n^*(\Theta, \ell) \;\geq\; \frac{s}{2}\Big(1 - \|P_{\theta_0}^n - P_{\theta_1}^n\|_{TV}\Big).

Equivalent (via Pinsker) form: if DKL(Pθ0Pθ1)α/nD_{KL}(P_{\theta_0}\|P_{\theta_1}) \leq \alpha/n for some constant α<\alpha < \infty, then Rns2(1α/2)R_n^* \geq \tfrac{s}{2}(1 - \sqrt{\alpha/2}). Convention follows Yu (1997, Lemma 1) and Tsybakov (2009, Theorem 2.2): ss is the half-separation in the loss; TV is normalized to [0,1][0,1]; the 1/21/2 comes from the average-error bound.

Intuition

If the two product distributions Pθ0nP_{\theta_0}^n and Pθ1nP_{\theta_1}^n are nearly indistinguishable (small TV), any estimator must err on at least one of them. Since the parameters are separated by 2s2s in loss, the error is of order ss. The trick is to pick θ0,θ1\theta_0, \theta_1 as far apart in loss as possible while keeping TV bounded away from 1; tuning the gap makes the lower bound rate-optimal.

Proof Sketch

For any θ^\hat\theta: supθEθ[(θ^,θ)]12(Eθ0[(θ^,θ0)]+Eθ1[(θ^,θ1)])\sup_\theta \mathbb{E}_\theta[\ell(\hat\theta, \theta)] \geq \tfrac{1}{2}(\mathbb{E}_{\theta_0}[\ell(\hat\theta, \theta_0)] + \mathbb{E}_{\theta_1}[\ell(\hat\theta, \theta_1)]). Define ψ=1[(θ^,θ0)<s]\psi = \mathbf{1}[\ell(\hat\theta, \theta_0) < s]. By the triangle property (θ^,θ0)+(θ^,θ1)2s\ell(\hat\theta, \theta_0) + \ell(\hat\theta, \theta_1) \geq 2s, the indicator ψ\psi is a valid binary test for θ0\theta_0 vs θ1\theta_1. The average testing error of any test is bounded below by 12(1Pθ0nPθ1nTV)\tfrac{1}{2}(1 - \|P_{\theta_0}^n - P_{\theta_1}^n\|_{TV}) by the Neyman-Pearson lemma. Substituting gives the displayed bound.

Why It Matters

Le Cam is the simplest lower-bound technique and almost always the first to try. For one-dimensional parametric problems it gives the right 1/n1/n rate in one line, matching the Cramer-Rao bound up to constants. The Le Cam construction is the template every richer method extends.

Failure Mode

Le Cam uses only two hypotheses. When Θ\Theta is rich, two points cannot capture the full difficulty: for the dd-dimensional Gaussian mean problem Le Cam yields Rnc/nR_n^* \geq c/n, while the true rate is d/nd/n. To capture dimension or smoothness, switch to Fano or Assouad.

Fano's Method

Fano extends Le Cam from 2 to MM hypotheses, with the divergence replaced by mutual information.

Theorem

Fano's Inequality for Minimax Risk

Statement

Let θ1,,θM\theta_1, \ldots, \theta_M be 2s2s-separated in \ell. Let TT be uniform on {1,,M}\{1, \ldots, M\} and XnT=iPθinX^n \mid T = i \sim P_{\theta_i}^n. Then:

Rn(Θ,)    s(1I(Xn;T)+log2logM),R_n^*(\Theta, \ell) \;\geq\; s \cdot \left(1 - \frac{I(X^n; T) + \log 2}{\log M}\right),

where I(Xn;T)I(X^n; T) is the mutual information between data and hypothesis index. The bound is non-trivial as soon as I(Xn;T)12logMI(X^n; T) \leq \tfrac{1}{2}\log M. A standard upper bound is I(Xn;T)1M2i,jDKL(PθinPθjn)I(X^n; T) \leq \tfrac{1}{M^2}\sum_{i,j} D_{KL}(P_{\theta_i}^n \| P_{\theta_j}^n), which lets you control mutual information by pairwise KL.

Intuition

Fano is information-theoretic: from nn samples you can extract at most I(Xn;T)I(X^n; T) bits about the hypothesis TT. Recovering TT requires at least logMO(1)\log M - O(1) bits, so when I(Xn;T)logMI(X^n; T) \ll \log M the testing problem is genuinely hard. Packing many hypotheses into Θ\Theta raises the bit requirement; keeping their distributions close keeps mutual information small. The art is in the packing.

Proof Sketch

Reduce estimation to MM-ary testing as in Le Cam: define T^=argmini(θ^,θi)\hat T = \arg\min_i \ell(\hat\theta, \theta_i). By 2s2s-separation, T^T\hat T \neq T forces (θ^,θT)s\ell(\hat\theta, \theta_T) \geq s. The classical Fano inequality states H(TXn)h2(Pe)+Pelog(M1)H(T \mid X^n) \leq h_2(P_e) + P_e \log(M-1), where PeP_e is the testing error and h2h_2 is binary entropy. Since H(TXn)=logMI(Xn;T)H(T \mid X^n) = \log M - I(X^n; T), solving for PeP_e and substituting yields the displayed risk bound.

Why It Matters

Fano is the workhorse for minimax rates in nonparametric estimation, high-dimensional sparse problems, and learning theory. The proof template is universal: pick a packing of Θ\Theta at scale ss, bound the mutual information from above, and read off the rate. Yang-Barron (1999) make this template into a theorem that translates metric-entropy estimates into minimax rates without case-by-case bookkeeping.

Failure Mode

Fano requires constructing a packing whose pairwise KL is small. For problems with fast-mixing chains, complex dependencies between coordinates, or non-iid data, the KL bound can be loose and produce sub-optimal rates. Refined versions (Birge's local Fano, Yang-Barron with adaptive packing) and the chi-squared variant of Tsybakov often recover tight rates where vanilla Fano fails.

The Assouad Hypercube Lemma

When the parameter is a dd-dimensional vector and the loss decomposes coordinate-wise, Assouad reduces estimation to dd parallel Le Cam tests.

Lemma

Assouad Lemma

Statement

Let Θ\Theta be indexed by τ{1,+1}d\tau \in \{-1, +1\}^d via θτ\theta_\tau, with loss separating across coordinates:

(θ^,θτ)    j=1dsj1[τ^jτj].\ell(\hat\theta, \theta_\tau) \;\geq\; \sum_{j=1}^d s_j \cdot \mathbf{1}[\hat\tau_j \neq \tau_j].

For each coordinate jj, let Pˉj,±n\bar P_{j,\pm}^n denote the mixture of PθτnP_{\theta_\tau}^n over τ\tau with τj=±1\tau_j = \pm 1. Then:

Rn(Θ,)    12j=1dsj(1Pˉj,+nPˉj,nTV).R_n^*(\Theta, \ell) \;\geq\; \frac{1}{2}\sum_{j=1}^d s_j \cdot \Big(1 - \big\|\bar P_{j,+}^n - \bar P_{j,-}^n\big\|_{TV}\Big).

Convention: Yu (1997, Lemma 2), Tsybakov (2009, Theorem 2.12), Wainwright (2019, Proposition 15.5).

Intuition

The decomposable loss splits the dd-dimensional estimation problem into dd independent coordinate-wise binary tests. Each coordinate contributes a Le Cam-style sj(1TV)s_j (1 - \mathrm{TV}) term; the total bound is the sum. Dimension enters multiplicatively, which is exactly what gives d/nd/n rates for high-dimensional Gaussian mean estimation.

Proof Sketch

Decompose the worst-case loss across coordinates. For coordinate jj, the sub-problem is testing whether τj=+1\tau_j = +1 or τj=1\tau_j = -1 given XnX^n when the other coordinates are drawn uniformly. The mixture distribution collapses the nuisance and reduces to a Le Cam binary test between Pˉj,+n\bar P_{j,+}^n and Pˉj,n\bar P_{j,-}^n. Apply Le Cam to each coordinate and sum over jj.

Why It Matters

Assouad is the tool for high-dimensional and nonparametric lower bounds where the loss is naturally additive (squared 2\ell_2, 1\ell_1, Hamming, sup-norm of a smooth function approximation). The famous d/nd/n rate for sparse and dense Gaussian-mean problems, the n2s/(2s+d)n^{-2s/(2s+d)} rate for ss-smooth density estimation in dd dimensions, and most adversarial adversarial-perturbation lower bounds use Assouad.

Failure Mode

Assouad needs the loss to decompose. For matrix-valued losses (operator norm in covariance estimation), regularized losses, or losses with coupling across coordinates, Assouad does not apply directly. The Birge-Massart approach replaces the hypercube with a packing argument that handles such losses with a Fano-style information bound; Cai-Zhou (2012) for sparse covariance estimation is a representative application.

Packing, Metric Entropy, and the Yang-Barron Bridge

The Fano method requires a packing of Θ\Theta at scale ss. The metric-entropy literature gives a systematic way to construct such packings and translates them into minimax rates.

Definition

Packing Number

Given a metric space (Θ,ρ)(\Theta, \rho), the packing number M(Θ,s)M(\Theta, s) is the largest MM such that there exist θ1,,θMΘ\theta_1, \ldots, \theta_M \in \Theta with ρ(θi,θj)s\rho(\theta_i, \theta_j) \geq s for all iji \neq j. The metric entropy at scale ss is logM(Θ,s)\log M(\Theta, s).

Theorem

Yang-Barron Bridge: Metric Entropy to Minimax Rate

Statement

Suppose DKL(PθPθ)Lρ2(θ,θ)D_{KL}(P_\theta \| P_{\theta'}) \leq L \rho^2(\theta, \theta') for θ,θ\theta, \theta' in a class Θ\Theta. Define the critical scale sns_n as the unique solution of logM(Θ,sn)nLsn2\log M(\Theta, s_n) \asymp n L s_n^2. Then the minimax risk satisfies Rn(Θ,ρ2)sn2R_n^*(\Theta, \rho^2) \asymp s_n^2. The upper bound is achieved by an ε\varepsilon-net plus likelihood test estimator (Yang-Barron 1999); the lower bound follows by Fano applied to a packing at scale sns_n.

Intuition

The critical scale sns_n balances two competing forces. Large sns_n makes the packing sparse (low metric entropy, hard to inflate the test set) and the per-pair KL large (easy to test). Small sns_n makes the packing dense (high metric entropy, many hypotheses) and the per-pair KL small (hard to test). The Fano denominator is logM(Θ,sn)\log M(\Theta, s_n), the numerator is proportional to nLsn2n L s_n^2, and the optimal trade-off equates the two.

Proof Sketch

The lower bound: pick a packing of size M(Θ,sn)M(\Theta, s_n) at scale sns_n. Pairwise KL is at most Lsn2L s_n^2, so mutual information I(Xn;T)nLsn2I(X^n; T) \leq nL s_n^2. By the critical-scale equation nLsn2logM(Θ,sn)nL s_n^2 \asymp \log M(\Theta, s_n), so the Fano bound is bounded away from zero, giving Rnsn2R_n^* \gtrsim s_n^2. The upper bound: a likelihood test on an ε=sn/2\varepsilon = s_n/2 net of Θ\Theta classifies the truth correctly with high probability; the resulting estimator incurs squared error O(sn2)O(s_n^2).

Why It Matters

Yang-Barron promotes metric-entropy estimates from nonparametric statistics (Kolmogorov-Tikhomirov, Birge, van der Vaart-Wellner) into minimax rates without bespoke argumentation per problem. Most modern minimax lower-bound papers either invoke this bridge directly or follow its structure: estimate the metric entropy of Θ\Theta, plug into Fano. The same template gives sharp rates for Holder classes, Sobolev balls, RKHS unit balls, sparse vectors, low-rank matrices, monotone functions, convex functions, and shape-constrained estimation.

Failure Mode

The bridge requires a clean KL-vs-metric coupling DKL(PθPθ)ρ2(θ,θ)D_{KL}(P_\theta\|P_{\theta'}) \asymp \rho^2(\theta, \theta'). Heavy-tailed observation models, models with vanishing Fisher information (e.g., uniform with unknown endpoint), and high-dimensional models with non-product structure can violate this coupling, in which case bespoke arguments using χ2\chi^2 divergence (Tsybakov 2009 Section 2.7) or hybrid two-stage Fano-Assouad bounds are needed.

How to Pick a Method

Three diagnostic questions:

  1. Is the parameter space one-dimensional or low-dimensional? Use Le Cam.
  2. Does the loss decompose coordinate-wise? Use Assouad.
  3. Is the parameter space a smooth class with known metric entropy? Use Yang-Barron / Fano with a packing.

Beyond the diagnostic, the construction is the art: choosing two hypotheses (Le Cam), the right hypercube embedding (Assouad), or the right packing (Fano) so that the separation is large and the divergences are small. Tsybakov's Introduction to Nonparametric Estimation (2009) is the canonical guide; Wainwright's High-Dimensional Statistics (2019, Ch. 15) covers the modern high-dimensional and sparse cases.

Canonical Examples

Example

Le Cam for Gaussian mean estimation

X1,,XnN(θ,1)X_1, \ldots, X_n \sim \mathcal{N}(\theta, 1), θR\theta \in \mathbb{R}. Take θ0=δ/2,θ1=+δ/2\theta_0 = -\delta/2, \theta_1 = +\delta/2. The KL between the product distributions is DKL(Pθ0nPθ1n)=nδ2/2D_{KL}(P_{\theta_0}^n \| P_{\theta_1}^n) = n\delta^2/2, so by Pinsker Pθ0nPθ1nTVδn/2\|P_{\theta_0}^n - P_{\theta_1}^n\|_{TV} \leq \delta\sqrt{n}/2. Setting δ=c/n\delta = c/\sqrt{n} keeps TV bounded by a constant <1< 1, and Le Cam gives Rnc/nR_n^* \geq c'/n for squared error. This matches the sample-mean upper bound Var(Xˉ)=1/n\mathrm{Var}(\bar X) = 1/n.

Example

Assouad for d-dimensional Gaussian mean

XiN(θ,Id)X_i \sim \mathcal{N}(\theta, I_d), θRd\theta \in \mathbb{R}^d. Take θτ=(δ/n)τ\theta_\tau = (\delta/\sqrt{n})\tau for τ{1,+1}d\tau \in \{-1,+1\}^d. Squared 2\ell_2 loss decomposes coordinate-wise with sj=δ2/ns_j = \delta^2/n. The mixture-pair TV at coordinate jj is bounded using the joint KL DKL(Pˉj,+nPˉj,n)2δ2D_{KL}(\bar P_{j,+}^n \| \bar P_{j,-}^n) \leq 2\delta^2 via Jensen, giving TV c<1\leq c < 1. Assouad sums to Rndδ2/n=cd/nR_n^* \gtrsim d \delta^2 / n = c''d/n. This matches the upper bound from the sample mean (EXˉθ22=d/n\mathbb{E}\|\bar X - \theta\|_2^2 = d/n). The dimension dependence is statistical, not algorithmic.

Example

Fano + Yang-Barron for Holder density estimation

For estimating a density on [0,1]d[0,1]^d in the Holder ball Hs(L)\mathcal{H}^s(L) of smoothness ss, the Kolmogorov metric entropy at scale ε\varepsilon is logM(Hs(L),ε)εd/s\log M(\mathcal{H}^s(L), \varepsilon) \asymp \varepsilon^{-d/s}. Coupling KL to the squared L2\mathcal{L}^2 metric and solving the Yang-Barron critical-scale equation εd/snε2\varepsilon^{-d/s} \asymp n \varepsilon^2 gives εnns/(2s+d)\varepsilon_n \asymp n^{-s/(2s+d)} and Rnn2s/(2s+d)R_n^* \asymp n^{-2s/(2s+d)}. This is the canonical curse-of-dimensionality rate; the matching upper bound is achieved by kernel density estimators of bandwidth hn1/(2s+d)h \asymp n^{-1/(2s+d)}.

Example

Frontier: Locally private mean estimation (Duchi-Jordan-Wainwright 2018)

Under α\alpha-local differential privacy, each user submits a noisy view Zi=Q(Xi)Z_i = Q(X_i) with QQ satisfying the local-DP constraint supx,x,SP(Q(x)S)/P(Q(x)S)eα\sup_{x,x',S} P(Q(x) \in S)/P(Q(x') \in S) \leq e^\alpha. Duchi, Jordan, and Wainwright (JASA 2018, 113:182-201) show that for mean estimation on [1,1]d[-1,1]^d, the local-DP minimax rate is Rn(α)d/(nmin(α2,α))R_n^*(\alpha) \asymp d / (n \min(\alpha^2, \alpha)), strictly slower than the non-private rate d/nd/n when α0\alpha \to 0. The proof uses an Assouad construction with a strong-data-processing inequality that converts local-DP into a contractive bound on the per-coordinate KL, giving the α2\alpha^2 price of privacy in the small-α\alpha regime.

Common Confusions

Watch Out

A lower bound is about the best estimator, not all estimators

The minimax lower bound says: the best worst-case risk is at least this large. It does not say every estimator achieves this rate. Most estimators are much worse than minimax-optimal. Stein's paradox makes this concrete: the MLE for a d3d \geq 3 Gaussian mean is dominated by the James-Stein shrinkage estimator under squared-error loss, even though both are d/n\sqrt{d/n}-optimal in the minimax sense.

Watch Out

Minimax optimality is worst-case, not problem-specific

A minimax-optimal estimator may be conservative on instances that are easier than the worst case. Adaptive estimators (e.g., Lepski's method, Birge-Massart model selection) try to achieve near-minimax rates on every instance simultaneously, paying only logarithmic factors for adaptation. The right estimator for your problem may not be the minimax estimator for the surrounding class.

Watch Out

TV, KL, Hellinger, and chi-squared are not interchangeable

Le Cam uses TV, Fano uses mutual information (and pairwise KL), Yang-Barron uses KL paired with a metric, Tsybakov's chi-squared method uses χ2\chi^2 for tighter bounds when KL diverges. The relationships PQTVDKL/2\|P-Q\|_{TV} \leq \sqrt{D_{KL}/2} (Pinsker), H2DKLH^2 \leq D_{KL}, DKLχ2D_{KL} \leq \chi^2 are one-way; substituting the wrong divergence in the wrong method gives bounds that are correct but loose, sometimes by a polynomial factor.

Watch Out

Le Cam can give tight parametric rates but not high-dimensional rates

For one-dimensional Gaussian or Bernoulli mean estimation, Le Cam with two points already gives the tight 1/n1/n rate. For dd-dimensional Gaussian mean, Le Cam with two points still only gives 1/n1/n, missing the dd factor. The cure is Assouad (or Fano with a packing), not a cleverer two-point construction.

Watch Out

Lower bounds need explicit constants for adaptive matching

Many published lower bounds prove RncratenR_n^* \geq c \cdot \text{rate}_n for some unspecified constant cc. For practical purposes (deciding whether a new estimator is worth using), the constant matters. The chi-squared method (Tsybakov 2009 Sec. 2.7) and the localized Fano of Birge-Massart give sharper constants and are needed when matching the leading constant of an upper bound.

Watch Out

Lower bounds care about what you observe, not what you can compute

Minimax lower bounds are information-theoretic: they bound what any estimator can extract from the data, ignoring computation. A computationally tractable estimator can still be statistically optimal (Lasso for sparse linear regression). Computational lower bounds (planted clique, statistical query model) are a separate research program: see Brennan-Bresler-Huleihel 2018 and follow-ups.

Summary

  • Minimax risk Rn=infθ^supθE[(θ^,θ)]R_n^* = \inf_{\hat\theta} \sup_\theta \mathbb{E}[\ell(\hat\theta, \theta)] is the right notion of fundamental difficulty.
  • All lower-bound methods reduce estimation to testing: if you cannot test, you cannot estimate.
  • Le Cam (two hypotheses, TV) gives parametric 1/n1/n rates in one line.
  • Fano (MM hypotheses, mutual information) handles rich parameter spaces; Yang-Barron turns metric-entropy estimates into rates.
  • Assouad (hypercube, sum of binary tests) handles decomposable losses and high-dimensional rates.
  • The art is in the construction: pick hypotheses far in loss but close in divergence.
  • Lower bounds are information-theoretic; matching them is a separate computational question.

Exercises

ExerciseCore

Problem

Use Le Cam to prove Rnc/nR_n^* \geq c/n for estimating the mean of XiBernoulli(p)X_i \sim \text{Bernoulli}(p), p[0,1]p \in [0,1], under squared-error loss. Give an explicit constant cc.

ExerciseCore

Problem

For dd-dimensional Gaussian mean estimation under squared 2\ell_2 loss, explain in one paragraph why Le Cam fails to recover the d/nd/n rate and how Assouad recovers it. State which line of the Assouad proof picks up the factor of dd.

ExerciseAdvanced

Problem

Prove a Fano lower bound for sparse linear regression: observations y=Xβ+εy = X\beta^* + \varepsilon with XRn×dX \in \mathbb{R}^{n \times d} Gaussian design, εN(0,σ2In)\varepsilon \sim \mathcal{N}(0, \sigma^2 I_n), and β\beta^* kk-sparse with β21\|\beta^*\|_2 \leq 1. Show that the minimax rate for squared 2\ell_2 loss is at least Ω(klog(d/k)/n)\Omega(k \log(d/k) / n).

ExerciseAdvanced

Problem

Show that the Yang-Barron critical-scale equation logM(Θ,sn)nLsn2\log M(\Theta, s_n) \asymp n L s_n^2 produces the rate Rnn2s/(2s+d)R_n^* \asymp n^{-2s/(2s+d)} when Θ\Theta is the Holder ball Hs(L)\mathcal{H}^s(L) on [0,1]d[0,1]^d with metric entropy logM(Hs,ε)εd/s\log M(\mathcal{H}^s, \varepsilon) \asymp \varepsilon^{-d/s}.

ExerciseResearch

Problem

The Duchi-Jordan-Wainwright 2018 local-DP lower bound shows that under α\alpha-LDP the minimax rate for dd-dimensional mean estimation degrades from d/nd/n to d/(nmin(α2,α))d / (n \min(\alpha^2, \alpha)). State precisely which step of the Assouad argument changes when local-DP is imposed, and show how the strong-data-processing inequality for α\alpha-LDP channels recovers the α2\alpha^2 contraction.

Related Comparisons

References

Canonical:

  • Tsybakov, A. (2009). Introduction to Nonparametric Estimation. Springer. Chapter 2 is the canonical exposition; Section 2.6-2.7 cover Fano, chi-squared, and the Yang-Barron bridge.
  • Le Cam, L. (1973). "Convergence of Estimates Under Dimensionality Restrictions." Annals of Statistics 1, 38-53. The original Le Cam two-point method.
  • Yu, B. (1997). "Assouad, Fano, and Le Cam." In Festschrift for Lucien Le Cam, 423-435. Springer. Unified exposition of the three core methods.
  • Fano, R. M. (1961). Transmission of Information: A Statistical Theory of Communications. MIT Press. Original Fano inequality.
  • Assouad, P. (1983). "Deux remarques sur l'estimation." Comptes Rendus Acad. Sci. Paris 296, 1021-1024. Original Assouad lemma.
  • Yang, Y., Barron, A. (1999). "Information-theoretic determination of minimax rates of convergence." Annals of Statistics 27(5), 1564-1599. The metric-entropy-to-minimax-rate bridge.

Current and applications:

  • Wainwright, M. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge. Chapter 15 covers minimax lower bounds with modern high-dimensional applications (sparse regression, low-rank matrix estimation).
  • Duchi, J. (lecture notes). Information-Theoretic Methods in Statistics. Stanford. Modern treatment with private and distributed estimation.
  • Duchi, J., Jordan, M. I., Wainwright, M. J. (2018). "Minimax Optimal Procedures for Locally Private Estimation." JASA 113, 182-201. Local-DP minimax rates via strong-data-processing Assouad.
  • Cai, T. T., Zhou, H. H. (2012). "Optimal rates of convergence for sparse covariance matrix estimation." Annals of Statistics 40(5), 2389-2420. Block-Assouad construction for matrix estimation.
  • Birge, L. (1983). "Approximation dans les espaces metriques et theorie de l'estimation." Z. Wahrsch. Verw. Gebiete 65, 181-237. Local Fano with refined constants.
  • Birge, L., Massart, P. (1998). "Minimum contrast estimators on sieves: exponential bounds and rates of convergence." Bernoulli 4, 329-375. Adaptive minimax rates via model selection.
  • van der Vaart, A., Wellner, J. (1996). Weak Convergence and Empirical Processes. Springer. Metric-entropy estimates for empirical-process classes.
  • Brennan, M., Bresler, G., Huleihel, W. (2018). "Reducibility and Computational Lower Bounds for Problems with Limited Dependence on Sparsity." COLT 2018. Computational vs statistical lower bounds.

Critique and refinements:

  • Polyanskiy, Y., Wu, Y. (2024 draft). Information Theory: From Coding to Learning. Cambridge. Chapter 32 unifies classical and modern minimax via ff-divergences and gives sharper Fano variants.
  • Verdu, S., Han, T. S. (1994). "A general formula for channel capacity." IEEE Trans. Inf. Theory 40(4), 1147-1157. Information-theoretic foundations beyond iid.
  • Khas'minskii, R. Z. (1979). "A Lower Bound on the Risks of Nonparametric Estimates of Densities in the Uniform Metric." Theory Probab. Appl. 23, 794-798. Lower bound for sup-norm density estimation; the source of the Stone rate.
  • Stone, C. J. (1980). "Optimal rates of convergence for nonparametric estimators." Annals of Statistics 8(6), 1348-1360. Companion to Khas'minskii on optimal rates.
  • Audibert, J.-Y., Tsybakov, A. (2007). "Fast learning rates for plug-in classifiers." Annals of Statistics 35, 608-633. Margin-condition-driven faster-than-1/n1/\sqrt{n} rates that the standard Fano construction misses.
  • Cai, T. T., Liu, W., Luo, X. (2011). "A constrained 1\ell_1-minimization approach to sparse precision matrix estimation." JASA 106, 594-607. Minimax lower bound with explicit leading constant.

Next Topics

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.