Minimax Lower Bounds: Le Cam, Fano, Assouad, and the Reduction to Testing

Sneiderman, Robby

Statistical Foundations

Minimax Lower Bounds: Le Cam, Fano, Assouad, and the Reduction to Testing

Why upper bounds are not enough. The reduction-to-testing principle, Le Cam's two-point method, Fano's inequality, the Assouad hypercube lemma, packing and metric-entropy bridges, and applications to nonparametric, sparse, and locally-private estimation.

AdvancedTier 1StableCore spine~90 min

Prerequisites

Concentration Inequalities Maximum Likelihood Estimation KL Divergence Cramer Rao Bound

Quiz (2)Prereq Map

Why This Matters

An upper bound says: here is an estimator that works this well. A lower bound says: no estimator can do better. Together they characterize the minimax rate of a statistical problem, which is the right notion of fundamental difficulty: a rate-optimal estimator is one whose worst-case risk matches the lower bound up to constants.

Lower bounds tell you when to stop trying to improve. If your estimator achieves rate $n^{-2s/(2s+d)}$ for $s$ -smooth densities in $\mathbb{R}^d$ and you can prove a matching Assouad lower bound, you know the rate is tight: no amount of cleverness can beat $n^{-2s/(2s+d)}$ , and the curse of dimensionality (the $d$ in the exponent) is statistical, not algorithmic.

In modern theory, lower bounds separate what is statistically possible from what is a failure of current algorithms. They are also the right tool for arguing that a model class is genuinely hard (sparse high-dimensional regression, nonparametric density estimation, locally private estimation, classification under adversarial perturbations, off-policy reinforcement learning).

Mental Model

Lower-bound arguments share one structural idea: estimation reduces to testing. If two parameter values $\theta_0, \theta_1$ are far in the loss metric but their distributions $P_{\theta_0}^n, P_{\theta_1}^n$ are close in some statistical divergence, no estimator can reliably tell them apart, so no estimator can recover $\theta$ with small loss on both.

The art is the reduction. Different methods package this idea differently:

Le Cam (two hypotheses): reduce to a single binary test.
Fano (M hypotheses): reduce to an $M$ -ary test, controlled by mutual information.
Assouad (hypercube): reduce to $d$ simultaneous binary tests on the cube $\{-1,+1\}^d$ .
Packing / metric entropy (Yang-Barron, Birge): pick $M$ via a covering argument and apply Fano.

Each method trades constructive complexity for tightness. Le Cam is the fastest to apply and gives parametric $1/n$ rates; Assouad and Fano are needed for high-dimensional and nonparametric rates.

The diagram is the reusable template: the top panel is a loss geometry construction, the bottom panel is the testing problem induced by the same construction, and the middle bridge is the reduction itself. Le Cam, Fano, and Assouad differ mostly in how many hypotheses they encode and how the testing error lower bound is packaged.

Formal Setup

Definition

Minimax Risk $R_{n}^{*} (Θ, ℓ)$

The minimax risk over a parameter space $\Theta$ with loss $\ell$ and sample size $n$ is:

$R_n^*(\Theta, \ell) = \inf_{\hat{\theta}} \sup_{\theta \in \Theta}\, \mathbb{E}_\theta\big[\ell(\hat{\theta}, \theta)\big].$

The infimum is over all estimators $\hat{\theta} = \hat{\theta}(X_1, \ldots, X_n)$ based on $n$ samples; the supremum is the worst case over the parameter space. A minimax lower bound is any inequality $R_n^* \geq r_n$ .

Definition

Total Variation Distance $∥ P - Q ∥_{T V}$

For distributions $P, Q$ on a common space,

$\|P - Q\|_{TV} = \sup_A |P(A) - Q(A)| = \tfrac{1}{2}\!\!\int |dP - dQ|.$

TV is bounded by the KL-Pinsker inequality $\|P - Q\|_{TV} \leq \sqrt{\tfrac{1}{2} D_{KL}(P\|Q)}$ and by Hellinger $\|P - Q\|_{TV} \leq H(P,Q)\sqrt{1 - H^2(P,Q)/4} \leq H(P,Q)$ . KL tensorizes additively and Hellinger via $H^2(P^n, Q^n) = 2(1 - (1 - H^2(P,Q)/2)^n)$ ; TV does not tensorize, which is why Le Cam bounds always go through KL or Hellinger to get to the product distribution.

The minimax framework is a two-player game: the statistician chooses an estimator (inf), nature chooses the worst parameter (sup), the value of the game is $R_n^*$ . A lower bound certifies that no matter what the statistician plays, nature can force the listed risk on some $\theta$ .

The Reduction to Testing

The fundamental observation: if you cannot test, you cannot estimate.

Pick two parameter values $\theta_0, \theta_1$ with $\ell(\theta_0, \theta_1) \geq 2s$ . Any estimator $\hat\theta$ induces a binary test $\psi = \mathbf{1}[\ell(\hat\theta, \theta_0) \geq s]$ . By the triangle property of $\ell$ , the test correctly distinguishes $\theta_0$ from $\theta_1$ whenever $\hat\theta$ is closer than $s$ to the truth. So estimation accuracy is at least as hard as binary testing, and testing accuracy is governed by the statistical divergence between $P_{\theta_0}^n$ and $P_{\theta_1}^n$ . This is the entire engine.

The choice of which divergence (TV, KL, Hellinger, $\chi^2$ , mutual information) governs which lower-bound method is most natural.

Le Cam Two-Point Method

Theorem

Le Cam Two-Point Method

Statement

Let $\theta_0, \theta_1 \in \Theta$ with $\ell(\theta_0, \theta_1) \geq 2s > 0$ . Then:

$R_n^*(\Theta, \ell) \;\geq\; \frac{s}{2}\Big(1 - \|P_{\theta_0}^n - P_{\theta_1}^n\|_{TV}\Big).$

Equivalent (via Pinsker) form: if $D_{KL}(P_{\theta_0}\|P_{\theta_1}) \leq \alpha/n$ for some constant $\alpha < \infty$ , then $R_n^* \geq \tfrac{s}{2}(1 - \sqrt{\alpha/2})$ . Convention follows Yu (1997, Lemma 1) and Tsybakov (2009, Theorem 2.2): $s$ is the half-separation in the loss; TV is normalized to $[0,1]$ ; the $1/2$ comes from the average-error bound.

Intuition

If the two product distributions $P_{\theta_0}^n$ and $P_{\theta_1}^n$ are nearly indistinguishable (small TV), any estimator must err on at least one of them. Since the parameters are separated by $2s$ in loss, the error is of order $s$ . The trick is to pick $\theta_0, \theta_1$ as far apart in loss as possible while keeping TV bounded away from 1; tuning the gap makes the lower bound rate-optimal.

Proof Sketch

For any $\hat\theta$ : $\sup_\theta \mathbb{E}_\theta[\ell(\hat\theta, \theta)] \geq \tfrac{1}{2}(\mathbb{E}_{\theta_0}[\ell(\hat\theta, \theta_0)] + \mathbb{E}_{\theta_1}[\ell(\hat\theta, \theta_1)])$ . Define $\psi = \mathbf{1}[\ell(\hat\theta, \theta_0) < s]$ . By the triangle property $\ell(\hat\theta, \theta_0) + \ell(\hat\theta, \theta_1) \geq 2s$ , the indicator $\psi$ is a valid binary test for $\theta_0$ vs $\theta_1$ . The average testing error of any test is bounded below by $\tfrac{1}{2}(1 - \|P_{\theta_0}^n - P_{\theta_1}^n\|_{TV})$ by the Neyman-Pearson lemma. Substituting gives the displayed bound.

Why It Matters

Le Cam is the simplest lower-bound technique and almost always the first to try. For one-dimensional parametric problems it gives the right $1/n$ rate in one line, matching the Cramer-Rao bound up to constants. The Le Cam construction is the template every richer method extends.

Failure Mode

Le Cam uses only two hypotheses. When $\Theta$ is rich, two points cannot capture the full difficulty: for the $d$ -dimensional Gaussian mean problem Le Cam yields $R_n^* \geq c/n$ , while the true rate is $d/n$ . To capture dimension or smoothness, switch to Fano or Assouad.

report a correction →

Fano's Method

Fano extends Le Cam from 2 to $M$ hypotheses, with the divergence replaced by mutual information.

Theorem

Fano's Inequality for Minimax Risk

Statement

Let $\theta_1, \ldots, \theta_M$ be $2s$ -separated in $\ell$ . Let $T$ be uniform on $\{1, \ldots, M\}$ and $X^n \mid T = i \sim P_{\theta_i}^n$ . Then:

$R_n^*(\Theta, \ell) \;\geq\; s \cdot \left(1 - \frac{I(X^n; T) + \log 2}{\log M}\right),$

where $I(X^n; T)$ is the mutual information between data and hypothesis index. The bound is non-trivial as soon as $I(X^n; T) \leq \tfrac{1}{2}\log M$ . A standard upper bound is $I(X^n; T) \leq \tfrac{1}{M^2}\sum_{i,j} D_{KL}(P_{\theta_i}^n \| P_{\theta_j}^n)$ , which lets you control mutual information by pairwise KL.

Intuition

Fano is information-theoretic: from $n$ samples you can extract at most $I(X^n; T)$ bits about the hypothesis $T$ . Recovering $T$ requires at least $\log M - O(1)$ bits, so when $I(X^n; T) \ll \log M$ the testing problem is genuinely hard. Packing many hypotheses into $\Theta$ raises the bit requirement; keeping their distributions close keeps mutual information small. The art is in the packing.

Proof Sketch

Reduce estimation to $M$ -ary testing as in Le Cam: define $\hat T = \arg\min_i \ell(\hat\theta, \theta_i)$ . By $2s$ -separation, $\hat T \neq T$ forces $\ell(\hat\theta, \theta_T) \geq s$ . The classical Fano inequality states $H(T \mid X^n) \leq h_2(P_e) + P_e \log(M-1)$ , where $P_e$ is the testing error and $h_2$ is binary entropy. Since $H(T \mid X^n) = \log M - I(X^n; T)$ , solving for $P_e$ and substituting yields the displayed risk bound.

Why It Matters

Fano is the workhorse for minimax rates in nonparametric estimation, high-dimensional sparse problems, and learning theory. The proof template is universal: pick a packing of $\Theta$ at scale $s$ , bound the mutual information from above, and read off the rate. Yang-Barron (1999) make this template into a theorem that translates metric-entropy estimates into minimax rates without case-by-case bookkeeping.

Failure Mode

Fano requires constructing a packing whose pairwise KL is small. For problems with fast-mixing chains, complex dependencies between coordinates, or non-iid data, the KL bound can be loose and produce sub-optimal rates. Refined versions (Birge's local Fano, Yang-Barron with adaptive packing) and the chi-squared variant of Tsybakov often recover tight rates where vanilla Fano fails.

report a correction →

The Assouad Hypercube Lemma

When the parameter is a $d$ -dimensional vector and the loss decomposes coordinate-wise, Assouad reduces estimation to $d$ parallel Le Cam tests.

Lemma

Assouad Lemma

Statement

Let $\Theta$ be indexed by $\tau \in \{-1, +1\}^d$ via $\theta_\tau$ , with loss separating across coordinates:

$\ell(\hat\theta, \theta_\tau) \;\geq\; \sum_{j=1}^d s_j \cdot \mathbf{1}[\hat\tau_j \neq \tau_j].$

For each coordinate $j$ , let $\bar P_{j,\pm}^n$ denote the mixture of $P_{\theta_\tau}^n$ over $\tau$ with $\tau_j = \pm 1$ . Then:

$R_n^*(\Theta, \ell) \;\geq\; \frac{1}{2}\sum_{j=1}^d s_j \cdot \Big(1 - \big\|\bar P_{j,+}^n - \bar P_{j,-}^n\big\|_{TV}\Big).$

Convention: Yu (1997, Lemma 2), Tsybakov (2009, Theorem 2.12), Wainwright (2019, Proposition 15.5).

Intuition

The decomposable loss splits the $d$ -dimensional estimation problem into $d$ independent coordinate-wise binary tests. Each coordinate contributes a Le Cam-style $s_j (1 - \mathrm{TV})$ term; the total bound is the sum. Dimension enters multiplicatively, which is exactly what gives $d/n$ rates for high-dimensional Gaussian mean estimation.

Proof Sketch

Decompose the worst-case loss across coordinates. For coordinate $j$ , the sub-problem is testing whether $\tau_j = +1$ or $\tau_j = -1$ given $X^n$ when the other coordinates are drawn uniformly. The mixture distribution collapses the nuisance and reduces to a Le Cam binary test between $\bar P_{j,+}^n$ and $\bar P_{j,-}^n$ . Apply Le Cam to each coordinate and sum over $j$ .

Why It Matters

Assouad is the tool for high-dimensional and nonparametric lower bounds where the loss is naturally additive (squared $\ell_2$ , $\ell_1$ , Hamming, sup-norm of a smooth function approximation). The famous $d/n$ rate for sparse and dense Gaussian-mean problems, the $n^{-2s/(2s+d)}$ rate for $s$ -smooth density estimation in $d$ dimensions, and most adversarial adversarial-perturbation lower bounds use Assouad.

Failure Mode

Assouad needs the loss to decompose. For matrix-valued losses (operator norm in covariance estimation), regularized losses, or losses with coupling across coordinates, Assouad does not apply directly. The Birge-Massart approach replaces the hypercube with a packing argument that handles such losses with a Fano-style information bound; Cai-Zhou (2012) for sparse covariance estimation is a representative application.

report a correction →

Packing, Metric Entropy, and the Yang-Barron Bridge

The Fano method requires a packing of $\Theta$ at scale $s$ . The metric-entropy literature gives a systematic way to construct such packings and translates them into minimax rates.

Definition

Packing Number $M (Θ, s)$

Given a metric space $(\Theta, \rho)$ , the packing number $M(\Theta, s)$ is the largest $M$ such that there exist $\theta_1, \ldots, \theta_M \in \Theta$ with $\rho(\theta_i, \theta_j) \geq s$ for all $i \neq j$ . The metric entropy at scale $s$ is $\log M(\Theta, s)$ .

Theorem

Yang-Barron Bridge: Metric Entropy to Minimax Rate

Statement

Suppose $D_{KL}(P_\theta \| P_{\theta'}) \leq L \rho^2(\theta, \theta')$ for $\theta, \theta'$ in a class $\Theta$ . Define the critical scale $s_n$ as the unique solution of $\log M(\Theta, s_n) \asymp n L s_n^2$ . Then the minimax risk satisfies $R_n^*(\Theta, \rho^2) \asymp s_n^2$ . The upper bound is achieved by an $\varepsilon$ -net plus likelihood test estimator (Yang-Barron 1999); the lower bound follows by Fano applied to a packing at scale $s_n$ .

Intuition

The critical scale $s_n$ balances two competing forces. Large $s_n$ makes the packing sparse (low metric entropy, hard to inflate the test set) and the per-pair KL large (easy to test). Small $s_n$ makes the packing dense (high metric entropy, many hypotheses) and the per-pair KL small (hard to test). The Fano denominator is $\log M(\Theta, s_n)$ , the numerator is proportional to $n L s_n^2$ , and the optimal trade-off equates the two.

Proof Sketch

The lower bound: pick a packing of size $M(\Theta, s_n)$ at scale $s_n$ . Pairwise KL is at most $L s_n^2$ , so mutual information $I(X^n; T) \leq nL s_n^2$ . By the critical-scale equation $nL s_n^2 \asymp \log M(\Theta, s_n)$ , so the Fano bound is bounded away from zero, giving $R_n^* \gtrsim s_n^2$ . The upper bound: a likelihood test on an $\varepsilon = s_n/2$ net of $\Theta$ classifies the truth correctly with high probability; the resulting estimator incurs squared error $O(s_n^2)$ .

Why It Matters

Yang-Barron promotes metric-entropy estimates from nonparametric statistics (Kolmogorov-Tikhomirov, Birge, van der Vaart-Wellner) into minimax rates without bespoke argumentation per problem. Most modern minimax lower-bound papers either invoke this bridge directly or follow its structure: estimate the metric entropy of $\Theta$ , plug into Fano. The same template gives sharp rates for Holder classes, Sobolev balls, RKHS unit balls, sparse vectors, low-rank matrices, monotone functions, convex functions, and shape-constrained estimation.

Failure Mode

The bridge requires a clean KL-vs-metric coupling $D_{KL}(P_\theta\|P_{\theta'}) \asymp \rho^2(\theta, \theta')$ . Heavy-tailed observation models, models with vanishing Fisher information (e.g., uniform with unknown endpoint), and high-dimensional models with non-product structure can violate this coupling, in which case bespoke arguments using $\chi^2$ divergence (Tsybakov 2009 Section 2.7) or hybrid two-stage Fano-Assouad bounds are needed.

report a correction →

How to Pick a Method

Three diagnostic questions:

Is the parameter space one-dimensional or low-dimensional? Use Le Cam.
Does the loss decompose coordinate-wise? Use Assouad.
Is the parameter space a smooth class with known metric entropy? Use Yang-Barron / Fano with a packing.

Beyond the diagnostic, the construction is the art: choosing two hypotheses (Le Cam), the right hypercube embedding (Assouad), or the right packing (Fano) so that the separation is large and the divergences are small. Tsybakov's Introduction to Nonparametric Estimation (2009) is the canonical guide; Wainwright's High-Dimensional Statistics (2019, Ch. 15) covers the modern high-dimensional and sparse cases.

Canonical Examples

Example

Le Cam for Gaussian mean estimation

$X_1, \ldots, X_n \sim \mathcal{N}(\theta, 1)$ , $\theta \in \mathbb{R}$ . Take $\theta_0 = -\delta/2, \theta_1 = +\delta/2$ . The KL between the product distributions is $D_{KL}(P_{\theta_0}^n \| P_{\theta_1}^n) = n\delta^2/2$ , so by Pinsker $\|P_{\theta_0}^n - P_{\theta_1}^n\|_{TV} \leq \delta\sqrt{n}/2$ . Setting $\delta = c/\sqrt{n}$ keeps TV bounded by a constant $< 1$ , and Le Cam gives $R_n^* \geq c'/n$ for squared error. This matches the sample-mean upper bound $\mathrm{Var}(\bar X) = 1/n$ .

Example

Assouad for d-dimensional Gaussian mean

$X_i \sim \mathcal{N}(\theta, I_d)$ , $\theta \in \mathbb{R}^d$ . Take $\theta_\tau = (\delta/\sqrt{n})\tau$ for $\tau \in \{-1,+1\}^d$ . Squared $\ell_2$ loss decomposes coordinate-wise with $s_j = \delta^2/n$ . The mixture-pair TV at coordinate $j$ is bounded using the joint KL $D_{KL}(\bar P_{j,+}^n \| \bar P_{j,-}^n) \leq 2\delta^2$ via Jensen, giving TV $\leq c < 1$ . Assouad sums to $R_n^* \gtrsim d \delta^2 / n = c''d/n$ . This matches the upper bound from the sample mean ( $\mathbb{E}\|\bar X - \theta\|_2^2 = d/n$ ). The dimension dependence is statistical, not algorithmic.

Example

Fano + Yang-Barron for Holder density estimation

For estimating a density on $[0,1]^d$ in the Holder ball $\mathcal{H}^s(L)$ of smoothness $s$ , the Kolmogorov metric entropy at scale $\varepsilon$ is $\log M(\mathcal{H}^s(L), \varepsilon) \asymp \varepsilon^{-d/s}$ . Coupling KL to the squared $\mathcal{L}^2$ metric and solving the Yang-Barron critical-scale equation $\varepsilon^{-d/s} \asymp n \varepsilon^2$ gives $\varepsilon_n \asymp n^{-s/(2s+d)}$ and $R_n^* \asymp n^{-2s/(2s+d)}$ . This is the canonical curse-of-dimensionality rate; the matching upper bound is achieved by kernel density estimators of bandwidth $h \asymp n^{-1/(2s+d)}$ .

Example

Frontier: Locally private mean estimation (Duchi-Jordan-Wainwright 2018)

Under $\alpha$ -local differential privacy, each user submits a noisy view $Z_i = Q(X_i)$ with $Q$ satisfying the local-DP constraint $\sup_{x,x',S} P(Q(x) \in S)/P(Q(x') \in S) \leq e^\alpha$ . Duchi, Jordan, and Wainwright (JASA 2018, 113:182-201) show that for mean estimation on $[-1,1]^d$ , the local-DP minimax rate is $R_n^*(\alpha) \asymp d / (n \min(\alpha^2, \alpha))$ , strictly slower than the non-private rate $d/n$ when $\alpha \to 0$ . The proof uses an Assouad construction with a strong-data-processing inequality that converts local-DP into a contractive bound on the per-coordinate KL, giving the $\alpha^2$ price of privacy in the small- $\alpha$ regime.

Common Confusions

Watch Out

A lower bound is about the best estimator, not all estimators

The minimax lower bound says: the best worst-case risk is at least this large. It does not say every estimator achieves this rate. Most estimators are much worse than minimax-optimal. Stein's paradox makes this concrete: the MLE for a $d \geq 3$ Gaussian mean is dominated by the James-Stein shrinkage estimator under squared-error loss, even though both are $\sqrt{d/n}$ -optimal in the minimax sense.

Watch Out

Minimax optimality is worst-case, not problem-specific

A minimax-optimal estimator may be conservative on instances that are easier than the worst case. Adaptive estimators (e.g., Lepski's method, Birge-Massart model selection) try to achieve near-minimax rates on every instance simultaneously, paying only logarithmic factors for adaptation. The right estimator for your problem may not be the minimax estimator for the surrounding class.

Watch Out

TV, KL, Hellinger, and chi-squared are not interchangeable

Le Cam uses TV, Fano uses mutual information (and pairwise KL), Yang-Barron uses KL paired with a metric, Tsybakov's chi-squared method uses $\chi^2$ for tighter bounds when KL diverges. The relationships $\|P-Q\|_{TV} \leq \sqrt{D_{KL}/2}$ (Pinsker), $H^2 \leq D_{KL}$ , $D_{KL} \leq \chi^2$ are one-way; substituting the wrong divergence in the wrong method gives bounds that are correct but loose, sometimes by a polynomial factor.

Watch Out

Le Cam can give tight parametric rates but not high-dimensional rates

For one-dimensional Gaussian or Bernoulli mean estimation, Le Cam with two points already gives the tight $1/n$ rate. For $d$ -dimensional Gaussian mean, Le Cam with two points still only gives $1/n$ , missing the $d$ factor. The cure is Assouad (or Fano with a packing), not a cleverer two-point construction.

Watch Out

Lower bounds need explicit constants for adaptive matching

Many published lower bounds prove $R_n^* \geq c \cdot \text{rate}_n$ for some unspecified constant $c$ . For practical purposes (deciding whether a new estimator is worth using), the constant matters. The chi-squared method (Tsybakov 2009 Sec. 2.7) and the localized Fano of Birge-Massart give sharper constants and are needed when matching the leading constant of an upper bound.

Watch Out

Lower bounds care about what you observe, not what you can compute

Minimax lower bounds are information-theoretic: they bound what any estimator can extract from the data, ignoring computation. A computationally tractable estimator can still be statistically optimal (Lasso for sparse linear regression). Computational lower bounds (planted clique, statistical query model) are a separate research program: see Brennan-Bresler-Huleihel 2018 and follow-ups.

Summary

Minimax risk $R_n^* = \inf_{\hat\theta} \sup_\theta \mathbb{E}[\ell(\hat\theta, \theta)]$ is the right notion of fundamental difficulty.
All lower-bound methods reduce estimation to testing: if you cannot test, you cannot estimate.
Le Cam (two hypotheses, TV) gives parametric $1/n$ rates in one line.
Fano ( $M$ hypotheses, mutual information) handles rich parameter spaces; Yang-Barron turns metric-entropy estimates into rates.
Assouad (hypercube, sum of binary tests) handles decomposable losses and high-dimensional rates.
The art is in the construction: pick hypotheses far in loss but close in divergence.
Lower bounds are information-theoretic; matching them is a separate computational question.

Exercises

ExerciseCore

Problem

Use Le Cam to prove $R_n^* \geq c/n$ for estimating the mean of $X_i \sim \text{Bernoulli}(p)$ , $p \in [0,1]$ , under squared-error loss. Give an explicit constant $c$ .

ExerciseCore

Problem

For $d$ -dimensional Gaussian mean estimation under squared $\ell_2$ loss, explain in one paragraph why Le Cam fails to recover the $d/n$ rate and how Assouad recovers it. State which line of the Assouad proof picks up the factor of $d$ .

ExerciseAdvanced

Problem

Prove a Fano lower bound for sparse linear regression: observations $y = X\beta^* + \varepsilon$ with $X \in \mathbb{R}^{n \times d}$ Gaussian design, $\varepsilon \sim \mathcal{N}(0, \sigma^2 I_n)$ , and $\beta^*$ $k$ -sparse with $\|\beta^*\|_2 \leq 1$ . Show that the minimax rate for squared $\ell_2$ loss is at least $\Omega(k \log(d/k) / n)$ .

ExerciseAdvanced

Problem

Show that the Yang-Barron critical-scale equation $\log M(\Theta, s_n) \asymp n L s_n^2$ produces the rate $R_n^* \asymp n^{-2s/(2s+d)}$ when $\Theta$ is the Holder ball $\mathcal{H}^s(L)$ on $[0,1]^d$ with metric entropy $\log M(\mathcal{H}^s, \varepsilon) \asymp \varepsilon^{-d/s}$ .

ExerciseResearch

Problem

The Duchi-Jordan-Wainwright 2018 local-DP lower bound shows that under $\alpha$ -LDP the minimax rate for $d$ -dimensional mean estimation degrades from $d/n$ to $d / (n \min(\alpha^2, \alpha))$ . State precisely which step of the Assouad argument changes when local-DP is imposed, and show how the strong-data-processing inequality for $\alpha$ -LDP channels recovers the $\alpha^2$ contraction.

Related Comparisons

References

Canonical:

Tsybakov, A. (2009). Introduction to Nonparametric Estimation. Springer. Chapter 2 is the canonical exposition; Section 2.6-2.7 cover Fano, chi-squared, and the Yang-Barron bridge.
Le Cam, L. (1973). "Convergence of Estimates Under Dimensionality Restrictions." Annals of Statistics 1, 38-53. The original Le Cam two-point method.
Yu, B. (1997). "Assouad, Fano, and Le Cam." In Festschrift for Lucien Le Cam, 423-435. Springer. Unified exposition of the three core methods.
Fano, R. M. (1961). Transmission of Information: A Statistical Theory of Communications. MIT Press. Original Fano inequality.
Assouad, P. (1983). "Deux remarques sur l'estimation." Comptes Rendus Acad. Sci. Paris 296, 1021-1024. Original Assouad lemma.
Yang, Y., Barron, A. (1999). "Information-theoretic determination of minimax rates of convergence." Annals of Statistics 27(5), 1564-1599. The metric-entropy-to-minimax-rate bridge.

Current and applications:

Wainwright, M. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge. Chapter 15 covers minimax lower bounds with modern high-dimensional applications (sparse regression, low-rank matrix estimation).
Duchi, J. (lecture notes). Information-Theoretic Methods in Statistics. Stanford. Modern treatment with private and distributed estimation.
Duchi, J., Jordan, M. I., Wainwright, M. J. (2018). "Minimax Optimal Procedures for Locally Private Estimation." JASA 113, 182-201. Local-DP minimax rates via strong-data-processing Assouad.
Cai, T. T., Zhou, H. H. (2012). "Optimal rates of convergence for sparse covariance matrix estimation." Annals of Statistics 40(5), 2389-2420. Block-Assouad construction for matrix estimation.
Birge, L. (1983). "Approximation dans les espaces metriques et theorie de l'estimation." Z. Wahrsch. Verw. Gebiete 65, 181-237. Local Fano with refined constants.
Birge, L., Massart, P. (1998). "Minimum contrast estimators on sieves: exponential bounds and rates of convergence." Bernoulli 4, 329-375. Adaptive minimax rates via model selection.
van der Vaart, A., Wellner, J. (1996). Weak Convergence and Empirical Processes. Springer. Metric-entropy estimates for empirical-process classes.
Brennan, M., Bresler, G., Huleihel, W. (2018). "Reducibility and Computational Lower Bounds for Problems with Limited Dependence on Sparsity." COLT 2018. Computational vs statistical lower bounds.

Critique and refinements:

Polyanskiy, Y., Wu, Y. (2024 draft). Information Theory: From Coding to Learning. Cambridge. Chapter 32 unifies classical and modern minimax via $f$ -divergences and gives sharper Fano variants.
Verdu, S., Han, T. S. (1994). "A general formula for channel capacity." IEEE Trans. Inf. Theory 40(4), 1147-1157. Information-theoretic foundations beyond iid.
Khas'minskii, R. Z. (1979). "A Lower Bound on the Risks of Nonparametric Estimates of Densities in the Uniform Metric." Theory Probab. Appl. 23, 794-798. Lower bound for sup-norm density estimation; the source of the Stone rate.
Stone, C. J. (1980). "Optimal rates of convergence for nonparametric estimators." Annals of Statistics 8(6), 1348-1360. Companion to Khas'minskii on optimal rates.
Audibert, J.-Y., Tsybakov, A. (2007). "Fast learning rates for plug-in classifiers." Annals of Statistics 35, 608-633. Margin-condition-driven faster-than- $1/\sqrt{n}$ rates that the standard Fano construction misses.
Cai, T. T., Liu, W., Luo, X. (2011). "A constrained $\ell_1$ -minimization approach to sparse precision matrix estimation." JASA 106, 594-607. Minimax lower bound with explicit leading constant.

Next Topics

Fano's inequality: detailed mechanics of the most-used lower bound.
Shrinkage estimation: James-Stein: minimax-optimal estimators that strictly dominate the MLE in $d \geq 3$ .
Rademacher complexity: complementary upper-bound machinery whose lower-bound counterpart is Fano.
Cramer-Rao bound: the parametric, smooth-family lower bound that Le Cam reproduces in one line.

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

6

Cramér-Rao Bound: Information Inequality, Achievability, and Sharper Variantslayer 0B · tier 1
Fisher Information: Curvature, KL Geometry, and the Natural Gradientlayer 0B · tier 1
Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1
Concentration Inequalitieslayer 1 · tier 1
KL Divergencelayer 1 · tier 1

Derived topics

3

Shrinkage Estimation and the James-Stein Estimator: Inadmissibility, SURE, and Brown's Characterizationlayer 0B · tier 1
Fano Inequalitylayer 3 · tier 2
Robust Adversarial Policieslayer 4 · tier 3

Graph-backed continuations

Fano Inequality Shrinkage Estimation and the James-Stein Estimator: Inadmissibility, SURE, and Brown's Characterization Robust Adversarial Policies