Skip to main content

Statistical Estimation

Fisher Information: Curvature, KL Geometry, and the Natural Gradient

The Fisher information measures how much a sample tells you about an unknown parameter. Coverage of the score function and Bartlett identities, the equivalence of variance-of-score and expected-negative-Hessian forms, the KL-divergence Hessian identity, the Fisher information matrix and Loewner ordering, the Cramér-Rao link, the natural gradient (parameterization invariance), K-FAC and empirical Fisher in deep learning, and applications to MLE asymptotic efficiency, EWC, and information geometry.

ImportantCoreTier 1StableCore spine~90 min
For:MLStatsResearch

Why This Matters

The Fisher information I(θ)I(\theta) is the local geometric object that controls how well a parametric model can be estimated. It plays three roles, all derived from the same definition. First, it is the variance of the score function, which sets the Cramér-Rao lower bound on the variance of any unbiased estimator: Var(θ^)1/(nI(θ))\mathrm{Var}(\hat\theta) \geq 1/(nI(\theta)). Second, it is the expected curvature of the log-likelihood at the truth, which is what the maximum likelihood estimator sees as it converges: n(θ^nθ0)N(0,I(θ0)1)\sqrt n (\hat\theta_n - \theta_0) \Rightarrow \mathcal N(0, I(\theta_0)^{-1}). Third, it is the Hessian of the KL divergence at the diagonal: I(θ)=η2KL(pθpη)η=θI(\theta) = \nabla^2_\eta \mathrm{KL}(p_\theta \,\|\, p_\eta) \big|_{\eta = \theta}, which makes it the natural Riemannian metric on the statistical manifold and the source of the natural gradient.

In modern deep learning, the Fisher information matrix appears in three central tools: K-FAC (Kronecker-Factored Approximate Curvature) for second-order optimization, the natural policy gradient (TRPO, PPO antecedents), and Elastic Weight Consolidation for continual learning. The empirical Fisher (computed on training data, not on samples drawn from the model) is a cheap surrogate that has its own correctness caveats; understanding the gap matters for debugging modern training pipelines.

Mental Model

The log-likelihood (θ)=logp(Xθ)\ell(\theta) = \log p(X \mid \theta) is a random function whose shape near the true θ0\theta_0 encodes how well the data identifies θ\theta. Three equivalent perspectives:

  • Variance of the score (I=E[s2]I = \mathbb E[s^2]): the score s=θs = \partial_\theta \ell is a mean-zero random variable; large variance means individual samples push the estimate around in different directions, but with high amplitude (informative).
  • Expected curvature (I=E[θ2]I = -\mathbb E[\partial_\theta^2 \ell]): the log-likelihood landscape is sharply peaked around θ0\theta_0, so the optimum is well-localized.
  • KL Hessian (I=2KL(pθ0pη)η=θ0I = \nabla^2 \mathrm{KL}(p_{\theta_0} \,\|\, p_\eta) \big|_{\eta = \theta_0}): nearby parameter values produce distributions that diverge sharply (in KL) as you move away from θ0\theta_0, so they are statistically distinguishable.

The first two equivalences are the Bartlett identities; the third is the KL Hessian identity. All three say the same thing: high Fisher information means the data discriminates parameter values precisely. Low Fisher information means a flat likelihood landscape, ambiguous parameter values, and slow estimation.

FormDefinitionWhat it tells you
Variance of scoreI(θ)=Eθ[s(θ;X)2]I(\theta) = \mathbb E_\theta[s(\theta;X)^2]How much one observation pushes the estimate around, large variance means each datum carries strong signal about θ\theta.
Negative expected HessianI(θ)=Eθ[θ2logp(X;θ)]I(\theta) = -\mathbb E_\theta[\nabla^2_\theta \log p(X;\theta)]How sharply peaked the log-likelihood is at the truth, large curvature means the optimum is well-localized.
Hessian of KL divergenceI(θ)I(\theta) is the Hessian of KL(pθpη)\mathrm{KL}(p_\theta \,\|\, p_\eta) in η\eta, evaluated at η=θ\eta = \thetaHow fast nearby distributions diverge in KL, the Riemannian metric on the statistical manifold and the source of the natural gradient.

All three are equal under Cramér regularity. Pick the form that matches the question you are asking: variance-of-score for plug-in estimates from data, negative-Hessian for second-order optimization, KL-Hessian for parameterization-invariant geometry.

The Fisher information matrix I(θ)Rd×dI(\theta) \in \mathbb R^{d \times d} is the multivariate generalization. Its eigenstructure tells you which parameter directions are well-identified (large eigenvalues, sharp curvature) and which are weakly identified (small eigenvalues, flat directions, ill-conditioned for ordinary gradient descent). Preconditioning the gradient by I1I^{-1} gives the natural gradient, which descends in distribution-space rather than parameter-space.

Formal Setup

Let XX be a random variable with density p(xθ)p(x \mid \theta) for θΘR\theta \in \Theta \subseteq \mathbb R (scalar case; we then generalize to θRd\theta \in \mathbb R^d). Assume Cramér regularity: the support of pp does not depend on θ\theta, logp(xθ)\log p(x \mid \theta) is twice continuously differentiable in θ\theta for almost every xx, and differentiation under the integral is valid.

Definition

Score function

The score function is the gradient of the log-density:

s(θ;x)=θlogp(xθ).s(\theta; x) = \nabla_\theta \log p(x \mid \theta).

Under Cramér regularity, Eθ[s(θ;X)]=0\mathbb E_\theta[s(\theta; X)] = 0 for all θ\theta (the first Bartlett identity). The score is a mean-zero random variable in xx that quantifies the local sensitivity of the likelihood to θ\theta.

Definition

Fisher information (scalar)

The Fisher information is the variance of the score:

I(θ)=Varθ[s(θ;X)]=Eθ[s(θ;X)2].I(\theta) = \mathrm{Var}_\theta[s(\theta; X)] = \mathbb E_\theta[s(\theta; X)^2].

For an i.i.d. sample X1,,XnX_1, \ldots, X_n, the total Fisher information of the sample is nI(θ)n I(\theta) (additivity, see below).

Definition

Fisher information matrix

For θRd\theta \in \mathbb R^d, the Fisher information matrix is the covariance of the score vector:

[I(θ)]jk=Eθ ⁣[logp(Xθ)θjlogp(Xθ)θk].[I(\theta)]_{jk} = \mathbb E_\theta\!\left[\frac{\partial \log p(X \mid \theta)}{\partial \theta_j} \cdot \frac{\partial \log p(X \mid \theta)}{\partial \theta_k}\right].

The matrix is positive semidefinite (covariance). It is positive definite under local identifiability at θ\theta, i.e., the score has full-rank covariance, so no nontrivial direction in parameter space leaves the log-likelihood locally flat. Global identifiability (different θ\theta give different distributions) is necessary for I(θ)0I(\theta) \succ 0 but not sufficient: the parameterization N(θ3,1)\mathcal{N}(\theta^3, 1) is globally identifiable, yet the score with respect to θ\theta vanishes at θ=0\theta = 0 and I(0)=0I(0) = 0.

Definition

Observed information

The observed information is the negative Hessian of the log-likelihood evaluated on the actual data:

J(θ;x)=θ2logp(xθ).J(\theta; x) = -\nabla^2_\theta \log p(x \mid \theta).

For an i.i.d. sample, Jn(θ;X1:n)=i=1nθ2logp(Xiθ)J_n(\theta; X_{1:n}) = -\sum_{i=1}^n \nabla^2_\theta \log p(X_i \mid \theta). The observed information is random (a function of the data), unlike the Fisher information I(θ)=E[J1(θ;X)]I(\theta) = \mathbb E[J_1(\theta; X)] which is deterministic. By the law of large numbers, 1nJn(θ;X1:n)I(θ)\frac{1}{n} J_n(\theta; X_{1:n}) \to I(\theta) a.s., so the two agree asymptotically.

Definition

Empirical Fisher

The empirical Fisher is the sample covariance of the score evaluated on training data:

I^emp(θ)=1ni=1ns(θ;Xi)s(θ;Xi).\hat I_{\mathrm{emp}}(\theta) = \frac{1}{n} \sum_{i=1}^n s(\theta; X_i) s(\theta; X_i)^\top.

In deep learning, XiX_i are training examples and s(θ;Xi)=θlogpθ(yixi)s(\theta; X_i) = \nabla_\theta \log p_\theta(y_i \mid x_i) are per-example gradients. The empirical Fisher is not the same as the true Fisher unless the model is correctly specified at θ\theta (see Kunstner et al. 2019 and the Confusions section). The true Fisher uses model samples Y~ipθ(xi)\tilde Y_i \sim p_\theta(\cdot \mid x_i), not training labels.

The Bartlett Identities

Theorem

Bartlett Identities (First and Second)

Statement

Under Cramér regularity:

First Bartlett identity: Eθ[s(θ;X)]=0\mathbb E_\theta[s(\theta; X)] = 0.

Second Bartlett identity: Varθ[s(θ;X)]=Eθ[θ2logp(Xθ)]\mathrm{Var}_\theta[s(\theta; X)] = -\mathbb E_\theta[\nabla^2_\theta \log p(X \mid \theta)]. Equivalently,

I(θ)=Eθ[ss]=Eθ[θ2logp(Xθ)].I(\theta) = \mathbb E_\theta[s s^\top] = -\mathbb E_\theta[\nabla^2_\theta \log p(X \mid \theta)].

Intuition

Both identities follow from differentiating p(xθ)dx=1\int p(x \mid \theta) \, dx = 1. The first is the first derivative; the second is the second derivative. Geometrically, the first says the log-likelihood gradient at the true θ\theta has zero mean. The second says the variance of that gradient equals the expected curvature, an information-equality that makes Fisher information computable in two equivalent ways.

Proof Sketch

First identity. Differentiate p(xθ)dx=1\int p(x \mid \theta) \, dx = 1 under the integral:

0=ddθp(xθ)dx=p(xθ)θdx=s(θ;x)p(xθ)dx=Eθ[s].0 = \frac{d}{d\theta} \int p(x \mid \theta) \, dx = \int \frac{\partial p(x \mid \theta)}{\partial \theta} \, dx = \int s(\theta; x) p(x \mid \theta) \, dx = \mathbb E_\theta[s].

Second identity. Differentiate again:

0=d2dθ2p(xθ)dx=[2logpθ2+(logpθ)2]p(xθ)dx=E[θ2logp]+E[s2].0 = \frac{d^2}{d\theta^2} \int p(x \mid \theta) \, dx = \int \left[\frac{\partial^2 \log p}{\partial \theta^2} + \left(\frac{\partial \log p}{\partial \theta}\right)^2\right] p(x \mid \theta) \, dx = \mathbb E[\partial_\theta^2 \log p] + \mathbb E[s^2].

Rearrange: E[s2]=E[θ2logp]\mathbb E[s^2] = -\mathbb E[\partial_\theta^2 \log p]. Combined with the first identity (E[s]=0\mathbb E[s] = 0), the variance of the score equals its second moment, which equals the expected negative Hessian.

Why It Matters

The Bartlett identities are the foundation for everything Fisher information does. They give two computable forms for I(θ)I(\theta) (variance of score, or expected negative Hessian); they validate the Cramér-Rao bound's Cauchy-Schwarz step (which uses E[s]=0\mathbb E[s] = 0); and the second identity is what makes the Fisher information equal to local curvature (linking variance and geometry).

Failure Mode

Both identities require Cramér regularity. For XU(0,θ)X \sim U(0, \theta), the support depends on θ\theta, so differentiation under the integral fails: the boundary contribution does not vanish. The score is defined as 1/θ-1/\theta on (0,θ)(0, \theta) but E[s]=1/θ0\mathbb E[s] = -1/\theta \neq 0, breaking the first identity. The second identity also breaks: E[θ2logp]=1/θ2\mathbb E[\partial_\theta^2 \log p] = 1/\theta^2 but E[s2]=1/θ2\mathbb E[s^2] = 1/\theta^2 as well, but neither equals the actual variance of the appropriate location estimator (which has variance Θ(1/n2)\Theta(1/n^2)). The right tool for non-regular families is the Hammersley-Chapman-Robbins bound.

In quasi-MLE settings (model is misspecified), the second Bartlett identity fails: Ep[ss]Ep[2logpθ]\mathbb E_{p_*}[s s^\top] \neq -\mathbb E_{p_*}[\nabla^2 \log p_\theta] where pp_* is the true distribution. This is why misspecified MLE has the sandwich asymptotic variance V1AV1V^{-1} A V^{-1} instead of I1I^{-1} (White's QMLE theorem on the MLE page).

Corollary

Hessian Form (corollary of Bartlett II)

Statement

Under Cramér regularity, the Fisher information equals the expected negative Hessian of the log-likelihood:

I(θ)=Eθ[θ2logp(Xθ)].I(\theta) = -\mathbb E_\theta[\nabla^2_\theta \log p(X \mid \theta)].

Intuition

Direct restatement of the second Bartlett identity. This is usually the easier form for computation: take the log-density's Hessian (the model's curvature) and take the expectation under pθp_\theta.

Proof Sketch

This is the second Bartlett identity rearranged: I=E[s2]=E[2logp]I = \mathbb E[s^2] = -\mathbb E[\partial^2 \log p].

Why It Matters

Computing E[2logp]-\mathbb E[\partial^2 \log p] is usually a one-line calculation in standard families: Gaussian gives 1/σ21/\sigma^2 for μ\mu, 1/(2σ4)1/(2\sigma^4) for σ2\sigma^2, and zero off-diagonals; Bernoulli gives 1/[θ(1θ)]1/[\theta(1-\theta)]; Poisson gives 1/λ1/\lambda. The variance-of-score form E[s2]\mathbb E[s^2] is sometimes algebraically messier (you have to expand the squared score and integrate); the Hessian form is usually cleaner.

Failure Mode

Same as the second Bartlett identity: fails when the support depends on θ\theta, when the log-density is not twice differentiable, when expectations are infinite (heavy-tailed scores), or when the model is misspecified.

Fisher Information as the KL-Divergence Hessian

Theorem

Fisher Information is the KL Hessian

Statement

For any θΘ\theta \in \Theta,

KL(pθpθ+δ)=12δI(θ)δ+O(δ3)as δ0.\mathrm{KL}(p_\theta \,\|\, p_{\theta + \delta}) = \frac{1}{2} \delta^\top I(\theta) \delta + O(\|\delta\|^3) \quad \text{as } \delta \to 0.

Equivalently, the Fisher information matrix is the Hessian of the KL divergence at the diagonal:

I(θ)=η2KL(pθpη)η=θ.I(\theta) = \nabla^2_\eta \mathrm{KL}(p_\theta \,\|\, p_\eta) \big|_{\eta = \theta}.

Intuition

The KL divergence is asymmetric and zero on the diagonal; locally, it must be a positive semidefinite quadratic form. The coefficient of that quadratic form is the Fisher information. So the Fisher information is the metric tensor of a (mathematically natural) Riemannian geometry on parameter space, in which "distance" between pθp_\theta and pηp_\eta is measured by KL.

Proof Sketch

Expand logp(xθ+δ)=logp(xθ)+s(x;θ)δ+12δ2logp(xθ)δ+O(δ3)\log p(x \mid \theta + \delta) = \log p(x \mid \theta) + s(x; \theta)^\top \delta + \frac{1}{2} \delta^\top \nabla^2 \log p(x \mid \theta) \delta + O(\|\delta\|^3). Take the expectation under pθp_\theta:

Epθ[logp(Xθ+δ)logp(Xθ)]=0+12δEpθ[2logp(Xθ)]δ+O(δ3).\mathbb E_{p_\theta}[\log p(X \mid \theta + \delta) - \log p(X \mid \theta)] = 0 + \frac{1}{2} \delta^\top \mathbb E_{p_\theta}[\nabla^2 \log p(X \mid \theta)] \delta + O(\|\delta\|^3).

The first-order term vanishes by the first Bartlett identity. The KL divergence is KL(pθpθ+δ)=Epθ[logp(Xθ+δ)logp(Xθ)]\mathrm{KL}(p_\theta \,\|\, p_{\theta + \delta}) = -\mathbb E_{p_\theta}[\log p(X \mid \theta + \delta) - \log p(X \mid \theta)], so

KL(pθpθ+δ)=12δEpθ[2logp]δ+O(δ3)=12δI(θ)δ+O(δ3),\mathrm{KL}(p_\theta \,\|\, p_{\theta + \delta}) = -\frac{1}{2} \delta^\top \mathbb E_{p_\theta}[\nabla^2 \log p] \delta + O(\|\delta\|^3) = \frac{1}{2} \delta^\top I(\theta) \delta + O(\|\delta\|^3),

using the second Bartlett identity to replace E[2logp]-\mathbb E[\nabla^2 \log p] by I(θ)I(\theta).

Why It Matters

This identity is the single most important fact about Fisher information. It explains why I(θ)I(\theta) is the right Riemannian metric on parameter space (it is the second-order Taylor coefficient of KL); why preconditioning by I1I^{-1} is parameterization invariant (KL is parameterization-free); and why the natural gradient is the right notion of steepest descent in distribution space. It also explains why MLE is asymptotically efficient: the score s(Xi;θ)\sum s(X_i; \theta) is a sum of i.i.d. mean-zero random variables with variance II per observation, so by CLT it converges to N(0,nI)\mathcal N(0, nI), and the inverse Hessian (which is I1\to I^{-1} by the second Bartlett identity and LLN) maps this back to N(0,I1/n)\mathcal N(0, I^{-1}/n) for the MLE.

Failure Mode

The KL Hessian identity uses Cramér regularity (Bartlett identities). It also requires the KL divergence to be finite in a neighborhood of θ\theta, which can fail at the boundary of parameter space or for distributions with disjoint supports. For exponential families in canonical parameterization, the identity is exact (not just up to O(δ3)O(\|\delta\|^3)): KL(pηpη)=A(η)A(η)A(η)(ηη)\mathrm{KL}(p_\eta \,\|\, p_{\eta'}) = A(\eta') - A(\eta) - A'(\eta) (\eta' - \eta), which is the Bregman divergence of the log-partition function AA and has Hessian A(η)=I(η)A''(\eta) = I(\eta) everywhere.

Fisher Information for n Samples

Proposition

Additivity for Independent Samples

Statement

If X1,,XnX_1, \ldots, X_n are independent observations with score functions si(θ;Xi)s_i(\theta; X_i), the Fisher information of the joint sample is the sum:

In(θ)=i=1nIi(θ)where Ii(θ)=Var[si(θ;Xi)].I_n(\theta) = \sum_{i=1}^n I_i(\theta) \quad \text{where } I_i(\theta) = \mathrm{Var}[s_i(\theta; X_i)].

In the i.i.d. case, In(θ)=nI(θ)I_n(\theta) = n I(\theta).

Intuition

The score of the joint log-likelihood is the sum of per-sample scores; variance of a sum of independent terms is the sum of variances. More data means proportionally more information.

Proof Sketch

logp(X1:nθ)=i=1nlogpi(Xiθ)\log p(X_{1:n} \mid \theta) = \sum_{i=1}^n \log p_i(X_i \mid \theta). Take the gradient: Sn=isi(θ;Xi)S_n = \sum_i s_i(\theta; X_i). By independence and mean-zero score, Var(Sn)=iVar(si)=iIi(θ)\mathrm{Var}(S_n) = \sum_i \mathrm{Var}(s_i) = \sum_i I_i(\theta).

Why It Matters

Additivity is what makes the rate 1/(nI)1/(nI) in the Cramér-Rao bound possible: each independent observation contributes a fixed quantum I(θ)I(\theta) of information. For dependent samples (time series, spatial data), additivity fails and the effective sample size is smaller (the Fisher information for the full block is less than nn times the single-observation Fisher information; the gap is captured by the autocorrelation function).

Failure Mode

Fails for dependent observations. Also fails when the model is misspecified jointly (e.g., the per-observation models are correct but the joint independence assumption is wrong); in that case the sandwich variance corrects the asymptotic distribution but the simple multiplication by nn overstates the available information.

The Natural Gradient

Theorem

Parameterization Invariance of the Natural Gradient

Statement

The natural-gradient vector field θI(θ)1θL(θ)\theta \mapsto -I(\theta)^{-1} \nabla_\theta L(\theta) is coordinate-invariant: under a smooth invertible reparameterization θ=g(ϕ)\theta = g(\phi) with Jacobian JJ, it transforms as a tangent vector, so the continuous-time flow θ˙=I(θ)1θL(θ)\dot\theta = -I(\theta)^{-1} \nabla_\theta L(\theta) describes the same trajectory in distribution space regardless of parameterization.

The discrete update θθηI(θ)1θL(θ)\theta \leftarrow \theta - \eta I(\theta)^{-1} \nabla_\theta L(\theta) is only first-order / infinitesimally invariant: it is the Euler discretization of the invariant flow, and the two parameterizations agree to leading order in η\eta but disagree at O(η2)O(\eta^2). Exact finite-step invariance requires using the Riemannian exponential map / a retraction along the Fisher metric; vanilla Euler steps in different coordinate systems trace different finite-step paths even when the underlying flow is the same.

Intuition

Ordinary gradient descent depends on which parameterization you use: the direction of steepest descent in Euclidean parameter space changes when you change coordinates. The natural gradient instead descends in the geometry induced by KL divergence, which is parameterization-free at the level of the vector field. Finite Euler steps approximate the invariant trajectory, with the approximation gap shrinking as the step size shrinks.

Proof Sketch

Under reparameterization θ=g(ϕ)\theta = g(\phi):

  • Gradient transforms as ϕL=JθL\nabla_\phi L = J^\top \nabla_\theta L (chain rule).
  • Fisher matrix transforms as Iϕ=JIθJI_\phi = J^\top I_\theta J (covariant tensor of rank 2).

The natural gradient in ϕ\phi-coordinates is Iϕ1ϕL=(JIθJ)1JθL=J1Iθ1JJθL=J1Iθ1θLI_\phi^{-1} \nabla_\phi L = (J^\top I_\theta J)^{-1} J^\top \nabla_\theta L = J^{-1} I_\theta^{-1} J^{-\top} J^\top \nabla_\theta L = J^{-1} I_\theta^{-1} \nabla_\theta L, which is the natural gradient in θ\theta-coordinates pulled back through J1J^{-1}. So the update direction in distribution space is the same.

Why It Matters

This is the formal statement of why natural gradient is "the right" descent direction for probabilistic models: it descends along the geodesic in KL geometry, independent of how the model is parameterized. Practical methods (TRPO's natural policy gradient, K-FAC, EWC) all derive from this invariance. The cost is that I(θ)1I(\theta)^{-1} is expensive to compute exactly for high-dimensional models, motivating Kronecker-factored approximations.

Failure Mode

Requires I(θ)I(\theta) to be positive definite (model identifiable). When II is rank-deficient, the natural gradient is not uniquely defined; use a generalized inverse I+I^+, but the parameterization-invariance breaks for the directions in the kernel of II.

For deep networks, the empirical Fisher (computed from training data with training labels) is often substituted for the true Fisher (which requires sampling labels from the model), and this substitution breaks parameterization invariance unless the model is correctly specified. K-FAC uses the true Fisher with a Kronecker approximation; some implementations cut corners and use the empirical Fisher, with results that are sometimes good and sometimes poor depending on the regime.

The natural gradient appears in three modern tools:

  • K-FAC (Martens & Grosse 2015): approximates I(θ)1I(\theta)^{-1} as a Kronecker product of small per-layer matrices, making natural-gradient descent feasible for deep networks. The Kronecker structure exploits the fact that for fully-connected layers, II has a block structure I=AGI_\ell = A_\ell \otimes G_\ell where AA_\ell is the activation covariance and GG_\ell the gradient covariance.

  • Natural policy gradient / TRPO (Kakade 2002; Schulman et al. 2015): in policy gradient methods, the "loss" is the expected return and the natural gradient direction I(θ)1J(θ)I(\theta)^{-1} \nabla J(\theta) corresponds to the policy update that achieves a fixed KL bound. TRPO solves a constrained optimization with the KL constraint linearized to a Fisher quadratic form.

  • EWC (Kirkpatrick et al. 2017): for continual learning, Elastic Weight Consolidation regularizes new-task training with a penalty jIjj(θ)(θjθj)2\sum_j I_{jj}(\theta_*) (\theta_j - \theta_{*j})^2, where θ\theta_* is the parameter from the previous task and IjjI_{jj} is the diagonal of the Fisher information matrix. The diagonal Fisher tells you which parameters were "important" for the previous task and should change slowly.

Connection to Cramér-Rao

The most-cited consequence of Fisher information is the Cramér-Rao bound: for any unbiased estimator θ^\hat\theta of θRd\theta \in \mathbb R^d,

Covθ(θ^)1nI(θ)1\mathrm{Cov}_\theta(\hat\theta) \succeq \frac{1}{n} I(\theta)^{-1}

in the Loewner ordering. The MLE attains this bound asymptotically (under regularity): n(θ^nθ0)N(0,I(θ0)1)\sqrt n (\hat\theta_n - \theta_0) \Rightarrow \mathcal N(0, I(\theta_0)^{-1}). The dedicated Cramér-Rao bound page covers the proof, the multivariate Loewner ordering form, the chain-rule extension for biased estimators, the sharper Bhattacharyya higher-order bounds, the Hammersley-Chapman-Robbins bound that does not require regularity, and the van Trees Bayesian extension.

The Cramér-Rao link is also why "Fisher information" deserves the name information: I(θ)I(\theta) is the information per observation about the parameter, and 1/(nI(θ))1/(nI(\theta)) is the minimum variance (for unbiased estimators) achievable from nn such observations.

Canonical Examples

Example

Bernoulli

XBernoulli(θ)X \sim \mathrm{Bernoulli}(\theta), so logp(xθ)=xlogθ+(1x)log(1θ)\log p(x \mid \theta) = x \log \theta + (1-x) \log(1-\theta). Score: s(θ;x)=x/θ(1x)/(1θ)s(\theta; x) = x/\theta - (1-x)/(1-\theta). By the Hessian form: 2logp/θ2=x/θ2(1x)/(1θ)2\partial^2 \log p / \partial \theta^2 = -x/\theta^2 - (1-x)/(1-\theta)^2, expectation 1/θ1/(1θ)=1/[θ(1θ)]-1/\theta - 1/(1-\theta) = -1/[\theta(1-\theta)], so I(θ)=1/[θ(1θ)]I(\theta) = 1/[\theta(1-\theta)]. Maximum at θ0\theta \to 0 or θ1\theta \to 1 (extremes are easy to identify); minimum I=4I = 4 at θ=1/2\theta = 1/2 (the most uncertain case is the hardest to estimate).

Example

Gaussian with known variance

XN(μ,σ2)X \sim \mathcal N(\mu, \sigma^2) with σ2\sigma^2 known. Log-density (xμ)22σ2+const-\frac{(x-\mu)^2}{2\sigma^2} + \mathrm{const}. Hessian in μ\mu: 1/σ2-1/\sigma^2, expectation independent of xx, so I(μ)=1/σ2I(\mu) = 1/\sigma^2. The Cramér-Rao bound gives Var(μ^)σ2/n\mathrm{Var}(\hat\mu) \geq \sigma^2/n, attained exactly by Xˉ\bar X.

Example

Gaussian with both unknown: information-orthogonal block

XN(μ,σ2)X \sim \mathcal N(\mu, \sigma^2) with both unknown. Score:

s(μ,σ2;x)=((xμ)/σ21/(2σ2)+(xμ)2/(2σ4)).s(\mu, \sigma^2; x) = \begin{pmatrix} (x-\mu)/\sigma^2 \\ -1/(2\sigma^2) + (x-\mu)^2/(2\sigma^4) \end{pmatrix}.

Fisher matrix:

I(μ,σ2)=(1/σ2001/(2σ4)).I(\mu, \sigma^2) = \begin{pmatrix} 1/\sigma^2 & 0 \\ 0 & 1/(2\sigma^4) \end{pmatrix}.

The off-diagonal is zero: μ\mu and σ2\sigma^2 are information-orthogonal. Estimating one does not lose precision on the other. This is special to the location-scale Gaussian; in most parameterizations of most families, the off-diagonals are nonzero (cf. the Gamma family).

Example

Exponential family in canonical form

p(xη)=h(x)exp(ηT(x)A(η))p(x \mid \eta) = h(x) \exp(\eta^\top T(x) - A(\eta)) for ηRd\eta \in \mathbb R^d. Score: s(η;x)=T(x)A(η)=T(x)μ(η)s(\eta; x) = T(x) - \nabla A(\eta) = T(x) - \mu(\eta). Fisher information equals the variance of TT: I(η)=Cov(T(X))=2A(η)I(\eta) = \mathrm{Cov}(T(X)) = \nabla^2 A(\eta) (the Hessian of the log-partition function). This is the cleanest case: Fisher information is exactly the cumulant matrix of the sufficient statistic. The MLE for μ=E[T]\mu = \mathbb E[T] is the sample mean Tˉ\bar T, which attains the Cramér-Rao bound at every nn (one of the few finite-sample efficient cases).

Example

Fisher information for a softmax / multinomial logistic

For a KK-class multinomial logistic with logit parameters ηk\eta_k for k=1,,K1k = 1, \ldots, K-1 (with ηK=0\eta_K = 0 for identifiability), the per-observation Fisher information matrix is I(η)=diag(π)ππI(\eta) = \mathrm{diag}(\pi) - \pi \pi^\top, where πk=eηk/jeηj\pi_k = e^{\eta_k}/\sum_j e^{\eta_j} is the predicted probability vector for the K1K-1 free classes. This is a K1×K1K-1 \times K-1 matrix; it is positive definite as long as no πk\pi_k equals 0 or 1 (degenerate cases). The natural gradient for cross-entropy loss preconditions by I1I^{-1}, which corresponds to descending in the KL geometry on the simplex. This is the basis for K-FAC in classification networks.

Example

Empirical Fisher vs true Fisher in a misspecified neural network

Consider a regression network trained with squared loss: yi=fθ(xi)+ϵiy_i = f_\theta(x_i) + \epsilon_i with ϵiN(0,σ2)\epsilon_i \sim \mathcal N(0, \sigma^2) assumed. The model implies pθ(yx)=N(fθ(x),σ2)p_\theta(y \mid x) = \mathcal N(f_\theta(x), \sigma^2) with score sθ(x,y)=(yfθ(x))θfθ(x)/σ2s_\theta(x, y) = (y - f_\theta(x)) \nabla_\theta f_\theta(x) / \sigma^2.

True Fisher: I(θ)=E(x,y)pθ[ss]=Ex[θfθf]/σ2I(\theta) = \mathbb E_{(x, y) \sim p_\theta}[s s^\top] = \mathbb E_x[\nabla_\theta f \nabla_\theta f^\top]/\sigma^2 (after taking expectation over ypθ(x)=N(fθ(x),σ2)y \sim p_\theta(\cdot \mid x) = \mathcal N(f_\theta(x), \sigma^2)). This is the Gauss-Newton matrix divided by σ2\sigma^2.

Empirical Fisher: I^emp(θ)=1ni(yifθ(xi))2θfθf/σ4\hat I_{\mathrm{emp}}(\theta) = \frac{1}{n} \sum_i (y_i - f_\theta(x_i))^2 \nabla_\theta f \nabla_\theta f^\top / \sigma^4. This weights each gradient outer product by the squared training residual, not by 1.

When the network has converged (yifθ(xi)y_i \approx f_\theta(x_i)), the empirical Fisher is approximately zero, while the true Fisher is finite. K-FAC uses the true Fisher; substituting empirical Fisher in late training produces ill-conditioned preconditioners. Kunstner, Hennig, & Balles (2019, "Limitations of the empirical Fisher approximation for natural gradient descent", arXiv:1905.12558) document this gap explicitly.

Common Confusions

Watch Out

Fisher information is not the same as observed information

I(θ)=E[2logp(Xθ)]I(\theta) = -\mathbb E[\nabla^2 \log p(X \mid \theta)] is the expected negative Hessian; it is a deterministic function of θ\theta. The observed information J(θ;x)=2logp(xθ)J(\theta; x) = -\nabla^2 \log p(x \mid \theta) is the actual second derivative on the realized data; it is random. Asymptotically they agree (LLN), and for valid Wald-style inference one can use either. Some statisticians prefer the observed information because it conditions on the data and tends to give better finite-sample coverage (Efron-Hinkley 1978).

Watch Out

The Cramér-Rao bound governs unbiased estimators only

Var(θ^)1/(nI)\mathrm{Var}(\hat\theta) \geq 1/(nI) is not a bound on MSE. Biased estimators can have lower MSE: ridge regression, James-Stein shrinkage, posterior means, regularized MLE all systematically introduce bias to reduce variance. The bound is a strict statement about the unbiased class. For biased estimators, use the chain-rule version Var(T)[ψ(θ)]2/(nI)\mathrm{Var}(T) \geq [\psi'(\theta)]^2 / (nI) where ψ(θ)=E[T]\psi(\theta) = \mathbb E[T], then add the squared bias to get an MSE lower bound.

Watch Out

Empirical Fisher is not the true Fisher unless the model is correctly specified

The empirical Fisher I^emp(θ)=(1/n)is(θ;Xi)s(θ;Xi)\hat I_{\mathrm{emp}}(\theta) = (1/n) \sum_i s(\theta; X_i) s(\theta; X_i)^\top uses training labels XiX_i; the true Fisher uses model samples X~ipθ\tilde X_i \sim p_\theta. These agree only when the model is correctly specified at θ\theta (so XipθX_i \sim p_\theta holds in expectation). For misspecified models, I^empA:=Ep[ss]\hat I_{\mathrm{emp}} \to A := \mathbb E_{p_*}[s s^\top] where pp_* is the true distribution; this differs from I=Ep[2logpθ]I = -\mathbb E_{p_*}[\nabla^2 \log p_\theta]. The QMLE sandwich variance V1AV1V^{-1} A V^{-1} uses both.

Watch Out

Fisher information matrix is not the Hessian of the training loss

For a deep network trained with cross-entropy, the Hessian of the training loss is approximately i2logpθ(yixi)\sum_i \nabla^2 \log p_\theta(y_i \mid x_i), the second derivative of the log-likelihood evaluated on training data. The Fisher information matrix is E(x,y)pθ[logplogp]\mathbb E_{(x, y) \sim p_\theta}[\nabla \log p \cdot \nabla \log p^\top], an expectation under the model. The Gauss-Newton matrix is yet a third object. All three coincide when the model is correctly specified and the training data is from the model; for real deep learning none of these conditions hold exactly, and the three objects can differ substantially. K-FAC and similar methods use the true Fisher (with sampled labels) to preserve the natural-gradient invariance.

Watch Out

High Fisher information does not mean high precision in absolute terms

I(θ)=100I(\theta) = 100 does not mean the estimator has a small variance in absolute units; it means Var(θ^)1/(100n)=0.01/n\mathrm{Var}(\hat\theta) \geq 1/(100n) = 0.01/n, relative to the natural scale of θ\theta. The Cramér-Rao bound is dimensionally [θ]2[\theta]^2. The relevant comparison is the signal-to-noise ratio θ/sd(θ^)=θnI(θ)\theta / \mathrm{sd}(\hat\theta) = \theta \sqrt{nI(\theta)}, which is the right unitless measure of estimation precision.

Watch Out

Fisher information depends on parameterization (but the natural gradient does not)

If θ=g(ϕ)\theta = g(\phi) is a smooth reparameterization, the Fisher information matrices are related by Iϕ(ϕ)=JIθ(g(ϕ))JI_\phi(\phi) = J^\top I_\theta(g(\phi)) J where JJ is the Jacobian. So I(θ)I(\theta) is a coordinate-dependent quantity (a covariant rank-2 tensor). The natural gradient I1LI^{-1} \nabla L, however, is parameterization-invariant: it descends in the KL geometry, which is intrinsic to the family of distributions and does not depend on labels.

Exercises

ExerciseCore

Problem

Compute the Fisher information I(λ)I(\lambda) for a single observation from Poisson(λ)\mathrm{Poisson}(\lambda). State the Cramér-Rao bound for an unbiased estimator of λ\lambda from nn i.i.d. observations and verify the sample mean attains it.

ExerciseCore

Problem

Compute the Fisher information matrix for N(μ,σ2)\mathcal N(\mu, \sigma^2) with both parameters unknown. Verify the off-diagonal is zero (information-orthogonality of μ\mu and σ2\sigma^2). What does information-orthogonality imply for the asymptotic distribution of (μ^MLE,σ^MLE2)(\hat\mu_{\mathrm{MLE}}, \hat\sigma^2_{\mathrm{MLE}})?

ExerciseAdvanced

Problem

Show that the natural gradient is parameterization-invariant: under θ=g(ϕ)\theta = g(\phi) with Jacobian JJ, the natural gradient step in ϕ\phi-space transforms to the natural gradient step in θ\theta-space. Why does ordinary gradient descent fail this invariance?

ExerciseAdvanced

Problem

Derive the Fisher information of an exponential family in canonical form p(xη)=h(x)exp(ηT(x)A(η))p(x \mid \eta) = h(x) \exp(\eta T(x) - A(\eta)). Show I(η)=A(η)I(\eta) = A''(\eta), the Hessian of the log-partition. Then show the Fisher information in the mean-value parameterization μ=A(η)\mu = A'(\eta) is Iμ(μ)=1/A(η)=1/Iη(η)I_\mu(\mu) = 1/A''(\eta) = 1/I_\eta(\eta) (reciprocal!). Use this to verify the natural gradient updates in η\eta and μ\mu coordinates are equivalent.

ExerciseResearch

Problem

For a fully-connected feedforward network with cross-entropy classification loss, derive the K-FAC approximation of the Fisher information matrix. Specifically, show that the Fisher block for a single layer with input activations aa and pre-softmax logits z=Waz = W a has the form IWE[aa]E[gg]I_W \approx \mathbb E[a a^\top] \otimes \mathbb E[g g^\top] where g=zlogp(yz)g = \nabla_z \log p(y \mid z) is the gradient of the per-example log-likelihood at the model's output. State the assumption that makes the Kronecker factorization exact and the scenarios in which it is a poor approximation.

References

Canonical:

  • Casella, G., & Berger, R. L. (2002). Statistical Inference (2nd ed.). Duxbury. Chapter 7 (Fisher information, Cramér-Rao, efficiency).
  • Lehmann, E. L., & Casella, G. (1998). Theory of Point Estimation (2nd ed.). Springer. Chapters 2-3 (information bounds, UMVUE, efficient estimation).
  • Schervish, M. J. (1995). Theory of Statistics. Springer. Section 2.3 (information inequalities, Bartlett identities).
  • Cramér, H. (1946). Mathematical Methods of Statistics. Princeton University Press. Original treatment (Section 32.3).
  • Cox, D. R., & Hinkley, D. V. (1974). Theoretical Statistics. Chapman & Hall. Chapter 4 (likelihood and Fisher information).
  • Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley. Chapter 11.10 (Fisher information, KL connection, Cramér-Rao).

Current and computational:

  • Amari, S. (2016). Information Geometry and Its Applications. Springer. Chapters 1-3 (Fisher metric as Riemannian structure, dual connections, exponential families).
  • Martens, J. (2020). "New insights and perspectives on the natural gradient method." Journal of Machine Learning Research, 21(146), 1-76. Comprehensive natural-gradient reference.
  • Martens, J., & Grosse, R. (2015). "Optimizing neural networks with Kronecker-factored approximate curvature." ICML, arXiv:1503.05671. The K-FAC method.
  • Kunstner, F., Hennig, P., & Balles, L. (2019). "Limitations of the empirical Fisher approximation for natural gradient descent." NeurIPS, arXiv:1905.12558. Why empirical Fisher is not the true Fisher.
  • Kakade, S. (2002). "A natural policy gradient." NeurIPS. The natural gradient applied to policy optimization.
  • Schulman, J., Levine, S., Moritz, P., Jordan, M. I., & Abbeel, P. (2015). "Trust region policy optimization." ICML, arXiv:1502.05477. TRPO uses Fisher information for the KL constraint.
  • Kirkpatrick, J., et al. (2017). "Overcoming catastrophic forgetting in neural networks." PNAS, 114(13), 3521-3526. Elastic Weight Consolidation uses the diagonal Fisher.
  • van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press. Chapters 2-8 (asymptotic efficient estimation, LAN, Le Cam theory).

Critique and refinements:

  • Efron, B., & Hinkley, D. V. (1978). "Assessing the accuracy of the maximum likelihood estimator: Observed versus expected Fisher information." Biometrika, 65(3), 457-487. Argues observed information often gives better finite-sample inference than expected.
  • White, H. (1982). "Maximum likelihood estimation of misspecified models." Econometrica, 50(1), 1-25. The QMLE sandwich variance: when the model is wrong, I1I^{-1} is replaced by V1AV1V^{-1} A V^{-1}.
  • Bernacchia, A., Lengyel, M., & Hennequin, G. (2018). "Exact natural gradient in deep linear networks and application to the nonlinear case." NeurIPS, arXiv:1803.06959. Conditions under which K-FAC's Kronecker factorization is exact.
  • Karakida, R., Akaho, S., & Amari, S. (2019). "Universal statistics of Fisher information in deep neural networks: Mean field approach." AISTATS, arXiv:1806.01316. Spectral structure of the Fisher matrix in deep networks.

Next Topics

  • Cramér-Rao bound: the inequality Var(θ^)1/(nI)\mathrm{Var}(\hat\theta) \geq 1/(nI) and its sharper variants (Bhattacharyya, HCR, van Trees).
  • Maximum likelihood estimation: the asymptotic distribution n(θ^nθ0)N(0,I1)\sqrt n (\hat\theta_n - \theta_0) \Rightarrow \mathcal N(0, I^{-1}).
  • Asymptotic statistics: M- and Z-estimator framework, LAN families, contiguity, the convolution theorem.
  • Hypothesis testing for ML: Wald, Score, and Likelihood-Ratio tests built from Fisher information.
  • Minimax lower bounds: van Trees and Le Cam techniques, with Fisher information as the local rate.

Last reviewed: April 19, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.