Fisher Information: Curvature, KL Geometry, and the Natural Gradient

Sneiderman, Robby

Statistical Estimation

Fisher Information: Curvature, KL Geometry, and the Natural Gradient

The Fisher information measures how much a sample tells you about an unknown parameter. Coverage of the score function and Bartlett identities, the equivalence of variance-of-score and expected-negative-Hessian forms, the KL-divergence Hessian identity, the Fisher information matrix and Loewner ordering, the Cramér-Rao link, the natural gradient (parameterization invariance), K-FAC and empirical Fisher in deep learning, and applications to MLE asymptotic efficiency, EWC, and information geometry.

ImportantCoreTier 1StableCore spine~90 min

For:MLStatsResearch

Prerequisites

Maximum Likelihood Estimation KL Divergence Basu Theorem Positive Semidefinite Matrices

Quiz (19)Pulse Check Prereq Map

Why This Matters

The Fisher information $I(\theta)$ is the local geometric object that controls how well a parametric model can be estimated. It plays three roles, all derived from the same definition. First, it is the variance of the score function, which sets the Cramér-Rao lower bound on the variance of any unbiased estimator: $\mathrm{Var}(\hat\theta) \geq 1/(nI(\theta))$ . Second, it is the expected curvature of the log-likelihood at the truth, which is what the maximum likelihood estimator sees as it converges: $\sqrt n (\hat\theta_n - \theta_0) \Rightarrow \mathcal N(0, I(\theta_0)^{-1})$ . Third, it is the Hessian of the KL divergence at the diagonal: $I(\theta) = \nabla^2_\eta \mathrm{KL}(p_\theta \,\|\, p_\eta) \big|_{\eta = \theta}$ , which makes it the natural Riemannian metric on the statistical manifold and the source of the natural gradient.

In modern deep learning, the Fisher information matrix appears in three central tools: K-FAC (Kronecker-Factored Approximate Curvature) for second-order optimization, the natural policy gradient (TRPO, PPO antecedents), and Elastic Weight Consolidation for continual learning. The empirical Fisher (computed on training data, not on samples drawn from the model) is a cheap surrogate that has its own correctness caveats; understanding the gap matters for debugging modern training pipelines.

Mental Model

The log-likelihood $\ell(\theta) = \log p(X \mid \theta)$ is a random function whose shape near the true $\theta_0$ encodes how well the data identifies $\theta$ . Three equivalent perspectives:

Variance of the score ( $I = \mathbb E[s^2]$ ): the score $s = \partial_\theta \ell$ is a mean-zero random variable; large variance means individual samples push the estimate around in different directions, but with high amplitude (informative).
Expected curvature ( $I = -\mathbb E[\partial_\theta^2 \ell]$ ): the log-likelihood landscape is sharply peaked around $\theta_0$ , so the optimum is well-localized.
KL Hessian ( $I = \nabla^2 \mathrm{KL}(p_{\theta_0} \,\|\, p_\eta) \big|_{\eta = \theta_0}$ ): nearby parameter values produce distributions that diverge sharply (in KL) as you move away from $\theta_0$ , so they are statistically distinguishable.

The first two equivalences are the Bartlett identities; the third is the KL Hessian identity. All three say the same thing: high Fisher information means the data discriminates parameter values precisely. Low Fisher information means a flat likelihood landscape, ambiguous parameter values, and slow estimation.

Form	Definition	What it tells you
Variance of score	$I(\theta) = \mathbb E_\theta[s(\theta;X)^2]$	How much one observation pushes the estimate around, large variance means each datum carries strong signal about $\theta$ .
Negative expected Hessian	$I(\theta) = -\mathbb E_\theta[\nabla^2_\theta \log p(X;\theta)]$	How sharply peaked the log-likelihood is at the truth, large curvature means the optimum is well-localized.
Hessian of KL divergence	$I(\theta)$ is the Hessian of $\mathrm{KL}(p_\theta \,\\|\, p_\eta)$ in $\eta$ , evaluated at $\eta = \theta$	How fast nearby distributions diverge in KL, the Riemannian metric on the statistical manifold and the source of the natural gradient.

All three are equal under Cramér regularity. Pick the form that matches the question you are asking: variance-of-score for plug-in estimates from data, negative-Hessian for second-order optimization, KL-Hessian for parameterization-invariant geometry.

The Fisher information matrix $I(\theta) \in \mathbb R^{d \times d}$ is the multivariate generalization. Its eigenstructure tells you which parameter directions are well-identified (large eigenvalues, sharp curvature) and which are weakly identified (small eigenvalues, flat directions, ill-conditioned for ordinary gradient descent). Preconditioning the gradient by $I^{-1}$ gives the natural gradient, which descends in distribution-space rather than parameter-space.

Formal Setup

Let $X$ be a random variable with density $p(x \mid \theta)$ for $\theta \in \Theta \subseteq \mathbb R$ (scalar case; we then generalize to $\theta \in \mathbb R^d$ ). Assume Cramér regularity: the support of $p$ does not depend on $\theta$ , $\log p(x \mid \theta)$ is twice continuously differentiable in $\theta$ for almost every $x$ , and differentiation under the integral is valid.

Definition

Score function $s (θ; x)$

The score function is the gradient of the log-density:

$s(\theta; x) = \nabla_\theta \log p(x \mid \theta).$

Under Cramér regularity, $\mathbb E_\theta[s(\theta; X)] = 0$ for all $\theta$ (the first Bartlett identity). The score is a mean-zero random variable in $x$ that quantifies the local sensitivity of the likelihood to $\theta$ .

Definition

Fisher information (scalar) $I (θ)$

The Fisher information is the variance of the score:

$I(\theta) = \mathrm{Var}_\theta[s(\theta; X)] = \mathbb E_\theta[s(\theta; X)^2].$

For an i.i.d. sample $X_1, \ldots, X_n$ , the total Fisher information of the sample is $n I(\theta)$ (additivity, see below).

Definition

Fisher information matrix $I (θ) \in R^{d \times d}$

For $\theta \in \mathbb R^d$ , the Fisher information matrix is the covariance of the score vector:

$[I(\theta)]_{jk} = \mathbb E_\theta\!\left[\frac{\partial \log p(X \mid \theta)}{\partial \theta_j} \cdot \frac{\partial \log p(X \mid \theta)}{\partial \theta_k}\right].$

The matrix is positive semidefinite (covariance). It is positive definite under local identifiability at $\theta$ , i.e., the score has full-rank covariance, so no nontrivial direction in parameter space leaves the log-likelihood locally flat. Global identifiability (different $\theta$ give different distributions) is necessary for $I(\theta) \succ 0$ but not sufficient: the parameterization $\mathcal{N}(\theta^3, 1)$ is globally identifiable, yet the score with respect to $\theta$ vanishes at $\theta = 0$ and $I(0) = 0$ .

Definition

Observed information $J (θ; x)$

The observed information is the negative Hessian of the log-likelihood evaluated on the actual data:

$J(\theta; x) = -\nabla^2_\theta \log p(x \mid \theta).$

For an i.i.d. sample, $J_n(\theta; X_{1:n}) = -\sum_{i=1}^n \nabla^2_\theta \log p(X_i \mid \theta)$ . The observed information is random (a function of the data), unlike the Fisher information $I(\theta) = \mathbb E[J_1(\theta; X)]$ which is deterministic. By the law of large numbers, $\frac{1}{n} J_n(\theta; X_{1:n}) \to I(\theta)$ a.s., so the two agree asymptotically.

Definition

Empirical Fisher $\hat{I}_{emp} (θ)$

The empirical Fisher is the sample covariance of the score evaluated on training data:

$\hat I_{\mathrm{emp}}(\theta) = \frac{1}{n} \sum_{i=1}^n s(\theta; X_i) s(\theta; X_i)^\top.$

In deep learning, $X_i$ are training examples and $s(\theta; X_i) = \nabla_\theta \log p_\theta(y_i \mid x_i)$ are per-example gradients. The empirical Fisher is not the same as the true Fisher unless the model is correctly specified at $\theta$ (see Kunstner et al. 2019 and the Confusions section). The true Fisher uses model samples $\tilde Y_i \sim p_\theta(\cdot \mid x_i)$ , not training labels.

The Bartlett Identities

Theorem

Bartlett Identities (First and Second)

Statement

Under Cramér regularity:

First Bartlett identity: $\mathbb E_\theta[s(\theta; X)] = 0$ .

Second Bartlett identity: $\mathrm{Var}_\theta[s(\theta; X)] = -\mathbb E_\theta[\nabla^2_\theta \log p(X \mid \theta)]$ . Equivalently,

$I(\theta) = \mathbb E_\theta[s s^\top] = -\mathbb E_\theta[\nabla^2_\theta \log p(X \mid \theta)].$

Intuition

Both identities follow from differentiating $\int p(x \mid \theta) \, dx = 1$ . The first is the first derivative; the second is the second derivative. Geometrically, the first says the log-likelihood gradient at the true $\theta$ has zero mean. The second says the variance of that gradient equals the expected curvature, an information-equality that makes Fisher information computable in two equivalent ways.

Proof Sketch

First identity. Differentiate $\int p(x \mid \theta) \, dx = 1$ under the integral:

$0 = \frac{d}{d\theta} \int p(x \mid \theta) \, dx = \int \frac{\partial p(x \mid \theta)}{\partial \theta} \, dx = \int s(\theta; x) p(x \mid \theta) \, dx = \mathbb E_\theta[s].$

Second identity. Differentiate again:

$0 = \frac{d^2}{d\theta^2} \int p(x \mid \theta) \, dx = \int \left[\frac{\partial^2 \log p}{\partial \theta^2} + \left(\frac{\partial \log p}{\partial \theta}\right)^2\right] p(x \mid \theta) \, dx = \mathbb E[\partial_\theta^2 \log p] + \mathbb E[s^2].$

Rearrange: $\mathbb E[s^2] = -\mathbb E[\partial_\theta^2 \log p]$ . Combined with the first identity ( $\mathbb E[s] = 0$ ), the variance of the score equals its second moment, which equals the expected negative Hessian.

Why It Matters

The Bartlett identities are the foundation for everything Fisher information does. They give two computable forms for $I(\theta)$ (variance of score, or expected negative Hessian); they validate the Cramér-Rao bound's Cauchy-Schwarz step (which uses $\mathbb E[s] = 0$ ); and the second identity is what makes the Fisher information equal to local curvature (linking variance and geometry).

Failure Mode

Both identities require Cramér regularity. For $X \sim U(0, \theta)$ , the support depends on $\theta$ , so differentiation under the integral fails: the boundary contribution does not vanish. The score is defined as $-1/\theta$ on $(0, \theta)$ but $\mathbb E[s] = -1/\theta \neq 0$ , breaking the first identity. The second identity also breaks: $\mathbb E[\partial_\theta^2 \log p] = 1/\theta^2$ but $\mathbb E[s^2] = 1/\theta^2$ as well, but neither equals the actual variance of the appropriate location estimator (which has variance $\Theta(1/n^2)$ ). The right tool for non-regular families is the Hammersley-Chapman-Robbins bound.

In quasi-MLE settings (model is misspecified), the second Bartlett identity fails: $\mathbb E_{p_*}[s s^\top] \neq -\mathbb E_{p_*}[\nabla^2 \log p_\theta]$ where $p_*$ is the true distribution. This is why misspecified MLE has the sandwich asymptotic variance $V^{-1} A V^{-1}$ instead of $I^{-1}$ (White's QMLE theorem on the MLE page).

report a correction →

Corollary

Hessian Form (corollary of Bartlett II)

Statement

Under Cramér regularity, the Fisher information equals the expected negative Hessian of the log-likelihood:

$I(\theta) = -\mathbb E_\theta[\nabla^2_\theta \log p(X \mid \theta)].$

Intuition

Direct restatement of the second Bartlett identity. This is usually the easier form for computation: take the log-density's Hessian (the model's curvature) and take the expectation under $p_\theta$ .

Proof Sketch

This is the second Bartlett identity rearranged: $I = \mathbb E[s^2] = -\mathbb E[\partial^2 \log p]$ .

Why It Matters

Computing $-\mathbb E[\partial^2 \log p]$ is usually a one-line calculation in standard families: Gaussian gives $1/\sigma^2$ for $\mu$ , $1/(2\sigma^4)$ for $\sigma^2$ , and zero off-diagonals; Bernoulli gives $1/[\theta(1-\theta)]$ ; Poisson gives $1/\lambda$ . The variance-of-score form $\mathbb E[s^2]$ is sometimes algebraically messier (you have to expand the squared score and integrate); the Hessian form is usually cleaner.

Failure Mode

Same as the second Bartlett identity: fails when the support depends on $\theta$ , when the log-density is not twice differentiable, when expectations are infinite (heavy-tailed scores), or when the model is misspecified.

report a correction →

Fisher Information as the KL-Divergence Hessian

Theorem

Fisher Information is the KL Hessian

Statement

For any $\theta \in \Theta$ ,

$\mathrm{KL}(p_\theta \,\|\, p_{\theta + \delta}) = \frac{1}{2} \delta^\top I(\theta) \delta + O(\|\delta\|^3) \quad \text{as } \delta \to 0.$

Equivalently, the Fisher information matrix is the Hessian of the KL divergence at the diagonal:

$I(\theta) = \nabla^2_\eta \mathrm{KL}(p_\theta \,\|\, p_\eta) \big|_{\eta = \theta}.$

Intuition

The KL divergence is asymmetric and zero on the diagonal; locally, it must be a positive semidefinite quadratic form. The coefficient of that quadratic form is the Fisher information. So the Fisher information is the metric tensor of a (mathematically natural) Riemannian geometry on parameter space, in which "distance" between $p_\theta$ and $p_\eta$ is measured by KL.

Proof Sketch

Expand $\log p(x \mid \theta + \delta) = \log p(x \mid \theta) + s(x; \theta)^\top \delta + \frac{1}{2} \delta^\top \nabla^2 \log p(x \mid \theta) \delta + O(\|\delta\|^3)$ . Take the expectation under $p_\theta$ :

$\mathbb E_{p_\theta}[\log p(X \mid \theta + \delta) - \log p(X \mid \theta)] = 0 + \frac{1}{2} \delta^\top \mathbb E_{p_\theta}[\nabla^2 \log p(X \mid \theta)] \delta + O(\|\delta\|^3).$

The first-order term vanishes by the first Bartlett identity. The KL divergence is $\mathrm{KL}(p_\theta \,\|\, p_{\theta + \delta}) = -\mathbb E_{p_\theta}[\log p(X \mid \theta + \delta) - \log p(X \mid \theta)]$ , so

$\mathrm{KL}(p_\theta \,\|\, p_{\theta + \delta}) = -\frac{1}{2} \delta^\top \mathbb E_{p_\theta}[\nabla^2 \log p] \delta + O(\|\delta\|^3) = \frac{1}{2} \delta^\top I(\theta) \delta + O(\|\delta\|^3),$

using the second Bartlett identity to replace $-\mathbb E[\nabla^2 \log p]$ by $I(\theta)$ .

Why It Matters

This identity is the single most important fact about Fisher information. It explains why $I(\theta)$ is the right Riemannian metric on parameter space (it is the second-order Taylor coefficient of KL); why preconditioning by $I^{-1}$ is parameterization invariant (KL is parameterization-free); and why the natural gradient is the right notion of steepest descent in distribution space. It also explains why MLE is asymptotically efficient: the score $\sum s(X_i; \theta)$ is a sum of i.i.d. mean-zero random variables with variance $I$ per observation, so by CLT it converges to $\mathcal N(0, nI)$ , and the inverse Hessian (which is $\to I^{-1}$ by the second Bartlett identity and LLN) maps this back to $\mathcal N(0, I^{-1}/n)$ for the MLE.

Failure Mode

The KL Hessian identity uses Cramér regularity (Bartlett identities). It also requires the KL divergence to be finite in a neighborhood of $\theta$ , which can fail at the boundary of parameter space or for distributions with disjoint supports. For exponential families in canonical parameterization, the identity is exact (not just up to $O(\|\delta\|^3)$ ): $\mathrm{KL}(p_\eta \,\|\, p_{\eta'}) = A(\eta') - A(\eta) - A'(\eta) (\eta' - \eta)$ , which is the Bregman divergence of the log-partition function $A$ and has Hessian $A''(\eta) = I(\eta)$ everywhere.

report a correction →

Fisher Information for n Samples

Proposition

Additivity for Independent Samples

Statement

If $X_1, \ldots, X_n$ are independent observations with score functions $s_i(\theta; X_i)$ , the Fisher information of the joint sample is the sum:

$I_n(\theta) = \sum_{i=1}^n I_i(\theta) \quad \text{where } I_i(\theta) = \mathrm{Var}[s_i(\theta; X_i)].$

In the i.i.d. case, $I_n(\theta) = n I(\theta)$ .

Intuition

The score of the joint log-likelihood is the sum of per-sample scores; variance of a sum of independent terms is the sum of variances. More data means proportionally more information.

Proof Sketch

$\log p(X_{1:n} \mid \theta) = \sum_{i=1}^n \log p_i(X_i \mid \theta)$ . Take the gradient: $S_n = \sum_i s_i(\theta; X_i)$ . By independence and mean-zero score, $\mathrm{Var}(S_n) = \sum_i \mathrm{Var}(s_i) = \sum_i I_i(\theta)$ .

Why It Matters

Additivity is what makes the rate $1/(nI)$ in the Cramér-Rao bound possible: each independent observation contributes a fixed quantum $I(\theta)$ of information. For dependent samples (time series, spatial data), additivity fails and the effective sample size is smaller (the Fisher information for the full block is less than $n$ times the single-observation Fisher information; the gap is captured by the autocorrelation function).

Failure Mode

Fails for dependent observations. Also fails when the model is misspecified jointly (e.g., the per-observation models are correct but the joint independence assumption is wrong); in that case the sandwich variance corrects the asymptotic distribution but the simple multiplication by $n$ overstates the available information.

report a correction →

The Natural Gradient

Theorem

Parameterization Invariance of the Natural Gradient

Statement

The natural-gradient vector field $\theta \mapsto -I(\theta)^{-1} \nabla_\theta L(\theta)$ is coordinate-invariant: under a smooth invertible reparameterization $\theta = g(\phi)$ with Jacobian $J$ , it transforms as a tangent vector, so the continuous-time flow $\dot\theta = -I(\theta)^{-1} \nabla_\theta L(\theta)$ describes the same trajectory in distribution space regardless of parameterization.

The discrete update $\theta \leftarrow \theta - \eta I(\theta)^{-1} \nabla_\theta L(\theta)$ is only first-order / infinitesimally invariant: it is the Euler discretization of the invariant flow, and the two parameterizations agree to leading order in $\eta$ but disagree at $O(\eta^2)$ . Exact finite-step invariance requires using the Riemannian exponential map / a retraction along the Fisher metric; vanilla Euler steps in different coordinate systems trace different finite-step paths even when the underlying flow is the same.

Intuition

Ordinary gradient descent depends on which parameterization you use: the direction of steepest descent in Euclidean parameter space changes when you change coordinates. The natural gradient instead descends in the geometry induced by KL divergence, which is parameterization-free at the level of the vector field. Finite Euler steps approximate the invariant trajectory, with the approximation gap shrinking as the step size shrinks.

Proof Sketch

Under reparameterization $\theta = g(\phi)$ :

Gradient transforms as $\nabla_\phi L = J^\top \nabla_\theta L$ (chain rule).
Fisher matrix transforms as $I_\phi = J^\top I_\theta J$ (covariant tensor of rank 2).

The natural gradient in $\phi$ -coordinates is $I_\phi^{-1} \nabla_\phi L = (J^\top I_\theta J)^{-1} J^\top \nabla_\theta L = J^{-1} I_\theta^{-1} J^{-\top} J^\top \nabla_\theta L = J^{-1} I_\theta^{-1} \nabla_\theta L$ , which is the natural gradient in $\theta$ -coordinates pulled back through $J^{-1}$ . So the update direction in distribution space is the same.

Why It Matters

This is the formal statement of why natural gradient is "the right" descent direction for probabilistic models: it descends along the geodesic in KL geometry, independent of how the model is parameterized. Practical methods (TRPO's natural policy gradient, K-FAC, EWC) all derive from this invariance. The cost is that $I(\theta)^{-1}$ is expensive to compute exactly for high-dimensional models, motivating Kronecker-factored approximations.

Failure Mode

Requires $I(\theta)$ to be positive definite (model identifiable). When $I$ is rank-deficient, the natural gradient is not uniquely defined; use a generalized inverse $I^+$ , but the parameterization-invariance breaks for the directions in the kernel of $I$ .

For deep networks, the empirical Fisher (computed from training data with training labels) is often substituted for the true Fisher (which requires sampling labels from the model), and this substitution breaks parameterization invariance unless the model is correctly specified. K-FAC uses the true Fisher with a Kronecker approximation; some implementations cut corners and use the empirical Fisher, with results that are sometimes good and sometimes poor depending on the regime.

report a correction →

The natural gradient appears in three modern tools:

K-FAC (Martens & Grosse 2015): approximates $I(\theta)^{-1}$ as a Kronecker product of small per-layer matrices, making natural-gradient descent feasible for deep networks. The Kronecker structure exploits the fact that for fully-connected layers, $I$ has a block structure $I_\ell = A_\ell \otimes G_\ell$ where $A_\ell$ is the activation covariance and $G_\ell$ the gradient covariance.
Natural policy gradient / TRPO (Kakade 2002; Schulman et al. 2015): in policy gradient methods, the "loss" is the expected return and the natural gradient direction $I(\theta)^{-1} \nabla J(\theta)$ corresponds to the policy update that achieves a fixed KL bound. TRPO solves a constrained optimization with the KL constraint linearized to a Fisher quadratic form.
EWC (Kirkpatrick et al. 2017): for continual learning, Elastic Weight Consolidation regularizes new-task training with a penalty $\sum_j I_{jj}(\theta_*) (\theta_j - \theta_{*j})^2$ , where $\theta_*$ is the parameter from the previous task and $I_{jj}$ is the diagonal of the Fisher information matrix. The diagonal Fisher tells you which parameters were "important" for the previous task and should change slowly.

Connection to Cramér-Rao

The most-cited consequence of Fisher information is the Cramér-Rao bound: for any unbiased estimator $\hat\theta$ of $\theta \in \mathbb R^d$ ,

$\mathrm{Cov}_\theta(\hat\theta) \succeq \frac{1}{n} I(\theta)^{-1}$

in the Loewner ordering. The MLE attains this bound asymptotically (under regularity): $\sqrt n (\hat\theta_n - \theta_0) \Rightarrow \mathcal N(0, I(\theta_0)^{-1})$ . The dedicated Cramér-Rao bound page covers the proof, the multivariate Loewner ordering form, the chain-rule extension for biased estimators, the sharper Bhattacharyya higher-order bounds, the Hammersley-Chapman-Robbins bound that does not require regularity, and the van Trees Bayesian extension.

The Cramér-Rao link is also why "Fisher information" deserves the name information: $I(\theta)$ is the information per observation about the parameter, and $1/(nI(\theta))$ is the minimum variance (for unbiased estimators) achievable from $n$ such observations.

Canonical Examples

Example

Bernoulli

$X \sim \mathrm{Bernoulli}(\theta)$ , so $\log p(x \mid \theta) = x \log \theta + (1-x) \log(1-\theta)$ . Score: $s(\theta; x) = x/\theta - (1-x)/(1-\theta)$ . By the Hessian form: $\partial^2 \log p / \partial \theta^2 = -x/\theta^2 - (1-x)/(1-\theta)^2$ , expectation $-1/\theta - 1/(1-\theta) = -1/[\theta(1-\theta)]$ , so $I(\theta) = 1/[\theta(1-\theta)]$ . Maximum at $\theta \to 0$ or $\theta \to 1$ (extremes are easy to identify); minimum $I = 4$ at $\theta = 1/2$ (the most uncertain case is the hardest to estimate).

Example

Gaussian with known variance

$X \sim \mathcal N(\mu, \sigma^2)$ with $\sigma^2$ known. Log-density $-\frac{(x-\mu)^2}{2\sigma^2} + \mathrm{const}$ . Hessian in $\mu$ : $-1/\sigma^2$ , expectation independent of $x$ , so $I(\mu) = 1/\sigma^2$ . The Cramér-Rao bound gives $\mathrm{Var}(\hat\mu) \geq \sigma^2/n$ , attained exactly by $\bar X$ .

Example

Gaussian with both unknown: information-orthogonal block

$X \sim \mathcal N(\mu, \sigma^2)$ with both unknown. Score:

$s(\mu, \sigma^2; x) = \begin{pmatrix} (x-\mu)/\sigma^2 \\ -1/(2\sigma^2) + (x-\mu)^2/(2\sigma^4) \end{pmatrix}.$

Fisher matrix:

$I(\mu, \sigma^2) = \begin{pmatrix} 1/\sigma^2 & 0 \\ 0 & 1/(2\sigma^4) \end{pmatrix}.$

The off-diagonal is zero: $\mu$ and $\sigma^2$ are information-orthogonal. Estimating one does not lose precision on the other. This is special to the location-scale Gaussian; in most parameterizations of most families, the off-diagonals are nonzero (cf. the Gamma family).

Example

Exponential family in canonical form

$p(x \mid \eta) = h(x) \exp(\eta^\top T(x) - A(\eta))$ for $\eta \in \mathbb R^d$ . Score: $s(\eta; x) = T(x) - \nabla A(\eta) = T(x) - \mu(\eta)$ . Fisher information equals the variance of $T$ : $I(\eta) = \mathrm{Cov}(T(X)) = \nabla^2 A(\eta)$ (the Hessian of the log-partition function). This is the cleanest case: Fisher information is exactly the cumulant matrix of the sufficient statistic. The MLE for $\mu = \mathbb E[T]$ is the sample mean $\bar T$ , which attains the Cramér-Rao bound at every $n$ (one of the few finite-sample efficient cases).

Example

Fisher information for a softmax / multinomial logistic

For a $K$ -class multinomial logistic with logit parameters $\eta_k$ for $k = 1, \ldots, K-1$ (with $\eta_K = 0$ for identifiability), the per-observation Fisher information matrix is $I(\eta) = \mathrm{diag}(\pi) - \pi \pi^\top$ , where $\pi_k = e^{\eta_k}/\sum_j e^{\eta_j}$ is the predicted probability vector for the $K-1$ free classes. This is a $K-1 \times K-1$ matrix; it is positive definite as long as no $\pi_k$ equals 0 or 1 (degenerate cases). The natural gradient for cross-entropy loss preconditions by $I^{-1}$ , which corresponds to descending in the KL geometry on the simplex. This is the basis for K-FAC in classification networks.

Example

Empirical Fisher vs true Fisher in a misspecified neural network

Consider a regression network trained with squared loss: $y_i = f_\theta(x_i) + \epsilon_i$ with $\epsilon_i \sim \mathcal N(0, \sigma^2)$ assumed. The model implies $p_\theta(y \mid x) = \mathcal N(f_\theta(x), \sigma^2)$ with score $s_\theta(x, y) = (y - f_\theta(x)) \nabla_\theta f_\theta(x) / \sigma^2$ .

True Fisher: $I(\theta) = \mathbb E_{(x, y) \sim p_\theta}[s s^\top] = \mathbb E_x[\nabla_\theta f \nabla_\theta f^\top]/\sigma^2$ (after taking expectation over $y \sim p_\theta(\cdot \mid x) = \mathcal N(f_\theta(x), \sigma^2)$ ). This is the Gauss-Newton matrix divided by $\sigma^2$ .

Empirical Fisher: $\hat I_{\mathrm{emp}}(\theta) = \frac{1}{n} \sum_i (y_i - f_\theta(x_i))^2 \nabla_\theta f \nabla_\theta f^\top / \sigma^4$ . This weights each gradient outer product by the squared training residual, not by 1.

When the network has converged ( $y_i \approx f_\theta(x_i)$ ), the empirical Fisher is approximately zero, while the true Fisher is finite. K-FAC uses the true Fisher; substituting empirical Fisher in late training produces ill-conditioned preconditioners. Kunstner, Hennig, & Balles (2019, "Limitations of the empirical Fisher approximation for natural gradient descent", arXiv:1905.12558) document this gap explicitly.

Common Confusions

Watch Out

Fisher information is not the same as observed information

$I(\theta) = -\mathbb E[\nabla^2 \log p(X \mid \theta)]$ is the expected negative Hessian; it is a deterministic function of $\theta$ . The observed information $J(\theta; x) = -\nabla^2 \log p(x \mid \theta)$ is the actual second derivative on the realized data; it is random. Asymptotically they agree (LLN), and for valid Wald-style inference one can use either. Some statisticians prefer the observed information because it conditions on the data and tends to give better finite-sample coverage (Efron-Hinkley 1978).

Watch Out

The Cramér-Rao bound governs unbiased estimators only

$\mathrm{Var}(\hat\theta) \geq 1/(nI)$ is not a bound on MSE. Biased estimators can have lower MSE: ridge regression, James-Stein shrinkage, posterior means, regularized MLE all systematically introduce bias to reduce variance. The bound is a strict statement about the unbiased class. For biased estimators, use the chain-rule version $\mathrm{Var}(T) \geq [\psi'(\theta)]^2 / (nI)$ where $\psi(\theta) = \mathbb E[T]$ , then add the squared bias to get an MSE lower bound.

Watch Out

Empirical Fisher is not the true Fisher unless the model is correctly specified

The empirical Fisher $\hat I_{\mathrm{emp}}(\theta) = (1/n) \sum_i s(\theta; X_i) s(\theta; X_i)^\top$ uses training labels $X_i$ ; the true Fisher uses model samples $\tilde X_i \sim p_\theta$ . These agree only when the model is correctly specified at $\theta$ (so $X_i \sim p_\theta$ holds in expectation). For misspecified models, $\hat I_{\mathrm{emp}} \to A := \mathbb E_{p_*}[s s^\top]$ where $p_*$ is the true distribution; this differs from $I = -\mathbb E_{p_*}[\nabla^2 \log p_\theta]$ . The QMLE sandwich variance $V^{-1} A V^{-1}$ uses both.

Watch Out

Fisher information matrix is not the Hessian of the training loss

For a deep network trained with cross-entropy, the Hessian of the training loss is approximately $\sum_i \nabla^2 \log p_\theta(y_i \mid x_i)$ , the second derivative of the log-likelihood evaluated on training data. The Fisher information matrix is $\mathbb E_{(x, y) \sim p_\theta}[\nabla \log p \cdot \nabla \log p^\top]$ , an expectation under the model. The Gauss-Newton matrix is yet a third object. All three coincide when the model is correctly specified and the training data is from the model; for real deep learning none of these conditions hold exactly, and the three objects can differ substantially. K-FAC and similar methods use the true Fisher (with sampled labels) to preserve the natural-gradient invariance.

Watch Out

High Fisher information does not mean high precision in absolute terms

$I(\theta) = 100$ does not mean the estimator has a small variance in absolute units; it means $\mathrm{Var}(\hat\theta) \geq 1/(100n) = 0.01/n$ , relative to the natural scale of $\theta$ . The Cramér-Rao bound is dimensionally $[\theta]^2$ . The relevant comparison is the signal-to-noise ratio $\theta / \mathrm{sd}(\hat\theta) = \theta \sqrt{nI(\theta)}$ , which is the right unitless measure of estimation precision.

Watch Out

Fisher information depends on parameterization (but the natural gradient does not)

If $\theta = g(\phi)$ is a smooth reparameterization, the Fisher information matrices are related by $I_\phi(\phi) = J^\top I_\theta(g(\phi)) J$ where $J$ is the Jacobian. So $I(\theta)$ is a coordinate-dependent quantity (a covariant rank-2 tensor). The natural gradient $I^{-1} \nabla L$ , however, is parameterization-invariant: it descends in the KL geometry, which is intrinsic to the family of distributions and does not depend on labels.

Exercises

ExerciseCore

Problem

Compute the Fisher information $I(\lambda)$ for a single observation from $\mathrm{Poisson}(\lambda)$ . State the Cramér-Rao bound for an unbiased estimator of $\lambda$ from $n$ i.i.d. observations and verify the sample mean attains it.

ExerciseCore

Problem

Compute the Fisher information matrix for $\mathcal N(\mu, \sigma^2)$ with both parameters unknown. Verify the off-diagonal is zero (information-orthogonality of $\mu$ and $\sigma^2$ ). What does information-orthogonality imply for the asymptotic distribution of $(\hat\mu_{\mathrm{MLE}}, \hat\sigma^2_{\mathrm{MLE}})$ ?

ExerciseAdvanced

Problem

Show that the natural gradient is parameterization-invariant: under $\theta = g(\phi)$ with Jacobian $J$ , the natural gradient step in $\phi$ -space transforms to the natural gradient step in $\theta$ -space. Why does ordinary gradient descent fail this invariance?

ExerciseAdvanced

Problem

Derive the Fisher information of an exponential family in canonical form $p(x \mid \eta) = h(x) \exp(\eta T(x) - A(\eta))$ . Show $I(\eta) = A''(\eta)$ , the Hessian of the log-partition. Then show the Fisher information in the mean-value parameterization $\mu = A'(\eta)$ is $I_\mu(\mu) = 1/A''(\eta) = 1/I_\eta(\eta)$ (reciprocal!). Use this to verify the natural gradient updates in $\eta$ and $\mu$ coordinates are equivalent.

ExerciseResearch

Problem

For a fully-connected feedforward network with cross-entropy classification loss, derive the K-FAC approximation of the Fisher information matrix. Specifically, show that the Fisher block for a single layer with input activations $a$ and pre-softmax logits $z = W a$ has the form $I_W \approx \mathbb E[a a^\top] \otimes \mathbb E[g g^\top]$ where $g = \nabla_z \log p(y \mid z)$ is the gradient of the per-example log-likelihood at the model's output. State the assumption that makes the Kronecker factorization exact and the scenarios in which it is a poor approximation.

References

Canonical:

Casella, G., & Berger, R. L. (2002). Statistical Inference (2nd ed.). Duxbury. Chapter 7 (Fisher information, Cramér-Rao, efficiency).
Lehmann, E. L., & Casella, G. (1998). Theory of Point Estimation (2nd ed.). Springer. Chapters 2-3 (information bounds, UMVUE, efficient estimation).
Schervish, M. J. (1995). Theory of Statistics. Springer. Section 2.3 (information inequalities, Bartlett identities).
Cramér, H. (1946). Mathematical Methods of Statistics. Princeton University Press. Original treatment (Section 32.3).
Cox, D. R., & Hinkley, D. V. (1974). Theoretical Statistics. Chapman & Hall. Chapter 4 (likelihood and Fisher information).
Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley. Chapter 11.10 (Fisher information, KL connection, Cramér-Rao).

Current and computational:

Amari, S. (2016). Information Geometry and Its Applications. Springer. Chapters 1-3 (Fisher metric as Riemannian structure, dual connections, exponential families).
Martens, J. (2020). "New insights and perspectives on the natural gradient method." Journal of Machine Learning Research, 21(146), 1-76. Comprehensive natural-gradient reference.
Martens, J., & Grosse, R. (2015). "Optimizing neural networks with Kronecker-factored approximate curvature." ICML, arXiv:1503.05671. The K-FAC method.
Kunstner, F., Hennig, P., & Balles, L. (2019). "Limitations of the empirical Fisher approximation for natural gradient descent." NeurIPS, arXiv:1905.12558. Why empirical Fisher is not the true Fisher.
Kakade, S. (2002). "A natural policy gradient." NeurIPS. The natural gradient applied to policy optimization.
Schulman, J., Levine, S., Moritz, P., Jordan, M. I., & Abbeel, P. (2015). "Trust region policy optimization." ICML, arXiv:1502.05477. TRPO uses Fisher information for the KL constraint.
Kirkpatrick, J., et al. (2017). "Overcoming catastrophic forgetting in neural networks." PNAS, 114(13), 3521-3526. Elastic Weight Consolidation uses the diagonal Fisher.
van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press. Chapters 2-8 (asymptotic efficient estimation, LAN, Le Cam theory).

Critique and refinements:

Efron, B., & Hinkley, D. V. (1978). "Assessing the accuracy of the maximum likelihood estimator: Observed versus expected Fisher information." Biometrika, 65(3), 457-487. Argues observed information often gives better finite-sample inference than expected.
White, H. (1982). "Maximum likelihood estimation of misspecified models." Econometrica, 50(1), 1-25. The QMLE sandwich variance: when the model is wrong, $I^{-1}$ is replaced by $V^{-1} A V^{-1}$ .
Bernacchia, A., Lengyel, M., & Hennequin, G. (2018). "Exact natural gradient in deep linear networks and application to the nonlinear case." NeurIPS, arXiv:1803.06959. Conditions under which K-FAC's Kronecker factorization is exact.
Karakida, R., Akaho, S., & Amari, S. (2019). "Universal statistics of Fisher information in deep neural networks: Mean field approach." AISTATS, arXiv:1806.01316. Spectral structure of the Fisher matrix in deep networks.

Next Topics

Cramér-Rao bound: the inequality $\mathrm{Var}(\hat\theta) \geq 1/(nI)$ and its sharper variants (Bhattacharyya, HCR, van Trees).
Maximum likelihood estimation: the asymptotic distribution $\sqrt n (\hat\theta_n - \theta_0) \Rightarrow \mathcal N(0, I^{-1})$ .
Asymptotic statistics: M- and Z-estimator framework, LAN families, contiguity, the convolution theorem.
Hypothesis testing for ML: Wald, Score, and Likelihood-Ratio tests built from Fisher information.
Minimax lower bounds: van Trees and Le Cam techniques, with Fisher information as the local rate.

Last reviewed: April 19, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

5

Positive Semidefinite Matriceslayer 0A · tier 1
Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1
KL Divergencelayer 1 · tier 1
Sufficient Statistics and Exponential Familieslayer 0B · tier 2
Basu's Theoremlayer 0B · tier 3

Derived topics

8

Asymptotic Statistics: M-Estimators, Delta Method, LANlayer 0B · tier 1
Cramér-Rao Bound: Information Inequality, Achievability, and Sharper Variantslayer 0B · tier 1
Likelihood-Ratio, Wald, and Score Testslayer 2 · tier 1
Minimax Lower Bounds: Le Cam, Fano, Assouad, and the Reduction to Testinglayer 3 · tier 1
Score Matchinglayer 3 · tier 1

+3 more on the derived-topics page.

Graph-backed continuations

Cramér-Rao Bound: Information Inequality, Achievability, and Sharper Variants Asymptotic Statistics: M-Estimators, Delta Method, LAN Hypothesis Testing for ML Minimax Lower Bounds: Le Cam, Fano, Assouad, and the Reduction to Testing Information Geometry Preconditioned Optimizers: Shampoo, K-FAC, and Natural Gradient Score Matching