Statistical Estimation
Fisher Information: Curvature, KL Geometry, and the Natural Gradient
The Fisher information measures how much a sample tells you about an unknown parameter. Coverage of the score function and Bartlett identities, the equivalence of variance-of-score and expected-negative-Hessian forms, the KL-divergence Hessian identity, the Fisher information matrix and Loewner ordering, the Cramér-Rao link, the natural gradient (parameterization invariance), K-FAC and empirical Fisher in deep learning, and applications to MLE asymptotic efficiency, EWC, and information geometry.
Why This Matters
The Fisher information is the local geometric object that controls how well a parametric model can be estimated. It plays three roles, all derived from the same definition. First, it is the variance of the score function, which sets the Cramér-Rao lower bound on the variance of any unbiased estimator: . Second, it is the expected curvature of the log-likelihood at the truth, which is what the maximum likelihood estimator sees as it converges: . Third, it is the Hessian of the KL divergence at the diagonal: , which makes it the natural Riemannian metric on the statistical manifold and the source of the natural gradient.
In modern deep learning, the Fisher information matrix appears in three central tools: K-FAC (Kronecker-Factored Approximate Curvature) for second-order optimization, the natural policy gradient (TRPO, PPO antecedents), and Elastic Weight Consolidation for continual learning. The empirical Fisher (computed on training data, not on samples drawn from the model) is a cheap surrogate that has its own correctness caveats; understanding the gap matters for debugging modern training pipelines.
Mental Model
The log-likelihood is a random function whose shape near the true encodes how well the data identifies . Three equivalent perspectives:
- Variance of the score (): the score is a mean-zero random variable; large variance means individual samples push the estimate around in different directions, but with high amplitude (informative).
- Expected curvature (): the log-likelihood landscape is sharply peaked around , so the optimum is well-localized.
- KL Hessian (): nearby parameter values produce distributions that diverge sharply (in KL) as you move away from , so they are statistically distinguishable.
The first two equivalences are the Bartlett identities; the third is the KL Hessian identity. All three say the same thing: high Fisher information means the data discriminates parameter values precisely. Low Fisher information means a flat likelihood landscape, ambiguous parameter values, and slow estimation.
| Form | Definition | What it tells you |
|---|---|---|
| Variance of score | How much one observation pushes the estimate around, large variance means each datum carries strong signal about . | |
| Negative expected Hessian | How sharply peaked the log-likelihood is at the truth, large curvature means the optimum is well-localized. | |
| Hessian of KL divergence | is the Hessian of in , evaluated at | How fast nearby distributions diverge in KL, the Riemannian metric on the statistical manifold and the source of the natural gradient. |
All three are equal under Cramér regularity. Pick the form that matches the question you are asking: variance-of-score for plug-in estimates from data, negative-Hessian for second-order optimization, KL-Hessian for parameterization-invariant geometry.
The Fisher information matrix is the multivariate generalization. Its eigenstructure tells you which parameter directions are well-identified (large eigenvalues, sharp curvature) and which are weakly identified (small eigenvalues, flat directions, ill-conditioned for ordinary gradient descent). Preconditioning the gradient by gives the natural gradient, which descends in distribution-space rather than parameter-space.
Formal Setup
Let be a random variable with density for (scalar case; we then generalize to ). Assume Cramér regularity: the support of does not depend on , is twice continuously differentiable in for almost every , and differentiation under the integral is valid.
Score function
The score function is the gradient of the log-density:
Under Cramér regularity, for all (the first Bartlett identity). The score is a mean-zero random variable in that quantifies the local sensitivity of the likelihood to .
Fisher information (scalar)
The Fisher information is the variance of the score:
For an i.i.d. sample , the total Fisher information of the sample is (additivity, see below).
Fisher information matrix
For , the Fisher information matrix is the covariance of the score vector:
The matrix is positive semidefinite (covariance). It is positive definite under local identifiability at , i.e., the score has full-rank covariance, so no nontrivial direction in parameter space leaves the log-likelihood locally flat. Global identifiability (different give different distributions) is necessary for but not sufficient: the parameterization is globally identifiable, yet the score with respect to vanishes at and .
Observed information
The observed information is the negative Hessian of the log-likelihood evaluated on the actual data:
For an i.i.d. sample, . The observed information is random (a function of the data), unlike the Fisher information which is deterministic. By the law of large numbers, a.s., so the two agree asymptotically.
Empirical Fisher
The empirical Fisher is the sample covariance of the score evaluated on training data:
In deep learning, are training examples and are per-example gradients. The empirical Fisher is not the same as the true Fisher unless the model is correctly specified at (see Kunstner et al. 2019 and the Confusions section). The true Fisher uses model samples , not training labels.
The Bartlett Identities
Bartlett Identities (First and Second)
Statement
Under Cramér regularity:
First Bartlett identity: .
Second Bartlett identity: . Equivalently,
Intuition
Both identities follow from differentiating . The first is the first derivative; the second is the second derivative. Geometrically, the first says the log-likelihood gradient at the true has zero mean. The second says the variance of that gradient equals the expected curvature, an information-equality that makes Fisher information computable in two equivalent ways.
Proof Sketch
First identity. Differentiate under the integral:
Second identity. Differentiate again:
Rearrange: . Combined with the first identity (), the variance of the score equals its second moment, which equals the expected negative Hessian.
Why It Matters
The Bartlett identities are the foundation for everything Fisher information does. They give two computable forms for (variance of score, or expected negative Hessian); they validate the Cramér-Rao bound's Cauchy-Schwarz step (which uses ); and the second identity is what makes the Fisher information equal to local curvature (linking variance and geometry).
Failure Mode
Both identities require Cramér regularity. For , the support depends on , so differentiation under the integral fails: the boundary contribution does not vanish. The score is defined as on but , breaking the first identity. The second identity also breaks: but as well, but neither equals the actual variance of the appropriate location estimator (which has variance ). The right tool for non-regular families is the Hammersley-Chapman-Robbins bound.
In quasi-MLE settings (model is misspecified), the second Bartlett identity fails: where is the true distribution. This is why misspecified MLE has the sandwich asymptotic variance instead of (White's QMLE theorem on the MLE page).
Hessian Form (corollary of Bartlett II)
Statement
Under Cramér regularity, the Fisher information equals the expected negative Hessian of the log-likelihood:
Intuition
Direct restatement of the second Bartlett identity. This is usually the easier form for computation: take the log-density's Hessian (the model's curvature) and take the expectation under .
Proof Sketch
This is the second Bartlett identity rearranged: .
Why It Matters
Computing is usually a one-line calculation in standard families: Gaussian gives for , for , and zero off-diagonals; Bernoulli gives ; Poisson gives . The variance-of-score form is sometimes algebraically messier (you have to expand the squared score and integrate); the Hessian form is usually cleaner.
Failure Mode
Same as the second Bartlett identity: fails when the support depends on , when the log-density is not twice differentiable, when expectations are infinite (heavy-tailed scores), or when the model is misspecified.
Fisher Information as the KL-Divergence Hessian
Fisher Information is the KL Hessian
Statement
For any ,
Equivalently, the Fisher information matrix is the Hessian of the KL divergence at the diagonal:
Intuition
The KL divergence is asymmetric and zero on the diagonal; locally, it must be a positive semidefinite quadratic form. The coefficient of that quadratic form is the Fisher information. So the Fisher information is the metric tensor of a (mathematically natural) Riemannian geometry on parameter space, in which "distance" between and is measured by KL.
Proof Sketch
Expand . Take the expectation under :
The first-order term vanishes by the first Bartlett identity. The KL divergence is , so
using the second Bartlett identity to replace by .
Why It Matters
This identity is the single most important fact about Fisher information. It explains why is the right Riemannian metric on parameter space (it is the second-order Taylor coefficient of KL); why preconditioning by is parameterization invariant (KL is parameterization-free); and why the natural gradient is the right notion of steepest descent in distribution space. It also explains why MLE is asymptotically efficient: the score is a sum of i.i.d. mean-zero random variables with variance per observation, so by CLT it converges to , and the inverse Hessian (which is by the second Bartlett identity and LLN) maps this back to for the MLE.
Failure Mode
The KL Hessian identity uses Cramér regularity (Bartlett identities). It also requires the KL divergence to be finite in a neighborhood of , which can fail at the boundary of parameter space or for distributions with disjoint supports. For exponential families in canonical parameterization, the identity is exact (not just up to ): , which is the Bregman divergence of the log-partition function and has Hessian everywhere.
Fisher Information for n Samples
Additivity for Independent Samples
Statement
If are independent observations with score functions , the Fisher information of the joint sample is the sum:
In the i.i.d. case, .
Intuition
The score of the joint log-likelihood is the sum of per-sample scores; variance of a sum of independent terms is the sum of variances. More data means proportionally more information.
Proof Sketch
. Take the gradient: . By independence and mean-zero score, .
Why It Matters
Additivity is what makes the rate in the Cramér-Rao bound possible: each independent observation contributes a fixed quantum of information. For dependent samples (time series, spatial data), additivity fails and the effective sample size is smaller (the Fisher information for the full block is less than times the single-observation Fisher information; the gap is captured by the autocorrelation function).
Failure Mode
Fails for dependent observations. Also fails when the model is misspecified jointly (e.g., the per-observation models are correct but the joint independence assumption is wrong); in that case the sandwich variance corrects the asymptotic distribution but the simple multiplication by overstates the available information.
The Natural Gradient
Parameterization Invariance of the Natural Gradient
Statement
The natural-gradient vector field is coordinate-invariant: under a smooth invertible reparameterization with Jacobian , it transforms as a tangent vector, so the continuous-time flow describes the same trajectory in distribution space regardless of parameterization.
The discrete update is only first-order / infinitesimally invariant: it is the Euler discretization of the invariant flow, and the two parameterizations agree to leading order in but disagree at . Exact finite-step invariance requires using the Riemannian exponential map / a retraction along the Fisher metric; vanilla Euler steps in different coordinate systems trace different finite-step paths even when the underlying flow is the same.
Intuition
Ordinary gradient descent depends on which parameterization you use: the direction of steepest descent in Euclidean parameter space changes when you change coordinates. The natural gradient instead descends in the geometry induced by KL divergence, which is parameterization-free at the level of the vector field. Finite Euler steps approximate the invariant trajectory, with the approximation gap shrinking as the step size shrinks.
Proof Sketch
Under reparameterization :
- Gradient transforms as (chain rule).
- Fisher matrix transforms as (covariant tensor of rank 2).
The natural gradient in -coordinates is , which is the natural gradient in -coordinates pulled back through . So the update direction in distribution space is the same.
Why It Matters
This is the formal statement of why natural gradient is "the right" descent direction for probabilistic models: it descends along the geodesic in KL geometry, independent of how the model is parameterized. Practical methods (TRPO's natural policy gradient, K-FAC, EWC) all derive from this invariance. The cost is that is expensive to compute exactly for high-dimensional models, motivating Kronecker-factored approximations.
Failure Mode
Requires to be positive definite (model identifiable). When is rank-deficient, the natural gradient is not uniquely defined; use a generalized inverse , but the parameterization-invariance breaks for the directions in the kernel of .
For deep networks, the empirical Fisher (computed from training data with training labels) is often substituted for the true Fisher (which requires sampling labels from the model), and this substitution breaks parameterization invariance unless the model is correctly specified. K-FAC uses the true Fisher with a Kronecker approximation; some implementations cut corners and use the empirical Fisher, with results that are sometimes good and sometimes poor depending on the regime.
The natural gradient appears in three modern tools:
-
K-FAC (Martens & Grosse 2015): approximates as a Kronecker product of small per-layer matrices, making natural-gradient descent feasible for deep networks. The Kronecker structure exploits the fact that for fully-connected layers, has a block structure where is the activation covariance and the gradient covariance.
-
Natural policy gradient / TRPO (Kakade 2002; Schulman et al. 2015): in policy gradient methods, the "loss" is the expected return and the natural gradient direction corresponds to the policy update that achieves a fixed KL bound. TRPO solves a constrained optimization with the KL constraint linearized to a Fisher quadratic form.
-
EWC (Kirkpatrick et al. 2017): for continual learning, Elastic Weight Consolidation regularizes new-task training with a penalty , where is the parameter from the previous task and is the diagonal of the Fisher information matrix. The diagonal Fisher tells you which parameters were "important" for the previous task and should change slowly.
Connection to Cramér-Rao
The most-cited consequence of Fisher information is the Cramér-Rao bound: for any unbiased estimator of ,
in the Loewner ordering. The MLE attains this bound asymptotically (under regularity): . The dedicated Cramér-Rao bound page covers the proof, the multivariate Loewner ordering form, the chain-rule extension for biased estimators, the sharper Bhattacharyya higher-order bounds, the Hammersley-Chapman-Robbins bound that does not require regularity, and the van Trees Bayesian extension.
The Cramér-Rao link is also why "Fisher information" deserves the name information: is the information per observation about the parameter, and is the minimum variance (for unbiased estimators) achievable from such observations.
Canonical Examples
Bernoulli
, so . Score: . By the Hessian form: , expectation , so . Maximum at or (extremes are easy to identify); minimum at (the most uncertain case is the hardest to estimate).
Gaussian with known variance
with known. Log-density . Hessian in : , expectation independent of , so . The Cramér-Rao bound gives , attained exactly by .
Gaussian with both unknown: information-orthogonal block
with both unknown. Score:
Fisher matrix:
The off-diagonal is zero: and are information-orthogonal. Estimating one does not lose precision on the other. This is special to the location-scale Gaussian; in most parameterizations of most families, the off-diagonals are nonzero (cf. the Gamma family).
Exponential family in canonical form
for . Score: . Fisher information equals the variance of : (the Hessian of the log-partition function). This is the cleanest case: Fisher information is exactly the cumulant matrix of the sufficient statistic. The MLE for is the sample mean , which attains the Cramér-Rao bound at every (one of the few finite-sample efficient cases).
Fisher information for a softmax / multinomial logistic
For a -class multinomial logistic with logit parameters for (with for identifiability), the per-observation Fisher information matrix is , where is the predicted probability vector for the free classes. This is a matrix; it is positive definite as long as no equals 0 or 1 (degenerate cases). The natural gradient for cross-entropy loss preconditions by , which corresponds to descending in the KL geometry on the simplex. This is the basis for K-FAC in classification networks.
Empirical Fisher vs true Fisher in a misspecified neural network
Consider a regression network trained with squared loss: with assumed. The model implies with score .
True Fisher: (after taking expectation over ). This is the Gauss-Newton matrix divided by .
Empirical Fisher: . This weights each gradient outer product by the squared training residual, not by 1.
When the network has converged (), the empirical Fisher is approximately zero, while the true Fisher is finite. K-FAC uses the true Fisher; substituting empirical Fisher in late training produces ill-conditioned preconditioners. Kunstner, Hennig, & Balles (2019, "Limitations of the empirical Fisher approximation for natural gradient descent", arXiv:1905.12558) document this gap explicitly.
Common Confusions
Fisher information is not the same as observed information
is the expected negative Hessian; it is a deterministic function of . The observed information is the actual second derivative on the realized data; it is random. Asymptotically they agree (LLN), and for valid Wald-style inference one can use either. Some statisticians prefer the observed information because it conditions on the data and tends to give better finite-sample coverage (Efron-Hinkley 1978).
The Cramér-Rao bound governs unbiased estimators only
is not a bound on MSE. Biased estimators can have lower MSE: ridge regression, James-Stein shrinkage, posterior means, regularized MLE all systematically introduce bias to reduce variance. The bound is a strict statement about the unbiased class. For biased estimators, use the chain-rule version where , then add the squared bias to get an MSE lower bound.
Empirical Fisher is not the true Fisher unless the model is correctly specified
The empirical Fisher uses training labels ; the true Fisher uses model samples . These agree only when the model is correctly specified at (so holds in expectation). For misspecified models, where is the true distribution; this differs from . The QMLE sandwich variance uses both.
Fisher information matrix is not the Hessian of the training loss
For a deep network trained with cross-entropy, the Hessian of the training loss is approximately , the second derivative of the log-likelihood evaluated on training data. The Fisher information matrix is , an expectation under the model. The Gauss-Newton matrix is yet a third object. All three coincide when the model is correctly specified and the training data is from the model; for real deep learning none of these conditions hold exactly, and the three objects can differ substantially. K-FAC and similar methods use the true Fisher (with sampled labels) to preserve the natural-gradient invariance.
High Fisher information does not mean high precision in absolute terms
does not mean the estimator has a small variance in absolute units; it means , relative to the natural scale of . The Cramér-Rao bound is dimensionally . The relevant comparison is the signal-to-noise ratio , which is the right unitless measure of estimation precision.
Fisher information depends on parameterization (but the natural gradient does not)
If is a smooth reparameterization, the Fisher information matrices are related by where is the Jacobian. So is a coordinate-dependent quantity (a covariant rank-2 tensor). The natural gradient , however, is parameterization-invariant: it descends in the KL geometry, which is intrinsic to the family of distributions and does not depend on labels.
Exercises
Problem
Compute the Fisher information for a single observation from . State the Cramér-Rao bound for an unbiased estimator of from i.i.d. observations and verify the sample mean attains it.
Problem
Compute the Fisher information matrix for with both parameters unknown. Verify the off-diagonal is zero (information-orthogonality of and ). What does information-orthogonality imply for the asymptotic distribution of ?
Problem
Show that the natural gradient is parameterization-invariant: under with Jacobian , the natural gradient step in -space transforms to the natural gradient step in -space. Why does ordinary gradient descent fail this invariance?
Problem
Derive the Fisher information of an exponential family in canonical form . Show , the Hessian of the log-partition. Then show the Fisher information in the mean-value parameterization is (reciprocal!). Use this to verify the natural gradient updates in and coordinates are equivalent.
Problem
For a fully-connected feedforward network with cross-entropy classification loss, derive the K-FAC approximation of the Fisher information matrix. Specifically, show that the Fisher block for a single layer with input activations and pre-softmax logits has the form where is the gradient of the per-example log-likelihood at the model's output. State the assumption that makes the Kronecker factorization exact and the scenarios in which it is a poor approximation.
References
Canonical:
- Casella, G., & Berger, R. L. (2002). Statistical Inference (2nd ed.). Duxbury. Chapter 7 (Fisher information, Cramér-Rao, efficiency).
- Lehmann, E. L., & Casella, G. (1998). Theory of Point Estimation (2nd ed.). Springer. Chapters 2-3 (information bounds, UMVUE, efficient estimation).
- Schervish, M. J. (1995). Theory of Statistics. Springer. Section 2.3 (information inequalities, Bartlett identities).
- Cramér, H. (1946). Mathematical Methods of Statistics. Princeton University Press. Original treatment (Section 32.3).
- Cox, D. R., & Hinkley, D. V. (1974). Theoretical Statistics. Chapman & Hall. Chapter 4 (likelihood and Fisher information).
- Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley. Chapter 11.10 (Fisher information, KL connection, Cramér-Rao).
Current and computational:
- Amari, S. (2016). Information Geometry and Its Applications. Springer. Chapters 1-3 (Fisher metric as Riemannian structure, dual connections, exponential families).
- Martens, J. (2020). "New insights and perspectives on the natural gradient method." Journal of Machine Learning Research, 21(146), 1-76. Comprehensive natural-gradient reference.
- Martens, J., & Grosse, R. (2015). "Optimizing neural networks with Kronecker-factored approximate curvature." ICML, arXiv:1503.05671. The K-FAC method.
- Kunstner, F., Hennig, P., & Balles, L. (2019). "Limitations of the empirical Fisher approximation for natural gradient descent." NeurIPS, arXiv:1905.12558. Why empirical Fisher is not the true Fisher.
- Kakade, S. (2002). "A natural policy gradient." NeurIPS. The natural gradient applied to policy optimization.
- Schulman, J., Levine, S., Moritz, P., Jordan, M. I., & Abbeel, P. (2015). "Trust region policy optimization." ICML, arXiv:1502.05477. TRPO uses Fisher information for the KL constraint.
- Kirkpatrick, J., et al. (2017). "Overcoming catastrophic forgetting in neural networks." PNAS, 114(13), 3521-3526. Elastic Weight Consolidation uses the diagonal Fisher.
- van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press. Chapters 2-8 (asymptotic efficient estimation, LAN, Le Cam theory).
Critique and refinements:
- Efron, B., & Hinkley, D. V. (1978). "Assessing the accuracy of the maximum likelihood estimator: Observed versus expected Fisher information." Biometrika, 65(3), 457-487. Argues observed information often gives better finite-sample inference than expected.
- White, H. (1982). "Maximum likelihood estimation of misspecified models." Econometrica, 50(1), 1-25. The QMLE sandwich variance: when the model is wrong, is replaced by .
- Bernacchia, A., Lengyel, M., & Hennequin, G. (2018). "Exact natural gradient in deep linear networks and application to the nonlinear case." NeurIPS, arXiv:1803.06959. Conditions under which K-FAC's Kronecker factorization is exact.
- Karakida, R., Akaho, S., & Amari, S. (2019). "Universal statistics of Fisher information in deep neural networks: Mean field approach." AISTATS, arXiv:1806.01316. Spectral structure of the Fisher matrix in deep networks.
Next Topics
- Cramér-Rao bound: the inequality and its sharper variants (Bhattacharyya, HCR, van Trees).
- Maximum likelihood estimation: the asymptotic distribution .
- Asymptotic statistics: M- and Z-estimator framework, LAN families, contiguity, the convolution theorem.
- Hypothesis testing for ML: Wald, Score, and Likelihood-Ratio tests built from Fisher information.
- Minimax lower bounds: van Trees and Le Cam techniques, with Fisher information as the local rate.
Last reviewed: April 19, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
5- Positive Semidefinite Matriceslayer 0A · tier 1
- Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1
- KL Divergencelayer 1 · tier 1
- Sufficient Statistics and Exponential Familieslayer 0B · tier 2
- Basu's Theoremlayer 0B · tier 3
Derived topics
8- Asymptotic Statistics: M-Estimators, Delta Method, LANlayer 0B · tier 1
- Cramér-Rao Bound: Information Inequality, Achievability, and Sharper Variantslayer 0B · tier 1
- Likelihood-Ratio, Wald, and Score Testslayer 2 · tier 1
- Minimax Lower Bounds: Le Cam, Fano, Assouad, and the Reduction to Testinglayer 3 · tier 1
- Score Matchinglayer 3 · tier 1
+3 more on the derived-topics page.
Graph-backed continuations