Mathematical Infrastructure
The Jacobian Matrix
The matrix of all first partial derivatives of a vector-valued function: encodes the best linear approximation, connects to the chain rule in matrix form, and is the backbone of backpropagation.
Prerequisites
Why This Matters
The Jacobian is the derivative of a vector-valued function. It is the matrix that makes the work for functions from to . Backpropagation is a chain of Jacobian multiplications. When you compose layers in a neural network, the chain rule multiplies Jacobians. Forward mode automatic differentiation computes Jacobian-vector products (JVPs). Reverse mode (backprop) computes vector-Jacobian products (VJPs).
The notation ("partial") replaces the ordinary when more than one input varies.
Beyond deep learning, the Jacobian appears in coordinate transformations (the Jacobian determinant is the volume change factor), in the inverse function theorem, and in sensitivity analysis throughout science and engineering.
The Jacobian is one object with three complementary readings
The Jacobian is the matrix that sends a tiny input step h to the first-order output change.
Mental Model
For a scalar function , the derivative gives the best linear approximation: .
For a vector-valued function (a map between spaces studied in differentiation in R^n), the Jacobian matrix plays exactly the same role. It is the matrix such that:
The Jacobian generalizes the derivative to maps between multi-dimensional spaces. Each row is the gradient of one output component; each column shows how one input affects all outputs.
Core Definitions
Jacobian Matrix
For a differentiable function with component functions , the Jacobian matrix at a point is the matrix:
Explicitly:
Row is the gradient of the -th component (transposed). Column shows how all outputs change when is perturbed.
Special Cases by Dimension
The Jacobian unifies several familiar derivative objects:
- (scalar-valued): The Jacobian is a row vector. The transpose of the gradient: .
- (curve in space): The Jacobian is an column vector. The tangent vector .
- (square case): The Jacobian is and has a well-defined determinant.
The Linear Approximation
The defining property of the Jacobian is that it gives the best linear approximation:
where means the error goes to zero faster than . This is the multi-dimensional version of . The Jacobian is the unique matrix with this property (when is differentiable).
The Chain Rule in Matrix Form
Chain Rule for Jacobians
Statement
Let and . The Jacobian of the composition is the product of Jacobians:
This is an matrix, computed as the product of an matrix and a matrix.
Intuition
The chain rule says: the linear approximation of a composition is the composition of the linear approximations. Since linear maps compose by matrix multiplication, Jacobians multiply. Each Jacobian represents "how much things change locally," and the chain rule says these local changes multiply.
Proof Sketch
Using the linear approximation:
The first step uses the linear approximation of , and the second uses the linear approximation of . The composite linear map is , which is the Jacobian of by uniqueness of the derivative.
Making this rigorous requires controlling the error terms, which is done via the standard multivariable chain rule proof.
Why It Matters
This is the theoretical foundation of backpropagation. A neural network is a composition of layers: . The Jacobian of the entire network is:
Backprop computes this product efficiently from right to left (reverse mode), yielding vector-Jacobian products. The chain rule is the reason backpropagation works.
Failure Mode
The chain rule requires differentiability. Functions with kinks (like ReLU at zero) are not classically differentiable there. The often-quoted defense — "non-differentiable points have measure zero, so the chain rule works almost everywhere" — needs context:
- For piecewise-smooth networks (ReLU, leaky ReLU, max-pooling), under mild assumptions (e.g. absolutely continuous input distribution and generic weights), the preactivation hyperplanes form a Lebesgue-null set, and classical derivatives + chain rule are valid almost surely.
- For discrete inputs, constrained or low-dimensional manifolds, degenerate parameter settings, or optimization paths that intentionally hit kinks, the a.e. story is not enough — the optimizer can spend non-trivial time at non-differentiable points.
- In practice, autodiff frameworks do not apply the classical chain rule at kinks; they pick a particular subgradient or one-sided derivative by convention (e.g. PyTorch / JAX use ). The right mathematical framework is Clarke generalized derivatives or subdifferential calculus; gradient-descent convergence on such piecewise-smooth functions is then governed by results from non-smooth optimization (Davis et al. 2020 on stochastic subgradient methods for tame functions).
The Jacobian Determinant
When (same input and output dimension), the Jacobian is a square matrix and has a determinant.
Jacobian Determinant
The Jacobian determinant measures how locally scales volumes. If is a small region near :
This is the basis for the change of variables formula in integration:
The sign of the Jacobian determinant indicates whether preserves or reverses orientation. A positive determinant means the orientation is preserved; negative means it is reversed.
The Inverse Function Theorem
Inverse Function Theorem
Statement
Let be continuously differentiable and suppose at a point . Then there exists a neighborhood of on which is a diffeomorphism (smooth bijection with smooth inverse), and the Jacobian of the inverse at is:
The derivative of the inverse is the inverse of the derivative.
Intuition
If the Jacobian at a point is invertible (nonzero determinant), then is locally one-to-one and onto. It does not collapse any directions. The linear approximation is invertible, so the nonlinear function is locally invertible too. And the derivative of the inverse is what you would guess: invert the derivative.
Proof Sketch
The proof uses the contraction mapping theorem. Define for a target value near . Since is invertible and is , is a contraction near , so it has a unique fixed point with . This gives local invertibility. The formula for the Jacobian of the inverse follows from the chain rule applied to .
Why It Matters
The inverse function theorem is foundational in differential geometry, implicit methods in ODEs, and. in machine learning. in normalizing flows. Normalizing flows require computing to evaluate the density of transformed distributions. The inverse function theorem guarantees that the flow is locally invertible whenever the Jacobian is nonsingular.
Failure Mode
The theorem is local: may be locally invertible near but not globally invertible. Example: on has , so is locally invertible near . But is not globally invertible (). Also, the Jacobian must be nonsingular at the point in question. IF , the function may fold or collapse directions near .
The Jacobian in Backpropagation
The connection between the Jacobian and automatic differentiation is fundamental:
Jacobian-vector product (JVP). forward mode:
Given a vector , compute . This tells you how a perturbation in the input propagates to the output. Cost: one forward pass. Efficient when (few inputs, many outputs).
Vector-Jacobian product (VJP). reverse mode (backpropagation):
Given a vector , compute . This tells you how each input component affects the output along direction . Cost: one backward pass. Efficient when (few outputs, many inputs).
For a loss function (one output, parameters), reverse mode computes the gradient in one backward pass. This is why backprop is efficient for neural network training.
Canonical Examples
Jacobian of polar-to-Cartesian transformation
The transformation from polar coordinates to Cartesian is:
The Jacobian is:
The Jacobian determinant is:
This is the familiar factor in the polar coordinate area element . The Jacobian determinant directly gives you the area scaling factor for coordinate transformations.
Jacobian of a two-layer neural network
Consider a simple network where is an elementwise activation. By the chain rule:
where and .
Each layer contributes its Jacobian as a factor in the product. The diag terms come from the elementwise nonlinearity. Backpropagation computes VJPs through this product right-to-left.
Common Confusions
The Jacobian is a matrix, the gradient is a vector
For (scalar output), the gradient is a column vector, while the Jacobian is a row vector. They carry the same information: . But for (vector output), the "gradient" is not well-defined as a single vector. the Jacobian is the correct generalization. Each row of the Jacobian is the gradient of one component.
JVP vs VJP is not just a transpose
A Jacobian-vector product and a vector-Jacobian product involve the same matrix but are computed differently by autodiff. They are not interchangeable: JVPs propagate perturbations forward (tangent mode), while VJPs propagate sensitivities backward (adjoint/cotangent mode). For a composition , the JVP multiplies Jacobians left-to-right; the VJP multiplies right-to-left. This choice of multiplication order is the difference between forward mode and reverse mode automatic differentiation.
Jacobian determinant only makes sense for square Jacobians
If with , the Jacobian is rectangular and has no determinant. Volume scaling in this case is described by (for , measuring how scales -dimensional volumes in ). The change-of-variables formula with requires .
Summary
- The Jacobian contains all first partial derivatives:
- Linear approximation:
- Chain rule: . Jacobians multiply
- Jacobian determinant: measures local volume scaling; gives the in
- Inverse function theorem: if , is locally invertible
- Forward mode autodiff computes JVPs (); reverse mode computes VJPs ()
- For scalar loss (), VJP = gradient computation, which is why backprop works
Exercises
Problem
Compute the Jacobian of the polar-to-Cartesian transformation at the point . Verify that the Jacobian determinant equals .
Problem
Let be defined by .
(a) Compute the Jacobian . (b) At the point , compute the JVP for . (c) At the same point, compute the VJP for .
References
Canonical:
- Spivak, Calculus on Manifolds (1965), Chapter 2
- Rudin, Principles of Mathematical Analysis (1976), Chapter 9
- Lee, Introduction to Smooth Manifolds (2013), Chapters 3 and 4
- Hubbard & Hubbard, Vector Calculus, Linear Algebra, and Differential Forms (2015), Chapters 1-3
Current:
- Griewank & Walther, Evaluating Derivatives (2008), Chapters 3 and 6
- Baydin et al., "Automatic Differentiation in Machine Learning: a Survey" (2018)
- Giles, "Collected Matrix Derivative Results for Forward and Reverse Mode Algorithmic Differentiation" (2008)
- Clarke, Optimization and Nonsmooth Analysis (1990), Chapter 2
Next Topics
The natural next steps from the Jacobian:
- Automatic differentiation: how JVPs and VJPs are computed efficiently, forward mode vs. reverse mode, and the computational graph perspective
- The Hessian matrix: the Jacobian of the gradient, encoding second-order curvature information
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
1- Differentiation in Rⁿlayer 0A · tier 1
Derived topics
13- The Hessian Matrixlayer 0A · tier 1
- Vector Calculus Chain Rulelayer 0A · tier 1
- The Multivariate Normal Distributionlayer 0B · tier 1
- Automatic Differentiationlayer 1 · tier 1
- Classical ODEs: Existence, Stability, and Numerical Methodslayer 1 · tier 1
+8 more on the derived-topics page.
Graph-backed continuations