The Jacobian Matrix

Sneiderman, Robby

Mathematical Infrastructure

The Jacobian Matrix

The matrix of all first partial derivatives of a vector-valued function: encodes the best linear approximation, connects to the chain rule in matrix form, and is the backbone of backpropagation.

CoreTier 1StableCore spine~35 min

Prerequisites

Differentiation in Rn

Quiz (10)Pulse Check Prereq Map

Why This Matters

The Jacobian is the derivative of a vector-valued function. It is the matrix that makes the work for functions from $\mathbb{R}^n$ to $\mathbb{R}^m$ . Backpropagation is a chain of Jacobian multiplications. When you compose layers in a neural network, the chain rule multiplies Jacobians. Forward mode automatic differentiation computes Jacobian-vector products (JVPs). Reverse mode (backprop) computes vector-Jacobian products (VJPs).

The notation ("partial") replaces the ordinary $d$ when more than one input varies.

Beyond deep learning, the Jacobian appears in coordinate transformations (the Jacobian determinant is the volume change factor), in the inverse function theorem, and in sensitivity analysis throughout science and engineering.

The Jacobian is one object with three complementary readings

The Jacobian is the matrix that sends a tiny input step h to the first-order output change.

Mental Model

For a scalar function $f: \mathbb{R} \to \mathbb{R}$ , the derivative $f'(x)$ gives the best linear approximation: $f(x + h) \approx f(x) + f'(x) h$ .

For a vector-valued function $f: \mathbb{R}^n \to \mathbb{R}^m$ (a map between spaces studied in differentiation in R^n), the Jacobian matrix plays exactly the same role. It is the matrix $J$ such that:

$f(x + h) \approx f(x) + J \cdot h$

The Jacobian generalizes the derivative to maps between multi-dimensional spaces. Each row is the gradient of one output component; each column shows how one input affects all outputs.

Core Definitions

Definition

Jacobian Matrix $J_{f} (x)$

For a differentiable function $f: \mathbb{R}^n \to \mathbb{R}^m$ with component functions $f = (f_1, \ldots, f_m)^T$ , the Jacobian matrix at a point $x$ is the $m \times n$ matrix:

$[J_f(x)]_{ij} = \frac{\partial f_i}{\partial x_j}$

Explicitly:

$J_f(x) = \begin{pmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{pmatrix}$

Row $i$ is the gradient of the $i$ -th component $f_i$ (transposed). Column $j$ shows how all outputs change when $x_j$ is perturbed.

Definition

Special Cases by Dimension

The Jacobian unifies several familiar derivative objects:

$f: \mathbb{R}^n \to \mathbb{R}$ (scalar-valued): The Jacobian is a $1 \times n$ row vector. The transpose of the gradient: $J_f = (\nabla f)^T$ .
$f: \mathbb{R} \to \mathbb{R}^m$ (curve in space): The Jacobian is an $m \times 1$ column vector. The tangent vector $f'(t)$ .
$f: \mathbb{R}^n \to \mathbb{R}^n$ (square case): The Jacobian is $n \times n$ and has a well-defined determinant.

The Linear Approximation

The defining property of the Jacobian is that it gives the best linear approximation:

$f(x + h) = f(x) + J_f(x) \cdot h + o(\|h\|)$

where $o(\|h\|)$ means the error goes to zero faster than $\|h\|$ . This is the multi-dimensional version of $f(x+h) \approx f(x) + f'(x)h$ . The Jacobian is the unique matrix with this property (when $f$ is differentiable).

The Chain Rule in Matrix Form

Theorem

Chain Rule for Jacobians

Statement

Let $g: \mathbb{R}^n \to \mathbb{R}^p$ and $f: \mathbb{R}^p \to \mathbb{R}^m$ . The Jacobian of the composition $f \circ g$ is the product of Jacobians:

$J_{f \circ g}(x) = J_f(g(x)) \cdot J_g(x)$

This is an $m \times n$ matrix, computed as the product of an $m \times p$ matrix and a $p \times n$ matrix.

Intuition

The chain rule says: the linear approximation of a composition is the composition of the linear approximations. Since linear maps compose by matrix multiplication, Jacobians multiply. Each Jacobian represents "how much things change locally," and the chain rule says these local changes multiply.

Proof Sketch

Using the linear approximation:

$f(g(x+h)) \approx f(g(x) + J_g(x)h) \approx f(g(x)) + J_f(g(x)) \cdot J_g(x) \cdot h$

The first step uses the linear approximation of $g$ , and the second uses the linear approximation of $f$ . The composite linear map is $J_f(g(x)) \cdot J_g(x)$ , which is the Jacobian of $f \circ g$ by uniqueness of the derivative.

Making this rigorous requires controlling the $o(\|h\|)$ error terms, which is done via the standard multivariable chain rule proof.

Why It Matters

This is the theoretical foundation of backpropagation. A neural network is a composition of layers: $f = f_L \circ f_{L-1} \circ \cdots \circ f_1$ . The Jacobian of the entire network is:

$J_f = J_{f_L} \cdot J_{f_{L-1}} \cdots J_{f_1}$

Backprop computes this product efficiently from right to left (reverse mode), yielding vector-Jacobian products. The chain rule is the reason backpropagation works.

Failure Mode

The chain rule requires differentiability. Functions with kinks (like ReLU at zero) are not classically differentiable there. The often-quoted defense — "non-differentiable points have measure zero, so the chain rule works almost everywhere" — needs context:

For piecewise-smooth networks (ReLU, leaky ReLU, max-pooling), under mild assumptions (e.g. absolutely continuous input distribution and generic weights), the preactivation hyperplanes form a Lebesgue-null set, and classical derivatives + chain rule are valid almost surely.
For discrete inputs, constrained or low-dimensional manifolds, degenerate parameter settings, or optimization paths that intentionally hit kinks, the a.e. story is not enough — the optimizer can spend non-trivial time at non-differentiable points.
In practice, autodiff frameworks do not apply the classical chain rule at kinks; they pick a particular subgradient or one-sided derivative by convention (e.g. PyTorch / JAX use $\mathrm{ReLU}'(0) = 0$ ). The right mathematical framework is Clarke generalized derivatives or subdifferential calculus; gradient-descent convergence on such piecewise-smooth functions is then governed by results from non-smooth optimization (Davis et al. 2020 on stochastic subgradient methods for tame functions).

report a correction →

The Jacobian Determinant

When $f: \mathbb{R}^n \to \mathbb{R}^n$ (same input and output dimension), the Jacobian is a square matrix and has a determinant.

Definition

Jacobian Determinant $det J_{f} (x)$

The Jacobian determinant $\det J_f(x)$ measures how $f$ locally scales volumes. If $A \subset \mathbb{R}^n$ is a small region near $x$ :

$\text{Volume}(f(A)) \approx |\det J_f(x)| \cdot \text{Volume}(A)$

This is the basis for the change of variables formula in integration:

$\int_{f(A)} g(y) \, dy = \int_A g(f(x)) \, |\det J_f(x)| \, dx$

The sign of the Jacobian determinant indicates whether $f$ preserves or reverses orientation. A positive determinant means the orientation is preserved; negative means it is reversed.

The Inverse Function Theorem

Theorem

Inverse Function Theorem

Statement

Let $f: \mathbb{R}^n \to \mathbb{R}^n$ be continuously differentiable and suppose $\det J_f(a) \neq 0$ at a point $a$ . Then there exists a neighborhood $U$ of $a$ on which $f$ is a diffeomorphism (smooth bijection with smooth inverse), and the Jacobian of the inverse at $b = f(a)$ is:

$J_{f^{-1}}(b) = [J_f(a)]^{-1}$

The derivative of the inverse is the inverse of the derivative.

Intuition

If the Jacobian at a point is invertible (nonzero determinant), then $f$ is locally one-to-one and onto. It does not collapse any directions. The linear approximation is invertible, so the nonlinear function is locally invertible too. And the derivative of the inverse is what you would guess: invert the derivative.

Proof Sketch

The proof uses the contraction mapping theorem. Define $\phi(x) = x - J_f(a)^{-1}(f(x) - y)$ for a target value $y$ near $f(a)$ . Since $J_f(a)$ is invertible and $f$ is $C^1$ , $\phi$ is a contraction near $a$ , so it has a unique fixed point $x$ with $f(x) = y$ . This gives local invertibility. The formula for the Jacobian of the inverse follows from the chain rule applied to $f^{-1}(f(x)) = x$ .

Why It Matters

The inverse function theorem is foundational in differential geometry, implicit methods in ODEs, and. in machine learning. in normalizing flows. Normalizing flows require computing $\det J_f$ to evaluate the density of transformed distributions. The inverse function theorem guarantees that the flow is locally invertible whenever the Jacobian is nonsingular.

Failure Mode

The theorem is local: $f$ may be locally invertible near $a$ but not globally invertible. Example: $f(x) = x^2$ on $\mathbb{R}$ has $f'(1) = 2 \neq 0$ , so $f$ is locally invertible near $x = 1$ . But $f$ is not globally invertible ( $f(1) = f(-1) = 1$ ). Also, the Jacobian must be nonsingular at the point in question. IF $\det J_f(a) = 0$ , the function may fold or collapse directions near $a$ .

report a correction →

The Jacobian in Backpropagation

The connection between the Jacobian and automatic differentiation is fundamental:

Jacobian-vector product (JVP). forward mode:

Given a vector $v \in \mathbb{R}^n$ , compute $J_f(x) \cdot v \in \mathbb{R}^m$ . This tells you how a perturbation $v$ in the input propagates to the output. Cost: one forward pass. Efficient when $n < m$ (few inputs, many outputs).

Vector-Jacobian product (VJP). reverse mode (backpropagation):

Given a vector $u \in \mathbb{R}^m$ , compute $u^T \cdot J_f(x) \in \mathbb{R}^n$ . This tells you how each input component affects the output along direction $u$ . Cost: one backward pass. Efficient when $m < n$ (few outputs, many inputs).

For a loss function $L: \mathbb{R}^n \to \mathbb{R}$ (one output, $n$ parameters), reverse mode computes the gradient $\nabla L$ in one backward pass. This is why backprop is efficient for neural network training.

Canonical Examples

Example

Jacobian of polar-to-Cartesian transformation

The transformation from polar coordinates $(r, \theta)$ to Cartesian $(x, y)$ is:

$f(r, \theta) = \begin{pmatrix} r \cos\theta \\ r \sin\theta \end{pmatrix}$

The Jacobian is:

$J_f(r, \theta) = \begin{pmatrix} \frac{\partial x}{\partial r} & \frac{\partial x}{\partial \theta} \\ \frac{\partial y}{\partial r} & \frac{\partial y}{\partial \theta} \end{pmatrix} = \begin{pmatrix} \cos\theta & -r\sin\theta \\ \sin\theta & r\cos\theta \end{pmatrix}$

The Jacobian determinant is:

$\det J_f = r\cos^2\theta + r\sin^2\theta = r$

This is the familiar factor $r$ in the polar coordinate area element $dA = r \, dr \, d\theta$ . The Jacobian determinant directly gives you the area scaling factor for coordinate transformations.

Example

Jacobian of a two-layer neural network

Consider a simple network $f(x) = \sigma(W_2 \, \sigma(W_1 x + b_1) + b_2)$ where $\sigma$ is an elementwise activation. By the chain rule:

$J_f(x) = \text{diag}(\sigma'(z_2)) \cdot W_2 \cdot \text{diag}(\sigma'(z_1)) \cdot W_1$

where $z_1 = W_1 x + b_1$ and $z_2 = W_2 \sigma(z_1) + b_2$ .

Each layer contributes its Jacobian as a factor in the product. The diag $(\sigma'(\cdot))$ terms come from the elementwise nonlinearity. Backpropagation computes VJPs through this product right-to-left.

Common Confusions

Watch Out

The Jacobian is a matrix, the gradient is a vector

For $f: \mathbb{R}^n \to \mathbb{R}$ (scalar output), the gradient $\nabla f \in \mathbb{R}^n$ is a column vector, while the Jacobian $J_f \in \mathbb{R}^{1 \times n}$ is a row vector. They carry the same information: $J_f = (\nabla f)^T$ . But for $f: \mathbb{R}^n \to \mathbb{R}^m$ (vector output), the "gradient" is not well-defined as a single vector. the Jacobian is the correct generalization. Each row of the Jacobian is the gradient of one component.

Watch Out

JVP vs VJP is not just a transpose

A Jacobian-vector product $Jv$ and a vector-Jacobian product $u^T J$ involve the same matrix $J$ but are computed differently by autodiff. They are not interchangeable: JVPs propagate perturbations forward (tangent mode), while VJPs propagate sensitivities backward (adjoint/cotangent mode). For a composition $f = f_L \circ \cdots \circ f_1$ , the JVP multiplies Jacobians left-to-right; the VJP multiplies right-to-left. This choice of multiplication order is the difference between forward mode and reverse mode automatic differentiation.

Watch Out

Jacobian determinant only makes sense for square Jacobians

If $f: \mathbb{R}^n \to \mathbb{R}^m$ with $m \neq n$ , the Jacobian is rectangular and has no determinant. Volume scaling in this case is described by $\sqrt{\det(J^T J)}$ (for $m > n$ , measuring how $f$ scales $n$ -dimensional volumes in $\mathbb{R}^m$ ). The change-of-variables formula with $|\det J|$ requires $m = n$ .

Summary

The Jacobian $J_f \in \mathbb{R}^{m \times n}$ contains all first partial derivatives: $[J]_{ij} = \partial f_i / \partial x_j$
Linear approximation: $f(x + h) \approx f(x) + J_f(x) \cdot h$
Chain rule: $J_{f \circ g} = J_f \cdot J_g$ . Jacobians multiply
Jacobian determinant: measures local volume scaling; gives the $r$ in $r \, dr \, d\theta$
Inverse function theorem: if $\det J_f \neq 0$ , $f$ is locally invertible
Forward mode autodiff computes JVPs ( $Jv$ ); reverse mode computes VJPs ( $u^T J$ )
For scalar loss ( $m = 1$ ), VJP = gradient computation, which is why backprop works

Exercises

ExerciseCore

Problem

Compute the Jacobian of the polar-to-Cartesian transformation $f(r, \theta) = (r\cos\theta, \; r\sin\theta)^T$ at the point $(r, \theta) = (2, \pi/4)$ . Verify that the Jacobian determinant equals $r = 2$ .

ExerciseAdvanced

Problem

Let $f: \mathbb{R}^3 \to \mathbb{R}^2$ be defined by $f(x, y, z) = (x^2 + yz, \; e^x \sin(y+z))^T$ .

(a) Compute the Jacobian $J_f(x, y, z)$ . (b) At the point $(0, 0, 0)$ , compute the JVP $J_f \cdot v$ for $v = (1, 1, 1)^T$ . (c) At the same point, compute the VJP $u^T \cdot J_f$ for $u = (1, 0)^T$ .

References

Canonical:

Spivak, Calculus on Manifolds (1965), Chapter 2
Rudin, Principles of Mathematical Analysis (1976), Chapter 9
Lee, Introduction to Smooth Manifolds (2013), Chapters 3 and 4
Hubbard & Hubbard, Vector Calculus, Linear Algebra, and Differential Forms (2015), Chapters 1-3

Current:

Griewank & Walther, Evaluating Derivatives (2008), Chapters 3 and 6
Baydin et al., "Automatic Differentiation in Machine Learning: a Survey" (2018)
Giles, "Collected Matrix Derivative Results for Forward and Reverse Mode Algorithmic Differentiation" (2008)
Clarke, Optimization and Nonsmooth Analysis (1990), Chapter 2

Next Topics

The natural next steps from the Jacobian:

Automatic differentiation: how JVPs and VJPs are computed efficiently, forward mode vs. reverse mode, and the computational graph perspective
The Hessian matrix: the Jacobian of the gradient, encoding second-order curvature information

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Differentiation in Rⁿlayer 0A · tier 1

Derived topics

13

The Hessian Matrixlayer 0A · tier 1
Vector Calculus Chain Rulelayer 0A · tier 1
The Multivariate Normal Distributionlayer 0B · tier 1
Automatic Differentiationlayer 1 · tier 1
Classical ODEs: Existence, Stability, and Numerical Methodslayer 1 · tier 1

+8 more on the derived-topics page.

Graph-backed continuations

Automatic Differentiation The Hessian Matrix Classical ODEs: Existence, Stability, and Numerical Methods Continuous Normalizing Flows Divergence, Curl, and Line Integrals Gradient Flow and Vanishing Gradients Implicit Differentiation Inverse and Implicit Function Theorem Matrix Calculus Normalizing Flows Physics-Informed Neural Networks Vector Calculus Chain Rule The Multivariate Normal Distribution