Mathematical Infrastructure
Vector Calculus Chain Rule
The chain rule for compositions of multivariable maps. Jacobians multiply when functions compose; gradients of scalar-valued compositions become vector-Jacobian products. The result that makes backpropagation a one-line theorem.
Why This Matters
The single-variable chain rule says . The multivariable version replaces those scalar derivatives with Jacobians, and the product becomes a matrix product. That one upgrade is enough to derive backpropagation in one line: a deep network is a composition , its Jacobian is the product , and the gradient of a scalar loss is a vector-Jacobian product evaluated right-to-left.
This page states the rule with assumptions, gives the matrix form, derives the gradient corollary, and shows the implicit chain rule that handles constraints. It does not re-derive backprop end-to-end; that lives on its own page.
The Rule
Multivariable Chain Rule
Statement
Let be differentiable at , with , and let be differentiable at . Then the composition is differentiable at and its Jacobian satisfies where the right-hand side is matrix multiplication of a matrix with an matrix.
Intuition
Differentiability of at means admits a best linear approximation: . Apply and use its own linear approximation at . The composed approximation is linear with matrix . Because the linear approximation of a differentiable map is unique, this matrix is the Jacobian of .
Proof Sketch
By assumption with , and with . Set and substitute: The term is because is a fixed linear map. The term is and , so as well. The remainder is , identifying as the Jacobian of .
Why It Matters
Every layer in a neural network is a function . A depth- network is the composition . The chain rule gives , a product of matrices. Forward-mode autodiff multiplies left-to-right; reverse-mode (backprop) multiplies right-to-left starting from the loss gradient. The asymmetry is not flop-equal-with-different-memory: forward mode computes one directional derivative per pass and so requires passes to recover a full Jacobian (cheap when input dimension is small), while reverse mode computes one gradient per pass and so requires passes (cheap when output dimension is small). For a scalar loss, and a single reverse pass yields the full gradient at cost comparable to one forward evaluation (cheap-gradient principle). This input-vs-output-dimension scaling, not memory layout, is the actual reason backprop wins for deep nets with many parameters.
Failure Mode
Differentiability at the inner point is required, not just continuity. Composing a non-differentiable inner map with a differentiable outer one breaks the rule even when the composition happens to be differentiable. ReLU activations are not differentiable at zero, so neural-network practice uses subgradients there; modern autodiff frameworks pick a convention (typically or ) and document it.
Gradient Corollary
For scalar-valued compositions the chain rule has a familiar special case. Let be differentiable. Define for . By the theorem, . Since is scalar-valued, is a row vector of size , and is . The product is a row vector of size , so This is a vector-Jacobian product (VJP): the upstream gradient is propagated backward by left-multiplying by . Backprop is exactly this identity applied recursively layer by layer.
Implicit Function Chain Rule
Sometimes a variable is defined implicitly: specifies as a function of near a point where is invertible. The implicit function theorem guarantees a smooth exists locally, and differentiating via the chain rule gives
Implicit differentiation underpins meta-learning by implicit gradients, deep-equilibrium models (DEQs), and the gradient of an optimization solution with respect to its parameters.
Worked Example: Two-Layer Network
Let and for activation applied componentwise, with , , . The Jacobian of the inner map is , a matrix. The Jacobian of the outer map at is . By the chain rule the composed Jacobian is For a scalar loss the gradient with respect to is the VJP , evaluated right-to-left, which is precisely what backprop computes.
Chain rule on a single logistic neuron with numbers
Take a single logistic neuron with , weights , bias , and a training point with target . The squared loss is . Walk the chain rule one node at a time.
Forward pass. , so . The loss is .
Backward pass. The chain rule gives Each factor has a clean form on this example:
- .
- .
- , so and .
Multiplying: , , .
A single SGD step with learning rate gives and . Re-evaluating, and — closer to the target .
The intermediate cache is exactly the activation state backprop stores during the forward pass to reuse on the backward pass. This sigmoid-of-affine composition is the building block for backpropagation in feedforward networks. Goodfellow, Bengio, Courville, Deep Learning (2016), §6.5.2 walks the same two-step composition with general activations.
Common Confusions
Order of multiplication matters
The product is not the same as in general, because matrix multiplication is non-commutative and even the shapes typically disagree. The outer Jacobian is evaluated at the inner output and goes on the left. Reversing the order is the most common chain-rule mistake on multivariable calculus exams.
Gradients are not Jacobians
For a scalar-valued function the gradient is a column vector and the Jacobian is the row vector . Many sources collapse the distinction and write , hiding the transpose inside the notation. When you implement autodiff or read papers, track the orientation explicitly: the chain rule gives row vectors via and gradients via .
Exercises
Problem
Let and let trace out a spiral. Compute via the chain rule, then verify by direct computation.
Problem
Use the chain rule to derive the gradient of the squared-loss with respect to for fixed , . Show that .
References
- Walter Rudin. Principles of Mathematical Analysis (3rd ed.). McGraw-Hill, 1976. Theorem 9.15: the chain rule for differentiable maps between Banach spaces. The cleanest statement and proof in the literature.
- Michael Spivak. Calculus on Manifolds. W. A. Benjamin, 1965. Chapter 2.5: chain rule via best linear approximation. Short, modern, coordinate-free presentation.
- Tom M. Apostol. Calculus, Volume II (2nd ed.). Wiley, 1969. Sections 8.10-8.12: multivariable chain rule with worked examples and the implicit-function variant.
- Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, Jeffrey Mark Siskind. Automatic Differentiation in Machine Learning: A Survey. JMLR 18, 2018. Section 3 develops forward and reverse mode as the two associativity orderings of chained Jacobians. arXiv:1502.05767
- Andreas Griewank and Andrea Walther. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation (2nd ed.). SIAM, 2008. Chapter 3: matrix-form chain rule and the cheap-gradient principle.
- Jan R. Magnus and Heinz Neudecker. Matrix Differential Calculus with Applications in Statistics and Econometrics (3rd ed.). Wiley, 2019. Chapter 5: matrix chain rule with the trace inner product, the working tool for matrix gradients in ML.
Related Topics
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
3- Differentiation in Rⁿlayer 0A · tier 1
- The Jacobian Matrixlayer 0A · tier 1
- Vectors, Matrices, and Linear Mapslayer 0A · tier 1
Derived topics
4- The Hessian Matrixlayer 0A · tier 1
- Automatic Differentiationlayer 1 · tier 1
- Feedforward Networks and Backpropagationlayer 2 · tier 1
- Divergence, Curl, and Line Integralslayer 0A · tier 2