Foundations
Taylor Expansion
Taylor approximation in one and many variables. Every optimization algorithm is a Taylor approximation: gradient descent uses first order, Newton's method uses second order.
Prerequisites
Why This Matters
Every optimization algorithm you use in ML is a Taylor approximation in disguise.
Gradient descent: approximate (first order). The symbol here is the gradient. The linear model is unbounded below, so you cannot literally minimize it; instead you minimize it under a step-size constraint (a trust region , a proximal/quadratic regularizer , or equivalently a fixed step size ). The proximal form recovers — the gradient descent step.
Newton's method: approximate (second order), then minimize this quadratic exactly.
Adam, L-BFGS, natural gradient: all variations on which Taylor terms to keep and how to estimate them. Understanding Taylor expansion means understanding the foundation of all gradient-based optimization.
Single Variable Taylor Expansion
Taylor Polynomial
The -th order Taylor polynomial of centered at is:
This is the unique polynomial of degree that matches and its first derivatives at .
The cases that matter most:
First order (linear approximation):
Second order (quadratic approximation):
Remainder Terms
The approximation is only useful if you can bound the error.
Lagrange Remainder
If is times continuously differentiable on an interval containing and , the remainder after the -th order Taylor polynomial is:
for some between and .
Integral Form of Remainder
Under the same conditions:
This form is often more useful for bounding because you can estimate the integral directly without locating the unknown point .
Main Theorems
Taylor Theorem with Lagrange Remainder
Statement
If is times continuously differentiable on an interval containing and , then:
for some between and .
Intuition
The Taylor polynomial matches perfectly at . The error depends on how much the -th derivative varies. If is bounded by on , then .
Proof Sketch
Define where is chosen so . Note trivially. By Rolle's theorem applied repeatedly, find between and where . Solving for gives the Lagrange form.
Why It Matters
This is what lets you bound the error of gradient descent. If you use the first-order approximation and has bounded second derivative (), then the approximation error over a step of size is at most . This is exactly why gradient descent with step size converges for -smooth functions.
Failure Mode
The theorem requires sufficient differentiability. If is only once differentiable, you cannot write a second-order expansion with Lagrange remainder. The bound is also only useful when is small; for large deviations the remainder can dominate.
Multivariate Taylor Expansion
For , the first-order expansion at is:
where is the gradient.
The second-order expansion is:
where is the Hessian matrix with entries .
If is twice continuously differentiable, the Hessian is symmetric. If additionally is convex, the Hessian is positive semidefinite at every point.
The full -th order expansion generalizes using multi-index notation. For a multi-index with :
where and . In practice, the second-order form is what appears in Newton's method and quasi-Newton approximations; the Hessian quadratic captures all second-order directional information.
Why convex optimization works: When is convex, the second-order Taylor expansion becomes a global lower bound: for all . This supporting hyperplane property is equivalent to convexity and underpins every convergence proof for gradient-based methods.
Worked Examples
Taylor approximation of e^x around zero
Let expanded at . The derivatives are all , so for every .
The -th order Taylor polynomial is:
The Lagrange remainder after terms is:
for some between and . Because , we get the bound .
Concretely, for and :
- True value:
- Error:
- Bound:
The bound is loose but confirms the approximation is accurate to three decimal places.
As increases the remainder for any fixed because factorials grow faster than any exponential; the Taylor series for converges everywhere.
Remainder bound: log(1+x) at x=0.3
Let , , . Derivatives:
- , so
- , so
Taylor polynomial: .
Lagrange remainder: for .
For : the worst case is , giving .
Actual: . Polynomial: . Error: , within the bound.
Gradient descent step size from Taylor
Let be -smooth, meaning everywhere. By second-order Taylor:
Minimizing the right side over gives , yielding:
This is the descent lemma, the workhorse of gradient descent convergence proofs. The step size is precisely the reciprocal of the Hessian's spectral norm bound.
Newton's Method as Second-Order Taylor Minimization
Newton's method minimizes a function by iteratively fitting and minimizing the local second-order Taylor model. At iterate , the model is:
Setting gives , so the Newton step is .
If is strongly convex with condition number (where ), then:
- Gradient descent requires iterations to reach precision .
- Newton's method achieves quadratic convergence in the local region: , provided the Hessian is Lipschitz continuous (i.e. ). Strong convexity alone gives only superlinear convergence; the quadratic rate comes from the third-order Taylor remainder, which is bounded only when the Hessian is Lipschitz. Without Hessian-Lipschitz, Newton can stall, oscillate, or overshoot far from the optimum, which is why practical implementations combine Newton with line search, trust regions, or cubic regularization (Nesterov-Polyak).
The gap is dramatic for ill-conditioned problems. For , gradient descent needs roughly steps to reduce the error by ; Newton's method reaches machine precision in fewer than 10 steps from a good starting point.
The cost: each Newton step requires solving an linear system, in general. This is why quasi-Newton methods (BFGS, L-BFGS) approximate the Hessian inverse without explicit computation.
Common Confusions
Taylor expansion is local, not global
The Taylor polynomial centered at approximates well near . It says nothing about the function far from . The function has a Taylor series that converges everywhere, but near 0 is not well-approximated by any polynomial centered at 0. In optimization, this locality is why step sizes must be small enough.
Second-order methods are not always better
Newton's method uses the Hessian and converges faster per step. But computing and inverting an Hessian costs storage and time. For neural networks with millions of parameters, this is infeasible. First-order methods win by being cheap per step, even if they need more steps.
Exercises
Problem
Compute the second-order Taylor expansion of at . Use the Lagrange remainder to bound the error for .
Problem
Let be twice continuously differentiable with for all . Prove that .
References
Canonical:
- Rudin, Principles of Mathematical Analysis (1976), Chapter 5: Taylor's theorem, remainder forms, and uniform convergence of series
- Apostol, Mathematical Analysis (1974), Chapter 5: single-variable Taylor theorem with Lagrange and integral remainder
- Spivak, Calculus on Manifolds (1965), Chapter 2: multivariate Taylor expansion and the Hessian
Current:
- Boyd & Vandenberghe, Convex Optimization (2004), Section 9.1: Taylor approximation in optimization, descent lemma derivation
- Nesterov, Introductory Lectures on Convex Optimization (2004), Section 1.2: smoothness conditions and gradient descent bounds via Taylor
- Nocedal & Wright, Numerical Optimization (2006), Chapter 2: Taylor models as the basis for line search and trust region methods
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
2- Continuity in Rⁿlayer 0A · tier 1
- Differentiation in Rⁿlayer 0A · tier 1
Derived topics
3- Automatic Differentiationlayer 1 · tier 1
- Convex Optimization Basicslayer 1 · tier 1
- Newton's Methodlayer 1 · tier 1
Graph-backed continuations