Skip to main content

Foundations

Taylor Expansion

Taylor approximation in one and many variables. Every optimization algorithm is a Taylor approximation: gradient descent uses first order, Newton's method uses second order.

CoreTier 1StableSupporting~40 min

Why This Matters

Every optimization algorithm you use in ML is a Taylor approximation in disguise.

Gradient descent: approximate f(x+h)f(x)+f(x)Thf(x + h) \approx f(x) + \nabla f(x)^T h (first order). The symbol here is the gradient. The linear model is unbounded below, so you cannot literally minimize it; instead you minimize it under a step-size constraint (a trust region hΔ\|h\| \leq \Delta, a proximal/quadratic regularizer 12ηh2\frac{1}{2\eta}\|h\|^2, or equivalently a fixed step size η\eta). The proximal form argminhf(x)Th+12ηh2\arg\min_h \nabla f(x)^T h + \frac{1}{2\eta}\|h\|^2 recovers h=ηf(x)h = -\eta \nabla f(x) — the gradient descent step.

Newton's method: approximate f(x+h)f(x)+f(x)Th+12hT2f(x)hf(x + h) \approx f(x) + \nabla f(x)^T h + \frac{1}{2} h^T \nabla^2 f(x) h (second order), then minimize this quadratic exactly.

Adam, L-BFGS, natural gradient: all variations on which Taylor terms to keep and how to estimate them. Understanding Taylor expansion means understanding the foundation of all gradient-based optimization.

Single Variable Taylor Expansion

Definition

Taylor Polynomial

The kk-th order Taylor polynomial of ff centered at aa is:

Tk(x;a)=j=0kf(j)(a)j!(xa)jT_k(x; a) = \sum_{j=0}^{k} \frac{f^{(j)}(a)}{j!}(x - a)^j

This is the unique polynomial of degree k\leq k that matches ff and its first kk derivatives at aa.

The cases that matter most:

First order (linear approximation):

f(x+h)f(x)+f(x)hf(x + h) \approx f(x) + f'(x) h

Second order (quadratic approximation):

f(x+h)f(x)+f(x)h+12f(x)h2f(x + h) \approx f(x) + f'(x) h + \frac{1}{2} f''(x) h^2

Remainder Terms

The approximation is only useful if you can bound the error.

Definition

Lagrange Remainder

If ff is (k+1)(k+1) times continuously differentiable on an interval containing aa and xx, the remainder after the kk-th order Taylor polynomial is:

Rk(x;a)=f(k+1)(c)(k+1)!(xa)k+1R_k(x; a) = \frac{f^{(k+1)}(c)}{(k+1)!}(x - a)^{k+1}

for some cc between aa and xx.

Definition

Integral Form of Remainder

Under the same conditions:

Rk(x;a)=axf(k+1)(t)k!(xt)kdtR_k(x; a) = \int_a^x \frac{f^{(k+1)}(t)}{k!}(x - t)^k \, dt

This form is often more useful for bounding because you can estimate the integral directly without locating the unknown point cc.

Main Theorems

Theorem

Taylor Theorem with Lagrange Remainder

Statement

If f:RRf: \mathbb{R} \to \mathbb{R} is (k+1)(k+1) times continuously differentiable on an interval II containing aa and xx, then:

f(x)=j=0kf(j)(a)j!(xa)j+f(k+1)(c)(k+1)!(xa)k+1f(x) = \sum_{j=0}^{k} \frac{f^{(j)}(a)}{j!}(x-a)^j + \frac{f^{(k+1)}(c)}{(k+1)!}(x-a)^{k+1}

for some cc between aa and xx.

Intuition

The Taylor polynomial matches ff perfectly at aa. The error depends on how much the (k+1)(k+1)-th derivative varies. If f(k+1)|f^{(k+1)}| is bounded by MM on II, then RkMxak+1/(k+1)!|R_k| \leq M|x-a|^{k+1}/(k+1)!.

Proof Sketch

Define g(t)=f(x)Tk(x;t)C(xt)k+1g(t) = f(x) - T_k(x; t) - C(x-t)^{k+1} where CC is chosen so g(a)=0g(a) = 0. Note g(x)=0g(x) = 0 trivially. By Rolle's theorem applied repeatedly, find cc between aa and xx where g(k+1)(c)=0g^{(k+1)}(c) = 0. Solving for CC gives the Lagrange form.

Why It Matters

This is what lets you bound the error of gradient descent. If you use the first-order approximation and ff has bounded second derivative (fL|f''| \leq L), then the approximation error over a step of size hh is at most L2h2\frac{L}{2}h^2. This is exactly why gradient descent with step size 1/L1/L converges for LL-smooth functions.

Failure Mode

The theorem requires sufficient differentiability. If ff is only once differentiable, you cannot write a second-order expansion with Lagrange remainder. The bound is also only useful when xa|x - a| is small; for large deviations the remainder can dominate.

Multivariate Taylor Expansion

For f:RnRf: \mathbb{R}^n \to \mathbb{R}, the first-order expansion at xx is:

f(x+h)=f(x)+f(x)Th+O(h2)f(x + h) = f(x) + \nabla f(x)^T h + O(\|h\|^2)

where f(x)Rn\nabla f(x) \in \mathbb{R}^n is the gradient.

The second-order expansion is:

f(x+h)=f(x)+f(x)Th+12hT2f(x)h+O(h3)f(x + h) = f(x) + \nabla f(x)^T h + \frac{1}{2} h^T \nabla^2 f(x) \, h + O(\|h\|^3)

where 2f(x)Rn×n\nabla^2 f(x) \in \mathbb{R}^{n \times n} is the Hessian matrix with entries [2f(x)]ij=2fxixj[\nabla^2 f(x)]_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}.

If ff is twice continuously differentiable, the Hessian is symmetric. If additionally ff is convex, the Hessian is positive semidefinite at every point.

The full kk-th order expansion generalizes using multi-index notation. For a multi-index α=(α1,,αn)\alpha = (\alpha_1, \ldots, \alpha_n) with α=α1++αn|\alpha| = \alpha_1 + \cdots + \alpha_n:

f(x+h)=αkαf(x)α!hα+Rk(x;h)f(x + h) = \sum_{|\alpha| \leq k} \frac{\partial^\alpha f(x)}{\alpha!} h^\alpha + R_k(x; h)

where α!=α1!αn!\alpha! = \alpha_1! \cdots \alpha_n! and hα=h1α1hnαnh^\alpha = h_1^{\alpha_1} \cdots h_n^{\alpha_n}. In practice, the second-order form is what appears in Newton's method and quasi-Newton approximations; the Hessian quadratic hTHhh^T H h captures all second-order directional information.

Why convex optimization works: When ff is convex, the second-order Taylor expansion becomes a global lower bound: f(y)f(x)+f(x)T(yx)f(y) \geq f(x) + \nabla f(x)^T (y - x) for all x,yx, y. This supporting hyperplane property is equivalent to convexity and underpins every convergence proof for gradient-based methods.

Worked Examples

Example

Taylor approximation of e^x around zero

Let f(x)=exf(x) = e^x expanded at a=0a = 0. The derivatives are all f(j)(x)=exf^{(j)}(x) = e^x, so f(j)(0)=1f^{(j)}(0) = 1 for every jj.

The kk-th order Taylor polynomial is:

Tk(x;0)=j=0kxjj!=1+x+x22+x36++xkk!T_k(x; 0) = \sum_{j=0}^{k} \frac{x^j}{j!} = 1 + x + \frac{x^2}{2} + \frac{x^3}{6} + \cdots + \frac{x^k}{k!}

The Lagrange remainder after kk terms is:

Rk(x;0)=ec(k+1)!xk+1R_k(x; 0) = \frac{e^c}{(k+1)!} x^{k+1}

for some cc between 00 and xx. Because ec<exe^c < e^{|x|}, we get the bound Rk(x;0)exxk+1/(k+1)!|R_k(x; 0)| \leq e^{|x|} \cdot |x|^{k+1} / (k+1)!.

Concretely, for x=0.5x = 0.5 and k=3k = 3:

  • T3(0.5)=1+0.5+0.125+0.02083=1.64583T_3(0.5) = 1 + 0.5 + 0.125 + 0.02083 = 1.64583
  • True value: e0.5=1.64872e^{0.5} = 1.64872
  • Error: 0.002890.00289
  • Bound: e0.5(0.5)4/24=1.6490.002604=0.00429e^{0.5} \cdot (0.5)^4 / 24 = 1.649 \cdot 0.002604 = 0.00429

The bound is loose but confirms the approximation is accurate to three decimal places.

As kk increases the remainder xk+1/(k+1)!0|x|^{k+1}/(k+1)! \to 0 for any fixed xx because factorials grow faster than any exponential; the Taylor series for exe^x converges everywhere.

Example

Remainder bound: log(1+x) at x=0.3

Let f(x)=log(1+x)f(x) = \log(1 + x), a=0a = 0, k=2k = 2. Derivatives:

  • f(x)=1/(1+x)f'(x) = 1/(1+x), so f(0)=1f'(0) = 1
  • f(x)=1/(1+x)2f''(x) = -1/(1+x)^2, so f(0)=1f''(0) = -1
  • f(x)=2/(1+x)3f'''(x) = 2/(1+x)^3

Taylor polynomial: T2(x;0)=xx2/2T_2(x; 0) = x - x^2/2.

Lagrange remainder: R2(x;0)=f(c)x3/6=2/(1+c)3x3/6R_2(x; 0) = f'''(c) \cdot x^3 / 6 = 2/(1+c)^3 \cdot x^3/6 for c(0,x)c \in (0, x).

For x=0.3x = 0.3: the worst case is c0c \to 0, giving R22(0.3)3/6=0.009|R_2| \leq 2 \cdot (0.3)^3 / 6 = 0.009.

Actual: log(1.3)=0.26236\log(1.3) = 0.26236. Polynomial: 0.30.045=0.2550.3 - 0.045 = 0.255. Error: 0.007360.00736, within the bound.

Example

Gradient descent step size from Taylor

Let ff be LL-smooth, meaning 2f(x)L\|\nabla^2 f(x)\| \leq L everywhere. By second-order Taylor:

f(xηf(x))f(x)ηf(x)2+Lη22f(x)2f(x - \eta \nabla f(x)) \leq f(x) - \eta \|\nabla f(x)\|^2 + \frac{L \eta^2}{2} \|\nabla f(x)\|^2

Minimizing the right side over η\eta gives η=1/L\eta^* = 1/L, yielding:

f ⁣(x1Lf(x))f(x)12Lf(x)2f\!\left(x - \frac{1}{L} \nabla f(x)\right) \leq f(x) - \frac{1}{2L} \|\nabla f(x)\|^2

This is the descent lemma, the workhorse of gradient descent convergence proofs. The step size 1/L1/L is precisely the reciprocal of the Hessian's spectral norm bound.

Newton's Method as Second-Order Taylor Minimization

Newton's method minimizes a function by iteratively fitting and minimizing the local second-order Taylor model. At iterate xtx_t, the model is:

m(h)=f(xt)+f(xt)Th+12hT2f(xt)hm(h) = f(x_t) + \nabla f(x_t)^T h + \frac{1}{2} h^T \nabla^2 f(x_t) \, h

Setting hm(h)=0\nabla_h m(h) = 0 gives 2f(xt)h=f(xt)\nabla^2 f(x_t) h = -\nabla f(x_t), so the Newton step is h=[2f(xt)]1f(xt)h^* = -[\nabla^2 f(x_t)]^{-1} \nabla f(x_t).

If ff is strongly convex with condition number κ=L/μ\kappa = L/\mu (where μI2fLI\mu \mathbf{I} \preceq \nabla^2 f \preceq L \mathbf{I}), then:

  • Gradient descent requires O(κlog(1/ε))O(\kappa \log(1/\varepsilon)) iterations to reach precision ε\varepsilon.
  • Newton's method achieves quadratic convergence in the local region: xt+1xCxtx2\|x_{t+1} - x^*\| \leq C \|x_t - x^*\|^2, provided the Hessian is Lipschitz continuous (i.e. 2f(x)2f(y)Mxy\|\nabla^2 f(x) - \nabla^2 f(y)\| \leq M \|x - y\|). Strong convexity alone gives only superlinear convergence; the quadratic rate comes from the third-order Taylor remainder, which is bounded only when the Hessian is Lipschitz. Without Hessian-Lipschitz, Newton can stall, oscillate, or overshoot far from the optimum, which is why practical implementations combine Newton with line search, trust regions, or cubic regularization (Nesterov-Polyak).

The gap is dramatic for ill-conditioned problems. For κ=104\kappa = 10^4, gradient descent needs roughly 10410^4 steps to reduce the error by e1e^{-1}; Newton's method reaches machine precision in fewer than 10 steps from a good starting point.

The cost: each Newton step requires solving an n×nn \times n linear system, O(n3)O(n^3) in general. This is why quasi-Newton methods (BFGS, L-BFGS) approximate the Hessian inverse without explicit computation.

Common Confusions

Watch Out

Taylor expansion is local, not global

The Taylor polynomial centered at aa approximates ff well near aa. It says nothing about the function far from aa. The function exe^x has a Taylor series that converges everywhere, but sin(1/x)\sin(1/x) near 0 is not well-approximated by any polynomial centered at 0. In optimization, this locality is why step sizes must be small enough.

Watch Out

Second-order methods are not always better

Newton's method uses the Hessian and converges faster per step. But computing and inverting an n×nn \times n Hessian costs O(n2)O(n^2) storage and O(n3)O(n^3) time. For neural networks with millions of parameters, this is infeasible. First-order methods win by being cheap per step, even if they need more steps.

Exercises

ExerciseCore

Problem

Compute the second-order Taylor expansion of f(x)=log(1+x)f(x) = \log(1 + x) at x=0x = 0. Use the Lagrange remainder to bound the error for x0.1|x| \leq 0.1.

ExerciseAdvanced

Problem

Let f:RnRf: \mathbb{R}^n \to \mathbb{R} be twice continuously differentiable with 2f(x)opL\|\nabla^2 f(x)\|_{\text{op}} \leq L for all xx. Prove that f(y)f(x)f(x)T(yx)L2yx2\|f(y) - f(x) - \nabla f(x)^T(y-x)\| \leq \frac{L}{2}\|y-x\|^2.

References

Canonical:

  • Rudin, Principles of Mathematical Analysis (1976), Chapter 5: Taylor's theorem, remainder forms, and uniform convergence of series
  • Apostol, Mathematical Analysis (1974), Chapter 5: single-variable Taylor theorem with Lagrange and integral remainder
  • Spivak, Calculus on Manifolds (1965), Chapter 2: multivariate Taylor expansion and the Hessian

Current:

  • Boyd & Vandenberghe, Convex Optimization (2004), Section 9.1: Taylor approximation in optimization, descent lemma derivation
  • Nesterov, Introductory Lectures on Convex Optimization (2004), Section 1.2: smoothness conditions and gradient descent bounds via Taylor
  • Nocedal & Wright, Numerical Optimization (2006), Chapter 2: Taylor models as the basis for line search and trust region methods

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

2

Derived topics

3