Skip to main content

Numerical Optimization

Line Search Methods

Line search chooses a step size along a descent direction. Armijo guarantees enough decrease, Wolfe prevents uselessly tiny steps, and backtracking gives a practical step-size controller without knowing the Lipschitz constant.

CoreTier 2StableSupporting~50 min

Why This Matters

Gradient descent gives a direction. It does not give a safe distance.

For a current point xkx_k and a descent direction pkp_k, the update is

xk+1=xk+αkpk.x_{k+1}=x_k+\alpha_k p_k.

The line search problem is to choose αk\alpha_k. Too large: the objective rises or curvature invalidates the local model. Too small: the method technically descends but wastes iterations. Line search is the step-size controller behind gradient descent, Newton's method, quasi-Newton methods, nonlinear conjugate gradient, and many constrained solvers.

Ray View

Reduce the multivariable problem to a one-dimensional function along a ray:

ϕ(α)=f(xk+αpk).\phi(\alpha)=f(x_k+\alpha p_k).

The starting slope is

ϕ(0)=f(xk)Tpk.\phi'(0)=\nabla f(x_k)^T p_k.

A direction is a descent direction when ϕ(0)<0\phi'(0)<0. Line search does not need to solve the original optimization problem; it only asks whether a candidate α\alpha is acceptable on this ray.

Armijo Sufficient Decrease

Definition

Armijo Condition

A step size α>0\alpha>0 satisfies the Armijo condition with c1(0,1)c_1\in(0,1) when

f(xk+αpk)f(xk)+c1αf(xk)Tpk.f(x_k+\alpha p_k)\leq f(x_k)+c_1\alpha\nabla f(x_k)^T p_k.

Because f(xk)Tpk<0\nabla f(x_k)^T p_k<0, the right-hand side is below f(xk)f(x_k). The candidate step must decrease the objective by a fixed fraction of the first-order prediction.

Typical values use c1=104c_1=10^{-4}. The number is small because Armijo is not trying to demand the exact one-dimensional minimum; it is trying to reject steps that are obviously too aggressive.

Lemma

Armijo Steps Exist For Descent Directions

Statement

If f(xk)Tpk<0\nabla f(x_k)^T p_k<0, then every sufficiently small positive α\alpha satisfies the Armijo condition.

Intuition

For tiny α\alpha, the function behaves like its first-order Taylor approximation. Since pkp_k points downhill at α=0\alpha=0, a small enough step must decrease the objective.

Proof Sketch

Taylor expansion gives f(xk+αpk)=f(xk)+αf(xk)Tpk+o(α)f(x_k+\alpha p_k)=f(x_k)+\alpha\nabla f(x_k)^T p_k+o(\alpha). Subtract the Armijo right side. The leading term is α(1c1)f(xk)Tpk\alpha(1-c_1)\nabla f(x_k)^T p_k, which is negative. For sufficiently small α\alpha, the lower-order term cannot overturn that sign.

Backtracking

Backtracking line search is the practical Armijo algorithm:

  1. Start with α=α0\alpha=\alpha_0, often α0=1\alpha_0=1 for Newton-like methods.
  2. If Armijo fails, set αρα\alpha\leftarrow \rho\alpha with ρ(0,1)\rho\in(0,1).
  3. Repeat until Armijo holds.

The algorithm terminates because of the Armijo existence lemma.

Theorem

Gradient Descent With Backtracking Reaches Stationary Points

Statement

For gradient descent with Armijo backtracking, the gradients satisfy

min0k<Kf(xk)2C(f(x0)f)/K\min_{0\leq k<K}\|\nabla f(x_k)\|^2 \leq C(f(x_0)-f^*)/K

for a constant CC depending on the line-search parameters and the smoothness constant. In particular, some iterate has small gradient as KK grows, and under standard assumptions the gradient norm converges to zero.

Intuition

Backtracking prevents steps from being too large, while stopping at the first acceptable step prevents systematic microscopic steps. Each accepted step pays for a decrease proportional to the squared gradient.

Proof Sketch

Smoothness gives a quadratic upper bound on f(xkαf(xk))f(x_k-\alpha\nabla f(x_k)). This bound implies Armijo holds once α\alpha is below a constant multiple of the inverse smoothness constant. Therefore accepted steps have a uniform lower bound unless the initial step is smaller. Armijo then gives a telescoping sum of decreases proportional to f(xk)2\|\nabla f(x_k)\|^2.

Failure Mode

This proves first-order stationarity, not global optimality. Non-convex objectives can still converge to saddles or poor local minima.

Wolfe Conditions

Armijo alone rejects steps that are too large, but it can accept steps that are too small. Wolfe adds a curvature condition:

Definition

Wolfe Conditions

A step satisfies the Wolfe conditions when

f(xk+αpk)f(xk)+c1αf(xk)Tpkf(x_k+\alpha p_k)\leq f(x_k)+c_1\alpha\nabla f(x_k)^T p_k

and

f(xk+αpk)Tpkc2f(xk)Tpk,\nabla f(x_k+\alpha p_k)^T p_k \geq c_2\nabla f(x_k)^T p_k,

where 0<c1<c2<10<c_1<c_2<1.

The curvature condition says the directional derivative at the new point should be less negative. In plain language: do not stop while the slope is still strongly downhill.

The strong Wolfe condition replaces the second inequality with

f(xk+αpk)Tpkc2f(xk)Tpk.|\nabla f(x_k+\alpha p_k)^T p_k|\leq c_2|\nabla f(x_k)^T p_k|.

Strong Wolfe also avoids stepping so far that the slope becomes too positive.

Zoutendijk's Theorem

Theorem

Zoutendijk Condition

Statement

Let θk\theta_k be the angle between pkp_k and the negative gradient. Under the stated assumptions,

k=0cos2(θk)f(xk)2<.\sum_{k=0}^{\infty}\cos^2(\theta_k)\|\nabla f(x_k)\|^2 < \infty.

If the directions stay gradient-related, meaning cos(θk)\cos(\theta_k) is bounded away from zero, then f(xk)0\|\nabla f(x_k)\|\to 0.

Intuition

Wolfe line search plus descent directions force enough cumulative progress. If the search directions do not become nearly orthogonal to the gradient, the gradients must vanish.

Why It Matters

This is why Wolfe conditions are standard for nonlinear conjugate gradient and quasi-Newton methods: they support convergence for directions more complex than f-\nabla f.

Practical Defaults

MethodTypical line searchReason
gradient descentArmijo backtrackingcheap and sufficient
NewtonArmijo or Wolfefull step often works near optimum
BFGS / L-BFGSstrong Wolfepreserves curvature update quality
nonlinear CGstrong Wolfecontrols conjugacy degradation
stochastic trainingusually no line searchnoisy minibatch objectives make exact accept/reject unreliable

Common Confusions

Watch Out

Exact line search is not automatically better

Exact line search minimizes along the current ray, but that ray may be a poor direction. On ill-conditioned quadratics, steepest descent with exact line search can still zigzag.

Watch Out

Armijo is not a learning-rate schedule

Armijo chooses a step from function evaluations at the current point. A schedule chooses steps from iteration count or validation heuristics. They solve different problems.

Watch Out

A tiny accepted step can still be bad

Armijo permits very small accepted steps. Wolfe or strong Wolfe conditions are used when small steps would destroy curvature information or make nonlinear CG stall.

Watch Out

Line search assumes the direction is worth taking

Line search fixes step length. It does not repair a non-descent direction. If f(xk)Tpk0\nabla f(x_k)^T p_k\geq 0, first fix the direction.

Q&A For Mastery

Why is α=1\alpha=1 special for Newton? Near a minimizer, Newton's quadratic model becomes accurate, so the full Newton step often satisfies the line search and gives fast local convergence.

Why do deep nets rarely use classical line search? Minibatch losses are noisy and expensive to evaluate repeatedly. SGD and Adam instead use scheduled or adaptive learning rates. Full-batch or deterministic subproblems still use line search.

What should I log? Track accepted α\alpha, number of backtracking reductions, objective decrease, and directional derivative. Many reductions usually mean the proposed direction or initial step is too aggressive.

What To Remember

  • Line search chooses α\alpha along a direction pkp_k.
  • Armijo guarantees enough decrease.
  • Wolfe prevents stopping too early along the ray.
  • Backtracking works because Armijo holds for small enough steps.
  • For quasi-Newton and nonlinear CG, strong Wolfe is usually the safe default.

Exercises

ExerciseCore

Problem

For f(x)=x4f(x)=x^4, x0=2x_0=2, p=f(x0)=32p=-f'(x_0)=-32, and c1=104c_1=10^{-4}, does α=1\alpha=1 satisfy Armijo?

ExerciseAdvanced

Problem

Why does the Wolfe curvature condition prevent uselessly small steps?

ExerciseResearch

Problem

Explain why line search becomes delicate for stochastic minibatch objectives.

References

Canonical:

  • Nocedal and Wright, Numerical Optimization, 2nd ed., Chapter 3.
  • Bertsekas, Nonlinear Programming, 3rd ed., Sections 1.2 and 1.3.
  • Armijo, "Minimization of Functions Having Lipschitz Continuous First Partial Derivatives" (1966).
  • Wolfe, "Convergence Conditions for Ascent Methods" (1969).

Further:

  • Boyd and Vandenberghe, Convex Optimization, Section 9.2.
  • Fletcher, Practical Methods of Optimization, line-search chapters.

Next Topics

Last reviewed: April 25, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

3

Derived topics

3