Numerical Optimization
Line Search Methods
Line search chooses a step size along a descent direction. Armijo guarantees enough decrease, Wolfe prevents uselessly tiny steps, and backtracking gives a practical step-size controller without knowing the Lipschitz constant.
Why This Matters
Gradient descent gives a direction. It does not give a safe distance.
For a current point and a descent direction , the update is
The line search problem is to choose . Too large: the objective rises or curvature invalidates the local model. Too small: the method technically descends but wastes iterations. Line search is the step-size controller behind gradient descent, Newton's method, quasi-Newton methods, nonlinear conjugate gradient, and many constrained solvers.
Ray View
Reduce the multivariable problem to a one-dimensional function along a ray:
The starting slope is
A direction is a descent direction when . Line search does not need to solve the original optimization problem; it only asks whether a candidate is acceptable on this ray.
Armijo Sufficient Decrease
Armijo Condition
A step size satisfies the Armijo condition with when
Because , the right-hand side is below . The candidate step must decrease the objective by a fixed fraction of the first-order prediction.
Typical values use . The number is small because Armijo is not trying to demand the exact one-dimensional minimum; it is trying to reject steps that are obviously too aggressive.
Armijo Steps Exist For Descent Directions
Statement
If , then every sufficiently small positive satisfies the Armijo condition.
Intuition
For tiny , the function behaves like its first-order Taylor approximation. Since points downhill at , a small enough step must decrease the objective.
Proof Sketch
Taylor expansion gives . Subtract the Armijo right side. The leading term is , which is negative. For sufficiently small , the lower-order term cannot overturn that sign.
Backtracking
Backtracking line search is the practical Armijo algorithm:
- Start with , often for Newton-like methods.
- If Armijo fails, set with .
- Repeat until Armijo holds.
The algorithm terminates because of the Armijo existence lemma.
Gradient Descent With Backtracking Reaches Stationary Points
Statement
For gradient descent with Armijo backtracking, the gradients satisfy
for a constant depending on the line-search parameters and the smoothness constant. In particular, some iterate has small gradient as grows, and under standard assumptions the gradient norm converges to zero.
Intuition
Backtracking prevents steps from being too large, while stopping at the first acceptable step prevents systematic microscopic steps. Each accepted step pays for a decrease proportional to the squared gradient.
Proof Sketch
Smoothness gives a quadratic upper bound on . This bound implies Armijo holds once is below a constant multiple of the inverse smoothness constant. Therefore accepted steps have a uniform lower bound unless the initial step is smaller. Armijo then gives a telescoping sum of decreases proportional to .
Failure Mode
This proves first-order stationarity, not global optimality. Non-convex objectives can still converge to saddles or poor local minima.
Wolfe Conditions
Armijo alone rejects steps that are too large, but it can accept steps that are too small. Wolfe adds a curvature condition:
Wolfe Conditions
A step satisfies the Wolfe conditions when
and
where .
The curvature condition says the directional derivative at the new point should be less negative. In plain language: do not stop while the slope is still strongly downhill.
The strong Wolfe condition replaces the second inequality with
Strong Wolfe also avoids stepping so far that the slope becomes too positive.
Zoutendijk's Theorem
Zoutendijk Condition
Statement
Let be the angle between and the negative gradient. Under the stated assumptions,
If the directions stay gradient-related, meaning is bounded away from zero, then .
Intuition
Wolfe line search plus descent directions force enough cumulative progress. If the search directions do not become nearly orthogonal to the gradient, the gradients must vanish.
Why It Matters
This is why Wolfe conditions are standard for nonlinear conjugate gradient and quasi-Newton methods: they support convergence for directions more complex than .
Practical Defaults
| Method | Typical line search | Reason |
|---|---|---|
| gradient descent | Armijo backtracking | cheap and sufficient |
| Newton | Armijo or Wolfe | full step often works near optimum |
| BFGS / L-BFGS | strong Wolfe | preserves curvature update quality |
| nonlinear CG | strong Wolfe | controls conjugacy degradation |
| stochastic training | usually no line search | noisy minibatch objectives make exact accept/reject unreliable |
Common Confusions
Exact line search is not automatically better
Exact line search minimizes along the current ray, but that ray may be a poor direction. On ill-conditioned quadratics, steepest descent with exact line search can still zigzag.
Armijo is not a learning-rate schedule
Armijo chooses a step from function evaluations at the current point. A schedule chooses steps from iteration count or validation heuristics. They solve different problems.
A tiny accepted step can still be bad
Armijo permits very small accepted steps. Wolfe or strong Wolfe conditions are used when small steps would destroy curvature information or make nonlinear CG stall.
Line search assumes the direction is worth taking
Line search fixes step length. It does not repair a non-descent direction. If , first fix the direction.
Q&A For Mastery
Why is special for Newton? Near a minimizer, Newton's quadratic model becomes accurate, so the full Newton step often satisfies the line search and gives fast local convergence.
Why do deep nets rarely use classical line search? Minibatch losses are noisy and expensive to evaluate repeatedly. SGD and Adam instead use scheduled or adaptive learning rates. Full-batch or deterministic subproblems still use line search.
What should I log? Track accepted , number of backtracking reductions, objective decrease, and directional derivative. Many reductions usually mean the proposed direction or initial step is too aggressive.
What To Remember
- Line search chooses along a direction .
- Armijo guarantees enough decrease.
- Wolfe prevents stopping too early along the ray.
- Backtracking works because Armijo holds for small enough steps.
- For quasi-Newton and nonlinear CG, strong Wolfe is usually the safe default.
Exercises
Problem
For , , , and , does satisfy Armijo?
Problem
Why does the Wolfe curvature condition prevent uselessly small steps?
Problem
Explain why line search becomes delicate for stochastic minibatch objectives.
References
Canonical:
- Nocedal and Wright, Numerical Optimization, 2nd ed., Chapter 3.
- Bertsekas, Nonlinear Programming, 3rd ed., Sections 1.2 and 1.3.
- Armijo, "Minimization of Functions Having Lipschitz Continuous First Partial Derivatives" (1966).
- Wolfe, "Convergence Conditions for Ascent Methods" (1969).
Further:
- Boyd and Vandenberghe, Convex Optimization, Section 9.2.
- Fletcher, Practical Methods of Optimization, line-search chapters.
Next Topics
- Conjugate gradient methods: line search plus conjugate directions for quadratics and nonlinear objectives.
- Quasi-Newton methods: why strong Wolfe protects BFGS curvature updates.
- Trust region methods: the alternative strategy: control step radius instead of searching along a fixed direction.
Last reviewed: April 25, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
3- Differentiation in Rⁿlayer 0A · tier 1
- Convex Optimization Basicslayer 1 · tier 1
- Newton's Methodlayer 1 · tier 1
Derived topics
3- Quasi-Newton Methodslayer 2 · tier 1
- Conjugate Gradient Methodslayer 2 · tier 2
- Trust Region Methodslayer 2 · tier 2
Graph-backed continuations