Numerical Optimization
Quasi-Newton Methods
Approximate the Hessian instead of computing it: BFGS builds a dense approximation, L-BFGS stores only a few vectors. Full BFGS has classical local superlinear convergence; L-BFGS is excellent in practice but its fixed-memory theory is weaker and not generally superlinear.
Prerequisites
Why This Matters
Newton's method converges quadratically but requires computing and inverting the Hessian at each step: storage and computation per iteration. For problems with , this is prohibitive.
Quasi-Newton methods keep the speed advantage of curvature information while avoiding the Hessian entirely. They build an approximation to the Hessian (or its inverse) using only gradient information, updated cheaply at each step. The result: superlinear convergence at the cost of first-order methods.
L-BFGS in particular is the default optimizer for medium-scale smooth optimization problems across scientific computing, and it remains competitive for many ML tasks where full-batch gradients are available.
Mental Model
At each step, you have a matrix (Hessian approximation) or (inverse Hessian approximation). You use to compute a Newton-like step:
After taking the step, you observe how the gradient changed. You use that observation to update , making it a better Hessian approximation. The crucial constraint: must satisfy the secant condition, which ensures it is consistent with the curvature you just observed.
Two conventions coexist in the literature:
- Direct form: maintain and solve a linear system each iteration. Cost per step: for the solve via Cholesky, plus for the rank-2 update.
- Inverse form: maintain and compute by a single matrix-vector multiply. Cost per step: total. No linear system solve needed.
The inverse form is more common in practice because it avoids the Cholesky factorization. L-BFGS operates entirely in the inverse form, computing implicitly.
The Secant Condition
Secant Condition
Define the step and gradient difference:
The secant condition (or quasi-Newton equation) requires the Hessian approximation to satisfy:
or equivalently, for the inverse approximation:
The secant condition says: the approximate Hessian, when applied to the step direction, must reproduce the observed gradient change. This is a first-order consistency requirement derived from Taylor expansion. It ensures the approximation is exact along the direction you just moved.
Note that and , so the secant condition imposes scalar constraints on , which has free parameters (since it is symmetric). For , the system is underdetermined: many symmetric matrices satisfy . The various quasi-Newton methods (BFGS, DFP, SR1, Broyden family) correspond to different ways of resolving this freedom.
Derivation of the Secant Condition from Taylor Expansion
Statement
If is approximately constant along the segment from to , then . The secant condition enforces this relationship exactly for the approximation .
Intuition
By the mean value theorem for vector-valued functions:
If the Hessian is roughly constant, the integral is approximately , giving . The secant condition forces the approximation to match this.
Proof Sketch
Taylor expand around :
Rearranging: .
When is quadratic, the term vanishes: the gradient of a quadratic is affine, so exactly. For general smooth , the approximation improves as the step size shrinks near convergence.
Replacing the true Hessian with the approximation and dropping the remainder gives .
Why It Matters
The secant condition is the minimal consistency requirement for a Hessian approximation. It constrains along one direction () but leaves freedom in all other directions. Different quasi-Newton methods (BFGS, DFP, SR1) differ in how they use this freedom.
Failure Mode
The secant condition requires (the curvature condition) for the update to produce a positive definite . This is guaranteed if is strongly convex but can fail for non-convex functions. When it fails, the BFGS update must be skipped or a safeguard applied (e.g., Powell damping).
The Curvature Condition in Detail
The requirement deserves further attention. Expanding:
For a strongly convex function with parameter , the bound holds, so the curvature condition is automatic.
For general smooth functions, the Wolfe conditions on the line search guarantee . Specifically, the sufficient decrease (Armijo) condition and the curvature condition together imply:
where are the Wolfe parameters (typical values: , ). This is why the Wolfe line search is not optional for BFGS; it is a structural requirement.
The BFGS Update
The BFGS (Broyden-Fletcher-Goldfarb-Shanno) update is the most successful quasi-Newton method. It was discovered independently by four researchers in 1970. The update modifies the inverse Hessian approximation to satisfy the secant condition while remaining positive definite, using a rank-2 correction:
BFGS Inverse Hessian Update
where .
Equivalently, the direct Hessian approximation update is:
Key properties of the BFGS update:
- Rank-2: has rank at most 2, so the update is
- Positive definiteness preserved: if and , then
- Secant condition satisfied: by construction
- Minimal change: BFGS is the unique update that satisfies the secant condition, preserves symmetry and positive definiteness, and is closest to in a weighted Frobenius norm
Verifying the Secant Condition
A direct check that :
The inner product , so the second factor gives . The first term vanishes, and the last term gives .
Verifying Positive Definiteness Preservation
To see why when and , write . Then . For any :
Equality requires both and . If then means . Substituting into gives , which forces and hence . So .
The Minimality Property
BFGS solves the following optimization problem:
where with , the average Hessian . In practice is approximated by or by itself (the direct BFGS update). The weighted Frobenius norm measures distance scale-invariantly: it penalizes changes in well-conditioned directions as much as ill-conditioned ones.
This variational characterization explains BFGS's success. It preserves as much of the old curvature information as possible while incorporating the new observation, in a scale-adapted sense.
DFP: The Other Rank-2 Update
The DFP (Davidon-Fletcher-Powell) method predates BFGS by a decade. It solves the dual minimality problem:
The DFP update for the direct Hessian approximation is:
DFP and BFGS are duals of each other: the BFGS update on has the same algebraic form as the DFP update on , and vice versa. In practice, BFGS outperforms DFP because DFP's inverse Hessian approximation tends to become increasingly ill-conditioned, amplifying errors in the line search. Nocedal and Wright (2006, Section 6.2) give empirical comparisons showing BFGS is more forgiving of inexact line searches.
SR1: The Rank-1 Alternative
The symmetric rank-1 (SR1) update takes the simplest possible form:
SR1 has one advantage over BFGS: it does not require . The approximation can capture negative curvature, making SR1 useful within trust-region methods for non-convex problems. The drawback: SR1 does not preserve positive definiteness, and the denominator can be zero or near-zero, requiring a skip condition.
The Full BFGS Algorithm
- Initialize (or a scaled identity) and .
- Compute search direction: .
- Line search: find satisfying the Wolfe conditions.
- Update: .
- Compute and .
- Check: if (some small threshold), skip the BFGS update.
- Update using the BFGS formula.
- Go to step 2.
The Wolfe conditions in step 3 enforce two requirements. First, the Armijo (sufficient decrease) condition: . Second, the curvature condition: . Together, they guarantee , which preserves positive definiteness.
Initialization Strategy
The choice of affects early convergence. Three common strategies:
- Identity: . The first step is steepest descent. Simple but can be far from the true scale.
- Scaled identity: where . After the first step, rescale to match observed curvature. This is the standard choice for L-BFGS.
- Diagonal: from a rough estimate of the diagonal Hessian.
In L-BFGS, the scaling is recomputed at every iteration, adapting the base matrix to the local curvature. This small detail has a large practical effect: without it, the first loop of the two-loop recursion starts from a poorly scaled direction.
Superlinear Convergence
BFGS Superlinear Convergence
Statement
Under the above assumptions, BFGS with Wolfe line search converges superlinearly to the minimizer :
That is, the convergence rate eventually beats any fixed linear rate, though it does not achieve the quadratic rate of exact Newton.
Intuition
As the iterates approach , the BFGS approximation converges to along the directions that matter: the directions the algorithm actually moves. The approximation becomes increasingly accurate where it counts, so the steps approach true Newton steps, and the convergence rate transitions from linear to superlinear.
Proof Sketch
The proof relies on the Dennis-More characterization (1974). Superlinear convergence of a quasi-Newton method holds if and only if:
This condition says that need not converge to the true inverse Hessian in every direction; only along the search directions . Call this directional convergence.
The proof that BFGS achieves directional convergence proceeds through a trace-determinant analysis. Define the error . The key identities are:
where is the angle between and the Newton direction. The trace grows slowly while the determinant is bounded below, which forces the eigenvalues of to cluster near 1 along the directions explored by the algorithm. Summing the trace bound over all iterations shows , which gives the Dennis-More condition.
See Nocedal and Wright (2006), Theorem 6.6, for the full argument.
Why It Matters
Superlinear convergence means BFGS is much faster than gradient descent (which is only linearly convergent) and approaches the speed of Newton without computing second derivatives. For smooth, strongly convex problems, BFGS is the practical optimum: each iteration costs versus for Newton, and the number of iterations is comparable.
Failure Mode
The superlinear convergence requires strong convexity and sufficiently accurate line searches. On non-convex problems, BFGS may converge to saddle points, and the Hessian approximation can become increasingly inaccurate. The curvature condition may fail, requiring skipped updates or damping. Even on convex problems, if the line search is too crude (e.g., fixed step size without Wolfe conditions), the approximation can degrade and convergence reverts to linear or worse.
Convergence Rate Comparison
| Method | Rate | Cost per iteration | Storage |
|---|---|---|---|
| Gradient descent | Linear: dependence | ||
| Conjugate gradient | Superlinear (on quadratics, finite termination in steps) | ||
| BFGS | Superlinear | ||
| L-BFGS ( pairs) | Superlinear (in practice; theory weaker) | ||
| Newton | Quadratic |
For gradient descent on a strongly convex function with condition number , the linear rate is . When , gradient descent needs roughly 3500 iterations to reduce the error by . BFGS typically needs 50-100 iterations for the same reduction, independent of , because the Hessian approximation adapts to the curvature.
L-BFGS: Limited-Memory BFGS
BFGS stores a matrix : memory. For , this is 8 TB in double precision. L-BFGS avoids this by never forming explicitly.
L-BFGS
L-BFGS stores only the last pairs (typically to ). The matrix-vector product is computed using a two-loop recursion that requires only storage and computation.
The Two-Loop Recursion
The two-loop recursion (Nocedal, 1980) computes without forming :
Input: gradient , stored pairs , scaling .
Loop 1 (backward, down to ):
Set . For each : compute ; set .
Scale: where .
Loop 2 (forward, up to ):
For each : compute ; set .
Output: .
Cost: multiplications and inner products per call. Total: time and storage, where .
Why It Works
The two-loop recursion computes the product of rank-2 updates applied to the scaled identity . Unrolling the BFGS recurrence, has the form:
where . The two-loop recursion evaluates by propagating through these factors without ever forming the matrix. Old pairs beyond the window of are discarded, so the effective approximation "forgets" curvature information from more than steps ago.
Choosing
The memory parameter controls the trade-off between approximation quality and cost:
- : fast, low memory, often sufficient for well-conditioned problems
- : standard for most applications. Diminishing returns beyond
- : recovers full BFGS (but defeats the purpose)
Empirically, between 5 and 20 gives performance within 10-20% of full BFGS on most smooth problems. The sensitivity to decreases as the problem becomes better conditioned.
Why L-BFGS Is the Default
For medium-scale smooth optimization ( in the thousands to low millions, full-batch gradients available), L-BFGS dominates because:
- memory vs. for BFGS
- No learning rate schedule to tune (unlike SGD, which needs step size decay and momentum tuning)
- Strong empirical convergence on smooth strongly convex problems
- Built into every serious optimization library (scipy.optimize.minimize, PyTorch
torch.optim.LBFGS)
Caveat — superlinear convergence is a full-BFGS result. The Dennis-More / Broyden-Dennis-More-Powell theory establishes local superlinear convergence for full-memory BFGS under the assumptions in the theorem above. With fixed memory (i.e.\ true L-BFGS), the classical superlinear-convergence proof breaks down because the approximation cannot capture all of in rank-2 updates. Liu-Nocedal (1989, the original L-BFGS paper) only proves -linear convergence for L-BFGS on strongly convex problems; later analyses (Yuan 1991; Byrd-Liu-Nocedal 1992) sharpen the constants but do not recover superlinear rates in general. In practice, L-BFGS with - converges almost as fast as full BFGS on most smooth problems, so the "L-BFGS = practical superlinear" intuition is empirically warranted — but it should not be quoted as a theorem.
L-BFGS is the standard for logistic regression, CRFs, maximum entropy models,
and any smooth ML problem where you can compute full gradients. It is also the optimizer behind liblinear, the default solver for scikit-learn's LogisticRegression with L2 regularization.
When L-BFGS Loses
L-BFGS has clear failure modes:
- Stochastic gradients: L-BFGS builds its approximation from gradient differences . If has noise from mini-batch sampling, is dominated by noise rather than curvature. Stochastic quasi-Newton methods exist (e.g., online L-BFGS, SQN) but are not yet standard.
- Non-smooth objectives: if has kinks (as in L1 regularization), the gradient is discontinuous and the secant condition loses meaning. Use proximal methods instead.
- Very large : for , even per iteration can be expensive when . First-order methods (Adam, SGD with momentum) become the only option.
- Non-convex deep learning: the loss surface of neural networks has saddle points, plateaus, and sharp minima. BFGS/L-BFGS applied to full-batch deep learning objectives tends to converge to sharp minima that generalize poorly, and the curvature condition fails frequently.
Connection to Preconditioning
The inverse Hessian approximation acts as a preconditioner for the gradient. Instead of descending along (steepest descent in the Euclidean metric), you descend along , which is steepest descent in the metric defined by .
In the eigenspace of the Hessian, this rescaling approximately equalizes the step sizes across all directions: directions with high curvature (large eigenvalues) get smaller steps, and directions with low curvature (small eigenvalues) get larger steps. This is why quasi-Newton methods handle ill-conditioned problems far better than gradient descent: the condition number of the preconditioned system is close to 1 when .
Preconditioning on an ill-conditioned quadratic
Consider where . The condition number is .
Gradient descent with optimal step size converges at rate per iteration. To reduce the error by requires about 6900 iterations.
BFGS, after accumulating curvature information from the first few steps, builds . The preconditioned gradient points directly at the minimum. On this quadratic, BFGS terminates in exactly 2 steps (with exact line search).
This is the same principle behind adaptive gradient methods in deep learning: Adam and AdaGrad use diagonal preconditioning (diagonal Hessian approximation), while L-BFGS uses a low-rank-plus-diagonal approximation. The optimizer theory page discusses this connection.
Damped BFGS and Practical Safeguards
On non-convex problems, the curvature condition can fail. Powell (1978) proposed a damped BFGS update that interpolates between and :
where if , and otherwise:
The modified pair always satisfies , so the update preserves positive definiteness. The cost: the approximation is biased toward the current , slowing convergence slightly. Powell damping is used in most production BFGS implementations, including MATLAB's fminunc.
Common Confusions
BFGS approximates the Hessian, not the gradient
BFGS does not modify the gradient direction in an ad hoc way. It builds a principled approximation to the Hessian (or its inverse) that satisfies a consistency condition (the secant equation). The resulting search direction is a quasi-Newton direction: close to the true Newton direction when the approximation is accurate. Contrast with Adam, which preconditions with a diagonal running average of squared gradients; this is a diagonal Hessian approximation, not a full quasi-Newton update.
L-BFGS is not just BFGS with less memory
L-BFGS does not simply truncate the BFGS matrix. It implicitly represents as a product of rank-2 updates applied to a scaled identity, and computes the matrix-vector product without ever forming . The two-loop recursion is the key algorithmic insight. The resulting approximation is different from full BFGS: it effectively forgets old curvature information beyond steps. In practice this limited memory is sufficient; empirically, convergence speed plateaus around .
Quasi-Newton is not the same as conjugate gradient
Both methods achieve superlinear convergence on quadratics and terminate in steps. The difference: conjugate gradient uses storage and generates search directions via a scalar recurrence, while BFGS maintains an matrix and produces search directions via a matrix-vector product. On non-quadratic problems, BFGS adapts its curvature model more aggressively. Conjugate gradient requires exact line searches for its conjugacy property; BFGS is more tolerant of inexact line searches.
The Wolfe line search is not optional for BFGS
Some implementations default to a simple backtracking (Armijo) line search. For steepest descent, this suffices. For BFGS, it does not: the curvature condition is only guaranteed by the strong Wolfe conditions. Without it, the BFGS matrix can lose positive definiteness and the algorithm can diverge. The Wolfe line search adds cost (typically 2-5 extra gradient evaluations per iteration) but is required for correctness.
Summary
- Quasi-Newton methods approximate the Hessian using only gradient differences
- The secant condition is the fundamental constraint, derived from Taylor expansion of the gradient
- BFGS uses a rank-2 update preserving positive definiteness: per step
- The BFGS update is the unique minimum-change update in a weighted Frobenius norm
- L-BFGS stores only vector pairs: per step with two-loop recursion
- Superlinear convergence for strongly convex problems with Wolfe line search
- L-BFGS is the default for medium-scale smooth optimization; it fails on stochastic, non-smooth, or very high-dimensional problems
- The Hessian approximation acts as a preconditioner, equalizing curvature across the condition number spectrum
Exercises
Problem
Derive the secant condition from a first-order Taylor expansion of around evaluated at . Specifically, show that under the assumption that the Hessian is approximately constant.
Problem
Verify directly that the BFGS inverse Hessian update satisfies the secant condition . That is, substitute into the BFGS formula and show the result is .
Problem
Consider BFGS initialized with on a strictly convex quadratic with . Show that after steps with exact line searches, (i.e., BFGS recovers the exact inverse Hessian in at most steps on a quadratic). This is called the finite termination property.
Problem
Show that the damped BFGS update preserves positive definiteness. That is, if and with the Powell choice of , then .
References
Canonical:
- Nocedal & Wright, Numerical Optimization (2006), Chapters 6-7. The standard reference; Chapter 6 covers the quasi-Newton framework, Chapter 7 covers L-BFGS.
- Dennis & Schnabel, Numerical Methods for Unconstrained Optimization and Nonlinear Equations (1996), Chapters 8-9
- Dennis & More, "Characterization of Superlinear Convergence and Its Application to Quasi-Newton Methods" (1974), Mathematics of Computation 28(126)
Current:
- Nocedal, "Updating Quasi-Newton Matrices with Limited Storage" (1980), Mathematics of Computation 35(151). The L-BFGS paper.
- Liu & Nocedal, "On the Limited Memory BFGS Method for Large Scale Optimization" (1989), Mathematical Programming 45(1-3)
- Powell, "Algorithms for Nonlinear Constraints That Use Lagrangian Functions" (1978), Mathematical Programming 14(1). Introduces damped BFGS.
- Byrd, Lu, Nocedal & Zhu, "A Limited Memory Algorithm for Bound Constrained Optimization" (1995), SIAM Journal on Scientific Computing 16(5). The L-BFGS-B variant for box constraints.
- Berahas, Nocedal & Takac, "A Multi-Batch L-BFGS Method for Machine Learning" (2016), NeurIPS. Extends L-BFGS to stochastic settings.
Next Topics
The natural next steps from quasi-Newton methods:
- Proximal gradient methods: when the objective has a non-smooth regularizer (e.g., ), quasi-Newton cannot be applied directly; proximal operators handle the non-smooth part
- Trust-region methods: an alternative globalization strategy to line search that constrains the step length directly; often paired with SR1 updates
- Optimizer theory (SGD, Adam, Muon): how modern deep learning optimizers relate to the preconditioning ideas from quasi-Newton methods
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
3- Newton's Methodlayer 1 · tier 1
- Secant Methodlayer 1 · tier 2
- Line Search Methodslayer 2 · tier 2
Derived topics
1- Proximal Gradient Methodslayer 2 · tier 1
Graph-backed continuations