Skip to main content

Numerical Optimization

Quasi-Newton Methods

Approximate the Hessian instead of computing it: BFGS builds a dense approximation, L-BFGS stores only a few vectors. Full BFGS has classical local superlinear convergence; L-BFGS is excellent in practice but its fixed-memory theory is weaker and not generally superlinear.

CoreTier 1StableSupporting~60 min

Why This Matters

Newton's method converges quadratically but requires computing and inverting the d×dd \times d Hessian at each step: O(d2)O(d^2) storage and O(d3)O(d^3) computation per iteration. For problems with d>104d > 10^4, this is prohibitive.

Quasi-Newton methods keep the speed advantage of curvature information while avoiding the Hessian entirely. They build an approximation to the Hessian (or its inverse) using only gradient information, updated cheaply at each step. The result: superlinear convergence at the cost of first-order methods.

L-BFGS in particular is the default optimizer for medium-scale smooth optimization problems across scientific computing, and it remains competitive for many ML tasks where full-batch gradients are available.

Mental Model

At each step, you have a matrix Bt2f(x(t))B_t \approx \nabla^2 f(x^{(t)}) (Hessian approximation) or Ht[2f(x(t))]1H_t \approx [\nabla^2 f(x^{(t)})]^{-1} (inverse Hessian approximation). You use HtH_t to compute a Newton-like step:

x(t+1)=x(t)αtHtf(x(t))x^{(t+1)} = x^{(t)} - \alpha_t H_t \nabla f(x^{(t)})

After taking the step, you observe how the gradient changed. You use that observation to update HtHt+1H_t \to H_{t+1}, making it a better Hessian approximation. The crucial constraint: Ht+1H_{t+1} must satisfy the secant condition, which ensures it is consistent with the curvature you just observed.

Two conventions coexist in the literature:

  • Direct form: maintain Bt2f(x(t))B_t \approx \nabla^2 f(x^{(t)}) and solve a linear system Btdt=f(x(t))B_t d_t = -\nabla f(x^{(t)}) each iteration. Cost per step: O(d2)O(d^2) for the solve via Cholesky, plus O(d2)O(d^2) for the rank-2 update.
  • Inverse form: maintain Ht[2f(x(t))]1H_t \approx [\nabla^2 f(x^{(t)})]^{-1} and compute dt=Htf(x(t))d_t = -H_t \nabla f(x^{(t)}) by a single matrix-vector multiply. Cost per step: O(d2)O(d^2) total. No linear system solve needed.

The inverse form is more common in practice because it avoids the Cholesky factorization. L-BFGS operates entirely in the inverse form, computing HtgH_t g implicitly.

The Secant Condition

Definition

Secant Condition

Define the step and gradient difference:

st=x(t+1)x(t),yt=f(x(t+1))f(x(t))s_t = x^{(t+1)} - x^{(t)}, \qquad y_t = \nabla f(x^{(t+1)}) - \nabla f(x^{(t)})

The secant condition (or quasi-Newton equation) requires the Hessian approximation to satisfy:

Bt+1st=ytB_{t+1} s_t = y_t

or equivalently, for the inverse approximation:

Ht+1yt=stH_{t+1} y_t = s_t

The secant condition says: the approximate Hessian, when applied to the step direction, must reproduce the observed gradient change. This is a first-order consistency requirement derived from Taylor expansion. It ensures the approximation is exact along the direction you just moved.

Note that stRds_t \in \mathbb{R}^d and ytRdy_t \in \mathbb{R}^d, so the secant condition imposes dd scalar constraints on Bt+1B_{t+1}, which has d(d+1)/2d(d+1)/2 free parameters (since it is symmetric). For d>1d > 1, the system is underdetermined: many symmetric matrices satisfy Bt+1st=ytB_{t+1} s_t = y_t. The various quasi-Newton methods (BFGS, DFP, SR1, Broyden family) correspond to different ways of resolving this freedom.

Proposition

Derivation of the Secant Condition from Taylor Expansion

Statement

If 2f\nabla^2 f is approximately constant along the segment from x(t)x^{(t)} to x(t+1)x^{(t+1)}, then 2fstyt\nabla^2 f \cdot s_t \approx y_t. The secant condition Bt+1st=ytB_{t+1} s_t = y_t enforces this relationship exactly for the approximation Bt+1B_{t+1}.

Intuition

By the mean value theorem for vector-valued functions:

yt=f(x(t+1))f(x(t))=(012f(x(t)+τst)dτ)sty_t = \nabla f(x^{(t+1)}) - \nabla f(x^{(t)}) = \left(\int_0^1 \nabla^2 f(x^{(t)} + \tau s_t) \, d\tau\right) s_t

If the Hessian is roughly constant, the integral is approximately 2f(x(t))\nabla^2 f(x^{(t)}), giving yt2f(x(t))sty_t \approx \nabla^2 f(x^{(t)}) s_t. The secant condition forces the approximation to match this.

Proof Sketch

Taylor expand f(x(t+1))\nabla f(x^{(t+1)}) around x(t)x^{(t)}:

f(x(t+1))=f(x(t))+2f(x(t))st+O(st2)\nabla f(x^{(t+1)}) = \nabla f(x^{(t)}) + \nabla^2 f(x^{(t)}) s_t + O(\|s_t\|^2)

Rearranging: 2f(x(t))st=ytO(st2)\nabla^2 f(x^{(t)}) s_t = y_t - O(\|s_t\|^2).

When ff is quadratic, the O(st2)O(\|s_t\|^2) term vanishes: the gradient of a quadratic is affine, so yt=2fsty_t = \nabla^2 f \cdot s_t exactly. For general smooth ff, the approximation improves as the step size st\|s_t\| shrinks near convergence.

Replacing the true Hessian with the approximation Bt+1B_{t+1} and dropping the remainder gives Bt+1st=ytB_{t+1} s_t = y_t.

Why It Matters

The secant condition is the minimal consistency requirement for a Hessian approximation. It constrains Bt+1B_{t+1} along one direction (sts_t) but leaves freedom in all other directions. Different quasi-Newton methods (BFGS, DFP, SR1) differ in how they use this freedom.

Failure Mode

The secant condition requires styt>0s_t^\top y_t > 0 (the curvature condition) for the update to produce a positive definite Bt+1B_{t+1}. This is guaranteed if ff is strongly convex but can fail for non-convex functions. When it fails, the BFGS update must be skipped or a safeguard applied (e.g., Powell damping).

The Curvature Condition in Detail

The requirement styt>0s_t^\top y_t > 0 deserves further attention. Expanding:

styt=st(f(x(t+1))f(x(t)))s_t^\top y_t = s_t^\top (\nabla f(x^{(t+1)}) - \nabla f(x^{(t)}))

For a strongly convex function with parameter μ>0\mu > 0, the bound stytμst2s_t^\top y_t \geq \mu \|s_t\|^2 holds, so the curvature condition is automatic.

For general smooth functions, the Wolfe conditions on the line search guarantee styt>0s_t^\top y_t > 0. Specifically, the sufficient decrease (Armijo) condition and the curvature condition together imply:

styt(c2c1)αtf(x(t))2s_t^\top y_t \geq (c_2 - c_1) \alpha_t \|\nabla f(x^{(t)})\|^2

where 0<c1<c2<10 < c_1 < c_2 < 1 are the Wolfe parameters (typical values: c1=104c_1 = 10^{-4}, c2=0.9c_2 = 0.9). This is why the Wolfe line search is not optional for BFGS; it is a structural requirement.

The BFGS Update

The BFGS (Broyden-Fletcher-Goldfarb-Shanno) update is the most successful quasi-Newton method. It was discovered independently by four researchers in 1970. The update modifies the inverse Hessian approximation HtH_t to satisfy the secant condition while remaining positive definite, using a rank-2 correction:

Definition

BFGS Inverse Hessian Update

Ht+1=(Iρtstyt)Ht(Iρtytst)+ρtststH_{t+1} = \left(I - \rho_t s_t y_t^\top\right) H_t \left(I - \rho_t y_t s_t^\top\right) + \rho_t s_t s_t^\top

where ρt=1ytst\rho_t = \frac{1}{y_t^\top s_t}.

Equivalently, the direct Hessian approximation update is:

Bt+1=BtBtststBtstBtst+ytytytstB_{t+1} = B_t - \frac{B_t s_t s_t^\top B_t}{s_t^\top B_t s_t} + \frac{y_t y_t^\top}{y_t^\top s_t}

Key properties of the BFGS update:

  • Rank-2: Ht+1HtH_{t+1} - H_t has rank at most 2, so the update is O(d2)O(d^2)
  • Positive definiteness preserved: if Ht0H_t \succ 0 and ytst>0y_t^\top s_t > 0, then Ht+10H_{t+1} \succ 0
  • Secant condition satisfied: Ht+1yt=stH_{t+1} y_t = s_t by construction
  • Minimal change: BFGS is the unique update that satisfies the secant condition, preserves symmetry and positive definiteness, and is closest to HtH_t in a weighted Frobenius norm

Verifying the Secant Condition

A direct check that Ht+1yt=stH_{t+1} y_t = s_t:

Ht+1yt=(Iρtstyt)Ht(Iρtytst)yt+ρtststytH_{t+1} y_t = \left(I - \rho_t s_t y_t^\top\right) H_t \left(I - \rho_t y_t s_t^\top\right) y_t + \rho_t s_t s_t^\top y_t

The inner product styt=1/ρts_t^\top y_t = 1/\rho_t, so the second factor gives (Iρtytst)yt=ytyt=0(I - \rho_t y_t s_t^\top) y_t = y_t - y_t = 0. The first term vanishes, and the last term gives ρtst(styt)=ρtst/ρt=st\rho_t s_t (s_t^\top y_t) = \rho_t s_t / \rho_t = s_t.

Verifying Positive Definiteness Preservation

To see why Ht+10H_{t+1} \succ 0 when Ht0H_t \succ 0 and ρt>0\rho_t > 0, write Vt=IρtytstV_t = I - \rho_t y_t s_t^\top. Then Ht+1=VtHtVt+ρtststH_{t+1} = V_t^\top H_t V_t + \rho_t s_t s_t^\top. For any z0z \neq 0:

zHt+1z=Ht1/2Vtz2+ρt(stz)20z^\top H_{t+1} z = \|H_t^{1/2} V_t z\|^2 + \rho_t (s_t^\top z)^2 \geq 0

Equality requires both Vtznull(Ht1/2)={0}V_t z \in \text{null}(H_t^{1/2}) = \{0\} and stz=0s_t^\top z = 0. If Ht0H_t \succ 0 then Vtz=0V_t z = 0 means z=ρtst(ytz)z = \rho_t s_t (y_t^\top z). Substituting into stz=0s_t^\top z = 0 gives ρt(stst)(ytz)=0\rho_t (s_t^\top s_t)(y_t^\top z) = 0, which forces ytz=0y_t^\top z = 0 and hence z=0z = 0. So Ht+10H_{t+1} \succ 0.

The Minimality Property

BFGS solves the following optimization problem:

minHHHtWsubject toH=H,Hyt=st\min_{H} \|H - H_t\|_W \quad \text{subject to} \quad H = H^\top, \quad H y_t = s_t

where MW2=W1/2MW1/2F2\|M\|_W^2 = \|W^{1/2} M W^{1/2}\|_F^2 with W=GˉtW = \bar{G}_t, the average Hessian 012f(x(t)+τst)dτ\int_0^1 \nabla^2 f(x^{(t)} + \tau s_t) d\tau. In practice WW is approximated by BtB_t or by Bt+1B_{t+1} itself (the direct BFGS update). The weighted Frobenius norm measures distance scale-invariantly: it penalizes changes in well-conditioned directions as much as ill-conditioned ones.

This variational characterization explains BFGS's success. It preserves as much of the old curvature information as possible while incorporating the new observation, in a scale-adapted sense.

DFP: The Other Rank-2 Update

The DFP (Davidon-Fletcher-Powell) method predates BFGS by a decade. It solves the dual minimality problem:

minBBBtWsubject toB=B,Bst=yt\min_{B} \|B - B_t\|_W \quad \text{subject to} \quad B = B^\top, \quad B s_t = y_t

The DFP update for the direct Hessian approximation is:

Bt+1DFP=(Iytstytst)Bt(Istytytst)+ytytytstB_{t+1}^{\text{DFP}} = \left(I - \frac{y_t s_t^\top}{y_t^\top s_t}\right) B_t \left(I - \frac{s_t y_t^\top}{y_t^\top s_t}\right) + \frac{y_t y_t^\top}{y_t^\top s_t}

DFP and BFGS are duals of each other: the BFGS update on HH has the same algebraic form as the DFP update on BB, and vice versa. In practice, BFGS outperforms DFP because DFP's inverse Hessian approximation tends to become increasingly ill-conditioned, amplifying errors in the line search. Nocedal and Wright (2006, Section 6.2) give empirical comparisons showing BFGS is more forgiving of inexact line searches.

SR1: The Rank-1 Alternative

The symmetric rank-1 (SR1) update takes the simplest possible form:

Bt+1=Bt+(ytBtst)(ytBtst)(ytBtst)stB_{t+1} = B_t + \frac{(y_t - B_t s_t)(y_t - B_t s_t)^\top}{(y_t - B_t s_t)^\top s_t}

SR1 has one advantage over BFGS: it does not require styt>0s_t^\top y_t > 0. The approximation can capture negative curvature, making SR1 useful within trust-region methods for non-convex problems. The drawback: SR1 does not preserve positive definiteness, and the denominator (ytBtst)st(y_t - B_t s_t)^\top s_t can be zero or near-zero, requiring a skip condition.

The Full BFGS Algorithm

  1. Initialize H0=IH_0 = I (or a scaled identity) and x(0)x^{(0)}.
  2. Compute search direction: dt=Htf(x(t))d_t = -H_t \nabla f(x^{(t)}).
  3. Line search: find αt\alpha_t satisfying the Wolfe conditions.
  4. Update: x(t+1)=x(t)+αtdtx^{(t+1)} = x^{(t)} + \alpha_t d_t.
  5. Compute st=x(t+1)x(t)s_t = x^{(t+1)} - x^{(t)} and yt=f(x(t+1))f(x(t))y_t = \nabla f(x^{(t+1)}) - \nabla f(x^{(t)}).
  6. Check: if styt<ϵs_t^\top y_t < \epsilon (some small threshold), skip the BFGS update.
  7. Update HtHt+1H_t \to H_{t+1} using the BFGS formula.
  8. Go to step 2.

The Wolfe conditions in step 3 enforce two requirements. First, the Armijo (sufficient decrease) condition: f(x(t)+αdt)f(x(t))+c1αf(x(t))dtf(x^{(t)} + \alpha d_t) \leq f(x^{(t)}) + c_1 \alpha \nabla f(x^{(t)})^\top d_t. Second, the curvature condition: f(x(t)+αdt)dtc2f(x(t))dt\nabla f(x^{(t)} + \alpha d_t)^\top d_t \geq c_2 \nabla f(x^{(t)})^\top d_t. Together, they guarantee ytst>0y_t^\top s_t > 0, which preserves positive definiteness.

Initialization Strategy

The choice of H0H_0 affects early convergence. Three common strategies:

  1. Identity: H0=IH_0 = I. The first step is steepest descent. Simple but can be far from the true scale.
  2. Scaled identity: H0=γIH_0 = \gamma I where γ=y0s0/y0y0\gamma = y_0^\top s_0 / y_0^\top y_0. After the first step, rescale to match observed curvature. This is the standard choice for L-BFGS.
  3. Diagonal: H0=diag(h1,,hd)H_0 = \text{diag}(h_1, \ldots, h_d) from a rough estimate of the diagonal Hessian.

In L-BFGS, the scaling γt=yt1st1/yt1yt1\gamma_t = y_{t-1}^\top s_{t-1} / y_{t-1}^\top y_{t-1} is recomputed at every iteration, adapting the base matrix to the local curvature. This small detail has a large practical effect: without it, the first loop of the two-loop recursion starts from a poorly scaled direction.

Superlinear Convergence

Theorem

BFGS Superlinear Convergence

Statement

Under the above assumptions, BFGS with Wolfe line search converges superlinearly to the minimizer xx^*:

limtx(t+1)xx(t)x=0\lim_{t \to \infty} \frac{\|x^{(t+1)} - x^*\|}{\|x^{(t)} - x^*\|} = 0

That is, the convergence rate eventually beats any fixed linear rate, though it does not achieve the quadratic rate of exact Newton.

Intuition

As the iterates approach xx^*, the BFGS approximation HtH_t converges to [2f(x)]1[\nabla^2 f(x^*)]^{-1} along the directions that matter: the directions the algorithm actually moves. The approximation becomes increasingly accurate where it counts, so the steps approach true Newton steps, and the convergence rate transitions from linear to superlinear.

Proof Sketch

The proof relies on the Dennis-More characterization (1974). Superlinear convergence of a quasi-Newton method holds if and only if:

limt(Ht[2f(x)]1)dtdt=0\lim_{t \to \infty} \frac{\|(H_t - [\nabla^2 f(x^*)]^{-1}) d_t\|}{\|d_t\|} = 0

This condition says that HtH_t need not converge to the true inverse Hessian in every direction; only along the search directions dtd_t. Call this directional convergence.

The proof that BFGS achieves directional convergence proceeds through a trace-determinant analysis. Define the error Et=2f(x)1/2Ht2f(x)1/2IE_t = \nabla^2 f(x^*)^{1/2} H_t \nabla^2 f(x^*)^{1/2} - I. The key identities are:

tr(Et+1)tr(Et)C1st2\text{tr}(E_{t+1}) - \text{tr}(E_t) \leq C_1 \|s_t\|^2

det(I+Et+1)/det(I+Et)C2cos2θt\det(I + E_{t+1}) / \det(I + E_t) \geq C_2 \cos^2 \theta_t

where θt\theta_t is the angle between dtd_t and the Newton direction. The trace grows slowly while the determinant is bounded below, which forces the eigenvalues of I+EtI + E_t to cluster near 1 along the directions explored by the algorithm. Summing the trace bound over all iterations shows Etdt/dt2<\sum \|E_t d_t / \|d_t\|\|^2 < \infty, which gives the Dennis-More condition.

See Nocedal and Wright (2006), Theorem 6.6, for the full argument.

Why It Matters

Superlinear convergence means BFGS is much faster than gradient descent (which is only linearly convergent) and approaches the speed of Newton without computing second derivatives. For smooth, strongly convex problems, BFGS is the practical optimum: each iteration costs O(d2)O(d^2) versus O(d3)O(d^3) for Newton, and the number of iterations is comparable.

Failure Mode

The superlinear convergence requires strong convexity and sufficiently accurate line searches. On non-convex problems, BFGS may converge to saddle points, and the Hessian approximation can become increasingly inaccurate. The curvature condition ytst>0y_t^\top s_t > 0 may fail, requiring skipped updates or damping. Even on convex problems, if the line search is too crude (e.g., fixed step size without Wolfe conditions), the approximation can degrade and convergence reverts to linear or worse.

Convergence Rate Comparison

MethodRateCost per iterationStorage
Gradient descentLinear: κ\kappa dependenceO(d)O(d)O(d)O(d)
Conjugate gradientSuperlinear (on quadratics, finite termination in dd steps)O(d)O(d)O(d)O(d)
BFGSSuperlinearO(d2)O(d^2)O(d2)O(d^2)
L-BFGS (mm pairs)Superlinear (in practice; theory weaker)O(md)O(md)O(md)O(md)
NewtonQuadraticO(d3)O(d^3)O(d2)O(d^2)

For gradient descent on a strongly convex function with condition number κ\kappa, the linear rate is 12/(κ+1)1 - 2/(\kappa + 1). When κ=103\kappa = 10^3, gradient descent needs roughly 3500 iterations to reduce the error by 10610^{-6}. BFGS typically needs 50-100 iterations for the same reduction, independent of κ\kappa, because the Hessian approximation adapts to the curvature.

L-BFGS: Limited-Memory BFGS

BFGS stores a d×dd \times d matrix HtH_t: O(d2)O(d^2) memory. For d=106d = 10^6, this is 8 TB in double precision. L-BFGS avoids this by never forming HtH_t explicitly.

Definition

L-BFGS

L-BFGS stores only the last mm pairs {(si,yi)}i=tmt1\{(s_i, y_i)\}_{i=t-m}^{t-1} (typically m=5m = 5 to 2020). The matrix-vector product HtfH_t \nabla f is computed using a two-loop recursion that requires only O(md)O(md) storage and O(md)O(md) computation.

The Two-Loop Recursion

The two-loop recursion (Nocedal, 1980) computes r=Htgr = H_t g without forming HtH_t:

Input: gradient g=f(x(t))g = \nabla f(x^{(t)}), stored pairs {(si,yi,ρi)}i=tmt1\{(s_i, y_i, \rho_i)\}_{i=t-m}^{t-1}, scaling γt\gamma_t.

Loop 1 (backward, i=t1i = t-1 down to tmt-m):

Set qgq \leftarrow g. For each ii: compute αi=ρisiq\alpha_i = \rho_i s_i^\top q; set qqαiyiq \leftarrow q - \alpha_i y_i.

Scale: r=γtqr = \gamma_t q where γt=st1yt1/yt1yt1\gamma_t = s_{t-1}^\top y_{t-1} / y_{t-1}^\top y_{t-1}.

Loop 2 (forward, i=tmi = t-m up to t1t-1):

For each ii: compute β=ρiyir\beta = \rho_i y_i^\top r; set rr+(αiβ)sir \leftarrow r + (\alpha_i - \beta) s_i.

Output: r=Htgr = H_t g.

Cost: 4md4md multiplications and 2m2m inner products per call. Total: O(md)O(md) time and O(md)O(md) storage, where mdm \ll d.

Why It Works

The two-loop recursion computes the product of mm rank-2 updates applied to the scaled identity γtI\gamma_t I. Unrolling the BFGS recurrence, HtH_t has the form:

Ht=Vt1Vtm(γtI)VtmVt1+rank-2 correctionsH_t = V_{t-1}^\top \cdots V_{t-m}^\top (\gamma_t I) V_{t-m} \cdots V_{t-1} + \text{rank-2 corrections}

where Vi=IρiyisiV_i = I - \rho_i y_i s_i^\top. The two-loop recursion evaluates HtgH_t g by propagating gg through these factors without ever forming the d×dd \times d matrix. Old pairs beyond the window of mm are discarded, so the effective approximation "forgets" curvature information from more than mm steps ago.

Choosing mm

The memory parameter mm controls the trade-off between approximation quality and cost:

  • m=3-5m = 3\text{-}5: fast, low memory, often sufficient for well-conditioned problems
  • m=10-20m = 10\text{-}20: standard for most applications. Diminishing returns beyond m=20m = 20
  • m=dm = d: recovers full BFGS (but defeats the purpose)

Empirically, mm between 5 and 20 gives performance within 10-20% of full BFGS on most smooth problems. The sensitivity to mm decreases as the problem becomes better conditioned.

Why L-BFGS Is the Default

For medium-scale smooth optimization (dd in the thousands to low millions, full-batch gradients available), L-BFGS dominates because:

  1. O(md)O(md) memory vs. O(d2)O(d^2) for BFGS
  2. No learning rate schedule to tune (unlike SGD, which needs step size decay and momentum tuning)
  3. Strong empirical convergence on smooth strongly convex problems
  4. Built into every serious optimization library (scipy.optimize.minimize, PyTorch torch.optim.LBFGS)

Caveat — superlinear convergence is a full-BFGS result. The Dennis-More / Broyden-Dennis-More-Powell theory establishes local superlinear convergence for full-memory BFGS under the assumptions in the theorem above. With fixed memory mdm \ll d (i.e.\ true L-BFGS), the classical superlinear-convergence proof breaks down because the approximation cannot capture all of [2f(x)]1[\nabla^2 f(x^*)]^{-1} in mm rank-2 updates. Liu-Nocedal (1989, the original L-BFGS paper) only proves RR-linear convergence for L-BFGS on strongly convex problems; later analyses (Yuan 1991; Byrd-Liu-Nocedal 1992) sharpen the constants but do not recover superlinear rates in general. In practice, L-BFGS with m=10m = 10-2020 converges almost as fast as full BFGS on most smooth problems, so the "L-BFGS = practical superlinear" intuition is empirically warranted — but it should not be quoted as a theorem.

L-BFGS is the standard for logistic regression, CRFs, maximum entropy models, and any smooth ML problem where you can compute full gradients. It is also the optimizer behind liblinear, the default solver for scikit-learn's LogisticRegression with L2 regularization.

When L-BFGS Loses

L-BFGS has clear failure modes:

  • Stochastic gradients: L-BFGS builds its approximation from gradient differences yt=gt+1gty_t = g_{t+1} - g_t. If gtg_t has noise from mini-batch sampling, yty_t is dominated by noise rather than curvature. Stochastic quasi-Newton methods exist (e.g., online L-BFGS, SQN) but are not yet standard.
  • Non-smooth objectives: if ff has kinks (as in L1 regularization), the gradient is discontinuous and the secant condition loses meaning. Use proximal methods instead.
  • Very large dd: for d>107d > 10^7, even O(md)O(md) per iteration can be expensive when m=20m = 20. First-order methods (Adam, SGD with momentum) become the only option.
  • Non-convex deep learning: the loss surface of neural networks has saddle points, plateaus, and sharp minima. BFGS/L-BFGS applied to full-batch deep learning objectives tends to converge to sharp minima that generalize poorly, and the curvature condition styt>0s_t^\top y_t > 0 fails frequently.

Connection to Preconditioning

The inverse Hessian approximation HtH_t acts as a preconditioner for the gradient. Instead of descending along f-\nabla f (steepest descent in the Euclidean metric), you descend along Htf-H_t \nabla f, which is steepest descent in the metric defined by Ht1H_t^{-1}.

In the eigenspace of the Hessian, this rescaling approximately equalizes the step sizes across all directions: directions with high curvature (large eigenvalues) get smaller steps, and directions with low curvature (small eigenvalues) get larger steps. This is why quasi-Newton methods handle ill-conditioned problems far better than gradient descent: the condition number of the preconditioned system Ht2fH_t \nabla^2 f is close to 1 when Ht[2f]1H_t \approx [\nabla^2 f]^{-1}.

Example

Preconditioning on an ill-conditioned quadratic

Consider f(x)=12xAxf(x) = \frac{1}{2} x^\top A x where A=diag(1,1000)A = \text{diag}(1, 1000). The condition number is κ=1000\kappa = 1000.

Gradient descent with optimal step size converges at rate (κ1)/(κ+1)0.998(\kappa - 1)/(\kappa + 1) \approx 0.998 per iteration. To reduce the error by 10610^{-6} requires about 6900 iterations.

BFGS, after accumulating curvature information from the first few steps, builds HtA1=diag(1,0.001)H_t \approx A^{-1} = \text{diag}(1, 0.001). The preconditioned gradient Htf=HtAxxH_t \nabla f = H_t A x \approx x points directly at the minimum. On this quadratic, BFGS terminates in exactly 2 steps (with exact line search).

This is the same principle behind adaptive gradient methods in deep learning: Adam and AdaGrad use diagonal preconditioning (diagonal Hessian approximation), while L-BFGS uses a low-rank-plus-diagonal approximation. The optimizer theory page discusses this connection.

Damped BFGS and Practical Safeguards

On non-convex problems, the curvature condition styt>0s_t^\top y_t > 0 can fail. Powell (1978) proposed a damped BFGS update that interpolates between yty_t and BtstB_t s_t:

rt=θtyt+(1θt)Btstr_t = \theta_t y_t + (1 - \theta_t) B_t s_t

where θt=1\theta_t = 1 if styt0.2stBtsts_t^\top y_t \geq 0.2 \, s_t^\top B_t s_t, and otherwise:

θt=0.8stBtststBtststyt\theta_t = \frac{0.8 \, s_t^\top B_t s_t}{s_t^\top B_t s_t - s_t^\top y_t}

The modified pair (st,rt)(s_t, r_t) always satisfies strt>0s_t^\top r_t > 0, so the update preserves positive definiteness. The cost: the approximation is biased toward the current BtB_t, slowing convergence slightly. Powell damping is used in most production BFGS implementations, including MATLAB's fminunc.

Common Confusions

Watch Out

BFGS approximates the Hessian, not the gradient

BFGS does not modify the gradient direction in an ad hoc way. It builds a principled approximation to the Hessian (or its inverse) that satisfies a consistency condition (the secant equation). The resulting search direction is a quasi-Newton direction: close to the true Newton direction when the approximation is accurate. Contrast with Adam, which preconditions with a diagonal running average of squared gradients; this is a diagonal Hessian approximation, not a full quasi-Newton update.

Watch Out

L-BFGS is not just BFGS with less memory

L-BFGS does not simply truncate the BFGS matrix. It implicitly represents HtH_t as a product of rank-2 updates applied to a scaled identity, and computes the matrix-vector product HtgH_t g without ever forming HtH_t. The two-loop recursion is the key algorithmic insight. The resulting approximation is different from full BFGS: it effectively forgets old curvature information beyond mm steps. In practice this limited memory is sufficient; empirically, convergence speed plateaus around m=10-20m = 10\text{-}20.

Watch Out

Quasi-Newton is not the same as conjugate gradient

Both methods achieve superlinear convergence on quadratics and terminate in dd steps. The difference: conjugate gradient uses O(d)O(d) storage and generates search directions via a scalar recurrence, while BFGS maintains an O(d2)O(d^2) matrix and produces search directions via a matrix-vector product. On non-quadratic problems, BFGS adapts its curvature model more aggressively. Conjugate gradient requires exact line searches for its conjugacy property; BFGS is more tolerant of inexact line searches.

Watch Out

The Wolfe line search is not optional for BFGS

Some implementations default to a simple backtracking (Armijo) line search. For steepest descent, this suffices. For BFGS, it does not: the curvature condition styt>0s_t^\top y_t > 0 is only guaranteed by the strong Wolfe conditions. Without it, the BFGS matrix can lose positive definiteness and the algorithm can diverge. The Wolfe line search adds cost (typically 2-5 extra gradient evaluations per iteration) but is required for correctness.

Summary

  • Quasi-Newton methods approximate the Hessian using only gradient differences
  • The secant condition Bt+1st=ytB_{t+1} s_t = y_t is the fundamental constraint, derived from Taylor expansion of the gradient
  • BFGS uses a rank-2 update preserving positive definiteness: O(d2)O(d^2) per step
  • The BFGS update is the unique minimum-change update in a weighted Frobenius norm
  • L-BFGS stores only mm vector pairs: O(md)O(md) per step with two-loop recursion
  • Superlinear convergence for strongly convex problems with Wolfe line search
  • L-BFGS is the default for medium-scale smooth optimization; it fails on stochastic, non-smooth, or very high-dimensional problems
  • The Hessian approximation acts as a preconditioner, equalizing curvature across the condition number spectrum

Exercises

ExerciseCore

Problem

Derive the secant condition from a first-order Taylor expansion of f\nabla f around x(t)x^{(t)} evaluated at x(t+1)x^{(t+1)}. Specifically, show that 2f(x(t))styt\nabla^2 f(x^{(t)}) s_t \approx y_t under the assumption that the Hessian is approximately constant.

ExerciseCore

Problem

Verify directly that the BFGS inverse Hessian update satisfies the secant condition Ht+1yt=stH_{t+1} y_t = s_t. That is, substitute yty_t into the BFGS formula and show the result is sts_t.

ExerciseAdvanced

Problem

Consider BFGS initialized with H0=IH_0 = I on a strictly convex quadratic f(x)=12xAxbxf(x) = \frac{1}{2} x^\top A x - b^\top x with A0A \succ 0. Show that after dd steps with exact line searches, Hd=A1H_d = A^{-1} (i.e., BFGS recovers the exact inverse Hessian in at most dd steps on a quadratic). This is called the finite termination property.

ExerciseAdvanced

Problem

Show that the damped BFGS update preserves positive definiteness. That is, if Bt0B_t \succ 0 and rt=θtyt+(1θt)Btstr_t = \theta_t y_t + (1 - \theta_t) B_t s_t with the Powell choice of θt\theta_t, then strt>0s_t^\top r_t > 0.

References

Canonical:

  • Nocedal & Wright, Numerical Optimization (2006), Chapters 6-7. The standard reference; Chapter 6 covers the quasi-Newton framework, Chapter 7 covers L-BFGS.
  • Dennis & Schnabel, Numerical Methods for Unconstrained Optimization and Nonlinear Equations (1996), Chapters 8-9
  • Dennis & More, "Characterization of Superlinear Convergence and Its Application to Quasi-Newton Methods" (1974), Mathematics of Computation 28(126)

Current:

  • Nocedal, "Updating Quasi-Newton Matrices with Limited Storage" (1980), Mathematics of Computation 35(151). The L-BFGS paper.
  • Liu & Nocedal, "On the Limited Memory BFGS Method for Large Scale Optimization" (1989), Mathematical Programming 45(1-3)
  • Powell, "Algorithms for Nonlinear Constraints That Use Lagrangian Functions" (1978), Mathematical Programming 14(1). Introduces damped BFGS.
  • Byrd, Lu, Nocedal & Zhu, "A Limited Memory Algorithm for Bound Constrained Optimization" (1995), SIAM Journal on Scientific Computing 16(5). The L-BFGS-B variant for box constraints.
  • Berahas, Nocedal & Takac, "A Multi-Batch L-BFGS Method for Machine Learning" (2016), NeurIPS. Extends L-BFGS to stochastic settings.

Next Topics

The natural next steps from quasi-Newton methods:

  • Proximal gradient methods: when the objective has a non-smooth regularizer (e.g., 1\ell_1), quasi-Newton cannot be applied directly; proximal operators handle the non-smooth part
  • Trust-region methods: an alternative globalization strategy to line search that constrains the step length directly; often paired with SR1 updates
  • Optimizer theory (SGD, Adam, Muon): how modern deep learning optimizers relate to the preconditioning ideas from quasi-Newton methods

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

3

Derived topics

1

Graph-backed continuations