Skip to main content

Numerical Optimization

Trust Region Methods

Trust region methods minimize a local model only inside a region where that model is trusted, then grow or shrink the region using actual versus predicted decrease. They are robust Newton-style methods for difficult non-convex landscapes.

AdvancedTier 2StableSupporting~55 min

Why This Matters

Newton's method is fast when its quadratic model is accurate and dangerous when it is not. A bad Newton step can be huge, point uphill, or chase a saddle because the Hessian is indefinite.

Trust region methods fix the failure mode by asking a more honest question:

How far from the current point do I actually trust this local model?

Instead of choosing a direction and searching along it, a trust region method chooses the step by minimizing the model inside a ball. Then it compares predicted improvement to actual improvement and updates the radius.

Mental Model

A line search method says: "I picked a direction. How far should I go?"

A trust region method says: "I trust my model inside this radius. What is the best step inside it?"

If the model predicts reality well, the region expands. If the model lies, the region shrinks. This is why trust regions are robust for non-convex second-order optimization: the radius becomes a feedback signal about local model quality.

Formal Setup

Definition

Quadratic Model

At iterate xkx_k, define

mk(p)=f(xk)+gkTp+(1/2)pTBkp,m_k(p) = f(x_k)+g_k^T p+(1/2)p^T B_k p,

where gk=f(xk)g_k=\nabla f(x_k) and BkB_k is either the Hessian 2f(xk)\nabla^2 f(x_k) or an approximation.

Definition

Trust Region Subproblem

The trust region step solves

pkargminpmk(p)subject topΔk.p_k\in\arg\min_p m_k(p) \quad \text{subject to}\quad \|p\|\leq \Delta_k.

The radius Δk\Delta_k limits how far the algorithm trusts the model.

Definition

Actual-To-Predicted Reduction Ratio

After proposing pkp_k, compute

ρk=f(xk)f(xk+pk)mk(0)mk(pk).\rho_k=\frac{f(x_k)-f(x_k+p_k)}{m_k(0)-m_k(p_k)}.

If ρk1\rho_k\approx 1, the model predicted the objective well. If ρk\rho_k is small or negative, the model was unreliable at that radius.

The Algorithm

  1. Build mk(p)m_k(p) from f(xk)f(x_k), gkg_k, and BkB_k.
  2. Approximately solve the trust region subproblem.
  3. Compute ρk\rho_k.
  4. Accept the step if ρk\rho_k is large enough.
  5. Shrink Δk\Delta_k after poor prediction; grow it after accurate prediction and boundary-hitting steps.

A typical policy:

ConditionStep decisionRadius decision
ρk<0.1\rho_k < 0.1rejectshrink
0.1ρk0.750.1 \leq \rho_k \leq 0.75accept cautiouslykeep similar
ρk>0.75\rho_k > 0.75 and pk=Δk\|p_k\|=\Delta_kacceptexpand
ρk<0\rho_k < 0rejectshrink aggressively

The exact constants are implementation choices. The logic is the important part.

Solving The Subproblem

Theorem

Trust Region Subproblem Optimality Conditions

Statement

A global solution pp^* to the quadratic trust region subproblem satisfies the existence of λ0\lambda\geq 0 such that

(B+λI)p=g,(B+\lambda I)p^*=-g,B+λI0,B+\lambda I\succeq 0,

and

λ(Δp)=0,pΔ.\lambda(\Delta-\|p^*\|)=0,\qquad \|p^*\|\leq \Delta.

Intuition

If the unconstrained Newton step is inside the ball and BB is positive definite, λ=0\lambda=0. Otherwise the constraint is active and λ\lambda shifts the Hessian until the constrained quadratic is well behaved.

Why It Matters

This is the exact mathematical bridge between Newton, regularized Newton, Levenberg-Marquardt, and trust region methods.

Exact subproblem solves are often unnecessary. Common approximations:

SolverBest forIdea
Cauchy pointcheap guaranteed progresssteepest descent clipped to the region
Doglegpositive definite Hessian, small problemspath from gradient step to Newton step
Steihaug-CGlarge-scale Hessian-vector productsconjugate gradient until boundary or negative curvature
Exact secular equationsmall/medium high-accuracy solvessolve for the multiplier λ\lambda

Cauchy Decrease

Lemma

Cauchy Point Sufficient Decrease

Statement

The Cauchy point pkCp_k^C satisfies

mk(0)mk(pkC)(1/2)gkmin(Δk,gk/Bk)m_k(0)-m_k(p_k^C) \geq (1/2)\|g_k\| \min\left(\Delta_k,\|g_k\|/\|B_k\|\right)

up to standard constant conventions.

Intuition

Even the cheapest trust region step must make meaningful predicted progress: either move to the boundary in the negative-gradient direction or take the best one-dimensional step along that direction.

Why It Matters

Most global convergence proofs only require that the approximate subproblem solver achieves at least a fixed fraction of Cauchy decrease. You do not need to solve the subproblem exactly to get robust theory.

Global Convergence

Theorem

Global First-Order Convergence

Statement

A standard trust region method satisfying the assumptions has

lim infkf(xk)=0.\liminf_{k\to\infty}\|\nabla f(x_k)\|=0.

Under stronger assumptions and standard acceptance rules, every accumulation point is first-order stationary.

Intuition

If the gradient stayed large forever, accepted steps would have to decrease ff by a nontrivial amount. That cannot continue forever because ff is bounded below. If steps are rejected repeatedly, the radius shrinks until the Taylor model becomes accurate enough to accept a step.

Proof Sketch

Combine Cauchy decrease with model accuracy for small Δk\Delta_k. Rejected steps force Δk\Delta_k smaller; small radii make ρk\rho_k approach 11 under smoothness; then steps must be accepted. Summing accepted decreases contradicts a gradient bounded away from zero.

Failure Mode

First-order stationarity is not the same as a good minimum. A method can still approach saddle-like stationary points unless the subproblem solver and acceptance logic exploit negative curvature.

Negative Curvature

If BkB_k has a negative eigenvalue, the quadratic model decreases along that direction. A line search Newton method may need special safeguards to avoid unstable or ascent directions. Trust region methods naturally cap the step while still allowing movement along negative curvature to the boundary.

This is one reason trust region methods are important in non-convex optimization: the ball prevents catastrophic step lengths, while the subproblem can still use curvature information that gradient descent ignores.

Trust Region Vs Line Search

QuestionLine searchTrust region
What is chosen first?directionlocal region and model
What is adjusted?step length α\alpharadius Δ\Delta
How is quality measured?sufficient decrease/Wolfe conditionsactual versus predicted decrease
Negative curvatureneeds safeguardshandled inside subproblem
Cost per iterationcheaper direction plus function checkssubproblem solve
Strong use casescalable first/second-order hybridsrobust Newton-like methods

RL Connection: TRPO And PPO

Trust Region Policy Optimization uses a KL constraint instead of a Euclidean radius:

E[DKL(πθoldπθ)]δ.E\left[D_{\mathrm{KL}}(\pi_{\theta_{\mathrm{old}}}\,\|\,\pi_\theta)\right]\leq \delta.

The idea is the same: do not let the new policy move outside the region where the local surrogate objective is trustworthy. PPO is best understood as a simpler clipped-surrogate approximation to this trust-region idea, not as the exact same constrained optimization method.

Common Confusions

Watch Out

The trust region is not a constraint on the final answer

The region constrains the current step pkp_k, not the optimization problem's final solution. The radius is a local reliability control.

Watch Out

Rejected steps are useful information

A rejected step says the model was not predictive at that radius. Shrinking the radius is not failure; it is the mechanism that restores model validity.

Watch Out

The Cauchy point is not the best step

The Cauchy point is a cheap lower bar for progress. Dogleg, truncated CG, and exact solvers can produce better steps, but they should at least clear the Cauchy-decrease floor.

Watch Out

Trust region does not make non-convex optimization globally solved

The method improves robustness and convergence to stationary points. It does not remove bad local minima, saddle geometry, noise, or modeling error.

Q&A For Mastery

Why not always use a line search? Line search methods choose a direction first. If the direction is bad because the Hessian is indefinite or poorly conditioned, searching along it can still be poor. Trust regions choose direction and length together.

Why does ρk\rho_k matter more than raw loss decrease? Raw decrease tells you the step helped. ρk\rho_k tells you whether the model is reliable enough to justify larger future steps.

What should I log? Log Δk\Delta_k, ρk\rho_k, accepted/rejected status, f(xk)\|\nabla f(x_k)\|, and whether the subproblem hit the boundary or negative curvature.

What To Remember

  • Trust region methods optimize a model inside a radius.
  • ρk\rho_k compares actual reduction to predicted reduction.
  • Cauchy decrease is the minimal progress certificate.
  • Dogleg and Steihaug-CG are practical subproblem solvers.
  • Trust regions are robust for Newton-like non-convex optimization, but they still guarantee stationarity, not magic global optimality.

Exercises

ExerciseCore

Problem

Suppose g=[3,4]Tg=[3,4]^T, B=0B=0, and Δ=2\Delta=2. What is the Cauchy point and predicted reduction?

ExerciseAdvanced

Problem

If the Newton step is inside the trust region and BB is positive definite, what is the trust region step?

ExerciseResearch

Problem

Why is a KL trust region more meaningful than a Euclidean parameter trust region for policy optimization?

References

Canonical:

  • Nocedal and Wright, Numerical Optimization, 2nd ed., Chapter 4.
  • Conn, Gould, and Toint, Trust Region Methods (2000).
  • More and Sorensen, "Computing a Trust Region Step" (1983).

Further:

  • Steihaug, "The Conjugate Gradient Method and Trust Regions in Large Scale Optimization" (1983).
  • Schulman et al., "Trust Region Policy Optimization" (2015).
  • Curtis and Robinson, "Exploiting Negative Curvature in Deterministic and Stochastic Optimization" (2019).

Next Topics

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

5

Derived topics

2