Trust Region Methods

Sneiderman, Robby

Numerical Optimization

Trust Region Methods

Trust region methods minimize a local model only inside a region where that model is trusted, then grow or shrink the region using actual versus predicted decrease. They are robust Newton-style methods for difficult non-convex landscapes.

AdvancedTier 2StableSupporting~55 min

Prerequisites

Newtons Method The Hessian Matrix Line Search Methods Augmented Lagrangian and Admm

Quiz (1)Prereq Map

Why This Matters

Newton's method is fast when its quadratic model is accurate and dangerous when it is not. A bad Newton step can be huge, point uphill, or chase a saddle because the Hessian is indefinite.

Trust region methods fix the failure mode by asking a more honest question:

How far from the current point do I actually trust this local model?

Instead of choosing a direction and searching along it, a trust region method chooses the step by minimizing the model inside a ball. Then it compares predicted improvement to actual improvement and updates the radius.

Mental Model

A line search method says: "I picked a direction. How far should I go?"

A trust region method says: "I trust my model inside this radius. What is the best step inside it?"

If the model predicts reality well, the region expands. If the model lies, the region shrinks. This is why trust regions are robust for non-convex second-order optimization: the radius becomes a feedback signal about local model quality.

Formal Setup

Definition

Quadratic Model

At iterate $x_k$ , define

m_k(p) = f(x_k)+g_k^T p+(1/2)p^T B_k p,

where $g_k=\nabla f(x_k)$ and $B_k$ is either the Hessian $\nabla^2 f(x_k)$ or an approximation.

Definition

Trust Region Subproblem

The trust region step solves

p_k\in\arg\min_p m_k(p) \quad \text{subject to}\quad \|p\|\leq \Delta_k.

The radius $\Delta_k$ limits how far the algorithm trusts the model.

Definition

Actual-To-Predicted Reduction Ratio

After proposing $p_k$ , compute

$\rho_k=\frac{f(x_k)-f(x_k+p_k)}{m_k(0)-m_k(p_k)}.$

If $\rho_k\approx 1$ , the model predicted the objective well. If $\rho_k$ is small or negative, the model was unreliable at that radius.

The Algorithm

Build $m_k(p)$ from $f(x_k)$ , $g_k$ , and $B_k$ .
Approximately solve the trust region subproblem.
Compute $\rho_k$ .
Accept the step if $\rho_k$ is large enough.
Shrink $\Delta_k$ after poor prediction; grow it after accurate prediction and boundary-hitting steps.

A typical policy:

Condition	Step decision	Radius decision
$\rho_k < 0.1$	reject	shrink
$0.1 \leq \rho_k \leq 0.75$	accept cautiously	keep similar
$\rho_k > 0.75$ and $\\|p_k\\|=\Delta_k$	accept	expand
$\rho_k < 0$	reject	shrink aggressively

The exact constants are implementation choices. The logic is the important part.

Solving The Subproblem

Theorem

Trust Region Subproblem Optimality Conditions

Statement

A global solution $p^*$ to the quadratic trust region subproblem satisfies the existence of $\lambda\geq 0$ such that

(B+\lambda I)p^*=-g,

B+\lambda I\succeq 0,

and

\lambda(\Delta-\|p^*\|)=0,\qquad \|p^*\|\leq \Delta.

Intuition

If the unconstrained Newton step is inside the ball and $B$ is positive definite, $\lambda=0$ . Otherwise the constraint is active and $\lambda$ shifts the Hessian until the constrained quadratic is well behaved.

Why It Matters

This is the exact mathematical bridge between Newton, regularized Newton, Levenberg-Marquardt, and trust region methods.

report a correction →

Exact subproblem solves are often unnecessary. Common approximations:

Solver	Best for	Idea
Cauchy point	cheap guaranteed progress	steepest descent clipped to the region
Dogleg	positive definite Hessian, small problems	path from gradient step to Newton step
Steihaug-CG	large-scale Hessian-vector products	conjugate gradient until boundary or negative curvature
Exact secular equation	small/medium high-accuracy solves	solve for the multiplier $\lambda$

Cauchy Decrease

Lemma

Cauchy Point Sufficient Decrease

Statement

The Cauchy point $p_k^C$ satisfies

m_k(0)-m_k(p_k^C) \geq (1/2)\|g_k\| \min\left(\Delta_k,\|g_k\|/\|B_k\|\right)

up to standard constant conventions.

Intuition

Even the cheapest trust region step must make meaningful predicted progress: either move to the boundary in the negative-gradient direction or take the best one-dimensional step along that direction.

Why It Matters

Most global convergence proofs only require that the approximate subproblem solver achieves at least a fixed fraction of Cauchy decrease. You do not need to solve the subproblem exactly to get robust theory.

report a correction →

Global Convergence

Theorem

Global First-Order Convergence

Statement

A standard trust region method satisfying the assumptions has

\liminf_{k\to\infty}\|\nabla f(x_k)\|=0.

Under stronger assumptions and standard acceptance rules, every accumulation point is first-order stationary.

Intuition

If the gradient stayed large forever, accepted steps would have to decrease $f$ by a nontrivial amount. That cannot continue forever because $f$ is bounded below. If steps are rejected repeatedly, the radius shrinks until the Taylor model becomes accurate enough to accept a step.

Proof Sketch

Combine Cauchy decrease with model accuracy for small $\Delta_k$ . Rejected steps force $\Delta_k$ smaller; small radii make $\rho_k$ approach $1$ under smoothness; then steps must be accepted. Summing accepted decreases contradicts a gradient bounded away from zero.

Failure Mode

First-order stationarity is not the same as a good minimum. A method can still approach saddle-like stationary points unless the subproblem solver and acceptance logic exploit negative curvature.

report a correction →

Negative Curvature

If $B_k$ has a negative eigenvalue, the quadratic model decreases along that direction. A line search Newton method may need special safeguards to avoid unstable or ascent directions. Trust region methods naturally cap the step while still allowing movement along negative curvature to the boundary.

This is one reason trust region methods are important in non-convex optimization: the ball prevents catastrophic step lengths, while the subproblem can still use curvature information that gradient descent ignores.

Trust Region Vs Line Search

Question	Line search	Trust region
What is chosen first?	direction	local region and model
What is adjusted?	step length $\alpha$	radius $\Delta$
How is quality measured?	sufficient decrease/Wolfe conditions	actual versus predicted decrease
Negative curvature	needs safeguards	handled inside subproblem
Cost per iteration	cheaper direction plus function checks	subproblem solve
Strong use case	scalable first/second-order hybrids	robust Newton-like methods

RL Connection: TRPO And PPO

Trust Region Policy Optimization uses a KL constraint instead of a Euclidean radius:

E\left[D_{\mathrm{KL}}(\pi_{\theta_{\mathrm{old}}}\,\|\,\pi_\theta)\right]\leq \delta.

The idea is the same: do not let the new policy move outside the region where the local surrogate objective is trustworthy. PPO is best understood as a simpler clipped-surrogate approximation to this trust-region idea, not as the exact same constrained optimization method.

Common Confusions

Watch Out

The trust region is not a constraint on the final answer

The region constrains the current step $p_k$ , not the optimization problem's final solution. The radius is a local reliability control.

Watch Out

Rejected steps are useful information

A rejected step says the model was not predictive at that radius. Shrinking the radius is not failure; it is the mechanism that restores model validity.

Watch Out

The Cauchy point is not the best step

The Cauchy point is a cheap lower bar for progress. Dogleg, truncated CG, and exact solvers can produce better steps, but they should at least clear the Cauchy-decrease floor.

Watch Out

Trust region does not make non-convex optimization globally solved

The method improves robustness and convergence to stationary points. It does not remove bad local minima, saddle geometry, noise, or modeling error.

Q&A For Mastery

Why not always use a line search? Line search methods choose a direction first. If the direction is bad because the Hessian is indefinite or poorly conditioned, searching along it can still be poor. Trust regions choose direction and length together.

Why does $\rho_k$ matter more than raw loss decrease? Raw decrease tells you the step helped. $\rho_k$ tells you whether the model is reliable enough to justify larger future steps.

What should I log? Log $\Delta_k$ , $\rho_k$ , accepted/rejected status, $\|\nabla f(x_k)\|$ , and whether the subproblem hit the boundary or negative curvature.

What To Remember

Trust region methods optimize a model inside a radius.
$\rho_k$ compares actual reduction to predicted reduction.
Cauchy decrease is the minimal progress certificate.
Dogleg and Steihaug-CG are practical subproblem solvers.
Trust regions are robust for Newton-like non-convex optimization, but they still guarantee stationarity, not magic global optimality.

Exercises

ExerciseCore

Problem

Suppose $g=[3,4]^T$ , $B=0$ , and $\Delta=2$ . What is the Cauchy point and predicted reduction?

ExerciseAdvanced

Problem

If the Newton step is inside the trust region and $B$ is positive definite, what is the trust region step?

ExerciseResearch

Problem

Why is a KL trust region more meaningful than a Euclidean parameter trust region for policy optimization?

References

Canonical:

Nocedal and Wright, Numerical Optimization, 2nd ed., Chapter 4.
Conn, Gould, and Toint, Trust Region Methods (2000).
More and Sorensen, "Computing a Trust Region Step" (1983).

Further:

Steihaug, "The Conjugate Gradient Method and Trust Regions in Large Scale Optimization" (1983).
Schulman et al., "Trust Region Policy Optimization" (2015).
Curtis and Robinson, "Exploiting Negative Curvature in Deterministic and Stochastic Optimization" (2019).

Next Topics

Second-order optimization methods: Newton, Gauss-Newton, Hessian-free, and curvature-aware ML.
Interior point methods: constrained optimization through barriers rather than local trust balls.

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

5

The Hessian Matrixlayer 0A · tier 1
Newton's Methodlayer 1 · tier 1
Augmented Lagrangian and ADMMlayer 2 · tier 2
Conjugate Gradient Methodslayer 2 · tier 2
Line Search Methodslayer 2 · tier 2

Derived topics

2

Second-Order Optimization Methodslayer 3 · tier 2
Interior Point Methodslayer 3 · tier 3

Graph-backed continuations

Second-Order Optimization Methods Interior Point Methods