Numerical Optimization
Proximal Gradient Methods
Optimize composite objectives by alternating gradient steps on smooth terms with proximal operators on nonsmooth terms. ISTA and its accelerated variant FISTA.
Why This Matters
Many ML objectives have the form "smooth loss plus nonsmooth regularizer." Lasso regression is the canonical example: squared error (smooth) plus an penalty (nonsmooth). You cannot apply standard gradient descent to the full objective because the norm is not differentiable at zero. Subgradient methods apply, but converge slowly ( for convex problems).
Proximal gradient methods solve this cleanly. They split the smooth and nonsmooth parts: a gradient step on the smooth part, then a proximal step on the nonsmooth part. ISTA recovers the rate of gradient descent on smooth functions; FISTA reaches the optimal rate via Nesterov acceleration. This splitting strategy is the algorithmic backbone of sparse optimization, constrained optimization, and structured regularization across ML.
Mental Model
Think of two operations in sequence. First, slide downhill along the smooth loss surface (a gradient step on ). This produces an intermediate point that ignores the nonsmooth regularizer entirely. Second, apply the proximal operator of to that intermediate point: find the nearest point (in a squared-distance sense) that also keeps small. For the norm, this second step is soft-thresholding, which pushes small coordinates exactly to zero and shrinks large ones toward zero. The two-step rhythm --- gradient, then proximal --- repeats until convergence.
When is the indicator function of a convex constraint set, the proximal step reduces to Euclidean projection, and the algorithm becomes projected gradient descent.
Formal Setup and Notation
We want to solve the composite minimization problem:
where is convex with -Lipschitz continuous gradient, meaning for all . The function is convex, lower semicontinuous, and proper (its domain is nonempty), but possibly nonsmooth.
The Lipschitz constant of controls the step size: we set to ensure the gradient step does not overshoot. For quadratic losses like , the constant is , the largest eigenvalue of .
The composite structure appears throughout ML:
| Problem | ||
|---|---|---|
| Lasso | ||
| Ridge | ||
| Elastic net | ||
| Constrained minimization | (indicator of set ) | |
| Group lasso | ||
| Nuclear norm regularization | loss on matrix |
Proximal Operator
The proximal operator of a convex function at point is:
It finds the point that balances being close to (the quadratic term) with having small value. The proximal operator always exists and is unique when is convex, lower semicontinuous, and proper.
When has a scaling parameter , write . The parameter controls the trade-off: large penalizes more heavily; small keeps the output closer to .
Key properties of the proximal operator
The proximal operator has several properties that make it well-suited to iterative algorithms:
-
Firm nonexpansiveness. For any : This is stronger than nonexpansiveness () and guarantees that iterating the proximal operator does not amplify errors.
-
Fixed-point characterization. minimizes if and only if for any . So minimizers of are exactly the fixed points of its proximal operator.
-
Resolvent identity. , where is the convex conjugate (Fenchel conjugate) of . This duality between the proximal operators of and underlies many algorithmic decompositions.
-
Composition with affine maps. If and is orthogonal, then . For general , no simple composition rule exists; this is why ADMM introduces auxiliary variables to handle affine compositions.
Moreau Envelope
The Moreau envelope of with parameter is:
It is a smooth approximation of . Even when is nonsmooth, the Moreau envelope is differentiable with gradient . As , the Moreau envelope converges pointwise to .
The Moreau envelope preserves minimizers: for any . It also preserves the minimum value. This means we can study the smooth function instead of the nonsmooth without changing the optimization problem's solution set.
The gradient is -Lipschitz continuous. This makes a function with bounded curvature, even when itself is piecewise linear (like the norm). The Moreau envelope therefore provides a principled way to "smooth" nonsmooth objectives for analysis.
Closed-Form Proximal Operators
The practical usefulness of proximal methods depends on whether has a closed form. For many regularizers used in ML, it does.
Soft-thresholding ( norm). For :
This shrinks each coordinate toward zero and sets coordinates with exactly to zero. The proximal operator is the mechanism behind sparsity in lasso and sparse recovery.
Indicator function. For (zero if , otherwise), the proximal operator is the Euclidean projection onto : .
norm (group soft-thresholding). For (not squared):
where . This shrinks the entire vector toward zero and sets when . It is the building block of group lasso, where each group is shrunk as a unit.
Squared norm. For (ridge penalty):
This scales all coordinates uniformly. No coordinate is set to zero, which is why ridge shrinks but does not select variables.
Nuclear norm. For , the proximal operator applies soft-thresholding to the singular values: compute the SVD , threshold the diagonal entries of , and reconstruct. This costs one SVD per iteration.
Elastic net. For , the proximal operator composes scaling and soft-thresholding: first soft-threshold with parameter , then scale by .
The ISTA Algorithm
The Iterative Shrinkage-Thresholding Algorithm (ISTA) repeats:
where is the step size and is the Lipschitz constant of .
Step by step: (1) compute the gradient ; (2) take a gradient step on the smooth part, producing an intermediate point ; (3) apply the proximal operator of to , handling the nonsmooth part. The gradient step minimizes a local quadratic model of around ; the proximal step incorporates at minimal additional cost.
Why the step size is
The step size comes from the quadratic upper bound on smooth convex functions. Because is -Lipschitz, we have the descent lemma:
Minimizing the right side plus over gives exactly the proximal gradient step with . The quadratic upper bound acts as a majorizer: ISTA minimizes this majorizer at each step, guaranteeing that (monotone decrease).
Backtracking line search
When is unknown or expensive to compute (e.g., computing for a large matrix), a backtracking line search finds a suitable step size. Start with some , then increase (with , typically ) until the sufficient decrease condition holds:
where is the proximal step output. This avoids computing exactly while preserving the convergence guarantee.
The FISTA Algorithm
Fast ISTA (FISTA), introduced by Beck and Teboulle (2009), adds Nesterov momentum to accelerate convergence from to .
Initialize , . Then repeat:
The momentum coefficient starts near zero and approaches as grows. At large , , so the coefficient is approximately , similar to the Nesterov acceleration schedule for smooth gradient descent.
The extrapolation step overshoots past in the direction of recent movement. This look-ahead is what distinguishes acceleration from plain gradient descent; the algorithm uses trajectory information, not just the current gradient.
FISTA with monotone variant
Because FISTA's momentum can cause the objective to increase between iterations, a monotone variant (Beck and Teboulle, 2009; O'Donoghue and Candes, 2015) modifies the update:
is replaced by choosing based on whichever of has the smaller objective value. This sacrifices some theoretical elegance but avoids the oscillations that slow FISTA on ill-conditioned problems.
Restarting for strongly convex problems
When is strongly convex with parameter , the optimal rate for first-order methods improves from to linear convergence . FISTA does not automatically exploit strong convexity, but a restarting scheme does: restart the momentum sequence () whenever or at fixed intervals of iterations. This achieves the optimal linear rate for strongly convex composite problems.
The condition number controls the restart interval. Ill-conditioned problems () restart less frequently; well-conditioned problems restart often.
Main Theorems
ISTA Convergence Rate
Statement
Let . ISTA with constant step size satisfies:
where is a minimizer of .
Intuition
The convergence rate is : after iterations, suboptimality shrinks proportionally to . This matches gradient descent on smooth convex functions. The proximal step handles nonsmoothness without degrading the rate. Compare to subgradient methods, which achieve only on the same problem class.
Proof Sketch
The proof uses the descent lemma for smooth functions combined with the optimality condition for the proximal step. Since is -Lipschitz:
The proximal optimality condition gives: . Rearranging and using convexity of :
for any . Combining with the descent inequality and setting :
Telescope from to and use the fact that to obtain .
Why It Matters
This rate shows that proximal gradient descent matches gradient descent even though part of the objective is nonsmooth. The proximal operator absorbs the nonsmoothness at no cost to the convergence rate; the price is only computational (evaluating per iteration).
Failure Mode
The rate is tight for ISTA: there exist convex instances where the bound is achieved. The rate also depends on ; if is large (ill-conditioned smooth part), convergence is slow per iteration. If is unknown and overestimated, the step size becomes needlessly small. If underestimated, the algorithm may diverge. Backtracking resolves this at the cost of extra gradient evaluations.
FISTA Accelerated Convergence Rate
Statement
FISTA satisfies:
Intuition
Acceleration improves the rate from to . After 100 iterations, ISTA reduces the gap by ; FISTA reduces it by . The momentum term lets the algorithm use trajectory information --- not just the current gradient --- to take longer effective steps.
Proof Sketch
Define a Lyapunov function:
where is an auxiliary sequence. The proof shows by substituting the FISTA update rules and using the descent property of the proximal step. The key algebraic identity is , which follows from the specific recurrence . Since is nonincreasing and the term is nonneg:
Since , dividing both sides by yields the bound.
Why It Matters
is the optimal rate for first-order methods on this problem class (Nesterov, 1983). No method using only gradient evaluations and proximal operators can do better in the worst case. FISTA achieves this optimality bound for composite objectives with the same per-iteration cost as ISTA.
Failure Mode
FISTA's iterates can oscillate: may exceed . This non-monotonicity makes FISTA harder to use as a subroutine (e.g., stopping criteria based on objective decrease may trigger prematurely). Monotone variants and adaptive restart address this. For strongly convex , plain FISTA wastes the strong convexity; use restarting or a modified momentum sequence to get linear convergence.
Optimality and Lower Bounds
The rate of FISTA is optimal among first-order methods for convex optimization with Lipschitz gradients. Nesterov (1983, 2004) proved a matching lower bound: for any first-order method that queries at points, there exists a convex function with -Lipschitz gradient such that:
So FISTA's upper bound matches the lower bound up to a constant factor. For ISTA, the rate is also tight: there exist instances achieving it.
Under strong convexity ( has parameter ), the optimal rate improves to (linear convergence). Plain ISTA achieves ; FISTA with restart achieves the optimal . The gap between and is significant when the condition number is large: ISTA needs iterations for -accuracy; FISTA with restart needs .
Canonical Examples
Lasso via ISTA
The Lasso problem is . Here with and , the squared largest singular value of . The proximal step is coordinate-wise soft-thresholding. Each ISTA iteration: compute , then .
For a matrix , each iteration costs for the matrix-vector products and , plus for the soft-thresholding. The total per-iteration cost is dominated by the two matrix-vector multiplies.
Projected gradient descent as a special case
When is the indicator of a convex set , the proximal operator is projection onto . So ISTA becomes projected gradient descent: . Constrained smooth optimization is a special case of proximal gradient methods. Common constraint sets with cheap projections include the simplex (), the ball (), and box constraints ( coordinate-wise clipping).
Elastic net
The elastic net combines and penalties: . Its proximal operator is:
First soft-threshold with parameter , then scale by . This produces solutions that are sparse (from the part) and have bounded norm (from the part). The elastic net proximal operator is separable across coordinates, so it costs .
Low-rank matrix completion
For matrix completion with nuclear norm regularization: , where is the set of observed entries. Here is a smooth quadratic loss on observed entries, and . The proximal operator requires a singular value decomposition: compute , then soft-threshold the singular values . Each iteration costs for the SVD, which is expensive for large matrices. Randomized SVD approximations reduce this in practice.
Connection to Other Methods
Proximal gradient methods sit at a crossroads in optimization.
Gradient descent. When , ISTA reduces to gradient descent with step size , and FISTA reduces to Nesterov's accelerated gradient method.
Subgradient methods. When both and are nonsmooth, the composite splitting is unavailable, and one resorts to subgradient methods. These converge at , strictly slower than ISTA's . The proximal framework exploits the partial smoothness of to gain a factor of .
Coordinate descent. For separable (like the norm), coordinate descent updates one coordinate at a time. Each coordinate update is cheap ( for lasso) and the proximal step on one coordinate is trivial. Coordinate descent can outperform ISTA when is large and is dense, but ISTA parallelizes better.
ADMM. The alternating direction method of multipliers handles more general structured problems, including those where involves a linear operator (e.g., total variation where is a difference matrix). ADMM introduces auxiliary variables to decouple the linear operator from the proximal step. It converges at for convex problems.
Mirror descent. Mirror descent and Frank-Wolfe methods use Bregman divergences instead of the Euclidean distance in the proximal step. This is useful when the geometry of the constraint set is non-Euclidean (e.g., the simplex with KL divergence). Proximal gradient methods with Bregman divergences are called Bregman proximal gradient methods.
Second-order methods. Newton's method and quasi-Newton methods can handle smooth parts more aggressively by using curvature information, but incorporating a nonsmooth requires proximal-Newton or proximal-quasi-Newton variants that solve a subproblem at each iteration.
Common Confusions
Proximal operator is not projection
The proximal operator for a general is not a projection onto a set. It is a generalized projection that balances proximity to the input with minimizing . Projection is the special case where is an indicator function. For the norm, the proximal operator is soft-thresholding, which shrinks coordinates rather than projecting them onto a set.
FISTA does not always beat ISTA in practice
FISTA has a better worst-case rate ( vs ), but its oscillatory behavior can make it slower on some problems, particularly ill-conditioned or strongly convex ones where the momentum overshoots. Restarting FISTA (resetting momentum when the objective increases) often works better in practice. The theoretical guarantee is about worst-case complexity, not every instance.
Step size 1/L does not require computing L exactly
Many practitioners believe ISTA requires knowing exactly. In practice, backtracking line search estimates adaptively and sometimes finds a local Lipschitz constant much smaller than the global . This can make each step larger and convergence faster than the worst-case bound predicts.
Proximal gradient is not the same as subgradient descent
Subgradient methods apply to general nonsmooth convex problems by replacing gradients with subgradients and using diminishing step sizes. Proximal gradient methods exploit the composite structure where is smooth. When this structure exists, the proximal approach is strictly better: vs . Using subgradient descent on a composite problem wastes the smoothness of .
Summary
- Proximal operator:
- ISTA: gradient step on smooth , then proximal step on nonsmooth
- ISTA converges at ; FISTA with momentum at
- is optimal for first-order methods on this problem class
- For regularization, the proximal operator is soft-thresholding
- The Moreau envelope smooths a nonsmooth function while preserving minimizers
- Restarting FISTA with strong convexity gives linear convergence
- Key special cases: lasso, projected gradient descent, elastic net, nuclear norm minimization
Exercises
Problem
Compute for and . Which coordinates become zero?
Problem
Compute the proximal operator of (the squared penalty used in ridge regression). Show that it never sets any coordinate to zero.
Problem
Show that the proximal operator of the indicator function of a closed convex set is the Euclidean projection .
Problem
Prove that the proximal operator is firmly nonexpansive: for any convex and any , .
Problem
Prove that the Moreau envelope is differentiable even when is not, and show .
References
Canonical:
- Beck & Teboulle, "A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems," SIAM J. Imaging Sci. (2009), Sections 2-4
- Nesterov, Introductory Lectures on Convex Optimization (2004), Chapter 2 (smooth minimization) and Chapter 4 (nonsmooth problems)
- Rockafellar, Convex Analysis (1970), Chapter 31 (Moreau-Yosida regularization and proximal mappings)
Textbook treatments:
- Boyd & Vandenberghe, Convex Optimization (2004), Chapter 6 (proximal and projection operators in the context of decomposition)
- Parikh & Boyd, "Proximal Algorithms," Foundations and Trends in Optimization 1(3), 2014, Sections 1-6 (comprehensive survey with closed-form proximal operators for 30+ functions)
- Beck, First-Order Methods in Optimization (2017), Chapters 6 (proximal gradient) and 10 (accelerated methods)
Convergence analysis and lower bounds:
- Nesterov, "A method for solving a convex programming problem with convergence rate ," Dokl. Akad. Nauk SSSR 269 (1983)
- Tseng, "On accelerated proximal gradient methods for convex-concave optimization," SIAM preprint (2008)
Next Topics
The natural next steps from proximal gradient methods:
- Coordinate descent: an alternative for separable penalties that updates one variable at a time
- Stochastic gradient descent: scaling proximal methods to large datasets via stochastic proximal gradient (prox-SGD)
- ADMM: handles composite objectives where involves a linear operator
- Mirror descent: replaces the Euclidean proximal step with Bregman divergences for non-Euclidean geometry
Last reviewed: April 13, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
3- Convex Optimization Basicslayer 1 · tier 1
- Subgradients and Subdifferentialslayer 1 · tier 1
- Quasi-Newton Methodslayer 2 · tier 1
Derived topics
2- Stochastic Gradient Descent Convergencelayer 2 · tier 1
- Coordinate Descentlayer 2 · tier 2
Graph-backed continuations