Numerical Optimization
Coordinate Descent
Optimize by updating one coordinate (or block) at a time while holding others fixed. The default solver for Lasso because each coordinate update has a closed-form solution.
Why This Matters
Coordinate descent is the workhorse behind the most widely used Lasso solvers (glmnet, scikit-learn's Lasso). When the nonsmooth penalty is separable across coordinates, each coordinate update has a closed-form solution. This makes coordinate descent extremely fast for high-dimensional sparse problems, often faster than proximal gradient methods.
Mental Model
Instead of moving in all directions at once (full gradient step), pick one coordinate axis and slide along it to the optimal point. Then pick another coordinate and repeat. Each step is cheap (a one-dimensional optimization) and for many problems it has an exact solution.
Think of it like tuning a guitar: you adjust one string at a time, cycling through all strings, and eventually the whole instrument is in tune.
Formal Setup and Notation
We want to minimize:
where is convex and smooth, and each is convex and acts on a single coordinate. The separability of is what makes coordinate descent efficient.
Coordinate Update Rule
At iteration , pick coordinate and update:
All other coordinates stay fixed. This is a one-dimensional optimization problem in .
Cyclic Coordinate Descent
In cyclic (or Gauss-Seidel) coordinate descent, you cycle through coordinates in a fixed order One full pass through all coordinates is called an epoch.
Randomized Coordinate Descent
In randomized coordinate descent, at each step you pick coordinate uniformly at random from . This version has cleaner convergence theory because each step is an unbiased estimator of progress.
Core Definitions
For the Lasso problem , the coordinate update for coordinate is:
where is the -th column of , and denote all
other columns and coordinates, and
is the soft-thresholding operator. The threshold is applied to
before the column-norm normalization; equivalently, one can write
. When the
columns of are normalized so that (the default
preprocessing in glmnet and scikit-learn), both forms collapse to
.
This closed-form update is why coordinate descent is the default for Lasso. No step size tuning is needed.
Block coordinate descent generalizes this: instead of updating a single coordinate, update a block of coordinates at a time. This is useful when the objective has natural block structure (e.g., group lasso).
Gauss-Seidel for solving is cyclic coordinate descent applied to (when is positive definite). Each coordinate update is .
Main Theorems
Convergence of Cyclic Coordinate Descent
Statement
Let where is convex with coordinate-wise Lipschitz gradients . Cyclic coordinate descent produces iterates satisfying:
For randomized coordinate descent with step size for coordinate :
Intuition
Each coordinate update can only decrease the objective (or leave it unchanged). The separability of ensures no coupling between coordinates in the nonsmooth part. The smooth part provides enough curvature that progress on individual coordinates translates to global progress. The factor in the rate reflects that each epoch touches coordinates.
Proof Sketch
For the randomized case: at each step, the expected decrease from updating a random coordinate is of the decrease you would get from a full gradient step. Apply the standard descent lemma coordinate-wise and take expectations. Telescope over steps to get the rate per iteration, which is per epoch.
Why It Matters
The per-epoch rate matches gradient descent, but each epoch of coordinate descent can be cheaper if the coordinate updates are fast. For Lasso, each coordinate update is and an epoch is , the same as one gradient step but with a much smaller constant because it avoids matrix multiplications.
Failure Mode
Cyclic coordinate descent can fail to converge when the nonsmooth part is not separable across coordinates. A clean two-dimensional counterexample is at the point : with fixed, the function is constant on , so is a one-dimensional minimizer; symmetrically for . Cyclic CD is stuck at with , while the global minimum is with . (Tseng 2001 shows that block separability of the nonsmooth term is what underwrites convergence; Powell 1973 gave an analogous smooth counterexample for the case where exact minimizers along each axis are non-unique.) For non-separable penalties, use proximal gradient or ADMM instead.
Canonical Examples
Lasso coordinate update
For , , and : precompute and . Each coordinate update reads one row of and one entry of . A full epoch scans all 1000 coordinates. In practice, active set strategies skip coordinates already at zero, making each epoch even cheaper.
Gauss-Seidel for linear systems
To solve with positive definite, iterate: for . This converges if is strictly diagonally dominant or symmetric positive definite.
Common Confusions
Coordinate descent is not gradient descent with a random mask
Coordinate descent solves the one-dimensional subproblem exactly (or near exactly). Simply zeroing out all but one component of the gradient and taking a step is a different algorithm with potentially worse behavior. The exact minimization along each coordinate is what gives coordinate descent its good properties.
Cyclic vs randomized: theory vs practice
Randomized coordinate descent has cleaner theory ( rate with nice constants). But cyclic coordinate descent often works better in practice because it ensures every coordinate gets updated each epoch. The theory for cyclic is harder and was settled only recently.
Summary
- Update one coordinate at a time; each step is a one-dimensional problem
- Separable penalties (like ) give closed-form coordinate updates
- Cyclic CD is the default Lasso solver; no step size to tune
- Randomized CD has cleaner theory: per iteration, per epoch
- Block CD generalizes to groups of coordinates
- Gauss-Seidel is CD applied to quadratic objectives
Exercises
Problem
Write out the coordinate descent update for coordinate of the Lasso problem in terms of the columns of and the current residual .
Problem
Give a two-dimensional example where cyclic coordinate descent on a nonsmooth, non-separable function fails to find the minimum. What goes wrong geometrically?
References
Canonical:
- Tseng, "Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization," Journal of Optimization Theory and Applications 109(3):475-494 (2001)
- Wright, "Coordinate Descent Algorithms," Mathematical Programming 151(1):3-34 (2015)
- Powell, "On Search Directions for Minimization Algorithms," Mathematical Programming 4(1):193-201 (1973)
Current:
- Friedman, Hastie, Tibshirani, "Regularization Paths for Generalized Linear Models via Coordinate Descent," Journal of Statistical Software 33(1):1-22 (2010)
- Nesterov, "Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems," SIAM Journal on Optimization 22(2):341-362 (2012)
- Beck & Tetruashvili, "On the Convergence of Block Coordinate Descent Type Methods," SIAM Journal on Optimization 23(4):2037-2060 (2013)
- Saha & Tewari, "On the Nonasymptotic Convergence of Cyclic Coordinate Descent Methods," SIAM Journal on Optimization 23(1):576-601 (2013)
Next Topics
The natural next steps from coordinate descent:
- Proximal gradient methods: for non-separable penalties
- Stochastic gradient descent: randomness in samples rather than coordinates
Last reviewed: April 27, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
3- Convex Optimization Basicslayer 1 · tier 1
- Proximal Gradient Methodslayer 2 · tier 1
- Mirror Descent and Frank-Wolfelayer 3 · tier 2
Derived topics
2- Stochastic Gradient Descent Convergencelayer 2 · tier 1
- Nonlinear Gauss-Seidellayer 3 · tier 3
Graph-backed continuations