Numerical Optimization
Nonlinear Gauss-Seidel
Block coordinate optimization with latest-value updates. Nonlinear Gauss-Seidel explains alternating minimization, EM-style sweeps, and why block coupling decides whether cycling, stalling, or convergence happens.
Prerequisites
Why This Matters
Many ML objectives are not one smooth blob. They have blocks: latent assignments and parameters, encoder and decoder variables, matrix factors and , primal and dual variables, or local worker models and a global consensus variable.
Nonlinear Gauss-Seidel is the clean mathematical pattern behind many alternating algorithms:
- update block 1 while the other blocks are fixed
- immediately use that new value when updating block 2
- continue through the blocks
- repeat until the sweep stops changing the objective
This page is not about a trendy optimizer. It is about recognizing when an algorithm is a block fixed-point method and asking the right questions: Are blocks weakly coupled? Are subproblems solved accurately enough? Is the objective decreasing? Can the method cycle?
Mental Model
Coordinate descent changes one coordinate or one block at a time. Jacobi-style methods update blocks in parallel from stale values. Gauss-Seidel-style methods update blocks sequentially and reuse the newest values immediately.
That one detail matters. If the first block update dramatically changes the best response of the second block, the next block may undo much of the previous work. If blocks are weakly coupled, a sweep behaves like a stable contraction. The practical question is not "is this block algorithm elegant?" The question is "does each block update make the other blocks easier or harder?"
Formal Setup
Block Decomposition
Write as with and . For a function , block means one group of variables is allowed to move while the other groups are held fixed.
Nonlinear Gauss-Seidel Sweep
Given , one exact nonlinear Gauss-Seidel sweep computes
for . Earlier blocks use the current-sweep values ; later blocks still use old values.
In practice, the block subproblem may be solved approximately by Newton, quasi-Newton, proximal, closed-form least squares, dynamic programming, or a few inner gradient steps. That makes the method an inexact nonlinear Gauss-Seidel method.
Main Guarantees
Exact Block Minimization Gives Monotone Descent
Statement
For an exact minimization sweep,
for every block . Therefore after the full sweep. If the level set is compact and the block-update limit points satisfy the needed first-order regularity conditions, every accumulation point is stationary. Global optimality needs additional convexity.
Intuition
Each block is allowed to pick the best value given the current values of the other blocks. That cannot increase the objective. But monotone decrease alone does not tell you which stationary point you reach, or whether the blocks stop fighting each other.
Proof Sketch
Apply the definition of the block argmin one block at a time and telescope the inequalities through the sweep. Stationarity of limit points comes from the first-order optimality conditions for each block subproblem plus continuity/closedness assumptions that let those conditions pass to a limit.
Why It Matters
This is the safety check for alternating minimization. If a proposed block update does not decrease the objective or a surrogate objective, you should not assume Gauss-Seidel theory is protecting it.
Failure Mode
In non-convex problems the method can converge to bad stationary points. If the subproblem minimizer is not unique, different tie-breaking rules can produce different trajectories. If block updates are only approximate, the approximation error must be controlled.
Contraction Gives Linear Convergence
Statement
Let be the full sweep map, so . If there exists such that
on a closed region containing the iterates, then has a unique fixed point in that region and
Intuition
The sweep is a function from "old blocks" to "new blocks." If every sweep compresses distances, the iterates cannot wander, cycle, or split into multiple possible limits inside that region.
Proof Sketch
This is the Banach fixed-point theorem applied to the complete metric space containing the iterates. The fixed point of the sweep satisfies the block optimality conditions.
Why It Matters
Contraction is usually too strong to prove globally, but it gives the right engineering diagnostic: strong cross-block curvature makes alternating methods fragile; weak cross-block curvature makes them stable.
How To Diagnose Coupling
Near a smooth local minimizer, partition the Hessian into blocks. A rough coupling lens is:
Small off-diagonal influence relative to within-block curvature suggests block does not wildly move the best response of block . Large values suggest a sweep may zigzag or need damping.
| Signal | What it suggests | Practical response |
|---|---|---|
| objective decreases smoothly | blocks are cooperating | keep exact or high-quality block solves |
| objective decreases but iterates oscillate | strong cross-block coupling | change block order, add damping, or merge blocks |
| subproblems are expensive | exact Gauss-Seidel is too costly | use inexact solves with descent checks |
| many local optima | non-convex alternating landscape | use restarts and monitor validation behavior |
Connections In ML
Alternating Matrix Factorization
For nonnegative matrix factorization or low-rank factorization, update with fixed, then update with fixed. Each subproblem may be easier than the joint problem, but the joint problem is still non-convex.
EM Is Related, But Not Identical
Expectation-maximization alternates between a distribution over latent variables and model parameters by maximizing an evidence lower bound:
The E-step optimizes with fixed. The M-step optimizes with fixed. That is a block-coordinate ascent pattern on a surrogate objective. EM guarantees monotone likelihood improvement under the standard exact-step setup, but EM is not generally a contraction and is not guaranteed to find the global maximum.
ADMM Adds A Dual Memory
ADMM also alternates over blocks, but it is not merely block coordinate descent on the original objective. The augmented Lagrangian and dual update carry constraint violation forward. That residual memory is the key extra mechanism.
Common Confusions
Monotone decrease is not convergence to the best solution
The objective can decrease every sweep and still converge to a poor local stationary point. For non-convex matrix factorization, topic models, and mixture models, initialization and block design matter.
Gauss-Seidel is not Jacobi
Gauss-Seidel uses the newest block values inside the same sweep. Jacobi updates all blocks from old values. Jacobi parallelizes better; Gauss-Seidel often stabilizes better because it uses fresher information.
Exact block minimization is not always better in wall-clock time
An exact block solve may be expensive. If approximate solves give enough descent, inexact Gauss-Seidel can beat exact Gauss-Seidel in real compute time.
EM is not a generic convergence certificate for every alternating method
EM has a special variational identity that ensures likelihood monotonicity. A random alternating heuristic does not inherit that identity just because it has two blocks.
Q&A For Mastery
If every block update minimizes its subproblem, why can the method still fail? Because minimizing one block can make another block worse. The objective may still descend, but the iterates can zigzag, stall at a saddle-like stationary point, or converge to a local solution determined by initialization.
What should I check first in code? Log the objective after every block update, not only after a full epoch. If a block update increases the objective without an intentional stochastic or approximate reason, the implementation is not matching the theory.
When should I merge blocks? When two blocks are so strongly coupled that each one repeatedly undoes the other. Merging increases subproblem cost but can remove the unstable feedback loop.
What To Remember
- Nonlinear Gauss-Seidel is sequential block optimization with newest-value updates.
- Exact block minimization gives monotone objective decrease, not automatic global optimality.
- Contraction gives linear convergence, but it is a strong local condition.
- Cross-block coupling is the central diagnostic.
- EM and ADMM look like alternating methods, but each has extra structure that must be respected.
Exercises
Problem
For , compute one Gauss-Seidel sweep from .
Problem
Explain why the same example would give a different first point under Jacobi updating.
Problem
In a Gaussian mixture EM run, what are the two blocks? Why does monotone likelihood improvement not imply the EM map is a contraction?
References
Canonical:
- Bertsekas, Nonlinear Programming, 3rd ed., Sections 2.7 and 2.8.
- Ortega and Rheinboldt, Iterative Solution of Nonlinear Equations in Several Variables, Chapter 10.
- Dempster, Laird, and Rubin, "Maximum Likelihood from Incomplete Data via the EM Algorithm" (1977).
Further:
- Wright, "Coordinate Descent Algorithms" (2015).
- Xu and Yin, "A Block Coordinate Descent Method for Regularized Multi-Convex Optimization" (2013).
- Wu, "On the Convergence Properties of the EM Algorithm" (1983).
Next Topics
- Augmented Lagrangian and ADMM: alternating block updates with a dual residual memory.
- Convex optimization basics: when block stationary points become globally meaningful.
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
2- Newton's Methodlayer 1 · tier 1
- Coordinate Descentlayer 2 · tier 2
Derived topics
1- Augmented Lagrangian and ADMMlayer 2 · tier 2
Graph-backed continuations