Skip to main content

Numerical Optimization

Augmented Lagrangian and ADMM

The augmented Lagrangian fixes the conditioning problem of pure penalties by adding dual feedback. ADMM uses that idea to split structured convex problems into alternating subproblems with primal and dual residual diagnostics.

AdvancedTier 2StableSupporting~60 min

Why This Matters

Many optimization problems in ML are structured:

  • Lasso has a smooth least-squares loss plus a non-smooth 1\ell_1 penalty.
  • Distributed learning has many local objectives that must agree on one model.
  • Matrix completion has a data-fit term and a low-rank or nuclear-norm term.
  • Constrained estimation has a likelihood objective plus feasibility conditions.

A single black-box optimizer can ignore that structure. ADMM splits the problem so each part gets the solver it deserves, while a dual variable remembers how badly the split variables disagree.

The Problem With Pure Penalties

For

minxf(x)subject toAx=b,\min_x f(x)\quad \text{subject to}\quad Ax=b,

a pure quadratic penalty minimizes

f(x)+(ρ/2)Axb2.f(x)+(\rho/2)\|Ax-b\|^2.

As ρ\rho grows, feasibility improves, but conditioning gets worse. The penalty method is asking one scalar ρ\rho to do two jobs: enforce the constraint and keep the subproblem numerically healthy. That tension is why pure penalties can be painfully slow.

Augmented Lagrangian

Definition

Augmented Lagrangian

For the equality-constrained problem minxf(x)\min_x f(x) subject to Ax=bAx=b, the augmented Lagrangian is

Lρ(x,λ)=f(x)+λT(Axb)+(ρ/2)Axb2,L_\rho(x,\lambda) = f(x)+\lambda^T(Ax-b)+(\rho/2)\|Ax-b\|^2,

where λ\lambda is the Lagrange multiplier and ρ>0\rho>0 is a penalty parameter.

Definition

Method of Multipliers

The method of multipliers alternates

xk+1argminxLρ(x,λk),x^{k+1}\in\arg\min_x L_\rho(x,\lambda^k),

then

λk+1=λk+ρ(Axk+1b).\lambda^{k+1}=\lambda^k+\rho(Ax^{k+1}-b).

The quadratic term discourages infeasibility. The multiplier update learns the right linear price for violating the constraint. This is the central improvement over a pure penalty method: a finite, moderate ρ\rho can work because λ\lambda adapts.

Theorem

Method Of Multipliers Convergence

Statement

For the equality-constrained convex problem, the method of multipliers drives the primal residual AxkbAx^k-b to zero and converges to the optimal value under standard saddle-point assumptions. When the optimal multiplier is not unique, the multiplier sequence may converge to one valid multiplier rather than a uniquely determined one.

Intuition

The dual update is a feedback controller. If the constraint is violated, the next subproblem receives a stronger linear signal against that violation. The method does not need ρ\rho\to\infty to become exact.

Why It Matters

This is the conceptual ancestor of ADMM. If you understand why dual feedback fixes pure penalties, ADMM stops looking like a bag of mysterious alternating formulas.

Failure Mode

If the subproblem is solved poorly or the problem is non-convex without extra assumptions, the dual feedback can amplify bad iterates instead of stabilizing them.

ADMM As Splitting

ADMM targets the split problem

minx,zf(x)+g(z)subject toAx+Bz=c.\min_{x,z} f(x)+g(z) \quad \text{subject to}\quad Ax+Bz=c.

The scaled ADMM updates are

xk+1argminxf(x)+(ρ/2)Ax+Bzkc+uk2,x^{k+1}\in \arg\min_x f(x)+(\rho/2)\|Ax+Bz^k-c+u^k\|^2, zk+1argminzg(z)+(ρ/2)Axk+1+Bzc+uk2,z^{k+1}\in \arg\min_z g(z)+(\rho/2)\|Ax^{k+1}+Bz-c+u^k\|^2, uk+1=uk+Axk+1+Bzk+1c.u^{k+1}=u^k+Ax^{k+1}+Bz^{k+1}-c.

Here u=λ/ρu=\lambda/\rho is the scaled dual variable. ADMM is not just "alternate xx and zz." The dual update is what makes the split remember disagreement.

Residuals: The Dashboard That Matters

The primal residual is

rk+1=Axk+1+Bzk+1c.r^{k+1}=Ax^{k+1}+Bz^{k+1}-c.

It measures feasibility. The dual residual, in the common two-block scaled form, is

sk+1=ρATB(zk+1zk).s^{k+1}=\rho A^TB(z^{k+1}-z^k).

It measures how much the zz-update is changing the dual stationarity conditions.

If this happensMeaningResponse
primal residual high, dual residual lowsplit variables still disagreeincrease ρ\rho cautiously
dual residual high, primal residual lowpenalty is forcing agreement too aggressivelydecrease ρ\rho cautiously
both decrease slowlysubproblem or scaling issuerescale variables or precondition
objective improves but residuals stallnot solving the constrained problem yetdo not trust objective alone

Convergence

Theorem

Two-Block Convex ADMM Convergence

Statement

For the two-block convex problem, ADMM produces primal residuals satisfying rk0r^k\to 0, dual residuals satisfying sk0s^k\to 0, and objective values converging to the optimal value. Standard ergodic convergence certificates are O(1/k)O(1/k) under the classical assumptions; stronger structure can yield faster rates.

Intuition

The xx- and zz-updates improve the augmented Lagrangian in directions that respect each function's structure, while the dual update prevents the split variables from drifting apart.

Proof Sketch

The standard proof compares each iterate to a primal-dual saddle point and constructs a Lyapunov quantity involving the dual error and constraint residual. The optimality conditions for the xx- and zz-subproblems make successive differences telescope. Summing those inequalities forces residuals to vanish and yields sublinear averaged convergence bounds.

Why It Matters

The theorem explains why ADMM is reliable for convex splitting even when gg is non-smooth. It also explains what to log: primal residual, dual residual, objective, and penalty parameter.

Failure Mode

Naive multi-block ADMM with three or more primal blocks can diverge without extra assumptions or modifications. Non-convex ADMM has problem-specific stationary-point guarantees, not generic global-optimality guarantees.

Core Example: Lasso

The Lasso problem is

minx(1/2)Axb2+λx1.\min_x (1/2)\|Ax-b\|^2+\lambda\|x\|_1.

Split it as x=zx=z:

f(x)=(1/2)Axb2,g(z)=λz1.f(x)=(1/2)\|Ax-b\|^2,\qquad g(z)=\lambda\|z\|_1.

Then the xx-update is ridge regression:

xk+1=(ATA+ρI)1(ATb+ρ(zkuk)),x^{k+1} = (A^TA+\rho I)^{-1}\left(A^Tb+\rho(z^k-u^k)\right),

and the zz-update is soft-thresholding:

zik+1=sign(xik+1+uik)max{xik+1+uikλ/ρ,0}.z_i^{k+1} = \operatorname{sign}(x_i^{k+1}+u_i^k) \max\{|x_i^{k+1}+u_i^k|-\lambda/\rho,0\}.

This is why ADMM is loved in structured statistics: one step is linear algebra, one step is a proximal operator, and the dual variable enforces consistency.

Where ADMM Shines

  • Moderate-accuracy structured convex problems: Lasso, graphical lasso variants, robust PCA, trend filtering, signal reconstruction.
  • Distributed consensus: local workers solve local subproblems and agree through a global variable.
  • Proximal splitting: smooth and non-smooth terms can use different specialized solvers.
  • Interpretability of logs: residual plots expose feasibility and stationarity separately.

Where ADMM Is Not Magic

ADMM is often not the best choice for high-precision small convex problems; an interior point method may win. It is also not a replacement for SGD or Adam in most neural-network training. Non-convex ADMM can be useful, but it should be treated as a specialized splitting heuristic unless you have assumptions proving stationarity.

Common Confusions

Watch Out

Rho is not a model regularization parameter

The penalty parameter ρ\rho changes the optimization path, not the target constrained solution in the exact convex setup. The Lasso regularization strength is λ\lambda; ADMM's ρ\rho is a solver parameter.

Watch Out

Small objective value is not enough

For constrained splitting problems, a low objective with a large primal residual may be infeasible. Always inspect primal and dual residuals.

Watch Out

ADMM is not just nonlinear Gauss-Seidel

ADMM uses Gauss-Seidel-style alternating primal updates, but the augmented Lagrangian and dual update are essential. Removing the dual update turns it into a penalty-style alternating method with different behavior.

Watch Out

Two-block theory does not automatically extend to many blocks

The classical convergence theorem is for two primal blocks. More blocks require grouping, proximal regularization, strong convexity assumptions, or other safeguards.

Q&A For Mastery

What is the first ADMM debugging plot? Plot rk\|r^k\|, sk\|s^k\|, the objective, and ρ\rho on the same run. If only the objective is logged, the main ADMM story is missing.

Why does scaled form use uu instead of λ\lambda? Because u=λ/ρu=\lambda/\rho makes the quadratic terms cleaner and reveals the residual accumulation pattern uk+1=uk+rk+1u^{k+1}=u^k+r^{k+1}.

When should I adapt ρ\rho? When primal and dual residuals are badly imbalanced for many iterations. Adaptation should be conservative and should rescale uu consistently if the implementation changes ρ\rho.

What To Remember

  • Pure penalties need large ρ\rho and become ill-conditioned.
  • The augmented Lagrangian adds dual feedback, so finite ρ\rho can work.
  • ADMM splits f(x)+g(z)f(x)+g(z) into alternating subproblems plus a dual residual update.
  • Primal residual means feasibility; dual residual means stationarity movement.
  • Classical ADMM convergence is a two-block convex result. Be cautious with many blocks and non-convex objectives.

Exercises

ExerciseCore

Problem

For the split problem minf(x)+g(z)\min f(x)+g(z) subject to x=zx=z, write the scaled ADMM updates.

ExerciseAdvanced

Problem

In Lasso ADMM, why is the xx-update a ridge regression and the zz-update a soft-thresholding step?

ExerciseResearch

Problem

Why can a naive three-block ADMM fail even if every subproblem is convex?

References

Canonical:

  • Hestenes, "Multiplier and Gradient Methods" (1969).
  • Powell, "A Method for Nonlinear Constraints in Minimization Problems" (1969).
  • Gabay and Mercier, "A Dual Algorithm for the Solution of Nonlinear Variational Problems via Finite Element Approximation" (1976).
  • Boyd, Parikh, Chu, Peleato, and Eckstein, "Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers" (2011).

Further:

  • Bertsekas, Constrained Optimization and Lagrange Multiplier Methods (1982).
  • Eckstein and Bertsekas, "On the Douglas-Rachford Splitting Method and the Proximal Point Algorithm for Maximal Monotone Operators" (1992).
  • Parikh and Boyd, "Proximal Algorithms" (2014).
  • Hong, Luo, and Razaviyayn, "Convergence Analysis of Alternating Direction Method of Multipliers for a Family of Nonconvex Problems" (2016).

Next Topics

Last reviewed: April 25, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

4

Derived topics

2