Augmented Lagrangian and ADMM

Sneiderman, Robby

Numerical Optimization

Augmented Lagrangian and ADMM

The augmented Lagrangian fixes the conditioning problem of pure penalties by adding dual feedback. ADMM uses that idea to split structured convex problems into alternating subproblems with primal and dual residual diagnostics.

AdvancedTier 2StableSupporting~60 min

Prerequisites

Convex Optimization Basics Convex Duality Nonlinear Gauss Seidel Projected Gradient Descent

Prereq Map

Why This Matters

Many optimization problems in ML are structured:

Lasso has a smooth least-squares loss plus a non-smooth $\ell_1$ penalty.
Distributed learning has many local objectives that must agree on one model.
Matrix completion has a data-fit term and a low-rank or nuclear-norm term.
Constrained estimation has a likelihood objective plus feasibility conditions.

A single black-box optimizer can ignore that structure. ADMM splits the problem so each part gets the solver it deserves, while a dual variable remembers how badly the split variables disagree.

The Problem With Pure Penalties

For

\min_x f(x)\quad \text{subject to}\quad Ax=b,

a pure quadratic penalty minimizes

f(x)+(\rho/2)\|Ax-b\|^2.

As $\rho$ grows, feasibility improves, but conditioning gets worse. The penalty method is asking one scalar $\rho$ to do two jobs: enforce the constraint and keep the subproblem numerically healthy. That tension is why pure penalties can be painfully slow.

Augmented Lagrangian

Definition

Augmented Lagrangian

For the equality-constrained problem $\min_x f(x)$ subject to $Ax=b$ , the augmented Lagrangian is

L_\rho(x,\lambda) = f(x)+\lambda^T(Ax-b)+(\rho/2)\|Ax-b\|^2,

where $\lambda$ is the Lagrange multiplier and $\rho>0$ is a penalty parameter.

Definition

Method of Multipliers

The method of multipliers alternates

x^{k+1}\in\arg\min_x L_\rho(x,\lambda^k),

then

\lambda^{k+1}=\lambda^k+\rho(Ax^{k+1}-b).

The quadratic term discourages infeasibility. The multiplier update learns the right linear price for violating the constraint. This is the central improvement over a pure penalty method: a finite, moderate $\rho$ can work because $\lambda$ adapts.

Theorem

Method Of Multipliers Convergence

Statement

For the equality-constrained convex problem, the method of multipliers drives the primal residual $Ax^k-b$ to zero and converges to the optimal value under standard saddle-point assumptions. When the optimal multiplier is not unique, the multiplier sequence may converge to one valid multiplier rather than a uniquely determined one.

Intuition

The dual update is a feedback controller. If the constraint is violated, the next subproblem receives a stronger linear signal against that violation. The method does not need $\rho\to\infty$ to become exact.

Why It Matters

This is the conceptual ancestor of ADMM. If you understand why dual feedback fixes pure penalties, ADMM stops looking like a bag of mysterious alternating formulas.

Failure Mode

If the subproblem is solved poorly or the problem is non-convex without extra assumptions, the dual feedback can amplify bad iterates instead of stabilizing them.

report a correction →

ADMM As Splitting

ADMM targets the split problem

\min_{x,z} f(x)+g(z) \quad \text{subject to}\quad Ax+Bz=c.

The scaled ADMM updates are

x^{k+1}\in \arg\min_x f(x)+(\rho/2)\|Ax+Bz^k-c+u^k\|^2,

z^{k+1}\in \arg\min_z g(z)+(\rho/2)\|Ax^{k+1}+Bz-c+u^k\|^2,

u^{k+1}=u^k+Ax^{k+1}+Bz^{k+1}-c.

Here $u=\lambda/\rho$ is the scaled dual variable. ADMM is not just "alternate $x$ and $z$ ." The dual update is what makes the split remember disagreement.

Residuals: The Dashboard That Matters

The primal residual is

r^{k+1}=Ax^{k+1}+Bz^{k+1}-c.

It measures feasibility. The dual residual, in the common two-block scaled form, is

s^{k+1}=\rho A^TB(z^{k+1}-z^k).

It measures how much the $z$ -update is changing the dual stationarity conditions.

If this happens	Meaning	Response
primal residual high, dual residual low	split variables still disagree	increase $\rho$ cautiously
dual residual high, primal residual low	penalty is forcing agreement too aggressively	decrease $\rho$ cautiously
both decrease slowly	subproblem or scaling issue	rescale variables or precondition
objective improves but residuals stall	not solving the constrained problem yet	do not trust objective alone

Convergence

Theorem

Two-Block Convex ADMM Convergence

Statement

For the two-block convex problem, ADMM produces primal residuals satisfying $r^k\to 0$ , dual residuals satisfying $s^k\to 0$ , and objective values converging to the optimal value. Standard ergodic convergence certificates are $O(1/k)$ under the classical assumptions; stronger structure can yield faster rates.

Intuition

The $x$ - and $z$ -updates improve the augmented Lagrangian in directions that respect each function's structure, while the dual update prevents the split variables from drifting apart.

Proof Sketch

The standard proof compares each iterate to a primal-dual saddle point and constructs a Lyapunov quantity involving the dual error and constraint residual. The optimality conditions for the $x$ - and $z$ -subproblems make successive differences telescope. Summing those inequalities forces residuals to vanish and yields sublinear averaged convergence bounds.

Why It Matters

The theorem explains why ADMM is reliable for convex splitting even when $g$ is non-smooth. It also explains what to log: primal residual, dual residual, objective, and penalty parameter.

Failure Mode

Naive multi-block ADMM with three or more primal blocks can diverge without extra assumptions or modifications. Non-convex ADMM has problem-specific stationary-point guarantees, not generic global-optimality guarantees.

report a correction →

Core Example: Lasso

The Lasso problem is

\min_x (1/2)\|Ax-b\|^2+\lambda\|x\|_1.

Split it as $x=z$ :

f(x)=(1/2)\|Ax-b\|^2,\qquad g(z)=\lambda\|z\|_1.

Then the $x$ -update is ridge regression:

x^{k+1} = (A^TA+\rho I)^{-1}\left(A^Tb+\rho(z^k-u^k)\right),

and the $z$ -update is soft-thresholding:

z_i^{k+1} = \operatorname{sign}(x_i^{k+1}+u_i^k) \max\{|x_i^{k+1}+u_i^k|-\lambda/\rho,0\}.

This is why ADMM is loved in structured statistics: one step is linear algebra, one step is a proximal operator, and the dual variable enforces consistency.

Where ADMM Shines

Moderate-accuracy structured convex problems: Lasso, graphical lasso variants, robust PCA, trend filtering, signal reconstruction.
Distributed consensus: local workers solve local subproblems and agree through a global variable.
Proximal splitting: smooth and non-smooth terms can use different specialized solvers.
Interpretability of logs: residual plots expose feasibility and stationarity separately.

Where ADMM Is Not Magic

ADMM is often not the best choice for high-precision small convex problems; an interior point method may win. It is also not a replacement for SGD or Adam in most neural-network training. Non-convex ADMM can be useful, but it should be treated as a specialized splitting heuristic unless you have assumptions proving stationarity.

Common Confusions

Watch Out

Rho is not a model regularization parameter

The penalty parameter $\rho$ changes the optimization path, not the target constrained solution in the exact convex setup. The Lasso regularization strength is $\lambda$ ; ADMM's $\rho$ is a solver parameter.

Watch Out

Small objective value is not enough

For constrained splitting problems, a low objective with a large primal residual may be infeasible. Always inspect primal and dual residuals.

Watch Out

ADMM is not just nonlinear Gauss-Seidel

ADMM uses Gauss-Seidel-style alternating primal updates, but the augmented Lagrangian and dual update are essential. Removing the dual update turns it into a penalty-style alternating method with different behavior.

Watch Out

Two-block theory does not automatically extend to many blocks

The classical convergence theorem is for two primal blocks. More blocks require grouping, proximal regularization, strong convexity assumptions, or other safeguards.

Q&A For Mastery

What is the first ADMM debugging plot? Plot $\|r^k\|$ , $\|s^k\|$ , the objective, and $\rho$ on the same run. If only the objective is logged, the main ADMM story is missing.

Why does scaled form use $u$ instead of $\lambda$ ? Because $u=\lambda/\rho$ makes the quadratic terms cleaner and reveals the residual accumulation pattern $u^{k+1}=u^k+r^{k+1}$ .

When should I adapt $\rho$ ? When primal and dual residuals are badly imbalanced for many iterations. Adaptation should be conservative and should rescale $u$ consistently if the implementation changes $\rho$ .

What To Remember

Pure penalties need large $\rho$ and become ill-conditioned.
The augmented Lagrangian adds dual feedback, so finite $\rho$ can work.
ADMM splits $f(x)+g(z)$ into alternating subproblems plus a dual residual update.
Primal residual means feasibility; dual residual means stationarity movement.
Classical ADMM convergence is a two-block convex result. Be cautious with many blocks and non-convex objectives.

Exercises

ExerciseCore

Problem

For the split problem $\min f(x)+g(z)$ subject to $x=z$ , write the scaled ADMM updates.

ExerciseAdvanced

Problem

In Lasso ADMM, why is the $x$ -update a ridge regression and the $z$ -update a soft-thresholding step?

ExerciseResearch

Problem

Why can a naive three-block ADMM fail even if every subproblem is convex?

References

Canonical:

Hestenes, "Multiplier and Gradient Methods" (1969).
Powell, "A Method for Nonlinear Constraints in Minimization Problems" (1969).
Gabay and Mercier, "A Dual Algorithm for the Solution of Nonlinear Variational Problems via Finite Element Approximation" (1976).
Boyd, Parikh, Chu, Peleato, and Eckstein, "Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers" (2011).

Further:

Bertsekas, Constrained Optimization and Lagrange Multiplier Methods (1982).
Eckstein and Bertsekas, "On the Douglas-Rachford Splitting Method and the Proximal Point Algorithm for Maximal Monotone Operators" (1992).
Parikh and Boyd, "Proximal Algorithms" (2014).
Hong, Luo, and Razaviyayn, "Convergence Analysis of Alternating Direction Method of Multipliers for a Family of Nonconvex Problems" (2016).

Next Topics

Trust region methods: constrained local models where the constraint controls step reliability.
Interior point methods: high-accuracy constrained convex optimization from a different geometric angle.

Last reviewed: April 25, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Convex Optimization Basicslayer 1 · tier 1
Convex Dualitylayer 2 · tier 1
Projected Gradient Descentlayer 2 · tier 2
Nonlinear Gauss-Seidellayer 3 · tier 3

Derived topics

2

Trust Region Methodslayer 2 · tier 2
Interior Point Methodslayer 3 · tier 3

Graph-backed continuations

Trust Region Methods Interior Point Methods