Skip to main content

Mathematical Infrastructure

Convex Duality

Fenchel conjugates, the Fenchel-Moreau theorem, weak and strong duality, KKT conditions, and why duality gives the kernel trick for SVMs, connects regularization to constraints, and enables adversarial formulations in DRO.

CoreTier 1StableCore spine~75 min

Why This Matters

KKT optimum: ∇f(x*) + λ*∇g(x*) = 0 with λ* > 0 — gradients anti-parallel on the active constraint.

KKT feasible region with active constraint at the optimumunconstrained minx* (constrained min)−∇f(x*)λ*∇g(x*)g(x) = 0feasible: g(x) ≤ 0infeasiblex₁x₂

Blue dashed circles: level sets of the objective f. Green line: active constraint g(x) = 0. At x*, −∇f and λ*∇g are parallel — the KKT stationarity condition. λ* > 0 because the constraint is active.

Duality is the tool that converts hard optimization problems into equivalent (or approximately equivalent) problems that are easier to solve, analyze, or interpret. In machine learning, duality is not an abstract luxury --- it is the engine behind some of the most important algorithms and insights:

  • SVMs: the dual of the SVM problem reveals the kernel trick, transforming an optimization in parameter space into one in the space of inner products between data points
  • Regularization: constrained ERM (minimize loss subject to a norm constraint) and regularized ERM (minimize loss plus a penalty) are dual to each other. The Lagrange multiplier is the regularization parameter
  • DRO: distributionally robust optimization uses duality to convert a minimax problem (worst-case over distributions) into a tractable regularized problem

If you do not understand duality, you cannot understand why the kernel trick works, why regularization is equivalent to constraining, or how adversarial formulations lead to tractable algorithms.

theorem visual

Duality Ledger

Dual variables act like certificates: they can never overstate the primal optimum, and under strong duality they match it exactly.

primal trackhigher
dual certificatelower
The remaining duality gap marks how much room still separates the certificate from the primal optimum.

Certificate

The dual stays a valid lower bound, but the primal optimum can still sit strictly above it.

primal problem: constrained or regularized ERM
dual problem: lower-bound certificate in multiplier space
bridge: Fenchel conjugates and Lagrange multipliers

Mental Model

Every convex minimization problem has a "shadow" problem --- its dual --- which is a maximization problem. The dual always provides a lower bound on the primal optimum (weak duality). Under mild conditions (strong duality), the two optima are equal. At the optimal point, the primal and dual variables satisfy complementary conditions (KKT).

The Fenchel conjugate ff^* is the fundamental building block: it converts a function into its "dual representation." For background on convex sets and functions, see convex optimization basics. The conjugate of the conjugate gives back the original function (for closed convex functions), which is the Fenchel-Moreau theorem.

Formal Setup

Definition

Fenchel Conjugate (Convex Conjugate)

The Fenchel conjugate (or convex conjugate) of a function f:RdR{+}f: \mathbb{R}^d \to \mathbb{R} \cup \{+\infty\} is:

f(y)=supxRd{yxf(x)}f^*(y) = \sup_{x \in \mathbb{R}^d} \left\{ y^\top x - f(x) \right\}

The conjugate f:RdR{+}f^*: \mathbb{R}^d \to \mathbb{R} \cup \{+\infty\} is always convex (as a supremum of affine functions in yy), even if ff is not convex.

Interpretation: f(y)f^*(y) measures the maximum "gap" between the linear function xyxx \mapsto y^\top x and f(x)f(x). Geometrically, f(y)f^*(y) is related to the supporting hyperplane of the epigraph of ff with slope yy.

Definition

Young-Fenchel Inequality

For any ff and its conjugate ff^*, and for all x,yx, y:

xyf(x)+f(y)x^\top y \leq f(x) + f^*(y)

This is immediate from the definition: f(y)yxf(x)f^*(y) \geq y^\top x - f(x). Equality holds if and only if yf(x)y \in \partial f(x) (the subdifferential of ff at xx).

Key conjugate pairs (you should know these):

f(x)f(x)f(y)f^*(y)
12x2\frac{1}{2}\|x\|^212y2\frac{1}{2}\|y\|^2
x\|x\| (any norm)δy1\delta_{\|y\|_* \leq 1} (indicator of dual norm ball)
12xAx\frac{1}{2}x^\top Ax (A0A \succ 0)12yA1y\frac{1}{2}y^\top A^{-1}y
exe^xylogyyy\log y - y for y>0y > 0, 00 for y=0y = 0, ++\infty for y<0y < 0
δC(x)\delta_C(x) (indicator of convex set)supxCyx\sup_{x \in C} y^\top x (support function)

Main Theorems

Theorem

Fenchel-Moreau Biconjugation Theorem

Statement

If f:RdR{+}f: \mathbb{R}^d \to \mathbb{R} \cup \{+\infty\} is proper (not identically ++\infty and never -\infty), lower-semicontinuous, and convex, then:

f=ff^{**} = f

That is, the conjugate of the conjugate recovers the original function.

Intuition

A closed convex function is completely determined by its supporting hyperplanes. The Fenchel conjugate encodes these hyperplanes. Conjugating twice reconstructs the function from its hyperplane representation. This is analogous to the Fourier transform: applying it twice (with appropriate normalization) gives back the original function.

If ff is not convex or not lower-semicontinuous, then ff^{**} is the closed convex envelope of ff --- the largest closed convex function that is pointwise f\leq f.

Proof Sketch

One direction is easy: f(x)=supy{xyf(y)}f(x)f^{**}(x) = \sup_y \{x^\top y - f^*(y)\} \leq f(x) follows from the Young-Fenchel inequality (for each yy, xyf(y)f(x)x^\top y - f^*(y) \leq f(x)).

The other direction uses the supporting hyperplane theorem: for every x0x_0 and every α<f(x0)\alpha < f(x_0), the point (x0,α)(x_0, \alpha) lies below the epigraph of ff. Since epi(f)\text{epi}(f) is closed and convex, a separating hyperplane gives a yy with yx0f(y)αy^\top x_0 - f^*(y) \geq \alpha. Since α<f(x0)\alpha < f(x_0) is arbitrary, f(x0)f(x0)f^{**}(x_0) \geq f(x_0).

Why It Matters

Fenchel-Moreau is the theoretical foundation of all convex duality. It says that every closed convex function has a perfect "dual representation" via its conjugate. This enables:

  1. Converting between constrained and penalized formulations: the conjugate of an indicator function is a support function, and vice versa. This is exactly the duality between norm-constrained and norm-penalized optimization.

  2. Deriving dual problems: the Lagrangian dual of a convex problem can be written in terms of Fenchel conjugates. Strong duality (f=ff^{**} = f) ensures no duality gap.

  3. Variational representations: many quantities in information theory (KL divergence, entropy) have dual representations via Fenchel conjugates. The Donsker-Varadhan variational formula for KL divergence is a consequence.

Failure Mode

If ff is not closed (lower-semicontinuous) or not convex, then fff^{**} \neq f. The biconjugate ff^{**} will be the closed convex envelope, which can differ significantly. For example, if f(x)=xf(x) = -|x|, then ff is concave, and f(x)=f^{**}(x) = -\infty for all xx. Always verify closedness and convexity before applying Fenchel-Moreau.

Lagrangian Duality

Definition

Lagrangian and Dual Problem

Consider the primal problem:

minxf(x)subject to gi(x)0,  i=1,,m\min_x f(x) \quad \text{subject to } g_i(x) \leq 0, \; i = 1, \ldots, m

The Lagrangian is:

L(x,λ)=f(x)+i=1mλigi(x),λi0L(x, \lambda) = f(x) + \sum_{i=1}^m \lambda_i g_i(x), \quad \lambda_i \geq 0

The dual function is q(λ)=infxL(x,λ)q(\lambda) = \inf_x L(x, \lambda). The dual problem is:

maxλ0q(λ)\max_{\lambda \geq 0} q(\lambda)

The dual function qq is always concave (as an infimum of affine functions in λ\lambda), regardless of whether the primal is convex.

Definition

Weak and Strong Duality

Let pp^* be the primal optimal value and dd^* the dual optimal value.

Weak duality (always holds): dpd^* \leq p^*. The dual provides a lower bound on the primal.

Strong duality: d=pd^* = p^*. The duality gap is zero.

Strong duality holds for convex problems under Slater's condition: there exists a strictly feasible point xˉ\bar{x} with gi(xˉ)<0g_i(\bar{x}) < 0 for all ii.

Theorem

Strong Duality via Slater's Condition

Statement

If ff and g1,,gmg_1, \ldots, g_m are convex and there exists a strictly feasible point xˉ\bar{x} with gi(xˉ)<0g_i(\bar{x}) < 0 for all ii, then strong duality holds: the primal and dual optimal values are equal, and the dual optimum is attained.

Intuition

Slater's condition says the feasible region has a non-empty interior (the constraints are not "barely" satisfied). This prevents pathological cases where the primal feasible set is "too thin" for duality to work. In practice, Slater's condition holds for almost every convex optimization problem you encounter in ML.

Proof Sketch

Consider the set V={(u,t):x,  gi(x)ui,  f(x)t}Rm+1\mathcal{V} = \{(u, t) : \exists x, \; g_i(x) \leq u_i, \; f(x) \leq t\} \subseteq \mathbb{R}^{m+1}. This set is convex (because ff and gig_i are convex). The point (0,p)(0, p^*) is on the boundary of V\mathcal{V}. By the supporting hyperplane theorem, there exists a hyperplane separating (0,p)(0, p^*) from the interior of V\mathcal{V}. This hyperplane defines the optimal dual variables λ\lambda^*. Slater's condition ensures the hyperplane has the right orientation (the λ\lambda components are non-negative), which gives q(λ)=pq(\lambda^*) = p^*.

Why It Matters

Strong duality is what makes dual methods actually useful:

  • SVM dual: the primal SVM minimizes over ww (potentially high-dimensional). The dual minimizes over αi\alpha_i (one per data point). Under strong duality, both give the same answer. The dual depends on data only through inner products xixjx_i^\top x_j, enabling the kernel trick: replace inner products with k(xi,xj)k(x_i, x_j).

  • Regularization duality: minimizing f(x)f(x) subject to xr\|x\| \leq r is equivalent (by strong duality) to minimizing f(x)+λxf(x) + \lambda\|x\| for some λ0\lambda \geq 0. The Lagrange multiplier λ\lambda is the regularization parameter. This is why constraint-based and penalty-based regularization are interchangeable.

  • DRO: worst-case expected loss over a ball of distributions can be dualized into a regularized empirical risk problem. This converts an intractable minimax problem into a tractable convex program.

The Legendre transform that underlies conjugacy also appears in information geometry: the natural-gradient and mirror-descent equivalence is a duality between natural parameters θ\theta and expectation parameters η\eta mediated by the Legendre transform of the log-partition function.

Failure Mode

Without Slater's condition, strong duality can fail with a strictly positive gap. Consider the convex problem:

minx,y  exs.t.x2y0,y>0\min_{x, y} \; e^{-x} \quad \text{s.t.} \quad \frac{x^2}{y} \leq 0, \quad y > 0

The constraint function f(x,y)=x2/yf(x, y) = x^2/y is convex on the open half-space y>0y > 0, and the objective exe^{-x} is convex, so this is a convex program.

Primal. Feasibility requires x2/y0x^2/y \leq 0 with y>0y > 0, which forces x=0x = 0. Every feasible point has the form (0,y)(0, y) with y>0y > 0, giving objective value e0=1e^0 = 1. So p=1p^* = 1.

Slater fails. Slater's condition requires a strictly feasible point: some (x,y)(x, y) with x2/y<0x^2/y < 0 and y>0y > 0. But x2/y0x^2/y \geq 0 whenever y>0y > 0, so no such point exists.

Dual. The Lagrangian is L(x,y,λ)=ex+λx2/yL(x, y, \lambda) = e^{-x} + \lambda x^2/y on y>0y > 0, with λ0\lambda \geq 0. For any λ>0\lambda > 0, sending x+x \to +\infty with yy fixed drives ex0e^{-x} \to 0 and λx2/y\lambda x^2/y stays finite for each fixed xx, but more carefully: take y=x2y = x^2 and x+x \to +\infty, then L=ex+λλL = e^{-x} + \lambda \to \lambda. Taking y+y \to +\infty with xx fixed drives λx2/y0\lambda x^2/y \to 0 and x+x \to +\infty drives ex0e^{-x} \to 0, so infx,y>0L(x,y,λ)=0\inf_{x, y > 0} L(x, y, \lambda) = 0. For λ=0\lambda = 0, infex=0\inf e^{-x} = 0. Thus q(λ)=0q(\lambda) = 0 for all λ0\lambda \geq 0, giving d=0d^* = 0.

Conclusion. d=0<1=pd^* = 0 < 1 = p^*, a positive duality gap of 1. Slater's condition is sufficient (not necessary) for strong duality under convexity; when it fails, the gap can be strictly positive.

KKT Conditions

Definition

Karush-Kuhn-Tucker (KKT) Conditions

For the convex problem minxf(x)\min_x f(x) subject to gi(x)0g_i(x) \leq 0, under strong duality, the primal-dual pair (x,λ)(x^*, \lambda^*) is optimal if and only if:

  1. Primal feasibility: gi(x)0g_i(x^*) \leq 0 for all ii
  2. Dual feasibility: λi0\lambda_i^* \geq 0 for all ii
  3. Stationarity: 0f(x)+iλigi(x)0 \in \partial f(x^*) + \sum_i \lambda_i^* \partial g_i(x^*)
  4. Complementary slackness: λigi(x)=0\lambda_i^* g_i(x^*) = 0 for all ii

Complementary slackness says: either a constraint is tight (gi(x)=0g_i(x^*) = 0) or its multiplier is zero (λi=0\lambda_i^* = 0). Active constraints "matter"; inactive constraints are irrelevant. In SVMs, the data points with λi>0\lambda_i^* > 0 are the support vectors.

Sion Minimax Theorem

Definition

Sion Minimax Theorem

If X\mathcal{X} is convex and compact, Y\mathcal{Y} is convex, and ϕ(x,y)\phi(x, y) is convex-concave (convex in xx for each yy, concave in yy for each xx) and lower-semicontinuous in xx, upper-semicontinuous in yy, then:

minxXsupyYϕ(x,y)=supyYminxXϕ(x,y)\min_{x \in \mathcal{X}} \sup_{y \in \mathcal{Y}} \phi(x, y) = \sup_{y \in \mathcal{Y}} \min_{x \in \mathcal{X}} \phi(x, y)

The min and sup can be interchanged. This is a generalization of von Neumann's minimax theorem for zero-sum games.

Why it matters for ML: The minimax theorem underlies adversarial formulations. In GANs, DRO, and robust optimization, you want to swap a min over model parameters with a max over adversarial perturbations. The Sion theorem tells you when this swap is valid.

Fenchel-Rockafellar Duality

Many problems in machine learning take the composite form infxf(x)+g(Ax)\inf_x f(x) + g(Ax) where f,gf, g are proper convex lower-semicontinuous functions on Rd\mathbb{R}^d and Rn\mathbb{R}^n respectively, and ARn×dA \in \mathbb{R}^{n \times d} is a bounded linear operator. The Fenchel-Rockafellar dual is:

supy  f(Ay)g(y)\sup_y \; -f^*(-A^\top y) - g^*(y)

Weak duality always holds. Strong duality holds under mild closedness and constraint-qualification conditions, for instance 0int(dom(g)Adom(f))0 \in \text{int}(\text{dom}(g) - A\, \text{dom}(f)) (Rockafellar 1970, Theorem 31.1). This pattern is the backbone of composite optimization: the lasso dual takes f(x)=λx1f(x) = \lambda\|x\|_1, g(z)=12zb2g(z) = \frac{1}{2}\|z - b\|^2, AA the design matrix; TV-denoising takes gg a squared data-fit term and f(x)=x1f(x) = \|\nabla x\|_1 through a discrete gradient operator AA. Writing both sides simultaneously exposes primal-dual algorithms (Chambolle-Pock, ADMM) that exploit the conjugate structure of each component.

Moreau-Yosida Regularization and Proximal Operators

Definition

Moreau Envelope and Proximal Operator

For a proper convex lsc function ff and λ>0\lambda > 0, the Moreau envelope (or Moreau-Yosida regularization) is:

fλ(x)=infy{f(y)+12λyx2}f_\lambda(x) = \inf_y \left\{ f(y) + \frac{1}{2\lambda}\|y - x\|^2 \right\}

The proximal operator is the unique minimizer:

proxλf(x)=argminy{f(y)+12λyx2}\text{prox}_{\lambda f}(x) = \arg\min_y \left\{ f(y) + \frac{1}{2\lambda}\|y - x\|^2 \right\}

The envelope fλf_\lambda is convex, continuously differentiable with fλ(x)=1λ(xproxλf(x))\nabla f_\lambda(x) = \frac{1}{\lambda}(x - \text{prox}_{\lambda f}(x)), and satisfies fλff_\lambda \uparrow f as λ0\lambda \downarrow 0. The key duality is the Moreau decomposition:

x=proxλf(x)+λproxf/λ(x/λ)x = \text{prox}_{\lambda f}(x) + \lambda \, \text{prox}_{f^*/\lambda}(x/\lambda)

Proximal operators are to non-smooth convex functions what gradients are to smooth ones. Proximal-gradient methods (ISTA, FISTA) and ADMM use them as the main iteration building block, turning non-smooth regularized problems into tractable updates. See Rockafellar 1970, Chapter 12, and Parikh and Boyd, Proximal Algorithms, Foundations and Trends in Optimization 1(3):127-239, 2014.

Duality in Distributionally Robust Optimization

Wasserstein distributionally robust optimization replaces the empirical risk EP^n[(θ,ξ)]\mathbb{E}_{\hat P_n}[\ell(\theta, \xi)] by the worst case over a Wasserstein ball {P:Wp(P,P^n)ε}\{P : W_p(P, \hat P_n) \leq \varepsilon\}. Strong duality gives a tractable single-variable dual:

supP:Wp(P,P^n)εEP[(θ,ξ)]=infγ0{γεp+EP^n[supξ{(θ,ξ)γξξ^p}]}\sup_{P : W_p(P, \hat P_n) \leq \varepsilon} \mathbb{E}_P[\ell(\theta, \xi)] = \inf_{\gamma \geq 0} \left\{ \gamma \varepsilon^p + \mathbb{E}_{\hat P_n}\left[ \sup_\xi \{\ell(\theta, \xi) - \gamma \|\xi - \hat\xi\|^p\} \right] \right\}

Under mild regularity this collapses to a variance or gradient-norm penalty on the ERM objective, linking robustness to standard regularization. See Esfahani and Kuhn, Math. Prog. 171:115-166, 2018 (arXiv:1505.05116) and Blanchet and Murthy, Math. Oper. Res. 44(2):565-600, 2019 (arXiv:1604.01446).

Canonical Examples

Example

Conjugate of the squared norm

f(x)=12x2f(x) = \frac{1}{2}\|x\|^2. Then:

f(y)=supx{yx12x2}f^*(y) = \sup_x \{y^\top x - \frac{1}{2}\|x\|^2\}

Taking the gradient and setting to zero: yx=0y - x = 0, so x=yx^* = y. Substituting: f(y)=yy12y2=12y2f^*(y) = y^\top y - \frac{1}{2}\|y\|^2 = \frac{1}{2}\|y\|^2.

The squared norm is its own conjugate. This self-duality is why 2\ell_2 regularization leads to particularly clean dual problems (e.g., ridge regression).

Example

Duality between constrained and penalized optimization

Primal (constrained): minwf(w)\min_w f(w) subject to wr\|w\| \leq r

Lagrangian: L(w,λ)=f(w)+λ(wr)L(w, \lambda) = f(w) + \lambda(\|w\| - r)

Dual: maxλ0{infw[f(w)+λw]λr}\max_{\lambda \geq 0} \{\inf_w [f(w) + \lambda\|w\|] - \lambda r\}

The inner minimization infw[f(w)+λw]\inf_w [f(w) + \lambda\|w\|] is exactly the penalized problem with regularization parameter λ\lambda.

By strong duality, there exists λ\lambda^* such that the constrained problem with radius rr and the penalized problem with parameter λ\lambda^* have the same solution. This is the rigorous justification for the equivalence between norm-constrained and norm-penalized regularization.

Common Confusions

Watch Out

Weak duality always holds; strong duality requires conditions

Students sometimes assume that the primal and dual always have the same optimal value. Weak duality (dpd^* \leq p^*) is trivially true, but strong duality (d=pd^* = p^*) requires assumptions like convexity plus Slater's condition. For non-convex problems, the duality gap can be arbitrarily large. Always check the conditions before claiming strong duality.

Watch Out

The dual problem is always concave, even if the primal is non-convex

The dual function q(λ)=infxL(x,λ)q(\lambda) = \inf_x L(x, \lambda) is concave in λ\lambda regardless of whether ff or the gig_i are convex. This is because qq is a pointwise infimum of affine functions in λ\lambda. The dual is always a concave maximization problem, which is computationally tractable. The catch: for non-convex primal problems, d<pd^* < p^* and the dual only gives a lower bound.

Watch Out

KKT conditions are necessary and sufficient only under strong duality

For general non-convex problems, KKT is only necessary (assuming constraint qualification). For convex problems with strong duality, KKT is both necessary and sufficient. In ML, most problems of interest (SVMs, regularized linear models, DRO) are convex and satisfy Slater's condition, so KKT gives exact optimality conditions.

Summary

  • Fenchel conjugate: f(y)=supx{yxf(x)}f^*(y) = \sup_x \{y^\top x - f(x)\}
  • Young-Fenchel inequality: xyf(x)+f(y)x^\top y \leq f(x) + f^*(y), with equality iff yf(x)y \in \partial f(x)
  • Fenchel-Moreau: f=ff^{**} = f for closed convex functions
  • Weak duality always holds (dpd^* \leq p^*); strong duality requires convexity + Slater's condition
  • KKT conditions: primal feasibility, dual feasibility, stationarity, complementary slackness
  • Duality converts constrained problems into penalized problems (regularization = Lagrangian duality)
  • SVM dual reveals the kernel trick; DRO dual yields tractable regularization

Exercises

ExerciseCore

Problem

Compute the Fenchel conjugate of f(x)=x1f(x) = \|x\|_1 (the 1\ell_1 norm on Rd\mathbb{R}^d). What is the dual norm of 1\ell_1?

ExerciseAdvanced

Problem

Derive the dual of the soft-margin SVM problem:

minw,b,ξ12w2+Ciξis.t. yi(wxi+b)1ξi,  ξi0\min_{w, b, \xi} \frac{1}{2}\|w\|^2 + C\sum_i \xi_i \quad \text{s.t. } y_i(w^\top x_i + b) \geq 1 - \xi_i, \; \xi_i \geq 0

Show that the dual depends on the data only through inner products xixjx_i^\top x_j, which enables the kernel trick.

Related Comparisons

Further directions

  • LP duality as a special case (primal-dual simplex)
  • Lasso dual and ridge dual in compact form
  • Subdifferential calculus: sum rule, chain rule, composition rule
  • Fenchel duality in Banach spaces (infinite-dimensional generalization)
  • Rockafellar's closedness conditions for strong duality without Slater
  • Interactive diagram: duality gap closing under Slater
  • Quiz

References

Canonical:

  • Rockafellar, Convex Analysis (1970), Chapters 12, 26-28, 31
  • Boyd & Vandenberghe, Convex Optimization (2004), Chapter 5
  • Borwein & Lewis, Convex Analysis and Nonlinear Optimization (2nd ed., 2006)

Current:

  • Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapter 15 (SVM duality)
  • Ben-Tal, El Ghaoui, Nemirovski, Robust Optimization (2009), Chapter 4

Next Topics

Building on convex duality:

Last reviewed: April 18, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

3

Derived topics

10

+5 more on the derived-topics page.