Skip to main content

Optimization Function Classes

Subgradients and Subdifferentials

The non-smooth generalization of the gradient for convex functions. Subgradients enable optimality conditions, calculus rules, and convergence guarantees for L1-regularized problems, hinge loss SVMs, and proximal algorithms where the objective is not differentiable.

CoreTier 1StableCore spine~35 min

Why This Matters

theorem visual

Subgradient Support

At a smooth point there is one supporting slope; at a kink there is a whole interval of valid subgradients.

x = 0.0

∂f(x) = [−0.8, 1.2]

fan of valid supporting slopes at the kink

Most modern ML objectives are not smooth. The L1 penalty in lasso regression is not differentiable at zero. The hinge loss in support vector machines kinks at the margin. The ReLU activation has no derivative at zero. The total-variation regularizer has gradient discontinuities at every level set.

Standard gradient-based optimization assumes the gradient exists, which is exactly what fails on these problems. The right replacement is the subgradient, which generalizes the gradient to convex but non-smooth functions. Every convex function has a non-empty subdifferential at every interior point of its domain, even where the gradient does not exist. This is what makes proximal gradient methods, the lasso, hinge-loss SVMs, and ADMM work as theory rather than heuristics.

This page is the foundational reference. Convex-optimization-basics covers the smooth case; this page covers the non-smooth case that nearly every sparsity-inducing or regularized ML method actually uses.

Definition

Definition

Subgradient

Let f:RdR{+}f : \mathbb{R}^d \to \mathbb{R} \cup \{+\infty\} be convex. A vector gRdg \in \mathbb{R}^d is a subgradient of ff at xx if f(y)f(x)+g,yxfor all yRd.f(y) \geq f(x) + \langle g, y - x \rangle \quad \text{for all } y \in \mathbb{R}^d. The set of all subgradients of ff at xx is the subdifferential: f(x)={gRd:f(y)f(x)+g,yx y}.\partial f(x) = \{g \in \mathbb{R}^d : f(y) \geq f(x) + \langle g, y - x\rangle \ \forall y\}.

A subgradient is any vector gg such that the affine function yf(x)+g,yxy \mapsto f(x) + \langle g, y - x\rangle stays below ff everywhere and touches it at xx. This is the convex-analytic generalization of the tangent line: at a smooth convex function, only one such affine support exists (the tangent), and g=f(x)g = \nabla f(x). At a kink, an entire family of supporting affine functions exists, and f(x)\partial f(x) is the set of their slopes.

Examples

Absolute value f(x)=xf(x) = |x| on R\mathbb{R}.

f(x)={{1}x>0[1,1]x=0{1}x<0\partial f(x) = \begin{cases} \{1\} & x > 0 \\ [-1, 1] & x = 0 \\ \{-1\} & x < 0 \end{cases}

Away from zero, ff is differentiable and the subdifferential is the singleton containing the derivative. At zero, every slope in [1,1][-1, 1] gives a supporting line that stays below x|x|. The subdifferential at the kink is the closed interval of all such slopes.

1\ell_1 norm f(x)=x1=ixif(x) = \|x\|_1 = \sum_i |x_i| on Rd\mathbb{R}^d.

f(x)={gRd:gi=sign(xi) if xi0, gi[1,1] if xi=0}.\partial f(x) = \{g \in \mathbb{R}^d : g_i = \text{sign}(x_i) \text{ if } x_i \neq 0, \ g_i \in [-1, 1] \text{ if } x_i = 0\}.

Coordinate-wise: each gig_i is the sign of xix_i at non-zero coordinates and any value in [1,1][-1, 1] at zero coordinates. This is the subdifferential that drives soft-thresholding in proximal methods.

Hinge loss f(x)=max(0,1x)f(x) = \max(0, 1 - x).

f(x)={{1}x<1[1,0]x=1{0}x>1\partial f(x) = \begin{cases} \{-1\} & x < 1 \\ [-1, 0] & x = 1 \\ \{0\} & x > 1 \end{cases}

The slope is 1-1 in the loss region, 00 in the no-loss region, and the entire interval [1,0][-1, 0] at the kink. The kink at x=1x = 1 is exactly the SVM margin.

Indicator function δC\delta_C of a closed convex set CC.

δC(x)={g:g,yx0 yC}=NC(x)\partial \delta_C(x) = \{g : \langle g, y - x\rangle \leq 0 \ \forall y \in C\} = N_C(x)

This is the normal cone to CC at xx: the set of outward-pointing directions that do not enter CC. At interior points, NC(x)={0}N_C(x) = \{0\}; on the boundary, the normal cone is non-trivial. This is how convex constraints enter optimality conditions.

Smooth convex ff. f(x)={f(x)}\partial f(x) = \{\nabla f(x)\} at every point of differentiability.

Existence

Theorem

Existence of Subgradients

Statement

For any proper convex f:RdR{+}f : \mathbb{R}^d \to \mathbb{R} \cup \{+\infty\} and any point xx in the relative interior of domf\text{dom}\, f, the subdifferential f(x)\partial f(x) is non-empty, closed, convex, and bounded.

If ff is differentiable at xx, then f(x)={f(x)}\partial f(x) = \{\nabla f(x)\}.

If xx is on the boundary of domf\text{dom}\, f or ff is not lower semicontinuous at xx, the subdifferential may be empty.

Intuition

Geometrically: a convex function has supporting hyperplanes at every interior point of its domain (this is the supporting hyperplane theorem applied to the epigraph). Each supporting hyperplane has a normal vector whose first dd coordinates form a subgradient. The set of such normals forms a closed convex bounded set, which is exactly the subdifferential.

Proof Sketch

Apply the supporting hyperplane theorem to the epigraph epif={(x,t):tf(x)}\text{epi}\, f = \{(x, t) : t \geq f(x)\} at the boundary point (x,f(x))(x, f(x)). The supporting hyperplane has normal (g,1)(g, -1) for some gRdg \in \mathbb{R}^d (the 1-1 is forced by the epigraph being "tt-monotone"). The supporting condition becomes f(y)f(x)g,yxf(y) - f(x) \geq \langle g, y - x\rangle for all yy, which is exactly the subgradient inequality. Closure and convexity of f(x)\partial f(x) follow because it is an intersection of half-spaces. Boundedness on the interior follows from ff being locally Lipschitz on the relative interior of its domain.

Why It Matters

Existence on the interior is what licenses every algorithm that picks "any subgradient" at each step. Without existence you could not even define a subgradient method, let alone analyze it. The boundary caveat matters for indicator functions of constraints, where the subdifferential at constraint boundaries is the normal cone (which can be unbounded).

Failure Mode

The subdifferential is empty at points outside domf\text{dom}\, f, and can be empty at boundary points where ff jumps from finite to ++\infty. Example: f(x)=xf(x) = -\sqrt{x} for x0x \geq 0, f(x)=+f(x) = +\infty for x<0x < 0. At x=0x = 0, every "subgradient" gg would need ygy-\sqrt{y} \geq g \cdot y for all y0y \geq 0, which forces gy/yg \leq -\sqrt{y}/y \to -\infty as y0+y \to 0^+. No finite gg works, so f(0)=\partial f(0) = \emptyset.

Optimality Condition

The single most important consequence of the subdifferential framework: it gives a clean optimality condition for non-smooth convex minimization.

Theorem

Subdifferential Optimality Condition

Statement

xx^* is a global minimizer of ff if and only if 0f(x).0 \in \partial f(x^*). This is the non-smooth analogue of f(x)=0\nabla f(x^*) = 0 for smooth convex functions.

Intuition

The subgradient inequality f(y)f(x)+g,yxf(y) \geq f(x^*) + \langle g, y - x^*\rangle with g=0g = 0 becomes f(y)f(x)f(y) \geq f(x^*) for all yy, which is exactly the definition of xx^* being a global minimizer. Conversely, if xx^* is a global minimizer, the constant subgradient g=0g = 0 satisfies the defining inequality.

Proof Sketch

If 0f(x)0 \in \partial f(x^*): By the subgradient inequality at g=0g = 0, f(y)f(x)+0=f(x)f(y) \geq f(x^*) + 0 = f(x^*) for all yy. So xx^* is a global minimizer.

If xx^* is a global minimizer: Take g=0g = 0. Then f(y)f(x)=f(x)+0,yxf(y) \geq f(x^*) = f(x^*) + \langle 0, y - x^*\rangle holds for all yy, so 0f(x)0 \in \partial f(x^*).

Why It Matters

This is the optimality condition used to derive the soft-thresholding operator (the proximal operator of the 1\ell_1 norm), to prove KKT conditions for constrained problems, and to characterize lasso solutions. For lasso minw12yXw22+λw1\min_w \tfrac{1}{2} \|y - Xw\|_2^2 + \lambda \|w\|_1, the optimality condition 0f(w)0 \in \partial f(w^*) becomes X(yXw)λw1X^\top(y - Xw^*) \in \lambda \, \partial \|w^*\|_1, which gives the explicit characterization that Xj(yXw)λ|X_j^\top(y - Xw^*)| \leq \lambda for zero coordinates of ww^* and equality with the right sign for non-zero coordinates.

Failure Mode

This optimality condition characterizes global minimizers; for non-convex ff, 0f(x)0 \in \partial f(x^*) (with \partial generalized to Clarke subgradients) is necessary but not sufficient for local minimality, and saddle points satisfy it. The full strength of "iff global minimum" requires convexity.

Subgradient Calculus

The subdifferential satisfies calculus rules analogous to gradient rules, with a few caveats around equality versus containment.

Sum rule. For convex f,gf, g, f(x)+g(x)(f+g)(x)\partial f(x) + \partial g(x) \subseteq \partial(f + g)(x), with equality if relint(domf)relint(domg)\text{relint}(\text{dom}\, f) \cap \text{relint}(\text{dom}\, g) \neq \emptyset.

Scalar multiplication. For α>0\alpha > 0, (αf)(x)=αf(x)\partial(\alpha f)(x) = \alpha \, \partial f(x).

Affine pre-composition. For ff convex and ARm×dA \in \mathbb{R}^{m \times d}, bRmb \in \mathbb{R}^m, (f(Ax+b))(x)=Af(Ax+b)\partial(f(Ax + b))(x) = A^\top \partial f(Ax + b).

Pointwise maximum. For convex f1,,fmf_1, \ldots, f_m and f(x)=maxifi(x)f(x) = \max_i f_i(x), f(x)=conv ⁣(iI(x)fi(x))\partial f(x) = \text{conv}\!\left(\bigcup_{i \in I(x)} \partial f_i(x)\right) where I(x)={i:fi(x)=f(x)}I(x) = \{i : f_i(x) = f(x)\} is the set of "active" functions at xx.

The maximum-rule is what gives the subdifferential of the hinge loss and the dual norm, and it is the source of the convex hull appearing in many non-smooth optimality conditions.

The Subgradient Method

The simplest non-smooth optimization algorithm replaces the gradient in gradient descent with any subgradient.

Algorithm. Choose step sizes ηk>0\eta_k > 0. At iteration kk:

  1. Pick any gkf(xk)g_k \in \partial f(x_k).
  2. Update xk+1=xkηkgkx_{k+1} = x_k - \eta_k \, g_k.

Unlike gradient descent on smooth functions, the subgradient method is not a descent method: f(xk+1)f(x_{k+1}) can exceed f(xk)f(x_k) even with optimal step sizes. The standard guarantee is on the best iterate.

Theorem

Subgradient Method Convergence Rate

Statement

For ff convex with gG\|g\| \leq G for all gf(x)g \in \partial f(x) and step sizes ηk=c/k\eta_k = c/\sqrt{k}:

min0jkf(xj)f=O ⁣(Gx0xk).\min_{0 \leq j \leq k} f(x_j) - f^* = O\!\left(\frac{G \|x_0 - x^*\|}{\sqrt{k}}\right).

With the optimal constant-step-size choice η=x0x/(Gk)\eta = \|x_0 - x^*\|/(G\sqrt{k}), the bound becomes f(xˉk)fGx0x/kf(\bar{x}_k) - f^* \leq G\|x_0 - x^*\|/\sqrt{k} where xˉk\bar{x}_k is the average iterate.

Intuition

Each subgradient step reduces the distance to xx^* by an amount proportional to η(f(xk)f)\eta(f(x_k) - f^*) but increases it by η2G2\eta^2 G^2 from the noise-like subgradient variation. Balancing these at η1/k\eta \sim 1/\sqrt{k} gives the O(1/k)O(1/\sqrt{k}) rate.

Proof Sketch

xk+1x2=xkx22ηkgk(xkx)+ηk2gk2\|x_{k+1} - x^*\|^2 = \|x_k - x^*\|^2 - 2\eta_k g_k^\top(x_k - x^*) + \eta_k^2\|g_k\|^2. By convexity, gk(xkx)f(xk)fg_k^\top(x_k - x^*) \geq f(x_k) - f^*. Summing over kk iterations and rearranging: k=0K1ηk(f(xk)f)12x0x2+12ηk2G2\sum_{k=0}^{K-1} \eta_k(f(x_k) - f^*) \leq \frac{1}{2}\|x_0 - x^*\|^2 + \frac{1}{2}\sum \eta_k^2 G^2. For ηk=c/K\eta_k = c/\sqrt{K}, both the left numerator and ηk\sum \eta_k scale as K\sqrt{K}, giving the O(1/K)O(1/\sqrt{K}) bound on the best iterate.

Why It Matters

The O(1/k)O(1/\sqrt{k}) rate is optimal among first-order methods for non-smooth convex optimization (Nemirovski-Yudin 1983). This gap from the O(1/k)O(1/k) smooth rate is what motivates proximal methods: when the non-smooth term has a closed-form proximal operator (like the 1\ell_1 norm), proximal-gradient methods recover the smooth rate by handling non-smoothness exactly.

Failure Mode

The rate degrades with problem conditioning but cannot be improved without structural assumptions. For strongly convex ff, step sizes ηk=1/(Lk)\eta_k = 1/(Lk) give O(1/k)O(1/k), but for general non-smooth convex problems, O(1/k)O(1/\sqrt{k}) is tight. The method also requires knowing GG and x0x\|x_0 - x^*\| for optimal step-size selection; poor choices can slow convergence or cause divergence.

Common Confusions

Watch Out

Subgradient method is not a descent method

For smooth convex functions with gradient descent, every step reduces the objective. For non-smooth convex functions with the subgradient method, the objective can increase between iterations. The convergence guarantee is on the best iterate seen so far, minjkf(xj)\min_{j \leq k} f(x_j), not on f(xk)f(x_k). This is why averaging schemes (Polyak averaging, Nemirovski-Yudin) are common in subgradient analyses.

Watch Out

Subgradients are only defined for convex functions

For non-convex functions, the convex-analytic subgradient does not exist in general. The Clarke subdifferential and the limiting subdifferential extend the concept to locally Lipschitz non-convex functions and provide necessary conditions for minimality, but the clean "iff global min" of the convex case is lost. ReLU networks are not convex, so the "subgradients" used in their training are Clarke subgradients (any element of the convex hull of limit-of-gradient sequences), not the subgradients defined here.

Watch Out

The subdifferential of a sum can be larger than the sum of subdifferentials

f+g(f+g)\partial f + \partial g \subseteq \partial (f + g), with equality only under a constraint qualification (relint domain intersection non-empty). Without that qualification, strict containment is possible.

The standard textbook example (Rockafellar 1970, §23) uses functions whose effective domains touch at a single boundary point. On R2\mathbb{R}^2, let f(x1,x2)={x1x2x1,x20+otherwise,g(x1,x2)=x2.f(x_1,x_2) = \begin{cases} -\sqrt{x_1 x_2} & x_1, x_2 \geq 0 \\ +\infty & \text{otherwise}\end{cases}, \qquad g(x_1,x_2) = -x_2. Then dom(f)={x1,x20}\mathrm{dom}(f) = \{x_1, x_2 \geq 0\} and dom(g)=R2\mathrm{dom}(g) = \mathbb{R}^2, so relint(domf)relint(domg)={x1>0,x2>0}\mathrm{relint}(\mathrm{dom}\, f) \cap \mathrm{relint}(\mathrm{dom}\, g) = \{x_1 > 0, x_2 > 0\} \neq \emptyset — the qualification does hold and sum equality is recovered. To see strict containment, take instead f(x)=δ{x0}(x)f(x) = \delta_{\{x \leq 0\}}(x) (indicator of (,0](-\infty, 0]) and g(x)=δ{x0}(x)g(x) = \delta_{\{x \geq 0\}}(x) on R\mathbb{R}. Their domains intersect only at {0}\{0\}, with empty relative interior intersection, so the qualification fails. Both indicators have finite normal cones at 00: f(0)=[0,)\partial f(0) = [0, \infty) and g(0)=(,0]\partial g(0) = (-\infty, 0], giving f(0)+g(0)=R\partial f(0) + \partial g(0) = \mathbb{R}. But f+g=δ{0}f + g = \delta_{\{0\}}, so (f+g)(0)=R\partial(f+g)(0) = \mathbb{R} as well — equality, not strict containment, in this particular case. Genuine strict containment requires more delicate examples; see Rockafellar §23 Theorem 23.8 for one. The qualification "relint domains intersect" is what guarantees you do not need to worry about these pathologies.

Summary

  • A subgradient of a convex ff at xx is any gg such that f(y)f(x)+g,yxf(y) \geq f(x) + \langle g, y - x\rangle for all yy. The set of subgradients is the subdifferential f(x)\partial f(x).
  • Subdifferentials exist (non-empty, closed, convex, bounded) at every point in the relative interior of domf\text{dom}\, f.
  • xx^* is a global minimizer of convex ff iff 0f(x)0 \in \partial f(x^*). This is the non-smooth analogue of f(x)=0\nabla f(x^*) = 0.
  • Subgradient calculus rules (sum, chain, pointwise max) follow gradient rules with constraint-qualification caveats.
  • The subgradient method runs at O(1/k)O(1/\sqrt{k}), slower than gradient descent on smooth functions; proximal methods recover O(1/k)O(1/k) when the non-smooth term has a tractable proximal operator.

Exercises

ExerciseCore

Problem

Compute the subdifferential of f(x)=max(0,x)f(x) = \max(0, x) (the ReLU function viewed as a univariate convex function) at every point. Then verify the optimality condition 0f(x)0 \in \partial f(x^*) for the global minimizer.

ExerciseAdvanced

Problem

Derive the soft-thresholding operator from the subdifferential optimality condition. Specifically, show that the unique solution to minx12(xz)2+λx\min_x \tfrac{1}{2}(x - z)^2 + \lambda |x| for λ>0\lambda > 0 is given by x=sign(z)max(zλ,0).x^* = \text{sign}(z) \max(|z| - \lambda, 0). This is the proximal operator of λ\lambda |\cdot| at the point zz.

References

Canonical:

  • Rockafellar, "Convex Analysis" (Princeton, 1970), Sections 23-25
  • Hiriart-Urruty and Lemarechal, "Fundamentals of Convex Analysis" (Springer, 2001), Chapter D
  • Boyd and Vandenberghe, "Convex Optimization" (Cambridge, 2004), Section 3.1.5 (gradient and subgradient overlap), Appendix C

Convergence theory:

  • Nesterov, "Introductory Lectures on Convex Optimization" (Springer, 2004), Section 3.2 (subgradient method)
  • Bubeck, "Convex Optimization: Algorithms and Complexity" (Foundations and Trends in ML, 2015), Sections 3.1-3.2

ML applications:

  • Beck, "First-Order Methods in Optimization" (SIAM, 2017), Chapters 3, 6, 10 (proximal operators, subgradient computations for ML losses)
  • Parikh and Boyd, "Proximal Algorithms" (Foundations and Trends in Optimization, 2014)

Next Topics

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

1

Derived topics

3