ML Methods
Support Vector Machines
Maximum-margin classifiers via convex optimization: hard margin, soft margin with slack variables, hinge loss, the dual formulation, and the kernel trick.
Why This Matters
Maximum-margin hyperplane $w^\top x + b = 0$ separating two classes. Filled markers are support vectors (points on the margin gutters with $\alpha_i > 0$). One soft-margin violator is shown.
Support vector machines were the dominant classification method from the mid-1990s through the early 2010s, and for good reason: they have a clean mathematical formulation rooted in convex optimization, come with strong generalization guarantees via margin theory, and extend to nonlinear classification through the kernel trick. Even in the deep learning era, SVMs remain the clearest example of how optimization geometry connects to statistical generalization.
Mental Model
You have two classes of points in . You want to find a separating hyperplane that is as far as possible from both classes. The "margin" is the width of the gap between the classes. Maximizing the margin is equivalent to minimizing subject to all points being correctly classified. This is a convex quadratic program, and its dual reveals that the solution depends only on dot products between data points — opening the door to kernels.
Hard Margin SVM
Separating Hyperplane
A hyperplane in defined by weight vector and bias . A point with label is correctly classified if and only if .
Functional Margin
The functional margin of a point is . Correct classification means positive functional margin. The geometric margin is the functional margin divided by , giving the actual Euclidean distance from the point to the hyperplane.
Hard Margin SVM
For linearly separable data , the hard margin SVM solves:
The margin (distance between the two class boundaries) is . Maximizing this margin is equivalent to minimizing .
The Dual Formulation
Hard Margin SVM Dual Problem
Statement
The Lagrangian dual of the hard margin SVM is:
subject to for all and .
At optimality, , and only for support vectors: points lying exactly on the margin boundary.
Intuition
The dual says: find weights on training points such that the resulting classifier maximizes margin. Most are zero. Only the points closest to the decision boundary (the support vectors) have nonzero weight, so the classifier depends on only a few critical training examples.
Proof Sketch
Form the Lagrangian . Set to get . Set to get . Substitute back into to eliminate and , yielding the dual. Strong duality holds because the primal is convex and Slater's condition is satisfied (any strictly feasible point suffices).
Why It Matters
The dual formulation is essential for two reasons: (1) the data appears only through dot products , enabling the kernel trick, and (2) the dual variables directly identify the support vectors.
Failure Mode
The hard margin SVM has no solution if the data is not linearly separable. Even a single mislabeled point or outlier makes the primal infeasible. This motivates the soft margin formulation.
Lagrange Multipliers and KKT — Quick Recap
SVM is the canonical setting for seeing Lagrange multipliers and the KKT conditions in action. A short self-contained refresher, since these concepts do not have standalone topic pages yet.
A constrained convex problem has Lagrangian with . The dual function is concave; its supremum equals at the primal optimum under regularity (Slater's condition: a strictly feasible point exists). The four KKT conditions characterize that joint optimum:
- Stationarity (gradient of Lagrangian vanishes):
- Primal feasibility: for every
- Dual feasibility: for every
- Complementary slackness: for every — each constraint is either tight () or its multiplier is zero ().
For the hard-margin SVM with constraints , the KKT conditions specialize to:
- Stationarity: (and from ).
- Primal feasibility: for every training point.
- Dual feasibility: .
- Complementary slackness: — either (the point is not a support vector) or (the point sits exactly on the margin gutter). No middle ground.
The dual-variable interpretation is what makes SVM both interpretable and sparse: only the support vectors contribute to . The complementary-slackness identity is the operational rule that picks them out.
Soft Margin SVM and Hinge Loss
Soft Margin SVM
When data is not linearly separable, introduce slack variables :
The parameter trades off margin width against margin violations. Large penalizes violations heavily (close to hard margin); small allows more violations for a wider margin.
Hinge Loss
The soft margin SVM is equivalent to minimizing the hinge loss:
where . The hinge loss is convex, non-differentiable at , and zero for correctly classified points with margin . This is a regularized ERM problem with hinge loss.
The Kernel Trick
Since the dual depends on data only through dot products , we can replace these with a kernel function where maps to a (potentially infinite-dimensional) feature space. The classifier becomes:
We never compute explicitly. We only evaluate the kernel. Four canonical kernels cover the bulk of practice:
| Kernel | Formula | Feature-space dimension | Where it earns its keep |
|---|---|---|---|
| Linear | (no lift) | High-dimensional sparse data (text, bag-of-words); when interpretability of matters. | |
| Polynomial (degree ) | Explicit interaction terms up to degree ; small with structured nonlinearity. | ||
| Gaussian RBF | infinite | Default for most problems; smooth boundaries; bandwidth is the dominant tuning knob. | |
| Sigmoid (tanh) | not p.s.d. for all | Historical neural-network analogy; rarely best in practice and not always Mercer. |
The RBF kernel maps to an infinite-dimensional feature space yet computes in per pair, which is what made the kernel trick a practical revolution. The kernels and RKHS page gives the full Mercer / reproducing-kernel theory underneath.
Sensitivity to and
SVM performance is famously parameter-sensitive. Two knobs dominate, and miscalibrating either can flip the model from underfit to overfit:
- (soft-margin penalty). Large heavily penalizes margin violations; the model approaches the hard-margin SVM, narrowing the margin and overfitting outliers. Small tolerates violations, widening the margin and behaving like a heavily regularized classifier (potentially underfitting). The mapping ties to the regularization strength of the equivalent hinge-loss-plus- ERM.
- (RBF bandwidth). Large makes vary slowly with distance, producing smooth, low-capacity decision functions (underfit). Small makes near-zero for any pair of distinct points, producing a near-nearest-neighbor classifier with high capacity (overfit; the algorithmic stability bound becomes loose).
The standard discipline is grid search over and with cross-validation, evaluated on a hold-out set. The classical sweet spot for RBF SVMs on a normalized dataset is and ("median heuristic"). When the median heuristic disagrees sharply with the cross-validated optimum, the data has multi-scale structure that single-bandwidth RBF cannot capture; that is when polynomial or composite kernels earn their place.
Bousquet and Elisseeff (2002, Theorem 22) gave SVM an explicit stability bound , where is the strong-convexity constant of the regularizer (here ) and is the Lipschitz constant of the loss. This is one of the cleanest examples of a stability-based generalization bound in the canonical-algorithm setting, and the rate scales linearly with .
Representer Theorem for SVMs
Statement
For any regularized risk minimization problem of the form , the minimizer has the representation .
Intuition
You never need to search over the full (infinite-dimensional) RKHS. The optimal function is always a finite linear combination of kernel evaluations at the training points. This is why the kernel trick is computationally feasible.
Proof Sketch
Decompose where lies in and is orthogonal. Then for all training points (by reproducing property), but . So only increases the regularizer without affecting the loss.
Why It Matters
The representer theorem reduces an infinite-dimensional optimization to a finite-dimensional one (finding coefficients ), making kernel methods computationally tractable.
Failure Mode
The kernel matrix is . For large , storing and inverting this matrix becomes prohibitive ( memory, solve time). This is why SVMs do not scale as well as SGD-based methods to very large datasets.
Connection to Regularization
The soft margin SVM is equivalent to regularized ERM with hinge loss:
This is the same framework as ridge regression (squared loss + L2 regularization) or regularized logistic regression (logistic loss + L2). The only difference is the loss function. The hinge loss is special because it creates sparsity in the dual: most , giving the "support vector" property.
Canonical Examples
Hard margin in 2D
Consider two classes: with label and with label . The optimal separating hyperplane passes through the origin with . The margin is and the support vectors are and , the closest points to the boundary from each class.
Common Confusions
SVMs do not maximize accuracy. THEY maximize margin
An SVM does not directly maximize classification accuracy on the training set. Many hyperplanes may achieve 100% training accuracy on separable data. The SVM picks the one with the largest margin. The theoretical justification is that large margin implies small generalization error (via VC dimension bounds on the margin classifier class).
The kernel trick is not specific to SVMs
Any algorithm whose computation depends only on dot products between data points can be kernelized. This includes kernel PCA, kernel ridge regression, and kernel k-means. SVMs popularized the kernel trick, but it is a general principle.
Why SVMs Were Dominant Pre-Deep-Learning
From roughly 1995 to 2012, SVMs were the go-to method for classification:
- Convex optimization: a single global optimum, no local minima
- Strong theory: margin-based generalization bounds
- Kernels: nonlinear classification without manual feature engineering
- Sparsity: the solution depends on few support vectors
Deep learning overtook SVMs because neural networks scale better to massive datasets (SGD is per epoch vs - for kernel SVMs) and learn hierarchical features automatically.
Summary
- Hard margin SVM: minimize subject to
- Margin = ; maximizing margin = minimizing
- Dual depends on data only through dot products kernel trick
- Support vectors: points with , lying on the margin
- Soft margin: slack variables , parameter controls tradeoff
- Hinge loss : convex surrogate for 0-1 loss
Exercises
Problem
Derive the dual of the hard margin SVM. Start from the primal subject to , introduce Lagrange multipliers , and eliminate and .
Problem
Show that for the soft margin SVM, the dual variables satisfy . What does mean geometrically?
Related Comparisons
References
Canonical:
- Vapnik, The Nature of Statistical Learning Theory (1995), Chapters 5 and 10
- Cristianini & Shawe-Taylor, An Introduction to Support Vector Machines (2000)
- Cortes & Vapnik, "Support-Vector Networks" (Machine Learning, 1995)
- Boser, Guyon, Vapnik, "A Training Algorithm for Optimal Margin Classifiers" (COLT, 1992)
- Schoelkopf & Smola, Learning with Kernels (2002), Chapters 2 and 7
Current:
- Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapters 15-16
- Boyd & Vandenberghe, Convex Optimization (2004), Chapter 5
- Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapter 12
- Steinwart & Christmann, Support Vector Machines (2008), Chapters 1 and 2
Next Topics
The natural next step from SVMs:
- Kernels and RKHS: the full theory of reproducing kernel Hilbert spaces
Last reviewed: May 3, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
5- Convex Optimization Basicslayer 1 · tier 1
- Logistic Regressionlayer 1 · tier 1
- Convex Dualitylayer 2 · tier 1
- Loss Functionslayer 1 · tier 2
- Perceptronlayer 1 · tier 2
Derived topics
3- The Kernel Tricklayer 2 · tier 1
- Kernels and Reproducing Kernel Hilbert Spaceslayer 3 · tier 2
- SVM for RF Classificationlayer 4 · tier 3
Graph-backed continuations