Skip to main content

ML Methods

Ridge Regression

L2-regularized least squares: closed-form shrinkage, SVD geometry, MSE dominance over OLS, the Bayesian Gaussian-prior connection, effective degrees of freedom, kernel ridge, and the modern overparameterized regime where ridgeless OLS can still generalize.

CoreTier 1StableCore spine~85 min

Why This Matters

Ordinary least squares is the canonical unbiased linear estimator. The Gauss-Markov theorem certifies it as best-linear-unbiased, and that result is older than statistical learning theory itself. But unbiasedness is a constraint, not a virtue. Once we accept that bias and variance trade against each other, dropping unbiasedness opens a strictly larger admissible region, and ridge regression is the simplest and most-studied estimator in that larger region.

Ridge does three things at once. It stabilizes the solution when XXX^\top X is ill-conditioned or singular. It shrinks toward the origin in the singular-value basis, with shrinkage strength tied to the inverse signal-to-noise ratio in each direction. And it has a closed form, a Bayesian interpretation, a representer theorem in kernel space, and a clean SVD geometry, so it acts as a Rosetta Stone connecting penalized regression, MAP estimation, kernel methods, and the modern overparameterized linear-model theory of benign overfitting.

The page is the bridge between classical estimation theory (Gauss-Markov, shrinkage and James-Stein, Cramer-Rao bound) and the modern theory of high-dimensional and overparameterized models (NTK, double descent, benign overfitting, kernels and RKHS). It is the smallest example where a closed-form penalized estimator beats the unbiased estimator on mean squared error, and where the SVD lets you read off the entire bias-variance trade by inspection.

Ridge (L2)||w||₂² ≤ tshrunk solutionwwLasso (L1)||w||₁ ≤ tsparse solution (w₂=0)wwL1 diamond has corners on axes, so the constrained optimum often lands on an axis (sparsity). L2 circle has no corners.

Mental Model

OLS finds the minimum-norm vector ww that explains yy as a linear combination of the columns of XX. Ridge adds a quadratic budget on w2\|w\|_2. Among all ww that fit the data well, ridge picks the one with smallest squared norm.

Three pictures, all equivalent:

  1. Constrained optimization. Minimize yXw22\|y - Xw\|_2^2 subject to w2t\|w\|_2 \le t for some tt. Lagrangian gives the penalty form with multiplier λ\lambda.
  2. Shrunk projection. OLS projects yy onto the column space of XX. Ridge does a softer projection: it pulls each component of y^\hat y in the singular-value basis toward zero by a factor σj2/(σj2+λ)\sigma_j^2 / (\sigma_j^2 + \lambda).
  3. MAP under a Gaussian prior. With wN(0,τ2I)w \sim \mathcal N(0, \tau^2 I) and ywN(Xw,σ2I)y \mid w \sim \mathcal N(Xw, \sigma^2 I), the posterior mode equals w^ridge\hat w_{\text{ridge}} at λ=σ2/τ2\lambda = \sigma^2 / \tau^2.

The same penalty plays three roles: convex regularizer, shrinkage operator on the SVD components, and prior covariance. Pick the picture that fits the question.

Formal Setup

Take the linear model y=Xw+εy = Xw^* + \varepsilon with XRn×dX \in \mathbb R^{n \times d}, yRny \in \mathbb R^n, and noise ε\varepsilon with E[ε]=0\mathbb E[\varepsilon] = 0 and Var(ε)=σ2In\operatorname{Var}(\varepsilon) = \sigma^2 I_n. No Gaussian assumption is needed for the MSE statements; Gaussianity is needed only for the Bayesian / MAP equivalence and for distributional inference.

Definition

Ridge Regression Estimator

For penalty parameter λ>0\lambda > 0, the ridge estimator is

w^λ  =  argminwRd  yXw22+λw22.\hat{w}_{\lambda} \;=\; \arg\min_{w \in \mathbb R^d} \; \|y - Xw\|_2^2 + \lambda \|w\|_2^2.

The objective is strictly convex (Hessian 2(XX+λI)02(X^\top X + \lambda I) \succ 0 for λ>0\lambda > 0), so the solution is unique. First-order optimality gives the normal equations (XX+λI)w=Xy(X^\top X + \lambda I) w = X^\top y and the closed form

w^λ  =  (XX+λI)1Xy.\hat{w}_{\lambda} \;=\; (X^\top X + \lambda I)^{-1} X^\top y.

The matrix XX+λIX^\top X + \lambda I is positive definite for any λ>0\lambda > 0, even when XXX^\top X is singular (n<dn < d or rank-deficient design). This is why ridge is well-defined where OLS is not.

The case λ=0\lambda = 0 recovers OLS when XXX^\top X is invertible. The case λ\lambda \to \infty gives w^0\hat w \to 0. The interesting regime is intermediate.

SVD Geometry

The singular value decomposition is the cleanest way to see what ridge does. Write X=UΣVX = U \Sigma V^\top where URn×rU \in \mathbb R^{n \times r}, VRd×rV \in \mathbb R^{d \times r} have orthonormal columns and Σ=diag(σ1,,σr)\Sigma = \operatorname{diag}(\sigma_1, \ldots, \sigma_r) with σ1σr>0\sigma_1 \ge \cdots \ge \sigma_r > 0 and r=rank(X)r = \operatorname{rank}(X).

Theorem

Ridge Acts as Component-wise Shrinkage in the SVD Basis

Statement

The ridge fitted values are

y^λ  =  Xw^λ  =  j=1rσj2σj2+λujujy,\hat y_{\lambda} \;=\; X \hat w_{\lambda} \;=\; \sum_{j=1}^{r} \frac{\sigma_j^2}{\sigma_j^2 + \lambda} \, u_j u_j^\top y,

and the ridge coefficients in the right-singular basis are

Vw^λ  =  diag ⁣(σjσj2+λ)Uy.V^\top \hat w_{\lambda} \;=\; \operatorname{diag}\!\left(\frac{\sigma_j}{\sigma_j^2 + \lambda}\right) U^\top y.

The shrinkage factor dj(λ)=σj2/(σj2+λ)(0,1)d_j(\lambda) = \sigma_j^2 / (\sigma_j^2 + \lambda) \in (0, 1) governs each direction independently.

Intuition

OLS fitted values are y^OLS=jujujy\hat y_{\text{OLS}} = \sum_j u_j u_j^\top y, the full projection onto each left singular direction. Ridge multiplies the jj-th coefficient by σj2/(σj2+λ)\sigma_j^2 / (\sigma_j^2 + \lambda). For directions with σjλ\sigma_j \gg \sqrt \lambda the factor is essentially 1 and ridge looks like OLS. For directions with σjλ\sigma_j \ll \sqrt \lambda the factor is essentially 0 and ridge silences that direction. Ridge is a low-pass filter on the singular-value spectrum, with cutoff at σjλ\sigma_j \approx \sqrt \lambda.

This is also why ridge stabilizes ill-conditioned designs. The OLS coefficient Vw^OLS=diag(1/σj)UyV^\top \hat w_{\text{OLS}} = \operatorname{diag}(1/\sigma_j) U^\top y blows up for small σj\sigma_j. Ridge replaces 1/σj1/\sigma_j by σj/(σj2+λ)\sigma_j / (\sigma_j^2 + \lambda), which is bounded above by 1/(2λ)1 / (2 \sqrt \lambda).

Proof Sketch

Substitute X=UΣVX = U \Sigma V^\top into (XX+λI)1X(X^\top X + \lambda I)^{-1} X^\top. The Gram matrix is VΣ2VV \Sigma^2 V^\top. Padding with the orthogonal complement of VV (eigenvalue 0 for XXX^\top X, eigenvalue λ\lambda for XX+λIX^\top X + \lambda I), the inverse is V(Σ2+λI)1V+1λVVV (\Sigma^2 + \lambda I)^{-1} V^\top + \tfrac{1}{\lambda} V_\perp V_\perp^\top. Multiplying by X=VΣUX^\top = V \Sigma U^\top kills the orthogonal piece (since VV=0V^\top V_\perp = 0) and yields V(Σ2+λI)1ΣUV (\Sigma^2 + \lambda I)^{-1} \Sigma U^\top. Multiplying once more by X=UΣVX = U \Sigma V^\top from the left gives the fitted values.

Why It Matters

Every other property of ridge follows from this one identity. Bias, variance, MSE, effective degrees of freedom, leave-one-out cross-validation, Bayesian posterior covariance, and kernel-ridge predictions are all weighted sums over jj with weights of the form σj2/(σj2+λ)k\sigma_j^2 / (\sigma_j^2 + \lambda)^k for small kk. If you see a quantity attached to ridge and it does not factor through the SVD, suspect an algebra error.

Failure Mode

The shrinkage is anisotropic and depends on the design. Rotating the feature space changes the singular vectors and therefore the ridge solution; ridge is not invariant under arbitrary linear reparameterization. Standardize features (zero mean, unit variance) before applying ridge, otherwise the penalty depends on units and on which features happen to have larger raw scale.

Bias, Variance, MSE Dominance

Under y=Xw+εy = Xw^* + \varepsilon with E[ε]=0\mathbb E[\varepsilon] = 0 and Var(ε)=σ2I\operatorname{Var}(\varepsilon) = \sigma^2 I:

  • Bias. E[w^λ]w=λ(XX+λI)1w\mathbb E[\hat w_{\lambda}] - w^* = -\lambda (X^\top X + \lambda I)^{-1} w^*. Nonzero whenever λ>0\lambda > 0 and w0w^* \ne 0. Grows in magnitude with λ\lambda.
  • Variance. Var(w^λ)=σ2(XX+λI)1XX(XX+λI)1\operatorname{Var}(\hat w_{\lambda}) = \sigma^2 (X^\top X + \lambda I)^{-1} X^\top X (X^\top X + \lambda I)^{-1}. Strictly smaller (in PSD order) than the OLS variance σ2(XX)1\sigma^2 (X^\top X)^{-1} for λ>0\lambda > 0.
  • MSE. MSE(w^λ)=Bias22+tr(Var(w^λ))\operatorname{MSE}(\hat w_{\lambda}) = \|\text{Bias}\|_2^2 + \operatorname{tr}(\operatorname{Var}(\hat w_{\lambda})).

Diagonalizing in the SVD basis,

MSE(w^λ)  =  j=1d[λ2θj2(σj2+λ)2  +  σ2σj2(σj2+λ)2]\operatorname{MSE}(\hat w_{\lambda}) \;=\; \sum_{j=1}^{d} \left[ \frac{\lambda^2 \theta_j^{*\,2}}{(\sigma_j^2 + \lambda)^2} \;+\; \frac{\sigma^2 \sigma_j^2}{(\sigma_j^2 + \lambda)^2} \right]

where θj=vjw\theta_j^* = v_j^\top w^* is the true coefficient projected onto the jj-th right singular vector. The first term is squared bias per direction, the second is variance per direction.

Theorem

Ridge Strictly Improves on OLS for Some Lambda

Statement

There exists λ>0\lambda^* > 0 such that

MSE(w^λ)  <  MSE(w^OLS).\operatorname{MSE}(\hat w_{\lambda^*}) \;<\; \operatorname{MSE}(\hat w_{\text{OLS}}).

The optimal λ\lambda depends on unknown quantities (ww^* and σ2\sigma^2); the existence statement does not.

Intuition

Squared bias starts at zero when λ=0\lambda = 0 and rises quadratically. Variance starts at the OLS value and decreases. Near λ=0\lambda = 0 the bias term contributes nothing to the derivative while the variance term contributes a strictly negative slope. So MSE is locally decreasing at λ=0\lambda = 0 and a small positive λ\lambda is always an improvement.

Ridge does not violate the Cramer-Rao bound. The CRB lower-bounds variance for unbiased estimators; ridge is biased and pays for its variance reduction with bias. The total MSE is what matters, and the trade is favorable in a neighborhood of λ=0\lambda = 0.

Proof Sketch

Differentiate the per-direction terms. Variance term: λ[σ2σj2/(σj2+λ)2]=2σ2σj2/(σj2+λ)3\partial_\lambda [\sigma^2 \sigma_j^2 / (\sigma_j^2 + \lambda)^2] = -2 \sigma^2 \sigma_j^2 / (\sigma_j^2 + \lambda)^3, which equals 2σ2/σj4-2 \sigma^2 / \sigma_j^4 at λ=0\lambda = 0. Bias term: λ[λ2θj2/(σj2+λ)2]\partial_\lambda [\lambda^2 \theta_j^{*\,2} / (\sigma_j^2 + \lambda)^2] vanishes at λ=0\lambda = 0 because of the leading λ\lambda. Sum over jj: λMSEλ=0=2σ2j1/σj4<0\partial_\lambda \operatorname{MSE} \big|_{\lambda = 0} = -2 \sigma^2 \sum_j 1 / \sigma_j^4 < 0. So MSE strictly decreases for small positive λ\lambda.

Why It Matters

This is the cleanest theoretical justification for any kind of regularization. Gauss-Markov says OLS is best among linear unbiased estimators. The class of all linear estimators is strictly larger and contains estimators with strictly lower MSE. Ridge is not a hack and not a heuristic; the squared-error metric is strictly smaller than the OLS value for a non-empty range of λ\lambda. The same conclusion underlies penalized GLMs, weight decay in deep nets, posterior contraction in Bayesian models, and the entire shrinkage program from James-Stein to modern hierarchical Bayes.

Failure Mode

The theorem guarantees existence, not a recipe. The MSE-optimal λ\lambda depends on ww^* and σ2\sigma^2, which are unknown. In practice, choose λ\lambda by leave-one-out cross-validation (closed form for ridge), KK-fold CV, generalized cross-validation, or marginal-likelihood maximization. Naively setting λ\lambda by rule-of-thumb is the most common way to under- or over-regularize.

Effective Degrees of Freedom

Cross-validation answers "what λ\lambda"; effective degrees of freedom answers "how complex is the resulting fit." For a linear smoother y^=Hλy\hat y = H_\lambda y, the standard definition is df(λ)=tr(Hλ)\operatorname{df}(\lambda) = \operatorname{tr}(H_\lambda). Ridge is a linear smoother with hat matrix Hλ=X(XX+λI)1XH_\lambda = X(X^\top X + \lambda I)^{-1} X^\top.

Proposition

Effective Degrees of Freedom for Ridge

Statement

The effective degrees of freedom of ridge regression at penalty λ\lambda is

df(λ)  =  tr(Hλ)  =  j=1rσj2σj2+λ.\operatorname{df}(\lambda) \;=\; \operatorname{tr}(H_\lambda) \;=\; \sum_{j=1}^{r} \frac{\sigma_j^2}{\sigma_j^2 + \lambda}.

This is monotone decreasing in λ\lambda, with df(0)=r\operatorname{df}(0) = r (OLS uses rr degrees of freedom on the column space of XX) and df()=0\operatorname{df}(\infty) = 0 (constant predictor uses none).

Intuition

Each shrinkage factor σj2/(σj2+λ)\sigma_j^2 / (\sigma_j^2 + \lambda) is the fraction of "freedom" the model retains in direction jj. High-σj\sigma_j directions count almost as a full degree; low-σj\sigma_j directions, near σj2λ\sigma_j^2 \approx \lambda, count as half; directions with σj2λ\sigma_j^2 \ll \lambda are essentially frozen and contribute nearly nothing to model complexity. The sum measures the model's effective dimension at the chosen penalty.

Proof Sketch

Hλ=X(XX+λI)1X=UΣVV(Σ2+λI)1VVΣU=UΣ(Σ2+λI)1ΣU=Udiag(σj2/(σj2+λ))UH_\lambda = X (X^\top X + \lambda I)^{-1} X^\top = U \Sigma V^\top V (\Sigma^2 + \lambda I)^{-1} V^\top V \Sigma U^\top = U \Sigma (\Sigma^2 + \lambda I)^{-1} \Sigma U^\top = U \operatorname{diag}(\sigma_j^2 / (\sigma_j^2 + \lambda)) U^\top. Trace of a matrix in this form is the sum of the diagonal of the inner diagonal matrix.

Why It Matters

Effective DF makes ridge comparable to other model-selection scores. AIC, BIC, CpC_p, and Mallows-style criteria all need a model-complexity term, and df(λ)\operatorname{df}(\lambda) is the right substitute when the penalty is continuous rather than discrete. It also shows up in Stein's unbiased risk estimate (SURE) for ridge: with Gaussian noise, SURE(λ)=yy^λ22+2σ2df(λ)nσ2\operatorname{SURE}(\lambda) = \|y - \hat y_\lambda\|_2^2 + 2 \sigma^2 \operatorname{df}(\lambda) - n \sigma^2, an unbiased risk estimate computable without resampling.

Failure Mode

The trace definition tr(Hλ)\operatorname{tr}(H_\lambda) depends on the linear-smoother structure. For nonlinear regularizers (lasso, elastic net) the analogous quantity is more delicate; lasso's effective degrees of freedom equal the expected number of nonzero coefficients (Zou-Hastie-Tibshirani 2007), not the trace of a hat matrix.

Bayesian Interpretation: Ridge as MAP under a Gaussian Prior

Place a Gaussian prior on ww and a Gaussian likelihood on yy:

w    N(0,τ2I),yX,w    N(Xw,σ2I).w \;\sim\; \mathcal N(0, \tau^2 I), \qquad y \mid X, w \;\sim\; \mathcal N(Xw, \sigma^2 I).

The posterior is Gaussian (Gaussian-Gaussian conjugacy):

wX,y    N ⁣((XX+σ2τ2I)1Xy,    σ2(XX+σ2τ2I)1).w \mid X, y \;\sim\; \mathcal N\!\left( \big(X^\top X + \tfrac{\sigma^2}{\tau^2} I\big)^{-1} X^\top y, \;\; \sigma^2 \big(X^\top X + \tfrac{\sigma^2}{\tau^2} I\big)^{-1} \right).

The posterior mean equals the MAP equals w^λ\hat w_\lambda with λ=σ2/τ2\lambda = \sigma^2 / \tau^2. Three consequences:

  • Direction. Strong prior (small τ2\tau^2) maps to large λ\lambda, heavy shrinkage. Weak prior (large τ2\tau^2) maps to small λ\lambda, near-OLS. The relationship is reciprocal, not direct.
  • Posterior covariance. Ridge gives a point estimate; the Bayesian view also gives uncertainty, with covariance equal to a constant times (XX+λI)1(X^\top X + \lambda I)^{-1}. This is the natural starting point for credible intervals and predictive variance.
  • Marginal likelihood. Maximizing p(yX,τ2,σ2)p(y \mid X, \tau^2, \sigma^2) over τ2\tau^2 and σ2\sigma^2 gives an empirical-Bayes choice of λ\lambda, an alternative to cross-validation that is fully principled when the Gaussian model is correct.

The connection generalizes. Any quadratic penalty wΛww^\top \Lambda w corresponds to a Gaussian prior with covariance Λ1\Lambda^{-1}. Penalizing only some coefficients corresponds to flat priors on the others. Group penalties, fused penalties, and Tikhonov penalties of the form Lw22\| L w \|_2^2 are all Gaussian priors in disguise.

Tikhonov Regularization and Inverse Problems

Ridge's older name in inverse-problem theory is Tikhonov regularization (Tikhonov 1963). The setup is solving Ax=bA x = b where AA is ill-conditioned or non-invertible, by minimizing Axb22+λLx22\| A x - b \|_2^2 + \lambda \| L x \|_2^2 for some operator LL. Setting L=IL = I recovers ridge with the regression matrix A=XA = X and target b=yb = y.

The full Tikhonov form lets you penalize derivatives (LL a discrete differentiation operator), encouraging smooth reconstructions; differences from a target (Lxx0L x - x_0), encouraging proximity to a known reference; or specific subspaces (LL a projector), penalizing spurious modes. All of these inherit ridge's SVD analysis after a change of variables. The connection makes ridge a special case of a much broader regularization framework used in tomography, deconvolution, geophysical inversion, and learning theory.

Kernel Ridge Regression

Replace xiRdx_i \in \mathbb R^d with a feature map ϕ(xi)\phi(x_i) in a reproducing kernel Hilbert space (RKHS) and the kernel trick K(xi,xj)=ϕ(xi),ϕ(xj)K(x_i, x_j) = \langle \phi(x_i), \phi(x_j) \rangle. The ridge problem becomes

minfH  i=1n(yif(xi))2+λfH2.\min_{f \in \mathcal H} \; \sum_{i=1}^{n} (y_i - f(x_i))^2 + \lambda \| f \|_{\mathcal H}^2.

Theorem

Representer Theorem for Kernel Ridge Regression

Statement

The minimizer of i(yif(xi))2+λfH2\sum_i (y_i - f(x_i))^2 + \lambda \| f \|_{\mathcal H}^2 over fHf \in \mathcal H has the form

f^(x)  =  i=1nαiK(xi,x),α=(K+λI)1y\hat f(x) \;=\; \sum_{i=1}^{n} \alpha_i \, K(x_i, x), \qquad \alpha = (K + \lambda I)^{-1} y

where KK is the n×nn \times n Gram matrix with Kij=K(xi,xj)K_{ij} = K(x_i, x_j).

Intuition

Even when H\mathcal H is infinite-dimensional, the optimal predictor lives in the nn-dimensional span of the kernel evaluations {K(xi,)}i=1n\{K(x_i, \cdot)\}_{i=1}^n. Ridge in an infinite-dimensional RKHS reduces to a finite linear system in nn variables. The kernel matrix KK plays the role of XXX X^\top, and the kernel trick lets you compute the entries without ever forming the feature map.

Proof Sketch

Decompose any fHf \in \mathcal H as f=f+ff = f_\parallel + f_\perp where fspan{K(xi,)}f_\parallel \in \operatorname{span}\{K(x_i, \cdot)\} and ff_\perp is orthogonal. The reproducing property gives f(xi)=f,K(xi,)=f(xi)f(x_i) = \langle f, K(x_i, \cdot) \rangle = f_\parallel(x_i), so the loss depends only on ff_\parallel. The regularizer is fH2=fH2+fH2\| f \|_{\mathcal H}^2 = \| f_\parallel \|_{\mathcal H}^2 + \| f_\perp \|_{\mathcal H}^2, strictly increasing in fH\| f_\perp \|_{\mathcal H}. So f=0f_\perp = 0 at the optimum and f^\hat f is a linear combination of {K(xi,)}\{K(x_i, \cdot)\}. Plug in to get a finite system in the coefficients α\alpha.

Why It Matters

The representer theorem turns infinite-dimensional regression into linear algebra on an n×nn \times n Gram matrix. It underlies kernel ridge regression, Gaussian process regression (which is its Bayesian twin), support vector regression, and the entire neural tangent kernel (NTK) connection: in the NTK regime, training a wide neural network with squared loss and small learning rate is equivalent to kernel ridge regression with kernel KNTKK_{\text{NTK}}. The representer theorem is what makes that equivalence computable.

Failure Mode

Two failure modes. First, the n×nn \times n kernel matrix scales as O(n3)O(n^3) to invert and O(n2)O(n^2) to store; kernel ridge does not scale to large nn without approximations (Nystrom, random features, conjugate gradient). Second, the representer theorem assumes the regularizer depends only on fH\| f \|_{\mathcal H}. For penalties like total variation or sparsity-inducing norms in function space, the optimal ff need not lie in the kernel span and the theorem does not apply.

Modern Frontier: Ridgeless OLS and Benign Overfitting

In the overparameterized regime (dnd \gg n), the OLS problem has infinitely many minimizers and the minimum-2\ell_2-norm solution w^OLSmin=X+y=X(XX)1y\hat w_{\text{OLS}}^{\min} = X^+ y = X^\top (X X^\top)^{-1} y is what gradient descent on the unregularized least-squares loss converges to from a zero initialization. This is "ridgeless OLS" and corresponds to λ0\lambda \downarrow 0 in the ridge problem.

Classical theory predicts disaster: zero training error plus arbitrary norm should give arbitrarily bad generalization. The empirical surprise is that ridgeless OLS often generalizes well in the heavily overparameterized regime, sometimes better than at the interpolation threshold (d=nd = n). This is benign overfitting (Bartlett, Long, Lugosi, Tsigler 2020; Hastie, Montanari, Rosset, Tibshirani 2022), and it gives a tractable linear-model analogue of the deep-learning generalization puzzle.

Theorem

Ridgeless OLS Can Generalize in High Dimension

Statement

For a sequence of overparameterized linear regression problems with d/nγ>1d/n \to \gamma > 1 and population covariance Σx\Sigma_x whose effective rank rk(Σx)=(j>kλj)/λk+1r_k(\Sigma_x) = (\sum_{j > k} \lambda_j) / \lambda_{k+1} grows fast enough, the ridgeless minimum-norm interpolator w^min\hat w_{\min} has population MSE that converges to a finite limit, and in some regimes that limit is smaller than the interpolation-threshold MSE at d=nd = n.

The precise form (Bartlett-Long-Lugosi-Tsigler) bounds excess risk by a sum of two terms: a bias term scaling with wΣx2/r0(Σx)\| w^* \|_{\Sigma_x}^2 / r_0(\Sigma_x) and a variance term scaling with σ2[k/n+n/Rk(Σx)]\sigma^2 [k^*/n + n / R_{k^*}(\Sigma_x)], where kk^* is an effective-rank cutoff. Both terms can be small simultaneously when the spectrum decays sufficiently slowly.

Intuition

Two ideas. First, the minimum-2\ell_2-norm interpolator is implicit ridge: gradient descent from zero on the unregularized loss, with sufficiently small step size, approaches the smallest-norm interpolating solution rather than an arbitrary one. The implicit bias of optimization fills in for the missing explicit penalty.

Second, in high dimension with mild spectral decay, almost all directions of Σx\Sigma_x have small variance and absorb the noise harmlessly while the few high-variance directions carry the signal. The "absorption" is benign because the noise-absorbing directions barely affect predictions on a fresh test point. Classical-low-dimensional theory missed this because it assumed d/n0d/n \to 0 where the absorption mechanism does not exist.

Proof Sketch

Decompose excess risk as bias plus variance using the SVD of the design (or eigendecomposition of Σx\Sigma_x). The bias term is bounded by the projection of ww^* onto the span of the bottom eigendirections of the empirical Gram matrix; control via concentration of the spectrum. The variance term factors as a sum over eigendirections of Σx\Sigma_x, each contributing σ2λj2/(λjn+tail)2\sigma^2 \lambda_j^2 / (\lambda_j n + \text{tail})^2 (analogous to the ridge formula but with no explicit penalty); the effective-rank conditions ensure the sum is small. Full details in Bartlett-Long-Lugosi-Tsigler (PNAS 2020) and the companion analysis by Tsigler-Bartlett (2023).

Why It Matters

Ridgeless overfitting is the smallest model where modern overparameterized generalization can be analyzed in closed form. It explains why explicit regularization is sometimes unnecessary in deep learning (the implicit bias of SGD plus high effective dimension can suffice), it connects ridge to NTK theory (training a wide net is kernel ridge with λ0\lambda \to 0), and it reframes the bias-variance tradeoff in the regime where classical asymptotics fail.

The companion phenomenon is double descent: as d/nd/n crosses 1, test error often peaks at the interpolation threshold and then descends again as d/nd/n grows further (Belkin-Hsu-Ma-Mandal 2019; Nakkiran-Kaplun-Bansal-Yang-Barak-Sutskever 2020). Ridge with optimally chosen λ\lambda smooths out the interpolation peak, recovering monotone decrease in test error.

Failure Mode

The benign-overfitting result is fragile to spectral assumptions. With isotropic covariance Σx=I\Sigma_x = I, ridgeless OLS is asymptotically as bad as classical theory predicts; benign behavior requires a long tail of small eigenvalues that absorb noise. It is also sensitive to the relationship between ww^* and the eigenstructure: if ww^* aligns with low-variance directions, the bias term dominates. Empirically, real datasets do tend to have decaying spectra, but the analysis is not a free lunch and explicit ridge often still outperforms ridgeless.

Choosing Lambda

Definition

Leave-One-Out Cross-Validation Closed Form for Ridge

For ridge regression, leave-one-out (LOO) cross-validation has a closed form, eliminating the need to refit nn times:

CVLOO(λ)  =  1ni=1n(yiy^i(λ)1hii(λ))2\operatorname{CV}_{\text{LOO}}(\lambda) \;=\; \frac{1}{n} \sum_{i=1}^{n} \left( \frac{y_i - \hat y_i(\lambda)}{1 - h_{ii}(\lambda)} \right)^2

where hii(λ)h_{ii}(\lambda) is the ii-th diagonal of the ridge hat matrix Hλ=X(XX+λI)1XH_\lambda = X (X^\top X + \lambda I)^{-1} X^\top. Generalized cross-validation (GCV) replaces the per-row hat diagonal hii(λ)h_{ii}(\lambda) by the average tr(Hλ)/n=df(λ)/n\operatorname{tr}(H_\lambda) / n = \operatorname{df}(\lambda) / n, giving a rotation-invariant variant that is often more stable.

Standard practice: grid λ\lambda on a log scale spanning many decades, compute CVLOO(λ)\operatorname{CV}_{\text{LOO}}(\lambda) or GCV, pick the minimizer (or the "one-standard-error rule" minimizer for robustness against CV noise). Three alternatives: SURE (closed form for Gaussian noise), marginal-likelihood maximization (Bayesian), and AIC/BIC with effective DF.

Canonical Examples

Example

Orthonormal design: pure scalar shrinkage

When XX=IdX^\top X = I_d (orthonormal columns), the SVD is X=UVX = U V^\top with σ1==σd=1\sigma_1 = \cdots = \sigma_d = 1 and ridge reduces to a scalar shrinkage of the OLS estimate:

w^λ  =  11+λw^OLS.\hat w_\lambda \;=\; \frac{1}{1 + \lambda} \hat w_{\text{OLS}}.

Each coefficient is multiplied by the same factor 1/(1+λ)(0,1)1/(1 + \lambda) \in (0, 1). Compare to lasso, which under the same orthonormal design soft-thresholds each coefficient at λ\lambda, sending small ones to exactly zero. Ridge shrinks; lasso shrinks and selects.

Example

Ridge fixes near-collinear features

Suppose x1x2x_1 \approx x_2. The OLS coefficients (w^1,w^2)(\hat w_1, \hat w_2) can be wildly large with opposite signs because the data does not separate the contributions of x1x_1 and x2x_2, only their sum. Adding λI\lambda I to the Gram matrix shifts both small singular values away from zero, making the inversion well-conditioned and producing moderate, similar-magnitude coefficients on x1x_1 and x2x_2. The ridge solution honestly reflects "the data cannot tell these apart," whereas the OLS solution amplifies design noise into spurious large coefficients.

Example

Ridge on an underdetermined system

With n<dn < d, OLS has infinitely many solutions; the normal equations are singular. Ridge has a unique solution for any λ>0\lambda > 0, and as λ0\lambda \downarrow 0 the ridge solution converges to the minimum-2\ell_2-norm interpolator: w^min=X(XX)1y\hat w_{\min} = X^\top (X X^\top)^{-1} y. This is the "ridgeless" limit and the starting point for the benign-overfitting story.

Connection to James-Stein and Empirical Bayes

Ridge and James-Stein are two faces of the same shrinkage idea applied to the Gaussian sequence model.

  • James-Stein: estimate θRd\theta \in \mathbb R^d from X=θ+εX = \theta + \varepsilon with εN(0,Id)\varepsilon \sim \mathcal N(0, I_d), dimension d3d \ge 3. The JS estimator θ^JS=(1(d2)/X2)X\hat \theta^{JS} = (1 - (d - 2) / \|X\|^2) X shrinks toward zero by a data-driven factor and dominates the MLE in total MSE.
  • Ridge in the Gaussian sequence model: take X=UΣVX = U \Sigma V^\top with orthonormal XX (Σ=I\Sigma = I, n=dn = d), so the design contributes nothing and the regression problem becomes the sequence model. Ridge with penalty λ\lambda gives θ^λ=(1/(1+λ))X\hat \theta_\lambda = (1 / (1 + \lambda)) X, a deterministic shrinkage by 1/(1+λ)1 / (1 + \lambda).

The two estimators differ in how the shrinkage factor is chosen: James-Stein adapts to the data via X2\|X\|^2; ridge fixes λ\lambda in advance (or chooses by cross-validation). Empirical-Bayes ridge, where λ\lambda is set by maximizing marginal likelihood, recovers an estimator very close to James-Stein in this setting and has the same MSE-dominance property in the right regime. See shrinkage estimation and James-Stein for the parallel theory.

Common Confusions

Watch Out

Ridge does not select variables

Ridge shrinks every coefficient toward zero but, except in the degenerate λ=\lambda = \infty limit, sets none of them exactly to zero. For variable selection use the lasso (L1 penalty), elastic net (L1 + L2), or hard-thresholding methods. Ridge is the right tool when you believe all features contribute something, when you want a stable estimator under collinearity, or when you want a closed form. Ridge is the wrong tool when interpretability requires a sparse model.

Watch Out

Lambda direction in the Bayesian view

The Bayesian equivalence is λ=σ2/τ2\lambda = \sigma^2 / \tau^2. Large prior variance τ2\tau^2 means a weak prior, which corresponds to small λ\lambda and light shrinkage. Small τ2\tau^2 means a strong prior, which corresponds to large λ\lambda and heavy shrinkage. Students routinely flip the direction. The ratio is the inverse signal-to-noise ratio in the prior and likelihood.

Watch Out

Gauss-Markov is not violated by ridge

Gauss-Markov asserts that OLS has minimum variance among linear unbiased estimators. Ridge is biased and falls outside the Gauss-Markov class. There is no contradiction in ridge having lower MSE than OLS. The theorem is sharp on the unbiased side; nothing in it prevents biased estimators from beating OLS on total error.

Watch Out

Standardize before applying ridge

The ridge penalty λw22\lambda \| w \|_2^2 depends on the units of the features. A feature measured in millimeters versus meters carries a thousand-times-different penalty. Standard practice is to center each column to mean zero and scale to unit variance before fitting, then unscale the coefficients afterward if you need to report them in the original units. This is not optional; without standardization ridge effectively over-penalizes small-scale features and under-penalizes large-scale ones.

Watch Out

The intercept is usually not penalized

Most software fits ridge with an unpenalized intercept: minimize yXwb122+λw22\| y - X w - b \mathbf 1 \|_2^2 + \lambda \| w \|_2^2. Penalizing the intercept would shrink the predicted mean toward zero, which is rarely desired and breaks translation invariance. The standard trick is to center yy and each column of XX, fit ridge without an intercept on the centered data, and recover the intercept as b^=yˉxˉw^\hat b = \bar y - \bar x^\top \hat w.

Watch Out

Ridgeless OLS is not always disastrous

The classical worry, that overparameterized OLS interpolates noise and generalizes badly, is correct in low effective dimension and isotropic settings. But in high dimension with realistic spectral decay, the minimum-norm interpolator absorbs noise into low-variance directions that barely affect predictions; this is benign overfitting (Bartlett-Long-Lugosi-Tsigler 2020). Whether ridge with λ>0\lambda > 0 improves on λ=0\lambda = 0 depends on the spectrum and the SNR; cross-validate rather than assuming.

Summary

  • Ridge objective: minwyXw22+λw22\min_w \| y - Xw \|_2^2 + \lambda \| w \|_2^2, closed form w^λ=(XX+λI)1Xy\hat w_\lambda = (X^\top X + \lambda I)^{-1} X^\top y.
  • SVD geometry: each direction is shrunk by factor σj2/(σj2+λ)\sigma_j^2 / (\sigma_j^2 + \lambda). Effective DF is the sum of these factors.
  • MSE dominance over OLS: there exists λ>0\lambda^* > 0 with strictly smaller MSE; the optimum requires CV.
  • Bayesian view: ridge equals MAP under Gaussian prior with λ=σ2/τ2\lambda = \sigma^2 / \tau^2.
  • Tikhonov regularization: ridge is the simplest case of Lw22\| L w \|_2^2 penalties used throughout inverse-problem theory.
  • Kernel ridge: representer theorem reduces RKHS regression to an n×nn \times n linear system in α=(K+λI)1y\alpha = (K + \lambda I)^{-1} y.
  • Modern: ridgeless OLS plus implicit bias of GD gives benign overfitting under spectral decay; double descent is the failure mode at the interpolation threshold and ridge with optimal λ\lambda smooths it out.
  • Always standardize features. Do not penalize the intercept. Do not expect variable selection.

Exercises

ExerciseCore

Problem

Show that for any λ>0\lambda > 0, the matrix XX+λIX^\top X + \lambda I is positive definite, even when XX is rank-deficient. Conclude that the ridge solution is uniquely defined for every λ>0\lambda > 0 and every design XX.

ExerciseCore

Problem

Derive the bias of the ridge estimator and show that

E[w^λ]w  =  λ(XX+λI)1w.\mathbb E[\hat w_\lambda] - w^* \;=\; -\lambda (X^\top X + \lambda I)^{-1} w^*.

In particular, the bias vanishes when w=0w^* = 0 (and only then for nontrivial XX).

ExerciseAdvanced

Problem

Prove the local-improvement statement carefully: if ww^* is finite and σ2>0\sigma^2 > 0, then λMSE(w^λ)λ=0<0\partial_\lambda \operatorname{MSE}(\hat w_\lambda) \big|_{\lambda = 0} < 0. Conclude that there exists λ>0\lambda^* > 0 with MSE(w^λ)<MSE(w^OLS)\operatorname{MSE}(\hat w_{\lambda^*}) < \operatorname{MSE}(\hat w_{\text{OLS}}).

ExerciseAdvanced

Problem

Derive Stein's unbiased risk estimate for ridge under Gaussian noise: assume y=Xw+εy = X w^* + \varepsilon with εN(0,σ2I)\varepsilon \sim \mathcal N(0, \sigma^2 I) and known σ2\sigma^2. Show that

SURE(λ)  =  yy^λ22+2σ2df(λ)nσ2\operatorname{SURE}(\lambda) \;=\; \| y - \hat y_\lambda \|_2^2 + 2 \sigma^2 \operatorname{df}(\lambda) - n \sigma^2

is an unbiased estimate of the in-sample prediction risk Ey^λXw22\mathbb E \| \hat y_\lambda - X w^* \|_2^2.

ExerciseResearch

Problem

Benign overfitting in the isotropic case. Consider the linear model y=Xw+εy = X w^* + \varepsilon with XRn×dX \in \mathbb R^{n \times d}, XijiidN(0,1/d)X_{ij} \overset{\text{iid}}{\sim} \mathcal N(0, 1/d), εN(0,σ2In)\varepsilon \sim \mathcal N(0, \sigma^2 I_n), d>nd > n. Let w^min\hat w_{\min} be the minimum-2\ell_2-norm interpolator. In the limit d/nγ>1d/n \to \gamma > 1 with w22ρ2\| w^* \|_2^2 \to \rho^2, derive the asymptotic out-of-sample prediction risk ExN(0,Id/d)(xw^minxw)2\mathbb E_{x \sim \mathcal N(0, I_d / d)} (x^\top \hat w_{\min} - x^\top w^*)^2. Show that with isotropic features and a generic ww^*, ridgeless interpolation does not generalize: the variance term diverges. Then explain qualitatively which assumption must change to recover benign overfitting.

Related Comparisons

References

Canonical:

  • Hoerl, A. E., and Kennard, R. W. (1970). Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics 12(1), 55-67. The original ridge paper.
  • Tikhonov, A. N. (1963). Solution of incorrectly formulated problems and the regularization method. Soviet Mathematics Doklady 4, 1035-1038. The Tikhonov regularization framework.
  • Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. Chapter 3.4 (ridge), Chapter 5.8 (effective DF), Chapter 18.3 (high-dimensional regression).
  • Wahba, G. (1990). Spline Models for Observational Data. SIAM. Reproducing-kernel and smoothing-spline foundations of the representer theorem.

Current:

  • Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press. Chapter 7 (regularized linear regression), Chapter 13 (kernel ridge).
  • Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapter 11.3 (ridge), Chapter 17 (kernel methods).
  • Bartlett, P. L., Long, P. M., Lugosi, G., and Tsigler, A. (2020). Benign overfitting in linear regression. PNAS 117(48), 30063-30070.
  • Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. (2022). Surprises in high-dimensional ridgeless least squares interpolation. Annals of Statistics 50(2), 949-986.
  • Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2019). Reconciling modern machine-learning practice and the classical bias-variance trade-off. PNAS 116(32), 15849-15854. The double-descent paper.
  • Tsigler, A., and Bartlett, P. L. (2023). Benign overfitting in ridge regression. JMLR 24(123), 1-76.
  • Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., and Sutskever, I. (2021). Deep double descent: where bigger models and more data hurt. JSTAT 124003. Empirical double-descent across architectures.

Critique and Frontier:

  • Engl, H. W., Hanke, M., and Neubauer, A. (1996). Regularization of Inverse Problems. Kluwer. The ill-posed-problem perspective on Tikhonov regularization, with convergence rates and parameter-choice rules.
  • Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. JRSS-B 58, 267-288. The lasso paper, framing what ridge cannot do (variable selection).
  • Zou, H., Hastie, T., and Tibshirani, R. (2007). On the degrees of freedom of the lasso. Annals of Statistics 35(5), 2173-2192. Effective DF beyond the linear-smoother case.
  • Jacot, A., Gabriel, F., and Hongler, C. (2018). Neural Tangent Kernel: Convergence and Generalization in Neural Networks. NeurIPS. The NTK / kernel-ridge connection.
  • Schoelkopf, B., Herbrich, R., and Smola, A. J. (2001). A generalized representer theorem. COLT. Modern statement and proof.

Next Topics

Building on ridge regression:

Last reviewed: April 19, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.