Ridge Regression

Sneiderman, Robby

ML Methods

Ridge Regression

L2-regularized least squares: closed-form shrinkage, SVD geometry, MSE dominance over OLS, the Bayesian Gaussian-prior connection, effective degrees of freedom, kernel ridge, and the modern overparameterized regime where ridgeless OLS can still generalize.

CoreTier 1StableCore spine~85 min

Prerequisites

Linear Regression Convex Optimization Basics Shrinkage Estimation James Stein Elements of Statistical Learning Book

Quiz (8)Pulse Check Prereq Map

Why This Matters

Ordinary least squares is the canonical unbiased linear estimator. The Gauss-Markov theorem certifies it as best-linear-unbiased, and that result is older than statistical learning theory itself. But unbiasedness is a constraint, not a virtue. Once we accept that bias and variance trade against each other, dropping unbiasedness opens a strictly larger admissible region, and ridge regression is the simplest and most-studied estimator in that larger region.

Ridge does three things at once. It stabilizes the solution when $X^\top X$ is ill-conditioned or singular. It shrinks toward the origin in the singular-value basis, with shrinkage strength tied to the inverse signal-to-noise ratio in each direction. And it has a closed form, a Bayesian interpretation, a representer theorem in kernel space, and a clean SVD geometry, so it acts as a Rosetta Stone connecting penalized regression, MAP estimation, kernel methods, and the modern overparameterized linear-model theory of benign overfitting.

The page is the bridge between classical estimation theory (Gauss-Markov, shrinkage and James-Stein, Cramer-Rao bound) and the modern theory of high-dimensional and overparameterized models (NTK, double descent, benign overfitting, kernels and RKHS). It is the smallest example where a closed-form penalized estimator beats the unbiased estimator on mean squared error, and where the SVD lets you read off the entire bias-variance trade by inspection.

Mental Model

OLS finds the minimum-norm vector $w$ that explains $y$ as a linear combination of the columns of $X$ . Ridge adds a quadratic budget on $\|w\|_2$ . Among all $w$ that fit the data well, ridge picks the one with smallest squared norm.

Three pictures, all equivalent:

Constrained optimization. Minimize $\|y - Xw\|_2^2$ subject to $\|w\|_2 \le t$ for some $t$ . Lagrangian gives the penalty form with multiplier $\lambda$ .
Shrunk projection. OLS projects $y$ onto the column space of $X$ . Ridge does a softer projection: it pulls each component of $\hat y$ in the singular-value basis toward zero by a factor $\sigma_j^2 / (\sigma_j^2 + \lambda)$ .
MAP under a Gaussian prior. With $w \sim \mathcal N(0, \tau^2 I)$ and $y \mid w \sim \mathcal N(Xw, \sigma^2 I)$ , the posterior mode equals $\hat w_{\text{ridge}}$ at $\lambda = \sigma^2 / \tau^2$ .

The same penalty plays three roles: convex regularizer, shrinkage operator on the SVD components, and prior covariance. Pick the picture that fits the question.

Formal Setup

Take the linear model $y = Xw^* + \varepsilon$ with $X \in \mathbb R^{n \times d}$ , $y \in \mathbb R^n$ , and noise $\varepsilon$ with $\mathbb E[\varepsilon] = 0$ and $\operatorname{Var}(\varepsilon) = \sigma^2 I_n$ . No Gaussian assumption is needed for the MSE statements; Gaussianity is needed only for the Bayesian / MAP equivalence and for distributional inference.

Definition

Ridge Regression Estimator $\overset{w}{^}_{λ}$

For penalty parameter $\lambda > 0$ , the ridge estimator is

$\hat{w}_{\lambda} \;=\; \arg\min_{w \in \mathbb R^d} \; \|y - Xw\|_2^2 + \lambda \|w\|_2^2.$

The objective is strictly convex (Hessian $2(X^\top X + \lambda I) \succ 0$ for $\lambda > 0$ ), so the solution is unique. First-order optimality gives the normal equations $(X^\top X + \lambda I) w = X^\top y$ and the closed form

$\hat{w}_{\lambda} \;=\; (X^\top X + \lambda I)^{-1} X^\top y.$

The matrix $X^\top X + \lambda I$ is positive definite for any $\lambda > 0$ , even when $X^\top X$ is singular ( $n < d$ or rank-deficient design). This is why ridge is well-defined where OLS is not.

The case $\lambda = 0$ recovers OLS when $X^\top X$ is invertible. The case $\lambda \to \infty$ gives $\hat w \to 0$ . The interesting regime is intermediate.

SVD Geometry

The singular value decomposition is the cleanest way to see what ridge does. Write $X = U \Sigma V^\top$ where $U \in \mathbb R^{n \times r}$ , $V \in \mathbb R^{d \times r}$ have orthonormal columns and $\Sigma = \operatorname{diag}(\sigma_1, \ldots, \sigma_r)$ with $\sigma_1 \ge \cdots \ge \sigma_r > 0$ and $r = \operatorname{rank}(X)$ .

Theorem

Ridge Acts as Component-wise Shrinkage in the SVD Basis

Statement

The ridge fitted values are

$\hat y_{\lambda} \;=\; X \hat w_{\lambda} \;=\; \sum_{j=1}^{r} \frac{\sigma_j^2}{\sigma_j^2 + \lambda} \, u_j u_j^\top y,$

and the ridge coefficients in the right-singular basis are

$V^\top \hat w_{\lambda} \;=\; \operatorname{diag}\!\left(\frac{\sigma_j}{\sigma_j^2 + \lambda}\right) U^\top y.$

The shrinkage factor $d_j(\lambda) = \sigma_j^2 / (\sigma_j^2 + \lambda) \in (0, 1)$ governs each direction independently.

Intuition

OLS fitted values are $\hat y_{\text{OLS}} = \sum_j u_j u_j^\top y$ , the full projection onto each left singular direction. Ridge multiplies the $j$ -th coefficient by $\sigma_j^2 / (\sigma_j^2 + \lambda)$ . For directions with $\sigma_j \gg \sqrt \lambda$ the factor is essentially 1 and ridge looks like OLS. For directions with $\sigma_j \ll \sqrt \lambda$ the factor is essentially 0 and ridge silences that direction. Ridge is a low-pass filter on the singular-value spectrum, with cutoff at $\sigma_j \approx \sqrt \lambda$ .

This is also why ridge stabilizes ill-conditioned designs. The OLS coefficient $V^\top \hat w_{\text{OLS}} = \operatorname{diag}(1/\sigma_j) U^\top y$ blows up for small $\sigma_j$ . Ridge replaces $1/\sigma_j$ by $\sigma_j / (\sigma_j^2 + \lambda)$ , which is bounded above by $1 / (2 \sqrt \lambda)$ .

Proof Sketch

Substitute $X = U \Sigma V^\top$ into $(X^\top X + \lambda I)^{-1} X^\top$ . The Gram matrix is $V \Sigma^2 V^\top$ . Padding with the orthogonal complement of $V$ (eigenvalue 0 for $X^\top X$ , eigenvalue $\lambda$ for $X^\top X + \lambda I$ ), the inverse is $V (\Sigma^2 + \lambda I)^{-1} V^\top + \tfrac{1}{\lambda} V_\perp V_\perp^\top$ . Multiplying by $X^\top = V \Sigma U^\top$ kills the orthogonal piece (since $V^\top V_\perp = 0$ ) and yields $V (\Sigma^2 + \lambda I)^{-1} \Sigma U^\top$ . Multiplying once more by $X = U \Sigma V^\top$ from the left gives the fitted values.

Why It Matters

Every other property of ridge follows from this one identity. Bias, variance, MSE, effective degrees of freedom, leave-one-out cross-validation, Bayesian posterior covariance, and kernel-ridge predictions are all weighted sums over $j$ with weights of the form $\sigma_j^2 / (\sigma_j^2 + \lambda)^k$ for small $k$ . If you see a quantity attached to ridge and it does not factor through the SVD, suspect an algebra error.

Failure Mode

The shrinkage is anisotropic and depends on the design. Rotating the feature space changes the singular vectors and therefore the ridge solution; ridge is not invariant under arbitrary linear reparameterization. Standardize features (zero mean, unit variance) before applying ridge, otherwise the penalty depends on units and on which features happen to have larger raw scale.

report a correction →

Bias, Variance, MSE Dominance

Under $y = Xw^* + \varepsilon$ with $\mathbb E[\varepsilon] = 0$ and $\operatorname{Var}(\varepsilon) = \sigma^2 I$ :

Bias. $\mathbb E[\hat w_{\lambda}] - w^* = -\lambda (X^\top X + \lambda I)^{-1} w^*$ . Nonzero whenever $\lambda > 0$ and $w^* \ne 0$ . Grows in magnitude with $\lambda$ .
Variance. $\operatorname{Var}(\hat w_{\lambda}) = \sigma^2 (X^\top X + \lambda I)^{-1} X^\top X (X^\top X + \lambda I)^{-1}$ . Strictly smaller (in PSD order) than the OLS variance $\sigma^2 (X^\top X)^{-1}$ for $\lambda > 0$ .
MSE. $\operatorname{MSE}(\hat w_{\lambda}) = \|\text{Bias}\|_2^2 + \operatorname{tr}(\operatorname{Var}(\hat w_{\lambda}))$ .

Diagonalizing in the SVD basis,

$\operatorname{MSE}(\hat w_{\lambda}) \;=\; \sum_{j=1}^{d} \left[ \frac{\lambda^2 \theta_j^{*\,2}}{(\sigma_j^2 + \lambda)^2} \;+\; \frac{\sigma^2 \sigma_j^2}{(\sigma_j^2 + \lambda)^2} \right]$

where $\theta_j^* = v_j^\top w^*$ is the true coefficient projected onto the $j$ -th right singular vector. The first term is squared bias per direction, the second is variance per direction.

Theorem

Ridge Strictly Improves on OLS for Some Lambda

Statement

There exists $\lambda^* > 0$ such that

$\operatorname{MSE}(\hat w_{\lambda^*}) \;<\; \operatorname{MSE}(\hat w_{\text{OLS}}).$

The optimal $\lambda$ depends on unknown quantities ( $w^*$ and $\sigma^2$ ); the existence statement does not.

Intuition

Squared bias starts at zero when $\lambda = 0$ and rises quadratically. Variance starts at the OLS value and decreases. Near $\lambda = 0$ the bias term contributes nothing to the derivative while the variance term contributes a strictly negative slope. So MSE is locally decreasing at $\lambda = 0$ and a small positive $\lambda$ is always an improvement.

Ridge does not violate the Cramer-Rao bound. The CRB lower-bounds variance for unbiased estimators; ridge is biased and pays for its variance reduction with bias. The total MSE is what matters, and the trade is favorable in a neighborhood of $\lambda = 0$ .

Proof Sketch

Differentiate the per-direction terms. Variance term: $\partial_\lambda [\sigma^2 \sigma_j^2 / (\sigma_j^2 + \lambda)^2] = -2 \sigma^2 \sigma_j^2 / (\sigma_j^2 + \lambda)^3$ , which equals $-2 \sigma^2 / \sigma_j^4$ at $\lambda = 0$ . Bias term: $\partial_\lambda [\lambda^2 \theta_j^{*\,2} / (\sigma_j^2 + \lambda)^2]$ vanishes at $\lambda = 0$ because of the leading $\lambda$ . Sum over $j$ : $\partial_\lambda \operatorname{MSE} \big|_{\lambda = 0} = -2 \sigma^2 \sum_j 1 / \sigma_j^4 < 0$ . So MSE strictly decreases for small positive $\lambda$ .

Why It Matters

This is the cleanest theoretical justification for any kind of regularization. Gauss-Markov says OLS is best among linear unbiased estimators. The class of all linear estimators is strictly larger and contains estimators with strictly lower MSE. Ridge is not a hack and not a heuristic; the squared-error metric is strictly smaller than the OLS value for a non-empty range of $\lambda$ . The same conclusion underlies penalized GLMs, weight decay in deep nets, posterior contraction in Bayesian models, and the entire shrinkage program from James-Stein to modern hierarchical Bayes.

Failure Mode

The theorem guarantees existence, not a recipe. The MSE-optimal $\lambda$ depends on $w^*$ and $\sigma^2$ , which are unknown. In practice, choose $\lambda$ by leave-one-out cross-validation (closed form for ridge), $K$ -fold CV, generalized cross-validation, or marginal-likelihood maximization. Naively setting $\lambda$ by rule-of-thumb is the most common way to under- or over-regularize.

report a correction →

Effective Degrees of Freedom

Cross-validation answers "what $\lambda$ "; effective degrees of freedom answers "how complex is the resulting fit." For a linear smoother $\hat y = H_\lambda y$ , the standard definition is $\operatorname{df}(\lambda) = \operatorname{tr}(H_\lambda)$ . Ridge is a linear smoother with hat matrix $H_\lambda = X(X^\top X + \lambda I)^{-1} X^\top$ .

Proposition

Effective Degrees of Freedom for Ridge

Statement

The effective degrees of freedom of ridge regression at penalty $\lambda$ is

$\operatorname{df}(\lambda) \;=\; \operatorname{tr}(H_\lambda) \;=\; \sum_{j=1}^{r} \frac{\sigma_j^2}{\sigma_j^2 + \lambda}.$

This is monotone decreasing in $\lambda$ , with $\operatorname{df}(0) = r$ (OLS uses $r$ degrees of freedom on the column space of $X$ ) and $\operatorname{df}(\infty) = 0$ (constant predictor uses none).

Intuition

Each shrinkage factor $\sigma_j^2 / (\sigma_j^2 + \lambda)$ is the fraction of "freedom" the model retains in direction $j$ . High- $\sigma_j$ directions count almost as a full degree; low- $\sigma_j$ directions, near $\sigma_j^2 \approx \lambda$ , count as half; directions with $\sigma_j^2 \ll \lambda$ are essentially frozen and contribute nearly nothing to model complexity. The sum measures the model's effective dimension at the chosen penalty.

Proof Sketch

$H_\lambda = X (X^\top X + \lambda I)^{-1} X^\top = U \Sigma V^\top V (\Sigma^2 + \lambda I)^{-1} V^\top V \Sigma U^\top = U \Sigma (\Sigma^2 + \lambda I)^{-1} \Sigma U^\top = U \operatorname{diag}(\sigma_j^2 / (\sigma_j^2 + \lambda)) U^\top$ . Trace of a matrix in this form is the sum of the diagonal of the inner diagonal matrix.

Why It Matters

Effective DF makes ridge comparable to other model-selection scores. AIC, BIC, $C_p$ , and Mallows-style criteria all need a model-complexity term, and $\operatorname{df}(\lambda)$ is the right substitute when the penalty is continuous rather than discrete. It also shows up in Stein's unbiased risk estimate (SURE) for ridge: with Gaussian noise, $\operatorname{SURE}(\lambda) = \|y - \hat y_\lambda\|_2^2 + 2 \sigma^2 \operatorname{df}(\lambda) - n \sigma^2$ , an unbiased risk estimate computable without resampling.

Failure Mode

The trace definition $\operatorname{tr}(H_\lambda)$ depends on the linear-smoother structure. For nonlinear regularizers (lasso, elastic net) the analogous quantity is more delicate; lasso's effective degrees of freedom equal the expected number of nonzero coefficients (Zou-Hastie-Tibshirani 2007), not the trace of a hat matrix.

report a correction →

Bayesian Interpretation: Ridge as MAP under a Gaussian Prior

Place a Gaussian prior on $w$ and a Gaussian likelihood on $y$ :

$w \;\sim\; \mathcal N(0, \tau^2 I), \qquad y \mid X, w \;\sim\; \mathcal N(Xw, \sigma^2 I).$

The posterior is Gaussian (Gaussian-Gaussian conjugacy):

$w \mid X, y \;\sim\; \mathcal N\!\left( \big(X^\top X + \tfrac{\sigma^2}{\tau^2} I\big)^{-1} X^\top y, \;\; \sigma^2 \big(X^\top X + \tfrac{\sigma^2}{\tau^2} I\big)^{-1} \right).$

The posterior mean equals the MAP equals $\hat w_\lambda$ with $\lambda = \sigma^2 / \tau^2$ . Three consequences:

Direction. Strong prior (small $\tau^2$ ) maps to large $\lambda$ , heavy shrinkage. Weak prior (large $\tau^2$ ) maps to small $\lambda$ , near-OLS. The relationship is reciprocal, not direct.
Posterior covariance. Ridge gives a point estimate; the Bayesian view also gives uncertainty, with covariance equal to a constant times $(X^\top X + \lambda I)^{-1}$ . This is the natural starting point for credible intervals and predictive variance.
Marginal likelihood. Maximizing $p(y \mid X, \tau^2, \sigma^2)$ over $\tau^2$ and $\sigma^2$ gives an empirical-Bayes choice of $\lambda$ , an alternative to cross-validation that is fully principled when the Gaussian model is correct.

The connection generalizes. Any quadratic penalty $w^\top \Lambda w$ corresponds to a Gaussian prior with covariance $\Lambda^{-1}$ . Penalizing only some coefficients corresponds to flat priors on the others. Group penalties, fused penalties, and Tikhonov penalties of the form $\| L w \|_2^2$ are all Gaussian priors in disguise.

Tikhonov Regularization and Inverse Problems

Ridge's older name in inverse-problem theory is Tikhonov regularization (Tikhonov 1963). The setup is solving $A x = b$ where $A$ is ill-conditioned or non-invertible, by minimizing $\| A x - b \|_2^2 + \lambda \| L x \|_2^2$ for some operator $L$ . Setting $L = I$ recovers ridge with the regression matrix $A = X$ and target $b = y$ .

The full Tikhonov form lets you penalize derivatives ( $L$ a discrete differentiation operator), encouraging smooth reconstructions; differences from a target ( $L x - x_0$ ), encouraging proximity to a known reference; or specific subspaces ( $L$ a projector), penalizing spurious modes. All of these inherit ridge's SVD analysis after a change of variables. The connection makes ridge a special case of a much broader regularization framework used in tomography, deconvolution, geophysical inversion, and learning theory.

Kernel Ridge Regression

Replace $x_i \in \mathbb R^d$ with a feature map $\phi(x_i)$ in a reproducing kernel Hilbert space (RKHS) and the kernel trick $K(x_i, x_j) = \langle \phi(x_i), \phi(x_j) \rangle$ . The ridge problem becomes

$\min_{f \in \mathcal H} \; \sum_{i=1}^{n} (y_i - f(x_i))^2 + \lambda \| f \|_{\mathcal H}^2.$

Theorem

Representer Theorem for Kernel Ridge Regression

Statement

The minimizer of $\sum_i (y_i - f(x_i))^2 + \lambda \| f \|_{\mathcal H}^2$ over $f \in \mathcal H$ has the form

$\hat f(x) \;=\; \sum_{i=1}^{n} \alpha_i \, K(x_i, x), \qquad \alpha = (K + \lambda I)^{-1} y$

where $K$ is the $n \times n$ Gram matrix with $K_{ij} = K(x_i, x_j)$ .

Intuition

Even when $\mathcal H$ is infinite-dimensional, the optimal predictor lives in the $n$ -dimensional span of the kernel evaluations $\{K(x_i, \cdot)\}_{i=1}^n$ . Ridge in an infinite-dimensional RKHS reduces to a finite linear system in $n$ variables. The kernel matrix $K$ plays the role of $X X^\top$ , and the kernel trick lets you compute the entries without ever forming the feature map.

Proof Sketch

Decompose any $f \in \mathcal H$ as $f = f_\parallel + f_\perp$ where $f_\parallel \in \operatorname{span}\{K(x_i, \cdot)\}$ and $f_\perp$ is orthogonal. The reproducing property gives $f(x_i) = \langle f, K(x_i, \cdot) \rangle = f_\parallel(x_i)$ , so the loss depends only on $f_\parallel$ . The regularizer is $\| f \|_{\mathcal H}^2 = \| f_\parallel \|_{\mathcal H}^2 + \| f_\perp \|_{\mathcal H}^2$ , strictly increasing in $\| f_\perp \|_{\mathcal H}$ . So $f_\perp = 0$ at the optimum and $\hat f$ is a linear combination of $\{K(x_i, \cdot)\}$ . Plug in to get a finite system in the coefficients $\alpha$ .

Why It Matters

The representer theorem turns infinite-dimensional regression into linear algebra on an $n \times n$ Gram matrix. It underlies kernel ridge regression, Gaussian process regression (which is its Bayesian twin), support vector regression, and the entire neural tangent kernel (NTK) connection: in the NTK regime, training a wide neural network with squared loss and small learning rate is equivalent to kernel ridge regression with kernel $K_{\text{NTK}}$ . The representer theorem is what makes that equivalence computable.

Failure Mode

Two failure modes. First, the $n \times n$ kernel matrix scales as $O(n^3)$ to invert and $O(n^2)$ to store; kernel ridge does not scale to large $n$ without approximations (Nystrom, random features, conjugate gradient). Second, the representer theorem assumes the regularizer depends only on $\| f \|_{\mathcal H}$ . For penalties like total variation or sparsity-inducing norms in function space, the optimal $f$ need not lie in the kernel span and the theorem does not apply.

report a correction →

Modern Frontier: Ridgeless OLS and Benign Overfitting

In the overparameterized regime ( $d \gg n$ ), the OLS problem has infinitely many minimizers and the minimum- $\ell_2$ -norm solution $\hat w_{\text{OLS}}^{\min} = X^+ y = X^\top (X X^\top)^{-1} y$ is what gradient descent on the unregularized least-squares loss converges to from a zero initialization. This is "ridgeless OLS" and corresponds to $\lambda \downarrow 0$ in the ridge problem.

Classical theory predicts disaster: zero training error plus arbitrary norm should give arbitrarily bad generalization. The empirical surprise is that ridgeless OLS often generalizes well in the heavily overparameterized regime, sometimes better than at the interpolation threshold ( $d = n$ ). This is benign overfitting (Bartlett, Long, Lugosi, Tsigler 2020; Hastie, Montanari, Rosset, Tibshirani 2022), and it gives a tractable linear-model analogue of the deep-learning generalization puzzle.

Theorem

Ridgeless OLS Can Generalize in High Dimension

Statement

For a sequence of overparameterized linear regression problems with $d/n \to \gamma > 1$ and population covariance $\Sigma_x$ whose effective rank $r_k(\Sigma_x) = (\sum_{j > k} \lambda_j) / \lambda_{k+1}$ grows fast enough, the ridgeless minimum-norm interpolator $\hat w_{\min}$ has population MSE that converges to a finite limit, and in some regimes that limit is smaller than the interpolation-threshold MSE at $d = n$ .

The precise form (Bartlett-Long-Lugosi-Tsigler) bounds excess risk by a sum of two terms: a bias term scaling with $\| w^* \|_{\Sigma_x}^2 / r_0(\Sigma_x)$ and a variance term scaling with $\sigma^2 [k^*/n + n / R_{k^*}(\Sigma_x)]$ , where $k^*$ is an effective-rank cutoff. Both terms can be small simultaneously when the spectrum decays sufficiently slowly.

Intuition

Two ideas. First, the minimum- $\ell_2$ -norm interpolator is implicit ridge: gradient descent from zero on the unregularized loss, with sufficiently small step size, approaches the smallest-norm interpolating solution rather than an arbitrary one. The implicit bias of optimization fills in for the missing explicit penalty.

Second, in high dimension with mild spectral decay, almost all directions of $\Sigma_x$ have small variance and absorb the noise harmlessly while the few high-variance directions carry the signal. The "absorption" is benign because the noise-absorbing directions barely affect predictions on a fresh test point. Classical-low-dimensional theory missed this because it assumed $d/n \to 0$ where the absorption mechanism does not exist.

Proof Sketch

Decompose excess risk as bias plus variance using the SVD of the design (or eigendecomposition of $\Sigma_x$ ). The bias term is bounded by the projection of $w^*$ onto the span of the bottom eigendirections of the empirical Gram matrix; control via concentration of the spectrum. The variance term factors as a sum over eigendirections of $\Sigma_x$ , each contributing $\sigma^2 \lambda_j^2 / (\lambda_j n + \text{tail})^2$ (analogous to the ridge formula but with no explicit penalty); the effective-rank conditions ensure the sum is small. Full details in Bartlett-Long-Lugosi-Tsigler (PNAS 2020) and the companion analysis by Tsigler-Bartlett (2023).

Why It Matters

Ridgeless overfitting is the smallest model where modern overparameterized generalization can be analyzed in closed form. It explains why explicit regularization is sometimes unnecessary in deep learning (the implicit bias of SGD plus high effective dimension can suffice), it connects ridge to NTK theory (training a wide net is kernel ridge with $\lambda \to 0$ ), and it reframes the bias-variance tradeoff in the regime where classical asymptotics fail.

The companion phenomenon is double descent: as $d/n$ crosses 1, test error often peaks at the interpolation threshold and then descends again as $d/n$ grows further (Belkin-Hsu-Ma-Mandal 2019; Nakkiran-Kaplun-Bansal-Yang-Barak-Sutskever 2020). Ridge with optimally chosen $\lambda$ smooths out the interpolation peak, recovering monotone decrease in test error.

Failure Mode

The benign-overfitting result is fragile to spectral assumptions. With isotropic covariance $\Sigma_x = I$ , ridgeless OLS is asymptotically as bad as classical theory predicts; benign behavior requires a long tail of small eigenvalues that absorb noise. It is also sensitive to the relationship between $w^*$ and the eigenstructure: if $w^*$ aligns with low-variance directions, the bias term dominates. Empirically, real datasets do tend to have decaying spectra, but the analysis is not a free lunch and explicit ridge often still outperforms ridgeless.

report a correction →

Choosing Lambda

Definition

Leave-One-Out Cross-Validation Closed Form for Ridge

For ridge regression, leave-one-out (LOO) cross-validation has a closed form, eliminating the need to refit $n$ times:

$\operatorname{CV}_{\text{LOO}}(\lambda) \;=\; \frac{1}{n} \sum_{i=1}^{n} \left( \frac{y_i - \hat y_i(\lambda)}{1 - h_{ii}(\lambda)} \right)^2$

where $h_{ii}(\lambda)$ is the $i$ -th diagonal of the ridge hat matrix $H_\lambda = X (X^\top X + \lambda I)^{-1} X^\top$ . Generalized cross-validation (GCV) replaces the per-row hat diagonal $h_{ii}(\lambda)$ by the average $\operatorname{tr}(H_\lambda) / n = \operatorname{df}(\lambda) / n$ , giving a rotation-invariant variant that is often more stable.

Standard practice: grid $\lambda$ on a log scale spanning many decades, compute $\operatorname{CV}_{\text{LOO}}(\lambda)$ or GCV, pick the minimizer (or the "one-standard-error rule" minimizer for robustness against CV noise). Three alternatives: SURE (closed form for Gaussian noise), marginal-likelihood maximization (Bayesian), and AIC/BIC with effective DF.

Canonical Examples

Example

Orthonormal design: pure scalar shrinkage

When $X^\top X = I_d$ (orthonormal columns), the SVD is $X = U V^\top$ with $\sigma_1 = \cdots = \sigma_d = 1$ and ridge reduces to a scalar shrinkage of the OLS estimate:

$\hat w_\lambda \;=\; \frac{1}{1 + \lambda} \hat w_{\text{OLS}}.$

Each coefficient is multiplied by the same factor $1/(1 + \lambda) \in (0, 1)$ . Compare to lasso, which under the same orthonormal design soft-thresholds each coefficient at $\lambda$ , sending small ones to exactly zero. Ridge shrinks; lasso shrinks and selects.

Example

Ridge fixes near-collinear features

Suppose $x_1 \approx x_2$ . The OLS coefficients $(\hat w_1, \hat w_2)$ can be wildly large with opposite signs because the data does not separate the contributions of $x_1$ and $x_2$ , only their sum. Adding $\lambda I$ to the Gram matrix shifts both small singular values away from zero, making the inversion well-conditioned and producing moderate, similar-magnitude coefficients on $x_1$ and $x_2$ . The ridge solution honestly reflects "the data cannot tell these apart," whereas the OLS solution amplifies design noise into spurious large coefficients.

Example

Ridge on an underdetermined system

With $n < d$ , OLS has infinitely many solutions; the normal equations are singular. Ridge has a unique solution for any $\lambda > 0$ , and as $\lambda \downarrow 0$ the ridge solution converges to the minimum- $\ell_2$ -norm interpolator: $\hat w_{\min} = X^\top (X X^\top)^{-1} y$ . This is the "ridgeless" limit and the starting point for the benign-overfitting story.

Connection to James-Stein and Empirical Bayes

Ridge and James-Stein are two faces of the same shrinkage idea applied to the Gaussian sequence model.

James-Stein: estimate $\theta \in \mathbb R^d$ from $X = \theta + \varepsilon$ with $\varepsilon \sim \mathcal N(0, I_d)$ , dimension $d \ge 3$ . The JS estimator $\hat \theta^{JS} = (1 - (d - 2) / \|X\|^2) X$ shrinks toward zero by a data-driven factor and dominates the MLE in total MSE.
Ridge in the Gaussian sequence model: take $X = U \Sigma V^\top$ with orthonormal $X$ ( $\Sigma = I$ , $n = d$ ), so the design contributes nothing and the regression problem becomes the sequence model. Ridge with penalty $\lambda$ gives $\hat \theta_\lambda = (1 / (1 + \lambda)) X$ , a deterministic shrinkage by $1 / (1 + \lambda)$ .

The two estimators differ in how the shrinkage factor is chosen: James-Stein adapts to the data via $\|X\|^2$ ; ridge fixes $\lambda$ in advance (or chooses by cross-validation). Empirical-Bayes ridge, where $\lambda$ is set by maximizing marginal likelihood, recovers an estimator very close to James-Stein in this setting and has the same MSE-dominance property in the right regime. See shrinkage estimation and James-Stein for the parallel theory.

Common Confusions

Watch Out

Ridge does not select variables

Ridge shrinks every coefficient toward zero but, except in the degenerate $\lambda = \infty$ limit, sets none of them exactly to zero. For variable selection use the lasso (L1 penalty), elastic net (L1 + L2), or hard-thresholding methods. Ridge is the right tool when you believe all features contribute something, when you want a stable estimator under collinearity, or when you want a closed form. Ridge is the wrong tool when interpretability requires a sparse model.

Watch Out

Lambda direction in the Bayesian view

The Bayesian equivalence is $\lambda = \sigma^2 / \tau^2$ . Large prior variance $\tau^2$ means a weak prior, which corresponds to small $\lambda$ and light shrinkage. Small $\tau^2$ means a strong prior, which corresponds to large $\lambda$ and heavy shrinkage. Students routinely flip the direction. The ratio is the inverse signal-to-noise ratio in the prior and likelihood.

Watch Out

Gauss-Markov is not violated by ridge

Gauss-Markov asserts that OLS has minimum variance among linear unbiased estimators. Ridge is biased and falls outside the Gauss-Markov class. There is no contradiction in ridge having lower MSE than OLS. The theorem is sharp on the unbiased side; nothing in it prevents biased estimators from beating OLS on total error.

Watch Out

Standardize before applying ridge

The ridge penalty $\lambda \| w \|_2^2$ depends on the units of the features. A feature measured in millimeters versus meters carries a thousand-times-different penalty. Standard practice is to center each column to mean zero and scale to unit variance before fitting, then unscale the coefficients afterward if you need to report them in the original units. This is not optional; without standardization ridge effectively over-penalizes small-scale features and under-penalizes large-scale ones.

Watch Out

The intercept is usually not penalized

Most software fits ridge with an unpenalized intercept: minimize $\| y - X w - b \mathbf 1 \|_2^2 + \lambda \| w \|_2^2$ . Penalizing the intercept would shrink the predicted mean toward zero, which is rarely desired and breaks translation invariance. The standard trick is to center $y$ and each column of $X$ , fit ridge without an intercept on the centered data, and recover the intercept as $\hat b = \bar y - \bar x^\top \hat w$ .

Watch Out

Ridgeless OLS is not always disastrous

The classical worry, that overparameterized OLS interpolates noise and generalizes badly, is correct in low effective dimension and isotropic settings. But in high dimension with realistic spectral decay, the minimum-norm interpolator absorbs noise into low-variance directions that barely affect predictions; this is benign overfitting (Bartlett-Long-Lugosi-Tsigler 2020). Whether ridge with $\lambda > 0$ improves on $\lambda = 0$ depends on the spectrum and the SNR; cross-validate rather than assuming.

Summary

Ridge objective: $\min_w \| y - Xw \|_2^2 + \lambda \| w \|_2^2$ , closed form $\hat w_\lambda = (X^\top X + \lambda I)^{-1} X^\top y$ .
SVD geometry: each direction is shrunk by factor $\sigma_j^2 / (\sigma_j^2 + \lambda)$ . Effective DF is the sum of these factors.
MSE dominance over OLS: there exists $\lambda^* > 0$ with strictly smaller MSE; the optimum requires CV.
Bayesian view: ridge equals MAP under Gaussian prior with $\lambda = \sigma^2 / \tau^2$ .
Tikhonov regularization: ridge is the simplest case of $\| L w \|_2^2$ penalties used throughout inverse-problem theory.
Kernel ridge: representer theorem reduces RKHS regression to an $n \times n$ linear system in $\alpha = (K + \lambda I)^{-1} y$ .
Modern: ridgeless OLS plus implicit bias of GD gives benign overfitting under spectral decay; double descent is the failure mode at the interpolation threshold and ridge with optimal $\lambda$ smooths it out.
Always standardize features. Do not penalize the intercept. Do not expect variable selection.

Exercises

ExerciseCore

Problem

Show that for any $\lambda > 0$ , the matrix $X^\top X + \lambda I$ is positive definite, even when $X$ is rank-deficient. Conclude that the ridge solution is uniquely defined for every $\lambda > 0$ and every design $X$ .

ExerciseCore

Problem

Derive the bias of the ridge estimator and show that

$\mathbb E[\hat w_\lambda] - w^* \;=\; -\lambda (X^\top X + \lambda I)^{-1} w^*.$

In particular, the bias vanishes when $w^* = 0$ (and only then for nontrivial $X$ ).

ExerciseAdvanced

Problem

Prove the local-improvement statement carefully: if $w^*$ is finite and $\sigma^2 > 0$ , then $\partial_\lambda \operatorname{MSE}(\hat w_\lambda) \big|_{\lambda = 0} < 0$ . Conclude that there exists $\lambda^* > 0$ with $\operatorname{MSE}(\hat w_{\lambda^*}) < \operatorname{MSE}(\hat w_{\text{OLS}})$ .

ExerciseAdvanced

Problem

Derive Stein's unbiased risk estimate for ridge under Gaussian noise: assume $y = X w^* + \varepsilon$ with $\varepsilon \sim \mathcal N(0, \sigma^2 I)$ and known $\sigma^2$ . Show that

$\operatorname{SURE}(\lambda) \;=\; \| y - \hat y_\lambda \|_2^2 + 2 \sigma^2 \operatorname{df}(\lambda) - n \sigma^2$

is an unbiased estimate of the in-sample prediction risk $\mathbb E \| \hat y_\lambda - X w^* \|_2^2$ .

ExerciseResearch

Problem

Benign overfitting in the isotropic case. Consider the linear model $y = X w^* + \varepsilon$ with $X \in \mathbb R^{n \times d}$ , $X_{ij} \overset{\text{iid}}{\sim} \mathcal N(0, 1/d)$ , $\varepsilon \sim \mathcal N(0, \sigma^2 I_n)$ , $d > n$ . Let $\hat w_{\min}$ be the minimum- $\ell_2$ -norm interpolator. In the limit $d/n \to \gamma > 1$ with $\| w^* \|_2^2 \to \rho^2$ , derive the asymptotic out-of-sample prediction risk $\mathbb E_{x \sim \mathcal N(0, I_d / d)} (x^\top \hat w_{\min} - x^\top w^*)^2$ . Show that with isotropic features and a generic $w^*$ , ridgeless interpolation does not generalize: the variance term diverges. Then explain qualitatively which assumption must change to recover benign overfitting.

Related Comparisons

Ridge vs. Lasso Regression

References

Canonical:

Hoerl, A. E., and Kennard, R. W. (1970). Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics 12(1), 55-67. The original ridge paper.
Tikhonov, A. N. (1963). Solution of incorrectly formulated problems and the regularization method. Soviet Mathematics Doklady 4, 1035-1038. The Tikhonov regularization framework.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. Chapter 3.4 (ridge), Chapter 5.8 (effective DF), Chapter 18.3 (high-dimensional regression).
Wahba, G. (1990). Spline Models for Observational Data. SIAM. Reproducing-kernel and smoothing-spline foundations of the representer theorem.

Current:

Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press. Chapter 7 (regularized linear regression), Chapter 13 (kernel ridge).
Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapter 11.3 (ridge), Chapter 17 (kernel methods).
Bartlett, P. L., Long, P. M., Lugosi, G., and Tsigler, A. (2020). Benign overfitting in linear regression. PNAS 117(48), 30063-30070.
Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. (2022). Surprises in high-dimensional ridgeless least squares interpolation. Annals of Statistics 50(2), 949-986.
Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2019). Reconciling modern machine-learning practice and the classical bias-variance trade-off. PNAS 116(32), 15849-15854. The double-descent paper.
Tsigler, A., and Bartlett, P. L. (2023). Benign overfitting in ridge regression. JMLR 24(123), 1-76.
Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., and Sutskever, I. (2021). Deep double descent: where bigger models and more data hurt. JSTAT 124003. Empirical double-descent across architectures.

Critique and Frontier:

Engl, H. W., Hanke, M., and Neubauer, A. (1996). Regularization of Inverse Problems. Kluwer. The ill-posed-problem perspective on Tikhonov regularization, with convergence rates and parameter-choice rules.
Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. JRSS-B 58, 267-288. The lasso paper, framing what ridge cannot do (variable selection).
Zou, H., Hastie, T., and Tibshirani, R. (2007). On the degrees of freedom of the lasso. Annals of Statistics 35(5), 2173-2192. Effective DF beyond the linear-smoother case.
Jacot, A., Gabriel, F., and Hongler, C. (2018). Neural Tangent Kernel: Convergence and Generalization in Neural Networks. NeurIPS. The NTK / kernel-ridge connection.
Schoelkopf, B., Herbrich, R., and Smola, A. J. (2001). A generalized representer theorem. COLT. Modern statement and proof.

Next Topics

Building on ridge regression:

Lasso regression: L1 penalty for sparse variable selection.
Elastic net: combining L1 and L2 penalties for grouped variable selection.
Bias-variance tradeoff: the general principle behind the MSE-dominance result.
Kernels and RKHS: how the representer theorem extends ridge to RKHS.
Neural tangent kernel: training a wide net is kernel ridge regression with the NTK.
Double descent and benign overfitting: the modern overparameterized story.
Shrinkage and James-Stein: the general theory of admissibility-violating estimators that ridge instantiates in the regression setting.

Last reviewed: April 19, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

6

Maximum A Posteriori (MAP) Estimationlayer 0B · tier 1
Shrinkage Estimation and the James-Stein Estimator: Inadmissibility, SURE, and Brown's Characterizationlayer 0B · tier 1
The Elements of Statistical Learning (Hastie, Tibshirani, Friedman)layer 0B · tier 1
Convex Optimization Basicslayer 1 · tier 1
Linear Regressionlayer 1 · tier 1

Derived topics

10

Bayesian Linear Regressionlayer 2 · tier 1
Lasso Regressionlayer 2 · tier 1
The Kernel Tricklayer 2 · tier 1
Neural Tangent Kernel: Lazy Training, Kernel Equivalence, μP, and the Limits of Widthlayer 4 · tier 1
Elastic Netlayer 2 · tier 2

+5 more on the derived-topics page.

Graph-backed continuations

Lasso Regression Elastic Net Kernels and Reproducing Kernel Hilbert Spaces Neural Tangent Kernel: Lazy Training, Kernel Equivalence, μP, and the Limits of Width Double Descent Benign Overfitting Bayesian Linear Regression Gaussian Processes for Machine Learning Gaussian Process Regression The Kernel Trick