ML Methods
Ridge Regression
L2-regularized least squares: closed-form shrinkage, SVD geometry, MSE dominance over OLS, the Bayesian Gaussian-prior connection, effective degrees of freedom, kernel ridge, and the modern overparameterized regime where ridgeless OLS can still generalize.
Prerequisites
Why This Matters
Ordinary least squares is the canonical unbiased linear estimator. The Gauss-Markov theorem certifies it as best-linear-unbiased, and that result is older than statistical learning theory itself. But unbiasedness is a constraint, not a virtue. Once we accept that bias and variance trade against each other, dropping unbiasedness opens a strictly larger admissible region, and ridge regression is the simplest and most-studied estimator in that larger region.
Ridge does three things at once. It stabilizes the solution when is ill-conditioned or singular. It shrinks toward the origin in the singular-value basis, with shrinkage strength tied to the inverse signal-to-noise ratio in each direction. And it has a closed form, a Bayesian interpretation, a representer theorem in kernel space, and a clean SVD geometry, so it acts as a Rosetta Stone connecting penalized regression, MAP estimation, kernel methods, and the modern overparameterized linear-model theory of benign overfitting.
The page is the bridge between classical estimation theory (Gauss-Markov, shrinkage and James-Stein, Cramer-Rao bound) and the modern theory of high-dimensional and overparameterized models (NTK, double descent, benign overfitting, kernels and RKHS). It is the smallest example where a closed-form penalized estimator beats the unbiased estimator on mean squared error, and where the SVD lets you read off the entire bias-variance trade by inspection.
Mental Model
OLS finds the minimum-norm vector that explains as a linear combination of the columns of . Ridge adds a quadratic budget on . Among all that fit the data well, ridge picks the one with smallest squared norm.
Three pictures, all equivalent:
- Constrained optimization. Minimize subject to for some . Lagrangian gives the penalty form with multiplier .
- Shrunk projection. OLS projects onto the column space of . Ridge does a softer projection: it pulls each component of in the singular-value basis toward zero by a factor .
- MAP under a Gaussian prior. With and , the posterior mode equals at .
The same penalty plays three roles: convex regularizer, shrinkage operator on the SVD components, and prior covariance. Pick the picture that fits the question.
Formal Setup
Take the linear model with , , and noise with and . No Gaussian assumption is needed for the MSE statements; Gaussianity is needed only for the Bayesian / MAP equivalence and for distributional inference.
Ridge Regression Estimator
For penalty parameter , the ridge estimator is
The objective is strictly convex (Hessian for ), so the solution is unique. First-order optimality gives the normal equations and the closed form
The matrix is positive definite for any , even when is singular ( or rank-deficient design). This is why ridge is well-defined where OLS is not.
The case recovers OLS when is invertible. The case gives . The interesting regime is intermediate.
SVD Geometry
The singular value decomposition is the cleanest way to see what ridge does. Write where , have orthonormal columns and with and .
Ridge Acts as Component-wise Shrinkage in the SVD Basis
Statement
The ridge fitted values are
and the ridge coefficients in the right-singular basis are
The shrinkage factor governs each direction independently.
Intuition
OLS fitted values are , the full projection onto each left singular direction. Ridge multiplies the -th coefficient by . For directions with the factor is essentially 1 and ridge looks like OLS. For directions with the factor is essentially 0 and ridge silences that direction. Ridge is a low-pass filter on the singular-value spectrum, with cutoff at .
This is also why ridge stabilizes ill-conditioned designs. The OLS coefficient blows up for small . Ridge replaces by , which is bounded above by .
Proof Sketch
Substitute into . The Gram matrix is . Padding with the orthogonal complement of (eigenvalue 0 for , eigenvalue for ), the inverse is . Multiplying by kills the orthogonal piece (since ) and yields . Multiplying once more by from the left gives the fitted values.
Why It Matters
Every other property of ridge follows from this one identity. Bias, variance, MSE, effective degrees of freedom, leave-one-out cross-validation, Bayesian posterior covariance, and kernel-ridge predictions are all weighted sums over with weights of the form for small . If you see a quantity attached to ridge and it does not factor through the SVD, suspect an algebra error.
Failure Mode
The shrinkage is anisotropic and depends on the design. Rotating the feature space changes the singular vectors and therefore the ridge solution; ridge is not invariant under arbitrary linear reparameterization. Standardize features (zero mean, unit variance) before applying ridge, otherwise the penalty depends on units and on which features happen to have larger raw scale.
Bias, Variance, MSE Dominance
Under with and :
- Bias. . Nonzero whenever and . Grows in magnitude with .
- Variance. . Strictly smaller (in PSD order) than the OLS variance for .
- MSE. .
Diagonalizing in the SVD basis,
where is the true coefficient projected onto the -th right singular vector. The first term is squared bias per direction, the second is variance per direction.
Ridge Strictly Improves on OLS for Some Lambda
Statement
There exists such that
The optimal depends on unknown quantities ( and ); the existence statement does not.
Intuition
Squared bias starts at zero when and rises quadratically. Variance starts at the OLS value and decreases. Near the bias term contributes nothing to the derivative while the variance term contributes a strictly negative slope. So MSE is locally decreasing at and a small positive is always an improvement.
Ridge does not violate the Cramer-Rao bound. The CRB lower-bounds variance for unbiased estimators; ridge is biased and pays for its variance reduction with bias. The total MSE is what matters, and the trade is favorable in a neighborhood of .
Proof Sketch
Differentiate the per-direction terms. Variance term: , which equals at . Bias term: vanishes at because of the leading . Sum over : . So MSE strictly decreases for small positive .
Why It Matters
This is the cleanest theoretical justification for any kind of regularization. Gauss-Markov says OLS is best among linear unbiased estimators. The class of all linear estimators is strictly larger and contains estimators with strictly lower MSE. Ridge is not a hack and not a heuristic; the squared-error metric is strictly smaller than the OLS value for a non-empty range of . The same conclusion underlies penalized GLMs, weight decay in deep nets, posterior contraction in Bayesian models, and the entire shrinkage program from James-Stein to modern hierarchical Bayes.
Failure Mode
The theorem guarantees existence, not a recipe. The MSE-optimal depends on and , which are unknown. In practice, choose by leave-one-out cross-validation (closed form for ridge), -fold CV, generalized cross-validation, or marginal-likelihood maximization. Naively setting by rule-of-thumb is the most common way to under- or over-regularize.
Effective Degrees of Freedom
Cross-validation answers "what "; effective degrees of freedom answers "how complex is the resulting fit." For a linear smoother , the standard definition is . Ridge is a linear smoother with hat matrix .
Effective Degrees of Freedom for Ridge
Statement
The effective degrees of freedom of ridge regression at penalty is
This is monotone decreasing in , with (OLS uses degrees of freedom on the column space of ) and (constant predictor uses none).
Intuition
Each shrinkage factor is the fraction of "freedom" the model retains in direction . High- directions count almost as a full degree; low- directions, near , count as half; directions with are essentially frozen and contribute nearly nothing to model complexity. The sum measures the model's effective dimension at the chosen penalty.
Proof Sketch
. Trace of a matrix in this form is the sum of the diagonal of the inner diagonal matrix.
Why It Matters
Effective DF makes ridge comparable to other model-selection scores. AIC, BIC, , and Mallows-style criteria all need a model-complexity term, and is the right substitute when the penalty is continuous rather than discrete. It also shows up in Stein's unbiased risk estimate (SURE) for ridge: with Gaussian noise, , an unbiased risk estimate computable without resampling.
Failure Mode
The trace definition depends on the linear-smoother structure. For nonlinear regularizers (lasso, elastic net) the analogous quantity is more delicate; lasso's effective degrees of freedom equal the expected number of nonzero coefficients (Zou-Hastie-Tibshirani 2007), not the trace of a hat matrix.
Bayesian Interpretation: Ridge as MAP under a Gaussian Prior
Place a Gaussian prior on and a Gaussian likelihood on :
The posterior is Gaussian (Gaussian-Gaussian conjugacy):
The posterior mean equals the MAP equals with . Three consequences:
- Direction. Strong prior (small ) maps to large , heavy shrinkage. Weak prior (large ) maps to small , near-OLS. The relationship is reciprocal, not direct.
- Posterior covariance. Ridge gives a point estimate; the Bayesian view also gives uncertainty, with covariance equal to a constant times . This is the natural starting point for credible intervals and predictive variance.
- Marginal likelihood. Maximizing over and gives an empirical-Bayes choice of , an alternative to cross-validation that is fully principled when the Gaussian model is correct.
The connection generalizes. Any quadratic penalty corresponds to a Gaussian prior with covariance . Penalizing only some coefficients corresponds to flat priors on the others. Group penalties, fused penalties, and Tikhonov penalties of the form are all Gaussian priors in disguise.
Tikhonov Regularization and Inverse Problems
Ridge's older name in inverse-problem theory is Tikhonov regularization (Tikhonov 1963). The setup is solving where is ill-conditioned or non-invertible, by minimizing for some operator . Setting recovers ridge with the regression matrix and target .
The full Tikhonov form lets you penalize derivatives ( a discrete differentiation operator), encouraging smooth reconstructions; differences from a target (), encouraging proximity to a known reference; or specific subspaces ( a projector), penalizing spurious modes. All of these inherit ridge's SVD analysis after a change of variables. The connection makes ridge a special case of a much broader regularization framework used in tomography, deconvolution, geophysical inversion, and learning theory.
Kernel Ridge Regression
Replace with a feature map in a reproducing kernel Hilbert space (RKHS) and the kernel trick . The ridge problem becomes
Representer Theorem for Kernel Ridge Regression
Statement
The minimizer of over has the form
where is the Gram matrix with .
Intuition
Even when is infinite-dimensional, the optimal predictor lives in the -dimensional span of the kernel evaluations . Ridge in an infinite-dimensional RKHS reduces to a finite linear system in variables. The kernel matrix plays the role of , and the kernel trick lets you compute the entries without ever forming the feature map.
Proof Sketch
Decompose any as where and is orthogonal. The reproducing property gives , so the loss depends only on . The regularizer is , strictly increasing in . So at the optimum and is a linear combination of . Plug in to get a finite system in the coefficients .
Why It Matters
The representer theorem turns infinite-dimensional regression into linear algebra on an Gram matrix. It underlies kernel ridge regression, Gaussian process regression (which is its Bayesian twin), support vector regression, and the entire neural tangent kernel (NTK) connection: in the NTK regime, training a wide neural network with squared loss and small learning rate is equivalent to kernel ridge regression with kernel . The representer theorem is what makes that equivalence computable.
Failure Mode
Two failure modes. First, the kernel matrix scales as to invert and to store; kernel ridge does not scale to large without approximations (Nystrom, random features, conjugate gradient). Second, the representer theorem assumes the regularizer depends only on . For penalties like total variation or sparsity-inducing norms in function space, the optimal need not lie in the kernel span and the theorem does not apply.
Modern Frontier: Ridgeless OLS and Benign Overfitting
In the overparameterized regime (), the OLS problem has infinitely many minimizers and the minimum--norm solution is what gradient descent on the unregularized least-squares loss converges to from a zero initialization. This is "ridgeless OLS" and corresponds to in the ridge problem.
Classical theory predicts disaster: zero training error plus arbitrary norm should give arbitrarily bad generalization. The empirical surprise is that ridgeless OLS often generalizes well in the heavily overparameterized regime, sometimes better than at the interpolation threshold (). This is benign overfitting (Bartlett, Long, Lugosi, Tsigler 2020; Hastie, Montanari, Rosset, Tibshirani 2022), and it gives a tractable linear-model analogue of the deep-learning generalization puzzle.
Ridgeless OLS Can Generalize in High Dimension
Statement
For a sequence of overparameterized linear regression problems with and population covariance whose effective rank grows fast enough, the ridgeless minimum-norm interpolator has population MSE that converges to a finite limit, and in some regimes that limit is smaller than the interpolation-threshold MSE at .
The precise form (Bartlett-Long-Lugosi-Tsigler) bounds excess risk by a sum of two terms: a bias term scaling with and a variance term scaling with , where is an effective-rank cutoff. Both terms can be small simultaneously when the spectrum decays sufficiently slowly.
Intuition
Two ideas. First, the minimum--norm interpolator is implicit ridge: gradient descent from zero on the unregularized loss, with sufficiently small step size, approaches the smallest-norm interpolating solution rather than an arbitrary one. The implicit bias of optimization fills in for the missing explicit penalty.
Second, in high dimension with mild spectral decay, almost all directions of have small variance and absorb the noise harmlessly while the few high-variance directions carry the signal. The "absorption" is benign because the noise-absorbing directions barely affect predictions on a fresh test point. Classical-low-dimensional theory missed this because it assumed where the absorption mechanism does not exist.
Proof Sketch
Decompose excess risk as bias plus variance using the SVD of the design (or eigendecomposition of ). The bias term is bounded by the projection of onto the span of the bottom eigendirections of the empirical Gram matrix; control via concentration of the spectrum. The variance term factors as a sum over eigendirections of , each contributing (analogous to the ridge formula but with no explicit penalty); the effective-rank conditions ensure the sum is small. Full details in Bartlett-Long-Lugosi-Tsigler (PNAS 2020) and the companion analysis by Tsigler-Bartlett (2023).
Why It Matters
Ridgeless overfitting is the smallest model where modern overparameterized generalization can be analyzed in closed form. It explains why explicit regularization is sometimes unnecessary in deep learning (the implicit bias of SGD plus high effective dimension can suffice), it connects ridge to NTK theory (training a wide net is kernel ridge with ), and it reframes the bias-variance tradeoff in the regime where classical asymptotics fail.
The companion phenomenon is double descent: as crosses 1, test error often peaks at the interpolation threshold and then descends again as grows further (Belkin-Hsu-Ma-Mandal 2019; Nakkiran-Kaplun-Bansal-Yang-Barak-Sutskever 2020). Ridge with optimally chosen smooths out the interpolation peak, recovering monotone decrease in test error.
Failure Mode
The benign-overfitting result is fragile to spectral assumptions. With isotropic covariance , ridgeless OLS is asymptotically as bad as classical theory predicts; benign behavior requires a long tail of small eigenvalues that absorb noise. It is also sensitive to the relationship between and the eigenstructure: if aligns with low-variance directions, the bias term dominates. Empirically, real datasets do tend to have decaying spectra, but the analysis is not a free lunch and explicit ridge often still outperforms ridgeless.
Choosing Lambda
Leave-One-Out Cross-Validation Closed Form for Ridge
For ridge regression, leave-one-out (LOO) cross-validation has a closed form, eliminating the need to refit times:
where is the -th diagonal of the ridge hat matrix . Generalized cross-validation (GCV) replaces the per-row hat diagonal by the average , giving a rotation-invariant variant that is often more stable.
Standard practice: grid on a log scale spanning many decades, compute or GCV, pick the minimizer (or the "one-standard-error rule" minimizer for robustness against CV noise). Three alternatives: SURE (closed form for Gaussian noise), marginal-likelihood maximization (Bayesian), and AIC/BIC with effective DF.
Canonical Examples
Orthonormal design: pure scalar shrinkage
When (orthonormal columns), the SVD is with and ridge reduces to a scalar shrinkage of the OLS estimate:
Each coefficient is multiplied by the same factor . Compare to lasso, which under the same orthonormal design soft-thresholds each coefficient at , sending small ones to exactly zero. Ridge shrinks; lasso shrinks and selects.
Ridge fixes near-collinear features
Suppose . The OLS coefficients can be wildly large with opposite signs because the data does not separate the contributions of and , only their sum. Adding to the Gram matrix shifts both small singular values away from zero, making the inversion well-conditioned and producing moderate, similar-magnitude coefficients on and . The ridge solution honestly reflects "the data cannot tell these apart," whereas the OLS solution amplifies design noise into spurious large coefficients.
Ridge on an underdetermined system
With , OLS has infinitely many solutions; the normal equations are singular. Ridge has a unique solution for any , and as the ridge solution converges to the minimum--norm interpolator: . This is the "ridgeless" limit and the starting point for the benign-overfitting story.
Connection to James-Stein and Empirical Bayes
Ridge and James-Stein are two faces of the same shrinkage idea applied to the Gaussian sequence model.
- James-Stein: estimate from with , dimension . The JS estimator shrinks toward zero by a data-driven factor and dominates the MLE in total MSE.
- Ridge in the Gaussian sequence model: take with orthonormal (, ), so the design contributes nothing and the regression problem becomes the sequence model. Ridge with penalty gives , a deterministic shrinkage by .
The two estimators differ in how the shrinkage factor is chosen: James-Stein adapts to the data via ; ridge fixes in advance (or chooses by cross-validation). Empirical-Bayes ridge, where is set by maximizing marginal likelihood, recovers an estimator very close to James-Stein in this setting and has the same MSE-dominance property in the right regime. See shrinkage estimation and James-Stein for the parallel theory.
Common Confusions
Ridge does not select variables
Ridge shrinks every coefficient toward zero but, except in the degenerate limit, sets none of them exactly to zero. For variable selection use the lasso (L1 penalty), elastic net (L1 + L2), or hard-thresholding methods. Ridge is the right tool when you believe all features contribute something, when you want a stable estimator under collinearity, or when you want a closed form. Ridge is the wrong tool when interpretability requires a sparse model.
Lambda direction in the Bayesian view
The Bayesian equivalence is . Large prior variance means a weak prior, which corresponds to small and light shrinkage. Small means a strong prior, which corresponds to large and heavy shrinkage. Students routinely flip the direction. The ratio is the inverse signal-to-noise ratio in the prior and likelihood.
Gauss-Markov is not violated by ridge
Gauss-Markov asserts that OLS has minimum variance among linear unbiased estimators. Ridge is biased and falls outside the Gauss-Markov class. There is no contradiction in ridge having lower MSE than OLS. The theorem is sharp on the unbiased side; nothing in it prevents biased estimators from beating OLS on total error.
Standardize before applying ridge
The ridge penalty depends on the units of the features. A feature measured in millimeters versus meters carries a thousand-times-different penalty. Standard practice is to center each column to mean zero and scale to unit variance before fitting, then unscale the coefficients afterward if you need to report them in the original units. This is not optional; without standardization ridge effectively over-penalizes small-scale features and under-penalizes large-scale ones.
The intercept is usually not penalized
Most software fits ridge with an unpenalized intercept: minimize . Penalizing the intercept would shrink the predicted mean toward zero, which is rarely desired and breaks translation invariance. The standard trick is to center and each column of , fit ridge without an intercept on the centered data, and recover the intercept as .
Ridgeless OLS is not always disastrous
The classical worry, that overparameterized OLS interpolates noise and generalizes badly, is correct in low effective dimension and isotropic settings. But in high dimension with realistic spectral decay, the minimum-norm interpolator absorbs noise into low-variance directions that barely affect predictions; this is benign overfitting (Bartlett-Long-Lugosi-Tsigler 2020). Whether ridge with improves on depends on the spectrum and the SNR; cross-validate rather than assuming.
Summary
- Ridge objective: , closed form .
- SVD geometry: each direction is shrunk by factor . Effective DF is the sum of these factors.
- MSE dominance over OLS: there exists with strictly smaller MSE; the optimum requires CV.
- Bayesian view: ridge equals MAP under Gaussian prior with .
- Tikhonov regularization: ridge is the simplest case of penalties used throughout inverse-problem theory.
- Kernel ridge: representer theorem reduces RKHS regression to an linear system in .
- Modern: ridgeless OLS plus implicit bias of GD gives benign overfitting under spectral decay; double descent is the failure mode at the interpolation threshold and ridge with optimal smooths it out.
- Always standardize features. Do not penalize the intercept. Do not expect variable selection.
Exercises
Problem
Show that for any , the matrix is positive definite, even when is rank-deficient. Conclude that the ridge solution is uniquely defined for every and every design .
Problem
Derive the bias of the ridge estimator and show that
In particular, the bias vanishes when (and only then for nontrivial ).
Problem
Prove the local-improvement statement carefully: if is finite and , then . Conclude that there exists with .
Problem
Derive Stein's unbiased risk estimate for ridge under Gaussian noise: assume with and known . Show that
is an unbiased estimate of the in-sample prediction risk .
Problem
Benign overfitting in the isotropic case. Consider the linear model with , , , . Let be the minimum--norm interpolator. In the limit with , derive the asymptotic out-of-sample prediction risk . Show that with isotropic features and a generic , ridgeless interpolation does not generalize: the variance term diverges. Then explain qualitatively which assumption must change to recover benign overfitting.
Related Comparisons
References
Canonical:
- Hoerl, A. E., and Kennard, R. W. (1970). Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics 12(1), 55-67. The original ridge paper.
- Tikhonov, A. N. (1963). Solution of incorrectly formulated problems and the regularization method. Soviet Mathematics Doklady 4, 1035-1038. The Tikhonov regularization framework.
- Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. Chapter 3.4 (ridge), Chapter 5.8 (effective DF), Chapter 18.3 (high-dimensional regression).
- Wahba, G. (1990). Spline Models for Observational Data. SIAM. Reproducing-kernel and smoothing-spline foundations of the representer theorem.
Current:
- Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press. Chapter 7 (regularized linear regression), Chapter 13 (kernel ridge).
- Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapter 11.3 (ridge), Chapter 17 (kernel methods).
- Bartlett, P. L., Long, P. M., Lugosi, G., and Tsigler, A. (2020). Benign overfitting in linear regression. PNAS 117(48), 30063-30070.
- Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. (2022). Surprises in high-dimensional ridgeless least squares interpolation. Annals of Statistics 50(2), 949-986.
- Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2019). Reconciling modern machine-learning practice and the classical bias-variance trade-off. PNAS 116(32), 15849-15854. The double-descent paper.
- Tsigler, A., and Bartlett, P. L. (2023). Benign overfitting in ridge regression. JMLR 24(123), 1-76.
- Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., and Sutskever, I. (2021). Deep double descent: where bigger models and more data hurt. JSTAT 124003. Empirical double-descent across architectures.
Critique and Frontier:
- Engl, H. W., Hanke, M., and Neubauer, A. (1996). Regularization of Inverse Problems. Kluwer. The ill-posed-problem perspective on Tikhonov regularization, with convergence rates and parameter-choice rules.
- Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. JRSS-B 58, 267-288. The lasso paper, framing what ridge cannot do (variable selection).
- Zou, H., Hastie, T., and Tibshirani, R. (2007). On the degrees of freedom of the lasso. Annals of Statistics 35(5), 2173-2192. Effective DF beyond the linear-smoother case.
- Jacot, A., Gabriel, F., and Hongler, C. (2018). Neural Tangent Kernel: Convergence and Generalization in Neural Networks. NeurIPS. The NTK / kernel-ridge connection.
- Schoelkopf, B., Herbrich, R., and Smola, A. J. (2001). A generalized representer theorem. COLT. Modern statement and proof.
Next Topics
Building on ridge regression:
- Lasso regression: L1 penalty for sparse variable selection.
- Elastic net: combining L1 and L2 penalties for grouped variable selection.
- Bias-variance tradeoff: the general principle behind the MSE-dominance result.
- Kernels and RKHS: how the representer theorem extends ridge to RKHS.
- Neural tangent kernel: training a wide net is kernel ridge regression with the NTK.
- Double descent and benign overfitting: the modern overparameterized story.
- Shrinkage and James-Stein: the general theory of admissibility-violating estimators that ridge instantiates in the regression setting.
Last reviewed: April 19, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
6- Maximum A Posteriori (MAP) Estimationlayer 0B · tier 1
- Shrinkage Estimation and the James-Stein Estimator: Inadmissibility, SURE, and Brown's Characterizationlayer 0B · tier 1
- The Elements of Statistical Learning (Hastie, Tibshirani, Friedman)layer 0B · tier 1
- Convex Optimization Basicslayer 1 · tier 1
- Linear Regressionlayer 1 · tier 1
Derived topics
10- Bayesian Linear Regressionlayer 2 · tier 1
- Lasso Regressionlayer 2 · tier 1
- The Kernel Tricklayer 2 · tier 1
- Neural Tangent Kernel: Lazy Training, Kernel Equivalence, μP, and the Limits of Widthlayer 4 · tier 1
- Elastic Netlayer 2 · tier 2
+5 more on the derived-topics page.
Graph-backed continuations