Modern Generalization
Neural Tangent Kernel: Lazy Training, Kernel Equivalence, μP, and the Limits of Width
In the infinite-width NTK parameterization, training a neural network with gradient descent is mathematically equivalent to kernel regression. The same limit suppresses feature learning, which is why μP and mean-field parameterizations exist. NTK is the precise boundary between the kernel and feature-learning regimes.
Prerequisites
Why This Matters
The Neural Tangent Kernel is one of the most cleanly stated theoretical results in modern deep learning. It shows that in a specific limit (infinite width with NTK parameterization), training a neural network with gradient descent is mathematically equivalent to kernel regression with a particular kernel. This gave the field a rigorous connection between deep networks and classical kernel theory, with explicit convergence rates and generalization bounds.
The NTK result is also, in large part, a negative result about its own regime. The lazy-training limit it describes is precisely the regime where networks do not learn features: the representation stays at initialization. Real finite-width networks operate outside the NTK regime when feature learning matters, which is why mean-field and maximal-update (μP) parameterizations are needed to describe practical training. NTK is best understood as a clean boundary case, not an explanation of why finite networks generalize.
The picture matters for three concrete reasons. First, kernel-ridge equivalence (proven below) imports the entire theory of ridge regression into deep learning: the infinite-width network with weight decay behaves like RKHS-regularized regression, with explicit bias-variance and double-descent predictions. Second, the NTK spectrum controls both convergence speed and generalization: small eigenvalues mean slow learning, and the eigenvalue decay rate determines the test-error rate. Third, the μP parameterization (Yang-Hu Tensor Programs IV/V) lets practitioners tune hyperparameters at small width and transfer them losslessly to large width, an industrially important consequence of the NTK / mean-field dichotomy.
Mental Model
Consider a neural network with parameters . At initialization , linearize the network around :
This is just a first-order Taylor expansion. The key insight of NTK theory: when the network is sufficiently wide, the parameters barely move during training (relative to their scale), so this linearization is accurate throughout training. The linearized model is a kernel method with kernel determined by .
Three parameterizations give three qualitatively different infinite-width limits, and the choice is not cosmetic.
| Parameterization | Per-weight scale | Output scale | Limit |
|---|---|---|---|
| Standard (PyTorch default) | Output diverges; not a clean limit | ||
| NTK | at init, output | Lazy: kernel regression with | |
| Mean-field / μP | Rich: features evolve, transfer across widths |
Under NTK parameterization the per-coordinate weight movement is , so each feature stays near its initialization. Under mean-field the per-coordinate movement is , so features actively evolve. Yang's Tensor Programs IV classified all parameterizations of feedforward networks and showed μP is the unique parameterization in which every layer, including the embedding layer and the output layer, is in the maximal feature-learning limit.
The Neural Tangent Kernel
Neural Tangent Kernel
For a neural network with parameters , the Neural Tangent Kernel is:
This is the inner product of the gradients of the network output with respect to all parameters, evaluated at inputs and .
At initialization , this defines . The kernel depends on the architecture (depth, width, activation function) and the initialization distribution.
Lazy Training Regime
A neural network is in the lazy training regime if and only if the parameters stay close to their initialization throughout training:
In this regime, the NTK remains approximately constant, and the network dynamics are well-approximated by the linearized model. The name "lazy" reflects that the features (the gradients ) do not change. only the linear combination of features is learned.
NTK Parameterization
A two-layer network has NTK parameterization if and only if it is written as
with and at initialization. The crucial features are the per-layer input scaling (so each pre-activation has variance) and the output normalization (so the output has variance). For deeper networks the same scaling is applied layer by layer.
The standard PyTorch default initialization uses different scaling: folded into the weights themselves (). This standard parameterization gives the same forward computation but a different per-weight learning-rate sensitivity, which produces a different infinite-width limit (output diverges unless the learning rate is rescaled by ). Yang and Hu (2021, Tensor Programs IV) classified all per-layer scaling choices into three classes: NTK (lazy), mean-field (feature-learning at the readout only), and μP (feature-learning at every layer including the embedding and readout).
Core Theoretical Results
NTK Convergence at Infinite Width
Statement
Consider a fully connected network with hidden layers of width , with NTK parameterization. As :
- The random kernel converges in probability to a deterministic kernel
- The limiting kernel depends only on the architecture (depth, activation function) and is independent of the random initialization
The limiting kernel can be computed recursively layer by layer.
Intuition
At infinite width, the law of large numbers kicks in: each layer computes a sum of many independent random terms, which concentrates around its expectation. The randomness of initialization washes out, leaving a deterministic kernel that depends only on the architecture.
Proof Sketch
Proceed by induction on depth. At each layer, the pre-activations are sums of independent terms (one per neuron in the previous layer). By the CLT, these converge to a Gaussian process as . The kernel of this GP at layer is determined recursively by the kernel at layer and the activation function, via the formula: where the covariance of is determined by .
Why It Matters
This shows that infinitely wide neural networks at initialization are Gaussian processes with a specific kernel. Combined with the constancy result below, this means training such networks is equivalent to kernel regression. a fully solved problem.
Failure Mode
The finite-width NTK deviates from the infinite-width limit at rate (Arora et al. 2019, arXiv:1904.11955, Cor. 6.2; Lee et al. 2019, Thm. 2.2). Quantitatively, for standard fully connected architectures, at the typical relative error is 2 to 5 percent (Arora et al. 2019, Table 1); at it exceeds 10 percent. For practically sized networks, the infinite-width approximation has non-negligible error.
NTK Stays Constant During Training
Statement
For a network with width trained by gradient flow on the squared loss, as :
- The NTK stays equal to throughout training:
- The training loss converges to zero exponentially: where is the smallest eigenvalue of on the training data
- The trained network is equivalent to kernel regression with kernel
Intuition
When the width is enormous, each parameter contributes a tiny amount to the output. Training changes each parameter by a tiny amount. The gradient features barely change, so the kernel stays constant. The dynamics become linear in the function space, and the solution is exactly kernel regression.
Proof Sketch
Under NTK parameterization, each individual weight is at initialization (with output scaled by ), giving . During gradient flow, each parameter moves by per coordinate, giving aggregate Frobenius-norm displacement that is in total but relative to , so the relative movement vanishes as (Chizat, Oyallon, Bach 2019).
The change in the NTK is second-order in the parameter displacement: where is the Hessian of in parameter space. Under NTK parameterization, and , so the product is (Lee et al. 2019, "Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent", arXiv:1902.06720).
With a constant kernel, the training dynamics in function space are linear: , which converges exponentially at rate .
Why It Matters
This is the central result of NTK theory. It says: infinitely wide networks trained with gradient descent are exactly kernel methods. This immediately imports decades of kernel theory: RKHS norm bounds, generalization guarantees, spectral analysis of convergence rates.
Failure Mode
The constancy of the NTK is precisely the lazy training limitation. If the kernel does not change, the network does not learn new features. It only learns a linear combination of the initial random features. This is why NTK theory cannot explain the success of feature learning in practice.
Equivalence to Kernel Ridge Regression
The constancy theorem says infinite-width gradient flow on the squared loss solves the noiseless kernel-regression problem with kernel . Adding weight decay (or, equivalently, early stopping) makes the equivalence even sharper: the trained network solves a kernel ridge regression problem.
NTK as Kernel Ridge Regression
Statement
Let be the infinite-width network trained with regularization centered at the initialization, that is the penalized loss At convergence,
which is exactly kernel ridge regression with kernel and ridge parameter . As this reduces to the minimum-RKHS-norm interpolant of the training data; as it shrinks to zero.
Intuition
With a constant kernel and a quadratic regularizer in parameter space, the optimization in function space is a ridge regression in the RKHS defined by . Weight decay on becomes RKHS-norm penalty on , and the infinite-width network delivers the minimizer in closed form. Without explicit regularization, gradient flow itself imposes an implicit bias toward the minimum-RKHS-norm interpolant.
Proof Sketch
Linearize around : . Define features where is the parameter count, so the linearized network is with . The penalized squared loss is solved by . Apply the kernel trick (push-through identity) to rewrite the predictor at a new as . At infinite width and by the centered initialization (Lee et al. 2019, Sec. 2.4; Hu, Lu, Bauer 2021, Thm. 1).
Why It Matters
This is the bridge that imports the entire ridge-regression theory into deep learning. The bias-variance tradeoff, the effective degrees of freedom , the SURE estimator for , and the cross-validation theory for selecting all transfer directly. It also explains why early stopping and weight decay behave so similarly in practice: under the lazy linearization, both move the solution along the same RKHS-norm-vs-fit tradeoff curve.
Failure Mode
The equivalence holds only in the lazy linearization. Real networks at finite width and finite training time deviate (Fort et al. 2020 measure the deviation via empirical NTK drift). For ReLU networks the NTK has a fast-decaying spectrum that makes the ridgeless interpolant unstable on high-frequency target functions; this is the spectral side of double descent. Under standard parameterization (not NTK parameterization), the ridge equivalence does not hold even at infinite width.
NTK Spectrum and Generalization
Once the equivalence to kernel ridge is established, the entire generalization theory of NTK reduces to spectral analysis of . The key object is the eigenvalue decay rate of the operator on the data distribution .
Generalization Rate from NTK Eigenvalue Decay (Bordelon-Pehlevan)
Statement
For a target function in the RKHS, the excess test error of the ridgeless NTK interpolant on samples scales as
up to logarithmic factors. The rate is determined entirely by the spectral decay exponent of , not by the network architecture beyond what affects . For the two-layer ReLU NTK on the sphere, (the input dimension), giving a rate .
Intuition
Kernel methods can only fit modes of the target along eigendirections of . Modes along eigendirections with small eigenvalues require proportionally more samples to learn. Polynomial decay of eigenvalues means high-frequency components of the target are increasingly costly to estimate, and the test error is dominated by the lowest-eigenvalue mode that is still meaningfully signal at sample size .
Proof Sketch
Use the bias-variance decomposition for kernel ridge with optimal . The bias term is and the variance term is . Substituting the polynomial eigenvalue decay and balancing the two terms gives the optimal and the resulting risk rate (Caponnetto, de Vito 2007 for the classical kernel result; Bordelon, Canatar, Pehlevan 2020 for the NTK-spectrum specialization).
Why It Matters
This is the only NTK-side prediction that matches measurable behavior of real networks: tasks that are well-aligned with low-frequency NTK eigenfunctions (smooth targets, image-classification on natural images) generalize well, while tasks dominated by high-frequency modes (adversarial examples, parity functions) require sample sizes that scale exponentially. The "spectral bias" toward low-frequency learning observed empirically by Rahaman et al. 2019 is exactly this eigenvalue ordering.
Failure Mode
Polynomial decay is an idealization; for some architectures the spectrum has heavy-tailed or non-polynomial behavior that breaks the clean rate. Real targets are not in the RKHS in general; misspecification adds an -dependent floor to the test error. Most of all, this entire analysis assumes the lazy regime, so it cannot describe the rate improvement that comes from feature learning in finite-width networks (Cui, Loureiro, Krzakala 2021).
μP and Hyperparameter Transfer
The Tensor Programs framework (Yang 2019-2022) is a calculus for infinite-width limits that subsumes NTK, mean-field, and every intermediate parameterization. Its most operational consequence is the maximal-update parameterization (μP), which gives a width-stable training recipe.
μP Zero-Shot Hyperparameter Transfer (Yang et al.)
Statement
Under μP, the optimal learning rate , optimizer hyperparameters, and weight-decay schedule for a target loss are independent of width. A practitioner can sweep hyperparameters on a small width (say 256) and transfer them losslessly to a much larger width (say 8192) without re-tuning, with optimal-loss agreement to within a few percent across widths spanning two orders of magnitude.
Intuition
Per-layer learning rates in μP are chosen so that every weight matrix moves by in spectral norm during a single update, regardless of width. Because every layer is in the same dynamical regime across widths, the loss landscape, when reparameterized in the natural μP coordinates, is width-invariant. NTK parameterization fails this property: under NTK the per-layer movement scales with width, so the optimal learning rate scales with width and the hyperparameter sweep must be redone for every architecture size.
Proof Sketch
The Tensor Programs IV calculus assigns each weight matrix a stability exponent in the infinite-width limit. μP is the unique assignment such that (i) every layer has feature updates and (ii) the output of the network has variance throughout training. Yang and Hu (2021) prove these two conditions force the per-layer learning-rate scaling. The hyperparameter-transfer theorem then follows because the infinite-width SDE is identical across widths, and the per-step error between widths and is .
Why It Matters
μP transfer is one of the few NTK-adjacent results that is industrially deployed. GPT-4-class training runs use μP-style scaling so that hyperparameter sweeps at 100M parameters transfer to 100B+ parameters without re-tuning. The cost saving is the difference between a tractable sweep at small scale and an infeasible sweep at frontier scale.
Failure Mode
μP transfer has been verified for fully connected nets and standard transformers, but breaks for some architectural choices (in particular, non-standard normalization layers that interact non-trivially with the per-layer scaling). Architectural changes such as MoE routing or sparse attention require their own width-stability analysis. The transfer is also loss-landscape-dependent: extremely sharp local minima can break the infinite-width approximation and require small-batch dynamics correction.
Connection to NNGP and Bayesian Deep Learning
The NTK and the NNGP are two infinite-width objects from the same underlying Gaussian-process structure. The distinction matters: the NNGP governs the prior over functions induced by initialization, while the NTK governs the training dynamics. Bayesian inference uses the NNGP; gradient-descent training uses the NTK.
| Object | What it describes | Reduces to |
|---|---|---|
| NNGP | Distribution of at random init, | Bayesian posterior with NNGP kernel |
| NTK | Training dynamics under gradient descent, | Kernel ridge regression with |
For a fully connected net, in the recursion (Lee et al. 2019, Sec. 2.3). On standard image data the two kernels give similar predictive accuracy, but they differ on out-of-distribution inputs and on uncertainty estimates: NNGP gives calibrated posterior variance, NTK gives a point estimate.
The NNGP itself was introduced for shallow networks by Neal (1996) and extended to deep nets by Lee et al. (ICLR 2018). The unification under Tensor Programs (Yang 2019, "Wide Feedforward or Recurrent Neural Networks of Any Architecture are Gaussian Processes") established the GP limit for essentially every sane architecture, including transformers (Hron et al. ICML 2020).
Empirical NTK Drift at Finite Width
The lazy-training assumption is testable. Fort et al. (NeurIPS 2020) measured the empirical NTK during real training of CIFAR-10 and ImageNet ResNets. They report:
- The empirical NTK at standard widths drifts by 30 to 80 percent in Frobenius norm during the first several epochs, then stabilizes
- Drift correlates with feature learning: high drift = high test accuracy improvement over the linearized baseline
- After the early phase the network enters a "linearized phase" where the NTK is approximately constant, suggesting a hybrid picture: rich early dynamics, lazy late dynamics
This is the cleanest empirical refutation of pure-NTK theory for practical-width networks, and the origin of the modern view that NTK predictions are most useful as a baseline that finite-width feature learning beats by a measurable margin.
The Lazy Regime vs. The Rich Regime
This is the critical conceptual distinction:
| Property | Lazy Regime (NTK) | Rich Regime (Feature Learning) |
|---|---|---|
| Parameters move | Very little | Substantially |
| Features | Fixed at initialization | Learned during training |
| Equivalent to | Kernel regression | Multiple infinite-width limits (mean-field, muP). Less clean than NTK |
| Width | Very large | Practical |
| Theory | Well-understood | Active research |
| Performance | Often worse on benchmark | Better, sometimes by large margins |
Real neural networks that achieve top benchmark performance are typically in the rich regime: they learn hierarchical features very different from their random initialization. NTK theory describes a regime where this feature learning is suppressed.
The NTK for Specific Architectures
NTK for a two-layer ReLU network
For a two-layer network with ReLU activation , the infinite-width NTK admits a closed form via the Cho-Saul arccosine kernels. Let and . Define
The first-layer NNGP kernel is , and the NTK is
Sanity check: for unit vectors with , , , giving and , so , matching Jacot-Gabriel-Hongler 2018 Prop. 2 and Arora et al. 2019 Thm. 3.1. This kernel is universal (dense in continuous functions) and positive definite on distinct points.
Common Confusions
NTK does not describe practical neural networks
The NTK regime requires width to be extremely large. often unrealistically so. Practical networks (GPT, ResNets, etc.) are not in the lazy regime. They learn features. NTK is a theoretical tool for understanding one extreme of neural network behavior, not a description of how practical networks work.
Constant NTK means no feature learning
A common point of confusion: "NTK theory proves neural networks are kernel methods." More precisely, NTK theory proves that infinitely wide networks in the lazy regime are kernel methods. The interesting behavior of practical networks. feature learning, representation learning, transfer. happens precisely when the NTK changes during training.
NTK parameterization vs standard parameterization
The NTK result requires a specific parameterization (scaling by ) that differs from the standard (mean field) parameterization. Different parameterizations lead to qualitatively different infinite-width limits. The mean field parameterization leads to feature learning even at infinite width. PyTorch default initialization is not NTK parameterization. Practitioners applying NTK results to standard-init models are making an unstated approximation.
Finite width matters quantitatively
For standard fully connected architectures, at the finite-width NTK typically differs from the infinite-width limit by 2 to 5 percent. At , the error exceeds 10 percent. Practical transformer MLP widths run 1024 to 8192, so the infinite-width approximation carries non-negligible error. "Infinite-width predictions" for real models are first-order approximations, not exact characterizations. (Arora et al. 2019, Table 1; Lee et al. 2019.)
Summary
- The NTK is
- At infinite width, the NTK converges to a deterministic kernel and stays constant during training
- Infinite-width networks trained with GD are equivalent to kernel regression with the NTK
- This is the lazy regime: features are fixed, only the linear readout is learned
- Real networks learn features (rich regime), which NTK theory does not capture
- NTK was a major theoretical advance but is incomplete as a theory of deep learning
Exercises
Problem
Warm-up: For a linear model with , compute the NTK . Then, for a two-layer linear network with and , compute the NTK at a fixed . Show that depth changes the NTK even for linear networks, and explain what this implies about the role of depth in NTK theory.
Problem
Suppose you have a two-layer network of width and the NTK on the training points has minimum eigenvalue . The training loss at time under gradient flow satisfies . If and , how many time units until the loss reaches ?
Problem
Explain why the NTK framework cannot account for the empirical observation that deeper networks learn increasingly abstract features at higher layers. What property of the NTK regime prevents this?
Related Comparisons
Frequently Asked Questions
- What is the Neural Tangent Kernel?
- The kernel at initialization, where is a neural network. Jacot et al. (2018) showed that in the infinite-width NTK parameterization, stays constant during training, and gradient descent on the network is equivalent to kernel ridge regression with .
- How is NTK different from NNGP?
- NNGP is the prior: at infinite width, at initialization is a Gaussian process with kernel . NTK is the gradient-descent-trained posterior: a Gaussian process with kernel . NNGP gives the Bayesian view, NTK gives the GD-trained view, both at infinite width.
- Why does the NTK regime kill feature learning?
- In the lazy regime, weights barely change during training. The features (intermediate activations) stay essentially fixed at their initialization values; the network learns a linear function in this fixed feature space. Real deep learning gets much of its power from learned features, so NTK theory describes a degenerate limit, not the practical regime.
- What is P and why does it matter?
- Maximal update parameterization (Yang et al. 2022): a different scaling of weights, learning rates, and biases that keeps feature learning ON at infinite width. Practical payoff: hyperparameters (learning rate, init scale) tuned at small scale transfer to large scale, eliminating expensive sweeps for billion-parameter models.
- Can NTK theory predict real deep learning?
- Partially. NTK eigenvalue-decay rates predict the order in which functions are learned (low-frequency first), and NTK regression bounds give generalization rates that match practice in the lazy regime. Outside the lazy regime — where real networks live — NTK theory breaks down. The gap is what motivates P and feature-learning theory.
References
Canonical (NTK):
- Jacot, Gabriel, Hongler, "Neural Tangent Kernel: Convergence and Generalization in Neural Networks" (NeurIPS 2018, arXiv:1806.07572), Prop. 2, Cor. 2
- Lee et al., "Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent" (NeurIPS 2019, arXiv:1902.06720), Thm. 2.2
- Arora et al., "On Exact Computation with an Infinitely Wide Neural Net" (NeurIPS 2019, arXiv:1904.11955), Thm. 3.1, Cor. 6.2
- Du et al., "Gradient Descent Finds Global Minima of Deep Neural Networks" (ICML 2019, arXiv:1811.03804)
- Allen-Zhu, Li, Song, "A Convergence Theory for Deep Learning via Over-Parameterization" (ICML 2019, arXiv:1811.03962)
- Cho, Saul, "Kernel Methods for Deep Learning" (NeurIPS 2009). Closed-form arccosine kernels used in the two-layer ReLU NTK derivation.
- Lee et al., "Deep Neural Networks as Gaussian Processes" (ICLR 2018, arXiv:1711.00165). The NNGP kernel that the NTK recursion builds on.
Mean-field primary sources:
- Mei, Montanari, Nguyen, "A Mean Field View of the Landscape of Two-Layer Neural Networks" (PNAS 2018, arXiv:1804.06561)
- Rotskoff, Vanden-Eijnden, "Parameters as Interacting Particles: Long Time Convergence and Asymptotic Error Scaling of Neural Networks" (arXiv:1805.00915)
- Chizat, Bach, "On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport" (NeurIPS 2018, arXiv:1805.09545)
Lazy vs. rich and muP:
- Chizat, Oyallon, Bach, "On Lazy Training in Differentiable Programming" (NeurIPS 2019, arXiv:1812.07956)
- Yang, "Wide Feedforward or Recurrent Neural Networks of Any Architecture are Gaussian Processes" (NeurIPS 2019, arXiv:1910.12478). Tensor Programs II.
- Yang, Hu, "Tensor Programs IV: Feature Learning in Infinite-Width Neural Networks" (ICML 2021, arXiv:2011.14522). Classification of parameterizations into NTK / mean-field / μP.
- Yang et al., "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer" (NeurIPS 2021, arXiv:2203.03466). μP parameterization and hyperparameter transfer across widths.
- Fort et al., "Deep Learning versus Kernel Learning: an Empirical Study of Loss Landscape Geometry and the Time Evolution of the Neural Tangent Kernel" (NeurIPS 2020, arXiv:2010.15110). Empirical NTK drift at finite width and its correlation with feature learning.
NTK spectrum and generalization:
- Bordelon, Canatar, Pehlevan, "Spectrum Dependent Learning Curves in Kernel Regression and Wide Neural Networks" (ICML 2020, arXiv:2002.02561). Eigenvalue-decay rate as the determinant of NTK generalization.
- Cui, Loureiro, Krzakala, "Generalization Error Rates in Kernel Regression: The Crossover from the Noiseless to Noisy Regime" (NeurIPS 2021, arXiv:2105.15004).
- Caponnetto, De Vito, "Optimal Rates for the Regularized Least-Squares Algorithm" (Foundations of Computational Mathematics, 2007). The classical kernel-ridge rate that the NTK specialization inherits.
- Hu, Lu, Bauer, "Regularization Matters: A Nonparametric Perspective on Overparametrized Neural Network" (Annals of Statistics, 2021). Explicit ridge equivalence proof.
- Rahaman et al., "On the Spectral Bias of Neural Networks" (ICML 2019, arXiv:1806.08734).
NNGP and Bayesian deep learning:
- Neal, "Priors for Infinite Networks" (Tech Report 1994; expanded in Bayesian Learning for Neural Networks, Springer 1996).
- Lee et al., "Deep Neural Networks as Gaussian Processes" (ICLR 2018, arXiv:1711.00165).
- Hron, Bahri, Sohl-Dickstein, Novak, "Infinite Attention: NNGP and NTK for Deep Attention Networks" (ICML 2020, arXiv:2006.10540). NNGP and NTK for transformer architectures.
Next Topics
- Ridge regression: the classical theory the NTK kernel-ridge equivalence imports into deep learning
- Kernels and RKHS: the function-space frame that makes NTK an RKHS method
- Implicit bias and modern generalization: why gradient descent finds the minimum-RKHS-norm interpolant in the ridgeless limit
- Double descent: the spectral side of interpolating kernel regression
- Benign overfitting: when interpolation generalizes despite zero training error
Last reviewed: April 19, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
6- Ridge Regressionlayer 1 · tier 1
- Implicit Bias and Modern Generalizationlayer 4 · tier 1
- Stochastic Processes for MLlayer 2 · tier 2
- Gaussian Process Regressionlayer 3 · tier 2
- Kernels and Reproducing Kernel Hilbert Spaceslayer 3 · tier 2
Derived topics
4- Benign Overfittinglayer 4 · tier 2
- Double Descentlayer 4 · tier 2
- Lazy vs Feature Learninglayer 4 · tier 2
- Mean Field Theorylayer 4 · tier 2
Graph-backed continuations