Neural Tangent Kernel: Lazy Training, Kernel Equivalence, μP, and the Limits of Width

Q: What is the Neural Tangent Kernel?

The kernel $K(x,y) = \langle \nabla_\theta f(x;\theta), \nabla_\theta f(y;\theta) \rangle$ at initialization, where $f$ is a neural network. Jacot et al. (2018) showed that in the infinite-width NTK parameterization, $K$ stays constant during training, and gradient descent on the network is equivalent to kernel ridge regression with $K$.

Q: How is NTK different from NNGP?

NNGP is the prior: at infinite width, $f(x;\theta)$ at initialization is a Gaussian process with kernel $K_{\mathrm{NNGP}} = \mathbb{E}[f(x)\,f(y)]$. NTK is the gradient-descent-trained posterior: a Gaussian process with kernel $K_{\mathrm{NTK}} = \mathbb{E}[\langle \nabla f(x), \nabla f(y) \rangle]$. NNGP gives the Bayesian view, NTK gives the GD-trained view, both at infinite width.

Sneiderman, Robby

Modern Generalization

Neural Tangent Kernel: Lazy Training, Kernel Equivalence, μP, and the Limits of Width

In the infinite-width NTK parameterization, training a neural network with gradient descent is mathematically equivalent to kernel regression. The same limit suppresses feature learning, which is why μP and mean-field parameterizations exist. NTK is the precise boundary between the kernel and feature-learning regimes.

AdvancedTier 1CurrentCore spine~95 min

Prerequisites

Kernels and Rkhs Ridge Regression Implicit Bias and Modern Generalization Gaussian Processes for ML

Quiz (4)Prereq Map

Why This Matters

The Neural Tangent Kernel is one of the most cleanly stated theoretical results in modern deep learning. It shows that in a specific limit (infinite width with NTK parameterization), training a neural network with gradient descent is mathematically equivalent to kernel regression with a particular kernel. This gave the field a rigorous connection between deep networks and classical kernel theory, with explicit convergence rates and generalization bounds.

The NTK result is also, in large part, a negative result about its own regime. The lazy-training limit it describes is precisely the regime where networks do not learn features: the representation stays at initialization. Real finite-width networks operate outside the NTK regime when feature learning matters, which is why mean-field and maximal-update (μP) parameterizations are needed to describe practical training. NTK is best understood as a clean boundary case, not an explanation of why finite networks generalize.

The picture matters for three concrete reasons. First, kernel-ridge equivalence (proven below) imports the entire theory of ridge regression into deep learning: the infinite-width network with weight decay $\lambda$ behaves like RKHS-regularized regression, with explicit bias-variance and double-descent predictions. Second, the NTK spectrum controls both convergence speed and generalization: small eigenvalues mean slow learning, and the eigenvalue decay rate determines the test-error rate. Third, the μP parameterization (Yang-Hu Tensor Programs IV/V) lets practitioners tune hyperparameters at small width and transfer them losslessly to large width, an industrially important consequence of the NTK / mean-field dichotomy.

Mental Model

Consider a neural network $f(x; \theta)$ with parameters $\theta$ . At initialization $\theta_0$ , linearize the network around $\theta_0$ :

$f(x; \theta) \approx f(x; \theta_0) + \nabla_\theta f(x; \theta_0)^\top (\theta - \theta_0)$

This is just a first-order Taylor expansion. The key insight of NTK theory: when the network is sufficiently wide, the parameters $\theta$ barely move during training (relative to their scale), so this linearization is accurate throughout training. The linearized model is a kernel method with kernel determined by $\nabla_\theta f$ .

Three parameterizations give three qualitatively different infinite-width limits, and the choice is not cosmetic.

Parameterization	Per-weight scale	Output scale	Limit
Standard (PyTorch default)	$1/\sqrt{m}$	$\Theta(1)$	Output diverges; not a clean limit
NTK	$\Theta(1)$ at init, output $\propto 1/\sqrt{m}$	$\Theta(1)$	Lazy: kernel regression with $\Theta^*$
Mean-field / μP	$\Theta(1/m)$	$\Theta(1)$	Rich: features evolve, transfer across widths

Under NTK parameterization the per-coordinate weight movement is $O(1/\sqrt{m})$ , so each feature $\partial f / \partial \theta_k$ stays near its initialization. Under mean-field the per-coordinate movement is $\Theta(1)$ , so features actively evolve. Yang's Tensor Programs IV classified all parameterizations of feedforward networks and showed μP is the unique parameterization in which every layer, including the embedding layer and the output layer, is in the maximal feature-learning limit.

The Neural Tangent Kernel

Definition

Neural Tangent Kernel $Θ (x, x^{'})$

For a neural network $f(x; \theta)$ with parameters $\theta$ , the Neural Tangent Kernel is:

$\Theta(x, x') = \left\langle \nabla_\theta f(x; \theta), \; \nabla_\theta f(x'; \theta) \right\rangle = \sum_{k} \frac{\partial f(x; \theta)}{\partial \theta_k} \frac{\partial f(x'; \theta)}{\partial \theta_k}$

This is the inner product of the gradients of the network output with respect to all parameters, evaluated at inputs $x$ and $x'$ .

At initialization $\theta_0$ , this defines $\Theta_0(x, x')$ . The kernel depends on the architecture (depth, width, activation function) and the initialization distribution.

Definition

Lazy Training Regime

A neural network is in the lazy training regime if and only if the parameters stay close to their initialization throughout training:

$\|\theta_t - \theta_0\| = o(\|\theta_0\|)$

In this regime, the NTK $\Theta_t(x, x') \approx \Theta_0(x, x')$ remains approximately constant, and the network dynamics are well-approximated by the linearized model. The name "lazy" reflects that the features (the gradients $\nabla_\theta f$ ) do not change. only the linear combination of features is learned.

Definition

NTK Parameterization

A two-layer network has NTK parameterization if and only if it is written as

$f(x; W, a) = \frac{1}{\sqrt{m}} \sum_{j=1}^m a_j \, \sigma\!\left(\frac{1}{\sqrt{d}} w_j^\top x\right),$

with $a_j \sim \mathcal{N}(0, 1)$ and $w_j \sim \mathcal{N}(0, I_d)$ at initialization. The crucial features are the per-layer $1/\sqrt{m_\text{in}}$ input scaling (so each pre-activation has $\Theta(1)$ variance) and the $1/\sqrt{m}$ output normalization (so the output has $\Theta(1)$ variance). For deeper networks the same scaling is applied layer by layer.

The standard PyTorch default initialization uses different scaling: $1/\sqrt{m_\text{in}}$ folded into the weights themselves ( $w_j \sim \mathcal{N}(0, 1/m_\text{in})$ ). This standard parameterization gives the same forward computation but a different per-weight learning-rate sensitivity, which produces a different infinite-width limit (output diverges unless the learning rate is rescaled by $1/m$ ). Yang and Hu (2021, Tensor Programs IV) classified all per-layer scaling choices into three classes: NTK (lazy), mean-field (feature-learning at the readout only), and μP (feature-learning at every layer including the embedding and readout).

Core Theoretical Results

Theorem

NTK Convergence at Infinite Width

Statement

Consider a fully connected network with $L$ hidden layers of width $m$ , with NTK parameterization. As $m \to \infty$ :

The random kernel $\Theta_0(x, x')$ converges in probability to a deterministic kernel $\Theta^*(x, x')$
The limiting kernel $\Theta^*$ depends only on the architecture (depth, activation function) and is independent of the random initialization

The limiting kernel can be computed recursively layer by layer.

Intuition

At infinite width, the law of large numbers kicks in: each layer computes a sum of many independent random terms, which concentrates around its expectation. The randomness of initialization washes out, leaving a deterministic kernel that depends only on the architecture.

Proof Sketch

Proceed by induction on depth. At each layer, the pre-activations are sums of $m$ independent terms (one per neuron in the previous layer). By the CLT, these converge to a Gaussian process as $m \to \infty$ . The kernel of this GP at layer $l$ is determined recursively by the kernel at layer $l-1$ and the activation function, via the formula: $K^{(l)}(x,x') = \mathbb{E}_{(u,v) \sim \mathcal{N}}[\sigma(u)\sigma(v)]$ where the covariance of $(u,v)$ is determined by $K^{(l-1)}$ .

Why It Matters

This shows that infinitely wide neural networks at initialization are Gaussian processes with a specific kernel. Combined with the constancy result below, this means training such networks is equivalent to kernel regression. a fully solved problem.

Failure Mode

The finite-width NTK deviates from the infinite-width limit at rate $O(1/\sqrt{m})$ (Arora et al. 2019, arXiv:1904.11955, Cor. 6.2; Lee et al. 2019, Thm. 2.2). Quantitatively, for standard fully connected architectures, at $m = 1024$ the typical relative error is 2 to 5 percent (Arora et al. 2019, Table 1); at $m = 128$ it exceeds 10 percent. For practically sized networks, the infinite-width approximation has non-negligible error.

report a correction →

Theorem

NTK Stays Constant During Training

Statement

For a network with width $m$ trained by gradient flow on the squared loss, as $m \to \infty$ :

The NTK $\Theta_t$ stays equal to $\Theta_0$ throughout training: $\sup_t \|\Theta_t - \Theta_0\| \to 0$
The training loss converges to zero exponentially: $L(t) \leq L(0) \cdot e^{-2\lambda_{\min} t}$ where $\lambda_{\min}$ is the smallest eigenvalue of $\Theta^*$ on the training data
The trained network is equivalent to kernel regression with kernel $\Theta^*$

Intuition

When the width is enormous, each parameter contributes a tiny amount to the output. Training changes each parameter by a tiny amount. The gradient features $\nabla_\theta f(x; \theta)$ barely change, so the kernel stays constant. The dynamics become linear in the function space, and the solution is exactly kernel regression.

Proof Sketch

Under NTK parameterization, each individual weight is $\Theta(1)$ at initialization (with output scaled by $1/\sqrt{m}$ ), giving $\|\theta_0\| = \Theta(\sqrt{m})$ . During gradient flow, each parameter moves by $O(1/\sqrt{m})$ per coordinate, giving aggregate Frobenius-norm displacement $\|\theta_t - \theta_0\|$ that is $O(1)$ in total but $O(1/\sqrt{m})$ relative to $\|\theta_0\|$ , so the relative movement vanishes as $m \to \infty$ (Chizat, Oyallon, Bach 2019).

The change in the NTK is second-order in the parameter displacement: $\|\Theta_t - \Theta_0\| \leq \|H_f\| \cdot \|\theta_t - \theta_0\|^2$ where $H_f$ is the Hessian of $f$ in parameter space. Under NTK parameterization, $\|H_f\|_{\mathrm{op}} = O(1/\sqrt{m})$ and $\|\theta_t - \theta_0\|^2 = O(1)$ , so the product is $O(1/\sqrt{m}) \to 0$ (Lee et al. 2019, "Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent", arXiv:1902.06720).

With a constant kernel, the training dynamics in function space are linear: $\dot{f} = -\Theta^* \cdot (f - y)$ , which converges exponentially at rate $2\lambda_{\min}(\Theta^*)$ .

Why It Matters

This is the central result of NTK theory. It says: infinitely wide networks trained with gradient descent are exactly kernel methods. This immediately imports decades of kernel theory: RKHS norm bounds, generalization guarantees, spectral analysis of convergence rates.

Failure Mode

The constancy of the NTK is precisely the lazy training limitation. If the kernel does not change, the network does not learn new features. It only learns a linear combination of the initial random features. This is why NTK theory cannot explain the success of feature learning in practice.

report a correction →

Equivalence to Kernel Ridge Regression

The constancy theorem says infinite-width gradient flow on the squared loss solves the noiseless kernel-regression problem with kernel $\Theta^*$ . Adding weight decay (or, equivalently, early stopping) makes the equivalence even sharper: the trained network solves a kernel ridge regression problem.

Theorem

NTK as Kernel Ridge Regression

Statement

Let $\hat f_{\text{NN}}^\lambda$ be the infinite-width network trained with $\ell_2$ regularization centered at the initialization, that is the penalized loss $L(\theta) = \frac{1}{n} \sum_{i=1}^n (f(x_i; \theta) - y_i)^2 + \lambda \|\theta - \theta_0\|_2^2.$ At convergence,

$\hat f_{\text{NN}}^\lambda(x) \;=\; \Theta^*(x, X) \big[\Theta^*(X, X) + n \lambda I\big]^{-1} y,$

which is exactly kernel ridge regression with kernel $\Theta^*$ and ridge parameter $\lambda$ . As $\lambda \to 0^+$ this reduces to the minimum-RKHS-norm interpolant of the training data; as $\lambda \to \infty$ it shrinks to zero.

Intuition

With a constant kernel and a quadratic regularizer in parameter space, the optimization in function space is a ridge regression in the RKHS defined by $\Theta^*$ . Weight decay on $\theta$ becomes RKHS-norm penalty on $f$ , and the infinite-width network delivers the minimizer in closed form. Without explicit regularization, gradient flow itself imposes an implicit bias toward the minimum-RKHS-norm interpolant.

Proof Sketch

Linearize around $\theta_0$ : $f_\theta(x) \approx f_0(x) + \nabla_\theta f_0(x)^\top (\theta - \theta_0)$ . Define features $\phi(x) = \nabla_\theta f_0(x) \in \mathbb{R}^P$ where $P$ is the parameter count, so the linearized network is $f(x) = f_0(x) + \phi(x)^\top \beta$ with $\beta = \theta - \theta_0$ . The penalized squared loss $\frac{1}{n} \|y - f_0 - \Phi \beta\|_2^2 + \lambda \|\beta\|_2^2$ is solved by $\hat\beta = (\Phi^\top \Phi + n\lambda I)^{-1} \Phi^\top (y - f_0)$ . Apply the kernel trick (push-through identity) to rewrite the predictor at a new $x$ as $\phi(x)^\top \hat\beta = \Theta_0(x, X) (\Theta_0(X,X) + n\lambda I)^{-1} (y - f_0)$ . At infinite width $\Theta_0 \to \Theta^*$ and $f_0 \to 0$ by the centered initialization (Lee et al. 2019, Sec. 2.4; Hu, Lu, Bauer 2021, Thm. 1).

Why It Matters

This is the bridge that imports the entire ridge-regression theory into deep learning. The bias-variance tradeoff, the effective degrees of freedom $\text{tr}(\Theta^*(\Theta^* + n\lambda I)^{-1})$ , the SURE estimator for $\lambda$ , and the cross-validation theory for selecting $\lambda$ all transfer directly. It also explains why early stopping and weight decay behave so similarly in practice: under the lazy linearization, both move the solution along the same RKHS-norm-vs-fit tradeoff curve.

Failure Mode

The equivalence holds only in the lazy linearization. Real networks at finite width and finite training time deviate (Fort et al. 2020 measure the deviation via empirical NTK drift). For ReLU networks the NTK has a fast-decaying spectrum that makes the ridgeless interpolant unstable on high-frequency target functions; this is the spectral side of double descent. Under standard parameterization (not NTK parameterization), the ridge equivalence does not hold even at infinite width.

report a correction →

NTK Spectrum and Generalization

Once the equivalence to kernel ridge is established, the entire generalization theory of NTK reduces to spectral analysis of $\Theta^*$ . The key object is the eigenvalue decay rate of the operator $\mathcal{T}_{\Theta^*} f(x) = \int \Theta^*(x, x') f(x') \, d\rho(x')$ on the data distribution $\rho$ .

Theorem

Generalization Rate from NTK Eigenvalue Decay (Bordelon-Pehlevan)

Statement

For a target function $f^*$ in the RKHS, the excess test error of the ridgeless NTK interpolant on $n$ samples scales as

$\mathbb{E}\big[\|\hat f - f^*\|_{L^2(\rho)}^2\big] \;\asymp\; n^{-\frac{\alpha - 1}{\alpha}}$

up to logarithmic factors. The rate is determined entirely by the spectral decay exponent $\alpha$ of $\Theta^*$ , not by the network architecture beyond what affects $\alpha$ . For the two-layer ReLU NTK on the sphere, $\alpha = d$ (the input dimension), giving a rate $n^{-(d-1)/d}$ .

Intuition

Kernel methods can only fit modes of the target along eigendirections of $\Theta^*$ . Modes along eigendirections with small eigenvalues require proportionally more samples to learn. Polynomial decay of eigenvalues means high-frequency components of the target are increasingly costly to estimate, and the test error is dominated by the lowest-eigenvalue mode that is still meaningfully signal at sample size $n$ .

Proof Sketch

Use the bias-variance decomposition for kernel ridge with optimal $\lambda(n) \asymp n^{-\alpha/(\alpha + 1)}$ . The bias term is $\sum_k \lambda^2 / (\lambda_k + \lambda)^2 (f^*_k)^2$ and the variance term is $(\sigma^2 / n) \sum_k \lambda_k / (\lambda_k + \lambda)^2$ . Substituting the polynomial eigenvalue decay and balancing the two terms gives the optimal $\lambda(n)$ and the resulting risk rate (Caponnetto, de Vito 2007 for the classical kernel result; Bordelon, Canatar, Pehlevan 2020 for the NTK-spectrum specialization).

Why It Matters

This is the only NTK-side prediction that matches measurable behavior of real networks: tasks that are well-aligned with low-frequency NTK eigenfunctions (smooth targets, image-classification on natural images) generalize well, while tasks dominated by high-frequency modes (adversarial examples, parity functions) require sample sizes that scale exponentially. The "spectral bias" toward low-frequency learning observed empirically by Rahaman et al. 2019 is exactly this eigenvalue ordering.

Failure Mode

Polynomial decay is an idealization; for some architectures the spectrum has heavy-tailed or non-polynomial behavior that breaks the clean rate. Real targets are not in the RKHS in general; misspecification adds an $f^*$ -dependent floor to the test error. Most of all, this entire analysis assumes the lazy regime, so it cannot describe the rate improvement that comes from feature learning in finite-width networks (Cui, Loureiro, Krzakala 2021).

report a correction →

μP and Hyperparameter Transfer

The Tensor Programs framework (Yang 2019-2022) is a calculus for infinite-width limits that subsumes NTK, mean-field, and every intermediate parameterization. Its most operational consequence is the maximal-update parameterization (μP), which gives a width-stable training recipe.

Theorem

μP Zero-Shot Hyperparameter Transfer (Yang et al.)

Statement

Under μP, the optimal learning rate $\eta^*$ , optimizer hyperparameters, and weight-decay schedule for a target loss are independent of width. A practitioner can sweep hyperparameters on a small width (say 256) and transfer them losslessly to a much larger width (say 8192) without re-tuning, with optimal-loss agreement to within a few percent across widths spanning two orders of magnitude.

Intuition

Per-layer learning rates in μP are chosen so that every weight matrix moves by $\Theta(1)$ in spectral norm during a single update, regardless of width. Because every layer is in the same dynamical regime across widths, the loss landscape, when reparameterized in the natural μP coordinates, is width-invariant. NTK parameterization fails this property: under NTK the per-layer movement scales with width, so the optimal learning rate scales with width and the hyperparameter sweep must be redone for every architecture size.

Proof Sketch

The Tensor Programs IV calculus assigns each weight matrix a stability exponent in the infinite-width limit. μP is the unique assignment such that (i) every layer has $\Theta(1)$ feature updates and (ii) the output of the network has $\Theta(1)$ variance throughout training. Yang and Hu (2021) prove these two conditions force the per-layer learning-rate scaling. The hyperparameter-transfer theorem then follows because the infinite-width SDE is identical across widths, and the per-step error between widths $m_1$ and $m_2$ is $O(1/\sqrt{\min(m_1, m_2)})$ .

Why It Matters

μP transfer is one of the few NTK-adjacent results that is industrially deployed. GPT-4-class training runs use μP-style scaling so that hyperparameter sweeps at 100M parameters transfer to 100B+ parameters without re-tuning. The cost saving is the difference between a tractable sweep at small scale and an infeasible sweep at frontier scale.

Failure Mode

μP transfer has been verified for fully connected nets and standard transformers, but breaks for some architectural choices (in particular, non-standard normalization layers that interact non-trivially with the per-layer scaling). Architectural changes such as MoE routing or sparse attention require their own width-stability analysis. The transfer is also loss-landscape-dependent: extremely sharp local minima can break the infinite-width approximation and require small-batch dynamics correction.

report a correction →

Connection to NNGP and Bayesian Deep Learning

The NTK and the NNGP are two infinite-width objects from the same underlying Gaussian-process structure. The distinction matters: the NNGP governs the prior over functions induced by initialization, while the NTK governs the training dynamics. Bayesian inference uses the NNGP; gradient-descent training uses the NTK.

Object	What it describes	Reduces to
NNGP $K^*(x, x')$	Distribution of $f(x; \theta_0)$ at random init, $m \to \infty$	Bayesian posterior with NNGP kernel
NTK $\Theta^*(x, x')$	Training dynamics under gradient descent, $m \to \infty$	Kernel ridge regression with $\Theta^*$

For a fully connected net, $\Theta^* = K^* + \nabla K^* \cdot \nabla K^*$ in the recursion (Lee et al. 2019, Sec. 2.3). On standard image data the two kernels give similar predictive accuracy, but they differ on out-of-distribution inputs and on uncertainty estimates: NNGP gives calibrated posterior variance, NTK gives a point estimate.

The NNGP itself was introduced for shallow networks by Neal (1996) and extended to deep nets by Lee et al. (ICLR 2018). The unification under Tensor Programs (Yang 2019, "Wide Feedforward or Recurrent Neural Networks of Any Architecture are Gaussian Processes") established the GP limit for essentially every sane architecture, including transformers (Hron et al. ICML 2020).

Empirical NTK Drift at Finite Width

The lazy-training assumption is testable. Fort et al. (NeurIPS 2020) measured the empirical NTK $\Theta_t(x, x') = \langle \nabla_\theta f_t(x), \nabla_\theta f_t(x') \rangle$ during real training of CIFAR-10 and ImageNet ResNets. They report:

The empirical NTK at standard widths drifts by 30 to 80 percent in Frobenius norm during the first several epochs, then stabilizes
Drift correlates with feature learning: high drift = high test accuracy improvement over the linearized baseline
After the early phase the network enters a "linearized phase" where the NTK is approximately constant, suggesting a hybrid picture: rich early dynamics, lazy late dynamics

This is the cleanest empirical refutation of pure-NTK theory for practical-width networks, and the origin of the modern view that NTK predictions are most useful as a baseline that finite-width feature learning beats by a measurable margin.

The Lazy Regime vs. The Rich Regime

This is the critical conceptual distinction:

Property	Lazy Regime (NTK)	Rich Regime (Feature Learning)
Parameters move	Very little	Substantially
Features	Fixed at initialization	Learned during training
Equivalent to	Kernel regression	Multiple infinite-width limits (mean-field, muP). Less clean than NTK
Width	Very large	Practical
Theory	Well-understood	Active research
Performance	Often worse on benchmark	Better, sometimes by large margins

Real neural networks that achieve top benchmark performance are typically in the rich regime: they learn hierarchical features very different from their random initialization. NTK theory describes a regime where this feature learning is suppressed.

The NTK for Specific Architectures

Example

NTK for a two-layer ReLU network

For a two-layer network $f(x) = \frac{1}{\sqrt{m}}\sum_{j=1}^m a_j \sigma(w_j^\top x)$ with ReLU activation $\sigma(z) = \max(0, z)$ , the infinite-width NTK admits a closed form via the Cho-Saul arccosine kernels. Let $u = x^\top x' / (\|x\|\|x'\|)$ and $\alpha = \arccos(u)$ . Define

$\kappa_0(u) = \frac{1}{\pi}(\pi - \arccos(u)), \qquad \kappa_1(u) = \frac{1}{\pi}\left(\sqrt{1 - u^2} + (\pi - \arccos(u))\, u\right).$

The first-layer NNGP kernel is $K^{(1)}(x, x') = \|x\|\|x'\| \cdot \kappa_1(u)$ , and the NTK is

$\Theta^*(x, x') = \|x\|\|x'\| \left[\kappa_1(u) + u \cdot \kappa_0(u)\right] = \frac{\|x\|\|x'\|}{\pi}\left[\sin\alpha + (\pi - \alpha)\cos\alpha\right] + \frac{\|x\|\|x'\|\cos\alpha}{\pi}(\pi - \alpha).$

Sanity check: for unit vectors with $x = x'$ , $\alpha = 0$ , $u = 1$ , giving $\kappa_1(1) = 1$ and $\kappa_0(1) = 1$ , so $\Theta^*(x, x) = 2$ , matching Jacot-Gabriel-Hongler 2018 Prop. 2 and Arora et al. 2019 Thm. 3.1. This kernel is universal (dense in continuous functions) and positive definite on distinct points.

Common Confusions

Watch Out

NTK does not describe practical neural networks

The NTK regime requires width to be extremely large. often unrealistically so. Practical networks (GPT, ResNets, etc.) are not in the lazy regime. They learn features. NTK is a theoretical tool for understanding one extreme of neural network behavior, not a description of how practical networks work.

Watch Out

Constant NTK means no feature learning

A common point of confusion: "NTK theory proves neural networks are kernel methods." More precisely, NTK theory proves that infinitely wide networks in the lazy regime are kernel methods. The interesting behavior of practical networks. feature learning, representation learning, transfer. happens precisely when the NTK changes during training.

Watch Out

NTK parameterization vs standard parameterization

The NTK result requires a specific parameterization (scaling by $1/\sqrt{m}$ ) that differs from the standard (mean field) parameterization. Different parameterizations lead to qualitatively different infinite-width limits. The mean field parameterization leads to feature learning even at infinite width. PyTorch default initialization is not NTK parameterization. Practitioners applying NTK results to standard-init models are making an unstated approximation.

Watch Out

Finite width matters quantitatively

For standard fully connected architectures, at $m = 1024$ the finite-width NTK typically differs from the infinite-width limit by 2 to 5 percent. At $m = 128$ , the error exceeds 10 percent. Practical transformer MLP widths run 1024 to 8192, so the infinite-width approximation carries non-negligible error. "Infinite-width predictions" for real models are first-order approximations, not exact characterizations. (Arora et al. 2019, Table 1; Lee et al. 2019.)

Summary

The NTK is $\Theta(x, x') = \langle \nabla_\theta f(x), \nabla_\theta f(x') \rangle$
At infinite width, the NTK converges to a deterministic kernel and stays constant during training
Infinite-width networks trained with GD are equivalent to kernel regression with the NTK
This is the lazy regime: features are fixed, only the linear readout is learned
Real networks learn features (rich regime), which NTK theory does not capture
NTK was a major theoretical advance but is incomplete as a theory of deep learning

Exercises

ExerciseCore

Problem

Warm-up: For a linear model $f(x; w) = w^\top x$ with $w \in \mathbb{R}^d$ , compute the NTK $\Theta(x, x')$ . Then, for a two-layer linear network $f(x; W, v) = v^\top W x$ with $W \in \mathbb{R}^{m \times d}$ and $v \in \mathbb{R}^m$ , compute the NTK at a fixed $(W, v)$ . Show that depth changes the NTK even for linear networks, and explain what this implies about the role of depth in NTK theory.

ExerciseAdvanced

Problem

Suppose you have a two-layer network of width $m$ and the NTK on the $n$ training points has minimum eigenvalue $\lambda_{\min}$ . The training loss at time $t$ under gradient flow satisfies $L(t) \leq L(0) e^{-2\lambda_{\min}t}$ . If $\lambda_{\min} = 0.01$ and $L(0) = 1.0$ , how many time units until the loss reaches $10^{-6}$ ?

ExerciseAdvanced

Problem

Explain why the NTK framework cannot account for the empirical observation that deeper networks learn increasingly abstract features at higher layers. What property of the NTK regime prevents this?

Related Comparisons

Frequently Asked Questions

$What is the Neural Tangent Kernel?$: $The kernel K (x, y) = ⟨ \nabla_{θ} f (x; θ), \nabla_{θ} f (y; θ)⟩ at initialization, where f is a neural network. Jacot et al. (2018) showed that in the infinite-width NTK parameterization, K stays constant during training, and gradient descent on the network is equivalent to kernel ridge regression with K .$
$How is NTK different from NNGP?$: $NNGP is the prior: at infinite width, f (x; θ) at initialization is a Gaussian process with kernel K_{NNGP} = E [f (x) f (y)] . NTK is the gradient-descent-trained posterior: a Gaussian process with kernel K_{NTK} = E [⟨ \nabla f (x), \nabla f (y)⟩] . NNGP gives the Bayesian view, NTK gives the GD-trained view, both at infinite width.$
$Why does the NTK regime kill feature learning?$: $In the lazy regime, weights barely change during training. The features (intermediate activations) stay essentially fixed at their initialization values; the network learns a linear function in this fixed feature space. Real deep learning gets much of its power from learned features, so NTK theory describes a degenerate limit, not the practical regime.$
$What is μ P and why does it matter?$: $Maximal update parameterization (Yang et al. 2022): a different scaling of weights, learning rates, and biases that keeps feature learning ON at infinite width. Practical payoff: hyperparameters (learning rate, init scale) tuned at small scale transfer to large scale, eliminating expensive sweeps for billion-parameter models.$
$Can NTK theory predict real deep learning?$: $Partially. NTK eigenvalue-decay rates predict the order in which functions are learned (low-frequency first), and NTK regression bounds give generalization rates that match practice in the lazy regime. Outside the lazy regime — where real networks live — NTK theory breaks down. The gap is what motivates μ P and feature-learning theory.$

References

Canonical (NTK):

Jacot, Gabriel, Hongler, "Neural Tangent Kernel: Convergence and Generalization in Neural Networks" (NeurIPS 2018, arXiv:1806.07572), Prop. 2, Cor. 2
Lee et al., "Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent" (NeurIPS 2019, arXiv:1902.06720), Thm. 2.2
Arora et al., "On Exact Computation with an Infinitely Wide Neural Net" (NeurIPS 2019, arXiv:1904.11955), Thm. 3.1, Cor. 6.2
Du et al., "Gradient Descent Finds Global Minima of Deep Neural Networks" (ICML 2019, arXiv:1811.03804)
Allen-Zhu, Li, Song, "A Convergence Theory for Deep Learning via Over-Parameterization" (ICML 2019, arXiv:1811.03962)
Cho, Saul, "Kernel Methods for Deep Learning" (NeurIPS 2009). Closed-form arccosine kernels $\kappa_0, \kappa_1$ used in the two-layer ReLU NTK derivation.
Lee et al., "Deep Neural Networks as Gaussian Processes" (ICLR 2018, arXiv:1711.00165). The NNGP kernel that the NTK recursion builds on.

Mean-field primary sources:

Mei, Montanari, Nguyen, "A Mean Field View of the Landscape of Two-Layer Neural Networks" (PNAS 2018, arXiv:1804.06561)
Rotskoff, Vanden-Eijnden, "Parameters as Interacting Particles: Long Time Convergence and Asymptotic Error Scaling of Neural Networks" (arXiv:1805.00915)
Chizat, Bach, "On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport" (NeurIPS 2018, arXiv:1805.09545)

Lazy vs. rich and muP:

Chizat, Oyallon, Bach, "On Lazy Training in Differentiable Programming" (NeurIPS 2019, arXiv:1812.07956)
Yang, "Wide Feedforward or Recurrent Neural Networks of Any Architecture are Gaussian Processes" (NeurIPS 2019, arXiv:1910.12478). Tensor Programs II.
Yang, Hu, "Tensor Programs IV: Feature Learning in Infinite-Width Neural Networks" (ICML 2021, arXiv:2011.14522). Classification of parameterizations into NTK / mean-field / μP.
Yang et al., "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer" (NeurIPS 2021, arXiv:2203.03466). μP parameterization and hyperparameter transfer across widths.
Fort et al., "Deep Learning versus Kernel Learning: an Empirical Study of Loss Landscape Geometry and the Time Evolution of the Neural Tangent Kernel" (NeurIPS 2020, arXiv:2010.15110). Empirical NTK drift at finite width and its correlation with feature learning.

NTK spectrum and generalization:

Bordelon, Canatar, Pehlevan, "Spectrum Dependent Learning Curves in Kernel Regression and Wide Neural Networks" (ICML 2020, arXiv:2002.02561). Eigenvalue-decay rate as the determinant of NTK generalization.
Cui, Loureiro, Krzakala, "Generalization Error Rates in Kernel Regression: The Crossover from the Noiseless to Noisy Regime" (NeurIPS 2021, arXiv:2105.15004).
Caponnetto, De Vito, "Optimal Rates for the Regularized Least-Squares Algorithm" (Foundations of Computational Mathematics, 2007). The classical kernel-ridge rate that the NTK specialization inherits.
Hu, Lu, Bauer, "Regularization Matters: A Nonparametric Perspective on Overparametrized Neural Network" (Annals of Statistics, 2021). Explicit ridge equivalence proof.
Rahaman et al., "On the Spectral Bias of Neural Networks" (ICML 2019, arXiv:1806.08734).

NNGP and Bayesian deep learning:

Neal, "Priors for Infinite Networks" (Tech Report 1994; expanded in Bayesian Learning for Neural Networks, Springer 1996).
Lee et al., "Deep Neural Networks as Gaussian Processes" (ICLR 2018, arXiv:1711.00165).
Hron, Bahri, Sohl-Dickstein, Novak, "Infinite Attention: NNGP and NTK for Deep Attention Networks" (ICML 2020, arXiv:2006.10540). NNGP and NTK for transformer architectures.

Next Topics

Ridge regression: the classical theory the NTK kernel-ridge equivalence imports into deep learning
Kernels and RKHS: the function-space frame that makes NTK an RKHS method
Implicit bias and modern generalization: why gradient descent finds the minimum-RKHS-norm interpolant in the ridgeless limit
Double descent: the spectral side of interpolating kernel regression
Benign overfitting: when interpolation generalizes despite zero training error

Last reviewed: April 19, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

6

Ridge Regressionlayer 1 · tier 1
Implicit Bias and Modern Generalizationlayer 4 · tier 1
Stochastic Processes for MLlayer 2 · tier 2
Gaussian Process Regressionlayer 3 · tier 2
Kernels and Reproducing Kernel Hilbert Spaceslayer 3 · tier 2

Derived topics

4

Benign Overfittinglayer 4 · tier 2
Double Descentlayer 4 · tier 2
Lazy vs Feature Learninglayer 4 · tier 2
Mean Field Theorylayer 4 · tier 2

Graph-backed continuations

Double Descent Benign Overfitting Lazy vs Feature Learning Mean Field Theory