Skip to main content

Beyond LLMS

Physics-Informed Neural Networks

Embedding PDE constraints directly into the neural network loss function via automatic differentiation. When physics-informed learning works, when it fails, and what alternatives exist.

AdvancedTier 2CurrentSupporting~55 min

Why This Matters

Interactive module

The same live PDE explorer appears here as optional context. Hide it if you want a pure reading path through PINNs.

Loading interactive PDE explorer...

The explorer above is the setup a PINN tries to replicate. The spectral solver on the left is the exact closed-form ground truth for four PDE archetypes (heat, advection, Schrödinger, Poisson): one Fourier multiplier per archetype, no time-stepping. The "Neural-net function fit" panel at the bottom trains a small SIREN MLP to memorize that field via supervised MSE. A true PINN keeps the same kind of network but replaces the supervised target with the PDE residual itself; the net never sees ground-truth values and instead learns to satisfy the equation pointwise. The sections below describe exactly what that substitution buys you and where it breaks.

Most of science is governed by partial differential equations (PDEs). Classical numerical solvers (finite elements, finite differences, spectral methods) work well but struggle with high-dimensional problems, complex geometries, and inverse problems. PINNs propose a different approach: use a neural network as a function approximator and enforce the PDE through the loss function.

The idea is seductive. The reality is more nuanced. PINNs work well in certain regimes and fail badly in others. Understanding where the boundary lies is critical for anyone applying ML to scientific problems.

Mental Model

A standard neural network learns from data alone. A PINN adds a second source of supervision: the governing equations. You do not need as much data because the physics constrains the space of acceptable solutions.

Think of it as regularization by physical law. Instead of an L2 penalty that pushes weights toward zero, you have a PDE residual penalty that pushes the solution toward physical consistency.

Core Definitions

Definition

Physics-Informed Neural Network (PINN)

A neural network uθ(x,t)u_\theta(x, t) trained to approximate the solution of a PDE by minimizing a composite loss that includes both data fidelity and PDE residual terms. The PDE residual is computed via automatic differentiation of the network output with respect to its inputs.

Definition

PDE Residual

For a PDE of the form N[u]=0\mathcal{N}[u] = 0 where N\mathcal{N} is a differential operator, the PDE residual at a collocation point xx is:

R[uθ](x)=N[uθ](x)\mathcal{R}[u_\theta](x) = \mathcal{N}[u_\theta](x)

This is computed by differentiating uθu_\theta with respect to its inputs using automatic differentiation. A perfect solution has zero residual everywhere.

The PINN Loss Function

The central construct of PINNs is the composite loss:

Proposition

PINN Loss Decomposition

Statement

The PINN loss is:

L(θ)=λdataLdata+λpdeLpde+λbcLbc+λicLic\mathcal{L}(\theta) = \lambda_{\text{data}} \mathcal{L}_{\text{data}} + \lambda_{\text{pde}} \mathcal{L}_{\text{pde}} + \lambda_{\text{bc}} \mathcal{L}_{\text{bc}} + \lambda_{\text{ic}} \mathcal{L}_{\text{ic}}

where:

  • Ldata=1Ndi=1Nduθ(xi)uiobs2\mathcal{L}_{\text{data}} = \frac{1}{N_d} \sum_{i=1}^{N_d} |u_\theta(x_i) - u_i^{\text{obs}}|^2 (data fidelity)
  • Lpde=1Nrj=1NrN[uθ](xj)2\mathcal{L}_{\text{pde}} = \frac{1}{N_r} \sum_{j=1}^{N_r} |\mathcal{N}[u_\theta](x_j)|^2 (PDE residual at collocation points)
  • Lbc\mathcal{L}_{\text{bc}} and Lic\mathcal{L}_{\text{ic}} enforce boundary and initial conditions

The weights λ\lambda balance the different loss terms.

Intuition

The network is pulled in multiple directions: fit the observed data, satisfy the PDE everywhere in the domain, and respect boundary/initial conditions. The physics term acts as an infinite-dimensional regularizer, constraining the solution to the manifold of physically plausible functions even where no data exists.

Proof Sketch

There is no convergence "proof" in the classical sense for general PINNs. Theoretical results (Shin, Darbon, Karniadakis 2020) show that as the number of collocation points and network capacity grow, minimizers of the PINN loss converge to the PDE solution under regularity assumptions. The rate of convergence is generally worse than classical solvers for smooth problems.

Why It Matters

This decomposition is the entire PINN methodology. The key engineering decisions are: (1) the architecture of uθu_\theta, (2) the placement of collocation points, (3) the relative weights λ\lambda, and (4) the optimizer and training schedule. Getting these wrong leads to solutions that satisfy neither the data nor the physics.

Failure Mode

The multi-objective nature of the loss creates optimization difficulties. The PDE residual and data terms can have vastly different scales and gradients, leading to one dominating the other. Adaptive weighting schemes (e.g., learning rate annealing, neural tangent kernel-based weighting) partially address this but do not fully solve it.

How PINNs Use Automatic Differentiation

The key enabling technology is automatic differentiation (autodiff). To compute N[uθ]\mathcal{N}[u_\theta], you differentiate the neural network output with respect to its inputs (not its parameters). Modern frameworks (PyTorch, JAX) compute these derivatives exactly and efficiently.

For example, if the PDE is the heat equation tu=α2u\partial_t u = \alpha \nabla^2 u, the residual at a point (x,t)(x, t) is:

R=uθtα2uθx2\mathcal{R} = \frac{\partial u_\theta}{\partial t} - \alpha \frac{\partial^2 u_\theta}{\partial x^2}

Both derivatives are computed by autodiff through the network graph.

When PINNs Work

Example

Smooth Solutions to Known PDEs

PINNs perform well on problems with smooth solutions where the governing PDE is known exactly. Classic demonstrations include the Burgers equation (before shock formation), the Schrodinger equation, and steady-state heat conduction. In these settings, the physics loss provides strong regularization, and the network can represent the solution accurately with moderate capacity.

Example

Inverse Problems

PINNs are particularly attractive for inverse problems: given sparse noisy observations, infer unknown PDE parameters. For example, estimating the diffusion coefficient from temperature measurements. The physics constraint regularizes the inverse problem and shrinks the feasible solution set; it does not, on its own, make the problem well-posed. Identifiability and stability still depend on the observation operator, the boundary and initial conditions, the parameterization of the unknowns, the noise model, and whether distinct parameters can produce indistinguishable observations. Many inverse PDE problems remain genuinely ill-posed even with the correct governing equation, and a successful PINN run on such a problem is fitting one solution from a non-trivial null space.

When PINNs Fail

The honest-survey failure modes (discontinuities, forward-solve cost, stiffness) are the surface story. The deeper, structural failures have names attached to them now and are where the interesting work is happening.

Loading PINN failure zoo...

The three tabs above are the live, in-browser versions of the three failure modes described below. The code is tiny, runs as you watch, and cites its source paper on each tab.

Watch Out

Spectral bias: plain MLPs cannot represent high frequencies

Rahaman et al. (2019) and Tancik et al. (arXiv:2006.10739) showed that an MLP with standard activations is biased toward low-frequency components and provably slow to fit oscillatory targets. A vanilla PINN on a solution containing sin(100x)\sin(100x) will converge to the low-frequency bulk and never resolve the detail. Two practical fixes: Fourier feature encodings (Tancik et al. 2020; Mildenhall et al. 2020 for NeRF) preprocess inputs through γ(x)=[cos(2πBx),sin(2πBx)]\gamma(x) = [\cos(2\pi B x), \sin(2\pi B x)], and sinusoidal representations like SIREN (Sitzmann et al. 2020, arXiv:2006.09661) replace activations with sin(ω0(Wx+b))\sin(\omega_0 (Wx + b)). Either transforms the NTK into one that is no longer frequency-biased. The "Neural-net function fit" panel in the explorer above uses SIREN for exactly this reason.

Watch Out

NTK imbalance: the data and PDE losses train at wildly different rates

Wang, Teng, Perdikaris (arXiv:2001.04536, 2021) gave the foundational analysis: the neural tangent kernels of the data-fitting loss and the PDE-residual loss have eigenvalue spectra separated by orders of magnitude. In effect, one loss dominates every gradient step and the other stagnates, regardless of the manually-chosen λ\lambda weights. The paper proposes an NTK-balanced weighting scheme that measures the kernels online and rescales. Calling it "weighting" understates the result: it is the first mechanistic explanation of why PINNs are so brittle to loss coefficients.

Loading NTK spectrum lab...

The demo above recomputes both kernels on a fixed 10-point sample every 30 training steps and plots their sorted eigenvalues side by side. Watch how the PDE kernel's spectrum widens across decades as training progresses while the data kernel stays bounded. Toggle NTK-balanced weighting and the gap closes.

Watch Out

Curriculum and causality: naive simultaneous training fails on time-dependent PDEs

Krishnapriyan et al. (arXiv:2109.01050, NeurIPS 2021) built an explicit failure-mode taxonomy for PINNs and showed that simultaneous training on all collocation points (the standard setup) is wrong for evolution equations: information from t=0t=0 must propagate causally through time, but vanilla training mixes early- and late-time residuals as if they were independent. A sequence-to-sequence curriculum (train to some cutoff time, then extend) recovers much of the accuracy gap. Wang, Sankaran, Perdikaris (arXiv:2203.07404) formalized this as causal time-weighting: weight each collocation point by the product of earlier residuals so the network must solve t0t_0 before it is asked to solve t1t_1. The improvement is often orders of magnitude, which is why this result is structurally surprising.

Watch Out

Discontinuities and sharp gradients

Shock waves, contact discontinuities, and thin boundary layers cause PINNs to fail or converge extremely slowly. Neural networks with smooth activations represent discontinuities poorly; the PDE residual near a shock is large and noisy, destabilizing training. Classical shock-capturing schemes (WENO, Godunov) handle these cases far better. Variational PINNs (Kharazmi et al., arXiv:2003.00596) sidestep some of this by minimizing a weak form of the residual against test functions, reducing the required derivative order.

Watch Out

PINNs can be slower than classical solvers for forward problems

For a well-posed forward PDE with known coefficients, a finite element solver with adaptive meshing is typically faster and more accurate than a PINN. PINNs pay the cost of neural network training (thousands of gradient descent steps) for a single PDE instance. Classical solvers amortize their cost better. PINNs become competitive when you need to solve many related PDEs (amortized inference), when the problem is inverse (unknown coefficients to infer from sparse data), or when classical meshing is impractical.

Loading classical vs PINN inverse demo...

The demo above shows the honest crossover. At 3 observations the classical FD grid search is under-constrained and its α estimate varies wildly with sample luck; the PINN's physics residual stabilizes the estimate. Drag n upward — above roughly 20 observations the classical estimate sharpens and matches or beats the PINN, because the data now constrains u directly.

Watch Out

Provable impossibility: some PDE classes cannot be PINN-approximated

de Ryck and Mishra (2024) gave explicit PDE classes where no PINN training trajectory can converge, no matter the architecture or weighting. This is a healthy counterweight to the "universal approximation" optimism of the early literature: PINNs are a method, not a solver, and the method has provable limits.

Watch Out

Stiff systems

PDEs with widely separated time scales create loss landscapes with pathological curvature. Fast dynamics produce large PDE residuals early in training; slow dynamics require long training to resolve. Multiscale architectures, hard-constraint formulations (Lu, Pestourie et al., arXiv:2102.04626) that enforce boundary conditions by architectural construction rather than loss penalty, and gradient-enhanced residual losses (gPINN, Yu et al., arXiv:2111.02801) mitigate but do not eliminate the problem.

Extensions

Biologically-Informed Neural Networks (BINNs)

BINNs apply the PINN framework to biological systems where the governing equations are partially known. Instead of enforcing a fully specified PDE, BINNs learn unknown terms in the equations from data while enforcing known structural constraints (conservation laws, positivity, symmetries).

Neural Operators: A Different Paradigm

Instead of learning a single PDE solution, neural operators learn the solution operator: a mapping from PDE parameters, initial conditions, or forcing terms to solutions.

Fourier Neural Operator (FNO): learns in the frequency domain, applying learned filters to the Fourier coefficients of the input function. Resolution invariant and fast at inference (one forward pass per new PDE instance).

DeepONet: uses a branch-trunk architecture. The branch network encodes the input function, the trunk network encodes the evaluation point, and their dot product gives the solution value. Theoretically grounded in the universal approximation theorem for operators (Chen & Chen, 1995).

Neural operators amortize the cost of training over many PDE instances. Once trained, solving a new PDE instance requires only a forward pass, not retraining. This makes them far more practical than PINNs for applications requiring repeated solves (design optimization, uncertainty quantification).

Bayesian PINNs

Yang, Meng, Karniadakis (arXiv:2003.06097, 2021) replace the point-estimate network with a posterior over weights, giving calibrated uncertainty quantification on both the solution and any inferred coefficients. The resulting B-PINN is more honest about its extrapolations than a vanilla PINN and connects directly to standard Bayesian deep learning.

Structural Connection: PINNs and RLHF

PINN training and RLHF are the same abstract problem with a cosmetic change of clothes, and noticing this is the single most useful way to transfer intuition between the two.

In RLHF, the reward model rϕr_\phi is a learned proxy for human judgment. Optimizing a policy against it aggressively (small KL coefficient β\beta) produces reward hacking: the policy finds degenerate behaviors that score highly under rϕr_\phi without actually being good. The fix is a KL regularizer against the SFT policy πsft\pi_{\text{sft}}:

maxπ  Ex,yπ[rϕ(x,y)]βDKL ⁣(ππsft).\max_{\pi} \; \mathbb{E}_{x, y \sim \pi}\left[r_\phi(x, y)\right] - \beta \, D_{\text{KL}}\!\left(\pi \,\|\, \pi_{\text{sft}}\right).

In a PINN, the PDE residual R(uθ)\mathcal{R}(u_\theta) is a computed proxy for PDE satisfaction. Optimizing it aggressively relative to the data loss produces the mirror pathology: the network finds functions with small residual that ignore the observational data (or vice versa, dominated-by-data fits that violate physics away from measurement points). The effective multi-objective is:

minθ  Ldata(uθ)+λLPDE(uθ),\min_{\theta} \; \mathcal{L}_{\text{data}}(u_\theta) + \lambda \, \mathcal{L}_{\text{PDE}}(u_\theta),

and Wang, Teng, Perdikaris's NTK analysis shows that the "correct" λ\lambda is not a hyperparameter to tune; it is determined by the ratio of kernel traces, which is the PINN analog of selecting β\beta from a target KL budget in RLHF.

The parallels line up term by term:

Structural roleRLHFPINN
Ground-truth objectiveHuman judgment (unobserved)PDE satisfaction (infeasible to compute exactly)
Learned proxyReward model rϕr_\phiPDE residual R(uθ)\mathcal{R}(u_\theta)
What "hacking" looks likePolicy scores high, humans disagreeResidual small, data fit poor (or vice versa)
Implicit trust regionKL against πsft\pi_{\text{sft}}Data-loss weight fixes uθu_\theta near observations
Ratio-balancing ruleTarget KL budgetNTK eigenvalue ratio (Wang-Teng-Perdikaris 2021)
Non-stationary proxyReward drifts across distributionPDE residual landscape stiffens during training

Both are proxy optimization problems where the proxy is differentiable but the true objective is not; both need a regularizer to prevent the optimizer from exploiting the gap; and in both, the optimal regularizer strength is structural rather than tuned.

This is not merely an analogy. The algorithmic fixes transfer: causal time-weighting (Wang-Sankaran-Perdikaris 2022) is a curriculum on collocation points, which is the PINN analog of process-reward shaping in RLHF. Hard-constraint PINNs (Lu-Pestourie 2021) enforce boundary conditions architecturally instead of through a loss term — the PINN version of constrained decoding. The open question in both fields is whether proxy-hacking is fundamental or an artifact of current architectures.

Open Questions

Two questions where the existing TheoremPath machinery meets the PINN literature at a productive boundary. Framed deliberately as open so the reader can enter the research trajectory rather than just consume a survey.

Watch Out

Finite-sample generalization bounds for PINNs

Current PINN theory (Shin-Darbon-Karniadakis 2020) gives asymptotic consistency: as the number of collocation points and training iterations tends to infinity, the PINN converges to a PDE solution. This is the wrong kind of result for practice: real PINNs are trained on 10310^3 to 10510^5 collocation points, not infinitely many.

A natural program: treat the PINN loss L(fθ)=Ldata(fθ)+λLPDE(fθ)\mathcal{L}(f_\theta) = \mathcal{L}_{\text{data}}(f_\theta) + \lambda \, \mathcal{L}_{\text{PDE}}(f_\theta) where LPDE\mathcal{L}_{\text{PDE}} depends on fθf_\theta through its autodiff derivatives. Under Lipschitz-derivative regularity, apply the vector contraction inequality (Maurer 2016, 2L\sqrt{2}L form) to bound empirical process terms. For a Sobolev or Barron-norm hypothesis class HH, the Rademacher complexity Rn(H)B/nR_n(H) \leq B/\sqrt{n} would give the PINN analog of a VC-style generalization bound.

The hard technical question is whether xfθ\partial_x f_\theta has a Rademacher complexity controllable by Rn(H)R_n(H), i.e., whether the derivative operator acts on the hypothesis class benignly under the parameter norm. This is open and genuinely research-grade.

Watch Out

PINN scaling laws

Kaplan-style scaling laws have reshaped language modeling: given a compute budget, there is a compute-optimal trade-off between parameter count, dataset size, and training steps. For PINNs, no equivalent study exists. Ad-hoc empirical surveys report that "bigger networks help until they do not," but there is no Chinchilla-style joint fit across problem classes.

A specific formulation: for a PDE whose solution lies in a Barron-norm class of complexity κ\kappa, is the PINN training loss LNparamsα(κ)Ncollβ(κ)Nstepsγ(κ)\mathcal{L} \sim N_{\text{params}}^{-\alpha(\kappa)} \, N_{\text{coll}}^{-\beta(\kappa)} \, N_{\text{steps}}^{-\gamma(\kappa)} for some exponents determined by the PDE's conditioning rather than by generic architecture scaling? If yes, compute-optimal allocation should trade parameters against collocation-point density the way language models trade parameters against tokens. If no, PINN scaling is limited by PDE conditioning and the comparison to LMs is misleading.

This is a natural target for a workshop note or a short empirical paper. The TheoremPath curriculum has the pieces (scaling laws, Barron norms, Rademacher bounds) already assembled elsewhere; the PINN literature does not.

Connections to Mechanistic Interpretability

A possibly-naive observation, flagged here as a working hypothesis rather than a result. Mechanistic interpretability looks inside a trained network to recover human-readable circuits. The NTK spectrum view embedded above is doing something adjacent: looking inside an actively training network to see which directions in parameter space are responsible for which components of the loss.

Three specific ways the two fields touch, none of them fully worked out:

  • The eigenvectors of the data NTK and the PDE NTK are literal directions in parameter space. When the gap between their spectra closes under NTK-balanced weighting, it means the same parameter directions start absorbing gradient signal from both losses. That is a mechanistic statement about the training dynamics, not just a convergence-rate statement, and it suggests a way to ask "which layer learned which part of the equation" by projecting gradients onto layer-restricted Gram matrices.
  • SIREN and Fourier-feature networks (Tancik 2020, Sitzmann 2020) fix spectral bias by construction. The mechanistic question — which parts of the network represent high-frequency components and which represent low-frequency ones — is open. The superposition work on sparse features in LLMs has a natural analog for PINNs: are high-frequency Fourier modes "in superposition" with low-frequency ones in the hidden layer, and does spectral bias correspond to high-frequency directions being the ones most contested by superposition?
  • The PINN–RLHF isomorphism in the section above implies that proxy hacking in RLHF and loss imbalance in PINNs share a mechanistic signature. A reward model has an NTK. Measuring its eigenvalue spread vs the KL regularizer's NTK spread during PPO updates is directly analogous to what the demo above does for PINN training. I have not seen this connection in either literature. That is either because it is wrong or because no one has written it yet; I think it is the second and the two-page version would be worth writing.

None of the above is a claim that any of this is solved. It is a list of the questions I would actually work on if I had the time, and it is the honest research trajectory TheoremPath is meant to support rather than decorate. If any of this is already established in a paper I have not read, I would like to hear about it.

Summary

  • PINN loss = data fidelity + PDE residual + boundary/initial conditions
  • Autodiff computes exact spatial/temporal derivatives of the network
  • PINNs work best for smooth solutions, inverse problems, and data-sparse regimes
  • PINNs fail on discontinuities, stiff systems, and problems where classical solvers already excel
  • Loss balancing between physics and data terms is a critical engineering challenge
  • Neural operators (FNO, DeepONet) learn solution operators and amortize cost across PDE instances

Exercises

ExerciseCore

Problem

Write the PINN loss for the 1D heat equation tu=αxxu\partial_t u = \alpha \partial_{xx} u on x[0,1]x \in [0, 1], t[0,T]t \in [0, T], with initial condition u(x,0)=f(x)u(x, 0) = f(x) and boundary conditions u(0,t)=u(1,t)=0u(0, t) = u(1, t) = 0. Identify all collocation point sets needed.

ExerciseAdvanced

Problem

Why do PINNs struggle with the Burgers equation tu+uxu=νxxu\partial_t u + u \partial_x u = \nu \partial_{xx} u at small viscosity ν\nu? What happens to the PDE residual near a shock?

References

Foundational:

  • Raissi, Perdikaris, Karniadakis, "Physics-informed neural networks," Journal of Computational Physics 378 (2019), pp. 686–707
  • Karniadakis, Kevrekidis, Lu, Perdikaris, Wang, Yang, "Physics-informed machine learning," Nature Reviews Physics 3 (2021)

Failure-mode analysis (read these before applying PINNs):

  • Wang, Teng, Perdikaris, "When and why PINNs fail to train: a neural tangent kernel perspective," arXiv:2001.04536 (2021). The foundational NTK-imbalance analysis.
  • Krishnapriyan, Gholami, Zhe, Kirby, Mahoney, "Characterizing possible failure modes in physics-informed neural networks," NeurIPS 2021, arXiv:2109.01050
  • Wang, Sankaran, Perdikaris, "Respecting causality is all you need for training PINNs," arXiv:2203.07404 (2022)
  • de Ryck, Mishra, "Error analysis for PINNs and related models approximating PDEs," 2024 survey

Spectral bias / implicit neural representations:

  • Rahaman et al., "On the spectral bias of neural networks," ICML 2019
  • Tancik et al., "Fourier features let networks learn high frequency functions in low dimensional domains," NeurIPS 2020, arXiv:2006.10739
  • Mildenhall, Srinivasan, Tancik, Barron, Ramamoorthi, Ng, "NeRF: Representing scenes as neural radiance fields for view synthesis," ECCV 2020
  • Sitzmann, Martel, Bergman, Lindell, Wetzstein, "Implicit neural representations with periodic activation functions" (SIREN), NeurIPS 2020, arXiv:2006.09661

Theory:

  • Shin, Darbon, Karniadakis, "On the convergence of PINNs" (2020). Asymptotic consistency; compare with the finite-sample program sketched in Open Questions above.
  • Maurer, "A vector-contraction inequality for Rademacher complexities," ALT 2016

Architectural extensions:

  • Lu, Pestourie, Yao, Wang, Verdugo, Johnson, "Physics-informed neural networks with hard constraints for inverse design," arXiv:2102.04626 (2021). Hard-constraint PINNs.
  • Kharazmi, Zhang, Karniadakis, "Variational physics-informed neural networks (VPINN)," arXiv:2003.00596 (2020)
  • Yu, Lu, Meng, Karniadakis, "Gradient-enhanced physics-informed neural networks (gPINN)," arXiv:2111.02801 (2021)
  • Yang, Meng, Karniadakis, "B-PINNs: Bayesian physics-informed neural networks for forward and inverse PDE problems with noisy data," arXiv:2003.06097 (2021)
  • Jagtap, Karniadakis, "Extended PINNs (XPINNs): a generalized space-time domain decomposition based deep learning framework" (2020)

Neural operators (alternative paradigm):

  • Li et al., "Fourier Neural Operator for parametric PDEs," ICLR 2021, arXiv:2010.08895
  • Lu, Jin, Pang, Zhang, Karniadakis, "Learning nonlinear operators via DeepONet," Nature Machine Intelligence (2021)

Next Topics

Explore neural operators and scientific ML applications as this rapidly evolving field develops.

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Derived topics

1