Scientific ML
Neural ODEs and Continuous-Depth Networks
Treating neural network depth as a continuous variable: the ODE formulation of residual networks, the adjoint method for memory-efficient backpropagation, the duality with PINNs, the SDE bridge to diffusion models, and the open research frontier.
Prerequisites
Why This Matters
A ResNet block has the form . This matches an Euler update only when the residual is scaled by a step size for network depth (Haber and Ruthotto 2017; Chen et al. 2018). Standard unscaled ResNets are not formal Euler discretizations of an ODE; the continuous limit requires this rescaling. With that rescaling, the limit gives a continuous dynamical system:
This reframing is not just mathematical elegance. It gives you: constant memory backpropagation (via the adjoint method), adaptive computation depth (the ODE solver decides how many steps to take), and a bridge between deep learning and dynamical systems theory. The tradeoffs are real: training is slower, and the expressiveness is constrained by ODE theory (no crossing trajectories). Neural ODEs connect to DEQ models (which solve for the ODE's fixed point directly) and continuous thought machines (which use ODE dynamics for adaptive-depth reasoning).
The ResNet-ODE Connection
ResNet as Discretized ODE
Statement
A residual network with update for is the Euler discretization of the initial value problem:
with step size . In the limit , the discrete trajectory converges to the continuous solution (under Lipschitz conditions on ).
The output of the network is .
Intuition
Each ResNet layer adds a small correction to the hidden state. In the continuous limit, these corrections become a vector field that flows the input through a smooth trajectory. The network's "depth" becomes a continuous time variable. Deeper networks correspond to longer integration times, and the network learns the vector field that transforms inputs into useful representations.
Why It Matters
This perspective explains why ResNets work: the skip connection is not just a gradient-flow trick. It makes each layer an incremental transformation, and in the continuous limit these incremental transformations compose into a smooth flow that is invertible (under standard well-posedness conditions on , by uniqueness of ODE solutions). The discrete finite-step ResNet map is not invertible in general — invertibility of a single block requires extra conditions like a Lipschitz constraint (the basis of invertible ResNets, Behrmann et al. 2019). The takeaway is that small-step composition is the regime where forward dynamics behave benignly, which is why deep ResNets train well.
It also enables replacing the fixed -layer architecture with an adaptive ODE solver that chooses its own step size based on local numerical error and stiffness of the dynamics, not on semantic input difficulty. Inputs whose trajectories pass through stiff regions of get more steps; smooth trajectories get fewer. This is data-dependent compute, but the criterion is numerical (truncation error tolerance), not a learned "is this input hard" signal.
Failure Mode
The continuous limit requires the dynamics to be Lipschitz continuous. If is not Lipschitz (e.g., if it has sharp discontinuities or unbounded gradients), the ODE may not have a unique solution, and the convergence of Euler's method is not guaranteed. ReLU is globally 1-Lipschitz despite being non-differentiable at zero (Lipschitz continuity does not require differentiability), so standard ReLU networks with bounded weights satisfy this condition. The places where Lipschitz fails in practice are unbounded activations like exp without clipping, or architectures whose Jacobian norm grows without bound during training.
The Adjoint Method
Adjoint Sensitivity Method
Statement
To compute for a neural ODE, define the adjoint state . The adjoint satisfies a backward ODE:
integrated backwards from to with initial condition . This follows the convention of Chen et al. (2018). The sign depends on the time-direction convention: reversing time or redefining flips the sign, so check the convention before comparing references.
The parameter gradient is:
Memory cost: in depth (constant, regardless of the number of ODE solver steps), compared to for standard backpropagation through layers. The memory cost is asymptotic; in practice adjoint solvers may require more function evaluations than discrete backprop, and reconstructed gradients can drift for stiff dynamics (Gholaminejad et al. 2019).
Intuition
Standard backprop stores all intermediate activations to compute gradients, costing memory. The adjoint method avoids this by solving the adjoint ODE backwards in time, recomputing on the fly by integrating the forward ODE backwards. This trades memory for compute: you solve two ODEs (forward and backward) instead of storing all activations.
This is the continuous analog of activation checkpointing, taken to its logical extreme.
Why It Matters
Constant-memory backpropagation enables training of very deep (or continuous-depth) networks without running out of GPU memory. For standard ResNets with layers, the memory saving is . For neural ODEs where the solver may take thousands of steps, the saving is even larger.
The adjoint method is not new. It was developed in optimal control theory in the 1960s (Pontryagin's maximum principle). Neural ODEs brought it to deep learning.
Failure Mode
The backward ODE recomputation of introduces numerical error. If the forward and backward solvers use different discretizations or if the dynamics are chaotic, the recomputed can diverge from the original, causing gradient inaccuracy. In practice, this is mitigated by using the same adaptive solver in both directions, but it remains a source of subtle bugs. Checkpointed approaches (solving forward, saving a few checkpoints, recomputing between them) offer a middle ground.
Why ODEs Constrain Expressiveness
ODE trajectories cannot cross. If , then for all (by uniqueness of ODE solutions under Lipschitz conditions). This means the map is a homeomorphism: it is continuous, invertible, and its inverse is continuous.
This is a limitation. A homeomorphism cannot change the topology of the data. If the input data has two intertwined spirals, a neural ODE cannot "untangle" them into linearly separable clusters through a continuous flow. A standard ResNet (with finite step size) can, because discrete maps are not constrained by ODE uniqueness.
In practice, this limitation is addressed by:
- Augmented neural ODEs (Dupont et al. 2019): concatenate extra zero-initialized dimensions to the state, so the flow acts on instead of . The homeomorphism constraint still applies in the lifted space, but in higher dimensions it can unknot data that is topologically stuck in . Dupont et al. show that this both fixes the topology obstruction for tasks like nested spheres and reduces solver NFE at training time, because the learned vector field can be simpler.
- Using neural ODEs as components in a larger architecture, not as the entire model.
Common Confusions
Neural ODEs are not just deep ResNets
The continuous formulation gives qualitatively different properties: adaptive depth, constant-memory training, invertibility constraints, and connections to physics. A 100-layer ResNet with ReLU and batch norm does not behave like a neural ODE. The ODE perspective is most useful when you actually use an ODE solver (adaptive step size, error control), not when you just view a discrete ResNet "as if" it were continuous.
Constant memory does not mean free computation
The adjoint method saves memory by recomputing activations during the backward pass. This doubles the computational cost (two ODE solves instead of one forward pass). For some applications, this tradeoff is worth it (when memory is the bottleneck). For others, standard backprop with gradient checkpointing is more practical.
Neural ODEs are not universally better than discrete networks
Neural ODEs are slower to train (ODE solver overhead), harder to parallelize (sequential integration), and more constrained in expressiveness (no trajectory crossing). They are most useful when continuous dynamics are a natural fit: time series, physical systems, normalizing flows, and problems where adaptive computation depth matters.
Exercises
Problem
A ResNet has 50 layers with hidden dimension 256. Standard backprop stores all 50 intermediate activations ( floats). The neural ODE adjoint method stores only the initial and final states. Compute the memory ratio. Under what conditions is the ODE approach worth the extra compute?
Problem
Explain why the non-crossing property of ODE trajectories limits the expressiveness of neural ODEs. Give a specific example of a classification problem in that a neural ODE (without augmentation) cannot solve but a standard 2-layer network can.
Duality with Physics-Informed Neural Networks
Neural ODEs and PINNs are dual approaches to the same class of dynamical-system problems. The duality is sharp:
| Object | What is parameterized | What is solved | What the loss penalizes |
|---|---|---|---|
| PINN | The solution | The PDE residual at collocation points (no integrator) | + boundary/data terms |
| Neural ODE | The vector field | The IVP via a numerical solver | at observed endpoints |
PINNs are mesh-free function approximators of the solution; they evaluate at points and demand the differential equation be satisfied there. Neural ODEs are learned right-hand sides; they specify the dynamics and let an ODE solver produce the trajectory.
Each has the failure mode the other avoids. PINNs struggle with sharp gradients and stiff multi-scale problems because the solution itself has structure that smooth networks resolve poorly. Neural ODEs struggle when the dynamics themselves have intrinsic complexity that requires fine-grained vector fields, but they handle stiff regions automatically through adaptive solvers. PINNs amortize across initial conditions only by retraining; neural ODEs handle new initial conditions via a fresh forward solve at no retraining cost.
In scientific ML the choice is usually dictated by what is known: if you trust the PDE and want to solve it once well, PINNs (or classical FEM) are appropriate; if you have observations of trajectories and need to learn the unknown dynamics from them, neural ODEs are appropriate. Hybrid approaches like neural operators (FNO, DeepONet) split the difference by amortizing the solution map across PDE instances.
Connection to Diffusion Models: The SDE Bridge
Diffusion models are not neural ODEs; they are neural SDEs (stochastic differential equations). The forward noising process
is an SDE with drift and diffusion . The reverse-time SDE (Anderson 1982),
is the generative direction. The neural network learns the only unknown, the score . The full theory requires stochastic calculus: Ito's lemma, the Ito-isometry-based existence theorem for SDEs, and the Fokker-Planck equation that links the SDE to the time evolution of .
The deterministic counterpart of the diffusion reverse SDE is the probability flow ODE:
This ODE has the same marginals as the SDE at every (Song et al. 2021, Theorem 2). Sampling from the diffusion model by integrating this ODE backward (the standard trick used by DPM-Solver, DDIM, and EDM samplers) is literally a neural ODE inference, with given by the score model. The bridge between Neural ODEs and diffusion is not metaphorical; the modern fast samplers are neural ODE solvers operating on a learned vector field.
This connects all the way back to energy-based models: the score is the negative gradient of the EBM energy (modulo the partition function). The probability-flow ODE includes a term that pushes toward higher density (lower energy of ), but it also carries the full SDE drift , so it is not pure gradient flow on a single energy. It is the marginal-preserving deterministic dynamics that share marginals with the reverse SDE; the score appears as one term, not as the entire vector field. Calling PF-ODE inference "energy descent" is correct only in the special case where (e.g., variance-exploding schedules at the SDE level). The neural-ODE / neural-SDE / EBM trio is one mathematical object viewed three ways, with the score being the bridge term rather than the whole story.
For the SDE-specific machinery — adjoint method extended to SDEs, generative neural SDEs as infinite-dimensional GANs, Latent SDEs for time series — see the dedicated neural SDEs page.
Open Questions
Neural ODEs occupy an unusual position in modern ML. The framework is mathematically clean and connects to control theory, dynamical systems, and statistical physics, but it has not displaced discrete networks for most applications. The interesting questions are about why.
-
Adjoint accuracy under stiff dynamics. The vanilla adjoint method recomputes forward states by integrating backward. For stiff or chaotic dynamics, recomputed states diverge from the true forward path, corrupting gradients (Gholaminejad et al. 2019). The "discretize-then-optimize" alternative (backprop through the solver) gives exact gradients but loses the constant-memory property. The right tradeoff between memory, compute, and gradient accuracy is unsettled.
-
The expressiveness gap. The non-crossing property of ODE flows means a Neural ODE in is a homeomorphism. Augmented Neural ODEs lift to , which provably fixes the topology obstruction (Dupont et al. 2019), but the expressiveness gap with discrete ResNets in finite-data regimes remains poorly characterized. When does continuity actually help, and when does it just slow you down?
-
Why diffusion samplers work despite high-dimensional non-Lipschitz score networks. Probability-flow ODE samplers integrate over thousands of NFE on score networks that are not Lipschitz-regularized in any explicit way. Standard ODE theory says this should be unstable; in practice EDM and DPM-Solver work robustly. The implicit regularization of the training objective is suspected to play a role, but no clean theorem exists.
-
Neural ODEs for sequence modeling against transformers. Mamba and other state-space models are continuous-time linear systems with selective parameterization, conceptually adjacent to Neural ODEs but with hand-crafted dynamics. A nonlinear Neural-ODE-style sequence model that competes with Mamba or transformers on language has not been demonstrated. Whether the limitation is fundamental or engineering is open.
-
Symplectic and structure-preserving Neural ODEs. For physical systems with conservation laws (Hamiltonian dynamics, energy conservation), generic Neural ODEs do not preserve the conserved quantity. Hamiltonian Neural Networks (Greydanus et al. 2019) and Lagrangian Neural Networks (Cranmer et al. 2020) constrain the architecture to enforce structure, but at a cost in flexibility. The right balance for general scientific ML applications is unsettled.
References
Canonical:
- Chen, Rubanova, Bettencourt, Duvenaud, "Neural Ordinary Differential Equations" (NeurIPS 2018, Best Paper; arXiv:1806.07366). The foundational paper.
- Haber and Ruthotto, "Stable Architectures for Deep Neural Networks" (Inverse Problems 2017; arXiv:1705.03341). ResNet as discretized ODE, with the step-size rescaling.
- Pontryagin, Boltyanskii, Gamkrelidze, Mishchenko, The Mathematical Theory of Optimal Processes (1962). Original adjoint method, predating ML by 56 years.
Current:
- Dupont, Doucet, Teh, "Augmented Neural ODEs" (NeurIPS 2019; arXiv:1904.01681). Fixes the topology-preserving expressiveness limitation.
- Gholaminejad, Keutzer, Biros, "ANODE: Unconditionally Accurate Memory-Efficient Gradients for Neural ODEs" (IJCAI 2019; arXiv:1902.10298). Documents gradient-accuracy and NFE issues in the vanilla adjoint method and proposes a checkpointed alternative.
- Onken, Ruthotto, "Discretize-Optimize vs. Optimize-Discretize for Time-Series Regression and Continuous Normalizing Flows" (arXiv:2005.13420, 2020). Empirical comparison of the two adjoint paradigms.
- Grathwohl, Chen, Bettencourt, Sutskever, Duvenaud, "FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models" (ICLR 2019; arXiv:1810.01367). Neural ODEs for normalizing flows via Hutchinson trace estimator.
- Li, Wong, Chen, Duvenaud, "Scalable Gradients for Stochastic Differential Equations" (AISTATS 2020; arXiv:2001.01328). Adjoint method extended to SDEs.
- Kidger, Foster, Li, Lyons, "Neural SDEs as Infinite-Dimensional GANs" (ICML 2021; arXiv:2102.03657). Generative modeling with neural SDEs.
- Kidger, Morrill, Foster, Lyons, "Neural Controlled Differential Equations for Irregular Time Series" (NeurIPS 2020; arXiv:2005.08926). CDEs for irregularly sampled sequences.
- Song, Sohl-Dickstein, Kingma, Kumar, Ermon, Poole, "Score-Based Generative Modeling through Stochastic Differential Equations" (ICLR 2021; arXiv:2011.13456). The probability flow ODE (Section 4.3 / Appendix D.1), the explicit Neural-ODE / diffusion bridge.
- Bai, Kolter, Koltun, "Deep Equilibrium Models" (NeurIPS 2019; arXiv:1909.01377). Related infinite-depth approach that solves for a fixed point instead of integrating.
Reference / Survey:
- Kidger, "On Neural Differential Equations" (PhD thesis, Oxford, 2022; arXiv:2202.02435). The standard modern reference covering ODEs, SDEs, and CDEs in one volume.
- Greydanus, Dzamba, Yosinski, "Hamiltonian Neural Networks" (NeurIPS 2019; arXiv:1906.01563). Structure-preserving architecture for physical systems.
Next Topics
- Adjoint sensitivity method: deeper treatment of the constant-memory backprop machinery, including the Pontryagin origins and the discretize-vs-optimize debate
- Neural SDEs: the stochastic generalization, including the explicit bridge to diffusion models and score-based generative modeling
- Continuous normalizing flows: density estimation via Neural ODE flows with the FFJORD trace trick
- Physics-informed neural networks: the dual approach that parameterizes the solution rather than the dynamics
- Stochastic calculus for ML: the SDE prerequisites for the diffusion bridge
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
8- Automatic Differentiationlayer 1 · tier 1
- Classical ODEs: Existence, Stability, and Numerical Methodslayer 1 · tier 1
- Gradient Flow and Vanishing Gradientslayer 2 · tier 1
- Skip Connections and ResNetslayer 2 · tier 1
- PDE Fundamentals for Machine Learninglayer 1 · tier 2
Derived topics
4- Adjoint Sensitivity Methodlayer 3 · tier 2
- Continuous Normalizing Flowslayer 3 · tier 3
- Neural SDEs and the Diffusion Bridgelayer 4 · tier 3
- Continuous Thought Machineslayer 5 · tier 3