Open Problems in ML Theory

Sneiderman, Robby

Modern Generalization

Open Problems in ML Theory

A curated list of genuinely open problems in machine learning theory: why overparameterized networks generalize, the right complexity measure for deep learning, feature learning beyond NTK, why scaling laws hold, emergent abilities, transformer-specific theory, and post-training theory.

ResearchTier 3FrontierSupporting~70 min

Prerequisites

Implicit Bias and Modern Generalization Scaling Laws Continuous Thought Machines Grokking

Prereq Map

Why This Matters

Classical statistical-learning theory (VC dimension, Rademacher complexity, PAC-Bayes) predicts that overparameterized neural networks should not generalize. They do. The bounds are vacuous in the regime where modern deep learning operates: the hypothesis class is rich enough to fit random labels (Zhang et al. 2017), yet generalization on real data is strong. No widely accepted theory closes this gap.

This page surveys the open problems that descend from that gap, plus a few that have emerged from frontier-model behavior since 2022. Each entry states what is known empirically, what is partially understood theoretically, and what is unresolved.

Mental Model

Classical learning theory says: complexity of the hypothesis class determines generalization. More parameters means worse generalization, unless you have proportionally more data. Modern deep learning violates this: models with millions or billions of parameters generalize well with relatively little data.

Something is controlling generalization that our current theory does not capture. Finding that something is the meta-problem that unifies all the open problems below.

Problem 1: Why Do Overparameterized Networks Generalize?

Proposition

Rethinking Generalization (Zhang et al. 2017)

Statement

Standard neural network architectures (CNNs, ResNets) can fit a training set of $n$ images with completely random labels, achieving zero training error. The same architectures, with the same hyperparameters, generalize well on real labels.

This means: for any uniform convergence bound $\epsilon(n, \mathcal{H})$ that depends only on the hypothesis class $\mathcal{H}$ and sample size $n$ , $\epsilon(n, \mathcal{H})$ must be vacuous (close to 1), because $\mathcal{H}$ contains both the good real-label function and the random-label function.

Intuition

The network architecture can represent garbage (random labels) just as easily as signal (real labels). So the architecture alone cannot explain generalization. Something about the training procedure (SGD, its initialization, the learning rate schedule) must be selecting good solutions from the vast space of interpolating solutions. This is the implicit bias of SGD.

Why It Matters

This result forced the field to abandon hypothesis-class-based generalization theory for deep learning and look instead at algorithm-dependent explanations. Every open problem on this page traces back to this observation.

Failure Mode

The proposition is an experimental observation, not a mathematical theorem. It does not tell you what the right explanation IS, only what it is not (hypothesis class complexity alone).

report a correction →

What is known. The Zhang et al. (2017) random labels experiment showed that neural networks can memorize random labels (zero training error on noise) while also generalizing well on real data. This means the hypothesis class itself (all functions representable by the network) is too rich. Generalization must come from the training algorithm, not the architecture alone.

What is partially understood. Several partial explanations exist:

Implicit bias of gradient descent: for linear models, gradient descent converges to the minimum-norm solution. For deep networks, gradient descent seems to prefer "simple" functions, but formalizing "simple" is hard.
Double descent: test error first decreases, then increases (classical), then decreases again as model size grows past the interpolation threshold. This is well documented empirically but the mechanism is not fully understood for deep networks.
Benign overfitting: in some settings, interpolating the training data (including noise) is compatible with good generalization. Proven for linear regression and some kernel methods, but not for deep networks in generality.

What is not known. There is no generalization bound for deep neural networks trained with SGD on realistic data that is both (a) non-vacuous and (b) does not require assumptions that are impossible to verify in practice. PAC-Bayes bounds come closest, but they require careful construction of a posterior distribution and often give loose numerical values.

Who is working on it. Most active ML-theory groups, including those at Princeton (Arora, Ma), MIT (Moitra, Bresler), Stanford (Liang, Zou), and Hebrew University (Shalev-Shwartz).

Problem 2: What Is the Right Complexity Measure?

What is known. Classical measures fail for deep learning:

Parameter count: networks with $10^9$ parameters generalize, so raw parameter count is not the right measure
VC dimension: scales with parameter count, gives vacuous bounds
Rademacher complexity: same problem for overparameterized networks
PAC-Bayes: gives non-vacuous bounds in some cases, but requires choosing a prior and posterior carefully

Candidate measures that partially work:

Norm-based bounds: controlling $\|W_1\| \cdot \|W_2\| \cdots \|W_L\|$ (product of layer norms) gives tighter bounds, but these are still loose and depend on depth in ways that do not match empirical generalization
Compression-based bounds: if a network can be compressed to a smaller representation without losing accuracy, the compressed size determines generalization. This is theoretically clean but the right notion of compression is debated
Fisher-Rao norm: a Riemannian geometry approach that accounts for the parameterization of the function, not just the function itself

What is not known. No single complexity measure has been shown to (a) predict generalization across architectures, datasets, and training procedures, and (b) have a rigorous theoretical justification. The field suspects the answer involves both the model and the training algorithm jointly, not the model alone.

Who is working on it. Neyshabur, Bartlett, Dziugaite, Roy, Arora, among others. The Predicting Generalization in Deep Learning (PGDL) competition benchmarked many proposed measures and found no clear winner.

Problem 3: Feature Learning Beyond NTK

What is known. The Neural Tangent Kernel (NTK) regime provides a mathematically tractable theory of neural network training: in the infinite-width limit, network training is equivalent to kernel regression with a fixed kernel (the NTK). This gives convergence guarantees and generalization bounds.

The problem with NTK. The NTK regime describes networks that do not learn features. The kernel is fixed at initialization. But feature learning. the ability of the network to learn useful intermediate representations. is widely believed to be the key advantage of deep learning over kernel methods.

Partial progress:

Mean-field theory: treats neurons as particles in a probability distribution and studies the evolution of this distribution during training. Allows feature learning but is mathematically harder than NTK.
Tensor programs (Yang, 2020-2024): a framework that tracks how random tensors evolve during training, unifying NTK, mean-field, and other limits
Rich regime vs lazy regime: networks can be in a regime where features change substantially (rich) or barely change (lazy/NTK). The rich regime is where feature learning happens, but it is much harder to analyze

What is not known. There is no rigorous theory of what features deep networks learn on natural data, why certain architectures learn better features than others, or how feature quality relates to generalization. The fundamental question: what makes a learned feature "good"?

Who is working on it. Yang (tensor programs), Mei and Montanari (mean field), Bach and Chizat (lazy vs rich regimes), Allen-Zhu and Li (feature learning provably helps).

Problem 4: Why Do Scaling Laws Hold?

What is known. Empirical scaling laws (Kaplan et al. 2020, Hoffmann et al. 2022) show that test loss follows a power law in model size, dataset size, and compute:

$L(N) \approx L_\infty + \frac{A}{N^\alpha}$

where $N$ is model size and $\alpha \approx 0.07$ for language models. Similar power laws hold for dataset size and compute budget. These laws hold over many orders of magnitude with remarkable precision.

What is partially understood:

Power laws in test loss as a function of dataset size are predicted by classical learning theory for certain model classes (bias-variance decomposition gives polynomial rates)
For simple models (linear regression, kernel regression), scaling exponents can be computed exactly and depend on the spectrum of the data covariance
Hutter (2021) proposed a framework based on quantization of the loss landscape that produces power laws

What is not known. No theory predicts the specific exponents observed for large language models from first principles. We do not know: (a) why the exponents are what they are, (b) whether the laws will continue to hold at larger scales, (c) what determines the irreducible loss $L_\infty$ , (d) why the relationship is so clean (a simple power law rather than something more complex).

Who is working on it. Scaling laws are studied by Kaplan, McCandlish, Henighan (OpenAI), Hoffmann et al. (DeepMind/Chinchilla), Sharma and Kaplan, Bahri et al. (Google), and many academic groups.

Problem 5: Are Emergent Abilities Real?

What is known. Wei et al. (2022) documented "emergent abilities": capabilities that are absent in small models but appear suddenly as models scale past a critical size. Examples included multi-step arithmetic, chain- of-thought reasoning, and certain benchmark tasks that jump from near-zero to high accuracy at a specific model scale.

The controversy. Schaeffer et al. (2023) argued that emergence is a mirage caused by nonlinear evaluation metrics. When you use linear or continuous metrics instead of discontinuous ones (like exact-match accuracy), the smooth improvement is visible at all scales. there is no sharp phase transition.

What is genuinely unresolved:

Even if the metric argument explains some cases, do genuinely qualitative capability jumps exist? Can a sufficiently large model do something structurally different from what smaller models do, or is it always gradual improvement?
What is the right definition of "emergence" for ML systems? Is it a phase transition in a statistical mechanics sense, or something else?
If emergence is real, can we predict when it will happen? This matters enormously for AI safety: unpredictable capability jumps make it hard to assess risks before deployment.

Who is working on it. Wei et al. (Google), Schaeffer, Miranda, Koyejo (Stanford), Arora et al. (Princeton), and the AI safety community broadly.

Problem 6: Transformer-Specific Theory

What is known. Transformers dominate modern deep learning, but most theoretical results apply to generic neural networks (MLPs, ResNets) or to simplified models. Transformer-specific theory is sparse.

Partial results:

Expressive power: transformers can simulate Turing machines (given enough layers and precision), but this does not explain what they learn in practice
In-context learning: transformers can learn to implement learning algorithms (like gradient descent) in their forward pass. Garg et al. (2022) showed this empirically; Akyurek et al. (2023) and Von Oswald et al. (2023) analyzed the mechanism
Attention patterns: some work characterizes what attention heads learn (induction heads, positional heads), but a complete taxonomy is missing
Context length generalization: why transformers struggle to generalize to sequence lengths longer than those seen in training, and whether positional encoding design can fix this

What is not known:

Why does the transformer architecture work so much better than alternatives for language? Is it the attention mechanism, the residual stream, the layer normalization, or the combination?
What is the right theoretical model of a transformer? Is it a kernel machine, a dynamical system, a program executor, or something new?
Can we characterize the functions that transformers can learn efficiently (not just represent)?
Under what conditions, if any, do transformers approximate optimization procedures in-context, and how should claims of mesa-optimization be operationalized and tested?

Who is working on it. Elhage, Olsson et al. (Anthropic, mechanistic interpretability), Garg, Brown (in-context learning), Edelman, Goel, Kakade (expressivity), Jelassi, Li (optimization), and many others.

Problem 7: Theory of Post-Training

What is known. Modern LLMs go through multiple training stages: pretraining (next-token prediction on large corpora), supervised fine-tuning (SFT on curated instruction-response pairs), reinforcement learning from human feedback (RLHF or variants), and sometimes additional stages like DPO, constitutional AI, or iterative refinement.

What is barely understood:

Why does RLHF work? The reward model is a noisy proxy for human preferences, trained on limited comparison data, yet it dramatically improves model behavior. Why does optimizing a proxy reward not cause catastrophic reward hacking more often?
What does fine-tuning change? Does SFT teach new knowledge or just change the format/style of outputs? Evidence suggests the latter (knowledge is mostly from pretraining), but this is not proven.
Alignment tax: how much capability (if any) is lost through safety training? Is there a fundamental tradeoff between helpfulness and safety, or can both be improved simultaneously?
Why does DPO work? Direct Preference Optimization bypasses the reward model entirely and optimizes the policy directly on preference data. It works surprisingly well despite being a much simpler procedure than RLHF. The theoretical justification exists (it optimizes the same objective as RLHF under certain assumptions), but why it works as well as it does in practice is not fully understood.

What is not known:

Is there a theory of what post-training should look like? Are the current stages (pre-train, SFT, RLHF) the right decomposition, or is there a more principled pipeline?
How do we think about the interaction between pretraining data and post- training objectives? The model's capabilities are bounded by pretraining, but post-training can elicit or suppress them.
Can we formalize what it means for a model to be "aligned" in a way that admits theoretical analysis?

Who is working on it. Anthropic (constitutional AI, alignment theory), OpenAI (RLHF theory), DeepMind (reward modeling), Rafailov et al. (DPO theory at Stanford), and the broader alignment research community.

Problem 8: Theory of Long-Horizon Agentic Behavior

Frontier models in 2025-2026 are deployed in agentic loops: tool use, multi-step browsing, multi-turn code execution, and chains of subgoals that last hundreds of turns. Empirical capability is improving on benchmarks like SWE-bench, GAIA, and tau-bench. Theory has barely caught up.

What is partially understood:

The single-shot policy gradient story (REINFORCE, PPO, GRPO) extends formally to multi-step settings, but variance and credit assignment grow rapidly with horizon. Standard variance bounds become loose past a few dozen steps.
For tasks with verifiable rewards at the end of a long trajectory, sparse signal makes naive RL untractable. Process reward models (Lightman et al. 2023, arXiv:2305.20050) and step-level supervision are the main empirical workarounds, both of which reintroduce learned-proxy errors.
Subgoal decomposition in classical hierarchical RL (options framework, Sutton, Precup, Singh 1999) gives a vocabulary for talking about temporal abstraction, but the link to what transformers actually do during agentic loops is informal.

What is not known:

There is no sample-complexity or convergence theory for tool-use or agentic RL that matches the empirical regimes (sparse rewards, partial observability, stochastic tool outputs, hundreds of steps).
It is unclear what the right theoretical object is for a "subgoal" or "plan" inside a transformer. In-context-learning analyses describe one-shot regression at most; multi-step planning has no analogue.
There are no rigorous results characterizing when agentic loops compose (a model that can do X and Y separately can do X-then-Y) versus fail to compose, despite this being the central empirical question.
Reward hacking in agentic settings (e.g., a coding agent that satisfies unit tests by overwriting them) has no formal characterization comparable to the single-shot Goodhart taxonomy.

Who is working on it. This is one of the freshest areas. Empirical agentic-RL groups at Anthropic, OpenAI, DeepMind, and academic groups (Stanford, CMU, Berkeley) are producing benchmarks and methods faster than theorists are producing analyses. The gap between practice and theory here is wider than for any other problem on this page.

Common Confusions

Watch Out

These are not just hard problems. THEY are genuinely open

An open problem in ML theory is not a hard engineering challenge. It is a question where the research community does not have a satisfactory answer and where leading researchers actively disagree about what the answer might be. Progress on any of these would be a significant contribution.

Watch Out

Empirical evidence is not a theoretical explanation

Observing that large models generalize, that scaling laws hold, or that RLHF improves model behavior is not a theoretical understanding. Theory requires identifying the mechanism, stating precise assumptions, and proving that the mechanism produces the observed behavior under those assumptions. We have strong empirical evidence for many phenomena but theoretical explanations for very few.

Watch Out

The NTK is not wrong. It is incomplete

The Neural Tangent Kernel regime provides valid and rigorous results for a specific limit (infinite width, small learning rate). The limitation is that this regime does not capture feature learning, which is the practically important regime. NTK theory is a correct description of a setting that does not fully match practice.

Summary

The central mystery: overparameterized networks generalize, and we do not fully know why
No existing complexity measure reliably predicts generalization for deep networks
NTK theory is rigorous but does not capture feature learning
Scaling laws are empirically robust but theoretically unexplained
Whether emergent abilities are real or a measurement artifact is debated
Transformer-specific theory is in its infancy
Post-training (RLHF, DPO, SFT) works but lacks theoretical foundations
Long-horizon agentic behavior is the freshest gap: empirical capability is outpacing analysis on every dimension (sample complexity, credit assignment, plan composition, agentic reward hacking)
Each of these problems is an active area of research with many groups contributing

Exercises

ExerciseResearch

Problem

The Zhang et al. (2017) random labels experiment shows that a neural network can fit random labels perfectly. Explain why this breaks classical generalization theory, and describe two different theoretical frameworks that attempt to resolve the contradiction.

ExerciseResearch

Problem

Explain the difference between the "lazy" (NTK) and "rich" (feature learning) regimes of neural network training. What determines which regime a network is in, and why does it matter for theory?

ExerciseResearch

Problem

Scaling laws show $L(N) \approx L_\infty + A/N^\alpha$ with $\alpha \approx 0.07$ for language models. What would it mean if $\alpha$ were much larger or much smaller? What determines $L_\infty$ ?

References

Canonical:

Zhang et al., "Understanding Deep Learning Requires Rethinking Generalization" (ICLR 2017)
Neyshabur et al., "Exploring Generalization in Deep Learning" (NeurIPS 2017)

Current:

Kaplan et al., "Scaling Laws for Neural Language Models" (2020), arXiv:2001.08361.
Wei et al., "Emergent Abilities of Large Language Models" (2022), arXiv:2206.07682.
Schaeffer, Miranda, Koyejo, "Are Emergent Abilities of Large Language Models a Mirage?" (NeurIPS 2023), arXiv:2304.15004.
Yang, "Tensor Programs" series (2020-2024).
Nanda, Chan, Lieberum, Smith, Steinhardt, "Progress Measures for Grokking via Mechanistic Interpretability" (ICLR 2023).

Long-horizon agentic theory (Problem 8):

Sutton, Precup, Singh, "Between MDPs and semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning" (Artificial Intelligence 1999). The options framework; the classical vocabulary for temporal abstraction.
Lightman, Kosaraju, Burda et al., "Let's Verify Step by Step" (OpenAI 2023), arXiv:2305.20050. PRM800K and step-level rewards on MATH.
Jimenez et al., "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" (ICLR 2024), arXiv:2310.06770. The benchmark that anchors most agentic-coding capability claims.
Mialon, Fourrier et al., "GAIA: A Benchmark for General AI Assistants" (2023), arXiv:2311.12983. Long-horizon tool-use benchmark for assistants.
Snell, Klein, Zhong et al., "Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters" (2024), arXiv:2408.03314. Empirical scaling laws for inference-time search; central to the question of when extra compute should be spent on training versus on planning at deployment.

Next Topics

This is the frontier. Every problem listed above is an active research direction with open positions for new researchers.

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

5

Implicit Bias and Modern Generalizationlayer 4 · tier 1
Scaling Lawslayer 4 · tier 1
Unsolved Problems in Computer Sciencelayer 3 · tier 2
Grokkinglayer 4 · tier 2
Continuous Thought Machineslayer 5 · tier 3

Derived topics

0

No published topic currently declares this as a prerequisite.