Paper Breakdowns

Landmark machine-learning and statistics papers, broken down to their actual mathematical contributions. Each breakdown links the equations to the topic pages on TheoremPath that develop the machinery in detail.

1992 · COLT 1992
A Training Algorithm for Optimal Margin Classifiers
Bernhard E. Boser, Isabelle M. Guyon, Vladimir N. Vapnik
The first paper to combine the maximum-margin hyperplane with the kernel trick. Replaces explicit feature maps with pairwise kernel evaluations and turns the resulting problem into a convex quadratic program with a clean dual.
2013 · ICLR 2014
Auto-Encoding Variational Bayes
Diederik P. Kingma, Max Welling
Introduces the variational autoencoder. Combines amortised inference with the reparameterisation trick to give a tractable, gradient-based estimator of the evidence lower bound for deep latent-variable models.
2014 · ICLR 2015
Adam: A Method for Stochastic Optimization
Diederik P. Kingma, Jimmy Ba
Combines exponentially decayed estimates of the first and second gradient moments to produce a per-coordinate adaptive step size. The default optimizer for almost every modern deep network from 2015 onward.
2014 · NeurIPS 2014
Generative Adversarial Nets
Ian J. Goodfellow et al.
Reframes generative modelling as a two-player minimax game between a generator and a discriminator. Establishes the equivalence between the optimal-discriminator game and Jensen-Shannon divergence minimisation.
2015 · ICML 2015
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Sergey Ioffe, Christian Szegedy
Inserts a normalization layer that whitens activations along the batch dimension, then learns a per-channel affine rescaling. Enables higher learning rates, reduces sensitivity to initialization, and was the standard regularizer in deep CNNs for half a decade.
2015 · CVPR 2016
Deep Residual Learning for Image Recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun
Introduces residual connections — adding the input of a block to its output — to enable training of networks more than an order of magnitude deeper than was previously stable. Wins ImageNet 2015 with a 152-layer model.
2017 · NeurIPS 2017
Attention Is All You Need
Ashish Vaswani et al.
Replaces recurrence and convolution in sequence transduction with stacked self-attention. Establishes the transformer block — multi-head scaled dot-product attention plus position-wise feed-forward layers — that every modern large language model still uses.
2018 · NAACL 2019
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
Combines a transformer encoder with a masked-language-model objective to learn deep bidirectional context, then fine-tunes on downstream NLP tasks. Establishes pre-train-then-fine-tune as the dominant paradigm for two years and the technical scaffolding for everything after.
2018 · PNAS 2019
Reconciling modern machine learning practice and the bias-variance trade-off
Mikhail Belkin, Daniel Hsu, Siyuan Ma, Soumik Mandal
Documents the double-descent risk curve: as model capacity grows past the interpolation threshold, test error decreases again instead of monotonically rising. Reframes the bias-variance picture for over-parameterized models.
2018 · NeurIPS 2018
Neural Tangent Kernel: Convergence and Generalization in Neural Networks
Arthur Jacot, Franck Gabriel, Clément Hongler
Proves that, in the infinite-width limit and at the right initialisation scale, gradient-flow training of a neural network is equivalent to kernel regression with a fixed deterministic kernel. Gives the first proof of global convergence for a non-convex deep-network training procedure.
2019 · NeurIPS 2019
Uniform Convergence May Be Unable to Explain Generalization in Deep Learning
Vaishnavh Nagarajan, J. Zico Kolter
Constructs an overparameterised setting where every uniform-convergence-based generalisation bound — including margin and norm-based bounds — must be vacuous, while SGD generalises well. Establishes a theoretical limit on what classical statistical-learning theory can explain about deep networks.
2020 · NeurIPS 2020
Denoising Diffusion Probabilistic Models
Jonathan Ho, Ajay Jain, Pieter Abbeel
Trains a single network to predict the Gaussian noise added at a random timestep, with a reweighted variational bound that reduces to a simple mean-squared-error loss. Establishes the noise-prediction parameterisation and U-Net backbone underlying every modern image and video diffusion system.
2020 · arXiv preprint
Scaling Laws for Neural Language Models
Jared Kaplan et al.
Empirical power-law fits relating cross-entropy test loss to non-embedding parameter count, dataset size, and compute. The constants and the prescription for compute-optimal training planned GPT-3 and were the operating manual for foundation models for two years.
2020 · ICLR 2021
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy et al.
Splits an image into a sequence of fixed-size patches, embeds each patch linearly, and feeds them into a standard transformer encoder. Matches or exceeds CNN accuracy on ImageNet given enough pretraining data, ending the assumption that vision needs convolution.
2021 · ICLR 2022
LoRA: Low-Rank Adaptation of Large Language Models
Edward J. Hu et al.
Freezes the pretrained weight matrix W and learns a low-rank update BA at fine-tune time, where rank(BA) is typically 4-32 vs. thousands. Cuts trainable parameters by three to four orders of magnitude with no inference-time penalty.
2022 · NeurIPS 2022
Training Compute-Optimal Large Language Models
Jordan Hoffmann et al.
Refits the language-model scaling laws across 400+ runs at fixed compute and finds that parameters and tokens should scale roughly equally, not 0.73 vs 0.27 as Kaplan et al. claimed. The 70B Chinchilla beats the 280B Gopher on the same FLOP budget. The corrected ratio (~20 tokens per parameter at training compute) became the Chinchilla rule for compute-optimal training.
2022 · NeurIPS 2022
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Tri Dao et al.
Tiles the attention computation so that softmax(QK^T)V is computed block-by-block in fast on-chip SRAM, never materializing the full n-by-n attention matrix in slow HBM. Exact, not approximate. 2-4x training speedup on long sequences with no loss change.
2022 · NeurIPS 2022
Training Language Models to Follow Instructions with Human Feedback
Long Ouyang et al.
Three-stage recipe — supervised demonstrations, reward model fit on pairwise human preferences, then PPO against the reward with a KL penalty to the supervised model — that turned a 175B base model into one human raters preferred to a 100x-larger raw GPT-3. Standardised RLHF as the post-training pipeline used by ChatGPT and the open-weight model ecosystem.
2023 · NeurIPS 2023
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov et al.
Closes the form of the KL-regularised RLHF policy in terms of a reward function, then inverts the relation: the policy is the reward model, up to a partition function that cancels in pairwise comparisons. Reduces preference fine-tuning to a single supervised log-likelihood objective with no reward model, no rollouts, and no PPO.
2023 · COLM 2024
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu, Tri Dao
Makes the state-space sequence model's transition matrices input-dependent, restoring the content-based selectivity that linear-time-invariant SSMs lack. Introduces a hardware-aware parallel scan that keeps training fast on GPUs even though the recurrence is now non-LTI. Reaches transformer-quality language modelling at $O(n)$ inference, on the same scaling-law slope as a Pythia-style transformer up to ~3B parameters.
2024 · NeurIPS 2024
QTIP: Quantization with Trellises and Incoherence Processing
Albert Tseng, Qingyao Sun, David Hou, Christopher De Sa
Post-training quantization that decouples codebook size from bitrate by replacing vector quantization with trellis-coded quantization. Uses random Hadamard incoherence processing to make LLM weights approximately i.i.d. Gaussian, then quantizes 256-dimensional Gaussian sequences via a Viterbi-decoded bitshift trellis. Two-bit Llama-2-70B reaches within 0.7 perplexity of FP16.
2026 · arXiv preprint (technical report)
Kimi K2: Open Agentic Intelligence
Kimi Team
1.04 trillion total / 32 billion activated parameter Mixture-of-Experts LLM trained on 15.5T tokens with the MuonClip optimizer (Muon plus per-head QK-Clip post-update weight rescaling). Pairs ultra-sparse MoE with multi-head latent attention from DeepSeek-V3 and a synthetic agentic data pipeline. Top open-weight model on Tau2-bench, ACEBench, and SWE-bench Verified at release.
2026 · arXiv preprint
SAM 3: Segment Anything with Concepts
Nicolas Carion et al.
Generalizes the Segment Anything line from per-prompt single-object segmentation to Promptable Concept Segmentation: given a noun phrase or image exemplar, return masks for every matching instance across an image or short video. Adds a presence token that decouples recognition from localization, doubling open-vocabulary segmentation accuracy on the new SA-Co benchmark.

A Training Algorithm for Optimal Margin Classifiers

Auto-Encoding Variational Bayes

Adam: A Method for Stochastic Optimization

Generative Adversarial Nets

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Deep Residual Learning for Image Recognition

Attention Is All You Need

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Reconciling modern machine learning practice and the bias-variance trade-off

Neural Tangent Kernel: Convergence and Generalization in Neural Networks

Uniform Convergence May Be Unable to Explain Generalization in Deep Learning

Denoising Diffusion Probabilistic Models

Scaling Laws for Neural Language Models

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

LoRA: Low-Rank Adaptation of Large Language Models

Training Compute-Optimal Large Language Models

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Training Language Models to Follow Instructions with Human Feedback

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

QTIP: Quantization with Trellises and Incoherence Processing

Kimi K2: Open Agentic Intelligence

SAM 3: Segment Anything with Concepts