Skip to main content

Curriculum

The Full Theory Library

Every concept organized by depth layer and module. Layer 0 is foundations. Layer 5 is applied systems. Every topic links down to its prerequisites until you hit axioms.

529 curriculum topics publishedLayered prerequisite mapReference pages included where they clarify the path

Layer measures prerequisite depth. Tier measures effort within that layer: Tier 1 is core, Tier 2 is recommended, and Tier 3 is optional or advanced. Time is an authored reading estimate, not a mastery guarantee.

Foundations

49 topics / L0A, L1 / Tiers 1-2

Axioms, definitions, and notation. The base layer everything else depends on.

L0ATier 1~50m

Beta Distribution

The Beta distribution as the conjugate prior for Bernoulli and Binomial likelihoods, as the order statistic of i.i.d. Uniforms, and as a flexible density on the unit interval: density, moments, conjugacy derivation, and MLE without closed form.

Prereqs
Common Probability DistributionsDistributions AtlasGamma Distribution
L0ATier 1~50m

Common Inequalities

The algebraic and probabilistic inequalities that appear everywhere in ML theory: Cauchy-Schwarz, Jensen, AM-GM, Holder, Minkowski, Young, Markov, and Chebyshev.

Prereqs
Common Probability DistributionsKolmogorov Probability Axioms
L0ATier 1~65m

Common Probability Distributions

The essential probability distributions for ML: Bernoulli through Dirichlet, with PMFs/PDFs, moments, relationships, and when each arises in practice.

Prereqs
Sets, Functions, and RelationsExponential Function PropertiesIntegration and Change of Variables+2 more
L0ATier 1~40m

Compactness and Heine-Borel

Sequential compactness, the Heine-Borel theorem in finite dimensions, the extreme value theorem, and why compactness is the key assumption in optimization.

Prereqs
Metric Spaces, Convergence, and Completeness
L0ATier 1~35m

Continuity in Rⁿ

Epsilon-delta continuity, uniform continuity, and Lipschitz continuity in Euclidean space. Lipschitz constants control how fast function values change and appear throughout optimization and generalization theory.

Prereqs
Metric Spaces, Convergence, and Completeness
L0ATier 1~55m

Differentiation in Rⁿ

Partial derivatives, directional derivatives, gradients, total derivatives, and the multivariable chain rule. The point is not notation: differentiability means one linear map predicts all small directions.

Prereqs
Sets, Functions, and RelationsVectors, Matrices, and Linear MapsContinuity in Rⁿ
L0ATier 1~35m

Distributions Atlas

A connection map for the parametric families used in statistical inference and ML. Lists each family with its support, parameterization, and the transformations that move between families: sum, mixture, conjugacy, limiting case, and ratio constructions.

Prereqs
Common Probability DistributionsRandom VariablesMoment Generating Functions
L0ATier 1~55m

Eigenvalues and Eigenvectors

Eigenvalues and eigenvectors: the directions a matrix scales without rotating. Characteristic polynomial, diagonalization, the spectral theorem for symmetric matrices, and the direct connection to PCA.

Prereqs
Matrix Operations and PropertiesInner Product Spaces and OrthogonalityLinear Independence+2 more
L0ATier 1~55m

Expectation, Variance, Covariance, and Moments

Expectation as average value, variance as squared spread, covariance as joint movement, and the moment identities used throughout statistics and ML.

Prereqs
Random VariablesCommon Probability DistributionsJoint, Marginal, and Conditional Distributions+1 more
L0ATier 1~50m

Exponential Distribution

The Exponential distribution as the memoryless waiting time: density, CDF, MGF, the memoryless property as a characterization, the Poisson-process inter-arrival construction, the minimum of independent exponentials, MLE for the rate, and the bridge to the Gamma distribution.

Prereqs
Common Probability DistributionsDistributions AtlasExponential Function Properties
L0ATier 1~30m

Exponential Function Properties

The exponential function exp(x): series definition, algebraic properties, and why it appears everywhere in ML. Softmax, MGFs, the Chernoff method, Boltzmann distributions, and exponential families all reduce to properties of exp.

L0ATier 1~55m

Gamma Distribution

The Gamma distribution as the sum of independent Exponentials and as a flexible nonnegative density: shape and rate, density and MGF, conjugacy for Poisson and Exponential likelihoods, Chi-squared as a special case, MLE without closed form.

Prereqs
Common Probability DistributionsDistributions AtlasExponential Distribution+1 more
L1Tier 1~40m

Gram Matrices and Kernel Matrices

The Gram matrix encodes pairwise inner products of a dataset. Always PSD. Appears in kernel methods, PCA, SVD, and attention. Connects linear algebra to ML.

Prereqs
Inner Product Spaces and OrthogonalityEigenvalues and EigenvectorsDistance Metrics Compared+2 more
L0ATier 1~40m

Inner Product Spaces and Orthogonality

Inner product axioms, Cauchy-Schwarz inequality, orthogonality, Gram-Schmidt, projections, and the bridge to Hilbert spaces.

Prereqs
Vectors, Matrices, and Linear Maps
L0ATier 1~40m

Joint, Marginal, and Conditional Distributions

Joint distributions, marginalization, conditional distributions, independence, Bayes theorem, and the chain rule of probability.

Prereqs
Random VariablesCommon Probability DistributionsKolmogorov Probability Axioms
L1Tier 1~45m

KL Divergence

Kullback-Leibler divergence measures how one probability distribution differs from another. Asymmetric, always non-negative, and central to variational inference, MLE, and RLHF.

Prereqs
Common Probability DistributionsInformation Theory FoundationsDistance Metrics Compared+1 more
L0ATier 1~30m

Kolmogorov Probability Axioms

The three axioms (non-negativity, normalization, countable additivity) that every probability claim on this site implicitly invokes. Sample space, event sigma-algebra, probability measure, and the immediate consequences.

Prereqs
Sets, Functions, and Relations
L0ATier 1~35m

Linear Independence

A set of vectors is linearly independent if no vector is a redundant copy of the others. The concept underwrites basis, dimension, rank, the column-space test for $Ax = b$, and every overdetermined-versus-underdetermined diagnosis in linear regression and gradient-based optimization.

Prereqs
Vectors, Matrices, and Linear Maps
L0ATier 1~35m

Matrix Norms

Frobenius, spectral, and nuclear norms for matrices. Submultiplicativity. When and why each norm appears in ML theory.

Prereqs
Vectors, Matrices, and Linear Maps
L0ATier 1~50m

Matrix Operations and Properties

Essential matrix operations for ML: trace, determinant, inverse, pseudoinverse, Schur complement, and the Sherman-Morrison-Woodbury formula. When and why each matters.

Prereqs
Sets, Functions, and RelationsLinear Independence
L0ATier 1~55m

Metric Spaces, Convergence, and Completeness

Metric space axioms, convergence of sequences, Cauchy sequences, completeness, and the Banach fixed-point theorem.

Prereqs
Sets, Functions, and Relations
L0ATier 1~60m

Normal Distribution

The Normal distribution as a parametric family: density, moment generating function, closure under affine transformations and sums, MLE for mean and variance, Fisher information, and the bridge to the Chi-squared, Student-t, and F sampling distributions.

Prereqs
Common Probability DistributionsDistributions AtlasExponential Function Properties+2 more
L1Tier 1~45m

Numerical Stability and Conditioning

Continuous math becomes real only through finite-precision approximation. Condition numbers, backward stability, catastrophic cancellation, and why theorems about reals do not transfer cleanly to floating-point.

Prereqs
Floating-Point ArithmeticMatrix Operations and PropertiesMatrix Norms
L0ATier 1~50m

Poisson Distribution

The Poisson distribution as the rare-event limit of the Binomial and as the count law of a Poisson process: PMF, MGF, mean equals variance, additivity, thinning, superposition, MLE, and the connection to the Exponential and Gamma.

Prereqs
Common Probability DistributionsDistributions AtlasExponential Distribution
L0ATier 1~25m

Positive Semidefinite Matrices

PSD matrices: equivalent characterizations, Cholesky decomposition, Schur complement, and Loewner ordering. Covariance matrices are PSD. Hessians of convex functions are PSD. These facts connect linear algebra to optimization and statistics.

Prereqs
Eigenvalues and Eigenvectors
L0ATier 1~70m

Random Variables

Random variables as measurement rules, their distributions, expectation and variance, and the rigorous measurable-map definition used in probability theory.

Prereqs
Kolmogorov Probability AxiomsSets, Functions, and Relations
L0ATier 1~40m

Sets, Functions, and Relations

The language underneath all of mathematics: sets, Cartesian products, functions, injectivity, surjectivity, equivalence relations, and quotient sets.

Prereqs
Basic Logic and Proof Techniques
L0ATier 1~60m

Singular Value Decomposition

The SVD A = UΣVᵀ: the most important matrix factorization in applied mathematics. Geometric interpretation, relationship to eigendecomposition, low-rank approximation via Eckart-Young, and applications from PCA to pseudoinverses.

Prereqs
Eigenvalues and EigenvectorsLinear IndependenceMatrix Norms+1 more
L1Tier 1~55m

Skewness, Kurtosis, and Higher Moments

Distribution shape beyond mean and variance: skewness measures tail asymmetry, kurtosis measures tail extremeness, cumulants are the cleaner language, and heavy-tailed distributions break all of these.

Prereqs
Common Probability DistributionsExpectation, Variance, Covariance, and Moments
L0ATier 1~40m

Taylor Expansion

Taylor approximation in one and many variables. Every optimization algorithm is a Taylor approximation: gradient descent uses first order, Newton's method uses second order.

Prereqs
Continuity in RⁿDifferentiation in Rⁿ
L0ATier 1~55m

Tensors and Tensor Operations

What a tensor actually is: a multilinear map with specific transformation rules, how tensor contraction generalizes matrix multiplication, Einstein summation, tensor decompositions, and how ML frameworks use the word tensor to mean multidimensional array.

Prereqs
Eigenvalues and EigenvectorsPandas and NumPy Fundamentals
L1Tier 1~35m

Total Variation Distance

Total variation distance measures the largest possible discrepancy between two probability distributions: equivalently half the L1 gap, one minus the overlap mass, or the minimum disagreement probability under a coupling.

Prereqs
Common Probability DistributionsMeasure-Theoretic Probability
L0ATier 1~45m

Vectors, Matrices, and Linear Maps

Vector spaces, linear maps, matrix representation, rank, nullity, and the rank-nullity theorem. The algebraic backbone of ML.

Prereqs
Sets, Functions, and Relations
L0ATier 2~50m

Basic Logic and Proof Techniques

The proof vocabulary behind serious mathematics and ML theory: implication, quantifiers, contradiction, contrapositive, induction, construction, cases, and counterexamples.

L0ATier 2~45m

Cantor's Theorem and Uncountability

Cantor's diagonal argument proves the reals are uncountable. The power set of any set has strictly greater cardinality. These results are the origin of the distinction between countable and uncountable infinity.

Prereqs
Zermelo-Fraenkel Set Theory
L0ATier 2~35m

Cardinality and Countability

Two sets have the same cardinality when a bijection exists between them. The naturals, integers, and rationals are countable. The reals are uncountable, proved by Cantor's diagonal argument.

Prereqs
Sets, Functions, and Relations
L0ATier 2~35m

Counting and Combinatorics

Counting principles, binomial and multinomial coefficients, inclusion-exclusion, and Stirling's approximation. These tools appear whenever you count hypotheses, bound shattering coefficients, or analyze combinatorial arguments in learning theory.

Prereqs
Basic Logic and Proof Techniques
L1Tier 2~30m

Cramér-Wold Theorem

A multivariate distribution is uniquely determined by all of its one-dimensional projections. This reduces multivariate convergence in distribution to checking univariate projections, and is the standard tool for proving multivariate CLT.

Prereqs
Central Limit TheoremMeasure-Theoretic Probability
L0ATier 2~35m

Hypergeometric Distribution

The Hypergeometric distribution is the law of the success count in n draws without replacement from a finite population of N items, K of which are successes. The PMF is the ratio of three binomial coefficients. The mean is nK/N, identical to the Binomial with p = K/N, but the variance carries the finite-population correction (N-n)/(N-1). The Binomial is the n << N limit. Fisher's exact test, capture-recapture, and quality-control acceptance sampling read off the Hypergeometric directly.

Prereqs
Common Probability DistributionsDistributions Atlas
L0ATier 2~40m

Integration and Change of Variables

Riemann integration, improper integrals, the substitution rule, multivariate change of variables via the Jacobian determinant, and Fubini theorem. The computational backbone of probability and ML.

L0ATier 2~40m

Inverse and Implicit Function Theorem

The inverse function theorem guarantees local invertibility when the Jacobian is nonsingular. The implicit function theorem guarantees that constraint surfaces are locally graphs. Both are essential for constrained optimization and implicit layers.

Prereqs
The Jacobian Matrix
L0ATier 2~40m

Lognormal Distribution

A random variable is Lognormal if its logarithm is Normal. The density, mean, variance, median, and mode all have closed forms in the two underlying Normal parameters. The Lognormal is the multiplicative analogue of the Normal: a product of many independent positive factors is approximately Lognormal in the same way a sum is approximately Normal. Applications cover financial returns (with the heavy-tail caveat that real returns are heavier than Lognormal), particle sizes, lifetimes, and insurance severity.

Prereqs
Common Probability DistributionsNormal DistributionCentral Limit Theorem+1 more
L1Tier 2~50m

Markov Chains and Steady State

Markov chains: the Markov property, transition matrices, stationary distributions, irreducibility, aperiodicity, the ergodic theorem, and mixing time. The backbone of PageRank, MCMC, and reinforcement learning.

Prereqs
Common Probability DistributionsEigenvalues and EigenvectorsPageRank Algorithm
L0ATier 2~35m

Moment Generating Functions

Moment generating functions encode moments, control light-tailed behavior, and power Chernoff bounds, sub-Gaussian estimates, and exponential-family theory.

Prereqs
Expectation, Variance, Covariance, and MomentsCommon Probability DistributionsExponential Function Properties
L1Tier 2~45m

Pareto Distribution

The Pareto distribution is the canonical power-law on a half-line. The Type I parameterization has survival function (x_m/x)^alpha for x at least x_m. The shape parameter alpha is the tail index. Three regimes of alpha matter for the law of large numbers and the central limit theorem: alpha at most 1 has no finite mean and breaks the LLN; 1 < alpha at most 2 has finite mean but infinite variance so the standard CLT fails (generalized CLT to a stable law); alpha greater than 2 admits both LLN and CLT in the usual form. Applications: wealth, city sizes, file sizes, network degree, insurance severity. The 80/20 'Pareto principle' is a specific case requiring alpha approximately 1.16.

Prereqs
Common Probability DistributionsCentral Limit TheoremLaw of Large Numbers+1 more
L0ATier 2~35m

Sequences and Series of Functions

Pointwise vs uniform convergence of function sequences, the Weierstrass M-test, and why uniform convergence preserves continuity. The concept that makes learning theory work.

Prereqs
Metric Spaces, Convergence, and Completeness
L1Tier 2~60m

Signals and Systems for ML

Linear time-invariant systems, convolution, Fourier transform, and the sampling theorem. The signal processing foundations that underpin CNNs, efficient attention, audio ML, and frequency-domain analysis of training dynamics.

L0ATier 2~35m

Triangular Distribution

A bounded continuous distribution defined by a lower endpoint, upper endpoint, and mode. Includes the PDF, CDF, moments, sum-of-uniforms construction, and why the shape is a useful bridge between geometry and probability.

Prereqs
Common Probability DistributionsRandom Variables
L1Tier 2~45m

Weibull Distribution

The Weibull distribution is the standard parametric model for failure-time data in reliability and survival analysis. Two parameters: a shape k and a scale lambda. The hazard rate is monotone in time, increasing when k > 1 (wear-out), constant when k = 1 (Exponential, memoryless), decreasing when k < 1 (early-failure / infant-mortality). Mean is lambda Gamma(1 + 1/k); variance is lambda squared times (Gamma(1 + 2/k) minus Gamma(1 + 1/k) squared). MLE for k has no closed form and is solved by Newton-Raphson on the profile score equation. Applications: component lifetimes, time to event in clinical trials, wind speeds, extreme-value Type III.

Prereqs
Common Probability DistributionsExponential DistributionGamma Distribution+1 more

Mathematical Infrastructure

29 topics / L0A, L0B, L1, L2, L3 / Tiers 1-3

Serious math machinery: measure theory, functional analysis, convex duality.

L1Tier 1~50m

Automatic Differentiation

Forward mode computes Jacobian-vector products, reverse mode computes vector-Jacobian products: backpropagation is reverse-mode autodiff, and the asymmetry between the two modes explains why training neural networks is efficient.

Prereqs
The Jacobian MatrixDifferentiation in RⁿMatrix Calculus+3 more
L0BTier 1~30m

Borel-Cantelli Lemmas

The two lemmas that bridge in-probability convergence to almost-sure convergence. First lemma: summable probabilities imply the events occur only finitely often; Second lemma: independence plus divergence implies infinitely often.

Prereqs
Measure-Theoretic ProbabilityModes of Convergence of Random Variables
L1Tier 1~40m

Characteristic Functions

The Fourier transform of a probability distribution. Always exists (unlike MGFs), uniquely determines the distribution, multiplies under independent sums, and powers the rigorous proof of the central limit theorem via Levy's continuity theorem.

Prereqs
Measure-Theoretic ProbabilityMoment Generating Functions
L2Tier 1~75m

Convex Duality

Fenchel conjugates, the Fenchel-Moreau theorem, weak and strong duality, KKT conditions, and why duality gives the kernel trick for SVMs, connects regularization to constraints, and enables adversarial formulations in DRO.

Prereqs
Convex Optimization BasicsInverse and Implicit Function TheoremSubgradients and Subdifferentials
L1Tier 1~50m

Matrix Calculus

The differentiation identities you actually use in ML: derivatives of traces, log-determinants, and quadratic forms with respect to matrices and vectors.

Prereqs
The Jacobian MatrixThe Hessian Matrix
L0BTier 1~80m

Measure-Theoretic Probability

The rigorous foundations of probability: sigma-algebras, measures, measurable functions as random variables, Lebesgue integration, and the convergence theorems that make modern probability and statistics possible.

Prereqs
Cardinality and CountabilityIntegration and Change of VariablesKolmogorov Probability Axioms+3 more
L0BTier 1~40m

Modes of Convergence of Random Variables

The four standard senses in which a sequence of random variables can converge: almost surely, in probability, in Lp, and in distribution. Their hierarchy, the strict counterexamples that separate them, and the supporting tools (Slutsky, continuous mapping, Skorokhod).

Prereqs
Measure-Theoretic ProbabilityMetric Spaces, Convergence, and Completeness
L0BTier 1~80m

Radon-Nikodym and Conditional Expectation

The Radon-Nikodym theorem: what 'density' really means. Absolute continuity, the Radon-Nikodym derivative, conditional expectation as a projection, tower property, and why this undergirds likelihood ratios, importance sampling, and KL divergence.

Prereqs
Measure-Theoretic Probability
L0ATier 1~40m

The Hessian Matrix

The matrix of second partial derivatives: encodes curvature, classifies nondegenerate critical points (and is inconclusive at degenerate ones), and is the central object in second-order optimization.

Prereqs
Matrix Operations and PropertiesEigenvalues and EigenvectorsDifferentiation in Rⁿ+2 more
L0ATier 1~35m

The Jacobian Matrix

The matrix of all first partial derivatives of a vector-valued function: encodes the best linear approximation, connects to the chain rule in matrix form, and is the backbone of backpropagation.

Prereqs
Differentiation in Rⁿ
L0ATier 1~25m

Vector Calculus Chain Rule

The chain rule for compositions of multivariable maps. Jacobians multiply when functions compose; gradients of scalar-valued compositions become vector-Jacobian products. The result that makes backpropagation a one-line theorem.

Prereqs
The Jacobian MatrixVectors, Matrices, and Linear MapsDifferentiation in Rⁿ
L3Tier 2~50m

Backward Stochastic Differential Equations

The Pardoux-Peng framework: an SDE with a terminal condition and an adapted solution pair. Linear BSDEs reduce to Feynman-Kac; nonlinear BSDEs are dual to Hamilton-Jacobi-Bellman PDEs and are the mathematical object that the deep BSDE method approximates.

Prereqs
Stochastic Differential EquationsIto's LemmaFeynman–Kac Formula+1 more
L0ATier 2~18m

Complex Numbers for Fourier

The minimum complex-number machinery you need to read Fourier-domain pages: the field C, the imaginary unit i, modulus and argument, polar form, Euler's formula, and roots of unity. Not complex analysis, just the grounders.

Prereqs
Vectors, Matrices, and Linear Maps
L0ATier 2~28m

Divergence, Curl, and Line Integrals

The vector-calculus operators that appear in Fokker-Planck, score-based diffusion, and PINN papers. Definitions of divergence, curl, gradient, line integrals, and Green's theorem in the plane. Compact, not encyclopedic.

Prereqs
The Jacobian MatrixVector Calculus Chain Rule
L3Tier 2~45m

Feynman–Kac Formula

The probabilistic representation of solutions to linear parabolic PDEs as expectations over SDE trajectories. The bridge that lets you Monte Carlo a PDE, the reason high-dimensional Black-Scholes is tractable, and the foundation under every backward-SDE method including deep BSDE.

Prereqs
Stochastic Differential EquationsIto's Lemma
L3Tier 2~50m

Fokker–Planck Equation

The deterministic PDE for the time-evolving density of an SDE. The bridge that lets you reason about Langevin samplers, diffusion models, and stochastic optimization in PDE language: stationary distributions become null spaces, mixing times become spectral gaps, score functions become drift-density couplings.

Prereqs
Stochastic Differential EquationsPDE Fundamentals for Machine LearningDivergence, Curl, and Line Integrals+1 more
L0BTier 2~75m

Functional Analysis Core

The four pillars of functional analysis: Hahn-Banach (extending functionals), Uniform Boundedness (pointwise bounded implies uniformly bounded), Open Mapping (surjective bounded operators have open images), and Banach-Alaoglu (dual unit ball is weak-* compact). These underpin RKHS theory, optimization in function spaces, and duality.

Prereqs
Metric Spaces, Convergence, and CompletenessInner Product Spaces and OrthogonalityMeasure-Theoretic Probability
L3Tier 2~50m

Hamilton–Jacobi–Bellman Equation

The PDE characterizing the value function of a continuous-time stochastic optimal control problem. The continuous-time analog of the discrete Bellman equation, the fully nonlinear PDE that nonlinear Feynman–Kac inverts via BSDEs, and the equation Deep BSDE solves numerically in high dimensions.

Prereqs
Stochastic Differential EquationsFeynman–Kac Formula
L2Tier 2~45m

Implicit Differentiation

Differentiating through implicit equations and optimization problems: the implicit function theorem gives dy/dx without solving for y explicitly. Applications to bilevel optimization, deep equilibrium models, hyperparameter optimization, and meta-learning.

Prereqs
The Jacobian MatrixAutomatic Differentiation
L0BTier 2~70m

Information Theory Foundations

The core of information theory for ML: entropy, cross-entropy, KL divergence, mutual information, data processing inequality, and the chain rules that connect them. The language of variational inference, generalization bounds, and representation learning.

L3Tier 2~50m

Ito's Lemma

The chain rule of stochastic calculus: if a process follows an SDE, applying a smooth function to it yields a modified SDE with an extra second-order correction term that has no analogue in ordinary calculus.

Prereqs
Stochastic Calculus for ML
L0BTier 2~70m

Martingale Theory

Martingales and their convergence properties: Doob martingale, optional stopping theorem, martingale convergence, Azuma-Hoeffding inequality, and Freedman inequality. The tools behind McDiarmid's inequality, online learning regret bounds, and stochastic approximation.

Prereqs
Measure-Theoretic Probability
L1Tier 2~30m

Non-Euclidean and Hyperbolic Geometry

The geometry that drops the parallel postulate. Hyperbolic and spherical models, sectional curvature, the Poincare disk, and why hyperbolic spaces embed tree-structured data with low distortion. The grounding for graph embeddings on curved spaces.

Prereqs
Metric Spaces, Convergence, and CompletenessVectors, Matrices, and Linear Maps
L1Tier 2~70m

PDE Fundamentals for Machine Learning

The partial differential equations that appear in modern machine learning: heat and Fokker-Planck for diffusion, continuity for flow matching, Hamilton-Jacobi-Bellman for reinforcement learning, Poisson for score matching. Classification, solution concepts, and where ML actually needs PDE theory versus where it just uses the vocabulary.

Prereqs
Fast Fourier TransformEigenvalues and EigenvectorsStochastic Differential Equations+3 more
L3Tier 2~55m

Stochastic Differential Equations

SDEs of the form dX = b dt + sigma dB: strong and weak solutions, existence and uniqueness under Lipschitz conditions, Euler-Maruyama discretization, and the canonical examples that appear throughout ML (Ornstein-Uhlenbeck, geometric Brownian motion, Langevin dynamics).

Prereqs
brownian motionIto's LemmaStochastic Calculus for ML
L3Tier 2~50m

Time Reversal of SDEs

Anderson 1982: any forward Ito SDE has an explicit time-reversed SDE whose drift involves the score function (gradient of log density). The single result that turns a forward noising process into a generative sampler and underlies every score-based diffusion model.

Prereqs
Stochastic Differential EquationsFokker–Planck Equation
L3Tier 3~60m

Information Geometry

Riemannian geometry on the space of probability distributions: the Fisher information metric, natural gradient descent, exponential families as dually flat manifolds, and the connection to mirror descent.

Prereqs
Fisher Information: Curvature, KL Geometry, and the Natural GradientConvex DualityNon-Euclidean and Hyperbolic Geometry+1 more
L0BTier 3~65m

Spectral Theory of Operators

Spectral theorem for compact self-adjoint operators on Hilbert spaces: every such operator has a countable orthonormal eigenbasis with real eigenvalues accumulating only at zero. This is the infinite-dimensional backbone of PCA, kernel methods, and neural tangent kernel theory.

Prereqs
Eigenvalues and EigenvectorsComplex Numbers for FourierFunctional Analysis Core
L3Tier 3~35m

Stochastic Calculus for ML

Brownian motion, Ito integrals, Ito's lemma, and stochastic differential equations: the mathematical machinery behind diffusion models, score-based generative models, and Langevin dynamics.

Prereqs
Martingale TheoryMeasure-Theoretic ProbabilityClassical ODEs: Existence, Stability, and Numerical Methods

Statistical Estimation

27 topics / L0B, L1, L2 / Tiers 1-3

MLE, Fisher information, Cramér-Rao, LLN, CLT — the estimation core.

L0BTier 1~90m

Asymptotic Statistics: M-Estimators, Delta Method, LAN

The large-sample toolbox for statistical inference: continuous mapping theorem, Slutsky, the delta method, M- and Z-estimator consistency and asymptotic normality, MLE as a special M-estimator, local asymptotic normality (Le Cam), the asymptotic equivalence of Wald / score / likelihood-ratio tests, and influence-function representations. These results justify essentially every confidence interval, standard error, and p-value in applied statistics, and they are the language of modern semiparametric theory.

Prereqs
Central Limit TheoremMaximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencyModes of Convergence of Random Variables+3 more
L2Tier 1~110m

Bayesian Linear Regression

Gaussian prior, Gaussian likelihood, Gaussian posterior. Full posterior derivation by completing the square in the exponent: the posterior mean equals the ridge estimator, the predictive distribution has irreducible plus epistemic variance, and the marginal likelihood gives a closed-form hyperparameter selection criterion. Worked numeric example with three data points carries the algebra end to end.

Prereqs
Linear RegressionRidge RegressionThe Multivariate Normal Distribution+4 more
L2Tier 1~60m

Bootstrap Methods

The nonparametric bootstrap: resample with replacement to approximate sampling distributions, construct confidence intervals, and quantify uncertainty without distributional assumptions.

Prereqs
Common Probability DistributionsAsymptotic Statistics: M-Estimators, Delta Method, LANCentral Limit Theorem+4 more
L0BTier 1~75m

Central Limit Theorem

The CLT family: classical Lindeberg-Levy, the Lindeberg-Feller condition for non-identical summands, Berry-Esseen rate, multivariate CLT, the delta method, and the martingale CLT that underwrites SGD asymptotics. Why these results power MLE asymptotic normality, bootstrap validity, and the Gaussian-noise approximation of stochastic optimization, and where the CLT breaks in high dimensions and under heavy tails.

Prereqs
Law of Large NumbersExpectation, Variance, Covariance, and MomentsCommon Probability Distributions+2 more
L1Tier 1~60m

Chi-Squared Distribution and Tests

The Chi-squared distribution as sum of squared standard Normals and as the sampling distribution of the scaled sample variance, plus the two Pearson Chi-squared tests: goodness of fit for cell counts and independence in contingency tables.

Prereqs
Distributions AtlasNormal DistributionGamma Distribution+1 more
L0BTier 1~80m

Conjugate Priors

When the prior and likelihood are paired so the posterior stays in the same family as the prior. Definition via exponential families, the standard table (Beta-Bernoulli, Dirichlet-multinomial, Normal-Normal, Normal-inverse-gamma, Gamma-Poisson), worked Normal-Normal updates in 1D and the multivariate case, and the pseudo-observation interpretation that makes conjugacy a feature, not a coincidence.

Prereqs
Bayesian EstimationMaximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencySufficient Statistics and Exponential Families+3 more
L0BTier 1~90m

Cramér-Rao Bound: Information Inequality, Achievability, and Sharper Variants

The fundamental information inequality for unbiased estimation. Coverage of the scalar and multivariate Cramér-Rao bounds, the chain rule for biased estimators, achievability in exponential families, the Bhattacharyya higher-order bounds, the Hammersley-Chapman-Robbins bound (no regularity required), the van Trees Bayesian inequality, and efficient information with nuisance parameters.

Prereqs
Fisher Information: Curvature, KL Geometry, and the Natural GradientMaximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencyKL Divergence
L2Tier 1~40m

The EM Algorithm

Expectation-Maximization: the principled way to do maximum likelihood when some variables are unobserved. Derives the ELBO, proves monotonic convergence, and shows why EM is the backbone of latent variable models.

Prereqs
Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencyConvex Optimization BasicsSufficient Statistics and Exponential Families+1 more
L1Tier 1~55m

F-Distribution and ANOVA

The F-distribution as a ratio of two scaled Chi-squareds, and the one-way analysis of variance F-test built on it: between-group versus within-group variance decomposition, exact null distribution under Normality and equal variances, and the link to two-sample t-tests.

Prereqs
Distributions AtlasChi-Squared Distribution and TestsHypothesis Testing for ML+1 more
L0BTier 1~90m

Fisher Information: Curvature, KL Geometry, and the Natural Gradient

The Fisher information measures how much a sample tells you about an unknown parameter. Coverage of the score function and Bartlett identities, the equivalence of variance-of-score and expected-negative-Hessian forms, the KL-divergence Hessian identity, the Fisher information matrix and Loewner ordering, the Cramér-Rao link, the natural gradient (parameterization invariance), K-FAC and empirical Fisher in deep learning, and applications to MLE asymptotic efficiency, EWC, and information geometry.

Prereqs
Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencyKL DivergenceBasu's Theorem+2 more
L0BTier 1~50m

Law of Large Numbers

The weak and strong laws of large numbers: the sample mean converges to the population mean. Kolmogorov's conditions, the rate of convergence from the CLT, and why LLN justifies using empirical risk as a proxy for population risk.

Prereqs
Random VariablesExpectation, Variance, Covariance, and MomentsCommon Probability Distributions+2 more
L2Tier 1~55m

Likelihood-Ratio, Wald, and Score Tests

The three asymptotic tests built from the likelihood: LRT compares maximized likelihoods, Wald compares the MLE to the null using the inverse Fisher information, score uses the gradient of the log-likelihood at the null. All three are asymptotically Chi-squared and equivalent under regularity; they disagree in finite samples and choice depends on the problem.

Prereqs
Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencyFisher Information: Curvature, KL Geometry, and the Natural GradientNeyman-Pearson and Hypothesis Testing Theory+2 more
L0BTier 1~70m

Maximum A Posteriori (MAP) Estimation

Maximum a posteriori estimation as the posterior mode of a Bayesian model: derivation, the flat-prior recovery of MLE, the worked L2-norm-equals-Gaussian-prior and L1-norm-equals-Laplace-prior equivalences that make ridge and lasso Bayesian, and the invariance failure under reparameterization that distinguishes MAP from MLE.

Prereqs
Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencyBayesian EstimationCommon Probability Distributions+1 more
L0BTier 1~90m

Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiency

MLE: pick the parameter that maximizes the likelihood of observed data. Score function, Bartlett identities, regularity conditions, consistency, asymptotic normality, Wilks' theorem, Cramér-Rao efficiency, exponential families, QMLE under misspecification, and the bridge to deep-learning negative log-likelihood training.

Prereqs
Common Probability DistributionsDifferentiation in RⁿCentral Limit Theorem+5 more
L0BTier 1~85m

The Multivariate Normal Distribution

The multivariate Gaussian as the joint of d correlated random variables: density derivation from standard normals via affine maps, the completing-the-square recipe, Schur-complement marginals and conditionals, the MGF and characteristic function, and the algebraic identities that power every Bayesian Gaussian derivation downstream.

Prereqs
Common Probability DistributionsJoint, Marginal, and Conditional DistributionsExpectation, Variance, Covariance, and Moments+4 more
L2Tier 1~45m

Permutation Tests

Exchangeability-based hypothesis testing: under the null of no group effect, the labels are exchangeable, so the distribution of any test statistic under random relabeling gives an exact null reference. Exact for small samples, approximated by Monte Carlo for large samples, robust under non-Normality and heavy tails.

Prereqs
Hypothesis Testing for MLBootstrap MethodsNeyman-Pearson and Hypothesis Testing Theory
L0BTier 1~90m

Shrinkage Estimation and the James-Stein Estimator: Inadmissibility, SURE, and Brown's Characterization

Why the sample mean is inadmissible in three or more dimensions. Stein's identity, the James-Stein estimator and its positive-part refinement, Stein's unbiased risk estimate, Brown's admissibility characterization, the empirical Bayes interpretation, and the link to ridge regression and modern regularization.

Prereqs
Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencyCramér-Rao Bound: Information Inequality, Achievability, and Sharper VariantsMinimax Lower Bounds: Le Cam, Fano, Assouad, and the Reduction to Testing
L1Tier 1~60m

Student-t Distribution and t-Test

The Student-t distribution as a ratio of a standard Normal and a root-Chi-squared, and the one-sample, two-sample, and paired t-tests it powers: exact null distribution under Normality, Welch correction for unequal variances, and large-sample equivalence to the Wald z-test.

Prereqs
Distributions AtlasNormal DistributionChi-Squared Distribution and Tests+2 more
L0BTier 2~60m

Bayesian Estimation

The Bayesian approach to parameter estimation: encode prior beliefs, update with data via Bayes rule, and obtain a full posterior distribution over parameters. Conjugate priors, MAP estimation, and the Bernstein-von Mises theorem showing that the posterior concentrates around the true parameter.

Prereqs
Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencyCommon Probability DistributionsJoint, Marginal, and Conditional Distributions+1 more
L1Tier 2~30m

De Moivre-Laplace Theorem

The first central limit theorem, historically. Bin(n,p) approximates N(np, np(1-p)) for large n, with explicit continuity correction. Stirling-based proof, Berry-Esseen rate, and where the approximation breaks down (small p, small n, skewed binomials).

Prereqs
Common Probability DistributionsCentral Limit TheoremMoment Generating Functions
L2Tier 2~45m

Empirical Bayes vs Hierarchical Bayes

What changes when hyperparameters are estimated and plugged in versus assigned a prior and integrated out, and why the gap is mostly about uncertainty rather than point estimates.

Prereqs
Bayesian EstimationShrinkage Estimation and the James-Stein Estimator: Inadmissibility, SURE, and Brown's CharacterizationAdjusted Density Maximization+2 more
L1Tier 2~50m

Goodness-of-Fit Tests

KS test, Shapiro-Wilk, Anderson-Darling, and the probability integral transform: testing whether data follows a specified distribution, with exact power comparisons and when each test fails.

Prereqs
Common Probability DistributionsBenford's LawHypothesis Testing for ML
L0BTier 2~40m

Method of Moments

Match sample moments to population moments to estimate parameters. Simpler than MLE but less efficient. Covers classical MoM, generalized method of moments (GMM), and when MoM is the better choice.

Prereqs
Common Probability Distributions
L1Tier 2~30m

Poisson Limit Theorem and Le Cam's Bound

Bin(n, lambda/n) converges to Pois(lambda) as n grows. The classical product-of-PMFs proof, then Le Cam's total-variation bound that makes the approximation quantitative. When to use Poisson vs Normal approximation. Disambiguation: Le Cam published multiple famous theorems.

Prereqs
Common Probability DistributionsCharacteristic FunctionsMoment Generating Functions
L2Tier 2~45m

REML and Variance Component Estimation

Why restricted maximum likelihood estimates variance components from error contrasts rather than the full data likelihood, and why that usually behaves better than ML when fixed effects are present.

Prereqs
Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencyLinear RegressionExpectation, Variance, Covariance, and Moments
L0BTier 2~60m

Sufficient Statistics and Exponential Families

Sufficient statistics compress data without losing information about the parameter. The Neyman-Fisher factorization theorem, exponential families, completeness, and Rao-Blackwell improvement of estimators.

Prereqs
Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiency
L0BTier 3~35m

Basu's Theorem

A complete sufficient statistic is independent of every ancillary statistic. This provides the cleanest method for proving independence between statistics without computing joint distributions.

Prereqs
Sufficient Statistics and Exponential Families

Bayesian ML Frontier

1 topics / L3 / Tier 1

Bayesian and probabilistic views of tabular foundation models and statistical ML frontiers.

Learning Theory Core

14 topics / L1, L2, L3 / Tiers 1-2

ERM, uniform convergence, VC dimension, Rademacher complexity.

L3Tier 1~65m

Algorithmic Stability

Algorithmic stability provides generalization bounds by analyzing how much a learning algorithm's output changes when a single training example is replaced: a structurally different lens from complexity-based approaches.

Prereqs
Empirical Risk MinimizationVC DimensionConcentration Inequalities+9 more
L2Tier 1~60m

Empirical Risk Minimization

The foundational principle of statistical learning: minimize average loss on training data as a proxy for minimizing true population risk.

Prereqs
Concentration InequalitiesCommon Probability DistributionsCommon Inequalities+7 more
L2Tier 1~45m

Hypothesis Classes and Function Spaces

What is a hypothesis class, why the choice of hypothesis class determines what ERM can learn, and the approximation-estimation tradeoff: bigger classes reduce approximation error but increase estimation error.

Prereqs
Empirical Risk Minimization
L1Tier 1~40m

PAC Learning Framework

The foundational formalization of what it means to learn from data: a concept is PAC-learnable if and only if an algorithm can, with high probability, find a hypothesis that is approximately correct, using a polynomial number of samples.

Prereqs
Concentration InequalitiesUniform ConvergenceCounting and Combinatorics+7 more
L3Tier 1~80m

Rademacher Complexity

A data-dependent measure of hypothesis class complexity that gives tighter generalization bounds than VC dimension by measuring how well the class can fit random noise on the actual data.

Prereqs
Empirical Risk MinimizationVC DimensionConcentration Inequalities+6 more
L2Tier 1~35m

Realizability Assumption

The technical assumption that the target function lies inside the hypothesis class. Realizable PAC learning is the simpler half of the story; agnostic PAC drops this assumption.

Prereqs
Empirical Risk MinimizationHypothesis Classes and Function Spaces
L2Tier 1~50m

Sample Complexity Bounds

How many samples do you need to learn? Tight answers for finite hypothesis classes, VC classes, and Rademacher-bounded classes, plus matching lower bounds via Fano and Le Cam.

Prereqs
VC DimensionRealizability Assumption
L2Tier 1~65m

Uniform Convergence

Uniform convergence of empirical risk to population risk over an entire hypothesis class: the key property that makes ERM provably work.

Prereqs
Empirical Risk MinimizationAdaptive Learning Is Not IIDBernstein Inequality+4 more
L2Tier 1~75m

VC Dimension

The Vapnik-Chervonenkis dimension: a combinatorial measure of hypothesis class complexity that characterizes learnability in binary classification.

Prereqs
Empirical Risk MinimizationConcentration InequalitiesCounting and Combinatorics+9 more
L2Tier 2~30m

Bias-Complexity Tradeoff

The error of an ERM hypothesis decomposes into approximation error (the price of restricting H) and estimation error (the price of finite data). Richer classes shrink approximation but inflate estimation. The optimal class is the smallest one that still captures the truth.

Prereqs
PAC Learning FrameworkEmpirical Risk MinimizationNo-Free-Lunch Theorem
L2Tier 2~35m

Glivenko-Cantelli Theorem

Uniform almost-sure convergence of the empirical CDF to the true CDF, and the function-class generalization that ML uses for ERM. Covers the classical theorem, the DKW exponential rate (Massart 1990), the Glivenko-Cantelli class definition, and the bridge to VC dimension and Rademacher complexity.

Prereqs
Concentration InequalitiesUniform ConvergenceVC Dimension+1 more
L2Tier 2~55m

Kolmogorov Complexity and MDL

Kolmogorov complexity measures the shortest program that produces a string. The Minimum Description Length principle selects models that compress data best, providing a computable approximation to an incomputable ideal.

Prereqs
p vs np
L1Tier 2~55m

Loss Functions

The generalized loss-function framework from Shalev-Shwartz & Ben-David Ch 3, the six standard losses (0-1, MSE, MAE, Huber, cross-entropy, hinge), and the bridge from supervised learning to ERM and PAC learnability.

Prereqs
Random VariablesExpectation, Variance, Covariance, and MomentsHypothesis Classes and Function Spaces
L2Tier 2~35m

No-Free-Lunch Theorem

For binary classification with 0-1 loss, no learning algorithm can succeed on every distribution: for any algorithm and any sample size m smaller than half the domain, some realizable distribution forces error at least 1/8 with probability at least 1/7. Universal learners do not exist; prior knowledge enters through the choice of hypothesis class.

Prereqs
PAC Learning FrameworkEmpirical Risk MinimizationLoss Functions Catalog

Concentration & Probability

21 topics / L1, L2, L3 / Tiers 1-2

Hoeffding through matrix Bernstein. The workhorse inequality family.

L2Tier 1~25m

Bennett's Inequality

A variance-aware concentration inequality for independent bounded random variables. The exponent uses the function h(a) = (1+a)log(1+a) - a, the same h that controls the multiplicative Chernoff bound for Bernoullis.

Prereqs
Concentration InequalitiesChernoff BoundsMoment Generating Functions+1 more
L2Tier 1~25m

Bernstein Inequality

A variance-sensitive concentration inequality for independent bounded random variables. Bernstein sharpens Hoeffding when the variance is much smaller than the worst-case range.

Prereqs
Concentration InequalitiesExpectation, Variance, Covariance, and MomentsMoment Generating Functions+2 more
L1Tier 1~45m

Chernoff Bounds

The Chernoff method: the universal technique for deriving exponential tail bounds by optimizing over the moment generating function, yielding the tightest possible exponential concentration.

Prereqs
Concentration InequalitiesMoment Generating Functions
L2Tier 1~25m

Chi-Squared Concentration

Two-sided exponential concentration for chi-squared sums of squared standard Gaussians: P(|Z/k - 1| > t) <= 2 exp(-t^2 k / 6) for t in (0, 1/2). Drives sub-exponential tails, variance-component inference, and Lipschitz Gaussian concentration.

Prereqs
Concentration InequalitiesChernoff BoundsCommon Probability Distributions+1 more
L1Tier 1~70m

Concentration Inequalities

Bounds on how far random variables deviate from their expectations: Markov, Chebyshev, Hoeffding, and Bernstein. Used throughout generalization theory, bandits, and sample complexity.

Prereqs
Common Probability DistributionsExpectation, Variance, Covariance, and MomentsCentral Limit Theorem+9 more
L3Tier 1~60m

Epsilon-Nets and Covering Numbers

Discretizing infinite sets for concentration arguments: epsilon-nets, covering numbers, packing numbers, the Dudley integral, and the connection to Rademacher complexity.

Prereqs
Sub-Gaussian Random VariablesConcentration InequalitiesContraction Inequality+2 more
L2Tier 1~50m

Fat Tails and Heavy-Tailed Distributions

When the tails dominate. Power laws, Pareto distributions, subexponential tails, why the law of large numbers converges slowly or fails, and why most of ML silently assumes thin tails.

Prereqs
Common Probability DistributionsExpectation, Variance, Covariance, and MomentsLaw of Large Numbers+4 more
L1Tier 1~30m

Hoeffding's Lemma

The MGF bound that powers Hoeffding's inequality: a centered random variable on [a,b] has sub-Gaussian moment generating function with parameter (b-a)^2/4.

Prereqs
Concentration InequalitiesMoment Generating FunctionsChernoff Bounds+3 more
L2Tier 1~45m

LLN and CLT Failures Under Heavy Tails

What breaks when finite mean or finite variance fails. Cauchy: the sample mean stays Cauchy no matter how large the sample. Pareto across the alpha regimes: when LLN still holds but CLT does not. Generalized CLT and stable-law limits. The consistency illusion in finance and reinsurance.

Prereqs
Law of Large NumbersCentral Limit TheoremCharacteristic Functions+3 more
L3Tier 1~70m

Matrix Concentration

Matrix Bernstein, Matrix Hoeffding, Weyl's inequality, and Davis-Kahan: the operator-norm concentration tools needed for covariance estimation, dimensionality reduction, and spectral analysis in high dimensions.

Prereqs
Sub-Gaussian Random VariablesSub-Exponential Random VariablesConcentration Inequalities+4 more
L3Tier 1~55m

McDiarmid's Inequality

The bounded-differences inequality: if changing any single input to a function changes the output by at most a fixed constant, the function concentrates around its mean with sub-Gaussian tails.

Prereqs
Concentration InequalitiesSub-Gaussian Random VariablesMartingale Theory+1 more
L3Tier 1~70m

Measure Concentration and Geometric Functional Analysis

High-dimensional geometry is counterintuitive: Lipschitz functions concentrate, random projections preserve distances, and most of a sphere's measure sits near the equator. Johnson-Lindenstrauss, Gaussian concentration, and Levy's lemma.

Prereqs
Sub-Gaussian Random VariablesEpsilon-Nets and Covering Numbers
L2Tier 1~55m

Sub-Exponential Random Variables

The distributional class between sub-Gaussian and heavy-tailed: heavier tails than Gaussian, the Orlicz norm characterization, Bernstein condition, and the two-regime concentration bound.

Prereqs
Sub-Gaussian Random VariablesConcentration InequalitiesBernstein Inequality+3 more
L2Tier 1~75m

Sub-Gaussian Random Variables

Sub-Gaussian random variables: the precise characterization of 'light-tailed' behavior that underpins exponential-tail concentration inequalities and the Gaussian-rate generalization bounds in learning theory.

Prereqs
Concentration InequalitiesChernoff BoundsSkewness, Kurtosis, and Higher Moments+1 more
L3Tier 1~50m

Symmetrization Inequality

The symmetrization technique: the proof template that connects the generalization gap to Rademacher complexity by introducing a ghost sample and random signs.

Prereqs
Rademacher ComplexityConcentration Inequalities
L3Tier 2~50m

Contraction Inequality

The Ledoux-Talagrand contraction principle: applying a Lipschitz function anchored at zero to a function class can only contract Rademacher complexity, letting you bound the complexity of the loss class from the hypothesis class.

Prereqs
Rademacher Complexity
L3Tier 2~65m

Empirical Processes and Chaining

Bounding the supremum of empirical processes via covering numbers and chaining: Dudley's entropy integral and Talagrand's generic chaining, the sharpest tools in classical learning theory.

Prereqs
Rademacher ComplexityEpsilon-Nets and Covering NumbersAsymptotic Statistics: M-Estimators, Delta Method, LAN+3 more
L3Tier 2~55m

Hanson-Wright Inequality

Concentration of quadratic forms XᵀAX for sub-Gaussian random vectors: the two-term bound involving the Frobenius norm (Gaussian regime) and operator norm (extreme regime).

Prereqs
Sub-Gaussian Random VariablesMatrix ConcentrationChi-Squared Concentration
L3Tier 2~50m

Restricted Isometry Property

The restricted isometry property (RIP): when a measurement matrix approximately preserves norms of sparse vectors, enabling exact sparse recovery via L1 minimization. Random Gaussian matrices satisfy RIP with O(s log(n/s)) rows.

Prereqs
Sub-Gaussian Random Variables
L2Tier 2~12m

Slud's Inequality

A lower bound on binomial tail probabilities. For X ~ Bin(n, p) with p = (1 - epsilon)/2, P(X >= n/2) is at least a normal-tail-style quantity. Used in VC lower bounds to show learning is genuinely hard.

Prereqs
Common Probability DistributionsConcentration Inequalities
L2Tier 2~75m

Stochastic Processes for ML

Random processes as the unifying framework behind Gaussian processes, neural tangent kernels, empirical processes, and martingale concentration: definitions, sample-path regularity, and the bridges between learning theory's main probabilistic tools.

Prereqs
Measure-Theoretic ProbabilityExpectation, Variance, Covariance, and MomentsConcentration Inequalities

Optimization & Function Classes

13 topics / L1, L2, L3 / Tiers 1-2

Convex optimization, regularization, kernels, RKHS.

L1Tier 1~60m

Convex Optimization Basics

Convex sets, convex functions, gradient descent convergence, strong convexity, and duality: the optimization foundation that every learning-theoretic result silently depends on.

Prereqs
Differentiation in RⁿMatrix Operations and PropertiesCommon Inequalities+8 more
L1Tier 1~45m

Gradient Descent Variants

From full-batch to stochastic to mini-batch gradient descent, plus momentum, Nesterov acceleration, AdaGrad, RMSProp, and Adam. Why mini-batch SGD with momentum is the practical default.

Prereqs
Convex Optimization BasicsDifferentiation in Rⁿ
L2Tier 1~50m

Gradient Flow and Vanishing Gradients

Why deep networks are hard to train: gradients shrink or explode as they propagate through layers. The Jacobian product chain, sigmoid saturation, ReLU dead neurons, skip connections, normalization, and gradient clipping.

Prereqs
Feedforward Networks and BackpropagationThe Jacobian Matrix
L2Tier 1~55m

Stochastic Gradient Descent Convergence

SGD convergence rates for convex and strongly convex functions, the role of noise as both curse and blessing, mini-batch variance reduction, learning rate schedules, and the Robbins-Monro conditions.

Prereqs
Gradient Descent VariantsConcentration InequalitiesCoordinate Descent+3 more
L1Tier 1~35m

Subgradients and Subdifferentials

The non-smooth generalization of the gradient for convex functions. Subgradients enable optimality conditions, calculus rules, and convergence guarantees for L1-regularized problems, hinge loss SVMs, and proximal algorithms where the objective is not differentiable.

Prereqs
Convex Optimization Basics
L2Tier 2~55m

Bias-Variance Tradeoff

The classical decomposition of mean squared error into bias squared, variance, and irreducible noise. The U-shaped test error curve, why it breaks in modern ML (double descent), and the connection to regularization.

Prereqs
Expectation, Variance, Covariance, and MomentsEmpirical Risk MinimizationElastic Net+5 more
L2Tier 2~45m

Cross-Validation Theory

The theory behind cross-validation as a model selection tool: LOO-CV, K-fold, the bias-variance tradeoff of the CV estimator itself, and why CV estimates generalization error.

Prereqs
Empirical Risk MinimizationBias-Variance TradeoffAIC and BIC+12 more
L3Tier 2~70m

Kernels and Reproducing Kernel Hilbert Spaces

Kernel functions, Mercer's theorem, the RKHS reproducing property, and the representer theorem: the mathematical framework that enables learning in infinite-dimensional function spaces via finite-dimensional computations.

Prereqs
Convex Optimization BasicsRademacher ComplexityCharacteristic Functions+11 more
L3Tier 2~55m

Preconditioned Optimizers: Shampoo, K-FAC, and Natural Gradient

Optimizers that use curvature information to precondition gradients: the natural gradient via Fisher information, K-FAC's Kronecker approximation, and Shampoo's full-matrix preconditioning. How they connect to Riemannian optimization and why they outperform Adam on certain architectures.

Prereqs
Convex Optimization BasicsFisher Information: Curvature, KL Geometry, and the Natural GradientThe Hessian Matrix+1 more
L2Tier 2~60m

Regularization Theory

Why unconstrained ERM overfits and how regularization controls complexity: Tikhonov (L2), sparsity (L1), elastic net, early stopping, dropout, the Bayesian prior connection, and the link to algorithmic stability.

Prereqs
Convex Optimization BasicsBias-Variance TradeoffAdaBoost+7 more
L3Tier 2~55m

Riemannian Optimization and Manifold Constraints

Optimization on curved spaces: the Stiefel manifold for orthogonal matrices, symmetric positive definite matrices, Riemannian gradient descent, retractions, and why flat-space intuitions break on manifolds. The geometric backbone of Shampoo, Muon, and constrained neural network training.

Prereqs
Convex Optimization BasicsThe Hessian MatrixEigenvalues and Eigenvectors+5 more
L2Tier 2~60m

Stability and Optimization Dynamics

Convergence of gradient descent for smooth and strongly convex objectives, the descent lemma, gradient flow as a continuous-time limit, Lyapunov stability analysis, and the edge of stability phenomenon.

Prereqs
Convex Optimization Basicsinvariants and monovariants
L2Tier 2~60m

Stochastic Approximation Theory

The Robbins-Monro framework, ODE method, and Polyak-Ruppert averaging: the unified theory behind why SGD, Q-learning, and TD-learning converge.

Prereqs
Convex Optimization BasicsMartingale TheoryAdaptive Learning Is Not IID+2 more

Statistical Foundations

18 topics / L1, L2, L3, L4 / Tiers 1-3

Minimax, Fano, information-theoretic lower bounds, random matrix theory.

L3Tier 1~90m

Minimax Lower Bounds: Le Cam, Fano, Assouad, and the Reduction to Testing

Why upper bounds are not enough. The reduction-to-testing principle, Le Cam's two-point method, Fano's inequality, the Assouad hypercube lemma, packing and metric-entropy bridges, and applications to nonparametric, sparse, and locally-private estimation.

Prereqs
Concentration InequalitiesMaximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencyKL Divergence+3 more
L2Tier 2~45m

Design-Based vs. Model-Based Inference

Two philosophies of statistical inference from survey data: design-based inference where randomness comes from the sampling design, and model-based inference where randomness comes from a statistical model, with the model-assisted hybrid approach.

Prereqs
Survey Sampling MethodsCausal Inference for Policy EvaluationOfficial Statistics and National Surveys
L2Tier 2~55m

Detection Theory

Binary hypothesis testing, the Neyman-Pearson lemma (likelihood ratio tests are most powerful), ROC curves, Bayesian detection, and sequential testing. Classification IS detection theory. ROC/AUC comes directly from here.

Prereqs
Hypothesis Testing for MLBayesian Estimation
L3Tier 2~60m

Fano Inequality

Fano inequality as the standard tool for information-theoretic lower bounds: error probability in estimating a random variable from a noisy observation is bounded below by conditional entropy and alphabet size.

Prereqs
Minimax Lower Bounds: Le Cam, Fano, Assouad, and the Reduction to TestingInformation Theory Foundations
L3Tier 2~45m

GREG Estimator

The generalized regression estimator is the standard model-assisted survey estimator: Horvitz-Thompson plus a regression correction using known auxiliary totals.

Prereqs
Linear RegressionSurvey Sampling MethodsDesign-Based vs. Model-Based Inference
L3Tier 2~55m

High-Dimensional Covariance Estimation

When dimension d is comparable to sample size n, the sample covariance matrix fails. Shrinkage estimators (Ledoit-Wolf), banding and tapering for structured covariance, and Graphical Lasso for sparse precision matrices.

Prereqs
Matrix ConcentrationLasso Regression
L3Tier 2~50m

Kernel Two-Sample Tests

Maximum Mean Discrepancy (MMD): a kernel-based nonparametric test for whether two samples come from the same distribution, with unbiased estimation, permutation testing, and applications to GAN evaluation.

Prereqs
Kernels and Reproducing Kernel Hilbert Spaces
L2Tier 2~50m

Neyman-Pearson and Hypothesis Testing Theory

The likelihood ratio test is the most powerful test for simple hypotheses (Neyman-Pearson lemma), UMP tests extend this to one-sided composites, and the power function characterizes a test's behavior across the parameter space.

Prereqs
Common Probability DistributionsMaximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiency
L2Tier 2~50m

Nonresponse and Missing Data

The taxonomy of missingness mechanisms (MCAR, MAR, MNAR), their consequences for inference, and the major correction methods: multiple imputation, inverse probability weighting, and the EM algorithm.

Prereqs
Common Probability DistributionsDesign-Based vs. Model-Based Inferencefuzzy matching and record linkage+3 more
L1Tier 2~45m

Order Statistics

Order statistics are the sorted values of a random sample. Their distributions govern quantile estimation, confidence intervals for medians, and the behavior of extremes.

Prereqs
Common Probability DistributionsTriangular Distribution
L4Tier 2~75m

Random Matrix Theory Overview

Why the spectra of random matrices matter for ML: Marchenko-Pastur law, Wigner semicircle, spiked models, and their applications to covariance estimation, PCA, and overparameterization.

Prereqs
Matrix ConcentrationEpsilon-Nets and Covering NumbersMeasure-Theoretic Probability+3 more
L3Tier 2~55m

Robust Statistics and M-Estimators

When data has outliers or model assumptions are wrong, classical estimators break. M-estimators generalize MLE to handle contamination gracefully.

Prereqs
Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencyMinimax and Saddle PointsSkewness, Kurtosis, and Higher Moments+1 more
L2Tier 2~45m

Sample Size Determination

How to compute the number of observations needed to estimate means, proportions, and treatment effects with specified precision and power, including corrections for finite populations and complex designs.

Prereqs
Hypothesis Testing for MLCommon Probability DistributionsSurvey Sampling Methods
L2Tier 2~60m

Survey Sampling Methods

The major probability sampling designs used in survey statistics: simple random, stratified, cluster, systematic, multi-stage, and multi-phase sampling, with their variance properties and estimators.

Prereqs
Common Probability DistributionsExpectation, Variance, Covariance, and MomentsTypes of Bias in Statistics
L3Tier 2~55m

Survival Analysis

Modeling time-to-event data with censoring: Kaplan-Meier curves, hazard functions, and the Cox proportional hazards model.

Prereqs
Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiency
L3Tier 3~50m

Copulas

Copulas separate the dependence structure of a multivariate distribution from its marginals. Sklar's theorem guarantees that any joint CDF can be decomposed into marginals and a copula, making dependence modeling modular.

Prereqs
Common Probability Distributions
L3Tier 3~50m

Longitudinal Surveys and Panel Data

Analysis of data where the same units are measured repeatedly over time: fixed effects, random effects, difference-in-differences, and the problems of attrition and time-varying confounding.

Prereqs
Linear RegressionCausal Inference for Policy EvaluationNonresponse and Missing Data+1 more
L3Tier 3~55m

Small Area Estimation

Methods for producing reliable estimates in domains where direct survey estimates have too few observations for useful precision, using Fay-Herriot and unit-level models that borrow strength across areas.

Prereqs
Bayesian EstimationLinear RegressionREML and Variance Component Estimation+1 more

Modern Generalization Theory

16 topics / L3, L4, L5 / Tiers 1-3

Implicit bias, double descent, NTK, mean field — where classical theory fails.

L4Tier 1~90m

Implicit Bias and Modern Generalization

Why classical generalization theory breaks for overparameterized models: the random labels experiment, the interpolation threshold, implicit bias of gradient descent, double descent, and the frontier of understanding why deep learning works.

Prereqs
Gradient Descent VariantsLinear RegressionVC Dimension+11 more
L4Tier 1~95m

Neural Tangent Kernel: Lazy Training, Kernel Equivalence, μP, and the Limits of Width

In the infinite-width NTK parameterization, training a neural network with gradient descent is mathematically equivalent to kernel regression. The same limit suppresses feature learning, which is why μP and mean-field parameterizations exist. NTK is the precise boundary between the kernel and feature-learning regimes.

Prereqs
Kernels and Reproducing Kernel Hilbert SpacesRidge RegressionImplicit Bias and Modern Generalization+3 more
L3Tier 1~30m

PAC-Bayes Bounds

PAC-Bayes bounds control the generalization gap of a stochastic predictor by the KL divergence between a learner-chosen posterior and a data-independent prior. They have produced some of the few non-vacuous generalization bounds reported for trained neural networks (Dziugaite-Roy 2017 onward); how tight the bound gets depends heavily on the choice of prior.

Prereqs
Rademacher ComplexityBayesian EstimationPAC Learning Framework+1 more
L4Tier 2~65m

Benign Overfitting

When interpolation (zero training error) does not hurt generalization: the min-norm interpolator fits noise in harmless directions while preserving signal. Bartlett et al. 2020, effective rank conditions, and why benign overfitting happens in overparameterized but not classical regimes.

Prereqs
Implicit Bias and Modern GeneralizationRandom Matrix Theory OverviewDouble Descent+3 more
L4Tier 2~65m

Double Descent

Test error follows a double-descent curve: it decreases, peaks at the interpolation threshold, then decreases again in the overparameterized regime, defying classical bias-variance intuition.

Prereqs
Implicit Bias and Modern GeneralizationRandom Matrix Theory OverviewBias-Variance Tradeoff+4 more
L4Tier 2~50m

Grokking

Models can memorize training data quickly, then generalize much later after continued training. This delayed generalization, called grokking, breaks the assumption that overfitting is a terminal state and connects to weight decay, implicit regularization, and phase transitions in learning.

Prereqs
Regularization TheoryStochastic Gradient Descent ConvergenceImplicit Bias and Modern Generalization+1 more
L4Tier 2~55m

Lazy vs Feature Learning

The fundamental dichotomy in neural network training: lazy regime (NTK, kernel-like, weights barely move) versus rich/feature learning regime (weights move substantially, representations emerge).

Prereqs
Neural Tangent Kernel: Lazy Training, Kernel Equivalence, μP, and the Limits of WidthMean Field Theory
L4Tier 2~65m

Mean Field Theory

The mean field limit of neural networks: as width goes to infinity under the right scaling, neurons become independent particles whose weight distribution evolves by Wasserstein gradient flow, capturing feature learning that the NTK regime misses.

Prereqs
Neural Tangent Kernel: Lazy Training, Kernel Equivalence, μP, and the Limits of WidthInformation Geometry
L4Tier 2~55m

Neural Network Optimization Landscape

Loss surface geometry of neural networks: saddle points dominate in high dimensions, mode connectivity, flat vs sharp minima, Sharpness-Aware Minimization, and the edge of stability phenomenon.

Prereqs
Training Dynamics and Loss LandscapesThe Hessian Matrix
L3Tier 2~55m

Optimal Transport and Earth Mover's Distance

The Monge and Kantorovich formulations of optimal transport, the linear programming dual, Sinkhorn regularization, and applications to WGANs, domain adaptation, and fairness.

Prereqs
Convex DualityWasserstein Distances
L3Tier 2~60m

Representation Learning Theory

What makes a good learned representation: the information bottleneck, contrastive learning, sufficient statistics, rate-distortion theory, and why representation learning is the central unsolved problem of deep learning.

Prereqs
Information Theory FoundationsVariational AutoencodersEquivariant Deep Learning+2 more
L4Tier 3~75m

Gaussian Processes for Machine Learning

A distribution over functions specified by a mean and kernel: closed-form posterior predictions with uncertainty, connection to kernel ridge regression, marginal likelihood for model selection, and the cubic cost bottleneck.

Prereqs
Kernels and Reproducing Kernel Hilbert SpacesJoint, Marginal, and Conditional DistributionsRidge Regression+7 more
L3Tier 3~50m

Information Bottleneck

The information bottleneck principle: compress the input X into a representation T that preserves information about the target Y. The Lagrangian formulation, connection to deep learning, Shwartz-Ziv and Tishby claims, and why the compression story may not hold for ReLU networks.

Prereqs
Information Theory Foundations
L5Tier 3~70m

Open Problems in ML Theory

A curated list of genuinely open problems in machine learning theory: why overparameterized networks generalize, the right complexity measure for deep learning, feature learning beyond NTK, why scaling laws hold, emergent abilities, transformer-specific theory, and post-training theory.

Prereqs
Implicit Bias and Modern GeneralizationScaling LawsContinuous Thought Machines+2 more
L4Tier 3~60m

Sparse Recovery and Compressed Sensing

Recover a sparse signal from far fewer measurements than its ambient dimension: the restricted isometry property, basis pursuit via L1 minimization, random measurement matrices, and applications from MRI to single-pixel cameras.

Prereqs
Lasso RegressionSub-Gaussian Random Variablessparse coding and efficient coding
L4Tier 3~55m

Wasserstein Distances

The Wasserstein (earth mover's) distance measures the minimum cost of transporting one probability distribution to another, with deep connections to optimal transport, GANs, and distributional robustness.

Prereqs
Common Probability DistributionsMeasure-Theoretic ProbabilityConvex Duality+2 more

LLM Construction

71 topics / L2, L3, L4, L5 / Tiers 1-3

Transformer math, attention, KV cache, optimizers, scaling laws, RLHF.

L5Tier 1~75m

Chain-of-Thought and Reasoning

Chain-of-thought as serial computation in token-space: the Merrill-Sabharwal expressiveness hierarchy, the Lanham-Turpin faithfulness gap, self-consistency bounds, hidden CoT, and the post-o1/R1 reframing where CoT is the artifact of RL on verifiable rewards.

Prereqs
Prompt Engineering and In-Context LearningTransformer ArchitectureDecoding Strategies+1 more
L3Tier 1~55m

Fine-Tuning and Adaptation

Methods for adapting pretrained models to new tasks: full fine-tuning, feature extraction, LoRA, QLoRA, adapter layers, and prompt tuning. When to use each, and how catastrophic forgetting constrains adaptation.

Prereqs
Feedforward Networks and Backpropagation
L4Tier 1~80m

Hallucination Theory

Why language models confabulate, framed by the Kalai-Nachum-Vempala-Zhang (2025) statistical lower bound on pretraining error and the evaluation-incentive equilibrium that prevents post-training from removing it. Calibration, conformal prediction, structural failure modes (reversal curse, snowballing), and the measurement gap (FActScore, SAFE).

Prereqs
Empirical Risk MinimizationTransformer ArchitectureReinforcement Learning from Human Feedback+8 more
L2Tier 1~35m

Linear Layer: Shapes, Bias, and Memory

A systems-first note on the linear layer: tensor shapes, the bias term, forward pass, backward pass, parameter memory, FLOPs, and finite-difference gradient tests.

Prereqs
Matrix Operations and PropertiesMatrix CalculusFeedforward Networks and Backpropagation
L3Tier 1~70m

Optimizer Theory: SGD, Adam, and Muon

Convergence theory of SGD (convex and strongly convex), momentum methods (Polyak and Nesterov), Adam as adaptive + momentum, why SGD can generalize better, the Muon optimizer, and learning rate schedules.

Prereqs
Convex Optimization BasicsAdam OptimizerAutomatic Differentiation+5 more
L5Tier 1~95m

Reinforcement Learning from Human Feedback

The RLHF pipeline as math, not folklore: Bradley-Terry with sycophancy and intransitivity, KL-shielded PPO with overoptimization mitigation, the DPO implicit-reward identity and its likelihood-displacement failure, online-vs-offline preference learning, Nash-LHF, and the LIMA challenge to whether you need RL at all.

Prereqs
Policy Gradient TheoremRLHF and AlignmentReinforcement Learning for Synthesis Planning+1 more
L4Tier 1~85m

Scaling Laws

Power-law scaling of LLM loss in parameters, data, and compute: Kaplan, Chinchilla, the Muennighoff data-constrained law for repetition, the Schaeffer metric-induced-emergence proposition, MoE and muP extensions, and the test-time compute axis.

Prereqs
Convex Optimization BasicsData Contamination and EvaluationDistributed Training Theory+5 more
L4Tier 1~90m

Sparse Autoencoders for Interpretability: TopK, JumpReLU, Matryoshka, and Scaling

Sparse autoencoders (SAEs) decompose neural network activations into interpretable feature directions by learning an overcomplete dictionary with a sparsity constraint. Coverage of the superposition hypothesis, the L1 SAE objective and its shrinkage failure mode, TopK SAEs (Gao et al. 2024 / OpenAI), JumpReLU SAEs (Rajamanoharan et al. 2024 / DeepMind), Matryoshka SAEs (Bushnaq / Bussmann 2024), Anthropic's million-feature Claude 3 Sonnet decomposition (Templeton et al. 2024), feature splitting, dead features, evaluation, and steering.

Prereqs
AutoencodersMechanistic Interpretability: Features, Circuits, and Causal FaithfulnessLasso Regression+3 more
L4Tier 2~60m

Attention Mechanism Theory

Mathematical formulation of attention: scaled dot-product attention as soft dictionary lookup, why scaling by the square root of key dimension prevents softmax saturation, multi-head attention, and the connection to kernel methods.

Prereqs
Matrix Operations and PropertiesSoftmax and Numerical StabilityGram Matrices and Kernel Matrices+2 more
L3Tier 2~45m

Attention Mechanisms History

The evolution of attention from Bahdanau (2014) additive alignment to Luong dot-product attention to self-attention in transformers. How attention solved the fixed-length bottleneck of seq2seq models.

Prereqs
Recurrent Neural NetworksByte-Level Language Models
L4Tier 2~45m

Attention Sinks and Retrieval Decay

Why transformers disproportionately attend to initial tokens (attention sinks), how StreamingLLM exploits this for infinite-length inference, and how retrieval accuracy degrades with distance and position within the context window.

Prereqs
Attention Mechanism TheoryForgetting Transformer (FoX)
L4Tier 2~55m

Attention Variants and Efficiency

Multi-head, multi-query, grouped-query, linear, and sparse attention: how each variant trades expressivity for efficiency, and when to use which.

Prereqs
Attention Mechanism TheoryFast Fourier Transform
L3Tier 2~35m

Bits, Nats, Perplexity, and BPB

The four units people use to measure language model quality, how they relate to each other, when to use each one, and how mixing them up leads to wrong conclusions.

Prereqs
Information Theory FoundationsKL Divergence
L5Tier 2~55m

Context Engineering

The discipline of building, routing, compressing, retrieving, and persisting context for LLMs: beyond prompt design into systems engineering for what the model sees.

Prereqs
KV CacheAttention Mechanism TheoryAgent Protocols: MCP and A2A+7 more
L3Tier 2~45m

Decoding Strategies

How language models select output tokens: greedy decoding, beam search, temperature scaling, top-k sampling, and nucleus (top-p) sampling. The tradeoffs between coherence, diversity, and quality.

Prereqs
Softmax and Numerical Stability
L5Tier 2~50m

Document Intelligence

Beyond OCR: understanding document layout, tables, figures, and structure using models that combine text, spatial position, and visual features to extract structured information from PDFs, invoices, and contracts.

Prereqs
Multimodal RAG
L5Tier 2~60m

DPO vs GRPO vs RL for Reasoning

Head-to-head comparison of preference optimization methods: DPO's implicit reward, GRPO's group-relative comparisons, and RL with verifier feedback for reasoning. When each works and when each fails.

Prereqs
RLHF and AlignmentPolicy Gradient TheoremActor-Critic Methods+6 more
L5Tier 2~45m

Edge and On-Device ML

Running models on phones, embedded devices, and edge servers: pruning, distillation, quantization, TinyML, and hardware-aware neural architecture search under memory, compute, and power constraints.

Prereqs
Speculative Decoding and Quantization
L4Tier 2~55m

Efficient Transformers Survey

Sub-quadratic attention variants (linear attention, Linformer, Performer, Longformer, BigBird) and why FlashAttention, a hardware-aware exact method, made most of them unnecessary in practice.

Prereqs
Attention Variants and Efficiency
L5Tier 2~55m

Flash Attention

IO-aware exact attention: tile QKV matrices into SRAM-sized blocks so the full N-by-N attention matrix is never materialized in HBM. Peak activation memory drops from O(N²) to O(N); HBM read/write traffic drops by a large constant factor (not asymptotic linearity); FLOP count is unchanged.

Prereqs
Attention Mechanism TheorySoftmax and Numerical StabilityAttention Is All You Need (Paper)+4 more
L4Tier 2~40m

Forgetting Transformer (FoX)

FoX adds a data-dependent forget gate to softmax attention. The gate down-weights unnormalized attention scores between past and present positions, giving the transformer a learned, recency-biased decay. FoX is FlashAttention-compatible, works without positional embeddings, and improves long-context language modeling and length extrapolation.

Prereqs
Attention Mechanism TheoryRecurrent Neural NetworksTransformer Architecture+2 more
L5Tier 2~40m

Fused Kernels

Combine multiple GPU operations into a single kernel launch to eliminate intermediate HBM reads and writes. Why kernel fusion is the primary optimization technique for memory-bound ML operations.

Prereqs
GPU Compute ModelCUDA Programming FundamentalsFlash Attention+2 more
L5Tier 2~50m

GPU Compute Model

How GPUs execute ML workloads: streaming multiprocessors, warps, memory hierarchy (registers, SRAM, L2, HBM), arithmetic intensity, the roofline model, and why most ML operations are memory-bound.

Prereqs
ASML and Chip ManufacturingDocker and Containers for MLKubernetes for ML Workloads+1 more
L4Tier 2~50m

Induction Heads

Induction heads are attention head circuits that implement a specific kind of pattern completion: given a sequence like [A][B]...[A], they predict [B]. Olsson et al. (2022) give strong causal (ablation) evidence that these heads exist in small attention-only models, and correlational co-occurrence evidence linking their formation to a sudden jump in in-context-learning ability in larger transformers. They are one mechanism among several that have been proposed for in-context learning, not the whole story.

Prereqs
Attention Mechanism TheoryTransformer ArchitectureMechanistic Interpretability: Features, Circuits, and Causal Faithfulness+2 more
L5Tier 2~60m

Inference Systems Overview

The modern LLM inference stack: batching strategies, scheduling, memory management with paged attention, model parallelism for serving, and why FLOPs do not equal latency when memory bandwidth is the bottleneck.

Prereqs
KV CacheSpeculative Decoding and QuantizationDocker and Containers for ML+4 more
L5Tier 2~50m

Inference-Time Scaling Laws

How additional compute at inference time (repeated sampling, search, verification) improves output quality, why gains are task-dependent, and why verifier quality matters more than raw sample count.

Prereqs
Scaling LawsTest-Time Compute and Search
L4Tier 2~60m

Iterative Magnitude Pruning and the Lottery Ticket Hypothesis

Iterative magnitude pruning repeatedly trains, prunes, rewinds, and retrains a network to search for sparse subnetworks that still learn well. The point is not cheap training; the point is understanding trainable sparsity, rewind stability, and when a sparse mask still preserves optimization geometry.

Prereqs
Model Compression and PruningFeedforward Networks and Backpropagation
L3Tier 2~45m

Knowledge Distillation

Training a small student model to mimic a large teacher: soft targets, temperature scaling, dark knowledge, and why the teacher's mistakes carry useful information about class structure.

Prereqs
Feedforward Networks and BackpropagationIterative Magnitude Pruning and the Lottery Ticket Hypothesis
L5Tier 2~45m

KV Cache Optimization

Advanced techniques for managing the KV cache memory bottleneck: paged attention for fragmentation-free allocation, prefix caching for shared prompts, token eviction for long sequences, and quantized KV cache for reduced footprint.

Prereqs
KV Cache
L5Tier 2~45m

KV Cache

Why autoregressive generation recomputes attention at every step, how caching past key-value pairs makes it linear, and the memory bottleneck that drives MQA, GQA, and paged attention.

Prereqs
Attention Mechanism TheoryAttention Is All You Need (Paper)Attention Variants and Efficiency+2 more
L5Tier 2~50m

Latent Reasoning

Techniques that spend computation in hidden state rather than visible chain-of-thought tokens: pause tokens, continuous thought, recurrent depth, and what is empirical vs speculative.

Prereqs
Test-Time Compute and SearchMemory Systems for LLMsMulti-Token Prediction
L5Tier 2~50m

Memory Systems for LLMs

Taxonomy of LLM memory: short-term (KV cache), working (scratchpad), long-term (retrieval), and parametric (weights). Why extending context alone is insufficient and how memory consolidation works.

Prereqs
Context EngineeringKV Cache
L4Tier 2~55m

Mixture of Experts

Sparse computation via learned routing: replace dense FFN layers with multiple expert networks, activate only a subset per token, and scale capacity without proportional compute.

Prereqs
Transformer ArchitectureModel Compression and PruningSpeculative Decoding and Quantization
L3Tier 2~50m

Model Compression and Pruning

Reducing model size without proportional accuracy loss: unstructured and structured pruning, magnitude pruning, the lottery ticket hypothesis, entropy coding for compressed weights, and knowledge distillation as compression.

Prereqs
Feedforward Networks and Backpropagation
L5Tier 2~45m

Multi-Token Prediction

Predicting k future tokens simultaneously using auxiliary prediction heads: forces planning, improves code generation, and connects to speculative decoding.

Prereqs
Transformer Architecture
L5Tier 2~55m

Multimodal RAG

RAG beyond text: retrieving images, tables, charts, and PDFs alongside text. Document parsing, multimodal chunking, vision-language retrievers, agentic RAG, and reasoning RAG with chain-of-thought retrieval.

Prereqs
Context EngineeringAudio Language ModelsCLIP, OpenCLIP, and SigLIP: Contrastive Language-Image Pretraining+1 more
L5Tier 2~45m

PaddleOCR and Practical OCR

A practitioner's guide to modern OCR toolkits: PaddleOCR's three-stage pipeline, TrOCR's transformer approach, EasyOCR, and Tesseract. When to use which, and what accuracy to expect.

Prereqs
Document Intelligencehough transform and circle detection
L5Tier 2~50m

Parallel Processing Fundamentals

Data, tensor, pipeline, expert, and sequence parallelism: the five strategies for distributing model training and inference across multiple GPUs, and how frontier labs combine all of them.

Prereqs
Stochastic Gradient Descent ConvergenceBroadcast Joins in Distributed ComputeDask Parallel Python+1 more
L3Tier 2~40m

Perplexity and Language Model Evaluation

Perplexity as exp(cross-entropy): the standard intrinsic metric for language models, its information-theoretic interpretation, connection to bits-per-byte, and why low perplexity alone does not guarantee useful generation.

Prereqs
Information Theory FoundationsBits, Nats, Perplexity, and BPBLog-Probability Computation
L5Tier 2~70m

Post-Training Overview

How post-training turns a pretrained language model into a deployable assistant: SFT, preference optimization, safety tuning, verifiable rewards, evaluation gates, and the failure modes each stage introduces.

Prereqs
RLHF and AlignmentTransformer ArchitectureAgentic RL and Tool Use+3 more
L5Tier 2~35m

Prefix Caching

Share computed KV cache entries across requests that share the same prefix. Radix attention trees enable efficient lookup. Significant latency savings for prefix-heavy production workloads.

Prereqs
KV CacheKV Cache Optimization
L5Tier 2~50m

Prompt Engineering and In-Context Learning

In-context learning allows LLMs to adapt to new tasks from examples in the prompt without weight updates. Theories for why it works, prompting strategies, and why prompt engineering is configuring inference-time computation.

Prereqs
Transformer Architecture
L5Tier 2~45m

Reasoning Data Curation

How to build training data for reasoning models: math with verified solutions, code with test cases, rejection sampling, external verification, self-play for problem generation, and the connection to RLVR.

Prereqs
Post-Training OverviewAlphaProof and AI-Assisted Theorem ProvingSynthetic Data Distillation
L4Tier 2~50m

Residual Stream and Transformer Internals

The residual stream as the central computational highway in transformers: attention and FFN blocks read from and write to it. Pre-norm vs post-norm, FFN as key-value memory, and the logit lens for inspecting intermediate representations.

Prereqs
Transformer ArchitectureForgetting Transformer (FoX)Gradient Flow and Vanishing Gradients
L4Tier 2~65m

RLHF and Alignment

The RLHF pipeline for aligning language models with human preferences: reward modeling, PPO fine-tuning, KL penalties, DPO, and why none of it guarantees truthfulness.

Prereqs
Policy Gradient TheoremMarkov Decision ProcessesActor-Critic Methods+2 more
L5Tier 2~55m

Scaling Compute-Optimal Training

Chinchilla scaling: how to optimally allocate a fixed compute budget between model size and training data, why many models were undertrained, and the post-Chinchilla reality of data quality and inference cost.

Prereqs
Scaling Laws
L4Tier 2~55m

Sparse Attention and Long Context

Standard attention is O(n²). Sparse patterns (Longformer, Sparse Transformer, Reformer), ring attention for distributed sequences, streaming with attention sinks, and why extending context is harder than it sounds.

Prereqs
Attention Mechanism TheoryGemini and Google Models
L5Tier 2~50m

Speculative Decoding and Quantization

Two core inference optimizations: speculative decoding for latency (draft-verify parallelism) and quantization for memory and throughput (reducing weight precision without destroying quality).

Prereqs
Transformer ArchitectureKV CacheMegakernels+1 more
L5Tier 2~45m

Structured Output and Constrained Generation

Forcing LLM output to conform to schemas via constrained decoding: logit masking for valid tokens, grammar-guided generation, and why constraining the search space can improve both reliability and quality.

Prereqs
Transformer ArchitectureTool-Augmented Reasoning
L3Tier 2~50m

Synthetic Data Distillation

Data-centric distillation: instead of matching teacher logits, train the student on a carefully synthesized corpus written by a stronger teacher. Phi and Orca style pipelines, explanation tuning, filtering, and why the corpus is the model.

Prereqs
Knowledge DistillationSynthetic Data Generation
L5Tier 2~65m

Test-Time Compute and Search

Spending more inference compute through best-of-N, verifier-guided search, tree search, process rewards, and latent/internal computation, with gains limited by verifier quality and base pass rate.

Prereqs
Scaling LawsAgentic RL and Tool Use
L3Tier 2~45m

Token Prediction and Language Modeling

Language modeling as probability assignment over sequences. Autoregressive and masked prediction objectives, perplexity evaluation, and the connection between prediction and compression.

Prereqs
Information Theory FoundationsSoftmax and Numerical StabilityFeedforward Networks and Backpropagation+1 more
L5Tier 2~55m

Tool-Augmented Reasoning

LLMs that call external tools during reasoning: Toolformer for learning when to invoke APIs, ReAct for interleaving thought and action, function calling and MCP for structured invocation, and code-as-thought for replacing verbal arithmetic with executed programs.

Prereqs
Transformer ArchitectureChain-of-Thought and ReasoningPrompt Engineering and In-Context Learning+1 more
L4Tier 2~60m

Training Dynamics and Loss Landscapes

The geometry of neural network loss surfaces: why saddle points dominate over local minima in high dimensions, how flat minima relate to generalization, and why SGD finds solutions that generalize.

Prereqs
Convex Optimization BasicsThe Hessian MatrixStability and Optimization Dynamics
L4Tier 2~70m

Transformer Architecture

The mathematical formulation of the transformer block: self-attention, multi-head attention, layer normalization, FFN blocks, positional encoding, and parameter counting.

Prereqs
Attention Mechanism TheoryFeedforward Networks and BackpropagationSoftmax and Numerical Stability+13 more
L5Tier 3~45m

Agent Protocols: MCP and A2A

The protocol layer for AI agents: MCP (Model Context Protocol) for tool access, A2A (Agent-to-Agent) for inter-agent communication, and why standardized interfaces matter for the agent ecosystem.

Prereqs
Agentic RL and Tool UseTool-Augmented Reasoning
L5Tier 3~35m

AMD Competition Landscape

AMD's MI300X and MI325X GPUs compete with NVIDIA on memory bandwidth and capacity but lag on software ecosystem. Competition matters because pricing, supply diversity, and vendor lock-in determine who can train and serve models.

Prereqs
GPU Compute Model
L5Tier 3~40m

ASML and Chip Manufacturing

ASML is the sole manufacturer of EUV lithography machines used to produce every advanced AI chip. Understanding the semiconductor supply chain reveals a critical concentration risk for AI compute.

L4Tier 3~45m

Attention as Kernel Regression

Softmax attention viewed as Nadaraya-Watson kernel regression: the output at each position is a kernel-weighted average of values. Connects attention to classical nonparametric statistics and motivates linear attention via random feature approximations.

Prereqs
Attention Mechanism TheoryKernels and Reproducing Kernel Hilbert Spaces
L4Tier 3~45m

Byte-Level Language Models

Skip the tokenizer and feed raw bytes to the model. ByT5, MegaByte, and Byte Latent Transformer: why operating on bytes is attractive, why it is expensive, and how hierarchical patching closes the compute gap.

Prereqs
Tokenization and Information Theorymorphology and subword modeling
L5Tier 3~55m

Distributed Training Theory

Training frontier models requires thousands of GPUs. Data parallelism, model parallelism, and communication-efficient methods make this possible.

Prereqs
Optimizer Theory: SGD, Adam, and MuonParallel Processing FundamentalsBatch Size and Learning Dynamics+7 more
L5Tier 3~40m

Donut and OCR-Free Document Understanding

End-to-end document understanding without OCR: Donut reads document images directly and generates structured output, bypassing the error-prone OCR pipeline. Nougat extends this to academic paper parsing.

Prereqs
Transformer ArchitectureDocument IntelligencePaddleOCR and Practical OCR
L5Tier 3~40m

Megakernels

Fuse an entire LLM forward pass or decode step into a single GPU kernel launch to eliminate kernel launch overhead and cross-kernel HBM round-trips. Why persistent kernels and CUDA Graphs dominate low-latency inference.

Prereqs
Fused KernelsGPU Compute Model
L5Tier 3~45m

Model Merging and Weight Averaging

Combining trained models by averaging or interpolating their weights: SWA, SLERP, TIES-Merging, DARE. Why it works (loss landscape mode connectivity), when it fails, and applications to combining specialized models.

Prereqs
Transformer Architecture
L4Tier 3~50m

Neural Architecture Search

Automating network architecture design: search spaces, search strategies (RL, evolutionary, differentiable), performance estimation via weight sharing, and the gap between NAS hype and practical gains.

Prereqs
Feedforward Networks and Backpropagation
L5Tier 3~45m

NVIDIA GPU Architectures

A concrete reference for the GPU hardware that determines what ML models can be trained and served: H100, H200, B200, GB200, and Rubin architectures, with emphasis on memory bandwidth as the primary bottleneck.

Prereqs
GPU Compute ModelCUDA Programming FundamentalsParallel Processing Fundamentals+1 more
L5Tier 3~45m

Plan-then-Generate

Beyond strict left-to-right token generation: plan what to say before producing tokens. Outline-then-fill, hierarchical generation, and multi-token prediction. Planning reduces incoherence in long outputs and enables structural editing.

Prereqs
Transformer Architecture
L4Tier 3~50m

Positional Encoding

Why attention needs position information, sinusoidal encoding, learned positions, RoPE (rotary position encoding via 2D rotations), ALiBi, and why RoPE became the default for modern LLMs.

Prereqs
Attention Mechanism TheoryAttention Is All You Need (Paper)Attention Mechanisms History
L5Tier 3~55m

Quantization Theory

Reduce model weight precision from FP32 to FP16, INT8, or INT4. Post-training quantization, quantization-aware training, GPTQ, AWQ, and GGUF. Quantization is how large language models actually get deployed.

Prereqs
Matrix Operations and PropertiesEigenvalues and EigenvectorsMatrix Calculus+3 more
L5Tier 3~40m

Table Extraction and Structure Recognition

Detecting tables in documents, recognizing row and column structure, and extracting cell content. Why tables are hard: merged cells, borderless layouts, nested headers, and cascading pipeline errors.

Prereqs
Document IntelligencePaddleOCR and Practical OCR
L4Tier 3~50m

Tokenization and Information Theory

Tokenization determines an LLM's vocabulary and shapes everything from compression efficiency to multilingual ability. Information theory explains what good tokenization looks like.

Prereqs
Information Theory FoundationsCommon Probability Distributionsmorphology and subword modeling

Methodology & Experimental Design

32 topics / L1, L2, L3, L4, L5 / Tiers 1-3

Evaluation, ablations, reproducibility, research habits, and data or benchmark failure modes.

L3Tier 1~30m

The Bitter Lesson

Sutton's meta-principle: scalable general methods that exploit computation tend to beat hand-crafted domain-specific approaches in the long run. Search and learning win; brittle cleverness loses.

L3Tier 1~50m

Causal Inference and the Ladder of Causation

Pearl's three-level hierarchy: association, intervention, counterfactual. Structural causal models, do-calculus, the adjustment formula, and why prediction is not causation.

Prereqs
Common Probability DistributionsBayesian EstimationCausal Inference Basics+2 more
L1Tier 1~45m

Confusion Matrices and Classification Metrics

The complete framework for evaluating binary classifiers: confusion matrix entries, precision, recall, F1, specificity, ROC curves, AUC, and precision-recall curves. When to use each metric.

Prereqs
Common Probability DistributionsMulti-Class and Multi-Label ClassificationSignal Detection Theory
L1Tier 1~40m

Confusion Matrix: MCC, Kappa, and Cost-Sensitive Evaluation

Beyond binary classification: multi-class confusion matrices, micro vs macro averaging, Matthews correlation coefficient, Cohen's kappa, cost-sensitive evaluation, and how to read a confusion matrix without getting the axes wrong.

L4Tier 1~35m

The Era of Experience

Sutton and Silver's thesis: the next phase of AI moves beyond imitation from human data toward agents that learn predominantly from their own experience. Text is not enough for general intelligence.

Prereqs
The Bitter LessonMarkov Decision Processes
L1Tier 1~45m

Model Evaluation Best Practices

Train/validation/test splits, cross-validation, stratified and temporal splits, data leakage, reporting with confidence intervals, and statistical tests for model comparison. Why single-number comparisons are misleading.

Prereqs
Confusion Matrices and Classification MetricsBayesian Optimization for Hyperparameters
L1Tier 1~35m

Train-Test Split and Data Leakage

Why you need three data splits, how to construct them correctly, and the common ways information leaks from test data into training. Temporal splits, stratified splits, and leakage detection.

Prereqs
ML Project Lifecycle
L1Tier 1~50m

Types of Bias in Statistics

A taxonomy of the major biases that corrupt statistical inference: selection bias, measurement bias, survivorship bias, response bias, attrition bias, publication bias, and their consequences for ML.

Prereqs
Anthropic Bias and Observation Selection
L3Tier 2~45m

Ablation Study Design

How to properly design ablation studies: remove one component at a time, measure the effect against proper baselines, and report results with statistical rigor.

Prereqs
Hypothesis Testing for MLBenchmarking MethodologyExperiment Tracking and Tooling+1 more
L1Tier 2~40m

Class Imbalance and Resampling

When class frequencies differ dramatically, standard accuracy is misleading. Resampling, cost-sensitive learning, and threshold tuning restore meaningful evaluation and training.

Prereqs
Confusion Matrices and Classification Metrics
L2Tier 2~40m

Convex Tinkering

Taleb's concept applied to ML research: designing small experiments with bounded downside and unbounded upside, and the precise conditions (convex payoff, nonzero variance, ex-post selection) under which this can dominate scale-first approaches.

Prereqs
Common Inequalitieseditorial principlesNon-Probability Sampling
L2Tier 2~50m

Evaluation Metrics and Properties

The metrics that determine whether a model is good: accuracy, precision, recall, F1, AUC-ROC, log loss, RMSE, calibration, and proper scoring rules. Why choosing the right metric matters more than improving the wrong one.

L1Tier 2~40m

Exploratory Data Analysis

The disciplined practice of looking at data before modeling: summary statistics, distributions, correlations, missing values, outliers, and class balance. You cannot model what you do not understand.

Prereqs
ML Project LifecyclePandas and NumPy FundamentalsTrain-Test Split and Data Leakage
L2Tier 2~50m

Feature Importance and Interpretability

Methods for attributing model predictions to input features: permutation importance, SHAP values, LIME, partial dependence, and why none of these imply causality.

Prereqs
Decision Trees and EnsemblesLinear RegressionExploratory Data Analysis+2 more
L3Tier 2~50m

Federated Learning

Train a global model without centralizing data. FedAvg, communication efficiency, non-IID convergence challenges, differential privacy integration, and applications in healthcare and mobile computing.

Prereqs
Optimizer Theory: SGD, Adam, and Muon
L1Tier 2~35m

Hardware for ML Practitioners

Practical hardware guidance for ML work: GPU memory as the real bottleneck, when local GPUs make sense, cloud options compared, and why you should not spend $5000 before knowing what you need.

L2Tier 2~55m

Hypothesis Testing for ML

Null and alternative hypotheses, p-values, confidence intervals, error types, multiple comparisons, and proper statistical tests for comparing ML models.

Prereqs
Benford's LawConfusion Matrix: MCC, Kappa, and Cost-Sensitive EvaluationDifferential Privacy+9 more
L2Tier 2~50m

Meta-Analysis

Combining results from multiple studies: fixed-effect and random-effects models, heterogeneity measures, publication bias, and why this matters for ML benchmarking.

Prereqs
Hypothesis Testing for MLBayesian EstimationREML and Variance Component Estimation
L1Tier 2~45m

ML Project Lifecycle

The full ML project workflow from problem definition through deployment and monitoring. Why most projects fail at data quality, not model architecture. Cross-functional requirements and MLOps basics.

Prereqs
Hardware for ML Practitioners
L2Tier 2~45m

P-Hacking and Multiple Testing

How selective reporting and multiple comparisons inflate false positive rates, and how Bonferroni and Benjamini-Hochberg corrections control them. Why hyperparameter tuning is multiple testing and benchmark shopping is p-hacking.

Prereqs
Hypothesis Testing for MLMeta-Analysis
L4Tier 2~45m

Prasad-Rao MSE Correction

Why the naive Fay-Herriot MSE is too small once the area variance is estimated, and how the classical Prasad-Rao decomposition adds the missing second-order term.

Prereqs
Small Area EstimationExpectation, Variance, Covariance, and MomentsREML and Variance Component Estimation
L2Tier 2~40m

Proper Scoring Rules

A scoring rule is proper exactly when the expected score is maximized by reporting the true belief. Log score and Brier score are strictly proper. Accuracy is not. Why this matters for calibrated probability estimates.

Prereqs
Evaluation Metrics and PropertiesROC Curve and AUC
L2Tier 2~50m

Reproducibility and Experimental Rigor

What it takes to make ML experiments truly reproducible: seeds, variance reporting, data hygiene, configuration management, and the discipline of multi-run evaluation.

Prereqs
Git and GitLab for ML ResearchPython for ML ResearchWeights and Biases for Experiment Tracking
L2Tier 2~35m

ROC Curve and AUC

Theory of the receiver operating characteristic curve. AUC as a Wilcoxon-Mann-Whitney probability, cost-weighted operating-point selection, the ROC convex hull, the Neyman-Pearson connection, and partial AUC for restricted FPR regions.

Prereqs
Confusion Matrices and Classification MetricsCommon Probability Distributions
L2Tier 2~50m

Statistical Significance and Multiple Comparisons

p-values, significance levels, confidence intervals, and the multiple comparisons problem. Bonferroni correction, Benjamini-Hochberg FDR control, and why these matter for model selection and benchmark evaluation.

Prereqs
Hypothesis Testing for ML
L3Tier 2~45m

Synthetic Data Generation

Using models to generate training data: LLM-generated instructions, diffusion-based image augmentation, code synthesis. When synthetic data helps (low-resource, privacy) and when it hurts (model collapse).

Prereqs
Common Probability Distributions
L4Tier 3~35m

Adjusted Density Maximization

Why some small-area methods adjust the likelihood or posterior-like density to estimate shrinkage factors more stably when the variance component is near zero.

Prereqs
Small Area EstimationREML and Variance Component EstimationPrasad-Rao MSE Correction
L3Tier 3~45m

Benchmarking Methodology

What makes a good benchmark, how benchmarks fail (contamination, leaderboard gaming, single-number comparisons), and how to report results honestly with variance, seeds, and proper statistical practice.

Prereqs
Evaluation Metrics and PropertiesReproducibility and Experimental Rigor
L3Tier 3~55m

Causal Inference Basics

Correlation is not causation. The potential outcomes framework, average treatment effects, confounders, and the methods that let you estimate causal effects from data.

Prereqs
Hypothesis Testing for MLFeature Importance and InterpretabilitySurvival Analysis
L5Tier 3~40m

Energy Efficiency and Green AI

The compute cost of training frontier models, carbon footprint, FLOPs vs wall-clock time vs dollars, and why reporting efficiency matters. Efficient alternatives: distillation, pruning, quantization, and scaling laws for optimal compute allocation.

L2Tier 3~35m

Experiment Tracking and Tooling

MLflow, Weights and Biases, TensorBoard, Hydra, and the discipline of recording everything. What to log, how to compare experiments, and why you cannot reproduce what you did not record.

Prereqs
Reproducibility and Experimental RigorHardware for ML PractitionersML Project Lifecycle+1 more
L3Tier 3~45m

Official Statistics and National Surveys

How government statistical agencies produce population, economic, and social data through censuses and surveys, with quality frameworks and implications for ML practitioners using these datasets.

Prereqs
Survey Sampling MethodsPrasad-Rao MSE CorrectionSmall Area Estimation

Training Techniques & Regularization

12 topics / L2, L3 / Tiers 1-3

Adam, dropout, batch norm, data augmentation, learning rate schedules.

L2Tier 1~55m

Adam Optimizer

Adaptive moment estimation: first and second moment tracking, bias correction, AdamW vs Adam+L2, learning rate warmup, and when Adam fails compared to SGD.

Prereqs
Gradient Descent VariantsStochastic Gradient Descent Convergence
L2Tier 1~50m

Batch Normalization

Normalize activations within mini-batches to stabilize and accelerate training, then scale and shift with learnable parameters.

Prereqs
Feedforward Networks and BackpropagationExpectation, Variance, Covariance, and MomentsActivation Functions+5 more
L2Tier 1~45m

Dropout

Dropout as stochastic regularization: Bernoulli masking, inverted dropout scaling, implicit ensemble interpretation, Bayesian connection via MC dropout, and equivalence to L2 regularization in linear models.

Prereqs
Feedforward Networks and BackpropagationCommon Probability Distributions
L2Tier 1~45m

Learning Rate Scheduling

Why learning rate moves the loss more than any other hyperparameter, and how warmup, cosine decay, step decay, cyclic schedules, and 1cycle policy control training dynamics.

Prereqs
Stochastic Gradient Descent ConvergenceAdam OptimizerBatch Size and Learning Dynamics+1 more
L2Tier 1~45m

Regularization in Practice

Practical guide to regularization: L2 (weight decay), L1 (sparsity), dropout, early stopping, data augmentation, and batch normalization. How each constrains model complexity and how to choose between them.

Prereqs
Regularization TheoryCross-Entropy Loss: MLE, KL Divergence, and Classification
L2Tier 1~40m

Weight Initialization

Why initialization determines whether gradients vanish or explode, and how Xavier and He initialization preserve variance across layers.

Prereqs
Feedforward Networks and BackpropagationEigenvalues and EigenvectorsActivation Functions
L3Tier 2~35m

Activation Checkpointing

Trade compute for memory by recomputing activations during the backward pass instead of storing them all. Reduces activation memory from linear to square-root in the number of layers.

Prereqs
Feedforward Networks and BackpropagationMixed Precision Training
L2Tier 2~45m

Batch Size and Learning Dynamics

How batch size affects what SGD finds: gradient noise, implicit regularization, the linear scaling rule, sharp vs flat minima, and the gradient noise scale as the key quantity governing the tradeoff.

Prereqs
Stochastic Gradient Descent ConvergenceAdam Optimizer
L2Tier 2~45m

Data Augmentation Theory

Why data augmentation works as a regularizer: invariance injection, effective sample size, MixUp, CutMix, and the connection to Vicinal Risk Minimization.

Prereqs
Contrastive LearningRegularization in PracticeSelf-Supervised Vision+1 more
L2Tier 2~35m

Label Smoothing and Regularization

Replace hard targets with soft targets to prevent overconfidence: the label smoothing formula, its connection to maximum entropy regularization, calibration effects, and when it hurts.

Prereqs
Logistic Regression
L3Tier 2~45m

Mixed Precision Training

Train with FP16 or BF16 for speed while keeping FP32 master weights for accuracy. Loss scaling, overflow prevention, and when mixed precision fails.

Prereqs
Floating-Point ArithmeticAdam OptimizerDistributed Training Theory+3 more
L2Tier 3~35m

Curriculum Learning

Train on easy examples first, gradually increase difficulty. Curriculum learning can speed convergence and improve generalization, but defining difficulty is the hard part. Self-paced learning, anti-curriculum, and the connection to importance sampling.

Prereqs
Synthetic Data Distillation

AI Safety & Alignment

17 topics / L3, L4, L5 / Tiers 1-2

RLHF failure modes, hallucination theory, interpretability, reward hacking.

L4Tier 1~90m

Mechanistic Interpretability: Features, Circuits, and Causal Faithfulness

Reverse-engineering trained neural networks. Coverage of the superposition hypothesis, sparse autoencoders for feature extraction, the linear representation hypothesis and its counterexamples, induction heads and IOI as canonical circuits, sparse feature circuits (Marks et al. 2024), cross-layer transcoders (Lindsey et al. 2024), activation patching (noising vs denoising), faithfulness checks, frontier-scale evidence from Anthropic's Scaling Monosemanticity (Templeton et al. 2024) and DeepMind's Gemma Scope (Lieberum et al. 2024), and the limits of current interpretability.

Prereqs
Transformer ArchitecturePrincipal Component AnalysisKolmogorov-Arnold Networks (KANs)+2 more
L4Tier 2~55m

Adversarial Machine Learning

Attacks across the ML lifecycle: evasion (adversarial examples), poisoning (corrupt training data), model extraction (steal via queries), privacy leakage (membership inference), and LLM jailbreaks. Why defenses are hard.

Prereqs
Feedforward Networks and BackpropagationMinimax and Saddle PointsVon Neumann Minimax Theorem
L3Tier 2~50m

Calibration and Uncertainty Quantification

When a model says 70% confidence, is it right 70% of the time? Calibration measures the alignment between predicted probabilities and actual outcomes. Temperature scaling, Platt scaling, conformal prediction, and MC dropout provide practical tools for trustworthy uncertainty.

Prereqs
Logistic RegressionBits, Nats, Perplexity, and BPBDecoding Strategies+3 more
L4Tier 2~50m

Catastrophic Forgetting

Fine-tuning a neural network on new data destroys knowledge of old data. The stability-plasticity dilemma, mitigation strategies (EWC, LwF, SI, MAS, replay, progressive networks), and implications for continual learning and LLM fine-tuning.

Prereqs
Fine-Tuning and Adaptation
L5Tier 2~50m

Constitutional AI

Anthropic's approach to alignment: replace human harmlessness labels with a constitution of principles that the model uses to self-evaluate, enabling scalable AI feedback.

Prereqs
RLHF and AlignmentReinforcement Learning from Human FeedbackReward Hacking
L3Tier 2~50m

Continual Learning and Forgetting

Learning sequentially without destroying previous knowledge: Elastic Weight Consolidation, progressive networks, replay methods, and the stability-plasticity tradeoff in deployed systems.

Prereqs
Optimizer Theory: SGD, Adam, and Muon
L5Tier 2~55m

Data Contamination and Evaluation

When training data overlaps test benchmarks, model scores become meaningless. Types of contamination, detection methods, dynamic benchmarks, and how to read evaluation claims skeptically.

Prereqs
Hypothesis Testing for MLBenchmarking MethodologyModel Collapse and Data Quality+1 more
L3Tier 2~55m

Differential Privacy

Formal privacy guarantees for algorithms: epsilon-delta DP, Laplace and Gaussian mechanisms, composition theorems, DP-SGD for training neural networks, and the privacy-utility tradeoff.

Prereqs
Common Probability DistributionsFederated Learning
L3Tier 2~55m

Ethics and Fairness in ML

Fairness definitions (demographic parity, equalized odds, calibration), the impossibility theorem showing they cannot all hold simultaneously, bias sources, and mitigation strategies at each stage of the pipeline.

L5Tier 2~50m

LLM Application Security

The OWASP LLM Top 10 (2025): prompt injection, sensitive information disclosure, supply chain, data and model poisoning, improper output handling, excessive agency, system prompt leakage, vector and embedding weaknesses, misinformation, and unbounded consumption. Standard application security for the GenAI era.

Prereqs
Adversarial Machine LearningRLHF and Alignment
L5Tier 2~45m

Model Collapse and Data Quality

When models train on their own outputs, the learned distribution narrows, tails disappear, and quality degrades across generations. Why synthetic data feedback loops threaten pretraining data quality and how to mitigate them.

Prereqs
Synthetic Data Generation
L3Tier 2~50m

Out-of-Distribution Detection

Methods for detecting when test inputs differ from training data, where naive softmax confidence fails and principled alternatives based on energy, Mahalanobis distance, and typicality succeed.

Prereqs
Calibration and Uncertainty QuantificationAnomaly Detection for Gravitational WavesCNNs for Medical Imaging
L5Tier 2~45m

Red-Teaming and Adversarial Evaluation

Systematically trying to make models produce harmful or incorrect outputs: manual and automated red-teaming, jailbreaks, prompt injection, adversarial suffixes, and why adversarial evaluation is necessary before deployment.

Prereqs
RLHF and AlignmentCalibration and Uncertainty Quantification
L5Tier 2~50m

Reward Hacking

Goodhart's law for AI: when models exploit reward model weaknesses instead of being genuinely helpful, including verbosity hacking, sycophancy, and structured mitigation strategies.

Prereqs
Reward Models and VerifiersRLHF and AlignmentHallucination Theory+1 more
L5Tier 2~55m

Reward Models and Verifiers

How preference reward models, outcome verifiers, process reward models, executable checks, and ensembles provide different training signals, and where Goodhart pressure enters.

Prereqs
RLHF and AlignmentPost-Training OverviewReasoning Data Curation+1 more
L4Tier 2~70m

Truth Directions and Linear Probes

The geometry-of-truth program asks whether transformer residual streams linearly separate true from false factual statements. Linear probes can reveal such structure; activation interventions test whether the direction is merely correlational or actually behavior-changing.

Prereqs
Mechanistic Interpretability: Features, Circuits, and Causal FaithfulnessResidual Stream and Transformer Internals
L5Tier 2~55m

Verifier Design and Process Reward

Detailed treatment of verifier types, process vs outcome reward models, verifier-guided search, self-verification, and the connection to test-time compute scaling. How to design reward signals for reasoning models.

Prereqs
Reward Models and Verifiers

Reinforcement Learning Theory

26 topics / L2, L3, L4, L5 / Tiers 1-3

MDPs, Bellman, policy gradients, multi-agent, game theory.

L2Tier 1~45m

Bellman Equations

The recursive backbone of RL. State-value and action-value Bellman equations, the contraction mapping property, convergence of value iteration, and why recursive decomposition is the central idea in sequential decision-making.

Prereqs
Markov Decision ProcessesExpectation, Variance, Covariance, and Moments
L2Tier 1~70m

Markov Decision Processes

The mathematical framework for sequential decision-making under uncertainty: states, actions, transitions, rewards, and the Bellman equations that make solving them possible.

Prereqs
Convex Optimization BasicsConcentration InequalitiesBayesian State Estimation+3 more
L3Tier 1~75m

Policy Gradient Theorem

The fundamental result enabling gradient-based optimization of parameterized policies: the policy gradient theorem and the algorithms it spawns.

Prereqs
Markov Decision ProcessesConvex Optimization BasicsMulti-Armed Bandits Theory+4 more
L2Tier 1~55m

Q-Learning

Model-free, off-policy value learning: the Q-learning update rule, convergence under Robbins-Monro conditions, and the deep Q-network revolution that introduced function approximation, experience replay, and the deadly triad.

Prereqs
Value Iteration and Policy IterationBellman EquationsStochastic Approximation Theory+1 more
L3Tier 1~40m

Reward Design and Reward Misspecification

The hardest problem in RL: specifying what you want. Reward shaping, potential-based shaping theorem, specification gaming, Goodhart's law in RL, and the bridge from classic RL to alignment.

Prereqs
Markov Decision ProcessesBellman EquationsReinforcement Learning for Drug Discovery
L2Tier 1~55m

Value Iteration and Policy Iteration

The two foundational algorithms for solving MDPs exactly: value iteration applies the Bellman optimality operator until convergence, while policy iteration alternates between exact evaluation and greedy improvement.

Prereqs
Markov Decision ProcessesBellman Equations
L3Tier 2~55m

Actor-Critic Methods

The dominant paradigm for deep RL and LLM training: an actor (policy network) guided by a critic (value network), with advantage estimation, PPO clipping, and entropy regularization.

Prereqs
Policy Gradient TheoremQ-LearningReward Systems and Reinforcement Learning Neuroscience+1 more
L5Tier 2~60m

Agentic RL and Tool Use

LLMs as multi-step policies: observations, tool calls, environment feedback, sparse rewards, credit assignment, and why agent training differs from single-turn RLHF.

Prereqs
Markov Decision ProcessesPolicy Gradient TheoremOffline Reinforcement Learning+2 more
L3Tier 2~40m

DDPG: Deep Deterministic Policy Gradient

An off-policy actor-critic algorithm for continuous action spaces that combines a deterministic policy gradient with a DQN-style critic, using replay buffers and polyak-averaged target networks for stability.

Prereqs
Policy Gradient TheoremQ-LearningActor-Critic Methods
L2Tier 2~45m

Exploration vs Exploitation

The fundamental tradeoff in sequential decision-making: exploit known good actions to collect reward now, or explore uncertain actions to discover potentially better strategies. Epsilon-greedy, Boltzmann exploration, UCB, count-based methods, and intrinsic motivation.

Prereqs
Multi-Armed Bandits TheoryMarkov Decision ProcessesThe Bitter Lesson+1 more
L3Tier 2~50m

Markov Games and Self-Play

Multi-agent extensions of MDPs where multiple agents with separate rewards interact. Nash equilibria, minimax values in zero-sum games, and self-play as a training method.

Prereqs
Markov Decision ProcessesReinforcement Learning for Auction Design
L2Tier 2~50m

Minimax and Saddle Points

Minimax theorems characterize when max-min equals min-max. Saddle points arise in zero-sum games, duality theory, GANs, and robust optimization.

Prereqs
Convex Optimization BasicsConvex Duality
L4Tier 2~50m

Multi-Agent Collaboration

Multiple LLM agents working together on complex tasks: debate for improving reasoning, division of labor across specialist agents, structured communication protocols, and when multi-agent outperforms single-agent systems.

Prereqs
Markov Decision ProcessesPolicy Gradient Theorem
L2Tier 2~55m

Multi-Armed Bandits Theory

The exploration-exploitation tradeoff formalized: K arms, regret as the cost of not knowing the best arm, and algorithms (UCB, Thompson sampling) that achieve near-optimal regret bounds.

Prereqs
Common Probability DistributionsBayesian Optimization for HyperparametersNo-Regret Learning+1 more
L3Tier 2~55m

No-Regret Learning

Online learning against adversarial losses: regret as cumulative loss minus the best fixed action in hindsight, multiplicative weights, follow the regularized leader, and why no-regret dynamics converge to Nash equilibria in zero-sum games.

Prereqs
Common Probability DistributionsConcentration Inequalities
L3Tier 2~55m

Offline Reinforcement Learning

Learning policies from a fixed dataset without environment interaction: distributional shift as the core challenge, conservative Q-learning (CQL) as the standard fix, and Decision Transformer as an alternative sequence modeling approach.

Prereqs
Q-Learning
L3Tier 2~60m

Online Learning and Bandits

Sequential decision making with adversarial or stochastic feedback: the bandit setting, explore-exploit tradeoff, UCB, Thompson sampling, and regret bounds. Connections to RL and A/B testing.

Prereqs
No-Regret LearningAdaptive Learning Is Not IIDTest-Time Training and Adaptive Inference
L3Tier 2~55m

Policy Optimization: PPO and TRPO

Trust region and clipped surrogate methods for stable policy optimization: TRPO enforces a KL constraint for monotonic improvement, PPO replaces it with a simpler clipped objective, and PPO became the default optimizer for LLM post-training.

Prereqs
Policy Gradient TheoremActor-Critic MethodsDDPG: Deep Deterministic Policy Gradient+2 more
L3Tier 2~40m

Policy Representations

How to parameterize policies in reinforcement learning: categorical for discrete actions, Gaussian for continuous actions, and why the choice affects gradient variance and exploration.

Prereqs
Markov Decision Processes
L3Tier 2~55m

Self-Play and Multi-Agent RL

Self-play as a training paradigm for competitive games, fictitious play convergence, AlphaGo/AlphaZero, and the challenges of multi-agent reinforcement learning: non-stationarity, partial observability, and centralized training.

Prereqs
Markov Decision ProcessesAgent-Based Modeling with MLNo-Regret Learning
L2Tier 2~50m

Temporal Difference Learning

Temporal difference methods bootstrap value estimates from other value estimates, enabling online, incremental learning without waiting for episode termination. TD(0), SARSA, and TD(lambda) with eligibility traces.

Prereqs
Markov Decision ProcessesValue Iteration and Policy IterationBellman Equations+1 more
L3Tier 2~40m

TD3: Twin Delayed Deep Deterministic Policy Gradient

An off-policy actor-critic algorithm that fixes DDPG's overestimation bias with clipped double-Q learning, target policy smoothing, and delayed policy updates. The minimum-complexity robust continuous-control algorithm.

Prereqs
DDPG: Deep Deterministic Policy GradientQ-LearningPolicy Gradient Theorem
L4Tier 3~55m

Mean-Field Games

The many-agent limit of strategic interactions: as the number of agents goes to infinity, each agent solves an MDP against the population distribution, and equilibrium becomes a fixed-point condition on the mean field.

Prereqs
Markov Decision ProcessesMean Field TheoryAgent-Based Modeling with ML
L3Tier 3~50m

Options and Temporal Abstraction

The options framework for hierarchical RL: temporally extended actions with initiation sets, internal policies, and termination conditions. Semi-MDPs and learning options end-to-end.

Prereqs
Markov Decision ProcessesValue Iteration and Policy IterationPolicy Representations
L3Tier 3~40m

Reinforcement Learning Environments and Benchmarks

The standard RL evaluation stack: Gymnasium API, classic control tasks, Atari, MuJoCo, ProcGen, the sim-to-real gap, and why benchmark performance is a poor predictor of real-world RL capability.

Prereqs
Markov Decision ProcessesDeep RL for Control
L4Tier 3~45m

Robust Adversarial Policies

Robust MDPs optimize against worst-case transition dynamics within an uncertainty set. Adversarial policies formalize distribution shift in RL as a game between agent and environment.

Prereqs
Markov Decision ProcessesMinimax Lower Bounds: Le Cam, Fano, Assouad, and the Reduction to Testing

Beyond LLMs

20 topics / L4, L5 / Tiers 1-3

JEPA, world models, vision-first AI, diffusion, audio, 3D, state-space, and continuous-time models.

L4Tier 1~80m

CLIP, OpenCLIP, and SigLIP: Contrastive Language-Image Pretraining

Radford et al. 2021 (CLIP) trained two encoders, one for images and one for text, with a symmetric InfoNCE objective on 400M web pairs. The result was a shared embedding space that powers zero-shot classification, retrieval, and serves as the visual backbone of every modern vision-language model. This page covers the contrastive objective as a mutual-information bound, the OpenCLIP scaling laws (Cherti et al. 2023), the SigLIP pairwise-sigmoid alternative (Zhai et al. 2023), the modality gap (Liang et al. 2022), and the practical pipeline from training corpus to LLaVA-style VLM backbone.

Prereqs
Contrastive LearningVision Transformer Lineage: ViT, DeiT, Swin, MAE, DINOv2, SAMInformation Theory Foundations+1 more
L4Tier 1~95m

Diffusion Models

Generative models that learn to reverse a noise-adding process: forward and reverse SDEs, score matching, the DDPM ELBO, deterministic ODE sampling, classifier-free guidance, latent diffusion, and the bridge to flow matching.

Prereqs
Variational AutoencodersScore MatchingBoltzmann Machines and Hopfield Networks+16 more
L4Tier 1~80m

Vision Transformer Lineage: ViT, DeiT, Swin, MAE, DINOv2, SAM

Dosovitskiy et al. 2021 (ViT) showed that a plain transformer applied to image patches matches CNN backbones once given enough pretraining data. The follow-on lineage (DeiT, Swin, MAE, DINOv2, SAM, register tokens, NaViT, ViT-22B, SigLIP-2) traded design knobs along three axes: how much locality bias to keep, what supervision to train against, and whether tokens are spatial-uniform or content-adaptive. This page covers the patch-embedding complexity, Swin's shifted-window argument, the MAE 75% masking ratio, DINO self-distillation with centering, and the 2024 register-token finding that closes the long-standing CLIP/ViT artifact-token gap.

Prereqs
Transformer ArchitectureConvolutional Neural NetworksCNNs for Medical Imaging
L4Tier 2~45m

Equilibrium and Implicit-Layer Models

Deep Equilibrium Models (DEQ) replace explicit depth with a fixed-point equation: instead of stacking L layers, solve for the equilibrium state where one more layer would not change the output. This enables infinite-depth networks with constant memory, using implicit differentiation for backprop.

Prereqs
Skip Connections and ResNetsImplicit Differentiation
L4Tier 2~50m

Equivariant Deep Learning

Networks that respect symmetry: if the input transforms under a group action, the output transforms predictably. Equivariance generalizes translation equivariance in CNNs to rotations, permutations, and gauge symmetries, reducing sample complexity and improving generalization on structured data.

Prereqs
Convolutional Neural NetworksGraph Neural NetworksAttention for Protein Structure: AlphaFold and Successors
L5Tier 2~50m

Florence and Vision Foundation Models

Vision foundation models that unify classification, detection, segmentation, captioning, and VQA under a single pretrained backbone. Florence, Florence-2, and the path toward GPT-4V-style multimodal understanding.

Prereqs
Vision Transformer Lineage: ViT, DeiT, Swin, MAE, DINOv2, SAMSelf-Supervised Vision
L4Tier 2~55m

Flow Matching

Learn a velocity field that transports noise to data along straight-line paths. Simpler training than diffusion, faster sampling, and cleaner math.

Prereqs
Diffusion ModelsIto's LemmaPDE Fundamentals for Machine Learning
L4Tier 2~55m

JEPA and Joint Embedding

LeCun's Joint Embedding Predictive Architecture: learning by predicting abstract representations rather than pixels, energy-based formulation, I-JEPA, V-JEPA, V-JEPA 2, and the connection to contrastive learning and world models.

Prereqs
AutoencodersVariational AutoencodersHistory of Artificial Intelligence+3 more
L4Tier 2~65m

Kolmogorov-Arnold Networks (KANs)

An alternative to MLPs where learnable univariate functions (typically B-splines) sit on edges and pure summation sits on nodes. Motivated by the Kolmogorov-Arnold representation theorem, competitive on small smooth scientific tasks, unproven at frontier scale.

Prereqs
Universal Approximation TheoremFeedforward Networks and BackpropagationActivation Functions
L4Tier 2~60m

Mamba and State-Space Models

Linear-time sequence modeling via structured state spaces: S4, HiPPO initialization, selective state-space models (Mamba), and the architectural fork from transformers.

Prereqs
Recurrent Neural NetworksAttention Mechanism TheoryDeep Learning for Time Series+4 more
L4Tier 2~55m

Physics-Informed Neural Networks

Embedding PDE constraints directly into the neural network loss function via automatic differentiation. When physics-informed learning works, when it fails, and what alternatives exist.

Prereqs
The Jacobian MatrixAutomatic DifferentiationFeedforward Networks and Backpropagation+6 more
L4Tier 2~50m

Self-Supervised Vision

Learning visual representations without labels: contrastive methods (SimCLR, MoCo), self-distillation (DINO/DINOv2), and masked image modeling (MAE). Why self-supervised vision matters for transfer learning and label-scarce domains.

Prereqs
Vision Transformer Lineage: ViT, DeiT, Swin, MAE, DINOv2, SAMAttention for Protein Structure: AlphaFold and SuccessorsCNNs for Signal Feature Extraction
L5Tier 2~40m

Test-Time Training and Adaptive Inference

Updating model parameters at inference time using self-supervised objectives on the test input itself. TTT layers replace fixed linear recurrences (as in Mamba) with learned update rules, blurring the boundary between training and inference.

Prereqs
Stochastic Gradient Descent ConvergenceRecurrent Neural NetworksContinuous Thought Machines
L5Tier 2~50m

Video World Models

Turning pretrained video diffusion models into interactive world simulators: condition on actions to generate future frames, enabling RL agent training, robot planning, and game AI without physical environments.

Prereqs
World Models and PlanningDiffusion Models
L4Tier 2~60m

World Models and Planning

Learning a model of the environment and planning inside it: Dreamer's latent dynamics, MuZero's learned model with MCTS, video world models, and why planning in imagination is the path to sample-efficient and safe AI.

Prereqs
Markov Decision ProcessesThe Era of ExperienceHistory of Artificial Intelligence+1 more
L5Tier 3~50m

Audio Language Models

Models that process and generate speech alongside text: audio tokenization, Whisper for transcription, end-to-end voice models, music generation, and the audio-language frontier.

Prereqs
Speech and Audio MLTransformer Architecture
L5Tier 3~35m

Continuous Thought Machines

Neural networks that process information through continuous-time internal dynamics rather than discrete layer-by-layer computation. Inspired by neural ODEs and dynamical systems, these architectures let the network 'think' for a variable amount of internal time before producing an output.

Prereqs
Neural ODEs and Continuous-Depth NetworksEquilibrium and Implicit-Layer Models
L4Tier 3~45m

3D Gaussian Splatting

Represent a 3D scene as millions of 3D Gaussians, each with position, covariance, opacity, and color. Render by projecting to 2D and alpha-compositing. Real-time, high-quality novel view synthesis without neural networks at render time.

Prereqs
Occupancy Networks and Neural FieldsPositive Semidefinite MatricesDiffusion Models+1 more
L4Tier 3~45m

Occupancy Networks and Neural Fields

Representing 3D geometry and appearance as continuous functions parameterized by neural networks: NeRF, occupancy networks, DeepSDF, volume rendering, and the connection to Gaussian splatting.

Prereqs
Feedforward Networks and Backpropagation
L5Tier 3~40m

World Model Evaluation

How to measure whether a learned world model is useful: prediction accuracy, controllability (sim-to-real transfer), planning quality, and why long-horizon evaluation is hard.

Prereqs
World Models and Planning

ML Methods

74 topics / L1, L2, L3 / Tiers 1-3

Core supervised / unsupervised algorithms: CNNs, autoencoders, boosting, clustering, contrastive learning.

L1Tier 1~45m

Activation Functions

Nonlinear activation functions in neural networks: sigmoid, tanh, ReLU, Leaky ReLU, GELU, and SiLU. Their gradients, saturation behavior, and impact on trainability.

Prereqs
Differentiation in RⁿConvex Optimization Basics
L2Tier 1~40m

AIC and BIC

Akaike and Bayesian information criteria for model selection: how they trade off fit versus complexity, when to use each, and their connection to cross-validation.

Prereqs
Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencyDecision Trees and EnsemblesGauss-Markov Theorem+1 more
L2Tier 1~35m

Bagging

Bootstrap Aggregation: train B models on bootstrap samples and average their predictions. Why averaging reduces variance, how correlation limits the gain, and out-of-bag error estimation.

Prereqs
Bootstrap MethodsDecision Trees and Ensembles
L1Tier 1~40m

Cross-Entropy Loss: MLE, KL Divergence, and Classification

Why cross-entropy is the correct loss for classification: its derivation as negative log-likelihood, connection to KL divergence, why MSE fails for classification, and practical variants including label smoothing and focal loss.

Prereqs
Information Theory FoundationsLogistic RegressionLog-Probability Computation+1 more
L1Tier 1~45m

Data Preprocessing and Feature Engineering

Standardization, scaling, encoding, imputation, and feature selection. Why most algorithms assume centered, scaled inputs and what breaks when you skip preprocessing.

Prereqs
Common Probability DistributionsLinear Regression
L2Tier 1~60m

Feedforward Networks and Backpropagation

Feedforward neural networks as compositions of affine transforms and nonlinearities, the universal approximation theorem, and backpropagation as reverse-mode automatic differentiation on the computational graph.

Prereqs
Differentiation in RⁿMatrix CalculusActivation Functions+8 more
L2Tier 1~45m

Gauss-Markov Theorem

Among all linear unbiased estimators, ordinary least squares has the smallest variance. This is the BLUE theorem. And understanding when its assumptions fail is just as important as the result itself.

Prereqs
Linear RegressionThe Multivariate Normal Distribution
L2Tier 1~60m

Gradient Boosting

Gradient boosting as functional gradient descent: fit weak learners to pseudo-residuals sequentially, reducing bias at each round. Covers AdaBoost, shrinkage, XGBoost second-order methods, and LightGBM leaf-wise growth.

Prereqs
Decision Trees and EnsemblesGradient Descent VariantsAdaBoost+2 more
L1Tier 1~40m

K-Means Clustering

Lloyd's algorithm for partitional clustering: the within-cluster sum of squares objective, convergence guarantees, k-means++ initialization, choosing k, and the connection to EM for Gaussians.

Prereqs
Common Probability DistributionsConvex Optimization BasicsNMF (Nonnegative Matrix Factorization)+2 more
L2Tier 1~60m

The Kernel Trick

Any algorithm whose computation accesses data only through inner products can be lifted to a feature space implicitly defined by a kernel, without ever computing the features. Worked polynomial-kernel inner-product equivalence, infinite-feature interpretation of the Gaussian RBF, Mercer's condition, and the four canonical applications: kernel SVM, kernel ridge regression, kernel PCA, kernel k-means.

Prereqs
Support Vector MachinesRidge RegressionBayesian Linear Regression+2 more
L2Tier 1~65m

Lasso Regression

OLS with L1 regularization: sparsity, the geometry of why L1 selects variables, proximal gradient descent, LARS, and elastic net.

Prereqs
Linear RegressionConvex Optimization BasicsRidge Regression+2 more
L1Tier 1~50m

Linear Regression

Ordinary least squares as projection, the normal equations, the hat matrix, Gauss-Markov optimality, and the connection to maximum likelihood under Gaussian noise.

Prereqs
Matrix Operations and PropertiesMaximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencyThe Elements of Statistical Learning (Hastie, Tibshirani, Friedman)+1 more
L1Tier 1~50m

Logistic Regression

The foundational linear classifier: sigmoid link function, maximum likelihood estimation, cross-entropy loss, gradient derivation, and regularized variants.

Prereqs
Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencyConvex Optimization BasicsData Preprocessing and Feature Engineering+2 more
L1Tier 1~50m

Loss Functions Catalog

A systematic catalog of loss functions for classification, regression, and representation learning: cross-entropy, MSE, hinge, Huber, focal, KL divergence, and contrastive loss.

Prereqs
Logistic Regression
L2Tier 1~35m

Overfitting and Underfitting

The two failure modes of supervised learning: models that memorize noise versus models too simple to capture signal. Diagnosis via training-validation gaps.

Prereqs
Empirical Risk MinimizationBias-Variance Tradeoff
L1Tier 1~55m

Principal Component Analysis

Dimensionality reduction via variance maximization: PCA as eigendecomposition of the covariance matrix, PCA as truncated SVD of the centered data matrix, reconstruction error, and when sample PCA works.

Prereqs
Eigenvalues and EigenvectorsSingular Value DecompositionGram Matrices and Kernel Matrices+4 more
L2Tier 1~45m

Random Forests

Random forests combine bagging with random feature subsampling to decorrelate trees, reducing ensemble variance beyond what pure bagging achieves. Out-of-bag estimation, variable importance, consistency theory, and practical strengths and weaknesses.

Prereqs
Decision Trees and EnsemblesBootstrap MethodsBagging
L1Tier 1~85m

Ridge Regression

L2-regularized least squares: closed-form shrinkage, SVD geometry, MSE dominance over OLS, the Bayesian Gaussian-prior connection, effective degrees of freedom, kernel ridge, and the modern overparameterized regime where ridgeless OLS can still generalize.

Prereqs
Linear RegressionConvex Optimization BasicsShrinkage Estimation and the James-Stein Estimator: Inadmissibility, SURE, and Brown's Characterization+3 more
L3Tier 1~80m

Score Matching

Hyvärinen 2005: train a model to estimate the score (gradient of log density) without computing the normalization constant. Integration by parts converts the intractable density-matching loss into a tractable gradient-based objective. Sliced score matching makes the Jacobian-trace term scale, denoising score matching reparameterizes the loss as ε-regression, and Tweedie's formula identifies the score with a posterior-mean denoiser. Together these are the training half of every modern diffusion model and energy-based model.

Prereqs
Stochastic Differential EquationsFokker–Planck EquationExpectation, Variance, Covariance, and Moments+4 more
L2Tier 1~45m

Skip Connections and ResNets

Residual connections let gradients flow through identity paths, enabling training of very deep networks. ResNets learn residual functions F(x) = H(x) - x, which is easier than learning H(x) directly.

Prereqs
Feedforward Networks and Backpropagation
L2Tier 1~70m

Support Vector Machines

Maximum-margin classifiers via convex optimization: hard margin, soft margin with slack variables, hinge loss, the dual formulation, and the kernel trick.

Prereqs
Convex Optimization BasicsConvex DualityLogistic Regression+2 more
L2Tier 1~45m

Universal Approximation Theorem

A single hidden layer neural network can approximate any continuous function on a compact set to arbitrary accuracy. Why this is both important and misleading: it says nothing about width, weight-finding, or generalization.

Prereqs
Feedforward Networks and Backpropagation
L3Tier 1~65m

Variational Autoencoders

Deriving the ELBO, the reparameterization trick for backpropagation through sampling, and how VAEs turn autoencoders into principled generative models via amortized variational inference.

Prereqs
AutoencodersMaximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencyAutoencoders for Low-Dimensional Dynamical Structures+4 more
L2Tier 2~50m

AdaBoost

AdaBoost as iterative reweighting of misclassified samples, exponential loss minimization, weak-to-strong learner amplification, margin theory, and the connection to coordinate descent.

Prereqs
Decision Trees and Ensembles
L2Tier 2~50m

Anomaly Detection

Methods for identifying data points that deviate from expected patterns: isolation forests, one-class SVMs, autoencoders, statistical distances, and why the absence of anomaly labels makes this problem structurally harder than classification.

Prereqs
Common Probability DistributionsAnomaly Detection for Gravitational Waves
L2Tier 2~45m

Autoencoders

Encoder-decoder architectures for unsupervised representation learning: undercomplete bottlenecks, sparse and denoising variants, and the connection between linear autoencoders and PCA.

Prereqs
Feedforward Networks and BackpropagationVectors, Matrices, and Linear MapsBoltzmann Machines and Hopfield Networks+3 more
L3Tier 2~55m

Contrastive Learning

Learning representations by pulling positive pairs together and pushing negative pairs apart, with theoretical grounding in mutual information maximization.

Prereqs
Feedforward Networks and Backpropagation
L3Tier 2~55m

Convolutional Neural Networks

How weight sharing and local connectivity exploit spatial structure: convolution as cross-correlation, translation equivariance, pooling for approximate invariance, and the conv-pool-fc architecture.

Prereqs
Feedforward Networks and BackpropagationVectors, Matrices, and Linear MapsFast Fourier Transform+3 more
L2Tier 2~60m

Decision Trees and Ensembles

Greedy recursive partitioning with splitting criteria, pruning, and why combining weak learners via bagging (random forests) and boosting (gradient boosting) yields strong predictors.

Prereqs
Empirical Risk MinimizationBias-Variance TradeoffK-Nearest Neighbors
L3Tier 2~60m

Deep Learning for Time Series

RNN/LSTM, Temporal Convolutional Networks, and Transformers (PatchTST, Informer, Autoformer) for time series forecasting, with N-BEATS basis decomposition and the Zeng et al. linear-baseline controversy.

Prereqs
Time Series FoundationsState Space ModelsRecurrent Neural Networks+1 more
L3Tier 2~45m

DeepONet

DeepONet (Lu, Karniadakis et al., 2021) approximates nonlinear operators between function spaces by splitting a network into a branch (encoding the input function at fixed sensors) and a trunk (encoding query coordinates), then taking an inner product. The architecture is the practical realization of Chen and Chen's 1995 universal approximation theorem for operators.

Prereqs
Spectral Theory of Operatorsnavier stokes for mlFourier Neural Operator
L2Tier 2~50m

Dimensionality Reduction Theory

Why and how to reduce dimensions: the curse of dimensionality, PCA, random projections (JL lemma), t-SNE, UMAP, and when each method preserves the structure you care about.

Prereqs
Principal Component AnalysisEigenvalues and EigenvectorsMeasure Concentration and Geometric Functional Analysis
L2Tier 2~40m

Elastic Net

Combining L1 and L2 penalties: elastic net gets sparsity from lasso and grouping stability from ridge, solving the failure mode where lasso arbitrarily selects among correlated features.

Prereqs
Ridge RegressionLasso Regression
L2Tier 2~55m

Ensemble Methods Theory

Why combining multiple models outperforms any single model: bias-variance decomposition for ensembles, diversity conditions, and the theoretical foundations of bagging, boosting, and stacking.

Prereqs
BaggingGradient Boosting
L3Tier 2~50m

EM Algorithm Variants

Variants of EM for when the standard algorithm is intractable: Monte Carlo EM, Variational EM, Stochastic EM, and ECM. Connection to VAEs as amortized variational EM.

Prereqs
The EM Algorithm
L3Tier 2~45m

Fourier Neural Operator

The Fourier Neural Operator (Li, Kovachki, Anandkumar et al., ICLR 2021) parameterizes the kernel of an integral operator directly in Fourier space, giving a resolution-invariant architecture for learning maps between function spaces. Canonical baseline for data-driven PDE solvers and the architectural backbone of FourCastNet weather prediction.

Prereqs
Fast Fourier Transformnavier stokes for mlSpectral Theory of Operators+1 more
L2Tier 2~60m

Gaussian Mixture Models and EM

GMMs as soft clustering with per-component Gaussians: EM derivation (E-step responsibilities, M-step parameter updates), convergence guarantees, model selection with BIC/AIC, and the connection to k-means as the hard-assignment limit.

Prereqs
K-Means ClusteringThe EM AlgorithmMaximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiency
L3Tier 2~55m

Gaussian Process Regression

Inference with Gaussian processes: the prior-to-posterior update in closed form, the role of kernel choice, marginal likelihood for hyperparameter selection, sparse approximations for scalability, and the connection to Bayesian optimization.

Prereqs
Kernels and Reproducing Kernel Hilbert SpacesJoint, Marginal, and Conditional DistributionsRidge Regression+5 more
L2Tier 2~50m

Generalized Additive Models

GAMs model the response as a sum of smooth univariate functions, one per predictor. Interpretable nonlinear regression with backfitting, P-splines, and partial effect plots.

Prereqs
Linear RegressionMARS (Multivariate Adaptive Regression Splines)
L3Tier 2~60m

Generative Adversarial Networks

The minimax game between generator and discriminator: Nash equilibrium at the data distribution, mode collapse, the Wasserstein distance fix, StyleGAN, and why diffusion models have largely replaced GANs for image generation.

Prereqs
Feedforward Networks and Backpropagation
L3Tier 2~55m

Graph Neural Networks

Message passing on graphs: GCN, GAT, GraphSAGE, the WL isomorphism test as an expressivity ceiling, over-smoothing in deep GNNs, and applications to molecules, social networks, and knowledge graphs.

Prereqs
Convolutional Neural NetworksEigenvalues and EigenvectorsClustering for Gene Expression+1 more
L2Tier 2~30m

Hyperbolic Embeddings for Graphs

Negatively curved space embeds tree-like graphs with exponentially less distortion than Euclidean space at the same dimension. The Poincare ball model, Mobius gyrovector arithmetic, and Sala et al.'s sharp distortion bound make this concrete; Ganea et al. lift the construction to neural network layers.

Prereqs
Non-Euclidean and Hyperbolic GeometryMetric Spaces, Convergence, and Completeness
L1Tier 2~40m

K-Nearest Neighbors

Classify by majority vote of the k closest training points: no training phase, universal consistency as n and k grow, and the curse of dimensionality that makes distance meaningless in high dimensions.

Prereqs
Common Probability DistributionsOrder Statistics
L3Tier 2~55m

Meta-Learning

Learning to learn: find model initializations or embedding spaces that enable fast adaptation to new tasks from few examples. MAML, prototypical networks, and the connection to few-shot learning and in-context learning in LLMs.

Prereqs
Feedforward Networks and BackpropagationTest-Time Training and Adaptive Inference
L1Tier 2~40m

Multi-Class and Multi-Label Classification

Multi-class (exactly one label, softmax + cross-entropy) vs multi-label (multiple labels, sigmoid + binary cross-entropy). One-vs-rest, one-vs-one, hierarchical classification, and evaluation metrics.

Prereqs
Logistic Regression
L1Tier 2~35m

Naive Bayes

The simplest generative classifier: assume conditional independence of features given the class, estimate class-conditional densities, and classify via MAP. Why it works despite the wrong independence assumption.

Prereqs
Common Probability Distributions
L2Tier 2~55m

Natural Language Processing Foundations

The progression from bag-of-words to transformers: tokenization, language modeling, TF-IDF, sequence-to-sequence, attention, and why the pre-train then fine-tune paradigm replaced task-specific architectures.

Prereqs
Word Embeddings
L3Tier 2~55m

Object Detection and Segmentation

Localizing and classifying objects in images: two-stage (R-CNN), one-stage (YOLO, SSD), anchor-free (CenterNet, FCOS) detectors, semantic and instance segmentation, and the IoU/mAP evaluation framework.

Prereqs
Convolutional Neural Networkshough transform and circle detection
L3Tier 2~50m

Optimal Brain Surgeon and Pruning Theory

Principled weight pruning via second-order information: Optimal Brain Damage uses the Hessian diagonal, Optimal Brain Surgeon uses the full inverse Hessian, and both reveal why magnitude pruning is a crude but popular approximation.

Prereqs
The Hessian MatrixFeedforward Networks and BackpropagationIterative Magnitude Pruning and the Lottery Ticket Hypothesis
L2Tier 2~45m

PageRank Algorithm

PageRank as the stationary distribution of a random walk on a graph: damping factor, power iteration, eigenvector interpretation, and applications beyond web search.

Prereqs
Eigenvalues and EigenvectorsGraph Algorithms Essentials
L1Tier 2~35m

Perceptron

Rosenblatt's perceptron (1958): the simplest neural network, the first learning algorithm with a convergence proof, and the lesson that linear separability is both powerful and limiting.

L3Tier 2~45m

Probability Flow ODE

Song et al. 2021: every diffusion SDE has a deterministic ODE that produces the same time-marginals. The deterministic dual of Anderson's reverse SDE; the basis of DDIM, DPM-Solver, EDM samplers, exact-likelihood computation, and the conceptual bridge to flow matching.

Prereqs
Stochastic Differential EquationsFokker–Planck EquationScore Matching+1 more
L2Tier 2~55m

Recommender Systems

User-item interaction modeling via matrix factorization, collaborative filtering, and content-based methods: the math of SVD-based recommendations, cold start, implicit feedback, and why evaluation is harder than the model.

Prereqs
Eigenvalues and Eigenvectors
L3Tier 2~55m

Recurrent Neural Networks

Sequential processing via hidden state recurrence: the simple RNN, vanishing and exploding gradients, LSTM gating mechanisms, and why transformers have largely replaced RNNs.

Prereqs
Feedforward Networks and BackpropagationConvolutional Neural NetworksMacroeconomic Time-Series Forecasting
L3Tier 2~50m

Semantic Search and Embeddings

Dense vector representations for semantic similarity: bi-encoders, cross-encoders, approximate nearest neighbor search, cosine similarity geometry, and the RAG retrieval pipeline.

Prereqs
Word EmbeddingsInner Product Spaces and OrthogonalityInformation Retrieval Foundations+2 more
L2Tier 2~50m

Spectral Clustering

Clustering via the eigenvectors of a graph Laplacian: embed data using the bottom eigenvectors, then run k-means in the embedding space. Finds non-convex clusters that k-means alone cannot.

Prereqs
Eigenvalues and EigenvectorsK-Means ClusteringPageRank Algorithm
L3Tier 2~50m

Speech and Audio ML

Machine learning for audio: mel spectrograms as 2D representations, CTC loss for sequence alignment, Whisper for speech recognition, text-to-speech synthesis, and why continuous audio signals are harder than discrete text.

Prereqs
Signals and Systems for MLRecurrent Neural NetworksCNNs for Signal Feature Extraction
L2Tier 2~55m

Time Series Forecasting Basics

Stationarity, autocorrelation, AR, MA, ARIMA, exponential smoothing, modern architectures (PatchTST, iTransformer, TSMixer), and time-series foundation models (TimesFM, Chronos, Moirai).

Prereqs
Linear RegressionTime Series Foundations
L3Tier 2~45m

Transfer Learning

Pretrain on a large dataset, fine-tune on a smaller target: why lower layers learn transferable features, feature extraction vs fine-tuning, domain adaptation, negative transfer, and the foundation model paradigm.

Prereqs
Feedforward Networks and BackpropagationRepresentation Learning in Cosmology
L2Tier 2~45m

t-SNE and UMAP

Two dominant nonlinear dimensionality reduction methods: t-SNE preserves local neighborhoods via KL divergence with a Student-t kernel, UMAP uses fuzzy simplicial sets and cross-entropy. Both excel at visualization but have important limitations.

Prereqs
Principal Component AnalysisSelf-Organizing Maps
L2Tier 2~45m

Word Embeddings

Dense vector representations of words: Word2Vec (skip-gram, CBOW), negative sampling, GloVe, the distributional hypothesis, and why embeddings transformed NLP from sparse features to learned representations.

Prereqs
Logistic RegressionSingular Value DecompositionMaximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiency+1 more
L2Tier 2~50m

XGBoost

XGBoost as second-order gradient boosting: Taylor expansion of the loss, regularized objective, optimal leaf weights, split gain formula, and the system optimizations that made it dominant on tabular data.

Prereqs
Gradient Boosting
L3Tier 3~55m

Bayesian Neural Networks

Place a prior over neural network weights and compute the posterior given data. Exact inference is intractable, so we approximate: variational inference, MC dropout, Laplace approximation, SWAG. Principled uncertainty, high cost, limited scaling evidence.

Prereqs
Bayesian EstimationFeedforward Networks and BackpropagationGaussian Processes for Machine Learning+1 more
L2Tier 3~50m

Boltzmann Machines and Hopfield Networks

Energy-based models for associative memory and generative learning: Hopfield networks store patterns via energy minimization, Boltzmann machines add stochasticity and hidden units, and RBMs enable tractable learning through contrastive divergence.

Prereqs
Common Probability Distributions
L2Tier 3~40m

Cubist and Model Trees

M5 model trees put linear regression at each leaf of a decision tree. Cubist extends this with rule simplification and smoothing. A useful hybrid of interpretability and prediction power.

Prereqs
Decision Trees and EnsemblesLinear Regression
L3Tier 3~55m

Energy-Based Models

A unifying framework for generative modeling: assign low energy to likely configurations via E(x), define probability through the Boltzmann distribution, and train without computing the intractable partition function.

Prereqs
Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencyFeedforward Networks and BackpropagationNeural SDEs and the Diffusion Bridge+2 more
L2Tier 3~35m

Logspline Density Estimation

Model the log-density as a spline, then normalize to get a smooth, positive density estimate. Connection to exponential families, knot selection by BIC, and flexible nonparametric density estimation.

Prereqs
Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiency
L2Tier 3~40m

MARS (Multivariate Adaptive Regression Splines)

MARS: automatically discover nonlinear relationships using piecewise linear hinge functions, forward-backward selection, and a direct connection to ReLU networks.

Prereqs
Linear Regression
L3Tier 3~45m

Mixture Density Networks

Neural networks that output the parameters of a mixture model instead of a single point prediction: handling multi-modal conditional distributions, the negative log-likelihood loss, and applications to inverse problems.

Prereqs
Gaussian Mixture Models and EMFeedforward Networks and Backpropagation
L2Tier 3~40m

NMF (Nonnegative Matrix Factorization)

Factor V into W*H with all entries nonnegative: parts-based additive representation, multiplicative update rules, and applications to topic modeling and image decomposition.

Prereqs
Eigenvalues and Eigenvectors
L3Tier 3~50m

Normalizing Flows

Generative models that transform a simple base distribution through invertible mappings, enabling exact log-likelihood computation via the change of variables formula.

Prereqs
Common Probability DistributionsThe Jacobian MatrixVariational Autoencoders
L3Tier 3~40m

Reservoir Computing and Echo State Networks

Fixed random recurrent networks with trained linear readouts: the echo state property, why random high-dimensional projections carry computational power, extreme learning machines, and connections to state-space models.

Prereqs
Recurrent Neural NetworksAutoencoders for Low-Dimensional Dynamical StructuresLyapunov-Based Machine Learning for Chaos+2 more
L2Tier 3~40m

Self-Organizing Maps

Kohonen networks: competitive learning on a grid that produces topology-preserving mappings from high-dimensional input to low-dimensional discrete maps.

L2Tier 3~45m

Wavelet Smoothing

Wavelet transforms decompose signals into localized frequency components. Thresholding wavelet coefficients denoises signals while adapting to local smoothness, achieving minimax-optimal rates over Besov spaces.

Prereqs
Signals and Systems for ML

Numerical Optimization

26 topics / L0A, L1, L2, L3 / Tiers 1-3

Newton, conjugate gradient, ADMM, coordinate descent, interior point, line search. The algorithmic toolbox.

L1Tier 1~50m

Conditioning and Condition Number

The condition number measures how sensitive a problem is to perturbations in its input. Ill-conditioned matrices turn small errors into catastrophic ones, and understanding conditioning is essential for any computation involving linear algebra.

Prereqs
Eigenvalues and EigenvectorsMatrix Operations and PropertiesMatrix Norms+3 more
L0ATier 1~40m

Floating-Point Arithmetic

How computers represent real numbers, why they get it wrong, and why ML uses float32, float16, bfloat16, and int8. IEEE 754, machine epsilon, overflow, underflow, and catastrophic cancellation.

L1Tier 1~35m

Log-Probability Computation

Working in log space prevents underflow when multiplying many small probabilities. The log-sum-exp trick provides a numerically stable way to compute log of a sum of exponentials, and it underlies stable softmax, log-likelihoods, and the forward algorithm for HMMs.

Prereqs
Softmax and Numerical Stability
L1Tier 1~50m

Newton's Method

The gold standard for fast local convergence: use second-order information (the Hessian) to take optimal quadratic steps. Quadratic convergence when it works, catastrophic failure when it doesn't.

Prereqs
Convex Optimization BasicsTaylor ExpansionThe Hessian Matrix
L2Tier 1~55m

Proximal Gradient Methods

Optimize composite objectives by alternating gradient steps on smooth terms with proximal operators on nonsmooth terms. ISTA and its accelerated variant FISTA.

Prereqs
Convex Optimization BasicsQuasi-Newton MethodsSubgradients and Subdifferentials
L2Tier 1~60m

Quasi-Newton Methods

Approximate the Hessian instead of computing it: BFGS builds a dense approximation, L-BFGS stores only a few vectors. Full BFGS has classical local superlinear convergence; L-BFGS is excellent in practice but its fixed-memory theory is weaker and not generally superlinear.

Prereqs
Newton's MethodLine Search MethodsSecant Method
L1Tier 1~45m

Softmax and Numerical Stability

The softmax function maps arbitrary reals to a probability distribution. Getting it right numerically: avoiding overflow and underflow: is the first lesson in writing ML code that actually works.

L1Tier 2~35m

Ascent Algorithms and Hill Climbing

Gradient ascent, hill climbing, and their failure modes: local optima, plateaus, and ridges. Random restarts and simulated annealing as strategies for escaping local optima.

Prereqs
Convex Optimization Basics
L2Tier 2~60m

Augmented Lagrangian and ADMM

The augmented Lagrangian fixes the conditioning problem of pure penalties by adding dual feedback. ADMM uses that idea to split structured convex problems into alternating subproblems with primal and dual residual diagnostics.

Prereqs
Convex Optimization BasicsConvex DualityNonlinear Gauss-Seidel+1 more
L3Tier 2~50m

Bayesian Optimization for Hyperparameters

Hyperparameter tuning as black-box optimization with a Gaussian process surrogate. Acquisition functions (EI, UCB, PI), sample efficiency for expensive evaluations, TPE as an alternative, and why grid search is wasteful.

Prereqs
Gaussian Process RegressionGaussian Processes for Machine Learning
L2Tier 2~60m

Conjugate Gradient Methods

Conjugate gradient solves large symmetric positive definite linear systems using matrix-vector products, Krylov subspaces, A-conjugate directions, condition-number-dependent rates, and preconditioning.

Prereqs
Line Search MethodsMatrix Operations and PropertiesMatrix Norms+1 more
L2Tier 2~45m

Coordinate Descent

Optimize by updating one coordinate (or block) at a time while holding others fixed. The default solver for Lasso because each coordinate update has a closed-form solution.

Prereqs
Convex Optimization BasicsMirror Descent and Frank-WolfeProximal Gradient Methods
L2Tier 2~50m

Line Search Methods

Line search chooses a step size along a descent direction. Armijo guarantees enough decrease, Wolfe prevents uselessly tiny steps, and backtracking gives a practical step-size controller without knowing the Lipschitz constant.

Prereqs
Convex Optimization BasicsDifferentiation in RⁿNewton's Method
L3Tier 2~55m

Mirror Descent and Frank-Wolfe

Mirror descent generalizes gradient descent via Bregman divergences, recovering multiplicative weights and exponentiated gradient as special cases. Frank-Wolfe replaces projections with linear minimization, making it projection-free.

Prereqs
Convex Optimization BasicsConvex DualityOnline Convex Optimization+1 more
L1Tier 2~55m

Numerical Linear Algebra

Algorithms for solving linear systems, computing eigenvalues, and factoring matrices. Every linear regression, PCA, and SVD computation depends on these methods.

Prereqs
Eigenvalues and EigenvectorsMatrix Operations and Properties
L3Tier 2~55m

Online Convex Optimization

A general framework for sequential decision-making with convex losses: online gradient descent, follow the regularized leader, adaptive methods, and the square-root-T regret guarantee that unifies many algorithms.

Prereqs
Convex Optimization BasicsNo-Regret Learning
L2Tier 2~40m

Projected Gradient Descent

Constrained convex optimization by alternating gradient steps with projections onto the feasible set. Same convergence rates as unconstrained gradient descent when projections are cheap.

Prereqs
Convex Optimization Basics
L1Tier 2~30m

Secant Method

A derivative-free root-finding method that approximates Newton's method using two previous function evaluations, achieving superlinear convergence of order approximately 1.618.

Prereqs
Newton's Method
L3Tier 2~55m

Second-Order Optimization Methods

Newton's method, Gauss-Newton, natural gradient, and K-FAC: how curvature information accelerates convergence, why the Hessian is too expensive to compute at scale, and Hessian-free alternatives that use Hessian-vector products.

Prereqs
Newton's MethodThe Hessian MatrixConjugate Gradient Methods+3 more
L2Tier 2~55m

Trust Region Methods

Trust region methods minimize a local model only inside a region where that model is trusted, then grow or shrink the region using actual versus predicted decrease. They are robust Newton-style methods for difficult non-convex landscapes.

Prereqs
Newton's MethodThe Hessian MatrixLine Search Methods+2 more
L2Tier 2~35m

Whitening and Decorrelation

Transform data to have identity covariance, removing correlations and normalizing scales. ZCA and PCA whitening, why whitening helps optimization, and connections to batch normalization.

Prereqs
Eigenvalues and EigenvectorsPrincipal Component AnalysisFloating-Point Arithmetic
L3Tier 3~50m

Interior Point Methods

Barrier functions transform constrained optimization into unconstrained problems. Newton steps on the barrier objective trace the central path to the constrained optimum with polynomial convergence.

Prereqs
Convex Optimization BasicsNewton's MethodAugmented Lagrangian and ADMM+1 more
L3Tier 3~45m

Nonlinear Gauss-Seidel

Block coordinate optimization with latest-value updates. Nonlinear Gauss-Seidel explains alternating minimization, EM-style sweeps, and why block coupling decides whether cycling, stalling, or convergence happens.

Prereqs
Coordinate DescentNewton's Method
L3Tier 3~50m

Submodular Optimization

Submodular functions exhibit diminishing returns. The greedy algorithm achieves a (1-1/e) approximation for monotone submodular maximization under cardinality constraints, with applications in feature selection, sensor placement, and data summarization.

Prereqs
Greedy Algorithms
L2Tier 3~35m

Tabu Search

Local search with memory: maintain a list of recently visited solutions to prevent cycling, use aspiration criteria to override the tabu when a move leads to a new best, and balance intensification against diversification.

Prereqs
Greedy AlgorithmsAscent Algorithms and Hill Climbing
L1Tier 3~25m

Winsorization

Clip extreme values to a fixed percentile instead of removing them. Preserves sample size, reduces outlier sensitivity, and improves stability of downstream estimators.

Prereqs
Order StatisticsCommon Probability Distributions

Sampling & MCMC

20 topics / L1, L2, L3 / Tiers 1-3

Gibbs, Metropolis-Hastings, HMC, importance sampling, coupling arguments, perfect sampling.

L2Tier 1~60m

Gibbs Sampling

The classical conditional-update MCMC algorithm: sample each variable from its full conditional, accept every proposal, and then pay close attention to mixing, blocking, and hierarchical parameterization.

Prereqs
Metropolis-Hastings AlgorithmMarkov Chain Monte Carlo
L2Tier 1~50m

Importance Sampling

Estimate expectations under one distribution by sampling from another and reweighting: a technique that is powerful when done right and catastrophically unreliable when done wrong.

Prereqs
Common Probability DistributionsMonte Carlo MethodsNumber Theory and Machine Learning+3 more
L2Tier 1~65m

Markov Chain Monte Carlo

When you can evaluate a target up to a normalizing constant but cannot sample from it, build a Markov chain whose stationary distribution is the target. Detailed balance, ergodicity, mixing time, autocorrelation, and the variance penalty for non-iid samples.

Prereqs
Markov Chains and Steady StateMonte Carlo MethodsStochastic Processes for ML
L2Tier 1~70m

Metropolis-Hastings Algorithm

The foundational MCMC algorithm: build a Markov chain with the right stationary distribution by accepting or rejecting proposals according to a carefully balanced ratio. The real story is not only correctness, but also proposal geometry, diagnostics, and where plain MH becomes painfully slow.

Prereqs
Common Probability DistributionsMarkov Chain Monte CarloMarkov Chains and Steady State+1 more
L2Tier 1~55m

Monte Carlo Methods

Approximate expectations by sampling: the Monte Carlo estimator, its variance, the $\sqrt{N}$ convergence rate, and the variance-reduction tricks that make practical Bayesian inference, REINFORCE, and ELBO estimation work.

Prereqs
Expectation, Variance, Covariance, and MomentsLaw of Large NumbersCentral Limit Theorem
L2Tier 2~45m

Burn-in and Convergence Diagnostics

Burn-in is only the first filter. Modern MCMC trust comes from split rank-normalized R-hat, bulk and tail ESS, trace behavior, and sampler-specific warnings like divergences.

Prereqs
Metropolis-Hastings AlgorithmMarkov Chains and Steady StateCoupling Arguments and Mixing Time+7 more
L3Tier 2~60m

Hamiltonian Monte Carlo

HMC uses gradients and Hamiltonian dynamics to propose large moves that still land in high-density regions. The real practical story includes leapfrog geometry, warmup, divergences, and when centered hierarchical models break the sampler.

Prereqs
Metropolis-Hastings AlgorithmMarkov Chain Monte CarloGibbs Sampling+2 more
L3Tier 2~50m

Langevin Dynamics

The overdamped Langevin SDE: gradient descent plus calibrated Gaussian noise. The mathematical object behind SGLD, ULA, and energy-based MCMC samplers, and the simplest sampler with provable polynomial-time convergence on log-concave targets.

Prereqs
Stochastic Differential EquationsFokker–Planck EquationHamiltonian Monte Carlo+4 more
L3Tier 2~60m

No-U-Turn Sampler and Neal's Funnel

NUTS removes HMC's hand-tuned trajectory length, but Neal's funnel shows why geometry still dominates. Centered hierarchical models create narrow necks, divergences, and misleading convergence unless you reparameterize.

Prereqs
Hamiltonian Monte CarloBayesian EstimationBurn-in and Convergence Diagnostics+1 more
L2Tier 2~40m

Rao-Blackwellization

The Rao-Blackwell theorem: conditioning an estimator on a sufficient statistic reduces variance without increasing bias. In MCMC, this means replacing sample averages with conditional expectations for lower-variance estimates at no extra sampling cost.

Prereqs
Sufficient Statistics and Exponential FamiliesImportance Sampling
L1Tier 2~40m

Rejection Sampling

The simplest exact sampling method: propose from an envelope distribution and accept or reject to produce exact independent draws from a target: but doomed to fail in high dimensions.

Prereqs
Monte Carlo Methods
L2Tier 2~50m

Variance Reduction Techniques

Get the same accuracy with fewer samples by exploiting correlation, known quantities, and stratification. Antithetic variates, control variates, stratification, and Rao-Blackwellization.

Prereqs
Importance Sampling
L2Tier 3~40m

Adaptive Rejection Sampling

For log-concave densities, build a piecewise linear upper hull of log f(x) that tightens automatically with each evaluation. Sample from the envelope and accept/reject.

Prereqs
Rejection SamplingGriddy Gibbs SamplingSqueezed Rejection Sampling
L3Tier 3~50m

Coupling Arguments and Mixing Time

Coupling constructs two Markov chains on the same probability space so they eventually meet, bounding total variation distance and mixing time. Spectral gap and coupling inequality are the main tools for proving how fast MCMC converges to stationarity.

Prereqs
Metropolis-Hastings AlgorithmMartingale TheoryTotal Variation Distance
L2Tier 3~30m

Griddy Gibbs Sampling

When Gibbs sampling encounters a non-conjugate full conditional, approximate it on a grid and sample from the piecewise constant or piecewise linear approximation. Simple, effective for smooth univariate conditionals.

Prereqs
Gibbs Sampling
L3Tier 3~45m

MCMC for Markov Random Fields

Gibbs sampling on undirected graphical models. The joint distribution factorizes over cliques, each variable is resampled from its Markov blanket, and the real practical story is local conditional updates versus long-range dependence, critical slowing, and exact-sampling alternatives.

Prereqs
Gibbs SamplingPerfect Sampling
L3Tier 3~45m

Perfect Sampling

Coupling from the past (Propp-Wilson): run Markov chains backward in time until all starting states coalesce. The result is an exact sample from the stationary distribution with no burn-in approximation, but often at a serious computational price.

Prereqs
Metropolis-Hastings AlgorithmGibbs Sampling
L3Tier 3~45m

Reversible Jump MCMC

MCMC for model selection: propose moves that change the number of parameters, maintain detailed balance across dimensions via Jacobian corrections, and sample over model space and parameter space simultaneously.

Prereqs
Metropolis-Hastings Algorithm
L2Tier 3~35m

Slice Sampling

Slice sampling draws from a target distribution by uniformly sampling from the region under its density curve. It introduces an auxiliary variable to avoid tuning proposal distributions, unlike random-walk Metropolis-Hastings.

Prereqs
Metropolis-Hastings Algorithm
L2Tier 3~30m

Squeezed Rejection Sampling

An optimization of rejection sampling that adds a cheap lower bound (squeeze function) to avoid expensive target density evaluations when the sample clearly falls in the accept or reject region.

Prereqs
Rejection Sampling

Decision Theory & Game Theory

2 topics / L2 / Tier 1

Expected utility, Nash equilibrium, minimax, auction theory, mechanism design, Arrow impossibility.

Algorithms & Computation

10 topics / L0A, L1, L2, L3 / Tiers 1-3

Sorting, graph algorithms, dynamic programming, FFT, matrix multiplication, P vs NP.

L0ATier 1~50m

Dynamic Programming

Solve complex optimization problems by decomposing them into overlapping subproblems with optimal substructure. The algorithmic backbone of sequence models, control theory, and reinforcement learning.

Prereqs
Sets, Functions, and RelationsGraph Algorithms EssentialsGreedy Algorithms
L2Tier 1~45m

Information Retrieval Foundations

Search as a first-class capability. TF-IDF, BM25, inverted indexes, precision/recall, reranking, and why retrieval is not just vector DB plus embeddings.

Prereqs
Common Probability DistributionsBasic Logic and Proof Techniquesfuzzy matching and record linkage
L1Tier 2~50m

Fast Fourier Transform

The Cooley-Tukey FFT reduces the discrete Fourier transform from O(n²) to O(n log n), enabling efficient convolution, spectral methods, and Fourier features for kernel approximation.

Prereqs
Exponential Function PropertiesComplex Numbers for Fourier
L0ATier 2~45m

Graph Algorithms Essentials

The graph algorithms every ML practitioner needs: BFS, DFS, Dijkstra, MST, and topological sort. Why they matter for computational graphs, knowledge graphs, dependency resolution, and GNNs.

Prereqs
Sets, Functions, and Relations
L0ATier 2~40m

Greedy Algorithms

The greedy paradigm: make the locally optimal choice at each step and never look back. When matroid structure or the exchange argument guarantees global optimality.

L0ATier 2~40m

Knapsack Problem

The canonical constrained optimization problem: 0/1 knapsack (NP-hard, pseudo-polynomial DP), fractional knapsack (greedy), FPTAS, and connections to Lagrangian relaxation in ML.

Prereqs
Dynamic ProgrammingGreedy Algorithms
L1Tier 2~50m

Matrix Multiplication Algorithms

From naive cubic algorithms to Strassen's subcubic breakthrough to the open question of the true matrix multiplication exponent. What we know, what we do not, and why it matters for ML.

Prereqs
Vectors, Matrices, and Linear Maps
L0ATier 2~35m

Sorting Algorithms

Comparison-based sorting lower bound, quicksort, mergesort, heapsort, and non-comparison sorts. The foundational algorithms behind efficient data processing and search.

L3Tier 2~55m

Unsolved Problems in Computer Science

P vs NP, the matrix multiplication exponent, one-way functions, the unique games conjecture, BPP vs P, natural proofs, and circuit lower bounds. The problems that define the limits of computation.

Prereqs
p vs npOpen Problems in Matrix Computation
L3Tier 3~45m

Open Problems in Matrix Computation

The unsolved questions in numerical linear algebra: the true exponent of matrix multiplication, practical fast algorithms, sparse matrix multiplication, randomized methods, and why these matter for scaling ML.

Prereqs
Matrix Multiplication AlgorithmsEigenvalues and Eigenvectors

Model Timeline & Labs

12 topics / L4, L5 / Tiers 1-3

GPT, Claude, Gemini, Llama, DeepSeek. Historical lineage and the 2020s lab landscape.

L5Tier 1~50m

DeepSeek Models

DeepSeek's model family: MoE architectures with Multi-head Latent Attention, fine-grained expert routing, RL-trained reasoning in DeepSeek-R1, the V3.1/V3.2 hybrid reasoning line, and the V4 Preview 1M-context release.

Prereqs
Transformer ArchitectureMixture of Experts
L5Tier 2~60m

AI Labs Landscape

Factual reference on the major AI research labs and companies: what they build, key technical contributions, and research focus areas. Current landscape as of April 2026.

Prereqs
Model Timelinekey researchers and ideas
L5Tier 2~40m

Claude Model Family

Anthropic's Claude series from Claude 1 through Opus 4.7, Sonnet 4.6, Haiku 4.5, and Mythos Preview: Constitutional AI, extended thinking, computer use, long context, tiering, and safety governance.

Prereqs
Transformer Architecture
L4Tier 2~25m

Cohere Models

The Cohere model family: Command R and R+ for retrieval-augmented enterprise generation, Embed and Rerank for two-stage retrieval, and the open-weight Aya multilingual line from Cohere For AI. Cohere announced an all-stock merger with German lab Aleph Alpha on April 24, 2026, taking the Schwarz Group as lead investor and positioning the combined company as a sovereign-AI alternative to dominant US providers.

Prereqs
Transformer ArchitectureTokenization and Information TheoryInformation Retrieval Foundations
L5Tier 2~45m

Gemini and Google Models

Google's model lineage from PaLM through Gemini 3.1: native multimodality, long context, TPU infrastructure, Gemini API preview churn, and the Gemma open-weight branch through Gemma 4.

Prereqs
Transformer Architecture
L5Tier 2~55m

GPT Series Evolution

The progression from GPT-1 (117M) through GPT-4, the o-series, GPT-5, GPT-5.4, gpt-oss, and GPT-Rosalind: what changed through scale, post-training, reasoning-time compute, tool use, and domain specialization.

Prereqs
Transformer ArchitectureAttention Mechanism TheoryScaling Laws+5 more
L4Tier 2~18m

Ineffable Intelligence

British AI lab founded by David Silver (UCL professor, ex-DeepMind reinforcement-learning lead, AlphaGo / AlphaZero / AlphaProof). Announced a $1.1 billion seed round at $5.1 billion valuation on April 27, 2026 — co-led by Sequoia and Lightspeed, with Nvidia, Google, DST Global, Index Ventures, and the UK Sovereign AI Fund participating. Stated mission: build a 'superlearner' that acquires knowledge from its own experience rather than from human-generated data, instantiating the 'Era of Experience' research agenda.

Prereqs
Reinforcement Learning from Human FeedbackAI Labs Landscape
L5Tier 2~50m

LLaMA and Open Weight Models

The open weight movement in large language models: LLaMA 1/2/3, the ecosystem of fine-tuning and quantization tools, and why open weights changed the dynamics of AI research.

Prereqs
Transformer ArchitectureToken Prediction and Language ModelingScaling Laws+7 more
L4Tier 2~30m

Mistral Models

The Mistral AI model family: Mistral 7B with sliding-window attention, the Mixtral 8x7B and 8x22B sparse mixture-of-experts releases, the dense Mistral Large/Nemo line, and the specialist Codestral, Pixtral, and Ministral variants.

Prereqs
Transformer ArchitectureMixture of ExpertsAttention Mechanism Theory+2 more
L5Tier 2~50m

Model Comparison Table

Structured comparison of major LLM families as of April 2026: architecture, parameters, context length, open weights, and key strengths, with discussion of what comparison tables cannot tell you.

Prereqs
Transformer ArchitectureClaude Model FamilyDeepSeek Models+3 more
L5Tier 2~70m

Model Timeline

A structured factual timeline of major language and multimodal models from GPT-2 through the current frontier, with parameter counts, key innovations, and the ideas that defined each era.

Prereqs
History of Artificial Intelligencekey researchers and ideas
L5Tier 3~35m

Qwen and Chinese Models

The Chinese open-weight model ecosystem: Qwen (Alibaba), Yi (01.AI), Baichuan, GLM (Zhipu AI), and Kimi (Moonshot AI), with a focus on multilingual capability and independent scaling.

Prereqs
Transformer Architecture

Connections & Applications

9 topics / L2, L3, L4 / Tiers 1-3

Cross-domain topics: SLAM, cryptography, signal detection, Kalman filters. Not core curriculum, but worth knowing.

L2Tier 1~55m

Kalman Filter

Optimal state estimation for linear Gaussian systems via recursive prediction and update steps using the Kalman gain.

Prereqs
Common Probability DistributionsEigenvalues and Eigenvectors
L2Tier 2~45m

Bayesian State Estimation

The filtering problem: recursively estimate a hidden state from noisy observations using predict-update cycles. Kalman filter for linear Gaussian systems, particle filters for the general case.

Prereqs
Bayesian EstimationCommon Probability DistributionsGaussian Processes in Astronomy+2 more
L3Tier 2~55m

GraphSLAM and Factor Graphs

SLAM as graph optimization: poses as nodes, constraints as edges, factor graph representation, MAP estimation via nonlinear least squares, and the sparsity structure that makes large-scale mapping tractable.

L2Tier 2~65m

State Space Models

Linear state space form, the Kalman filter and RTS smoother, EM for parameter learning, and the ARIMA equivalence. The unifying framework behind classical filtering and modern Mamba/S4.

Prereqs
Time Series FoundationsMarkov Chains and Steady StateKalman Filter+1 more
L2Tier 2~70m

Time Series Foundations

Rigorous treatment of stationarity, the Wold decomposition, autocorrelation, unit roots, AR/MA/ARMA/ARIMA models, and spectral representation. The classical theory that every modern sequence model rests on.

Prereqs
Kolmogorov Probability AxiomsExpectation, Variance, Covariance, and MomentsStochastic Processes for ML
L4Tier 3~50m

Active SLAM and POMDPs

Choosing robot actions to simultaneously map an environment and localize, formulated as a partially observable Markov decision process over belief states.

Prereqs
GraphSLAM and Factor GraphsMarkov Decision ProcessesVisual and Semantic SLAM
L4Tier 3~60m

Number Theory and Machine Learning

The emerging two-way street between number theory and machine learning: how number-theoretic tools improve ML systems, and how ML is discovering new mathematical structure in classical problems.

Prereqs
Common Probability DistributionsLaw of Large NumbersDifferential Privacy+1 more
L3Tier 3~50m

Particle Filters

Sequential Monte Carlo: represent the posterior over hidden states as a set of weighted particles, propagate through dynamics, reweight by likelihood, and resample to combat degeneracy.

Prereqs
Metropolis-Hastings AlgorithmImportance SamplingGraphSLAM and Factor Graphs+2 more
L4Tier 3~50m

Visual and Semantic SLAM

Replacing laser range finders with cameras for SLAM, and enriching maps with semantic labels to improve data association and planning.

Prereqs
GraphSLAM and Factor Graphs

Other topics

Additional topics not yet assigned to a specific module.

L4Tier 2applied-ml

Causal Inference for Policy Evaluation

Quasi-experimental methods for recovering policy effects without randomization. Difference-in-differences identifies the average treatment effect on the treated under parallel trends; regression discontinuity identifies a local average treatment effect under continuity at the cutoff; instrumental variables identifies a local average treatment effect for compliers under monotonicity (Imbens-Angrist 1994). Synthetic control and double/debiased ML extend these designs to single-unit and high-dimensional settings.

Prereqs
Causal Inference BasicsCausal Inference and the Ladder of CausationHypothesis Testing for ML
L2Tier 1applied-statistics

Non-Probability Sampling

Convenience and opt-in samples do not give probability-of-inclusion guarantees. The data-defect identity (Meng 2018) shows why a massive convenience sample can produce a confidently wrong answer. Repair methods: calibration, sampling-score weighting, mass imputation, doubly robust integration with a probability sample, and sensitivity analysis.

Prereqs
Expectation, Variance, Covariance, and MomentsLaw of Large NumbersCentral Limit Theorem+1 more
L3Tier 1causal-semiparametric

Double/Debiased Machine Learning

A general recipe for plugging flexible ML estimators into causal and structural estimands while recovering root-n rate and asymptotic normality. Cross-fitting plus Neyman-orthogonal moments converts slow nuisance rates into honest confidence intervals for a low-dimensional parameter of interest.

Prereqs
Asymptotic Statistics: M-Estimators, Delta Method, LANMaximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencyCross-Validation Theory+3 more
L4Tier 1formal-verification

AlphaProof and AI-Assisted Theorem Proving

AlphaProof (DeepMind, July 2024) earned an IMO silver medal solving 4 of 6 problems by autoformalizing into Lean 4 and running a reinforcement-learning theorem prover. AlphaGeometry 2 solved IMO 2024 Problem 4 in geometry. The state of AI-assisted formal mathematics in 2026: autoformalization, retrieval-augmented theorem proving, and the LLM + Lean kernel pipeline.

Prereqs
theorem proving in leanIneffable Intelligence
L0BTier 2infrastructure

WebGPU for Machine Learning

WebGPU gives the browser an explicit GPU compute model: devices, queues, buffers, and WGSL kernels. That is the missing substrate for serious in-browser inference, custom kernels, and eventually browser-native training systems.

Prereqs
Computer Architecture for MLAutomatic Differentiation
L3Tier 2learning-theory

Adaptive Learning Is Not IID

Why diagnostic systems that choose the next question from previous answers need sequential-probability assumptions, not iid sampling assumptions.

Prereqs
Random VariablesRadon-Nikodym and Conditional ExpectationMartingale Theory+1 more
L3Tier 2optimization

SGD as a Stochastic Differential Equation

The continuous-time SDE limit of mini-batch SGD. Order-1 weak approximation (Li-Tai-E), Mandt-Hoffman-Blei stationary distribution, Bayesian interpretation, the linear scaling rule for batch size, and the modified-equation correction that exposes SGD's implicit gradient-norm regularizer.

Prereqs
Stochastic Differential EquationsStochastic Gradient Descent ConvergenceFokker–Planck Equation+1 more
L2Tier 1predictive-uncertainty

Split Conformal Prediction

A distribution-free, model-agnostic procedure that converts any point predictor into a prediction set with finite-sample marginal coverage. The only assumption is exchangeability. The proof is five lines.

Prereqs
Order StatisticsHypothesis Testing for MLCross-Validation Theory+1 more
L3Tier 1predictive-uncertainty

Weighted Conformal Prediction Under Covariate Shift

The extension of split conformal prediction that restores finite-sample coverage when the test distribution differs from the training distribution, at the cost of knowing or estimating the likelihood ratio between them.

Prereqs
Split Conformal PredictionRadon-Nikodym and Conditional ExpectationImportance Sampling+1 more
L1Tier 1scientific-ml

Classical ODEs: Existence, Stability, and Numerical Methods

The foundational theorems that justify treating an ODE as having a solution at all: Picard-Lindelof existence and uniqueness, Gronwall's inequality, linear-system theory, Lyapunov stability, and the numerical methods (Euler, Runge-Kutta, adaptive step) used in modern scientific ML.

Prereqs
Continuity in RⁿThe Jacobian Matrix
L3Tier 2scientific-ml

Adjoint Sensitivity Method

Compute gradients through an ODE solver by integrating a backward adjoint ODE, trading O(NT) activation memory for O(1) memory at the cost of a second integration.

Prereqs
Neural ODEs and Continuous-Depth NetworksClassical ODEs: Existence, Stability, and Numerical MethodsAutomatic Differentiation
L3Tier 3scientific-ml

Continuous Normalizing Flows

Generative models that replace the stack of invertible layers in a normalizing flow with a learned ODE, trading the per-layer Jacobian determinant for an O(d) trace via Hutchinson's estimator.

Prereqs
Neural ODEs and Continuous-Depth NetworksNormalizing FlowsThe Jacobian Matrix+1 more
L4Tier 3scientific-ml

Neural ODEs and Continuous-Depth Networks

Treating neural network depth as a continuous variable: the ODE formulation of residual networks, the adjoint method for memory-efficient backpropagation, the duality with PINNs, the SDE bridge to diffusion models, and the open research frontier.

Prereqs
Classical ODEs: Existence, Stability, and Numerical MethodsSkip Connections and ResNetsGradient Flow and Vanishing Gradients+5 more
L4Tier 3scientific-ml

Neural SDEs and the Diffusion Bridge

The stochastic generalization of neural ODEs: parameterizing the drift and diffusion of an SDE with neural networks, the adjoint method extended through Brownian motion, the explicit bridge to diffusion models via the probability flow ODE, and generative neural SDEs as infinite-dimensional GANs.

Prereqs
Neural ODEs and Continuous-Depth NetworksStochastic Calculus for MLAdjoint Sensitivity Method+2 more
L3Tier 1sequential-inference

E-Values and Anytime-Valid Inference

A framework for hypothesis testing whose validity survives optional stopping, optional continuation, and data-dependent peeking. E-values are nonnegative random variables with expectation at most one under the null; Ville's inequality makes the guarantee anytime-valid.

Prereqs
Measure-Theoretic ProbabilityMartingale TheoryHypothesis Testing for ML+4 more
L1Tier 1statistics

Analysis of Variance

One-way ANOVA decomposes the total sum of squares into a between-group component and a within-group component. Under iid normal data with equal group variances, the ratio of mean squares has an F distribution and gives an exact test of the equal-means null hypothesis. The Welch correction handles unequal variances. Two-way ANOVA partitions further into main effects and interaction. Post-hoc procedures (Tukey HSD, Bonferroni, Scheffe) correct for the multiple-comparison problem that naive pairwise t-tests ignore.

Prereqs
Central Limit TheoremThe Multivariate Normal DistributionLinear Regression+2 more
L1Tier 1statistics

Delta Method

Asymptotic distribution of a smooth function of an estimator. If sqrt(n)(T_n - mu) converges to N(0, sigma^2), then sqrt(n)(g(T_n) - g(mu)) converges to N(0, [g'(mu)]^2 sigma^2). The multivariate version uses the Jacobian; the second-order version handles vanishing derivatives. The page derives the result, works three canonical examples (variance of a log proportion, variance of a ratio of means, asymptotic variance of the sample correlation), and ties the construction to variance-stabilizing transformations.

Prereqs
Central Limit TheoremModes of Convergence of Random VariablesExpectation, Variance, Covariance, and Moments+1 more
L1Tier 1statistics

Variance-Stabilizing Transformations

Many distributions have variance that depends on the mean: Poisson variance equals the mean, binomial-proportion variance equals p(1-p)/n. The delta method gives Var(g(X)) ≈ [g'(μ)]^2 σ^2(μ)/n, so picking g to satisfy g'(μ) σ(μ) = constant makes the asymptotic variance independent of μ. Solving this ODE produces the canonical transformations: 2√X for Poisson, arcsin(√p̂) for binomial proportions, log for multiplicative scale data, and the Fisher z-transform for the sample correlation. Anscombe's small-count corrections and the Box-Cox family complete the toolkit.

Prereqs
Delta MethodCentral Limit TheoremCommon Probability Distributions+1 more
L1Tier 2statistics

Tweedie Distribution

The Tweedie distribution is the one-parameter subfamily of the exponential dispersion model (EDM) family characterized by a power variance function V(mu) = mu^p. Special cases recover the Normal (p=0), Poisson (p=1), Gamma (p=2), and Inverse Gaussian (p=3). The intermediate range 1<p<2 produces a compound Poisson-Gamma distribution with a point mass at zero and a continuous positive part, which is the canonical model for insurance loss severity. The page covers the EDM construction, the four special-case identifications, and the compound-Poisson-Gamma representation; the applied actuarial treatment lives on ActuaryPath.

Prereqs
Common Probability DistributionsSufficient Statistics and Exponential FamiliesMaximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiency