Curriculum
The Full Theory Library
Every concept organized by depth layer and module. Layer 0 is foundations. Layer 5 is applied systems. Every topic links down to its prerequisites until you hit axioms.
Layer measures prerequisite depth. Tier measures effort within that layer: Tier 1 is core, Tier 2 is recommended, and Tier 3 is optional or advanced. Time is an authored reading estimate, not a mastery guarantee.
Foundations
49 topics / L0A, L1 / Tiers 1-2
Axioms, definitions, and notation. The base layer everything else depends on.
Beta Distribution
The Beta distribution as the conjugate prior for Bernoulli and Binomial likelihoods, as the order statistic of i.i.d. Uniforms, and as a flexible density on the unit interval: density, moments, conjugacy derivation, and MLE without closed form.
Common Inequalities
The algebraic and probabilistic inequalities that appear everywhere in ML theory: Cauchy-Schwarz, Jensen, AM-GM, Holder, Minkowski, Young, Markov, and Chebyshev.
Common Probability Distributions
The essential probability distributions for ML: Bernoulli through Dirichlet, with PMFs/PDFs, moments, relationships, and when each arises in practice.
Compactness and Heine-Borel
Sequential compactness, the Heine-Borel theorem in finite dimensions, the extreme value theorem, and why compactness is the key assumption in optimization.
Continuity in Rⁿ
Epsilon-delta continuity, uniform continuity, and Lipschitz continuity in Euclidean space. Lipschitz constants control how fast function values change and appear throughout optimization and generalization theory.
Differentiation in Rⁿ
Partial derivatives, directional derivatives, gradients, total derivatives, and the multivariable chain rule. The point is not notation: differentiability means one linear map predicts all small directions.
Distributions Atlas
A connection map for the parametric families used in statistical inference and ML. Lists each family with its support, parameterization, and the transformations that move between families: sum, mixture, conjugacy, limiting case, and ratio constructions.
Eigenvalues and Eigenvectors
Eigenvalues and eigenvectors: the directions a matrix scales without rotating. Characteristic polynomial, diagonalization, the spectral theorem for symmetric matrices, and the direct connection to PCA.
Expectation, Variance, Covariance, and Moments
Expectation as average value, variance as squared spread, covariance as joint movement, and the moment identities used throughout statistics and ML.
Exponential Distribution
The Exponential distribution as the memoryless waiting time: density, CDF, MGF, the memoryless property as a characterization, the Poisson-process inter-arrival construction, the minimum of independent exponentials, MLE for the rate, and the bridge to the Gamma distribution.
Exponential Function Properties
The exponential function exp(x): series definition, algebraic properties, and why it appears everywhere in ML. Softmax, MGFs, the Chernoff method, Boltzmann distributions, and exponential families all reduce to properties of exp.
Gamma Distribution
The Gamma distribution as the sum of independent Exponentials and as a flexible nonnegative density: shape and rate, density and MGF, conjugacy for Poisson and Exponential likelihoods, Chi-squared as a special case, MLE without closed form.
Gram Matrices and Kernel Matrices
The Gram matrix encodes pairwise inner products of a dataset. Always PSD. Appears in kernel methods, PCA, SVD, and attention. Connects linear algebra to ML.
Inner Product Spaces and Orthogonality
Inner product axioms, Cauchy-Schwarz inequality, orthogonality, Gram-Schmidt, projections, and the bridge to Hilbert spaces.
Joint, Marginal, and Conditional Distributions
Joint distributions, marginalization, conditional distributions, independence, Bayes theorem, and the chain rule of probability.
KL Divergence
Kullback-Leibler divergence measures how one probability distribution differs from another. Asymmetric, always non-negative, and central to variational inference, MLE, and RLHF.
Kolmogorov Probability Axioms
The three axioms (non-negativity, normalization, countable additivity) that every probability claim on this site implicitly invokes. Sample space, event sigma-algebra, probability measure, and the immediate consequences.
Linear Independence
A set of vectors is linearly independent if no vector is a redundant copy of the others. The concept underwrites basis, dimension, rank, the column-space test for $Ax = b$, and every overdetermined-versus-underdetermined diagnosis in linear regression and gradient-based optimization.
Matrix Norms
Frobenius, spectral, and nuclear norms for matrices. Submultiplicativity. When and why each norm appears in ML theory.
Matrix Operations and Properties
Essential matrix operations for ML: trace, determinant, inverse, pseudoinverse, Schur complement, and the Sherman-Morrison-Woodbury formula. When and why each matters.
Metric Spaces, Convergence, and Completeness
Metric space axioms, convergence of sequences, Cauchy sequences, completeness, and the Banach fixed-point theorem.
Normal Distribution
The Normal distribution as a parametric family: density, moment generating function, closure under affine transformations and sums, MLE for mean and variance, Fisher information, and the bridge to the Chi-squared, Student-t, and F sampling distributions.
Numerical Stability and Conditioning
Continuous math becomes real only through finite-precision approximation. Condition numbers, backward stability, catastrophic cancellation, and why theorems about reals do not transfer cleanly to floating-point.
Poisson Distribution
The Poisson distribution as the rare-event limit of the Binomial and as the count law of a Poisson process: PMF, MGF, mean equals variance, additivity, thinning, superposition, MLE, and the connection to the Exponential and Gamma.
Positive Semidefinite Matrices
PSD matrices: equivalent characterizations, Cholesky decomposition, Schur complement, and Loewner ordering. Covariance matrices are PSD. Hessians of convex functions are PSD. These facts connect linear algebra to optimization and statistics.
Random Variables
Random variables as measurement rules, their distributions, expectation and variance, and the rigorous measurable-map definition used in probability theory.
Sets, Functions, and Relations
The language underneath all of mathematics: sets, Cartesian products, functions, injectivity, surjectivity, equivalence relations, and quotient sets.
Singular Value Decomposition
The SVD A = UΣVᵀ: the most important matrix factorization in applied mathematics. Geometric interpretation, relationship to eigendecomposition, low-rank approximation via Eckart-Young, and applications from PCA to pseudoinverses.
Skewness, Kurtosis, and Higher Moments
Distribution shape beyond mean and variance: skewness measures tail asymmetry, kurtosis measures tail extremeness, cumulants are the cleaner language, and heavy-tailed distributions break all of these.
Taylor Expansion
Taylor approximation in one and many variables. Every optimization algorithm is a Taylor approximation: gradient descent uses first order, Newton's method uses second order.
Tensors and Tensor Operations
What a tensor actually is: a multilinear map with specific transformation rules, how tensor contraction generalizes matrix multiplication, Einstein summation, tensor decompositions, and how ML frameworks use the word tensor to mean multidimensional array.
Total Variation Distance
Total variation distance measures the largest possible discrepancy between two probability distributions: equivalently half the L1 gap, one minus the overlap mass, or the minimum disagreement probability under a coupling.
Vectors, Matrices, and Linear Maps
Vector spaces, linear maps, matrix representation, rank, nullity, and the rank-nullity theorem. The algebraic backbone of ML.
Basic Logic and Proof Techniques
The proof vocabulary behind serious mathematics and ML theory: implication, quantifiers, contradiction, contrapositive, induction, construction, cases, and counterexamples.
Cantor's Theorem and Uncountability
Cantor's diagonal argument proves the reals are uncountable. The power set of any set has strictly greater cardinality. These results are the origin of the distinction between countable and uncountable infinity.
Cardinality and Countability
Two sets have the same cardinality when a bijection exists between them. The naturals, integers, and rationals are countable. The reals are uncountable, proved by Cantor's diagonal argument.
Counting and Combinatorics
Counting principles, binomial and multinomial coefficients, inclusion-exclusion, and Stirling's approximation. These tools appear whenever you count hypotheses, bound shattering coefficients, or analyze combinatorial arguments in learning theory.
Cramér-Wold Theorem
A multivariate distribution is uniquely determined by all of its one-dimensional projections. This reduces multivariate convergence in distribution to checking univariate projections, and is the standard tool for proving multivariate CLT.
Hypergeometric Distribution
The Hypergeometric distribution is the law of the success count in n draws without replacement from a finite population of N items, K of which are successes. The PMF is the ratio of three binomial coefficients. The mean is nK/N, identical to the Binomial with p = K/N, but the variance carries the finite-population correction (N-n)/(N-1). The Binomial is the n << N limit. Fisher's exact test, capture-recapture, and quality-control acceptance sampling read off the Hypergeometric directly.
Integration and Change of Variables
Riemann integration, improper integrals, the substitution rule, multivariate change of variables via the Jacobian determinant, and Fubini theorem. The computational backbone of probability and ML.
Inverse and Implicit Function Theorem
The inverse function theorem guarantees local invertibility when the Jacobian is nonsingular. The implicit function theorem guarantees that constraint surfaces are locally graphs. Both are essential for constrained optimization and implicit layers.
Lognormal Distribution
A random variable is Lognormal if its logarithm is Normal. The density, mean, variance, median, and mode all have closed forms in the two underlying Normal parameters. The Lognormal is the multiplicative analogue of the Normal: a product of many independent positive factors is approximately Lognormal in the same way a sum is approximately Normal. Applications cover financial returns (with the heavy-tail caveat that real returns are heavier than Lognormal), particle sizes, lifetimes, and insurance severity.
Markov Chains and Steady State
Markov chains: the Markov property, transition matrices, stationary distributions, irreducibility, aperiodicity, the ergodic theorem, and mixing time. The backbone of PageRank, MCMC, and reinforcement learning.
Moment Generating Functions
Moment generating functions encode moments, control light-tailed behavior, and power Chernoff bounds, sub-Gaussian estimates, and exponential-family theory.
Pareto Distribution
The Pareto distribution is the canonical power-law on a half-line. The Type I parameterization has survival function (x_m/x)^alpha for x at least x_m. The shape parameter alpha is the tail index. Three regimes of alpha matter for the law of large numbers and the central limit theorem: alpha at most 1 has no finite mean and breaks the LLN; 1 < alpha at most 2 has finite mean but infinite variance so the standard CLT fails (generalized CLT to a stable law); alpha greater than 2 admits both LLN and CLT in the usual form. Applications: wealth, city sizes, file sizes, network degree, insurance severity. The 80/20 'Pareto principle' is a specific case requiring alpha approximately 1.16.
Sequences and Series of Functions
Pointwise vs uniform convergence of function sequences, the Weierstrass M-test, and why uniform convergence preserves continuity. The concept that makes learning theory work.
Signals and Systems for ML
Linear time-invariant systems, convolution, Fourier transform, and the sampling theorem. The signal processing foundations that underpin CNNs, efficient attention, audio ML, and frequency-domain analysis of training dynamics.
Triangular Distribution
A bounded continuous distribution defined by a lower endpoint, upper endpoint, and mode. Includes the PDF, CDF, moments, sum-of-uniforms construction, and why the shape is a useful bridge between geometry and probability.
Weibull Distribution
The Weibull distribution is the standard parametric model for failure-time data in reliability and survival analysis. Two parameters: a shape k and a scale lambda. The hazard rate is monotone in time, increasing when k > 1 (wear-out), constant when k = 1 (Exponential, memoryless), decreasing when k < 1 (early-failure / infant-mortality). Mean is lambda Gamma(1 + 1/k); variance is lambda squared times (Gamma(1 + 2/k) minus Gamma(1 + 1/k) squared). MLE for k has no closed form and is solved by Newton-Raphson on the profile score equation. Applications: component lifetimes, time to event in clinical trials, wind speeds, extreme-value Type III.
Mathematical Infrastructure
29 topics / L0A, L0B, L1, L2, L3 / Tiers 1-3
Serious math machinery: measure theory, functional analysis, convex duality.
Automatic Differentiation
Forward mode computes Jacobian-vector products, reverse mode computes vector-Jacobian products: backpropagation is reverse-mode autodiff, and the asymmetry between the two modes explains why training neural networks is efficient.
Borel-Cantelli Lemmas
The two lemmas that bridge in-probability convergence to almost-sure convergence. First lemma: summable probabilities imply the events occur only finitely often; Second lemma: independence plus divergence implies infinitely often.
Characteristic Functions
The Fourier transform of a probability distribution. Always exists (unlike MGFs), uniquely determines the distribution, multiplies under independent sums, and powers the rigorous proof of the central limit theorem via Levy's continuity theorem.
Convex Duality
Fenchel conjugates, the Fenchel-Moreau theorem, weak and strong duality, KKT conditions, and why duality gives the kernel trick for SVMs, connects regularization to constraints, and enables adversarial formulations in DRO.
Matrix Calculus
The differentiation identities you actually use in ML: derivatives of traces, log-determinants, and quadratic forms with respect to matrices and vectors.
Measure-Theoretic Probability
The rigorous foundations of probability: sigma-algebras, measures, measurable functions as random variables, Lebesgue integration, and the convergence theorems that make modern probability and statistics possible.
Modes of Convergence of Random Variables
The four standard senses in which a sequence of random variables can converge: almost surely, in probability, in Lp, and in distribution. Their hierarchy, the strict counterexamples that separate them, and the supporting tools (Slutsky, continuous mapping, Skorokhod).
Radon-Nikodym and Conditional Expectation
The Radon-Nikodym theorem: what 'density' really means. Absolute continuity, the Radon-Nikodym derivative, conditional expectation as a projection, tower property, and why this undergirds likelihood ratios, importance sampling, and KL divergence.
The Hessian Matrix
The matrix of second partial derivatives: encodes curvature, classifies nondegenerate critical points (and is inconclusive at degenerate ones), and is the central object in second-order optimization.
The Jacobian Matrix
The matrix of all first partial derivatives of a vector-valued function: encodes the best linear approximation, connects to the chain rule in matrix form, and is the backbone of backpropagation.
Vector Calculus Chain Rule
The chain rule for compositions of multivariable maps. Jacobians multiply when functions compose; gradients of scalar-valued compositions become vector-Jacobian products. The result that makes backpropagation a one-line theorem.
Backward Stochastic Differential Equations
The Pardoux-Peng framework: an SDE with a terminal condition and an adapted solution pair. Linear BSDEs reduce to Feynman-Kac; nonlinear BSDEs are dual to Hamilton-Jacobi-Bellman PDEs and are the mathematical object that the deep BSDE method approximates.
Complex Numbers for Fourier
The minimum complex-number machinery you need to read Fourier-domain pages: the field C, the imaginary unit i, modulus and argument, polar form, Euler's formula, and roots of unity. Not complex analysis, just the grounders.
Divergence, Curl, and Line Integrals
The vector-calculus operators that appear in Fokker-Planck, score-based diffusion, and PINN papers. Definitions of divergence, curl, gradient, line integrals, and Green's theorem in the plane. Compact, not encyclopedic.
Feynman–Kac Formula
The probabilistic representation of solutions to linear parabolic PDEs as expectations over SDE trajectories. The bridge that lets you Monte Carlo a PDE, the reason high-dimensional Black-Scholes is tractable, and the foundation under every backward-SDE method including deep BSDE.
Fokker–Planck Equation
The deterministic PDE for the time-evolving density of an SDE. The bridge that lets you reason about Langevin samplers, diffusion models, and stochastic optimization in PDE language: stationary distributions become null spaces, mixing times become spectral gaps, score functions become drift-density couplings.
Functional Analysis Core
The four pillars of functional analysis: Hahn-Banach (extending functionals), Uniform Boundedness (pointwise bounded implies uniformly bounded), Open Mapping (surjective bounded operators have open images), and Banach-Alaoglu (dual unit ball is weak-* compact). These underpin RKHS theory, optimization in function spaces, and duality.
Hamilton–Jacobi–Bellman Equation
The PDE characterizing the value function of a continuous-time stochastic optimal control problem. The continuous-time analog of the discrete Bellman equation, the fully nonlinear PDE that nonlinear Feynman–Kac inverts via BSDEs, and the equation Deep BSDE solves numerically in high dimensions.
Implicit Differentiation
Differentiating through implicit equations and optimization problems: the implicit function theorem gives dy/dx without solving for y explicitly. Applications to bilevel optimization, deep equilibrium models, hyperparameter optimization, and meta-learning.
Information Theory Foundations
The core of information theory for ML: entropy, cross-entropy, KL divergence, mutual information, data processing inequality, and the chain rules that connect them. The language of variational inference, generalization bounds, and representation learning.
Ito's Lemma
The chain rule of stochastic calculus: if a process follows an SDE, applying a smooth function to it yields a modified SDE with an extra second-order correction term that has no analogue in ordinary calculus.
Martingale Theory
Martingales and their convergence properties: Doob martingale, optional stopping theorem, martingale convergence, Azuma-Hoeffding inequality, and Freedman inequality. The tools behind McDiarmid's inequality, online learning regret bounds, and stochastic approximation.
Non-Euclidean and Hyperbolic Geometry
The geometry that drops the parallel postulate. Hyperbolic and spherical models, sectional curvature, the Poincare disk, and why hyperbolic spaces embed tree-structured data with low distortion. The grounding for graph embeddings on curved spaces.
PDE Fundamentals for Machine Learning
The partial differential equations that appear in modern machine learning: heat and Fokker-Planck for diffusion, continuity for flow matching, Hamilton-Jacobi-Bellman for reinforcement learning, Poisson for score matching. Classification, solution concepts, and where ML actually needs PDE theory versus where it just uses the vocabulary.
Stochastic Differential Equations
SDEs of the form dX = b dt + sigma dB: strong and weak solutions, existence and uniqueness under Lipschitz conditions, Euler-Maruyama discretization, and the canonical examples that appear throughout ML (Ornstein-Uhlenbeck, geometric Brownian motion, Langevin dynamics).
Time Reversal of SDEs
Anderson 1982: any forward Ito SDE has an explicit time-reversed SDE whose drift involves the score function (gradient of log density). The single result that turns a forward noising process into a generative sampler and underlies every score-based diffusion model.
Information Geometry
Riemannian geometry on the space of probability distributions: the Fisher information metric, natural gradient descent, exponential families as dually flat manifolds, and the connection to mirror descent.
Spectral Theory of Operators
Spectral theorem for compact self-adjoint operators on Hilbert spaces: every such operator has a countable orthonormal eigenbasis with real eigenvalues accumulating only at zero. This is the infinite-dimensional backbone of PCA, kernel methods, and neural tangent kernel theory.
Stochastic Calculus for ML
Brownian motion, Ito integrals, Ito's lemma, and stochastic differential equations: the mathematical machinery behind diffusion models, score-based generative models, and Langevin dynamics.
Statistical Estimation
27 topics / L0B, L1, L2 / Tiers 1-3
MLE, Fisher information, Cramér-Rao, LLN, CLT — the estimation core.
Asymptotic Statistics: M-Estimators, Delta Method, LAN
The large-sample toolbox for statistical inference: continuous mapping theorem, Slutsky, the delta method, M- and Z-estimator consistency and asymptotic normality, MLE as a special M-estimator, local asymptotic normality (Le Cam), the asymptotic equivalence of Wald / score / likelihood-ratio tests, and influence-function representations. These results justify essentially every confidence interval, standard error, and p-value in applied statistics, and they are the language of modern semiparametric theory.
Bayesian Linear Regression
Gaussian prior, Gaussian likelihood, Gaussian posterior. Full posterior derivation by completing the square in the exponent: the posterior mean equals the ridge estimator, the predictive distribution has irreducible plus epistemic variance, and the marginal likelihood gives a closed-form hyperparameter selection criterion. Worked numeric example with three data points carries the algebra end to end.
Bootstrap Methods
The nonparametric bootstrap: resample with replacement to approximate sampling distributions, construct confidence intervals, and quantify uncertainty without distributional assumptions.
Central Limit Theorem
The CLT family: classical Lindeberg-Levy, the Lindeberg-Feller condition for non-identical summands, Berry-Esseen rate, multivariate CLT, the delta method, and the martingale CLT that underwrites SGD asymptotics. Why these results power MLE asymptotic normality, bootstrap validity, and the Gaussian-noise approximation of stochastic optimization, and where the CLT breaks in high dimensions and under heavy tails.
Chi-Squared Distribution and Tests
The Chi-squared distribution as sum of squared standard Normals and as the sampling distribution of the scaled sample variance, plus the two Pearson Chi-squared tests: goodness of fit for cell counts and independence in contingency tables.
Conjugate Priors
When the prior and likelihood are paired so the posterior stays in the same family as the prior. Definition via exponential families, the standard table (Beta-Bernoulli, Dirichlet-multinomial, Normal-Normal, Normal-inverse-gamma, Gamma-Poisson), worked Normal-Normal updates in 1D and the multivariate case, and the pseudo-observation interpretation that makes conjugacy a feature, not a coincidence.
Cramér-Rao Bound: Information Inequality, Achievability, and Sharper Variants
The fundamental information inequality for unbiased estimation. Coverage of the scalar and multivariate Cramér-Rao bounds, the chain rule for biased estimators, achievability in exponential families, the Bhattacharyya higher-order bounds, the Hammersley-Chapman-Robbins bound (no regularity required), the van Trees Bayesian inequality, and efficient information with nuisance parameters.
The EM Algorithm
Expectation-Maximization: the principled way to do maximum likelihood when some variables are unobserved. Derives the ELBO, proves monotonic convergence, and shows why EM is the backbone of latent variable models.
F-Distribution and ANOVA
The F-distribution as a ratio of two scaled Chi-squareds, and the one-way analysis of variance F-test built on it: between-group versus within-group variance decomposition, exact null distribution under Normality and equal variances, and the link to two-sample t-tests.
Fisher Information: Curvature, KL Geometry, and the Natural Gradient
The Fisher information measures how much a sample tells you about an unknown parameter. Coverage of the score function and Bartlett identities, the equivalence of variance-of-score and expected-negative-Hessian forms, the KL-divergence Hessian identity, the Fisher information matrix and Loewner ordering, the Cramér-Rao link, the natural gradient (parameterization invariance), K-FAC and empirical Fisher in deep learning, and applications to MLE asymptotic efficiency, EWC, and information geometry.
Law of Large Numbers
The weak and strong laws of large numbers: the sample mean converges to the population mean. Kolmogorov's conditions, the rate of convergence from the CLT, and why LLN justifies using empirical risk as a proxy for population risk.
Likelihood-Ratio, Wald, and Score Tests
The three asymptotic tests built from the likelihood: LRT compares maximized likelihoods, Wald compares the MLE to the null using the inverse Fisher information, score uses the gradient of the log-likelihood at the null. All three are asymptotically Chi-squared and equivalent under regularity; they disagree in finite samples and choice depends on the problem.
Maximum A Posteriori (MAP) Estimation
Maximum a posteriori estimation as the posterior mode of a Bayesian model: derivation, the flat-prior recovery of MLE, the worked L2-norm-equals-Gaussian-prior and L1-norm-equals-Laplace-prior equivalences that make ridge and lasso Bayesian, and the invariance failure under reparameterization that distinguishes MAP from MLE.
Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiency
MLE: pick the parameter that maximizes the likelihood of observed data. Score function, Bartlett identities, regularity conditions, consistency, asymptotic normality, Wilks' theorem, Cramér-Rao efficiency, exponential families, QMLE under misspecification, and the bridge to deep-learning negative log-likelihood training.
The Multivariate Normal Distribution
The multivariate Gaussian as the joint of d correlated random variables: density derivation from standard normals via affine maps, the completing-the-square recipe, Schur-complement marginals and conditionals, the MGF and characteristic function, and the algebraic identities that power every Bayesian Gaussian derivation downstream.
Permutation Tests
Exchangeability-based hypothesis testing: under the null of no group effect, the labels are exchangeable, so the distribution of any test statistic under random relabeling gives an exact null reference. Exact for small samples, approximated by Monte Carlo for large samples, robust under non-Normality and heavy tails.
Shrinkage Estimation and the James-Stein Estimator: Inadmissibility, SURE, and Brown's Characterization
Why the sample mean is inadmissible in three or more dimensions. Stein's identity, the James-Stein estimator and its positive-part refinement, Stein's unbiased risk estimate, Brown's admissibility characterization, the empirical Bayes interpretation, and the link to ridge regression and modern regularization.
Student-t Distribution and t-Test
The Student-t distribution as a ratio of a standard Normal and a root-Chi-squared, and the one-sample, two-sample, and paired t-tests it powers: exact null distribution under Normality, Welch correction for unequal variances, and large-sample equivalence to the Wald z-test.
Bayesian Estimation
The Bayesian approach to parameter estimation: encode prior beliefs, update with data via Bayes rule, and obtain a full posterior distribution over parameters. Conjugate priors, MAP estimation, and the Bernstein-von Mises theorem showing that the posterior concentrates around the true parameter.
De Moivre-Laplace Theorem
The first central limit theorem, historically. Bin(n,p) approximates N(np, np(1-p)) for large n, with explicit continuity correction. Stirling-based proof, Berry-Esseen rate, and where the approximation breaks down (small p, small n, skewed binomials).
Empirical Bayes vs Hierarchical Bayes
What changes when hyperparameters are estimated and plugged in versus assigned a prior and integrated out, and why the gap is mostly about uncertainty rather than point estimates.
Goodness-of-Fit Tests
KS test, Shapiro-Wilk, Anderson-Darling, and the probability integral transform: testing whether data follows a specified distribution, with exact power comparisons and when each test fails.
Method of Moments
Match sample moments to population moments to estimate parameters. Simpler than MLE but less efficient. Covers classical MoM, generalized method of moments (GMM), and when MoM is the better choice.
Poisson Limit Theorem and Le Cam's Bound
Bin(n, lambda/n) converges to Pois(lambda) as n grows. The classical product-of-PMFs proof, then Le Cam's total-variation bound that makes the approximation quantitative. When to use Poisson vs Normal approximation. Disambiguation: Le Cam published multiple famous theorems.
REML and Variance Component Estimation
Why restricted maximum likelihood estimates variance components from error contrasts rather than the full data likelihood, and why that usually behaves better than ML when fixed effects are present.
Sufficient Statistics and Exponential Families
Sufficient statistics compress data without losing information about the parameter. The Neyman-Fisher factorization theorem, exponential families, completeness, and Rao-Blackwell improvement of estimators.
Basu's Theorem
A complete sufficient statistic is independent of every ancillary statistic. This provides the cleanest method for proving independence between statistics without computing joint distributions.
Bayesian ML Frontier
1 topics / L3 / Tier 1
Bayesian and probabilistic views of tabular foundation models and statistical ML frontiers.
Learning Theory Core
14 topics / L1, L2, L3 / Tiers 1-2
ERM, uniform convergence, VC dimension, Rademacher complexity.
Algorithmic Stability
Algorithmic stability provides generalization bounds by analyzing how much a learning algorithm's output changes when a single training example is replaced: a structurally different lens from complexity-based approaches.
Empirical Risk Minimization
The foundational principle of statistical learning: minimize average loss on training data as a proxy for minimizing true population risk.
Hypothesis Classes and Function Spaces
What is a hypothesis class, why the choice of hypothesis class determines what ERM can learn, and the approximation-estimation tradeoff: bigger classes reduce approximation error but increase estimation error.
PAC Learning Framework
The foundational formalization of what it means to learn from data: a concept is PAC-learnable if and only if an algorithm can, with high probability, find a hypothesis that is approximately correct, using a polynomial number of samples.
Rademacher Complexity
A data-dependent measure of hypothesis class complexity that gives tighter generalization bounds than VC dimension by measuring how well the class can fit random noise on the actual data.
Realizability Assumption
The technical assumption that the target function lies inside the hypothesis class. Realizable PAC learning is the simpler half of the story; agnostic PAC drops this assumption.
Sample Complexity Bounds
How many samples do you need to learn? Tight answers for finite hypothesis classes, VC classes, and Rademacher-bounded classes, plus matching lower bounds via Fano and Le Cam.
Uniform Convergence
Uniform convergence of empirical risk to population risk over an entire hypothesis class: the key property that makes ERM provably work.
VC Dimension
The Vapnik-Chervonenkis dimension: a combinatorial measure of hypothesis class complexity that characterizes learnability in binary classification.
Bias-Complexity Tradeoff
The error of an ERM hypothesis decomposes into approximation error (the price of restricting H) and estimation error (the price of finite data). Richer classes shrink approximation but inflate estimation. The optimal class is the smallest one that still captures the truth.
Glivenko-Cantelli Theorem
Uniform almost-sure convergence of the empirical CDF to the true CDF, and the function-class generalization that ML uses for ERM. Covers the classical theorem, the DKW exponential rate (Massart 1990), the Glivenko-Cantelli class definition, and the bridge to VC dimension and Rademacher complexity.
Kolmogorov Complexity and MDL
Kolmogorov complexity measures the shortest program that produces a string. The Minimum Description Length principle selects models that compress data best, providing a computable approximation to an incomputable ideal.
Loss Functions
The generalized loss-function framework from Shalev-Shwartz & Ben-David Ch 3, the six standard losses (0-1, MSE, MAE, Huber, cross-entropy, hinge), and the bridge from supervised learning to ERM and PAC learnability.
No-Free-Lunch Theorem
For binary classification with 0-1 loss, no learning algorithm can succeed on every distribution: for any algorithm and any sample size m smaller than half the domain, some realizable distribution forces error at least 1/8 with probability at least 1/7. Universal learners do not exist; prior knowledge enters through the choice of hypothesis class.
Concentration & Probability
21 topics / L1, L2, L3 / Tiers 1-2
Hoeffding through matrix Bernstein. The workhorse inequality family.
Bennett's Inequality
A variance-aware concentration inequality for independent bounded random variables. The exponent uses the function h(a) = (1+a)log(1+a) - a, the same h that controls the multiplicative Chernoff bound for Bernoullis.
Bernstein Inequality
A variance-sensitive concentration inequality for independent bounded random variables. Bernstein sharpens Hoeffding when the variance is much smaller than the worst-case range.
Chernoff Bounds
The Chernoff method: the universal technique for deriving exponential tail bounds by optimizing over the moment generating function, yielding the tightest possible exponential concentration.
Chi-Squared Concentration
Two-sided exponential concentration for chi-squared sums of squared standard Gaussians: P(|Z/k - 1| > t) <= 2 exp(-t^2 k / 6) for t in (0, 1/2). Drives sub-exponential tails, variance-component inference, and Lipschitz Gaussian concentration.
Concentration Inequalities
Bounds on how far random variables deviate from their expectations: Markov, Chebyshev, Hoeffding, and Bernstein. Used throughout generalization theory, bandits, and sample complexity.
Epsilon-Nets and Covering Numbers
Discretizing infinite sets for concentration arguments: epsilon-nets, covering numbers, packing numbers, the Dudley integral, and the connection to Rademacher complexity.
Fat Tails and Heavy-Tailed Distributions
When the tails dominate. Power laws, Pareto distributions, subexponential tails, why the law of large numbers converges slowly or fails, and why most of ML silently assumes thin tails.
Hoeffding's Lemma
The MGF bound that powers Hoeffding's inequality: a centered random variable on [a,b] has sub-Gaussian moment generating function with parameter (b-a)^2/4.
LLN and CLT Failures Under Heavy Tails
What breaks when finite mean or finite variance fails. Cauchy: the sample mean stays Cauchy no matter how large the sample. Pareto across the alpha regimes: when LLN still holds but CLT does not. Generalized CLT and stable-law limits. The consistency illusion in finance and reinsurance.
Matrix Concentration
Matrix Bernstein, Matrix Hoeffding, Weyl's inequality, and Davis-Kahan: the operator-norm concentration tools needed for covariance estimation, dimensionality reduction, and spectral analysis in high dimensions.
McDiarmid's Inequality
The bounded-differences inequality: if changing any single input to a function changes the output by at most a fixed constant, the function concentrates around its mean with sub-Gaussian tails.
Measure Concentration and Geometric Functional Analysis
High-dimensional geometry is counterintuitive: Lipschitz functions concentrate, random projections preserve distances, and most of a sphere's measure sits near the equator. Johnson-Lindenstrauss, Gaussian concentration, and Levy's lemma.
Sub-Exponential Random Variables
The distributional class between sub-Gaussian and heavy-tailed: heavier tails than Gaussian, the Orlicz norm characterization, Bernstein condition, and the two-regime concentration bound.
Sub-Gaussian Random Variables
Sub-Gaussian random variables: the precise characterization of 'light-tailed' behavior that underpins exponential-tail concentration inequalities and the Gaussian-rate generalization bounds in learning theory.
Symmetrization Inequality
The symmetrization technique: the proof template that connects the generalization gap to Rademacher complexity by introducing a ghost sample and random signs.
Contraction Inequality
The Ledoux-Talagrand contraction principle: applying a Lipschitz function anchored at zero to a function class can only contract Rademacher complexity, letting you bound the complexity of the loss class from the hypothesis class.
Empirical Processes and Chaining
Bounding the supremum of empirical processes via covering numbers and chaining: Dudley's entropy integral and Talagrand's generic chaining, the sharpest tools in classical learning theory.
Hanson-Wright Inequality
Concentration of quadratic forms XᵀAX for sub-Gaussian random vectors: the two-term bound involving the Frobenius norm (Gaussian regime) and operator norm (extreme regime).
Restricted Isometry Property
The restricted isometry property (RIP): when a measurement matrix approximately preserves norms of sparse vectors, enabling exact sparse recovery via L1 minimization. Random Gaussian matrices satisfy RIP with O(s log(n/s)) rows.
Slud's Inequality
A lower bound on binomial tail probabilities. For X ~ Bin(n, p) with p = (1 - epsilon)/2, P(X >= n/2) is at least a normal-tail-style quantity. Used in VC lower bounds to show learning is genuinely hard.
Stochastic Processes for ML
Random processes as the unifying framework behind Gaussian processes, neural tangent kernels, empirical processes, and martingale concentration: definitions, sample-path regularity, and the bridges between learning theory's main probabilistic tools.
Optimization & Function Classes
13 topics / L1, L2, L3 / Tiers 1-2
Convex optimization, regularization, kernels, RKHS.
Convex Optimization Basics
Convex sets, convex functions, gradient descent convergence, strong convexity, and duality: the optimization foundation that every learning-theoretic result silently depends on.
Gradient Descent Variants
From full-batch to stochastic to mini-batch gradient descent, plus momentum, Nesterov acceleration, AdaGrad, RMSProp, and Adam. Why mini-batch SGD with momentum is the practical default.
Gradient Flow and Vanishing Gradients
Why deep networks are hard to train: gradients shrink or explode as they propagate through layers. The Jacobian product chain, sigmoid saturation, ReLU dead neurons, skip connections, normalization, and gradient clipping.
Stochastic Gradient Descent Convergence
SGD convergence rates for convex and strongly convex functions, the role of noise as both curse and blessing, mini-batch variance reduction, learning rate schedules, and the Robbins-Monro conditions.
Subgradients and Subdifferentials
The non-smooth generalization of the gradient for convex functions. Subgradients enable optimality conditions, calculus rules, and convergence guarantees for L1-regularized problems, hinge loss SVMs, and proximal algorithms where the objective is not differentiable.
Bias-Variance Tradeoff
The classical decomposition of mean squared error into bias squared, variance, and irreducible noise. The U-shaped test error curve, why it breaks in modern ML (double descent), and the connection to regularization.
Cross-Validation Theory
The theory behind cross-validation as a model selection tool: LOO-CV, K-fold, the bias-variance tradeoff of the CV estimator itself, and why CV estimates generalization error.
Kernels and Reproducing Kernel Hilbert Spaces
Kernel functions, Mercer's theorem, the RKHS reproducing property, and the representer theorem: the mathematical framework that enables learning in infinite-dimensional function spaces via finite-dimensional computations.
Preconditioned Optimizers: Shampoo, K-FAC, and Natural Gradient
Optimizers that use curvature information to precondition gradients: the natural gradient via Fisher information, K-FAC's Kronecker approximation, and Shampoo's full-matrix preconditioning. How they connect to Riemannian optimization and why they outperform Adam on certain architectures.
Regularization Theory
Why unconstrained ERM overfits and how regularization controls complexity: Tikhonov (L2), sparsity (L1), elastic net, early stopping, dropout, the Bayesian prior connection, and the link to algorithmic stability.
Riemannian Optimization and Manifold Constraints
Optimization on curved spaces: the Stiefel manifold for orthogonal matrices, symmetric positive definite matrices, Riemannian gradient descent, retractions, and why flat-space intuitions break on manifolds. The geometric backbone of Shampoo, Muon, and constrained neural network training.
Stability and Optimization Dynamics
Convergence of gradient descent for smooth and strongly convex objectives, the descent lemma, gradient flow as a continuous-time limit, Lyapunov stability analysis, and the edge of stability phenomenon.
Stochastic Approximation Theory
The Robbins-Monro framework, ODE method, and Polyak-Ruppert averaging: the unified theory behind why SGD, Q-learning, and TD-learning converge.
Statistical Foundations
18 topics / L1, L2, L3, L4 / Tiers 1-3
Minimax, Fano, information-theoretic lower bounds, random matrix theory.
Minimax Lower Bounds: Le Cam, Fano, Assouad, and the Reduction to Testing
Why upper bounds are not enough. The reduction-to-testing principle, Le Cam's two-point method, Fano's inequality, the Assouad hypercube lemma, packing and metric-entropy bridges, and applications to nonparametric, sparse, and locally-private estimation.
Design-Based vs. Model-Based Inference
Two philosophies of statistical inference from survey data: design-based inference where randomness comes from the sampling design, and model-based inference where randomness comes from a statistical model, with the model-assisted hybrid approach.
Detection Theory
Binary hypothesis testing, the Neyman-Pearson lemma (likelihood ratio tests are most powerful), ROC curves, Bayesian detection, and sequential testing. Classification IS detection theory. ROC/AUC comes directly from here.
Fano Inequality
Fano inequality as the standard tool for information-theoretic lower bounds: error probability in estimating a random variable from a noisy observation is bounded below by conditional entropy and alphabet size.
GREG Estimator
The generalized regression estimator is the standard model-assisted survey estimator: Horvitz-Thompson plus a regression correction using known auxiliary totals.
High-Dimensional Covariance Estimation
When dimension d is comparable to sample size n, the sample covariance matrix fails. Shrinkage estimators (Ledoit-Wolf), banding and tapering for structured covariance, and Graphical Lasso for sparse precision matrices.
Kernel Two-Sample Tests
Maximum Mean Discrepancy (MMD): a kernel-based nonparametric test for whether two samples come from the same distribution, with unbiased estimation, permutation testing, and applications to GAN evaluation.
Neyman-Pearson and Hypothesis Testing Theory
The likelihood ratio test is the most powerful test for simple hypotheses (Neyman-Pearson lemma), UMP tests extend this to one-sided composites, and the power function characterizes a test's behavior across the parameter space.
Nonresponse and Missing Data
The taxonomy of missingness mechanisms (MCAR, MAR, MNAR), their consequences for inference, and the major correction methods: multiple imputation, inverse probability weighting, and the EM algorithm.
Order Statistics
Order statistics are the sorted values of a random sample. Their distributions govern quantile estimation, confidence intervals for medians, and the behavior of extremes.
Random Matrix Theory Overview
Why the spectra of random matrices matter for ML: Marchenko-Pastur law, Wigner semicircle, spiked models, and their applications to covariance estimation, PCA, and overparameterization.
Robust Statistics and M-Estimators
When data has outliers or model assumptions are wrong, classical estimators break. M-estimators generalize MLE to handle contamination gracefully.
Sample Size Determination
How to compute the number of observations needed to estimate means, proportions, and treatment effects with specified precision and power, including corrections for finite populations and complex designs.
Survey Sampling Methods
The major probability sampling designs used in survey statistics: simple random, stratified, cluster, systematic, multi-stage, and multi-phase sampling, with their variance properties and estimators.
Survival Analysis
Modeling time-to-event data with censoring: Kaplan-Meier curves, hazard functions, and the Cox proportional hazards model.
Copulas
Copulas separate the dependence structure of a multivariate distribution from its marginals. Sklar's theorem guarantees that any joint CDF can be decomposed into marginals and a copula, making dependence modeling modular.
Longitudinal Surveys and Panel Data
Analysis of data where the same units are measured repeatedly over time: fixed effects, random effects, difference-in-differences, and the problems of attrition and time-varying confounding.
Small Area Estimation
Methods for producing reliable estimates in domains where direct survey estimates have too few observations for useful precision, using Fay-Herriot and unit-level models that borrow strength across areas.
Modern Generalization Theory
16 topics / L3, L4, L5 / Tiers 1-3
Implicit bias, double descent, NTK, mean field — where classical theory fails.
Implicit Bias and Modern Generalization
Why classical generalization theory breaks for overparameterized models: the random labels experiment, the interpolation threshold, implicit bias of gradient descent, double descent, and the frontier of understanding why deep learning works.
Neural Tangent Kernel: Lazy Training, Kernel Equivalence, μP, and the Limits of Width
In the infinite-width NTK parameterization, training a neural network with gradient descent is mathematically equivalent to kernel regression. The same limit suppresses feature learning, which is why μP and mean-field parameterizations exist. NTK is the precise boundary between the kernel and feature-learning regimes.
PAC-Bayes Bounds
PAC-Bayes bounds control the generalization gap of a stochastic predictor by the KL divergence between a learner-chosen posterior and a data-independent prior. They have produced some of the few non-vacuous generalization bounds reported for trained neural networks (Dziugaite-Roy 2017 onward); how tight the bound gets depends heavily on the choice of prior.
Benign Overfitting
When interpolation (zero training error) does not hurt generalization: the min-norm interpolator fits noise in harmless directions while preserving signal. Bartlett et al. 2020, effective rank conditions, and why benign overfitting happens in overparameterized but not classical regimes.
Double Descent
Test error follows a double-descent curve: it decreases, peaks at the interpolation threshold, then decreases again in the overparameterized regime, defying classical bias-variance intuition.
Grokking
Models can memorize training data quickly, then generalize much later after continued training. This delayed generalization, called grokking, breaks the assumption that overfitting is a terminal state and connects to weight decay, implicit regularization, and phase transitions in learning.
Lazy vs Feature Learning
The fundamental dichotomy in neural network training: lazy regime (NTK, kernel-like, weights barely move) versus rich/feature learning regime (weights move substantially, representations emerge).
Mean Field Theory
The mean field limit of neural networks: as width goes to infinity under the right scaling, neurons become independent particles whose weight distribution evolves by Wasserstein gradient flow, capturing feature learning that the NTK regime misses.
Neural Network Optimization Landscape
Loss surface geometry of neural networks: saddle points dominate in high dimensions, mode connectivity, flat vs sharp minima, Sharpness-Aware Minimization, and the edge of stability phenomenon.
Optimal Transport and Earth Mover's Distance
The Monge and Kantorovich formulations of optimal transport, the linear programming dual, Sinkhorn regularization, and applications to WGANs, domain adaptation, and fairness.
Representation Learning Theory
What makes a good learned representation: the information bottleneck, contrastive learning, sufficient statistics, rate-distortion theory, and why representation learning is the central unsolved problem of deep learning.
Gaussian Processes for Machine Learning
A distribution over functions specified by a mean and kernel: closed-form posterior predictions with uncertainty, connection to kernel ridge regression, marginal likelihood for model selection, and the cubic cost bottleneck.
Information Bottleneck
The information bottleneck principle: compress the input X into a representation T that preserves information about the target Y. The Lagrangian formulation, connection to deep learning, Shwartz-Ziv and Tishby claims, and why the compression story may not hold for ReLU networks.
Open Problems in ML Theory
A curated list of genuinely open problems in machine learning theory: why overparameterized networks generalize, the right complexity measure for deep learning, feature learning beyond NTK, why scaling laws hold, emergent abilities, transformer-specific theory, and post-training theory.
Sparse Recovery and Compressed Sensing
Recover a sparse signal from far fewer measurements than its ambient dimension: the restricted isometry property, basis pursuit via L1 minimization, random measurement matrices, and applications from MRI to single-pixel cameras.
Wasserstein Distances
The Wasserstein (earth mover's) distance measures the minimum cost of transporting one probability distribution to another, with deep connections to optimal transport, GANs, and distributional robustness.
LLM Construction
71 topics / L2, L3, L4, L5 / Tiers 1-3
Transformer math, attention, KV cache, optimizers, scaling laws, RLHF.
Chain-of-Thought and Reasoning
Chain-of-thought as serial computation in token-space: the Merrill-Sabharwal expressiveness hierarchy, the Lanham-Turpin faithfulness gap, self-consistency bounds, hidden CoT, and the post-o1/R1 reframing where CoT is the artifact of RL on verifiable rewards.
Fine-Tuning and Adaptation
Methods for adapting pretrained models to new tasks: full fine-tuning, feature extraction, LoRA, QLoRA, adapter layers, and prompt tuning. When to use each, and how catastrophic forgetting constrains adaptation.
Hallucination Theory
Why language models confabulate, framed by the Kalai-Nachum-Vempala-Zhang (2025) statistical lower bound on pretraining error and the evaluation-incentive equilibrium that prevents post-training from removing it. Calibration, conformal prediction, structural failure modes (reversal curse, snowballing), and the measurement gap (FActScore, SAFE).
Linear Layer: Shapes, Bias, and Memory
A systems-first note on the linear layer: tensor shapes, the bias term, forward pass, backward pass, parameter memory, FLOPs, and finite-difference gradient tests.
Optimizer Theory: SGD, Adam, and Muon
Convergence theory of SGD (convex and strongly convex), momentum methods (Polyak and Nesterov), Adam as adaptive + momentum, why SGD can generalize better, the Muon optimizer, and learning rate schedules.
Reinforcement Learning from Human Feedback
The RLHF pipeline as math, not folklore: Bradley-Terry with sycophancy and intransitivity, KL-shielded PPO with overoptimization mitigation, the DPO implicit-reward identity and its likelihood-displacement failure, online-vs-offline preference learning, Nash-LHF, and the LIMA challenge to whether you need RL at all.
Scaling Laws
Power-law scaling of LLM loss in parameters, data, and compute: Kaplan, Chinchilla, the Muennighoff data-constrained law for repetition, the Schaeffer metric-induced-emergence proposition, MoE and muP extensions, and the test-time compute axis.
Sparse Autoencoders for Interpretability: TopK, JumpReLU, Matryoshka, and Scaling
Sparse autoencoders (SAEs) decompose neural network activations into interpretable feature directions by learning an overcomplete dictionary with a sparsity constraint. Coverage of the superposition hypothesis, the L1 SAE objective and its shrinkage failure mode, TopK SAEs (Gao et al. 2024 / OpenAI), JumpReLU SAEs (Rajamanoharan et al. 2024 / DeepMind), Matryoshka SAEs (Bushnaq / Bussmann 2024), Anthropic's million-feature Claude 3 Sonnet decomposition (Templeton et al. 2024), feature splitting, dead features, evaluation, and steering.
Attention Mechanism Theory
Mathematical formulation of attention: scaled dot-product attention as soft dictionary lookup, why scaling by the square root of key dimension prevents softmax saturation, multi-head attention, and the connection to kernel methods.
Attention Mechanisms History
The evolution of attention from Bahdanau (2014) additive alignment to Luong dot-product attention to self-attention in transformers. How attention solved the fixed-length bottleneck of seq2seq models.
Attention Sinks and Retrieval Decay
Why transformers disproportionately attend to initial tokens (attention sinks), how StreamingLLM exploits this for infinite-length inference, and how retrieval accuracy degrades with distance and position within the context window.
Attention Variants and Efficiency
Multi-head, multi-query, grouped-query, linear, and sparse attention: how each variant trades expressivity for efficiency, and when to use which.
Bits, Nats, Perplexity, and BPB
The four units people use to measure language model quality, how they relate to each other, when to use each one, and how mixing them up leads to wrong conclusions.
Context Engineering
The discipline of building, routing, compressing, retrieving, and persisting context for LLMs: beyond prompt design into systems engineering for what the model sees.
Decoding Strategies
How language models select output tokens: greedy decoding, beam search, temperature scaling, top-k sampling, and nucleus (top-p) sampling. The tradeoffs between coherence, diversity, and quality.
Document Intelligence
Beyond OCR: understanding document layout, tables, figures, and structure using models that combine text, spatial position, and visual features to extract structured information from PDFs, invoices, and contracts.
DPO vs GRPO vs RL for Reasoning
Head-to-head comparison of preference optimization methods: DPO's implicit reward, GRPO's group-relative comparisons, and RL with verifier feedback for reasoning. When each works and when each fails.
Edge and On-Device ML
Running models on phones, embedded devices, and edge servers: pruning, distillation, quantization, TinyML, and hardware-aware neural architecture search under memory, compute, and power constraints.
Efficient Transformers Survey
Sub-quadratic attention variants (linear attention, Linformer, Performer, Longformer, BigBird) and why FlashAttention, a hardware-aware exact method, made most of them unnecessary in practice.
Flash Attention
IO-aware exact attention: tile QKV matrices into SRAM-sized blocks so the full N-by-N attention matrix is never materialized in HBM. Peak activation memory drops from O(N²) to O(N); HBM read/write traffic drops by a large constant factor (not asymptotic linearity); FLOP count is unchanged.
Forgetting Transformer (FoX)
FoX adds a data-dependent forget gate to softmax attention. The gate down-weights unnormalized attention scores between past and present positions, giving the transformer a learned, recency-biased decay. FoX is FlashAttention-compatible, works without positional embeddings, and improves long-context language modeling and length extrapolation.
Fused Kernels
Combine multiple GPU operations into a single kernel launch to eliminate intermediate HBM reads and writes. Why kernel fusion is the primary optimization technique for memory-bound ML operations.
GPU Compute Model
How GPUs execute ML workloads: streaming multiprocessors, warps, memory hierarchy (registers, SRAM, L2, HBM), arithmetic intensity, the roofline model, and why most ML operations are memory-bound.
Induction Heads
Induction heads are attention head circuits that implement a specific kind of pattern completion: given a sequence like [A][B]...[A], they predict [B]. Olsson et al. (2022) give strong causal (ablation) evidence that these heads exist in small attention-only models, and correlational co-occurrence evidence linking their formation to a sudden jump in in-context-learning ability in larger transformers. They are one mechanism among several that have been proposed for in-context learning, not the whole story.
Inference Systems Overview
The modern LLM inference stack: batching strategies, scheduling, memory management with paged attention, model parallelism for serving, and why FLOPs do not equal latency when memory bandwidth is the bottleneck.
Inference-Time Scaling Laws
How additional compute at inference time (repeated sampling, search, verification) improves output quality, why gains are task-dependent, and why verifier quality matters more than raw sample count.
Iterative Magnitude Pruning and the Lottery Ticket Hypothesis
Iterative magnitude pruning repeatedly trains, prunes, rewinds, and retrains a network to search for sparse subnetworks that still learn well. The point is not cheap training; the point is understanding trainable sparsity, rewind stability, and when a sparse mask still preserves optimization geometry.
Knowledge Distillation
Training a small student model to mimic a large teacher: soft targets, temperature scaling, dark knowledge, and why the teacher's mistakes carry useful information about class structure.
KV Cache Optimization
Advanced techniques for managing the KV cache memory bottleneck: paged attention for fragmentation-free allocation, prefix caching for shared prompts, token eviction for long sequences, and quantized KV cache for reduced footprint.
KV Cache
Why autoregressive generation recomputes attention at every step, how caching past key-value pairs makes it linear, and the memory bottleneck that drives MQA, GQA, and paged attention.
Latent Reasoning
Techniques that spend computation in hidden state rather than visible chain-of-thought tokens: pause tokens, continuous thought, recurrent depth, and what is empirical vs speculative.
Memory Systems for LLMs
Taxonomy of LLM memory: short-term (KV cache), working (scratchpad), long-term (retrieval), and parametric (weights). Why extending context alone is insufficient and how memory consolidation works.
Mixture of Experts
Sparse computation via learned routing: replace dense FFN layers with multiple expert networks, activate only a subset per token, and scale capacity without proportional compute.
Model Compression and Pruning
Reducing model size without proportional accuracy loss: unstructured and structured pruning, magnitude pruning, the lottery ticket hypothesis, entropy coding for compressed weights, and knowledge distillation as compression.
Multi-Token Prediction
Predicting k future tokens simultaneously using auxiliary prediction heads: forces planning, improves code generation, and connects to speculative decoding.
Multimodal RAG
RAG beyond text: retrieving images, tables, charts, and PDFs alongside text. Document parsing, multimodal chunking, vision-language retrievers, agentic RAG, and reasoning RAG with chain-of-thought retrieval.
PaddleOCR and Practical OCR
A practitioner's guide to modern OCR toolkits: PaddleOCR's three-stage pipeline, TrOCR's transformer approach, EasyOCR, and Tesseract. When to use which, and what accuracy to expect.
Parallel Processing Fundamentals
Data, tensor, pipeline, expert, and sequence parallelism: the five strategies for distributing model training and inference across multiple GPUs, and how frontier labs combine all of them.
Perplexity and Language Model Evaluation
Perplexity as exp(cross-entropy): the standard intrinsic metric for language models, its information-theoretic interpretation, connection to bits-per-byte, and why low perplexity alone does not guarantee useful generation.
Post-Training Overview
How post-training turns a pretrained language model into a deployable assistant: SFT, preference optimization, safety tuning, verifiable rewards, evaluation gates, and the failure modes each stage introduces.
Prefix Caching
Share computed KV cache entries across requests that share the same prefix. Radix attention trees enable efficient lookup. Significant latency savings for prefix-heavy production workloads.
Prompt Engineering and In-Context Learning
In-context learning allows LLMs to adapt to new tasks from examples in the prompt without weight updates. Theories for why it works, prompting strategies, and why prompt engineering is configuring inference-time computation.
Reasoning Data Curation
How to build training data for reasoning models: math with verified solutions, code with test cases, rejection sampling, external verification, self-play for problem generation, and the connection to RLVR.
Residual Stream and Transformer Internals
The residual stream as the central computational highway in transformers: attention and FFN blocks read from and write to it. Pre-norm vs post-norm, FFN as key-value memory, and the logit lens for inspecting intermediate representations.
RLHF and Alignment
The RLHF pipeline for aligning language models with human preferences: reward modeling, PPO fine-tuning, KL penalties, DPO, and why none of it guarantees truthfulness.
Scaling Compute-Optimal Training
Chinchilla scaling: how to optimally allocate a fixed compute budget between model size and training data, why many models were undertrained, and the post-Chinchilla reality of data quality and inference cost.
Sparse Attention and Long Context
Standard attention is O(n²). Sparse patterns (Longformer, Sparse Transformer, Reformer), ring attention for distributed sequences, streaming with attention sinks, and why extending context is harder than it sounds.
Speculative Decoding and Quantization
Two core inference optimizations: speculative decoding for latency (draft-verify parallelism) and quantization for memory and throughput (reducing weight precision without destroying quality).
Structured Output and Constrained Generation
Forcing LLM output to conform to schemas via constrained decoding: logit masking for valid tokens, grammar-guided generation, and why constraining the search space can improve both reliability and quality.
Synthetic Data Distillation
Data-centric distillation: instead of matching teacher logits, train the student on a carefully synthesized corpus written by a stronger teacher. Phi and Orca style pipelines, explanation tuning, filtering, and why the corpus is the model.
Test-Time Compute and Search
Spending more inference compute through best-of-N, verifier-guided search, tree search, process rewards, and latent/internal computation, with gains limited by verifier quality and base pass rate.
Token Prediction and Language Modeling
Language modeling as probability assignment over sequences. Autoregressive and masked prediction objectives, perplexity evaluation, and the connection between prediction and compression.
Tool-Augmented Reasoning
LLMs that call external tools during reasoning: Toolformer for learning when to invoke APIs, ReAct for interleaving thought and action, function calling and MCP for structured invocation, and code-as-thought for replacing verbal arithmetic with executed programs.
Training Dynamics and Loss Landscapes
The geometry of neural network loss surfaces: why saddle points dominate over local minima in high dimensions, how flat minima relate to generalization, and why SGD finds solutions that generalize.
Transformer Architecture
The mathematical formulation of the transformer block: self-attention, multi-head attention, layer normalization, FFN blocks, positional encoding, and parameter counting.
Agent Protocols: MCP and A2A
The protocol layer for AI agents: MCP (Model Context Protocol) for tool access, A2A (Agent-to-Agent) for inter-agent communication, and why standardized interfaces matter for the agent ecosystem.
AMD Competition Landscape
AMD's MI300X and MI325X GPUs compete with NVIDIA on memory bandwidth and capacity but lag on software ecosystem. Competition matters because pricing, supply diversity, and vendor lock-in determine who can train and serve models.
ASML and Chip Manufacturing
ASML is the sole manufacturer of EUV lithography machines used to produce every advanced AI chip. Understanding the semiconductor supply chain reveals a critical concentration risk for AI compute.
Attention as Kernel Regression
Softmax attention viewed as Nadaraya-Watson kernel regression: the output at each position is a kernel-weighted average of values. Connects attention to classical nonparametric statistics and motivates linear attention via random feature approximations.
Byte-Level Language Models
Skip the tokenizer and feed raw bytes to the model. ByT5, MegaByte, and Byte Latent Transformer: why operating on bytes is attractive, why it is expensive, and how hierarchical patching closes the compute gap.
Distributed Training Theory
Training frontier models requires thousands of GPUs. Data parallelism, model parallelism, and communication-efficient methods make this possible.
Donut and OCR-Free Document Understanding
End-to-end document understanding without OCR: Donut reads document images directly and generates structured output, bypassing the error-prone OCR pipeline. Nougat extends this to academic paper parsing.
Megakernels
Fuse an entire LLM forward pass or decode step into a single GPU kernel launch to eliminate kernel launch overhead and cross-kernel HBM round-trips. Why persistent kernels and CUDA Graphs dominate low-latency inference.
Model Merging and Weight Averaging
Combining trained models by averaging or interpolating their weights: SWA, SLERP, TIES-Merging, DARE. Why it works (loss landscape mode connectivity), when it fails, and applications to combining specialized models.
Neural Architecture Search
Automating network architecture design: search spaces, search strategies (RL, evolutionary, differentiable), performance estimation via weight sharing, and the gap between NAS hype and practical gains.
NVIDIA GPU Architectures
A concrete reference for the GPU hardware that determines what ML models can be trained and served: H100, H200, B200, GB200, and Rubin architectures, with emphasis on memory bandwidth as the primary bottleneck.
Plan-then-Generate
Beyond strict left-to-right token generation: plan what to say before producing tokens. Outline-then-fill, hierarchical generation, and multi-token prediction. Planning reduces incoherence in long outputs and enables structural editing.
Positional Encoding
Why attention needs position information, sinusoidal encoding, learned positions, RoPE (rotary position encoding via 2D rotations), ALiBi, and why RoPE became the default for modern LLMs.
Quantization Theory
Reduce model weight precision from FP32 to FP16, INT8, or INT4. Post-training quantization, quantization-aware training, GPTQ, AWQ, and GGUF. Quantization is how large language models actually get deployed.
Table Extraction and Structure Recognition
Detecting tables in documents, recognizing row and column structure, and extracting cell content. Why tables are hard: merged cells, borderless layouts, nested headers, and cascading pipeline errors.
Tokenization and Information Theory
Tokenization determines an LLM's vocabulary and shapes everything from compression efficiency to multilingual ability. Information theory explains what good tokenization looks like.
Methodology & Experimental Design
32 topics / L1, L2, L3, L4, L5 / Tiers 1-3
Evaluation, ablations, reproducibility, research habits, and data or benchmark failure modes.
The Bitter Lesson
Sutton's meta-principle: scalable general methods that exploit computation tend to beat hand-crafted domain-specific approaches in the long run. Search and learning win; brittle cleverness loses.
Causal Inference and the Ladder of Causation
Pearl's three-level hierarchy: association, intervention, counterfactual. Structural causal models, do-calculus, the adjustment formula, and why prediction is not causation.
Confusion Matrices and Classification Metrics
The complete framework for evaluating binary classifiers: confusion matrix entries, precision, recall, F1, specificity, ROC curves, AUC, and precision-recall curves. When to use each metric.
Confusion Matrix: MCC, Kappa, and Cost-Sensitive Evaluation
Beyond binary classification: multi-class confusion matrices, micro vs macro averaging, Matthews correlation coefficient, Cohen's kappa, cost-sensitive evaluation, and how to read a confusion matrix without getting the axes wrong.
The Era of Experience
Sutton and Silver's thesis: the next phase of AI moves beyond imitation from human data toward agents that learn predominantly from their own experience. Text is not enough for general intelligence.
Model Evaluation Best Practices
Train/validation/test splits, cross-validation, stratified and temporal splits, data leakage, reporting with confidence intervals, and statistical tests for model comparison. Why single-number comparisons are misleading.
Train-Test Split and Data Leakage
Why you need three data splits, how to construct them correctly, and the common ways information leaks from test data into training. Temporal splits, stratified splits, and leakage detection.
Types of Bias in Statistics
A taxonomy of the major biases that corrupt statistical inference: selection bias, measurement bias, survivorship bias, response bias, attrition bias, publication bias, and their consequences for ML.
Ablation Study Design
How to properly design ablation studies: remove one component at a time, measure the effect against proper baselines, and report results with statistical rigor.
Class Imbalance and Resampling
When class frequencies differ dramatically, standard accuracy is misleading. Resampling, cost-sensitive learning, and threshold tuning restore meaningful evaluation and training.
Convex Tinkering
Taleb's concept applied to ML research: designing small experiments with bounded downside and unbounded upside, and the precise conditions (convex payoff, nonzero variance, ex-post selection) under which this can dominate scale-first approaches.
Evaluation Metrics and Properties
The metrics that determine whether a model is good: accuracy, precision, recall, F1, AUC-ROC, log loss, RMSE, calibration, and proper scoring rules. Why choosing the right metric matters more than improving the wrong one.
Exploratory Data Analysis
The disciplined practice of looking at data before modeling: summary statistics, distributions, correlations, missing values, outliers, and class balance. You cannot model what you do not understand.
Feature Importance and Interpretability
Methods for attributing model predictions to input features: permutation importance, SHAP values, LIME, partial dependence, and why none of these imply causality.
Federated Learning
Train a global model without centralizing data. FedAvg, communication efficiency, non-IID convergence challenges, differential privacy integration, and applications in healthcare and mobile computing.
Hardware for ML Practitioners
Practical hardware guidance for ML work: GPU memory as the real bottleneck, when local GPUs make sense, cloud options compared, and why you should not spend $5000 before knowing what you need.
Hypothesis Testing for ML
Null and alternative hypotheses, p-values, confidence intervals, error types, multiple comparisons, and proper statistical tests for comparing ML models.
Meta-Analysis
Combining results from multiple studies: fixed-effect and random-effects models, heterogeneity measures, publication bias, and why this matters for ML benchmarking.
ML Project Lifecycle
The full ML project workflow from problem definition through deployment and monitoring. Why most projects fail at data quality, not model architecture. Cross-functional requirements and MLOps basics.
P-Hacking and Multiple Testing
How selective reporting and multiple comparisons inflate false positive rates, and how Bonferroni and Benjamini-Hochberg corrections control them. Why hyperparameter tuning is multiple testing and benchmark shopping is p-hacking.
Prasad-Rao MSE Correction
Why the naive Fay-Herriot MSE is too small once the area variance is estimated, and how the classical Prasad-Rao decomposition adds the missing second-order term.
Proper Scoring Rules
A scoring rule is proper exactly when the expected score is maximized by reporting the true belief. Log score and Brier score are strictly proper. Accuracy is not. Why this matters for calibrated probability estimates.
Reproducibility and Experimental Rigor
What it takes to make ML experiments truly reproducible: seeds, variance reporting, data hygiene, configuration management, and the discipline of multi-run evaluation.
ROC Curve and AUC
Theory of the receiver operating characteristic curve. AUC as a Wilcoxon-Mann-Whitney probability, cost-weighted operating-point selection, the ROC convex hull, the Neyman-Pearson connection, and partial AUC for restricted FPR regions.
Statistical Significance and Multiple Comparisons
p-values, significance levels, confidence intervals, and the multiple comparisons problem. Bonferroni correction, Benjamini-Hochberg FDR control, and why these matter for model selection and benchmark evaluation.
Synthetic Data Generation
Using models to generate training data: LLM-generated instructions, diffusion-based image augmentation, code synthesis. When synthetic data helps (low-resource, privacy) and when it hurts (model collapse).
Adjusted Density Maximization
Why some small-area methods adjust the likelihood or posterior-like density to estimate shrinkage factors more stably when the variance component is near zero.
Benchmarking Methodology
What makes a good benchmark, how benchmarks fail (contamination, leaderboard gaming, single-number comparisons), and how to report results honestly with variance, seeds, and proper statistical practice.
Causal Inference Basics
Correlation is not causation. The potential outcomes framework, average treatment effects, confounders, and the methods that let you estimate causal effects from data.
Energy Efficiency and Green AI
The compute cost of training frontier models, carbon footprint, FLOPs vs wall-clock time vs dollars, and why reporting efficiency matters. Efficient alternatives: distillation, pruning, quantization, and scaling laws for optimal compute allocation.
Experiment Tracking and Tooling
MLflow, Weights and Biases, TensorBoard, Hydra, and the discipline of recording everything. What to log, how to compare experiments, and why you cannot reproduce what you did not record.
Official Statistics and National Surveys
How government statistical agencies produce population, economic, and social data through censuses and surveys, with quality frameworks and implications for ML practitioners using these datasets.
Training Techniques & Regularization
12 topics / L2, L3 / Tiers 1-3
Adam, dropout, batch norm, data augmentation, learning rate schedules.
Adam Optimizer
Adaptive moment estimation: first and second moment tracking, bias correction, AdamW vs Adam+L2, learning rate warmup, and when Adam fails compared to SGD.
Batch Normalization
Normalize activations within mini-batches to stabilize and accelerate training, then scale and shift with learnable parameters.
Dropout
Dropout as stochastic regularization: Bernoulli masking, inverted dropout scaling, implicit ensemble interpretation, Bayesian connection via MC dropout, and equivalence to L2 regularization in linear models.
Learning Rate Scheduling
Why learning rate moves the loss more than any other hyperparameter, and how warmup, cosine decay, step decay, cyclic schedules, and 1cycle policy control training dynamics.
Regularization in Practice
Practical guide to regularization: L2 (weight decay), L1 (sparsity), dropout, early stopping, data augmentation, and batch normalization. How each constrains model complexity and how to choose between them.
Weight Initialization
Why initialization determines whether gradients vanish or explode, and how Xavier and He initialization preserve variance across layers.
Activation Checkpointing
Trade compute for memory by recomputing activations during the backward pass instead of storing them all. Reduces activation memory from linear to square-root in the number of layers.
Batch Size and Learning Dynamics
How batch size affects what SGD finds: gradient noise, implicit regularization, the linear scaling rule, sharp vs flat minima, and the gradient noise scale as the key quantity governing the tradeoff.
Data Augmentation Theory
Why data augmentation works as a regularizer: invariance injection, effective sample size, MixUp, CutMix, and the connection to Vicinal Risk Minimization.
Label Smoothing and Regularization
Replace hard targets with soft targets to prevent overconfidence: the label smoothing formula, its connection to maximum entropy regularization, calibration effects, and when it hurts.
Mixed Precision Training
Train with FP16 or BF16 for speed while keeping FP32 master weights for accuracy. Loss scaling, overflow prevention, and when mixed precision fails.
Curriculum Learning
Train on easy examples first, gradually increase difficulty. Curriculum learning can speed convergence and improve generalization, but defining difficulty is the hard part. Self-paced learning, anti-curriculum, and the connection to importance sampling.
AI Safety & Alignment
17 topics / L3, L4, L5 / Tiers 1-2
RLHF failure modes, hallucination theory, interpretability, reward hacking.
Mechanistic Interpretability: Features, Circuits, and Causal Faithfulness
Reverse-engineering trained neural networks. Coverage of the superposition hypothesis, sparse autoencoders for feature extraction, the linear representation hypothesis and its counterexamples, induction heads and IOI as canonical circuits, sparse feature circuits (Marks et al. 2024), cross-layer transcoders (Lindsey et al. 2024), activation patching (noising vs denoising), faithfulness checks, frontier-scale evidence from Anthropic's Scaling Monosemanticity (Templeton et al. 2024) and DeepMind's Gemma Scope (Lieberum et al. 2024), and the limits of current interpretability.
Adversarial Machine Learning
Attacks across the ML lifecycle: evasion (adversarial examples), poisoning (corrupt training data), model extraction (steal via queries), privacy leakage (membership inference), and LLM jailbreaks. Why defenses are hard.
Calibration and Uncertainty Quantification
When a model says 70% confidence, is it right 70% of the time? Calibration measures the alignment between predicted probabilities and actual outcomes. Temperature scaling, Platt scaling, conformal prediction, and MC dropout provide practical tools for trustworthy uncertainty.
Catastrophic Forgetting
Fine-tuning a neural network on new data destroys knowledge of old data. The stability-plasticity dilemma, mitigation strategies (EWC, LwF, SI, MAS, replay, progressive networks), and implications for continual learning and LLM fine-tuning.
Constitutional AI
Anthropic's approach to alignment: replace human harmlessness labels with a constitution of principles that the model uses to self-evaluate, enabling scalable AI feedback.
Continual Learning and Forgetting
Learning sequentially without destroying previous knowledge: Elastic Weight Consolidation, progressive networks, replay methods, and the stability-plasticity tradeoff in deployed systems.
Data Contamination and Evaluation
When training data overlaps test benchmarks, model scores become meaningless. Types of contamination, detection methods, dynamic benchmarks, and how to read evaluation claims skeptically.
Differential Privacy
Formal privacy guarantees for algorithms: epsilon-delta DP, Laplace and Gaussian mechanisms, composition theorems, DP-SGD for training neural networks, and the privacy-utility tradeoff.
Ethics and Fairness in ML
Fairness definitions (demographic parity, equalized odds, calibration), the impossibility theorem showing they cannot all hold simultaneously, bias sources, and mitigation strategies at each stage of the pipeline.
LLM Application Security
The OWASP LLM Top 10 (2025): prompt injection, sensitive information disclosure, supply chain, data and model poisoning, improper output handling, excessive agency, system prompt leakage, vector and embedding weaknesses, misinformation, and unbounded consumption. Standard application security for the GenAI era.
Model Collapse and Data Quality
When models train on their own outputs, the learned distribution narrows, tails disappear, and quality degrades across generations. Why synthetic data feedback loops threaten pretraining data quality and how to mitigate them.
Out-of-Distribution Detection
Methods for detecting when test inputs differ from training data, where naive softmax confidence fails and principled alternatives based on energy, Mahalanobis distance, and typicality succeed.
Red-Teaming and Adversarial Evaluation
Systematically trying to make models produce harmful or incorrect outputs: manual and automated red-teaming, jailbreaks, prompt injection, adversarial suffixes, and why adversarial evaluation is necessary before deployment.
Reward Hacking
Goodhart's law for AI: when models exploit reward model weaknesses instead of being genuinely helpful, including verbosity hacking, sycophancy, and structured mitigation strategies.
Reward Models and Verifiers
How preference reward models, outcome verifiers, process reward models, executable checks, and ensembles provide different training signals, and where Goodhart pressure enters.
Truth Directions and Linear Probes
The geometry-of-truth program asks whether transformer residual streams linearly separate true from false factual statements. Linear probes can reveal such structure; activation interventions test whether the direction is merely correlational or actually behavior-changing.
Verifier Design and Process Reward
Detailed treatment of verifier types, process vs outcome reward models, verifier-guided search, self-verification, and the connection to test-time compute scaling. How to design reward signals for reasoning models.
Reinforcement Learning Theory
26 topics / L2, L3, L4, L5 / Tiers 1-3
MDPs, Bellman, policy gradients, multi-agent, game theory.
Bellman Equations
The recursive backbone of RL. State-value and action-value Bellman equations, the contraction mapping property, convergence of value iteration, and why recursive decomposition is the central idea in sequential decision-making.
Markov Decision Processes
The mathematical framework for sequential decision-making under uncertainty: states, actions, transitions, rewards, and the Bellman equations that make solving them possible.
Policy Gradient Theorem
The fundamental result enabling gradient-based optimization of parameterized policies: the policy gradient theorem and the algorithms it spawns.
Q-Learning
Model-free, off-policy value learning: the Q-learning update rule, convergence under Robbins-Monro conditions, and the deep Q-network revolution that introduced function approximation, experience replay, and the deadly triad.
Reward Design and Reward Misspecification
The hardest problem in RL: specifying what you want. Reward shaping, potential-based shaping theorem, specification gaming, Goodhart's law in RL, and the bridge from classic RL to alignment.
Value Iteration and Policy Iteration
The two foundational algorithms for solving MDPs exactly: value iteration applies the Bellman optimality operator until convergence, while policy iteration alternates between exact evaluation and greedy improvement.
Actor-Critic Methods
The dominant paradigm for deep RL and LLM training: an actor (policy network) guided by a critic (value network), with advantage estimation, PPO clipping, and entropy regularization.
Agentic RL and Tool Use
LLMs as multi-step policies: observations, tool calls, environment feedback, sparse rewards, credit assignment, and why agent training differs from single-turn RLHF.
DDPG: Deep Deterministic Policy Gradient
An off-policy actor-critic algorithm for continuous action spaces that combines a deterministic policy gradient with a DQN-style critic, using replay buffers and polyak-averaged target networks for stability.
Exploration vs Exploitation
The fundamental tradeoff in sequential decision-making: exploit known good actions to collect reward now, or explore uncertain actions to discover potentially better strategies. Epsilon-greedy, Boltzmann exploration, UCB, count-based methods, and intrinsic motivation.
Markov Games and Self-Play
Multi-agent extensions of MDPs where multiple agents with separate rewards interact. Nash equilibria, minimax values in zero-sum games, and self-play as a training method.
Minimax and Saddle Points
Minimax theorems characterize when max-min equals min-max. Saddle points arise in zero-sum games, duality theory, GANs, and robust optimization.
Multi-Agent Collaboration
Multiple LLM agents working together on complex tasks: debate for improving reasoning, division of labor across specialist agents, structured communication protocols, and when multi-agent outperforms single-agent systems.
Multi-Armed Bandits Theory
The exploration-exploitation tradeoff formalized: K arms, regret as the cost of not knowing the best arm, and algorithms (UCB, Thompson sampling) that achieve near-optimal regret bounds.
No-Regret Learning
Online learning against adversarial losses: regret as cumulative loss minus the best fixed action in hindsight, multiplicative weights, follow the regularized leader, and why no-regret dynamics converge to Nash equilibria in zero-sum games.
Offline Reinforcement Learning
Learning policies from a fixed dataset without environment interaction: distributional shift as the core challenge, conservative Q-learning (CQL) as the standard fix, and Decision Transformer as an alternative sequence modeling approach.
Online Learning and Bandits
Sequential decision making with adversarial or stochastic feedback: the bandit setting, explore-exploit tradeoff, UCB, Thompson sampling, and regret bounds. Connections to RL and A/B testing.
Policy Optimization: PPO and TRPO
Trust region and clipped surrogate methods for stable policy optimization: TRPO enforces a KL constraint for monotonic improvement, PPO replaces it with a simpler clipped objective, and PPO became the default optimizer for LLM post-training.
Policy Representations
How to parameterize policies in reinforcement learning: categorical for discrete actions, Gaussian for continuous actions, and why the choice affects gradient variance and exploration.
Self-Play and Multi-Agent RL
Self-play as a training paradigm for competitive games, fictitious play convergence, AlphaGo/AlphaZero, and the challenges of multi-agent reinforcement learning: non-stationarity, partial observability, and centralized training.
Temporal Difference Learning
Temporal difference methods bootstrap value estimates from other value estimates, enabling online, incremental learning without waiting for episode termination. TD(0), SARSA, and TD(lambda) with eligibility traces.
TD3: Twin Delayed Deep Deterministic Policy Gradient
An off-policy actor-critic algorithm that fixes DDPG's overestimation bias with clipped double-Q learning, target policy smoothing, and delayed policy updates. The minimum-complexity robust continuous-control algorithm.
Mean-Field Games
The many-agent limit of strategic interactions: as the number of agents goes to infinity, each agent solves an MDP against the population distribution, and equilibrium becomes a fixed-point condition on the mean field.
Options and Temporal Abstraction
The options framework for hierarchical RL: temporally extended actions with initiation sets, internal policies, and termination conditions. Semi-MDPs and learning options end-to-end.
Reinforcement Learning Environments and Benchmarks
The standard RL evaluation stack: Gymnasium API, classic control tasks, Atari, MuJoCo, ProcGen, the sim-to-real gap, and why benchmark performance is a poor predictor of real-world RL capability.
Robust Adversarial Policies
Robust MDPs optimize against worst-case transition dynamics within an uncertainty set. Adversarial policies formalize distribution shift in RL as a game between agent and environment.
Beyond LLMs
20 topics / L4, L5 / Tiers 1-3
JEPA, world models, vision-first AI, diffusion, audio, 3D, state-space, and continuous-time models.
CLIP, OpenCLIP, and SigLIP: Contrastive Language-Image Pretraining
Radford et al. 2021 (CLIP) trained two encoders, one for images and one for text, with a symmetric InfoNCE objective on 400M web pairs. The result was a shared embedding space that powers zero-shot classification, retrieval, and serves as the visual backbone of every modern vision-language model. This page covers the contrastive objective as a mutual-information bound, the OpenCLIP scaling laws (Cherti et al. 2023), the SigLIP pairwise-sigmoid alternative (Zhai et al. 2023), the modality gap (Liang et al. 2022), and the practical pipeline from training corpus to LLaVA-style VLM backbone.
Diffusion Models
Generative models that learn to reverse a noise-adding process: forward and reverse SDEs, score matching, the DDPM ELBO, deterministic ODE sampling, classifier-free guidance, latent diffusion, and the bridge to flow matching.
Vision Transformer Lineage: ViT, DeiT, Swin, MAE, DINOv2, SAM
Dosovitskiy et al. 2021 (ViT) showed that a plain transformer applied to image patches matches CNN backbones once given enough pretraining data. The follow-on lineage (DeiT, Swin, MAE, DINOv2, SAM, register tokens, NaViT, ViT-22B, SigLIP-2) traded design knobs along three axes: how much locality bias to keep, what supervision to train against, and whether tokens are spatial-uniform or content-adaptive. This page covers the patch-embedding complexity, Swin's shifted-window argument, the MAE 75% masking ratio, DINO self-distillation with centering, and the 2024 register-token finding that closes the long-standing CLIP/ViT artifact-token gap.
Equilibrium and Implicit-Layer Models
Deep Equilibrium Models (DEQ) replace explicit depth with a fixed-point equation: instead of stacking L layers, solve for the equilibrium state where one more layer would not change the output. This enables infinite-depth networks with constant memory, using implicit differentiation for backprop.
Equivariant Deep Learning
Networks that respect symmetry: if the input transforms under a group action, the output transforms predictably. Equivariance generalizes translation equivariance in CNNs to rotations, permutations, and gauge symmetries, reducing sample complexity and improving generalization on structured data.
Florence and Vision Foundation Models
Vision foundation models that unify classification, detection, segmentation, captioning, and VQA under a single pretrained backbone. Florence, Florence-2, and the path toward GPT-4V-style multimodal understanding.
Flow Matching
Learn a velocity field that transports noise to data along straight-line paths. Simpler training than diffusion, faster sampling, and cleaner math.
JEPA and Joint Embedding
LeCun's Joint Embedding Predictive Architecture: learning by predicting abstract representations rather than pixels, energy-based formulation, I-JEPA, V-JEPA, V-JEPA 2, and the connection to contrastive learning and world models.
Kolmogorov-Arnold Networks (KANs)
An alternative to MLPs where learnable univariate functions (typically B-splines) sit on edges and pure summation sits on nodes. Motivated by the Kolmogorov-Arnold representation theorem, competitive on small smooth scientific tasks, unproven at frontier scale.
Mamba and State-Space Models
Linear-time sequence modeling via structured state spaces: S4, HiPPO initialization, selective state-space models (Mamba), and the architectural fork from transformers.
Physics-Informed Neural Networks
Embedding PDE constraints directly into the neural network loss function via automatic differentiation. When physics-informed learning works, when it fails, and what alternatives exist.
Self-Supervised Vision
Learning visual representations without labels: contrastive methods (SimCLR, MoCo), self-distillation (DINO/DINOv2), and masked image modeling (MAE). Why self-supervised vision matters for transfer learning and label-scarce domains.
Test-Time Training and Adaptive Inference
Updating model parameters at inference time using self-supervised objectives on the test input itself. TTT layers replace fixed linear recurrences (as in Mamba) with learned update rules, blurring the boundary between training and inference.
Video World Models
Turning pretrained video diffusion models into interactive world simulators: condition on actions to generate future frames, enabling RL agent training, robot planning, and game AI without physical environments.
World Models and Planning
Learning a model of the environment and planning inside it: Dreamer's latent dynamics, MuZero's learned model with MCTS, video world models, and why planning in imagination is the path to sample-efficient and safe AI.
Audio Language Models
Models that process and generate speech alongside text: audio tokenization, Whisper for transcription, end-to-end voice models, music generation, and the audio-language frontier.
Continuous Thought Machines
Neural networks that process information through continuous-time internal dynamics rather than discrete layer-by-layer computation. Inspired by neural ODEs and dynamical systems, these architectures let the network 'think' for a variable amount of internal time before producing an output.
3D Gaussian Splatting
Represent a 3D scene as millions of 3D Gaussians, each with position, covariance, opacity, and color. Render by projecting to 2D and alpha-compositing. Real-time, high-quality novel view synthesis without neural networks at render time.
Occupancy Networks and Neural Fields
Representing 3D geometry and appearance as continuous functions parameterized by neural networks: NeRF, occupancy networks, DeepSDF, volume rendering, and the connection to Gaussian splatting.
World Model Evaluation
How to measure whether a learned world model is useful: prediction accuracy, controllability (sim-to-real transfer), planning quality, and why long-horizon evaluation is hard.
ML Methods
74 topics / L1, L2, L3 / Tiers 1-3
Core supervised / unsupervised algorithms: CNNs, autoencoders, boosting, clustering, contrastive learning.
Activation Functions
Nonlinear activation functions in neural networks: sigmoid, tanh, ReLU, Leaky ReLU, GELU, and SiLU. Their gradients, saturation behavior, and impact on trainability.
AIC and BIC
Akaike and Bayesian information criteria for model selection: how they trade off fit versus complexity, when to use each, and their connection to cross-validation.
Bagging
Bootstrap Aggregation: train B models on bootstrap samples and average their predictions. Why averaging reduces variance, how correlation limits the gain, and out-of-bag error estimation.
Cross-Entropy Loss: MLE, KL Divergence, and Classification
Why cross-entropy is the correct loss for classification: its derivation as negative log-likelihood, connection to KL divergence, why MSE fails for classification, and practical variants including label smoothing and focal loss.
Data Preprocessing and Feature Engineering
Standardization, scaling, encoding, imputation, and feature selection. Why most algorithms assume centered, scaled inputs and what breaks when you skip preprocessing.
Feedforward Networks and Backpropagation
Feedforward neural networks as compositions of affine transforms and nonlinearities, the universal approximation theorem, and backpropagation as reverse-mode automatic differentiation on the computational graph.
Gauss-Markov Theorem
Among all linear unbiased estimators, ordinary least squares has the smallest variance. This is the BLUE theorem. And understanding when its assumptions fail is just as important as the result itself.
Gradient Boosting
Gradient boosting as functional gradient descent: fit weak learners to pseudo-residuals sequentially, reducing bias at each round. Covers AdaBoost, shrinkage, XGBoost second-order methods, and LightGBM leaf-wise growth.
K-Means Clustering
Lloyd's algorithm for partitional clustering: the within-cluster sum of squares objective, convergence guarantees, k-means++ initialization, choosing k, and the connection to EM for Gaussians.
The Kernel Trick
Any algorithm whose computation accesses data only through inner products can be lifted to a feature space implicitly defined by a kernel, without ever computing the features. Worked polynomial-kernel inner-product equivalence, infinite-feature interpretation of the Gaussian RBF, Mercer's condition, and the four canonical applications: kernel SVM, kernel ridge regression, kernel PCA, kernel k-means.
Lasso Regression
OLS with L1 regularization: sparsity, the geometry of why L1 selects variables, proximal gradient descent, LARS, and elastic net.
Linear Regression
Ordinary least squares as projection, the normal equations, the hat matrix, Gauss-Markov optimality, and the connection to maximum likelihood under Gaussian noise.
Logistic Regression
The foundational linear classifier: sigmoid link function, maximum likelihood estimation, cross-entropy loss, gradient derivation, and regularized variants.
Loss Functions Catalog
A systematic catalog of loss functions for classification, regression, and representation learning: cross-entropy, MSE, hinge, Huber, focal, KL divergence, and contrastive loss.
Overfitting and Underfitting
The two failure modes of supervised learning: models that memorize noise versus models too simple to capture signal. Diagnosis via training-validation gaps.
Principal Component Analysis
Dimensionality reduction via variance maximization: PCA as eigendecomposition of the covariance matrix, PCA as truncated SVD of the centered data matrix, reconstruction error, and when sample PCA works.
Random Forests
Random forests combine bagging with random feature subsampling to decorrelate trees, reducing ensemble variance beyond what pure bagging achieves. Out-of-bag estimation, variable importance, consistency theory, and practical strengths and weaknesses.
Ridge Regression
L2-regularized least squares: closed-form shrinkage, SVD geometry, MSE dominance over OLS, the Bayesian Gaussian-prior connection, effective degrees of freedom, kernel ridge, and the modern overparameterized regime where ridgeless OLS can still generalize.
Score Matching
Hyvärinen 2005: train a model to estimate the score (gradient of log density) without computing the normalization constant. Integration by parts converts the intractable density-matching loss into a tractable gradient-based objective. Sliced score matching makes the Jacobian-trace term scale, denoising score matching reparameterizes the loss as ε-regression, and Tweedie's formula identifies the score with a posterior-mean denoiser. Together these are the training half of every modern diffusion model and energy-based model.
Skip Connections and ResNets
Residual connections let gradients flow through identity paths, enabling training of very deep networks. ResNets learn residual functions F(x) = H(x) - x, which is easier than learning H(x) directly.
Support Vector Machines
Maximum-margin classifiers via convex optimization: hard margin, soft margin with slack variables, hinge loss, the dual formulation, and the kernel trick.
Universal Approximation Theorem
A single hidden layer neural network can approximate any continuous function on a compact set to arbitrary accuracy. Why this is both important and misleading: it says nothing about width, weight-finding, or generalization.
Variational Autoencoders
Deriving the ELBO, the reparameterization trick for backpropagation through sampling, and how VAEs turn autoencoders into principled generative models via amortized variational inference.
AdaBoost
AdaBoost as iterative reweighting of misclassified samples, exponential loss minimization, weak-to-strong learner amplification, margin theory, and the connection to coordinate descent.
Anomaly Detection
Methods for identifying data points that deviate from expected patterns: isolation forests, one-class SVMs, autoencoders, statistical distances, and why the absence of anomaly labels makes this problem structurally harder than classification.
Autoencoders
Encoder-decoder architectures for unsupervised representation learning: undercomplete bottlenecks, sparse and denoising variants, and the connection between linear autoencoders and PCA.
Contrastive Learning
Learning representations by pulling positive pairs together and pushing negative pairs apart, with theoretical grounding in mutual information maximization.
Convolutional Neural Networks
How weight sharing and local connectivity exploit spatial structure: convolution as cross-correlation, translation equivariance, pooling for approximate invariance, and the conv-pool-fc architecture.
Decision Trees and Ensembles
Greedy recursive partitioning with splitting criteria, pruning, and why combining weak learners via bagging (random forests) and boosting (gradient boosting) yields strong predictors.
Deep Learning for Time Series
RNN/LSTM, Temporal Convolutional Networks, and Transformers (PatchTST, Informer, Autoformer) for time series forecasting, with N-BEATS basis decomposition and the Zeng et al. linear-baseline controversy.
DeepONet
DeepONet (Lu, Karniadakis et al., 2021) approximates nonlinear operators between function spaces by splitting a network into a branch (encoding the input function at fixed sensors) and a trunk (encoding query coordinates), then taking an inner product. The architecture is the practical realization of Chen and Chen's 1995 universal approximation theorem for operators.
Dimensionality Reduction Theory
Why and how to reduce dimensions: the curse of dimensionality, PCA, random projections (JL lemma), t-SNE, UMAP, and when each method preserves the structure you care about.
Elastic Net
Combining L1 and L2 penalties: elastic net gets sparsity from lasso and grouping stability from ridge, solving the failure mode where lasso arbitrarily selects among correlated features.
Ensemble Methods Theory
Why combining multiple models outperforms any single model: bias-variance decomposition for ensembles, diversity conditions, and the theoretical foundations of bagging, boosting, and stacking.
EM Algorithm Variants
Variants of EM for when the standard algorithm is intractable: Monte Carlo EM, Variational EM, Stochastic EM, and ECM. Connection to VAEs as amortized variational EM.
Fourier Neural Operator
The Fourier Neural Operator (Li, Kovachki, Anandkumar et al., ICLR 2021) parameterizes the kernel of an integral operator directly in Fourier space, giving a resolution-invariant architecture for learning maps between function spaces. Canonical baseline for data-driven PDE solvers and the architectural backbone of FourCastNet weather prediction.
Gaussian Mixture Models and EM
GMMs as soft clustering with per-component Gaussians: EM derivation (E-step responsibilities, M-step parameter updates), convergence guarantees, model selection with BIC/AIC, and the connection to k-means as the hard-assignment limit.
Gaussian Process Regression
Inference with Gaussian processes: the prior-to-posterior update in closed form, the role of kernel choice, marginal likelihood for hyperparameter selection, sparse approximations for scalability, and the connection to Bayesian optimization.
Generalized Additive Models
GAMs model the response as a sum of smooth univariate functions, one per predictor. Interpretable nonlinear regression with backfitting, P-splines, and partial effect plots.
Generative Adversarial Networks
The minimax game between generator and discriminator: Nash equilibrium at the data distribution, mode collapse, the Wasserstein distance fix, StyleGAN, and why diffusion models have largely replaced GANs for image generation.
Graph Neural Networks
Message passing on graphs: GCN, GAT, GraphSAGE, the WL isomorphism test as an expressivity ceiling, over-smoothing in deep GNNs, and applications to molecules, social networks, and knowledge graphs.
Hyperbolic Embeddings for Graphs
Negatively curved space embeds tree-like graphs with exponentially less distortion than Euclidean space at the same dimension. The Poincare ball model, Mobius gyrovector arithmetic, and Sala et al.'s sharp distortion bound make this concrete; Ganea et al. lift the construction to neural network layers.
K-Nearest Neighbors
Classify by majority vote of the k closest training points: no training phase, universal consistency as n and k grow, and the curse of dimensionality that makes distance meaningless in high dimensions.
Meta-Learning
Learning to learn: find model initializations or embedding spaces that enable fast adaptation to new tasks from few examples. MAML, prototypical networks, and the connection to few-shot learning and in-context learning in LLMs.
Multi-Class and Multi-Label Classification
Multi-class (exactly one label, softmax + cross-entropy) vs multi-label (multiple labels, sigmoid + binary cross-entropy). One-vs-rest, one-vs-one, hierarchical classification, and evaluation metrics.
Naive Bayes
The simplest generative classifier: assume conditional independence of features given the class, estimate class-conditional densities, and classify via MAP. Why it works despite the wrong independence assumption.
Natural Language Processing Foundations
The progression from bag-of-words to transformers: tokenization, language modeling, TF-IDF, sequence-to-sequence, attention, and why the pre-train then fine-tune paradigm replaced task-specific architectures.
Object Detection and Segmentation
Localizing and classifying objects in images: two-stage (R-CNN), one-stage (YOLO, SSD), anchor-free (CenterNet, FCOS) detectors, semantic and instance segmentation, and the IoU/mAP evaluation framework.
Optimal Brain Surgeon and Pruning Theory
Principled weight pruning via second-order information: Optimal Brain Damage uses the Hessian diagonal, Optimal Brain Surgeon uses the full inverse Hessian, and both reveal why magnitude pruning is a crude but popular approximation.
PageRank Algorithm
PageRank as the stationary distribution of a random walk on a graph: damping factor, power iteration, eigenvector interpretation, and applications beyond web search.
Perceptron
Rosenblatt's perceptron (1958): the simplest neural network, the first learning algorithm with a convergence proof, and the lesson that linear separability is both powerful and limiting.
Probability Flow ODE
Song et al. 2021: every diffusion SDE has a deterministic ODE that produces the same time-marginals. The deterministic dual of Anderson's reverse SDE; the basis of DDIM, DPM-Solver, EDM samplers, exact-likelihood computation, and the conceptual bridge to flow matching.
Recommender Systems
User-item interaction modeling via matrix factorization, collaborative filtering, and content-based methods: the math of SVD-based recommendations, cold start, implicit feedback, and why evaluation is harder than the model.
Recurrent Neural Networks
Sequential processing via hidden state recurrence: the simple RNN, vanishing and exploding gradients, LSTM gating mechanisms, and why transformers have largely replaced RNNs.
Semantic Search and Embeddings
Dense vector representations for semantic similarity: bi-encoders, cross-encoders, approximate nearest neighbor search, cosine similarity geometry, and the RAG retrieval pipeline.
Spectral Clustering
Clustering via the eigenvectors of a graph Laplacian: embed data using the bottom eigenvectors, then run k-means in the embedding space. Finds non-convex clusters that k-means alone cannot.
Speech and Audio ML
Machine learning for audio: mel spectrograms as 2D representations, CTC loss for sequence alignment, Whisper for speech recognition, text-to-speech synthesis, and why continuous audio signals are harder than discrete text.
Time Series Forecasting Basics
Stationarity, autocorrelation, AR, MA, ARIMA, exponential smoothing, modern architectures (PatchTST, iTransformer, TSMixer), and time-series foundation models (TimesFM, Chronos, Moirai).
Transfer Learning
Pretrain on a large dataset, fine-tune on a smaller target: why lower layers learn transferable features, feature extraction vs fine-tuning, domain adaptation, negative transfer, and the foundation model paradigm.
t-SNE and UMAP
Two dominant nonlinear dimensionality reduction methods: t-SNE preserves local neighborhoods via KL divergence with a Student-t kernel, UMAP uses fuzzy simplicial sets and cross-entropy. Both excel at visualization but have important limitations.
Word Embeddings
Dense vector representations of words: Word2Vec (skip-gram, CBOW), negative sampling, GloVe, the distributional hypothesis, and why embeddings transformed NLP from sparse features to learned representations.
XGBoost
XGBoost as second-order gradient boosting: Taylor expansion of the loss, regularized objective, optimal leaf weights, split gain formula, and the system optimizations that made it dominant on tabular data.
Bayesian Neural Networks
Place a prior over neural network weights and compute the posterior given data. Exact inference is intractable, so we approximate: variational inference, MC dropout, Laplace approximation, SWAG. Principled uncertainty, high cost, limited scaling evidence.
Boltzmann Machines and Hopfield Networks
Energy-based models for associative memory and generative learning: Hopfield networks store patterns via energy minimization, Boltzmann machines add stochasticity and hidden units, and RBMs enable tractable learning through contrastive divergence.
Cubist and Model Trees
M5 model trees put linear regression at each leaf of a decision tree. Cubist extends this with rule simplification and smoothing. A useful hybrid of interpretability and prediction power.
Energy-Based Models
A unifying framework for generative modeling: assign low energy to likely configurations via E(x), define probability through the Boltzmann distribution, and train without computing the intractable partition function.
Logspline Density Estimation
Model the log-density as a spline, then normalize to get a smooth, positive density estimate. Connection to exponential families, knot selection by BIC, and flexible nonparametric density estimation.
MARS (Multivariate Adaptive Regression Splines)
MARS: automatically discover nonlinear relationships using piecewise linear hinge functions, forward-backward selection, and a direct connection to ReLU networks.
Mixture Density Networks
Neural networks that output the parameters of a mixture model instead of a single point prediction: handling multi-modal conditional distributions, the negative log-likelihood loss, and applications to inverse problems.
NMF (Nonnegative Matrix Factorization)
Factor V into W*H with all entries nonnegative: parts-based additive representation, multiplicative update rules, and applications to topic modeling and image decomposition.
Normalizing Flows
Generative models that transform a simple base distribution through invertible mappings, enabling exact log-likelihood computation via the change of variables formula.
Reservoir Computing and Echo State Networks
Fixed random recurrent networks with trained linear readouts: the echo state property, why random high-dimensional projections carry computational power, extreme learning machines, and connections to state-space models.
Self-Organizing Maps
Kohonen networks: competitive learning on a grid that produces topology-preserving mappings from high-dimensional input to low-dimensional discrete maps.
Wavelet Smoothing
Wavelet transforms decompose signals into localized frequency components. Thresholding wavelet coefficients denoises signals while adapting to local smoothness, achieving minimax-optimal rates over Besov spaces.
Numerical Optimization
26 topics / L0A, L1, L2, L3 / Tiers 1-3
Newton, conjugate gradient, ADMM, coordinate descent, interior point, line search. The algorithmic toolbox.
Conditioning and Condition Number
The condition number measures how sensitive a problem is to perturbations in its input. Ill-conditioned matrices turn small errors into catastrophic ones, and understanding conditioning is essential for any computation involving linear algebra.
Floating-Point Arithmetic
How computers represent real numbers, why they get it wrong, and why ML uses float32, float16, bfloat16, and int8. IEEE 754, machine epsilon, overflow, underflow, and catastrophic cancellation.
Log-Probability Computation
Working in log space prevents underflow when multiplying many small probabilities. The log-sum-exp trick provides a numerically stable way to compute log of a sum of exponentials, and it underlies stable softmax, log-likelihoods, and the forward algorithm for HMMs.
Newton's Method
The gold standard for fast local convergence: use second-order information (the Hessian) to take optimal quadratic steps. Quadratic convergence when it works, catastrophic failure when it doesn't.
Proximal Gradient Methods
Optimize composite objectives by alternating gradient steps on smooth terms with proximal operators on nonsmooth terms. ISTA and its accelerated variant FISTA.
Quasi-Newton Methods
Approximate the Hessian instead of computing it: BFGS builds a dense approximation, L-BFGS stores only a few vectors. Full BFGS has classical local superlinear convergence; L-BFGS is excellent in practice but its fixed-memory theory is weaker and not generally superlinear.
Softmax and Numerical Stability
The softmax function maps arbitrary reals to a probability distribution. Getting it right numerically: avoiding overflow and underflow: is the first lesson in writing ML code that actually works.
Ascent Algorithms and Hill Climbing
Gradient ascent, hill climbing, and their failure modes: local optima, plateaus, and ridges. Random restarts and simulated annealing as strategies for escaping local optima.
Augmented Lagrangian and ADMM
The augmented Lagrangian fixes the conditioning problem of pure penalties by adding dual feedback. ADMM uses that idea to split structured convex problems into alternating subproblems with primal and dual residual diagnostics.
Bayesian Optimization for Hyperparameters
Hyperparameter tuning as black-box optimization with a Gaussian process surrogate. Acquisition functions (EI, UCB, PI), sample efficiency for expensive evaluations, TPE as an alternative, and why grid search is wasteful.
Conjugate Gradient Methods
Conjugate gradient solves large symmetric positive definite linear systems using matrix-vector products, Krylov subspaces, A-conjugate directions, condition-number-dependent rates, and preconditioning.
Coordinate Descent
Optimize by updating one coordinate (or block) at a time while holding others fixed. The default solver for Lasso because each coordinate update has a closed-form solution.
Line Search Methods
Line search chooses a step size along a descent direction. Armijo guarantees enough decrease, Wolfe prevents uselessly tiny steps, and backtracking gives a practical step-size controller without knowing the Lipschitz constant.
Mirror Descent and Frank-Wolfe
Mirror descent generalizes gradient descent via Bregman divergences, recovering multiplicative weights and exponentiated gradient as special cases. Frank-Wolfe replaces projections with linear minimization, making it projection-free.
Numerical Linear Algebra
Algorithms for solving linear systems, computing eigenvalues, and factoring matrices. Every linear regression, PCA, and SVD computation depends on these methods.
Online Convex Optimization
A general framework for sequential decision-making with convex losses: online gradient descent, follow the regularized leader, adaptive methods, and the square-root-T regret guarantee that unifies many algorithms.
Projected Gradient Descent
Constrained convex optimization by alternating gradient steps with projections onto the feasible set. Same convergence rates as unconstrained gradient descent when projections are cheap.
Secant Method
A derivative-free root-finding method that approximates Newton's method using two previous function evaluations, achieving superlinear convergence of order approximately 1.618.
Second-Order Optimization Methods
Newton's method, Gauss-Newton, natural gradient, and K-FAC: how curvature information accelerates convergence, why the Hessian is too expensive to compute at scale, and Hessian-free alternatives that use Hessian-vector products.
Trust Region Methods
Trust region methods minimize a local model only inside a region where that model is trusted, then grow or shrink the region using actual versus predicted decrease. They are robust Newton-style methods for difficult non-convex landscapes.
Whitening and Decorrelation
Transform data to have identity covariance, removing correlations and normalizing scales. ZCA and PCA whitening, why whitening helps optimization, and connections to batch normalization.
Interior Point Methods
Barrier functions transform constrained optimization into unconstrained problems. Newton steps on the barrier objective trace the central path to the constrained optimum with polynomial convergence.
Nonlinear Gauss-Seidel
Block coordinate optimization with latest-value updates. Nonlinear Gauss-Seidel explains alternating minimization, EM-style sweeps, and why block coupling decides whether cycling, stalling, or convergence happens.
Submodular Optimization
Submodular functions exhibit diminishing returns. The greedy algorithm achieves a (1-1/e) approximation for monotone submodular maximization under cardinality constraints, with applications in feature selection, sensor placement, and data summarization.
Tabu Search
Local search with memory: maintain a list of recently visited solutions to prevent cycling, use aspiration criteria to override the tabu when a move leads to a new best, and balance intensification against diversification.
Winsorization
Clip extreme values to a fixed percentile instead of removing them. Preserves sample size, reduces outlier sensitivity, and improves stability of downstream estimators.
Sampling & MCMC
20 topics / L1, L2, L3 / Tiers 1-3
Gibbs, Metropolis-Hastings, HMC, importance sampling, coupling arguments, perfect sampling.
Gibbs Sampling
The classical conditional-update MCMC algorithm: sample each variable from its full conditional, accept every proposal, and then pay close attention to mixing, blocking, and hierarchical parameterization.
Importance Sampling
Estimate expectations under one distribution by sampling from another and reweighting: a technique that is powerful when done right and catastrophically unreliable when done wrong.
Markov Chain Monte Carlo
When you can evaluate a target up to a normalizing constant but cannot sample from it, build a Markov chain whose stationary distribution is the target. Detailed balance, ergodicity, mixing time, autocorrelation, and the variance penalty for non-iid samples.
Metropolis-Hastings Algorithm
The foundational MCMC algorithm: build a Markov chain with the right stationary distribution by accepting or rejecting proposals according to a carefully balanced ratio. The real story is not only correctness, but also proposal geometry, diagnostics, and where plain MH becomes painfully slow.
Monte Carlo Methods
Approximate expectations by sampling: the Monte Carlo estimator, its variance, the $\sqrt{N}$ convergence rate, and the variance-reduction tricks that make practical Bayesian inference, REINFORCE, and ELBO estimation work.
Burn-in and Convergence Diagnostics
Burn-in is only the first filter. Modern MCMC trust comes from split rank-normalized R-hat, bulk and tail ESS, trace behavior, and sampler-specific warnings like divergences.
Hamiltonian Monte Carlo
HMC uses gradients and Hamiltonian dynamics to propose large moves that still land in high-density regions. The real practical story includes leapfrog geometry, warmup, divergences, and when centered hierarchical models break the sampler.
Langevin Dynamics
The overdamped Langevin SDE: gradient descent plus calibrated Gaussian noise. The mathematical object behind SGLD, ULA, and energy-based MCMC samplers, and the simplest sampler with provable polynomial-time convergence on log-concave targets.
No-U-Turn Sampler and Neal's Funnel
NUTS removes HMC's hand-tuned trajectory length, but Neal's funnel shows why geometry still dominates. Centered hierarchical models create narrow necks, divergences, and misleading convergence unless you reparameterize.
Rao-Blackwellization
The Rao-Blackwell theorem: conditioning an estimator on a sufficient statistic reduces variance without increasing bias. In MCMC, this means replacing sample averages with conditional expectations for lower-variance estimates at no extra sampling cost.
Rejection Sampling
The simplest exact sampling method: propose from an envelope distribution and accept or reject to produce exact independent draws from a target: but doomed to fail in high dimensions.
Variance Reduction Techniques
Get the same accuracy with fewer samples by exploiting correlation, known quantities, and stratification. Antithetic variates, control variates, stratification, and Rao-Blackwellization.
Adaptive Rejection Sampling
For log-concave densities, build a piecewise linear upper hull of log f(x) that tightens automatically with each evaluation. Sample from the envelope and accept/reject.
Coupling Arguments and Mixing Time
Coupling constructs two Markov chains on the same probability space so they eventually meet, bounding total variation distance and mixing time. Spectral gap and coupling inequality are the main tools for proving how fast MCMC converges to stationarity.
Griddy Gibbs Sampling
When Gibbs sampling encounters a non-conjugate full conditional, approximate it on a grid and sample from the piecewise constant or piecewise linear approximation. Simple, effective for smooth univariate conditionals.
MCMC for Markov Random Fields
Gibbs sampling on undirected graphical models. The joint distribution factorizes over cliques, each variable is resampled from its Markov blanket, and the real practical story is local conditional updates versus long-range dependence, critical slowing, and exact-sampling alternatives.
Perfect Sampling
Coupling from the past (Propp-Wilson): run Markov chains backward in time until all starting states coalesce. The result is an exact sample from the stationary distribution with no burn-in approximation, but often at a serious computational price.
Reversible Jump MCMC
MCMC for model selection: propose moves that change the number of parameters, maintain detailed balance across dimensions via Jacobian corrections, and sample over model space and parameter space simultaneously.
Slice Sampling
Slice sampling draws from a target distribution by uniformly sampling from the region under its density curve. It introduces an auxiliary variable to avoid tuning proposal distributions, unlike random-walk Metropolis-Hastings.
Squeezed Rejection Sampling
An optimization of rejection sampling that adds a cheap lower bound (squeeze function) to avoid expensive target density evaluations when the sample clearly falls in the accept or reject region.
Decision Theory & Game Theory
2 topics / L2 / Tier 1
Expected utility, Nash equilibrium, minimax, auction theory, mechanism design, Arrow impossibility.
Bounded Rationality
Real agents optimize under limited information, limited compute, and limited foresight. Simon's satisficing, heuristics, and the implications for search, planning, and agent design in ML.
Game Theory Foundations
Strategic interaction between rational agents. Normal-form games, dominant strategies, Nash equilibrium existence, mixed strategies, and connections to minimax, mechanism design, and multi-agent RL.
Algorithms & Computation
10 topics / L0A, L1, L2, L3 / Tiers 1-3
Sorting, graph algorithms, dynamic programming, FFT, matrix multiplication, P vs NP.
Dynamic Programming
Solve complex optimization problems by decomposing them into overlapping subproblems with optimal substructure. The algorithmic backbone of sequence models, control theory, and reinforcement learning.
Information Retrieval Foundations
Search as a first-class capability. TF-IDF, BM25, inverted indexes, precision/recall, reranking, and why retrieval is not just vector DB plus embeddings.
Fast Fourier Transform
The Cooley-Tukey FFT reduces the discrete Fourier transform from O(n²) to O(n log n), enabling efficient convolution, spectral methods, and Fourier features for kernel approximation.
Graph Algorithms Essentials
The graph algorithms every ML practitioner needs: BFS, DFS, Dijkstra, MST, and topological sort. Why they matter for computational graphs, knowledge graphs, dependency resolution, and GNNs.
Greedy Algorithms
The greedy paradigm: make the locally optimal choice at each step and never look back. When matroid structure or the exchange argument guarantees global optimality.
Knapsack Problem
The canonical constrained optimization problem: 0/1 knapsack (NP-hard, pseudo-polynomial DP), fractional knapsack (greedy), FPTAS, and connections to Lagrangian relaxation in ML.
Matrix Multiplication Algorithms
From naive cubic algorithms to Strassen's subcubic breakthrough to the open question of the true matrix multiplication exponent. What we know, what we do not, and why it matters for ML.
Sorting Algorithms
Comparison-based sorting lower bound, quicksort, mergesort, heapsort, and non-comparison sorts. The foundational algorithms behind efficient data processing and search.
Unsolved Problems in Computer Science
P vs NP, the matrix multiplication exponent, one-way functions, the unique games conjecture, BPP vs P, natural proofs, and circuit lower bounds. The problems that define the limits of computation.
Open Problems in Matrix Computation
The unsolved questions in numerical linear algebra: the true exponent of matrix multiplication, practical fast algorithms, sparse matrix multiplication, randomized methods, and why these matter for scaling ML.
Model Timeline & Labs
12 topics / L4, L5 / Tiers 1-3
GPT, Claude, Gemini, Llama, DeepSeek. Historical lineage and the 2020s lab landscape.
DeepSeek Models
DeepSeek's model family: MoE architectures with Multi-head Latent Attention, fine-grained expert routing, RL-trained reasoning in DeepSeek-R1, the V3.1/V3.2 hybrid reasoning line, and the V4 Preview 1M-context release.
AI Labs Landscape
Factual reference on the major AI research labs and companies: what they build, key technical contributions, and research focus areas. Current landscape as of April 2026.
Claude Model Family
Anthropic's Claude series from Claude 1 through Opus 4.7, Sonnet 4.6, Haiku 4.5, and Mythos Preview: Constitutional AI, extended thinking, computer use, long context, tiering, and safety governance.
Cohere Models
The Cohere model family: Command R and R+ for retrieval-augmented enterprise generation, Embed and Rerank for two-stage retrieval, and the open-weight Aya multilingual line from Cohere For AI. Cohere announced an all-stock merger with German lab Aleph Alpha on April 24, 2026, taking the Schwarz Group as lead investor and positioning the combined company as a sovereign-AI alternative to dominant US providers.
Gemini and Google Models
Google's model lineage from PaLM through Gemini 3.1: native multimodality, long context, TPU infrastructure, Gemini API preview churn, and the Gemma open-weight branch through Gemma 4.
GPT Series Evolution
The progression from GPT-1 (117M) through GPT-4, the o-series, GPT-5, GPT-5.4, gpt-oss, and GPT-Rosalind: what changed through scale, post-training, reasoning-time compute, tool use, and domain specialization.
Ineffable Intelligence
British AI lab founded by David Silver (UCL professor, ex-DeepMind reinforcement-learning lead, AlphaGo / AlphaZero / AlphaProof). Announced a $1.1 billion seed round at $5.1 billion valuation on April 27, 2026 — co-led by Sequoia and Lightspeed, with Nvidia, Google, DST Global, Index Ventures, and the UK Sovereign AI Fund participating. Stated mission: build a 'superlearner' that acquires knowledge from its own experience rather than from human-generated data, instantiating the 'Era of Experience' research agenda.
LLaMA and Open Weight Models
The open weight movement in large language models: LLaMA 1/2/3, the ecosystem of fine-tuning and quantization tools, and why open weights changed the dynamics of AI research.
Mistral Models
The Mistral AI model family: Mistral 7B with sliding-window attention, the Mixtral 8x7B and 8x22B sparse mixture-of-experts releases, the dense Mistral Large/Nemo line, and the specialist Codestral, Pixtral, and Ministral variants.
Model Comparison Table
Structured comparison of major LLM families as of April 2026: architecture, parameters, context length, open weights, and key strengths, with discussion of what comparison tables cannot tell you.
Model Timeline
A structured factual timeline of major language and multimodal models from GPT-2 through the current frontier, with parameter counts, key innovations, and the ideas that defined each era.
Qwen and Chinese Models
The Chinese open-weight model ecosystem: Qwen (Alibaba), Yi (01.AI), Baichuan, GLM (Zhipu AI), and Kimi (Moonshot AI), with a focus on multilingual capability and independent scaling.
Connections & Applications
9 topics / L2, L3, L4 / Tiers 1-3
Cross-domain topics: SLAM, cryptography, signal detection, Kalman filters. Not core curriculum, but worth knowing.
Kalman Filter
Optimal state estimation for linear Gaussian systems via recursive prediction and update steps using the Kalman gain.
Bayesian State Estimation
The filtering problem: recursively estimate a hidden state from noisy observations using predict-update cycles. Kalman filter for linear Gaussian systems, particle filters for the general case.
GraphSLAM and Factor Graphs
SLAM as graph optimization: poses as nodes, constraints as edges, factor graph representation, MAP estimation via nonlinear least squares, and the sparsity structure that makes large-scale mapping tractable.
State Space Models
Linear state space form, the Kalman filter and RTS smoother, EM for parameter learning, and the ARIMA equivalence. The unifying framework behind classical filtering and modern Mamba/S4.
Time Series Foundations
Rigorous treatment of stationarity, the Wold decomposition, autocorrelation, unit roots, AR/MA/ARMA/ARIMA models, and spectral representation. The classical theory that every modern sequence model rests on.
Active SLAM and POMDPs
Choosing robot actions to simultaneously map an environment and localize, formulated as a partially observable Markov decision process over belief states.
Number Theory and Machine Learning
The emerging two-way street between number theory and machine learning: how number-theoretic tools improve ML systems, and how ML is discovering new mathematical structure in classical problems.
Particle Filters
Sequential Monte Carlo: represent the posterior over hidden states as a set of weighted particles, propagate through dynamics, reweight by likelihood, and resample to combat degeneracy.
Visual and Semantic SLAM
Replacing laser range finders with cameras for SLAM, and enriching maps with semantic labels to improve data association and planning.
Other topics
Additional topics not yet assigned to a specific module.
Causal Inference for Policy Evaluation
Quasi-experimental methods for recovering policy effects without randomization. Difference-in-differences identifies the average treatment effect on the treated under parallel trends; regression discontinuity identifies a local average treatment effect under continuity at the cutoff; instrumental variables identifies a local average treatment effect for compliers under monotonicity (Imbens-Angrist 1994). Synthetic control and double/debiased ML extend these designs to single-unit and high-dimensional settings.
Non-Probability Sampling
Convenience and opt-in samples do not give probability-of-inclusion guarantees. The data-defect identity (Meng 2018) shows why a massive convenience sample can produce a confidently wrong answer. Repair methods: calibration, sampling-score weighting, mass imputation, doubly robust integration with a probability sample, and sensitivity analysis.
Double/Debiased Machine Learning
A general recipe for plugging flexible ML estimators into causal and structural estimands while recovering root-n rate and asymptotic normality. Cross-fitting plus Neyman-orthogonal moments converts slow nuisance rates into honest confidence intervals for a low-dimensional parameter of interest.
AlphaProof and AI-Assisted Theorem Proving
AlphaProof (DeepMind, July 2024) earned an IMO silver medal solving 4 of 6 problems by autoformalizing into Lean 4 and running a reinforcement-learning theorem prover. AlphaGeometry 2 solved IMO 2024 Problem 4 in geometry. The state of AI-assisted formal mathematics in 2026: autoformalization, retrieval-augmented theorem proving, and the LLM + Lean kernel pipeline.
WebGPU for Machine Learning
WebGPU gives the browser an explicit GPU compute model: devices, queues, buffers, and WGSL kernels. That is the missing substrate for serious in-browser inference, custom kernels, and eventually browser-native training systems.
Adaptive Learning Is Not IID
Why diagnostic systems that choose the next question from previous answers need sequential-probability assumptions, not iid sampling assumptions.
SGD as a Stochastic Differential Equation
The continuous-time SDE limit of mini-batch SGD. Order-1 weak approximation (Li-Tai-E), Mandt-Hoffman-Blei stationary distribution, Bayesian interpretation, the linear scaling rule for batch size, and the modified-equation correction that exposes SGD's implicit gradient-norm regularizer.
Split Conformal Prediction
A distribution-free, model-agnostic procedure that converts any point predictor into a prediction set with finite-sample marginal coverage. The only assumption is exchangeability. The proof is five lines.
Weighted Conformal Prediction Under Covariate Shift
The extension of split conformal prediction that restores finite-sample coverage when the test distribution differs from the training distribution, at the cost of knowing or estimating the likelihood ratio between them.
Classical ODEs: Existence, Stability, and Numerical Methods
The foundational theorems that justify treating an ODE as having a solution at all: Picard-Lindelof existence and uniqueness, Gronwall's inequality, linear-system theory, Lyapunov stability, and the numerical methods (Euler, Runge-Kutta, adaptive step) used in modern scientific ML.
Adjoint Sensitivity Method
Compute gradients through an ODE solver by integrating a backward adjoint ODE, trading O(NT) activation memory for O(1) memory at the cost of a second integration.
Continuous Normalizing Flows
Generative models that replace the stack of invertible layers in a normalizing flow with a learned ODE, trading the per-layer Jacobian determinant for an O(d) trace via Hutchinson's estimator.
Neural ODEs and Continuous-Depth Networks
Treating neural network depth as a continuous variable: the ODE formulation of residual networks, the adjoint method for memory-efficient backpropagation, the duality with PINNs, the SDE bridge to diffusion models, and the open research frontier.
Neural SDEs and the Diffusion Bridge
The stochastic generalization of neural ODEs: parameterizing the drift and diffusion of an SDE with neural networks, the adjoint method extended through Brownian motion, the explicit bridge to diffusion models via the probability flow ODE, and generative neural SDEs as infinite-dimensional GANs.
E-Values and Anytime-Valid Inference
A framework for hypothesis testing whose validity survives optional stopping, optional continuation, and data-dependent peeking. E-values are nonnegative random variables with expectation at most one under the null; Ville's inequality makes the guarantee anytime-valid.
Analysis of Variance
One-way ANOVA decomposes the total sum of squares into a between-group component and a within-group component. Under iid normal data with equal group variances, the ratio of mean squares has an F distribution and gives an exact test of the equal-means null hypothesis. The Welch correction handles unequal variances. Two-way ANOVA partitions further into main effects and interaction. Post-hoc procedures (Tukey HSD, Bonferroni, Scheffe) correct for the multiple-comparison problem that naive pairwise t-tests ignore.
Delta Method
Asymptotic distribution of a smooth function of an estimator. If sqrt(n)(T_n - mu) converges to N(0, sigma^2), then sqrt(n)(g(T_n) - g(mu)) converges to N(0, [g'(mu)]^2 sigma^2). The multivariate version uses the Jacobian; the second-order version handles vanishing derivatives. The page derives the result, works three canonical examples (variance of a log proportion, variance of a ratio of means, asymptotic variance of the sample correlation), and ties the construction to variance-stabilizing transformations.
Variance-Stabilizing Transformations
Many distributions have variance that depends on the mean: Poisson variance equals the mean, binomial-proportion variance equals p(1-p)/n. The delta method gives Var(g(X)) ≈ [g'(μ)]^2 σ^2(μ)/n, so picking g to satisfy g'(μ) σ(μ) = constant makes the asymptotic variance independent of μ. Solving this ODE produces the canonical transformations: 2√X for Poisson, arcsin(√p̂) for binomial proportions, log for multiplicative scale data, and the Fisher z-transform for the sample correlation. Anscombe's small-count corrections and the Box-Cox family complete the toolkit.
Tweedie Distribution
The Tweedie distribution is the one-parameter subfamily of the exponential dispersion model (EDM) family characterized by a power variance function V(mu) = mu^p. Special cases recover the Normal (p=0), Poisson (p=1), Gamma (p=2), and Inverse Gaussian (p=3). The intermediate range 1<p<2 produces a compound Poisson-Gamma distribution with a point mass at zero and a continuous positive part, which is the canonical model for insurance loss severity. The page covers the EDM construction, the four special-case identifications, and the compound-Poisson-Gamma representation; the applied actuarial treatment lives on ActuaryPath.