ML Methods
Gaussian Process Regression
Inference with Gaussian processes: the prior-to-posterior update in closed form, the role of kernel choice, marginal likelihood for hyperparameter selection, sparse approximations for scalability, and the connection to Bayesian optimization.
Prerequisites
Why This Matters
Gaussian process regression gives you two things that most regression methods do not: a posterior mean (point prediction) and a posterior variance (uncertainty estimate) at every test point, both in closed form. The uncertainty is calibrated: it is large where training data is sparse and small where data is dense. This is a Bayesian approach: place a prior over functions, condition on data, and read off the posterior.
This built-in uncertainty quantification makes GPs the standard surrogate model in Bayesian optimization, where you need to decide where to evaluate an expensive function next. The acquisition function depends on both the predicted value and the uncertainty, and GPs provide both.
Formal Setup
Assume we observe training points where and with and .
Place a GP prior on : where is the kernel (covariance function).
Kernel Matrix
The kernel matrix (Gram matrix) is with . For a test point , define the vector and scalar .
Posterior and Marginal Likelihood
The posterior predictive distribution and the log marginal likelihood for GP regression are derived in the canonical Gaussian Processes for ML page; both follow from conditioning a multivariate Gaussian. The closed-form posterior mean is , the posterior variance is , and the log marginal likelihood combines a data-fit term, a log-determinant complexity term, and a normalization constant.
This page is a special case of Gaussian processes for ML focused on regression with Gaussian likelihood. The classical results are stated there; this page covers the kriging / RBF connection, sparse approximations, and the numerical pipeline.
Kernel Choice
The kernel encodes prior assumptions about the function :
- Squared exponential (RBF): . Assumes is infinitely differentiable. The length scale controls smoothness.
- Matern-: . Controls differentiability via : gives Ornstein-Uhlenbeck (rough), gives once-differentiable, gives the RBF.
- Periodic: for functions with known periodicity.
- Sums and products of kernels are valid kernels, enabling compositional modeling.
Marginal Likelihood for Hyperparameter Selection
For the explicit log marginal likelihood and its proof, see the Gaussian processes for ML page. In practice the marginal likelihood is non-convex in the hyperparameters and can have multiple local optima, especially with compositional kernels — different initializations can lead to different solutions.
Sparse Gaussian Processes
Exact GP regression costs . Sparse GPs reduce this by selecting inducing points and approximating the full GP through these points.
The variational free energy (VFE, Titsias 2009) approach optimizes the inducing point locations and a variational distribution over the function values at the inducing points. The resulting cost is for training and for prediction.
The FITC (fully independent training conditional) approximation assumes conditional independence of training points given the inducing points. This gives a diagonal correction to the inducing point approximation.
For inducing points, sparse GPs can handle training points. The choice of inducing point locations matters: placing them in regions of high data density gives better approximations.
Connection to Bayesian Optimization
Bayesian optimization uses a GP as a surrogate for an expensive black-box function . The GP posterior mean and variance define an acquisition function (e.g., expected improvement, UCB) that balances exploitation (evaluating where the mean is low) and exploration (evaluating where the variance is high).
The GP's calibrated uncertainty is critical here. If the uncertainty is underestimated, the optimizer exploits too early. If overestimated, it explores too much. GP uncertainty, conditioned on kernel hyperparameters fit via marginal likelihood, is well-calibrated for smooth functions.
Common Confusions
GP uncertainty is conditional on the kernel being correct
The posterior variance is exact given the GP prior. If the true function is not well-modeled by the chosen kernel (e.g., using an RBF kernel for a discontinuous function), the uncertainty estimates are miscalibrated. The GP does not "know" when its kernel assumption is wrong.
The posterior mean equals kernel ridge regression, but the variance does not
The GP posterior mean is identical to the kernel ridge regression prediction with regularization when the squared loss carries the standard normalization (and when it does not). Kernel ridge regression does not produce the posterior variance. The uncertainty quantification is unique to the Bayesian (GP) formulation; see the GP-for-ML page for the equivalence proof.
Canonical Examples
GP regression on a 1D function
Sample 20 points from on with noise . Fit a GP with RBF kernel, optimizing and via marginal likelihood. The posterior mean tracks the true sine function. The 95% confidence band ( standard deviations) is narrow near training points and widens in gaps between points. At the edges of the training range ( or ), the posterior reverts to the prior mean (zero) with prior variance, correctly reflecting that the GP has no information there.
Summary
- GP regression gives a closed-form Gaussian posterior:
- Posterior variance is large where data is sparse and small near training points
- Kernel choice encodes smoothness and structure assumptions about
- Marginal likelihood selects hyperparameters without cross-validation via automatic Occam's razor
- Exact GP costs ; sparse GPs with inducing points cost
- GPs are the default surrogate model for Bayesian optimization
Exercises
Problem
For a GP with RBF kernel (, ) and noise , you observe one training point: , . Compute the posterior mean and variance at and at .
Problem
Show that the GP marginal likelihood penalizes both underfitting and overfitting. Specifically, consider the RBF kernel with length scale . Explain what happens to the data-fit term and the complexity term as and as .
Connection to Kriging and RBF Interpolation
GP regression is mathematically identical to kriging, the geostatistical interpolation method developed by Matheron (1963) building on Krige's mining work.
In kriging, the kernel is called the variogram (or its complement, the covariance function), the posterior mean is the kriging predictor, and the posterior variance is the kriging variance. The equations are the same: . The geostatistical and machine learning communities developed the same theory independently.
Radial basis function (RBF) interpolation is the noise-free special case: set , and the GP posterior mean becomes , which passes exactly through every training point. This is classical RBF interpolation with the kernel serving as the radial basis function. Common choices include the Gaussian (), multiquadric (), and Matern kernels.
The GP perspective adds two things that classical RBF interpolation lacks: principled uncertainty quantification (the posterior variance) and hyperparameter selection via marginal likelihood.
Noise-Free GP Equals RBF Interpolation
Statement
With the GP posterior mean reduces to , and this predictor satisfies for every training input . The predictor coincides with the classical RBF interpolant with radial basis function .
Intuition
Without observation noise the model has no reason to smooth past the data. The kernel inverse acts as the change of basis from data values to interpolation weights, and the linear combination of basis functions centered at training inputs reproduces the observed values exactly.
Proof Sketch
Apply the GP posterior mean formula at . The vector becomes the -th row of , so . By construction the predictor is a finite linear combination of , which is the form of the RBF interpolant; uniqueness of the interpolant in for positive definite identifies the two predictors.
Why It Matters
The kriging / RBF correspondence makes the geostatistics and approximation-theory literatures directly importable into GPs. Convergence rates from RBF approximation theory transfer over.
Failure Mode
With the matrix becomes ill-conditioned when training inputs cluster, and the explicit amplifies numerical noise. In practice a small jitter is added for stability; this is a numerical fix, not a model of observation noise.
References
Canonical:
- Rasmussen & Williams, Gaussian Processes for Machine Learning (2006), Chapters 2, 5
- Bishop, Pattern Recognition and Machine Learning (2006), Section 6.4
Kriging and RBF:
- Matheron, "Principles of Geostatistics" (Economic Geology, 1963). Founding paper of kriging theory.
- Cressie, Statistics for Spatial Data (Rev. ed., 1993), Chapters 3-4
- Buhmann, Radial Basis Functions: Theory and Implementations (2003), Chapters 1-3
- Wendland, Scattered Data Approximation (2005). Native spaces and convergence theory for RBF interpolation.
Current:
- Snelson & Ghahramani, "Sparse Gaussian Processes using Pseudo-inputs" (NeurIPS 2006). The FITC approximation named in the sparse-GP section.
- Titsias, "Variational Learning of Inducing Variables in Sparse Gaussian Processes" (AISTATS 2009)
- Hensman, Fusi, Lawrence, "Gaussian Processes for Big Data" (UAI 2013)
- Wilson & Adams, "Gaussian Process Kernels for Pattern Discovery and Extrapolation" (ICML 2013, arXiv:1302.4245). Spectral mixture kernels: structured kernel learning from data.
- Wilson, Hu, Salakhutdinov, Xing, "Deep Kernel Learning" (AISTATS 2016, arXiv:1511.02222). Composing neural-network feature extractors with GP output layers.
Next Topics
- Neural tangent kernel: the infinite-width neural network as a GP
- Cross-validation theory: alternative model selection when marginal likelihood is unavailable
Last reviewed: May 8, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
8- Joint, Marginal, and Conditional Distributionslayer 0A · tier 1
- Gram Matrices and Kernel Matriceslayer 1 · tier 1
- Ridge Regressionlayer 1 · tier 1
- Bayesian Linear Regressionlayer 2 · tier 1
- The Kernel Tricklayer 2 · tier 1
Derived topics
4- Neural Tangent Kernel: Lazy Training, Kernel Equivalence, μP, and the Limits of Widthlayer 4 · tier 1
- Cross-Validation Theorylayer 2 · tier 2
- Bayesian Optimization for Hyperparameterslayer 3 · tier 2
- Gaussian Processes in Astronomylayer 4 · tier 3