Bayesian Optimization for Hyperparameters

Sneiderman, Robby

Numerical Optimization

Bayesian Optimization for Hyperparameters

Hyperparameter tuning as black-box optimization with a Gaussian process surrogate. Acquisition functions (EI, UCB, PI), sample efficiency for expensive evaluations, TPE as an alternative, and why grid search is wasteful.

AdvancedTier 2StableSupporting~50 min

Prerequisites

Gaussian Processes Regression Gaussian Processes for ML

Quiz (4)Prereq Map

Why This Matters

Training a neural network takes hours or days. Evaluating a single hyperparameter configuration requires a full training run. You cannot afford thousands of evaluations. Bayesian optimization uses a probabilistic model of the objective function to choose which configurations to evaluate next, achieving good results in far fewer evaluations than grid or random search.

The key insight: after each evaluation, you have information about the objective surface. Use that information to decide where to look next, rather than searching blindly.

Problem Setup

Definition

Hyperparameter Optimization as Black-Box Optimization

Let $f: \mathcal{X} \to \mathbb{R}$ be the objective function (e.g., validation loss as a function of hyperparameters). We want to find:

$x^* = \arg\min_{x \in \mathcal{X}} f(x)$

We can only evaluate $f$ at chosen points. Each evaluation is expensive (a full training run). We do not have access to $\nabla f$ . The function may be noisy (validation loss varies across random seeds).

The Bayesian optimization loop:

Fit a surrogate model to all observations $\{(x_i, y_i)\}_{i=1}^t$
Use the surrogate to compute an acquisition function $\alpha(x)$
Evaluate $f$ at $x_{t+1} = \arg\max_x \alpha(x)$
Add $(x_{t+1}, y_{t+1})$ to observations. Repeat.

The most important practical distinction is that there are two optimization problems in the loop. The outer problem is the expensive one you really care about: train a model and read off validation loss. The inner problem is cheap: optimize the acquisition function to decide which expensive run is still worth paying for.

The Surrogate Model: Gaussian Process

A Gaussian process provides a posterior distribution over $f$ given observations. After observing $t$ evaluations, the GP posterior gives a mean prediction $\mu_t(x)$ and uncertainty $\sigma_t(x)$ at any point $x$ :

$f(x) \mid \mathcal{D}_t \sim \mathcal{N}(\mu_t(x), \sigma_t^2(x))$

The mean $\mu_t(x)$ is the best estimate of $f(x)$ . The variance $\sigma_t^2(x)$ is high where we have few observations (uncertain regions) and low near observed points.

Embedded Figure

Bayesian optimization pays for one more expensive run only where it still learns something

The surrogate model estimates validation loss across the search space. The acquisition rule then asks a second question: where is the combination of low predicted loss and unresolved uncertainty still worth another full training run?

observed runsposterior meanuncertainty band

Acquisition rule

Expected Improvement

next x ≈ 0.61

α_{EI} (x) = E [(f_{best} - f (x))_{+}]

EI spends budget where the mean already looks promising and the uncertainty is still wide enough to pay for another run.

Acquisition rule

Upper Confidence Bound

next x ≈ 0.59

α_{UCB} (x) = - μ (x) + β σ (x)

UCB moves farther into uncertain regions as the exploration weight grows, which is why it comes with regret theory but also a tuning knob.

Acquisition rule

Probability of Improvement

next x ≈ 0.70

α_{PI} (x) = Φ (\frac{f _{best} - μ ( x )}{σ ( x )})

PI asks only whether improvement is likely, so it often hugs the incumbent and can get greedy too early.

Why this beats grid search

The point is not that the GP is magically correct. The point is that the optimizer keeps carrying forward information from earlier runs instead of spending the budget on regions that already look clearly bad.

Where the method breaks

If the search space is high-dimensional, heavily categorical, or dominated by random-seed noise, the surrogate can become less informative than a strong random-search baseline.

Acquisition Functions

The acquisition function balances exploitation (evaluate where $\mu_t(x)$ is low, i.e., the current best guess) with exploration (evaluate where $\sigma_t(x)$ is high, i.e., uncertain regions that might contain the optimum).

Proposition

Expected Improvement Closed Form

Statement

The Expected Improvement acquisition function has a closed-form expression under a GP posterior:

$\text{EI}(x) = \mathbb{E}[\max(f_{\text{best}} - f(x), 0)] = (f_{\text{best}} - \mu_t(x))\Phi(Z) + \sigma_t(x)\phi(Z)$

where $Z = (f_{\text{best}} - \mu_t(x))/\sigma_t(x)$ , and $\Phi$ , $\phi$ are the standard normal CDF and PDF respectively. When $\sigma_t(x) = 0$ , $\text{EI}(x) = 0$ .

Intuition

EI asks: "In expectation, how much better than the current best would an evaluation at $x$ be?" The first term rewards points predicted to be good ( $\mu_t(x) < f_{\text{best}}$ ). The second term rewards uncertainty ( $\sigma_t(x)$ large), because uncertain points might be much better than predicted. Both exploitation and exploration emerge from a single principled criterion.

Proof Sketch

$\text{EI}(x) = \int_{-\infty}^{f_{\text{best}}} (f_{\text{best}} - f) \cdot \mathcal{N}(f; \mu_t(x), \sigma_t^2(x)) \, df$ . Substitute $z = (f - \mu_t(x))/\sigma_t(x)$ and split into two standard Gaussian integrals. The result follows from the identities $\int_{-\infty}^a z\phi(z)dz = -\phi(a)$ and $\int_{-\infty}^a \phi(z)dz = \Phi(a)$ .

Why It Matters

EI is the most widely used acquisition function in practice. Its closed form makes it cheap to evaluate and optimize. It naturally balances exploration and exploitation without requiring a tuning parameter (unlike UCB).

Failure Mode

EI can be myopic: it optimizes the expected improvement for the next single evaluation, not for a budget of $B$ remaining evaluations. It can also get stuck in local optima of the acquisition function if the GP posterior is multimodal.

report a correction →

Upper Confidence Bound (UCB): choose $x_{t+1} = \arg\min_x (\mu_t(x) - \beta_t \sigma_t(x))$ where $\beta_t$ is a parameter controlling exploration. Larger $\beta_t$ means more exploration.

Probability of Improvement (PI): choose the point most likely to improve on $f_{\text{best}}$ : $\text{PI}(x) = \Phi(Z)$ with the same $Z$ as EI. PI is purely exploitative and tends to over-exploit in practice.

In real hyperparameter tuning, the winner is often not the acquisition function with the prettiest formula. It is the one whose behavior matches your budget and noise level. EI is a strong default because it usually balances "likely good" against "still unresolved." UCB is easier to analyze theoretically because the exploration weight is explicit. PI is easy to compute, but it is usually the first one to get stuck hugging the current incumbent.

Regret Bounds

Theorem

GP-UCB Cumulative Regret Bound

Statement

Let $f$ be a function in the RKHS of kernel $k$ with RKHS norm at most $B$ . Running GP-UCB with $\beta_t = B + 4\sigma\sqrt{2(\gamma_t + 1 + \ln(1/\delta))}$ for $T$ rounds, the cumulative regret satisfies, with probability at least $1 - \delta$ :

$R_T = \sum_{t=1}^T (f(x_t) - f(x^*)) \leq O(\sqrt{T \beta_T \gamma_T})$

where $\gamma_T$ is the maximum information gain after $T$ observations, which depends on the kernel. For the squared exponential kernel in $d$ dimensions, $\gamma_T = O((\log T)^{d+1})$ .

Intuition

The regret grows sublinearly in $T$ : the average regret per evaluation goes to zero. The rate depends on the kernel's information gain $\gamma_T$ , which measures how quickly the GP learns about $f$ . Smoother functions (smaller $\gamma_T$ ) are easier to optimize.

Proof Sketch

At each round, the UCB acquisition function ensures that $f(x_t) \leq \mu_{t-1}(x_t) + \beta_t \sigma_{t-1}(x_t)$ with high probability. The regret at round $t$ is bounded by $2\beta_t \sigma_{t-1}(x_t)$ . Summing and applying Cauchy-Schwarz gives $R_T \leq 2\beta_T \sqrt{T \sum_t \sigma_{t-1}^2(x_t)}$ . The sum of predictive variances is bounded by the information gain $\gamma_T$ .

Why It Matters

This provides a theoretical guarantee for Bayesian optimization. It shows that GP-UCB finds near-optimal hyperparameters with a number of evaluations that depends on the smoothness of the objective (via $\gamma_T$ ) rather than the dimensionality of the search space (which grid search depends on exponentially).

Failure Mode

The bound requires $f$ to lie in the RKHS of the chosen kernel. If the kernel is misspecified (e.g., using a smooth SE kernel for a non-smooth objective), the bound does not hold. The bound also grows with input dimension $d$ through $\gamma_T$ , making Bayesian optimization less efficient for high-dimensional search spaces (typically $d > 20$ ).

report a correction →

Tree-Structured Parzen Estimator (TPE)

TPE (Bergstra et al., 2011) models $p(x \mid y)$ instead of $p(y \mid x)$ . It splits observed hyperparameters into "good" ( $y < y^*$ ) and "bad" ( $y \geq y^*$ ) groups, fits kernel density estimators $l(x)$ and $g(x)$ to each, and maximizes the ratio $l(x)/g(x)$ .

Advantage over GP: TPE handles categorical and conditional hyperparameters naturally (e.g., "use Adam" vs "use SGD", where Adam has a parameter $\beta_1$ that SGD does not).

Disadvantage: TPE lacks the theoretical guarantees of GP-UCB. The density estimation can be poor with few observations.

Why Grid Search is Wasteful

Grid search evaluates a regular grid over the hyperparameter space. For $d$ hyperparameters with $m$ values each, it requires $m^d$ evaluations. With $d = 5$ and $m = 10$ , this is 100,000 evaluations.

Bergstra & Bengio (2012) showed that random search outperforms grid search in the same budget because: (1) in most problems, only a few hyperparameters matter, and (2) grid search wastes evaluations exploring unimportant dimensions. Random search projects uniformly onto each dimension, giving better coverage of the important ones.

Bayesian optimization further improves on random search by using past results to guide future evaluations.

Common Confusions

Watch Out

Bayesian optimization is not Bayesian hyperparameter inference

Bayesian optimization uses a Bayesian model (the GP) as a tool for optimization. It does not compute a posterior over hyperparameters. The GP models the objective function $f$ , not the hyperparameter distribution. The output is a point estimate $x^*$ , not a posterior.

Watch Out

GP-UCB and EI are not equivalent

Both balance exploration and exploitation, but differently. UCB has an explicit exploration parameter $\beta_t$ that must be set. EI has no explicit parameter but can be myopic. They lead to different evaluation sequences and different theoretical guarantees.

Canonical Examples

Example

Tuning learning rate and weight decay

Objective: validation loss of a ResNet-50 as a function of learning rate $\eta \in [10^{-5}, 10^{-1}]$ (log scale) and weight decay $\lambda \in [10^{-6}, 10^{-2}]$ (log scale). A GP with Matern-5/2 kernel over the log-transformed space. After 20 evaluations (each a full ImageNet training run), Bayesian optimization typically finds a configuration within 0.5% of the best found by 200 random search evaluations.

Summary

Bayesian optimization uses a GP surrogate to model the objective function
Acquisition functions (EI, UCB, PI) decide where to evaluate next
EI has a closed form and balances exploration/exploitation without tuning
GP-UCB has sublinear regret guarantees depending on kernel smoothness
TPE handles categorical hyperparameters but lacks theoretical guarantees
Grid search is exponential in dimension; random search is a better baseline
Bayesian optimization is most valuable when each evaluation is expensive

Exercises

ExerciseCore

Problem

You have budget for 30 evaluations of a function with 3 continuous hyperparameters. Compare the number of distinct values each hyperparameter gets under grid search ( $\lfloor 30^{1/3} \rfloor = 3$ values each, giving 27 evaluations) versus random search (30 random points). Which gives better coverage per dimension?

ExerciseAdvanced

Problem

Derive the Expected Improvement when $\sigma_t(x) \to 0$ and when $\mu_t(x) = f_{\text{best}}$ . Interpret both cases.

ExerciseResearch

Problem

You are tuning a large language model with a search space containing optimizer choice, learning-rate schedule, weight decay, warmup length, batch size, and several categorical architecture knobs. Seed-to-seed noise is large. Why can a careful random-search baseline still beat Bayesian optimization here even if the BO implementation is mathematically correct?

References

Canonical and technical:

Jones, Schonlau, and Welch, "Efficient Global Optimization of Expensive Black-Box Functions" (1998)
Srinivas et al., "Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design" (2010), Sections 2-5
Bergstra and Bengio, "Random Search for Hyper-Parameter Optimization" (2012)

Practice and synthesis:

Bergstra et al., "Algorithms for Hyper-Parameter Optimization" (TPE, 2011)
Snoek, Larochelle, and Adams, "Practical Bayesian Optimization of Machine Learning Algorithms" (2012)
Shahriari et al., "Taking the Human Out of the Loop: A Review of Bayesian Optimization" (2016), Sections 2-5
Frazier, "A Tutorial on Bayesian Optimization" (2018), Sections 1-3

Next Topics

Gaussian Processes Regression for the posterior formulas and kernel fitting underneath the surrogate
Multi-Armed Bandits Theory for the exploration-versus-exploitation logic behind regret bounds
Model Evaluation Best Practices for the train/validation/test discipline that keeps hyperparameter search honest

Last reviewed: April 23, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Gaussian Process Regressionlayer 3 · tier 2
Gaussian Processes for Machine Learninglayer 4 · tier 3

Derived topics

2

Model Evaluation Best Practiceslayer 1 · tier 1
Multi-Armed Bandits Theorylayer 2 · tier 2

Graph-backed continuations

Multi-Armed Bandits Theory Model Evaluation Best Practices