Skip to main content

Numerical Optimization

Bayesian Optimization for Hyperparameters

Hyperparameter tuning as black-box optimization with a Gaussian process surrogate. Acquisition functions (EI, UCB, PI), sample efficiency for expensive evaluations, TPE as an alternative, and why grid search is wasteful.

AdvancedTier 2StableSupporting~50 min

Why This Matters

Training a neural network takes hours or days. Evaluating a single hyperparameter configuration requires a full training run. You cannot afford thousands of evaluations. Bayesian optimization uses a probabilistic model of the objective function to choose which configurations to evaluate next, achieving good results in far fewer evaluations than grid or random search.

The key insight: after each evaluation, you have information about the objective surface. Use that information to decide where to look next, rather than searching blindly.

Problem Setup

Definition

Hyperparameter Optimization as Black-Box Optimization

Let f:XRf: \mathcal{X} \to \mathbb{R} be the objective function (e.g., validation loss as a function of hyperparameters). We want to find:

x=argminxXf(x)x^* = \arg\min_{x \in \mathcal{X}} f(x)

We can only evaluate ff at chosen points. Each evaluation is expensive (a full training run). We do not have access to f\nabla f. The function may be noisy (validation loss varies across random seeds).

The Bayesian optimization loop:

  1. Fit a surrogate model to all observations {(xi,yi)}i=1t\{(x_i, y_i)\}_{i=1}^t
  2. Use the surrogate to compute an acquisition function α(x)\alpha(x)
  3. Evaluate ff at xt+1=argmaxxα(x)x_{t+1} = \arg\max_x \alpha(x)
  4. Add (xt+1,yt+1)(x_{t+1}, y_{t+1}) to observations. Repeat.

The most important practical distinction is that there are two optimization problems in the loop. The outer problem is the expensive one you really care about: train a model and read off validation loss. The inner problem is cheap: optimize the acquisition function to decide which expensive run is still worth paying for.

The Surrogate Model: Gaussian Process

A Gaussian process provides a posterior distribution over ff given observations. After observing tt evaluations, the GP posterior gives a mean prediction μt(x)\mu_t(x) and uncertainty σt(x)\sigma_t(x) at any point xx:

f(x)DtN(μt(x),σt2(x))f(x) \mid \mathcal{D}_t \sim \mathcal{N}(\mu_t(x), \sigma_t^2(x))

The mean μt(x)\mu_t(x) is the best estimate of f(x)f(x). The variance σt2(x)\sigma_t^2(x) is high where we have few observations (uncertain regions) and low near observed points.

Embedded Figure

Bayesian optimization pays for one more expensive run only where it still learns something

The surrogate model estimates validation loss across the search space. The acquisition rule then asks a second question: where is the combination of low predicted loss and unresolved uncertainty still worth another full training run?

validation losscurrent besthyperparameter line
observed runsposterior meanuncertainty band

Acquisition rule

Expected Improvement

next x ≈ 0.61

EI spends budget where the mean already looks promising and the uncertainty is still wide enough to pay for another run.

Acquisition rule

Upper Confidence Bound

next x ≈ 0.59

UCB moves farther into uncertain regions as the exploration weight grows, which is why it comes with regret theory but also a tuning knob.

Acquisition rule

Probability of Improvement

next x ≈ 0.70

PI asks only whether improvement is likely, so it often hugs the incumbent and can get greedy too early.

Why this beats grid search

The point is not that the GP is magically correct. The point is that the optimizer keeps carrying forward information from earlier runs instead of spending the budget on regions that already look clearly bad.

Where the method breaks

If the search space is high-dimensional, heavily categorical, or dominated by random-seed noise, the surrogate can become less informative than a strong random-search baseline.

Acquisition Functions

The acquisition function balances exploitation (evaluate where μt(x)\mu_t(x) is low, i.e., the current best guess) with exploration (evaluate where σt(x)\sigma_t(x) is high, i.e., uncertain regions that might contain the optimum).

Proposition

Expected Improvement Closed Form

Statement

The Expected Improvement acquisition function has a closed-form expression under a GP posterior:

EI(x)=E[max(fbestf(x),0)]=(fbestμt(x))Φ(Z)+σt(x)ϕ(Z)\text{EI}(x) = \mathbb{E}[\max(f_{\text{best}} - f(x), 0)] = (f_{\text{best}} - \mu_t(x))\Phi(Z) + \sigma_t(x)\phi(Z)

where Z=(fbestμt(x))/σt(x)Z = (f_{\text{best}} - \mu_t(x))/\sigma_t(x), and Φ\Phi, ϕ\phi are the standard normal CDF and PDF respectively. When σt(x)=0\sigma_t(x) = 0, EI(x)=0\text{EI}(x) = 0.

Intuition

EI asks: "In expectation, how much better than the current best would an evaluation at xx be?" The first term rewards points predicted to be good (μt(x)<fbest\mu_t(x) < f_{\text{best}}). The second term rewards uncertainty (σt(x)\sigma_t(x) large), because uncertain points might be much better than predicted. Both exploitation and exploration emerge from a single principled criterion.

Proof Sketch

EI(x)=fbest(fbestf)N(f;μt(x),σt2(x))df\text{EI}(x) = \int_{-\infty}^{f_{\text{best}}} (f_{\text{best}} - f) \cdot \mathcal{N}(f; \mu_t(x), \sigma_t^2(x)) \, df. Substitute z=(fμt(x))/σt(x)z = (f - \mu_t(x))/\sigma_t(x) and split into two standard Gaussian integrals. The result follows from the identities azϕ(z)dz=ϕ(a)\int_{-\infty}^a z\phi(z)dz = -\phi(a) and aϕ(z)dz=Φ(a)\int_{-\infty}^a \phi(z)dz = \Phi(a).

Why It Matters

EI is the most widely used acquisition function in practice. Its closed form makes it cheap to evaluate and optimize. It naturally balances exploration and exploitation without requiring a tuning parameter (unlike UCB).

Failure Mode

EI can be myopic: it optimizes the expected improvement for the next single evaluation, not for a budget of BB remaining evaluations. It can also get stuck in local optima of the acquisition function if the GP posterior is multimodal.

Upper Confidence Bound (UCB): choose xt+1=argminx(μt(x)βtσt(x))x_{t+1} = \arg\min_x (\mu_t(x) - \beta_t \sigma_t(x)) where βt\beta_t is a parameter controlling exploration. Larger βt\beta_t means more exploration.

Probability of Improvement (PI): choose the point most likely to improve on fbestf_{\text{best}}: PI(x)=Φ(Z)\text{PI}(x) = \Phi(Z) with the same ZZ as EI. PI is purely exploitative and tends to over-exploit in practice.

In real hyperparameter tuning, the winner is often not the acquisition function with the prettiest formula. It is the one whose behavior matches your budget and noise level. EI is a strong default because it usually balances "likely good" against "still unresolved." UCB is easier to analyze theoretically because the exploration weight is explicit. PI is easy to compute, but it is usually the first one to get stuck hugging the current incumbent.

Regret Bounds

Theorem

GP-UCB Cumulative Regret Bound

Statement

Let ff be a function in the RKHS of kernel kk with RKHS norm at most BB. Running GP-UCB with βt=B+4σ2(γt+1+ln(1/δ))\beta_t = B + 4\sigma\sqrt{2(\gamma_t + 1 + \ln(1/\delta))} for TT rounds, the cumulative regret satisfies, with probability at least 1δ1 - \delta:

RT=t=1T(f(xt)f(x))O(TβTγT)R_T = \sum_{t=1}^T (f(x_t) - f(x^*)) \leq O(\sqrt{T \beta_T \gamma_T})

where γT\gamma_T is the maximum information gain after TT observations, which depends on the kernel. For the squared exponential kernel in dd dimensions, γT=O((logT)d+1)\gamma_T = O((\log T)^{d+1}).

Intuition

The regret grows sublinearly in TT: the average regret per evaluation goes to zero. The rate depends on the kernel's information gain γT\gamma_T, which measures how quickly the GP learns about ff. Smoother functions (smaller γT\gamma_T) are easier to optimize.

Proof Sketch

At each round, the UCB acquisition function ensures that f(xt)μt1(xt)+βtσt1(xt)f(x_t) \leq \mu_{t-1}(x_t) + \beta_t \sigma_{t-1}(x_t) with high probability. The regret at round tt is bounded by 2βtσt1(xt)2\beta_t \sigma_{t-1}(x_t). Summing and applying Cauchy-Schwarz gives RT2βTTtσt12(xt)R_T \leq 2\beta_T \sqrt{T \sum_t \sigma_{t-1}^2(x_t)}. The sum of predictive variances is bounded by the information gain γT\gamma_T.

Why It Matters

This provides a theoretical guarantee for Bayesian optimization. It shows that GP-UCB finds near-optimal hyperparameters with a number of evaluations that depends on the smoothness of the objective (via γT\gamma_T) rather than the dimensionality of the search space (which grid search depends on exponentially).

Failure Mode

The bound requires ff to lie in the RKHS of the chosen kernel. If the kernel is misspecified (e.g., using a smooth SE kernel for a non-smooth objective), the bound does not hold. The bound also grows with input dimension dd through γT\gamma_T, making Bayesian optimization less efficient for high-dimensional search spaces (typically d>20d > 20).

Tree-Structured Parzen Estimator (TPE)

TPE (Bergstra et al., 2011) models p(xy)p(x \mid y) instead of p(yx)p(y \mid x). It splits observed hyperparameters into "good" (y<yy < y^*) and "bad" (yyy \geq y^*) groups, fits kernel density estimators l(x)l(x) and g(x)g(x) to each, and maximizes the ratio l(x)/g(x)l(x)/g(x).

Advantage over GP: TPE handles categorical and conditional hyperparameters naturally (e.g., "use Adam" vs "use SGD", where Adam has a parameter β1\beta_1 that SGD does not).

Disadvantage: TPE lacks the theoretical guarantees of GP-UCB. The density estimation can be poor with few observations.

Why Grid Search is Wasteful

Grid search evaluates a regular grid over the hyperparameter space. For dd hyperparameters with mm values each, it requires mdm^d evaluations. With d=5d = 5 and m=10m = 10, this is 100,000 evaluations.

Bergstra & Bengio (2012) showed that random search outperforms grid search in the same budget because: (1) in most problems, only a few hyperparameters matter, and (2) grid search wastes evaluations exploring unimportant dimensions. Random search projects uniformly onto each dimension, giving better coverage of the important ones.

Bayesian optimization further improves on random search by using past results to guide future evaluations.

Common Confusions

Watch Out

Bayesian optimization is not Bayesian hyperparameter inference

Bayesian optimization uses a Bayesian model (the GP) as a tool for optimization. It does not compute a posterior over hyperparameters. The GP models the objective function ff, not the hyperparameter distribution. The output is a point estimate xx^*, not a posterior.

Watch Out

GP-UCB and EI are not equivalent

Both balance exploration and exploitation, but differently. UCB has an explicit exploration parameter βt\beta_t that must be set. EI has no explicit parameter but can be myopic. They lead to different evaluation sequences and different theoretical guarantees.

Canonical Examples

Example

Tuning learning rate and weight decay

Objective: validation loss of a ResNet-50 as a function of learning rate η[105,101]\eta \in [10^{-5}, 10^{-1}] (log scale) and weight decay λ[106,102]\lambda \in [10^{-6}, 10^{-2}] (log scale). A GP with Matern-5/2 kernel over the log-transformed space. After 20 evaluations (each a full ImageNet training run), Bayesian optimization typically finds a configuration within 0.5% of the best found by 200 random search evaluations.

Summary

  • Bayesian optimization uses a GP surrogate to model the objective function
  • Acquisition functions (EI, UCB, PI) decide where to evaluate next
  • EI has a closed form and balances exploration/exploitation without tuning
  • GP-UCB has sublinear regret guarantees depending on kernel smoothness
  • TPE handles categorical hyperparameters but lacks theoretical guarantees
  • Grid search is exponential in dimension; random search is a better baseline
  • Bayesian optimization is most valuable when each evaluation is expensive

Exercises

ExerciseCore

Problem

You have budget for 30 evaluations of a function with 3 continuous hyperparameters. Compare the number of distinct values each hyperparameter gets under grid search (301/3=3\lfloor 30^{1/3} \rfloor = 3 values each, giving 27 evaluations) versus random search (30 random points). Which gives better coverage per dimension?

ExerciseAdvanced

Problem

Derive the Expected Improvement when σt(x)0\sigma_t(x) \to 0 and when μt(x)=fbest\mu_t(x) = f_{\text{best}}. Interpret both cases.

ExerciseResearch

Problem

You are tuning a large language model with a search space containing optimizer choice, learning-rate schedule, weight decay, warmup length, batch size, and several categorical architecture knobs. Seed-to-seed noise is large. Why can a careful random-search baseline still beat Bayesian optimization here even if the BO implementation is mathematically correct?

References

Canonical and technical:

  • Jones, Schonlau, and Welch, "Efficient Global Optimization of Expensive Black-Box Functions" (1998)
  • Srinivas et al., "Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design" (2010), Sections 2-5
  • Bergstra and Bengio, "Random Search for Hyper-Parameter Optimization" (2012)

Practice and synthesis:

  • Bergstra et al., "Algorithms for Hyper-Parameter Optimization" (TPE, 2011)
  • Snoek, Larochelle, and Adams, "Practical Bayesian Optimization of Machine Learning Algorithms" (2012)
  • Shahriari et al., "Taking the Human Out of the Loop: A Review of Bayesian Optimization" (2016), Sections 2-5
  • Frazier, "A Tutorial on Bayesian Optimization" (2018), Sections 1-3

Next Topics

Last reviewed: April 23, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

2

Derived topics

2