# Baby introduction to Bayesian Hyperparameter Optimization for Machine Learning

Hyperparameters, in contrast to model parameters, are set by the ML engineer before training. Hyperparameter optimization is about finding the hyperparameters of an algorithm that give the best performance. It is represented in equation form as:

Now, evaluating the objective function, specially when we consider K-fold validation as an intermediary step, is extremely expensive. Methods like Grid and random search are inefficient as they are completely uninformed by past evaluations and spend a significant amount of time evaluating “bad” hyperparameters. In Bayesian Optimization, we try to use our past knowledge to better the approximation, and thus reduce the compute time.

It keeps track of past evaluation results use it to form a probabilistic model mapping hyperparameters to a probability of a score on the objective function. This probabilistic model is called a “surrogate” for the objective function and is represented as p(y | x).

This surrogate is much easier to optimize than the objective function. So, in Bayesian optimization, we

1) Build a surrogate probability model of the objective function

2) Evaluate the hyperparameters that perform best on the surrogate

3) Apply these hyperparameters to the true objective function

4) Update the surrogate model incorporating the new results

5) Repeat steps 2–4 until max iterations or time is reached

Thus we can see that the goal of Bayesian reasoning is to become “less wrong” with more data. We continually update the surrogate probability model after each evaluation of the objective function.

The basic idea is: spend a little more time selecting the next hyperparameters in order to make fewer calls to the objective function. In practice, the time spent selecting the next hyperparameters is inconsequential compared to the time spent in the objective function.

We form an initial view of the world (called a prior) and then we update our model based on new experiences (the updated model is called a posterior). Bayesian hyperparameter optimization takes that framework and applies it to finding the best value of model settings!