Looking at the NIPS papers (and workshops) I noticed many references (some hidden) to Bayesian optimization as a useful subroutine, so I decided to investigate and see what it's all about. I found a great tutorial by Eric Brochu, Vlad Cora, and Nando de Freitas, and I decided to follow it and see what it's all about. Note that this is one of the techniques used in James Bergstra's paper on hyperparameter optimization, and he has optimized and tested code online to do this kind of optimization, so you should use that code rather than the one here.
Regardless, I thought it to be a useful exercise to implement it and see what happens.
Bayesian optimization is a really cool and simple way of globally optimizing a function which is very expensive to evaluate. So expensive, in fact, that you're not bothered by the fact that it involves optimizing over a gaussian process with something like simulated annealing (or really whatever gradient-free global optimization algorithm strikes your fancy). One example of this is hyperparameter optimization, where your neural network has over 10 different tuning parameters which interact in odd ways and essentially make your research irreproducible. Then, you really care about not running too many iterations of something silly such as simulated annealing over your real objective function and are willing to spend the computational budget of a few minutes to get a really good point to try out.
So how does it work? First, you assume you have evaluated your objective function on a bunch of arbitrarily chosen points of your space. Then, in each iteration, it fits a gaussian process over your function evaluations and searches over this gaussian process for the best point to try out next. This search is a global optimization of something called an acquisition function, which can be as simple as the expected value of your objective at the point plus the predicted variance at that same point (the UCB acquisition function) or the expected improvement evaluation your function at that point can have over your current best point. Then, you select your current best point, evaluate your expensive objective function in it and add it to the list. Repeat until you're satisfied that your objective value is no longer improving all that much.
Bayesian optimization also makes sense in an intuitive way: essentially you're replacing a worst-case assumption (for example, doing grid search and hoping to find optimal points assumes that your function is Lispchitz with constant related to the size of the grid) on your function with an average-case assumption expressed as any kernel you want. Effectively instead of saying that (f(x')-f(x))/||x'-x|| is always smaller than K you say that f(x')-f(x) ~ Normal(0, k||x'-x||), which then allows you to fiddle with the norm (or even replace it with something entirely different), and also generalizes nicely to discrete or mixed discrete-continuous spaces (as long as you're still able to do a reasonable job of searching over your acquisition function).
https://gist.github.com/1402892
Edit: Ruben Cantin has an efficient multithreaded C++ implementation with python bindings.