Alexandre Passos's ML blog @atpassos - Tumblr Blog

NIPS 2013 reading list

As usual, here is my nips 2013 reading list. It's incomplete, and only has papers I could find online as of a couple of weeks ago.

Accelerating Stochastic Gradient Descent using Predictive Variance Reduction; R. Johnson, T. Zhang

Another paper proposing a linear convergence rate for stochastic gradient descent. This builds upon the earlier stochastic average gradient approach, which works by storing the gradient of all training examples as well as their sum, and then in each iteration sampling a training example and updating the sum of gradients by how much the gradient with respect to this example changed in the past.

The contribution of this paper is that they prove that if instead of keeping the gradient for every example and a running sum of all gradients you keep an old weight vector around and the sum of gradients computed with respect to that weight vector you can efficiently "correct" this gradient sum with respect to one training example at a time and still get the same convergence rate.

A cool thing is that this lets you implement gradient updates for sparse feature vectors efficiently, since you can keep a multiplier on the full gradient you keep around, as the information that varyes from sample to sample is sparse.

An Approximate, Efficient LP Solver for LP Rounding; S. Sridhar, S. Wright, C. Re, J. Liu, V. Bittorf, C. Zhang

This paper has two intertwined contributions: first they show how to approximately solve many LPs that show up in machine learning with coordinate descent on a quadratically-smoothed version of the problem (which looks like ADMM), and then they show that approximate solutions to LP relataxions work almost as well as exact solutions when doing rounding to approximate ILPs.

The results are compelling, and I'm curious to try this quadratic approximation in some LPs and see what happens.

A simple example of Dirichlet process mixture inconsistency for the number of components; J. Miller, M. Harrison

I think this paper makes a very helpful point. In many places and casual conversations one hears dismissive remarks that using a Dirichlet process prior is the way to do inference in models when one doesn't know the actual number of components. While this is clearly not always the case, as the Dirichlet process makes some quite restrictive assumptions about the sizes of the produced clusters (a few big clusters and many small clusters, essentially required for exchangeable models), this is the first work I know of that shows that it actually also misrepresents the size of the clusters (which is behavior anyone who tried fitting a DP mixture model to random data has observed).

Better Approximation and Faster Algorithm Using the Proximal Average; Y. Yu

This paper shows the neat idea that if you have many nonsmooth regularizers in an optimization problem you can still do something fista-like by instead of computing the expensive prox operator of all the regularizers together simply doing an accelerated gradient step on the smooth part and averaging the prox operator of each regularizer. While this won't give you sparsity, it seems to be efficient, and it's very easy to implement.

Distributed Representations of Words and Phrases and their Compositionality; T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean

This is the second paper from the google team working on word embeddings. As with the earlier work, the results are pretty impressive. It still strikes me as counterintuitive that these local bag-of-words or skip-bigram models end up capturing semantics sometimes better than more complex models which take account of things such as word order. The contributions in this one are using phrases, exploring negative-example training instead of the tree structure, subsampling frequent words, and more nice eye candy of the form A - B + C = D, which is always fun to stare at. To be fair, though, some human selection has to be done, because intuitive semantics is not as commutative as these operations. For example, "Berlin - Germany + France = Paris" looks good, but "Berlin - Paris + France = Germany" really doesn't.

Dropout Training as Adaptive Regularization; S. Wager, S. Wang, P. Liang

This paper investigates what regularizer is actually implied by dropout on linear models. It turns out to be something suspiciously close to the regularizers from Adagrad and confidence-weighted learning algorithms, in that they depend on the covariance matrix of the data (which is intuitive, since randomly deleting features is going to have an effect proportional to how correlated the features are). They present somewhat good results on training models with the exact regularizer, as well as a variant of adagrad which is closer to dropout.

Estimation, Optimization, and Parallelism when Data is Sparse; J. Duchi, M. Jordan, B. McMahan

This paper analyses learnign in the common (for me) case of very high dimensional very sparse feature vectors. As has been empirically observed, adagrad does well, and this paper proves matching lower and upper bounds on performance which are achieved by adagrad, which is nice. Moreover, it proposes and proves good properties about a parallel implementation of dual averaging which is essentially what I've been using for the past year in factorie, with good results.

Generalized Denoising Auto-Encoders as Generative Models; Y. Bengio, L. Yao, G. Alain, P. Vincent

This is further work on the line of research interpreting autoencoders as generative models. Interestingly enough, the training algorithms (like Algorithm 1 in this paper) are looking a lot like training algorithms for RBMs, but hopefully with more stability and less issues with things like gaussian units. Looking at samples from the distributions defined by autoencoders is really interesting.

Learning word embeddings efficiently with noise-contrastive estimation; A. Mnih, k. kavukcuoglu

This is very similar to the Google paper I cited above. It shares one contribution, noting that using negative contrastive estimation is faster than a tree of classifiers (and doesn't create the problem of coming up with a good tree structure). I'm puzzled by the fact that the numbers seem a bit worse than the Google numbers.

Projecting Ising Model Parameters for Fast Mixing; J. Domke, X. Liu

The idea is that since it is known that some model structures (involving low coupling) make sampling efficient, how about picking a model which doesn't fit this template and somehow projecting its parameters into it to improve the efficiency of sampling? This paper explores many ways of doing so, and things based on the reverse KL divergence (minimizing the KL between the distributions induced approximate and true parameters), which is similar to a mean-field approximation, works well (though it barely outperforms loopy BP, as usual).

Reasoning With Neural Tensor Networks for Knowledge Base Completion; R. Socher, D. Chen, C. Manning, A. Ng

This paper is another one attacking the problem of reasoning in knowledge bases as tensor completion. Here the focus is on learning relations: each entity gets a vector, and each relation (in wordnet or freebase) gets a neural network which takes two entities and produces a score. The model is trained to minimize training loss and performs reasonably well on test data. One clever idea is to represent entities with the average representations of the words which compose them (coming from any neural language model, like Collobert & Weston's); then knowledge is automatically shared across entities which share words, which can be good..

Streaming Variational Bayes; T. Broderick, N. Boyd, A. Wibisono, A. Wilson, M. Jordan

The idea of this paper is to pick ideas from the stochastic variational inference algorithms, which approximate the posterior given the prior and all observations by taking stochastic natural gradient steps and use these techniques for a streaming setting, where you want to approximate the posterior given the previous posterior and one observation. The technique is parallelizable, and seems to work pretty well.

Variance Reduction for Stochastic Gradient Optimization; C. Wang, X. Chen, A. Smola, E. Xing

This is another paper which has a variance-reduction strategy to improve stochastic gradient descent. The idea is to use control variates: you define a known quantity, which is also an expectation of something over the data, but whose errors in each sample you can compute exactly, and whose error is correlated with the stochastic gradients of your function. Then instead of using the actual stochastic gradients you apply to each gradient the correction that would have removed the error in the control variate. This then should minimize the negative effect of having too much variance in your stochastic gradients.

Deriving these control variates seems a bit like black magic to me, but that's because I'm not familiar with them.

They test the technique on logistic regression and stochastic variational inference, and the results look ok for logistic regression but quite impressive for variational inference.

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

My thesis: Combinatorial algorithms and linear programming for inference in natural language processing.

I recently defended my phd thesis in UNICAMP, and I've decided to revive this blog (in a new address) with a version of the text I used in my defense.

So the thesis is about combinatorial and linear programming perspectives for inference in NLP, and how to intermingle good things from both approaches to make things faster, not asymptotically (which is not that interesting, as sentences aren't getting any longer over time) but in practice, on models which are actually used.

The main tool I'll use throughout is the linear program. A linear program is an optimization problem (maximization or minimization) where you optimize a linear objective subject to linear constraints. The general form of a linear program is

Linear programs have really nice properties. The big one for our cases is duality, which is that for any linear program with a well-defined optimum you can construct a lower-bounding linear program called the dual program.

First write the Lagrangian of the LP, leaving the positivity constraints implicit,

where lambda, coming from inequality constraints, is guaranteed to be positive

We can then collect terms to get

which is also the Lagrangian of the following optimization problem

That is, each variable of the primal problem is a constraint in the dual problem, and vice versa.

Any feasible setting of this dual problem upper-bounds the objective value of the primal problem. We also have strong duality, which says that the objective values of the two problems are the same. So if we can construct a primal setting and a dual setting with the same objective value we know we've solved the problem.

Moreover, if we have primal and dual optimal variables x* and lambda*, we have complimentary slackness, which says that either

A final interesting property is that if you have more constraints than variables (and you're feasible) you are guaranteed to have optimal solutions in which most constraints are satisfied with a slack.

Then because of duality, if you have more variables than constraints you must have optimal solutions in which most variables are set to 0.

This leads to cutting planes and column generation. These techniques originated with the ellipsoid algorithm and the Dantzig-Wolfe decomposition, but they are very simple meta algorithms for solving LPs.

For cutting planes, first you initialize an LP with the primal variables and a subset of the constraints, and solve it to optimality. Then, you look for violated constraints, that is indices i such that

Then you pick one of these violated constraints, add them to the restricted problem, and solve it again. When you find no violated constraints you're done, and hopefully this will happen before you add all constraints to your problem.

The dual of cutting planes is column generation, and it can be weird at first. You again initialize a restricted problem, but this time with only a subset of the variables. You solve it to optimality, and get the dual variables from the optimality certificate. Then you search over primal variables with positive reduced cost

These are variables which, added to your solution, are expected to bring your optimization value up. Then you iteratively add them and resolve the problem until there are no such variables with negative reduced cost, which means all your dual constraints are satisfied and you have a primal dual optimal pair.

The trick for making cutting planes and column generation algorithms efficient is having procedures which can search for violated primal/dual constraints in less time than it would take to enumerate all constraints.

There are also strong connections between dynamic programming and linear programming. For example, here is a linear program for weighted longest paths on directed graphs. We have an indicator variable for each edge, and if a node is in a path and is not the source or target its number of incoming edges equals its number of outgoing edges, and the target has one incoming edge:

The dual of this problem has a variable per non-source node and constraints for each edge, and is written as

which, if the graph is acyclic, can be solved by topologically sorting the edges, starting from the source, and iteratively setting the value of an edge's endpoint to be the maximum of its current value and than the value of the edge's start point plus the weight of the edge, which is the standard dynamic programming algorithm for longest paths.

The reason linear programs are so important to us now is that the problem of MAP inference is usually specifiable as an optimization problem with a linear objective, and in many cases linear constraints as well.

The problem of inference arises in the structured prediction task. This is when you have something which looks like classification, but the set of valid labels for any given example is too large. Examples from NLP and vision.

Each label then becomes many training examples for a binary classifier, one per part.

The idea of structured linear models is to represent each possible label as being defined by the presence / absence of smaller parts. This also allows learning: each part defines a binary classification example, and learning the parameters of a structured linear model trains all these classifiers jointly.

The most popular example of a structured linear model is a graphical model, but in the thesis we also make reference to hypergraph models, and literature still has others more.

The idea of a structured linear model is to express each label with a binary indicator vector of which parts are active and a set of valid such vectors, which is always representable as a possibly exponentially large set of linear constraints.

As you saw in the thesis, we can define learningm joint inference, and all sorts of other things just by assuming this linear objective and set of valid inputs structure of structured linear models.

The main problem we will concern ourselves in this thesis is the problem of MAP inference in a structured linear model, or finding a maximizer for

The first technique we will look at to accelerate algorithms for this problem is column generation, and for linear chain graphical models (though the technique does generalize beyond that, as seen in the thesis).

First, a review. The viterbi algorithm for MAP inference in a chain model looks like this:

Note that this is the longest path algorithm where t(x_i-1,x_i) + theta_i(x_i) plays the role of the edge cost, and edges connect variables and values. We can then write the primal LP

Now that we have a linear program let's look at the main ingredients for a column-generation algorithm.

First, we need a way to define and solve many restricted versions of the original problem, with less primal variables.

Note that the Viterbi algorithm is really fast: it solves this LP in linear time, looking at each primal variable only once, and then revisiting only the primal variables which will be in the final solution. Hence feeding this LP into a solver doesn't make sense, and we should use Viterbi itself to define and solve the restricted problems.

This is easy: removing a primal variable means removing a constraint, which is equivalent to doing the maximizations in the Viterbi algorithm over less values of the previous variable in the chain.

That said, we're still not done. For this LP the reduced cost of a primal variable is

That is, the amount by which the corresponding Viterbi constraint is violated. However, this is a problem, as checking all the reduced costs then becomes as expensive as running Viterbi itself, which is what we're trying to beat.

This comes down to the earlier point: the key to a column generation algorithm is a good reduced-cost oracle.

Fortunately, we can exploit the structure of the problem to construct such an oracle. Note that the terms in its equation decouple into things which depend only on x_i-1, things which depend only on x_i, and things which depend on both. Moreover, in many models the things which depend on both are a constant throughout the chain, and even when they are not it's often easy to bound their magnitude. With such a bound, one can go and search first for all values of x_i-1 which could conceivably lead to a negative reduced cost, then conditioned on those values find the set of values of x_i which could lead to a negative reduced cost, and finally evaluate only those values.

While this is technically sufficient for an algorithm which is marginally faster than viterbi, it is possible to get a substantial further speedup by using reduced costs from a special-purpose LP which uses forward and backward information in the chain.

A sketch of its derivation is to first write the LP with two variables per edge, one going forwards and one going backwards. The dual of this LP is a dynamic programming algorithm for MAP message passing. But note that if you give me a solution in which the forward and backward variable are set differently I can give you another solution with the same scores in which they are set identically, so we can just use one set of variables in the primal problem. This leads to a dual problem that instead of having paired constraints like

has "merged" constraints like

and it's easy to see that satisfying the paired constraints is sufficient but not necessary to satisfy the merged constraints, and yet the objective value of the LP is the same.

Using the reduced cost from the merged constraints has now a very attractive property: we can use global information from both sides of the chain around an LP variable when deciding whether to include it in our restricted LP. Experimentally this leads to big wins.

Here is one table of results: it's easy to see that for the two tasks the column generation approach outperforms viterbi, is as fast as approximate inference, and yet is exact.

For more details and extensions please see the thesis.

Now we're going to talk about the second contribution in the thesis, a new LP relaxation for joint inference.

Joint inference is a term that has been used a lot over the years to mean somewhat different things in different times. Most of them fall into one of the two following camps.

In the first, you have a thing you want to analyze, like a document, and there are many things you want to predict from it, like pos tags, NER, parse trees, role labels, etc. You can then label some data and train a pipeline of structured linear models for all these tasks, each building on the labels of the ones that come before. Or, you could do them jointly, by allowing later stages in the pipeline to revise the decisions made in earlier stages, like a role labeler changing the phrase attachment on a parse tree, or a parser changing POS tags.

The second meaning of joint inference is when you have many things to process, all similar, and you want to exploit this extra data to do better. For example, when POS tagging unknown words, having many occurrences of the word helps you know its contexts, and decide on its reasonable tags better than acting independently each time. This is what I refer to as corpus-wide inference.

In both these scenarios there are two competing desires: we want to reuse the known-good models and algorithms for the individual tasks, and we want to leverage information from accross tasks to improve performance.

Simply defining a big joint structured linear model for all tasks is tricky. Even if inference is tractable in the joint model (say if both tasks are sequence segmentation tasks over the same sequence), it might be much more expensive due to the combinatorial explosion over the number of labels.

Moreover, there is the intuition that the jointness is there really only to adjust the answers of already-good models.

A popular form of doing joint inference is dual decomposition. If you have two or more structured linear models which are mostly independent, and can express jointness constraints (they should agree on the things on which they overlap) with linear equality constraints, you can then write the following primal problem

and then dualize the constraint into a Lagrangian and maximize the Lagrangian over the primal variables to get a dual objective

which is convex and can be solved with subgradient ascent.

Note that to compute the value and subgradient for this optimization problem, which is all that is necessary to solve it, all we need to do is maximize each structured linear model independently, with reweighted costs. Hence we can reuse the combinatorial structure of the known algorithms for those problems to solve the more complex joint problem.

There are also block-coordinate descent methods for solving this optimization problem if some restrictions are placed on the matrices A and B, essentially that they define "variables" which partition the solution space of each structured linear model into mutually exclusive states.

Then you can define a max-marginal as

the best possible score attainable by a label in each part of the partition. Since max marginals over A are linear in dual variables premultiplying A, a locally optimal way of setting the dual variables is

which will equalize the max marginals of the variables defined by A and B, making them agree on the maximizing label and satisfy the primal constraint.

For local joint models this is fine, but for global joint models it's a little unsatisfying. In that case you really want to just encourage different models to agree, not force them to. The standard way to encourage two models to agree with DD is to create a separate graphical model with one variable for A and one for B, and one factor connecting them, and then add dual decomposition constraints that force each copy to agree with the original variable.

Then the process of joint inference looks like this. Each model makes its independent prediction, the consensus graphical model disagrees with both, so subgradient messages are passed, inference is rerun in both models, there is still disagreement, and so subgradient messages are sent again, and finally each model finds out about the optimal values of the other model.

This is unsatisfying, because it takes two rounds of messages for the actual models to communicate, since their messages are mediated by an intermediary model.

And this brings us to the second main contribution of this thesis, which is a way of doing this style of joint inference, where factors connect different submodels, in a way that communicates between the models with every message, instead of every other message.

The key idea is to fold the factor of the auxiliary model into the optimization problem itself. First we represent the factor scores in terms of disagreement penalties instead of agreement scores, and get a primal problem like

Representing the max(0, x) with constraints and extra auxiliary variables, which we optimize out, we get a dual problem where, essentially, we're allowed to remove up to c_ij from the score of a setting x such that Ax = e_i as long as By != e_j.

Moreover, the objective is functionally identical to the dual decomposition objective, except the dual variable lambda's size is bounded by the penalties.

This means we can still reuse the known inference algorithms as black boxes, as in dual decomposition.

So it's as if we're trying to enforce a dual decomposition constraint but we give up if setting lambda sufficiently high does not enforce it.

Similarly to the dual decomposition case, it's possible to derive a block coordinate descent algorithm for this objective, for faster convergence. To do that we need to restrict ourselves to agreement factors, where c_ij is nonzero only if i == j.

The idea is similar: use max-marginals to find conditions for optimality of the lambdas, and find a lambda satisfying these conditions. The actual execution is trickier than the one for dual decomposition for many reasons. First, the boundedness constraints on lambda mean it's not possible to simply equalize the max marginals, even if this is what's specified by the penalties. Secondly, it's not obvious which value the primal maximizer will have.

Our contribution, however, finds such a value by enumeration and then finds lambdas that satisfy the necessary constraints. The algorithm is pretty hairy, and I won't put it here.

Experimentally, we compared these algorithms to an implementation of dual decomposition in a corpus-wide task from a recent paper of Sasha Rush and Michael Collins.

Here are some representative experimental results

As you can see, it does empirically outperform the approach where auxiliary models are used.

As I explain in the thesis, the approach is very general, and all forms of pairwise factors can be represented using it. It's also easy to make a symmetrized version of the objective which penalizes the absolute value distance between Ax and By, which leads to simpler algorithms.

So as we've seen there are many connections between linear programming and the combinatorial algorithms for MAP inference, and we can productively use reasoning in terms of LPs to accelerate the known inference algorithms.

So this is mainly it, and I hope the thesis is readable and doesn't have too many mistakes on it. I also hope to have more time to blog now that this is out of the way.

My #NIPS2012 reading list, part 3 of 3

MAP Inference in Chains using Column Generation

D. Belanger, A. Passos, S. Riedel, A. McCallum

(disclaimer: this is my paper!) Usually when doing exact MAP inference in chains and tree graphical models one uses the viterbi algorithm (or max-product BP in trees), which will consider all possible settings of all pairs of adjacent variables before settling in the optimal solution. At the same time, many inference problems are "easy" in that simple techniques like beam search (only considering K possible "parents" for each variable instead of going over all of them) work really well, and reach solutions which are very similar to the optimum. What we do in this paper is use column generation, a meta-algorithm for solving linear programs that looks like cutting planes in the dual LP, to efectively run exact inference in the same time it takes to do beam search by efficiently pruning most "bad" states from even being considered during inference. We also find out that the pruning efficiency is much higher when we send messages forwards and backwards over the chains, instead of only in one direction, so we're doing more work only because it will make doing less work easier.

Practical Bayesian Optimization of Machine Learning Algorithms

J. Snoek, H. Larochelle, R. Adams

One of the best methodological advances in machine learning recently is the trend towards explicitly dealing with hyperparameter optimization, which is setting things like batch sizes, regularization strengths, learning rates, pruning thresholds, etc, which has historically been dealt with by getting really smart people to build highly complex mental models of the learning algorithms, leading to brittleness and a general difficulty in reimplementing and reproducing good results. The special contribution of this paper is to extend the already-known Gaussian process-based Bayesian optimization approach to hyperparameter optimization to dealing explicitly with the covariance parameters of the GP kernel.

Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

K. Jiang, B. Kulis, M. Jordan

This is an extension (in a way) of the earlier cool paper by a subset of the authors on how one could generalize k-means to work with Dirichlet processes in a very simple way (essentially create new clusters for all points which are sufficiently far from any of the given cluster centers). This time the nonparametric k-means approach is generalized (in pretty much the same way) to work with any Bregman divergence (instead of necessarily the squared distance) and also extended to the hierarchical DP setting, giving a k-means-like algorithm for LDA, by directly clustering tokens.

Truncation-free Online Variational Inference for Bayesian Nonparametric Models

C. Wang, D. Blei

This proposes a new variational algorithm for Bayesian nonparametric models that doesn't need an explicit a priori specification of the truncation parameter. This is achieved by first modifying the variational update on the latent variables to depend on the expected likelihood instead of the exponentiated expected log-likelihood, which they claim leads to a better estimation of the variance of the approximating distribution. They then use Gibbs sampling on the untruncated space to approximately compute these expectations. This idea apparently also extends to stochastic variational inference.

Emergence of Object-Selective Features in Unsupervised Feature Learning

A. Coates, A. Karpathy, A. Ng

This achieves similar results to the ICML google paper on deep neural networks and cat detectors, except with a lot less CPU time, working with patches instead of entire images, and learning invariant-ish features. It is worth reading if only to see how to cleverly use well-known efficient algorithms to learn deep representations that capture a lot of the interesting variability in the data.

Learned Prioritization for Trading Off Accuracy and Speed

J. Jiang, A. Teichert, H. Daume III, J. Eisner

The question asked by this paper is whether it is possible to use learning not to do some task (like parsing) better but to do it faster. This is done by applying reinforcement learning to pruning and prioritization in agenda-based parsing, where the reward function is a weighted linear combination of accuracy and time, and learning is done via policy gradient methods. Except this turns out not to work, because credit assignment is really difficult in this problem, so they add reward shaping (eagerly punishing actions which take time and eagerly rewarding actions which are known to lead to good trees), and also use ideas from apprenticeship learning to add an oracle to the policy gradient search by using a mixed policy that follows an oracle at first but slowly backs off to the learned policy after it learns to correctly mimic the oracle. It is clearly a work-in-progress, but has many interesting insightful nuggets.

Tensor Decomposition for Fast Parsing with Latent-Variable PCFGs

S. Cohen, M. Collins

Parsing with latent-variable PCFGs takes a long time mostly because of the tensor contractions necessary for the inside-outside algorithm, which are cubic in the number of states in the PCFG. The idea behind this paper is to represent these tensors with a low-rank representation, effectively computing approximate messages which is much cheaper than computing the exact ones. Interestingly they can also bound the error in the final parse as a function of the error in the individual low-rank approximations.

#JustMigrate

My #NIPS2012 reading list, part 2 of 3

FastEx: Fast Clustering with Exponential Families

A. Ahmed, S. Ravi, S. Narayanamurthy, A. Smola

Assuming you want to do MCMC for clustering models with arbitrary exponential families in the clusters, the main computational bottleneck is often evaluating the log-likelihood of each data point in each cluster and normalizing and sampling. This scales badly with the number of clusters, and sounds wasteful. What this paper does then is propose a few generic-ish tricks to speed-up sampling. The first trick is to use a taylor expansion of the likelihood to justify approximate computation of the acceptance probabilities. The second trick is to use hashing kernels to speed up the computation of the dot products between natural parameters and sufficient statistics.

Fast Variational Inference in the Conjugate Exponential Family

J. Hensman, M. Rattray, N. Lawrence

Another new variational inference paper. This one generalizes the collapsed variational algorithms in the literature and shows that in the new lower bound the older coordinate ascent updates are actually steepest descent updates. This opens up the possibility of using better updates, like conjugate gradient, which then seems to converge much more quickly. The technique looks general enough, and I'll still need to read more carefully to understand what is going on in there.

Hamming Distance Metric Learning

M. Norouzi, R. Salakhutdinov, D. Fleet

The idea is to directly learn a mapping from real-valued inputs to binary vectors (not singleton binary vectors) in a way that preserves a known ranking-based similarity. The loss function is a list of triples stating that A and B are closer than A and C, from which you can then write down a hinge-like loss with a margin and optimize. The key trick seems to be an efficient way of performing loss-augmented inference over binary codes.

Identifiability and Unmixing of Latent Parse Trees

P. Liang, S. Kakade, D. Hsu

This paper presents two cool results. The first (and more interesting, in my opinion) is a numerical test for whether a set of moment-matching constraints is enough to identify the parameters of a latent variable model that uses the jacobian of the constraints. Intuitively, the parameters are identifiable if the Jacobian of the constraint violation vectors has the same rank as the dimensionality of your parameter vector (that is, whenever there is no linear subspace in the neighborhood of a constraint-satisfying parameter vector where all vectors are also constraint-satisfying). Then they use these tests to investigate identifiability of some models that look like PCFGs and dependency parsing models, and also they show how to derive some algorithms using this "unmixing" technique I don't quite understand yet.

Interpreting prediction markets: a stochastic approach

N. Della Penna, M. Reid, R. Frongillo

This seems to be augmenting the known deep connections between prediction markets and online learning. So apparently not only (as it was known) are market makers optimally setting prices by looking at the gradient of a strongly convex loss function but the actual process by which the market works looks like mirror descent in the limit of high liquidity.

Learning as MAP Inference in Discrete Graphical Models

T. Caetano, X. Liu, J. Petterso

This is an unusual paper. It tries to learn a linear binary classifier using discrete instead of continuous optimization as its foundation, rethinking some of the tradeoffs that have been taken for granted in much of machine learning. They formulate loss functions such that the loss of making a certain prediction on a given example is a sum of loss functions that operate on possibly overlapping subsets of features (specificying cliques in a graph over the features) and the true label. This ends up looking similar to dropout for neural networks ( http://arxiv.org/pdf/1207.0580.pdf ), in that a large family of predictors that share parameters and operate on subsets of the possible features are all simultaneously trained to achieve low error. The specific way they do so, however, involves discretizing the parameter space and using the 0/1 loss directly, leading to learning being cast as inference in these large discrete graphical models.

Majorization for CRFs and Latent Likelihoods

T. Jebara, A. Choromanska

In the old days the parameters for logistic regression and conditional random fields used to be estimated using fairly specific upper-bound-based techniques, where a sequence of quadratic upper bounds on the log loss was defined in a way that each such bound could be optimized analytically while guaranteeing some improvement in the loss function. These methods, however, were eventually dropped in favor of generic gradient-based optimization techniques like L-BFGS and SGD. This paper presents a new way of optimizing CRFs using quadratic upper bounds, except these bounds are much tighter, and hence the optimization ends up being faster than generic techniques. More interestingly, this bound also applies for latent-variable CRFs and other interesting models, and there the gains are even more impressive.

#JustMigrate

My #NIPS2012 reading list, part 1 of 3

This is part 1 of my personal, biased, and incomplete list of interesting NIPS 2012 papers. It's sorted roughly alphabetically, but not quite.

Accuracy at the Top

S. Boyd, C. Cortes, M. Mohri, A. Radovanovic

The key idea is to replace optimizing precision@k with optimizing accuracy at the "topmost" fraction of a results list. The trick is to use the same kind of quantile loss that has been used before in quantile regression, which looks like an absolute value loss where overpredicting and underpredicting are penalized by different amounts, in a way that not ranking a good thing on the top of the list is "less bad" than ranking a bad thing on top of the list. The main novelty here seems to be that they add a hinge-and-margin to this loss and use this to make generalization bounds.

Active Learning of Model Evidence Using Bayesian Quadrature

M. Osborne, D. Duvenaud, R. Garnett, C. Rasmussen, S. Roberts, Z. Ghahramani

The idea here is to use Bayesian methods to solve numerical integrals of hard problems. If you approximate the function being integrated with a Gaussian process, then you can integrate it analytically in an interval (by integrating the mean function of the GP), and also use the predicted variance of the Gaussian process to choose on which points to evaluate your unknown function, much like in Bayesian optimization. These things then turn out to be useful to compute partition functions, in a way responding to concerns by Larry Wasserman and others.

A quasi-Newton proximal splitting method

S. Becker, J. Fadili

The key contribution here is a new quasi-newton (like L-BFGS) algorithm for l1-regularized (and other nonsmooth) optimization problems. This comes from their derivation of a new way of computing a proximal step when the smooth part doesn't decompose into a sum of independent terms. It seems to be more robust than state-of-the-art methods but doesn't necessarily beat all of them in their examples.

A Spectral Algorithm for Latent Dirichlet Allocation

A. Anandkumar, D. Foster, D. Hsu, S. Kakade, Y. Liu

This is an "exact" algorithm for learning LDA. It's a simple-ish extension of their work on mixture and multi-view models. Unfortunately, the actual algorithm in the paper is not particularly stable, and the power iteration from http://arxiv.org/abs/1210.7559 should be used instead. The key idea is to look at the third-order moment tensor of the words in LDA and realize that "decorrelating" its dimensions with PCA leads to a tensor that is the sum of outer products of orthogonal tensors, which then can be decomposed in many ways to get the true topics (assuming, of course, that the data was generated by the model).

A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets

N. Le Roux, M. Schmidt, F. Bach

This is really interesting. Apparently it is possible to make SGD converge much faster in later iterations if instead of approximating the gradient of the loss by the gradient in a single example you approximate it by the sum of the last time you computed the gradients on all examples (and then you compute the gradient on a new example and update this sum). The convergence rates then are as good as batch gradient descent but (ignoring the issue of sparsity) the cost per iteration is as good as the cost of a single stochastic gradient descent iteration.

Continuous Relaxations for Discrete Hamiltonian Monte Carlo

Z. Ghahramani, Y. Zhang, C. Sutton, A. Storkey

This does something I've always wanted to see done, which is usin gradient-based samplers (like Hamiltonian Monte Carlo) to sample from discrete distributions with pairwise interactions. The idea is to add an auxiliary variable that depends on the discrete variables in a way that the normalization constant of this auxiliary variable has a term that "cancels out" the interactions between the discrete variables (they call this the gaussian integral trick). Then conditioning on this "mean field" the discrete variables become independent logits, and can even be integrated out entirely. This means then that you can sample directly in the space of this "mean field" with HMC, which mixes much better than steps on the discrete variables directly. This technique looks relly promising, as it should often be possible to come up with these auxiliary continuous densities.

Convergence Rate Analysis of MAP Coordinate Minimization Algorithms

O. Meshi, T. Jaakkola, A. Globerson

It's hard to see convergence analysis of coordinate descent algorithms because, intuitively, ill-conditioning and ill-smoothing should make coordinate descent work arbitrarily bad (as the good descent directions will not align with the coordinates). So it's interesting to see that the structure of smoothed MAP problems allows for rates of convergence that are almost as good as Fista's and other algorithms.

#JustMigrate

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

EMNLP 2012 Reading List

This year's EMNLP has a large number of really interesting papers. Here's a small list with summaries of some that caught my eye (unfortunately I'm not going, so this is all done without the benefit of presentations or conversations).

Improved Parsing and POS Tagging Using Inter-Sentence Consistency Constraints

Sasha Rush, Roi Reichart, Michael Collins, Amir Globerson

http://aclweb.org/anthology-new/D/D12/D12-1131.pdf

This is another paper describing useful improvements from global inference, this time using dual decomposition. The idea is to constrain inference in the test set such that it agrees on "stuff" (edge pos tags in dependency parsing, word type pos tags in pos tagging) in the test set that was not seen in the training set. It should be easy to flip this around as a semi-supervised learning mechanism instead of a global inference one, and I'm curious to see how using dual decomposition during training would compare with generalized expectations or posterior regularization (both of which do constrained marginal instead of MAP inference).

Wiki-ly Supervised Part-of-Speech Tagging

Shen Li, João Graça, Ben Taskar

http://aclweb.org/anthology-new/D/D12/D12-1127.pdf

Not only the title is clever in this paper. The idea is to use dictionaries (wiktionary, in this case) as supervision for transferring knowledge of POS tagging from one language to others. The wikitionary has a definition and general-ish part-of-speech tag for each word together with links between "the same" word across languages. Exploiting these links turns out to be more useful than using parallel text or plain unsupervised learning for many languages. The best part is that this is mostly orthogonal to the state-of-the-art, so presumably even better results can be had by augmenting parallel text + label propagation with wikitionary.

Multi-Domain Learning: When Do Domains Matter?

Mahesh Joshi, Mark Dredze, William Cohen, Carolyn Rosé

http://aclweb.org/anthology-new/D/D12/D12-1119.pdf

This is one of a few very interesting empirical papers in this EMNLP (the other of note is the LDA paper I mention below). The question it tries to answer is when and why domain adaptation methods work, and it does so by finding some confounding factors and seeing their impact. I like that the empirical evaluation is done by messing with the training data (artificially distorting in-domain data to look like out-of-domain data in some particular ways without actually being so). The conclusions are really fuzzy, and cast a healthy amount of doubt on the "reality" of the results in domain adaptation papers. The most puzzling conclusion is that just randomly creating domains and using domain adaptation often helps performance more than using "real" domains, which suggests that maybe rebranding this line of research as "ensemble methods" and moving on would not be a bad idea.

Learning-based multi-sieve coreference resolution with knowledge

Lev Ratinov, Dan Roth

http://aclweb.org/anthology-new/D/D12/D12-1113.pdf

This does something I've wanted to do for a while (but never got around to doing) which is use the sieve-based structure of the stanford coreference resolution into a learning system (then the sieve idea is really inference folded into learning, which is pretty cool). The key insight (I quote) is that "different types of mention pairs may require a different co-ref model". They argue that sentence structure matters more for within-document while string match matters more for across-document coreference. All the careful text samples help understanding why and how the system works.

Fast Large-Scale Approximate Graph Construction for NLP

Amit Goyal, Hal Daumé III, Raul Guerra

http://aclweb.org/anthology-new/D/D12/D12-1098.pdf

I'm mostly bookmarking this for later reference. I've wasted far too much time implementing approximate nearest neighbor computation in large sparse datasets (there's even a metaoptimize question I asked about this). The trick is to use PLEB: permute the coordinates randomly, sort, and use that to find some high-precision near neighbors; repeating this process allows for a better recall.

Exploring topic coherence over many models and many topics

Keith Stevens, Philip Kegelmeyer, David Andrzejwski, and David Butler

http://aclweb.org/anthology-new/D/D12/D12-1087.pdf

This is the other really good empirical paper I mentioned earlier. It sets out to compare three broadly distinct classes of "topic" models: matrix factorization (LSA), non-negative matrix factorization (NMF), and matrix factorization into probabilities (LDA), as well as revalidate recently published topic coherence scores (two very closely similar papers, one from UMass and the other from UCI, proposing almost the same metric). Some conclusions are as expected: unconstrained matrix factorization (LSA) generates better features and compresses better than the other models, which in turn provide "crisper" and more "natural-looking" topics (indeed the eye-candy was probably partially responsible for the success of LDA). Also the metrics are really sensitive to hyperparameter settings, and don't seem to correlate that closely with human judgment (which was visible from their original papers), but computing them based on a larger corpus (as the UCI version does) seems to be more discriminative than computing them on the same corpus in which the topics were fit (as the UMass version does). It's worth reading, and debating the conclusions.

Parse, price, and cut---Delayed column and row generation for graph based parsers

Sebastian Riedel, David Smith, Andrew McCallum

http://aclweb.org/anthology-new/D/D12/D12-1067.pdf

This paper is another clear win for LP relaxations as a way of reasoning about inference. It essentially uses cutting planes in the primal and dual inference problems (cutting planes in the dual being referred to as column generation) with oracle strategies informed by the domain to dramatically cut inference time in grandfather-arc dependency parsing models. One sad thing is that all this clever inference is only used at test time, and the models are trained with looby bp to approximate the likelihood gradient. I wonder if something can be done about that.

Spectral dependency parsing with latent variables

Paramveer Dhillon, Jordan Rodu, Michael Collins, Dean Foster, Lyle Ungar

http://aclweb.org/anthology-new/D/D12/D12-1019.pdf

A mix of two interesting things (spectral methods and latent variable parsing) that seems to work well enough. The setup of using a generative model to rerank the output of a discriminative one seems odd, and I'd like to see how this model would fare by itself in parsing.

#JustMigrate

ICML 2012 Reading List

Again, here's my reading list for ICML 2012. I couldn't summarize papers I didn't find online as of last week, so those are missing from this list.

Online Structured Prediction via Coactive Learning

www.cs.cornell.edu/people/tj/publications/shivaswamy_joachims_12a.pdf

Pannaga Shivaswamy, Thorsten Joachims

The idea here is to work in a model that sits in the middle of online learning and bandits: instead of observing the reward of the current action, the system observes a different action, presented as an alternative. It matches really well with all sorts of implicit feedback models in the web, and has strong regret bounds. I really like this direction, and I think this is the way to go for all sorts of problems that rely on user data. The one disadvantage of this with regards to bandits is that there is no clear need for the system to explore if it assumes the user will always report what amounts to a margin violation.

Nonparametric variational inference

www.princeton.edu/~sjgershm/GershmanHoffmanBlei12.pdf

Samuel Gershman, Matt Hoffman, David Blei

The idea here is to replace the parametric approximations normally used in mean-field algorithms (like approximating random variables with others with the same functional forms but no dependence) with nonparametric approximations based on mixtures of gaussians. The cool thing is that to fit these mixtures of gaussians all you need are the first two derivatives of the log-likelihood, and not expectations, so you can apply these techniques to nonconjugate models, and also capture multi-modality (because mixtures of gaussians are by nature “bumpy”). This seems to take a lot of the pain out of deriving variational algorithms, although it still needs to be extended to a lot of settings (and maybe the pain will come back then).

Building high-level features using large scale unsupervised learning

http://arxiv.org/pdf/1112.6209v2.pdf

Quoc Le, Marc'Aurelio Ranzato, Rajat Monga, Matthieu Devin, Greg Corrado, Kai Chen, Jeff Dean, Andrew Ng

When I first heard about the scale and results of this paper, it occurred to me that I would not have thought it was possible were I not seeing it. Essentially, learning a neural network from scratch on far more data and machines that anything I’ve heard of ca capture all sorts of high-level phenomena that people have been talking about for ages but of which there was no evidence so far. For example, this automatically learns face detectors from many images where faces show up in different positions and orientations. Also because it was trained in internet data it managed to learn a cat detector with no supervision.

A Graphical Model Formulation of Collaborative Filtering Neighbourhood Methods with Fast Maximum Entropy Training

www.tiberiocaetano.com/papers/2012/DefCae12.pdf

Aaron Defazio, Tiberio Caetano

In collaborative filtering, a lot of very smart machine learning has been applied to low-rank and latent-space-based models, which makes sense because matrix factorizations are one thing which ML people love. On the other hand, neighborhood methods have been treated as hacky approximations most of the time. This paper uses MRFs to directly regularize the predictions of neighborhood models, and learns these regularizers from data using something that looks a little like pseudolikelihood.

Parallelizing Exploration-Exploitation Tradeoffs with Gaussian Process Bandit Optimization

http://las.ethz.ch/files/desautels12parallelizing.pdf

Thomas Desautels, Andreas Krause, Joel Burdick

Bayesian optimization is a really interesting way of optimizing a very expensive function (you approximate it with a gaussian process and then query the gaussian process to decide where to test your function), but it doesn’t naturally lead itself to a batch setting (where you try many actions at once). Here they find surprisingly that greedily selecting actions in a batch assuming that earlier actions returned what was expected of them works well if you are careful about the variances. I can see this being useful in practice.

Bayesian Optimal Active Search and Surveying

http://www2.math.uu.se/~rmann/papers/optimal_learning.pdf

Roman Garnett, Yamuna Krishnamurthy, Xuehan Xiong, Jeff Schneider, Richard Mann

The scenario described in this paper is akin to playing battleship: what if you want to run an algorithm to explore an unknown landscape, but you only get points for finding out specific “useful” features of this landscape, instead of generally reducing your uncertainty? This also maps really well to recommendation systems (where you want to recommend items at the same time that are good and that let you recommend other good items in the future). This paper does a bayesian decision theoretical treatment of this problem, and finds useful things such that looking ahead is almost always useful, and also finds efficient algorithms for this lookahead.

Hierarchical Exploration for Accelerating Contextual Bandits

www.cs.cmu.edu/~sahong/papers/icml2012-yue-hong-guestrin_long.pdf

Yisong Yue, Sue Ann Hong, Carlos Guestrin

The idea here is as follows: you want to learn something like a recommender, but you want to avoid exploring all possible directions. So you first constrain yourself to exploring only a linear subspace of the parameters, and relax this constraint as appropriate to make your classifier more accurate. It interpolates really nicely between the case where the assumption of the low-rank subspace is verified and the case where it is not, and looks pretty good in the experiments.

Quasi-Newton Methods: A New Direction

http://probabilistic-optimization.org/KernelNewtonICML.pdf

Philipp Hennig, Martin Kiefel

Quasi-newton methods (like lbfgs) keep using gradient information across iterations to build an estimate of the Hessian. The idea in this paper is to try to do this estimation using bayesian inference. First they clarify a lot of the properties of known quasi-newton methods from an inference perspective, and then extend and propose a new nonparametric bayesian quasi-newton method which looks like it’s far more efficient in terms of function evaluations (how will it fare in terms of wall clock time is not so clear). The experiments are a little weak, in that no ML problems are used, but the idea seems really interesting.

Revisiting k-means: New Algorithms via Bayesian Nonparametrics

www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf

Brian Kulis, Michael Jordan

So it turns out that the hacky rule of “if a point is sufficiently far from any cluster when doing k-means just give it its own cluster” actually turns out to be derivable from the Dirichlet process. I really like this because it lets you turn things around and thing of other uses of the DP as essentially applying this simple rule probabilistically and in more complex settings. I’m not sure why they derived the updates from the sampler, however, as I think something like EM or the collapsed variational DP would behave similarly in the limit if you used hard instead of soft assignments in the variational distribution. Still, this is a must-read for those interested in NPbayes.

Flexible Modeling of Latent Task Structures in Multi-Task Learning.

www.cs.utah.edu/~piyush/recent/mfamtlICML12.pdf

Alexandre Passos, Piyush Rai, Jacques Wainer, Hal Daume III

My paper (actually Piyush is an equal contributor)! The idea here is to try to come up with a single model that can adapt itself at training time to the many different assumptions that have been succesfully used in transfer learning: things like clustering, low-rank parameter matrices, manifolds, etc. Just for fun we did this using nonparametric bayesian methods, using a nonparametric mixture of nonparametric factor analysers. While the final algorithm is not scalable as it is, I think it might be a basis for something actually useful in this field.

On the Equivalence between Herding and Conditional Gradient Algorithms

www.di.ens.fr/~fbach/icml2012_herding_final.pdf

Francis Bach, Simon Lacoste-Julien, Guillaume Obozinski

I really like the papers on herding, as that seems to be a neat idea that is deeply connected with all sorts of other interesting things. This paper shows that herding is actually something like Frank-Wolfe for the problem of moment matching. Things break down a little when improving it from an optimization perspective leads to a worse learner, which I think indicates that the explanation is not yet complete.

Sparse stochastic inference for latent Dirichlet allocation

www.cs.princeton.edu/~blei/papers/MimnoHoffmanBlei2012.pdf

David Mimno, Matt Hoffman, David Blei

The idea here seems to do something like Matt Hoffman’s online variational LDA paper but starting from the collapsed representation of LDA (with only z variables for the tokens) rather than the standard representation (with zs but also with phi and theta for documents and topics), and also using a joint (rather than a factorized) variational distribution. The key trick is that while evaluating expectations over this joint q is hard, sampling from it is easy, as it looks a lot like Gibbs sampling for LDA except with digammas in it. Then you can use these samples to approximate the expectations and end up with a stochastic natural gradient algorithm that is sparse and scalable.

Randomized Smoothing for (Parallel) Stochastic Optimization

http://arxiv.org/pdf/1103.4296.pdf

John Duchi, Martin Wainwright, Peter Bartlett

It is known that SGD and proximal methods converge really quickly to the optimal solution when the problem is smooth (that is, differentiable everywhere), but there are no known results for when the problem is not smooth. Here they propose a really simple algorithm (instead of using one sample of the objective function per iteration to compute the gradient, use many samples, taken not from the current value but from that value plus some added noise, and then this noise, which is annealed down over time, ensures that there are always derivatives) that has good provable bounds.

Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring

www.ics.uci.edu/~sungjia/publications/SGFS_ICML12_v12_cameraready.pdf

Sungjin Ahn, Anoop Korattikara, Max Welling

I think this won the best paper award. It is a follow up on Welling and Teh’s 2011 paper on bayesian SGD with Langevin dynamics, except it fixes some problems with that algorithm. The idea here is to find something that works really well in the limit where the posterior is gaussian (lots of data) when possible, but also degrades smoothly to the langevin approach when not possible. This allows them to use rather large learning rates, and relying on the natural gradient rather than the actual gradient also helps converge more quickly. It seems really interesting, and I haven’t finished digesting it.

#JustMigrate

Passive-aggressive perceptrons and trust-region methods

Recently I've been reading up on some optimization things, and I think I saw an interesting connection between trust-region methods for nonlinear optimization (things like Levenberg-Marquadt) and the passive-aggressive (PA) family of online learning algorithms (as well as many proximal-gradient methods). To do so I think I'll first explain trust-region methods, then PA, and then the connection.

Trust-region methods, essentially, are a way of making sure that Newton's method works. In Newton's method you pick the function f you want to optimize, approximate it locally by a truncate taylor series, and minimize the approximation. So you're actually minimizing

where H is the hessian (the matrix of second derivatives) of f. This has an analytic minimum at

(you could invert the matrix but it's faster and arguably at least as stable to just solve the linear system above with least squares)

One problem with newton's method is that you're usually not solving a quadratic (otherwise f(x) = f'(x)), so there is an error in your taylor approximation (not only that, but in really bad cases your hessian can be not positive definite, which will lead to all sorts of weird behaviors when you try to solve the system above). There are, then, two strategies you can use to make it practical: line search (that is, trying many things in the diretion of H^-1 grad f and choosing the best), which is what is most commonly done, and trust-region methods, which consist of adding a diagonal term to the hessian to make the linear system "nicer" at the expense of a poorer approximation of f. So in a mile-high view all trust region methods would do as above using

instead of H.

Now to online learning. One problem with things like the perceptron is that they update willy-nilly, often taking unnecessarily big steps in the direction of something which was almost correctly classified and taking not nearly big enough steps in the direction of the actually hard examples. You can see the perceptron as doing SGD (technically, stochastic subgradient descent) in the following odd-looking loss function

for y being either 1 or -1. The gradient of this is x, if it's misclassified, or 0, if it's correctly classified. The passive-aggressive family of online learning algorithm roughly tries to fix this eagerness in the perceptron by doing conservative updates. That is, rather than doing stochastic gradient steps to minimize the weird loss function above, take the smallest possible step needed to make it zero. Technically, it's the following optimization problem:

This problem is commonly solved by looking at its dual, which has an analytic solution. The lagrangian is

This is, of course, equivalent (in the sense that it leads to the same solution as)

and if you look at the Hessian of that, you'll see that all the first term does is add a multiple of the identity matrix to the Hessian of the original loss function (which doesn't even exist, as it's not quite differentiable). The way you end up tuning the alpha step size in PA is very similar to how you'd choose the lambda in a trust-region method, with the difference that you have a closed-form in PA which you don't in the trust-region method.

Hat tip to Michael Wick, with whom I discussed it this morning.

#JustMigrate

Algorithms versus ideas (or learning reductions versus perceptron updates)

One of the most entertaining parts of NIPS 2011 was sitting in Mark Reid's talk and the panel session in the Relations between machine learning problems workshop in Sierra Nevada. There was a long discussion on mapping the landscape of machine learning in a useful and compact way. This is clearly necessary, as there are currently hundreds of variations of even simple things such as binary classification that are only loosely related to each other. There are also grand unifying paradigms, such as structural risk minimization, neural networks, bayesian machine learning, etc, which tell incompatible stories as far as assumptions, inputs/outputs, and algorithms go, although some mapping between these approaches is possible (for example you can pick almost any neural network, tack a probability distribution on top of it and call it bayesian, or you can call your empirical risk a "likelihood" and your regularizer a "prior", and then trade days of computation time for more accuracy).

In the workshop I detected a bias towards the more formal, CS-y, kind of thinking, which is well expressed by the learning reductions formalism. In this paradigm you pick a simple, well-defined, baseline learning problem (such as binary classification with the 0-1 loss, but it could be really anything), and show how being able to build fairly accurate solutions to it allows you to build reasonably accurate solutions to more complex problems (and here SEARN works for structured learning, costing for cost-sensitive classification, etc), where "fairly" and "reasonably" are formally specified. This allows you to build a very CS-y map of the machine learning problems, the structures between them, and even how they relate to each other. In the future this might lead to interesting notions of completeness or hardness, etc.

In this spirit people at the workshop suggested that the way to go is formal specification of the interfaces between learning problems (going as far as suggesting learning-as-a-service), as this interface-specific knowledge would then lead to a better mapping of the relationships, and a better usage of reductions.

I'm not too happy with this view, however, for two reasons: first, because learning problems are social constructs, and simplifications of real-world needs, and secondly because this is not the pattern of reuse which has naturally developped, and it would be smart to understand this pattern before changing it.

For my first point I'd like to say that I've never seen a binary classification problem. What I've seen is many real-world, messy problems, mapped into the "binary classification" box so that they can be solved to a reasonable degree of accuracy. There is already a lot of heavy mapping going on between the needs of a practicioner and the abstraction of binary classification: you need to come up with some notion of instances which are reasonably independent, a labeled set of examples (and with these a labeling scheme) to train and test on, heuristic feature functions which are almost but not quite equivalent to the label, and finally a loss function that allows you to tell how right/wrong you are (which depends heavily on the labellings, which almost never occur naturally).

So all these issues are rightly swept under the rug when talking about building classifiers, because over time it was discovered that for a lot of problems there exist ways of making these mapping decisions which lead to useful results in the end, at a reasonable cost. At the same time, these mappings create binary classification problems with the same interface that are so different under the hood that entirely different abstractions are often used to attack them (think feature learning/engineering, online things, generative assumptions, etc).

My second point is, I think, even more important. If you look at the way machine learning has developed historically, researchers almost never trusted each other's results as black boxes to which you can plug things in the input and output slots. Completely opposite to the learning reductions approach is how perceptron updates came to be in the center of way too many facets of machine learning. Instead of looking for clear, well-defined, input-output black boxes, machine learning researchers have pried open problems looking for a place where you can pick your parameter vector and add it to it something easily computed from the training example and the true label in a way that your parameter vector has a higher dot product with the true label than with the wrong labels. This simple, ugly, idea, which depends on all sorts of artificial notions such as parameter vectors, true labels, training examples, etc, can actually be found lying around inside most machine learning algorithms, at least the efficient ones. It can be used from binary classification to machine translation, and pretty much every problem in between. The open-box idea of perceptron updates, or more generally gradient/mirror descent in generalized linear models (the fancy way of calling it), has managed to be more useful than any black-box abstraction as far as machine learning goes. This is, of course, not at all unique to the perceptron update, and I could very well say something like belief propagation (which will then encompass most message-passing and some non-message-passing algorithms used in ML, with a loose enough definition of algorithm).

So, while thinking in terms of nice taxonomies, clean interfaces, reductions, and complexity is very appealing, is it really the right thing to do? Can a learning reduction make any black-box algorithm as useful as the idea of a perceptron update?

Finally, should we be really looking at reusing black-box algorithms, or should we try to surface as clearly as possibly the white-box ideas which can later be reused in countless other algorithms?

I think rather than attempting to reconceive the world in our image it often turns out that the best way to move forward is to do what already works in a better way.

(of course I don't find reductions a bad idea at all. It is brilliant how some complex problems can be solved using them, and I'd like to work on them at some point. It's just that maybe a black box view of the field is not ideal)

#JustMigrate

#NIPS2011 Workshop day 1 Highlights

Today was the first day of the NIPS worksops, and as usual it was really interesting.

In the morning I mostly stayed at the Bayesian Optimization workshop, which was really interesting. Remy Munos talked about his ideas on optimizing highly nonconvex and nonsmooth functions with stochastic bandit feedback. The idea involves using a given hierarchical partition of the space to do some clever sampling, plus slight assumptions on smoothness of the function only around the optimum. Even if you don't know the actual smoothness around the optimum you can still get some pretty interesting results.

I found out about research that is very closely related to my recommendation systems work, by Roman Garnett and Katja Hofmann. I'm really looking forward to reading their papers on the subject, and possibly collaborating in these areas.

There was also Philip Hennig's poster about doing bayesian optimization where you try to keep track of explicit beliefs about where the optimum is and what is its value, rather than keeping track of beliefs about the function value everywhere. Pieter Abbeel's poster about safe exploration was really interesting. Apparently he has a generalization of minimax and expectimax objectives where you can tune one simple intuitive parameter to control how adversarial the world is, and route around that.

After this I watched the contributed talks at the Computational Tradeoffs in Learning workshop. Sameer Singh talked about his approximate parallel BP algorithm, which is really interesting. Suppose you run BP, and in doing so find out that the marginal on some nodes isn't changing all that much. Then you can fix the distribution of those nodes, condition on them, and decompose your graphical model into many independent parts, on which you can then run inference in parallel. Apparently at 95% accuracy or so you can get almost linear speedup with up to eight cores or so on some realistic models.

Then there was Hal Daumé's really interesting talk on learning time vs accuracy tradeoffs in complex inference problems. The setting is great, and I have high hopes for this line of research. Essentially, you have these nondeterministic inference problems, such as parsing, in which you can push and pop items into a chart in pretty much any order you want, and if you're willing to wait forever you will find the optimal value. His objective then is to learn a heuristic function used to sort items popped off of the agenda to minimize time at a small expense in accuracy. The problem ends up being so nondeterministic, however, that it's really hard to come up with reasonable gradients that lead you towards good solutions, so there is a lot of future work needed to make this approach feasible. Regardless, it's something I wish more inference techniques tried to do, and even if this fails some other approach should be tried.

Following that there was Jason Eisner's student Vaselin's talk on something pretty similar, but different. He is following up on his earlier work where you essentially treat your model, inference, and loss function computation as a big black box, which you optimize on your training data using gradients computed by automatic differentiation. This has all sorts of nice consequences, such as not needing to have the artificial distinction between model and inference (which gets really dodgy with approximate inference), getting a nice thing to optimize that leverages properties from your inference, and being able to use literally any structure you want. The problem, unfortunately, ends up highly non-convex (which makes sense). This talk was about extending this idea to account for explicit computational cost tradeoffs by modeling early stopping and factor selection as part of the cost function you can optimize over. It's pretty impressive, and the results are great.

In the afternoon I mostly stuck around the Relationships Between Machine Learning Problems workshop, which I generally found really interesting. I think machine learning as a field has solidified around some relatively suboptimal designs for problems and solutions, and exploring the relationships between these is a way of breaking out of this cycle. Mark Reid's talk on possible anatomies for a learning problem was really interesting, not so much for the solutions it proposed (which were arguable and hence endlessly discussed in the panel sessions) but for making a broad sweep of the usual current ways of talking about machine learning problems. Also, for exposing me to the term "nips bingo", which is writing papers by essentially stringing random buzzwords together (for an innocuous example, think of something like "A copula-based manifold optimization process for for semi-supervised identification of latent tree models"). The panel discussion was really enlightening, as was the follow-up discussion I had with Tiberio Caetano about some issues regarding current machine learning.

All in all this was really interesting, and I'm looking forward for tomorrow's sessions. I'll have a poster at the Challenges in Learning Hierarchical Models workshop, and I'm also planning on seeing some talks in the Philosphy and Machine Learning, Computational Social Science, Learning Semantics, and Domain Adaptation workshops. It'll be a lot of running around.

#JustMigrate

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

#NIPS2011 Day 4 Highlights & Workshop plans

Today was only a half day at NIPS, so no posters and very few talks.

There were two very interesting talks on learning large-scale mixture like things with very clever subsampling of the data. The first, Scalable Training of Mixture Models via Coresets, by Feldman et al, is explicitly about something I find really cool and interesting, which is coresets. It turns out that it's surprisingly difficult to extend the known coresets for the k-means problem into coresets for mixtures of gaussians, mostly because of the covariance. Once they managed to get that, however, they got really nice results on streaming and distributed algorithms for learning these coresets. The motivation is that you want to discard points more often in dense regions and less often in sparse regions, while at the same time inversely weighting the remaining points by their probability of being discarded. This is done in a higher-dimensional space that somehow helps encode covariance information. Then they can cheaply merge these coresets into larger coresets that represent the data more accurately. It's really impressive how well their algorithm degrades when you tighten the data budget (essentially not at all).

The other talk was Fast and Accurate k-means For Large Datasets, by Schindler et al. They want to approach a much easier problem, k-means, but the mechanics of the technique is eerily similar. The algorithm has a two-phase structure: first, one streaming pass over the data discards all points sufficiently close to known points (probably shifting the weight of the known points) to get a smartly subsampled dataset; then, a second pass goes over these weighted remaining points and computed the solution to a k-means problem. The first phase caused some people to misunderstanding it as approximating k-medians or some other aglomerative clustering technique, but it's really finding a coreset structure in a streaming setting and then using it to learn k-means. The highlight was the surprisingly cheeky method they used to find the nearest neighbor of a newly arrived streaming point: random projection (note the singular). They just projected all points in the coreset into one random direction and did a binary search in this direction to find some candidate points closest to the test point to avoid computing all n*k distances, which makes their algorithm faster than computing the quality of the k-means approximation.

Tomorrow is the first day of the workshops, which are my favorite part of nips, because the talks tend to be more interesting and the posters more creative (you can clearly tell that I don't ski). I'm planning on attending my talk in the Bayesian Optimization workshop at 8h40 (at Hotel Bar in Melia Sierra Nevada), Jenn Wortman Vaughn's talk at the relationships between machine learning problems workshop (at Dilar, Melia Sierra Nevada), the contributed short talks at the computational tradeoffs workshop (at Montebajo: Basketball court), and the afternoon talks of Peter Grunwald, Mark Reid, and Dario Garcia at the relations between machine learning problems workshop, again.

#JustMigrate

#NIPS2011 Day 3 Highlights

Today was, I think, the most interesting day in NIPS by far.

It started really well with Bernard Schölkopf's talk about causality. He talked mostly about his recent research (and not other also interesting directions) on identifying causal-like relationships without interventions. The key idea is that with some somewhat defensible modeling assumptions, like additive heteroskedatic noise (that is, caused variables are generated from their causes by computing a deterministic nonlinear function and adding noise with a fixed variance) or the fact that P(cause) is independent of the function f(cause) that generates the caused variable you can decide which variables cause which, or, with some kernel-based techniques, whether there is a confounding hidden variable.

Of course, what this is really saying is that these seemingly innocuous assumptions are actually very limiting. There are, of course, many philosophical issues with looking at causality in graphical models, many obvious ones (are there causes?, as someone asked) and many non-obvious ones (most actual data is on very aggregate variables instead of individual events, and it is really arguable whether changing the rate of X can really cause the rate of Y to change, as causality is a relationship between local events and not aggregate statistics), but this whole direction is really interesting.

There was also an interesting spotlight on Object Detection with Grammar Models by Girshick et al, but I didn't quite catch enough detail to explain it.

The talk on Efficient inference in CRFs with fully connected Gaussian potentials by Koltun et al was also really interesting. It turns out that the main message-passing step of a variational approximation of this kind of CRF with gaussian random variables is a convolution with a high-dimensional Gaussian filter, and there are very efficient techniques to do that which can avoid the obvious n^2 step and make using these models feasible where otherwise they wouldn't be.

The paper on Practical Variational Inference for Neural Networks by Alex Graves also turned out to be really interesting. The model is essentially a batch version of confidence-weighted learning, and he managed to come up with enough tricks to make it really fast.

Jake Abernathy's talk on A Collaborative Mechanism for Crowdsourcing Prediction Problems was the best NIPS talk so far. Really nice motivation, and a very interesting step towards a solution to this problem. There are some really hard trade-offs here: you want to design incentives so that people who have information feel compelled to share it as soon as possible but at the same time you don't want payoffs to asymptote too quickly or there is no incentive to shave off the last couple of percent points, which are in practice much more expensive (exponentially so) than the first percentage points. There is also the issue of generalization: in practice these competitions degenerate into a race between ever-more-complex ensembles, which really hurts whoever wants to build a real system out of the results of these prediction competitions. Still, I really hope this area gets more researched, as I think it is really important.

Now it's time to get ready for the workshops (my talk is 8h40 AM on the Bayesian optimization workshop, on friday, and there are some really interesting talks in the Computational tradeoffs, Relationships between learning problems, and Learning semantics workshops).

#JustMigrate

#NIPS2011 Day 2 Highlights

Yesterday most of the talks were about reinforcement learning or neuroscience, so they didn't quite catch my attention. Thankfully the organization moved the posters a bit further apart, which helped a lot. Also, most of the posters were papers I had already read, except for one (which wasn't available when I made my reading lists):

Message-Passing for Approximate MAP Inference with Latent Variables, by Jiang, Rai, and Daume III. They present another, simple, generalization of loopy belief propagation for the case of marginal MAP inference. The algorithm is really simple: nodes which you want to marginalize away send sum-product messages and nodes you want to maximize on send max-product messages. One thing which slightly confused me is that in their experiments max-product was always better than sum-product for the case of marginal map inference, but as far as I know most people rarely use max-product BP, instead doing sum-product and looking at max-marginals (or max-consistent marginals in the case of parsing or something similar), as it tends to converge more often and find better fixed points. Maybe this is related to the Hamming loss-like evaluation function, however. Still, this is something I see myself using in the future, as in practice all cases of inference in nontrivial graphical models involve some variables you added just to manage uncertainty and some variables you really want to be as accurate as possible on.

The talk by Po-Ling Loh on guarantees even in the case of non-convexity was also pretty interesting.

One trend I noticed yesterday as well was more than one paper on learning latent tree structures. I think I'll check this area out more closely after the conference.

#JustMigrate

#NIPS2011 Day 1 Highlights

The tutorials this year struck me as rather weak, but perhaps this is observer bias. First, it's odd that two tutorials were picked on what are very closely related topics (LP relaxations for inference and dual decomposition and lagrangian relaxations, instances of LP relaxations, for inference in NLP). Of these I liked the NLP tutorial more, perhaps because of the simultaneous focus on practical things and simple explanations of the theorems behind them.

The nonparametric bayesian tutorial was interesting, but didn't have much in the way of cool practical advice or deep new insights (rather, it helped shape some ideas I already had). Parts of the presentation were shaky as well. For example, they argue that in Bayesian inference the posterior always (except for some corner cases) converges to whatever generated the data. This is called Dobb's theorem. I wonder, however, how is it possible to apply this in non-identifiable settings, or, more generally, in settings with latent variables where the likelihood biases what kind of information about the latent variables can be recovered (for example, some bits of the latent variables might never be used to generate the observable data). If this theorem applies only in scenarios where there are no latent variables then it is really uninteresting.

Apparently they've managed to cut away all the funny bits of NIPS, making a rather boring introductory talk and completely removing the workshop skits.

Of the papers, the layout of the poster session was really bad, crowding over a thousand people in an unnecessarily small space, and grouping posters in sets of three, which had the unfortunate effect that middle-of-the-set presenters always ended up standing right in front of their posters.

My reading list was posted before, but as usual some papers of the day caught my attention.

Collective Graphical Models by Sheldon and Dietterich proposes and solves a rather neat problem where observables in a graphical model are really averages over large sets of replicas of the model. This arises all the time when, to borrow their example, fairly thorough samples are made of sequential processes at given time stamps which reveal how many things are in which state at any time but crucially not which of these things are in which state. This individual information is filled in with something EM-like.

Multiview Learning of Word Embeddings via CCA by Dhillon, Foster, and Ungar tries to learn things which loosely resemble brown clusters (and neural word embeddings) but which, crucially, are (1) token-specific, rather than type-specific (although it's not clear how they manage this) and (2) very fast by a linear model which is essentially CCA, so it's easy to implement and understand. Instead of grouping the entire context of a token in a bag, or just looking at words before/after, they use CCA to find the maximally correlated projection for each word given both its before- and after-context. They manage to get new state-of-the-art results on CONLL NER, which I find surprising as I had already written off data dataset as being too overfit already, beating all other approaches I know of, even without doing severe feature engineering (which the other approaches famously do).

All in all this was really interesting, and I'm looking forward to the next days.

#JustMigrate

The value of information in recommendation systems

If you look the right way at a recommendation system, in practice, you'll see a machine learning system that selects a few unlabeled examples, shows them to a user, eventually gets labels for some of those, and retrains itself to incorporate this new information. So effectively the act of making a recommendation is the only way such systems have to get information about the users, specially as in most user interfaces search functionality is deemphasized in relation to recommender functionality. These systems are effectively selecting their own training data.

Also, one of the motivations behind recommendation systems is that people on websites will generally search for similar things, so you might be able to show them what they would search for before they do it. However, if you assume, like it's usually done, that your objective is predicting some behavioral data that you have collected, there's a problem. Suppose I can give you a perfect recommender system, no errors, perfect precision and perfect recall. It will, then, necessarily, recommend to each user those things he was going to search for anyway and nothing else. This is obviously bad from a cost perspective: you spent a lot of cash hiring grumpy machine learners and now your revenue stayed the same. This suggests that the actual quality of a recommendation system is counterfactual: it's how much your actual revenue will increase if you use it, compared to what its value will be if you don't use it.

Finally there's another tricky issue. Let's say we're building a movie recommendation system. Suppose every day I search for silly action movies (Rambo 17, or whatever) and click on "one star". In most data driven settings you'll only observe that I am not liking the rambo movies - essentialy, that my probability of liking a rambo movie given that I wanted to watch it is small. However, the probability of me wanting to watch it is pretty high. Should a recommender system show me rambo movies? Or, more generally, what can you do when the data you collect (in this case P(rating)) is actually conditioned on one of the variables you want to predict (in this case P(want to watch)?

To start to tackle all these issues I'm presenting a paper from my internship work with Jurgen van Gael, Ralf Herbrich, Ulrich Paquet, and others, at Microsoft Research Cambridge, in the Bayesian Optimization, Experiment Design, and Bandits workshop.

The idea behind the paper is to treat the value of a recommender system as the quality of the recommendations it actually makes (not some abstract notion of matching the rank of user preferences or, god forbid, mean squared error on star ratings). If you assume the users are not adversarial (i.e., a poser, or someone who clicks "like" not because he likes something but to deceive your system) and their actions can be well-modeled by a probability distribution on which you have priors (maybe computed from a logged data set), you can write down the expected change in total value of this recommendation session as a function of which recommendations the system makes to the user (because then there is a probability of the user rating those items, and the system getting more information, and making better recommendations).

This provides natural solutions for many classic problems in recommendation systems, such as diversity (you now want to make diverse recommendations to maximize the information you'll get from the user), over-personalization (as long as there is uncertainty in the system, and you should probably use a tracking model anyway, the system will keep risking making bad recommendations on the off chance that the user might like something different), and cold-start (while you can't predict the tastes of a user you have never seen it is possible to make good recommendations by hedging your bets). We even have experimental data with real users showing that this in fact works.

In one sentence, my point is that every recommendation system is in practice an active learning agent, and a pretty naive one, so it should be very easy to do better and hedge your bets.

If you're interested, or if you think I'm wrong, I'll give a talk December 16th, 8h40 AM, on the Bayesian Optimization, Experimental Design, and Bandits NIPS workshop. The paper is also online, and I'd love to get some comments. If you can't make it to the talk I'll also be around in the poster sessions in that workshop.

#JustMigrate

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

Correlations and anticorrelations in LDA inference

This year I'll present two papers in NIPS workshops. The first is about something I think is an often overlooked issue in topic models, specially more complicated ones. This is all obviously my own personal views, and do not reflect those of my co-authors (Hanna and Andrew), although hopefully they'll not disagree too much with me.

The paper is called Correlations and Anticorrelations in LDA inference, and it is based on the following insight. In LDA the only reason you can learn crisp, semantic topics at all is that the prior has a very strong preference for using as few as possible topics in any given document. Not only that, but given a choice about whether to split a certain number of word tokens between two topics or put them all in one topic the solution where they are all in one topic always has higher likelihood. So what happens is that if each word in a document is really only explainable by a single topic nothing bad happens. Whenever a token is explainable by more than one topic, LDA inference will pick the topic for that document which more easily explain some other words in that document. Iterating this process,then, will make the topics "move apart" in space so that each topic ends up responsible for as few words as possible.

However, it's usually common in the final learned topics for some word types to appear with very high probability in more than one topic. If you try, then, to infer the document-specific distributions over topics for new documents, whenever that word happens to occur in a document the inference process will effectively pick one of the two possible explanatory topics. This means that those topics which share words, even if the prior distribution is completely indifferent to their co-occurrences, will co-occur less often than expected on the posterior, and much less often than those topics which do not share words.

So if you try to learn some structure such as correlations over the topics what happens is that you'll probably also learn a lot of spurious correlations induced not by the actual information in the topics but by which topics share how many words. A nice thing of this is that it can explain why things such as correlated LDA have a higher test-set likelihood even though they do not produce better topics: because this bias is the same in the training and test sets the model will accurately learn to predict topic covariances on the test data even though those covariances are artificial.

If you want to know more about this, or tell me what is wrong with this, I'll be happy to talk with you in the poster sessions in the Challenges in Learning Hierarchical Models: Transfer Learning and Optimization workshop, December 17th, in "Montebajo: Library" (whatever that means). These sessions will be 9h50 AM and 7h10 PM.

#JustMigrate #lda #paper

Bayesian optimization

Looking at the NIPS papers (and workshops) I noticed many references (some hidden) to Bayesian optimization as a useful subroutine, so I decided to investigate and see what it's all about. I found a great tutorial by Eric Brochu, Vlad Cora, and Nando de Freitas, and I decided to follow it and see what it's all about. Note that this is one of the techniques used in James Bergstra's paper on hyperparameter optimization, and he has optimized and tested code online to do this kind of optimization, so you should use that code rather than the one here.

Regardless, I thought it to be a useful exercise to implement it and see what happens.

Bayesian optimization is a really cool and simple way of globally optimizing a function which is very expensive to evaluate. So expensive, in fact, that you're not bothered by the fact that it involves optimizing over a gaussian process with something like simulated annealing (or really whatever gradient-free global optimization algorithm strikes your fancy). One example of this is hyperparameter optimization, where your neural network has over 10 different tuning parameters which interact in odd ways and essentially make your research irreproducible. Then, you really care about not running too many iterations of something silly such as simulated annealing over your real objective function and are willing to spend the computational budget of a few minutes to get a really good point to try out.

So how does it work? First, you assume you have evaluated your objective function on a bunch of arbitrarily chosen points of your space. Then, in each iteration, it fits a gaussian process over your function evaluations and searches over this gaussian process for the best point to try out next. This search is a global optimization of something called an acquisition function, which can be as simple as the expected value of your objective at the point plus the predicted variance at that same point (the UCB acquisition function) or the expected improvement evaluation your function at that point can have over your current best point. Then, you select your current best point, evaluate your expensive objective function in it and add it to the list. Repeat until you're satisfied that your objective value is no longer improving all that much.

Bayesian optimization also makes sense in an intuitive way: essentially you're replacing a worst-case assumption (for example, doing grid search and hoping to find optimal points assumes that your function is Lispchitz with constant related to the size of the grid) on your function with an average-case assumption expressed as any kernel you want. Effectively instead of saying that (f(x')-f(x))/||x'-x|| is always smaller than K you say that f(x')-f(x) ~ Normal(0, k||x'-x||), which then allows you to fiddle with the norm (or even replace it with something entirely different), and also generalizes nicely to discrete or mixed discrete-continuous spaces (as long as you're still able to do a reasonable job of searching over your acquisition function).

https://gist.github.com/1402892

Edit: Ruben Cantin has an efficient multithreaded C++ implementation with python bindings.

#JustMigrate #bayesian optimization #code #hyperparameters

Trending Blogs

Last Seen Blogs

Alexandre Passos's ML blog