Menu Close

Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets

Abstract

Bayesian optimization has become a successful tool for hyperparameter optimization of machine learning algorithms, such as support vector machines or deep neural networks. Despite its success, for large datasets, training and validating a single configuration often takes hours, days, or even weeks, which limits the achievable performance. To accelerate hyperparameter optimization, we propose a generative model for the validation error as a function of training set size, which is learned during the optimization process and allows exploration of preliminary configurations on small subsets, by extrapolating to the full dataset. We construct a Bayesian optimization procedure, dubbed FABOLAS, which models loss and training time as a function of dataset size and automatically trades off high information gain about the global optimum against computational cost. Experiments optimizing support vector machines and deep neural networks show that FABOLAS often finds high-quality solutions 10 to 100 times faster than other state-of-the-art Bayesian optimization methods or the recently proposed bandit strategy Hyperband.

http://proceedings.mlr.press/v54/klein17a/klein17a.pdf

Aaron Klein1 Stefan Falkner1 Simon Bartels2 Philipp Hennig2 Frank Hutter1

 

 

Understanding Black-box Predictions via Influence Functions

Abstract

How can we explain the predictions of a blackbox model? In this paper, we use influence functions — a classic technique from robust statistics — to trace a model’s prediction through the learning algorithm and back to its training data, thereby identifying training points most responsible for a given prediction. To scale up influence functions to modern machine learning settings, we develop a simple, efficient implementation that requires only oracle access to gradients and Hessian-vector products. We show that even on non-convex and non-differentiable models where the theory breaks down, approximations to influence functions can still provide valuable information. On linear models and convolutional neural networks, we demonstrate that influence functions are useful for multiple purposes: understanding model behavior, debugging models, detecting dataset errors, and even creating visually indistinguishable training-set attacks.

https://arxiv.org/pdf/1703.04730.pdf

Pang Wei Koh 1 Percy Liang 1

Problem: What did I do wrong that made our blackbox less effective? Is it wrong to the training data? wrong labels?

Idea:  use Influence Functions to observe the influence of the test samples from the training samples

The degree of influence of a single training sample z on all model parameters θ is calculated as:

 

Where ε is the weight of sample z relative to other training samples. If there are n samples, it can be interpreted as 1/n

Where Hessian is second-order partial matrix that contains the influence of all training samples (a total of n) on the model parameter θ.

Thus, the gradient:

Implicates the effect of a single training sample z on the model parameter θ.   L is the loss

The Summary Impact of Index I is mainly composed of 2 parts of information:

  • Influence information of other training samples implied by the Hessian matrix.
  • The effect of current training sample z on the parameter θ of the model.

The article further gives the calculation of the degree of influence of a single training sample z on the prediction results of a single test sample Ztest:

We Can summarize the impact that consists of three parts of information (respectively, corresponding to the three formulas):

  • the information that the current test sample ztest is influenced by the model parameter θ
  • Influence information of other training samples implied by the Hessian matrix.
  • The effect of current training sample z on the parameter θ of the model.

Let’s see what happens if there is a missing item in the formula:

  • If there is no loss value information for the single sample z of the third item, it will be like the picture on the left of the figure above, and the degree of influence I of a single sample z will deviate.
  • If there is no second Hessian matrix and there is no reference to other training samples, the training sample (green sample) that is labeled with the test sample will only have a positive effect on the predicted test sample. On the contrary, if the test sample has a different label of the training sample (red Samples only have a negative effect on predicting test samples.
    This is not true, because some training samples and test samples are of the same label, but they have a negative effect on training. The right side of the figure below is a training sample, but it is a training sample for disturbing training:

That is: If we have one more the right side of the 7 training samples, it is more likely to cause the left side of the test sample to make a mistake (loss of value becomes larger).

At the end of the article, several practical cases (effects) affecting functions are given.

1. Understand the behavior of the model

The article gives a vivid example of using impact function comparison support vector machine (SVM) and deep network (Inception), a model for identifying “fish” and “dog”

The green dot is a training sample labeled “Fish”, and the red dot is a training sample labeled “Dog”.

 

Comparing the main SVM with the Inception plot, the abscissa is the Euclidean distance between the training sample and the test sample (can be understood as the picture similarity), and the ordinate is the degree of influence of the training sample on a single test sample.

 

what is interesting:

1. In the SVM model, training samples (large Euclidean distances) that differ significantly from the test sample are almost ineffective for model discriminant test samples (I is almost equal to 0). This is in line with the SVM support vector in the identification of the important role. (The more difficult the training sample, the greater the impact on the model)

2. In the Inception depth network, no matter how large the Euclidean distance is, a training sample will influence the judgment of the test sample. This proves the advantage of deep networks. Each training sample may have an effect on model optimization (whether positive or negative).

3. The two sample plots on the right are the ones that have the greatest influence on the prediction in the training sample. It can be seen that the two least-like “fish” samples in the SVM play a greater role in model discrimination, but Inception There are two “fish” charts in the comparison.

2. Generate Adversarial training examples

3. Assess the value of the training sample set

If the training sample set and the test sample set are not the same domain or not the same distribution, even if collecting more training samples, it is not helpful for training the model.

If you calculate the degree of influence I, only a very small part of the training sample has an effect on the prediction of the test sample, you should be careful, perhaps the field of your training sample collection is incorrect, you need to try to collect training samples in another way.

 

4. Find training sample with the wrong labels

 

Reference:

https://github.com/kohpangwei/influence-release

 

 

 

Large-Scale Evolution of Image Classifiers

Abstract

Neural networks have proven effective at solving difficult problems but designing their architectures can be challenging, even for image classification problems alone. Our goal is to minimize human participation, so we employ evolutionary algorithms to discover such networks automatically. Despite significant computational requirements, we show that it is now possible to evolve models with accuracies within the range of those published in the last year. Specifically, we employ simple evolutionary techniques at unprecedented scales to discover models for the CIFAR-10 and CIFAR-100 datasets, starting from trivial initial conditions and reaching accuracies of 94.6% (95.6% for ensemble) and 77.0%, respectively. To do this, we use novel and intuitive mutation operators that navigate large search spaces; we stress that no human participation is required once evolution starts and that the output is a fully-trained model. Throughout this work, we place special emphasis on the repeatability of results, the variability in the outcomes and the computational requirements.

https://arxiv.org/pdf/1703.01041.pdf

 

 

 

 

 

 

 

 

Key points:

  1. Need many controllers, so-called Workers in the paper, to guild the evolution process (e.g., selection,  mutation)
  2. The controllers are working in a distributed way

Benefits: The proposed method is simple and it is able to generate a fully trained network requiring no post-processing.

concerns:

The paper is interesting in the way that it helps to discover a fully automated DNN architecture for solving complex tasks without human participation. Although the authors claim that their method is scalable, only companies owning large-scale platform can employ this method and as long as there is no more economical implementation, we cannot see it as a scalable solution. However, it is a good starting point for automating the scalable architectural design of DNNs besides other solutions such as reinforcement learning. [other comment]