Menu Close

DeepMind: Neural Processes

What about

More and more works have been proposed to overcome the limitations around deep learning.   This work is to

  • improve the flexibility of the testing phase of a  deep learning model by combining the advantage of Gaussian Process(GP).
  • on the contrary, deep learning can learn a kernel function from data observations automatically, which can be directly used by GP as well. 

Details

(refer to https://wemedia.ifeng.com/68012220/wemedia.shtml)

Function approximation is the core of many problems in machine learning. DeepMind’s latest research combines the advantages of neural networks and stochastic processes and proposes a neural process model that achieves good performance and high computational efficiency on multitasking.

Paper: https://arxiv.org/pdf/1807.01622.pdf

Function approximation is at the heart of many problems in machine learning. A very popular method of this problem over the past decade has been deep neural networks. Advanced neural networks consist of black box function approximators that learn to parameterize a single function from a large number of training data points. As a result, most of the workload on the network falls into the training phase, while the evaluation and testing phases are simplified to fast forward propagation. While high test time performance is valuable for many practical applications, the output of the network cannot be updated after training, which may be undesirable. For example, meta-learning is an increasingly popular area of research that addresses this limitation.

As an alternative to neural networks, random processes can also be reasoned to perform function regression. The most common example of this method is the Gaussian process (GP), which is a neural network model with complementary properties: GP does not require an expensive training phase, and can perform potential ground truth functions based on certain observations. Inferred, this makes them very flexible when tested.

In addition, the GP represents an infinite number of different functions in unobserved positions, so it can capture the uncertainty of its prediction given some observations. However, GP is computationally expensive: the original GP is the scale of the 3rd order data point, and the current optimal approximation method is the quadratic approximation. In addition, the available kernels are usually limited in their form of functionality, requiring an additional optimization process to determine the most appropriate kernel and its hyperparameters for any given task.

Therefore, the combination of neural network and stochastic process reasoning to make up for some of the shortcomings of the two methods, as a potential solution, is getting more and more attention. In this work, the team of DeepMind research scientist Marta Garnelo et al. proposed a method based on neural networks and learning stochastic process approximation, which they call Neural Processes (NPs). NP has some basic properties of GP, that is, they learn to model the distribution on top of functions, can estimate the uncertainty of their prediction according to the observation of the context, and transfer some work from training to test time to achieve the flexibility of the model.

More importantly, NP generates predictions in a very computationally efficient manner. Given n context points and m target points, the reasoning of a trained NP corresponds to the forward transfer of a deep neural network, which

Scale, not like the classic GP

. In addition, the model overcomes many of the functional design constraints by learning the implicit kernel directly from the data.

The main contributions of this research are:

Neural Processes, a model that combines the advantages of neural networks and stochastic processes.

We compare neural processes (NP) with meta-learning, deep latent variable models, and Gaussian processes. Given that NP is relevant in many areas, they allow comparisons between many related topics.

We demonstrate the advantages and capabilities of NP by applying NP to a range of tasks, including one-dimensional regression, true image completion, Bayesian optimization, and contextual bandits.

Neural process model

Figure 1: Neural process model

 

(a) The graph model of the neural process, x and y correspond to the data of y = f(x), C and T respectively represent the number of context points and target points, and z represents the global latent variable. A gray background indicates that the variable was observed.

(b) Schematic diagram of the implementation of the neural process. The variables in the circle correspond to the variables in the model of (a), the variables in the box represent the intermediate representation of NP, and the bold letters represent the following calculation modules: h – encoder, a – aggregator and g – decoder. In our implementation, h and g correspond to a neural network and a corresponds to a mean function. The solid line indicates the generation process and the dashed line indicates the reasoning process.

In our NP implementation, we provide two additional requirements: the order of the context points and the invariance of computational efficiency.

The final model can be summarized as the following three core components (see Figure 1b):

From the input space to the encoder of the representation space, the inputs are in pairs

Context value and generate a representation for each pair

. We parameterize h as a neural network.

Aggregator a, which summarizes the input of the encoder.

Conditional decoder g, which samples the global latent variable z and the new target location

As input and corresponding

Value output prediction

.

 

Figure 2: Graph model for correlation model (a-c) and neural process (d). Gray shading indicates that the variable was observed. C represents the context variable, and T represents the target variable, which is the variable to be predicted given C.

 

result


Figure 4. Pixelated regression on MNIST and CelebA

The diagram on the left shows that an image completes pixelation can be framed as a 2-D regression task, where f (pixel coordinates) = pixel brightness. The diagram on the right shows the results of the image implementation of MNIST and CelebA. The top image corresponds to the context node provided to the model. In order to be able to show more clearly, the unobserved points are marked in blue and white in MNIST and CelebA, respectively. In the case of a given text node, each row corresponds to a different sample. As the text nodes increase, the predicted pixels become closer to the underlying pixels, and the variance between the samples gradually decreases.

Figure 5. Thompson sampling of 1-D objective functions using neural processes

These figures show the process of 5 iterative optimizations. Each prediction function (blue) is drawn by sampling a latent variable, the condition of which is to increase the number of text nodes (black). The underlying ground truth function is represented as a black dotted line. The red triangle represents the next evaluation point, which corresponds to the minimum of the extracted NP curve. The red circle in the next iteration corresponds to this evaluation point, and its underlying ground truth refers to a new text node that will be the NP.

Table 1. Bayesian optimization using Thompson sampling

The average of the optimization steps needs to reach the global minimum of the 1-D function generated by the Gaussian process. These values are normalized by the number of steps taken by a random search. The performance of a Gaussian process using the appropriate kernel is equivalent to the upper limit of performance.

Table 2. Results of the wheel bandit problem after increasing the delta value

The results represent the mean and standard error of more than 100 cumulative and simple regrets. The result normalizes the performance of a uniform agent.

Discuss

We introduced a set of models that combine the advantages of stochastic processes and neural networks, called neural processes. NPs learn to represent distributions on functions, and make flexible predictions based on some text input during testing. NPs don’t need to write the kernel themselves, but learn the implicit measure directly from the data.

We apply NPs to some column regression tasks to demonstrate their flexibility. The purpose of this paper is to introduce NPs and compare it to the research currently underway. Therefore, the task we present is that although there are many types, the dimensions are relatively low. Extending NPs to higher dimensions can significantly reduce computational complexity and data driven representations.

Limitations of this work

  • How can we assure NPs have the equal performance to pure deep learning models?
  • What real application cases are better of using  NPs rather than deep learning?

Expecting more future works are going deeper.