Menu Close

neural process

Neural Processes as distributions over functions

Orignal from https://kasparmartens.rbind.io/post/np/

In this year’s ICML, some interesting work was presented on Neural Processes. See the paper conditional Neural Processes and the follow-up work by the same authors on Neural Processes which was presented in the workshop.

Neural Processes (NPs) caught my attention as they essentially are a neural network (NN) based probabilistic model which can represent a distribution over stochastic processes. So NPs combine elements from two worlds:

  • Deep Learning – neural networks are flexible non-linear functions which are straightforward to train
  • Gaussian Processes – GPs offer a probabilistic framework for learning a distribution over a wide class of non-linear functions

Both have their advantages and drawbacks. In the limited data regime, GPs are preferable due to their probabilistic nature and ability to capture uncertainty. This differs from (non-Bayesian) neural networks which represent a single function rather than a distribution over functions. However the latter might be preferable in the presence of large amounts of data as training NNs is computationally much more scalable than inference for GPs. Neural Processes aim to combine the best of these two worlds.

I found the idea behind NPs interesting, but I felt I was lacking intuition and a deeper understanding how NPs behave as a prior over functions. I believe, often the best way towards understanding something is implementing it, empirically trying it out on simple problems, and finally explaining this to someone else. So here is my attempt at reviewing and discussing NPs.

Before reading my post, I recommend the reader to take a look at both original papers. Even though here I discuss [NPs], you might find it easier to start with [conditional NPs] which are essentially a non-probabilistic version of NPs.

What is a Neural Process?

 

The NP is a neural network based approach to represent a distribution over functions. The broad idea behind how the NP model is set up and how it is trained is illustrated in this schema:

Given a set of observations (xi,yi), they are split into two sets: “context points” and “target points”. Given the pairs (xc,yc) for c=1,…,C in the context set and given unseen inputs xt∗ for t=1,…,T in the target set, our goal is to predict the corresponding function values yt∗. We can think of NPs as if they were modelling the target set conditional on the context. Information flows from the context set (on the left) to making new predictions on the target set (on the right) via the latent space z. The latter is essentially a finite-dimensional embedding of mappings from x to y. The fact that z is a random variable makes NP a probabilistic model and lets us capture uncertainty over functions. Once we have trained the model, we can use our (approximate) posterior distribution of z as a prior to make predictions at test time.

At first sight, the split into context and target sets may look like the standard train and test split of the data, but this is not the case, as the target set is directly used in training the NP model – our (probabilistic) loss function is explicitly defined over the target set. This will allow the model to avoid overfitting and achieve better out-of-sample generalisation. In practice, we would repeatedly split our training data into randomly chosen context and target sets to obtain good generalisation.

Let us consider two scenarios:

  1. Inferring a distribution over functions, based on a single data set
  2. Inferring a distribution over functions, when we have access to multiple data sets which we believe to be related in some way

The first scenario corresponds to a standard (probabilistic) supervised learning setup: Given a data set of N samples, i.e. given (xi,yi) for i=1,…,N, and assuming there is an underlying true f which has generated the yi=f(xi)values, our goal is to learn the posterior distribution over f and use it to get predictive densities at test points f(x∗).

The second scenario can be seen from the meta-learning viewpoint. Given Ddata sets d=1,…,D, each consisting of Nd pairs (xi(d),yi(d)), if we assume that every data set d=1,…,D has its own underlying function fd which has generated the values yi=fd(xi), we might want to learn the posterior of every fd as well as generalise to a new data set d∗. The latter is especially useful when every data set has only a small number of observations. This information sharing is achieved by specifying that there exists a shared process which underlies all functions fd. For example, in the context of GPs, one can assume that fd∼GPshare kernel hyperparameters. Having learned the shared process, when given a new data set d∗, one can use the posterior over functions as a prior and carry out few-shot function regression.

The reason I wanted to highlight these two scenarios is the following. Usually, in the most standard setting, GP-regression is carried out under the first scenario. This tends to work well even when N is small. However, the motivation behind NPs seems to be mainly coming from the meta-learning setup – in this setting the latent z can be thought of as a mechanism to share information across different data sets. Nevertheless, having elements of a probabilistic model, NPs should be applicable in both scenarios. Below we will investigate how NPs behave when trained only on a single data set, as well as the second setup where we have access to a large number of function draws in order to train the NP.

How are NPs implemented?

Here is a more detailed schema of the NP generative model:

Going through this generative mechanism step-by-step:

  • First, the context points (xc,yc) are mapped through a NN h to obtain a latent representation rc.
  • Then, the vectors rc are aggregated (in practice: averaged) to obtain a single value r (which has the same dimensionality as every rc).
  • This r is used to parametrise the distribution of z, i.e. p(z|x1:C,y1:C)=N(μz(r),σz2(r))
  • Finally, to obtain a prediction at a target xt∗, we sample z and concatenate this with xt∗, and map (z,xt∗) through a NN g to obtain a sample from the predictive distribution of yt∗.

Inference for the NP is carried out in the variational inference framework. Specifically, we introduce two approximate distributions:

  • q(z|context) to approximate the conditional prior p(z|context)
  • q(z|context,target) to approximate the respective p(z|context,target)where we have denoted context:=(x1:C,y1:C) and target:=(x1:T∗,y1:T∗).

The approximate posterior q(z|⋅) is chosen to have the specific form as illustrated in the inference model diagram below. That is, we use the same h to map both the context set as well as the target set to obtain the aggregated r, which in turn is mapped to μz and σz. These parametrise the approximate posterior q(z|⋅)=N(μz,σz).

The variational lower bound

ELBO=

Eq(z|context,target)[∑t=1Tlog⁡p(yt∗|z,xt∗)+log⁡q(z|context)q(z|context,target)]

contains two terms. The first is the expected log-likelihood over the target set. This is evaluated by first sampling z∼q(z|context,target), as indicated on the left part of the inference diagram, and then using these z values for predictions on the target set, as on the right part of the diagram. The second term in ELBO has a regularising effect – it is the negative KL divergence between q(z|context,target) and q(z|context). Note that this differs slightly from the most commonly encountered variational inference setup with KL(q||p), where pwould be the prior p(z). This is because in our generative model, we have specified a conditional prior p(z|context) instead of directly specifying p(z). As this conditional prior depends on h, we do not have access to its exact value and instead need to use an approximate q(z|context).

Experiments

NP as a prior over functions

Let’s start by exploring the behaviour of NPs as a prior over functions, i.e. in the setting where we haven’t observed any data and haven’t yet trained the model. Having initialised the weights (here I initialised them independently from a standard normal), we can sample z∼N(0,I) and generate from the prior predictive distribution over a grid of x∗ values to plot the functions.

As opposed to GPs which have interpretable kernel hyperparameters, the NP prior is much less explicit. There are various architectural choices involved (such as how many hidden layers to use, what activation functions to use etc) which all implicitly affect our prior distribution over the function space.

For example, when using sigmoid activations and varying the dimensionality of zin {1,2,4,8}, typical draws from the (randomly initialised) NP prior may look as follows:

But when deciding to use the ReLU activations instead, we have placed the prior probability mass over a different set of functions:

Training NP on a small data set

Suppose all we have is the following five data points:

The training procedure for NPs will involve separating the context set and target set. One option is to use a fixed size context set, another is to cover a wider range of scenarios by training using varying context set sizes (e.g. at every iteration we could randomly draw the number of context points from the set {1,2,3,4}). Once we have trained the model on these random subsets, we use the trained model as our prior and now condition on all of our data (i.e. we take all these five points to be the context points) and plot draws from the posterior. The animation below illustrates how the predictive distribution of the NP changes as it is being trained:

So the NP seems to have successfully learned a distribution over mappings which go through all of our five points. Now let’s explore how well it generalises to other mappings, i.e. what happens if we use this trained NP for prediction on a different context set. Here is the posterior when conditioning on the red points instead:

Not very surprisingly, the flexible NP model which was trained only on subsets of the five blue points, doesn’t generalise to a different set of context points. To get a model which would generalise better, we could consider (pre)training the NP on a larger set of functions.

Training NPs on a small class of functions

So far, we have explored the training scenario using a single (fixed) data set. To have an NP which would generalise similarly to GPs, it seems that we should train it on a much larger class of functions. But before that, let’s explore how NPs behave in a simpler setting.

That is, let’s consider a toy scenario, where instead of a single function we observe a small class of functions. Specifically, let’s consider all functions of the form a⋅sin⁡(x) where a∈[−1,1].

It would be interesting to see:

  1. Is the NP able to capture this class of functions
  2. Will it generalise beyond this class of functions

Let’s use the following training procedure:

  • Draw a uniformly a∼U(−2,2)
  • Draw xi∼U(−3,3)
  • Define yi:=f(xi), where f(x)=asin⁡(x)
  • Divide pairs (xi,yi) randomly into context and target sets and perform an optimisation step
  • Repeat

Here we used a two-dimensional z so that we could visualise what the model has learned. Having trained the NP, we visualised the function draws corresponding to various (z1,z2) values on a grid, as shown below:

It seems that the direction from left-to-right essentially encodes our parameter a. Here is another visualisation of the same effect, where we vary one of the latent dimensions (either z1 or z2):

Note that above we did not use any context set at prediction time, but simply pre-specified various (z1,z2) values. Now let’s look into using this trained model for predicting.

Taking the context set to be the point (0,0), shown on the left, will result in quite a broad posterior which looks quite nice, covering functions which resemble asin⁡(x) for a certain range of values of a (but note that not for all a∈[−2,2] it was trained on).

Adding a second context point (1,sin⁡(1)) will result in the posterior shown in the middle. The posterior has changed compared to the previous plot, e.g. functions with a negative values of a are not included any more, but none of the functions goes through the given point. When increasing the number of context points which follow f(x)=1.0sin⁡(x) then the NP posterior will become reasonably close to the true underlying function, as shown on the right.

Now let’s explore how well the trained NP will generalise beyond the class of functions it was trained on. Specifically, let’s explore how it will generalise to the following functions 2.5sin⁡(x) and |sin⁡(x)|. The first requires some extrapolation from the training data. The second one has a similar shape to the functions in training set but unlike the rest its values are non-negative.

As seen from the plots, the NP has not been able to generalise beyond what it had seen during training. In both cases, the model behaviour is somewhat expected (e.g. on the left, a=2 corresponds to the best fit within the class of functions the model has seen). However, note that there is not much uncertainty in the NP predictive distributions. Of course over-confident predictions are not specific to NPs, however their black-box nature may make them more difficult to diagnose, compared to more interpretable models.

Training NPs on functions drawn from GPs

Based on experiments so far, it seems that NPs are not out-of-the-box replacements for GPs, as the latter have more desirable properties regarding posterior uncertainty. In order to achieve a similar behaviour with an NP, we could train it using a large number of draws from a GP prior. We could do it as follows:

  • Draw f∼GP(0,kθ(⋅))
  • Draw xi∼U(−3,3)
  • Define yi:=f(xi)
  • Divide pairs (xi,yi) randomly into context and target sets and perform an optimisation step
  • Repeat

The above procedure can be carried out with fixed kernel hyperparameters θ or a mixture of different values. The RBF (or squared exponential) kernel has two parameters: one controls the variance (essentially the range of function values) and the other “wigglyness”. The latter is called the lengthscale parameter and its effect is illustrated here, by drawing functions from the GP prior with lengthscale values in {1,2,3}:

To cover a variety of functions in the NP training, we could specify a prior p(θ)where to draw samples from. In this toy experiment, I varied lengthscale, as above, uniformly in {1,2,3}. As previously, choosing the latent z-space to be 2D, we can visualise what the NP has learned:

This looks pretty cool! The NP has learned the two-dimensional z space where we can smoothly interpolate between different functions.

Now let’s explore the predictions we get using this NP, and let’s see how its behaviour compares to a Gaussian Process posterior. Using an increasing number of context points {3,5,11}, let’s consider two functions:

First, using a relatively smooth function f(x)=sin⁡(0.5x), the predictions look as follows:

Second, let’s consider f(x)=sin⁡(1.5x):

In the first case, the NP predictions follow the observations quite closely, whereas in the second case, with three observations it looks good, but when given more points it hasn’t been able to capture the pattern. This effect is quite likely due to our architectural choices to use quite small NNs and a low-dimensional z. In order to improve the model behaviour so that it would resemble GPs more closely, we could consider applying the following changes:

  • Using only a 2D z-space might be quite restrictive in what we are able to learn, we could consider using a higher-dimensional z. And similarly for r.
  • We could consider using a larger number of hidden units in NNs h and g, and consider making them deep.
  • Observing a larger number of function draws as well as a larger variety of functions (i.e. more variability in GP kernel hyperparameters) during the training phase could lead to better generalisation.

Such changes can indeed lead to more desirable results. For example, having increased dim(r) to 32 and dim(z) to 4 together with a larger number of hidden units in h and g, we observe much nicer behaviour:

Conclusions

Even though Neural Processes combine elements from both NNs and GP-like Bayesian models to capture distributions over functions, on this spectrum NPs lie closer to neural models. By making careful choices regarding the neural architectures as well as the training procedure for NPs, it is possible to achieve desirable model behaviour, e.g. GP-like predictive uncertainties. However, these effects are mostly implicit which make NPs more challenging to interpret as a prior.

Implementation

My implementation using TensorFlow in R (together with code for the experiments above) is available in github.com/kasparmartens/NeuralProcesses.

Acknowledgements

I would like to thank Hyunjik Kim for clarifying my understanding of the NP papers, Jin Xu for sharing his results on NPs (which stimulated me to think about more complex NNs), Tanel Pärnamaa and Leon Law for their comments, and Chris Yau for being an awesome supervisor.

 

Orignal from : https://kasparmartens.rbind.io/post/np/

DeepMind: Neural Processes

What about

More and more works have been proposed to overcome the limitations around deep learning.   This work is to

  • improve the flexibility of the testing phase of a  deep learning model by combining the advantage of Gaussian Process(GP).
  • on the contrary, deep learning can learn a kernel function from data observations automatically, which can be directly used by GP as well. 

Details

(refer to https://wemedia.ifeng.com/68012220/wemedia.shtml)

Function approximation is the core of many problems in machine learning. DeepMind’s latest research combines the advantages of neural networks and stochastic processes and proposes a neural process model that achieves good performance and high computational efficiency on multitasking.

Paper: https://arxiv.org/pdf/1807.01622.pdf

Function approximation is at the heart of many problems in machine learning. A very popular method of this problem over the past decade has been deep neural networks. Advanced neural networks consist of black box function approximators that learn to parameterize a single function from a large number of training data points. As a result, most of the workload on the network falls into the training phase, while the evaluation and testing phases are simplified to fast forward propagation. While high test time performance is valuable for many practical applications, the output of the network cannot be updated after training, which may be undesirable. For example, meta-learning is an increasingly popular area of research that addresses this limitation.

As an alternative to neural networks, random processes can also be reasoned to perform function regression. The most common example of this method is the Gaussian process (GP), which is a neural network model with complementary properties: GP does not require an expensive training phase, and can perform potential ground truth functions based on certain observations. Inferred, this makes them very flexible when tested.

In addition, the GP represents an infinite number of different functions in unobserved positions, so it can capture the uncertainty of its prediction given some observations. However, GP is computationally expensive: the original GP is the scale of the 3rd order data point, and the current optimal approximation method is the quadratic approximation. In addition, the available kernels are usually limited in their form of functionality, requiring an additional optimization process to determine the most appropriate kernel and its hyperparameters for any given task.

Therefore, the combination of neural network and stochastic process reasoning to make up for some of the shortcomings of the two methods, as a potential solution, is getting more and more attention. In this work, the team of DeepMind research scientist Marta Garnelo et al. proposed a method based on neural networks and learning stochastic process approximation, which they call Neural Processes (NPs). NP has some basic properties of GP, that is, they learn to model the distribution on top of functions, can estimate the uncertainty of their prediction according to the observation of the context, and transfer some work from training to test time to achieve the flexibility of the model.

More importantly, NP generates predictions in a very computationally efficient manner. Given n context points and m target points, the reasoning of a trained NP corresponds to the forward transfer of a deep neural network, which

Scale, not like the classic GP

. In addition, the model overcomes many of the functional design constraints by learning the implicit kernel directly from the data.

The main contributions of this research are:

Neural Processes, a model that combines the advantages of neural networks and stochastic processes.

We compare neural processes (NP) with meta-learning, deep latent variable models, and Gaussian processes. Given that NP is relevant in many areas, they allow comparisons between many related topics.

We demonstrate the advantages and capabilities of NP by applying NP to a range of tasks, including one-dimensional regression, true image completion, Bayesian optimization, and contextual bandits.

Neural process model

Figure 1: Neural process model

 

(a) The graph model of the neural process, x and y correspond to the data of y = f(x), C and T respectively represent the number of context points and target points, and z represents the global latent variable. A gray background indicates that the variable was observed.

(b) Schematic diagram of the implementation of the neural process. The variables in the circle correspond to the variables in the model of (a), the variables in the box represent the intermediate representation of NP, and the bold letters represent the following calculation modules: h – encoder, a – aggregator and g – decoder. In our implementation, h and g correspond to a neural network and a corresponds to a mean function. The solid line indicates the generation process and the dashed line indicates the reasoning process.

In our NP implementation, we provide two additional requirements: the order of the context points and the invariance of computational efficiency.

The final model can be summarized as the following three core components (see Figure 1b):

From the input space to the encoder of the representation space, the inputs are in pairs

Context value and generate a representation for each pair

. We parameterize h as a neural network.

Aggregator a, which summarizes the input of the encoder.

Conditional decoder g, which samples the global latent variable z and the new target location

As input and corresponding

Value output prediction

.

 

Figure 2: Graph model for correlation model (a-c) and neural process (d). Gray shading indicates that the variable was observed. C represents the context variable, and T represents the target variable, which is the variable to be predicted given C.

 

result


Figure 4. Pixelated regression on MNIST and CelebA

The diagram on the left shows that an image completes pixelation can be framed as a 2-D regression task, where f (pixel coordinates) = pixel brightness. The diagram on the right shows the results of the image implementation of MNIST and CelebA. The top image corresponds to the context node provided to the model. In order to be able to show more clearly, the unobserved points are marked in blue and white in MNIST and CelebA, respectively. In the case of a given text node, each row corresponds to a different sample. As the text nodes increase, the predicted pixels become closer to the underlying pixels, and the variance between the samples gradually decreases.

Figure 5. Thompson sampling of 1-D objective functions using neural processes

These figures show the process of 5 iterative optimizations. Each prediction function (blue) is drawn by sampling a latent variable, the condition of which is to increase the number of text nodes (black). The underlying ground truth function is represented as a black dotted line. The red triangle represents the next evaluation point, which corresponds to the minimum of the extracted NP curve. The red circle in the next iteration corresponds to this evaluation point, and its underlying ground truth refers to a new text node that will be the NP.

Table 1. Bayesian optimization using Thompson sampling

The average of the optimization steps needs to reach the global minimum of the 1-D function generated by the Gaussian process. These values are normalized by the number of steps taken by a random search. The performance of a Gaussian process using the appropriate kernel is equivalent to the upper limit of performance.

Table 2. Results of the wheel bandit problem after increasing the delta value

The results represent the mean and standard error of more than 100 cumulative and simple regrets. The result normalizes the performance of a uniform agent.

Discuss

We introduced a set of models that combine the advantages of stochastic processes and neural networks, called neural processes. NPs learn to represent distributions on functions, and make flexible predictions based on some text input during testing. NPs don’t need to write the kernel themselves, but learn the implicit measure directly from the data.

We apply NPs to some column regression tasks to demonstrate their flexibility. The purpose of this paper is to introduce NPs and compare it to the research currently underway. Therefore, the task we present is that although there are many types, the dimensions are relatively low. Extending NPs to higher dimensions can significantly reduce computational complexity and data driven representations.

Limitations of this work

  • How can we assure NPs have the equal performance to pure deep learning models?
  • What real application cases are better of using  NPs rather than deep learning?

Expecting more future works are going deeper.