Menu Close

Noise2Noise: Learning Image Restoration without Clean Data

Abstract

Can you believe training a denoiser model without using clean but only noisy images? This is a paper from ICML2018, published jointly by researchers from NVIDIA, Aalto University, and MIT. The article proposes a very interesting point: in some common cases, the network can learn to recover signals without “looking” to “clean” signals, and the results are close to or equivalent to training with “clean” samples. And this conclusion comes from a simple statistical observation: the loss function we use in network training, which only requires the ground truth to be “clean” in some statistics, without the need for each of the target signal is all “clean”.

Paper: https://arxiv.org/pdf/1803.04189.pdf

Github(not official): https://github.com/yu4u/noise2noise

 Introduction

A traditional neural network denoising method generally takes a noisy picture as an input and a clear picture as an output. On this basis, the neural network is trained to fit the mapping between the two to achieve the denoising function. In the sample pair (\hat{x}_{i}, \hat{y}_{i}), \hat{x}_{i} is used as the input picture with noise, \hat{y}_{i} is the clear picture that should be output, and then the empirical risk is minimized.

(1)   \begin{equation*}argmin\sum L(f_{\theta}\left ( \hat{x}_{i} \right ),y_{i})\end{equation*}

where f_{\theta} is a mapping function with all parametars \theta; L is a loss function.

Considering that the cost of acquiring noisy pictures and sharp pictures is relatively high, Noise2Noise describes a method that uses only noisy pictures as training samples to achieve denoising. Since there are no clear pictures contrasted to  noisy pictures, the neural network will connect the noisy pictures with the clear pictures that are not observed.

Technical Background

Suppose we have a set of unreliable room temperature measurements (y_{1}, y_{2}, …). A common strategy for estimating the true unknown temperature is to find a number z that has the smallest mean deviation from the measured value based on some loss function L:

(2)   \begin{equation*} argmin \mathop{\mathbb{E}_{y}}\left \{ L(z,y) \right \} \end{equation*}

For the L_{2} loss L(z,y)=(z-y)^2, minimization means finding the expected mean of the observations.

(3)   \begin{equation*} z= \mathop{\mathbb{E}_{y}}\left \{ y \right \} \end{equation*}

For the L_{1} loss L(z,y)=|z - y|, the optimal solution of the loss function is taken at the median of the measured values:

(4)   \begin{equation*} z=median { y } \end{equation*}

For the L_{0} loss L(z,y)= \left | z-y \right |_{0}, the optimal solution approximation of the loss function is taken at the mode of the measured value:

(5)   \begin{equation*} z=mode{y} \end{equation*}

From a statistical point of view, these general loss functions can be interpreted as the negative logarithm of the likelihood function, and the optimization process for these loss functions can be regarded as the maximum likelihood estimation.

Training a neural network regression is a generalization of this point estimation process. Observe the input target pair form (\hat{x}_{i}, \hat{y}_{i}) of a typical training task, where the network function f_{\theta}(x) is parameterized by \theta.

(6)   \begin{equation*} arg\underset{\theta}{min}\mathop{\mathbb{E}_{(x,y)}} \left \{ L(f_{\theta},y) \right \} \end{equation*}

If the entire training task is decomposed into several training steps, the above objective function can be changed according to Bayes’ theorem:

(7)   \begin{equation*}arg\underset{\theta}{min} \mathop{\mathbb{E}_x} \left \{ \mathop{\mathbb{E}_{(y|x)}} \left \{ L(f_{\theta},y) \right \} \right \}\end{equation*}

In fact, some input and output data is not a 1:1 mapping, but a multi-value mapping. For example, in a super-resolution problem, a low resolution x has multiple high-resolution y corresponding to it, so p(y|x) distribution is  complicated. So when using L_{2} loss, the low resolution picture x is used as the input and the high resolution picture y is taken as the output. At this time, the result of the f_{\theta} output of the neural network is the average of all possible high resolution outputs. So the picture local details are ambiguous and it does not achieve the desired result. Of course, a trained discriminator can be used as a loss.

In summary, one of the seemingly insignificant attributes of L_{2}  minimization is that if we replace the target with a random number that matches the target, the estimate will remain the same. Therefore, if the input condition target distribution p(y|x) is replaced by an arbitrary distribution having the same conditional expectation value, the optimal network parameter \theta also remains unchanged. This means that the training target of the neural network can be added with noise with a mean of 0 without changing the network training results. Then the network objective function can be changed

(8)   \begin{equation*}arg\underset{\theta}{min} \underset{i}{\sum} L(f_{\theta}(\hat{x}_{i}),\hat{y})\end{equation*}

Where the output and the target are both from a noisy distribution and satisfy \mathop{\mathbb{E}}\left \{ \hat{y}_{i}| \hat{x}_{i}\right \}={y}_{i}

When the given training data is infinite, the solution of the objective function is the same as the original objective function. When the training data is finitely large, the estimated mean square error is equal to the average variance of the noise in the target divided by the number of training examples, namely:

(9)   \begin{equation*}\mathop{\mathbb{E}_{\hat{y}}} \left [ \frac{1}{N}\underset{i}{\sum} y_{i}-\frac{1}{N}\underset{i}{\sum }\hat{y}_{i}\right ]^2=\frac{1}{N}\left [ \frac{1}{N}\underset{i}{\sum }Var(y_{i}) \right ]\end{equation*}

Therefore, as the number of samples increases, the error is close to zero. Even if the amount of data is limited, the estimate is unbiased because it is correct in expectation.

In many image restoration tasks, the expected input of contaminated data is the “clean” target we want to recover, so as long as we observe each contaminated image twice, ie the input dataset is also the target dataset, we can achieve Training for the network without the need to get a “clean” goal.
L_{1} loss can get the median of the target, which means the network can be trained to repair images with significant anomalous content (up to 50%) and only need to be contaminated images in pairs

In many image restoration problems, the input noise data is expected to be exactly what we want to recover. For example, for low-light shots, a long-exposure, noise-free picture is exactly the average of the individual, short-exposure, noisy pictures. The above findings show that as long as we have two pictures with noise and the same content as the training samples, we can achieve the same denoising function as before, which is much less costly than the original to obtain clear pictures.

Experiments and Results

Additive Gaussian noise

Generally, additive white Gaussian noise is zero-mean, so the article uses the L_{2} loss training network. The article uses the image of the open source image library, randomly adds noise with variance σ∈[0,50] to each image. The network needs to estimate the noise amplitude in the denoising process. The whole process is blind denoising.

Other Synthetic Noise

Poisson noise

Poisson noise, like Gaussian noise, is zero-mean, but is more difficult to remove because it is signal-dependent. The article uses L2 loss and varies the noise amplitude λ∈[0,50] during training.
It should be noted that the image saturation cutoff region does not satisfy the zero-mean assumption, because the noise distribution is partially lost in these regions, and the expectation of the remaining portions is no longer zero, so no good effect can be obtained in these regions.

Multiplicative Bernoulli noise

That is equivalent to randomly sampling the image, the unsampled point pixel value is 0. The probability of contaminated pixels is denoted as p, during the training of the article, the change p ∈ [0.0, 0.95], and in the test p = 0.5. The result is that using a contaminated target gets a bit higher than the “clean” target, which may be due to the fact that the contaminated target effectively uses the dropout technique in the network output.

Text removal

The network uses independent contaminated input and target pairs to train, the probability p of contaminated pixels is [0.0, 0.5] during training, and p = 0.25 in the test. The L1 loss is used as a loss function in training to remove outliers.

Random value impulse noise

That is, for each pixel of the position, the probability p is randomly replaced by the value of [0, 1]. In this case, both the mean and the median produce good results, and the ideal output should be the mode of the pixel value distribution. To approximate the majority, the article uses an annealed version of the “L0 loss” function, defined as \left (| f_{\theta}(\hat{x})-\hat{y}|+\varepsilon \right )^{\gamma }  where ε=10−8, γ is linear from 2 to 0 during training. decline. The probability that the input and target images are contaminated with pixels during training is [0, 0.95].

In addition, the article also tested Monte Carlo rendering and MRI, and all achieved good results.
The significance of this article is that it is often difficult to get clear training data in the real world, and this article provides a new way to solve this problem. The article also mentioned that there is no free lunch in the world, and this method can’t learn to get the features that don’t exist in the input data, but it also applies to the training of clear goals.

Resource : https://blog.csdn.net/QiangLi_strong/article/details/81541041

Download :  noise2noise_slice_for_Journal_Club

【Refer】Balancing Bias and Variance to Control Errors in Machine Learning

In the world of Machine Learning, accuracy is everything. You strive to make your model more accurate by tuning and tweaking the parameters, but are never able to make it 100% accurate. That’s the hard truth about your prediction/ classification models, they can never be error free. In this article I’ll discuss why this happens and other forms of error that can be reduced.

Suppose we are observing a response variable Y (qualitative or quantitative) and input variable X having p number of features or columns (X1, X2…..Xp) and we assume there is relation between them. This relation can be expressed as

Y = f(X) + e

Here f is some fixed but unknown function of X1,…,Xp, and e is a random error term, which is independent of X and has mean zero. In this formulation, f represents the systematic information that X provides about Y. Estimation of this relation or f(X) is known as statistical learning.

In general, we won’t be able to make a perfect estimate of f(X), and this gives rise to an error term, known as reducible error. The accuracy of the model can be improved by making a more accurate estimate of f(X) and therefore reducing the reducible error. But, even if we make a 100% accurate estimate of f(X), our model won’t be error free, this is known as irreducible error(e in the above equation).

In other terms, the irreducible error can be seen as information that X cannot provide about Y. The quantity e may contain unmeasured variables that are useful in predicting Y : since we don’t measure them, f cannot use them for its prediction. The quantity e may also contain unmeasurable variation. For example, the risk of an adverse reaction might vary for a given patient on a given day, depending on manufacturing variation in the drug itself or the patient’s general feeling of well-being on that day.

Such end cases are present in every problem, and the error they introduce is not reducible as generally they are not present in the training data. Nothing that we can do about it. What we can do is reduce other forms of error to get a near perfect estimation of f(X). But first lets take a look at other important concepts in machine learning, which you need to understand in order to proceed further.

Model Complexity

The complexity of a relation, f(X), between input and response variables, is an important factor to consider while learning from a dataset. A simple relation is easy to interpret. For example a linear model would look like this

Y ≈ β0 + β1X1 + β2X2 + …+ βpXp

It is easy to infer information from this relation and also it clearly tells how a particular feature impacts the response variable. Such models come under the category of restrictive models as they can take only a particular form, linear in this case. But, a relation may be more complex than this, for example it may be quadratic, circular, etc. These models are more flexible as they fit data points more closely can take different forms. Generally such methods result in a higher accuracy. But this flexibility comes at the cost of interpretability, as a complex relation is harder to interpret.

Choosing a flexible model, does not always guarantee high accuracy. It happens because our flexible statistical learning procedure is working too hard to find patterns in the training data, and may be picking up some patterns that are just caused by random chance rather than by true properties of the unknown function f. This changes our estimation of f(X), leading to a less accurate model. This phenomenon is also known as overfitting.

When inference is the goal, there are clear advantages to using simple and relatively inflexible statistical learning methods. In some settings, however, we are only interested in prediction, and the interpretability of the predictive model is simply not of interest. This is when we use more flexible methods.

Quality of fit

To quantify the extent to which the predicted response value for a given observation is close to the true response value for that observation, the most commonly-used measure in regression setting is the mean squared error (MSE),

Taken from Wikipedia

As the name goes, it is the mean of square of the errors or differences in predictions and observed values for all inputs. It is known as training MSE if calculated using training data, and test MSE if calculated using testing data.

The expected test MSE, for a given value x0, can always be decomposed into the sum of three fundamental quantities: the variance of f(x0), the squared bias of f(x0) and the variance of the error terms e. Where, e is the irreducible error, about which we discusses earlier. So, lets see more about bias and variance.

Bias

Bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model. So, if the true relation is complex and you try to use linear regression, then it will undoubtedly result in some bias in the estimation of f(X). No matter how many observations you have, it is impossible to produce an accurate prediction if you are using a restrictive/ simple algorithm, when the true relation is highly complex.

Variance

Variance refers to the amount by which your estimate of f(X) would change if we estimated it using a different training data set. Since the training data is used to fit the statistical learning method, different training data sets will result in a different estimation. But ideally the estimate for f(X) should not vary too much between training sets. However, if a method has high variance then small changes in the training data can result in large changes in f(X).

General Rule

Any change in dataset will provide a different estimate, which is highly accurate, when using a statistical method that tries to match data points too closely. A general rule is that, as a statistical method tries to match data points more closely or when a more flexible method is used, the bias reduces, but variance increases.

Credit : An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani

In the above image, the left hand side shows a graph of 3 different statistical methods in regression setting. The yellow one is linear, while blue one is slightly non-linear and green is highly non-linear/flexible as it matches data points too closely. In the right hand side you can see a graph of MSE versus flexibility of these three methods. Red one represents test MSE and grey one represents the training MSE. It is not certain that a method with lowest training MSE will also have lowest test MSE. This is because some methods specifically estimate coefficients so as to minimize the training MSE, but they might not have a low test MSE. This problem can be chalked up to the issue of overfitting. As seen in the graph, the green curve(most flexible) has lowest training MSE but not the lowest test MSE. Lets go a little deeper into this problem.

Credit : ISLR by Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani

This is a graph showing test MSE(red curve), bias(green curve) and variance(yellow curve), with respect to flexibility of chosen method, for a particular dataset. The point of lowest MSE makes an interesting point about the error forms bias and variance. It shows that with increase in flexibility, bias decreases more rapidly than variance increases. After some point there is no more decrease in bias but variance starts increasing rapidly due to overfitting.

Bias-Variance Trade off

Credit : An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani

In the above figure, imagine that the center of the target is a model that perfectly predicts the correct values. As we move away from the bulls-eye, our predictions get worse and worse. Imagine we can repeat our entire model building process to get a number of separate hits on the target, such that each blue dot represents different realizations of our model based on different data sets for same problem. It displays four different cases representing combinations of both high and low bias and variance. High bias is when all dots are far from bulls eye and high variance is when all dots are scattered. This illustration combined with previous explanation makes the difference between bias and variance pretty clear.

As described earlier, in order to minimize the expected test error, we need to select a statistical learning method that simultaneously achieves low variance and low bias. There is always a trade-off between these values because it is easy to obtain a method with extremely low bias but high variance (for instance, by drawing a curve that passes through every single training observation) or a method with very low variance but high bias (by fitting a horizontal line to the data). The challenge lies in finding a method for which both the variance and the squared bias are low.

Mastering the trade-off between bias and variance is necessary to become a machine learning champion.

This concept should be kept in mind while solving machine learning problems as it helps in improving the model accuracy. Also retaining this knowledge helps you in deciding best statistical models for different situations quickly.

 

From :  https://towardsdatascience.com/balancing-bias-and-variance-to-control-errors-in-machine-learning-16ced95724db