Noise2Noise: Learning Image Restoration without Clean Data

Abstract

Can you believe training a denoiser model without using clean but only noisy images? This is a paper from ICML2018, published jointly by researchers from NVIDIA, Aalto University, and MIT. The article proposes a very interesting point: in some common cases, the network can learn to recover signals without “looking” to “clean” signals, and the results are close to or equivalent to training with “clean” samples. And this conclusion comes from a simple statistical observation: the loss function we use in network training, which only requires the ground truth to be “clean” in some statistics, without the need for each of the target signal is all “clean”.

Paper: https://arxiv.org/pdf/1803.04189.pdf

Github(not official): https://github.com/yu4u/noise2noise

Introduction

A traditional neural network denoising method generally takes a noisy picture as an input and a clear picture as an output. On this basis, the neural network is trained to fit the mapping between the two to achieve the denoising function. In the sample pair ( $\hat{x}_{i}$ , $\hat{y}_{i}$ ), $\hat{x}_{i}$ is used as the input picture with noise, $\hat{y}_{i}$ is the clear picture that should be output, and then the empirical risk is minimized.

(1) $\begin{equation*}argmin\sum L(f_{\theta}\left ( \hat{x}_{i} \right ),y_{i})\end{equation*}$

where $f_{\theta}$ is a mapping function with all parametars $\theta$ ; L is a loss function.

Considering that the cost of acquiring noisy pictures and sharp pictures is relatively high, Noise2Noise describes a method that uses only noisy pictures as training samples to achieve denoising. Since there are no clear pictures contrasted to noisy pictures, the neural network will connect the noisy pictures with the clear pictures that are not observed.

Technical Background

Suppose we have a set of unreliable room temperature measurements ( $y_{1}$ , $y_{2}$ , …). A common strategy for estimating the true unknown temperature is to find a number $z$ that has the smallest mean deviation from the measured value based on some loss function L:

(2) $\begin{equation*} argmin \mathop{\mathbb{E}_{y}}\left \{ L(z,y) \right \} \end{equation*}$

For the $L_{2}$ loss $L(z,y)=(z-y)^2$ , minimization means finding the expected mean of the observations.

(3) $\begin{equation*} z= \mathop{\mathbb{E}_{y}}\left \{ y \right \} \end{equation*}$

For the $L_{1}$ loss $L(z,y)=|z - y|$ , the optimal solution of the loss function is taken at the median of the measured values:

(4) $\begin{equation*} z=median { y } \end{equation*}$

For the $L_{0}$ loss $L(z,y)= \left | z-y \right |_{0}$ , the optimal solution approximation of the loss function is taken at the mode of the measured value:

(5) $\begin{equation*} z=mode{y} \end{equation*}$

From a statistical point of view, these general loss functions can be interpreted as the negative logarithm of the likelihood function, and the optimization process for these loss functions can be regarded as the maximum likelihood estimation.

Training a neural network regression is a generalization of this point estimation process. Observe the input target pair form ( $\hat{x}_{i}$ , $\hat{y}_{i}$ ) of a typical training task, where the network function $f_{\theta}(x)$ is parameterized by $\theta$ .

(6) $\begin{equation*} arg\underset{\theta}{min}\mathop{\mathbb{E}_{(x,y)}} \left \{ L(f_{\theta},y) \right \} \end{equation*}$

If the entire training task is decomposed into several training steps, the above objective function can be changed according to Bayes’ theorem:

(7) $\begin{equation*}arg\underset{\theta}{min} \mathop{\mathbb{E}_x} \left \{ \mathop{\mathbb{E}_{(y|x)}} \left \{ L(f_{\theta},y) \right \} \right \}\end{equation*}$

In fact, some input and output data is not a 1:1 mapping, but a multi-value mapping. For example, in a super-resolution problem, a low resolution x has multiple high-resolution y corresponding to it, so p(y|x) distribution is complicated. So when using $L_{2}$ loss, the low resolution picture x is used as the input and the high resolution picture y is taken as the output. At this time, the result of the $f_{\theta}$ output of the neural network is the average of all possible high resolution outputs. So the picture local details are ambiguous and it does not achieve the desired result. Of course, a trained discriminator can be used as a loss.

In summary, one of the seemingly insignificant attributes of $L_{2}$ minimization is that if we replace the target with a random number that matches the target, the estimate will remain the same. Therefore, if the input condition target distribution p(y|x) is replaced by an arbitrary distribution having the same conditional expectation value, the optimal network parameter $\theta$ also remains unchanged. This means that the training target of the neural network can be added with noise with a mean of 0 without changing the network training results. Then the network objective function can be changed

(8) $\begin{equation*}arg\underset{\theta}{min} \underset{i}{\sum} L(f_{\theta}(\hat{x}_{i}),\hat{y})\end{equation*}$

Where the output and the target are both from a noisy distribution and satisfy $\mathop{\mathbb{E}}\left \{ \hat{y}_{i}| \hat{x}_{i}\right \}={y}_{i}$

When the given training data is infinite, the solution of the objective function is the same as the original objective function. When the training data is finitely large, the estimated mean square error is equal to the average variance of the noise in the target divided by the number of training examples, namely:

(9) $\begin{equation*}\mathop{\mathbb{E}_{\hat{y}}} \left [ \frac{1}{N}\underset{i}{\sum} y_{i}-\frac{1}{N}\underset{i}{\sum }\hat{y}_{i}\right ]^2=\frac{1}{N}\left [ \frac{1}{N}\underset{i}{\sum }Var(y_{i}) \right ]\end{equation*}$

Therefore, as the number of samples increases, the error is close to zero. Even if the amount of data is limited, the estimate is unbiased because it is correct in expectation.

In many image restoration tasks, the expected input of contaminated data is the “clean” target we want to recover, so as long as we observe each contaminated image twice, ie the input dataset is also the target dataset, we can achieve Training for the network without the need to get a “clean” goal.
$L_{1}$ loss can get the median of the target, which means the network can be trained to repair images with significant anomalous content (up to 50%) and only need to be contaminated images in pairs

In many image restoration problems, the input noise data is expected to be exactly what we want to recover. For example, for low-light shots, a long-exposure, noise-free picture is exactly the average of the individual, short-exposure, noisy pictures. The above findings show that as long as we have two pictures with noise and the same content as the training samples, we can achieve the same denoising function as before, which is much less costly than the original to obtain clear pictures.

Experiments and Results

Additive Gaussian noise

Generally, additive white Gaussian noise is zero-mean, so the article uses the $L_{2}$ loss training network. The article uses the image of the open source image library, randomly adds noise with variance σ∈[0,50] to each image. The network needs to estimate the noise amplitude in the denoising process. The whole process is blind denoising.

Other Synthetic Noise

Poisson noise

Poisson noise, like Gaussian noise, is zero-mean, but is more difficult to remove because it is signal-dependent. The article uses L2 loss and varies the noise amplitude λ∈[0,50] during training.
It should be noted that the image saturation cutoff region does not satisfy the zero-mean assumption, because the noise distribution is partially lost in these regions, and the expectation of the remaining portions is no longer zero, so no good effect can be obtained in these regions.

Multiplicative Bernoulli noise

That is equivalent to randomly sampling the image, the unsampled point pixel value is 0. The probability of contaminated pixels is denoted as p, during the training of the article, the change p ∈ [0.0, 0.95], and in the test p = 0.5. The result is that using a contaminated target gets a bit higher than the “clean” target, which may be due to the fact that the contaminated target effectively uses the dropout technique in the network output.

Text removal

The network uses independent contaminated input and target pairs to train, the probability p of contaminated pixels is [0.0, 0.5] during training, and p = 0.25 in the test. The L1 loss is used as a loss function in training to remove outliers.

Random value impulse noise

That is, for each pixel of the position, the probability p is randomly replaced by the value of [0, 1]. In this case, both the mean and the median produce good results, and the ideal output should be the mode of the pixel value distribution. To approximate the majority, the article uses an annealed version of the “L0 loss” function, defined as $\left (| f_{\theta}(\hat{x})-\hat{y}|+\varepsilon \right )^{\gamma }$ where ε=10−8, γ is linear from 2 to 0 during training. decline. The probability that the input and target images are contaminated with pixels during training is [0, 0.95].

In addition, the article also tested Monte Carlo rendering and MRI, and all achieved good results.
The significance of this article is that it is often difficult to get clear training data in the real world, and this article provides a new way to solve this problem. The article also mentioned that there is no free lunch in the world, and this method can’t learn to get the features that don’t exist in the input data, but it also applies to the training of clear goals.

Resource : https://blog.csdn.net/QiangLi_strong/article/details/81541041

Download : noise2noise_slice_for_Journal_Club