This paper observes that a major flaw in common image-classification networks is their lack of robustness to common corruptions and perturbations. The authors develop and publish two variants of the ImageNet validation dataset, one for corruptions and one for perturbations. They then propose metrics for evaluating several common networks on their new datasets and find that robustness has not improved much from AlexNet to ResNet. They do, however, find several ways to improve performance including using larger networks, using ResNeXt, and using adversarial logit pairing.
Quality: The datasets and metrics are very thoroughly treated, and are the key contribution of the paper.
Some questions: What happens if you combine ResNeXt with ALP or histogram equalization? Or any other combinations? Is ALP equally beneficial across all networks? Are there other useful adversarial defenses?
Clarity: The novel validation sets and reasoning for them are well-explained, as are the evaluation metrics. Some explanation of adversarial logit pairing would be welcome, and some intuition (or speculation) as to why it is so effective at improving robustness.
Originality: Although adversarial robustness is a relatively popular subject, I am not aware of any other work presenting datasets of corrupted/perturbed images.
Significance: The paper highlights a significant weakness in many image-classification networks, provides a benchmark, and identifies ways to improve robustness. It would be improved by more thorough testing, but that is less important than the dataset, metrics and basic benchmarking provided.
Question: Why do authors do not recommend training on the new datasets?
Siamese neural network is a class of neural network architectures that contain two or more identical subnetworks. identical here means they have the same configuration with the same parameters and weights. Parameter updating is mirrored across both subnetworks.
Siamese NNs are popular among tasks that involve finding similarityor arelationshipbetweentwo comparable things. Some examples are paraphrase scoring, where the inputs are two sentences and the output is a score of how similar they are; or signature verification, where figure out whether two signatures are from the same person. Generally, in such tasks, two identical subnetworks are used to process the two inputs, and another module will take their outputs and produce the final output. The picture below is from Bromley et al (1993)[1]. They proposed a Siamese architecture for the signature verification task.
Siamese architectures are good in these tasks because
Sharing weights across subnetworks means fewer parameters to train for, which in turn means less data required and less tendency to overfit.
Each subnetwork essentially produces a representation of its input. (“Signature Feature Vector” in the picture.) If your inputs are of the same kind, like matching two sentences or matching two pictures, it makes sense to use similar model to process similar inputs. This way you have representation vectors with the same semantics, making them easier to compare.
In Question Answering, some recent studies have used Siamese architectures to score relevance between a question and an answer candidate[2]. So one input is a question sentence, the other input is an answer, and the output is how relevant is the answer to the question. Questions and answers don’t look exactly the same, but if the goal is to extract the similarity or a connection between them, a Siamese architecture can work well, too.
In my own experience, Siamese Networks may offer 3 distinct advantages over Traditional CLASSIFICATION!
These advantages are somewhat true for any kind of data, and not just for Images (where these are currently most popularly used).
CAN BE MORE ROBUST TO EXTREME CLASS IMBALANCE.
CAN BE GOOD TO ENSEMBLE WITH A CLASSIFIER.
CAN YIELD BETTER EMBEDDINGS.
Let’s say we want to learn to predict what animal is in a given image.
Case 1 : if it is just 2 animal classes to predict from (Cat vs Dogs) and given millions of images of each class, one could train a deep CNN Classifier. Easy!
Case 2 : but what if we have tens of thousands of animal classes and for most of these, we only have a few dozens of image examples? Trying to learn each animal as a Class using deep CNN seems less feasible now. Such a classifier can perform poorly for rarely seen training class e.g. let’s say there were only 4 training images of ‘eels’
Siamese Network is a Model Architecture used alongside a Distance-based Loss.
It learns what makes 2 pair of inputs the same (e.g. dog-dog, eel-eel).
In Comparison, Classification learns what makes an input a dog/ cat/ eel etc.
Advantages of such learning can be:
MORE ROBUST TO CLASS IMBALANCE. If the model has learnt well what makes any 2 animals the same, one example of a class like ‘eel’ in training may be sufficient to predict / recognize an eel in future. This is amazing! See One-Shot learning
NICE TO ENSEMBLE WITH BEST CLASSIFIER. Given that its learning mechanism is somewhat different from Classification, simple averaging of it with a Classifier can do much better than averaging 2 correlated Supervised models (e.g. GBM & RF classifier). I have experienced it personally.
BETTER EMBEDDINGS. Siamese focus on learning embeddings (in deeper layer) that place same classes / concepts close together. Hence, can learn semantic similarity.
This is different from Classification Loss (e.g. logistic loss) which is explicitly rewarded only to make the classes linearly separable.
This makes its embeddings more useful in a generic sense e.g. one can calculate distance on it. For example, one could use its last-layer embeddings to build a ‘search-by-image’ app
Images below shows the MNIST Embeddings that i got by training:
Classifier with 3 Hidden layers (size 200–100–2 ) & Softmax loss
Siamese Architecture with same network & Distance Loss.
I plot as embeddings the output of their 3rd Hidden layer on Test Images. Clearly Siamese Embeddings are not only linearly separable but also fit for distance-calculation.
Downside can be:
Training involves Pairwise Learning => quadratic pairs to learn from (in order to see all information available) => slower than Classification (pointwise learning)
Prediction can add a few HyperParameters and can be slightly slower. It does not readily output Class probabilities, but distances from each Class.