Short Answer: No
Long Answer: They are a different variant of Convolutional Neural Networks (CNN). Let’s have a more detailed view of CNNs to get a grasp of Capsule Networks and what shortcomings they try to address of CNNs.
A CNN can be considered a class of feed forward neural networks. Normally they consist of a input and a output layer and multiple hidden layers in between. Most of the hidden layers apply a convolution operation to its input and passing the result to the next layer. The reasons why convolutions are used instead of fully connected layers are,that fully connected layers have a lot of parameters since the whole input is considered, whereas convolution generally has a small kernel window (normally of size 5×5), which is slid over the input and the parameters are shared across multiple locations (so for one such window the number of parameters is only 25). Furthermore convolution introduces some kind of locality by only considering the immediate 5×5 neighbourhood into consideration for one pixel.
So what problem does Hinton have with convolution ? Actually he does not have any problem with convolution per se but with the architecture most CNNs use. To see the problem let us consider a typical CNN architecture:
This is one of the typical ways to approach a CNN architecture; a conv layer followed by a pooling layer(and more conv and pooling layers, so that the lower levels can detect low level features like edges and the high level layers can detect abstraction like eyes, or corners or so on)
So here you can see something called pooling layer. You might wonder, what a pooling layer does. First notice, that there are multiple pooling strategies, we will focus our attention to Max Pooling for know :
As you can see Max Pooling downsampled the feature size to half of the input feature size. Generally Max Pooling (or any kind of Pooling) is used to downsample the feature size to a manageable level. Besides reducing the size of the feature vector it has some other ideas behind it. By only considering the max, we essentially are only interested if a feature is present in a certain window but we do not really care about the exact location. If we have a convolution filter, which detects edges, an edge will give a high response to this filter and the Max Pooling only keeps the information if an edge is present and throws away “unnecessary” information; which also include the location and the spatial relationship between certain features.
And this kind of Pooling is exactly what Hinton dislikes and addresses in the Capsule Networks. His criticisms in a nutshell are :
Since Max Pooling does not care about the spatial relationships (it throws the information away), the mere presence of certain features can be a good indicator of the presence of the object. However because it does not care about the spatial relationship it would say, that the image on the right
is also a face. One can prevent this, by adding this examples into the dataset and labelling them as not face, but it is quite a crude way to deal with it.
Moreover Max Pooling is, according to Hinton, a crude way to redirect information to the maximum, there are other ways to redirect which are a bit more sophisticated than a simple Max Pooling. ( we will come back to that later)
The most important drawbacks of CNN is that they are not able to model spatial relationships that well, we do not have a internal representation of the geometrical constraints of the data; the only knowledge we have is coming from the data itself; if we want to be able to detect cars in many viewpoints we need to have these different viewpoint cars in the training set, because we did not encode the prior knowledge of the geometrical relationship into the network.
If we think of Computer Graphics, it can model internal hierarchical representation of data, by combining several matrices to model the relationship of certain parts of a face and also the relationship. Hinton argues, that when we do image recognition with our brain we do perform some kind of inverse graphics solutions; from visual information received by eyes, they deconstruct a hierarchical representation of the world around us and try to match it with already learned patterns and relationships stored in the brain. This is how recognition happens. And the key idea is that representation of objects in the brain does not depend on view angle. We just need to make the internal representation happen in a neural network. And for this purpose the capsules come to rescue.
So to get an idea what a capsule actually is, let us consider a quote from Hinton himself :
Each capsule learns to recognize an implicitly defined visual entity over a limited domain of viewing conditions and deformations and it outputs both the probability that the entity is present within its limited domain and a set of “instantiation parameters” that may include the precise pose, lighting and deformation of the visual entity relative to an implicitly defined canonical version of that entity. When the capsule is working properly, the probability of the visual entity being present is locally invariant — it does not change as the entity moves over the manifold of possible appearances within the limited domain covered by the capsule. The instantiation parameters, however, are “equivariant” — as the viewing conditions change and the entity moves over the appearance manifold, the instantiation parameters change by a corresponding amount because they are representing the intrinsic coordinates of the entity on the appearance manifold[1]
So it takes some time to digest this information; Hinton introduces an important word; namely equivariance (which is not the same as invariance). This distinction is key to understand capsules; Max Pooling does introduce some kind of invariance; if you translate or change the input a little the output should not change; and in Max Pooling it does not. If you change the input the little, the Maximum still stays the same (and disregards the change in input coming for example from a viewpoint change). Coming back to Hinton, he states that we want the probability of the presence of an entity to stay the same, even if we changed the input by changing the viewpoint for example. This makes sense; the probability of a presence of a nose should not change if we just change the viewpoint.
However he also wants to achieve equivariance of the parameters, that means if we change the input, the parameters should change accordingly to encode the orientation changes ( which are referred as parameters). To have a better understanding, let us look at the architecture of the Capsule Network:
Remember when I said, that a Capsule Network is just a variant of a CNN. Here you can see, that at the beginning we have normal convolutions. Furthermore the Capsules here are abstracted, but what is inside of them ?
Let us quote the man himself
Convolutional neural networks (CNNs) use translated replicas of learned feature detectors. This allows them to translate knowledge about good weight values acquired at one position in an image to other positions. This has proven extremely helpful in image interpretation. Even though we are replacing the scalar-output feature detectors of CNNs with vector-output capsules and max-pooling with routing-by-agreement, we would still like to replicate learned knowledge across space. To achieve this, we make all but the last layer of capsules be convolutional. As with CNNs, we make higher-level capsules cover larger regions of the image[2]
So as you can see, even the Capsules consist of Convolutional layers, but the novelty is the Capsule structure, inside which you nest a few Convolutional layers.
The key concept of the Capsules can be summed up in this image:
So just to make a brief summary out of it. U1 to U3 are the vector outputs of the Capsules one level below; they indicate the probability of the presence of an entity (be it nose or whatever) and their state the also encode the pose and other properties such as deformation etc…
Apart from that encoding spatial relationships, there are two novelties, which a Capsule Network introduces: namely the routing algorithm and the new non-linearity called squashing . Let us briefly talk about them :
Routing algorithm:
The routing can be thought of as coupling between capsules in the layer below and above; intuitively it means that the capsule below sends the output to the capsule above who are “experts” in dealing with it. You can think of it as coupling coefficients between two capsules hierarchies (so we are looking at the coefficients 𝑐𝑖ciThese are the initial coupling coefficients, and they are refined by measuring the agreement between the current output 𝑣𝑗vjof each capsule, j, in the layer above and the prediction 𝑢𝑗|𝑖^uj|i^made by capsule i in the layer below. The agreement is calculated as scalar product of the output and the prediction of the capsule in the layer below. The initial coupling coefficients 𝑐𝑖ciare thus iteratively refined with the calculated agreement.
Squashing function
So till now we have multiplied the output of the previous capsule by weight matrices to encode the spatial relationships, then multiplied them with coupling coefficients to just receive the relevant informations from the previous capsules (or the information the current capsule is expert in dealing with from the previous capsules), now we run this through a squashing function. Essentially it is a new non linearity introduced by Hinton, and its definition is :
, where 𝑠𝑗sj is the output after the coupling step. The concept behind it, that they wanted the length of the output vector of a capsule to represent the probability that the entity represented by the capsule (in our case a face) is present in the current input. The squashing function ensures that short vectors get shrunk to almost zero length and long vectors get shrunk to a length slightly below 1, which therefore can be considered as probability. Long vector mean, that there were a lot of evidence present of the entity in the input, and short vectors mean there was less evidence.
The described routing between capsules, is usually done between PrimaryCapsules and DigitCaps (Digit comes from the fact, that a lot of experiments were done on Mnist I think) and the squashing is then done on the DigitCaps layers.
The last step of the Network is a reconstruction step summed up in this picture:
During training, all but the activation vector of the correct digit is masked out and this activity vector is used to reconstruct the input image using a 3 layer fully connected decoder. This encourages the DigitCaps to capture information relevant for reconstruction and is used as regularization technique.
This whole architecture reduces the error rate on the smallNorb dataset by 45% [3] . However they still need a lot of testing on huge datasets, but the idea is promising and it is a potential return of Computer Vision doing reverse graphics as it initially did.
Will they replace neural networks ? No because they themselves are neural networks
Will they replace CNNs? No because, they themselves include Convolution layers. The new thing is, to nest the Convolutional layers.
Despite all this, it is a very promising approach, and one that can force us to think about existing CNNs and Max Pooling; there are already approaches to learnable pooling instead of hard code it and there might be alternative ways to do routing instead of Max Pooling, and for the purpose of rethinking the paper might be very important as thought provoking. If Capsules will be the way to go from now on, is hard to say, because we still need experiments on large datasets to get to know the real capacity of them. One challenge task, could to be to achieve state of the art on ImageNet by using a fraction of the input images, since the goal of including geometrical relationships is to use way less data to learn.
Footnotes
[1] Understanding Hinton’s Capsule Networks. Part II: How Capsules Work.