Image 1: Behaviour of the spatial transformer, during 10 training steps. As visible, the spatial transformer is able to remove redundant background data from the image. Image source (2).
Introduction
Current CNNs are only somewhat invariant in translation through the use of a max-pooling layer.^{(1)} A key requirement of a CNN is that it is able to correctly classify objects in real world image data. Objects in images are usually at random positions and taken from random viewpoints at different scales. Therefore, a CNN has to be implemented in a way, that the output of the network is invariant to the position, size and rotation of an object in the image. The Spatial Transformer Module manages to do that, with remarkable results.
2D Geometric Transformations
2D Geometric Transforms are a set of transforms to alter parameters such as scale, rotation and position of an image. The transformation is done by multiplying each coordinate vector of an image with one of the transformation matrices shown in table 1.
The effect of the respective transformations can be seen on the right.
Table 1: Hierachy of 2D geometrical transformations. Image source [2]. | Image 2: Geometrical effect of the transformations on an image. Image source [2]. |
The following shows a generic matrix for each of the 2D geometric transformations. In order to perform the transformation in one single multiplication, homogeneous coordinates are used, which means, that a 1 is added as a third dimension, i.e. \bar{x} = \begin{pmatrix} x \\ y\\ 1 \end{pmatrix}.
Transformation | Matrix Entries |
---|---|
Translation | x' = \begin{bmatrix} 1 & 0 & t_x \\ 0 & 1 & t_y \\\end{bmatrix} \bar{x} |
Rigid (Rotation + Translation) | x' = \begin{bmatrix} \cos \theta & - \sin \theta & t_x \\ \sin \theta & \cos \theta & t_y \\\end{bmatrix} \bar{x} |
similarity (scaled rotation) | x' = \begin{bmatrix} s \cos \theta & - s \sin \theta & t_x \\ s \sin \theta & s \cos \theta & t_y \\\end{bmatrix} \bar{x} |
Affine Transfromation | x' = \begin{bmatrix} a_{00} & a_{01} & a_{02} \\ a_{10} & a_{11} & a_{12} \\\end{bmatrix} \bar{x} |
Projection | x' = \begin{bmatrix} h_{00} & h_{01} & h_{02} \\ h_{10} & h_{11} & h_{12} \\ h_{10} & h_{11} & h_{12} \\\end{bmatrix} \bar{x} |
Translational Invariance Trough the Max-Pooling Layer
Max-Pooling is a form of non-linear downsampling, which is an essential part of standard CNNs.^{(1)} The input to the CNN is separated into non-overlapping squares of the same size. Each square is then reduced to its maximum pixel value, while the other values are dropped. For further information on Max-Pooling, please refer to the basics pages. While the Max-Pooling Layer reduces computational complexity in the higher layers of a network, it also provides a form of translation invariance, which will be covered in the following.
Image 3: The figure above shows the 8 possible translations of an image by one pixel. The 4x4 matrix represents 16 pixel values of an example image. When applying 2x2 max-pooling, the maximum of each colored rectangle is the new entry of the respective bold rectangle. As one can see, the output stays identical for 3 out of 8 cases, making the output to the next layer somewhat translation invariant. When performing 3x3 max-pooling, 5 out of 8 translation directions give an identical result^{(1)}, which means that the translation invariance rises with the size of the max-pooling.
As described before, translation is only the most simple scenario of a geometric transformation. Other transformations listed in the table above can only be handled by the spatial transformer module.
Performance Compared to Standard CNNs
While providing state-of-the-art results, the computation time of the Spatial-Transformer-CNN introduced by Jaderberg et. al is only 6% slower than the corresponding standard CNN.^{[1]}
The table in image 6 [1] shows the comparison of the results from different neural networks on the MNIST data-set. The table distinguishes between fully convolutional networks (FCN) and convolutional neural networks (CNN). It further includes a spatial transformer module to each of the network types (ST-FCN and ST-CNN) with \theta containing either 6 (Aff) or 8 (Proj) transformation parameters or a thin plate spline (TPS). The MNIST data-set was distorted by rotation (R), rotation, translation and scaling (RTS), a projective transform (P) or an elastic distortion (E). The results show, that each spatial transformer network outperforms its standard counterpart, as the classification error is smaller in all cases.
Image 6: The Table shows the comparison of the classification error of different network models, on several distorted versions of the MNIST data set. Networks which include a spatial transformer module outperform the classic neural networks. The images on the right show examples for the input to the spatial transformer (a), visualization of the transformation (b) and output after the transformation (c). While the left column uses a thin plate spline (TPS), the transformation on the right is affine. Image source [1].
Unsupervised Sub-Object Classification
Another important discovery from Jaderberg et al is, that they achieved a form of unsupervised and fine-grained classification. The presented experiments were done on the CUB-200-2011 bird data set which contains images of 200 different bird species. These bird images not only consist of different species, but are taken from different angles and points of view, with different scaling and individual background scenery. This gives an idea, how challenging this dataset is. Previous to the introduction of the spatial transformer module Simon and Roder also performed unsupervised fine-grained classification on this bird data set. By analyzing the constellation of part detectors, that fire at approximately the same relative location from each other they achieved a state-of-the-art classification accuracy of 81.0%^{[3]}. The table in Image 7 shows, that Jaderberg et al were able to improve this result by 1.3 percentage points using a CNN model with Inception architecture.^{[1]} Given the latter network as a basis, they inserted either 2 or 4 spatial transformer modules into the network architecture, achieving even higher classification results. The images next to the table on the right show the region of interest on which each of the transformer modules focused. This shows that when spatial transformer modules are put in parallel, each can learn a different part of an object. In this case, one of the transformer modules focused on the head and the other on the body of a bird. Other work on this data set with sub-object classification was done by Branson et al. While the latter explicitly defined parts of the bird, and trained separate detectors on these parts, Jaderberg et al achieved this separation in a completely unsupervised manner.
Image 7: The table shows the classification results of different network architectures for the CUB-200-2011 bird data set. The spatial transformer networks are able to outperform the other networks. The images on the left show the behaviour of the spatial transformers in the network, when 2 (upper row) or 4 (lower row) are in parallel inside the network. Interestingly, each of the spatial transformers learned to focus on either the head or the body of the bird, in a completely unsupervised manner. Image source [1].
Problems and Limitations
When using a spatial transformer, it is possible to downsample, or oversample a feature map.^{[1]} Using a sampling kernel such as the bilinear kernel which is of fixed width, can cause aliasing effects in the output feature map.
As described in the last chapter, spatial transformer modules can be used for fine-grained classification, which means, that also sub-parts of a class can be detected by a neural network. While this is a very promising result, the number of objects a STN can model is limited to the number of parallel spatial transformers in the network.^{[1]}
Literature
[1] Spatial Transformer Networks (2015, M. Jaderberg et. al)
[2] Computer Vision: Algorithms and Applications (2011, R. Szeliski)
[3] Neural Activation Constellations: Unsupervised Part model Discovery with Convolutioinal Networks (2015, M. Simon and E. Rodner)
2 Comments
Unknown User (ga85deh)
some grammar mistakes I've found (corrections in capital letters):
- .. leads to morE accurate object.
- Table 1: HiErachy
- , that 1 is added as A third dimension
- non-overlapping squares of THE same size.
- The above ILLUSTRATION
- et. al is only 6% slower than the corresponding (NO COMMA)
- These bird images are not only .. different
- an idea, how challEnging this dataset
- introduction of the SPATIAL
- each of the transformer moduleS
- the number of objects aN STN can model
Unknown User (ga63muv)
HI, Köppl,
Confusions:
Final comments: