Overview

Spatial Transformer Networks are Convolutional Neural Networks, that contain one or several Spatial Transformer Modules. These modules attempt to make the network spatially invariant to its input data, in a computationally efficient manner, which leads to more accurate object classification results. Further they allow the localization of objects in an image and a sub-classification of an object, such as distinguishing between the body and the head of a bird in an unsupervised manner.

st-structure

Image 1: Behaviour of the spatial transformer, during 10 training steps. As visible, the spatial transformer is able to remove redundant background data from the image. Image source (2).

Introduction

Current CNNs are only somewhat invariant in translation through the use of a max-pooling layer.⁽¹⁾ A key requirement of a CNN is that it is able to correctly classify objects in real world image data. Objects in images are usually at random positions and taken from random viewpoints at different scales. Therefore, a CNN has to be implemented in a way, that the output of the network is invariant to the position, size and rotation of an object in the image. The Spatial Transformer Module manages to do that, with remarkable results.

2D Geometric Transformations

2D Geometric Transforms are a set of transforms to alter parameters such as scale, rotation and position of an image. The transformation is done by multiplying each coordinate vector of an image with one of the transformation matrices shown in table 1.

The effect of the respective transformations can be seen on the right.

Table 1: Hierachy of 2D geometrical transformations. Image source [2].

Image 2: Geometrical effect of the transformations on an image. Image source [2].

The following shows a generic matrix for each of the 2D geometric transformations. In order to perform the transformation in one single multiplication, homogeneous coordinates are used, which means, that a 1 is added as a third dimension, i.e. $\begin{array}{l}\bar{x} = \begin{pmatrix} x \\ y\\ 1 \end{pmatrix}\end{array}$ .

Transformation	Matrix Entries
Translation	$\begin{array}{l}x' = \begin{bmatrix} 1 & 0 & t_x \\ 0 & 1 & t_y \\\end{bmatrix} \bar{x}\end{array}$
Rigid (Rotation + Translation)	$\begin{array}{l}\mathrm{x'\,= \begin{bmatrix} \cos \theta & - \sin \theta & t_x \\ \sin \theta & \cos \theta & t_y \\\end{bmatrix} \bar{x}}\end{array}$
similarity (scaled rotation)	$\begin{array}{l}\mathrm{x'\,= \begin{bmatrix} s \cos \theta & - s \sin \theta & t_x \\ s \sin \theta & s \cos \theta & t_y \\\end{bmatrix} \bar{x}}\end{array}$
Affine Transfromation	$\begin{array}{l}\mathrm{x'\,= \begin{bmatrix} a_{00} & a_{01} & a_{02} \\ a_{10} & a_{11} & a_{12} \\\end{bmatrix} \bar{x}}\end{array}$
Projection	$\begin{array}{l}\mathrm{x'\,= \begin{bmatrix} h_{00} & h_{01} & h_{02} \\ h_{10} & h_{11} & h_{12} \\ h_{10} & h_{11} & h_{12} \\\end{bmatrix} \bar{x}}\end{array}$

Translational Invariance Trough the Max-Pooling Layer

Max-Pooling is a form of non-linear downsampling, which is an essential part of standard CNNs.⁽¹⁾ The input to the CNN is separated into non-overlapping squares of the same size. Each square is then reduced to its maximum pixel value, while the other values are dropped. For further information on Max-Pooling, please refer to the basics pages. While the Max-Pooling Layer reduces computational complexity in the higher layers of a network, it also provides a form of translation invariance, which will be covered in the following.

Image 3: The figure above shows the 8 possible translations of an image by one pixel. The 4x4 matrix represents 16 pixel values of an example image. When applying 2x2 max-pooling, the maximum of each colored rectangle is the new entry of the respective bold rectangle. As one can see, the output stays identical for 3 out of 8 cases, making the output to the next layer somewhat translation invariant. When performing 3x3 max-pooling, 5 out of 8 translation directions give an identical result⁽¹⁾, which means that the translation invariance rises with the size of the max-pooling.

As described before, translation is only the most simple scenario of a geometric transformation. Other transformations listed in the table above can only be handled by the spatial transformer module.

Spatial Transformer Module

Overview

Image 4: Architecture of a spatial transformer module.^[1] U and V are the in- and output feature map respectively. The goal of the spatial transformer is to determine the parameters for $\begin{array}{l}\theta\end{array}$ , i.e. the parameters for the geometric transform. Image source [1].

Image 5: Illustration of a transformation showing the sampling^[1]While $\begin{array}{l}\mathrm{\mathcal{T}_I(G)}\end{array}$ depicts a unitary transformation (i.e. no change at all), $\begin{array}{l}\mathrm{\mathcal{T}_\theta(G)}\end{array}$ shows an affine transform. One can see, that the sampling grid produced by $\begin{array}{l}\mathrm{\mathcal{T}_\theta(G)}\end{array}$ does not correspond to the pixel coordinates of $\begin{array}{l}\mathrm{U}\end{array}$ . Therefore some form of interpolation has to be performed. Image 5 also shows that the spatial transformer module is also able to crop the image and remove (possibly) redundant information from the image and focusing on key elements of the image. Image source [1].

Localization Network

The localization network transforms the input feature map $\begin{array}{l}\mathrm{U}\end{array}$ , which is shown above into the output parameters $\begin{array}{l}\theta\end{array}$ . The dimension of $\begin{array}{l}\theta\end{array}$ depends on the type of transformation which the network should be able to perform. As for example a projective transform has 8 degrees of freedom, the dimension of $\begin{array}{l}\theta\end{array}$ would have to be 8. While the localization network can be either a CNN or a classic neural network, it should contain a final regression layer, in order to output the paramters $\begin{array}{l}\theta\end{array}$ . In general any parameterizable transformation, that is differentiable towards its parameters can be used in a STN.

Grid Generator

The grid generator $\begin{array}{l}\mathrm{\mathcal{T}_\theta(G)}\end{array}$ uses the transformation parameters to create a sampling grid, which is a set of points $\begin{array}{l}\mathrm{(\mathcal{x_i^s},\mathcal{y_i^s})}\end{array}$ . One of these points defines, where a sampling kernel has to be applied on $\begin{array}{l}\mathrm{U}\end{array}$ in order to obtain a certain output pixel in $\begin{array}{l}\mathrm{V}\end{array}$ .

Sampler

The sampler combines the input feature map and the sampling grid, resulting in the output feature map $\begin{array}{l}\mathrm{V}\end{array}$ , by performing some form of interpolation.The interpolation step is necessary as the coordinates of the sampling points $\begin{array}{l}\mathrm{(\mathcal{x_i^s},\mathcal{y_i^s})}\end{array}$ in general won't be existing coordinates of the input $\begin{array}{l}\mathrm{U}\end{array}$ .

The sampling step can be performed with any kernel, while the general formula is

$\begin{array}{l}\displaystyle V_i^c = \sum_n^H{\sum_m^W{U_{nm}^ck(x_i^s -m;\phi_x)k(y_i^s-n;\phi_y)}} \space \forall i \in [1...H'W'] \space \forall c \in [1...C]\end{array}$

Here $\begin{array}{l}k()\end{array}$ can be any kind of sampling method, with $\begin{array}{l}\phi_x\end{array}$ and $\begin{array}{l}\phi_y\end{array}$ being the sampling parameters. The only constraint for the sampling kernel is, that it is differentiable with respect to the sampling points $\begin{array}{l}\mathrm{(\mathcal{x_i^s},\mathcal{y_i^s})}\end{array}$ . This is necessary, for the backpropagation algorithm, that is used for training the spatial transformer network. In their paper, Jaderberg et al are using a bilinear sampling kernel. This changes the above equation to

$\begin{array}{l}\displaystyle V_i^c = \sum_n^H{\sum_m^W{U_{nm}^c max(0,1-|x_i^s -m|) max(0,1-|y_i^s-n|)}}.\end{array}$

For performing backpropagation, one has to find the gradients for $\begin{array}{l}\mathrm{U}\end{array}$ and $\begin{array}{l}G\end{array}$ , which brings us to the equations

$\begin{array}{l}\displaystyle \frac{\partial V_i^c}{ \partial U_{nm}^c} = \sum_n^H{\sum_m^W{ max(0,|x_i^s -m|) max(0,|y_i^s-n|)}}.\end{array}$

$\begin{array}{l}\frac{\partial V_i^c}{ \partial x_i^s} = \sum_n^H{\sum_m^W{U_{nm}^c max(0,|y_i^s -m|) }} \begin{cases} 0 \space if \space \space |x_i^s -m| \ge 1 \\ 1 \space if \space \space m \ge x_i^s \\ -1 \space if \space \space m < x_i^s \end{cases},\end{array}$

$\begin{array}{l}\frac{\partial V_i^c}{ \partial y_i^s} = \sum_n^H{\sum_m^W{U_{nm}^c max(0,|x_i^s -m|) }} \begin{cases} 0 \space if \space \space |y_i^s -m| \ge 1 \\ 1 \space if \space \space m \ge y_i^s \\ -1 \space if \space \space m < y_i^s \end{cases}.\end{array}$

These equations, allow the loss gradients to backpropagate to the input feature map $\begin{array}{l}\mathrm{U}\end{array}$ of the transformer module on one hand, and to the transformation parameters $\begin{array}{l}\theta\end{array}$ of the ST-layer by using the further derivatives $\begin{array}{l}\frac{\partial x_i^s}{ \partial \theta}\end{array}$ and $\begin{array}{l}\frac{\partial y_i^s}{ \partial \theta}\end{array}$ .^[1]

Performance Compared to Standard CNNs

While providing state-of-the-art results, the computation time of the Spatial-Transformer-CNN introduced by Jaderberg et. al is only 6% slower than the corresponding standard CNN.^[1]

The table in image 6 [1] shows the comparison of the results from different neural networks on the MNIST data-set. The table distinguishes between fully convolutional networks (FCN) and convolutional neural networks (CNN). It further includes a spatial transformer module to each of the network types (ST-FCN and ST-CNN) with $\begin{array}{l}\theta\end{array}$ containing either 6 (Aff) or 8 (Proj) transformation parameters or a thin plate spline (TPS). The MNIST data-set was distorted by rotation (R), rotation, translation and scaling (RTS), a projective transform (P) or an elastic distortion (E). The results show, that each spatial transformer network outperforms its standard counterpart, as the classification error is smaller in all cases.

Image 6: The Table shows the comparison of the classification error of different network models, on several distorted versions of the MNIST data set. Networks which include a spatial transformer module outperform the classic neural networks. The images on the right show examples for the input to the spatial transformer (a), visualization of the transformation (b) and output after the transformation (c). While the left column uses a thin plate spline (TPS), the transformation on the right is affine. Image source [1].

Unsupervised Sub-Object Classification

Another important discovery from Jaderberg et al is, that they achieved a form of unsupervised and fine-grained classification. The presented experiments were done on the CUB-200-2011 bird data set which contains images of 200 different bird species. These bird images not only consist of different species, but are taken from different angles and points of view, with different scaling and individual background scenery. This gives an idea, how challenging this dataset is. Previous to the introduction of the spatial transformer module Simon and Roder also performed unsupervised fine-grained classification on this bird data set. By analyzing the constellation of part detectors, that fire at approximately the same relative location from each other they achieved a state-of-the-art classification accuracy of 81.0%^[3]. The table in Image 7 shows, that Jaderberg et al were able to improve this result by 1.3 percentage points using a CNN model with Inception architecture.^[1] Given the latter network as a basis, they inserted either 2 or 4 spatial transformer modules into the network architecture, achieving even higher classification results. The images next to the table on the right show the region of interest on which each of the transformer modules focused. This shows that when spatial transformer modules are put in parallel, each can learn a different part of an object. In this case, one of the transformer modules focused on the head and the other on the body of a bird. Other work on this data set with sub-object classification was done by Branson et al. While the latter explicitly defined parts of the bird, and trained separate detectors on these parts, Jaderberg et al achieved this separation in a completely unsupervised manner.

Image 7: The table shows the classification results of different network architectures for the CUB-200-2011 bird data set. The spatial transformer networks are able to outperform the other networks. The images on the left show the behaviour of the spatial transformers in the network, when 2 (upper row) or 4 (lower row) are in parallel inside the network. Interestingly, each of the spatial transformers learned to focus on either the head or the body of the bird, in a completely unsupervised manner. Image source [1].

Problems and Limitations

When using a spatial transformer, it is possible to downsample, or oversample a feature map.^[1] Using a sampling kernel such as the bilinear kernel which is of fixed width, can cause aliasing effects in the output feature map.

As described in the last chapter, spatial transformer modules can be used for fine-grained classification, which means, that also sub-parts of a class can be detected by a neural network. While this is a very promising result, the number of objects a STN can model is limited to the number of parallel spatial transformers in the network.^[1]