Object Detection / Fast R-CNN

Introduction

Object detection is an important and complex task in computer vision. In order to approach this task, multi-stage pipelines are commonly used, which is a slow and inelegant way. Object detection is complex because detection requires accurate localization of objects. This creates two challenges: The first being, that numerous candidate object locations ("proposals") must be processed. The second one is that the rough localization of the proposals must be refined to get a precise localization.

Fast R-CNN is a single stage training algorithm that classifies object proposals and refine their localisation. (2)

Architecture

Figure 1. Architecure of Fast R-CNN. (1)

The input of Fast R-CNN is the image and multiple regions of interest (RoI). The network uses several convolutional and max pooling layers to produce a feature map of the image.
Normally there are about 2000k region of interest (RoI), which are determined by proposal methods like Selective Search (4). The pooling layer (RoI pooling) will extract a fixed-length feature vector from the feature map of each region of interest. Each vector feeds into a sequence of fully connected layers (FCs). This produces two output vectors for each RoI:

1. A vector to estimates the object class is produced by a softmax-function.

2. A four-tuple $\begin{array}{l}(r, c, h, w)\end{array}$ , which define the RoI. $\begin{array}{l}(r, c)\end{array}$ specifies the top-left corner and $\begin{array}{l}(h, w)\end{array}$ is the height and width of the window. (2)(1)

Training

One big improvement provided by the Fast R-CNN is that it takes advantage of feature sharing during training. In taining, stochasitc gradient descent (SGD) minibatches are sampled hierachically, first by sampling N images and then by sampling R/N RoIs from each image. Choosing a small N decreases the computational effort of mini-batch operation. Good results are archieved with N=2 and R=128 using fewer SGD iterations than R-CNN. Fast R-CNN uses a streamlined training process, which jointly optimize the softmax classifier and bounding-box regressor.

Multi-task loss

The network has two outputs. The first one is the probability distribution (for each RoI), $\begin{array}{l}p=(p_0,...,p_K)\end{array}$ , for $\begin{array}{l}K+1\end{array}$ classes. This is computed by a softmax classifier. The second output is the bounding-box regression, $\begin{array}{l}t=(t_x, t_y, t_w, t_h)\end{array}$ , for each of the classes. Each training RoI is labelled with a class $\begin{array}{l}u\end{array}$ and a bounding-box regression target $\begin{array}{l}v\end{array}$ . A multi-task loss $\begin{array}{l}L\end{array}$ is used to jointly train for the classification and the bouding-box regression:

$\begin{array}{l}L(p, u, t^u , v) = L_{cls}(p, u) + λ[u ≥ 1]L_{loc} (t^u , v)\end{array}$ ,

in which $\begin{array}{l}L_{cls}(p,u) = -log(p_u)\end{array}$ is log loss for the true class $\begin{array}{l}u\end{array}$ . The second task loss $\begin{array}{l}L_{loc}\end{array}$ , is defined over a tuple of the true bounding-box regression targets $\begin{array}{l}v = (v_x, v_y, v_w, v_h)\end{array}$ , for class $\begin{array}{l}u\end{array}$ and a predicted tuple $\begin{array}{l}t=(t_x, t_y, t_w, t_h)\end{array}$ again for class $\begin{array}{l}u\end{array}$ . The bounding-box regression is defined as $\begin{array}{l}L_{loc} (t^u, v) = \sum_{i \in \{x,y,w,h\}} smooth_{L1} (t_i^u- v_i)\end{array}$ in which $\begin{array}{l}smooth_{L1}\end{array}$ is $\begin{array}{l}0.5x^2\end{array}$ if $\begin{array}{l}|x|< 1\end{array}$ , otherwise it is $\begin{array}{l}|x| - 0.5\end{array}$ . The parameter $\begin{array}{l}\lambda\end{array}$ is used to balance the two task losses.(2)

Back-Propagation through RoI pooling layers

Back-propagation uses derivatives through the RoI pooling layer. The backwards function for the RoI pooling layer computes the partial derivative of the loss function with respect to each input variable $\begin{array}{l}x_i\end{array}$ by following the argmax function:

$\begin{array}{l}\frac{\delta L}{\delta x_i} = \sum_{r} \sum_{j} [i=i^*(r,j)] \frac{\delta L}{\delta y_{rj}}\end{array}$

The argmax function is defined as $\begin{array}{l}i^*(r,j) = argmax_{i' \in R(r,j)} x_{i'}\end{array}$ . $\begin{array}{l}R(r,j)\end{array}$ is the index set of inputs in the sub-window over which the ouput unit $\begin{array}{l}y_{rj}\end{array}$ max pools. For each mini-batch RoI $\begin{array}{l}r\end{array}$ and for each pooling output unit $\begin{array}{l}y_{rj}\end{array}$ , the derivative $\begin{array}{l}\delta L / \delta y_{rj}\end{array}$ is calculated, if $\begin{array}{l}i\end{array}$ is the argmax selected for $\begin{array}{l}y_{rj}\end{array}$ by max pooling.(2)

SGD hyper-parameters

The softmax function and bounding-box regression are initialized vom zero-mean Gaussian distributions. Biases are initiaized to 0. All layers uses a learning rate of 1 for weights and 2 for biases and a global learning rate of 0.001. (2)

Fazit

Results

Fast R-CNN achieves top results on the visual object classes challenge of 2007, 2010 and 2012. Table 1 displays three object detectors, which are trained on a 16 layer deep Network. It shows that Fast R-CNN is faster to train, faster to test and achieves higher accuracy. This results present a big step to real time object detection.

	Fast R-CNN	R-CNN (1)	SPP-net (3)
Train time (h)	9.5	84	25
Speedup	8.8x	1x	3.4x
Test time/image	0.32s	47.0s	2.3s
Test speedup	146x	1x	20x
mAP	66.9%	66%	63.1%

Table 1. Timings exclude object proposal time, which is equal for all methods. All methods use VGG16 from Simonyan and Zisserman.(2)(3)(1)

Advantages

Fast R-CNN overcome many disadvantages of earlier methods and improves in speed and accuracy. This method has several advantages:

1. Higher detection quality (mAP) than R-CNN (1), SPPnet (3)
2. Training is single-stage, using a multi-task loss
3. Training can update all network layers
4. No disk storage is required for feature caching (2)

Literature

1. Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2014.

2. Girshick, Ross. "Fast r-cnn." Proceedings of the IEEE International Conference on Computer Vision. 2015.

3. K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, 2014.

4. Uijlings, Jasper RR, et al. "Selective search for object recognition." International journal of computer vision 104.2 (2013).

Weblinks

1. http://www.robots.ox.ac.uk/~tvg/publications/talks/fast-rcnn-slides.pdf

Seitenhierarchie