Author: Michael Brenner

 

Introduction

Object detection is an important and complex task in computer vision. In order to approach this task, multi-stage pipelines are commonly used, which is a slow and inelegant way. Object detection is complex because detection requires accurate localization of objects. This creates two challenges: The first being, that numerous candidate object locations ("proposals") must be processed. The second one is that the rough localization of the proposals must be refined to get a precise localization. 

Fast R-CNN is a single stage training algorithm that classifies object proposals and refine their localisation. (2)

 

Architecture

Figure 1. Architecure of Fast R-CNN. (1)                                             

The input of Fast R-CNN is the image and multiple regions of interest (RoI). The network uses several convolutional and max pooling layers to produce a feature map of the image.
Normally there are about 2000k region of interest (RoI), which are determined by proposal methods like Selective Search (4). The pooling layer (RoI pooling) will extract a fixed-length feature vector from the feature map of each region of interest. Each vector feeds into a sequence of fully connected layers (FCs). This produces two output vectors for each RoI:

1. A vector to estimates the object class is produced by a softmax-function.

2. A four-tuple (r, c, h, w), which define the RoI. (r, c) specifies the top-left corner and (h, w) is the height and width of the window. (2)(1)

 

                                  

Training

One big improvement provided by the Fast R-CNN is that it takes advantage of feature sharing during training. In taining, stochasitc gradient descent (SGD) minibatches are sampled hierachically, first by sampling N images and then by sampling R/N RoIs from each image. Choosing a small N decreases the computational effort of mini-batch operation. Good results are archieved with N=2 and R=128 using fewer SGD iterations than R-CNN. Fast R-CNN uses a streamlined training process, which jointly optimize the softmax classifier and bounding-box regressor.

Multi-task loss

The network has two outputs. The first one is the probability distribution (for each RoI), p=(p_0,...,p_K), for K+1 classes. This is computed by a softmax classifier. The second output is the bounding-box regression, t=(t_x, t_y, t_w, t_h), for each of the classes. Each training RoI is labelled with a class u and a bounding-box regression target v. A multi-task loss L is used to jointly train for the classification and the bouding-box regression:

L(p, u, t^u , v) = L_{cls}(p, u) + λ[u ≥ 1]L_{loc} (t^u , v),

in which L_{cls}(p,u) = -log(p_u) is log loss for the true class u . The second task loss L_{loc}, is defined over a tuple of the true bounding-box regression targets v = (v_x, v_y, v_w, v_h), for class u  and a predicted tuple  t=(t_x, t_y, t_w, t_h) again for class u . The bounding-box regression is defined as L_{loc} (t^u, v) = \sum_{i \in \{x,y,w,h\}} smooth_{L1} (t_i^u- v_i) in which smooth_{L1}is 0.5x^2  if |x|<1, otherwise it is |x| - 0.5. The parameter \lambda is used to balance the two task losses.(2)

Back-Propagation through RoI pooling layers

Back-propagation uses derivatives through the RoI pooling layer. The backwards function for the RoI pooling layer computes the partial derivative of the loss function with respect to each input variable x_i by following the argmax function:

\frac{\delta L}{\delta x_i} = \sum_{r} \sum_{j} [i=i^*(r,j)] \frac{\delta L}{\delta y_{rj}}

The argmax function is defined as  i^*(r,j) = argmax_{i' \in R(r,j)} x_{i'}R(r,j) is the index set of inputs in the sub-window over which the ouput unit y_{rj} max pools. For each mini-batch RoI r and for each pooling output unit y_{rj} , the derivative \delta L / \delta y_{rj}  is calculated, if i is the argmax selected for y_{rj} by max pooling.(2)

SGD hyper-parameters

The softmax function and bounding-box regression are initialized vom zero-mean Gaussian distributions. Biases are initiaized to 0. All layers uses a learning rate of 1 for weights and 2 for biases and a global learning rate of 0.001. (2)

 

Fazit

Results

Fast R-CNN achieves top results on the visual object classes challenge of 2007, 2010 and 2012. Table 1 displays three object detectors, which are trained on a 16 layer deep Network. It shows that Fast R-CNN is faster to train, faster to test and achieves higher accuracy. This results present a big step to real time object detection.

 Fast R-CNNR-CNN (1)SPP-net (3)
Train time (h)9.58425
Speedup8.8x1x3.4x
Test time/image0.32s47.0s2.3s
Test speedup146x1x20x
mAP66.9%66%63.1%

Table 1. Timings exclude object proposal time, which is equal for all methods. All methods use VGG16 from Simonyan and Zisserman.(2)(3)(1)
 

Advantages

Fast R-CNN overcome many disadvantages of earlier methods and improves in speed and accuracy. This method has several advantages:

1. Higher detection quality (mAP) than R-CNN (1), SPPnet (3)
2. Training is single-stage, using a multi-task loss
3. Training can update all network layers
4. No disk storage is required for feature caching (2)

Literature

1. Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2014. 

2. Girshick, Ross. "Fast r-cnn." Proceedings of the IEEE International Conference on Computer Vision. 2015.

3. K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, 2014.

4. Uijlings, Jasper RR, et al. "Selective search for object recognition." International journal of computer vision 104.2 (2013).

  • Keine Stichwörter

Kommentar

  1. Unbekannter Benutzer (ga69taq) sagt:

    General problems and suggestions:

    • Page seems generally not finished and under construction with missing sections, improperly displayed equations etc.
    • Avoid using subjective pronouns such as "we", "you" etc., use passive versions of the sentences.
    • Add image sources to your bibliography/weblinks.
    • Fast R-CNN is never defined as the non-abbreviated form.
    • Architecture section title is differently coloured than the rest of the titles.

    Consistency problems and suggestions:

    • Your capitalization is inconsistent throughout the article.
    • Your citation enumeration style is inconsistent with the rest of the wiki (we decided to use (1) style for enumerations).
    • Your bibliography style is inconsistent with the rest of the wiki.

    Visual problems and suggestions:

    • Remember to check and eliminate double spaces and underspacing.
    • I would recommend using numbered lists for options in Architecture section. It should fix the improper line spacing in that section.
    • I do not think you need to write the author and date since they are already available under page title.
    • For all mathematical numbering I would recommend using math inlines (for example "(r, c, h, w)" written as " ")

    Corrections (note that some of the corrections have more than one changes):

    • Object detection is a important
    • Object detection is an important
    • To approach this task, many use multi-stage pipelines,
    • In order to approach this task, multi-stage pipelines are commonly used,
    • localization of Objects.
    • localization of objects.
    • First, numerous candidate object locations
    • The first being, that numerous candidate object locations
    • Second, the rough localization of the proposals must be refined to get a precise localization.
    • The second one is that the rough localization of the proposals must be refined to get a precise localization.
    • algorithm that classify object proposals and refine their localisation.
    • algorithm that classifies object proposals and refine their localization.
    • The Input of Fast R-CNN is the Image
    • The input of Fast R-CNN is the image
    • to produce a feature map of the Image.
    • to produce a feature map of the image.
    • Each vector will fed into
    • Each vector feeds into (or "Each vector will feed into" if you believe future tense is a must)
    • A softmax probability estimates the object class.
    • A softmax probability estimate of the object class. (though I am still unsure if this is what you meant)
    • One big improvement of Fast R-CNN
    • One big improvement provided by the Fast R-CNN
    • In taining, stochasitc gradient descent
    • During training, stochastic gradient descent
    • first by sampling N images and the by sampling R/N RoIs
    • first by sampling N images and then by sampling R/N RoIs
    • Making N small decrease the mini-batch computation.
    • Choosing a small N decreases the computational effort of mini-batch operation.
    • Good result are archieved with N=2 and R=128 using fewer SGD iterations than R-CNN.
    • Good results are achieved with N=2 and R=128 using fewer SGD iterations than R-CNN. ("good results" are too ambiguous, good according to whom?)
    • The Network has two outputs.
    • The network has two outputs.
    • The first one ist probability distribution
    • The first one is the probability distribution
    • for each of the classes.
    • for each class.
    • For each training RoI is labeled with
    • Each training RoI is labelled with (also, starting a new paragraph here seems unnecessary)
    • for classification and bouding-box regression:
    • for the classification and the bounding-box regression:
    • log loss for true class u.
    • log loss for the true class u.
    • a tuple of true bounding-box regression targets
    • a tuple of the true bounding-box regression targets

    Final comments:

    • It is important to note that some of the corrections / suggestions are subjective and for you to decide
    • Most of the issues I pointed out under the general topics are omitted in the correction list but I am pretty sure I still missed some stuff. Hopefully the next review will catch any mistakes I skipped.
    • Page seems to be a work in progress.
    • I intended using inline comments but it turns into a mailspam (at least for the reviewer) so decided to use the same method as Sebastian.