14. RelTR: Relation Transformer for Scene Graph Generation

Introduction

In the past 10 years the machine learning and computer vision communities have come a long way. As the computer vision technologies continue to grow and with the advent of new and complex neural network models, people are no longer satisfied with simply detecting and recognizing objects in images; instead, people look forward to a higher level of understanding and reasoning about visual scenes. A Scene Graph is a structured and graphical representation of a scene (from an image) which clearly describes different entities in an image along with the relations between these entities.

Fig 1. Image to Scene Graph [[Cong et al, 2022]]

Motivation

Most existing methods for generating scene graphs currently depend on two stage detectors. In the first stage, a object detector predicts different objects in the scene. In the second stage, the relationship inference network then takes these object proposals and tries to label the edges between these proposals. The problem with these 2 stage detectors is that they have a large number of parameters to train, due to the fact that they have two different models, and thus are slow. Moreover, if N object proposals are given, the relationship inference network runs the risk of learning based on erroneous features provided by the detection backbone and has to predict O(N²) relationships, which leads to slow inference.
Recently, the one-stage models have emerged in the field of object detection such as DETR (Detection Transformer) where object detection is seen as an end to end set prediction problem and proposes a set-based loss via bipartite matching. The goal of this paper is to extend this same strategy for scene graph generation and hence the authors of the paper introduce a novel one stage end-to-end framework for scene graph generation - Relation Transformer (RelTR).

Contribution

Instead of classifying the dense relationships between all entity proposals (as done in most advanced approaches), the Relation Transformer can generate a sparse scene graph by decoding the visual appearance with the subject and object queries learned from the data.
Since Relation Transformer is a one stage detector, it has fewer parameters and faster inference compared to other models. The model also has a higher accuracy when compared to other 1 stage models in the same field such as FCSGG (Fully Convolutional Scene Graph Generation).

Methodology

A scene graph G consists of entity vertices V and relationship edges E. The goal of the Relation Transformer (RelTR) is to directly predict a fixed-size N_𝑡 , set of < V𝑠𝑢𝑏 − E𝑝𝑟𝑑 − V𝑜𝑏𝑗 > for the scene graph generation without inferring the possible predicates between all entity pairs. The model has an encoder-decoder like architecture. It consists of the feature encoder that extracts the visual feature context, the entity decoder that captures the entity representations from DETR and the triplet decoder with the subject and object branches. The triplet decoder is a fusion of 3 attention functions namely coupled self-attention (CSA), decoupled visual attention (DVA) and decoupled entity attention (DEA).

Fig 2. RelTr Model Architecture [Cong et al, 2022]

Brief Introduction to Attention

The input of a single-head attention is formed from queries Q, keys K and values V while the output is computed as:

$\begin{array}{l}\displaystyle Attention(Q, K, V) = softmax(\frac{Q.K^T}{\sqrt{d_k}}).V\end{array}$

where d_k is the dimension of K.

In order to benefit from the information in different representation sub-spaces, multihead attention is applied in Transformer. A complete attention function is a multi-head attention followed by a normalization layer with residual connection and denoted as 𝐴𝑡𝑡(.) in this paper for simplicity.

Fig 3. Attention and Multi-head Attention [Vaswani et al, 2017]

Subject and Object Queries

There are two types of embeddings, namely subject queries $\begin{array}{l}Q_s \in R^{N_t \times d}\end{array}$ and object queries $\begin{array}{l}Q_o \in R^{N_t \times d}\end{array}$ for the subject branch and object branch respectively. Each subject and object query is responsible for detecting a subject and object in the image respectively. These subject and object queries are not paired together due to the fact that the attention layers in the triplet decoder are permutation invariant. Hence in order to distinguish between different triplets, the learnable triplet encodings $\begin{array}{l}E_t \in R^{N_t \times d}\end{array}$ are introduced.

Coupled Self-Attention (CSA)

CSA captures the context between N_t triplets, and the dependencies between all subjects and objects. Although the triplet encodings E_t are already available, we still need subject encodings E_s and object encodings E_o of the same size as E_t to inject the semantic concepts of <subject> and <object> in coupled self-attention.

$\begin{array}{l}\displaystyle Q = K = [Q_s + E_s + E_t , Q_o + E_o + E_t] \newline [Q_s , Q_o] = Att_{CSA}(Q , K , [Q_s , Q_o])\end{array}$

The main intuition behind CSA is for the model to combine all the relevant information related to Subject and Object Queries along with their positional encodings.

Decoupled Visual Attention (DVA)

DVA concentrates on extracting visual features from the feature context Z that are extracted from Feature Encoder. Decoupled implies that the computations of subject and object embeddings are independent of each other, which is distinct from CSA. In the subject branch, subject embeddings, Q_s, are updated through their interaction with the feature context Z (combined with positional encodings E_p again). Same set of operations are repeated for the object branch.

$\begin{array}{l}\displaystyle Q = Q_s + E_t , K = Z + E_p \newline Q_s = Att^{sub}_{DVA}(Q , K , Z)\end{array}$

The main intuition behind DVA is to combine different subjects and objects to the image features.

Decoupled Entity Attention (DEA)

DEA is performed as a bridge between entity detection and triplet detection. The motivation for introducing DEA is expecting subject and object embeddings to learn more accurate localization and classification information from entity representations through the attention mechanism.

$\begin{array}{l}\displaystyle Q_s = Att^{sub}_{DEA}(Q_s + E_t , Q_e , Q_e) \newline Q_o = Att^{sub}_{DEA}(Q_o + E_t , Q_e , Q_e)\end{array}$

The main intuition behind DEA is to link subjects and objects with their respective locations in the images.

Feed Forward Network (FFN)

The outputs of DEA are processed by a feed-forward network followed by a normalization layer with residual connection. Two independent feed-forward networks with the same structures are used to predict the height, width and normalized center coordinates of subject and object boxes.

The attention heat maps of the subjects and objects from the DVA module in the last decoder are concatenated and resized to 2x28x28. These concatenated attention heat maps are then converted to feature vectors using a Convolution Neural Network. The final predicate labels are then predicted by a two-layer perceptron with the subject representations, object representations and spatial feature vectors.

Fig 4. MLP for BBox Regression and Converting Heatmaps to spatial Features [Cong et al, 2022]

Set Prediction Loss for Triplet Detection

The Triplet Prediction is represented as $\begin{array}{l}< \hat{y}_{sub} , \hat{c}_{pred} , \hat{y}_{obj}>\end{array}$ :

$\begin{array}{l}\hat{y}_{sub} = < \hat{c}_{sub} , \hat{b}_{sub}>\end{array}$
$\begin{array}{l}\hat{y}_{obj} = < \hat{c}_{obj} , \hat{b}_{obj}>\end{array}$

Where,

$\begin{array}{l}\hat{c}_x = predicted \: class \: of \: entity \: x\end{array}$ and $\begin{array}{l}\hat{b}_x = predicted \: bounding \: box \: of \: entity \: x\end{array}$

The ground truths are represented as $\begin{array}{l}< y_{sub} , c_{pred} , y_{obj}>\end{array}$

The pair-wise matching cost $\begin{array}{l}c_{tri}\end{array}$ between a predicted triplet and a ground truth triplet consists of :

$\begin{array}{l}Subject \: Cost = c_{m}< \hat{y}_{sub}, y_{sub}>\end{array}$
$\begin{array}{l}Object \: Cost = c_{m}< \hat{y}_{obj}, y_{obj}>\end{array}$
$\begin{array}{l}Predicate \: Cost = c_{m}< \hat{c}_{pred}, c_{pred}>\end{array}$

Note that for subjects and objects we have respective bounding boxes and class labels however for predicates we only have class labels.

Focal loss for Classification

The authors used Focal Loss for classification of subjects, objects and predicates. The focal loss has an inherent property that it also resolves the Foreground-Background Class Imbalance Problem that is very common in 1 stage detectors.

Fig 5. Foreground-Background Class Imbalance [medium.com]

The class cost function is formulated as:

$\begin{array}{l}\displaystyle c^{+}_{cls} = \alpha.(1 - \hat{p}(c))^{\gamma}.(-\log{\hat{p}(c)} + \epsilon) \newline c^{-}_{cls} = (1 - \alpha).(\hat{p}(c))^{\gamma}.(-\log{1 - \hat{p}(c)} + \epsilon) \newline c_{cls}(\hat{c},c) = c^{+}_{cls} + c^{-}_{cls}\end{array}$

For further information on focal loss : https://medium.com/analytics-vidhya/how-focal-loss-fixes-the-class-imbalance-problem-in-object-detection-3d2e1c4da8d7

The Bounding Box cost function for subjects and objects is given by:

$\begin{array}{l}\displaystyle c_{box}(\hat{b},b) = 5L_1(\hat{b},b) + 2L_{GIOU}(\hat{b}, b)\end{array}$

Note that here instead of standard IOU, Generalized-IOU has been used along with L1 Loss.

Fig 6. Generalized Intersection over Union [medium.com]

For further information on GIOU: https://medium.com/analytics-vidhya/different-iou-losses-for-faster-and-accurate-object-detection-3345781e0bf

Since we have described all the relevant losses such as classification loss and bounding box loss, hence now the total cost function can be represented as:

$\begin{array}{l}\displaystyle c_m(\hat{y},y) = c_{cls}(\hat{c},c) + \mathbb{1}_{b \in y}c_{box}(\hat{b},b)\end{array}$

The triplet loss can now be given as:

$\begin{array}{l}\displaystyle c_{tri} = c_{m}(\hat{y}_{sub},y_{sub}) + c_{m}(\hat{y}_{obj},y_{obj}) + c_{m}(\hat{c}_{pred},c_{pred})\end{array}$

Finally, the Hungarian algorithm is executed for the bipartite matching and each ground truth triplet is assigned to a prediction.

About the Dataset

Visual Genome:

Visual Genome dataset is a collection of dense annotations of objects (eg: "giraffe"), attributes (eg: “yellow and brown spotted with long neck") and relationships (eg: "Giraffe eats leaf") within each image.
The dataset contains over 108K images where each image has an average of 35 objects, 26 attributes, and 21 pairwise relationships between objects.
There are 150 subject/object categories and 50 predicate categories.

Fig 7. Visual Genome [Krishna et al, 2016]

Further information related to visual genome dataset can be found on: https://visualgenome.org/

Open Images V6:

Open Image V6 is a dataset released by Google containing over 9 Million images with labels spanning across various tasks such as Image-level labels, Object bounding boxes, Visual Relationships, Instance Segmentation masks and Localized Narratives. These annotations were generated through a combination of machine learning algorithms followed by human verification on the test, validation, and subsets of the training splits.

Fig 8. Open Image Dataset [Kuznetsova et al, 2020 ]

Experiments and Results

There are three standard evaluation settings:

Predicate classification (PredCLS): predict predicates given ground truth categories and bounding boxes of entities.
Scene graph classification (SGCLS): predict predicates and entity categories given ground truth boxes.
Scene graph detection (SGDET): predict categories, bounding boxes of entities and predicates.

The metric used in these evaluation settings was Recall@k (R@k) and mean Recall@k (mR@k).

Fig 9. Visual Genome Comparision [Cong et al, 2022]

For Open Image V6, apart from mean Recall, weighted mean average precision (AP) of relationship detection $\begin{array}{l}wmAP_{rel}\end{array}$ and phrase detection $\begin{array}{l}wmAP_{phr}\end{array}$ are calculated. The final score is then computed as : $\begin{array}{l}score_{wtd} = 0.2 * R@50 + 0.4 * wmAP_{rel} + 0.4 * wmAP_{phr}\end{array}$

Also the authors compare the frames per second (FPS) for different models and show how their model although not the best one in terms of R@50 or $\begin{array}{l}score_{wtd}\end{array}$ but easily surpasses other models in speed.

Fig 10. Open V6 Comparision [Cong et al, 2022]

Ablation Study

The authors conduct various experiments to see how different layers were important for the network:

By varying the number of encoder layers and keeping the layers in triplet decoder fixed.
By fixing the number of encoder layers and varying the layers in triplet decoder.

Fig 11. Encoder and Triplet Decoder Ablation Study [Cong et al, 2022]

The authors also ablated the Decoupled Entity Attention (DEA) and the mask head for attention heat maps from the framework:

Fig 12. DEA Ablation Study [Cong et al, 2022]

Personal Review of Paper

In this paper, inspired by DETR, which excels in object detection, the authors introduce an interesting model, the Relation Transformer, that aims to solve the Scene Graph Generation problem via a One Stage Detector. They combine the extracted image features and entity representations from the transformer with their novel triplet decoder that provides compact subject and object representations along with subject and object heatmaps. All these representations are simply passed to an MLP to predict <subject-predicate-object>.

Since the model proposed in this paper is a One Stage Detector, it has fewer number of parameters as compared to two stage detectors that solve the same problem and it is computationally very fast as shown by the results.
The authors were pretty clear about the fact that their model will not be able to surpass the two stage detectors, however, their emphasis was on the fact that their model is extremely fast and almost as good as the two stage detectors.

Drawbacks:

Written Aspect of the paper:
- The paper did not clearly define what terms like 'entity representations', 'subject queries' and 'object queries' actually meant.
- So the DETR paper becomes an absolute prerequisite before reading this paper.
Model:
- The authors repetitively add the positional encodings in the CSA and DVA modules.
- Although CSA already has encoded that information
- It doesn't make sense to add positional encodings again to DVA.
- Experimentally it was not seen that how this repeated addition of positional encodings affected the model's accuracy.
Experiments:
- In the ablation study, the authors show that how removing DEA or changing the number of layers in the model affects the accuracy.
- But during this course, I realized that it is very important to explain theoretically or at least experimentally how each and every module of the network affects the final outcome.
- And replacing any module with MLP is the easiest choice that a person can make.
- In this regards, the authors could have simply removed the DVA and DEA and instead used a simple MLP that would take [output of CSA, feature context, entity representation] and then made predictions. This would provide a valid metric as to why modules like DVA and DEA are needed.

Finally, I would say that since the model is extremely fast as compared to the other two stage detectors in the field of scene graph generation, and considering that the accuracies are not far apart, the model can easily replace two stage detectors for practical applications.

References

[1] Yuren Cong, Michael Ying Yang, and Bodo Rosenhahn. "RelTR: Relation Transformer for Scene Graph Generation", 2022

[2] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. "Attention Is All You Need", 2017

[3] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko. "End-to-End Object Detection with Transformers", 2020

[4] Rajat Koner, Suprosanna Shit, Volker Tresp. "Relation Transformer Network", 2021

[5] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, Fei-Fei Li. "Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations", 2016

[6] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, Vittorio Ferrari. "The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale", 2020

[7] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Piotr Dollár. "Focal Loss for Dense Object Detection", 2018

[8] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, Silvio Savarese. "Generalized Intersection over Union: A Metric and A Loss for Bounding Box Regression", 2019

Seitenhierarchie