Introduction

Scene graph is a scene structure representation that explains objects inside the scenes as nodes and the relationship between objects as a directed edge. Thanks to the recent advances in computer vision, especially in object detection, image-based scene graph generation is improved. Additionally, scene graphs are applied successfully to various tasks such as image retrieval, object detection, semantic segmentation, etc. Moreover, scene graphs can also be used to understand and reason about the objects and subjects inside the scene. Also, multi-modal tasks that require understanding and reasoning of the images in addition to language understanding might be solved with scene graphs. Several applications of the scene graphs on visual question answering (Johnson et al, 2017) and image captioning (Yang et al., 2019) are papers that support this statement. Even though most of the recent scene graph applications focus on the scene graph generation from a single image, scene graph generation on videos has not been fully explored yet. This gap is aimed to be filled by the authors of the paper.

Figure 1: Static scene graph generation vs dynamic scene graph generation.

Figure 1 shows the differences between dynamic scene graph generation and static scene graph generation. The second row represents the static scene graph generation where temporal information is not used in the graph representations whereas in the 3rd row temporal information between the frames is leveraged which results in a more accurate relationship prediction between the objects.

Motivation and Contributions

In this paper, authors aim to generate dynamic scene graphs from videos. Even though most of the previous static image-based works utilized spatial information to predict relationships, temporal dependencies are omitted in the previous works. The main motivation of the paper is to fill this gap by creating a model that leverages temporal information. To do that, they used the original encoder-decoder architecture of Transformer (Vaswani et al. 2017). The encoder learns about the spatial relationships within the frame and the temporal decoder learns the relationships between frames and uses learned frame embeddings instead of standard positional embeddings. They trained and evaluated their model on the Action Genome dataset. They achieved SOTA results on predicate classification (PREDCLS), scene graph classification (SGCLS), and scene graph detection (SGDET) tasks. Their contribution is as follows:

They create a novel framework, Spatial-Temporal Transformer (STTran).
Multi-label classification is used in a relationship classification task.
Novel thresholding strategy to select additional confident relations between objects while generating dynamic scene graphs.
Extensive experiments and ablation studies to show the effectiveness of the model to use temporal information.

Related Works

The earliest work on scene graphs is developed for semantic image retrieval tasks (Johnson et al. 2015). Scene graphs are used to generate a semantic representation of the visual scenes and to do that they also created the first dataset of scene graphs. Thanks to this seminal work, scene graphs got a lot of attention from the community and the people put more focus to research the scene graphs. Another novel work on the scene graph generation is done by (Xu et al.,2017). They stated that the relationship between the objects inside the frame should be contextualized to make better predictions. They model the relationship between objects by using an iterative message passing algorithm and representations of the objects evolve and improve over time. Two GRUs are used to model the evaluation of edge and node features. Yang et al. introduce Graph-RCNN which focuses on efficiently calculating a quadratic number of relations between object and subject pairs inside the frame (Yang et al.2018). To do that, they proposed a model called Relation Proposal Network (RePN). Additionally, they proposed Attentional Graph Convolution Network to have better-contextualized representations of object pairs and corresponding relations. As a final contribution, they proposed a novel evaluation metric and they reported the state of arts performance on the proposed metric.

One of the significant breakthroughs in the last few years of AI research is Transformers (Vaswani et al.2017). The crucial contribution of the Transformer is using self-attention which generates contextualized representations of the words while being more parallelizable than the recurrent based approaches. Transformer models achieved state of the art in most of the downstream NLP tasks. Additionally, the vision community also adopted the Transformer to vision. The Vision Transformer (ViT), applies self-attention directly to the image patches and shows that ViT achieves excellent results on the mid or small sized benchmarks when trained on large datasets. Moreover, another influential work DETR (Carion et al., 2020) showed that Transformer architecture can be applied to object detection and panoptic segmentation tasks.

Previous transformers based approaches for videos worked on a single image, however, understanding the videos requires understanding the temporal domain. To fill this gap, (Girdhar et al, 2018) proposed Action Transformer Networks for human action localization and recognition. Action Transformer uses spatiotemporal information to generate a contextual representation of the target person whose action is aimed to be classified.

Another work that tries to use spatiotemporal information to segment instances in a video is VisTR (Wang et al. 2021). In this model, authors extract feature representations of each frame by using traditional convolutional neural network architecture and feed the embeddings into Transformer encoder-decoder architecture. The output of the model is a mask for each instance in each frame.

There are several works that use both spatial and temporal networks for action recognition. These works mostly leverage LSTM or 3D convolutions to use the temporal domain. In one of the earliest works, two stream convolutional neural networks are proposed to process both motion between frames and the visual appearances of frames (Simonyan, Zisserman, 2014).

Dataset

Previous approaches studied action classification as a monolithic event where single action classification label is used to depict the event that occurs in the video. However, according to Cognitive Science and Neuroscience perspective events are hierarchically structured to be perceived by humans (Barker & Write, 1951). For example, "starting the web browser on the computer" event can be decomposed into pushing the power button, entering the password of the computer and clicking to the browser logo. Since previous datasets do not encode the dynamic changes in the relationships between objects to depict the event, Action Genome is proposed to fill this gap.

Action Genome is a dataset that consists of 9848 videos annotated with action labels and spatio-temporal scene graph labels (Ji et al., 2020). The goal of the Action Genome dataset is to examine the relationship between object-subject pairs inside the frames of the video to understand the action dynamics. Action Genome contains 1.7 million human-object relations instances of 25 categories and 583K bounding boxes of interacted objects of 35 classes. 265K frames in the videos are labeled. Relationships in Action Genome are splitted into 3 categories:

Attention
Spatial
Contact

Table 1: Relationship types in the Action Genome

Attention relationships are used to depict if a person is perceiving the object which is a sign of possible or ongoing interaction between the person and the object.

Spatial relationships illustrate spatial location of the objects and the person.

Contact relationships express the type of interaction between the objects and the person.

STTran is trained and evaluated on the Action Genome.

Figure 2: Annotated sample for a action "sitting on a sofa"

Methodology

Video Representation

Videos consists of several frames. Each frame of the video at the timestamp t is represented as $\begin{array}{l}I_{t}\end{array}$

Video with T frames is represented as $\begin{array}{l}V= [I_{1}, I_{2}, I_{3}, ...,I_{t}]\end{array}$

Relationship Representation

To find the objects inside the frame, firstly an object detector is applied to the frame. Authors decide to use Faster-RCNN as object detector and also they leverage Faster-RCNN to extract feature vectors of the objects' proposals. For each frame at the timestamp t, object detector proposes $\begin{array}{l}N(t)\end{array}$ object proposals and corresponding visual features $\begin{array}{l}[ {v_{t}^1,.....,v_{t}^{N(t)} ]\end{array}$ where $\begin{array}{l}\{ {v_{t}^1,.....,v_{t}^{N(t)} \} \in \R^{2048}\end{array}$ . Additionally, object detector proposes $\begin{array}{l}N(t)\end{array}$ bounding boxes $\begin{array}{l}[ {b_{t}^1,.....,b_{t}^{N(t)} ]\end{array}$ and distribution of object categories $\begin{array}{l}[ {d_{t}^1,.....,d_{t}^{N(t)} ]\end{array}$ . Some of the proposals interact between each other in the scene. Relationships in the frame at the timestamp t can be represented as $\begin{array}{l}R_t= \{ r_{t}^{1}, r_{t}^{2}, ... ,r_{t}^{K(t)}\}\end{array}$ . Each relation between subject and object in the frame is represented by using following features:

Subject Visual feature
Subject Semantic vector
Object Visual feature
Object Semantic vector
Subject Object Location Information

Relationship embedding between subject i and object t can be formulated as follows :

$\begin{array}{l}x_t^{k} = \langle W_{s}v_t^{i}, W_ov_t^{i}, W_u \varphi (u_{it} \bigoplus f_{box}(b_t^{i}, b_t^{j})), s_t^{i}, s_t^{j} \rangle\end{array}$

where

$\begin{array}{l}W_{s},W_o \in R^{2048x512}\end{array}$ are matrices that represent the linear transformation for subject and object feature vectors respectively
$\begin{array}{l}W_u \in R^{12544*512}\end{array}$ is a matrix for dimension reduction.
$\begin{array}{l}v_t^{i}\end{array}$ subject feature vector of kth relationship at time stamp t. It is ith proposal
$\begin{array}{l}v_t^{i}\end{array}$ object feature vector of kth relationship at time stamp t. It is jth proposal
$\begin{array}{l}u_{it}\end{array}$ is the feature map of the union of the bounding boxes that is computed with ROI Align.
$\begin{array}{l}\bigoplus\end{array}$ is the elementwise summation operator.
$\begin{array}{l}\langle , \rangle\end{array}$ is the concatenation operator.
$\begin{array}{l}f_{box}\end{array}$ is a function that takes bounding boxes of subject and object and transforms it to the 256x7x7 feature vector.
$\begin{array}{l}b_t^{i}, b_t^{j}\end{array}$ bounding boxes of subject and object respectively.
$\begin{array}{l}s_t^{i}, s_t^{j} \in \R^{200}\end{array}$ are determined by the object categories of subject and object by using Embedding Layer initialized from pretrained Word2Vec embeddings.
$\begin{array}{l}\varphi\end{array}$ is the flatten operator.

Spatio-Temporal Transformer

STTran (Cong et al., 2021) follows the original Transformer (Vaswani et al.2017) architecture where first information is encoded and then decoded for the final task. However, there are three differences from the original implementation. First, the encoder and decoder are created to extract certain relationships in and between frames. Secondly, in the decoder layer, masked self-attention is not used, unlike the original implementation. Thirdly, positional embeddings are not used in the encoder transformer.

Spatial Encoder

The aim of the Spatial Encoder is extracting relationships inside the single frame.

Input of the spatial encoder can be represented as $\begin{array}{l}X_t = \{ x_t^1, x_t^2, ..., x_t^{K(t)} \}\end{array}$ . Same matrix is used for query, keys and values. Original self attention is applied and transformer architecture is used. There are N layers are stacked on top of each other and the output of the previous layer given as input to the next layer. Since relationships between each frame are assumed to be parallel, no positional embeddings are added to input vectors.

Temporal Decoder

To capture the relationship between frames that are adjacent, they used a sliding window approach that runs on the output of the spatial encoder layer which is a spatially contextualized representation of the interactions inside the frame. Generated input batch to the decoder is formulated as follows:

$\begin{array}{l}Z_i = [ X_i, ..., X_{i + \eta -1} ] , i \in \{1, ..., T - \eta +1 \}\end{array}$ where

$\begin{array}{l}\eta\end{array}$ is the sliding window size and $\begin{array}{l}\eta \leq T\end{array}$ where $\begin{array}{l}T\end{array}$ is the sequence length. Unlike the original transformer, the decoder does not use masked self-attention and N layers stacked on top of each other. Learned frame encoding embeddings are added to the inputs and the representations of the final decoder layer are used for the prediction.

Frame Encodings

Since the goal of the temporal decoder is leveraging temporal dependencies between frame and transformer architecture is unaware of the temporal dependencies by construction, positional embeddings are needed to signal temporal information of the inputs to the model. To solve this problem, the authors designed customized frame encodings with learned embedding parameters to inject into the relationship representations. Since the sliding window approach is used in the decoder and window size is a fixed hyperparameter, the number of embedding vectors is fixed and they have the same size as relation representation vectors.

Loss Function

Predication classification and object classification losses are combined to train the model. To calculate predicate classification loss, firstly different linear transformations are applied to each relationship type.

Then multi-label margin loss is used to calculate the predicate classification loss.

$\begin{array}{l}L_p(r, P^{+}, P^{-}) = \sum_{p \in P^+}\sum_{q \in P^- }max(0, 1-\phi(r,p) + \phi(r,q))\end{array}$

r represents a subject object pair. Since each subject object pair might have single or multiple predicates in the annotation, these predicate sets are denoted as $\begin{array}{l}P^{+}\end{array}$ . Contrarily, $\begin{array}{l}P^{-}\end{array}$ resembles the predicates which are not included in predicate sets. The function $\begin{array}{l}\phi(r,p)\end{array}$ represents the confidence score of the pth predicate.

The second loss that is used in the paper is object classification loss. That is the cross entropy loss between computed object distribution and true object distribution.

Overall loss is $\begin{array}{l}L_{total} = L_p + L_o\end{array}$

Scene Graph Generation Strategy

Authors used and evaluated 3 different scene graph generation strategies in this paper.

With Constraint

In the "With Constraint" strategy only one predicate can be assigned to object-subject pair. Therefore, it assesses the model's capability on predicting the most important relationship.

Without Constraint

In the "Without Constraint setting" strategy multiple predicates can be assigned to object-subject pairs. The disadvantage of the "Without Constraint" strategy is possibility of adding noise and wrong information to the graph.

Semi Constraint

"Semi Constraint" is the novel strategy that is proposed in this paper. In this strategy, multiple predicates can be assigned to the subject-object pair. For instance, the person (object), and food(subject) pair can have both "eating" and "holding" as a predicate. In this strategy, the authors threshold confidence of the predicates, and if the confidence is higher than the threshold predicate is assigned as positive.

Experiments

Experiment Metric

Authors used 3 different evaluation metrics for scene graph generation.

Predicate Classification (PREDCLS)
Scene Graph Classification (SGCLS)
Scene Graph Detection (SGDET)

In PREDCLS, the model predicts predicates between a given ground truth bounding box and class information of the objects in the scene. In the Scene Graph Classification (SGCLS), the model aims to predict predicate labels and class information of the bounding boxes. In the Scene Graph Detection, model extract bounding boxes, their class information and the predicates.

Table 2: Comparison with SOTA methods. Bold represents the highest.

Authors conduct several comparison and ablation studies to show effectiveness of STTran. Firstly, they compared STTran with SOTA image based scene graph generation models in "With Constraint", "Semi Constraint" and "No Constraint". They show that STTRan overperforms all other image based SOTA methods by using temporal relationships between frames.

Additionally, to understand whether image based methods can leverage temporal dependency easily, they added RNN and LSTM on top of the image based models. Before sending feature vectors into the final classifier, they used LSTMs to incorporate temporal information. As a result, they concluded that all methods improve their scene graph generation capability by leveraging temporal aspects. (Table 2)

Table 3: Performance improvement when LSTM added on top of the image based methods

Moreover, they conducted another experiment to examine whether STTran leverages temporal dependencies to increase its performance. To do that, they reversed or shuffled the order of 1/3 of the training instances (videos) which can be seen as adding noise to the temporal information. According to the experiment results, adding noise to the temporal information lowers the performance of the STTran. (Table 3)

Table 4: Results of when 1/3 of training instances are shuffled or reversed

Ablation Studies

Authors conducted ablation studies to understand the contribution of each module of the STTran. Firstly, they only used Spatial Encoder without frame encodings, and the performance was similar to the image-based models as expected. Then, they only used a temporal decoder without frame encoding and since temporal information is used in this setting, performance is superior compared to using the only spatial encoder. In the third experiment, they used both spatial encoder and temporal decoder and they observed a slight performance increase. As a final experiment, they access the Frame Encoding module and compare the effect of using "sinusoidal" and "learned" frame encodings. Using learned frame encodings increased the performance most, which shows that temporal information is fully leveraged. Sinusoidal frame encodings do not help much and the authors do not have a statement about the reason. (Table 4)

Table 5: Results of ablation study

Qualitative Results

Qualitative results of the STTran on SGDET task with "With Constraint", "Semi Constraint" and "No Constraint" on the SGDET task can be seen in the Figure 3.

Figure 3: Qualitative results of the model on the video where the woman tries to reach the medicine while sitting on the bed.

Green boxes in the ground truth represent the objects that can not be found by the object detector. Gray colors are false positive detections and therefore their relations are false positive. The melons are the true positive boxes and correct relationships are represented with light blue.

Discussion and Student's Review

In this paper, the authors aim to extend the works in the static image-based scene graph representation to the videos. To do that, they leveraged well-known Transformer based encoder-decoder architecture. The authors used learned frame embeddings in the decoder. Additionally, they proposed novel semi-constraint scene graph generation methods where they converted single class classification problem into multi-label classification by using thresholding schema. Finally, they showed that the proposed method leverages the temporal domain while generating scene graphs.
The authors clearly define the problem statement and proposed method to solve the problem. Most of the time, they support their thesis by experimenting the idea. For instance, they state that model uses temporal information and to prove that they trained on the noised version of the dataset where temporal information is corrupted. Additionally, they did ablation studies to show where does the improvement come from which gives chances to understand the module's effect separately. They clearly defined all the metrics and tasks used in the evaluation except R@K. They also have so many qualitative experiments which help to understand the process and see the improvement of the model visually. However paper has several drawbacks. Firstly, in the supplementary material, they compare STTran with SOTA model, and in some relations (holding), STTran does not overperform SOTA and they did not make a hypothesis about it. Secondly, they state that there are annotation problems but they did not clearly explain their statement and show the examples of the wrong predictions. Thirdly, their code base is a little bit mixed up and some of the functions that are defined in the paper are embedded into other functions which makes it harder to understand the code.
To sum up, I like the paper a lot, it was well written and educative.

Caghan Koksal

References

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020, August). End-to-end object detection with transformers. In European conference on computer vision (pp. 213-229). Springer, Cham.

Cong, Y., Liao, W., Ackermann, H., Rosenhahn, B., & Yang, M. Y. (2021). Spatial-temporal transformer for dynamic scene graph generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 16372-16382).

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

Girdhar, R., Carreira, J., Doersch, C., & Zisserman, A. (2019). Video action transformer network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 244-253).

Ji, J., Krishna, R., Fei-Fei, L., & Niebles, J. C. (2020). Action genome: Actions as compositions of spatio-temporal scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10236-10247).

Johnson, J., Krishna, R., Stark, M., Li, L. J., Shamma, D., Bernstein, M., & Fei-Fei, L. (2015). Image retrieval using scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3668-3678).

Johnson, J., Hariharan, B., Van Der Maaten, L., Hoffman, J., Fei-Fei, L., Lawrence Zitnick, C., & Girshick, R. (2017). Inferring and executing programs for visual reasoning. In Proceedings of the IEEE international conference on computer vision (pp. 2989-2998).

Roger G Barker and Herbert F Wright. One boy’s day; a specimen record of behavior. 1951. 1, 2

Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, 27.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

Yang, J., Lu, J., Lee, S., Batra, D., & Parikh, D. (2018). Graph r-cnn for scene graph generation. In Proceedings of the European conference on computer vision (ECCV) (pp. 670-685).

Yang, X., Tang, K., Zhang, H., & Cai, J. (2019). Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10685-10694).

Xu, D., Zhu, Y., Choy, C. B., & Fei-Fei, L. (2017). Scene graph generation by iterative message passing. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5410-5419).

Wang, Y., Xu, Z., Wang, X., Shen, C., Cheng, B., Shen, H., & Xia, H. (2021). End-to-end video instance segmentation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8741-8750).

Seitenhierarchie

16. Spatial-Temporal Transformer for Dynamic Scene Graph Generation