Introduction

Human-Object Interaction (HOI) Recognition, which focuses on the action taken by a human to an object, can be regarded as a subtask of Scene Graph Generation (SGG) [3,5]. Image-level HOI Classification aims to retrieve multiple <verb, object> pairs in the image, and HOI Detection requires a pair of human-object bounding boxes other than the HOI classes [7]. As it is a multi-label learning task, an object detector is commonly used to find the corresponding features in the image. And the datasets for multi-instance learning are severely imbalanced since some actions may be more common in reality, e.g., sitting on a chair. In the benchmark dataset HICO (Humans Interacting with Common Objects) [1],  20.1% of the classes have ≤ 5 positive samples, and the average negative to positive ratio over the categories is 6000:1.

To alleviate these problems, Jin et al. [7] mainly work on the classifier of the HOI recognition model, which has been overlooked before, and find that the model using their detection-free (DEFR) method can surpass the state-of-the-art (SOTA) model significantly [2,3], which are all detect-assisted. Figure 1 shows the DEFR pipeline for HOI classification. The pre-trained image encoder extracts the visual feature, and the feature vector is inputted into the classifier in which the weights are initialized with language embeddings of each label. Furthermore, they proposed a new loss function to reduce the negative effect of imbalanced data. 

The authors also show that DEFR can be applied to HOI Detection. For simplicity, I mainly introduce the HOI Classification in the following since they mainly work on HOI classification and the pipelines are nearly the same. The only difference is that they use an off-the-shelf object detector in HOI Detection first to obtain the HOI region in the image.

Figure 1: The DEFR pipeline for HOI classification

Related Work

Scene Graph

Scene Graph is a structural representation, which can capture detailed semantics by explicitly modeling objects (e.g., horse, bike, person) and attributes of objects (e.g., yellow, old) as nodes and relationships between paired objects (e.g., ride, is behind) as edges [4]. Slightly different from the HOI scenario, scene graph normally uses triplets (subject, relation,  object) to describe the visual relationships since it is not human-centered. Besides actions, the relation can be spatial adverbs (e.g., is behind), descriptive verbs (e.g., wear), prepositions (e.g. with), and comparative adjectives (e.g., taller than). Same with HOI recognition, SGG also suffers from the long-tailed visual relationships (i.e. triplets) distribution. 

Paradigms of HOI Classification

There are mainly two paradigms for HOI classification, i.e. one-stage paradigm and two-stage paradigm [2]. In one-stage paradigm, models extract features from raw pixels and directly make predictions. DEFR is an example of the one-stage paradigm.  In two-stage paradigm, object detectors, as well as other tools, are applied to obtain features of humans and objects. Based on those features from humans and objects, the model predicts corresponding actions. Figure 2 shows the overview of the HAKE-based model [2], one of the SOTA HOI classification models, which is consist of two stages. The model first decomposes the human body into several parts and predicts the action of each part. Then based on the prediction text and the visual feature, the model predicts the action label. Apparently, the pipeline is much more complex compared with DEFR. 

Figure 2: Overview of the HAKE-based HOI classification model

(HAKE: Human Activity Knowledge Engine)

Methodology

Language Embedding Initialization

Given the extracted features from an image x\in\mathbb{R}^D, weight matrix of the classifier W\in\mathbb{R}^{D\times C}, and the zero-initialized bias of the classifier b\in\mathbb{R}^C, the logit of class j , namely z_i, is the dot product of x and the j-th column vector of the weight matrix  w_j plus the bias b_{j}, i,e, z_j=x^Tw_j+b_j. As a result, the logit z_j can be approximately regarded as the unnormalized projection on the "class space" and the column vector w_{j} is the proxy of the class j.  The labels of HOI tasks are semantically correlated. As shown in the two left subfigures of Figure 2, the bicycle cluster is closed to the motorcycle cluster, and the proxies in them are isotropic, i.e. the distance between <sit_on, bicycle> and <sit_on, motorcycle> is approximately equal to the distance between <hold, bicycle> and <hold, motorcycle>. So in the case of HOI, standard initialization might not be a good idea, especially for few-shot cases since they are initialized randomly and do not have enough samples to urge the model to learn the correlations. Instead, language embeddings maintain the correlations in between and they can be obtained by language models such as BERT [13] and CLIP [6]. 

To compare the similarity between visual features and the proxies,  the language embeddings needed to be normalized first. Then they are applied as prior knowledge to initialize the weight matrix W. As shown in Figure 2, the fine-tuned model, of which the classifier weight matrix is initialized with the language embeddings, outperforms the one with random-initialized linear classification weights (mAP 60.5 vs mAP 36.8), and the weight vectors of its classifier are still clustered afterward. Instead, the weight vectors of the random-initialized classifier are chaotic after fine-tunning, which means the classifier fails to learn the mapping.

Figure 3: t-SNE visualization of classifier weights. Each point stands for an HOI class in HICO dataset and is colored by objects. 

Log-Sum-Exp Sign (LSE-Sign) Loss

Multi-label Classification uses the sigmoid function to predict if a label is contained in the image and then compute the binary cross-entropy loss. The formula of BCE-loss and the partial derivative with respect to a certain class K are as follows

(1) L = -\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^{C}y_{ij}\log{\hat{y}_{ij}}+(1-y_{ij})\log{(1-\hat{y}_{ij})}
(2) \frac{\partial L}{\partial z_K}=-\sum_{i=1}^{N}y_{iK}-\hat{y}_{iK}

where \hat{y_{ij}} \equiv \sigma(z_{ij})=-\frac{1}{1+\exp{(-z_{ij})}}, and z_{ij} stands for the logit of the sample i for class j. Since the HICO dataset is imbalanced, most terms y_{iK}-\hat{y}_{iK} in equation (2) are negative. As a result, the gradient will be dominated by the negative samples, which will prevent the model from learning on positive samples. 

Weighted sigmoid cross-entropy loss [3] and focal loss [8] are applied to previous work to tackle this problem. But their performance is sensitive to the hyperparameters, and they do not consider the dependency between the classes. The main idea of the LSE-Sign loss is that it encourages learning from the misclassified items, which is similar to the focal loss. The formula and the partial derivative with respect to the class K are as follows:

(3) \frac{1}{N}\sum_{i=1}^{N}\log{(1+\sum_{j=1}^{C}e^{-y_{ij}s_{ij}})}
(4) \frac{\partial L}{\partial s_K}=\frac{1}{N}\sum_{i=1}^{N}\frac{-y_{iK}e^{-y_{iK}s_{iK}}}{1+\sum_{j=1}^{C}e^{-y_{ij}s_{ij}}}

where s_{ij}=\gamma\cdot x_i^Tw_j+b_j, and \gamma is a hyperparameter for scaling as the row vectors of linear classification weight matrix are normalized when initialization. Another notable difference regarding BCE loss in LSE-Sign loss is that y_{ij} is 1 when class j is contained in sample i, otherwise, it returns -1. As shown in equation 4, the magnitude of gradients will be significantly amplified when an item is misclassified since it is approximately a softmax function. As a consequence, the final gradients with respect to the class K will be dominated by the misclassified terms instead of the negative samples. 

Experiment

Datasets

Two datasets are applied in the experiment: HICO [1] and MPII Human Pose [9]. HICO dataset is a benchmark for classifying human-object interactions (HOI) in images and contains 600 HOI categories with 117 unique verbs and 80 object classes. Each image may contain multiple HOI classes and multiple human-object pairs. The training set has 38, 116 images, and the test set has 9, 658 images. MPII Human Pose dataset is a state-of-the-art benchmark for evaluating articulated human pose estimation. Unlike HICO, all person instances in an image are assumed to take the same action and each image is classified into only one of 393 action classes. It contains 15,205 training images and 5,708 test images. Following [3,14], 6,987 images are sampled from the training set as the validation set for evaluating the performance of DEFR since the setting of the test set of MPII is slightly different than the training set [14].

Pre-trained Image Encoder

The Visual Transformer ViT-B/32 [10] is mainly used as the backbone at resolution 224. Here "B" denotes Base Transformer, and "32" refers to 32x32 input patch size. Several strategies can be used to pre-train the backbone, such as training with image classification on ImageNet or Image-text contrastive learning based on CLIP [6]. In order to make a fair comparison with other works, ResNet101 [11] is also used in the following.

Fine-tunning

The datasets mentioned above at resolution 672 are used for fine-tuning the HOI classification model respectively. Data augmentation of random color jittering, horizontal flipping, and resized cropping is used. To reduce class imbalance, over-sampling is adopted so that each class has at least 40 samples per epoch.

Results

Verification of effectiveness of each component

Table 1 shows the effect of each component. The baseline model uses ViT-B/32 pre-trained on ImageNet-1K as the backbone and is fine-tuned on HICO, and its linear classification layer is initialized randomly. For embedding initialization, BERT-generated embeddings are applied. Here we can see both language embedding initialization and the LSE-Sign Loss function can improve the performance by a clear margin. 


LSE-Sign LossEmbedding 
Initialization
mAP
Baseline

37.8


44.1 (+6.3)

50.6 (+12.8)
53.5 (+15.7)

Table 1: The gain of individual components

Ablation Study

Embedding Initialization and Visual Backbone

There are two variables here: initialization strategy and the visual backbone. Other than BERT, the CLIP-based text encoder can be applied to generate the language embeddings as well, and it outperforms BERT since the language model is trained with the vision model jointly. Moreover, the CLIP-based ViT can be used as the backbone of  HOI classification model. As shown in Table 2, The combination of CLIP-based models achieves the best performance thanks to better compatibility.

BackboneRandom 
Initialization

BERT
Embedding

CLIP
Embedding
ImageNet-1K44.153.5 (+9.4)54.7 (+10.6)
ImageNet-21K44.253.9 (+9.7)55.1 (+10.9)
CLIP36.851.0 (+14.2)60.5 (+23.7)

Table 2: Ablation study on classifier initialization and pre-trained image encoder. The loss is computed by LSE-Sign and evaluated on HICO dataset.

LSE-Sign Loss

Using the best model obtained above, this work implements the ablation study on the loss function. The result is presented in Table 3. Weighted BCE is the binary cross-entropy loss weighted by the positive-negative ratio per class. Focal loss uses γ=2 and α=0.25 as recommended in [8]. LSE-Sign loss uses \gamma=100 and surpasses all the other loss functions.

Loss FunctionmAP
Weighted BCE [3]54.7
BCE57.9
Focal loss [8]53.2
LSE-Sign Loss60.5

Table 3: Comparison between LSE-Sign and other loss functions

Compared with Other HOI Classification Works

The comparison is executed on both HICO and MPII datasets. The results are shown in Table 4. DEFR outperforms all the other detection-assisted models regardless of the backbone. Here the visual transformer is CLIP-based.

MethodBackbonemAP on different datasets
HICO MPII 
Girdhar et al. [12]ResNet 10134.630.6
Pairwise-Part [3]ResNet 10139.932.0
HAKE [2]CNN-Based47.1
DEFR [7]ViT-B/3260.5
ViT-B/1665.655.3
ResNet10153.643.6

Table 4: Comparison within HOI classification with other state-of-the-art models

Few-shot analysis

Another highlight of DEFR is that it outperforms existing methods considerably in few-shot subsets since language embeddings provide massive prior knowledge to the network. The results are shown in Table 5. Here "Few@i" means classes that the number of training images is i. The number of HOI classes for Few@1, 5, and 10 are 49, 125, and 162, respectively.

MethodmAPFew@1Few@5Few@10
Pairwise-Part [9]39.913.019.822.3
HAKE [2]47.125.432.533.7
DEFR / 1665.652.756.957.2

Table 5: Few-shot performance of HOI Classification evaluated on HICO. 

Analysis

After analyzing the CLS attention map of the backbone Transformer before and after fine-tuning on HICO, it is shown in Figure 3 that the model pays more attention to the HOI-related objects after fine-tunning, i.e. it learns the proper features for HOI classification. This is probably because the LSE-Sign loss and the classifier weights initialized by language embeddings can transmit more correct information to the weights of the encoder through backpropagation. When computing the derivative with respect to the weights of the encoder w_e, we use the chain rule as follows:

(5) \frac{\partial L}{\partial w_e}=\sum^C_{j=1}\frac{\partial L}{\partial s_j}\frac{\partial s_j}{\partial x}\frac{\partial x}{\partial w_e}

It is interpreted previously that \frac{\partial L}{\partial s_j} is mainly affected by the misclassified samples instead of the negative ones and \frac{\partial s_j}{\partial x} is the proxy of classj, which is initialized with language embeddings that captures the semantic meaning of the classes. As a result,  the model can obtain better gradients to update the weights of the encoder.

Figure 4: Visualization of CLS attention map.

Personal Review & Discussion

  • The column vectors in the classifier weight matrix are the semantic proxies of each label. Initializing the classifier with the language embedding can maintain the correlation between labels and let the weights closer to the optima, which can ease the learning process and improve the accuracy, especially for few-shot cases. This gives me a whole new perspective on the classifier in neural networks. Before that, I haven't thought about the meaning of the weights in the classifier. 
  • The methodology proposed in this paper can not only be applied to HOI recognition tasks but can also be generalized to the Scene Graph Generation tasks when predicting triplets jointly.
  • Comments on Experiment:
    • It seems DEFR can reduce the complexity of HOI classification since it significantly simplifies the pipeline, but there are no experiments to show the convergence speed or inference time of each batch so far, which can be added in the future. If the inference speed can be upgraded, that would be good news for many real-time applications, e.g. medical robot.
    • Figure 4 shows that fine-tuning can facilitate the model to focus more on the related objects, it would be better to show if the fine-tuned backbone can be a substitution of an object detector or not, e.g., not using the off-the-shelf detector for HOI detection. If the result is not satisfactory, it would be still beneficial to use the detector for classification since it can guide the attention of the model to the object. For example, the model might focus more on the snow in the visual context when trying to distinguish the action <carry, snowboard>. With the bounding box, the network will pay more attention to the board as supposed. 

Reference

1. Chao, Y., Wang, Z., He, Y., Wang, J., & Deng, J. (2015). HICO: A Benchmark for Recognizing Human-Object Interactions in Images. 2015 IEEE International Conference on Computer Vision (ICCV), 1017-1025.

2. Li, Y., Xu, L., Huang, X., Liu, X., Ma, Z., Chen, M., Wang, S., Fang, H., & Lu, C. (2019). HAKE: Human Activity Knowledge Engine. ArXiv, abs/1904.06539.

3. Fang, H., Cao, J., Tai, Y., & Lu, C. (2018). Pairwise Body-Part Attention for Recognizing Human-Object Interactions. ECCV.

4. Zhu, G., Zhang, L., Jiang, Y., Dang, Y., Hou, H., Shen, P., Feng, M., Zhao, X., Miao, Q., Shah, S.A., & Bennamoun (2022). Scene Graph Generation: A Comprehensive Survey. ArXiv, abs/2201.00443.

5. Chang, X., Ren, P., Xu, P., Li, Z., Chen, X., & Hauptmann, A.G. (2021). A Comprehensive Survey of Scene Graphs: Generation and Application. IEEE transactions on pattern analysis and machine intelligence, PP.

6. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML.

7. Jin, Ying et al. “The Overlooked Classifier in Human-Object Interaction Recognition.” ArXiv abs/2203.05676 (2022): n. pag.

8. Lin, T., Goyal, P., Girshick, R.B., He, K., & Dollár, P. (2020). Focal Loss for Dense Object Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42, 318-327.

9. Andriluka, M., Pishchulin, L., Gehler, P.V., & Schiele, B. (2014). 2D Human Pose Estimation: New Benchmark and State of the Art Analysis. 2014 IEEE Conference on Computer Vision and Pattern Recognition, 3686-3693.

10. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ArXiv, abs/2010.11929.

11. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770-778.

12. Girdhar, R., & Ramanan, D. (2017). Attentional Pooling for Action Recognition. ArXiv, abs/1711.01467.

13. Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv, abs/1810.04805.

14. Mallya, A., & Lazebnik, S. (2016). Learning Models for Actions and Person-Object Interactions with Transfer to Question Answering. ECCV.


*This preprint paper is renamed as The Overlooked Classifier in Human-Object Interaction Recognition in Version 2

  • Keine Stichwörter