1. Introduction
This paper aims at utilizing Convolutional Neural Networks (CNN) and Graph Neural Networks (GNN) to generate scene graphs from surgical images/videos.
1.1 Motivation
Robot-Assisted Minimally Invasive Surgeries (RMIS) holds great potential in the medical domain, in terms of accuracy and reliability to further reduce trauma during operations. However, there are still many limitations for RMIS applications, such as lack of haptic feedback, reduced field of view, and the deficit of evaluation methods.
Recently, many attempts[11, 14, 15, 16] have been made by deep learning and computer vision. Spatial understanding enables the system to infer the implicit relationship between instruments and tissues in a surgical scene. This allows the robot to automatically execute assistive actions, such as stimulating a haptic sense or overlay medical images during surgery. Fig.1 is an example of the scene graphs that describe the interactions between tissues and instruments.
Fig.1 Scene graphs describe the interaction between tissues and surgical instruments. This graph shows that the kidney is manipulated by forceps while cut by scissors.
1.2 Main Contributions of the Paper
- Improve model performance by applying Label Smoothed (LS) on node and edge embeddings.
- Incorporate SageConv[1] and attention mechanism[2] into previous graph-parsing network[3], in order to improve predictions of adjacency matrix and feature embeddings.
- Based on existing dataset[5], create new annotations of bounding box and tool-tissue interaction graphs with the help of clinical expertise.
2. Proposed Approach
This section will mainly focus on the GNN part, as many CNN-based visual feature extractors have already been well exploited that could be applied in the pipeline with a plug-in fashion.
To generate scene graphs from images/videos, one approach is to have a CNN-GNN framework: the CNN detects objects (i.e. tissues, instruments) and extract visual features, while the GNN generates scene graphs based on these features. Since potential interactions are unknown, the authors proposed to iteratively learn the graph structure via message passing, as well as to improve node and edge embeddings in the same time.
The scene graph is modelled as G(V,E,Y), where the nodes v \in V are defective tissues or instruments, the edges e \in E with e = (x, w) \in V \times V are the interaction between nodes. The interaction states y_v \in Y are a set of labels that implies the interaction types among nodes.
Each frame in a surgical scene video is inferred as a parse graph g = (V_g,E_g,Y_g), where g is a subgraph of G. Given a frame, an object detection network (e.g. SSD[9]) is used to detect bounding boxes. Based on these bounding boxes, Label-Smoothed(LS) features F are extracted, which are made of node embedding F_v and edge embedding F_e. The model then infers a parse graph with these features by g^∗ = argmax_g p(Y_g|V_g, E_g, F) p(V_g, E_g|F, G).
2.1 A Quick Introduction to GPNN[3]
GPNN[3] was originally introduced to recognize Human-Object Interactions (HOI) in images and videos. It can iteratively learn the graph structures during message passing. Since the paper is inspired by GPNN[3] with a few modifications, it would be highly beneficial to understand GPNN[3] first.
As shown in Fig. 2, at beginning, the initial graph was equally fully connected. During training, i.e. message passing, some edges are receiving more weights than others (thicker edges in Fig. 2), indicating that the two corresponding objects are likely to interact with each other.
Fig.2 Illustration of GPNN[3]’s learning process
Fig. 3. Forward pass of GPNN[3]
Fig. 3 shows the forward process of GPNN[3] which takes node embedding F_v, edge embedding F_e and output parse graphs with a soft adjacency matrix A^s. It consists of four functions:
- Link function, which has some convolutional layers and a sigmoid activation function. It predicts a soft adjacency matrix A^s that represents connectivity between nodes, i.e., the graph structure. Sigmoid is used to limit each element of A^s within [0,1], which are later served as weights during message passing.
- Message function, which consists of three groups of fully connected(FC) layers. It passes current node embedding, neighboring nodes embedding, and edge embedding into different FC group respectively, then concatenate these features before calculating the weighted sum with A^s. The result is treated as a summary of messages.
- Update function, which is Gated Recurrent Unit (GRU). Given history states and summarized messages, it updates current node states recurrently.
- Readout function, which is again some fully connected layers with a task-specific activation function. For one-class outputs, a Sigmoid is used. For multi-class outputs, a Softmax is used.
Based on the above four functions, GPNN[3] can iteratively learn the soft adjacency matrix A^s, which is used as weights in message function. Elements in A^s with higher value indicate higher probabilities that the corresponding two objects will interact.
2.2 Proposed Network Architecture in the Paper
Fig. 4. is the architecture proposed by the authors. The forward pass can be summarized as follows:
- Use a CNN to detect bounding boxes of objects in the image.
- Extract label-smoothed feature F based on bounding boxes, which consist of node embedding F_v and edge embedding F_e.
- Input these features to a modified GPNN[3] to get the parse graph. Different from the original four functions discussed in Section 2.1, the authors use SageConv[1] and spatial attention module[2] to enhance the original link function. The other three functions remain the same.
In addition, the authors tried several CNNs for different tasks as visual backbone, which will be discussed in later sections.
Fig.4. Proposed architecture in the paper
2.3 Label Smoothened (LS) Feature
Label Smoothing (LS) is a technique that prevents the over-confident of model about its predictions [4]. The intuition is that datasets may have wrong labels by mistake, thus directly minimizing the loss might be unwise. With LS, the network can improve its generalization abilities and extract potentially more ‘useful’ features. As observed in Fig.5, features extracted from the same class are more tightly clustered with LS.
To calculate a LS target, a smoothing factor is in need. For instance, if T_k is the true one-hot target and \epsilon is the smoothing factor, then the smoothened label for K classes is T_K^{LS} = T_k(1-\epsilon) + \epsilon / K. And the LS cross-entropy loss for the model prediction P_k is CE^{LS} = \sum_{k=1}^K -T_k^{LS} \log (P_k).
In the paper, the authors use LS for both node embedding F_v and edge embedding F_e.
Fig. 5. Node feature extraction without and with label smoothing (LS). The authors choose five different classes to plot the tSNE based on their features. With LS, features from the same class are more compact.
3. Experiments
3.1 Dataset
Based on the existing robotic scene segmentation challenge 2018 dataset[5], the authors annotate bounding boxes of all tissues and instruments appeared in the scene. In addition, the interaction graphs are annotated with the help of clinical expertise. Fig.6 gives two examples of the dataset with corresponding images and scene graphs.
Dataset summary:
- Interaction types: grasping, retraction, tissue manipulation, tool manipulation, cutting, cauterization, suction, looping, suturing, clipping, staple, ultrasound sensing
- Total number of interaction types: 12
- Number of video sequences: 15
- Number of frames per video: 149
- Resolution: 1280 x 1024
- Train/Validation split: 12/3 videos (1788/449 frames)
Fig. 6. Examples of the dataset. It contains bounding boxes for defective tissues and surgical instruments, as well as interaction scene graphs.
3.2 Implementation Details
To extract features from frames, different types of networks, e.g., classification (ResNet18[7]), detection (YOLOv3[8], SSD[9]), segmentation (LWL[10]), multitask learning (AP-MTL[11]), have been finetuned with corresponding annotations. The authors then use adaptive average pooling [3, 4] on these features for node and edge embeddings, each with dimension=200, which are input to GNN to predict scene graphs. Weighted multi-class multi-label hinge loss [4, 6], and Adam optimizer with a learning rate of 10−5 are employed in training. Cods are implemented with Pytorch and trained on NVIDIA RTX 2080Ti GPU.
3.3 Results
Fig. 7 shows the qualitative performance of the SSD-LS model. Note that both GPNN[3] and the paper’s model made a wrong prediction between the kidney and the forceps. However, the authors argue that their model’s prediction is more reasonable compared to GPNN[3]’s, as “Tissue Manipulation” makes more sense than “Tool Manipulation” for the interaction between forceps and kidney.
Fig. 7 Qualitative analysis. Both predictions by the authors(d) and GPNN(e) are wrong as highlighted in red. But the authors argue their predictions are more reasonable
Table 1 shows the quantitative results. Firstly, using LS raises Mean Average Precision (mAP) for detection models like SSD[9] and YOLOv3[8]. Secondly, the proposed model outperforms many other graph-based models, such as GPNN[3], GraphSage [1] (using Readout function from [3]), GAT [12] and Graph Hpooling[13]. Note that GPNN[3] has a marginally better performance in terms of hinge loss and recall.
The ablation study of the proposed model is shown in Table 2. Every module added onto the default GPNN[3] base module has a positive effect on the model’s performance.
Note: reference numbers in the table snip may be different from these in the article, due to the screen shot.
Fig. 8 reports the mAP of predicted scene graphs with different types of visual backbones, such as classification (ResNet18[7]), multitask learning (AP-MTL[11]), detection (SSD[9]), segmentation (LWL[10]). ResNet18[7] yields to the best performance comparing to other methods. Using LS further boosts model performances.
However, the authors does not provide any analysis behind this result, nor offer any comparisons on these models’ sizes or capabilities.
Fig.8 mAP of predicted scene graphs with different types of visual backbones
Conclusion
This paper presents an framework to predict scene graphs from surgical scenes. It first extract visual features with a CNN, then utilize GNN to iteratively generate parse graphs that predict interactions between tissues and instruments. To better predict potential interactions, SageConv[1] and attention mechanism[2] are integrated into the previous GPNN[3] model, while Label Smoothed (LS) features are used. Such spatio-temporal scene graph generation helps future robots understanding of surgical scenes.
Student’s Review
This paper introduces several valuable ideas to improve performances in scene graph generation. It also makes much sense to use CNNs for vision features before paring graphs with GNNs. The way to learn the graph structure iteratively is inspiring, by assuming equally weights on a fully connected graph and then dynamically adjust these weights.
However, there are a few points that the paper is not clear enough:
- To briefly recap, GPNN[3] has four functions: link, message, update, and readout. In particular, the message function aggregate information from neighbor nodes. In this paper, the authors use SageConv[1] and attention mechanism[2] after the link function, which are information aggregators as well. This seems redundant to me. It remains an open question which approach is better, whether to simply use SageConv[1] and attention mechanism[2] for more iterations, or to have two redundant (but different) aggregation layers as stated in the paper.
- In the paper it states "Qualitative comparison with GraphSage[1], GAT[9] and Hpooling[13] models were not performed, as these models are unable to predict the adjacent matrix between the graph nodes. “ However, in Table1 there are quantitative results for those models. It is not clear from the paper how did these results are generated.
- The authors could have shown the temporal consistency of the model in experiments, given GRUs are used to recurrently update embeddings. It would be better to see if GRUs are working as expected. For example, scene graphs within a certain time range should be consistent without abrupt changes.
- It would be interesting to see some analysis on why classification models are yielding the best results, as well as comparisons on these CNN models’ capabilities.
Reference
[1] Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on large graphs. In: Advances in neural information processing systems. pp. 1024–1034 (2017)
[2] Roy, A.G., Navab, N., Wachinger, C.: Concurrent spatial and channel ‘squeeze & excitation’ in fully convolutional networks. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 421–429. Springer (2018)
[3] Qi, S., Wang, W., Jia, B., Shen, J., Zhu, S.C.: Learning human-object interactions by graph parsing neural networks. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 401–417 (2018)
[4] Muller, R., Kornblith, S., Hinton, G.E.: When does label smoothing help? In: Advances in Neural Information Processing Systems. pp. 4696–4705 (2019)
[5] Allan, M., Kondo, S., Bodenstedt, S., Leger, S., Kadkhodamohammadi, R., Luengo, I., Fuentes, F., Flouty, E., Mohammed, A., Pedersen, M., et al.: 2018 robotic scene segmentation challenge. arXiv preprint arXiv:2001.11190 (2020)
[6] Moore, R., DeNero, J.: L1 and l2 regularization for multiclass hinge loss models. In: Symposium on machine learning in speech and language processing (2011)
[7] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
[8] Redmon, J., Farhadi, A.: Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018)
[9] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd: Single shot multibox detector. In: European conference on computer vision. pp. 21–37. Springer (2016)
[10] Islam, M., Li, Y., Ren, H.: Learning where to look while tracking instruments in robot-assisted surgery. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 412–420. Springer (2019)
[11] Islam, M., VS, V., Ren, H.: Ap-mtl: Attention pruned multi-task learning model for real-time instrument detection and segmentation in robot-assisted surgery. arXiv preprint arXiv:2003.04769 (2020)
[12] Veliˇckovi´c, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. arXiv preprint arXiv:1710.10903 (2017)
[13] Zhang, Z., Bu, J., Ester, M., Zhang, J., Yao, C., Yu, Z., Wang, C.: Hierarchical graph pooling with structure learning. arXiv preprint arXiv:1911.05954 (2019)
[14] Allan, M., Ourselin, S., Thompson, S., Hawkes, D.J., Kelly, J., Stoyanov, D.: Toward detection and localization of instruments in minimally invasive surgery. IEEE Transactions on Biomedical Engineering 60(4), 1050–1058 (2012)
[15] Laina, I., Rieke, N., Rupprecht, C., Vizca´ıno, J.P., Eslami, A., Tombari, F., Navab, N.: Concurrent segmentation and localization for tracking of surgical instruments. In: International conference on medical image computing and computer-assisted intervention. pp. 664–672. Springer (2017)
[16] Pakhomov, D., Premachandran, V., Allan, M., Azizian, M., Navab, N.: Deep residual learning for instrument segmentation in robotic surgery. In: International Workshop on Machine Learning in Medical Imaging. pp. 566–573. Springer (2019)