Here is my BlogPost for the paper "Global-Reasoned Multi-Task Learning Model for Surgical Scene Understanding" written by Lalithkumar Seenivasan1+, Sai Mitheran2+, Mobarakol Islam3, Hongliang Ren1,4 Senior Member, IEEE.

INTRODUCTION

Enhancing surgical scene understanding (tool interaction detection and instrument segmentation) through global-local relational reasoning.

Fig. 1: Enhancing surgical scene understanding (tool interaction detection and instrument segmentation) through global-local relational reasoning.


Global and local relational reasoning enable scene understanding models to perform human-like scene analysis and understanding. Inspired by the work that a graph-based global reasoning network was proposed that performs global reasoning in the latent space to efficiently capture global relations, we propose a global-reasoned multi-task surgical scene understanding model that performs instrument segmentation and detects tool-tissue interaction. Global-reasoned surgical scene understanding is critical in developing surgical skill assessment, real-time and post-surgical analysis, augmented tactile feedback and automated surgical report generation. By combining the GloRe unit that reasons in the latent space and multi-scale-feature decoder aggregation that captures local relations at multiple scales, the semantic segmentation model is aimed to perform better scene reasoning. To detect the tool action, we improve upon the visual-semantic graph attention network (VS-GAT) [20] and introduce Globally-Reasoned VS-GAT. By embedding global-reasoned latent features to VS-GAT, we hypothesize the model to detect globally-reasoned node-to-node interaction. By sharing the feature encoder and GloRe unit, we also reduce the computational cost compared to running two independent single-task models. 


Key Contributions

Propose a globally-reasoned multi-task learning (MTL) surgical scene understanding model that performs instrument segmentation and tool-tissue interaction detection.

Improve the MTL model’s segmentation performance by incorporating latent global interaction reasoning and introducing multi-scale local reasoning.

Utilize the MTL model setup to enhance interaction detection performance by sharing a generalized feature extractor for visual feature extraction and incorporating globally-reasoned features from the segmentation module into the scene graph (tool interaction detection) model.

Study the performance of sequential and knowledge distillation (KD) based optimization techniques in optimizing MTL models for optimal model convergence.


RELATED WORK

Surgical Instrument Segmentation

To address the spatial inconsistency problem, instance-based segmentation has been proposed [17, 15]. Current state-of-the-art (SOTA) models in instrument segmentation include MF-TAPNet [15] and ISI-Net [8]. MF-TAPNet [15] employs an attention mechanism and utilizes temporal optical flow. Built on top of Mask-RCNN [10], ISI-Net [8] employs a temporal consistency strategy to take advantage of the temporal frame sequence. 

A refined attention-based network called RASNet [23] was also proposed that utilizes the attention mechanism for semantic segmentation to leverage on the global context of high-level features to focus on key regions of the image. 

As an alternative to these prior works, we propose a simple and efficient global and local reasoned model that achieves competitive performances against existing SOTA models.

Surgical Tool Interaction Detection

Initially, human-to-object interaction detection was achieved by employing Fast-RCNN [21] and Faster-RCNN [7]. 

This issue of rubustness was addressed by theorizing the interaction detection task in the non-euclidean space and employing graph networks to detect interaction [24, 20]. 

Graph passing neural network (GPNN) [24] theorizes each scene as a sparse graph, with its nodes being the objects and edges denoting the presence of interaction. While GPNN relies mainly on visual features to detect object-to-object interaction, the visual semantic graph attention network (VS-GAT) [20] was introduced that utilizes spatial and semantic features on top of visual features to detect interactions. 

Here, we further improve the VS-GAT in detecting interaction by including globally-reasoned features.

Multi-Task Learning

A single MTL model offers a computational advantage over multiple single-task learning (STL) models. 

GradNorm [5] helps balance the learning of independent task sub-modules, thereby balancing independent task influence on the shared module and improving synchronization in model convergence. 

Attention prone MTL optimization technique [12] has also been proposed that enables sequential convergence of model’s independent tasks. 

In this work, we also implement and study the performance of (i) sequential, (ii) vanilla, and (iii) KD-based MTL optimization in training our proposed MTL model.

METHODOLOGY

Fig. 2: The proposed network architecture. The proposed globally-reasoned multi-task scene understanding model consists of a shared feature extractor. The segmentation module performs latent global reasoning (GloRe [4] unit) and local reasoning (multi-scale local reasoning) to segment instruments. To detect tool interaction, the scene graph (tool interaction detection) model incorporates the global interaction space features to further improve the performance of the visual-semantic graph attention network [20].

Global And Local Reasoning For Instrument Segmentation

Multi-scale global reasoning for instrument segmentation

Fig. 3: Multi-scale global reasoning for instrument segmentation


A simple encoder-decoder pair incorporating global reasoning is employed to achieve competitive performance with the SOTA models in instrument segmentation. 

The GloRe unit [4] is employed to reason global interaction in the latent space. While this reasoning is limited to the latent interaction space, we include the Multi-Scale Local Reasoning module (MSLR). 

Here, multi-scale decoder aggregation is performed to capture multi-scale local (neighborhood) relations in coordinate space. For the decoder block, we design a lightweight decoder with (a) a conv block (conv-BatchNorm-ReLU), (b) a dropout and (c) finally a conv layer.


To improve instrument segmentation, three variants of global reasoning, (i) vanilla GR, (ii) Multi-scale global reasoning (MSGR), and (iii) multi-scale local reasoning and GR (MSLRGR), have been studied. 

(i) vanilla GR: the GloRe unit [4] is naively implemented to reason on the encoder’s latent features. In MSGR, the GloRe unit is employed to reason on multi-scale interactions as shown in Fig. 3. 

(ii) Multi-scale global reasoning (MSGR): the GloRe unit is employed to reason on multi-scale interactions as shown in Fig. 3. 

(iii) multi-scale local reasoning and GR (MSLRGR): global reasoning is achieved by combining vanilla GR and multi-scale local (neighborhood) reasoning (MSLR). 

Global Reasoning For Interaction Detection

The VS-GAT [20] network employs two sub-graphs: (a) Visual graph (Gv) and (b) Semantic graph (Gs) embedded with visual (Fvf) and semantic features (Fsemf), respectively. The two graphs are then propagated and fused to form a combined graph (Gc). The edges of this graph are embedded with spatial features (Fsf = features of bounding box location). We append the GISF (FGISF) from the segmentation module’s GloRe unit to the combined graph’s edges. This allows the model to predict the interactions based on both node-to-node and global latent interaction reasoning [ Y = G ( F v f , F s e m f , F s f , F G I S F ) ].

Multi-Task Optimization

To address the asynchronous convergence problem, we explore three different optimization techniques.


The first one, Vanilla-MTL (V-MTL) optimization, naively combines the loss of both tasks during the training. 

L V − M T L = ( α ∗ L s g ) + ( ( 1 − α ) ∗ L s e g )


In the second variant, KD-based MTL (KD-MTL) optimization [19] is explored. 

The KD-MTL favors the segmentation task in training the feature encoder. Here, the task losses are combined with Kullback-Leibler divergence (KLD)  [18] loss between the feature encoder outputs of the STL segmentation model and MTL model. By reducing the KLD loss between the outputs of the feature encoder, we aim for the MTL to improve model convergence of the segmentation model.

L K D − M T L = ( α ∗ L s g ) + L s e g + L K L D − s e g


The final optimization technique involves optimizing the MTL model sequentially (S-MTL). 
As shown in Algorithm 1, the MTL model’s feature encoder and segmentation model is first trained based on the segmentation loss. Upon convergence, the weights of the feature encoder and segmentation blocks are frozen. The training of scene graph in detecting interactions is then performed until convergence.

1:  [Initialize model weights]shared feature extractor (Wsh), scene segmentation (Wseg), scene graph (Wsg)

2: [Set gradient accumulators to zero]shared feature extractor ( d W s h ), scene segmentation ( d W s e g ), scene graph ( d W s g ) d W s h ← 0 , d W s e g ← 0 , d W s g ← 0

3:  [Optimize feature extractor and segmentation network]while tasknotconvergeddo:   [segmentor and feature extractor gradients w.r.t segmentation loss Lseg]   dWsh←dWsh+∑iδi∇WshL(Wsh,Wseg)   dWseg←dWseg+∑iδi∇WsegL(Wsh,Wseg)endwhile

4:  [Optimize Scene graph]whilescenegraphtasknotconvergeddo:   [Scene graph block gradients w.r.t scene graph loss Lsg]   dWsg←dWsg+∑iδi∇WsgL(Wsh,Wseg,Wsg)endwhile

EXPERIMENTS

Dataset

The model’s performance in interaction detection and instrument segmentation is trained and evaluated on MICCAI Endoscopic Vision Challenge 2018 [1] dataset. 

Implementation Details

We employ cross-entropy loss to calculate the segmentation loss and multi-label loss to calculate the interaction detection loss. The models are trained using the Adam optimizer [16]The feature extractor is initially loaded with ImageNet pre-trained weights. The learning rate at epoch = 0 is set to xxx and is decayed by 0.98 every 10 epochs. Our models are trained for 130 epochs with a batch size of 4. 

Multi-Task Model Improving Single-Task Performance

Variants of feature sharing between the segmentation and scene graph modules in multi-task setting to improve single-task performance

Fig. 4: Variants of feature sharing between the segmentation and scene graph modules in multi-task setting to improve single-task performance


ModelTool interaction detectionSegmentation

AccmAPRecallmIoUP-AccClass-wise IoU
T0T1T2T3T4T5T6T7
SOTA (Surgical scene graph)
GPNN [24]0.55000.1934-----------
Islam et al. [13]0.48020.2157-----------
G-Hpooling [28]0.33210.1523-----------
VS-GAT[20]0.65370.25600.2666----------
SOTA (Surgical scene segmentation)
LinkNet34 [2]---0.26100.930.91930.35810.14810.00620.64880.00040.00710.0000
AlbUNet [26]---0.24710.910.90900.36100.09230.00640.60820.00000.00000.0000
Ternaus-UNet11 [26]---0.24060.9170.89040.32670.07410.00550.62830.00000.00000.0000
Ternaus-UNet16 [26]---0.23290.9180.88110.30690.09230.00620.57630.00000.00030.0000
MF-TAPNet [15]---0.24890.9310.93100.29610.02250.00000.74200.00000.00000.0000
MF-TAPNet11 [15]---0.35680.9550.97290.61420.23380.01000.84200.00300.16340.0153
MF-TAPNet34 [15]---0.35430.9520.97670.66360.34350.02840.82220.00000.00000.0000
ResNet18 [11]---0.38580.94870.95330.57640.38100.00080.83530.00730.27630.0557
ResNet18 + GloRe [4]---0.39260.94830.95240.57640.38420.00090.82560.07200.28470.0448
S-MTL (Ours)
MSLRGR0.70030.28850.30960.43540.96380.97140.69660.43560.00150.87160.12030.34710.0387
MSLRGR-GISFSG0.69940.31310.31570.43540.96380.97140.69660.43560.00150.87160.12030.34710.0387
KD-MTL (Ours)
MSLRGR0.64340.29920.28180.41050.96170.97040.67750.36500.00020.85980.07090.33300.0069
MSLRGR-GISFSG0.67100.28080.30720.41050.96110.97130.66700.37850.00280.86030.04580.31840.0401


TABLE I: Comparison of our proposed globally-reasoned multi-task scene understanding model (S-MTL-MSLRGR-GISFSG) and its variant’s performances against the state-of-the-art models in segmentation and tool-tissue interaction detection. T0-T7 are tool classes as stated in section IV-A.


Fig. 5: Qualitative analysis: Top - Comparison of our proposed model and its variant’s performance in instrument segmentation against select benchmark models and the Ground Truth (GT). Bottom - Comparison of our proposed model’s performance in interaction detection against using vanilla VS-GAT [20] and the Ground Truth (GT). Here, our proposed model refers to S-MTL-MSLRGR-GISFSG (sequentially trained multi-task learning model with multi-scale local reasoning and global reasoning and its scene graph enhanced with global interaction space features).


ModelFeature encoderSFTool interaction detectionSegmentation
GR [4]MSGRMSLRPFGISFSGAccmAPRecallmIoUP-Acc
STL
VS-GAT [20]




0.65370.25600.2666--
SEG




---0.38580.9487
SEG-GR



---0.39260.9483
SEG-MSGR


---0.43500.9628
SEG-MSLRGR


---0.43540.9638
S-MTL
GR



0.67870.25780.30420.39260.9483
MSGR


0.68130.29060.30400.43500.9628
MSLRGR


0.70030.28850.30960.43540.9638
MSLRGR-PF

0.68480.29600.31570.43540.9638
MSLRGR-GISFSG

0.69940.31310.31570.43540.9638












TABLE II: Ablation study highlighting the importance of multi-scale local and global reasoning (MSLRGR) and use of global interaction space feature in the scene graph (GISFSG) in improving sequentially optimized multi-task learning (S-MTL) model.


ModelBest in tool interaction detectionBest in instrument segmentationBalanced performance
Tool interactionSegmentationTool interactionSegmentationTool interactionSegmentation
detection
detection
detection
AccmAPmIoUP-AccAccmAPmIoUP-AccAccmAPmIoUP-Acc
V-MTL-GR0.61930.23030.35210.94200.50730.25800.37310.94550.60640.23270.36210.9447
KD-MTL-GR0.66150.25310.36090.94490.63910.24720.37300.94530.65550.25220.37130.9458
KD-MTL-MSLRGR0.66490.26440.40220.96100.63220.27240.41650.96220.64340.29920.41050.9617
KD-MTL-MSLRGR-SGFSEG0.65890.26360.39740.95930.61840.28290.41880.96070.65030.26000.41110.9608
KD-MTL-MSLRGR-GISFSG0.68300.28180.40340.96130.63390.28190.41690.96170.67100.28080.41050.9611


TABLE III: Ablation Study on multi-task learning (MTL) model optimized using Vanilla-MTL (V-MTL) and Knowledge Distillation-based MTL (KD-MTL) optimization techniques.


We experiment to improve MTL models' performance over the STL models through the multi-task model setup as three variants of MTL models:

i) Vanilla-MTL model: we aim to improve the interaction detection model. 

ii) Fig. 4 (i): the interaction features from the VS-GAT’s combined graph (GC) edges are appended to the latent interaction space features in the segmentation module’s GloRe unit. 

iii) The final variant used in our final proposed model: we aim to use global interaction space features to improve scene graph (GISFSG) interaction detection (Fig. 4 (ii)). 

RESULTS AND EVALUATION

Quantitatively, the model’s performance in segmenting instruments and detecting interaction is benchmarked against its respective single task SOTA models. The performance in instrument segmentation is quantified using (a) the mean intersection over onion (mIoU), class-wise IoU, and pixel accuracy (P-Acc) metrics. The performance in interaction detection is quantified using accuracy (Acc), mean average precision (mAP), and Recall. It is observed that our globally-reasoned multi-task model (S-MTL-MSLRGR-GISFSG) performance is on par and, in most cases, outperforms STL models in both instrument segmentation (mIoU and P-Acc) and interaction detection (Acc, mAP, and Recall). 

Qualitatively, it is also observed that the model’s performance with global reasoning in latent space is further enhanced by incorporating multi-scale local reasoning. 

The segmentation performance of the SOTA models are significantly different from their original works due to three main changes: (i) train and test set, (ii) number and type of classes and (iii) resolution of the input image.

DISCUSSION AND CONCLUSION

In the paper, a globally-reasoned multi-task surgical scene understanding model to perform instrument segmentation and tool-tissue interaction detection is proposed. 

The model’s performance is improved by (i) introducing multi-scale local (neighborhood) reasoning and incorporating latent global reasoning and (ii) introducing global interaction space features into the scene graph. 

The detailed study also proves that the proposed model performs on-par and, in most cases, outperforms existing SOTA single-task models in MICCAI endoscopic vision challenge 2018. 

STUDENT'S REVIEW

In this paper, we improved the performance of a globally-reasoned multi-task surgical scene understanding model for instrument segmentation and interaction detection by incorporating global relational reasoning in the latent interaction space and introducing multi-scale local (neighborhood) reasoning in the coordinate space to improve segmentation. 

The S-MTL Optimization Algorithm: Based on the segmentation loss ---→ the MTL model’s feature encoder and segmentation model is trained ---→ convergence ---→ the weights of the feature encoder and segmentation blocks are frozen ---→ the training of scene graph in detecting interactions is then performed until convergence.

With the Ablation Study on multi-task learning (MTL) model, V-MTL and KD-MTL optimized as well and sequential training results in optimal convergence but further improvement in segmentation task from incorporating it with scene graph still needs to be made in the future because of the asynchronous converge of the model.

Remaining Question: 


REFERENCES

  • Keine Stichwörter