Geometrical or topological changes in curvilinear structures such as blood vessels or nerve fibers can determine a huge variety of pathologies. Diseases like strokes, keratitits or retinal hematologic disorders can be better diagnosed, studied and treated with an automated segmentation of these structures within medical images. Image analysis from different modalities such as magnetic resonance angiography (MRA), corneal confocal microscopy (CCM), and optical coherence tomography angiography (OCTA) would benefit from using these methods [1] [2].
Challenges
The task of curvilinear structures segmentation has many challenges pending to really get implemented on a macro scale. First of all research has focused mainly on 2D images rather than on 3D. There are 2 approaches, either use old fashioned filter based methods or deep learning networks. The former requires deep domain knowledge and manual tuning, while the latter has been developed to focus on one single image modality. Besides, sparse manual annotations and microvasculature detection make it even harder to succeed. Narrow structures are difficult to detect thanks to the unbalanced distribution of edge compared to non-edge voxels (3D case) [1] [2].
Previous Work
As mentioned before, there are 2 approches to solve the curvilinear structure segmentation [1]:
- Filter methods: They enhance curvilinear structures while suppressing background pixels and noise.
- Examples:
- Active contour based methods
- Hessian matrix based filters
- multi oriented filters
- Examples:
- Deep Learning Networks:
- Examples:
- R2U-Net: Is a recurrent neural network embedded into the U-shaped network used for vessel segmentation
- DANet: Is a dual attention network for segmenting natural images. It upsamples attention features in the last layer
- Uception: Is U-Net architecture combined with inception modules. It has a considerable GPU memory usage.
- Examples:
Current research
CS2-Net
The CS2-Net is based on the traditional U shaped encoder decoder architecture for semantic segmentation. It includes residual blocks and deals with 2D and 3D in a unified manner. What makes this network successful is its dual self attention mechanism consisting of a spatial attention module and a channel attention module. Moreover, this architecture was used for 6 different image modalities [1].
Methodology 2D Network
The image below describes at a high level the architecture of the 2D network. The encoder extracts the high level features of the images and feeds them to the CSAM module. The decoder restores the dimensions of the attention enhanced features [1].
SAB
The spatial attention block or SAB models spatial relationships of any two given pixels. Using the high level feature maps from the encoder's last layer, it captures the edge information of vessel-like structures in vertical and horizontal directions. The block Q in the picture above represents the features captured in the vertical direction using a 1x3 convolution layer, while the block K represents the features captured in the horizontal direction using a 3x1 convolution layer [1].
The correlation matrix between any given points within one feature is passed through a softmax function to promote similarities and supress differences between the horizontal and vertical directions. This new matrix, as shown in the equation below, is the attention map of similarities between spatial possitions[1].
The attention enhanced features F' are an intermediate product between the attention map of similarities and the reshaped feature V (see CS2-Net 2D network image). The final output of the SAB module ist the channel wise addition of F and the attention enhanced features F'. Notice there are some reshapin operation in between[1].
CAB
The channel attention block or CAB is a simpler version of the SAB focused on the interdependency of channels. It uses the unaltered high level features F to find the channel wise attention map by calculating the correlation matrix between channels and applying a softmax function to enhance contrast. Similar to SAB, the attention enhanced features F'' are calculated by the multiplication of F and the channel wise attention map. The final output of the CAB module is also a channel wise addition between F and the attention enhanced features F'' [1].
Loss function
The loss function used by the 2D CS2-Net is a simple binary cross entropy loss [1].
Methodology 3D Network
The main difference between the 2D Network and the 3D network is that all the basic operations are changed to support 3D inputs. For example, convolution layers now use kernels of 3 dimension instead of 2. Moreover, the CSAM or channel spatial attention module is modified to consider the depth dimension of a 3D input. The image below shows the high level architecture of this network [1].
SAB
The spatial attention block of the 3D network now has a additional J block to consider the depth dimension. It uses a 1x1x3 convolution to capture the information in this direction. Therefore the new attention map of similarities between spatial possitions now uses the product of 2 correlation matrices just as in the equation below [1].
CAB
The channel attention block of the 3D network has also an addtitional J' block to consider the depth dimension. The equation for the channel wise attention map now uses 2 correlation matrix between a combination of reshaped and transposed feature matrices K', Q' and K' and J' just as shown in the equation below [1].
Loss function
Since the labels in 3D images are sparse and the quality of the annotations is not high for all of the voxels, then the loss function is more complex. It is a weighted sum of the weighted cross entropy loss and the dice loss as shown in the equation below. Alfa is fixed to 0.6 empirically [1].
The weighted cross entropy loss corrects the bias towards background voxels due to the few number of vessel-like voxels (see equation below) [1].
The way of calculating the weight for the vessel-like voxels is shown in the equation below [1].
The dice loss ensures micro-structure segmentation by using the dice similarity coefficient using the equation below. pi denotes the probability and gi the ground truth of vessel-like voxels. Epsilon is used for numerical instability problems [1].
Experimental setup 2D Network
The following chart shows information about the datasets used to train the 2D CS2-Net. Data augmentation was used during the training phase [1].
L. Mou et al. Medical Image analysis (2020).
The metrics used for evaluating the results are [1]:
- accuracy
- sensitivity
- specificity
- AUC or area under the ROC curve: tradeoff between sensitivity (false positive rate) and false positive rate
- p values: p < 0.05 is considered statistically significant
- false discovery rate: It was used only on corneal nerve fiber tracing
Experimental setup 3D Network
The 3D CS2-NET was trained with both real and synthetic datasets. The real dataset is called MIDAS and it is composed of 50 brain MRA from the Circle of Willis. All individuals are healthy people from 18 to 60 years and half are females. The synthetic dataset had two different datasets. The first one, Synthetic, is composed of 136 volumes while the second one is called VascuSynth and has artificial gaussian noise added. For the latter 3 different noise with different variances were used and a number from 1 to 3 was assigned [1].
The metrics used for evaluating the results are [1]:
- true positive rate (sensitivity)
- false positive rate
- false negative rate
- segmentation rates
- over segmentation
- under segmentation
- dice coefficient: It was not used in MRA volumes since many voxels were unlabeled
Results 2D Network
The image below shows the visual results of random images from the 3 datasets that used color fundus. The green circle highlights were the CS2-Net outperformed the other state of the art methods [1].
The quantitative results for the color fundus datasets are shown on the table below. CS2 Net was not the best in all metrics, but as an overall it outperformed the other state of the art methods [1].
The visual results for the optical coherence tomography and the corneal confocal microscopy datasets are shown below. The green arrows points vessel-like structures with under segmentation problems in the other state of the art methods. Meanwhile the red arrows shows oversegmentation in the other state of the art methods compared to CS2-Net [1].
The quantitative results of OCT-A, OCT RPE and CORN-1 datasets are shown in the 3 charts below. CS2-Net again outperformed all the state of the art methods in all metrics [1].
To give a better intuition of what the attention maps output, the following images were taken from the decoder at different layers for each dataset. Moreover, an enlarged image was compared with other state of the art methods [1].
As an additional metric, the area under the ROC curve is shown below. This time the 2D results and some 3D results are shown in the same image. CS2-Net has the best tradeoff between sensitivity and specificity [1].
Results 3D Network
The visual results of random images from the MIDAS dataset are shown in the image below. The red arrows point to sections where 3D UNet under segments or over segments in comparison to CS2-Net [1].
The quantitative results of the MRA volumes are shown below. CS2-Net outperforms all the state of the art methods in the prestablished metrics [1].
The image below shows the visual results of random images from 3 of the 4 synthetic datasets. The enlarged section allows to compare how CS2-Net does a better job at curvilinear segmentation [1].
The chart below shows the quantitative good results from CS2-Net compared to other state of the art methods for segmentation of the synthetic datasets [1].
An ablation study was conducted to prove the effectiveness of the channel spatial attention module CSAM. The visual results are shown in the image below. While the quantitative results are shown in the subsequent chart [1].
Conclusions
Curvilinear structure segmentation is an essential step towards automated medical diagnosis and it remains a challenge. CS2-Net brings us closer to that goal by outperforming other state of the art methods. However, more deep learning techniques need to be tried in different applications to close that gap and other challenges in medical procedures. Furthermore, a couple of improvements can be taken into consideration. First, simplify the 3D architecture to reduce resource consumption. Second, with neighborhodd or continuity constrains discard diseased cells marked as vessel-like structures [1].
ER-Net
The edge reinforced network or ER-Net is characterized for giving more importance to edge voxels in order to guarantee continuity and microstructure detection [2].
Methodology
ER-Net uses a encoder decoder architecture but with 3 special things: the reverse edge attention module or REAM, the feature selection module or FSM, and the edge reinforced loss. The high level architecture is shown on the image below [2].
REAM
The REAM module is embedded between adjacent layers of the encoder. As shown in the image below it calculates the intersection between the foregronund and background of different layers. When concatenating this information at the same position, then it has more edge information. In simiple words it increases the weight of voxels on the edges [2].
FSM
Since feature fusion via simple channel stacking may lead to redundancy, then the FSM module adaptively selects effective encoding features from the encoder side and effective recovery features from the decoder side. As shown in the image below, the final attention weights encode interdependency between different feature channels [2].
ER Loss
The ER Loss uses the dice similarity coefficient to determine a threshold lambda to switch between mask supervised loss and edge refined loss [2].
The mask supervised loss is simply the dice loss, which measures non overlapping ratio between the prediction and the ground truth[2].
The edge refined loss is composed of dice loss and edge loss. The edge loss is shown on the equation below and it measures edge differences between prediction and ground truth. It can be divided in [2]:
- edge dice loss: aims to optimize global structure
- edge binary cross entropy loss: Does not consider global structure
- auxiliary term: is a regularizer
Kappa and Tau are learnable parameters that restrict each other. Furthermore, the remaining parameter is empirically set to 1 and the edge probabilities are obtained using a 3x3x3 Laplacian operator [2].
Experimental setup
All datasets used were public, including 2 nerves and 2 cerebrovascular. The cerebrovascular datasets are 100 MRA volumes from healthy volunteers. MIDAS I has 50 manual annotations of the circle of Willis, while MIDAS II has 42 manual annotation from intra cranial vasculature with data augmentation. The 2 nerve datasets are part of DIADEM, a digital reconstruction of axonal and dendritic morphology. The first dataset is OPF, which consists of 9 sepparate Drosophilia olfactory axonal projections image stacks taken with 2 channel confocal microscopy. The second dataset is NL1A (neocortical layer 1 axon), which consists of 16 image stacks involving numerous axonal trees taken with 2 photon laser scanning microscopy in vivo [2].
The metrics used for evaluating results are [2]:
- sensitivity
- specificity
- dice similarity coefficient
- average Hausdorff distance: Reflects the edge error of segmentation and takes voxel location into account
Results
The image below shows the visual results of ER-Net compared to other state of the art methods for random images from the MIDAS dataset.
In terms of quantitative results ER Net outperforms other networks as shown below.
For the nerve datasets the image below shows an example of better over and under segmentation from ER-Net.
The overall quantitative results once again back up the use of ER Net as shown in the table below.
Ablation studies were performed to further demonstrate the effectiveness of REAM, FSM and ER Loss. The quantitative results are shown below.
Following the nomenclature of the table above, the visual results of the ablation studies are shown in the image below.
Conclusions
ER-Net improves connectivity of curvilinear segmentation by capturing microstructure with better edge detecting mechanisms. It outperforms state of the art methods, although MIDAS II and NL1A yield poorer results due to incomplete manual annotations. Despite the results, the following improvements could be done. First, the threshold lambda in ER Loss needs automated adjustment based on data set. Second, for practical purposes, quantification of biomarkers such as density, length tortuosity and caliber shoud be integrated in the segmentation process. Finally, to avoid supervised learning, one of the following could be adopted: semi supervised learning, data augmentation with GAN, transfer learning, or active learning [2].
DANet
Dual Attention Network utilizes self-attention mechanism to capture the dependencies between features in both the spatial and channel dimensions. These modules are then fused to enhance the feature representations. The method improves the discriminant ability of feature representations for scene segmentation on three popular benchmarks including Cityscapes, PASCAL Context, and COCO Stuff datasets [3].
Related Work
Semantic Segmentation uses fully convolutional networks and several variants have been proposed to improve contextual aggregation in the model. Deeplabv2 and Deeplabv3 use atrous spatial pyramid pooling, PSPNet uses a pyramid pooling module, and encoder-decoder structures merge mid-level and high-level semantic features. DAG-RNN models contextual dependencies using a recurrent neural network, PSANet uses a convolution layer to capture pixel-wise relations, and EncNet uses a channel attention mechanism to capture global context [3].
Methodology
The method is based on a dilated residual network with two types of attention modules added to it: a spatial attention module and a channel attention module. The outputs from these two modules are aggregated to obtain better feature representations for pixel-level prediction. The high level architecture can be seen in the image below [3].
These modules as seen on the image below are the same used on the CS2Net, which used this work as inspiration. PAM is the name in this paper for SAM [3].
The last step is an element wise sum of the output of the attention modules and a convolutional layer to generate the final prediction map.[3]
Experimental Setup
The following datasets were used to validate DANet [3]:
Dataset | training samples | validation samples | testing samples | semantic classes |
---|---|---|---|---|
Cityscapes | 2979 | 500 | 1525 | 19 |
PASCAL VOC 2012 | 10582 | 1449 | 1456 | 20 |
PASCAL Context | 4998 | not specified | 5105 | 60 |
COCO Stuff | 9000 | not specified | 1000 | 60 |
Results
In the following table, the contributions of the CAM and PAM modules as well as the increase in the depth of the network is demonstrated with the intersection over union metric.
In the next table, the contribution of different strategies to the performance of the network such as data augmentation or multi scale input is demonstrated.
The following table shows how DANet outperforms other state of the art methods in the cityscapes dataset in terms of mean IoU.
On PASCAL VOC 2012 DANet almost had the higher score as shown below.
In PASCAL Context dataset DANet score the highest mean IoU as shown below.
In COCO Stuff dataset DANet outperformed every state of the art method as shown below.
Conclusions
As seen in the ablation experiments and the mean IoU comparison DANet outperforms state of the art methods in 4 different scene segmentation datasets. The pending challenges are to simplify the network complexity in terms of computing resources and to improve the robustness of the model [3].
Personal Review
The following chart presents a comparison between CS2-Net and ER-Net.
The following chart presents a comparison between DANet and CS2-Net. Note that there is no comparison between ER-Net and DANet due to their lack of similarities.
After the revision of this papers of curvilinear segmentation, one might ask if the complete inegration of quantification of biomarkers will happen in the next 5 years. I believe it will, in 5 years we might have a complete automated diagnosis of disorders related to curvilinear structures in the human body. In order for this to happen people should start donating data to studies like the ones reviewed. Maybe with the advances in this technology people might be able to detect morphological changes in the valves from the heart which are well known to generate clots, and ultimately strokes.
The papers reviewed have some weaknesses, for instance, in none of the works there is a quantification of the time to train the network or the complexity of the algorithm. DANet uses just one metric to compare performance between networks, while the other 2 works surprisingly use specificity eventhough it is not a good metric for unbalanced semantic segmentation. For the medical related papers, there is no semantic segmentation of relevant curvilinear structures such as bronchi and bronchioles. Across all papers there are hyper parameters empirically set to an arbitrary value without further analysis.
Regarding the strengths found on all 3 works reviewed, all of them outperformed state of the art methods and demonstrated results both quantitatively and visually.
Finally, the main takeaways of these curvilinear structure segmentation works are the following:
- Attention improves significantly semantic segmentation problems
- Research should focus on improving deep learning methods for 3D images
- Edges are an important part curvilinear structure segmentation to guarantee continuity
- Deep learning methods should be agnostic of image modality
- Any new AI development should be tried in solving different problems
- Resource consumption should be addressed for 3D volume problems
- ablation studies with different attention modules might be useful to determine which is the best approach
References
- [1] Mou, L., Zhao, Y., Fu, H., Liux, Y., Cheng, J., Zheng, Y., Su, P., Yang, J., Chen, L., Frangi, A.F., et al., 2020. CS2-Net: Deep learning segmentation of curvilinear structures in medical imaging. Med. Image Anal. 101874.
- [2] Xia, L., Zhang, H., Wu, Y., Song, R., Ma, Y., Mou, L., Liu, J., Xie, Y. , Ma, M., Zhao, Y., 2022. 3D vessel-like structure segmentation in medical images by an edge-reinforced network Med. Image Anal. 102581.
- [3] Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., Lu, H., 2019. Dual attention network for scene segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3146–3154.