Cardiac Phase Detection in Echocardiograms with Densely Gated Recurrent Neural Networks and Global Extrema Loss

This is a blog post for the paper ‘Cardiac Phase Detection in Echocardiograms With Densely Gated Recurrent Neural Networks and Global Extrema Loss’

written by Fatemeh Taheri Dezaki , Zhibin Liao , Christina Luong, Hany Girgis, Neeraj Dhungel, Amir H. Abdi ,
Delaram Behnami, Ken Gin, Robert Rohling, Purang Abolmaesumi , and Teresa Tsang

Introduction and problem statement

Cardiovascular disease is a major concern in the field of medicine, being the leading cause of premature death worldwide. It is important to detect it on time for early treatment. Echo is one of the mostly used cardiac images [1]. An accurate detection of End-Systolic (ES) and End-Diastolic (ED) frames is important as a pre-processing step in order to measure other cardiac parameters [2]. Those parameters consequently are essential for certain cardiac abnormalities detection. ED reflects the lowest volume of blood in the left ventricle, whereas ES measures the volume of blood at end load. However, identifying these frames is a challenging task due to different heart rates and environments related to pathological conditions. Echocardiographers identify ED and ES manually which makes it time consuming and less accurate. On average, there is a disagreement of 3 frames among 5 sonographers.

Therefore, there is a demand for an automatic and accurate method to detect these frames especially that people with cardiac abnormalities result in unreliable detection of ES and ED frame.

This is where deep learning architectures are useful by taking this issue as a regression problem. The architecture is a combination of CNNs based image feature extraction and RNNs as well as a regression model to capture the ED and ES frame [3]-[7]. This will allow to find a novel global minimal loss to localize the required frames. CNNs are used for their ability to encode mid-level and high-level features. Whereas RNNs are useful for temporal dependencies between sequential data. In this experiment, two CNN models are taken into consideration ResNet [8] and DenseNet [9] and four RNN architectures which are LSTM [10], bi-directional LSTM, GRU [11], Bi-GRU.

This paper shows that it is possible to find accurately the ES and ED frames without the left ventricular segmentation. A global extrema loss is proposed since it helps in achieving improvements.

Methodology

In this experiment it is important to highlight the two main frames ES and ED compared to other non-critical frames in between.

Figure 1: Overview of the deep learning framework architecture

Ground truth definition

Let’s take the training set D $\begin{array}{l}= {z_n}_{(n=1)}^N\end{array}$ as a collection of N cine series. For each cine series $\begin{array}{l}{x_t}_{(t=1)}^{|T_n|}\end{array}$ , x is a 2D image with label $\begin{array}{l}l_t\end{array}$ of each frame of a cardiac cycle. The set L {-1,0,1} represents the ED frame by 1, ES frame by -1 and the non-critical frames by 0.

In this equation $\begin{array}{l}\sigma_{ED}(T)\end{array}$ and $\begin{array}{l}\sigma_{ES}(T)\end{array}$ are the selection functions presenting index t of ED and ES frame for $\begin{array}{l}l_t\end{array}$ . $\begin{array}{l}\tau\end{array}$ = 3 and $\begin{array}{l}\eta\end{array}$ = 0.8 are constants and is a collection of sequence indices. Note that $\begin{array}{l}\eta\end{array}$ was fixed as one originally in a previous experiment by Kong et al. [10]. Here it is reduced to 0.8 in order to highlight the difference between the non-critical frames and ED and ES.

Loss Function

During the test phase the maximum value corresponds to the ED frame while the minimum value represents the ES frame, which is shown is the following equations:

$\begin{array}{l}\tilde{q}_{ED} = arg max \{{\tilde{y}_{(n,t)}}\}_1^{|T_n|}\end{array}$ and $\begin{array}{l}\tilde{q}_{ES} = arg min \{{\tilde{y}_{(n,t)}}\}_1^{|T_n|}\end{array}$ where $\begin{array}{l}\tilde{y}_{(n,t)} = f_{dnn}(x_{(n,t)},\theta_{dnn})\end{array}$

The training loss here is actually a combination of two losses. The first one the mean squared error.

$\begin{array}{l}\displaystyle L_{mse} = \sum_{n=1}^N \sum _{t=1}^{|T_n|} || y_{(n,t)} - \tilde y _{(n,t)} ||^2\end{array}$

This function reduces the mean label prediction. However, the problem is that it does not tackle the issue of difference in characteristic between consecutive frames. Therefore, it is possible that, during testing, the frames around ED and ES overcome the actual ones which will lead to inaccurate frame predictions.

Therefore, another loss function is needed in order to highlight the monotonic characteristic during training. The following loss function was used by Kong et al [12]:

$\begin{array}{l}\displaystyle L_{mono} = \frac{1}{N}\frac{1}{|T_n|}\sum_{n=1}^N \sum _{t=1}^{|T_n|} [(\mathbb1(y_{(n,t)} > y_{(n,t-1)}) max (0, \tilde y_{(n, t-1)} - \tilde y_{(n, t)} ) + [(\mathbb1(y_{(n,t)} > y_{(n,t-1)}) max (0, \tilde y_{(n, t)} - \tilde y_{(n, t -1)} )]\end{array}$

With 1(.) an indicator function. However, this function does not highlight the ED and ES frames compared to other non-critical frames, and that is why it is possible to misidentify the required frames.

Another loss function is proposed here to overcome the problem, and this is achieved by imposing a small margin around the desired frames.

$\begin{array}{l}\displaystyle L_{ge} = \frac{1}{N}\sum_{n=1}^N[max((max(k_n) + \gamma) - \tilde {y}_{(n, \sigma_{ED(T)})},0) + max(\tilde {y}_{(n, \sigma_{ES(T)})}) - (min(k_n) - \gamma, 0)]\end{array}$

Where $\begin{array}{l}\tilde {y}_{(n, \sigma_{ES(T)})}\end{array}$ and $\begin{array}{l}\tilde {y}_{(n, \sigma_{ED(T)})}\end{array}$ are the predictions of the true index of ES and ED frames and $\begin{array}{l}\gamma = 0.025\end{array}$ is a user-defined margin parameter.

In order to understand more the difference between the two last loss functions, here is an example highlighting their behaviors.

Figure 2: Comparison of $\begin{array}{l}L_{mono}\end{array}$ (frame #a) and $\begin{array}{l}L_{ge}\end{array}$ (frame #b) with frame errors

In frame (a) $\begin{array}{l}L_{mono}\end{array}$ creates 11 loss components averaged by 32 which is the total number of frames. As the gradient gets smaller and divided by 32, two problems arise. The first one is the possibility of having very small values and hence reach a vanishing gradient. The second is the difficulty of overcoming a local minima when having a small gradient. In frame b, $\begin{array}{l}L_{ge}\end{array}$ highlights only the four most relevant predictions and hence no need for normalizing. Here, the gradient is relatively large and consequently the two previous issues are solved.

Figure 3: Comparison of $\begin{array}{l}L_{mono}\end{array}$ (frame #a) and $\begin{array}{l}L_{ge}\end{array}$ (frame #b) with correct predictions

Frame #c shows the correct predictions using $\begin{array}{l}L_{mono}\end{array}$ and it is clear that this loss function acts only as a regularizer but might not necessarily shift the values to the correct ED and ES frames. This usually happens when having a gradient that rather drives the values away from the correct ones. This is not the case when using the global extrema loss as shown in frame d where the ED and ES frame are correctly detected.

The training loss function in this experiment is the combination of two loss functions as follow:

$\begin{array}{l}\displaystyle L_{total} = (1-\alpha) L_{mse} + \alpha L_{struct}\end{array}$

Where $\begin{array}{l}L_{struct}\end{array}$ is either $\begin{array}{l}L_{ge}\end{array}$ or $\begin{array}{l}L_{mono}\end{array}$ .

Deep Learning framework

It is basically similar to the one proposed in previous papers [13],[14]. There are three components: a CNN module for image feature extraction followed by an RNN module for temporal dependencies and then fed to a regression module to make the predictions. The following figures represent the framework used in the experiment.

Figure 4: CNN modules, RNN units and structures

CNN module

Two options were tested ResNet and DenseNet. These two architectures have a considerable depth which helps improve the classification accuracy. It also includes by-pass connections (skip connections) which prevents degradation.

ResNet consists of residual layers and each one of the latter adds the input of a computation block from the previous layer to its own output using the skip connection. The computation block is the original ResNet design [8] which is formed by two stacks of three CNNs followed by a Batch Normalization [15] and ReLu [16]. The output image feature is given by the following equation:

$\begin{array}{l}\displaystyle x_{(t,L)} = x_{(t,L-1)} + \sum_{l=1}^L f_{res}(x_{(t,L-1)},\theta _{f eat}^t)\end{array}$

where x(t,l) are the input features, l is the number of the computation block and x(t,0) = xt is the input image, $\begin{array}{l}f_{res}\end{array}$ the customized computation block.

DenseNet [9] the output of all of the previous layers are concatenated to be the input of the next layer. The computation block has one stack on convolution layer, BN and ReLU. The output image feature is given by:

$\begin{array}{l}\displaystyle x_{(t,L)} = [x_{(t,0)}, ... ,x_{(t, L-1)}]\end{array}$

RNN module

The two recurrent neural network tested are GRU and LSTM because they capture long term dependencies in sequential data by controlling the information flow within the unit which solves the problem of vanishing gradient.

The LSTM module used is the typical one [10]. It has a memory cell and three gates, input, output and forget gate.

Input gate: $\begin{array}{l}i_t = s (W _{(x ̃i)} x ̃_t + W_{hi} h_{t-1})\end{array}$ (sigmoid function)

Forget gate: $\begin{array}{l}i_t = s (W _{(x ̃f)} x ̃_t + W_{hf} h_{t-1})\end{array}$ (W are the weight parameters to compute the forget gate)

Output gate: $\begin{array}{l}i_t = s (W _{(x ̃o)} x ̃_t + W_{ho} h_{t-1})\end{array}$ (hidden state)

$\begin{array}{l}\tilde {c}_t = tanh (W _{x ̃_g} x ̃_t + W_{hg} h_{t-1})\end{array}$

$\begin{array}{l}c_t=f_t . c_{(t-1)}+i_t . c ̃_t\end{array}$ (element wise product)

$\begin{array}{l}h_t= o_t . tanh⁡(c_t)\end{array}$

Figure 5: Graphic model of LSTM

The GRU module is less complicated than the LSTM [11].

$\begin{array}{l}z_t = s (W _{(x ̃z)} x ̃_t + W_{hz} h_{t-1})\end{array}$ , W are the weight parameters to compute the state

$\begin{array}{l}r_t = s (W _{(x ̃r)} x ̃_t + W_{hr} h_{t-1})\end{array}$

$\begin{array}{l}i_t = tanh(W _{(x ̃g)} x ̃_t + W_{hg} (r_t . h_{t-1}))\end{array}$

$\begin{array}{l}h_t=(1-z_t) . h_t+z_t . h ̃_t\end{array}$ where $\begin{array}{l}z_t\end{array}$ is the update gate

Figure 6: Graphic model of GRU

The bidirectional LSTM/ GRU: it takes into account the past state but there is in addition a second controller for the backward state. It reversely reads the input to compute the backward state. By element wise summation the two controllers’ outputs are combined to give the final output.

Regression module

The final prediction is given using a regression module. The parameters are updated using Backpropagation through time (BPTT) [17].

Experiments

Dataset

The dataset used was collected from PACS server of the Vancouver Hospital

Data acquired between 2011 and 2015
3087 patients study
2D echo AP4 format stored in DICOM format
Various pathological conditions (different heart rates from 47 to 104 beats/min for example)
Frames of complete cardiac cycle with variable number of frames, minimum 29, maximum 55 and average 42 frames
Ultrasound machine Philips iE33
High quality datasets

Experimental setup

Following are the details concerning the data used:

Ultrasound image beam cropped and re-sized to 120x120 pixels with bi-cubic interpolation method
3 mutually exclusive sets: 60% as training set, 20% as validation set, 20% as test set
Deployment: Intel Core i7-2600k, 3.40GHz (8 cores), 8GB of RAM, and a NVIDIA GeForce GTX 980Ti Video Card
Lasagne deep learning library
Learning rate decays by 1/10 at 31^st epoch and 61^st epoch, maximum training epoch: 100
Regularization with weight decay method from $\begin{array}{l}1e^{-1}\end{array}$ to $\begin{array}{l}5e^{-1}\end{array}$

Evaluation Metrics

R2 score is used as an evaluation performance for the regression performance, which has the following equation:

$\begin{array}{l}\displaystyle 1 - \frac{\sum{(y_{(n,t)} - \tilde y_{(n,t)})^2} }{\sum{(y_{(n,t)} - \tilde y)^2}}\end{array}$

However, it is an indirect indicator of the frame detection problem. The average error of the prediction is used as a direct indicator for this purpose.

$\begin{array}{l}\mu_e = \frac{1}{N}\sum{|q_e^i - \tilde q_e^i | }\end{array}$

With the first term as the ground truth ED/ES.

Result and discussions

Model and hyper-parameter selection

The performance of the experiment is done using four initial learning rates from 1 $\begin{array}{l}e^{-4}\end{array}$ to 1 $\begin{array}{l}e^{-1}\end{array}$ in an interval of power of 10. The weighting parameter α is set to 0.3. There are 32 models using the combinations of CNNs, RNNs, and the learning rates. The number of parameters is within 10% of difference from 1.1 Million. The reference model is ResNet with 2-LSTM model [18].

The following table represents the results

CNN module	RNN module	Number of params	μED	μES	R2
ResNet	2-LSTM	1.18M	0.78±1.02	1.45±1.28	0.92
	Bi-LSTM	1.10M	0.70±0.99	1.57±1.35	0.93
	2-GRU	1.12M	0.76±1.00	1.42±1.29	0.92
	Bi-GRU	1.10M	0.83±1.17	1.47±1.36	0.92
DenseNet	2-LSTM	1.08M	0.57±0.88	1.34±1.18	0.94
	Bi-LSTM	1.19M	0.66±1.08	1.45±1.31	0.93
	2-GRU	0.98M	0.49±0.78	1.36±1.18	0.93
	Bi-GRU	1.07M	0.64±1.06	1.33±1.21	0.93

Table 1: Deep learning architecture comparison shown by error measurements on test set.

It could be seen that the results using DenseNet are better in all cases. However, the smallest error for the ED frame localization is DenseNet with 2-GRU (0.49) while the smallest error for the ES is Bi-GRU (1.33). Taking into consideration the average of both frames as well as the lowest number of parameters (0.98M), the best combination turns out to be DenseNet with 2-GRU. This architecture result in 0.11 lower ES frame and 0.18 in ED than average.

Another aspect to highlight is the fact that the ES localization error is always higher than the ED localization error. There is no clear reason behind that, but it was suggested that the reason might be the lack of visual information for the ES frame in the AP4 format.

LSTM and GRU produce comparable results with the same CNN module [19],[20]. In addition, two layers or Bi versions for both LSTM and GRU lead to comparable results.

Figure 7: Deep learning architecture comparison by error measurement on test set

The graphs are showing that the best learning rate to pick is 10 $\begin{array}{l}e^1\end{array}$ . But it also is in line with the results in the table highlighting the fact that DenseNet with 2-GRU has the best performance, achieving the lowest error.

State of Art comparison

Taking the best architecture according to the previous results, another test is done to compare $\begin{array}{l}L_{ge}\end{array}$ to $\begin{array}{l}L_{mono}\end{array}$ .

The following table sums up the results:

Method	Number of Params	μED	μES
TempReg-Net	5.37M	0.91±1.16	1.75±1.51
DMTRL-Net	0.77M	0.65±0.88	1.80±1.55
DenseNet + 2-GRU + $\begin{array}{l}L_{mono}\end{array}$	0.98M	0.49±0.78	1.34±1.17
DenseNet + 2-GRU + $\begin{array}{l}L_{ge}\end{array}$ (proposed)	0.98M	0.20±0.67	1.43±1.30

Table 2: Comparison between the two error losses $\begin{array}{l}L_{ge}\end{array}$ and $\begin{array}{l}L_{mono}\end{array}$

The error localization for the ED frame is lowest using $\begin{array}{l}L_{ge}\end{array}$ reducing the error by 0.29 frames on average. Regarding the ES frame, using $\begin{array}{l}L_{ge}\end{array}$ loss results in in 0.09 frame behind the frame with $\begin{array}{l}L_{mono}\end{array}$ . However according to the t-test, the performance is still acceptable since with 5% significance level, it fails to reject the null hypothesis. So, combining results of both ED and ES frames, $\begin{array}{l}L_{ge}\end{array}$ leads to better results. The results were also compared to the ones using other networks like TempReg-Net [21] and DMTRL-Net [22].

The following graphs also illustrate the difference between the two cases.

Figure 8: Histogram showing the frame error distribution with both error functions for ED in (a) and ES in (b) on AP4 view

According to the graphs, it is first reflected that there are more errors in the ES frames than ED frames. In addition, the accuracy of the framework with $\begin{array}{l}L_{ge}\end{array}$ is highlighted with the higher number of 0 frame error. From the second graph for the ES frame, the framework with $\begin{array}{l}L_{mono}\end{array}$ tends to have a late prediction (in blue) while the framework with $\begin{array}{l}L_{ge}\end{array}$ tends to slightly anticipate the ES frame (in yellow).

These experiments were performed with different cine series properties, in other words, using different frame rate, heart rates, etc.

Furthermore, for the purpose of generalization, the same experiments were applied using PLAX images instead of AP4 with the dataset of 1382 patients. The results are summed in the following graphs.

Figure 9: Histogram showing the frame error distribution with both error functions for ED in (a) and ES in (b) on AP4 view

It can also be seen that it leads to similar conclusions since the observations are comparable. For the ED frame, the error is reduced by 0.13 frame using $\begin{array}{l}L_{ge}\end{array}$ error, and 0.10 for the ES frame.

Conclusion

We have showed the performance of different architectures, using a combination of CNN modules and RNNs modules followed by regression. According to the results, DenseNet performed better than ResNet but LSTM and GRU lead comparable results. A new loss function $\begin{array}{l}L_{ge}\end{array}$ was proposed and yielded actually in better performance. For generalization purposes the tests were performed using AP4 images but also PLAX view and both gave comparable results. However other aspects were not taken into consideration regarding the generalization purposes like other cardiac views, data of low quality, etc

References

[1] D. D. McManus, S. J. Shah, M. R. Fabi, A. Rosen, M. A. Whooley, and N. B. Schiller, “Prognostic value of left ventricular end-systolic volume index as a predictor of heart failure hospitalization in stable coronary artery disease: Data from the heart and soul study,” J. Amer. Soc. EchoCardiogr., vol. 22, no. 2, pp. 190–197, 2009.

[2] R. O. Mada, P. Lysyansky, A. M. Daraban, J. Duchenne, and J.-U. Voigt, “How to define end-diastole and end-systole? Impact of timing on strain measurements,” JACC, Cardiovascular Imag., vol. 8, no. 2, pp. 148–157, 2015.

[3] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sampling for sequence prediction with recurrent neural networks,” in Proc. Adv. Neural Inf. Process. Syst., 2015, pp. 1171–1179.

[4] A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional LSTM and other neural network architectures,” Neural Netw., vol. 18, no. 5, pp. 602–610, 2005.

[5] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely connected convolutional networks,” in Proc. IEEE CVPR, Jun. 2017, vol. 1, no. 2, p. 3.

[6] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proc. Adv. Neural Inf.

Process. Syst., 2012, pp. 1097–1105.

[7] G. Litjens et al., “A survey on deep learning in medical image analysis,” Med. Image Anal., vol. 42, pp. 60–88, Dec. 2017.

[8] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 770–778.

[9] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely connected convolutional networks,” in Proc. IEEE CVPR, Jun. 2017, vol. 1, no. 2, p. 3.

[10] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997.

[11] K. Cho et al., “Learning phrase representations using RNN encoder– decoder for statistical machine translation,” in Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP). Stroudsburg, PA, USA: Association for Computational Linguistics, Oct. 2014, pp. 1724–1734.

[12] B. Kong, Y. Zhan, M. Shin, T. Denny, and S. Zhang, “Recognizing end-diastole and end-systole frames via deep temporal regression network,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent. (MICCAI). New York, NY, USA: Springer, 2016, pp. 264–272.

[13] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sampling for sequence prediction with recurrent neural networks,” in Proc. Adv. Neural Inf. Process. Syst., 2015, pp. 1171–1179.

[14] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 3156–3164.

[15] S. Ioffe and C. Szegedy. (2015). “Batch normalization: Accelerating deep network training by reducing internal covariate shift.” [Online].

Available: https://arxiv.org/abs/1502.03167

[16] V. Nair and G. E. Hinton, “Rectified linear units improve restricted Boltzmann machines,” in Proc. 27th Int. Conf. Mach. Learn. (ICML),

2010, pp. 807–814.

[17] P. J. Werbos, “Backpropagation through time: What it does and how to do it,” Proc. IEEE, vol. 78, no. 10, pp. 1550–1560, Oct. 1990.

[18] F. T. Dezaki et al., “Deep residual recurrent neural networks for characterization of cardiac cycle phase from echocardiograms,” in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. New York, NY, USA: Springer, 2017, pp. 100–108.

[19] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. (2014). “Empirical evaluation of gated recurrent neural networks on sequence modeling.”

[Online]. Available: https://arxiv.org/abs/1412.3555

[20] W. Yin, K. Kann, M. Yu, and H. Schütze. (2017). “Comparative study of CNN and RNN for natural language processing.” [Online]. Available: https://arxiv.org/abs/1702.01923

[21] B. Kong, Y. Zhan, M. Shin, T. Denny, and S. Zhang, “Recognizing end-diastole and end-systole frames via deep temporal regression network,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent. (MICCAI). New York, NY, USA: Springer, 2016, pp. 264–272.

[22] W. Xue, G. Brahm, S. Pandey, S. Leung, and S. Li, “Full left ventricle quantification via deep multitask relationships learning,” Med. Image Anal., vol. 43, pp. 54–65, Jan. 2018.

Seitenhierarchie