Predicting Alzheimer’s disease progression using multi-modal deep learning approach

This is the blogpost for the paper ''Predicting Alzheimer’s disease progression using multi-modal deep learning approach".

Written by Garam Lee, Kwangsik Nho, Byungkon Kang, Kyung-Ah Sohn, Dokyoon Kim & Alzheimer’s Disease Neuroimaging Initiative.

Introduction

Alzheimer’s disease (AD) is a progressive, irreversible neurodegenerative condition, which is characterized by the decline in cognitive functions (thinking, behavior, memory). In 2018, 5.7 million Americans were affected. By 2050, the number would be projected to 14 million Americans [1]. There is currently no treatment to cure a patient who is already in AD, however, there is only treatment to decelerate the progression of AD. Thus, it is of fundamental importance to develop strategies for the detection of AD at early stages for timely treatment and progression delay. Individuals who have mild symptoms of brain malfunction but could still perform everyday tasks classified to have mild cognitive impairment (MCI), which is a prodromal form of AD. As shown in figure 1, some patients in their MCI stages are converted to AD within a limit of the time window after baseline (i.e. MCI conversion), while some are not (i.e. MCI non-conversion).

Figure 1. Some patients in their MCI stages are converted to AD within a limit of the time window after baseline (i.e. MCI conversion, MCI-C),
while some are not (i.e. MCI non-conversion, MCI-NC). During prediction, the MCI patients would be classified as either MCI-C or MCI-NC after time Δt.

Related Works

Therefore, various studies have used machine learning to identify biomarkers for MCI conversion prediction, as shown in Table 1. This included support vector machine (SVM) using multi-task learning which resulted 73.9% accuracy, 68.6% sensitivity, and 73.6% specificity [2] or domain transfer learning method that used auxiliary samples of AD patients, cognitively normal older adults (CN) and MCI subjects, resulting 79.4% accuracy, 84.5% sensitivity, and 72.7% specificity [3]. Linear discriminant analysis (LDA) was also used based on cortex thickness data and multi-modal data, by combining cerebrospinal fluid (CSF), MRI, and cognitive performance biomarkers were combined, resulting in 68.5% accuracy 53.4% sensitivity, and 77% specificity [4,5].

Table 1. A list of previous models that train a classifier using MCI samples.

Objectives

The objective of this paper is to increase the prediction accuracy for the conversion of MCI individuals to AD patients using neural network. This paper examines the prediction of MCI to AD conversion with recurrent neural network and it has 4 major contributions:

Assesses the performance of the recurrent neural network in AD prediction
Analyzes the effects of using multi-modal data in the network
(cognitive performance, Demographic Information, CSF biomarkers, MRI neuroimaging biomarkers)
Analyzes the effects of longitudinal data in the network
(at 6-, 12-, 18-, 24-month after the baseline visit)
Propose method that could take in variable-length longitudinal data

Methodology

Recurrent Neural Network (RNN)
Recurrent Neural Network (RNN) architecture was chosen as it could extract features along the temporal domain. Suppose we have N number of subjects, each of which has a sequence:

$\begin{array}{l}\displaystyle {x_1^n, x_2^n, ..., x_t^n, ..., x_T^n}\end{array}$

which represents data record of n-th sample and the t-th element in a sequence and T is the length of the sequence. As shown in Figure 2, an RNN processes one element of an input sequence at a time and updates its memory state that implicitly contains information about the history of all the past elements of the sequence [6]. The hidden state is represented as a Euclidean vector, ie. a sequence of real numbers, and is updated recursively from the input at the given step and the previous value of the hidden state.
Figure 2: Illustration of recurrent neural network (RNN). Left: An RNN module is composed of input, memory state, and output,
each of which has a weight parameter to be learned for a given task. The memory state (blue box) takes the input and
computes the output based on the memory state from the previous step and the current input, as the t increase along time,
it could be represented as a feedback loop. Right: RNN network that have variable-length input and output
sequence at different timepoint, ranging from 1 to T, represented as an “unfolded” feedback loop.

The hidden state $\begin{array}{l}h_t\end{array}$ and the predicted output $\begin{array}{l}\hat{y_t}\end{array}$ of this RNN could also be written as:
where tanh and softmax are the activation functions represented as:
The role of the non-linear tanh activation function is to endow the RNN with higher representational power, while the softmax function is to turn an arbitrary vector into a probability vector.
However, in order to take in variable-length input sequence in terms of multi-modality and longitudinally, gated recurrent unit (GRU), as an RNN variant, was used as the first step in the architecture. Therefore, the proposed RNN network architecture consists of two training steps.
Proposed RNN Network
The main idea of their model is to separately build a feature extractor for each modality and integrate the extracted four feature vectors at the end. The proposed model comprised of two training steps as shown in Figure 3.
Figure 3. Illustration of the proposed neural network. The proposed method contains 4 GRU components that accept the 4 dataset modalities.
Blue rectangle: In the 1^st training step, each GRU component takes both time series and non-time series data to produce fixed-size feature vectors.
Red rectangle: In the 2^nd training step, the vectors are concatenated to form an input for the final prediction.

The first training step
1st RNN encodes a sequence of symbols into a fixed-length vector representation, by learning a single GRU for each modality of data. As shown in Figure 4, a GRU consists of a reset and an update gate. The update gate decides what information to throw away and what new information to add. While the reset gate decides how much past information to forget.
Figure 4. Illustration of the gated recurrent unit (GRU), a variant of RNN [7,8].
The second training step
The 2nd RNN decodes the representation into another sequence of symbols, by learning the integrative feature representation to make the final prediction. The last output sequence provided by the RNN is a probability vector for classification, cross-entropy loss function was used to quantify how “far away” the n-th prediction is from the n-th ground truth label $\begin{array}{l}\hat{y_n}\end{array}$ . The loss function and the minimization could be represented as:
Using Backpropagation Through Time (BPTT) [9], optimal parameters $\begin{array}{l}W^*_h\end{array}$ , $\begin{array}{l}W^*_x\end{array}$ , $\begin{array}{l}W^*_y\end{array}$ were chosen such that the cross-entropy loss function of the given data is minimized.

Experimental Settings

Datasets
1. All individuals used in the analysis were participants of the Alzheimer’s Disease Neuroimaging Initiative (ADNI) [10,11]. Longitudinal data includes baseline, 6-, 12-, 18-, 24-month. Multi-modal data are also used which include the 4 following modalities:
  1. Cognitive performance
    - executive functioning
    - memory
  2. Demographic Information
    - age
    - sex
    - years of education
    - APOE ε4 status
  3. CSF biomarkers
    - amyloid-β 1–42 peptide (Aβ1–42)
    - total tau (t-tau)
    - tau phosphorylated threonine 181 (p-tau)
  4. MRI neuroimaging biomarkers
    - hippocampal volume
    - entorhinal cortical thickness
  In this study, a total of 1,618 ADNI participants aged 55 to 91 were used, which include 415 cognitively normal older adult controls (CN), 865 MCI (307 MCI converter and 558 MCI non-converter), and 338 AD patients, as shown in Table 2.
  Table 2. Subject demographics of the ADNI database at baseline visit.
Models
Table 3. Data type distribution in the three experimental schemes in this study.
To evaluate the performance and effectiveness of the proposed longitudinal multi-modal deep learning method, this study used three models (Table 3) and compared their performances. In the “baseline” experiment, 4 modalities data at the baseline visit (cognitive performance, CSF, demographic information, and MRI) were used. In the “single modal” experiment, only longitudinal cognitive performance data was used, the network using cognitive scores performed the best among all other single modality. In the proposed method, four modalities of longitudinal data were combined and used for training the classifier. Table 3 shows the summary statistics of each modality of data and hyperparameters used for training RNN. For training our models, CN and AD are used as an auxiliary dataset to pre-train the classifier, and MCI-C and MCI-NC are used for training. Due to the nature of longitudinal data, the sample size available for training varies over Δt, as shown in Figure 5.
Table 4. Data statistics and hyperparameters used in this study.
Figure 5. The number of subjects available in demographic data, neuroimaging data, cognitive performance, and CSF biomarkers over Δt.
Training
We tested the classifier on MCI patients to predict the conversion after Δt from baseline (6, 12, 18, and 24 months) as shown in Figure 1. At each prediction time Δt, 10 times 5-fold cross-validation was used, in which every fold has the same ratio of MCI-C and MCI-NC subjects. MCI samples were partitioned into 5 subsets, and one subset was selected for testing, while the remaining subsets were used for training.

Results

Is multi-modal data better?
– Comparison of prediction of MCI to AD conversion using single modal and multimodal data
The accuracies of “proposed” and models with "single modality" of data are shown in Figure 6. The accuracy from the model with demographic data was excluded from the graph because the prediction performance was too low. The model using cognitive performance was observed to be the most accurate among other single data modality models. Even though the sample size for neuroimaging data was larger than those of cognitive performance and CSF biomarkers (Figure 5), the model with neuroimaging data showed less accuracy. They suggested that this is because cognitive performance is a longitudinal data which takes advantage of giving relatively closer data record to MCI conversion. However, the model with cognitive data shows high variance of sensitivity for 18 and 24 months prediction. It is observed that model with only cognitive data might not a stable predictor for long period of prediction while integrating other biomarkers can alleviate the high variance in proposed.
Figure 6. Performance of the proposed model using multi-modal data and the models using single modality data.
Is longitudinal data better?
– Comparison of prediction of MCI to AD conversion using cross-sectional data at baseline and longitudinal data
The performances of two schemes: “baseline” and “proposed” were shown in Figure 6 (left) and Figure 7. In Figure 7 (left), the more top-left of the ROC curve indicates better accuracy. In Figure 7 (right), the prediction model based on longitudinal data shows overall better performance than the model using only cross-sectional data at baseline, in terms of accuracy, sensitivity and specificity. Intuitively, data from multiple time points has more information than data at a single time point. Therefore the network could analyze the temporal changes and extract features in cognitive performance and CSF, which are not available in baseline visit data, for better MCI conversion prediction.

Figure 7. Performance of the proposed model using longitudinal data and the model using only baseline data.

Discussion

This work used 4 separate encoding GRU components, which is capable of accepting any irregular length of data as an input without preprocessing. While previous work could only used data with fixed length of time points were collected by taking data that fell within a certain time window, and therefore pre-processing of data is needed. Second, the 4 separate encoding GRU components could also handle non-overlapping samples, however, only over-lapping samples could be used in previous works.

The limitation of this work is that hidden features in multi-modal and longitudinal data were not efficiently incorporated due to the 2 steps training in the network architecture. Hidden features that are irrelevant to AD progression with respect to the single modality; and that cannot be extracted by single modality but only can be explained by a combination of multi-modal data (integrative features) would be filtered out in 1st layer, parameter optimization in the 2nd training step does not affect the parameters in each GRU for feature extraction.

In my point of view, this study didn’t mention the reason why not using all the patient data for training and testing. There is also a highly unbalanced ratio of MCI-C and MCI-NC subjects in training samples compared to previous works, which might not be a fair comparison. In addition, fairness in the network performance comparison could be questioned as discussed in the seminar. When we compare the single modality experiment and proposed (multi-modality) experiment, the proposed model, however, has also trained with longitudinal data. The effect of the longitudinal data might affect the performance of the proposed model, and thus the comparison of single and multi modality data is not literally correct. Therefore, it is possible to add comparison which solely compares the 2 models that are trained on single modality and multi-modality model at the same timepoint. Vice versa for the comparison of the longitudinal and proposed model.

Although the proposed network in this study could improve the accuracy for predicting MCI to AD conversion compared to previous works, future work includes modifying the network architecture that could be capable of learning integrative features from single and multi-modality, such as end-to-end training, and linking the GRUs to 2nd training step.

Conclusion

This work made good use of the longitudinal and multi-modal data in the ADNI dataset to predict non-linear AD progression, which could be concluded as the following 4 take-home messages:

This two-step training method that could take in variable-length longitudinal data
Recurrent neural network (gated recurrent unit) could be used for AD prediction
Model with multi-modal data performs better than single-modality, model with longitudinal data performs better than baseline data
Multi-modal and longitudinal data enabled integrative features to be extracted

References

[1] Alzheimer’s, A. 2015 Alzheimer’s disease facts and figures. Alzheimers Dement 11, 332–384 (2015).

[2] Zhang, D., Shen, D. & Alzheimer’s Disease Neuroimaging, I. Multi-modal multi-task learning for joint prediction of multiple regression and classification variables in Alzheimer’s disease. Neuroimage 59, 895–907, https://doi.org/10.1016/j.neuroimage.2011.09.069 (2012).

[3] Cheng, B., Liu, M., Zhang, D., Munsell, B. C. & Shen, D. Domain Transfer Learning for MCI Conversion Prediction. IEEE Trans Biomed Eng 62, 1805–1817, https://doi.org/10.1109/TBME.2015.2404809 (2015).

[4] Kim, D. et al. A Graph-Based Integration of Multimodal Brain Imaging Data for the Detection of Early Mild Cognitive Impairment (E-MCI). Multimodal Brain Image Anal (2013) 8159, 159–169, https://doi.org/10.1007/978-3-319-02126-3_16 (2013).

[5] Ewers, M. et al. Prediction of conversion from mild cognitive impairment to Alzheimer’s disease dementia based upon biomarkers and neuropsychological test performance. Neurobiol Aging 33, 1203–1214, https://doi.org/10.1016/j.neurobiolaging.2010.10.019 (2012).

[6] LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444, https://doi.org/10.1038/nature14539 (2015).

[7] Cho, K. et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).

[8] Illustrated Guide to LSTM’s and GRU’s: A step by step explanation, Michael Nguyen, Towards Data Science (2018).

[9] Guo, J. Backpropagation through time. Unpubl. ms., Harbin Institute of Technology (2013).

[10] Saykin, A. J. et al. Genetic studies of quantitative MCI and AD phenotypes in ADNI: Progress, opportunities, and plans. Alzheimer’s & dementia: the journal of the Alzheimer’s Association 11, 792–814 (2015).

[11] Saykin, A. J. et al. Alzheimer’s Disease Neuroimaging Initiative biomarkers as quantitative phenotypes: genetics core aims, progress, and plans. Alzheimer’s & dementia: the journal of the Alzheimer’s Association 6, 265–273 (2010).

Seitenhierarchie

Predicting Alzheimer’s disease progression using multi-modal deep learning approach

Introduction

Related Works

Objectives

Methodology

Recurrent Neural Network (RNN)

Proposed RNN Network

Experimental Settings

Datasets

Models

Training

Results

Is multi-modal data better?

Is longitudinal data better?

Discussion

Conclusion

References