Background and Motivation: 

Multimodal Learning with Functional and Structural MRI Analysis

Multimodal Learning integrates data from various modalities such as text, video, audio, and images through the use of deep learning techniques to provide a holistic understanding of complex systems. For the scope of this blog post, we will refer to multimodal learning as being the integration of primarily imaging modalities, specifically Magnetic Resonance Imaging (MRI) modalities. Two main modalities are taken into consideration for this topic: structural and functional MRI.

Structural magnetic resonance imaging MRI provides a static but detailed study of the brain by capturing detailed images of its anatomical structure, highlighting features such as gray and white matter.

In contrast, functional MRI provides insights into brain activity by emphasizing brain regions associated with particular tasks or stimuli and monitors changes in blood flow and oxygenation.

Multimodal Learning Strategies

As shown in Figure 1, there are three different approaches are frequently utilized for multimodal learning. Input-level fusion, sometimes referred to as feature-based fusion or data-based integration, is the first type. Using this approach, a standard feature vector that integrates the multiple input modalities is fed into a neural network. Layer-level fusion, also referred to as joint fusion or intermediate integration, is the second technique. It combines feature representations that have been learned from intermediate layers. The features of each modality are fed to an independent neural network and the learned representations through those neural networks are combined and fed to the rest of the model. The third approach, referred to as late or model-based integration or decision-level fusion, involves combining the predictions of many models. Each modality is used to train a single neural network. The outputs of the networks are then aggregated to get the final result [1],[2].

Each technique offers a variety of alternatives, each with pros and cons.

Figure 1: Fusion Strategies Using Deep Learning [3].


Motivation


Healthcare professionals frequently have to perform a thorough analysis of different types of medical images in order to diagnose a patient’s condition in real-world clinical practice settings. In these situations, as we’re aiming to mimic real-world clinical practice scenarios, multimodal technology adoption becomes essential since one modality can complement the weaknesses of another, resulting in a more accurate assessment of the medical condition. 

Studying only functional or structural MRI in isolation has limitations. Functional MRI captures moment-to-moment changes, while structural MRI provides a static view. Combining them addresses this limitation. Using multimodal learning is motivated by the need to bridge the gap between structural and functional insights, leading to a more comprehensive understanding of brain function. Deep learning-based multimodal learning medical image fusion can be applied to effectively extract and combine the feature information from various modes, which can improve the clinical usability of medical imaging in the diagnosis and evaluation of diseases.

In this blog post, we will examine multiple articles using different multimodal learning fusion strategies with functional and structural imaging modalities. We will explore their materials and methods, models used, the results obtained and the strengths and limitations of each approach.

Early Fusion Approach

Abnormal structural and functional network topological properties associated with left prefrontal, parietal, and occipital cortices significantly predict childhood TBI-related attention deficits: A semi-supervised deep learning study 

(Cao et al., 2023)

The motivation behind this research paper is to identify the functional and structural neurological alterations linked to attention deficits related to traumatic brain injuries in children. A semi-supervised auto-encoder is used to predict attention deficits following TBI by looking into the topological brain characteristics within both structural and functional brain networks.

Materials and Methods

For the data acquisition, a total of 52 children with TBI and 53 controls were included in the group-level analyses. For each subject, a DTI scan, a task-based functional MRI scan, and a high-resolution T1-weighted MRI scan were collected. Diffusion Tensor Imaging is a structural MRI modality that provides details about the microstructure of the brain’s white matter pathways. Both DTI and T1-weighted MRI scans provide structural information. Task-based functional MRI scans provide functional insights by detecting changes in blood flow and oxygenation in response to neural activity. 

In order to construct structural and functional brain networks, graph theoretical technique (GTT)-based approaches were implemented to characterize the connectivity patterns present in the brain. This technique depicts the brain as a graph with nodes representing the different voxels (brain regions) and edges representing the connections between them.

Multiple tools were used for the preprocessing and feature extraction or generation.

Preprocessing structural MRI images was done using FreeSurfer, a toolkit for neuro-imaging data analysis that offers an array of methods to measure the functional, anatomical, and connectivity characteristics of the human brain. Using the FMPIB Software Library (FSL), a software library that includes image analysis and statistical tools for functional and structural MRI brain imaging data, Diffusion Tensor Imaging scans were preprocessed in order to form a structural network. Features generation was performed using the Brain Connectivity Toolbox. A total of 685 features were extracted and integrated into a standard feature vector, which will be used by the developed model. The fusion strategy here is an early fusion (input-level multimodal learning strategy) as it integrates the multiple modalities (structural MRI, DTI, fMRI) into a standard feature vector fed to the model.

For training, 60 features were selected from the original 685 brain features extracted from structural and functional brain networks. Feature reduction was performed using a two-sample t-test, mutual information-based method, and the Lasso-based method.

As shown in Figure 2, the semi-supervised auto-encoder has three main components: the encoder, the decoder, and the classifier. The encoder converts the brain input features into latent space. To minimize the risk of overfitting, a Gaussian noise was induced to 20% of the input features.

The decoder transforms the encoded features into reconstructed features. The classifier uses the Sigmoid activation function and predicts if the subject is in the TBI group or control group. Adam optimizer was used for the back-propagation process. The loss function of the full model was calculated by combining the decoder loss with a weight of 0.7 and the classifier loss with a weight of 0.3.


Figure 2: Overall structure of the semi-supervised auto-encoder [4].


Results


The semi-supervised auto-encoder model’s reconstruction process was evaluated by computing the mean square error (MSE) of the validation data and averaging it over all five cross-validations. The classification was evaluated using the classification accuracy, and AUC averaged over the cross-validations. For comparison purposes, a classical machine learning model was constructed following the same training and validation procedure as the semi-supervised auto-encoder model. Support vector machine (SVM) was used for classification and principal component analysis (PCA) was used for feature reduction. With an accuracy of 82.86% and an AUC of 0.860, the semi-supervised auto-encoder performed better than the PCA+SVM model, which had an accuracy of 78.09% and an AUC of 0.825. 

Moreover, a permutation-based method was used to compute each feature’s importance score and, hence, identify the most important brain features. As shown in Table 1, six most predictive brain regions involving both function and structural networks were identified, which emphasizes the importance of combining functional and structural imaging modalities.

Table 1:  Importance score of the most important brain features in accurately differentiating children with TBI and controls [4].


Strengths and Limitations


The paper demonstrates strengths in its comprehensive approach to the study of post-traumatic brain injury attention deficits study. It goes beyond simple classification and uses deep learning techniques to identify neurobiological features associated with TBI-related attention deficits. The built semi-supervised auto-encoder deep learning model studies the topological alterations in the structural and functional brain networks and how well specific alterations in different brain regions can predict attention deficits post-TBI. The main strength of the fusion strategy used, early fusion, is its simplicity as it enables seamless integration of data from the different imaging modalities and makes the overall architecture more straightforward. Also, combining modalities at the input level can be computationally efficient as it reduces the complexity of the model’s following layers. Moreover, combining the multiple imaging modalities at an early stage preserves the features of each modality, which can be advantageous especially when the imaging modalities provide complementary information. However, early fusion might not capture complex relationships between different modalities. One of the limitations noted by the authors is the risk of overfitting and limited generalization due to the small sample size despite the attempt to minimize the risk by introducing Gaussian noise. Finally, the model evaluation approach doesn’t highlight the distinction between multimodal and unimodal techniques but rather focuses on a comparison between the developed semi-autoencoder model and the simple PCA + SVM model with the same fusion input standard vector. The understanding of multimodal versus unimodal effectiveness is limited with this approach.


Late Fusion Approach

Combined Structural MR and Diffusion Tensor Imaging Classify the Presence of Alzheimer’s Disease With the Same Performance as MR Combined With Amyloid Positron Emission Tomography: A Data Integration Approach

(Agostinho et al., 2022)

The motivation behind this research paper is to explore whether combining different imaging modalities can provide better insights that could improve early detection approaches to diagnose Alzheimer’s disease. For Alzheimer’s disease classification, the authors conducted a comparison of all possible combinations involving three essential imaging modalities: diffusion tensor imaging (DTI), structural magnetic resonance imaging (MRI), and Positron Emission Tomography (PET).

Materials & Methods


For this research paper, two datasets were used: an internal dataset (Table 2)  and an external one (Table 3). The internal dataset was acquired locally by gathering imaging data from a group of participants who had received an early diagnosis of Alzheimer’s disease and had been assessed by a neurologist. For each subject, a DTI, a structural MRI (T1-weighted), and a functional PiB-PET scan were acquired. Validation was performed using an external dataset from the Alzeihmer’s Disease Neuroimaging Initiative (ADNI) database. 

Table 2: Demographics and Neuropsychologic Characteristics for the Study Population [5].

Table 3: Demographics and Neuropsychologic Characteristics for the ADNI External Data [5].


The preprocessing and feature generation were done using multiple tools: DTI Explore for DTI scans, Computational Anatomy Toolbox (CAT) for T1-weighted MRI, and Statistical Parametric Mapping (SPM12) for PiB-PET scans. Regions of interest (ROI) were identified and measurements were retrieved from the different imaging modalities (sMRI, PET, DTI) for each ROI. The embedded-based method (EBM) and filter-based method (FMB) were two of the feature selection techniques used. 

Figure 3: Overall Scheme of the Construction and Validation of the Models [5].


The reduced sets were fed into support vector machine (SVM) classifiers. As shown in Figure 3, for each feature set, an SVM classifier was constructed for each imaging modality: MRI-based classifier, PiB-PET-based classifier, and DTI-based classifier. The models were evaluated using the external data for validation. The models that performed the best were then selected for the ensemble phase. To evaluate the effectiveness of various modality combinations, the classifiers were combined using a weighted fusion technique, which applied weight values (between 0 and 1) to the predicted probabilities from each trained classifier. Four main combinations were evaluated: three approaches combining two different imaging modalities (sMRI + DTI, sMRI + PiB-PET, and PiB-PET + DTI) with a weight value of ½ each, and a combination of the three imaging modalities with a weight of ⅓ (sMRI+PiB-PET+DTI). The multimodal fusion strategy used here is the decision level, also referred to as late fusion, as it integrates the outputs of the independent processing of each modality to make a final decision. 

Results

As shown in Table 4, the performance of combinations of modalities was shown to be better than that of single modality models. PiB-PET + sMRI model, with an accuracy of 98.05%, demonstrated superior performance compared to the sMRI-only model. Similarly, the combination of sMRI and DTI  yielded a better accuracy (97.30%) than the base unimodal classifiers. However, combining all modalities sMRI+PiB-PET+DTI (98.11%) did not result in a significant improvement compared to the bimodal combination of PiB-PET and sMRI (98.05%), which could be explained by the fact that maybe DTI and PiB-PET contribute to the classification model with redundant information. Both imaging modalities might represent non-independent biological processes. 

Table 4: Ensemble Classification Performance.


Strengths and Limitations

The paper’s strength lies in its comprehensive approach, which uses ensemble and base classifiers in order to investigate all possible combinations of imaging modalities. It has been demonstrated that the combination of multiple imaging modalities, both structural and functional, improves the accuracy of classification. Also, the use of external data for validation makes the approach more robust as it ensures the generalizability of the findings beyond the initial internal dataset. Nevertheless, the adoption of a late fusion strategy might not completely take advantage of the complementarity and correlations between the different imaging modalities. Also, the learning process may be affected by the aggregation design choice, which is a simple weighting function in this study.


Structural and Functional MRI Data Differentially Predict Chronological Age and Behavioral Memory Performance

(Soch et al., 2022)

The motivation behind this paper is to investigate cognitive aging, which is the tendency for explicit memory performance to deteriorate with advancing age. In the midst of this overall tendency, some older persons exhibit "successful aging," which is defined by maintained cognitive function. The study intends to investigate possible reasons, such as increased brain-structural integrity, more effective resource usage, or compensating cognitive mechanisms, in order to identify the factors causing this phenomenon. The research explores how data from various modalities including functional MRI, structural MRI, and behavioral measures data, might predict chronological age and behavioral memory performance differently in both young and older healthy adults. 


Materials and Methods


The internal dataset used was collected from structural MRI and functional MRI scans on a sample of 106 young and 153 older subjects. Statistical Parametric Mapping (SPM12) was used for data preprocessing. As shown in Figure 4, multiple source variables were extracted including structural MRI maps (Gray matter volume maps from T1-weighted images), resting-state functional MRI maps, fMRI contrast images, fMRI summary statistics, and behavioral data that represent behavioral response frequencies from a surprise recognition memory test carried out by the participants. Target variables were also extracted for each subject to be used for prediction analyses: age group (young or older), chronological age (in years), and memory performance. Following the extraction of the source and target variables, multiple analyses were carried out, each of which involved utilizing SVMs to predict a single target variable from a feature set of source variables. The age group was predicted using support vector classification (SVC), while chronological age and memory performance were predicted using support vector regression (SVR).

Figure 4: Methodology of the Study [6].


Results


For the age group target variable, the prediction performance was assessed using balanced accuracy. For chronological age and memory performance, the prediction performance was assessed using correlation coefficients and absolute errors.  The study conducted a comparative evaluation of the ability of functional and structural MRI data together with behavioral data to predict chronological age versus memory performance in young and older subjects. According to the study results, task-based fMRI provides the best prediction of dependent memory performance, but single-value fMRI scores provide the most accurate prediction of independent memory performance. Conversely, structural MRI maps provide the best prediction of chronological age Also, it was found that the effects of memory and age were specific to structural MRI as opposed to fMRI. The research highlights the significance of comparing models as an essential step in choosing the best model for a particular task. It assesses the individual performance of the models. 

Strengths and Limitations


The use of several source variables, such as maps, contrast images, and summary statistics, that were taken from both fMRIs and sMRIs is one of the study's strongest points. Nevertheless, it is important to note that the study focuses on a model comparison based on several modality inputs rather than the actual fusion of features. This choice in design enables a comprehensive evaluation of the individual prediction powers of behavioral measurements and structural and functional MRI data. However, the lack of real multimodal integration can prevent many modalities from being explored for possible correlations. Although the approach adopted is not precisely classified as early, intermediate, or late fusion, it is consistent with the idea of combining information, which would be here model evaluations, after the models have been constructed and assessed independently. Therefore, this model comparison could be considered as a form of decision-level multimodal strategy.


Intermediate Fusion Approach

Multi-modal deep learning of functional and structural neuroimaging and genomic data to predict mental illness

(Rahaman et al., 2021)


The motivation behind this paper is to address the difficulties that come with diagnosing and understanding neuropsychiatric illnesses such as schizophrenia, especially given that such diagnoses are highly variable and are usually diagnosed based on self-reported symptoms. The existing diagnostic approach lacks the ability to predict cases and doesn’t provide insights into the underlying neural and biological mechanisms of neuropsychiatric disorders. In response to this, the study constructs a multi-modal deep learning framework to predict schizophrenia.

Materials and Methods


The dataset used consisted of structural magnetic resonance images (sMRI), resting functional magnetic resonance images (fMRI), and genome-wide single nucleotide polymorphism (SNP) data from 275 healthy control and 162 schizophrenia subjects. Both the functional and structural MRIs were preprocessed using statistical parametric mapping (SPM12). As shown in Figure 5, the multimodal architecture developed has two submodules: one for feature extraction and selection with group independent component analysis (gICA) and genotyping, and a deep neural networks module for learning the modality features. 

For the first feature extraction submodule, to extract functional and structural networks from the respective scans, a fully automated independent component analysis ICA-based pipeline was applied. It runs gICA for both structural and functional scans with specific settings such as the number of independent components expected. For functional MRI scans, a static functional network connectivity (sFNC) matrix representing the strength of connectivity between the different components is computed. For structural MRI scans, the ran gICA constructs the ICA loading matrix, showing how much each voxel contributes to the overall pattern represented by the specific independent component. SNPs are generated from the genomic sequence. 

The deep neural networks (DNN) submodule consists of four subnetworks:

  1. The sFNC features from fMRI scans are fed to an encoder.
  2. ICA loadings from sMRI scans are fed to a multilayer feed-forward network (FFN)
  3. SNPs are inputs to a bi-directional long short-term memory (LSTM) unit with an attention mechanism.
  4. The fourth neural network is a fully connected network integrating the joint features from the previous networks. It consists of a sequence of fully connected layers followed by a softmax prediction later.

The joint features that are input to the latter network are obtained from a weighted fusion of the latent features from the three input modalities. The approach adopted here is a layer-level or intermediate fusion as it is aggregating the latent features from sMRI, fMRI, and SNPs as input to the final subnetwork.

Figure 5: Multi-modal Architecture [7].

Results


For comparison purposes, multiple baseline models were implemented: unimodal models (fMRI-based model, sMRI-based model, and genetic data-based model) and bimodal models with different combinations (sMRI+ fMRI, SNPs+sMRI and SNPs+ fMRI).  Furthermore, for the models integrating the three modalities, different data fusion techniques were implemented (mid fusion and late-fusion). As shown in Table 5 and Figure 6, the multimodal model integrating functional MRIs, structural MRIs, and SNPs using a late fusion strategy outperformed the unimodal neural networks and the other baseline models in schizophrenia classification, achieving ag 0.85 ROC with an accuracy of 88%.


Table 5: Accuracy Results [7].

Figure 6: ROC and AUC for all models implemented [7].

Strengths and Limitations


The main advantage of this approach is the application of intermediate fusion, which makes it possible to investigate correlations and interactions between various modalities more successfully.  One of the strongest assets of this study is that it produced several baselines to compare the effectiveness of the suggested multimodal classification. It conducted a comprehensive evaluation by comparing the suggested multi-modal joint fusion model with baselines that used various fusion types and combinations of modalities. Also, compared to the previously reviewed papers, the suggested model improves its accuracy by by adding genetics as a modality in addition to imaging modalities. Nevertheless, the paper recognizes the existence of heterogeneity in the dataset and notes that the model's complexity could be considered as a limitation. 

Conclusion 

Deep learning-based multi-modal medical imaging fusion can be used to efficiently extract and integrate features information from various modes, enhancing the clinical usability of medical images in the assessment and diagnosis of medical conditions. Through the fusion of data from many modes, multi-modal technology allows a single mode of the medical image to complement the shortcomings of another mode in order to accurately evaluate the medical condition and get diagnostic information. Because multi-modal deep learning may produce more precise and accurate predictions, it has enormous potential to enhance medical diagnosis and treatment. The use of multi-modal data can enhance diagnosis precision, aid healthcare professionals diagnose of various diseases such as Alzheimer’s disease, forecast illness risk, and enable them to personalize treatments.


References

[1] Fatemeh Behrad, Mohammad Saniee Abadeh, An overview of deep learning methods for multimodal medical data mining. Expert Systems with Applications, Volume 200, (2022). doi: 10.1016/j.eswa.2022.117006.

[2] Pei, X., Zuo, K., Li, Y. et al. A Review of the Application of Multi-modal Deep Learning in Medicine: Bibliometrics and Future Directions. Int J Comput Intell Syst 16, 44 (2023). doi: 10.1007/s44196-023-00225-6.

[3] Huang, SC., Pareek, A., Seyyedi, S. et al. Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. npj Digit. Med. 3, 136 (2020). doi: 10.1038/s41746-020-00341-z

[4] Cao M, Wu K, Halperin JM and Li X (2023) Abnormal structural and functional network topological properties associated with left prefrontal, parietal, and occipital cortices significantly predict childhood TBI-related attention deficits: A semi-supervised deep learning study. Front. Neurosci. 17:1128646. doi: 10.3389/fnins.2023.1128646

[5] Agostinho, Daniel & Caramelo, Francisco & Moreira, Ana & Santana, Isabel & Abrunhosa, Antero & Castelo-Branco, Miguel. (2022). Combined Structural MR and Diffusion Tensor Imaging Classify the Presence of Alzheimer’s Disease With the Same Performance as MR Combined With Amyloid Positron Emission Tomography: A Data Integration Approach. Frontiers in Neuroscience. 15. 10.3389/fnins.2021.638175. 

[6] Soch J, Richter A, Kizilirmak JM, Schütze H, Feldhoff H, Fischer L, Knopf L, Raschick M, Schult A, Düzel E, Schott BH. Structural and Functional MRI Data Differentially Predict Chronological Age and Behavioral Memory Performance. eNeuro. 2022 Nov 14;9(6):ENEURO.0212-22.2022. doi: 10.1523/ENEURO.0212-22.2022.

[7] Rahaman MA, Chen J, Fu Z, Lewis N, Iraji A, Calhoun VD. Multi-modal deep learning of functional and structural neuroimaging and genomic data to predict mental illness. Annu Int Conf IEEE Eng Med Biol Soc. 2021 Nov;2021:3267-3272. doi: 10.1109/EMBC46164.2021.9630693.

  • Keine Stichwörter