Transfusion: Understanding Transfer Learning for Medical Imaging

Introduction

Transfer Learning is a very popular technique in Machine Learning, where a model is pretrained on a specific dataset for a specific task, and then later this model is reused as an initialization for the task of interest [1]. In Medical Imaging, often a big architecture (like ResNet50) is pretrained on a very large dataset of natural images with high diversity of labels and samples, usually ImageNet (fig. 1), and later this model is fine-tuned in the Medical Imaging dataset. This scenario can be considered as a de-facto method for the application of Deep Learning in medical settings. It has been used for pathology classification on Chest x-rays [2], eye pathology identification [3], detection of Alzheimer’s disease [4] and other usecases.

cvcv Figure 1: Image Net samples

Source [5]

Figure 2: A retinal fundus image and chest x-ray image from our datasets

This paper examines the behavior of Transfer Learning in Medical Imaging tasks and has 4 major contributions:

Assesses the performance of Transfer Learning
Analyzes the effects of Transfer Learning in the representations learned by the networks
Discovers other benefits of Transfer Learning apart from just feature reuse
Tries weight transfusion and explores hybrid approaches to Transfer Learning

Related work

There are quite some papers, mostly from 2018, that have questioned some common beliefs that we have regarding the effects of Transfer Learning in natural image datasets. In [6] the authors have concluded that more pretraining data does not always result in better performance. This is mostly seen when we pretrain on some coarse-grained dataset, like ImageNet or JFT, and then fine-tune in some fine-grained dataset, like Birdsnap and FGVC Aircraft in the case of the paper. As we see in fig. 3, pretraining in the JFT dataset and random initialization have similar results in these two specialized datasets. The authors also came up with a new method for matching the distribution of pretraining dataset with the distribution of the target dataset. In this case they achieved better results in all the cases.

In [7] we have similar results as well. We again see(fig. 4) that for the two datasets specialized in Cars and Aircrafts, respectively, random initialization performs similarly to pretraining in ImageNet and then fine-tuning on top of it.

Figure 3: Pretraining in the JFT dataset and random initialization have similar results in the two specialized datasets Birdsnap and FGVC Aircraft. JFT-Adaptive Transfer is for distribution matching of datasets.

Source [6]

Figure 4: For the two datasets specialized in Cars and Aircraft, respectively, random initialization performs similarly to pretraining on ImageNet and then fine-tuning on top of it.

Source [7]

Methodology

Performance Evaluation of Transfer Learning

The performance of Transfer Learning is evaluated for:

Big, standard and high performing architectures that are very popular for ImageNet and for Transfer Learning.
And small CNN architectures

These architectures are evaluated in terms of their performance when (1) training from random initialization and (2) performing Transfer Learning from ImageNet (pretraining the architecture in ImageNet and then fine-tuning it in the medical dataset).

Analysis of the effects of Transfer Learning in the learned representations

(Singular Vector) Canonical Correlation Analysis – (SV)CCA

Figure 5a Figure 5b

To understand the representations learned by the neural networks, the authors use a method called (Singular Vector) Canonical Correlation Analysis – (SV)CCA [8]. It is an efficient and invariant to affine transformations tool for comparing two representations, for example: two layers of the same network, same layer of the network before/after training, same layer of the network trained with different initializations.

It works with a collection of outputs of a neuron on a sequence of inputs called a neuron activation vector. In fig. 5a, have a look at the blue neuron. For an input (image in our case) $\begin{array}{l}x_1\end{array}$ , the neuron produces as an output the scalar $\begin{array}{l}{z_1}^{L_1}(x_1)\end{array}$ and so on. In the end we have the neuron activation vector for the blue neuron $\begin{array}{l}[{z_1}^{L_1}(x_1), {z_1}^{L_1}(x_2),..., {z_1}^{L_1}(x_n)]\end{array}$ for our specific dataset of $\begin{array}{l}n\end{array}$ images. We perform the same procedure for all of the neurons in that layer and get a matrix $\begin{array}{l}{L_1} \in R^{a \times n}\end{array}$ as a representation of the red layer. Similarly we get the representation $\begin{array}{l}{L_2} \in R^{b \times n}\end{array}$

Because a neuron is represented as a vector, we can consider the layer as a subspace spanned by its neurons[8]. What CCA will do is, find the “best” (correlation maximizing) linear relationship between the two representations(multidimensional variates) $\begin{array}{l}L_1\end{array}$ and $\begin{array}{l}L_2\end{array}$ [9]. Mathematically, it will try to find two vectors $\begin{array}{l}w \in R^a\end{array}$ and $\begin{array}{l}s \in R^b\end{array}$ such that $\begin{array}{l}\rho\end{array}$ (the cosine of the angle between the vectors $\begin{array}{l}w^TL_1\end{array}$ and $\begin{array}{l}s^TL_2\end{array}$ ) is maximized, where :

$\begin{array}{l}\displaystyle \rho = \frac{\langle\,{w^TL_1},{s^TL_2}\rangle}{{{||w^TL_1}||} \times {||s^TL_2}||} [9]\end{array}$

Maximizing $\begin{array}{l}\rho\end{array}$ is the same as solving a singular value decomposition (SVD). For a more detailed explanation please have a look at [9]. Finally, the CCA similarity score is just the mean of the multiple $\begin{array}{l}p^{(i)}\end{array}$ scores that we get from solving the SVD.

As a pre-processing step to performing CCA, we need to make sure that the representations $\begin{array}{l}L_1\end{array}$ and $\begin{array}{l}L_2\end{array}$ are centered so we can compare them[9].

The SV in (SV)CCA means that in the beginning we perform an SVD in $\begin{array}{l}L_1\end{array}$ and $\begin{array}{l}L_2\end{array}$ to get subspaces with contain the most important directions of the original subspaces

Do transfer learning and random initialization learn the same features(representations)?

To answer this questions the authors compare the CCA similarity scores between the representations learned while using pretrained weights and the representations learned while using randomly initialized weights. For the comparison different architectures are used. Furthermore, the representations of the top two layers for the small CNN networks, and top two stages for the big architectures are used. The CCA scores are averaged. To gain more intuition, have a look at fig. 6. In the end as a baseline to compare these CCA scores to, they also train the networks with two different random initializations and get the representations of the same layers as before.

Figure 6: The same network trained with two different initializations, randomly initialized weights and pretrained weights. In both cases we get the representations of the last layer. We perform CCA between these representations and get a similarity score

Do large models change more through training?

Figure 7: Per layer similarities before and after training. This tells how much the layer(the red one for example) has changed through training

For answering this question we first choose a network of interest (of course we should choose a big network and also a smaller one so we can compare the results in both architecture sizes). Next, for the network of interest, we find the per-layer similarities of the network before and after training when using Transfer Learning and/or Random Initialization. This means, for a specific layer, for example the red layer in fig. 7, get a representation before training, one after training and get a CCA similarity score between them. This will tell how much the layer has changed through training. We follow this procedure for all the layers of the network.

Feature independent properties of Transfer Learning and Weight Transfusion

Weight scaling - Does Transfer Learning without feature reuse increase the convergence speed?

The authors try to see whether Transfer Learning without the feature extraction ability inherited from pretraining increases the converge speed over just random initialization. To do this, they only keep the scaling of the pretrained weights and remove this feature extraction ability. For each layer of the network, they create a normal distribution $\begin{array}{l}\mathcal{N}(\mu,\,\sigma^{2})\,\end{array}$ , where $\begin{array}{l}\mu\end{array}$ and $\begin{array}{l}\sigma^2\end{array}$ are the mean and variance of the pretrained weights of that layer. Next, they initialize the respective layer of the network by sampling i.i.d weights from this distribution. The initialization is called Mean Var initialization.

Which layers have the highest impact in convergence speed?

Figure 8a: First layer initialized with pretrained weights, other layers with random weights	Figure 8b: First two layers initialized with pretrained weights, the other layer with random weights	Figure 8c: All layers initialized with pretrained weights

For answering this question the authors have tried weight transfusion. With this method they initialize a part of network’s layers with pretrained weights and the rest of the layers with random weights. They perform this approach until all the layers are initialized with pretrained weights (fig. 8c).

Hybrid approach to Transfer Learning

In this section we slim the big models. This means we reuse pretrained weights up to a certain layer and randomly initialize the rest of the layers while slimming the model (reducing the number of channels). We then see how this approach performs.

Experimental Setup

Datasets

The main dataset used is the RETINA dataset, which consists of Retinal fundus photographs[3]. The goal is to detect the Diabetic Retinopathy disease, graded in 5 scales by increased severity. The second dataset is the CheXpert dataset[10], which is used to detect 5 pathologies (atelectasis, cardiomegaly, consolidation, edema and pleural eusion). For both datasets the evaluation is doing via AUC-ROC.

Models

Figure 9: An example of a CNN model in the CBR family

For the big ImageNet architectures Resnet50 [11] and Inception-v3 [12] are used. For the small models the authors have created a small family of CNNs called CBR where each layer consists of a 2D convolution, batch normalization and ReLu activation function (Fig. 9). The only difference between the networks in this family is the number of layers they contain. Some models (CBR-LargeT, CBR-LargeW) are the third of a standard ImageNet model size, and some (CBR-Tiny) the twentieth of the size.

Training setup

For the retina dataset the settings are:

587x587 images
Learning rate of 0.001
Batch size of 8
Adam optimizer

For the CheXpert dataset the settings are:

224x224 images
Learning rate of 0.001
Learning_rate
- inherited from ImageNet training
- From 0 to 0.1x32/256 in 5 epochs
- Then, decay with a factor of 10 on epochs 30, 60 and 90, respectively
Batch size of 32
Vanilla SGD with momentum(coefficient of 0.9)

Results and Discussion

For each of the experiments shown in this section please have a look at the methodology chapter to see how the experiment was done in case you have problems following it.

Performance Evaluation of Transfer Learning

Figure 10: Performance of Transfer Learning and Random Initialization in the RETINA dataset, across multiple architectures

The conclusions from this section are(fig. 10):

Transfer learning and random initialization perform comparably, across all architectures.
The smaller models perform comparably to the large models, across the two types of initialization (Random Initialization and Transfer Learning)
Performance in ImageNet is not decisive of the performance in the medical dataset. As we can see the small models perform very bad in ImageNet Top5, but they perform similarly to big models in the medical datasets.This is an indication that the large models are overparameterized for the medical datasets.

The results are the same for the CheXpert dataset as well.

Analysis of the effects of Transfer Learning in the learned representations

Transfer Learning and Random Initialization learn different features(representations)

Figure 11: CCA similarity scores when comparing pretrained weights with randomly initialized weights(yellow scores) are much smaller than the

CCA similarity scores when comparing two different random initializations(blue scores). The experiment was done for higher layers of the networks.

In Fig. 11. we can notice that the CCA similarity scores when comparing pretrained weights with randomly initialized weights are much smaller than the CCA scores when comparing two different random initializations. This means that Transfer Learning actually gives rise to different features from random initialization and this is more noticeable in the large models.

Large models move(change) less through training(especially in the lowest layers)

Figure 12: Per layer CCA similarities before and after training for multiple models and two initialization types. This tells how much the layers have changed through training.

The conclusion of the title can be seen from Fig. 12. We can notice that the per-layer CCA similarity before and after training for the first layer of Resnet50(a big model) is larger(they have changed) than the CCA similarity of the first layers of the other two small models. This holds for Transfer Learning and Random Initialization.

We also notice that similarity with initialization is much higher for pretrained weights in all architectures, which means that feature reuse mostly happens at the first layers.

Feature independent properties of Transfer Learning and Weight Transfusion

Weight scaling - Transfer Learning without feature reuse increases the convergence speed

Figure 13: Mean Var initialization(weight scaling) increases the convergence speed compared to random initialization

When we use the Mean Var initialization(weight scaling) to initialize the model, we notice an increase in convergence speed(green line) compared to randomly initializing the model(blue line). Hence, we see that Transfer Learning has other benefits apart from just feature reuse, namely weight scales.

First layers have the biggest impact in convergence speed

Figure 14: Weight transfusion to see how the converge speed changes

In fig. 14, when doing weight transfusion, we notice that the highest increase in convergence speed happens when we go from initializing all the network with random weights, to initializing the first convolutional layer with pretrained weights and the rest of the network randomly. Furthermore, because of this convergence speed increase, we again can see that feature reuse happens mostly at the lower layers.

Hybrid approach to Transfer Learning

Figure 15: The slimmed model has similar converge speed and AUC compared to Transfer Learning. Moreover, it has a much higher converge speed than random initialization

In this experiment the authors slim the big Resnet50 model. By slimming we mean using pretrained weights up to Block2 of the network, randomly initializing the rest of the model and also halving the number of channels in the rest of the architecture. In fig. 15 we can easily notice that the slimmed network performs very similarly in terms of convergence speed and AUC compared to Transfer Learning. Moreover, the slim network has a much higher converge speed compared to randomly initializing the network.

Conclusion

This paper has analyzed the behavior of Transfer Learning in Medical Imaging datasets. We have concluded that Transfer learning and Random Initialization perform comparably and the smaller models perform similarly to large models. Additionally, we saw that Transfer Learning and Random Initialization learn different representations, large models move less through training(they are overparameterized for the Medical datasets) in the lower layers and feature reuse mostly happens at these layers. Moreover, we noticed that weight scaling increases convergence speed compared to Random Initialization, the lower layers have more impact in the convergence speed and also we explored a hybrid approach to Transfer Learning.

In the future we could for example try domain adaption within Medical datasets, come up with a precise mathematical definition of weight scaling and see if weight scaling could work within the same domain.

References

[0] Raghu, M.; Zhang, C.; Kleinberg, J. M. & Bengio, S., Transfusion: Understanding Transfer Learning with Applications to Medical Imaging , CoRR, 2019

[1] http://cs231n.github.io/transfer-learning/

[2] Wang, X.; Peng, Y.; Lu, L.; Lu, Z.; Bagheri, M. & Summers, R. M., ChestX-Ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases, 2017, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017 , 3462-3471

[3] Gulshan, V.; Peng, L.; Coram, M.; Stumpe, M. C.; Wu, D.; Narayanaswamy, A.; Venugopalan, S.; Widner, K.; Madams, T.; Cuadros, J.; Kim, R.; Raman, R.; Nelson, P. C.; Mega, J. L. & Webster, D. R., Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs, JAMA, 2016 , 316 , 2402-2410

[4] Ding, Y.; Sohn, J.; Kawczynski, M.; Trivedi, H.; Harnish, R.; Jenkins, N.; Lituiev, D.; Copeland, T.; Aboian, M.; Aparici, C.; Behr, S.; Flavell, R.; Huang, S.-y.; Zalocusky, K.; Nardo, L.; Seo, Y.; Hawkins, R.; hernandez pampaloni , M.; Hadley, D. & Franc, B., A Deep Learning Model to Predict a Diagnosis of Alzheimer Disease by Using 18 F-FDG PET of the Brain, Radiology, 2018

[5] Chen, C.; Ren, Y. & Kuo, C.-C. J., Big Visual Data Analysis, Springer Singapore, 2016

[6] Ngiam, J.; Peng, D.; Vasudevan, V.; Kornblith, S.; Le, Q. V. & Pang, R., Domain Adaptive Transfer Learning with Specialist Models, ArXiv, 2018

[7] Kornblith, S.; Shlens, J. & Le, Q. V., Do better ImageNet models transfer better? , 2019

[8] Raghu, M.; Gilmer, J.; Yosinski, J. & Sohl-Dickstein, J., SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability, 2017

[9] Morcos, A. S.; Raghu, M. & Bengio, S., Insights on representational similarity in neural networks with canonical correlation, 2018

[10] Irvin, J.; Rajpurkar, P.; Ko, M.; Yu, Y.; Ciurea-Ilcus, S.; Chute, C.; Marklund, H.; Haghgoo, B.; Ball, R. L.; Shpanskaya, K. S.; Seekins, J.; Mong, D. A.; Halabi, S. S.; Sandberg, J. K.; Jones, R.; Larson, D. B.; Langlotz, C. P.; Patel, B. N.; Lungren, M. P. & Ng, A. Y., CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison , CoRR, 2019

[11] He, K.; Zhang, X.; Ren, S. & Sun, J., Deep Residual Learning for Image Recognition, CoRR, 2015

[12] Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J. & Wojna, Z., Rethinking the Inception Architecture for Computer Vision, CoRR, 2015

Seitenhierarchie

Transfusion: Understanding Transfer Learning for Medical Imaging

Introduction

Related work

Methodology

Performance Evaluation of Transfer Learning

Analysis of the effects of Transfer Learning in the learned representations

(Singular Vector) Canonical Correlation Analysis – (SV)CCA

Do transfer learning and random initialization learn the same features(representations)?

Do large models change more through training?

Feature independent properties of Transfer Learning and Weight Transfusion

Weight scaling - Does Transfer Learning without feature reuse increase the convergence speed?

Which layers have the highest impact in convergence speed?

Hybrid approach to Transfer Learning

Experimental Setup

Datasets

Models

Training setup

Results and Discussion

Performance Evaluation of Transfer Learning

Analysis of the effects of Transfer Learning in the learned representations

Transfer Learning and Random Initialization learn different features(representations)

Large models move(change) less through training(especially in the lowest layers)

Feature independent properties of Transfer Learning and Weight Transfusion

Weight scaling - Transfer Learning without feature reuse increases the convergence speed

First layers have the biggest impact in convergence speed

Hybrid approach to Transfer Learning

Conclusion

References