Automatic detection of rare pathologies in fundus photographs using few-shot learning

Motivation

Through the last decades huge datasets in the medical field have been collected, especially in fundus photography which is the part of the eyeball opposite the pupil. That is mainly because of diabetic retinopathy (DR) which is a very common pathology that targets the vessels of the back of the eye.
Many attempts to automate the screening processwere done but a complete shift to automatic pathology screening never happened. As a matter of fact, while some pathologies are easily detected because of the abundance of examples in datasets, like DR and Age-related Macular Degeneration (AMD), others are often ignored by screening systems like Asteroid Hyalosis and Telangiectasia.

Obviously, ophthalmologists are not willing to replace their interpretations with automatic interpretations if other sight-threatening pathologies are ignored.

So we are looking to have a Multiple pathology screening solution that addresses the data scarcity problem at the same time. For that purpose, the authors of this paper propose a solution that uses a Few-Shot learning paradigm. The idea is to generalize predictions to a new category that was not seen in training, given only few examples.

Outline

1.Idea

2.State-of-the-Art

3.Proposed Framework

1.Deep Learning for Frequent Condition Detection

2.Probabilistic model for Rare Condition Detection

3.Inferring Predictions

4.Experiments

1.OPHDIAT dataset

2.Performance Assessment

3.Parameter Selection

4.Heatmap Generation

5.Comparison with other frameworks

5.Discussion

Idea

The genesis of the idea is when the authors tried to visualize the feature space of a CNN trained for DR detection using t -distributed stochastic neighbor embedding (t-SNE) (visualization tool).

They were trying to see what convolutional neural networks (CNNs), trained to detect DR, have learnt.

They observed that many conditions, that are unrelated to DR and not targeted by the CNN, were clustered in feature space.

That is where the idea came up, to train deep learning classiﬁer to detect frequent conditions and from these deep learning models derive simple probabilistic models to detect rare conditions.

State-of-the-Art

There are several other frameworks that deal with data scarcity.

Transfer Learning for example consists of training a model on a large, possibly unrelated dataset (e.g. imagenet) and then fine-tuning to detect a target condition.

Multitask Learning is another approach, that addresses multiple target conditions simultaneously rather then sequentially.

One-Shot or Few-Shot Learning is a framework where the classiﬁer must generalize to a new category not seen in training, given only one or a few examples.

That could be done using Siamese Networks for example. A Siamese Network is a neural network that accepts two images as input and decides whether or not these images belong to the same category.

Or it could also be achieved by designing a probabilistic model for the new category, in the feature space derived from an initial training.

The proposed framework is based on this alternative with a CNN as initial training.

Proposed few-shot learning framework

The proposed framework can be divided into 3 main tasks.

First we have to detect frequent conditions through a multitask detector using a CNN.

Then derive a probabilistic detection model for each rare condition.

And after that predictions for new images would be inferred for both frequent and rare conditions.

Notations

Let's first define some notations that would be helpful through the post.

_{$\begin{array}{l}D\end{array}$} is the dataset of preprocessed images _{$\begin{array}{l}I∈D\end{array}$} . The dataset is divided into 4 sets, 3 of which are mutually exclusive: the learning subset $\begin{array}{l}D_L\end{array}$ , the validation subset _{$\begin{array}{l}D_V\end{array}$} and the test subset _{$\begin{array}{l}D_T\end{array}$ . The fourth part is the reference subset}_{$\begin{array}{l}D_R\end{array}$ .}

_{$\begin{array}{l}(c_n)_{n=1,N}\end{array}$} are the _{$\begin{array}{l}N\end{array}$}conditions

_{$\begin{array}{l}M\leq N\end{array}$}most frequent conditions
Label _{$\begin{array}{l}y_{I,n} ∈ \{0,1\}\end{array}$} indicating the presence or absence of condition _{$\begin{array}{l}c_n\end{array}$} in _{$\begin{array}{l}I\end{array}$}
Frequency _{$\begin{array}{l}f_n\end{array}$} of condition _{$\begin{array}{l}c_n\end{array}$} in the dataset

Then various spaces are defined for each method that are summarized in this graph, along with the operations of learning and inference pipelines which we would go through throughout the post.

Train multitask detector for frequent conditions

The multitask detector is a CNN defined as a multilabel classifier. The goal is to minimize the following cost function.

where :

_{$\begin{array}{l}M\leq N\end{array}$}most frequent conditions.
$\begin{array}{l}x_{I,n} ∈ \mathbb{R}\end{array}$ output of model for image I and condition _{$\begin{array}{l}c_n\end{array}$}
$\begin{array}{l}σ\end{array}$ converts that output into a probability $\begin{array}{l}p_{I,n}^M = σ(x_{I,n})∈ [0,1]\end{array}$ . $\begin{array}{l}σ\end{array}$ was elected as activation function since patients can have multiple conditions simultaneously.

Define Feature space

Since the CNN is deﬁned to detect the M most frequent conditions the penultimate layer of the CNN extracts all features to detect them.

The output of that layer is then used to define the feature space on which the rare conditions would be detected.

Let s call this feature space $\begin{array}{l}S\end{array}$ .

The number of neurons in this output if very high (e.g. 2049 for Inception-v3 or 1537 for Inception-v4) which would make it difficult for later computations. That is why we need to reduce the dimension.

Feature space dimension reduction

To reduce the dimension, the authors chose to follow van der Maaten and Hinton (2008)’s recommendation to adopt a a two-step procedure:

Principal Component Analysis (PCA)
t -distributed stochastic neighbor embedding (t-SNE)

t -distributed stochastic neighbor embedding (t-SNE)

Let s first explain t-SNE.

It is a Nonlinear, unsupervised, technique for embedding high-dimensional data in a low-dimensional space.

Like in this example we have data in 2D, it transforms them into data in 1D.

Since it can get data in spaces in the thousands and gives it back in 3D or 2D it is a good toll for visualization.

What is special about this technique is that it preserves neighborhood information: it maps similar input vectors to nearby output vectors and dissimilar input vectors to distant output vectors.

In the case of this paper t-SNE generates a very good separation of the various conditions.

The problem is that since the dimension of the space is way too high the computations might be a little slow that is why PCA is first used.

*source: StatQuest

Principal Component Analysis (PCA)

The Principal Component Analysis or PCA also lowers the feature space although not as loss free as t-SNE.

Using it as preparation step for t-SNE speeds up the computation of pairwise distances between the datapoints and suppresses some noise without severely distorting the interpoint distances.

Probability function estimation

Now that we have a feature space with good separation between conditions, the probabilistic condition detection model is defined for rare conditions.

For that a density probability function $\begin{array}{l}f_n\end{array}$ and $\begin{array}{l}\bar {f_n}\end{array}$ for presence and absence of each condition are defined and computed using the Parzen-Rosenblatt method(1962).

The estimations are done in the reference subset (training images are discarded in case the CNN overfitted).

The probability that an image contains a certain condition is then computed according to the following equation:

Infer predictions for new images

Dealing with new images a new problem arises which is that the computations for t-SNE are only theoretical and cannot be written in closed form.

Which means thet the projection from $\begin{array}{l}S′→S′′\end{array}$ is not allowed for new samples.

To approximate the prediction, K-nearest neighbor regression (KNN) is used.

KNN regression

The prediction is computed as the weighted arithmetic mean of exact predictions(output of the learning pipeline), as follows:

$\begin{array}{l}(V_k)_{(k=1…K)}\end{array}$ : the $\begin{array}{l}K\end{array}$ nearest neighbors in $\begin{array}{l}S′\end{array}$
$\begin{array}{l}(⟨π_J,q_{J,n}⟩)_{J∈D_R}\end{array}$ : reference samples
$\begin{array}{l}\widehat{q}_{I,n}=\frac{1}{∑_{k=1}^K\frac{1}{‖\pi_I-\pi_{V_k} ‖ }}∑_{k=1}^K\frac{q_{V_k,n}}{‖\pi_I-\pi_{V_k} ‖}\end{array}$ : approximate prediction

This pipeline is differentiable, provided that the K nearest neighbors of I are considered constant.

It can thus be implemented as a differentiable processing graph $\begin{array}{l}G\end{array}$ stacking these operations:

Which allows heatmap generation and fine tuning of CNN weights.

Summary

The probability $\begin{array}{l}p\end{array}$ that a condition $\begin{array}{l}c_n\end{array}$ is present in $\begin{array}{l}I\end{array}$ can be computed using 2 equations depemding on the nature of the condition(frequent/rare):

Experiments in the OPHDIAT dataset

OPHDHIAT is a telemedical network screening system for DR in the Ile-de-France, composed of screening centers, hospitals, health-care centers and prisons.

They collected thousands of reports from 2004 untill 2017, which resulted on huge amounts of data that made automation possible in this field.

Each screening exam is analyzed by certified ophthalmologists. The report contains information like: grade of DR in each eye, presence or suspicion of presence of other pathologies…

41 conditions were identified during this analysis which would be the ground truth annotations in this study.

The problem with the annotations is that ophtalmologists might have missed some pathologies, because of that the "normal" images (or images that were analyzed as free from pathologies) were reinspected to make sure that they are indeed "normal".

Image preprocessing

Before beginning with the piplines mentioned the images were preprocessed: the size was normalized through picking a square of interest and that square would then be resized to 299 by 299 pixels.

The appearance was normalized by changing the illumination as it is shown in the pictures.

Dividing dataset

Dividing the dataset is in the case of this paper not an easy task. In fact, both kinds of conditions(frequent and rare) need to be represented in the datasets.

There are also some constraints e.g. two eyes of the same person need to be in the same set.

To divide the dataset, a balanced portion of the dataset $\begin{array}{l}B_M\end{array}$ had to be created. Balanced means all frequent conditions are equally represented and all constraints are respected. This dataset consists of all frequent conditions, some normal images and no rare conditions.

To assure the validation and testing work properly, certain measures were taken:

change dataset to reference dataset in case model overfitted.
performe a 10-fold cross-validation/-testing strategy to maximize size of data which would help especially with rare conditions.

Performance assessment

AUC or area under the roc was used to assess performance of different variable values and overall framework.

Parameter selection

To select the parameters used in the computations of the proposed framework, the authors selected a first value of $\begin{array}{l}M\end{array}$ (number of frequent conditions) arbitrarily : $\begin{array}{l}M_0=11\end{array}$ (such that $\begin{array}{l}f_n≥1000\end{array}$ ) then multiple values for $\begin{array}{l}M\end{array}$ were investigated by steps of 6.

For each value of M the other parameters were chosen so that the classification performance is maximized.

For example: to maximize the classification performance for $\begin{array}{l}M = M_0\end{array}$ :

$\begin{array}{l}P''\end{array}$ (dimension of the reduced feature space generated by t-SNE) was set to 2.

$\begin{array}{l}K\end{array}$ (number of neighbors to approximate the $\begin{array}{l}\widehat q _{I,n}\end{array}$ predictions) was set to 3.

Other dimension reduction parameters were set to commonly used values.

Another important parameter is the CNN architecture used for the frequent condition detector. For each value of $\begin{array}{l}M\end{array}$ , a different CNN architecture was selected to maximize the AUC on the validation subset.

Heatmap Generation

The goal of the generation of heatmap is to measure how much each pixel $\begin{array}{l}I_{xy}\end{array}$ contributes to image $\begin{array}{l}I\end{array}$ ’s prediction.

The generation was done by differentiating the model predictions with respect to each input pixel.

Probability density functions

Using Inception-v3 and M=17 this is the resulting probability density functions obtained.

The red color indicates a large probability density for that condition in that area.

The blue color indicates a low probability density.

A widespread probability density function means that images with or without the condition could not be separated well e.g embolus, arteriosclerosis, papilledema..

A limited function shows more precise location and so is better for detection e.g AMD, DR, glaucoma, cataract.

These are the 4 most frequent conditions (4 big pictures in the figure).

In the figure the functions are grouped by similarity to emphsize the difficulty of the detection.

Comparison of results with other Frameworks

The results of the current framework was compared with other popular Machine learning frameworks in terms of AUC on the test subset. In the table, the best AUC for each condition is in bold.

To do the comparison, a fair comparison environment had to be set up:

use the same CNN architecture had to be used for all methods,
use selected CNN instead of basis network for Siamese basis network,
train for 11 frequent consitions then fine-tune for rare condition for transfer learning,
detect all 41 conditions together for multi-task learning.

We can see from the values of AUC for each performance that the proposed framework outperforms others for most of the conditions

For some frequent conditions transfer learning could compete with the propsed framework. For rare conditions on the other hand the knowledge collected is not used so the AUC drops in comparison to the proposed.

The Siamese networks also performs better for some rare conditions. Those are actually a few of the conditions which the proposed framework did not detect easily. This proves that this approach is also interestring to explore in further studies, although the proposed framework had highest AUC in all other conditions.

Discussion

The authors mention some of the advantages and disadvantages of the framework. In fact, the computation time is a major advantage as it is comparable to standard CNNs which is not common for more complicated procedures. Many benefits were also gained by using t-SNE as visualization technique. Not only did it boost the performance of the framework, but also it opened the possibility to use 2D images of feature space for decision support. Furthermore, the differentiability of the reduction technique offers the opportunity for weight-tuning and heatmap detection which is a very important strong point of the procedure.

Of course the results in the experiments is the main asset of this framework as it clearly outperformed other frameworks in achieving the same goal in the same environment.

The first imitation mentioned by the authors is the setback in the results. As a matter of fact, 8 of the 41 conditions were particularly diﬃcult to detect automatically. Among these conditions, one is particularly diﬃcult to detect even manually by ophtalmologists and three are associ-
ated with a poor reproducibility. However, the other four are considered easy to detect. This may be due to inadequate image preprocessing or a too large difference with frequent conditions.

An important limitation is also the fact that each image interpreted by 1 single human reader so quality of performance assessement could've been improved.

Another limitation mentioned briefly is that alternatives to building blocks like PCA or KNN were not explored.

In my opinion this paper proposes interesting procedure that is way ahead of current state-of-the-art as in most cases, approaches focus on only one or 2 very frequent conditions, like DR or AMD and rare conditions are not handled at all.
Another strong point are the visuals and graphs created by the tools They allowed the reader to get a better grasp of some very theoretical notions, hence remarkably increased clarity of the paper.
The authors performed great self critique and clearly explained the lacking parts and possible future research on the same field, that is also a good aspect of the paper.

On the other hand, some limitations are worth-mentioning. When the authors stumble upon incorrect results, they briefly mention possible solution but they do not investigate them further.
For example, they mention that the fact that conditions were not detected easily was due to inadequate image preprocessing but they did not mention further investigation in that area which could ve solved more issues.
Likewise some decisions were not explained well in the paper. For instance, choosing the value for M and investigating values with step of 6 conditions was an important decision, that to the reader would seem unfounded.
There is also no information about the training time for CNN which can be considered a minor limitation.

The lack alternatives limitation mentioned by the authors is in my opinion also a very important disadvantage. In fact, the results could have been significantly improved, had they tried other more advanced methods for dimention reduction and other procedures.

Seitenhierarchie