Blog post written by: Ali Wali Khan
Based on: Usman Akbar, M., Larsson, M., Blystad, I. et al. Brain tumor segmentation using synthetic MR images - A comparison of GANs and diffusion models. Sci Data 11, 259 (2024). https://doi.org/10.1038/s41597-024-03073-x

1. Introduction

1.1. Medical Image Segmentation & Artificial Intelligence

Medical imaging plays a vital role in diagnosing and treating numerous diseases, enabling healthcare professionals to visualize and understand the internal structures and functions of the human body. Medical Image Segmentation is crucial among the many processes involved, as it partitions images into distinct regions to isolate and analyze specific structures, such as organs or tumors.

With advancements in artificial intelligence (AI), this field has seen significant improvements in accuracy, efficiency, and cost-effectiveness. AI techniques, including machine learning and deep learning, facilitate early disease detection, enhance diagnostic accuracy, and significantly reduce the time required for labor-intensive processes. For instance, in radiotherapy treatment planning, AI-driven segmentation networks can decrease the time needed to segment tumors and organs from hours to mere minutes, streamlining the workflow and allowing healthcare professionals to focus more on patient care.

1.2. Challenges with Data Availability

However, training deep learning models, such as Convolutional Neural Networks (CNNs) and Vision Transformers, for classification or segmentation typically requires large annotated datasets. Unlike the extensive and openly available ImageNet database used in computer vision, medical imaging datasets are much smaller and more challenging to access. Ethical concerns, anonymization, and stringent data protection regulations like the General Data Protection Regulation (GDPR) complicate the sharing and availability of medical images. Although there are several openly available medical imaging datasets, they are limited in size and often represent selective populations, focusing on healthy controls rather than diseased individuals. This limitation restricts the potential applicability of AI models trained on such data in clinical settings.

1.3. The Solution: Using Generative AI to Make Synthetic Images

A potential solution to facilitate the sharing of medical images is to generate and share synthetic images or synthetic patients, as GDPR should not apply to images that do not belong to specific individuals. Generative models, such as Generative Adversarial Networks (GANs) and Diffusion Models, have recently shown promise in producing highly realistic synthetic images by learning the high-dimensional distribution of training images.

In this blog post, we will evaluate how effective synthetic images are compared to real images. We will examine their accuracy in diagnostic applications, their utility in training machine learning models, and their potential impact on research and clinical practices. By comparing synthetic and real images, we aim to understand the strengths and limitations of using generative models in the medical imaging field.

2. Technical Background

Before diving into the paper's findings, we will take a quick look at how GANs and Diffusion Models work. Understanding these models will provide a foundation for appreciating their capabilities and the implications of their use in generating synthetic images.

2.1. Generative Adversarial Networks (GANs)

Components:

Generator (G): Creates synthetic data that resembles the real data.
Discriminator (D): Evaluates the authenticity of the data, distinguishing between real and synthetic data.

Adversarial Process:

The generator and discriminator are pitted against each other in a zero-sum game.
The generator aims to produce realistic data to fool the discriminator.
The discriminator aims to accurately distinguish between real and synthetic data.

In summary, GANs involve a generator and a discriminator competing against each other to produce and distinguish synthetic data.

2.2. Diffusion Models

Diffusion Process:

- Forward Process: Gradually adds noise to the data, effectively diffusing it into a simple distribution (usually Gaussian noise).
- Reverse Process: Trains a model to progressively denoise the data, reconstructing the original data from the noisy version.

Diffusion models typically aim to make the predicted noise as close as possible to the actual noise added during the forward process.

In summary, Diffusion Models use a noise addition and removal process, training a model to reverse the diffusion of noise to reconstruct the original data

3. Experimental Setup

Before we dive into the results, it is imperative to understand the experimental setup that was used. This setup includes the relevant parameters of the experiment that were employed to assess the effectiveness of synthetic images. By clearly outlining our methodology, we ensure that the findings are transparent, reproducible, and accurately interpreted in the context of the experiments conducted.

The entire experimental setup is summarized in the diagram above. The flowchart represents the overall experimental process, which can be explained as follows:

Using a dataset of real medical images, we train a synthetic image generation model to create synthetic medical images. These synthetic images are then used, either independently or alongside the real medical images, to train an image segmentation model. This approach allows us to evaluate the effectiveness of synthetic images in medical image segmentation.

To ensure the robustness of this process, we experiment with different combinations of the components highlighted in the diagram. By varying these parameters and running numerous experiments, we can thoroughly assess the performance and reliability of synthetic images compared to real ones. This comprehensive experimental setup clearly explains how synthetic images can be effectively utilized in medical imaging applications.

3.1. Experimental Parameters

For each of the highlighted components in the diagram, these are the different values they could take:

Real Image Dataset:
1. BraTS 2020
2. BraTS 2021
Synthetic Image Generation Model:
1. Progressive GAN
2. StyleGAN 1
3. StyleGAN 2
4. StyleGAN 3
5. Diffusion Model
Image Segmentation Model
1. U-Net
2. Swin-Transformer
Synthetic or Synthetic Data:
1. Synthetic Images in isolation
2. Synthetic + Real Images combined
Augmentation or No Augmentation(*):
1. with Data Augmentation
2. without Data Augmentation

Thus, for instance, one experiment could look like the following: Using the BraTS 2020 dataset, a Progressive GAN was trained to generate synthetic images. These Synthetic images were used in isolation (without real images) with Data Augmentation to train a U-Net-based Image Segmentation model.

Also, in addition to these, baseline performances with only real data were recorded for the final comparison.

3.2. Evaluation Metrics

The evaluation of the results of each experiment falls into two general categories:

Generation Evaluation: Evaluation of the quality of the images produced by the Synthetic Image Generation Models.
The metrics used to evaluate this are:
1. Frechet Inception Distance (FID):
  It measures the similarity between real and generated images by comparing their feature distributions using a pre-trained Inception network. It calculates the Frechet distance between the means and covariances of the feature sets. Lower FID scores indicate higher similarity and better image quality.
  
  $\begin{array}{l}\text{FID} = \| \mu_r - \mu_g \|^2 + \text{Tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2})\end{array}$
  
  $\begin{array}{l}\text{where } \mu_r \text{ and } \Sigma_r \text{ are the mean and covariance of real image features,} \text{and } \mu_g \text{ and } \Sigma_g \text{ are the mean and covariance of generated image features.}\end{array}$
2. Inception Score (IS):
  Inception Score (IS) evaluates the quality and diversity of generated images. It uses a pre-trained Inception network to assess how well images can be classified and how diverse the classifications are. Higher IS values indicate clearer, more distinct, and diverse generated images.
  
  $\begin{array}{l}\text{IS} = \exp \left( \mathbb{E}_x \left[ D_{KL}(p(y|x) \| p(y)) \right] \right)\end{array}$
  
  $\begin{array}{l}\text{ where }p(y|x)\text{ is the conditional label distribution for image }x,\text{and }p(y)\text{ is the marginal distribution over all images.}\end{array}$
Segmentation Evaluation:
Evaluation of the segmentation quality of the Image Segmentation Models.
The metrics used to evaluate this are:
1. Dice Score:
  The Dice Score, or Dice Coefficient, measures the overlap between predicted and ground truth segmentations. It is calculated as twice the intersection size divided by the sum of the sizes of the prediction and ground truth sets. A score of 1 indicates perfect overlap, reflecting high segmentation accuracy.
  
  $\begin{array}{l}\text{Dice Score} = \frac{2 |A \cap B|}{|A| + |B|}\end{array}$
  
  $\begin{array}{l}\text{where } A \text{ is the set of predicted pixels, } B \text{ is the set of ground truth pixels,} \text{and } |A \cap B| \text{ is the intersection of the sets.}\end{array}$
2. Hausdorff Distance:
  Hausdorff Distance evaluates the similarity between two sets of points, often used in segmentation tasks. It measures the maximum distance from any point in one set to the nearest point in the other set. Lower values indicate closer sets, implying better segmentation accuracy.
  
  $\begin{array}{l}H(A, B) = \max \left( \sup_{a \in A} \inf_{b \in B} d(a, b), \sup_{b \in B} \inf_{a \in A} d(a, b) \right)\end{array}$
  
  $\begin{array}{l}\text{where } d(a, b) \text{ is the distance between points } a \text{ and } b, A \text{ and } B \text{ are the sets of points, and } \sup \text{ and } \inf \text{ denote the supremum and infimum.}\end{array}$

4. Key Results & Insights

4.1. Summary of Experiments & Model Performance Overview

Given the extensive number of experiments and various evaluation criteria, presenting each individual score in a table would be impractical. Instead, it is more effective to categorize and rank the performance of each Image Generation Model within two key areas:
The values inside the cells represent the relative rank of the Image Generation Model compared to the others with 1 being the highest and 5 being the lowest.

Generation Performance:

Metric	Metric Type	Year	Diffusion	StyleGAN 1	StyleGAN 2	StyleGAN 3	Progressive GAN
FID	Generation	2020	1	3	5	4	2
IS	Generation	2020	3	5	4	1	2
FID	Generation	2021	4	3	1	5	2
IS	Generation	2021	3	5	1	2	4

Segmentation Performance:

Metric	Metric Type	Year	Architecture	Diffusion	StyleGAN 1	StyleGAN 2	StyleGAN 3	Progressive GAN
Dice	Segmentation	2020	U-Net	1	5	3	2	4
Dice	Segmentation	2020	Swin Transformer	1	5	2	3	4
Dice Aug	Segmentation	2020	U-Net	1	5	2	3	4
Dice Aug	Segmentation	2020	Swin Transformer	1	5	2	4	3
Hausdorff	Segmentation	2020	U-Net	1	5	4	2	3
Hausdorff	Segmentation	2020	Swin Transformer	1	5	4	2	3
Hausdorff Aug	Segmentation	2020	U-Net	1	5	3	4	2
Hausdorff Aug	Segmentation	2020	Swin Transformer	1	5	2	3	4
Dice	Segmentation	2021	U-Net	1	4	3	2	5
Dice	Segmentation	2021	Swin Transformer	1	5	2	3	4
Dice Aug	Segmentation	2021	U-Net	1	4	2	3	5
Dice Aug	Segmentation	2021	Swin Transformer	1	5	2	3	4
Hausdorff	Segmentation	2021	U-Net	1	4	5	2	3
Hausdorff	Segmentation	2021	Swin Transformer	1	4	3	2	5
Hausdorff Aug	Segmentation	2021	U-Net	1	4	5	3	2
Hausdorff Aug	Segmentation	2021	Swin Transformer	1	5	2	3	4

Some key insights that could be derived from the above tables are as follows:

Diffusion Model performs the best for Segmentation on average compared to each of the 4 GAN Models, as is evident from the fact that it consistently scored number one for all Segmentation Tasks that used the data generated by it.
U-Net and Swin Transformer architectures show similar trends in rankings, with U-Net generally having better consistency.
High Generation Performance does not correlate too well with High Segmentation Performance.

4.2. Segmentation Performance with Synthetic Images & The "Memorization" Problem in Diffusion Models

The table in Section 4.1 provides valuable insights, but it leaves a crucial question unanswered: how do synthetic images stack up against real ones? To address this, we compared each experiment listed in the table to the baseline results discussed at the end of Section 3.1.

We used a straightforward yet effective metric to evaluate this comparison: the relative Dice score. This score is calculated using the following equation:

$\begin{array}{l}\displaystyle \text{Relative Dice Score = }\frac{\text{Avg}(\text{Dice}_{\text{ Synthetic}})}{\text{Avg}(\text{Dice}_{\text{ Real}})} \text{ or } \frac{\text{Avg}(\text{Dice}_{\text{ Synthetic + Real}})}{\text{Avg}(\text{Dice}_{\text{ Real}})}\end{array}$

A snapshot of results using U-Net are as follows:

Looking at the table, it's evident that training segmentation networks exclusively with synthetic images can achieve remarkable performance, with Dice scores reaching up to 93% for StyleGAN2 and 100% for Diffusion models, surpassing results obtained with real images. While these scores might suggest that using Diffusion Models to generate synthetic images solves the problem effectively, it's important to consider that the high performance of Diffusion Models can be attributed to "memorization". This means the synthetic images they produce closely resemble the real images from which they learn their distribution.

However, from a GDPR (General Data Protection Regulation) perspective, there are potential concerns regarding the use of synthetic images that closely mimic real data. Even though synthetic images are not derived directly from personal data, their high fidelity to real images could potentially lead to privacy risks if these synthetic representations are used in contexts where re-identification or privacy breaches are possible. Thus, the problem of limited data availability will remain unsolved as the sharing of these images could also potentially become subject to GDPR regulations.

4.3. Impact of Real-to-Synthetic Data Proportion on Segmentation

Another side experiment involved examining the impact of mixing real and synthetic data on segmentation performance. In this experiment, different proportions of real and synthetic data were combined to create various datasets. The objective was to observe how these varying ratios influenced the accuracy and effectiveness of the segmentation process.

Graph depicting the U-Net segmentation performance (Dice score) when using different proportions of real (BraTS 2021) and synthetic images generated from StyleGAN 3 (trained on BraTS 2021), in a constant total set of 100,000 images. As the number of real images increases along the x-axis, fewer synthetic images are used. To avoid random fluctuations, each segmentation model was trained 10 times, and the average performance is presented.

The experiment showed that adding a few real images to the dataset significantly improves segmentation performance, as these images provide crucial, authentic features for the model. However, increasing the number of real images beyond a certain point results in diminishing returns, where further additions do not substantially enhance performance. This indicates an optimal balance between real and synthetic data for the best segmentation results.

5. Recommendations

5.1. Developing New Metrics for Generation Performance Evaluation

In Section 4.1, it is noted that 'High Generation Performance does not correlate too well with High Segmentation Performance.' This discrepancy is likely due to the limitations of the metrics used to evaluate generation performance—namely, the Frechet Inception Distance (FID) and Inception Score (IS)—when applied to medical images. The main reasons for this are outlined below:

Training Data Bias:
- ImageNet Pretraining: Both FID and IS rely on models pretrained on ImageNet, a dataset that primarily contains a diverse array of natural images but very few, if any, medical images. Consequently, the features extracted by these models are less relevant for medical images, which have distinct visual characteristics compared to natural images.
Relevance of Features:
- Feature Appropriateness: The features learned by ImageNet-trained models may not capture the fine-grained details and specific structures present in medical images. Medical imaging often necessitates the detection of subtle anomalies or specific anatomical structures that ImageNet-trained models are not designed to identify.
Domain-Specific Details:
- Unique Characteristics: Medical images, such as MRI, CT scans, or X-rays, have unique characteristics like varying intensity levels, noise patterns, and specific textural details crucial for medical diagnosis. FID and IS metrics may not effectively capture these nuances since they are optimized for the patterns seen in natural images.

To address these limitations, specialized metrics like Rad-FID and Rad-IS are proposed, which are better suited for medical imaging for the following reasons:

Rad-FID (RadImageNet-based FID):
- Relevance to Medical Domain: By using RadImageNet, which is specifically tailored to medical images, Rad-FID ensures that the distance computation between real and synthetic images is based on feature representations that are more relevant to the medical domain.
- Improved Feature Extraction: The feature extractor trained on RadImageNet captures more pertinent details, leading to a more accurate assessment of the quality of synthetic medical images.
Rad-IS (RadImageNet-based Inception Score):
- Domain-Specific Class Distribution: Rad-IS calculates the score based on the entropy of predicted class distributions using RadImageNet classes. This means it evaluates how well the synthetic images match the specific categories relevant to medical imaging, which is crucial for applications in this field.
- Meaningful Diversity and Quality: Rad-IS ensures that the generated images not only are diverse but also align with medically meaningful categories, thereby providing a more accurate and relevant measure of image quality.

6. ChatGPT Prompts

Convert the following text to be more appropriate as a blog post paragraph
Write the following information in terms of a hierarchical list
Write the equations for FID, IS, Dice Score, and Hausdorff Distance in LaTeX. Output them as individual blocks of text I can copy (do not combine them into one big output).
Write what the variables in each of the equations for FID, IS, Dice Score, and Hausdorff Distance mean in LaTeX. Output them as individual blocks as well.
Check the following paragraph for mistakes relating to coherence, grammar, and structure. Make sure to correct them.
Write the equation for relative dice score in LateX given that it is defined as Dice_{synthetic+real} / Dice_{real}

7. References

[1] Litjens, G. et al. A survey on deep learning in medical image analysis. Medical image analysis 42, 60–88 (201)

[2] Cai, Likun & Chen, Yanjie & Cai, Ning & Cheng, Wei & Wang, Hao. (2020). Utilizing Amari-Alpha Divergence to Stabilize the Training of Generative Adversarial Networks. Entropy. 22. 410. 10.3390/e22040410.

[3] http://braintumorsegmentation.org/

[4] Karras, T., Aila, T., Laine, S. & Lehtinen, J. Progressive growing of GANs for improved quality, stability, and variation. ICLR (2018).

[5] Karras, T., Laine, S. & Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/ CVF conference on computer vision and pattern recognition, 4401–4410 (2019).

[6] Ho, J., Jain, A. & Abbeel, P. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 33, 6840–6851 (2020).

[7] RadImageNet: An Open Radiologic Deep Learning Research Dataset for Effective Transfer Learning https://pubs.rsna.org/doi/10.1148/ryai.210315

Seitenhierarchie

7: Brain Tumor Segmentation using Synthetic MR Images - A Comparison of GANs and Diffusion Models