Image Superresolution Using Generative Models

Introduction

People without a background in IT and Computer Science are often confused about what is and what is not possible to achieve using computers. This leads to many humorous misunderstandings, one of which is the film and television trope of "Zoom and Enhance" [1]. This refers to a common situation in film where a, usually crimesolving, character points at pixelated or degraded image footage and orders their IT technician to "Enhance that!" and after some wild keyboard presses the footage suddenly appears clear and detailed.

This is, of course, an imaginary scenario however, it highlights how useful a technology with this exact capability can be. We could not only improve video surveilance footage and satellite imagery helping in things like law enforcement, military planning and environmental protection but in the context of this Seminar we could improve notoriously low resolution medical images, potentially leading to better diagnostics.

Fortunately with the recent rise of Machine Learning based image manipulation techniques the field of Image Super Resolution aiming to recover high resolution (HR) images from low resolution (LR) image inputs has picked up steam as well. In particular the category of unsupervised generative models has recently made strides in this field, which is why we want to look at these techniques and their application in the medical field in particular.

Image Super Resolution Problem

Mathematically the Problem of Image Super Resolution starts with the assumption that we have some low resolution image y which is the result of the degradation Function $\begin{array}{l}\mathcal{D}\end{array}$ with parameters $\begin{array}{l}\delta\end{array}$ .

$\begin{array}{l}\displaystyle y = \mathcal{D}(x; \delta)\end{array}$

The Goals is then to model and solve the inverse of the degradation function to receive the original high resolution image x.

$\begin{array}{l}\displaystyle \mathcal{D}^{-1}(y;\delta) = x\end{array}$

However, this is not a trivial task as not only can a HR image be degraded to many different LR images using different types of degradations, it is also true that many different HR images can be degraded to the same LR image with the same degradation making this a many-to-many problem.

Because of the properties of generative models and their training we will focus on so called non blind Image Superresolution, a subcategory of the overall field where we are allowed to make an assumption about the type of degradation that occurred in the image. Namely both methods we will look at are assuming bicubic downsampling as the degradation type for their experiments.

[2]

Evaluation Metrics

To have some notion of how well the proposed methods work we require some quantitative evaluation metrics. There are two such metrics that are commonly used in the field [3]:

The Peak Signal-to-Noise-Ratio (PSNR), where $\begin{array}{l}I\end{array}$ is the ground truth HR image and $\begin{array}{l}\hat{I}\end{array}$ the generated image and MSE the Mean square error, represents the ratio of the maximum possible power of a signal to the noise that affects the quality of the signal.

$\begin{array}{l}\displaystyle PSNR = 10 * \log_{10}\frac{\max(I)^2}{MSE(I, \hat{I})}\end{array}$

And the Structure similarty index (SSIM), where $\begin{array}{l}\mu\end{array}$ is the averare of an image, $\begin{array}{l}\sigma\end{array}$ the variance of an image and c1/c2 are stabilization variables, as the name suggests represents the similarity of the structure between the actual image and the reconstructed image.

$\begin{array}{l}\displaystyle \frac{(2\mu_{I}\mu_{\hat{I}}+c1)(2\sigma_{I\hat{I}}+c2)}{((\mu^2_I+\mu^2_{\hat{I}}+c1)(\sigma^2_I+\sigma^2_{\hat{I}}+c2))}\end{array}$

Generative Models

Generative Models are a category of mache learning models designed to learn the underlying distribution and patterns of a training dataset to then be able to use this knowledge to generate new samples that fit within this distribution, "creating" new data. The following papers used two types of generative modelling:

Generative Adversarial Networks

The types of models stage the training process as a game between a generator network and a discriminator network. The generator generates images samples and the discriminator tries to classify the sample as either coming from the true data distribution or having been generated. This feedback can then be used to further train the generator [4].

Denoising Diffusion Models

Diffusion models are deep generative models that work by iteratively adding noise to the available training data (forward diffusion) and then learning to reverse this process (denoising/reverse diffusion) to recover the data [5]. The learning process is optimizing the model parameters to minimize the difference between the generated images and the training images. Once trained, it can be used to generate new images by starting from a random noise vector or in our case a degraded initial image and running the diffusion process in reverse. By iteratively applying the reverse transformations, the model gradually generates a more complex image.

Paper 1: A new generative adversarial network for medical images super resolution [8]

The goal of this Paper is to improve on the Generative Adverserial Network (GAN) for Image Super Resolution (SRGAN) architecture proposed in [6] specifically for the application in medical imaging. To achieve this they implemented some specific changes to both the generator and discriminator portions of the network.

Generator Network

Image [4 SRGAN Generator] shows the original SRGAN Generator based on ResNet blocks for feature extraction from LR images using kernel size k, number of feature maps n and stride s. After extraction it uses a sub-pixel convolutional layer for 4x upscaling. Some noted limitations of this architecture are single-scale feature extraction which may lead to either large or small scale features being missed by the extractor and single-step upscaling which may lead to artifacts in the output image due to the possibility of predicting the wrong information multiple times in a single step.

The paper implemented 2 major changes to this architecture:

They split the feature extraction process into three steps
They split the upsampling process into 2 steps

Part 1: Shallow Feature Extraction

Shallow, basic features at different scales of the image contain important information about the patient. For example in retinal vessal images the vessel structures are important but might not be accurately captured by feature extraction at only a single scale. Similar concept applies for other medical image types like for example brain MRI where the structure and especially the edges of a tumor are important information that should be preserved.

To extract shallow features, the authors used ResNet block with kernels of three different sizes, namely sizes 3, 5 and 7 respectively. For each scale they use two such blocks and channel wise concatenate the three scales features into a single feature vector. Image [6 Shallow Feature Extraction] shows the architecture.

Part 2: Deep Feature Extraction

For the deep feature extraction they use repeating ResNet block with size 3 convolutional kernels. In Image [7 Deep Feature Extraction] the dotted line symbolizes the 16 times repetition of the two previous block. Its important to note they included a skip connection to preserve the shallow features that were extracted by the previous block. At the end of this feature extraction they also already upscale the image 2x for the first time.

Part 3: Feature Extraction of upscaled image

In the third feature extraction step, they calculate features of the already 2x upscaled image passed on by the previous block. For this they use another mini ResNet network comprising of three residual blocks and again include a skip connection to preserve previously extracted features. They then also upscale the image 2x again, resulting in a 4x upscaled HR image.

Discriminator Network

As stated previously the authors also made changes to the discriminator network of the SRGAN model which can be seen in image [9 SRGAN Discriminator]. The original architecture contains eight convolutional layers with increasing number of kernels which you might recognize as this is the VGG [7] architecture. The feature maps generated by the convolutional layers are simply followed by two dense layers with Leacky ReLU activation functions and passed through a sigmoid function to turn the logits into a probability for sample classification.

The authors then didnt deviate much from this architecture however they implemented three changes to it:

As they increased the size of the generator they also added more convolutional layers to the discriminator
They added three additional convolutional layers with residual connections to be able to manipulate the number of channels in the feature maps
They added skip connections to combat vanishing gradients

The resulting discriminator architecture can be viewed in image [10 Proposed Discriminator].

Experiments and Results

The authors tested their architecture on image datasets of 4 different medical image types. All datasets originally only contain HR images which are bicubically downsampled 4x to produce the LR images for the experiments.

Retinal image dataset:
1. DRIVE (40x512x512)
2. STARE (397x512x512)
Skin cancer dataset:
1. ISCIN (540x512x512)
Brain tumor dataset:
1. BraTS contains 3d volumetric images of which they extracted 430 2d sliced (430x240x240)
Cardiac ultrasound dataset:
1. CAMUS (500x1024x512)

Results on Retinal images:

In this experiment they compare the results of the SR of their adapted SRGAN network against those of the base SRGAN and basic bicubic upsampling. In images [11 retinal image results] and [12 retinal image results 2] we can see the basic bicubic upsampling produces blurry and blocky structure, SRGAN manages to remove blurr but adds some noise to the image and the proposed method produces a rather smooth image very close to the ground truth.

They also clearly beat out the models they compared themselves to in the two quantitative metrics they used [13 retinal images quantitative results].

Results on skin cancer and cardiac ultrasound images:

Now while the results on the retinal image dataset were quite positive, the results on the skin cancer and cardiac ultrasound images were a little less impressive. In the images they provided [14 skin cancer results] and [15 cardiac ultrasound results] they claim to see an improvement in the preservation of colors and generation of smootheness however i am personally having a hard time seing what they mean.

Quantitatively they do still beat both other methods. Left results on skin cancer dataset and right results on cardiac ultrasound dataset.

Results on brain MRI images:

Fortunately their models most impressive results were on the last dataset containing brain MRI images. Image [16 results on brain MRI images] shows that their method seems to do a better job of showing the brains intricate structure clearly and accurately.

Here for the first and only time they also evaluate their results against more contemporary models for MRI Super Resolution than just the two baseline comparisons and are the clear favourite [17 quantitative results MRI].

Personal note on the paper

I thought the authors did a great job of clearly explaining what changes they are implementing to the SRGAN architecture and why and overall the paper is very readable. Testing the model on 4 publicly available datasets also makes it possible to easily compare results in the future. However, it also showed that their model performed well on the retinal and MRI datasets but not so much on the other two. This makes me wonder if it would not have been a good idea to focus the adaptation of the model on a single medical imaging type to generate even more impressive results instead of going for this general approach. I also think that evaluating the model only against SRGAN and Bicubic interpolation for 3 of 4 datasets was not sufficient. SRGAN is the model they base theirs on, so outperforming it should be considered as baseline. SRGAN was also already 6 years old at the time of writing of this paper so there already existed newer iterations they could have compared theirs to.

Paper 2: Image Super-Resolution via Iterative Refinement [9]

In this paper the authors propose SR3 (Super Resolution via Repeated Refinement) which tackles general image super resolution through stochastic iterative denoising, also known as diffusion. They evaluate their model on datasets containing images of human faces and natural images, such as animals plants etc.

Denoising Model

As mentioned in the introduction, denoising diffusion models take a training set of HR images, iteratively add noise to the images and simultaneously train a model to approximate a mapping to be able to iteratively denoise the image in a process called backward diffusion. The denoising model the authors proposed in this paper is based on the U-NET [10] architecture, image [18 U-Net denoising model] shows the model. The model works as follows:

They upsample $\begin{array}{l}x\end{array}$ the LR input to the target resolution using bicubic interpolation
They concatenate $\begin{array}{l}x\end{array}$ with $\begin{array}{l}y_t\end{array}$ which,
1. In the first iteration $\begin{array}{l}y_T\end{array}$ is simply noise sampled from a normal distribution
2. After the first denoising iteration $\begin{array}{l}y_t\end{array}$ is the output of the last denoising step iteration
During each iteration the model uses convolutional blocks to encode and recover the image to extract features and reduce noise while also leveraging skip connections to preserve higher level features
Repeat steps 2 and 3 for $\begin{array}{l}T\end{array}$ iterations and the last iterations output $\begin{array}{l}y_0\end{array}$ is an iteratively denoised HR output image

Experiments and Results

The authors tested their model on two datasets. Again both originally contained only HR images which were cropped, resized and downsampled to varying sizes for different experiments.

Human Face dataset:
1. Training samples: Flickr-Faces-HQ: (52.000x512x512)
2. Testing: CelebA-HQ 30.000x1024x10x24
Natural image dataset:
1. Training: ImageNet1K 1,281,167 images of varying sizes
2. Testing: ImageNet1K (dev split) 150.000 images of varying sizes

They compare their model to contemporary models: EnhanceNet [11], ESRGAN [12], FSRGAN [13], SRFLOW [14] and PULSE [15]. It is worth noting that ESRGAN (Enhanced-SRGAN) and FSRGAN (Face-SRGAN) are themselves adaptations of SRGAN the model the previous paper based their work on. They also implemented an additional Regression baseline model that shares SR3s architecture but does not use the iterative denoising process, instead upsamping the image in DNN regressive fashion in a single step. This is used to specifically show the advantage of the iterative refinement process.

Natural image dataset

In the first experiment they compare the results of their model against the regression baseline model at a 4x SR task going from a 64x64 LR input generated by bicubic downsampling to a 256x256 HR output. As you can see in [19 Natural SR3 vs Regression] even though the regression models output are impressive in their own right, they are somewhat blurry and lack the fine-grained stucture of SR3s output.

Next up they repeated the same experiment but compare the results to the previously mentioned contemporary models. And again it seems fair to say that visually you can see an improvement in image detail and smootheness in the image generated by SR3, especially in the face of the jaguar.

Human Face dataset

For the human facial image dataset they repeated almost the same pattern for the experiment however they increased the scale of upsampling to 8x going from 64x64 LR images to 512x512 HR outputs. Images [20 Face SR3 vs Regression] and [21 Face SR3 vs Regression 2] show that SR3 again clearly visually outperforms its Regression baseline creating much more detailed real looking faces.

The same thing hold true when comparing the results against those of contemporary models, SR3 consistently generates detailed, clear and realistic looking images while the other models all struggle in some capacity.

For this experiment they also for the first and only time include the quantitative results of their experiment [21 Metrics SR3 Faces], where it is interesting to note that while SR3 beats both PULSE and FSRGA it does not outperform its own regression baseline model. The authors argue that conventional quantitative evaluation like PSNR and SSIM do not correlate well with human perception when the input resolution is low and the magnification factor is large because these metrics tend to penalize any synthetic detail that is not perfectly aligned with the target image (e.g. hairstrands or leopard spots). Therefore these metrics prefer mean square error regression based techniques which are conservative with adding synthetic details. This is a fair criticism of these metrics when the goal is to produce a realistic, good looking HR images which may require the addition of synthetic details to the image however, in the context of for example medical imaging this may not apply as there it is very important that we use stable algorithms that do not add any synthetic detail and only enhance what is actually visible in the image.

Unstable Image generation

To further illustrate the point made in the last section, the authors showed an experiment where they used SR3 to generate HR images from the same input images multiple times. This shows that because of the stochastic noise that is being added to the image at the beginning of the iterative denoising procedure, at very high SR factors SR3 is unstable and adds a lot of details to the images resulting in diverse but still real looking HR outputs. This experiment was performed on two natural images with a 16x SR factor going from 16x16 to 256x256 [22 unstable image generation].

Conclusion

Image Super Resolution using generative models is generating impressive results, there are however some notable issues in the field. Like the second paper discussed, the existing quantitative evlatuation metrics arent always suitable for evaluating the performance of these models. Inversely not all models are suitable for all SR tasks, namely unstable models like SR3 might not be suitable for application in the medical field. Another issue we did not thoroughly discuss in this blogpost is that both the generative models tackle the task of Non blind Single Image Super Resolution, where we assume the type of degradation that cause the LR image. In cases where different degradations or even multiple degradations are present in a single LR image, these methods may not perform well at all anymore. This then also raises another question about the experiments performed in the papers. In both cases the authors used HR datasets and simply generated the LR samples themselves. However it remains to be seen if these experiments then translate well into the real world where LR images are not purposely generated. Lastly for the field of medical SR especially, its curious that existing methods pretty much exclusively use image data to generate HR images. Medical imaging is a complex process that offers a lot of additional information beyond raw image data that could be used to improve the images. It would be interesting to see a method that uses such additional information in their approach.