Super-Resolution in image processing means upsampling and therefore interpolation between pixels of an image. It can be interpreted as the opposite of downsampling. To make images larger in the image dimensions it is necessary to predict the values of the additional pixels between the original pixels. One of the easiest ways and also a traditional method to do this is applying a bicubic interpolation. New methods have evolved in the recent years and the use of neural networks is outperforming all other methods developed so far.
Visual Example for better Understanding
Figure 1: Low resolution image.
Figure 2: Interpolated low resolution image.
Figure 3: High resolution image
Figure 1 shows a picture with 25x25 pixels, whereas on the upper right you can see the same image in original resolution 100x100 pixels. Figure 2 shows the upscaled left image with bicubic interpolation. One can see that the image is blurred compared to the original resolution image on the right (cf. figure 3). This effect is caused by incorrect prediction of the new pixel values. The aim of a Super-Resolution neural network is learning the missing pixel values for the upscaled image as good as possible.
Metrics
In order to describe the quality of the upscaling method it is necessary to define a metric which describes the similiraty between the predicted (upscaled) image and the ground truth (full resolution) image. In this section, some of the commonly used metrics, that are employed for problems of this nature, are described.
PSNR
Peak Signal to Noise Ratio (PSNR) is a commonly used metric to define the similarity between two images. It is calculated using the Mean-Square-Error (MSE) of the pixels and the maximum possible pixel value (MAXI) as follows:
PSNR = 10 \cdot \log (\frac{MAX_I^2}{MSE}) |
A high PSNR value corresponds to a high similarity between two images and a low value corresponds to a low similarity respectively. (4)
SSIM
The structural similarity index is developed in order to improve traditional methods such as PSNR, which have been proven to be inconsistent with human visual perception. It takes luminance, contrast and structure of both images into account.
The SSIM index is calculated on various windows of an image. The measure between two windows and of common size N×N is:
{\displaystyle {\hbox{SSIM}}(x,y)={\frac {(2\mu _{x}\mu _{y}+c_{1})(2\sigma _{xy}+c_{2})}{(\mu _{x}^{2}+\mu _{y}^{2}+c_{1})(\sigma _{x}^{2}+\sigma _{y}^{2}+c_{2})}}} |
with:
- the average of ;
- the average of ;
- the variance of ;
- the variance of ;
- the covariance of and ;
- , two variables to stabilize the division with weak denominator;
- the dynamic range of the pixel-values (typically this is );
- and by default.(3)
More metrics can be found in literature:
- IFC (Information Fidelity Criterion)
- NQM
- WPSNR (Weighted Peak Signal to Noise Ration)
- MSSSIM (Multi Scale Structural Similarity)
Classical Approaches
Three classical approaches are briefly described below:
Bicubic Interpolation
Bicubic Interpolation considers 16 surrounding pixels to predict new pixel values.
Bilinear Interpolation
Bilinear Interpolation considers 4 surrounding Pixels to predict new pixel values.
Nearest Neighbor
The Nearest Neighboor method simply predicts the pixel values from the value of the nearest neighboor pixel.
Approaches with neural Networks
SRCNN (Super Resolution Convolutional Neural Network)
Figure 4: Architecture of SRCNN. Three convolutional layers in total. (2)
SRCNN consists of 3 convolutional layers with filter sizes 9x9 for the first layer, 1x1 for the second layer and 3x3 for the last layer. The first layer generates 64 feature maps, the second 32 and the last one generates the output. Figure 4 shows color images, but in practice only grayscale images are used to train and apply the network. The first layer filters can be interpreted as feature detectors, such as corners, lines, etc. They are visualized in the figure 5.
Figure 5: Filters of the first convolution. (2)
VDSR (Very Deep Super Resolution)
Figure 6: Network architecture of VDSR. Multiple convolutional layers followed by addition of the input image. (1)
The VDSR network consists of 20 convolutional layers. The input and output image share the same size. This is achieved by padding with zeros in every convolution. The key element here is the residual learning, which is applied by adding the input image to the ouput from the last convolutional layer. In this way only the difference between low and high resolution is learned by the network. It makes sense because both images are sharing the same low frequencies and thus do not need to be considered in the training process.
The filter size of each convolution except the first one is 3x3x64. The receptive field of the network is therefore 41x41 pixels. Each convolution except the last one generates 64 feature maps, some of which visualized in the graphic above. Data augmentation via rotation and flipping is used for training. In order to gain speed and reduce the size of the network the training data is decomposed into patches with size 41x41. This also helps to increase the amount of training data. Out of 291 images approximately 140.000 Patches can be generated.
In order to aid the the network to converge, gradient clipping and L2-regularization is used. The learning rate is decreased every 20 epochs by a factor of 0.1, which improves performance.
Comparison of different Methods and State of the Art Performance
The table below shows a few methods of super resolution approaches. The datasets can be found as standard in today´s literature. All networks are trained with Set291, a set of images containing 291 natural images.
Figure 7: Benchmark table for different super-resolution approaches. (1)
Figure 8 visualizes the performance of state of the art techniques. There is only a slight difference between the ground truth image on the left and the predicted image on the right.
Figure 8: Results of different super-resolution approaches on an example image. (1)
Applications
Image Super-Resolution is used in many areas such as:
- Surveillance
- Remote Sensing
- Medical Imaging (i.e., ultrasonic images, x-ray-images)
- Video Standard Conversion (i.e., SD to HD)
- Photocameras (i.e., postprocessing of images)
- Printing (i.e., enhance print quality on paper of low resolution images)
- Biometrics (i.e., fingerprint/face recognition)
- Commercial (barcode reading)
- Military (tracking and detecting)
- Satellite Imaging (i.e., weather forecasting)
Literature
- https://arxiv.org/abs/1511.04587 (Accurate Image Super-Resolution Using Very Deep Convolutional Networks; Jiwon Kim, Jung Kwon Lee, Kyoung Mu Lee )
- http://personal.ie.cuhk.edu.hk/~ccloy/files/eccv_2014_deepresolution.pdf (Learning a Deep Convolutional Network for Image Super-Resolution; Chao Dong, Chen Change Loy, Kaiming He and Xiaoou Tang)
- https://en.wikipedia.org/wiki/Structural_similarity
- https://en.wikipedia.org/wiki/Peak_signal-to-noise_ratio
- https://upload.wikimedia.org/wikipedia/commons/f/f5/Interpolation-bicubic.svg
- https://upload.wikimedia.org/wikipedia/commons/d/dd/Interpolation-bilinear.svg
- https://upload.wikimedia.org/wikipedia/commons/2/27/Interpolation-nearest.svg
Weblinks
https://github.com/huangzehao/caffe-vdsr (Implementation of VDSR in Caffe)
http://live.ece.utexas.edu/publications/2004/hrs_ieeetip_2004_infofidel.pdf (More about Information Fidelity Criterion (IFC))
Kommentar
Unbekannter Benutzer (ga69taq) sagt:
30. Januar 2017General problems and suggestions:
Please refer to your images with figure labels if you wish to explain them in your text.
Consistency problems and suggestions:
Corrections (note that some of the corrections have more than one changes):
Final comments: