Introduction on Image Stitching

Within the field of computer vision, image stitching represents a vital research branch with profound applications across various domains including virtual reality, autonomous driving and medical imaging among many others. Image stitching methods attempt to align a pair of input images as best as possible which has been a challenging task for many years. Conventional methods utilize a feature-based approach, which breaks down the task into three stages: Keypoint extraction, feature matching and homography estimation. A significant drawback of feature-based approaches is the high dependency on the quantity and quality of the obtained features, often failing in challenging scenes with sparse textures or unfavorable lighting conditions. Another methodology, that has seen a surge in popularity in recent years are learned-based approaches. In contrast, these methods are not reliant on keypoints and are able to directly output the corresponding homography matrix or return the final stitched image.

Relevance for the Medical Domain

Image stitching plays a crucial role in various medical fields and can enhance the capabilities of healthcare professionals in healthcare and diagnosis. Applications of image stitching include x-ray and MRI imaging, endoscopy and laparoscopy as well as ophthalmology among many others. Some of these application areas are illustrated in Fig. 1-3. Image stitching can assist doctors in various ways. However, a very common use-case is increasing the field-of-view, which is illustrated in Fig. 4. This can find practical applications in many scenarios, such as minimally invasive surgery.

Recent research

Large baseline cases, where the source and target images are further apart, remain a challenging task in image stitching. This section aims to introduce three distinct learning-based approaches that aim to tackle this problem.

Depth-Aware Multi-Grid Deep Homography Estimation With Contextual Correlation

Nie et al. [5] presents an unsupervised multi-grid deep homography estimation approach and introduces a novel contextual correlation layer (CCL) in combination with a depth-aware loss function. It addresses scenes with varying depth levels and challenging parallax configurations.

Network Architecture

The overall structure of the network can be seen in Fig. 5 and can be described as follows:

Multi-scale features are extracted using a convolutional neural network (CNN) and organized to form a multi-scale feature pyramid, where the i-th pyramid layer is comprised of downsampled multi-scale features from scales i to N. The homography estimation is conducted in a coarse-to-fine fashion, utilizing three pyramid layers in total, where each layer processes the information provided by the previous layer. In this process, only the last pyramid layer predicts the UxV multi-grid homography and the initial layers predict the global homography. The homography estimation is performed using the novel CCL followed by the residual mesh regressor and is based on feature flow.

The CCL, shown in Fig. 6, constructs a correlation probability volume by taking evenly spaced chunks from target features and using them as filters in a convolution operation across the reference image. After applying scaled softmax the obtained values can be interpreted as feature matching probability, from which the feature flow can be calculated.

The objective function is formally shown in Eq. 5 and consists of two parts.

The content loss penalizes misalignment of the images, calculated by measuring the L1-loss between the overlapping areas of the reference and warped target image. It is calculated on a per layer basis and is aggregated into a weighted sum with progressively increasing exponents. This emphasizes the error on the third pyramid layer. The equations for the content loss can be seen in Eq. 1-2.

The depth-aware shape-preserved loss encourages adjacent grids on similar depth levels to have a similar shape. Utilizing a pretrained depth predictor network, the grids are assigned different depth levels, with losses calculated accordingly. The interested reader can find more detailed information in Eq. 3-4 as well as Fig. 7.

$\begin{array}{l}L^k_{content} = \left \\| \mathcal{W}^k(E) \odot I_r - \mathcal{W}^k(I_t) \right \\|_1\end{array}$ Eq. 1: Content loss per layer (k). (Adopted from Nie et al. [5])	$\begin{array}{l}L_{content} = \mathcal{\omega_1} L_{content}^1 + \mathcal{\omega_2} L_{content}^2 + \mathcal{\omega_3} L_{content}^3\end{array}$ Eq. 2: Total content loss (Adopted from Nie et al. [5])
$\begin{array}{l}l_{sp}^{A,B}=2 - \frac{ \|\overrightarrow{e_1}\cdot\overrightarrow{e_2}\| }{ \lVert\overrightarrow{e_1}\rVert \cdot \lVert\overrightarrow{e_2}\rVert } - \frac{ \|\overrightarrow{e_3}\cdot\overrightarrow{e_4}\| }{ \lVert\overrightarrow{e_3}\rVert \cdot \lVert\overrightarrow{e_4}\rVert }\end{array}$ Eq. 3: Similarity matrix calculation. $\begin{array}{l}l^{A,B}_{sp}\end{array}$ calculates the grid similarity between patches A and B (see Fig. 8c). The similarity matrices $\begin{array}{l}\mathrm{L_{sp}^{hor}}\end{array}$ and $\begin{array}{l}\mathrm{L_{sp}^{ver}}\end{array}$ are formed through calculations in horizontal and vertical direction. (Adopted from Nie et al. [5])	$\begin{array}{l}L_{shape} = \frac{1}{U(V-1)} \sum_{k=1}^{M}D_{hor}^kL_{sp}^{hor} + \frac{1}{(U-1)V} \sum_{k=1}^{M}D_{ver}^kL_{sp}^{ver}\end{array}$ Eq. 4: Depth-aware shape-preserved loss. D is a depth consistency matrix (calculated in horizontal and vertical direction) and indicates, if adjacent grids are at the same depth level. M is the number of distinct depth levels. (Adopted from Nie et al. [5])
$\begin{array}{l}L = \mathcal{\lambda}L_{content} + \mu L_{shape}\end{array}$ Eq. 5: Complete objective function (Adopted from Nie et al. [5])
Fig. 7: A detailed description of the depth-aware shape-preserved loss. (Image taken from Nie et al. [5])

Results

The results demonstrate superior performance compared to existing feature-based and unsupervised methods. As depicted in Fig. 8., the alignment performance is notably enhanced for the majority of presented samples. This is further confirmed through quantitative evaluations on real-world and synthetic data, showcased in Tab. 1 and Tab. 2. In both cases, the proposed method exhibits dominance across all conducted tests, beating its competitors in the based on RMSE, PSNR and SSIM. Lastly, conducted ablation studies validate chosen network components. Tab. 3 shows the results of the CCL ablation study, where the CCL achieves better performance in terms of speed and quality, compared to other commonly used network components, such as the cost volume. The interested reader can find more results in Tab. 4-5 as well as Fig. 9.

Tab. 1: Quantitave evaluation based on the real-world UDIS-D dataset using the image similarity measures PSNR and SSIM. First and second best performance is indicated by red and blue colors. (Table adopted from Nie et al. [5])	Tab. 2: Quantitative RMSE evaluation based on the synthetic MS-COCO dataset. First and second best performance is indicated by red and blue colors. (Table adopted from Nie et al. [5])	Tab. 3: CCL ablation study results. The CCL achieves best performance in terms of performance and model size, while being slightly slower than the fastest components. (Table adopted from Nie et al. [5])
		Tab. 4: Component ablation study results. Each component further improves the performance. (Table adopted from Nie et al. [5])
Fig. 8: Qualitative evaluation based on the real-world UDIS-D dataset. Blue- and orange-tinted colors represent alignment of the source and target image. (Image taken from Nie et al. [5])		Tab. 5: Analysis of the impact of different depth-levels and grid sizes on the image similarity metrics. Results indicate a slight peak at the highlighted parameters. (Table adopted from Nie et al. [5])
		Fig. 9: Analysis of the impact of the depth-aware shape-preserved loss on the mesh. Results indicate a smoother mesh deformation. (Table adopted from Nie et al. [5])

Independant Results

Köhler et al. [6] provides unique insights into the performance of the method in the medical domain, specificially in the context of laparoscopic imaging. The results will only be briefly highlighted in the following and the interested reader is encouraged to read more in the original study.

The authors came to the conclusion that the pretrained model achieved subpar performance, likely due to the domain gap. The model trained on medical data achieved comparable performance with regards to the reference methods. However, single homography methods were deemed more robust and faster than the proposed method. It was further concluded that learning-based methods are promising and relevant for the future, despite not achieving optimal performance yet.

Semi-supervised Deep Large-baseline Homography Estimation with Progressive Equivalence Constraint

Jiang et al. [7] proposes a semi-supervised deep learning approach. It addresses a common failure point of photometric-loss-based unsupervised methods, which struggle with large-baseline cases. Jiang et al. [7] introduces a progressive estimation strategy, where the large-baseline is broken into multiple small-baseline problems. Key contributions from this paper include the use of the novel semi-supervised homography identity loss.

Training Process

The overall training process is depicted in Fig. 10. Unlike the other presented methods in the unsupervised learning sector, this approach uses synthetically generated intermediary images to obtain additional training pairs. As shown in Fig. 10a, in the first step, synthetic images are generated by transforming the source image through warps of randomly sampled homographies. This is done in a consecutive manner, such that each new homography is applied on the previously warped image, essentially creating a chain of homographies. To avoid degenerate solutions, the homographies are selected from a pre-defined set of homographies. After that, $\begin{array}{l}n+2\end{array}$ training pairs are obtained, where each pair consists of two images.

Specificially, for each source and target pair in the training dataset, the following training pairs are obtained, which are then used to calculate the homography identity loss: $\begin{array}{l}(I_{s_0},I_{s_{t}}),\;(I_{s_0},I_{s_1}),\; \ldots,\; (I_{s_{n-1}},I_{s_n}),\;(I_{s_{n}},I_{s_{t}})\end{array}$

Fig. 10: Overall training process. (a) shows the proposed progressive estimation strategy. (b) shows the proposed unsupervised objective function.
(Image taken from Jiang et al. [7])

$\begin{array}{l}\prod^{0}_{i=n-1} H_{s_is_{i+1}} = H^{-1}_{s_nt} \times H_{st}\end{array}$

Eq. 6: Equivalence constraint. Following the accumultative multiplication from Fig. 11b, we can obtain the equivalence constraint. (Adopted from Jiang et al. [7])

$\begin{array}{l}\mathcal{L}_{HIL} = \mathcal{L}_{unsup} + \lambda_w \mathcal{L}_{sup}\end{array}$

Eq. 6: Homography identity loss, where $\begin{array}{l}\lambda_w = \mathcal{L}_{unsup} / \mathcal{L}_{sup}\end{array}$ . (Equation adopted from Jiang et al. [7])

$\begin{array}{l}\mathcal{L}_{unsup} = |H^{-1}_{s_{n}t} \times H_{st} - \prod^0_{i=n-1} H_{s_{i}s_{i+1}}|_1\end{array}$

Eq. 7: Unsupervised objective. Based on the Equivalence constraint Eq. 6. (Equation adopted from Jiang et al. [7])

$\begin{array}{l}\mathcal{L}_{unsup} = \sum^{n-1}_{i=0}|H_{s_{i}s_{i+1}} - H^{i+1}_{gt}|_1\end{array}$

Eq. 8: Supervised objective. (Equation adopted from Jiang et al. [7])

The homography identity loss is given by an unsupervised and a supervised objective. The supervised objective can be seen in Eq. 8 and is given by the L1-Loss between the predicted homography and the known homography used in the image generation process. The unsupervised objective, presented in Eq. 7, encapsulates the multiplicative properties of homography matrices. It embodies the equivalence constraint, formulated through Eq. 6, which is given by the accumulitative multiplication, illustrated in Fig. 10b. The training pairs were chosen in such a way that allows the construction of this multiplication chain starting from the source image, going over the intermediate images, to the target image.

For the interested reader it shall be noted that, there are several optimizations made to loss functions, which includes the partial substitution of homography matrices by their known ground-truth matrix, in order to stabilize the training process. For further, more detailed descriptions, the supplementary material provided by Jiang et al. [7] is recommended.

Network Architecture

Fig. 11: Overall network architecture. (Image taken from Jiang et al. [7])

The pipeline of the proposed network is shown in Fig. 11. It takes the source and target image as well as resized variants as input, which improves the applicability to high-resolution images. Multi-scale features ( $\begin{array}{l}\mathcal{F}^k \in \mathbb{R}^{\frac{H}{2^{2+k}} \times \frac{W}{2^{2+k}} \times d^k}, k \in [0,2].\end{array}$ ) are extracted through a multi-scale CNN encoder. For the resized images, the multi-scale features ( $\begin{array}{l}\mathcal{\hat{F}}^k \in \mathbb{R}^{\frac{H}{2^{2+k}} \times \frac{W}{2^{2+k}} \times d^k}, k \in [0,2].\end{array}$ ) are obtained. To construct the pyramid features, features $\begin{array}{l}\mathcal{F}^0\end{array}$ , $\begin{array}{l}\mathcal{F}^1\end{array}$ , $\begin{array}{l}\mathcal{\hat{F}}^1\end{array}$ and $\begin{array}{l}\mathcal{\hat{F}}^2\end{array}$ are selected to decrease redundancy.

Based on the features, the global and local feature correlation representations ( $\begin{array}{l}\mathcal{C}^2_{g}\end{array}$ , $\begin{array}{l}\mathcal{\hat{C}}^1_{l}\end{array}$ , $\begin{array}{l}\mathcal{C}^1_{l}\end{array}$ and $\begin{array}{l}\mathcal{C}^0_{l}\end{array}$ ) are calculated. The global correlation layer utilizes the cosine similarity of the $\begin{array}{l}\mathcal{\hat{F}}^2\end{array}$ features, which represent the most global information and is given by $\begin{array}{l}\mathcal{C}_g^2(\mathrm{x}_s, \mathrm{x}_t) = \mathcal{\hat{F}^2_s}(\mathrm{x}_s)^\top \mathcal{\hat{F}^2_t}(\mathrm{x}_t)\end{array}$ . The local correlation layer uses the improved GOCOR [8] feature correlation module.

The coarse-to-fine homography estimation module adopts two coarse motion estimators (CME) and two fine motion estimators (FME) to refine the homography progressively. Within the network, the homographies are representated using homography flow ( $\begin{array}{l}\mathrm{F}_{st} = \mathrm{X}_{t} - \mathrm{X}_{s}\end{array}$ ) to facilitate the learning of motion information, analogous to Li et al. [9]. The loss functions are updated accordingly using this new representation. The interested reader can find the updated and final loss functions in Eq. 9-10.

$\begin{array}{l}\mathcal{L}_{\text{unsup}} = \sum_{i=0}^{n-1} \lambda_i \left( |\mathrm{F}_{st} - \mathrm{F}_{s_{i+1}t} - \sum_{j=i}^{0} \mathrm{F}^{j+1}_{gt} |_{1} + |\mathrm{\hat{F}}_{st} - \mathrm{\hat{F}}_{s_{i+1}t} - \sum_{j=i}^{0} \mathrm{\hat{F}}_{gt} |_{1} \right)\end{array}$

Eq. 9: Final unsupervised objective based on homography flow. (Equation adopted from Jiang et al. [7])

$\begin{array}{l}\mathcal{L}_{\text{sup}} = \sum_{i=0}^{n-1} | \mathrm{F}_{s_is_{i+1}} - \mathrm{F}_{gt}^{i+1} |_1 + \sum_{i=0}^{n-1} | \hat{\mathrm{F}}_{s_is_{i+1}} - \hat{\mathrm{F}}^{i+1}_{gt} |_1\end{array}$

Eq. 10: Final supervised objective based on homography flow. (Equation adopted from Jiang et al. [7])

Results

To showcase the results, a novel-large baseline dataset is introduced, which includes manually labeled, uniformly distributed matching points across each source and target pair. Furthermore, the dataset is split into different categories based on the type of scene they depict, such as regular scene, low-light, etc.

The results indicate state-of-the-art performance based on the point matching error, outperforming all other methods in the large-baseline benchmark, as shown in Tab. 6. For the small-baseline dataset, it displays superb performance, achieving the best performance in all categories, excluding the low-texture (LT-S) category, where it placed second. These results can be found in Tab. 7. The findings are further supported by a qualitative evaluation, where the method is deemed superior compared to all other tested methods, as evident by Fig. 13.

Tab. 6: Quantitative PME (point matching error) evaluation based on their novel large-baseline dataset. The method outperforms throughout all test cases. (Table adopted from Jiang et al. [7])	Tab. 7: Qualitative PME (point matching error) based on a small-baseline dataset. The method achieves the best performence in all but one category. (Table adopted from Jiang et al. [7])
Fig. 12: Qualitative results based on their novel large-baseline dataset. Dark blue indicates a low error, wheras green and red indicate larger errors. (Image taken from Jiang et al. [7])	Tab. 8: Results of ablation studies. 2) and 3) change the number of inserted images, 4), 5), 6) and 7) experiment based on the loss function and 8), 9) and 10) adjust the utilized homography representation. (Table adopted from Jiang et al. [7])
Legend: Regular (RE), low-texture (LT), low-light (LL), small-foregrounds (SF), large-foregrounds (LF)

Unsupervised deep image stitching: Reconstructing stitched features to images

Nie et al. [10] adopts a notably different unsupervised approach. It focuses on reconstructing the merged images from feature space to pixel space, which is motivated by the fact that misalignment in pixel space is more noticeable than in feature space than pixel space. It presents a two step approach consisting of a coarse alignment of the source and target images, followed unsupervised image reconstruction.

Network architecture

The network is divided into two parts, namely the unsupervised coarse image alignment and the unsupervised image reconstruction. The network architecture is presented in Fig. 13.

Fig. 13: The proposed network architecture. (Image taken from Nie et al. [10])

The unsupervised coarse image alignment uses a pre-existing large-baseline deep homography estimator network [11], which estimates the homography in a coarse-to-fine manner. A main contribution of Nie et al. [10] for the coarse alignment is the use of a different loss function for the deep homography estimation. As opposed to other losses that calculate the loss based on fixed image patches, which fail in low-overlap scenarios, an ablation-based loss is presented. As opposed to patch-based calculation, the loss is only calculated on overlapping regions while other regions are neglected.

The unsupervised image reconstruction is split into a low-resolution and high-resolution branch. The low-resolution branch essentially merges the low-resolution images, whereas the high-resolution branch aims to add detail to the provided upscaled low-resolution reconstruction image. The LR branch uses a UNet-shaped architecture, which is commonly used for image reconstruction tasks. The HR branch uses a series of residual convolution blocks in combination with further skip connections that ensure the retention of low-resolution information.

The loss is comprised of the LR and HR deformation loss as well as the LR-HR content consistency loss.

The LR and HR deformation loss calculates the loss differently based on the pixel location. It divides the image into the image content region, which covers most of the image and the seam area, which indicates the seam regions, where the two images are joined together. For each region, corresponding masks are calculated which are then used in the loss calculation. For the image content, a perceptual loss function based on Johnson et al. [12] is used. It encourages feature similarity between the input and reconstructed image based feature similarity. In particular, a VGG-19 [13] trained on the ImageNet dataset [14] is used to calculate the features and determine the perceptual loss, which can be found by the interested reader in Eq. 14.

To ensure continuous and natural transitions between the images, L1-Loss is used for the seam regions, which can be seen in Eq. 13.

To promote consistency between the outputs of the low-resolution and high-resolution branch, the LR-HR content consistency loss is used, which is given by the L1-Norm of the difference between the (downscaled) high-resolution image and the low-resolution image (see Eq. 15).

$\begin{array}{l}\mathcal{L}^l_{LR} = \mathcal{\lambda}_c \mathcal{L}^l_{Content} + \mathcal{\lambda}_s \mathcal{L}^l_{Seam}\end{array}$ Eq. 11: LR deformation loss. (Adopted from Nie et al. [11])	$\begin{array}{l}\mathcal{L}^l_{Content} = \mathcal{L}_P(S_{LR} \odot M^{AC}, I^{AW}) + \mathcal{L}_P(S_{LR} \odot M^{BC}, I^{BW})\end{array}$ Eq. 12: Content loss (LR). (Adopted from Nie et al. [11])	$\begin{array}{l}\mathcal{L}^l_{Seam} = \mathcal{L}_1(S_{LR} \odot M^{AS}, I^{AW} \odot M^{AS}) + \mathcal{L}_1(S_{LR} \odot M^{BS}, I^{BW} \odot M^{BS})\end{array}$ Eq. 13: Seam loss (LR). (Adopted from Nie et al. [11])
$\begin{array}{l}\ell^{\phi, j}_{feat}(\hat{y},y) = \frac{1}{C_jH_jW_j} \\| \phi_j(\hat{y}) - \phi(y) \\|_2^2\end{array}$ Eq. 14: Perceptual loss, where $\begin{array}{l}\phi\end{array}$ is the loss network, $\begin{array}{l}j\end{array}$ represents the j-th layer of the CNN and $\begin{array}{l}C_j, H_j, W_j\end{array}$ denominate the $\begin{array}{l}C_j \times H_j \times W_j\end{array}$ shaped feature map. (Adopted from Johnson et al. [12])
$\begin{array}{l}\mathcal{L}_{CS} = \\| S^{256\times256}_{HR} - S_{LR} \\|_1\end{array}$ Eq. 15: LR-HR content consistency loss. (Adopted from Nie et al. [11])
$\begin{array}{l}\mathcal{L}_{R} = \mathcal{\omega}_{LR}\mathcal{L}_{LR} + \mathcal{\omega}_{HR}\mathcal{L}_{HR} + \mathcal{\omega}_{CS}\mathcal{L}_{CS}\end{array}$ Eq. 16: Complete loss function. (Adopted from Nie et al. [11])

The complete loss is calculated as a weighted sum of the aforementioned loss components and is formally shown in Eq. 16.

Results

This method achieves second best performance on homography estimation, which was evaluated on the MS-COCO dataset. A comparison of commonly used image quality metrics, namely the PSNR and SSIM, indicate unparalleled performance. Both results can be seen in Tab. 9. Looking at a qualitative comparison, which is dipicted in Fig. 15, the strengths of the reconstruction network become apparent. In the shown cases, no overlapping artifacts can be seen as well as smooth and continous seams, which is a unique strength for this approach. These findings are further validated by a study on visual quality, which are demonstrated in Fig. 14. The interested reader can view additional ablation study results in Tab. 10 and is referred to the original paper for additional ablation studies.

In contrary to the previously discussed papers, the authors also mention limitations of their proposed architecture. In cases where the coarse alignment by the single homography network is unsatisfactory, duplication artifacts can occur in the final image, as evident by Fig. 16. This is due to the fact that the network interprets the misalignment as multiple different objects, which it then attempts to reconstruct. Possible solutions for this problem include the use of a non-linear homography estimator, which would improve the performance of the coarse image alignment network.

Tab. 9: Quantitative RMSE evaluation based on the synthetic MS-COCO dataset. (Table adopted from Nie et al. [10])

Fig. 14: User study on visual quality averaged on 20 participants. Numbers are shown in percentages. (Adopted from Nie et al. [10])

Components	Findings
LR-branch + Content Loss	Basic stitching, seam distortions, limited resolution
LR- and HR-branch + Content Loss	Increased resolution, artifacts, steam distortions
LR- and HR-branch + Content Loss + Seam Loss	No seam distortions, still artifacts
LR- and HR-branch + Content Loss + Seam Loss + Consistency Loss	No artifacts

Tab. 10: Summary of ablation study results.

Fig. 15: Qualitative evaluation based on the their own real-world dataset. (Image taken from Nie et al. [10])

Fig. 16: Example of a failure case. Unsatisfactory coarse alignment can cause duplication artifacts. (Image taken from Nie et al. [10])

Own review

In this work, three interesting and promising architectures for semi- and unsupervised image stitching were presented. This section will provide a brief concise summary of the three discussed papers and a personal opinion.

Nie et al. [5] presented a depth-aware multi-grid approach, which provides a non-linear homography estimation solution. Their novel components indicate vast improvements of the stitching quality, which are also verified through presented ablation studies. A notable flaw of this paper lies in the absence of self-criticism, as it does not outline any failure scenario or emphasize metrics where its method exhibited subpar performance. Additionally, recent studies in the medical domain by Köhler et al. [6] were unable to replicate the proclaimed superiority of the proposed method over other state-of-the-art approaches. This discrepancy might be attributed to the domain gap between medical images and the “normal” applications for image stitching, or perhaps an inadequate selection of comparison methods.

In their 2022 work, Jiang et al. [7] proposed a semi-supervised homography estimation approach that relies on progressive equivalence constraints. This approach only yields a linear approximation of the homography, which is unable to handle large parallax cases, but introduces a promising architecture that avoids issues from the commonly used photometric loss. Its results indicate superior performance, evidenced by the point matching error metric on their own novel dataset, and were validated through provided ablation studies. However, similarly to the previously discussed paper, it also neglected to outline any failure instances or potential issues of their architecture, posing a potential hinderance to future research. Another noteworthy observation is that, at the time of this research, this method has not been widely cited, due to its novelty, which leaves further validation up to future work.

The last approach by Nie et al. [10] uses a reconstruction network to merge coarsely aligned images. This method stands out as the most different among the other presented strategies, as it focuses on improving the visual quality of the stitched images. Its results indicate sufficient performance of the adopted coarse homography estimator and state-of-the-art performance with regards to image quality metrics. A refreshing sight in this paper was the showcasing of some failure cases, providing a better understanding of the method and a broader vision for future research. Another issue briefly pointed out by Jiang et al., were problems of this method with dynamic objects, but could not be examined in more details. The paper exhibited some minor shortcomings, such as a vague description of the coarse alignment setup and the employed perceptual loss. Nevertheless, among the three papers discussed, it stands out as the most comprehensive, offering generally more detailed content and self-reflected.

Findings by Köhler et al. [6] has shown mixed results in the medical field, further emphazising the domain gap, which will require further domain-specific research in order to deal with complex non-ridid transformations, diverse image textures and different requirements on the stitched images in the medical domain. Indicated by the study, the presented approaches are promising but not quite plug-and-play solutions for this field.

Exciting avenues for future research in the domain of image stitching include the further extensive testing and fine-tuning of the presented semi- and unsupervised methods. Additionally, applying these methods the medical domain through more concrete studies can provide valuable insights into the domain gap and contribute to addressing specific challenges within the field.

References

[1] Samsung. (2023). Digital Radiography AccE GM85 | Samsung Healthcare Global. https://www.samsunghealthcare.com/en/products/DigitalRadiography/AccE%20GM85/Radiology/benefit (Retrieved: 2024-01-22)

[2] Guy, S., Haberbusch, J. L., Promayon, E., Mancini, S., & Voros, S. (2022). Qualitative comparison of image stitching algorithms for multi-camera systems in laparoscopy. Journal of Imaging, 8(3), 52.

[3] Kim, J., Go, S., Noh, K. J., Park, S. J., & Lee, S. (2021). Fully Leveraging Deep Learning Methods for Constructing Retinal Fundus Photomontages. Applied Sciences, 11(4), 1754.

[4] Kim, D. T., Nguyen, V. T., Cheng, C. H., Liu, D. G., Liu, K. C. J., & Huang, K. C. J. (2018). Speed improvement in image stitching for panoramic dynamic images during minimally invasive surgery. Journal of Healthcare Engineering, 2018.

[5] Nie, L., Lin, C., Liao, K., Liu, S., & Zhao, Y. (2021). Depth-aware multi-grid deep homography estimation with contextual correlation. IEEE Transactions on Circuits and Systems for Video Technology, 32(7), 4460-4472.

[6] Köhler, H., Pfahl, A., Moulla, Y., Thomaßen, M. T., Maktabi, M., Gockel, I., ... & Chalopin, C. (2022). Comparison of image registration methods for combining laparoscopic video and spectral image data. Scientific Reports, 12(1), 16459.

[7] Jiang, H., Li, H., Lu, Y., Han, S., & Liu, S. (2023, June). Semi-supervised deep large-baseline homography estimation with progressive equivalence constraint. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 37, No. 1, pp. 1024-1032).

[8] Truong, P., Danelljan, M., Gool, L. V., & Timofte, R. (2020). GOCor: Bringing globally optimized correspondence volumes into your neural network. Advances in Neural Information Processing Systems, 33, 14278-14290.

[9] Li, H., Luo, K., & Liu, S. (2021). Gyroflow: gyroscope-guided unsupervised optical flow learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 12869-12878).

[10] Nie, L., Lin, C., Liao, K., Liu, S., & Zhao, Y. (2021). Unsupervised deep image stitching: Reconstructing stitched features to images. IEEE Transactions on Image Processing, 30, 6184-6197.

[11] Nie, L., Lin, C., Liao, K., & Zhao, Y. (2020). Learning edge-preserved image stitching from large-baseline deep homography. arXiv preprint arXiv:2012.06194.

[12] Johnson, J., Alahi, A., & Fei-Fei, L. (2016). Perceptual losses for real-time style transfer and super-resolution. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14 (pp. 694-711). Springer International Publishing.

[13] Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

[14] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., ... & Fei-Fei, L. (2015). Imagenet large scale visual recognition challenge. International journal of computer vision, 115, 211-252.

Seitenhierarchie

Image Stitching Using Unsupervised / Semi-Supervised Learning

Introduction on Image Stitching

Relevance for the Medical Domain

Recent research

Depth-Aware Multi-Grid Deep Homography Estimation With Contextual Correlation

Network Architecture

Results

Independant Results

Semi-supervised Deep Large-baseline Homography Estimation with Progressive Equivalence Constraint

Training Process

Network Architecture

Results

Unsupervised deep image stitching: Reconstructing stitched features to images

Network architecture

Results

Own review

References