Supervisor: Azade Farshad Yeganeh, Y. M.

1. Introduction

Video Object Segmentation (VOS), a significant technology in the field of computer vision, demonstrates substantial practical implications across diverse fields, particularly in medical applications. This report explores the definition, significance, challenges, and state-of-the-art methods of VOS, focusing on its revolutionary impact on interpreting visual data in various domains, especially healthcare.

Object segmentation in the context of video processing is the partitioning of a video into several regions at the pixel level, with the goal of capturing sharper boundaries.

There are several types of segmentation. As we can see in Figure 1, semantic segmentation only classifies pixels into certain types, while instance segmentation can identify different objects with the same types.

For video object segmentation (Figure 2), also known as VOS, it is the task of separating foreground regions from backgrounds in video sequences. It’s a critical task in computer vision and its applications span various domains, such as autonomous driving, video editing, surveillance systems, augmented reality, and medical imaging, where it enables precise identification and tracking of specific objects within video sequences.

2. Motivation

2.1. Motivation within the Medical Context

The utilization of Video Object Segmentation (VOS) technology presents a transformative potential within the medical context, particularly in the domains of diagnostic accuracy, neurology, and surgical interventions. The detailed expansion of the provided points is as follows:

2.1.1. Precision in Diagnosis

The application of VOS for the accurate segmentation of echocardiogram videos marks a significant advancement in cardiology. By extracting precise and detailed information from video sequences of the heart, VOS enhances the capability to diagnose heart diseases. This precision stems from the technology's ability to isolate and analyze cardiac structures and motions frame by frame, leading to a more nuanced understanding of cardiac function and pathology.

2.1.2. Temporal Understanding in Neurology

In neurological practice, VOS contributes profoundly by capturing the temporal evolution of cerebral structures and functions in video sequences. This is especially pertinent in conditions like epilepsy, where changes over time are as critical as static images. VOS aids neurologists in tracking the progression of brain abnormalities, offering a dynamic perspective that static images cannot provide. By observing and quantifying these changes, clinicians can better understand disease progression, evaluate treatment efficacy, and potentially anticipate seizure events.

2.1.3. Interventional Support in Surgery

During surgical procedures, especially minimally invasive ones such as laparoscopic surgeries, VOS provides real-time segmentation of anatomical structures. This real-time analysis is crucial in offering visual guidance to surgeons, enabling the precise navigation of surgical instruments and the identification of target organs or lesions. By delineating anatomical boundaries and providing a clear distinction between different tissues, VOS serves as a navigational aid, contributing to safer surgical interventions with potentially reduced operative times and improved patient outcomes.

In summary, the integration of VOS into medical diagnostics, neurology, and surgical procedures exemplifies the convergence of advanced technology and healthcare, paving the way for enhanced patient care through improved diagnostic capabilities, better understanding of neurological conditions, and increased precision and safety in surgical practices.

2.2. Problem Statement

The integration of Video Object Segmentation (VOS) into medical applications offers the promise of enhancing the quality and accuracy of patient care through advanced imaging analysis. However, the adoption of this technology faces several substantial challenges that must be addressed to realize its full potential.

2.2.1. Complex Anatomical Structures

A significant impediment to the effective implementation of VOS in medical settings is the inherent complexity of human anatomy. Anatomical structures exhibit a high degree of variability in shape and size, which introduces difficulties in achieving accurate segmentation. This issue is particularly pronounced in the segmentation of organs such as the brain or lungs, where the intricate details and nuanced variations can confound even sophisticated algorithms. The ability to discern these structures accurately is crucial, as it directly impacts the quality of the diagnosis and the subsequent treatment plan.

2.2.2. Dynamic Scene Changes

Medical imaging is further complicated by the dynamic nature of the internal bodily environment. Videos capture physiological changes in real-time, including blood flow, organ motion, and the evolution of pathological abnormalities. These dynamic changes require algorithms capable of not only recognizing patterns but also rapidly adapting to physiological movements and alterations over time. The algorithms must be robust enough to track the progression of a disease or the body's response to treatment, necessitating advanced computational models that can interpret and predict complex biological behavior.

2.2.3. Limited Annotated Data

Another hurdle in the effective application of VOS in healthcare is the scarcity of annotated medical data. Annotated datasets are the bedrock upon which deep learning models learn and improve. The lack of extensive, accurately labeled data sets can severely restrict the ability of these models to learn effectively, leading to challenges in developing algorithms that are both robust and generalizable. This scarcity is due in part to the rigorous requirements for privacy and the labor-intensive process of obtaining detailed annotations from skilled professionals.

Addressing these challenges is paramount for the successful integration of VOS in medical applications. Strategies to overcome these hurdles include the development of more sophisticated models that can handle the complexity of human anatomy, improved techniques for real-time adaptation to dynamic changes, and enhanced methods for data annotation and synthesis. By tackling these issues, the medical community can leverage VOS to its fullest capacity, leading to better patient outcomes and more efficient medical processes.

2.3. Existing VOS methods

The existing VOS methods can be mainly grouped into four types: unsupervised, semi-supervised, interactive, and referring. Note that ‘unsupervised’ and ‘semi-supervised’ in VOS and general machine learning tasks have different application scopes. In VOS, these terms indicate the level of supervision required during inference instead of training.

SVOS methods, on the other hand, initiate with the ground truth labels available in a few frames (generally the first frame only, semi-supervised setting). These labels are manually annotated to indicate the objects to be segmented from the remaining frames.

Interactive VOS involves user intervention, where the user provides inputs such as scribbles or bounding boxes to guide the segmentation process in the video frames.

Referring VOS, on the other hand, combines language processing with visual information, allowing the segmentation algorithm to focus on objects described by textual queries within the video.

2.4. Benchmarks

2.4.1. Datasets summary

In terms of the existing main datasets for VOS, there are three main datasets I'd like to highlight:

DAVIS (Densely Annotated Video Segmentation) series is a high-resolution dataset series that has evolved over the years into three versions. Compared with other datasets at that time, DAVIS datasets have more sequences, annotations, and challenges, which makes them prevalent for training and evaluation. DAVIS-2016, is designed for single-object tasks, DAVIS-2017 is proposed for multi-object. And DAVIS-2017-U is the unsupervised version.

YouTube‑VOS series is a large-scale dataset series, with long-range video sequences. It contains three versions. The first two versions are designed for multi-object SVOS, while the latter one serves for multi-object UVOS. It can be found that the number of video sequences in YouTube-VOS is dozens of times as many as that in DAVIS, which indicates more diverse objects and contexts are considered. Moreover, each video sequence in the datasets has a greater number of frames than any other datasets, allowing VOS methods to model and exploit long-range temporal dependency between frames.

MOSE, Complex video Object Segmentation. Its most notable feature is complex scenes with crowded and occluded objects. The target objects in the videos are commonly occluded by others and disappear in some frames. It can help promotes the development of more comprehensive and robust video object segmentation algorithms.

Table 1: Summary of the VOS datasets [6]

Datasets	D. type		O. num		DA	Resolution	Videos	Annotations	Categories	Objects
Datasets	R	S	S	M	DA	Resolution	Videos	Annotations	Categories	Objects
DAVIS-2016	√		√		√	854 x 480	50	3455	-	50
DAVIS-2017	√			√	√	854 x 480	150	10459	-	376
DAVIS-2017-U	√			√	√	854 x 480	150	10731	-	449
YouTube-VOS-2018	√			√		1280 x 720	4453	197272	94	7754
YouTube-VOS-2019	√			√		1280 x 720	4519	>190,000	94	8614
YouTube-VIS	√			√		1280 x 720	2883	>131,000	40	4883
MOSE	√		√	√		1920 × 1080	2149	431,725	36	5200

D. type:data type of contained video sequence, Rreal data, S synthetic data;
O. num:number of objects annotated in each video sequence, Ssingle object, M multiple objects,
DA:dense annotation, i.e. all involved video frames are annotated
Videos: number of video sequences; Annotations: number of annotated frames;
Categories: number of the object categories involved; Objects: number of the annotated objects.

2.4.2. Evaluation Metrics

In evaluating the performance of Video Object Segmentation (VOS), several core metrics are employed, each offering unique insights into different aspects of segmentation accuracy and quality.

Dice Coefficient

The Dice Coefficient, denoted as $D SC$ , is defined mathematically as

$\begin{array}{l}\displaystyle DSC= \cfrac{2|X\cap Y|}{|X|+|Y|}\end{array}$

where $X$ represents the predicted set of pixels and $Y$ denotes the ground truth. This coefficient measures the pixel-wise agreement between a predicted segmentation and its ground truth, providing a sense of how well the two overlap. It is considered a more lenient metric, less sensitive to small discrepancies in larger objects, which makes it particularly suitable for tasks where slight variations are permissible or expected.

Jaccard Index

The Jaccard Index, or Intersection over Union (IoU), given by

$\begin{array}{l}\displaystyle J = \frac{|M \cap G|}{|M \cup G|}\end{array}$

, assesses the region similarity between the segmented mask $M$ and the ground truth mask $G$ . This metric is stricter than the Dice Coefficient and is sensitive to variations, especially in smaller objects or at the edges of objects. The Jaccard Index is crucial for tasks demanding high precision in the detection and segmentation of small-scale objects, as it penalizes the misclassification more heavily.

F-measure

The F-measure, or F-score, is defined as

$\begin{array}{l}\displaystyle F = \frac{2 \times P_c \times R_c}{P_c + R_c}\end{array}$

where $P_{c}$ and $R_{c}$ are the precision and recall calculated from the contour points of the segmented mask $c (M)$ and the ground truth $c (G)$ . This metric provides a harmonic mean of precision and recall, creating a balance between the two. The F-measure is highly versatile and can be adjusted to cater to the specific requirements of different applications, making it ideal for datasets that are unbalanced or when there is a need to maintain an equilibrium between precision and recall.

Mean of J&F

The mean of Jaccard and F-measure,

$\begin{array}{l}\displaystyle J\&F = \frac{J + F}{2}\end{array}$

aggregates the performance captured by both the Jaccard Index and F-measure, offering a comprehensive overview of the VOS performance by considering both region-based and contour-based similarities.

The choice of metric for evaluating VOS systems is dictated by the specific goals of the task and the characteristics of the data being analyzed. Additionally, the Frame Per Second (FPS) metric is crucial as it indicates the number of image frames processed per second, reflecting the efficiency of video processing. However, there is often a trade-off between accuracy and efficiency, which necessitates a balanced approach when selecting the appropriate evaluation metric for a given VOS task.

3. Methods

3.1. Unsupervised VOS: Tracking Anything with Decoupled Video Segmentation (DEVA) [7]

In the current research landscape, significant strides have been made in the field of Video Object Segmentation (VOS), with several state-of-the-art studies contributing notable advancements. These studies offer insights into innovative methodologies and underscore the progressive nature of this domain.

Among these advancements is an unsupervised model known as DEVA (Tracking Anything with Decoupled Video Segmentation). DEVA proposes a method of decoupled video segmentation that utilizes external data, enhancing its ability to generalize tasks with sparse annotations, surpassing traditional end-to-end video segmentation approaches. This method facilitates the integration of existing universal image segmentation models into the VOS framework.

A key feature of DEVA is its bi-directional propagation technique, which serves to denoise image segmentations and elegantly integrates them with temporally propagated segmentations. The model operates on a bifurcated system; initially, it identifies and outlines objects in the first frame of a video, subsequently utilizing these outlines to track the objects consistently across the video sequence.

To augment the segmentation accuracy, DEVA employs an 'in-clip consensus' mechanism, wherein it considers feedback from subsequent frames to amend any discrepancies in the segmentation. Although the model does not autonomously recognize new objects that may emerge in a video, it periodically renews its tracking by assessing new frames and amalgamating this information with the current tracking pathway. This strategy of reciprocal information gathering from both preceding and forthcoming frames is defined as 'bi-directional propagation'.

An illustrative example of the 'in-clip consensus' process is presented below, where object proposals from various frames are compared. The selection of the output shape is based on the consensus of proposals, with unsupported shapes being dismissed as noise.

Performance evaluations of DEVA reveal its superiority over other methods, particularly in the DAVIS16 validation set, and demonstrate competitive results in the DAVIS17 set. These outcomes attest to the model's efficacious class-agnostic temporal propagation capability, optimized for the processing of extensive video sequences.

The robustness and versatility of DEVA across different VOS tasks are highlighted, indicating its potential as a formidable instrument for addressing future segmentation challenges. The model's adeptness in class-agnostic processing and its efficient amalgamation with segmentation models position DEVA at the vanguard of VOS research.

The Demo of DEVA: https://youtu.be/Z8Gld-kbs-c?si=6HmnB6ZBhJw1GVHW

3.2. Semi-supervised VOS Putting the Object Back into Video Object Segmentation (Cutie) [8]

This section of the report delves into the intricacies of a semi-supervised model known as Cutie, which represents a significant advancement in the identification and tracking of objects within video streams. Central to Cutie's design is the 'Object Transformer', a mechanism that preserves memory at the object level. This can be likened to an efficient archiving system that is capable of retrieving specific object summaries by synthesizing general queries with intricate details from the object's representation.

Cutie's approach is comprehensive, as it does not limit its focus to the primary object; rather, it encompasses both the foreground and background within its purview. By extracting rich feature sets from the entire visual field, Cutie ensures clarity and mitigates confusion that might arise from other dynamic elements within the scene, commonly referred to as 'distractors'.

To streamline its operation over time, Cutie unifies the gathered information into a condensed object memory bank. This consolidation enables the system to recall specific object-related details on demand, akin to remembering the distinguishing features of a person one has previously encountered. The system initiates its process by analyzing the initial frame of a video, which has been pre-marked to identify the target object. This frame serves as the foundation for the segmentation of subsequent frames. Cutie encodes these frames into two distinct memory types: a detailed pixel memory that preserves high resolution and an abstracted object memory that encapsulates a higher-level understanding of the object. Upon receiving a new frame, Cutie engages its stored memory to draft a preliminary outline of the object. This nascent outline undergoes refinement through the model's 'Object Transformer', which incrementally integrates complex object data across multiple processing stages, known as transformer blocks. The culmination of this process is a definitive object outline that the decoder uses to generate the final segmentation mask.

This methodology adeptly merges granular details with an overarching object context, establishing a novel benchmark for video object segmentation. In empirical assessments, Cutie exhibits a strong performance when juxtaposed with contemporary transformer-based VOS methods, showcasing superior results when additional training data are employed. For instance, in the DAVIS-2017 validation set, the Cutie-base model achieved a J&F score of 88.8%. Its efficacy is further evidenced by its robust performance on the YouTubeVOS-2019 validation set.

Importantly, Cutie maintains a practical processing speed, with Cutie-base operating at 36.4 frames per second, rendering it apt for real-time applications. This balance of speed and accuracy underscores the model's potential utility in time-sensitive and computationally demanding environments.

Demo of Cutie: https://www.youtube.com/watch?v=Vj5v7KtdYr4

3.3. Interactive VOS/Referring VOS Segment and Track Anything (SAM-Track) [9]

The interactive and referring VOS model, SAM-Track, presents a comprehensive framework for video segmentation. This system unifies the Segment Anything Model (SAM) with an AOT-based tracking model (DeAOT) and Grounding-DINO to facilitate text-based interactions. SAM-Track enables precise and effective segmentation and tracking of objects in video sequences via multimodal interactions or automated methods.

SAM-Track offers dual tracking modes—interactive and automatic. An additional Fusion Tracking Mode is also available, which amalgamates both interactive and automatic tracking methods to provide flexibility in object tracking and segmentation.

Within the SAM-Track framework, three primary models are integral to its VOS tasks:

SAM is a large-scale image segmentation model that provides interactive segmentation with zero-shot capabilities, meaning it can segment objects that were not present during the training phase.
DeAOT is a VOS model that efficiently tracks multiple objects, utilizing a Gated Propagation Module to maintain visual information consistency across frames, which has shown excellent results in tracking competitions.
Grounding-DINO integrates language processing with object detection, interpreting textual descriptions to accurately identify and outline objects with precise bounding rectangles.

The architecture of SAM-Track operates across three distinct modes:

The Interactive Mode enables accurate object selection through clicks or language inputs, which are processed by Grounding-DINO for improved object detection.
The Automatic Mode commences from the second frame, tracking both pre-existing and new objects to ensure comprehensive coverage.
Grounding-DINO processes input such as text categories or detailed object descriptions, producing the minimum bounding rectangles for target identification.
SAM utilizes these bounding rectangles as prompts to predict the segmentation mask for each object, which DeAOT then uses to track the objects throughout the video.

Figure 15 illustrates the use of the interactive tracking mode and its efficacy in refining predictions during tracking, an advantageous feature for managing segmentation in complex settings.

In medical applications, where there is often a paucity of samples for many cell types and organs, training specific trackers for rare objects can be challenging. SAM-Track's proficiency in tracking zero-shot objects by simple interactions such as clicking is particularly valuable, enabling the tracking of rare objects without the necessity for specialized training datasets.

The experimental data shows SAM-Track's exceptional performance on two well-known VOS benchmarks, DAVIS-2016 Val and DAVIS-2017 Test. The annotations for these benchmarks were generated through interactive mouse-clicking, highlighting the model's precision in annotation and its robustness in object segmentation and tracking. The performance metrics suggest that SAM-Track's accuracy is nearing that achieved by initial mask-based tracking methods.

Demo of interactive tracking mode: https://youtu.be/eZrdna8JkoQ?si=zwGcK8EtffzLTd2k

Demo for text-based category-guided tracking: https://youtu.be/5oieHqFIJPc?si=_5jAbbhFOHJs3vNJ

Demo for Referring Object Tracking: https://youtu.be/nXfq17X6ohk?si=zaodAI50e9-CsFbm

Demo for refine segment result in any frame of video (Medical Diagnosis): https://youtu.be/hPjw28Ul4cw?si=l0s4ADyqMuv_EhQE

4. Conclusion

4.1. Comparison and Discussion

Model	Tracking Anything with Decoupled Video Segmentation (DEVA)	Putting the Object Back into Video Object Segmentation (Cutie)	Segment and Track Anything (SAM-Track)
Advantages	The first step towards open-world large-vocabulary video segmentation	End-to-end robust network Fast running time	Multi-model interaction Capable of tracking new objects Fast Inference speed
Performance	Rank #1 in DAVIS 2016 val, DAVIS 2017 (val), DAVIS 2017(test-dev)	Rank #1 in DAVIS 2017 (val), DAVIS 2017(test-dev), YouTube-VOS 2018&2019, MOSE	DAVIS-2016 Val (92.0%), DAVIS-2017 Test (79.2%)
Limitation	Cannot detect new objects by itself Work better when training data is sufficient	Fails when highly similar objects move in close proximity or occlude each other	Require manual input
Evaluation Metrics	J&F	J, F, J&F	J, F, J&F

DEVA (Tracking Anything with Decoupled Video Segmentation)

DEVA emerges as a trailblazer in VOS, taking a significant leap towards open-world, large-vocabulary video segmentation. It has garnered top rankings in several DAVIS challenges, signifying its proficiency in handling complex segmentation tasks. However, DEVA's reliance on substantial training data and its inability to autonomously detect new objects present notable limitations. The model excels when training data are ample but may falter when confronted with novel objects not present in its training datasets.

Cutie (Putting the Object Back into Video Object Segmentation)

Cutie operates as a robust end-to-end network renowned for its swift processing capabilities. This model has secured impressive standings in the DAVIS challenges, emphasizing its efficiency. Despite its speed, Cutie faces challenges with object segmentation when dealing with closely situated or overlapping objects, indicating a potential area for refinement, particularly in scenes with dense object clusters.

SAM-Track (Segment and Track Anything)

SAM-Track adopts a versatile multi-model framework capable of tracking new objects swiftly, showcasing quick inference times conducive to real-time applications. Its approach allows for the integration of multiple input methods, enhancing its tracking abilities. However, a dependency on manual input for accurate tracking is a constraint that necessitates user interaction, potentially limiting its automation capabilities.

4.2. Pros and Cons in Medical Applications

The integration of these advanced VOS methods into medical applications presents both opportunities and challenges.

Advantages

Unsupervised VOS: Addressing the shortage of annotated medical data, unsupervised VOS models like DEVA can be particularly beneficial in medical contexts where annotated data are scarce.
Multimodal Interaction: The capacity for diverse input methods (clicks, strokes, text) aligns with the dynamic nature of medical scenarios, such as real-time surgery segmentation, where rapid and precise responses are essential.

Challenges

Computational Demand: High processing requirements of these VOS methods may pose a challenge in medical settings where computational resources are limited.
Domain-Specific Knowledge Gap: There may be a lack of nuanced understanding of medical imagery, which is critical for accurate analysis and interpretation.
New Object Identification: The difficulty in identifying new objects within complex and varied medical scenes remains a significant hurdle.

By considering these strengths and limitations, medical professionals and technologists can strategize the implementation of VOS methods to improve the efficiency and quality of patient care. The key lies in leveraging the advantages such as unsupervised learning capabilities and multimodal interactions while mitigating the challenges through enhanced computational strategies, domain-specific training, and improved object identification algorithms. The ultimate goal is to facilitate better patient outcomes and streamline medical processes through the judicious application of VOS technologies.

5. References

[1] Chen, Changhao, Bing Wang, Chris Xiaoxuan Lu, Niki Trigoni, and Andrew Markham. “A Survey on Deep Learning for Localization and Mapping: Towards the Age of Spatial Machine Intelligence.” June 22, 2020.

[2] Zhao, Hengshuang, Xiaojuan Qi, Xiaoyong Shen, Jianping Shi, and Jiaya Jia. “ICNet for Real-Time Semantic Segmentation on High-Resolution Images.” April 27, 2017.

[3] Zheng, Ziyang, Jiewen Yang, Xinpeng Ding, Xiaowei Xu, and Xiaomeng Li. “GL-Fusion: Global-Local Fusion Network for Multi-View Echocardiogram Video Segmentation.” In *Medical Image Computing and Computer Assisted Intervention - MICCAI 2023. Part IV 26th International Conference, Vancouver, BC, Canada, October 8-12, 2023, Proceedings*, 78–88. Springer.

[4] Myronenko, Andriy, and Ali Hatamizadeh. “Robust Semantic Segmentation of Brain Tumor Regions from 3D MRIs.” In Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries: 5th International Workshop, 2019.

[5] Sestini, Luca, Benoit Rosa, Elena de Momi, Giancarlo Ferrigno, and Nicolas Padoy. SAF-IS: A Spatial Annotation Free Framework for Instance Segmentation of Surgical Tools. 2023. hal-04212988

[6] Gao, Mingqi, Feng Zheng, James J. Q. Yu, Caifeng Shan, Guiguang Ding, and Jungong Han. “Deep Learning for Video Object Segmentation: A Review.” *Artificial Intelligence Review* 56, no. 1 (2023): 457–531.

[7] Cheng, Ho Kei, Seoung Wug Oh, Brian Price, Alexander Schwing, and Joon-Young Lee. “Tracking Anything with Decoupled Video Segmentation.” September 7, 2023. (ICCV 2023)

[8] Cheng, Ho Kei, Seoung Wug Oh, Brian Price, Joon-Young Lee, and Alexander Schwing. “Putting the Object Back into Video Object Segmentation.” October 19, 2023. (arXiv:2310.12982)

[9] Cheng, Yangming, Liulei Li, Yuanyou Xu, Xiaodi Li, Zongxin Yang, Wenguan Wang, and Yi Yang. “Segment and Track Anything.” May 11, 2023. (arXiv:2305.06558)

[10] Perazzi F, Pont-Tuset J, McWilliams B, Van Gool L, Gross M, Sorkine-Hornung A (2016a) A benchmark dataset and evaluation methodology for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 724–732

[11] Perazzi F, Pont-Tuset J, McWilliams B, Van Gool L, Gross M, Sorkine-Hornung A (2016b) A benchmark dataset and evaluation methodology for video object segmentation.

[12] Pont-Tuset J, Perazzi F, Caelles S, Arbeláez P, Sorkine-Hornung A, Van Gool L (2017) The 2017 davis challenge on video object segmentation.

[13] Xu N, Yang L, Fan Y, Huang TS, Yang J, Shi H (2019b) The 2nd large-scale video object segmentation challenge track 1: Video object segmentation.

[14] Xu N, Yang L, Fan Y, Yang J, Yue D, Liang Y, Price B, Cohen S, Huang T (2018a) Youtube-vos: Sequence-tosequence video object segmentation. In: Proceedings of the European Conference on Computer Vision, pp 585–601

[15] Xu N, Yang L, Fan Y, Yue D, Liang Y, Yang J, Huang T (2018b) Youtube-vos: A large-scale video object segmentation benchmark.

[16] Ding, Henghui, Chang Liu, Shuting He, Xudong Jiang, Philip H. S. Torr, and Song Bai. “MOSE: A New Dataset for Video Object Segmentation in Complex Scenes.”

Seitenhierarchie

Video Object Segmentation