Author:

Simon Goldhofer
Supervisor:Prof. Gudrun Klinker
Advisor:Gudrun Klinker (@gi32kef)
Submission Date:[created]

Abstract

Temporal sentence grounding in long videos aims to localize moments in long videos according to a query in natural language. With the increase in available video data and compute resources, the demand for temporal video grounding capabilities is steadily rising, along with increased research activity. The task poses following challenges: (1) high computational burden due to long videos, high sampling rates and large model sizes, (2) difficulty of capturing relevant semantic information in embeddings and (3) aligning the video and text embeddings temporally with increasing video length and video and text complexity. To address these challenges, a state-of-the art model is adapted, trained and evaluated on the movie audio description dataset. An end-to-end trainable model is developed, which uses a transformer encoder-decoder architecture to pre-filter candidate windows and perform temporal alignment and prediction. Experiments show promising results for a window-based sampling technique, while training and evaluating on the whole movies still poses significant challenges like overfitting, end-to-end training and model architecture improvements. The dataset is analyzed in regard to errors and noise in the labels and representational limitations of the video and text embeddings. Furthermore, a possible moment-retrieval and localization pipeline applicable to industry, as well as possible research directions, are proposed.

Results/Implementation/Project Description

Conclusion

[ PDF (optional) ] 

[ Slides Kickoff/Final (optional)]