Abstract

Traditional action recognition models predominantly focus on short video clips, achieving high accuracy in controlled environments but often faltering in complex, real-world scenarios such as feature films where broader contextual information is vital. This thesis investigates whether incorporating contextual information from previous video segments can enhance the performance of action recognition models. The Hollywood2 dataset was utilized to conduct experiments using the TimeSformer model, integrating additional contextual features including video captions generated by the BLIP-2 model and MFCC audio features. The study examined the impact of sequential context by including features from preceding video clips to assess the influence of temporal dependencies. Results indicated that adding meaningful contextual features significantly improved the model's ability to differentiate between visually similar actions. However, incorporating sequential context from previous clips gave mixed outcomes, and the models did not surpass existing state-of-the-art performance on the Hollywood2 dataset. Limitations such as loss of visual information due to frame cropping, restricted temporal length by analyzing only eight frames per clip, and the visual variability between training and testing splits may have influenced the results. The findings suggest that while incorporating contextual information is beneficial, additional strategies involving more sophisticated models and feature representations are necessary to effectively capture the complexities of action recognition in long-form videos.


Documentation

Introduction

Action recognition focuses on automatically identifying human actions in videos. Traditional methods often rely on analyzing short clips, which may overlook important contextual information, particularly in complex videos like films.

This project investigates...

  • how well the TimeSformer Pretrained on the Kinetics400 dataset, performs on the Hollywood2 dataset.
  • whether incorporating additional context such as video captioning and MFCC audio features can enhance action recognition in films.
  • whether incorporating context from preceding video clips can enhance action recognition in films.

Methods

The action recognition model used in this study is TimeSformer. Several configurations were tested:

  1. Action Recognition Only (AR Only): Using only the TimeSformer model, without any additional contextual information.
  2. AR with Video Captions and Audio Features (VC & AF): Incorporating video captions and audio features to provide more context.
  3. Including Previous Clips (Sequential Context): Integrating information from one or two preceding clips.

Additional setups included:

  • Finetuning: Adjusting the model specifically to the Hollywood2 dataset.
  • Mocked Context: Using random noise as a substitute for real context to determine whether actual context improves results.

Architecture:

The action recognition system is based on the TimeSformer model, enhanced with additional contextual features to improve performance in recognizing actions in films. The key concept is to combine visual information with textual and auditory context.

Base Model: TimeSformer

  • TimeSformer is a transformer-based model that processes video frames to recognize actions.
  • It captures spatial and temporal information within and across frames.
  • The model is pre-trained on the Kinetics-400 dataset.

Incorporating Contextual Features

To provide richer context, the following features were integrated:

 

Video Captioning Features (VC)

  • Descriptive captions for video content are generated using the BLIP-2 model.
  • This adds semantic understanding of the scene.

Audio Features (AF)

  • Audio cues are extracted using Mel-frequency cepstral coefficients (MFCCs).
  • This provides auditory context that complements the visual data.

Model Variations

Different configurations were explored to assess the impact of contextual information:

  1. Action Recognition Only (AR Only)

    • Input: Visual features from TimeSformer.
    • Purpose: Serves as a baseline model without additional context.
  2. AR with Real Context (AR + VC & AF)

    • Input: Combined visual features with real video captions and audio features.
    • Purpose: Evaluates the effect of meaningful contextual information.
  3. AR with Mocked Context (AR + Mocked VC & AF)

    • Input: Visual features with random noise replacing captions and audio.
    • Purpose: Determines if improvements are due to actual context or merely additional data.
  4. Including Sequential Context

    • Input: Features from the current clip plus one or more preceding clips.
    • Purpose: Investigates whether temporal context from previous clips enhances action recognition.
    • Explanation: The model takes one or multiple features from previous clips with decreasing fixed weights.


Dataset:

Hollywood2 Human Actions and Scenes dataset

Path: /nas/lrz/tuei/ldv/studierende/data/HOLLYWOOD2_Human_Actions_and_Scenes_Dataset/

Publicly available for download here: https://www.di.ens.fr/~laptev/actions/hollywood2/



The Hollywood2 dataset was utilized, some more information:

  • 1,707 video clips from 69 films
  • 12 action classes (e.g., "answer phone," "drive car")
  • Clips pose challenges due to the diversity and complexity of the scenes 



In the following are a few samples from different classes and some frames

Results:

Performance Comparison

ConfigurationTop-1 AccuracyTop-5 Accuracy
AR Only (Not Finetuned)51.47%91.63%
AR Only (Finetuned on Hollywood2)50.34%92.08%
AR + Real VC & AF (Not Finetuned)54.64%92.53%
AR + Real VC & AF (Finetuned)50.68%91.40%
AR + Mocked VC & AF (Not Finetuned)51.36%92.19%
AR + Mocked VC & AF (Finetuned)48.30%90.72%

Key Findings

  • Adding Real Context Improves Performance: Integrating real video captions and audio features increased accuracy from 51.47% to 54.64%.
  • Mocked Context Shows No Improvement: Utilizing random noise as context did not yield improvements, underscoring the importance of real context.
  • Sequential Context Had Limited Impact: Including information from previous clips did not consistently enhance performance.
  • Finetuning Outcomes: Finetuning on Hollywood2 occasionally resulted in overfitting, where the model performed well on training data but struggled to generalize to new data.

Conclusion

Incorporating real context, such as video captions and audio, can enhance action recognition in films. However, simply adding more data from previous clips does not necessarily lead to better results. The relevance and quality of the context play a more significant role than its quantity. Different and other datapoints as context can and should be explored.

Limitations

  • Frame Cropping: The cropping of video frames into square shapes might have excluded important visual information.
  • Short Clips: The use of only 8 frames per clip may have hindered the model’s ability to capture longer actions.
  • Dataset Variability: The differences between training and testing videos made it difficult for the model to generalize effectively.

Future Directions

  • Analyzing Longer Clips: Increasing the number of frames analyzed per clip may capture more comprehensive action details.
  • Exploring Alternative Contexts: Including additional sources of context, such as scripts or scene descriptions, could yield better results.
  • Testing More Advanced Models: Utilizing more sophisticated models could lead to improved performance in action recognition tasks.


  • Keine Stichwörter