Documentation
Introduction
Action recognition focuses on automatically identifying human actions in videos. Traditional methods often rely on analyzing short clips, which may overlook important contextual information, particularly in complex videos like films.
This project investigates...
- how well the TimeSformer Pretrained on the Kinetics400 dataset, performs on the Hollywood2 dataset.
- whether incorporating additional context such as video captioning and MFCC audio features can enhance action recognition in films.
- whether incorporating context from preceding video clips can enhance action recognition in films.
Methods
The action recognition model used in this study is TimeSformer. Several configurations were tested:
- Action Recognition Only (AR Only): Using only the TimeSformer model, without any additional contextual information.
- AR with Video Captions and Audio Features (VC & AF): Incorporating video captions and audio features to provide more context.
- Including Previous Clips (Sequential Context): Integrating information from one or two preceding clips.
Additional setups included:
- Finetuning: Adjusting the model specifically to the Hollywood2 dataset.
- Mocked Context: Using random noise as a substitute for real context to determine whether actual context improves results.
Architecture:
The action recognition system is based on the TimeSformer model, enhanced with additional contextual features to improve performance in recognizing actions in films. The key concept is to combine visual information with textual and auditory context.
Base Model: TimeSformer
- TimeSformer is a transformer-based model that processes video frames to recognize actions.
- It captures spatial and temporal information within and across frames.
- The model is pre-trained on the Kinetics-400 dataset.
Incorporating Contextual Features
To provide richer context, the following features were integrated:
Video Captioning Features (VC)
- Descriptive captions for video content are generated using the BLIP-2 model.
- This adds semantic understanding of the scene.
Audio Features (AF)
- Audio cues are extracted using Mel-frequency cepstral coefficients (MFCCs).
- This provides auditory context that complements the visual data.
Model Variations
Different configurations were explored to assess the impact of contextual information:
Action Recognition Only (AR Only)
- Input: Visual features from TimeSformer.
- Purpose: Serves as a baseline model without additional context.
AR with Real Context (AR + VC & AF)
- Input: Combined visual features with real video captions and audio features.
- Purpose: Evaluates the effect of meaningful contextual information.
AR with Mocked Context (AR + Mocked VC & AF)
- Input: Visual features with random noise replacing captions and audio.
- Purpose: Determines if improvements are due to actual context or merely additional data.
Including Sequential Context
- Input: Features from the current clip plus one or more preceding clips.
- Purpose: Investigates whether temporal context from previous clips enhances action recognition.
- Explanation: The model takes one or multiple features from previous clips with decreasing fixed weights.
Dataset:
Hollywood2 Human Actions and Scenes dataset
Path: /nas/lrz/tuei/ldv/studierende/data/HOLLYWOOD2_Human_Actions_and_Scenes_Dataset/
Publicly available for download here: https://www.di.ens.fr/~laptev/actions/hollywood2/
The Hollywood2 dataset was utilized, some more information:
- 1,707 video clips from 69 films
- 12 action classes (e.g., "answer phone," "drive car")
- Clips pose challenges due to the diversity and complexity of the scenes
In the following are a few samples from different classes and some frames
Results:
Performance Comparison
Configuration | Top-1 Accuracy | Top-5 Accuracy |
---|---|---|
AR Only (Not Finetuned) | 51.47% | 91.63% |
AR Only (Finetuned on Hollywood2) | 50.34% | 92.08% |
AR + Real VC & AF (Not Finetuned) | 54.64% | 92.53% |
AR + Real VC & AF (Finetuned) | 50.68% | 91.40% |
AR + Mocked VC & AF (Not Finetuned) | 51.36% | 92.19% |
AR + Mocked VC & AF (Finetuned) | 48.30% | 90.72% |
Key Findings
- Adding Real Context Improves Performance: Integrating real video captions and audio features increased accuracy from 51.47% to 54.64%.
- Mocked Context Shows No Improvement: Utilizing random noise as context did not yield improvements, underscoring the importance of real context.
- Sequential Context Had Limited Impact: Including information from previous clips did not consistently enhance performance.
- Finetuning Outcomes: Finetuning on Hollywood2 occasionally resulted in overfitting, where the model performed well on training data but struggled to generalize to new data.
Conclusion
Incorporating real context, such as video captions and audio, can enhance action recognition in films. However, simply adding more data from previous clips does not necessarily lead to better results. The relevance and quality of the context play a more significant role than its quantity. Different and other datapoints as context can and should be explored.
Limitations
- Frame Cropping: The cropping of video frames into square shapes might have excluded important visual information.
- Short Clips: The use of only 8 frames per clip may have hindered the model’s ability to capture longer actions.
- Dataset Variability: The differences between training and testing videos made it difficult for the model to generalize effectively.
Future Directions
- Analyzing Longer Clips: Increasing the number of frames analyzed per clip may capture more comprehensive action details.
- Exploring Alternative Contexts: Including additional sources of context, such as scripts or scene descriptions, could yield better results.
- Testing More Advanced Models: Utilizing more sophisticated models could lead to improved performance in action recognition tasks.