Human Action Recognition using Compressed Videos in H.264 Format

Student

Name: Khalil Smaoui

Description

Human Action Recognition (HAR) is gaining more and more attention in the field of Computer Vision. As it seeks to comprehend the human behaviour, analyze it and label it to an action, HAR is used in various domains such as video surveillance systems, smart homes, and hospital environments. Commonly, the existing deep learning approaches consist of Convolutional Neural Networks that are capable to learn robust representations of image data by processing RGB pixels. However, some recent works proposed the usage of compressed video data in their networks as an alternative to reduce the complexity of the networks caused by the high redundancy between frames. In this paper, a deep neural network is implemented that is capable to learn from the Discrete Cosine Transform (DCT) coefficients of I-frames from compressed H.264 data.

Supervisor

Philipp Paukner

Literature Research

Paper	Ultimate goal(s)	Data Processing	Network/Algorithm	Results
COMPRESSED-DOMAIN VIDEO CLASSIFICATION WITH DEEP NEURAL NETWORKS: “THERE’S WAY TOO MUCH INFORMATION TO DECODE THE MATRIX”	Investigate video classification via a 3D CNN that directly ingests compressed bitstream information	Only P-type macroblock motion vectors are extracted and retained. That’s because training on both P and B-type motion vectors incurs a substantial increase in complexity with marginal improvement in classification accuracy. Also, B-type MB motion vectors were found to be very sparse or contain information that is mostly redundant if P-type MB motion vectors are available. MB motion information is extracted from a compressed video bitstream using FFMPEG’s libavcodec library, which supports most MPEG/ITU-T standards used in practice. No decoding of any video to its pixel-domain representation is performed here.	CNN which takes as an input with N = 24 is a fixed spatial size (ind. from the video resolution), K = 2 (2 components for MVs) and T = 160 is a fixed temporal extent (roughly the average number of P-frames per video in UCF-101 dataset)	Accuracy: 77.5% (UCF-101) Complexity (# of Parameters): 29.4M
Compressed Video Action Recognition [2017] CoViAR	Train a deep network directly on the compressed video data	To break the dependency between consecutive P-frames, all motion vectors are traced back to the reference I-frame and the residual is accumulated the way. In this way, each P-frame depends only on the I-frame but no other P-frames. While RGB I-frame features are used as they are, P-frame features; MVs and Residuals need to incorporate the information from RGB I-frames. I-frames and residuals are decoded to pixel domain here.	ResNet-152 to model I-frames, and ResNet-18 to model the motion vectors and residuals. This offers a good trade-off between speed and accuracy. All networks can be trained independently.	Accuracy: 90.4% (UCF-101) Complexity (# of Parameters): 83.6M
Faster and Accurate Compressed Video Action Recognition Straight from the Frequency Domain [2020] Faster-CoViAR	Deep neural network for human action recognition capable of learning straight from compressed video.	Initially, a compressed video is parsed and then a set of encoded frames for each stream is obtained by: uniform sampling. encoded frames are entropy decoded and passed through the network, one frame at a time, generating frame scores. Assign scores of each frame and average scores to give a score to each video Finally, the final prediction is obtained by a simple late fusion, which consists of a weighted average between the video score of both streams.	The network is a two-stream CNN integrating both frequency (i.e., DCT coefficients) and temporal (i.e., motion vectors) information, which can be extracted by parsing and entropy decoding the stream of encoded video data. This enables to save high computational load in full decoding the video stream and thus greatly speed up the processing time.
THE GOOD, THE BAD, AND THE UGLY: NEURAL NETWORKS STRAIGHT FROM JPEG [2020]	A Frequency Band Selection (FBS) technique to select the most relevant DCT coefficients before feeding them to the network.	Since higher frequency information has a little visual effect on the image, the n lowest frequency coefficients are retained	ResNet-50	Future works: Evaluate smarter strategies for selecting DCT coefficients, like self-attention models.
Faster Neural Networks Straight from JPEG [2018]		Modify the libjpeg library to decode JPEG images only partially, resulting in an image representation consisting of a triple of tensors containing discrete cosine transform (DCT) coefficients in the YCbCr color space. Due to how the JPEG codec works, these tensors are at different spatial resolutions. We then design and train a network to operate directly from this representation; as one might suspect, this turns out to work reasonably well. Tried different methods for transformations in order to bring the DCT coefficients to a compatible size	Modified ResNet-50 network to accommodate the differently sized and strided input	More efficient compared to Fast CoViAR Decrease in the performance Speed up inference

Source Code

https://gitlab.ldv.ei.tum.de/EmoVid/compressed-domain-video

Bereichsverknüpfungen

Seitenhierarchie

Student

Description

Supervisor

Literature Research

Source Code

Files