Literature Research
Paper | Ultimate goal(s) | Data Processing | Network/Algorithm | Results |
COMPRESSED-DOMAIN VIDEO CLASSIFICATION WITH DEEP NEURAL NETWORKS: “THERE’S WAY TOO MUCH INFORMATION TO DECODE THE MATRIX” | Investigate video classification via a 3D CNN that directly ingests compressed bitstream information | Only P-type macroblock motion vectors are extracted and retained. That’s because training on both P and B-type motion vectors incurs a substantial increase in complexity with marginal improvement in classification accuracy. Also, B-type MB motion vectors were found to be very sparse or contain information that is mostly redundant if P-type MB motion vectors are available. MB motion information is extracted from a compressed video bitstream using FFMPEG’s libavcodec library, which supports most MPEG/ITU-T standards used in practice. No decoding of any video to its pixel-domain representation is performed here. | CNN which takes as an input with N = 24 is a fixed spatial size (ind. from the video resolution), K = 2 (2 components for MVs) and T = 160 is a fixed temporal extent (roughly the average number of P-frames per video in UCF-101 dataset) | Accuracy: 77.5% (UCF-101) Complexity (# of Parameters): 29.4M |
Compressed Video Action Recognition [2017] CoViAR | Train a deep network directly on the compressed video data | To break the dependency between consecutive P-frames, all motion vectors are traced back to the reference I-frame and the residual is accumulated the way. In this way, each P-frame depends only on the I-frame but no other P-frames. While RGB I-frame features are used as they are, P-frame features; MVs and Residuals need to incorporate the information from RGB I-frames. I-frames and residuals are decoded to pixel domain here. | ResNet-152 to model I-frames, and ResNet-18 to model the motion vectors and residuals. This offers a good trade-off between speed and accuracy. All networks can be trained independently. | Accuracy: 90.4% (UCF-101) Complexity (# of Parameters): 83.6M |
Faster and Accurate Compressed Video Action Recognition Straight from the Frequency Domain [2020] Faster-CoViAR | Deep neural network for human action recognition capable of learning straight from compressed video. | Initially, a compressed video is parsed and then a set of encoded frames for each stream is obtained by:
| The network is a two-stream CNN integrating both frequency (i.e., DCT coefficients) and temporal (i.e., motion vectors) information, which can be extracted by parsing and entropy decoding the stream of encoded video data. This enables to save high computational load in full decoding the video stream and thus greatly speed up the processing time. | |
THE GOOD, THE BAD, AND THE UGLY: NEURAL NETWORKS STRAIGHT FROM JPEG [2020] | A Frequency Band Selection (FBS) technique to select the most relevant DCT coefficients before feeding them to the network. | Since higher frequency information has a little visual effect on the image, the n lowest frequency coefficients are retained | ResNet-50 | Future works: Evaluate smarter strategies for selecting DCT coefficients, like self-attention models. |
Faster Neural Networks Straight from JPEG [2018] | Modify the libjpeg library to decode JPEG images only partially, resulting in an image representation consisting of a triple of tensors containing discrete cosine transform (DCT) coefficients in the YCbCr color space. Due to how the JPEG codec works, these tensors are at different spatial resolutions. We then design and train a network to operate directly from this representation; as one might suspect, this turns out to work reasonably well. Tried different methods for transformations in order to bring the DCT coefficients to a compatible size | Modified ResNet-50 network to |
|