Kurzbeschreibung
Deep neural networks have shown strong performance in video action recognition tasks. The recently proposed network architectures can learn spatiotemporal features by fusing convolutional networks spatially and temporally. Motivated by that, in this paper we intend to transfer the end-to-end video-level representation learning approaches to conduct video emotion analysis. Four of them including 3D convolutional neural network(C3D), 2plus1D convolution neural network(R2+1D), Temporal Segment Network(TSN) and an Efficient Convolutional Network(ECO) are modified to analyse video-level emotions in LIRIS-ACCEDE dataset. Specifically, all networks are trained to analyse video- level emotions by aggregating the frame-level features which consist of spatial and temporal cues. The key difference among these networks is temporal pooling method. Experimental results show that the four networks are not performing well as we expected, though ECO model can analyse the emotions in training dataset quite accurately.