학술논문

Emotion and Gesture Guided Action Recognition in Videos Using Supervised Deep Networks
Document Type
Periodical
Author
Source
IEEE Transactions on Computational Social Systems IEEE Trans. Comput. Soc. Syst. Computational Social Systems, IEEE Transactions on. 10(5):2546-2556 Oct, 2023
Subject
Computing and Processing
Communication, Networking and Broadcast Technologies
General Topics for Engineers
Videos
Feature extraction
Visualization
Spatiotemporal phenomena
Convolution
Tensors
Emotion recognition
Action recognition
deep neural networks (DNNs)
long temporal context
Visual Attention with Long-term Context (VALC) dataset: LINK
visual attention
Language
ISSN
2329-924X
2373-7476
Abstract
Emotions and gestures are essential elements in improving social intelligence and predicting real human action. In recent years, recognition of human visual actions using deep neural networks (DNNs) has gained wide popularity in multimedia and computer vision. However, ambiguous action classes, such as “praying” and “pleading,” are still challenging to classify due to similar visual cues of action. We need to focus on attentive associated features of facial expressions and gestures, including the long-term context of a video for the correct classification of ambiguous actions. This article proposes an attention-aware DNN named human action attention network (HAANet) that can capture long-term temporal context to recognize actions in videos. The visual attention network extracts discriminative features of facial expressions and gestures in the spatial and temporal dimensions. We have further consolidated a class-specific attention pooling mechanism to capture transition in semantic traits over time. The efficacy of HAANet is demonstrated on five benchmark datasets. As per our knowledge, no publicly available dataset exists in the literature, which distinguishes ambiguous human actions by focusing on the visual cues of a human in action. This motivated us to create a new dataset, known as Visual Attention with Long-term Context (VALC), which contains 32 actions with about 101 videos per class and an average length of 30 s. HAANet outperforms UCF101, ActivityNet, and BreakFast-Actions datasets in terms of accuracy.