학술논문

Emotion and Gesture Guided Action Recognition in Videos Using Supervised Deep Networks

Document Type

Periodical

Author

Source

IEEE Transactions on Computational Social Systems IEEE Trans. Comput. Soc. Syst. Computational Social Systems, IEEE Transactions on. 10(5):2546-2556 Oct, 2023

Subject

Computing and Processing
Communication, Networking and Broadcast Technologies
General Topics for Engineers
Videos
Feature extraction
Visualization
Spatiotemporal phenomena
Convolution
Tensors
Emotion recognition
Action recognition
deep neural networks (DNNs)
long temporal context
Visual Attention with Long-term Context (VALC) dataset: LINK
visual attention

Language

ISSN

2329-924X
2373-7476

Abstract

Emotions and gestures are essential elements in improving social intelligence and predicting real human action. In recent years, recognition of human visual actions using deep neural networks (DNNs) has gained wide popularity in multimedia and computer vision. However, ambiguous action classes, such as “praying” and “pleading,” are still challenging to classify due to similar visual cues of action. We need to focus on attentive associated features of facial expressions and gestures, including the long-term context of a video for the correct classification of ambiguous actions. This article proposes an attention-aware DNN named human action attention network (HAANet) that can capture long-term temporal context to recognize actions in videos. The visual attention network extracts discriminative features of facial expressions and gestures in the spatial and temporal dimensions. We have further consolidated a class-specific attention pooling mechanism to capture transition in semantic traits over time. The efficacy of HAANet is demonstrated on five benchmark datasets. As per our knowledge, no publicly available dataset exists in the literature, which distinguishes ambiguous human actions by focusing on the visual cues of a human in action. This motivated us to create a new dataset, known as Visual Attention with Long-term Context (VALC), which contains 32 actions with about 101 videos per class and an average length of 30 s. HAANet outperforms UCF101, ActivityNet, and BreakFast-Actions datasets in terms of accuracy.

Online Access

Full Text (IEEE) Web of Science Scopus Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송