학술논문
Context-aware Emotion Recognition Based on Vision-Language Pre-trained Model
Document Type
Conference
Author
Source
2024 International Conference on Advanced Robotics and Mechatronics (ICARM) Advanced Robotics and Mechatronics (ICARM), 2024 International Conference on. :70-75 Jul, 2024
Subject
Language
ISSN
2993-4990
Abstract
Given the difficulty of recognizing ambiguous emotions in facial expression recognition tasks, we propose a visual-language model named CAER-CLIP to address this challenge. The proposed CAER-CLIP standed for Context-Aware Emotion Recognition (CAER), and were incorporated structure of the Contrastive Language–Image Pre-training (CLIP) model as promising alternative to classifier. There are two parts in CAER-CLIP model. In the visual part, facial expressions and contextual information of the image are simultaneously extracted to obtain the final feature embeddings, which are then used as a learnable “class” token for text-image pairing with desired module. In the textual part, we use text labels for emotion recognition classes as input. The outputs were merged to participate the comparative study to generated parameters of the model. The experiments demonstrate the effectiveness of the proposed method and show that our CAER-CLIP outperforms the state-of-the-art results on the CAER benchmark. The ablation experiment verified the effectiveness of both the classifier-based and text-based (ours without classifier) models, demonstrating that our method with the CAER-CLIP structure performed better, and the incorporation of a text encoder in the deep network model architecture effectively enhancing recognition accuracy.