학술논문

Supervised Contrastive Learning for Robust and Efficient Multi-modal Emotion and Sentiment Analysis
Document Type
Conference
Source
2022 26th International Conference on Pattern Recognition (ICPR) Pattern Recognition (ICPR), 2022 26th International Conference on. :2423-2429 Aug, 2022
Subject
Computing and Processing
Robotics and Control Systems
Signal Processing and Analysis
Training
Sentiment analysis
Computational modeling
Computer architecture
Predictive models
Transformers
Robustness
Language
ISSN
2831-7475
Abstract
Expression of human emotion and sentiment are often multi-modal consisting use of spoken speech, vision, and text. Combining multiple modalities allows learning-based models to benefit with the complementary information present across modalities to produce more accurate predictions. One of the bigger challenges in multi-modal affective computing is performance consistency in non-ideal scenarios. Most benchmarks fail to generalize in non-ideal scenarios where one of the modalities is missing or highly corrupted due to occlusion, sensor errors, or change of orientation. Consequently, various modality fusion approaches were proposed. However, most of these fusion approaches assume that each modality is equally useful. To address the challenge of performance consistency, in this work we propose to use supervised contrastive learning (SCL). We demonstrate through various experiments and comparison with state-of-the-art (SOTA) methods that the model robustness against corrupted and missing modalities improves when trained with SCL. Next, we use the Perceiver architecture [1] in order to efficiently combine the representations of different modalities. Its iterative attention mechanism allows to create a reduced latent representation in an efficient manner. We observe that it can accommodate a wide range of modality combinations, allowing for robust information fusion. Our approach allows reduction of model complexity and efficient fusion of different modalities, while maintaining the performance consistency and model robustness. We conduct ablation experiments to study the effect of each contribution in different scenarios, and we show that the proposed methods outperform the state-of-art, while simultaneously being robust to corrupted modalities. Our method also outperforms its counterparts and SOTA while using less numerical complexity (inference times and compute operations).