학술논문
ERVQ: Leverage Residual Vector Quantization for Speech Emotion Recognition
Document Type
Conference
Author
Source
2024 IEEE 14th International Symposium on Chinese Spoken Language Processing (ISCSLP) Chinese Spoken Language Processing (ISCSLP), 2024 IEEE 14th International Symposium. :456-460 Nov, 2024
Subject
Language
Abstract
Although speech pre-trained models (PTM) have shown remarkable performance in speech emotion recognition (SER), they are constructed for general tasks and exhibit limitations in capturing emotional related features. Recently, some popular text-to-speech models utilize residual vector quantization (RVQ) to effectively embed multi-scale speech detail information, resulting in significant speech reconstruction quality. Inspired by this success, we explore RVQ's potential for emotional representation learning. To enhance ERVQ's ability to capture emotional features, we adopt modification and employ content alignment on it. In this paper, we present a novel perspective on SER by introducing enhanced RVQ for emotion recognition, called Emotion RVQ (ERVQ). Based on experimental results, our approach achieves State-of-the-Art (SOTA) in both in-domain and out-of-domain SER scenarios, proving its outperformed emotional representation embedding demonstrating compared to other speech PTMs.