학술논문

ERVQ: Leverage Residual Vector Quantization for Speech Emotion Recognition
Document Type
Conference
Source
2024 IEEE 14th International Symposium on Chinese Spoken Language Processing (ISCSLP) Chinese Spoken Language Processing (ISCSLP), 2024 IEEE 14th International Symposium. :456-460 Nov, 2024
Subject
Computing and Processing
Signal Processing and Analysis
Representation learning
Emotion recognition
Vector quantization
Speech recognition
Text to speech
Speech processing
speech emotion recognition
residual vector quantization
Language
Abstract
Although speech pre-trained models (PTM) have shown remarkable performance in speech emotion recognition (SER), they are constructed for general tasks and exhibit limitations in capturing emotional related features. Recently, some popular text-to-speech models utilize residual vector quantization (RVQ) to effectively embed multi-scale speech detail information, resulting in significant speech reconstruction quality. Inspired by this success, we explore RVQ's potential for emotional representation learning. To enhance ERVQ's ability to capture emotional features, we adopt modification and employ content alignment on it. In this paper, we present a novel perspective on SER by introducing enhanced RVQ for emotion recognition, called Emotion RVQ (ERVQ). Based on experimental results, our approach achieves State-of-the-Art (SOTA) in both in-domain and out-of-domain SER scenarios, proving its outperformed emotional representation embedding demonstrating compared to other speech PTMs.