학술논문

ERVQ: Leverage Residual Vector Quantization for Speech Emotion Recognition

Document Type

Conference

Author

Xie, Jingran; Xiang, Yang; Wang, Hui; Wu, Xixin; Wu, Zhiyong; Meng, Helen

Source

2024 IEEE 14th International Symposium on Chinese Spoken Language Processing (ISCSLP) Chinese Spoken Language Processing (ISCSLP), 2024 IEEE 14th International Symposium. :456-460 Nov, 2024

Subject

Computing and Processing
Signal Processing and Analysis
Representation learning
Emotion recognition
Vector quantization
Speech recognition
Text to speech
Speech processing
speech emotion recognition
residual vector quantization

Language

Abstract

Although speech pre-trained models (PTM) have shown remarkable performance in speech emotion recognition (SER), they are constructed for general tasks and exhibit limitations in capturing emotional related features. Recently, some popular text-to-speech models utilize residual vector quantization (RVQ) to effectively embed multi-scale speech detail information, resulting in significant speech reconstruction quality. Inspired by this success, we explore RVQ's potential for emotional representation learning. To enhance ERVQ's ability to capture emotional features, we adopt modification and employ content alignment on it. In this paper, we present a novel perspective on SER by introducing enhanced RVQ for emotion recognition, called Emotion RVQ (ERVQ). Based on experimental results, our approach achieves State-of-the-Art (SOTA) in both in-domain and out-of-domain SER scenarios, proving its outperformed emotional representation embedding demonstrating compared to other speech PTMs.

Online Access

Full Text (IEEE) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송