학술논문

Investigation of Ensemble of Self-Supervised Models for Speech Emotion Recognition
Document Type
Conference
Source
2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2023 Asia Pacific. :988-995 Oct, 2023
Subject
Communication, Networking and Broadcast Technologies
Computing and Processing
Signal Processing and Analysis
Training
Emotion recognition
Computational modeling
Evidence theory
Speech recognition
Self-supervised learning
Information processing
Language
ISSN
2640-0103
Abstract
Traditional deep learning based speech emotion recognition (SER) methods are restricted to the lack of emotional speeches. Recently, the self-supervised learning (SSL) speech representation model is emerging and has become the state-of-the-art method for SER because of its ability to learn a more generalized emotion representation. Some SSL models adopt a similar network architecture, but utilize various training strategies or pre-trained speech data to learn speech representations from different perspectives, thus containing complementary information for the SER task. In this paper, we conduct an investigation of ensemble learning with six SSL models for SER using three kinds of ensemble methods: decision ensemble, late feature ensemble, and early feature ensemble. The six SSL models adopted in this paper are Wav2vec 2.0 Base, HuBERT Base, WavLM Base, WavLM Base+, UniSpeech-SAT Base, and UniSpeech-SAT Base+. Our experimental results on three datasets show that: 1) all three kinds of ensemble learning methods can significantly improve the SER performance, demonstrating that there may exist a large amount of complementary information between different SSL models; 2) in terms of the decision ensemble methods, the evidence theory (ET) based fusion performs better than the average-based fusion because of its ability to deal with the uncertainty of different models; 3) the ET-based decision ensemble is the best in comparative ensemble methods because of its superior performance and lower computational cost.