학술논문
Speech Emotion Recognition Based on Shallow Structure of Wav2vec 2.0 and Attention Mechanism
Document Type
Conference
Author
Source
2024 IEEE 14th International Symposium on Chinese Spoken Language Processing (ISCSLP) Chinese Spoken Language Processing (ISCSLP), 2024 IEEE 14th International Symposium. :398-402 Nov, 2024
Subject
Language
Abstract
Speech Emotion Recognition (SER) has always been an important topic in the field of human-computer interaction. Most of the existing methods use handcrafted features, which may ignore emotion-related information contained in raw speech signals. In recent years, speech Self-supervised Learning (SSL) models such as Wav2vec 2.0 (W2V2) have emerged and been employed to extract general speech representations for the downstream SER tasks. However, the large number of parameters introduced by SSL models is unnecessary. In this paper, a SER model is proposed on the basis of the shallow structure of W2V2 and the attention mechanism. The W2V2-based module is constructed using the first seven Conv1d blocks of W2V2 to extract local feature representations from raw speech signals. The attention-based module is used to globally capture the contextual emotional information from the local feature representations. Within this module, three multi-head self-attention blocks are cascaded for multilevel feature fusion. Experimental results show that the proposed model achieves better performance than the baselines on the IEMOCAP and EMODB datasets.