학술논문

Two-Stream Holographic Convolutional Network for Speech Emotion Recognition Using Korean Audio
Document Type
Conference
Source
2024 4th International Conference on Ubiquitous Computing and Intelligent Information Systems (ICUIS) Ubiquitous Computing and Intelligent Information Systems (ICUIS), 2024 4th International Conference on. :113-119 Dec, 2024
Subject
Communication, Networking and Broadcast Technologies
Computing and Processing
Emotion recognition
Accuracy
Speech recognition
Computer architecture
Medical services
Phonetics
Feature extraction
Convolutional neural networks
Optimization
Load modeling
Speech Emotion Recognition
Two-Stream Holographic Convolutional Network
Kookaburra Optimization Algorithm
Korean audio
Language
Abstract
Speech Emotion Recognition (SER) has become an important area of research in the AI domain, helping systems comprehend people's feelings with speech, in such areas as healthcare, automated customer support, and virtual companionship. Emotion detection from Korean speech raises learning issues and demand an understanding of all high-frequency spectrum and time selective features. Despite that these layers work effectively, typical convolutional networks may lose phase information which is so important for emotion distinction especially in languages with tone and phonetics. For this research, the AI-Hub Speech Emotional Database, a large scale Korean speech dataset, is employed to create a Two-Stream Holographic Convolutional Network (TSHCN) for SER. However, the TSHCN architecture employed a two-path structure in which one branch encoded the spectral feature and the other branch encoded the temporal dynamics in sequence, while the phase information was retained using holodynamic convolutions for better classification of emotions. In order to increase the efficiency of the model, the Kookaburra Optimization Algorithm (KOA) is used on the parameters of the weight network, reducing the number of mistakes and the load of calculations as well as increasing the accuracy of the model. As more evident from the previously discussed results, SER performance is notably enhanced by the use of TSHCN with KOA; specifically, the accurate representation of Korean speakers' speech emotions will help pave the way for the development of SER for the Korean language and other complex languages as well. This study thus takes up the shortcoming in the conventional approaches to the task and proposes an optimal and language-independent SER model. The introduced approach attains higher accuracy as 99%.