학술논문

BSER: A Learning Framework for Bangla Speech Emotion Recognition
Document Type
Conference
Source
2024 6th International Conference on Electrical Engineering and Information & Communication Technology (ICEEICT) Electrical Engineering and Information & Communication Technology (ICEEICT), 2024 6th International Conference on. :410-415 May, 2024
Subject
Aerospace
Bioengineering
Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
Engineered Materials, Dielectrics and Plasmas
Fields, Waves and Electromagnetics
Nuclear Engineering
Photonics and Electrooptics
Power, Energy and Industry Applications
Robotics and Control Systems
Signal Processing and Analysis
Human computer interaction
Training
Emotion recognition
AWGN
Speech recognition
Market research
Convolutional neural networks
SER
LSTM
1D CNN
MFCC
LMS
ZCR
RMSE
Language
ISSN
2769-5700
Abstract
Human Computer Interaction (HCI) relies on accurate speech emotion identification. Speech Emotion Recognition (SER) analyzes voice signals to classify emotions. English based Speech Emotion Recognition (SER) has been extensively studied, while Bangla SER has not. The study integrates a one-dimensional convolution neural network with a long short-term memory (LSTM) architecture into a fully linked network for SER. Speech categorization requires feature inclusion, which this method achieves. We included Additive White Gaussian Noise (AWGN), signal elongation, and pitch alteration to improve dataset dependability. Mel-frequency cepstral coefficients (MFCC), Mel-Spectrogram, Zero Crossing Rate (ZCR), chromagram, and Root Mean Square Error are analyzed in this study. One-dimensional convolutional neural network blocks extract local information, while LSTM layers catch global trends in our model. Training and testing loss curves, confusion matrix, recall, precision, F1-score, and accuracy are used to evaluate the model. We assessed using two cutting-edge datasets, the SUST Bangla Emotional Speech Corpus (SUBESCO) and the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). Experimental results show that the suggested BSER model is more resilient than baseline models on both datasets. BSER improves research in this sector and shows that our hybrid model can detect and classify emotions in voice inputs.