학술논문

An Approach to Recognize Speech Using Convolutional Neural Network for the Multilingual Language
Document Type
Conference
Source
2023 Global Conference on Information Technologies and Communications (GCITC) Information Technologies and Communications (GCITC), 2023 Global Conference on. :1-6 Dec, 2023
Subject
Aerospace
Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
Engineered Materials, Dielectrics and Plasmas
Power, Energy and Industry Applications
Robotics and Control Systems
Signal Processing and Analysis
Natural languages
Smart homes
Feature extraction
Recording
Convolutional neural networks
Mel frequency cepstral coefficient
Automatic speech recognition
Automatic Speech Recognition Systems
Artificial Intelligence
Machine Learning
Convolutional Neural Network
Mel-frequency Cepstral Coefficient
Language
Abstract
Automatic Speech Recognition Systems (ASRS) are essential for supporting natural language communication between human and machines. It has gained prominence when Artificial Intelligence (AI) and Machine Learning (ML) were introduced. It allows user to naturally interact with machine and perform hands-free operation. In addition, it is a fundamental technology used in many fields such as Education, Smart Home Automation, Automotive, Aviation, Disable People and so on. In this paper, a Convolutional Neural Network (CNN)-based ASRS is built which models raw speech signals. The speech corpus targeted is our own created database in four languages like Hindi, English, Punjabi and Bengali. The recording is done in different environment by 50 male and native speakers of Hindi and Punjabi language. They were able to speak English and Bengali as well. In addition, the collected raw speech samples are further used to extract features using Mel-frequency Cepstral Coefficient (MFCC), the most widely used technique for feature extraction. Further, the 2D CNN model with 6 layers was designed to recognize the speech samples for each language. The experimental results depict the validation accuracy of 96.29% with 0.174 loss. Hence, a significant performance is demonstrated by the CNN-based model for this comprehensive tonal speech dataset.