학술논문

BharatBhasaNet-A Unified Framework to Identify Indian Code Mix Languages
Document Type
Periodical
Source
IEEE Access Access, IEEE. 12:68893-68904 2024
Subject
Aerospace
Bioengineering
Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
Engineered Materials, Dielectrics and Plasmas
Engineering Profession
Fields, Waves and Electromagnetics
General Topics for Engineers
Geoscience
Nuclear Engineering
Photonics and Electrooptics
Power, Energy and Industry Applications
Robotics and Control Systems
Signal Processing and Analysis
Transportation
Codes
Testing
Training
Transformers
Data models
Text recognition
Task analysis
Linguistics
Natural language processing
Digital communication
Multilingual
Indian
native
Romanized
transformer
attention
XLM-Roberta
Language
ISSN
2169-3536
Abstract
In the rapidly globalizing digital communication sphere, the imperative for advanced multilingual text recognition and identification is increasingly evident. Contrasting the previous works, which were predominantly constrained to 2-3 languages, this paper explores the rich linguistic diversity of India, addressing challenges in automated language processing for 12 languages. BharatBhasaNet, our comprehensive Language Identification (LID) framework, integrates an extensive dataset covering these 12 Indian languages in both native-script and romanized forms, derived from INDICCORP, Bhasha-Abhijnaanam, and Aksharantar datasets by AI4Bharat. The framework accommodates two models, Roberta-native and Roberta-Romanized, based on attention mechanism and transformer architecture. With its exceptional accuracy of 99.54% in native script and 60.90% in Romanized text, BharatBhasaNet significantly advances language identification, providing broader language coverage than existing LIDs. It excels in interpreting code-mixed sentences, unveiling crucial accuracy patterns related to sentence length, word span, and complexity in multilingual contexts. The framework underwent rigorous testing using a real-time dataset from the National Informatics Center (NIC), achieving an accuracy rate of 92.67%. Overcoming challenges like limited training data and distinguishing similar languages, BharatBhasaNet marks a significant leap in Romanized text identification within diverse linguistic landscapes.