학술논문

Offensive Hinglish Text Classification Using Longformer, Bigbird & HingMBert
Document Type
Conference
Source
2023 International Conference in Advances in Power, Signal, and Information Technology (APSIT) Advances in Power, Signal, and Information Technology (APSIT), 2023 International Conference in. :718-721 Jun, 2023
Subject
Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
Power, Energy and Industry Applications
Signal Processing and Analysis
Social networking (online)
Text categorization
Blogs
Transformers
Natural language processing
Libraries
Task analysis
longformer
bigbird
hingmbert
hinglish
Language
Abstract
With a sharp rise in the number of users and skilled fluency in the use of “Hinglish”, a cross-lingual text of Hindi and English, in a linguistically diverse country, India, it has increasingly become important to analyze social content written in the aforementioned language over platforms such as Twitter, Reddit, Facebook. This project focuses on the classification of Hinglish text into offensive and non-offensive categories. We present a state-of-the-art novel use of “Longformer” and “Bigbird” transformer-based language modeling models on Hinglish texts for natural language processing (NLP) sentiment classification task, alongside mBERT, pre-trained for Hinglish, hence “HingMBert” model. Our dataset is a combination of pre-existing datasets and further addition of approximately 8,200 records, a total of 24,660 records. The tweepy library has been used to scrape novel tweets. Experimental results for Longformer, Bigbird, and HingMBert have achieved F1 scores of 0.76, 0.77, and 0.79, and accuracy of 76%, 77%, and 80% respectively. F1 score of 0.78 and an accuracy of 78% was achieved using the majority voting ensemble technique.