학술논문

Offensive Hinglish Text Classification Using Longformer, Bigbird & HingMBert

Document Type

Conference

Author

Bansal, Yakshit; Samdarshi, Sumit; Kumar, Sushant; Vishwakarma, Dinesh Kumar

Source

2023 International Conference in Advances in Power, Signal, and Information Technology (APSIT) Advances in Power, Signal, and Information Technology (APSIT), 2023 International Conference in. :718-721 Jun, 2023

Subject

Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
Power, Energy and Industry Applications
Signal Processing and Analysis
Social networking (online)
Text categorization
Blogs
Transformers
Natural language processing
Libraries
Task analysis
longformer
bigbird
hingmbert
hinglish

Language

Abstract

With a sharp rise in the number of users and skilled fluency in the use of “Hinglish”, a cross-lingual text of Hindi and English, in a linguistically diverse country, India, it has increasingly become important to analyze social content written in the aforementioned language over platforms such as Twitter, Reddit, Facebook. This project focuses on the classification of Hinglish text into offensive and non-offensive categories. We present a state-of-the-art novel use of “Longformer” and “Bigbird” transformer-based language modeling models on Hinglish texts for natural language processing (NLP) sentiment classification task, alongside mBERT, pre-trained for Hinglish, hence “HingMBert” model. Our dataset is a combination of pre-existing datasets and further addition of approximately 8,200 records, a total of 24,660 records. The tweepy library has been used to scrape novel tweets. Experimental results for Longformer, Bigbird, and HingMBert have achieved F1 scores of 0.76, 0.77, and 0.79, and accuracy of 76%, 77%, and 80% respectively. F1 score of 0.78 and an accuracy of 78% was achieved using the majority voting ensemble technique.

Online Access

Full Text (IEEE) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송