학술논문

IndoRobusta: Towards Robustness Against Diverse Code-Mixed Indonesian Local Languages

Document Type

Working Paper

Author

Adilazuarda, Muhammad Farid; Cahyawijaya, Samuel; Winata, Genta Indra; Fung, Pascale; Purwarianti, Ayu

Source

Subject

Computer Science - Computation and Language

Language

Abstract

Significant progress has been made on Indonesian NLP. Nevertheless, exploration of the code-mixing phenomenon in Indonesian is limited, despite many languages being frequently mixed with Indonesian in daily conversation. In this work, we explore code-mixing in Indonesian with four embedded languages, i.e., English, Sundanese, Javanese, and Malay; and introduce IndoRobusta, a framework to evaluate and improve the code-mixing robustness. Our analysis shows that the pre-training corpus bias affects the model's ability to better handle Indonesian-English code-mixing when compared to other local languages, despite having higher language diversity.

Online Access

Open Access (Arxiv) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송