학술논문

Language Identification on Massive Datasets of Short Messages using an Attention Mechanism CNN
Document Type
Conference
Source
2020 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Advances in Social Networks Analysis and Mining (ASONAM), 2020 IEEE/ACM International Conference on. :16-23 Dec, 2020
Subject
Computing and Processing
Social networking (online)
Neural networks
Blogs
Manuals
Benchmark testing
Noise measurement
Task analysis
LID
NN
Data mining
corpus
AI
Language
ISSN
2473-991X
Abstract
Language Identification (LID) is a challenging task, especially when the input texts are short and noisy such as microblog posts on social media or chat logs on gaming forums. The task has been tackled by either designing a feature set for a traditional classifier (e.g. Naive Bayes) or applying a deep neural network classifier (e.g. Bi-directional GRU, Encoder-Decoder). These methods are usually trained and tested on a private corpus, then used as off-the-shelf packages by other researchers on their own datasets, and consequently the various results published are not directly comparable. In this paper, we first create a new massive labeled dataset based on one year of Twitter data. We use this dataset to test several existing LID systems, in order to obtain a set of coherent benchmarks, and we make our dataset publicly available so that others can add to this set of benchmarks. Finally, we propose a shallow but efficient neural LID system, which is a ngram-regional convolution neural network enhanced with an attention mechanism. Experimental results show that our architecture is able to predict tens of thousands of samples per second and surpasses all state-of-the-art systems in accuracy and F1 score, including outperforming the popular langid system by 5%.