학술논문

Classification of Documents with Variable Length Using Likelihood Extracted Discriminative Attributes
Document Type
Conference
Source
2023 IEEE Smart World Congress (SWC) Smart World Congress (SWC), 2023 IEEE. :1-9 Aug, 2023
Subject
Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
Power, Energy and Industry Applications
Robotics and Control Systems
Signal Processing and Analysis
Transportation
Document classification
Log-likelihood
Discriminative Attributes
Sequential Model
Language
Abstract
Document classification techniques are used to label textual unstructured data during organization. The length of textual document, which can either be short or long, is vital to the performance of a classifying algorithm. Long document classification has proved to be a more challenging task than classifying short documents. The performance of deep learning algorithms, transformer-based algorithms, and their variations are recently put to test for classifying documents of varying lengths. Bidirectional Encoder Representation from Transformers (BERT), a transformer-based technique, places a limitation on the number of tokens, which in some cases have resulted in impaired classification performance in addition to the complexity introduced during the training process, while classical algorithms developed using deep learning methods need optimization to capture the entirety of the feature space. We propose a novel Likelihood Extracted Discriminative Attributes (LEDA) algorithm, which is a simplified approach for short and long document classification without incurring complex training process of the classifier. LEDA creates a set of discriminative attributes using likelihood ratio with an extra layer of regression filtration and subsequently use the attributes in Keras concatenated sequential model for the classification of variable length documents. These vocabs can explain the variation within the data, thereby enabling an improved classifier performance. LEDA achieves a 100% classification accuracy on the IMDB movie review dataset, which is one of the public datasets used for benchmarking classifier performance for long documents. Furthermore, LEDA achieves a 94% classification accuracy on a sample of short documents from Twitter data stream focused on COVID-19.