학술논문

A Novel Approach Using Extractive and Abstractive Summarization for the Genre Classification of Short Text
Document Type
Conference
Source
2023 Third International Conference on Advances in Electrical, Computing, Communication and Sustainable Technologies (ICAECT) Advances in Electrical, Computing, Communication and Sustainable Technologies (ICAECT), 2023 Third International Conference on. :1-7 Jan, 2023
Subject
Aerospace
Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
Engineered Materials, Dielectrics and Plasmas
Fields, Waves and Electromagnetics
General Topics for Engineers
Photonics and Electrooptics
Power, Energy and Industry Applications
Robotics and Control Systems
Signal Processing and Analysis
Support vector machines
Training
Deep learning
Social networking (online)
Text categorization
Neural networks
Feature extraction
Transformers
Fake news
Monitoring
abstractive summarization
Brown corpus
classification
extractive summarization
Gensim summarizer
Pegasus summarizer
Language
Abstract
Genre classification for text documents is useful in media monitoring and detection of misinformation. Recent work in text classification for genre has shown that advanced algorithms such as neural networks and transformers are well-suited for the purpose. However, for shorter text documents, such as those obtained from social media or news articles, training of deep learning models becomes challenging since they require a large amount of input. Furthermore, genre classification of text summaries, such as headlines of news, is an important direction which has not been explored at large. In this work, the effect of Extractive and Abstractive Summarization on classification for genre of text documents was evaluated. Gensim summarizer was used to obtain extractive summaries and the Pegasus summarizer to obtain abstractive summaries. For classification, two classes of genres, Fiction and Non-fiction, were considered while the gold standard Brown Corpus was used for experimentation. The features used for genre classification were frequencies of various Part-of-Speech (PoS) tags derived from five Penn TreeBank annotated tags. Logistic Regression (LR) and Support Vector Machines (SVM) were used for classification purposes. The results of classification were better for summaries obtained using the extractive technique, indicating that the features of extractive summaries remain in agreement with the documents from which the summary is constructed as compared to abstractive summaries. Further, the SVM classifier performed better than the LR classifier. For exhaustive coverage of the research goal, further experimentation with the number of words of the output summaries of the extractive technique was performed to arrive at a threshold value of the length of summaries. The value indicated that summaries as short as 80 words can be successfully classified using this method.