학술논문

Finding the Number of Latent Topics With Semantic Non-Negative Matrix Factorization
Document Type
article
Source
IEEE Access, Vol 9, Pp 117217-117231 (2021)
Subject
Machine learning
NLP
topic modeling
semantic non-negative matrix factorization
Electrical engineering. Electronics. Nuclear engineering
TK1-9971
Language
English
ISSN
2169-3536
Abstract
Topic modeling, or identifying the set of topics that occur in a collection of articles, is one of the primary objectives of text mining. One of the big challenges in topic modeling is determining the correct number of topics: underestimating the number of topics results in a loss of information, i.e., omission of topics, underfitting, while overestimating leads to noisy and unexplainable topics and overfitting. In this paper, we consider a semantic-assisted non-negative matrix factorization (NMF) topics model, which we call SeNMFk, based on Kullback-Leibler(KL) divergence and integrated with a method for determining the number of latent topics. SeNMFk involves (i) creating a random ensemble of pairs of matrices whose mean is equal to the initial words-by-documents matrix representing the text corpus and the Shifted Positive Pointwise Mutual Information (SPPMI) matrix, which encodes the context information, respectively, and (ii) jointly factorizing each of these pairs with different number of topics to acquire sets of latent topics that are stable to noise. We demonstrate the performance of our method by identifying the number of topics in several benchmark text corpora, when compared to other state-of-the-art techniques. We also show that the number of document classes in the input text corpus may differ from the number of the extracted latent topics, but these classes can be retrieved by clustering the column-vectors of one of the factor matrices. Additionally, we introduce a software called pyDNMFk to estimate the number of topics. We demonstrate that our unsupervised method, SeNMFk, not only determines the correct number of topics, but also extracts topics with a high coherence and accurately classifies the documents of the corpus.