학술논문

Latent Topic Extraction as a Source of Labeling in Natural Language Processing
Document Type
Conference
Source
2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) Bioinformatics and Biomedicine (BIBM), 2023 IEEE International Conference on. :4312-4319 Dec, 2023
Subject
Bioengineering
Computing and Processing
Engineering Profession
Robotics and Control Systems
Signal Processing and Analysis
COVID-19
Machine learning algorithms
Biological system modeling
Machine learning
Predictive models
Data models
Natural language processing
topic modeling
latent dirichlet allocation
non-negative matrix factorization
Language
ISSN
2156-1133
Abstract
Supervised machine learning algorithms depend on accurate labeling of target data to develop models that can derive relationships between input data and the target data. One major hindrance for developing supervised machine learning models capable of predicting the correct target label of unseen data rests on the quality of the data used to train the models, which often depends on having a subject matter expert (SME) create a labeled dataset to train the model on. Given the scarcity of such experts in many fields, the time needed to analyze data for labeling, and subjective differences among experts, ways to reduce the complexity associated with creating meaningful datasets are needed. In this work, we explore the use of two unsupervised topic modeling algorithms, Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF) as potential methods for reducing the complexities in the labeling process. Specifically, we obtained COVID patient message data labeled by a SME and compared the overlap in topics designated as COVID versus not by the two algorithms to those of the SME. For each of the topic modeling algorithms, we found a strong degree of overlap in the COVID vs. non-COVID patient message labels with that of the SME, suggesting that the methodology could be used to provide synergies for developing labeled data sets used for clinically meaningful models.