학술논문
Effect of Various Data Preprocessing in Sequence Embedding-Based Machine Learning for Human-Virus PPI Classification
Document Type
Conference
Source
2021 4th International Conference of Computer and Informatics Engineering (IC2IE) Computer and Informatics Engineering (IC2IE), 2021 4th International Conference of. :74-78 Sep, 2021
Subject
Language
Abstract
Identifying human-virus protein-protein interactions (PPI) is an important task which is increasingly researched using computational methods. Previous research shows that using doc2vec encoding scheme for features combined with Random Forest classifier gives promising performance. However, human-virus PPI data are usually imbalanced, and additional preprocessing step has not been investigated in this task. In this work, we investigated various preprocessing methods and modifications to improve classification performance. The result shows that a modification in the feature formulation method, combined with random oversampling can improve the classification AUC result from 0.9414 to 0.9448.