학술논문

Effect of Various Data Preprocessing in Sequence Embedding-Based Machine Learning for Human-Virus PPI Classification
Document Type
Conference
Source
2021 4th International Conference of Computer and Informatics Engineering (IC2IE) Computer and Informatics Engineering (IC2IE), 2021 4th International Conference of. :74-78 Sep, 2021
Subject
Bioengineering
Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
Robotics and Control Systems
Signal Processing and Analysis
Proteins
Data preprocessing
Encoding
Task analysis
Informatics
Random forests
classification
human-virus PPI
sequence embedding
data preprocessing
oversampling
Language
Abstract
Identifying human-virus protein-protein interactions (PPI) is an important task which is increasingly researched using computational methods. Previous research shows that using doc2vec encoding scheme for features combined with Random Forest classifier gives promising performance. However, human-virus PPI data are usually imbalanced, and additional preprocessing step has not been investigated in this task. In this work, we investigated various preprocessing methods and modifications to improve classification performance. The result shows that a modification in the feature formulation method, combined with random oversampling can improve the classification AUC result from 0.9414 to 0.9448.