학술논문

OCtS: an alternative of the t-Score method sensitive to outliers and correlation in feature selection.
Document Type
Article
Source
Communications in Statistics: Simulation & Computation. 2024, Vol. 53 Issue 3, p1409-1422. 14p.
Subject
*MACHINE learning
*CLASSIFICATION algorithms
*MISSING data (Statistics)
*NOSOLOGY
*BOOSTING algorithms
*PERFORMANCE standards
*FEATURE selection
Language
ISSN
0361-0918
Abstract
A wide range of issues including missing values, class noise, class imbalance, outliers, correlation and irrelevant variables have the potential to negatively affect the overall performance of disease diagnosis classification algorithms. This study proposes a new technique, alternative to the t-Score method, to increase the performance of ensemble learning classification algorithms by removing irrelevant variables. Therefore, three publicly available datasets from medical domain varying in their sample sizes, number of variables, and data preprocessing problems were selected and processed with our newly proposed feature selection method called Outliers and Correlation t-Score (OCtS). Afterwards, six widely used ensemble learning algorithms including Random Forest, Gradient Boosting Machine, Extreme Gradient Boosting Machine, Light Gradient Boosting Machine, CatBoost, and Bagging were employed for disease diagnosis classification, and performance metrics were measured. Our results indicate that the classification performance of six ensemble learning algorithms significantly increased when the OCtS method was employed, and our feature selection method, OCtS, exhibited higher performance compared to the standard t-score method across all datasets (p = 0.0001). We conclude that, using data preprocessing methods with OCtS offers better algorithm performance when employing ensemble learning algorithms in disease diagnosis classification. [ABSTRACT FROM AUTHOR]