학술논문

RFE Based Feature Selection Improves Performance of Classifying Multiple-causes Deaths in Colorectal Cancer
Document Type
Conference
Source
2022 7th International Conference on Intelligent Informatics and Biomedical Science (ICIIBMS) Intelligent Informatics and Biomedical Science (ICIIBMS), 2022 7th International Conference on. 7:188-194 Nov, 2022
Subject
Aerospace
Bioengineering
Communication, Networking and Broadcast Technologies
Computing and Processing
Engineered Materials, Dielectrics and Plasmas
Engineering Profession
Fields, Waves and Electromagnetics
General Topics for Engineers
Robotics and Control Systems
Signal Processing and Analysis
Transportation
Measurement
Biological system modeling
Feature extraction
Informatics
Optimization
Classification tree analysis
Genetic algorithms
Feature Selection
Recursive Feature Elimination
eXtreme Gradient Boosting
Random Forest
Logistic Regression
Genetic Algorithm
Language
ISSN
2189-8723
Abstract
Colorectal cancer (CRC) is the most common malignancy globally, although the cure rate of CRC has improved in recent years, it still presents high mortality and low survival rates. Therefore, understanding the multiple-cause death types of CRC is important for CRC prognostication and treatment. Here we use Recursive Feature Elimination (RFE) to select the feature subsets in CRC and use Genetic Algorithm (GA) to optimize the classifiers' parameters of the feature subset to improve classification of the multiple-cause death types of CRC. Firstly, feature selection was performed based on RFE. Among them, eXtreme Gradient Boosting (XGBoost), Random Forest (RF), and Logistic Regression (LR) were used as the base estimators for RFE, respectively. The number of selected features was set to 200, 600, 1 000, 1 400, 1 800 and 2 200, and the step was set to 200, 600, 1 000, 1 400 and 1 800, to identify the optimal feature combination in the CRC. Secondly, the XGBoost, RF, and LR classification algorithms were used to classify and predict the CRC subsets selected by RFE, respectively, and the GA was used to optimize the classifiers’ parameters. Finally, using the classification results of the optimal classifiers’ output after GA optimization, the optimal performance metrics of the CRC subsets selected under different parameters settings of RFE were compared by classifiers type. The experimental results show that XGBoost was more suitable for the data after RFE feature selection than RF and LR classifiers. When the base estimator of RFE was RF, the number of selected features was 600, the step was 1 800, and the performance metrics of the XGBoost classifier were the highest. The accuracy was 0.85, the weighted precision was 0.87, the weighted recall was 0.85, and the weighted F1 value was 0.83.