학술논문

Machine learning-based risk factor analysis and prevalence prediction of intestinal parasitic infections using epidemiological survey data.
Document Type
Article
Source
PLoS Neglected Tropical Diseases. 6/14/2022, Vol. 16 Issue 6, p1-19. 19p.
Subject
*PARASITIC diseases
*INTESTINAL infections
*FACTOR analysis
*HOOKWORM disease
*RECEIVER operating characteristic curves
*PROTOZOAN diseases
Language
ISSN
1935-2727
Abstract
Background: Previous epidemiological studies have examined the prevalence and risk factors for a variety of parasitic illnesses, including protozoan and soil-transmitted helminth (STH, e.g., hookworms and roundworms) infections. Despite advancements in machine learning for data analysis, the majority of these studies use traditional logistic regression to identify significant risk factors. Methods: In this study, we used data from a survey of 54 risk factors for intestinal parasitosis in 954 Ethiopian school children. We investigated whether machine learning approaches can supplement traditional logistic regression in identifying intestinal parasite infection risk factors. We used feature selection methods such as InfoGain (IG), ReliefF (ReF), Joint Mutual Information (JMI), and Minimum Redundancy Maximum Relevance (MRMR). Additionally, we predicted children's parasitic infection status using classifiers such as Logistic Regression (LR), Support Vector Machines (SVM), Random Forests (RF) and XGBoost (XGB), and compared their accuracy and area under the receiver operating characteristic curve (AUROC) scores. For optimal model training, we performed tenfold cross-validation and tuned the classifier hyperparameters. We balanced our dataset using the Synthetic Minority Oversampling (SMOTE) method. Additionally, we used association rule learning to establish a link between risk factors and parasitic infections. Key findings: Our study demonstrated that machine learning could be used in conjunction with logistic regression. Using machine learning, we developed models that accurately predicted four parasitic infections: any parasitic infection at 79.9% accuracy, helminth infection at 84.9%, any STH infection at 95.9%, and protozoan infection at 94.2%. The Random Forests (RF) and Support Vector Machines (SVM) classifiers achieved the highest accuracy when top 20 risk factors were considered using Joint Mutual Information (JMI) or all features were used. The best predictors of infection were socioeconomic, demographic, and hematological characteristics. Conclusions: We demonstrated that feature selection and association rule learning are useful strategies for detecting risk factors for parasite infection. Additionally, we showed that advanced classifiers might be utilized to predict children's parasitic infection status. When combined with standard logistic regression models, machine learning techniques can identify novel risk factors and predict infection risk. Author summary: In developing countries such as Ethiopia, intestinal parasites are a significant public health problem. These parasites are detrimental to the health of schoolchildren. Numerous risk factors for parasitic infections have been identified using uni- and multi-variate logistic regression. However, logistic regression has inherent limitations when applied to data sets with a large number of risk factors. We used machine learning techniques in conjunction with logistic regression models to identify relevant risk factors for parasitic infections in a dataset of 954 Ethiopian schoolchildren with 54 different risk factors for parasitic infections. Additionally, we developed predictive models of parasitic infection. Compared to logistic regression, we discovered that machine learning techniques identified novel risk factors and had higher predictive accuracy. Furthermore, we discovered that infection prediction could be aided by combining socioeconomic, health, and hematological characteristics. As a result, we concluded that advanced machine learning methods should be used in conjunction with logistic regression to study parasitic infections. [ABSTRACT FROM AUTHOR]