학술논문

Non-linear machine learning models incorporating SNPs and PRS improve polygenic prediction in diverse human populations
Document Type
article
Source
Communications Biology. 5(1)
Subject
Biological Sciences
Aetiology
2.1 Biological and endogenous factors
Genetic Predisposition to Disease
Genome-Wide Association Study
Humans
Machine Learning
Multifactorial Inheritance
Polymorphism
Single Nucleotide
NHLBI’s Trans-Omics in Precision Medicine (TOPMed) Consortium
Biological sciences
Biomedical and clinical sciences
Language
Abstract
Polygenic risk scores (PRS) are commonly used to quantify the inherited susceptibility for a trait, yet they fail to account for non-linear and interaction effects between single nucleotide polymorphisms (SNPs). We address this via a machine learning approach, validated in nine complex phenotypes in a multi-ancestry population. We use an ensemble method of SNP selection followed by gradient boosted trees (XGBoost) to allow for non-linearities and interaction effects. We compare our results to the standard, linear PRS model developed using PRSice, LDpred2, and lassosum2. Combining a PRS as a feature in an XGBoost model results in a relative increase in the percentage variance explained compared to the standard linear PRS model by 22% for height, 27% for HDL cholesterol, 43% for body mass index, 50% for sleep duration, 58% for systolic blood pressure, 64% for total cholesterol, 66% for triglycerides, 77% for LDL cholesterol, and 100% for diastolic blood pressure. Multi-ancestry trained models perform similarly to specific racial/ethnic group trained models and are consistently superior to the standard linear PRS models. This work demonstrates an effective method to account for non-linearities and interaction effects in genetics-based prediction models.