학술논문

Non-linear machine learning models incorporating SNPs and PRS improve polygenic prediction in diverse human populations

Document Type

article

Source

Communications Biology. 5(1)

Subject

Biological Sciences
Aetiology
2.1 Biological and endogenous factors
Genetic Predisposition to Disease
Genome-Wide Association Study
Humans
Machine Learning
Multifactorial Inheritance
Polymorphism
Single Nucleotide
NHLBI’s Trans-Omics in Precision Medicine (TOPMed) Consortium
Biological sciences
Biomedical and clinical sciences

Language

Abstract

Polygenic risk scores (PRS) are commonly used to quantify the inherited susceptibility for a trait, yet they fail to account for non-linear and interaction effects between single nucleotide polymorphisms (SNPs). We address this via a machine learning approach, validated in nine complex phenotypes in a multi-ancestry population. We use an ensemble method of SNP selection followed by gradient boosted trees (XGBoost) to allow for non-linearities and interaction effects. We compare our results to the standard, linear PRS model developed using PRSice, LDpred2, and lassosum2. Combining a PRS as a feature in an XGBoost model results in a relative increase in the percentage variance explained compared to the standard linear PRS model by 22% for height, 27% for HDL cholesterol, 43% for body mass index, 50% for sleep duration, 58% for systolic blood pressure, 64% for total cholesterol, 66% for triglycerides, 77% for LDL cholesterol, and 100% for diastolic blood pressure. Multi-ancestry trained models perform similarly to specific racial/ethnic group trained models and are consistently superior to the standard linear PRS models. This work demonstrates an effective method to account for non-linearities and interaction effects in genetics-based prediction models.

Online Access

Open Access (eScholarship) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송