학술논문

Unsupervised discovery of ancestry-informative markers and genetic admixture proportions in biobank-scale datasets
Document Type
article
Source
American Journal of Human Genetics. 110(2)
Subject
Biological Sciences
Genetics
Human Genome
Generic health relevance
Humans
Genome-Wide Association Study
Likelihood Functions
Biological Specimen Banks
Population Groups
Software
Genetics
Population
AIM
OpenADMIXTURE
OpenMendel
SKFR
admixture
ancestry-informative marker
biobank scale
genetic ancestry
sparse K-means with feature ranking
sparse clustering
Medical and Health Sciences
Genetics & Heredity
Biological sciences
Biomedical and clinical sciences
Health sciences
Language
Abstract
Admixture estimation plays a crucial role in ancestry inference and genome-wide association studies (GWASs). Computer programs such as ADMIXTURE and STRUCTURE are commonly employed to estimate the admixture proportions of sample individuals. However, these programs can be overwhelmed by the computational burdens imposed by the 105 to 106 samples and millions of markers commonly found in modern biobanks. An attractive strategy is to run these programs on a set of ancestry-informative SNP markers (AIMs) that exhibit substantially different frequencies across populations. Unfortunately, existing methods for identifying AIMs require knowing ancestry labels for a subset of the sample. This supervised learning approach creates a chicken and the egg scenario. In this paper, we present an unsupervised, scalable framework that seamlessly carries out AIM selection and likelihood-based estimation of admixture proportions. Our simulated and real data examples show that this approach is scalable to modern biobank datasets. OpenADMIXTURE, our Julia implementation of the method, is open source and available for free.