학술논문

Statistical phasing of 150,119 sequenced genomes in the UK Biobank.
Document Type
Article
Source
American Journal of Human Genetics. Jan2023, Vol. 110 Issue 1, p161-165. 5p.
Subject
*X chromosome
*CHROMOSOMES
*ERROR rates
*GENE frequency
*GENOTYPES
Language
ISSN
0002-9297
Abstract
The first release of UK Biobank whole-genome sequence data contains 150,119 genomes. We present an open-source pipeline for filtering, phasing, and indexing these genomes on the cloud-based UK Biobank Research Analysis Platform. This pipeline makes it possible to apply haplotype-based methods to UK Biobank whole-genome sequence data. The pipeline uses BCFtools for marker filtering, Beagle for genotype phasing, and Tabix for VCF indexing. We used the pipeline to phase 406 million single-nucleotide variants on chromosomes 1–22 and X at a cost of £2,309. The maximum time required to process a chromosome was 2.6 days. In order to assess phase accuracy, we modified the pipeline to exclude trio parents. We observed a switch error rate of 0.0016 on chromosome 20 in the White British trio offspring. If we exclude markers with nonmajor allele frequency < 0.1% after phasing, this switch error rate decreases by 80% to 0.00032. We present an open-source pipeline for filtering, phasing, and indexing 150,119 UK Biobank genomes. This pipeline makes it possible to apply haplotype-based methods to these data. We use the pipeline to phase 406 million single-nucleotide variants on chromosomes 1–22 and X at a cost of £2,309. [ABSTRACT FROM AUTHOR]