학술논문
Benchmarking challenging small variants with linked and long reads.
Document Type
article
Author
Wagner, Justin; Olson, Nathan; Harris, Lindsay; Khan, Ziad; Farek, Jesse; Mahmoud, Medhat; Stankovic, Ana; Kovacevic, Vladimir; Yoo, Byunggil; Miller, Neil; Rosenfeld, Jeffrey; Ni, Bohan; Zarate, Samantha; Kirsche, Melanie; Aganezov, Sergey; Schatz, Michael; Narzisi, Giuseppe; Byrska-Bishop, Marta; Clarke, Wayne; Evani, Uday; Markello, Charles; Shafin, Kishwar; Zhou, Xin; Sidow, Arend; Bansal, Vikas; Ebert, Peter; Marschall, Tobias; Lansdorp, Peter; Hanlon, Vincent; Mattsson, Carl-Adam; Barrio, Alvaro; Fiddes, Ian; Xiao, Chunlin; Fungtammasan, Arkarachai; Chin, Chen-Shan; Wenger, Aaron; Rowell, William; Sedlazeck, Fritz; Carroll, Andrew; Salit, Marc; Zook, Justin
Source
Cell Genomics. 2(5)
Subject
Language
Abstract
Genome in a Bottle benchmarks are widely used to help validate clinical sequencing pipelines and develop variant calling and sequencing methods. Here we use accurate linked and long reads to expand benchmarks in 7 samples to include difficult-to-map regions and segmental duplications that are challenging for short reads. These benchmarks add more than 300,000 SNVs and 50,000 insertions or deletions (indels) and include 16% more exonic variants, many in challenging, clinically relevant genes not covered previously, such as PMS2. For HG002, we include 92% of the autosomal GRCh38 assembly while excluding regions problematic for benchmarking small variants, such as copy number variants, that should not have been in the previous version, which included 85% of GRCh38. It identifies eight times more false negatives in a short read variant call set relative to our previous benchmark. We demonstrate that this benchmark reliably identifies false positives and false negatives across technologies, enabling ongoing methods development.