학술논문
Gaps and complex structurally variant loci in phased genome assemblies
Document Type
article
Author
Porubsky, David; Vollger, Mitchell R; Harvey, William T; Rozanski, Allison N; Ebert, Peter; Hickey, Glenn; Hasenfeld, Patrick; Sanders, Ashley D; Stober, Catherine; Consortium, Human Pangenome Reference; Korbel, Jan O; Paten, Benedict; Marschall, Tobias; Eichler, Evan E; Abel, Haley J; Antonacci-Fulton, Lucinda L; Asri, Mobin; Baid, Gunjan; Baker, Carl A; Belyaeva, Anastasiya; Billis, Konstantinos; Bourque, Guillaume; Buonaiuto, Silvia; Carroll, Andrew; Chaisson, Mark JP; Chang, Pi-Chuan; Chang, Xian H; Cheng, Haoyu; Chu, Justin; Cody, Sarah; Colonna, Vincenza; Cook, Daniel E; Cook-Deegan, Robert M; Cornejo, Omar E; Diekhans, Mark; Doerr, Daniel; Ebler, Jana; Eizenga, Jordan M; Fairley, Susan; Fedrigo, Olivier; Felsenfeld, Adam L; Feng, Xiaowen; Fischer, Christian; Flicek, Paul; Formenti, Giulio; Frankish, Adam; Fulton, Robert S; Gao, Yan; Garg, Shilpa; Garrison, Erik; Garrison, Nanibaa’ A; Giron, Carlos Garcia; Green, Richard E; Groza, Cristian; Guarracino, Andrea; Haggerty, Leanne; Hall, Ira M; Haukness, Marina; Haussler, David; Heumos, Simon; Hoekzema, Kendra; Hourlier, Thibaut; Howe, Kerstin; Jain, Miten; Jarvis, Erich D; Ji, Hanlee P; Kenny, Eimear E; Koenig, Barbara A; Kolesnikov, Alexey; Kordosky, Jennifer; Koren, Sergey; Lee, HoJoon; Lewis, Alexandra P; Li, Heng; Liao, Wen-Wei; Lu, Shuangjia; Lu, Tsung-Yu; Lucas, Julian K; Magalhães, Hugo; Marco-Sola, Santiago; Marijon, Pierre; Markello, Charles; Martin, Fergal J; McCartney, Ann; McDaniel, Jennifer; Miga, Karen H; Mitchell, Matthew W; Monlong, Jean; Mountcastle, Jacquelyn; Munson, Katherine M; Mwaniki, Moses Njagi; Nattestad, Maria; Novak, Adam M; Nurk, Sergey
Source
Genome Research. 33(4)
Subject
Language
Abstract
There has been tremendous progress in phased genome assembly production by combining long-read data with parental information or linked-read data. Nevertheless, a typical phased genome assembly generated by trio-hifiasm still generates more than 140 gaps. We perform a detailed analysis of gaps, assembly breaks, and misorientations from 182 haploid assemblies obtained from a diversity panel of 77 unique human samples. Although trio-based approaches using HiFi are the current gold standard, chromosome-wide phasing accuracy is comparable when using Strand-seq instead of parental data. Importantly, the majority of assembly gaps cluster near the largest and most identical repeats (including segmental duplications [35.4%], satellite DNA [22.3%], or regions enriched in GA/AT-rich DNA [27.4%]). Consequently, 1513 protein-coding genes overlap assembly gaps in at least one haplotype, and 231 are recurrently disrupted or missing from five or more haplotypes. Furthermore, we estimate that 6-7 Mbp of DNA are misorientated per haplotype irrespective of whether trio-free or trio-based approaches are used. Of these misorientations, 81% correspond to bona fide large inversion polymorphisms in the human species, most of which are flanked by large segmental duplications. We also identify large-scale alignment discontinuities consistent with 11.9 Mbp of deletions and 161.4 Mbp of insertions per haploid genome. Although 99% of this variation corresponds to satellite DNA, we identify 230 regions of euchromatic DNA with frequent expansions and contractions, nearly half of which overlap with 197 protein-coding genes. Such variable and incompletely assembled regions are important targets for future algorithmic development and pangenome representation.