학술논문
DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer
Document Type
Article
Author
Baid, Gunjan; Cook, Daniel E.; Shafin, Kishwar; Yun, Taedong; Llinares-López, Felipe; Berthet, Quentin; Belyaeva, Anastasiya; Töpfer, Armin; Wenger, Aaron M.; Rowell, William J.; Yang, Howard; Kolesnikov, Alexey; Ammar, Waleed; Vert, Jean-Philippe; Vaswani, Ashish; McLean, Cory Y.; Nattestad, Maria; Chang, Pi-Chuan; Carroll, Andrew
Source
Nature Biotechnology; February 2023, Vol. 41 Issue: 2 p232-238, 7p
Subject
Language
ISSN
10870156; 15461696
Abstract
Circular consensus sequencing with Pacific Biosciences (PacBio) technology generates long (10–25 kilobases), accurate ‘HiFi’ reads by combining serial observations of a DNA molecule into a consensus sequence. The standard approach to consensus generation, pbccs, uses a hidden Markov model. We introduce DeepConsensus, which uses an alignment-based loss to train a gap-aware transformer–encoder for sequence correction. Compared to pbccs, DeepConsensus reduces read errors by 42%. This increases the yield of PacBio HiFi reads at Q20 by 9%, at Q30 by 27% and at Q40 by 90%. With two SMRT Cells of HG003, reads from DeepConsensus improve hifiasm assembly contiguity ( NG50 4.9 megabases (Mb) to 17.2 Mb), increase gene completeness (94% to 97%), reduce the false gene duplication rate (1.1% to 0.5%), improve assembly base accuracy (Q43 to Q45) and reduce variant-calling errors by 24%. DeepConsensus models could be trained to the general problem of analyzing the alignment of other types of sequences, such as unique molecular identifiers or genome assemblies.