학술논문

Short Read Mapping: An Algorithmic Tour
Document Type
Periodical
Source
Proceedings of the IEEE Proc. IEEE Proceedings of the IEEE. 105(3):436-458 Mar, 2017
Subject
General Topics for Engineers
Engineering Profession
Aerospace
Bioengineering
Components, Circuits, Devices and Systems
Computing and Processing
Engineered Materials, Dielectrics and Plasmas
Fields, Waves and Electromagnetics
Geoscience
Nuclear Engineering
Robotics and Control Systems
Signal Processing and Analysis
Transportation
Power, Energy and Industry Applications
Communication, Networking and Broadcast Technologies
Photonics and Electrooptics
Genomics
Bioinformatics
Sequential analysis
DNA
Throughput
Algorithm design and analysis
Burrows–Wheeler transform
DNA sequencing
sequence alignment
string matching
suffix trees
Language
ISSN
0018-9219
1558-2256
Abstract
Ultra-high-throughput next-generation sequencing (NGS) technology allows us to determine the sequence of nucleotides of many millions of DNA molecules in parallel. Accompanied by a dramatic reduction in cost since its introduction in 2004, NGS technology has provided a new way of addressing a wide range of biological and biomedical questions, from the study of human genetic disease to the analysis of gene expression, protein–DNA interactions, and patterns of DNA methylation. The data generated by NGS instruments comprise huge numbers of very short DNA sequences, or “reads,” that carry little information by themselves. These reads therefore have to be pieced together by well-engineered algorithms to reconstruct biologically meaningful measurements, such as the level of expression of a gene. To solve this complex, high-dimensional puzzle, reads must be mapped back to a reference genome to determine their origin. Due to sequencing errors and to genuine differences between the reference genome and the individual being sequenced, this mapping process must be tolerant of mismatches, insertions, and deletions. Although optimal alignment algorithms to solve this problem have long been available, the practical requirements of aligning hundreds of millions of short reads to the 3-billion-base-pair-long human genome have stimulated the development of new, more efficient methods, which today are used routinely throughout the world for the analysis of NGS data.