학술논문

Finding Overlapping Rmaps via Clustering
Document Type
Periodical
Source
IEEE/ACM Transactions on Computational Biology and Bioinformatics IEEE/ACM Trans. Comput. Biol. and Bioinf. Computational Biology and Bioinformatics, IEEE/ACM Transactions on. 19(6):3114-3123 Jan, 2022
Subject
Bioengineering
Computing and Processing
Genomics
Comets
Bioinformatics
DNA
Adaptive optics
Biomedical optical imaging
Error correction
Optical mapping
pairwise alignment
error correction
assembly
clustering
Language
ISSN
1545-5963
1557-9964
2374-0043
Abstract
Optical mapping is a method for creating high resolution restriction maps of an entire genome. Optical mapping has been largely automated, and first produces single molecule restriction maps, called Rmaps, which are assembled to generate genome wide optical maps. Since the location and orientation of each Rmap is unknown, the first problem in the analysis of this data is finding related Rmaps, i.e., pairs of Rmaps that share the same orientation and have significant overlap in their genomic location. Although heuristics for identifying related Rmaps exist, they all require quantization of the data which leads to a loss in the precision. In this paper, we propose a Gaussian mixture modelling clustering based method, which we refer to as OMclust, that finds overlapping Rmaps without quantization. Using both simulated and real datasets, we show that OMclust substantially improves the precision (from 48.3% to 73.3%) over the state-of-the art methods while also reducing CPU time and memory consumption. Further, we integrated OMclust into the error correction methods (Elmeri and cOMet) to demonstrate the increase in the performance of these methods. When OMclust was combined with cOMet to error correct Rmap data generated from human DNA, it was able to error correct close to 3x more Rmaps, and reduced the CPU time by more than 35x. Our software is written in C++ and is publicly available under GNU General Public License at https://github.com/kingufl/OMclust.