학술논문

Improve Spark-based Application Performance Using Minimizer
Document Type
Conference
Source
2020 IEEE 9th Data Driven Control and Learning Systems Conference (DDCLS) Data Driven Control and Learning Systems Conference (DDCLS), 2020 IEEE 9th. :595-599 Nov, 2020
Subject
Computing and Processing
Robotics and Control Systems
Signal Processing and Analysis
Sequential analysis
Genomics
Bioinformatics
Standards
Task analysis
DNA
Correlation
Metagenome assembly
Minimizer
Apache Spark
SpaRC
Language
Abstract
SpaRC(Spark Reads Clustering) is a generic sequence clustering algorithm based on Spark, which provides a scalable solution for billions of reads. However, SpaRC measures the correlation between reads by employing k-mer. This method can effectively complete computing tasks when the the amount of data is small. However, as the amount of data increases, the shortcomings of long running time and large memory resources are increasingly prominent. Here we explored a sequence similarity measurement method to alleviate these problems by using minimizer to measure sequence similarity between reads, without long running time and large memory resources. This method combines the minimizer measurement strategy and extracts the overlap rate information of reads to measure the sequence similarity between different reads, instead of the traditional method using k-mer. Results indicate that the method offers great improvement in clustering performance. Compared with the traditional k-mer method, this method can effectively improve the use of memory resources by SpaRC.