학술논문

Improve Spark-based Application Performance Using Minimizer

Document Type

Conference

Author

Wu, Jinda; Deng, Li; Wang, Lili; Li, Kexue; Lu, Yakang; Song, Yang

Source

2020 IEEE 9th Data Driven Control and Learning Systems Conference (DDCLS) Data Driven Control and Learning Systems Conference (DDCLS), 2020 IEEE 9th. :595-599 Nov, 2020

Subject

Computing and Processing
Robotics and Control Systems
Signal Processing and Analysis
Sequential analysis
Genomics
Bioinformatics
Standards
Task analysis
DNA
Correlation
Metagenome assembly
Minimizer
Apache Spark
SpaRC

Language

Abstract

SpaRC(Spark Reads Clustering) is a generic sequence clustering algorithm based on Spark, which provides a scalable solution for billions of reads. However, SpaRC measures the correlation between reads by employing k-mer. This method can effectively complete computing tasks when the the amount of data is small. However, as the amount of data increases, the shortcomings of long running time and large memory resources are increasingly prominent. Here we explored a sequence similarity measurement method to alleviate these problems by using minimizer to measure sequence similarity between reads, without long running time and large memory resources. This method combines the minimizer measurement strategy and extracts the overlap rate information of reads to measure the sequence similarity between different reads, instead of the traditional method using k-mer. Results indicate that the method offers great improvement in clustering performance. Compared with the traditional k-mer method, this method can effectively improve the use of memory resources by SpaRC.

Online Access

Full Text (IEEE) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송