학술논문

Optimization of data-intensive next generation sequencing in high performance computing
Document Type
Conference
Source
2015 IEEE 15th International Conference on Bioinformatics and Bioengineering (BIBE) Bioinformatics and Bioengineering (BIBE), 2015 IEEE 15th International Conference on. :1-6 Nov, 2015
Subject
Bioengineering
Computing and Processing
Bioinformatics
Genomics
Resource management
Scalability
Sequential analysis
Next generation networking
Software
Next Generation Sequencing
BWA
High Performance Computing
Human Genome Sequence
Thread Scalability
Data-Intensive Workload and Concurrent Parallelization
Language
Abstract
Advancement in Next Generation Sequencing (NGS) technology are associated with ever-increasing volume of genomic data every year. These genomic data are efficiently processed by empirical parallelism using High Performance Computing (HPC). The processed data can be used for genome-wide association studies, genetics, personalized medicine and many other areas. There are different kind of algorithms and implementations used in different phases of genome processing. In this paper, we used BWAKIT and GATK based software for processing larger volume of genomic data that are referred as "NGS workflow at SIDRA". We used BWAKIT for genome alignment and GATK for variant discovery in the NGS workflow that required larger computation and huge memory requirement respectively. We observed, the CPU utilization is not more than 45% during variant discovery and hence, it is necessary to understand the optimal selection (in terms of number of threads or cores) of the resources during the NGS workflow automation. We analyzed the performance bottleneck and application optimization in terms of "scalability" (use maximum available CPUs and memory) and "multiple instances of NGS workflow with different genome data within a node" (process more volume of genome data concurrently with limited set of CPUs and memory). We observed that, 40%, 65%, 71% and 76% improvement in performance while processing 2, 4, 8 and 16 samples concurrently using our own scheduling heuristics. As a result, our proposed NGS workflow automation will improve the performance upto 76% compared to application scalability based workflows.