학술논문

BIOPET: Towards Scalable, Maintainable, User-Friendly, Robust and Flexible NGS Data Analysis Pipelines
Document Type
Conference
Source
2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) CCGRID Cluster, Cloud and Grid Computing (CCGRID), 2017 17th IEEE/ACM International Symposium on. :823-829 May, 2017
Subject
Communication, Networking and Broadcast Technologies
Computing and Processing
Grid computing
Next Generation Sequencing
pipeline framework
reentrancy
reproducibility
scalable
Fault tolerance
HPC cluster
Language
Abstract
Because of the rapid decreasing of sequencing cost, more research and clinical institutes are generating Next Generation Sequencing data at an increasing and impressive scale. University Medical Centers in the Netherlands are sequencing thousands patients a year each as part of their routine diagnosis. On the research front, the GoNL project and BIOS project coordinated by the BBMRI-NL consortium have sequenced 770 whole genome DNA samples and over 4000 RNA samples collected from a number of Dutch biobanks. In 2016, the deployment of Illumina X Ten sequencer at the Hartwig Medical Foundation provides a sequencing capacity of 18,000 whole genome DNA samples per year. Processing these petabyte scale datasets requires revolutionary thinking and solutions in the computing and storage infrastructure and the data analysis pipelines. At Leiden University Medical Center, we have developed a GATK-Queue based open source pipeline framework – BIOPET (Bioinformatics Pipeline Execution Toolkit). We implemented all our commonly used NGS tools as Queue modules in the form of Scala classes. Together with those that are already supported in GATK-Queue like GATK variant-calling and Picard tools, we have a full set of NGS tools at our disposal as Scala classes that are further combined into pipeline functions. Besides meeting the various standard requirements for NGS pipelines such as reentrancy, the BIOPET framework also offers a list of advanced features, such as live debugging, test and meta-analysis frameworks and easy deployment. BIOPET framework can run on various types of HPC infrastructure through its DRMAA support, e.g., SGE, SLURM, PBS.