학술논문

PyBDA: a command line tool for automated analysis of big biological data sets

Document Type

article

Author

Simon Dirmeier; Mario Emmenlauer; Christoph Dehio; Niko Beerenwinkel

Source

BMC Bioinformatics, Vol 20, Iss 1, Pp 1-6 (2019)

Subject

Big data
Data analysis
Command line
Pipeline
Computing cluster
Grid engine
Computer applications to medicine. Medical informatics
R858-859.7
Biology (General)
QH301-705.5

Language

English

ISSN

1471-2105

Abstract

Abstract Background Analysing large and high-dimensional biological data sets poses significant computational difficulties for bioinformaticians due to lack of accessible tools that scale to hundreds of millions of data points. Results We developed a novel machine learning command line tool called PyBDA for automated, distributed analysis of big biological data sets. By using Apache Spark in the backend, PyBDA scales to data sets beyond the size of current applications. It uses Snakemake in order to automatically schedule jobs to a high-performance computing cluster. We demonstrate the utility of the software by analyzing image-based RNA interference data of 150 million single cells. Conclusion PyBDA allows automated, easy-to-use data analysis using common statistical methods and machine learning algorithms. It can be used with simple command line calls entirely making it accessible to a broad user base. PyBDA is available at https://pybda.rtfd.io.

Online Access

EBSCOHost PDF Full Text (ProQuest Central) Full Text (Gale Academic Onefile) Open Access (DOAJ) Open Access (BioMed Central) Web of Science JCR 저널정보 Scopus Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송