학술논문

The Bacteria Genome Pipeline (BAGEP): an automated, scalable workflow for bacteria genomes with Snakemake
Document Type
Academic Journal
Source
PeerJ. October 27, 2020, Vol. 8 e10121
Subject
Bacteria -- Genetic aspects
Visualization (Computer)
Quality control
Bacterial genetics -- Genetic aspects
Genes -- Genetic aspects
Microbial drug resistance -- Genetic aspects
Single nucleotide polymorphisms -- Genetic aspects
Workflow software -- Quality management
Phylogeny -- Genetic aspects
Genomes -- Genetic aspects
Genomics -- Genetic aspects
Biological sciences
Quality control
Workflow software
Quality management
International economic relations
Genetic aspects
Language
English
ISSN
2167-8359
Abstract
Next generation sequencing technologies are becoming more accessible and affordable over the years, with entire genome sequences of several pathogens being deciphered in few hours. However, there is the need to analyze multiple genomes within a short time, in order to provide critical information about a pathogen of interest such as drug resistance, mutations and genetic relationship of isolates in an outbreak setting. Many pipelines that currently do this are stand-alone workflows and require huge computational requirements to analyze multiple genomes. We present an automated and scalable pipeline called BAGEP for monomorphic bacteria that performs quality control on FASTQ paired end files, scan reads for contaminants using a taxonomic classifier, maps reads to a reference genome of choice for variant detection, detects antimicrobial resistant (AMR) genes, constructs a phylogenetic tree from core genome alignments and provide interactive short nucleotide polymorphism (SNP) visualization across core genomes in the data set. The objective of our research was to create an easy-to-use pipeline from existing bioinformatics tools that can be deployed on a personal computer. The pipeline was built on the Snakemake framework and utilizes existing tools for each processing step: fastp for quality trimming, snippy for variant calling, Centrifuge for taxonomic classification, Abricate for AMR gene detection, snippy-core for generating whole and core genome alignments, IQ-TREE for phylogenetic tree construction and vcfR for an interactive heatmap visualization which shows SNPs at specific locations across the genomes. BAGEP was successfully tested and validated with Mycobacterium tuberculosis (n=20) and Salmonella enterica serovar Typhi (n=20) genomes which are about 4.4 million and 4.8 million base pairs, respectively. Running these test data on a 8 GB RAM, 2.5 GHz quad core laptop took 122 and 61 minutes on respective data sets to complete the analysis. BAGEP is a fast, calls accurate SNPs and an easy to run pipeline that can be executed on a mid-range laptop; it is freely available on: https://github.com/idolawoye/BAGEP.
Author(s): Idowu B. Olawoye (1,2), Simon D.W. Frost (3,4), Christian T. Happi (1,2) Introduction Over the years, as next generation sequencing has rapidly become popular, molecular biology has taken a [...]