학술논문

Characterising the source of errors for metagenomic taxonomic classification
Document Type
Electronic Thesis or Dissertation
Source
Subject
metagenomic taxonomic classification
microbial communities
Metagenomics
taxonomic identification
metagenomic study
bioinformatics
16S rRNA metagenomic data
in silico metagenome data
Term Frequency - Inverse Document Frequency
TF-IDF
Language
English
Abstract
Characterising microbial communities enables a better understanding of their complexity and the contribution to the environment. Metagenomics has been a rapidly expanding field since the revolution of next generation sequencing began, and it has a wide range of application including for medicine, agriculture, forensics, archaeology and even domestic use [Sarkar et al., 2021, Holman et al., 2017, Khodakova et al.,2014, Santiago-Rodriguez et al., 2017, Vilanova et al., 2015]. Sequencing amplicon data, such as 16S rRNA, is now commonly used to characterise the microbiome in a variety of biological samples. However, their correct taxonomic identification still remains a challenge, and often short reads are identified, correctly or not, at several ranks of the taxonomic tree other than species or subspecies level. Every metagenomic study is designed for specific needs, and it is often complicated to find a suitable bioinformatics pipeline and reference database. There is currently a lack of systematic benchmarking of in-house methods for metagenomics. The work presented in this thesis aims to establish an approach for the in silico validation of 16S rRNA metagenomic data. A method to generate realistic in silico metagenome data that resembles project-specific sequencing data is presented, including a new process to generate synthetic negative controls for amplicon data, which can be employed regularly to assess the appropriateness and optimisation of methods for specific metagenomic projects. To aid the benchmarking process, new metrics have been defined based on a measure of taxonomic distance. A k-mer based method with the lowest common ancestor approach was selected to investigate a range of factors that influence meta-taxonomic classification success. It includes the comparison of database quality filtered at various levels, and as well as a comparison of different taxonomic annotation methodologies. The experimental findings reveal the importance of having highly curated taxonomic annotations of the genetic sequences in the database, and that a missing fraction of the tree of life can lead to misclassification of any related or unrelated organisms. In some cases, it is shown that longer reads can help to improve assignment, with mutations and sequencing errors having a relatively low negative impact. The marker gene 16S rRNA has well-defined conserved and variable regions, which help to distinguish species. Therefore, these regions were studied and also recalculated using information theory, to investigate which parts of the sequence are discriminative for metagenomic taxonomic identification. In addition, linguistics methods, Term Frequency - Inverse Document Frequency (TF-IDF) coupled with multinomial naive Bayes, is shown to provide understanding of genetic signatures and is applied to generate a new method to classify taxonomically metagenomics short reads. Biological samples were taken from cattle respiratory tract, DNA was extracted and sequenced to provide metagenomic data. Two sets of experiments were carried out, (i) to compare sampling and extraction methods and (ii) to characterise the microbial community observed in young cattle in the different lung lobes and nose. The data reveal that the composition of the microbial community observed is highly dependent on the sampling method.

Online Access