학술논문

Predicting the animal hosts of coronaviruses from compositional biases of spike protein and whole genome sequences through machine learning.
Document Type
Article
Source
PLoS Pathogens. 4/20/2021, Vol. 17 Issue 4, p1-21. 21p.
Subject
*VIRAL genomes
*MACHINE learning
*EMERGING infectious diseases
*COVID-19 pandemic
*CORONAVIRUSES
*SARS-CoV-2
*COVID-19
Language
ISSN
1553-7366
Abstract
The COVID-19 pandemic has demonstrated the serious potential for novel zoonotic coronaviruses to emerge and cause major outbreaks. The immediate animal origin of the causative virus, SARS-CoV-2, remains unknown, a notoriously challenging task for emerging disease investigations. Coevolution with hosts leads to specific evolutionary signatures within viral genomes that can inform likely animal origins. We obtained a set of 650 spike protein and 511 whole genome nucleotide sequences from 222 and 185 viruses belonging to the family Coronaviridae, respectively. We then trained random forest models independently on genome composition biases of spike protein and whole genome sequences, including dinucleotide and codon usage biases in order to predict animal host (of nine possible categories, including human). In hold-one-out cross-validation, predictive accuracy on unseen coronaviruses consistently reached ~73%, indicating evolutionary signal in spike proteins to be just as informative as whole genome sequences. However, different composition biases were informative in each case. Applying optimised random forest models to classify human sequences of MERS-CoV and SARS-CoV revealed evolutionary signatures consistent with their recognised intermediate hosts (camelids, carnivores), while human sequences of SARS-CoV-2 were predicted as having bat hosts (suborder Yinpterochiroptera), supporting bats as the suspected origins of the current pandemic. In addition to phylogeny, variation in genome composition can act as an informative approach to predict emerging virus traits as soon as sequences are available. More widely, this work demonstrates the potential in combining genetic resources with machine learning algorithms to address long-standing challenges in emerging infectious diseases. Author summary: New zoonotic viruses remain a major threat to global health and the COVID-19 pandemic has shown the specific potential of coronaviruses to cause widespread disease burden and economic damage. Tracing the origins of these zoonotic viruses is extremely challenging and usually requires substantial effort. However, there is potential to uncover which animals may be the host origin of viruses by using 'signatures' within viral genomes generated by long-term coevolution. We investigated this by calculating 116 genomic features of spike protein sequences and whole genome sequences from approximately 200 coronaviruses. We used a machine learning approach in random forests, training separate models to predict broad host type using genomic information from spike proteins or whole genomes. Models trained on spike proteins achieved similar performance to that of whole genomes, reiterating the importance of this protein for host-virus interactions and likelihood of cross-species transmission. When applied to SARS-CoV-2, the causative virus of COVID-19, model predictions suggested a bat origin, consistent with estimations elsewhere using more traditional phylogenetic analyses. This work demonstrates the potential of machine learning to infer the ecology of new zoonotic viruses directly from genetic sequences, giving a rapid methodology to assist in tracing the origins of outbreaks. [ABSTRACT FROM AUTHOR]