Viruses are a kind of biological entities which rely on host cells for survival. Depending on the genetic materials and replication mode, they can be grouped into double-stranded DNA (dsDNA), single-stranded DNA (ssDNA), double-stranded RNA (dsRNA), positive-sense single-stranded RNA (+ssRNA), negative-sense single-stranded RNA (−ssRNA), ssRNA reverse transcriptase viruses (ssRNA-RT) and dsDNA reverse transcriptase viruses (dsDNA-RT) (Walker et al. 2020). Viruses can infect most kinds of biological entities, including viruses, bacteria, archaea and eukaryote (La Scola et al. 2008; Fermin, 2018). They have a great impact on the earth by shaping bacterial population dynamics and balancing the global ecosystem (Suttle, 2007). For humans, viruses, on the one hand, can cause high human morbidity and mortality and serious economic loss (Baud et al. 2020), on the other hand, they can promote and maintain the healthy balance of the gut microbiome (Seo and Kweon, 2019). Besides, some phages can be applied as the therapy of bacterial infections, especially for the bacterial strains resistant to multiple antibiotics (Altamirano and Barr, 2019).
The viromics studies based on the high-throughput sequencing technology have become increasingly popular in recent years, and novel viruses are being discovered at an unprecedented pace (Gregory et al. 2019). For example, the Tara Oceans Project recently identified 195,728 viral populations which were more than 10 times as many as the known global ocean DNA virome (Gregory et al. 2019). However, several challenges exist in analyzing the sequencing data from viromics studies. Firstly, it is difficult to identify all viral nucleotide sequences from the nucleotide sequences that mixed with the sequences of other species and the possible pollutions (Roux et al. 2015a; Ren et al. 2017; Fang et al. 2019; Kieft et al. 2020); secondly, the annotation of viral nucleotide sequences is still challenging, especially for those with remote or no homology with the known viruses (Roux et al. 2015b; McNair et al. 2019; Zhang et al. 2019a); thirdly, the taxonomic assignment of novel viruses is difficult due to a lack of a unified classification system for viruses (Low et al. 2019); fourthly, rapid functional characterization of a large number of newly discovered viruses such as identifying the viral hosts is extremely difficult to achieve by using traditional experimental methods (Jofre and Muniesa, 2020). According to the above analysis, an emerging area of computational viromics which is defined as using the computational methods to solve the problems in viromics studies was proposed in the present study. It includes but not limited to the following aspects:
The virome has a significant impact on human health. Previous studies have shown that the virome is associated with multiple diseases. However, the detailed mechanism is still unknown due to the complex interactions between the virome and their hosts (Clooney et al. 2019). Computational methods are needed to identify the viruses and their roles in causing human diseases. For example, Zhu et al. developed a metagenomic data analysis pipeline, MicroPro, to analyze the association between the microbes in the human body and complex diseases (Zhu et al. 2019). The virome is also closely related to the early warnings of newly emerging viruses. The Global Virome Project (GVP) has estimated that there are 631,000–827,000 unknown viruses with the potential of infecting humans (Carroll et al. 2018). Recent studies have developed machine-learning methods to identify the human-infecting virome based on sequence features (Zhang et al. 2019b). More efforts are needed to validate their usage in applications.
Taken together, this perspective provides an overall view of computational viromics which includes the identification, annotation and taxonomic assignment of viral genomics sequences, phenotype prediction of viruses, evolution of viromes, virus-host interactions, virus culturomics, association of the virome and human health, and so on (Table 1). The computational viromics is still in the beginning stage. Much more computational methods and experimental efforts are needed to characterize the virome and its interactions with the hosts and environments considering the huge diversity of the global virome.
Fields Methods or case studies Summary Advantages Limitations Identification of viral genomic sequences VIBRANT (Kieft et al. 2020) Recovery, annotation and curation of microbial viruses from genomic sequences Automated; user-friendly; accurate Only for prokaryotic virus detection ViromeScan (Rampelli et al. 2016) Detect eukaryotic viruses based on the homologous search method Customized virus database; taxonomic assignment Difficult to identify novel viruses Annotation of viral genomes PHANOTATE (McNair et al. 2019) Gene annotations in phages based on viral gene characteristics Reference-free; predicting more genes than other gene callers Only for phages Vgas (Zhang et al. 2019a) Combining ab initio and similarity-based method for predicting viral genes High precision and recall rate Accuracy needs further improvement Taxonomic assignment of the virome vConTACT (Eloe-Fadrosh, 2019) Classification of prokaryotic virome based on the gene sharing network Universal, scalable and automated Not for short fragments, the singleton or outlier sequences; only tested for phages GRAViTy (Eloe-Fadrosh, 2019) Classification of eukaryotic viruses at the family level within each Baltimore group Concise and clear Not for short contigs; only for eukaryotic viruses Evolution of the virome Analyzed the origins and evolution of the RNA virome (Wolf et al. 2018) Reconstruct the RNA virus evolution using the RdRp protein, and reveal extensive gene module exchange and horizontal virus transfer among diverse viruses A far more complete reconstruction of the evolution of RNA viruses than those in previous studies Only for RNA viruses Host prediction of phages PHP (Lu et al. 2021) A Gaussian model for host prediction of prokaryotic viruses in metagenomics Accurate, fast and user-friendly Accuracy needs further improvement Virus-host interactions P-HIPSTer (Lasso et al. 2019) Prediction of PPIs between 1,001 human-infecting viruses and human based on structure information Comprehensive and accurate The codes are not available Predict the receptorome of human viruses (Zhang et al. 2020b) Predicting the receptorome of the human-infecting virome based on the unique features of mammalian virus receptors Fast and comprehensive Accuracy needs further improvement; unable to predict virus-receptor interactions directly Virus culturomics KOMODO (Oberhardt et al. 2015) A platform for recommending microbial media Data-rich, user-friendly and relatively accurate Only suitable for bacteria and archaea Association of virome and human health MicroPro (Zhu et al. 2019) A data analysis pipeline for analysis of the association between the microbes in human body and the complex diseases Combining both known and unknown microbial organisms Did not consider the complex interactions between microbes HumanVirusFinder (Zhang et al. 2019b) A machine learning model for identification of human-infecting viruses from viral metagenomic data Fast, easy to use, suitable for genomes, contigs and reads Limited data used in training the models
Table 1. Illustration of computational methods or case studies in computational viromics.