As the reassortment underlies the important evolution for influenza viruses, increasing experimental efforts were made to study the reassortments. However, such research is challenging due to the safety and ethical issues (Butler 2011; Fouchier 2015). With the advent of the big data era, more and more influenza virus genomic data are generated. Additionally, the improvements were made on bioinfor-matics by the development of various subjects like math-ematics, physics and biology, etc., leading the computational methods to be one of the indispensable tools for studying the influenza virus reassortment. As shown in Fig. 2, the key point of the data-driven computational identification of reassortments is to recognize the hetero-geneity of multiple gene segments based on the genomic data; then the integrative analyses with related epidemio-logic information are made to infer the reassortment events. Many tools that were designed to detect virus genomic recombination have been developed, which can also identify the virus reassortment, such as Simplot (Lole et al. 1999), Recombination Detection Program (RDP) (Martin and Rybicki 2000), GENECONV (Sawyer 1989), DSS (Difference of Sums of Squares) (McGuire et al. 1997), MAXCHI (Maximum Chi-Square method) (Smith 1992), and so on. Currently, the specific bioinformatics identification methods are mainly divided into two types as summarized in Table 1, the phylogenetic tree-based methods and the phylogenetic tree-independent methods. Here, a comprehensive review of influenza virus reassortment identification is given in this section. By summarizing the existing computational approaches in reassortment identification, we believe this work can serve as a reasonable guide to identify and infer reassortments.
Figure 2. The framework of reassortment identification with bioinfor-matics methods. Reassortants can be inferred by recognizing the heterogeneity of multiple segments based on the genomic data first, as shown in the left panel. Then the heterogeneity and related epidemiologic information, such as sampled region, host and sampled date will be combined to identify the reassortment events further.
Classification Method Principle Accessibility Compiling environment Using experience Required data Limitation References Phylogenetic tree-based FluReF A bottom-up research Source code Written by C ++ on Linux system The test dataset containing 1050 strains spent 10 s in total Complete genome of influenza virus High computational complexity Yurovsky and Moret (2011) Villa et al. Core mutations Source code Written by both C++ and python on Linux system The mock test dataset containing 7477 strains with the 290 bp simulated genomic sequence took more than 5 days in total Complete HA and NA sequences Limited to the reassortment identification of the HA and NA segments Villa and Lassig (2017) FluResort Identities of predicted protein Web invalid Written with ANSI/ISO standard C ++ on both Windows or Linux systems Not available Viral protein sequences and mass spectral data of these proteins Limited to the HA, NA, NP and M1 proteins, and the mass spectral data with high-resolution was required Lun et al. (2012) Nagarajan et al. Enumerating maximal bicliques Not supported Not supported Not available Genomic segments of influenza virus High computational complexity Nagarajan and Kingsford (2008) GiRaF Graph theory Source code Written by C ++ on Linux, Mac or Windows systems The test dataset containing 35 strains took about 5 s in total Complete genome of influenza virus High computational complexity Nagarajan and Kingsford (2011) Suzuki et al. Topologies of quartet trees Not supported Not supported Not available Complete genome of influenza virus High computational complexity Suzuki (2010) Dong et al. Genotype profile IVEE soft Written by both C ++ and python on Windows system Each complete genome took 3 about seconds Complete genome and the genotype information It had limitations when inferring intra-subtype reassortments within the same host Dong et al. 2011) Phylogenetic tree independent Wan et al. Network module; MST Not supported Not supported Not available Genomic segments of influenza virus Not suitable for short sequences Wan et al.(2007a, 2007b, 2008) Rabadan et al. Hamming distance Not supported Not supported Not available Genomic segments of influenza virus The assumption of equal mutation rate among segments may not always hold Rabadan et al. (2008) Silva et al. Genetic distance Not supported Written by Ruby script using bioruby on a Debian Linux server system Not available Genomic segments of influenza virus The performance of this algorithm will be influenced by the sample bias significantly de Silva et al. (2012) HoPER Host tropism Not supported Not supported Not available The full-length amino acid sequences of all genomic segments It is difficult to identify the reassortments between different hosts Yin et al. (2020)
Table 1. A brief review of the reassortment identification methods for the influenza viruses.
Currently, the phylogenetic incongruences of the relation-ships among eight gene segments in influenza viruses were identified to infer the reassortment events manually in most previous studies (Arenas and Posada 2010; Boni et al. 2010). In this process, the phylogenetic trees for each gene segment were first constructed with different methods such as neighborhood joining (NJ) (Saitou and Nei 1987), maximum likelihood (ML) (Felsenstein 1981) and maxi-mum parsimony (MP) (Sourdis and Nei 1988) etc., or based on the molecular clock analysis (Takezaki et al. 1995); the reconstructed phylogenetic trees were then partitioned into multiple clades manually based on some criteria like the bootstrap of ancestral node and the diver-gence time for different phylogenetic clades etc.; finally, the reassortment events would be recognized with the integration of topological incongruence and related epi-demiologic information. Lots of achievements with this manual identification approach were made on different subtype influenza viruses. For example, our previous work combined the molecular clock analysis and the viruses' epidemiological data to demonstrate that at least two sequential reassortments of the novel H7N9 viruses were took place with the distinct H9N2 viruses. The computa-tional results indicated that the first reassortment likely occurred in wild birds while the second occurred in domestic birds in east China in early 2012 (Wu et al. 2013). Recently, the genetic origin and evolution of H5N6 viruses were explored also with this method by Bi et al. (2016). In their work, a comprehensive phylogenetic analysis of eight gene segments with ML method coupling with the epi-demiological data was performed, which revealed that H5N6 arose from the reassortment of H5 and H6N6 viru-ses, and that the internal genes were constantly reassorted among low-pathogenic avian influenza viruses. In addition, the reassortment events of pandemic H1N1/2009 virus were successfully identified in Vijaykrishna's (Vijaykr-ishna et al. 2010) and Smith's work (Smith et al. 2009a) with similar methods.
Despite the significant achievements, the feasibility and validity of this identification method was still limited by the manual operation. Particularly, enormous amounts of genomic data on influenza viruses made the manual reas-sortment identification not available. Hence, automatic comparison among phylogenetic trees based on different gene segments was developed to improve the algorithm feasibility. FluReF is a fully automated reassortment finder which was proposed by Yurovsky and Moret (2011). The reassortment events can be identified by a bottom-up search for candidate clades on both the whole genome-based and segment-based phylogenetic trees, which sepa-rates the phylogenetic clades containing the reassortants from the other clades. As demonstrated in this work, FluReF could find reassortments effectively even for geo-graphically and temporarily expanded datasets. Recently, Villa et al. successfully inferred the reassortment events with the self-defined core mutations in genealogical trees in the investigation of the fitness cost in human influenza virus reassortment (Villa and Lassig 2017). Apart from the epidemiologic and mutational information, the biophysical data were also used. Lun et al. proposed a set of automatic reassortment identification algorithms, FluShuffle and FluResort (Lun et al. 2012). In FluShuffle, PepGen was first employed to generate theoretical peptide monoisotopic masses based on the influenza viral protein sequences. Then a Bayesian Markov Chain Monte Carlo (MCMC) approach (van Ravenzwaaij et al. 2018) was implemented to assign a combination of protein accessions to a single mass spectrum. Next, a Gibbs sampling algorithm was employed to estimate the marginal posterior probability for each known protein accession. Finally, accessions that match more peaks or match uniquely to a peak were selected with a higher probability at each step in the Gibbs sampler. The different combinations of influenza viral protein identities had been established through FluShuffle, which were then mapped onto the phylogenetic trees using FluResort. A statistical model was developed in FluResort to calculate the likelihood of reassortments, which was quantified using Z-score, a standardized value of the weighted mean patristic distance of each identity across different trees. This set of algorithms were evaluated with both the experimental and simulated mass spectral data obtained from the whole virus digests. For the experimental data, the algorithms were first tested with mass spectral data obtained from the digestion of a H1N1 strain from the reassortment of a 2009 H1N1 pandemic strain (A/Cali-fornia/07/2009) and a lab-modified H1N1 strain (A/Puerto Rico/08/1934). The seasonal influenza A and B viruses were also analyzed with these two algorithms. In addition, FluShuffle and FluResort algorithms were tested with the simulated mass spectral data. As indicated in this paper, these two algorithms accurately identified the natural reassortment of the H1N1 vaccine strain with the identifi-cation of each viral protein. Additionally, no reassortment events were recognized in the seasonal strain analyses. Although this set of algorithms can identify the reassort-ments accurately and rapidly, the mass spectral data with high-resolution are required.
Additionally, the graph theory was employed when many efforts were made on the automatic comparison of the phylogenetic trees. A framework based on the enu-merating maximal bicliques was first proposed to detect the reassortment events by Nagarajan and Kingsford (2008). Then, a fully automatic reassortment identification algo-rithm, GiRaF (Graph-incompatibility based Reassortment Finder) (Nagarajan and Kingsford 2011), was developed on the basis of this framework. In GiRaF, large groups of Markov chain Monte Carlo (MCMC)-sampled trees are searched for incompatible splitting by a fast biclique enu-meration algorithm coupled with several statistical tests to identify the differential phylogenetic topology. Then, the reassortment events are recognized with the combination of the differential from multiple gene segments. Three influ-enza virus datasets, including 156 human influenza A (H3N2) isolates (Levin et al. 2005), 35 avian influenza A (H5N1) isolates (Salzberg et al. 2007) and 140 swine influenza isolates (Kingsford et al. 2009) were evaluated with GiRaF, which had been analyzed in previous studies relying on the manual reassortment identification method. Not only the known reassortment events in these three influenza virus populations were detected accurately, but also several unreported reassortments in H5N1 and swine influenza isolates were identified. In addition, GiRaF can identify the reassortment events with high sensitivity as well as high precision for the simulated reassortment datasets. Recently, the reassortment events within the Victoria and Yamagata lineages were recognized by GiRaF when researchers exploited the evolutionary trajectories of influenza B viruses (Virk et al. 2020). A method based on quartet trees was proposed by Suzuki to detect reassort-ments, which can be used even when the constructed phylogenetic trees were unreliable (Suzuki 2010). In this method, a quartet of strains were examined at a time, and the corresponding phylogenetic tree was constructed for each gene segment. Then, the topologies of all quartet trees supported with a statistical significance were compared among segments. The reassortment events could be rec-ognized according to the pattern of topological difference among segments. Notably, although the reassortment events can be identified accurately, the computation com-plexity of the graph theory-based algorithm is tremendous, as the traversal of the phylogenetic tree with a part of strains will cost huge computing resources and time.
Obviously, the validity of the identified reassortment events with the phylogenetic tree-based methods is dependent on the reliability of the constructed trees. However, the false phylogenetic incongruence can be caused by the inaccurate construction of phylogenetic trees, such as inappropriate selection of evolution model (Keane et al. 2006), high level of homoplasy (Goloboff and Wilkinson 2018), long branch attraction (Li et al. 2007), insufficient sampling (Graybeal 1998), unreasonable data partition (Prosperi et al. 2011) and so on. To solve this problem, Svinti et al. developed two robust approaches to detect reassortments, namely MLreassort and Breassort, which can distinguish the reassortment-caused topological inconsistency from phylogenetic errors-caused topological inconsistency (Svinti et al. 2013). MLreassort is based on a maximum likelihood framework while Breassort is a Bayesian based approach. High precision and sensitivity were achieved when these two approaches detected reas-sortment events on both the small real data of influenza A sequences and the simulated data. However, the perfor-mance of these two approaches was not satisfactory when they analyzed the large datasets.
In conclusion, phylogenetic tree-dependent methods rely on the assumption that reassortants are distributed among the different clades of phylogenetic trees. These approaches are generally feasible to identify reassortment events across inter-subtypes of the influenza virus. They are accurate and sensitive to identify reassortment events even if the reassortant has a complicated evolutionary history. However, the reliability of phylogenetic tree con-struction is usually unsatisfactory when there is extremely incomplete data, and the low bootstrap probabilities and poor topology may lead to the obscure evidences for reassortment. Although some efforts were made to solve this problem, the feasibility of these methods is still limited by the computational cost from large scale data.
Ambiguous quantified benchmark for partitioning the phylogenetic clades and the extreme dependence of phy-logenetic reconstruction led more efforts on the identifi-cation of reassortment events without the phylogenetic trees.
The sequence distance between strains was commonly used in phylogenetic tree-independent methods. The Complete Composition Vector (CCV) was first employed to recognize reassortants by Wan et al. (Wan et al. 2007a, 2007b, 2008). In these algorithms, the calculated CCVs among different virus strains are core parameters, which are then used to assign diverse genotypes for related strains by different clustering methods. In their first algo-rithm (Wan et al. 2007a), the reassortment events can be identified by the genotypes which are assigned using the network modules coupled with the CCVs. As demonstrated in the study, this algorithm could infer the reassortment events with a large number of sequences accurately and rapidly. After that, the clustering method was improved by employing the minimum spanning tree (MST) and the Hierarchical Bayesian Modeling instead of the networks (Wan et al. 2008). As indicated in the evaluations, the CCV-based algorithms could successfully identify the reassortment events of the NP and PB2 genes for the H5N1 avian influenza virus. Another two algorithms were also developed with the sequence distance. Rabadan et al. constructed a statistical framework to estimate the likeli-hood of reassortments with the hamming distance in the third codon position for all sequences (Rabadan et al. 2008). The detected reassortment events of H3N2 strains with this algorithm were similar to the previous study. A reassortment identification algorithm was developed by Silva et al. based on the r-neighbourhood which are determined only by the genetic distances among sequences (de Silva et al. 2012). For each sequence, the set of r closest strains is defined as the r-neighbourhood for that sequence. 35 candidate reassortants of high quality were found by the algorithm with the large data sets of influenza virus whole genome nucleotide sequences. In addition, Chan et al. proposed that the pervasive reassortment in influenza virus can be detected with persistent homology (Chan et al. 2013).
Apart from the sequence distance, the other features were also employed to identify the reassortment events without the phylogenetic tree. Recently, a novel compu-tational algorithm HopPER (Yin et al. 2020) was proposed by Yin et al., which inferred the reassortment events by the random forest based on the prediction of the host tropism. 147 features generated from seven physicochemical prop-erties of amino acids (i.e. polarity, net charge, hydropho-bicity, normalized van der waals volume, solvent accessibility, polarizability and secondary structure) were used to infer the host tropism. For the full length and non-redundant amino acid sequences of different proteins, 280 out of 318 candidate reassortants were successfully iden-tified regardless of the completeness of the genomes. In addition, HopPER was more robust than the alternative reassortment identification algorithms (Karasin et al. 2000, 2002, 2006; Olsen et al. 2006; Khiabanian et al. 2009; Kingsford et al. 2009; Nagarajan and Kingsford 2011; de Silva et al. 2012).
In addition to the efforts which have working on the computational identification of influenza virus reassort-ments, several database tools were also developed to facilitate the computational related analysis for influenza virus reassortments.
Due to the diversity of influenza viruses that can reflect the possible reassortment events, the appropriate assigned genotypes are essential to identify and describe the reas-sortments of influenza viruses. FluGenome (http://www.flugenome.org/) was constructed by Lu et al., which enabled users to recognize the reassortment events by their developed genotype nomenclature (Lu et al. 2007). The available sequences for eight gene segments were retrieved from NCBI Influenza Virus Resource first; then the downloaded sequences were clustered into several lineages following criteria, in order to assign the strains into the nomenclatural genotypes. FluGenome provided three levels of information that included the segments (assigned lineage, strain name, segment, serotype, host, country, year, GenBank accession number, nucleotide sequence and sequence length), genomes (assigned genotype and acces-sion numbers of individual gene segments) and genotypes (all genotypes and the genomes assigned into each geno-type). With the analysis of more than 2000 complete viral genomes, 156 unique genotypes were revealed in Flu-Genome. Based on the developed genotypes, the reassort-ment events can be further detected by combining the epidemiologic information of the corresponding strains in the database. Unfortunately, FluGenome is no longer sup-ported, which is a grievous loss to the study of the influenza virus reassortments.
As the increasing number of studies have attempted to identify the reassortment computationally, a systematic, comprehensive online repository of reassortment events for influenza viruses is needed urgently. Our previous work developed FluReassort (https://www.jianglab.tech/FluReassort) (Ding et al. 2020), the first database that included all reported and published reassortment events. To facili-tate the investigation of the reassortment preference on the gene segment or the subtype of viruses, FluReassort also supported the reconstruction of reassortment networks for different subtypes of influenza viruses, which was based on the reassortment events retrieved from the extensive liter-ature. Total 3513 research papers published before July 2018 were retrieved from the PubMed database with a keyword combination of "subtype and (reassortment or reassortant or evolution or origin)", where "subtype" denotes the specific subtype of influenza virus such as H1N1. To provide the high quality reassortment events comprehensively, the reassortment events which were compiled manually from the given retrieved literature would be recruited in FluReassort only if they had both the phylogenetic analysis and clear reassortant and reassort-ment donor strains. As a result, 204 reassortment events were compiled based on 535 strains of 56 subtypes isolated from 37 different countries, which provides the metadata about the reassortant strain and reassortment donor strain, the inferred date, geographic region and host for reassort-ment, phylogenetic analysis methods and the PubMed IDs (PMIDs) of the corresponding references. FluReassort offered the most comprehensive information about reas-sortment events for influenza viruses in a structured way. The retrieval and exposition of the compiled reassortment events are implemented on the 'Home' page, while the 'Phylogenetic Analysis' and 'Reassortment Network' pages are designed to analyze the reassortment events. FluReassort has conducted a thorough compilation of the reassortment events for influenza viruses for the first time. The information provided by FluReassort can serve as a guide to future research, and facilitate data-driven explo-ration of the reassortments.