HTML
-
To investigate the freshwater virome from the Yangtze River, approximately 5 L of water samples were collected between 2017 and 2018 from each of the six representative river ports in the Yangtze River Delta: Anqing, Wuhu, Nanjing, Zhenjiang, Changzhou and Nantong. The sampling section is around 640 km in length, accounting for about one-tenth of the total length and basically covering the whole Yangtze River Delta region. The information of sampling sites was listed in Supplementary Table S1. The water samples were collected using the five-point sampling method at 20 cm water depths in each sampling site. The center of the river was determined as the central sampling point, and then four points on the diagonal apart from the central sampling point were selected. A total of 30 L of water samples were collected with sterile disposable containers and shipped on ice immediately for further processing. As a control, 5 L sterile ddH2O (Sangon, Shanghai, China) was simultaneously prepared and further processed in the same condition.
-
Viral particles were concentrated using the virus adsorption-elution method reported by Katayama et al. (Katayama et al. 2002) and further optimized by Hamza et al. (Hamza et al. 2009) and De Keuckelaere et al. (De Keuckelaere et al. 2013). Briefly, in the initial primary concentration step, MgCl2 was added to the water samples to obtain a final concentration of 0.05 mol/L, and the pH was adjusted to 3.5 with 1 mol/L HCl. Glass fiber filters (AP15 and AP20, Millipore) were used as prefilters to delay clogging. A type HA negatively charged membrane (HAWP14250, Millipore) with 0.45 μm pore size was used in a pressure pump system for water filtration. After filtration, the membranes were rinsed with 0.5 mmol/L H2SO4 (pH 3.4) and eluted with 70 mL Tr alk elution buffer (0.05 mol/L KH2PO4, 1.0 mol/L NaCl, 0.1% (v/v) Triton X-100, pH 9.2) (Hamza et al. 2009). In the secondary concentration step, 12.5% (w/v) PEG-6000 (Sigma-Aldrich) and 0.3 mol/L NaCl (Sigma-Aldrich) were added. After overnight incubation at 4 ℃, the concentrate was centrifuged at 10,000 × g for 30 min. Then the pellet was suspended in 2 mL PBS and vigorously vortexed. The mixture was incubated for 5 min at room temperature and centrifuged at 10, 000 × g for 20 min. About 2 mL supernatant was collected and stored at –80 ℃ until use.
-
The concentrated water samples were treated at 37 ℃ with a mixture of DNases (Turbo DNase from Ambion, Baseline-ZERO from Epicentre, and benzonase from Novagen) and RNase (Fermentas) for 60 min to digest unprotected nucleic acid (Zhang et al. 2014; Zhang et al. 2016; Zhang et al. 2017). Then the remaining total nucleic acid was isolated using a QIAamp Viral RNA Mini Kit (QIAGEN) according to the manufacturer’s protocol. For library construction, dsDNA was synthesized from RNA and DNA viruses. For RNA viruses, a reverse transcription kit (SuperScript Ⅲ Reverse Transcriptase) was used for reversely transcribing RNA into cDNA, after which the product was denatured at 95 ℃ for 2 min and quickly placed on ice for about 2 min. Then the DNA polymerase I large fragment (Klenow) was added to synthesize the second strand of cDNA (dsDNA). For ssDNA viruses, ssDNA was converted to dsDNA using the Klenow reaction and the product was utilized to construct libraries. Specifically, 12 μL nucleic acid extracts were added to the reaction system for synthesizing dsDNA (total reaction system: 20 μL) and the experiments were performed in the same tube. Overall, six libraries along with a control library were constructed using a Nextera XT DNA Sample Preparation Kit (Illumina) and the quality was inspected using agarose gel electrophoresis and Agilent bioanalyzer 2100. All libraries were sequenced on an Illumina MiSeq platform (250 bp paired ends) with dual barcoding for each individual sample (Liu et al. 2016).
-
Paired-end reads of 250 bp generated by MiSeq were debarcoded using vendor software from Illumina. An in-house analysis pipeline running on a 32 nodes Linux cluster was utilized to process the data. Reads were considered duplicates if bases 5–55 were identical and only one random copy of duplicates was kept. Low sequencing quality tails were trimmed using Phred quality score ten as the threshold. Adaptors were trimmed using the default parameters of VecScreen which is NCBI BLASTn with specialized parameters designed for adapter removal. The cleaned reads were de novo assembled within each barcode, detected chimera are filtered by length using the ENSEMBLE assembler with the default parameters (Deng et al. 2015). Contigs and singlets reads are then matched against a customized viral proteome database using BLASTx with an E-value cutoff of < 10-5. The virus BLASTx database was compiled using NCBI virus reference proteome (ftp://ftp.ncbi.nih.gov/refseq/release/viral/) and viral proteins sequences from NCBI nr fasta file (based on annotation taxonomy in Virus Kingdom). Candidate viral hits are then compared to an in-house non-virus non-redundant (NVNR) protein database with an E-value cutoff of < 10-5 to remove false-positive viral hits. The NVNR database was compiled using non-viral protein sequences extracted from NCBI nrfasta file (based on annotation taxonomy excluding Virus Kingdom). Contigs without significant BLASTx similarity to viral proteome database are searched against viral protein families in vFam database (Skewes-Cox et al. 2014) using HMMER3 (Eddy 2009; Johnson et al. 2010; Finn et al. 2011) to detect remote viral protein similarities.
-
Composition similarity analysis of the six viromes were compared using MEGAN software (MEtaGenome Analyzer, v6.20.19)(Huson et al. 2007) under the compare option. The results were presented by the Unweighted Pair Group Method with Arithmetic Mean (UPGMA) taxonomic tree, canonical correspondence analysis (CCA) under cluster analysis option, and Bray-Curtis ecological distance matrix with default parameters. The species rarefaction curve is also calculated and generated by MEGAN software v6.20.19 (Huson et al. 2007) in rarefaction window to evaluate sampling completion in each library. The Friedman rank-sum test was used for analyzing the differences of viromic structure among six viromes using SPSS (IBM SPSS 25.0, SPSS Inc)(Friedman 1937). The viral community structure and richness results were visualized in heapmap, venn diagram and bar plots which were generated using R v3.6.3 package pheatmap (v1.0.12, https://cran.r-project.org/package=pheatmap), venn (v1.9, https://cran.r-project.org/package=venn) and ggplot2 (v3.2.1, https://ggplot2.tidyverse.org), respectively.
-
Viral contigs that may be from the same genome but without overlaps were merged using the Low Sensitivity/Fastest parameter in software Geneious v11.1.2 (Kearse et al. 2012). And the individual contig was used as reference for mapping to the raw reads of its original barcode using the Low Sensitivity/Fastest parameter. Putative viral open reading frames (ORFs) were predicted by Geneious v11.1.2 with built-in parameters (Minimum size: 300; Genetic code: Standard; Start codons: ATG) (Kearse et al. 2012), further the predicted ORFs were compared against the nr database from NCBI using BLASTp. The annotations of these ORFs were based on comparisons to the Conserved Domain Database using RPS-BLAST with an E-value cutoff of < 10-5. Coding protein sequences from ORFs which had no significant similarity found in the Database were annotated as putative proteins. Those contigs annotated with virus hallmark genes of main virus groups were selected, among which complete ORFs identified were included for further phylogenetic analyses (virus hallmark genes used: MCP for Microviridae, NS1 for Parvoviridae, Rep for CRESS-DNA viruses, TerL for Caudovirales and RdRp for Ribovriia). All sequences with virus hallmark genes were presented in scatter plots drawn utilizing R package ggplot2 v3.2.1.
-
Phylogenetic analyses were performed based on the predicted protein sequences of virus hallmark genes identified in this study and protein sequences of reference strains belonging to different group of viruses downloaded from the NCBI GenBank database. Related protein sequences were aligned using MUSCLE in MEGA v10.1.8 (Kumar et al. 2018) with the default settings. Sites containing more than 50% gaps were temporarily removed from alignments. Bayesian inference trees were then constructed using MrBayes v3.2.7 (Ronquist et al. 2012). The Markov chain was run for a maximum of 1 million generations, in which every 50 generations were sampled and the first 25% of Markov chain Monte Carlo (mcmc) samples were discarded as burn-in. Maximum Likelihood trees were also constructed to confirm all the Bayesian inference trees using software MEGA v10.1.8 (Kumar et al. 2018).
-
Particular attention was given to minimizing the risk of cross contamination and nucleic acid degradation. Aerosol filter pipet tips were used for avoiding possible cross contamination among samples. All experimental materials (including microcentrifuge tubes, pipet tips, etc.) which directly contacted with nucleic acid samples were RNase and DNase free and all nucleic acid samples were dissolved in DEPC-treated water with RNase inhibitors added. All experimental processes were performed in a biological safety cabinet.
-
The raw sequence reads data analyzed in this study are available at the National Center for Biotechnology Information (NCBI) Sequence Read Archive database under the accession numbers SRR12904122, SRR12904128, SRR12904125, SRR12904131, SRR12904201, SRR12904457, and SRR14308507 (control ddH2O). All viral sequences with virus hallmark genes identified in this study were deposited in the GenBank database under the accession numbers listed in Supplementary Table S2, along with a detailed list of the viral strain names, sequence length, taxonomic classifications, etc.
Library ID Samnpling site Location Sampling Date Library total reads Average reads length GC% No. of viral contigs Average contig length Min contig length Max contig length Percentage of reads mapped to Viral Contigs SRA accession no. 1anqing Anqing 30.51 N, 117.03 E 28-Oct-2017 916558 234.1 52.8% 4216 734.8 260 12589 10.7% SRX9369005 2wuhu Wuhu 31.34 N, 118.39 E 25-Oct-2017 1227660 234.6 52.6% 6037 760.3 260 20778 14.8% SRX9369011 3nanjing Nanjing 32.05 N, 118.89 E 23-Oct-2017 557758 234.9 49.2% 1506 941.2 261 15260 24.8% SRX9369008 4zhenjiang Zhenjiang 32.13 N, 119.43 E 23-Oct-2017 952270 231.3 51.9% 4785 748.7 260 9303 13.3% SRX9369014 5changzhou Changzhou 31.96 N, 120.03 E 29-Oct-2017 1344164 239.4 51.2% 7382 712.5 260 8086 12.8% SRX9369084 6nantong Nantong 32.01 N, 120.86 E 30-Oct-2017 1456270 236.6 53.1% 6217 867.1 260 20738 26.0% SRX9369340 WaterControl — — — 11312 230.9 48.7% 0 — — — — SRX10663920 Table Supplementary Table S2. Information of viral sequences with virus hallmark genes identified in the Yangtze River.