The geographical sources of the 5, 167 sequences of documented HBV strains are listed in Table 1. A total of 3, 657 items (70.8%) of them are from Asia, 2181 items (42.2%) are China strains, and 921 items (17.8%) are Europe strains.
Table 1. The geographical distribution of 5,167 HBV records.
We compared 4 features of HBV sequences, the ratio of the predominant sequence, the ratio of unique sequences, the ratio of sequences with the predominant length and the length variability (the Number of lengths) (Fig. 1B). These values can characterize both the variability and the conservation of sequences or domains. The preC domain presented the highest ratio of predominant sequence, indicating a highly conserved domain, followed by the core-CTD. The HBpol and large HBsAg had the top two highest ratios of unique sequences.
The lengths of amino acid sequences or domains of HBV were variable. The number of alternative lengths of small HBsAg was the lowest (Fig. 1B and Supplementary Figure S1), while that of the HBpol was the highest. The preC domain and core-CTD were the next lowest after small HBsAg, which revealed their lengths were highly conserved. The preS1/S2 domains and the HBcAg are the most conserved. However, considering the length, the small and middle HBsAg, the preS1-RBD, the preS2 domain, the core-AD and the HBx are the most conserved. It's probable that the length is critical for assembly, so structural proteins, especially the small HBsAg and the core-AD, were conserved except the preS1 region which functions as receptor binding (Yan et al. 2012). Moreover, for the HBpol, the sequence length isn't critical. Some domains of the HBpol may provide the flexibility of length while not affecting function.
We analyzed different HBV proteins at the level of amino acid and found that the predominant sequence of HBcAg accounts for more than only around 20% of total sequences (Fig. 1B). However, almost all small HBsAg sequences are of the same length. The preS2 is also highly conserved in length, while the preS1 is more variable. In comparison, the HBpol was variable both in length and in amino acid sequence.
We analyzed the conservation of amino acid residues in four HBV ORFs by calculating the proportion of positions that shared the predominant amino acid. As a whole, it's interesting that some sites were highly conserved while some were extremely variable (Fig. 2 and Supplementary Figure S2). For most amino acid residues, sequence homology levels were above 50%. AAs in the ORF S were more variable. Only ~ 20% of preS1 sites and ~ 50% of preS2 sites had sequence homology levels above 95%, while the 95% sequence homology level was reached in more than 60% of sites in other sequences or domains. Nearly 70% of sites in the HBpol reaching this high level of concordance. Even in the ORF C, there were more than 60% sites with predominant AAs (≥ 0.950). The amino acid residues in the preS2 domain and the S domain were more variable than those in the core-AD and the core-CTD. This indicates the surface antigen could be more variable and provides the basis for immune escape. Among all sites, L350 in the large HBsAg and H83 in the HBpol were the most conserved. No mutation was found at the two sites of all HBV strains in our study. Sites with the highest conservation have the most potential as targets for development of new anti-virus strategies.
Figure 2. The similarity of amino acid sequence. The proportion of sites in several genetic regions of HBV that share sequence similarity at various levels of conservation. All amino acids share sequence similarity at 20% or greater (far left). Moving from left to right, sequence similarity for each genetic region is evaluated at increasing levels. Very few amino acid positions share 100% sequence similarity (far right). The proportion of sites with high ratio of predominant residues is lower in the preS1/S2 domain than the S domain. It indicates higher variability of the preS1/S2 domains.
After sequence alignment, we could further obtain the profile of pair-wise similarity in a whole picture. To compare the similarity of sequences or domains in different areas, we extracted subsets of data from different geographic locations, especially China which is a very important area with high prevalence of HBV infection. Surprisely, that different areas exhibited different profiles of pair-wise similarity (Fig. 3A). Almost all pair-wise similarities were more than 80% except that some similarities of preS2 were around 60%. The profiles of both the preC domain and the core-CTD in different areas were highly similar.
Figure 3. A The histogram profile of pair-wise similarity in different HBV genomic regions, grouped by geographical source of each isolate. The list of countries and the number of sequences from each geographical region are found in Table 1 and section "Materials and Methods". X represents the pair-wise similarity which was defined as the ratio of the number of positions with identical amino acid residues to the length of the sequences. Y represents the ratio of different similarity. The total ratio of pair-wise similarity in each panel was 1. B The Pearson cross-correlation coefficient of the profiles of pair-wise similarity in Fig. 3A (P < 0.001).
When these profiles were compared using the Pearson cross-correlation coefficient, the preS2 domain, the preC domain and the core-CTD show highly coefficient. However, the profiles of the preS1 domain, the S domain, the core-AD, the HBpol and the HBx were different among different areas. When compared among different areas, the similarity profiles are significantly different (P < 0.001) for HBpol, preS1, Core-AD (America-China), S (AmericaChina/Other area) and HBx (America-Other area) (Fig. 3B). We also noted that the profile between China and America differ most significantly in Core-AD, preS1, S, HBpol and HBx (P < 0.001) (Fig. 3B). The biological difference of the geographical difference in sequence population at amino acid level needs to be further addressed. Maybe it could advance the development of the vaccine or other anti-virus drugs.
The HBsAg components include important proteins in immune escape and persistence of immune tolerance. Currently, little is known about the structure of it. From the prediction of its secondary structure, it was surprising that alpha-helices and beta-sheet only exist in the S domain, while the preS1/S2 domains that are only found in the large and middle HBsAg contains only unstructured coils (Fig. 4A). While comparing the different amino acid frequencies of each sites in the HBsAg, we also found that cysteine residues exist in the S domain (Supplementary Figure S3A). Considering the conserved length of S domain, we speculate that the stability of its structure is critical for the assembly of both SVPs and the envelope of Dane particles. There were 40–46 amino acid residues in the N-terminus of preS1 critical for the receptor binding of HBV (Yan et al. 2012), so the flexibility in the preS1/S2 domain could provide for allosteric regulation during virushost recognition.
Figure 4. A Secondary structure prediction of the three HBsAg isoforms. Representation is the prediction of 3 forms of HBsAg components. Alpha-helices and beta-sheets only exist in the S domain. The preS1, preS2 and S domains are colored in purple, green and yellow respectively. Cysteine residues are colored in red. Red cyslinders stand for alpha-helices, blue arrows stand for beta-sheet and dash lines stand for unstructured colis in the cartoon for possible unified foding the 3 forms of HBsAg. B The 11 amino acid residues upstream of the preS1-RBD. Most residues were highly conserved except Lys10.
For the HBsAg sequence, it's interesting that cysteine only exists in S domain (Supplementary Figure S3A). While for HBcAg, another structural protein of virions, only 3 highly conserved cysteine residues exist in the core-CTD (Supplementary Figure S3B). However, cysteine residues exist throughout sequences of the non-structural proteins, the HBpol and the HBx (Supplementary Figure S3C, S3D). A special function of cysteine is to form intraor intermolecular disulfide bonds, which can stabilize the structure of proteins. Thus, it's postulated that these disulfide bonds keep the stability of the S domain in the envelope, while the preS1/S2 domain, without any cysteine residues, could provide a highly flexible conformation to facilitate binding to the host receptor.
It has been reported that 40–46 amino acid residues in the N-terminus of the preS1 domain are critical for binding to host receptor (Chouteau et al. 2001; Le Duff et al. 2009; Yan et al. 2012). We found that this domain is highly conserved. However, it's surprising that over 50% of HBV genome records encode the preS1 sequence with an N-terminal extension of the RBD of up to 11 amino acid residues (Fig. 4B). It has never been reported that this sequence is associated with receptor binding. Its amino acid sites were also highly conserved, excepting Lys10.