The alignment of the gp120 consensus sequence for subtype B and C to the HXB reference sequence is shown in Fig. 1. Key sites in the solved structure are shown in Table 1.
Figure 1. Realignment of gp120 sequences sets for subtype B, subtype C, CRF07 to reference sequence HXB-2. " ~" indicates sites that contain less than three amino acids and cannot be translated.
Table 1. Key structure elements within the gp120 sequences.
The raw MI arrays calculated for the gp120_B and gp120_C alignments are shown in Fig. 2A and 2B respectively. The figures also show the location of the points with respect to the regions of the solved structure and the specific secondary structure features that have been identified within the structure. In both arrays the V1/V2 and V3 loops show the greatest variation. Although there appears to be more variation in the V1/V2 loop, this is because many of the sites within the V3 loop contained large numbers of insertions or deletions and couldn't be included in the analysis. For the raw data, the most notable difference occurs close to the amino terminal of the protein. The C subtype array appears to contain a region of coevolving sites which is absent in the B subtype array.
Figure 2. Calculated mutual information arrays for gp120 subtype B and subtype C. (A) and (B) raw mutual information for subtype B and C respectively. (C) and (D), mutual information arrays after removing sites that were not statistically significant at 5000 bootstraps. Circled regions highlight regions that appear to contain sites of interest. Letters correspond to the marked positions on the structure in Fig. 3.
The statistically significant points (P < 0.0002) that remain after bootstrapping are shown in Fig. 2C and 2D. For subtype B, there are two clearly defined regions that remain and which are distinct from the three regions identified in the subtype C array. The figure also shows the range of sites that define hotspots, but it seems likely that some of the signal is due to noise and not all of the sites are coevolving. Region A occurs around aa140 and corresponds to the N terminal of the V1 loop. This is not present in the solved structure, but the loop is thought to be located at the top of the trimer away from where the protein binds to the CD4 receptor (arrow A in Fig. 3). This prediction is supported by a fusogenicity study which found that a mutation at position aa140 had a negative effect on the fusion ability of the virus . A second region is predicted to occur between positions 140 and 339 (A and B on Fig. 3). In this case, the positions are on opposite sides of the protein and even in the trimer are not in close proximity; therefore it seems unlikely that these sites are coevolving and the signal is probably produced by random co-mutations that the bootstrapping failed to remove.
Figure 3. Location of predicted coevolving sites on the Solved structure for the gp120 trimer (PDB structure 2NY7). The base of the trimer binds to the CD4 cell surface receptor. Letters refer to the highlighted regions in Fig. 2. A and D correspond to the N terminus of the V1 and C terminus of the V2 loop respectively, B/E corresponds to the V3 loop in the CPGR region. C corresponds to a region close to the N terminus of the gp120 protein. See text for details
For gp120 subtype C there are three regions that remain after bootstrapping. The first of these occurs at ~aa30 at the N terminus of the protein. Again, this region falls outside the solved structure but it is likely this region is located at the base of the trimer where the structure binds to the CD4 cell surface receptor (Fig. 2). Because the alignments for both subtypes are realigned according to reference sequence HXB2 it is possible to compare this region in the two alignments. Interestingly, between positions aa28 to aa35 in subtype C this region does appear to show greater variation than subtype B. Fig. 4 shows an entropy plot for this region. While the entropy is relatively well conserved across the region, the average entropy between aa28 to aa35 for subtype C is an order of magnitude higher than the value for subtype B. This finding is further supported by visual inspection of the alignment (data not shown).
Figure 4. Entropy vs amino acid position in the area around region C in Fig. 2 and Fig. 3 for subtype B and C. Subtype B is shown as black triangle, subtype C is shown as red circles. In this region, the mutual information array predicts a number of coevolving sites for subtype C, but not for subtype B. The entropy plot indicates that the subtype B alignment is highly conserved whereas the subtype C sequences show greater variation. The vertical dashed lines indicate the range of region C marked in Fig. 2D, The horizontal lines show the average entropy across the region for each subtype: subtype B dashed black line; subtype C red solid line. Outside this region the entropy is approximately the same for both subtypes.
Region D corresponds to the C terminal of the V2 protein at ~aa180. Again, because this region is not part of the solved structure it is difficult to inter pret the significance of these sites. However, once again, the prediction is supported by the same fusogenicity study which found that a mutation at aa180 also produced a decrease in the fusion ability of the virus . Region E in Fig. 1D corresponds to the V3 loop, close to the CPGR motif [12, 13, 20]. There are many coevolving sites predicted in this region which is consistent with one of the earliest mutual information studies on HIV which examined the variation in V3 loop within a broad sequence set .