Inference of Global HIV-1 Sequence Patterns and Preliminary Feature Analysis

Yan Wang; Reda Rawi; Daniel Hoffmann; Binlian Sun; Rongge Yang

doi:10.1007/s12250-013-3348-z

August 2013

Citation: Yan Wang, Reda Rawi, Daniel Hoffmann, Binlian Sun, Rongge Yang. Inference of Global HIV-1 Sequence Patterns and Preliminary Feature Analysis .VIROLOGICA SINICA, 2013, 28(4) : 228-238. http://dx.doi.org/10.1007/s12250-013-3348-z

Inference of Global HIV-1 Sequence Patterns and Preliminary Feature Analysis

Yan Wang ¹ ,
Reda Rawi ² ,
Daniel Hoffmann ² ,
Binlian Sun ^{1
,,} ,
Rongge Yang ^{1
,,}

1.
AIDS and HIV Research Group, State Key Laboratory of Virology, Wuhan Institute of Virology, Chinese Academy of Sciences, Wuhan 430071, China
2.
Research Group for Bioinformatics, Center for Medical Biology, University of Duisburg-Essen, Essen 45141, Germany

Corresponding author: Binlian Sun, sunbl@wh.iov.cn
Rongge Yang, ryang@wh.iov.cn
Received Date: 04 June 2013
Accepted Date: 26 July 2013
Published Date: 27 July 2013
Available online: 01 August 2013

Abstract

The epidemiology of HIV-1 varies in different areas of the world, and it is possible that this complexity may leave unique footprints in the viral genome. Thus, we attempted to find significant patterns in global HIV-1 genome sequences. By applying the rule inference algorithm RIPPER (Repeated Incremental Pruning to Produce Error Reduction) to multiple sequence alignments of Env sequences from four classes of compiled datasets, we generated four sets of signature patterns. We found that these patterns were able to distinguish southeastern Asian from non- southeastern Asian sequences with 97.5% accuracy, Chinese from non-Chinese sequences with 98.3% accuracy, African from non-African sequences with 88.4% accuracy, and southern African from non-southern African sequences with 91.2% accuracy. These patterns showed different associations with subtypes and with amino acid positions. In addition, some signature patterns were characteristic of the geographic area from which the sample was taken. Amino acid features corresponding to the phylogenetic clustering of HIV-1 sequences were consistent with some of the deduced patterns. Using a combination of patterns inferred from subtypes B, C, and all subtypes chimeric with CRF01_AE worldwide, we found that signature patterns of subtype C were extremely common in some sampled countries (for example, Zambia in southern Africa), which may hint at the origin of this HIV-1 subtype and the need to pay special attention to this area of Africa. Signature patterns of subtype B sequences were associated with different countries. Even more, there are distinct patterns at single position 21 with glycine, leucine and isoleucine corresponding to subtype C, B and all possible recombination forms chimeric with CRF01_AE, which also indicate distinct geographic features. Our method widens the scope of inference of signature from geographic, genetic, and genomic viewpoints. These findings may provide a valuable reference for epidemiological research or vaccine design.
- Pattern inference
- , global HIV-1 sequence
- , Repeated Incremental Pruning to Produce Error Reduction (RIPPER)

References
1. Avenue M, Hill M, Cohen W W, Of C, and Pruning R. 1994. Fast E ective Rule Induction 2 Previous work 1 in introduction.
2. Bello G, Eyer-Silva W a, Couto-Fernandez J C, Guimarães M L, Chequer-Fernandez S L, Teixeira S L M, and Morgado M G. 2007. Demographic history of HIV-1 subtypes B and F in Brazil. Infection, genetics and evolution :journal of molecular epidemiology and evolutionary genetics in infectious diseases, 7:263-270.
  doi: 10.1016/j.meegid.2006.11.002
3. Blair C, and Murphy R W. 2011. Recent trends in molecular phylogenetic analysis:where to next? The Journal of heredity, 102:130-138.
  doi: 10.1093/jhered/esq092
4. Buonaguro L, Tagliamonte M, Tornesello M L, and Buonaguro F M. 2007. Genetic and phylogenetic evolution of HIV-1 in a low subtype heterogeneity epidemic:the Italian example. Retrovirology, 4:34-34.
  doi: 10.1186/1742-4690-4-34
5. Butler I F, Pandrea I, Marx P a, and Apetrei C. 2007. HIV genetic diversity:biological and public health consequences. Current HIV research, 5:23-45.
  doi: 10.2174/157016207779316297
6. Cai Y-D, Lu L, Chen L, and He J-F. 2010. Predicting subcellular location of proteins using integrated-algorithm method. Molecular diversity, 14:551-558.
  doi: 10.1007/s11030-009-9182-4
7. Crooks G E, Hon G, Chandonia J-m, and Brenner S E. 2004. WebLogo : A Sequence Logo Generator. 1188-1190.
8. Delano W L, and Ph D. 2004. PyMOL User' s Guide written by.
9. Delatorre E O, and Bello G. 2012. Phylodynamics of HIV-1 subtype C epidemic in east Africa. PloS one, 7:e41904-e41904.
  doi: 10.1371/journal.pone.0041904
10. Dybowski J N, Riemenschneider M, Hauke S, Pyka M, Verheyen J, Hoffmann D, and Heider D. 2011. Improved Bevirimat resistance prediction by combination of structural and sequence-based classifiers. BioData mining, 4:26-26.
  doi: 10.1186/1756-0381-4-26
11. Edgar R C. 2004. MUSCLE:multiple sequence alignment with high accuracy and high throughput. Nucleic acids research, 32:1792-1797.
  doi: 10.1093/nar/gkh340
12. Fauci A S, Johnston M I, Dieffenbach C W, Burton D R, Hammer S M, Hoxie J a, Martin M, Overbaugh J, Watkins D I, Mahmoud A, and Greene W C. 2008. HIV vaccine research:the way forward. Science (New York, N.Y.), 321:530-532.
  doi: 10.1126/science.1161000
13. Fryer H R, and McLean A R. 2011. Modelling the spread of HIV immune escape mutants in a vaccinated population. PLoS computational biology, 7:e1002289-e1002289.
  doi: 10.1371/journal.pcbi.1002289
14. Gentleman R C, Carey V J, Bates D M, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, and Gentry J. 2004. Bioconductor:open software development for computational biology and bioinformatics. Genome biology, 5:R80
  doi: 10.1186/gb-2004-5-10-r80
15. Gilbert M T P, Rambaut A, Wlasiuk G, Spira T J, Pitchenik A E, and Worobey M. 2007. The emergence of HIV/AIDS in the Americas and beyond. Proceedings of the National Academy of Sciences of the United States of America, 104:18566-18570.
  doi: 10.1073/pnas.0705329104
16. Grant B J, Rodrigues A P C, ElSawy K M, McCammon J A, and Caves L S D. 2006. Bio3d:an R package for the comparative analysis of protein structures. Bioinformatics, 22:2695-2696.
  doi: 10.1093/bioinformatics/btl461
17. Hemelaar J. 2012. The origin and diversity of the HIV-1 pandemic. Trends in Molecular Medicine, 18:182-192.
  doi: 10.1016/j.molmed.2011.12.001
18. Hornik K, Buchta C, and Zeileis A. 2009. Open-source machine learning:R meets Weka. Computational Statistics, 24:225–232.
  doi: 10.1007/s00180-008-0119-7
19. Junqueira D M, de Medeiros R M, Matte M C C, Araújo L A L, Chies J A B, Ashton-Prolla P, and Almeida S E D M. 2011. Reviewing the history of HIV-1:spread of subtype B in the Americas. PloS one, 6:e27489-e27489.
  doi: 10.1371/journal.pone.0027489
20. Kallings L O. 2008. The first postmodern pandemic:25 years of HIV/ AIDS. Journal of internal medicine, 263:218-243.
  doi: 10.1111/jim.2008.263.issue-3
21. Karlsson Hedestam G B, Fouchier R a M, Phogat S, Burton D R, Sodroski J, and Wyatt R T. 2008. The challenges of eliciting neutralizing antibodies to HIV-1 and to influenza virus. Nature reviews. Microbiology, 6:143-155.
  doi: 10.1038/nrmicro1819
22. Li Y, Uenishi R, Hase S, Liao H, Li X-J, Tsuchiura T, Tee K K, Pybus O G, and Takebe Y. 2010. Explosive HIV-1 subtype B' epidemics in Asia driven by geographic and risk group founder events. Virology, 402:223-227.
  doi: 10.1016/j.virol.2010.03.048
23. Liao H, Tee K K, Hase S, Uenishi R, Li X-J, Kusagawa S, Thang P H, Hien N T, Pybus O G, and Takebe Y. 2009. Phylodynamic analysis of the dissemination of HIV-1 CRF01_AE in Vietnam. Virology, 391:51-56.
  doi: 10.1016/j.virol.2009.05.023
24. Lihana R W. 2012. Update on HIV-1 Diversity in Africa : A Decade in Review. 83-100.
25. Liu J, and Zhang C. 2011. Phylogeographic analyses reveal a crucial role of Xinjiang in HIV-1 CRF07_BC and HCV 3a transmissions in Asia. PloS one, 6:e23347-e23347.
  doi: 10.1371/journal.pone.0023347
26. Lundegaard C, Lamberth K, Harndahl M, Buus S, Lund O, and Nielsen M. 2008. NetMHC-3.0:accurate web accessible predictions of human, mouse and monkey MHC class I affinities for peptides of length 8-11. Nucleic Acids Research, 36:W509-W512.
27. Lynch R M, Shen T, Gnanakaran S, and Derdeyn C a. 2009. Appreciating HIV type 1 diversity:subtype differences in Env. AIDS research and human retroviruses, 25:237-248.
  doi: 10.1089/aid.2008.0219
28. Masciotra S, Livellara B, Belloso W, Clara L, Tanuri a, Ramos a C, Baggs J, Lal R, and Pieniazek D. 2000. Evidence of a high frequency of HIV-1 subtype F infections in a heterosexual population in Buenos Aires, Argentina. AIDS research and human retroviruses, 16:1007-1014.
  doi: 10.1089/08892220050058425
29. Meng Z, Xin R, Zhong P, Zhang C, Abubakar Y F, Li J, Liu W, Zhang X, and Xu J. 2012. A new migration map of HIV-1 CRF07_BC in China:analysis of sequences from 12 provinces over a decade. PloS one, 7:e52373-e52373.
  doi: 10.1371/journal.pone.0052373
30. Moran D, and Jordaan J a. 2007. HIV/AIDS in Russia:determinants of regional prevalence. International journal of health geographics, 6:22-22.
  doi: 10.1186/1476-072X-6-22
31. Morcos F, Pagnani A, Lunt B, Bertolino A, Marks D S, Sander C, Zecchina R, Onuchic J N, Hwa T, and Weigt M. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proceedings of the National Academy of Sciences of the United States of America, 108: E1293-E1301.
32. Morris C N, and Ferguson a G. 2006. Estimation of the sexual transmission of HIV in Kenya and Uganda on the trans-Africa highway:the continuing role for prevention in high risk groups. Sexually transmitted infections, 82:368-371
  doi: 10.1136/sti.2006.020933
33. Njai H F, Gali Y, Vanham G, Clybergh C, Jennes W, Vidal N, Butel C, Mpoudi-Ngolle E, Peeters M, and Ariën K K. 2006. The predominance of Human Immunodeficiency Virus type 1 (HIV-1) circulating recombinant form 02 (CRF02_AG) in West Central Africa may be related to its replicative fitness. Retrovirology, 3:40-40.
  doi: 10.1186/1742-4690-3-40
34. Paradis E, Claude J, and Strimmer K. 2004. APE:Analyses of Phylogenetics and Evolution in R language. Bioinformatics, 20:289-290.
  doi: 10.1093/bioinformatics/btg412
35. Paraschiv S, Otelea D, Batan I, Baicus C, Magiorkinis G, and Paraskevis D. 2012. Molecular typing of the recently expanding subtype B HIV-1 epidemic in Romania:evidence for local spread among MSMs in Bucharest area. Infection, genetics and evolution :journal of molecular epidemiology and evolutionary genetics in infectious diseases, 12:1052-1057.
  doi: 10.1016/j.meegid.2012.03.003
36. Paraskevis D, Pybus O, Magiorkinis G, Hatzakis A, Wensing A M, van de Vijver D a, Albert J, Angarano G, Asjö B, Balotta C, Boeri E, Camacho R, Chaix M-L, Coughlan S, Costagliola D, De Luca A, de Mendoza C, Derdelinckx I, Grossman Z, Hamouda O, Hoepelman I, Horban A, Korn K, Kücherer C, Leitner T, Loveday C, Macrae E, Maljkovic-Berry I, Meyer L, Nielsen C, Op de Coul E L, Ormaasen V, Perrin L, Puchhammer-Stöckl E, Ruiz L, Salminen M O, Schmit J-C, Schuurman R, Soriano V, Stanczak J, Stanojevic M, Struck D, Van Laethem K, Violin M, Yerly S, Zazzi M, Boucher C a, and Vandamme A-M. 2009. Tracing the HIV-1 subtype B mobility in Europe:a phylogeographic approach. Retrovirology, 6:49-49.
  doi: 10.1186/1742-4690-6-49
37. Pérez L, Thomson M M, Bleda M J, Aragonés C, González Z, Pérez J, Sierra M, Casado G, Delgado E, and Nájera R. 2006. HIV Type 1 molecular epidemiology in cuba:high genetic diversity, frequent mosaicism, and recent expansion of BG intersubtype recombinant forms. AIDS research and human retroviruses, 22:724-733.
  doi: 10.1089/aid.2006.22.724
38. Pollakis G, Abebe A, Kliphuis A, De Wit T F R, Fisseha B, Tegbaru B, Tesfaye G, Negassa H, Mengistu Y, Fontanet A L, Cornelissen M, and Goudsmit J. 2003. Recombination of HIV type 1C (C'/C") in Ethiopia:possible link of EthHIV-1C' to subtype C sequences from the high-prevalence epidemics in India and Southern Africa. AIDS research and human retroviruses, 19:999-1008.
  doi: 10.1089/088922203322588350
39. Poonpiriya V, Sungkanuparph S, Leechanachai P, Pasomsub E, Watitpun C, Chunhakan S, and Chantratita W. 2008. A study of seven rule-based algorithms for the interpretation of HIV-1 genotypic resistance data in Thailand. Journal of virological methods, 151:79-86.
  doi: 10.1016/j.jviromet.2008.03.017
40. Restif O. 2009. Evolutionary epidemiology 20 years on:challenges and prospects. Infection, genetics and evolution :journal of molecular epidemiology and evolutionary genetics in infectious diseases, 9:108-123.
  doi: 10.1016/j.meegid.2008.09.007
41. Sharp P M, and Hahn B H. 2011. Origins of HIV and the AIDS Pandemic. 1-22.
42. Sharp P M, and Hahn B H. 2011. Origins of HIV and the AIDS pandemic. Cold Spring Harbor perspectives in medicine, 1:a006841-a006841.
43. Shen C, Craigo J, Ding M, Chen Y, and Gupta P. 2011. Origin and dynamics of HIV-1 subtype C infection in India. PloS one, 6:e25956-e25956.
  doi: 10.1371/journal.pone.0025956
44. Sierra M, Thomson M M, Posada D, Pérez L, Aragonés C, González Z, Pérez J, Casado G, and Nájera R. 2007. Identification of 3 phylogenetically related HIV-1 BG intersubtype circulating recombinant forms in Cuba. Journal of acquired immune deficiency syndromes (1999), 45:151-160.
  doi: 10.1097/QAI.0b013e318046ea47
45. Silveira J, Santos A F, Martínez A M B, Góes L R, Mendoza-Sassi R, Muniz C P, Tupinambás U, Soares M a, and Greco D B. 2012. Heterosexual transmission of human immunodeficiency virus type 1 subtype C in southern Brazil. Journal of clinical virology :the official publication of the Pan American Society for Clinical Virology, 54:36-41.
  doi: 10.1016/j.jcv.2012.01.017
46. Spira S. 2003. Impact of clade diversity on HIV-1 virulence, antiretroviral drug sensitivity and drug resistance. Journal of Antimicrobial Chemotherapy, 51:229-240.
  doi: 10.1093/jac/dkg079
47. Taylor B S, and Hammer S M. 2008. The challenge of HIV-1 subtype diversity. The New England journal of medicine, 359:1965-1966.
  doi: 10.1056/NEJMc086373
48. Tebit D M, and Arts E J. 2011. Tracking a century of global expansion and evolution of HIV to drive understanding and to combat disease. The Lancet Infectious Diseases, 11:45-56.
  doi: 10.1016/S1473-3099(10)70186-9
49. Villanova F E. 2010. Diversity of HIV-1 Subtype B : Implications to the Origin of BF Recombinants. 5: 1-9.
50. Walker B D, and Burton D R. 2008. Toward an AIDS vaccine. Science (New York, N.Y.), 320:760-764.
  doi: 10.1126/science.1152622
51. Walker P R, Pybus O G, Rambaut A, and Holmes E C. 2005. Comparative population dynamics of HIV-1 subtypes B and C:subtype-specific differences in patterns of epidemic growth. Infection, genetics and evolution :journal of molecular epidemiology and evolutionary genetics in infectious diseases, 5:199-208.
  doi: 10.1016/j.meegid.2004.06.011
52. Wang Y, Rawi R, Wilms C, Heider D, Yang R, and Hoffmann D. 2013. A small set of succinct signature patterns distinguishes Chinese and non-Chinese HIV-1 genomes. PloS one, 8:e58804-e58804.
  doi: 10.1371/journal.pone.0058804
53. Witten I H, Frank E, and Hall M A. 2011. Data Mining: Practical Machine Learning Tools and Techniques: Practical Machine Learning Tools and Techniques. Elsevier
54. Worobey M, Gemmel M, Teuwen D E, Haselkorn T, Kunstman K, Bunce M, Muyembe J-j, Kabongo J-m M, Kalengayi R M, Van Marck E, Gilbert M T P, Wolinsky S M, Kalengayi M, and Marck E V. 2008. Direct evidence of extensive diversity of HIV-1 in Kinshasa by 1960. Nature, 455:661-664.
  doi: 10.1038/nature07390
55. Yang O O. 2009. Candidate vaccine sequences to represent intra-and inter-clade HIV-1 variation. PloS one, 4:e7388-e7388.
  doi: 10.1371/journal.pone.0007388
56. Zhao Y. 2011. R and Data Mining: Examples and Case Studies 1.
57. Zhu T, Korber B T, Nahmias a J, Hooper E, Sharp P M, and Ho D D. 1998. An African HIV-1 sequence from 1959 and implications for the origin of the epidemic. Nature, 391:594-597.
  doi: 10.1038/35400
Proportional views

Figures(9)

PDF

Article Metrics

Article views(5034) PDF downloads(11) Cited by()

Proportional views

HTML

INTRODUCTION

During the processes of independent cross-species transmission, different HIV lineages were formed, and these included HIV-1 M, N, O, and P, and HIV-2. The HIV-1 M group has been further subdivided into nine subtypes, A-D, F-H, J, and K, according to the variation in genetic distance of these amino acids. This variation is generally 8-17% and up to 30% within subtypes, whereas between subtypes, it is generally 17-35% and up to 42%, depending on the genomic regions used for subtyping (Hemelaar J, 2012; Sharp P M, et al., 2011). With the increasing sensitivity and range of sequencing techniques, increasing numbers of circulating recombinant forms (CRFs) have been reported.

The globally uneven distribution of the different HIV-1 subtypes and CRFs reflects the molecular epidemiology of the virus. In southern and eastern Africa, the predominant subtype is C, and this makes up 52% of HIV-1 infections worldwide. By contrast, in West and Central Africa, the vast majority of infections are caused by CRF02_AG, while in East Africa, subtypes A and D and their CRFs are the dominant subtypes (Delatorre E O, et al., 2012; Hemelaar J, 2012; Kallings L O, 2008; Morris C N, et al., 2006; Njai H F, et al., 2006; Pollakis G, et al., 2003; Shen C, et al., 2011; Tebit D M, et al., 2011; Worobey M, et al., 2008; Zhu T, et al., 1998). Within the homosexual populations in North and South America, Western and Central Europe, Australia, Asia (for example, Hong Kong, Japan, Korea, Taiwan etc.), North Africa, the Middle East, South Africa, and Russia, subtype B is the predominant subtype (Buonaguro L, et al., 2007; Delatorre E O, et al., 2012; Gilbert M T P, et al., 2007; Junqueira D M, et al., 2011; Moran D, et al., 2007; Paraskevis D, et al., 2009). In South America, in addition to the B, C, F, and BF subtypes, recombinant virus subtypes also coexist, and infections caused by the BF recombinant viruses (including CRF12_BF, CRF17_BF, CRF29_BF, and CRF29_BF) accounted for 80% of the HIV-1 infections in Argentina. In Eastern Europe, A1 is the predominant subtype, but subtypes B and CRF03_AB are also common in this region (Bello G, et al., 2007; Masciotra S, et al., 2000; Paraschiv S, et al., 2012; Pérez L, et al., 2006; Sierra M, et al., 2007; Silveira J, et al., 2012; Villanova F E, 2010; Walker P R, et al., 2005).

In contrast to Africa, all subtypes in Asia seem to have originated from different founder events, including the CRF01_AE, B, and C subtypes, as well as the various CRFs derived from these three subtypes. It is worth mentioning that the B subtype in Asia can also be divided into two types; in evolutionary terms, one is closer to the subtype B found in Europe and America, while the other is genetically distant, forming a clear clustering branch in the phylogenetic tree called B' or Thai B. The coexistence of HIV-1 subtypes in East Asia leads to various CRFs, which are dominant in particular regions such as the BC recombinant epidemic among drug users in Northwestern and Southeastern China, and the various Thai-B and CRF01_AE recombinants found in Thailand and Myanmar (Li Y, et al., 2010; Liao H, et al., 2009; Liu J, et al., 2011; Meng Z, et al., 2012).

The central role that HIV diversity plays in HIV transmission suggests the necessity for global HIV epidemic monitoring and a reasonable sampling strategy. In addition, studies of the association of diversity with spread, viral load, and disease progression may also give crucial clues for the prevention and treatment of HIV (Butler I F, et al., 2007; Fryer H R, et al., 2011; Restif O, 2009; Spira S, 2003; Taylor B S, et al., 2008).

Exploration of the signature patterns in the HIV genome could be the first step toward studying HIV diversity. Data mining of biological sequence requires identifying the rules, extracting features and inferring models from a large but specific biological dataset in order to classify, recognize or predict new data. This usually involves pattern mining and clustering of biological sequences, and these two techniques can usually be used interchangeably (Poonpiriya V, et al., 2008). The performance and effectiveness of the various biological sequence pattern mining and clustering methods differ, depending on the characteristics of the algorithms and the datasets used (Cai Y-D, et al., 2010; Dybowski J N, et al., 2011; Zhao Y, 2011).

Although traditional phylogenetic analysis of HIV sequences supports study of HIV origin, evolution, and dissemination, it is generally unsuitable for application to large samples because of the computational requirements (Blair C, et al., 2011). In the current study, we used an efficient method of data mining known as RIPPER (Repeated Incremental Pruning to Produce Error Reduction). This method is suitable for large-scale sample analysis (Avenue M, et al., 1994) to comprehensively analyze global HIV sequence patterns. We particularly focused on analyzing the Env regions, which cover most of the currently available datasets and include the maximum amount of information (Lynch R M, et al., 2009).

In our study, we compiled four datasets from four HIV-1 pandemic hotspots with different epidemiological and evolutionary features: Southeast Asia, China, Africa, and Southern Africa, and focused in our analysis on answering the following three questions.

1) For the four epidemiological hotspots with different epidemiological features, can we identify signature patterns that are characteristic of HIV-1 sequences from the four geographic classes?

2) Is the performance of the signature pattern inference the same for all four datasets?

3) Can we understand the scope of signature pattern analysis and the application of these patterns?

MATERIALS AND METHODS

The global HIV-1 sequences and associated information were retrieved from the Los Alamos HIV sequence database (http://www.hiv.lanl.gov/).

Dataset compilation

The dataset was downloaded from LANL HIV Sequence Alignments

(http://www.hiv.lanl.gov/content/sequence/NEWALIGN/align.html) by setting Alignment type as Filter alignment (all complete sequences), Year as 2011, Organism as HIV-1, DNA/Protein as PRO, Region as Env, and Subtype as ALL.

The original downloaded dataset comprised 3,261 sequences. After removing problematic sequences, especially those with ambiguous amino acids, 2,762 sequences were extracted. The HXB2 Env sequence was used as a reference of amino acid position (with number 1 corresponding to the first Met residue).The combined set was realigned using the software MUSCLE (version 3.5) (Edgar R C, 2004) with default parameters.

The following four datasets were compiled based on the alignment of the aforementioned 2762 sequences. For inference of the signature patterns for Env sequences sampled in Southeast Asia, we obtained an aligned set of 312 Southeast Asian and 2,450 non-Southeast Asian Env sequences by extracting sequences with labels of "TH, " "CN, " "MY, " or "VN" in the sequence headers. For inference of the signature patterns for Env sequences sampled in China, we obtained an aligned set of 162 Chinese and 2,600 non-Chinese Env sequences from the above overall alignment by extracting the sequences with the label of "CN" in the sequence headers. For the inference of signature patterns for Env sequences sampled in Africa, we obtained an aligned set of African Env sequences by extracting the sequences with labels of African countries in the sequence headers. African samples were separated into five regions as follows:

1. Southern ("AO", "ZM", "ZW", "BW", "ZA", "NA");

2. Central ("CF", "CM", "CG", "GA", "AO").

3. Western

("ML", "GH", "NE", "NG", "SN", "GM", "BJ").

4. Eastern ("ET", "UG", "KE", "SO", "TZ", "RW").

5. Northern ("EG", "SD", "LY", "MA", "TN").

For the inference of signature patterns of Env sequences sampled in Africa, we obtained an overall aligned set of 1,103 African and 1,659 non-African Env sequences, whereas for the inference of signature patterns for Env sequences sampled in Southern Africa alone, we selected only the Southern African sequence described above ("AO", "ZM", "ZW", "BW", "ZA", "NA"), which gave an aligned set of 599 Southern African and 2,163 non-Southern African Env sequences.

Extraction of all labels and the manipulation of characters were performed using the R scripts (Supplementary material 1). The supplementary materials and the four alignment files are available on the website of Virologica Sinica: http://www.virosin.org.

Rule inference

To deduce the signature patterns of the four datasets, we used JRip software (Witten I H, et al., 2011) in RWeka (Hornik K, et al., 2009), which can be used in the R environment (Gentleman R C, et al., 2004; Hornik K, et al., 2009) (http://cran.r-project.org). JRip implements RIPPER, which is an incremental machine learning method. In addition, the rule sets can be inferred directly from the training datasets, thus this method is suitable for the fast inference of rules from large datasets. Further association studies and plotting were performed in the R environment.

Assessment of signature pattern inference

To certify the inference of signature patterns, we tested the classification assessment of the signature patterns. We assessed in detail the performance of signature patterns in the classification of Env sequences of Southeast Asian or non-Southeast Asian samples. We performed a full 'leave-one-out' classification run with the same set of 2,762 Env sequences used above; each of the sequences was omitted once from the training data, and a set of signature patterns was learned by RIPPER from the remaining 2,761 sequences and their class labels as described above. This was followed by the classification of the remaining one sequence as either Southeast Asian or non-Southeast Asian, based on this set of signature patterns. Comparison of the 2,761 predicted and true class labels allowed for an assessment of the prediction performance. The same procedure was used for assessment of classification of the other three datasets.

Entropy calculation for pattern positions

In an attempt to explain the positions captured in the pattern inferences from the information theory, R-package bio3d (Grant B J, et al., 2006) was used to manipulate and analyze sequences. Using the "entropy" function, we could compute Shannon entropies S_j for alignment position j based on a 22-letter alphabet, including the conventional amino acid, the gap symbol "-, " and "X" (this letter was last not used here), according to the following formula:

with the relative frequency P_ij of letter i at alignment position j.

Phylogenetic analysis for pattern positions

Owing to the limitations of phylogenetic analysis, such as computational requirements, we considered in this study only one specific pattern corresponding to subtype B and Thai-B (B') in Southeast Asian sequences, as this analysis might provide important clues to specific geographic origin corresponding to the Chinese HIV-1 B'pandemic and help to interpret identified patterns from a phylogenetic viewpoint, which might exclude founder effects.

Making use of the maximum likelihood (ML) method to reconstruct phylogenetic trees, we analyzed a set of 954 global HIV-1 subtype B (B') Env sequences, with 10 HIV-1 subtype D sequences added as an outgroup. The substitution model we chose was HIVb + I + Gamma, and the heuristic tree searches used Nearest-neighbor interchange (NNI), and branch support estimation used approximate likelihood ratio (aLTR). After obtaining the phylogenetic tree, the association of amino acid pattern with the phylogenetic clustering branches was plotted with the R-package ape (Paradis E, et al., 2004).

Acknowledgements

We gratefully acknowledge the funding by the Chinese Key National Science and Technology Program in the 12th Five-Year Period, grant 2012ZX10001006-002;

And the Deutsche Forschungsgemeinschaft (http://www. dfg.de), grant TRR60/A6; the University of Duisburg-Essen (http://www.uni-due.de)

Author contributions

Yan Wang: Performed the experiments and wrote the article

Reda Rawi: Paticipated in a portion of experiments

Daniel Hoffmann: Designed the project

Binlian Sun: Designed the project and revised the article

Rongge Yang: Designed the project and revised the article

Supplementary materials:

The supplementary materials and the four alignment files are available on the website of Virologica Sinica: http://www.virosin.org.

Figure (9) Reference (57) Relative (20)

Inference of Global HIV-1 Sequence Patterns and Preliminary Feature Analysis

Abstract

References

Proportional views

Article Metrics

Related

Proportional views

Inference of Global HIV-1 Sequence Patterns and Preliminary Feature Analysis

Corresponding author: Binlian Sun, sunbl@wh.iov.cn

Corresponding author: Rongge Yang, ryang@wh.iov.cn

HTML

Dataset compilation

Rule inference

Assessment of signature pattern inference

Entropy calculation for pattern positions

Phylogenetic analysis for pattern positions

Rule inference for Southeast Asian HIV-1 Env sequences

Statistical errors for the classification of signature patterns

Analysis of signature patterns found in Southeast Asian HIV-1 sequences

Inferred rules are associated with HIV-1 subtypes

Inferred rules are characteristic of geographic sampling

Support by phylogenetic analysis for the position combination (x553 = R) and (x190 = S)

Rule inference for Chinese HIV-1 Env sequences

Rule inference for African HIV-1 Env sequences

Analysis of signature patterns found in African HIV-1 sequences

Rule inference for Southern Africa HIV-1 Env sequences

Extensive study on signature patterns for Env sequences of different subtypes sampled worldwide

Scopes of signature pattern analysis

Mechanisms of these signature patterns

Application of these signature patterns

目录