Using genomic signatures for HIV-1 sub-typing
© Pandit and Sinha; licensee BioMed Central Ltd. 2010
Published: 18 January 2010
Human Immunodeficiency Virus type 1 (HIV-1), the causative agent of Acquired Immune Deficiency Syndrome (AIDS), exhibits very high genetic diversity with different variants or subtypes prevalent in different parts of the world. Proper classification of the HIV-1 subtypes, displaying differential infectivity, plays a major role in monitoring the epidemic and is also a critical component for effective treatment strategy. The existing methods to classify HIV-1 sequence subtypes, based on phylogenetic analysis focusing only on specific genes/regions, have shown inconsistencies as they lack the capability to analyse whole genome variations. Several isolates are left unclassified due to unresolved sub-typing. It is apparent that classification of subtypes based on complete genome sequences, rather than sub-genomic regions, is a more robust and comprehensive approach to address genome-wide heterogeneity. However, no simple methodology exists that directly computes HIV-1 subtype from the complete genome sequence.
We use Chaos Game Representation (CGR) as an approach to identify the distinctive genomic signature associated with the DNA sequence organisation in different HIV-1 subtypes. We first analysed the effect of nucleotide word lengths (k = 2 to 8) on whole genomes of the HIV-1 M group sequences, and found the optimum word length of k = 6, that could classify HIV-1 subtypes based on a Test sequence set. Using the optimised word length, we then showed accurate classification of the HIV-1 subtypes from both the Reference Set sequences and from all available sequences in the database. Finally, we applied the approach to cluster the five unclassified HIV-1 sequences from Africa and Europe, and predict their possible subtypes.
We propose a genomic signature-based approach, using CGR with suitable word length optimisation, which can be applied to classify intra-species variations, and apply it to the complex problem of HIV-1 subtype classification. We demonstrate that CGR is a simple and computationally less intensive method that not only accurately segregates the HIV-1 subtype and sub-subtypes, but also aid in the classification of the unclassified sequences. We hope that it will be useful in subtype annotation of the newly sequenced HIV-1 genomes.
Human Immunodeficiency Virus (HIV) type 1, a retrovirus, is the causative agent of Acquired Immunodeficiency Syndrome (AIDS). With more than 33 million individuals living with the virus and more than 25 million deaths since its onset, HIV has led to a global pandemic . The major problem to curb HIV-1, through the development of a vaccine, has been its high genetic variability and evolutionary rates . This genetic heterogeneity of HIV-1 has been attributed to the lack of proofreading capabilities of the Reverse Transcriptase enzyme [3, 4]. Genetically diverse population of viral species ('quasispecies') dwells inside an infected individual , and HIV can exhibit up to 10% variability within a single individual . The human host's immune system as well as the antiviral drugs used in treatment regimes also trigger viral evolution.
Analogous to within-individual variability, HIV exhibits high heterogeneity at the population level. HIV-1 sequences are classified into three phylogenetically distinct groups - M (Major), O (Outlier), and N (non-M/non-O) - based upon their sequence diversity. The M group is globally prevalent and responsible for the pandemic. Group M is further stratified into nine genetically discrete subtypes - A to D, F to H, J, and K - showing up to 25% to 35% sequence level variations between the genomes in different subtypes [2, 5, 6]. The subtypes A and F are further classified into sub-subtypes (A1, A2) and (F1, F2) based upon differential clustering . To add to the complexity, two or more HIV-1 subtypes recombine and circulate in the population to form Circulating Recombinant Forms (CRFs), and new CRFs continuously emerge over time.
Historically, the subtypes were classified based on the envelope (env) gene sequence variations and classification of subtypes A to F was done on the basis of env gene alone . All subtypes, except E, could be consistently classified from the gag region of HIV-1 . The partial genome sequences and phylogenies based on env and gag genes further led to designation of subtypes G to J [9–11]. Phylogenetic comparisons of A and F led to determination of sub-subtypes that form differential clusters within the corresponding subtypes . Generally, HIV-1 strains fall into the appropriate phylogenetic clusters when multiple regions of their genome are analysed. Subtype K, which was earlier proposed to be a sub-subtype of F based upon phylogenetic analysis of env and gag sequences, was later classified as a distinct subtype when whole genome sequences were analysed . Subtype I previously classified on the basis of C2V3 region of env sequences was later found to be a subtype A and G recombinant [10, 13]. Some of the recent methods use env, gag and pol gene sequences, which together span most of the HIV genome. These studies clearly indicate that classification of subtypes based on complete genome sequences, rather than sub-genomic regions, may be a more robust and comprehensive approach. However, no simple methodology exists that directly computes a HIV-1 subtype from the complete genome sequence, rather than generating gene-based phylogenies and then analysing the distance matrices [14–16].
In this article, we address this problem by identifying the variations in the genomic signatures (at various word lengths) at whole genome level in the different subtypes of HIV-1 using the Chaos Game Representation (CGR) method. CGR is a two-dimensional plot, where the primary sequence organisation of DNA is mapped using iterative functions. The use of CGR has mostly been restricted to a visualization tool representing nucleotide sequences, in which patterns like over- or under-representation of nucleotides, dinucleotides, trinucleotides etc. can be visually ascribed. Goldman concluded that the patterns exhibited by CGR are sufficient to evaluate word length composition of three, i.e., the frequencies of nucleotides, dinucleotides and trinucleotides . However, it was shown later that longer oligonucleotide frequencies also influence the patterns seen in CGR . Recently, a spectrum of word lengths, in addition to nucleotide and dinucleotide, in CGRs were identified as factors that can differentiate between genomes of different species. Several distance measures were proposed to compare two or more CGRs and it was employed for studying phylogenetic relationships among diverse species [19, 20]. However, it is not clear if intra-species genomic variability, which is much less than between-species variation, can be resolved using CGRs with similar word lengths. A different class of methods, using data structures such as, suffix arrays and suffix tree, have also been used to study specific genomic signatures using different word lengths .
In this study, we demonstrate the applicability of CGR to address the problem of intra-species variability by considering the complex issue of HIV-1 subtype classification, as these subtypes form a set, which exhibit subtle differences that are sufficient for displaying differential infectivity and evolutionary dynamics . We show that CGR is an effective methodology to resolve HIV-1 subtype variations by first optimising the suitable word length, and then applying the method to obtain the known and unknown HIV-1 subtypes by analysing all available whole genome sequences, along with the Reference Sequence set that is used by workers in the field . Our studies clearly show that this unusual approach can effectively be used for studying intra-species variability in general, and specifically offer an easy-to-use and accurate method for HIV-1 sub-typing from whole genome data.
Accession numbers of training set sequences used for word length optimisation
The Reference Sequences set was taken from the HIV Database, which were classified using traditional sequence alignment methods . The Reference set contains four sequences each for subtype B, C, D and G; three sequence for subtype H; two each for subtype J and K; four each for sub-subtype A1, F1 and F2; two for sub-subtype A2 and four for SIVcpz, where SIVcpz is the SIV sequence derived from Chimpanzee. The U (unclassified) sequences were also collected from the database.
CGR of a genome is plotted in a square, with each of the four vertices labelled as the four nucleotide bases A, T, G and C, respectively. To initialise, we place the first point in the middle of the square. The second point is placed as a mid-point between the initial point and the coordinates of the vertex corresponding to the first nucleotide of the DNA sequence. The next point, corresponding to the second nucleotide, is placed as a mid-point between the previously plotted point and the coordinate of the vertex. The process is repeated for the complete sequence and the entire genome is plotted in a two-dimensional plot. The frequency of different word lengths can be extracted by dividing the CGR space with a grid of appropriate size. To obtain the frequencies of all the k-letter words, CGR must be divided into a (2k × 2k) grid. The frequencies are obtained by counting the number of occurrences in each box of the grid.
where the initial point CGR0 is the mid point of the square (i.e., 400,400) and i varies from 1, .... n. In order to have greater resolution to study the effect of different word lengths, we have taken 800 divisions to construct the CGR.
where, k is the word length; aij and bij are the frequency values corresponding to the first and second CGR. For a given set of CGR, we constructed pair-wise distance matrices for each pair, and used it for further analysis. All calculations were performed using MATLAB R2007b .
Pair-wise distance matrices were used to cluster different HIV-1 subtypes using Neighbor-Joining (NJ) method and the "Neighbor" programme of PHYLIP  was used to construct the dendrograms.
Results and discussion
Here we present the results of our study on the classification of HIV-1 subtypes using the CGR approach. First, we generated the CGR plots for the first HIV-1 complete genome sequence to highlight the features exhibited by a typical HIV-1 genome. Then we used the training set of HIV-1 subtype genome sequences, given in Table 1, to optimise the word length required to correctly segregate the different subtypes. We further tested the optimised word length on the Reference Sequence Set used for HIV sub-typing, and also for other subtype sequences available in the database. Finally, we analysed all the unclassified sequences for HIV-1 implementing our methodology of CGR.
CGR for HIV genome
Clustering of HIV-1 group M subtypes using different word lengths
Resolving subtypes from the reference dataset
Resolving subtypes from all available genome sequences
Predicting the subtypes of unclassified sequences
The exponential growth of HIV-1 genome sequences manifest variable geographical distribution of the subtypes. Subtype C dominates the regions of South Africa, India, China, etc. and is the most prevalent subtype that causes HIV-1 epidemic. On the other hand, HIV-1 subtype B is the most studied subtype predominant in Americas, Australia, Western Europe, Japan etc. The high variability of HIV-1 has major influence on infectivity and transmissibility of the virus, with subtypes exhibiting variable treatment response and differential selection of drug resistance mutations as seen for subtype B and C . Therefore, accurate classification of HIV-1 genomes is important as it facilitates the efficacy of monitoring the epidemic and developing treatment strategies.
CGR in the past has been used as a visualization tool, and it was shown that it could highlight inter-species differences. However, here we propose that CGR is a simple and computationally less intensive method, which can even identify genomic signatures marking intra-species variability, which are much less as compared to the inter-species variability. We demonstrate that CGR is a suitable method to correctly separate HIV-1 group M subtypes, using whole genome sequences. We demonstrate the applicability of different word lengths, and prove that word length of six is sufficient to differentially segregate HIV-1 subtypes and sub-subtypes into distinct clusters. In HIV-1, regions of high variability have been known to exhibit non-random distribution of certain 6 base pair long nucleotide sequences, which may undergo non-synonymous mutations leading to changes in the amino acids . CGR computed from genome sequences, however, can recognize both synonymous and non-synonymous changes highlighting both neutral as well as selective mutations. It remains to be studied if a similar analysis of the proteome, instead of the genome sequence, would be useful in obtaining the functional basis of the sub-type classifications.
This methodology utilizes genome-wide information rather than gene- or region-specific information to classify HIV-1 subtypes. Using CGR, we could replicate the clustering of Reference Sequence set, and also all other HIV-1 group M subtype sequences. Importantly, we show that using this method we could also classify the five Unclassified sequences to subtypes, which fit with additional information available in literature. Thus, we demonstrate the applicability of this new method to solve the complex problem of HIV-1 sub-typing, and propose its use in subtype annotation of the newly sequenced HIV-1 genomes. The proposed methodology, with suitable word length optimisation, can also be applied to classify intra-species variants in other organisms.
The authors thank Mr. Satya P. Rungta for help in the initial stages of the work. AP thanks CSIR for fellowship.
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 1, 2010: Selected articles from the Eighth Asia-Pacific Bioinformatics Conference (APBC 2010). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/11?issue=S1.
- Joint United Nations Program on HIV/AIDS (UNAIDS) and World Health Organization (WHO): AIDS epidemic update. 2007.Google Scholar
- Tebit DM, Nankya I, Arts EJ, Gao Y: HIV diversity, recombination and disease progression: how does fitness "fit" into the puzzle? AIDS Rev 2007, 9: 75–87.PubMedGoogle Scholar
- Preston B, Poiesz B, Loeb L: Fidelity of HIV-1 reverse transcriptase. Science 1988, 242: 1168–1171. 10.1126/science.2460924View ArticlePubMedGoogle Scholar
- Domingo E, Holland J: RNA virus mutations and fitness for survival. Annu Rev Microbiol 1997, 51: 151–178. 10.1146/annurev.micro.51.1.151View ArticlePubMedGoogle Scholar
- Takebe Y, Uenishi R, Li X: Global molecular epidemiology of HIV: understanding the genesis of AIDS pandemic. Adv Pharmacol 2008, 56: 1–25. full_textView ArticlePubMedGoogle Scholar
- Robertson DL, Anderson JP, Bradac JA, Carr JK, Foley B, Funkhouser RK, Gao F, Hahn BH, Kuiken C, Learn GH, et al.: HIV-1 Nomenclature Proposal. In Human Retroviruses and AIDS 1999. Edited by: Kuiken CL, Foley B, Hahn B, Korber B, McCutchan F, Marx PA, Mellors JW, Mullins JI, Sodroski J, Wolinksy S. New Mexico: Los Alamos National Laboratory; 1999:492–505.Google Scholar
- Myers G, MacInnes K, Korber B: The emergence of simian/human immunodeficiency viruses. AIDS Res Hum Retroviruses 1992, 8: 373–386. 10.1089/aid.1992.8.373View ArticlePubMedGoogle Scholar
- Louwagie J, McCutchan FE, Peeters M, Brennan TP, Sanders-Buell E, Eddy GA, Groen G, Fransen K, Gershy-Damet GM, Deleys R, et al.: Phylogenetic analysis of gag genes from 70 international HIV-1 isolates provides evidence for multiple genotypes. AIDS 1993, 7: 769–780. 10.1097/00002030-199306000-00003View ArticlePubMedGoogle Scholar
- Janssens W, Heyndrickx L, Fransen K, Motte J, Peeters M, Nkengasong JN, Ndumbe PM, Delaporte E, Perret JL, Atende C, et al.: Genetic and phylogenetic analysis of env subtypes G and H in central Africa. AIDS Res Hum Retroviruses 1994, 10: 877–879.PubMedGoogle Scholar
- Kostrikis LG, Bagdades E, Cao Y, Zhang L, Dimitriou D, Ho DD: Genetic analysis of human immunodeficiency virus type 1 strains from patients in Cyprus: identification of a new subtype designated subtype I. J Virol 1995, 69: 6122–6130.PubMed CentralPubMedGoogle Scholar
- Leitner T, Alaeus A, Marquina S, Lilja E, Lidman K, Albert J: Yet another subtype of HIV type 1? AIDS Res Hum Retroviruses 1995, 11: 995–997. 10.1089/aid.1995.11.995View ArticlePubMedGoogle Scholar
- Triques K, Bourgeois A, Vidal N, Mpoudi-Ngole E, Mulanga-Kabeya C, Nzilambi N, Torimiro N, Saman E, Delaporte E, Peeters M: Near-full-length genome sequencing of divergent African HIV-1 subtype F viruses leads to the identification of a new HIV-1 subtype designated K. AIDS Res Hum Retroviruses 2000, 16: 139–151. 10.1089/088922200309485View ArticlePubMedGoogle Scholar
- Gao F, Robertson DL, Carruthers CD, Li Y, Bailes E, Kostrikis LG, Salminen MO, Bibollet-Ruche F, Peeters M, Ho DD, et al.: An isolate of human immunodeficiency virus type 1 originally classified as subtype I represents a complex mosaic comprising three different group M subtypes (A, G, and I). J Virol 1998, 72: 10234–10241.PubMed CentralPubMedGoogle Scholar
- Rozanov M, Plikat U, Chappey C, Kochergin A, Tatusova T: A web-based genotyping resource for viral sequences. Nucleic Acids Res 2004, 32: W654–659. 10.1093/nar/gkh419PubMed CentralView ArticlePubMedGoogle Scholar
- The Stanford HIV Drug Resistance Database[http://hivdb.stanford.edu/index.html]
- Myers RE, Gale CV, Harrison A, Takeuchi Y, Kellam P: A statistical model for HIV-1 sequence classification using the subtype analyser (STAR). Bioinformatics 2005, 21: 3535–3540. 10.1093/bioinformatics/bti569View ArticlePubMedGoogle Scholar
- Jeffrey HJ: Chaos game representation of gene structure. Nucleic Acids Res 1990, 18: 2163–2170. 10.1093/nar/18.8.2163PubMed CentralView ArticlePubMedGoogle Scholar
- Goldman N: Nucleotide, dinucleotide and trinucleotide frequencies explain patterns observed in chaos game representations of DNA sequences. Nucleic Acids Res 1993, 21: 2487–2491. 10.1093/nar/21.10.2487PubMed CentralView ArticlePubMedGoogle Scholar
- Almeida JS, Carriço JA, Maretzek A, Noble PA, Fletcher M: Analysis of genomic sequences by chaos game representation. Bioinformatics 2001, 17: 429–437. 10.1093/bioinformatics/17.5.429View ArticlePubMedGoogle Scholar
- Wang Y, Hill K, Singh S, Kari L: The spectrum of genomic signatures: from di-nucleotides to chaos game representation. Gene 2005, 346: 173–185. 10.1016/j.gene.2004.10.021View ArticlePubMedGoogle Scholar
- Manber U, Myers G: Suffix arrays: a new method for on-line string searches. SIAM J Computing 1993, 22(5):935–948. 10.1137/0222058View ArticleGoogle Scholar
- Troyer RM, Collins KR, Abraha A, Fraundorf E, Moore DM, Krizan RW, Toossi Z, Colebunders RL, Jensen MA, Mullins JI, et al.: Changes in HIV-1 fitness and genetic diversity during disease progression. J Virol 2005, 79: 9006–9018. 10.1128/JVI.79.14.9006-9018.2005PubMed CentralView ArticlePubMedGoogle Scholar
- Leitner T, Korber B, Daniels M, Calef C, Foley B: HIV-1 subtype and circulating recombinant form (CRF) reference sequences. In HIV sequence compendium. Edited by: Leitner T, Foley B, Hahn B, Marx P, McCutchan F, Mellors J, Wolinsky S, Korber B. New Mexico: Los Alamos National Laboratory; 2005:41–48.Google Scholar
- HIV Sequence Database[http://www.hiv.lanl.gov]
- The MathWorks: MATLAB[http://www.mathworks.com]
- Felsenstein J: PHYLIP - Phylogeny Inference Package. Cladistics 1989, 5: 164–166.Google Scholar
- Huson DH, Bryant D: Application of Phylogenetic Networks in Evolutionary Studies. Mol Biol Evol 2006, 23(2):254–267. 10.1093/molbev/msj030View ArticlePubMedGoogle Scholar
- Hoek L, Pollakis G, Lukashov VV, Jebbink MF, Jeeninga RE, Bakker M, Dukers N, Jurriaans S, Paxton WA, Back NKT, et al.: Characterization of an HIV-1 group M variant that is distinct from the known subtypes. AIDS Research and Human Retroviruses 2007, 23(3):466–470. 10.1089/aid.2006.0184View ArticlePubMedGoogle Scholar
- Paraskevis D, Magiorkinis E, Magiorkinis G, Sypsa V, Paparizos V, Lazanas M, Gargalianos P, Antoniadou A, Panos G, Chrysos G, et al.: Increasing prevalence of HIV-1 subtype A in Greece: estimating epidemic history and origin. J Infect Dis 2007, 196: 1167–1176. 10.1086/521677View ArticlePubMedGoogle Scholar
- Doi H: Importance of purine and pyrimidine content of local nucleotide sequences (six bases long) for evolution of the human immunodeficiency virus type 1. Proc Natl Acad Sci USA 1991, 88: 9282–9286. 10.1073/pnas.88.20.9282PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.