A classification approach for genotyping viral sequences based on multidimensional scaling and linear discriminant analysis
© Kim et al; licensee BioMed Central Ltd. 2010
Received: 15 December 2009
Accepted: 21 August 2010
Published: 21 August 2010
Accurate classification into genotypes is critical in understanding evolution of divergent viruses. Here we report a new approach, MuLDAS, which classifies a query sequence based on the statistical genotype models learned from the known sequences. Thus, MuLDAS utilizes full spectra of well characterized sequences as references, typically of an order of hundreds, in order to estimate the significance of each genotype assignment.
MuLDAS starts by aligning the query sequence to the reference multiple sequence alignment and calculating the subsequent distance matrix among the sequences. They are then mapped to a principal coordinate space by multidimensional scaling, and the coordinates of the reference sequences are used as features in developing linear discriminant models that partition the space by genotype. The genotype of the query is then given as the maximum a posteriori estimate. MuLDAS tests the model confidence by leave-one-out cross-validation and also provides some heuristics for the detection of 'outlier' sequences that fall far outside or in-between genotype clusters. We have tested our method by classifying HIV-1 and HCV nucleotide sequences downloaded from NCBI GenBank, achieving the overall concordance rates of 99.3% and 96.6%, respectively, with the benchmark test dataset retrieved from the respective databases of Los Alamos National Laboratory.
The highly accurate genotype assignment coupled with several measures for evaluating the results makes MuLDAS useful in analyzing the sequences of rapidly evolving viruses such as HIV-1 and HCV. A web-based genotype prediction server is available at http://www.muldas.org/MuLDAS/.
We are observing rapid growth in the number of viral sequences in the public databases : for example, HIV-1 and HCV sequence entries in NCBI GenBank have doubled almost every three years. These viruses also show great genotypic diversities and thus have been classified into groups, so-called genotypes and subtypes [2, 3]. Consequently classifying these virus strains into genotypes or subtypes based on their sequence similarities has become one of the most basic steps in understanding their evolution, epidemiology and developing antiviral therapies or vaccines. The conventional classification methods include the following: (1) the nearest neighbour methods that look for the best match of the query to the representatives of each genotype, so-called references (e.g., ); (2) the phylogenetic methods that look for the monophyletic group to which the query branches (e.g., ). Since the genotypes have been defined originally as separately clustered groups, these intuitively sound methods have been widely used and quite successful for many cases.
However, with increasing numbers of sequences, we are observing outliers that cannot be clearly classified (e.g., ) or for which these methods do not agree. A recent report that compared these different automatic methods with HIV-1 sequences showed less than 50% agreement among them except for subtypes B and C . One of the reasons for the disagreement was attributed to the increasing divergence and complexity caused by recombination. It was also noted that closely related subtypes (B and D) or the subtypes sharing common origin (A and CRF01_AE) showed poor concordance rate among those methods. We think what lies at the bottom of this problem is that the number of reference sequences per subtype was too small; these methods have used two to four reference sequences. Having been carefully chosen by experts among the high-quality whole-genome sequences, they are to cover the diversity of each subtype as much as possible . However with intrinsically small numbers of references per subtype, they cannot address the confidence of subtype predictions; a low E-value of a pairwise alignment or a high bootstrap value of a phylogenetic tree indicates the reliability of the unit operation, but does not necessarily guarantee a confident classification.
Recognition of this issue of lacking a statistical confidence measure, brought about the introduction of the probabilistic methods based on either position-specific scoring matrix  or jumping Hidden Markov Models (jpHMM) [9–11] built from multiple sequence alignment (MSA) of each genotype. By using full spectra of reference sequences, jpHMM was effective in detecting recombination breakpoints. Recently, new classification methods based on nucleotide composition strings have been introduced . It is unique in that it bypasses the multiple sequence alignment and still achieves high accuracy. However, it uses only 42 reference sequences and has been tested with 1,156 sequences. Considering the explosive increase in the numbers of these viral sequences, the test cases of these conventional methods were rather small, an order of ten thousands at most. It would be desirable to measure the performance of a new classification method over all the sequences publicly available.
Here we present a new method, MuLDAS, which develops the background classification models based on the distances among the reference sequences, re-evaluates their validity for each query, and reports the statistical significance of genotype assignment in terms of posterior probabilities. As such, it is suited for the cases where many reference sequences are available. MuLDAS achieves such goals by combining principal coordinate analysis (PCoA)  with linear discriminant analysis (LDA), both of which are well established statistical tools with popular usages in biological sciences. PCoA, also known as classical multidimensional scaling (MDS), maps the sequences to a high-dimensional principal coordinate space, while trying to preserve the distance relationships among them as much as possible. It has been widely applied to the discovery of global trends in a sequence set, complementing tree-based methods in phylogenetic analysis [17, 18]. Since genotypes have been defined as distinct monophyletic groups in a phylogenetic tree, each genotype should form a well separated cluster in a MDS space if an appropriately high dimension is chosen. In such cases, we can find a set of hyperplanes that separate these clusters and classify a query relative to the hyperplanes. For this purpose, MuLDAS applies LDA , a straightforward and powerful classification method, to the MDS coordinates and assigns a query to the genotype that shows the highest posterior probability of membership. This probability can be useful in detecting any ambiguous cases, for which careful examination is required. MuLDAS tests the LDA models through the leave-one-out cross-validation (LOOCV), which can be used to assess the model validity by examining the misclassification rate. As the sequences are represented by coordinates, a simple measure can be also developed for detecting genotype outliers. We have tested the algorithm with virtually all the HIV-1 and HCV sequences available from NCBI GenBank and the results are presented.
The last step of MuLDAS is to develop the discriminant models that best classify the references according to their genotypes and assign the genotype membership to the query according to the models. Here one can envisage applying various classification methods such as K-Nearest Neighbour (K-NN), Support Vector Machine (SVM), and linear classifiers, among others. If the references are well clustered according to their genotype membership, then the simplest methods such as linear discriminant analysis (LDA) or quadratic discriminant analysis (QDA) should work. Both of them work by fitting a Gaussian distribution function to each group centre, while the difference between them is whether global (LDA) or group (QDA) covariance is used. Since it can be expected that the within-group divergences may differ from one group to another, QDA may be better suited. However, the sample size imbalance issue mentioned above prevents applying QDA as it becomes unstable with a small number of references for some genotypes. On the other hand, LDA applies the global covariance commonly to all the genotypes and thus may be more robust to this issue. Although it is not as rigorous as QDA, this heuristic approach works reasonably well as long as the group divergences are not too different from one to another. Once the linear discriminants are calculated based on the reference sequences, the posterior probability of belonging to a particular group is given as a function of so-called Mahalanobis distance from the query to the group centre . To the query, the maximum a posteriori (MAP) estimate, that is, the genotype having the maximum probability is then assigned. The posterior probability is scaled by the prior that is proportional to the number of references for each genotype. This step is implemented with lda of MASS package in the R statistical system http://www.r-project.org/.
Cross-Validation of the Prediction Models
The validity of the linear discriminant models are assessed by LOOCV of the genotype membership of the reference sequences. For each one of the references, its genotype is predicted by the models generated from the rest of the references. The misclassification error rate, which is the ratio of the number of misclassified references to the total number of references participated in the validation, is a sensitive measure of the background classification power. Many viral sequences in the public databases are not of the whole genome but cover only a few genes or a part of a gene, and thus their phylogenetic signal may be variable . Consequently we re-evaluate the classification power of each prediction using LOOCV. If the reference sequences are not well resolved in the MDS space for a given query, it would be evident in LOOCV, resulting in a high misclassification rate.
where X Q , X R , and X C are the MDS vectors of the query, one of the references, and the centre of the reference group, S, respectively. The group, S, contains all the reference sequences belonging to the genotype to which the query has been classified. If O is smaller than 1.0, the query is well inside the cluster, and outside otherwise. We can develop a simple heuristic filter based on this: for example, a threshold can be set at 2.0 in order to tolerate some divergence. A similar measure, the branching index, has been devised for tree-based methods to detect outlier sequences by measuring the relative distance from the node of the query to the most recent common ancestor (MRCA) of the genotype cluster [14, 15]. See Supplementary Note 1 in Additional File 2 for the comparison of 'outlierness' with the branching index. If a truly new genotype is emerging, MuLDAS may classify such sequences into one of the genotypes (a nominal genotype). Their posterior probabilities may be very high but the 'outlierness' values from the nominal group would be also very high. We simulated such a situation by leaving all the reference sequences of a given genotype out and classifying them based on the reference sequences from the other groups only. Indeed O values were consistently large. See Supplementary Note 2 in Additional File 2 for details.
Nested Analysis for Recombinant Detection
There are a number of methods for characterizing recombinant viral strains . Similar to the tree-based bootscanning method , MuLDAS can be run along the sequence in sliding windows to locate the recombination spot. It is applicable to long sequences only and takes too much time to be served practically through web for a tool such as MuLDAS that relies on large sample sizes unless a cluster farm having several hundred CPUs is employed. Rather than attempting to detect de novo recombinant forms by performing sliding-window runs, we classify the query to the well defined common recombinant forms by the following approaches: (a) predicting genotypes gene by gene for a query that encompass more than one gene; (b) re-iteration of the analysis in a 'nested' fashion that includes recombinant reference sequences. HIV-1 and HCV contains an order of 10 genes and thus gene by gene analysis of a whole genome sequence may take 10 times longer than a single gene analysis. If different genotypes are assigned with high confidences to different gene segments of a query, it may hint a recombinant case. For some recombinants, the breakpoint may occur in the middle of a gene. In such cases, it is likely that the posterior probability of classification is not dominated by just one genotype but the second or so would have a non-negligible P value. We re-iterate the prediction process in a 'nested' fashion by focusing on the genotypes having the P value greater than 0.01 and the associated common recombinant genotypes. For example, the references in the 'nested' round of HIV-1 classification would include CRF02_AG group if the P value of either A or G group were greater than 0.01. We have implemented this procedure for classifying HIV-1 sequences, for which some common recombinant groups known as circulating recombinant forms (CRFs) have been described . Although recombinant forms have been known for HCV, no formal definitions of common forms are available at the moment .
One may argue in favour of an alternative approach where the reference CRF sequences are included into the MSA of the major group sequences and do the classification in a single operation. In multidimensional scaling, both divergent and close sequences are mapped to the same space, the latter are not well resolved. As CRF sequences are often clustered near their ostensible non-recombinant forms, they are not resolved if they are included in the MSA with all the other major group sequences.
Web Server Development
Apache web servers that accept a nucleotide sequence as a query and predicts the genotype for each gene segment of the query has been developed, one for each of HIV-1 and HCV. These are freely accessible at http://www.muldas.org/MuLDAS/. Each CGI program written in PERL wraps the component programs that have been downloaded from the respective distribution web sites of HMMER, EMBOSS, and R. As the calculation of distance matrix consumes much of the run time, we split the task into several, typically four, computational nodes, each of which calculates parts of the rows in parallel, and the results are integrated by the master node. A typical subtype prediction of a 1000-bp HIV-1 nucleotide sequence takes around 20 seconds on an Intel Xeon CPU Linux box. The web servers report the MAP genotype of the query as well as the posterior P for each genotype, the leave-one-out cross-validation result of the prediction models, and the outlier detection result (see Supplementary Figure 1 in Additional File 3 for screenshots). The 3 D plot of the query and the references in the top three PCs are given in PNG format and an XML file describing all the PCs of the query and the references can be downloaded for a subsequent dynamic interactive visualization with GGobihttp://www.ggobi.org/ (Figure 3). This is particularly useful for visually examining the quality of clustering and for confirming the outlier detection result that may lead to the discernment of potential new types or recombinants. If the number of reference sequences for a particular genotype, the classification by MuLDAS would be suboptimal. In such cases, interactive visualization of the clustering pattern using GGobi may also be useful. For HIV-1, the 'nested' analysis as described above is re-iterated and the result is reported as well.
Results and Discussion
The MuLDAS algorithm was tested with the sequence datasets of HIV-1 and HCV downloaded from NCBI GenBank. The genotype information of nucleotide sequences was retrieved from the LANL website for 158,578 HIV-1 (including 6,203 CRFs) and 40,378 HCV sequences (non-recombinants only) that have not been used as the reference sequences. For some of the sequences, the genotypes/subtypes were given by the original submitters and otherwise they were assigned by LANL. We considered these datasets as 'gold standards' for benchmarking the performance of MuLDAS.
Genotype/Subtype nomenclatures of the test datasets
HIV-1 sequences are grouped into M (main), N (non-main), U (unclassified) and O (outgroup) groups . Most of the sequences available belong to M group. As N and O groups are quite distant from M group, the subtypes of M group cannot be well resolved in the MDS plot that includes these remote groups. Consequently, we focused on classifying M group sequences into subtypes, A-D, F-H, J, and K. Among M group subtypes, A and F are sometimes further split into sub-subtypes, A1 and A2, and F1 and F2, respectively . However, some new sequences were still being reported at the subtype level in the LANL database. This was the case even to the sequences included in MSA produced by LANL. Resolving sub-subtypes for relatively short sequences using MuLDAS would require a 'nested' analysis using the relevant subtype sequences only. Due to these reasons, we did not attempt to distinguish sub-subtypes and classified them at the subtype level. Different subtypes of the M group sequences may recombine to form a new strain . If these strains were found in more than three epidemiologically independent patients, they are called circulating recombinant forms (CRFs). Among the CRFs, CRF01_AE was formed by recombination of A and now extinct E strains, and constitutes a large family that is distinct from subtype A . We have called the M group and CRF01_AE subtypes as the 'major' subtypes and the MuLDAS run against them as the 'major' analysis. Supplementary Table 3(a) in Additional File 1 lists the breakdown of the statistics by subtypes and gene segments of all the test nucleotide sequences that have been classified to the 'major' groups by LANL. The distribution was far from uniform, representing study biases: sequences belonged to subtypes H, J, and K were rare; especially for auxiliary proteins such as vif and vpr, non-B strains were too rare to evaluate the classification accuracy.
HCV sequences are now classified as genotypes 1 through 6 and their subtypes are suffixed by a lower case alphabet: for example, 1a, 2k, 6h and so on . The multiple sequence alignments downloaded from the LANL website included only a few sequences per subtype that were to be used as references by MuLDAS, making it difficult to apply MuLDAS at the subtype level. Since these genotypes were roughly equidistant from each other , MuLDAS was applied at the genotype level, and all the subtypes from a genotype were lumped together into a group. See Supplementary Tables 4(a) in Additional File 1 for the breakdown of HCV nucleotide sequences, respectively.
Determination of MDS dimensionality and assessment of model validity
Summary statistics of the benchmarking results1
# of test sequences
LOOCV error rate
MAP3 (% of sequences higher than the cutoff)
P > = 0.99
P > = 0.90
P > = 0.50
LANL concordance (% of sequences higher than the cutoff)
O < = 2.0
Having demonstrated that the MuLDAS linear models were well validated, we then surveyed the posterior probability of classification: more than 99% of the cases showed the maximum a posteriori probability values of 0.90 or higher, meaning unambiguous calls for most cases (Table 1). The overall concordance rates of the MuLDAS predictions with those retrieved from LANL were 98.9% and 96.7%, respectively for HIV-1 and HCV sequences (Table 1). See the next section for the plausible explanation for the apparently low concordance for HCV.
The test results with HIV-1 sequences§
Outlierness < 2.0
No. of reference sequences
(a) by gene segment
(b) by subtype
The test results with HCV nucleotide sequences
Outlierness < 2.0
No. of reference sequences
(a) by gene segment
(b) by genotype
Assessment of the HIV-1 nested analysis results
Many HIV-1 sequences have been described as circulating recombinant forms (CRFs) by LANL. For a total of 9,000 nucleotide gene segments of 8,612 such sequence entries, subtypes were assigned by MuLDAS by the 'nested' analysis (see Methods). After the 'major' analysis of each gene segment, the subtypes having posterior probability greater than 0.01 were identified and the corresponding reference sequences were collected into a pool. The CRF references originated from these subtypes were also added to the pool. The MuLDAS classification model was, then, built based on the pool of references, and was applied to the query sequence. Note that the reference pool was re-collected for each query. A total of 4,994 nucleotide gene segments (derived from 4,949 sequence entries) passed the filtering step (O ≤ 2.0) and had unambiguous calls (posterior probability ≥ 0.99), with an overall accuracy of 94.67% (Supplementary Table 6 in Additional File 1). It should be noted that the number of reference sequences per gene segment or subtype is not high for CRFs presently and consequently the accuracy reported here should be interpreted carefully. The relatively high accuracy seen with pol sequences (Supplementary Tables 6 Additional File 1) are encouraging in that the genes in this segment are the targets of antiviral therapies and recent resistance screenings to help guide treatment regimens frequently sequence these genes .
Even with this success, there were still many sequences that failed to pass filtering steps. As a classification tool, MuLDAS has been developed to assign a subtype among a set of known subtypes, and thus not designed to detect a new subtype or recombination pattern. However, MuLDAS may hint some important clues for the analysis of these outlier sequences in terms of outlierness value and a set of posterior probabilities as well as the complex subtype pattern along the sequence. See Supplementary Note 4 in Additional File 2 for the summary of the test runs of MuLDAS with artificial HIV-1 sequences interwoven with two subtypes. The MuLDAS runs displayed complex subtype patterns that were generally congruent with the subtype composition. For the cases where either the recombination spot or subtype composition of the query was substantially different from the common CRFs, its performance were suboptimal. This implies that sliding-window analysis by MuLDAS along the sequence is necessary. We plan to develop MuLDAS further to implement such a feature, exploiting cluster farms with several hundred CPUs.
A proposed process for subtype decision
It is evident from the previous sections that one has to accept the prediction results if and only if the reported parameters such as posterior probability (P) and outlierness (O) are reasonable. A working proposal for highly confident genotype assignment may be P better than 0.99 and O less than 2.0. A straightforward application of such criteria to 100,654 HCV nucleotide gene segment sequences achieved a false positive rate around 2.6%, leaving about 13.9% as undecided (data not shown).
Accuracy and coverage of each subtype decision step for HIV-1 nucleotide gene segments
No. of sequences
Subtypes given by LANL
[Nested analysis] Outlierness < 2.0 & Pval > 0.99 among (1)
Correctly classified among (2)
Outlierness < 2.0 & Subtype(major) = subtype(nested) among (3)
Correctly classified among (4)
[Major analysis] Outlierness < 1.0 & Pval > 0.99 among (5)
Correctly classified among (6)
Subtype assigned (2)+(4)+(6)
Correctly classified among (8)
Pval < 0.6 among (9)
Outlierness > 10.0 among (9)
While an alternative strategy may maximize the prediction coverage at the loss of the accuracy, our approach minimizes misclassification and leaves the 'twilight zone' to the users' discretion. The latter included some extreme cases such as those in-between multiple subtypes (P < 0.6) or far outside the nearest cluster (O > 10). The lists constitute about 0.7% of the total HIV-1 nucleotide sequences (Table 4).
Comparison with other methods
We have validated the performance of MuLDAS in genotyping HCV and subtyping HIV-1 sequences against the benchmark test dataset downloaded from LANL databases. As MuLDAS shows excellent performance, it would be informative to compare with other automatic genotyping (or subtyping) methods. Most published methods report concordance rates with LANL similar to those of MuLDAS, even though one of the tests showed quite discordant results among those methods . However, their test cases were quite limited, not as full scale as those of MuLDAS. It should also be emphasized that all those methods are based on well established core algorithms in the fields of sequence alignment or phylogenetics. As such, appropriate implementations of those methods should work well for the classification of the query, as long as it is well clustered with only one of the genotypes (or subtypes). Therefore it would be more informative to understand the difference of these methods in dealing with a problematic query sequence that is either divergent or recombinant. As there are no such test panels publicly available, we have devised our own panels: one panel of genome sequences (length > 9000 bp) for each of HCV and HIV-1. We downloaded 1,218 and 1,131 such genome sequences from GenBank, respectively for HCV and HIV-1. From LANL, the genotypes were retrieved for 1,116 and 1,086 of them, respectively. MuLDAS ran in the gene-by-gene mode. If all the gene segments of a genome sequence are 'confidently' genotyped by MuLDAS (O < 2.0 and P > 0.99) and agree with LANL, we count it as a concordant case, otherwise discordant.
Concordance between LANL and MuLDAS in genotypes for the genome sequences longer than 9,000 bp
(1) Genome sequences downloaded
(2) LANL genotypes known in (1)
(3) All confidently genotyped gene segments concordant with LANL genotypes
(4) Some confidently genotyped segments discordant to LANL genotypes
(5) Accuracy [(3)/(2)]%
(6) Recombination pattern 'inferable' from MuLDAS results among (4)
(7) Including 'partial' successes [(3)+(6)]
For HIV-1, based on the strict criteria mentioned above, 938 out of 1,086 were concordant with LANL, leaving 148 cases as discordant (86.3% accuracy). Since we classify a HIV-1 sequence into M-groups or CRFs (01~16). Any sequences that do not belong to these groups are bound to be discordant in this analysis. Indeed all 148 but seven discordant cases were of complex recombinant ones. The genotype compositions predicted by MuLDAS for 103 such cases were congruent with their recombination patterns designated by LANL. For example, LANL genotype for a sequence entry EU220698 was 'AC', a non-CRF recombination of 'A' and 'C'. Among nine segments, six were of 'C' and three were of 'A' by MuLDAS. Including such 'partial success' cases, the success rate goes up to 95.9% (Table 5).
We validated the results for 148 discordant HIV-1 cases with independent runs of both NCBI Genotyping Tool  and REGA . NCBI Genotyping Tool offers an option to choose one from various reference sets. Since some of the genome sequences used in this test are included in more recent reference sets, NCBI Genotyping Tool would immediately recognize them with perfect matches. For fair comparisons, we used so-called "2005 pure and CRFs" as the reference set in the test run. Since NCBI Genotyping Tool does not summarize the sliding window result into a single genotype, we devise our own scheme as follows: for each window the best scoring genotype is reported; among them infrequent ones (<10%) were trimmed. When multiple genotypes are finally reported for a genome sequence, the genotype composition is compared with that of LANL. If they are congruent, we label them 'inferable' (73 cases). This scheme was successfully validated with the 938 sequences for which MuLDAS showed concordance with LANL (Sequence set (3) for HIV-1 in Table 5). On the other hand REGA summarizes the sliding window result and report a single genotype. However among 148 sequences, REGA failed to report summarized genotypes for 107 sequences due to poor bootscanning support values (we label them "Failed QC"). Thus we focused on the comparison with NCBI Tool, which showed 97 and 102 cases (including 'inferable') that were concordant with LANL and MuLDAS, respectively. There were 79 cases for which all three agreed on. The results are summarized in Supplementary Table 7 in Additional File 1, while the full listings of the test results are found in Additional File 5.
Here we have demonstrated that MuLDAS is a novel approach useful for classifying viral sequences based on a large sample population of reference sequences. As it reports several confidence measures, it is a particularly powerful tool for detecting unusual, problematic sequences that often slip through unnoticed. Explosive growth in number coupled with complex divergence of viral sequences, demands classification tools such as MuLDAS. It has been a while since the previous methods were developed and their performances have not been comprehensively re-evaluated with the sequences emerged since then. MuLDAS achieved remarkable accuracy in the tests that included all HIV-1 or all HCV sequences currently available. As at the core of MuLDAS is MDS of distance matrix followed by LDA, it is conceivable that in place of LDA other classification algorithms such as K-NN or SVM are applied. However, they may not be appropriate as they focus on either a few nearest neighbours (K-NN) or solely on the decision boundary without taking into consideration of the population distribution (SVM). In addition, K-NN may also suffer from the issue of sample size imbalance. MuLDAS algorithm is straightforward enough to be applied to the classification of either nucleotide or peptide sequences. It can be even extended to classify individual subjects into population groups based on a distance matrix of polymorphic markers such as SNP. To sum it up, the approach taken by MuLDAS has far reaching implications for sequence classifications.
Note added in proof
The pre-computed genotype/subtype information is accessible through LinkOut service from NCBI.
We are grateful to Prof. Julian Lee and Kyu-Baek Hwang at Soongsil University, and Dr. Sang Chul Kim at Yonsei University for helpful discussions. We also thank Dr. Joo Shil Lee and colleagues at Korea National Institute of Health for helpful discussion and encouragement. Generous allocation of computer clusters for the benchmark tests and a web server system by Korea Bioinformation Center (KOBIC), Taejon, Korea, is greatly appreciated. This work has been supported by a grant from the Korea Science and Engineering Foundation (KOSEF) (R11-2008-062-03003-0) funded by the Korea government (MEST).
- Rambaut A, Posada D, Crandall KA, Holmes EC: The causes and consequences of HIV evolution. Nat Rev Genet 2004, 5: 52–61. 10.1038/nrg1246View ArticlePubMedGoogle Scholar
- Robertson DL, Anderson JP, Bradac JA, Carr JK, Foley B, Funkhouser RK, Gao F, Hahn BH, Kalish ML, Kuiken C, Learn GH, Leitner T, McCutchan F, Osmanov S, Peeters M, Pieniazek D, Salminen M, Sharp PM, Wolinsky S, Korber B: HIV-1 nomenclature proposal. Science 2000, 288: 55–56. 10.1126/science.288.5463.55dView ArticlePubMedGoogle Scholar
- Simmonds P, Bukh J, Combet C, Deléage G, Enomoto N, Feinstone S, Halfon P, Inchauspé G, Kuiken C, Maertens G, Mizokami M, Murphy DG, Okamoto H, Pawlotsky JM, Penin F, Sablon E, Shin-I T, Stuyver LJ, Thiel HJ, Viazov S, Weiner AJ, Widell A: Consensus proposals for a unified system of nomenclature of hepatitis C virus genotypes. Hepatology 2005, 42: 962–973. 10.1002/hep.20819View ArticlePubMedGoogle Scholar
- Rozanov M, Plikat U, Chappey C, Kochergin A, Tatusova T: A web-based genotyping resource for viral sequences. Nucleic Acids Res 2004, 32: W654-W659. 10.1093/nar/gkh419View ArticlePubMedPubMed CentralGoogle Scholar
- de Oliveira T, Deforche K, Cassol S, Salminen M, Paraskevis D, Seebregts C, Snoeck J, van Rensburg EJ, Wensing AM, van de Vijver DA, Boucher CA, Camacho R, Vandamme AM: An automated genotyping system for analysis of HIV-1 and other microbial sequences. Bioinformatics 2005, 21: 3797–3800. 10.1093/bioinformatics/bti607View ArticlePubMedGoogle Scholar
- Vidal N, Peeters M, Mulanga-Kabeya C, Nzilambi N, Robertson D, Ilunga W, Sema H, Tshimanga K, Bongo B, Delaporte E: Unprecedented degree of human immunodeficiency virus type 1 (HIV-1) group M genetic diversity in the Democratic Republic of Congo suggests that the HIV-1 pandemic originated in Central Africa. J Virol 2000, 74: 10498–10507. 10.1128/JVI.74.22.10498-10507.2000View ArticlePubMedPubMed CentralGoogle Scholar
- Gifford R, de Oliveira T, Rambaut A, Myers RE, Gale CV, Dunn D, Shafer R, Vandamme AM, Kellam P, Pillay D: UK Collaborative Group on HIV Drug Resistance: Assessment of automated genotyping protocols as tools for surveillance of HIV-1 genetic diversity. AIDS 2006, 20: 1521–1529. 10.1097/01.aids.0000237368.64488.aeView ArticlePubMedGoogle Scholar
- Myers RE, Gale CV, Harrison A, Takeuchi Y, Kellam P: A statistical model for HIV-1 sequence classification using the subtype analyser (STAR). Bioinformatics 2005, 21: 3535–3540. 10.1093/bioinformatics/bti569View ArticlePubMedGoogle Scholar
- Schultz AK, Zhang M, Leitner T, Kuiken C, Korber B, Morgenstern B, Stanke M: A jumping profile Hidden Markov Model and applications to recombination sites in HIV and HCV genomes. BMC Bioinformatics 2006, 7: 265. 10.1186/1471-2105-7-265View ArticlePubMedPubMed CentralGoogle Scholar
- Zhang M, Schultz AK, Calef C, Kuiken C, Leitner T, Korber B, Morgenstern B, Stanke M: jpHMM at GOBICS: a web server to detect genomic recombinations in HIV-1. Nucleic Acids Research 2006, 34: W463–5. 10.1093/nar/gkl255View ArticlePubMedPubMed CentralGoogle Scholar
- Schultz AK, Zhang M, Bulla I, Leitner T, Korber B, Morgenstern B, Stanke M: jpHMM: Improving the reliability of recombination prediction in HIV-1. Nucleic Acids Research 2009, 37: W647–51. 10.1093/nar/gkp371View ArticlePubMedPubMed CentralGoogle Scholar
- Wu X, Cai Z, Wan XF, Hoang T, Goebel R, Lin G: Nucleotide composition string selection in HIV-1 subtyping using whole genomes. Bioinformatics 2007, 23: 1744–1752. 10.1093/bioinformatics/btm248View ArticlePubMedGoogle Scholar
- Leitner T, Escanilla D, Franzén C, Uhlén M, Albert J: Accurate reconstruction of a known HIV-1 transmission history by phylogenetic tree analysis. Proc Natl Acad Sci USA 1996, 93: 10864–10869. 10.1073/pnas.93.20.10864View ArticlePubMedPubMed CentralGoogle Scholar
- Wilbe K, Saminen M, Laukkanen T, McCutchan F, Ray SC, Albert J, Leitner T: Characterization of novel recombinant HIV-1 genomes using the branching index. Virology 2003, 316: 116–25. 10.1016/j.virol.2003.08.004View ArticlePubMedGoogle Scholar
- Hraber P, Kuiken C, Waugh M, Geer S, Bruno WJ, Leitner T: Classification of hepatitis C virus and human immunodeficiency virus-1 sequences with the branching index. J Gen Virol 2008, 89: 2098–107. 10.1099/vir.0.83657-0View ArticlePubMedPubMed CentralGoogle Scholar
- Cox TF, Cox MAA: Multidimensional Scaling. CRC/Chapman and Hall; 2001.Google Scholar
- Higgins DG: Sequence ordinations: a multivariate analysis approach to analysing large sequence data sets. Comput Appl Biosci 1992, 8: 15–22.PubMedGoogle Scholar
- Brown AJ, Lobidel D, Wade CM, Rebus S, Phillips AN, Brettle RP, France AJ, Leen CS, McMenamin J, McMillan A, Maw RD, Mulcahy F, Robertson JR, Sankar KN, Scott G, Wyld R, Peutherer JF: The molecular epidemiology of human immunodeficiency virus type 1 in six cities in Britain and Ireland. Virology 1997, 235: 166–177. 10.1006/viro.1997.8656View ArticlePubMedGoogle Scholar
- Venables WN, Ripley BD: Modern Applied Statistics with S Fourth Edition. Springer, New York, NY; 2002.View ArticleGoogle Scholar
- Leitner T, Foley B, Hahn B, Marx P, McCutchan F, Mellors J, Wolinsky S, Korber B, (Eds): HIV Sequence Compendium 2005. Theoretical Biology and Biophysics Group, Los Alamos National Laboratory, NM, LA-UR 06–0680; 2005.Google Scholar
- Kuiken C, Yusim K, Boykin L, Richardson R: The Los Alamos hepatitis C sequence database. Bioinformatics 2005, 21: 379–384. 10.1093/bioinformatics/bth485View ArticlePubMedGoogle Scholar
- Hair JF, Tatham RL, Anderson RE, Black W: Multivariate Data Analysis Fifth Edition. Prentice Hall, Upper Saddle River, NJ; 1998.Google Scholar
- Links to Recombinant sequence Analysis/detection Programs[http://www.bioinf.manchester.ac.uk/recombination/programs.shtml]
- Salminen MO, Carr JK, Burke DS, McCutchan FE: Identification of breakpoints in intergenotypic recombinants of HIV type 1 by bootscanning. AIDS Res Hum Retroviruses 1995, 11: 1423–1425. 10.1089/aid.1995.11.1423View ArticlePubMedGoogle Scholar
- Timm J, Roggendorf M: Sequence diversity of hepatitis C virus: implications for immune control and therapy. World J Gastroenterol 2007, 13: 4808–4817.View ArticlePubMedPubMed CentralGoogle Scholar
- Johnson VA, Brun-Vézinet F, Clotet B, Günthard HF, Kuritzkes DR, Pillay D, Schapiro JM, Richman DD: Update of the drug resistance mutations in HIV-1: 2007. Top HIV Med 2007, 15: 119–125.PubMedGoogle Scholar
- Janini M, Rogers M, Birx DR, McCutchan FE: Human immunodeficiency virus type 1 DNA sequences genetically damaged by hypermutation are often abundant in patient peripheral blood mononuclear cells and may be generated during near-simultaneous infection and activation of CD4(+) T cells. J Virol 2001, 75: 7973–7986. 10.1128/JVI.75.17.7973-7986.2001View ArticlePubMedPubMed CentralGoogle Scholar
- Gandhi SK, Siliciano JD, Bailey JR, Siliciano RF, Blankson JN: Role of APOBEC3G/F-mediated hypermutation in the control of human immunodeficiency virus type 1 in elite suppressors. J Virol 2008, 82: 3125–3130. 10.1128/JVI.01533-07View ArticlePubMedPubMed CentralGoogle Scholar
- Land AM, Ball TB, Luo M, Pilon R, Sandstrom P, Embree JE, Wachihi C, Kimani J, Plummer FA: Human immunodeficiency virus (HIV) type 1 proviral hypermutation correlates with CD4 count in HIV-infected women from Kenya. J Virol 2008, 82(16):8172–8182. 10.1128/JVI.01115-08View ArticlePubMedPubMed CentralGoogle Scholar
- Vartanian JP, Henry M, Wain-Hobson S: Sustained G-->A hypermutation during reverse transcription of an entire human immunodeficiency virus type 1 strain Vau group O genome. J Gen Virol 2002, 83(Pt 4):801–805.View ArticlePubMedGoogle Scholar
- Wang B, Mikhail M, Dyer WB, Zaunders JJ, Kelleher AD, Saksena NK: First demonstration of a lack of viral sequence evolution in a nonprogressor, defining replication-incompetent HIV-1 infection. Virology 2003, 312(1):135–150. 10.1016/S0042-6822(03)00159-4View ArticlePubMedGoogle Scholar
- Wei M, Xing H, Hong K, Huang H, Tang H, Qin G, Shao Y: Biased G-to-A hypermutation in HIV-1 proviral DNA from a long-term non-progressor. AIDS 2004, 18(13):1863–1865. 10.1097/00002030-200409030-00023View ArticlePubMedGoogle Scholar
- Pace C, Keller J, Nolan D, James I, Gaudieri S, Moore C, Mallal S: Population level analysis of human immunodeficiency virus type 1 hypermutation and its relationship with APOBEC3G and vif genetic variation. J Virol 2006, 80(18):9259–9269. 10.1128/JVI.00888-06View ArticlePubMedPubMed CentralGoogle Scholar
- Kijak GH, Janini LM, Tovanabutra S, Sanders-Buell E, Arroyo MA, Robb ML, Michael NL, Birx DL, McCutchan FE: Variable contexts and levels of hypermutation in HIV-1 proviral genomes recovered from primary peripheral blood mononuclear cells. Virology 2008, 376(1):101–111. 10.1016/j.virol.2008.03.017View ArticlePubMedGoogle Scholar
- Vartanian JP, Meyerhans A, Asjö B, Wain-Hobson S: Selection, recombination, and G----A hypermutation of human immunodeficiency virus type 1 genomes. J Virol 1991, 65(4):1779–1788.PubMedPubMed CentralGoogle Scholar
- Goodenow M, Huet T, Saurin W, Kwok S, Sninsky J, Wain-Hobson S: HIV-1 isolates are rapidly evolving quasispecies: evidence for viral mixtures and preferred nucleotide substitutions. J Acquir Immune Defic Syndr 1989, 2(4):344–352.PubMedGoogle Scholar
- Fitzgibbon JE, Mazar S, Dubin DT: A new type of G-->A hypermutation affecting human immunodeficiency virus. AIDS Res Hum Retroviruses 1993, 9(9):833–838. 10.1089/aid.1993.9.833View ArticlePubMedGoogle Scholar
- Simon JH, Southerling TE, Peterson JC, Meyer BE, Malim MH: Complementation of vif-defective human immunodeficiency virus type 1 by primate, but not nonprimate, lentivirus vif genes. J Virol 1995, 69(7):4166–4172.PubMedPubMed CentralGoogle Scholar
- Monken CE, Wu B, Srinivasan A: High resolution analysis of HIV-1 quasispecies in the brain. AIDS 1995, 9(4):345–349.View ArticlePubMedGoogle Scholar
- Yoshimura FK, Diem K, Learn GH Jr, Riddell S, Corey L: Intrapatient sequence variation of the gag gene of human immunodeficiency virus type 1 plasma virions. J Virol 1996, 70(12):8879–8887.PubMedPubMed CentralGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.