Data-intensive analysis of HIV mutations

Background In this study, clustering was performed using a bitmap representation of HIV reverse transcriptase and protease sequences, to produce an unsupervised classification of HIV sequences. The classification will aid our understanding of the interactions between mutations and drug resistance. 10,229 HIV genomic sequences from the protease and reverse transcriptase regions of the pol gene and antiretroviral resistant related mutations represented in an 82-dimensional binary vector space were analyzed. Results A new cluster representation was proposed using an image inspired by microarray data, such that the rows in the image represented the protein sequences from the genotype data and the columns represented presence or absence of mutations in each protein position.The visualization of the clusters showed that some mutations frequently occur together and are probably related to an epistatic phenomenon. Conclusion We described a methodology based on the application of a pattern recognition algorithm using binary data to suggest clusters of mutations that can easily be discriminated by cluster viewing schemes.


Background
The human immunodeficiency virus (HIV) shows extensive genetic variability that helps the selection of drug resistance mutations in response to antiretroviral therapy. Hence, it is important to understand the relationship between HIV genotype and phenotype (i.e., drug resistance) to increase the probability of treatment success.
To infer antiretroviral resistance, look-up tables [1,2] and rule-based systems [3,4] were developed by different groups to infer phenotypic resistance based on HIV genomic sequences from infected patients that failed on antiretroviral therapy. In Brazil, a look-up table [2] was developed and used by the Brazilian Ministry of Health AIDS program to help the decision-making process for antiretroviral salvage therapy (http://algoritmo.aids. gov.br/).
In Brazil, patients who fail on antiretroviral therapy receive genotype tests for antiretroviral resistance throughout a network of laboratories [5]. This collection of HIV genomic sequences represents the variability of the HIV population in this country. With this extensive amount of data, questions arise as to whether it is possible to classify the sequences, based on the occurrences of resistance-related mutations in the different amino acid positions, and whether it is possible to achieve a classification that can express current knowledge of the relationship between mutations and drug resistance.
One possible way to answer these questions is to apply clustering algorithms on reverse transcriptase and protease sequences, to obtain clusters containing sequences that are similar. This similarity among the sequences may reveal some of the relationships among the mutations related to antiretroviral resistance.
Nonetheless, extraction of a simple and compact representation of the dataset is complex because of the number and size of sequences. The clusters thus generated may provide a representation that contributes to the understanding of the classification and the relationships between mutations.
In the present study, a pipeline (see Figure 1) was introduced to represent clusters inspired by microarray data, in which extensive amounts of data are available. Microarray data were used as inspiration because such applications typically contain large volumes of information on gene patterns from thousands of genes at once. Thus, clusters were represented in an image corresponding to a matrix, such that the rows in the image represented each protein sequence and the columns indicated the presence or absence of resistance-related mutations. This image enabled us to summarize the dataset without losing any information about clustering, permitting the observation of important characteristics of each cluster and enabling cluster comparison, thus providing insights into the data.
Previous studies have attempted to identify common protease and reverse transcriptase mutation patterns [6][7][8][9][10][11][12][13][14][15] (as shown in Tables 1, 2 and 3). However, many previous works search only for pairs of mutations, not being able to find larger mutation patterns, which are known to exist [11,[16][17][18][19][20][21]. Furthermore, frequently, only subtype B virus sequences are used, and mutations occur with different probabilities in the different subtypes [22]. Also, in some of the previous works a small number of protein positions are used. Consequently, not all mutation patterns in the data are found and it is more difficult to compare results. Finally, small datasets used in some of the related works do not represent all of the virus population variability, also missing mutation patterns. Therefore, there is no clear consensus on which are the important mutation patterns that arise in the protein sequences.
Such studies on mutation patterns are important because the co-existence of mutations may result in different antiretroviral resistance profiles. For example, a mutation can restore the fitness decrease from another mutation that confers drug resistance. However, some of the previous studies only investigated pairs of mutations, and most of them only analyzed subtype B HIV-1 sequences. Moreover, previous studies analyzed specific mutation profiles, making it difficult to compare results between different studies. Thus, mutation patterns have not been fully characterized in the protease and reverse transcriptase sequences. Characterization of these patterns may lead to a better understanding of the interactions among these mutations and to classification of the sequences.
In the present study, a large number of codons (38 from reverse transcriptase and 44 from protease, as shown in Table 4) from subtypes B, C and F were clustered, and the sequences were classified according to the mutation patterns. These clusters were compared with clusters reported in other studies.

Look up tables and rule-based systems
Based on genotype-phenotype correlation studies on laboratory HIV-1 isolates, genotype-phenotype correlations on clinical isolates and genotype-treatment history correlations [25], some efforts have been made to try to understand the relationship between HIV genotype and phenotype. For example, look-up tables [1,2,26] have been compiled using information from the scientific literature,           which has been turned into rules in which the occurrences of mutations, or combinations of mutations, are correlated with drug resistance. In addition to look-up tables, some rule-based systems [3,4] have created scoring systems to calculate the likelihood of therapy failure, which are also based on published data. Look-up tables and rule-based systems are efforts to correlate the set of known mutations with the potential for drug resistance. Both represent current knowledge concerning the relationships between virus genotype and drug resistance and its application. Look-up tables and rule-based systems group mutations into clusters of mutations, thereby predicting the possible result of drug treatment.

Clustering
Similar to the classifications retrieved from look-up tables and rule-based systems, pattern recognition methods are designed to extract information from data to classify them. In cases where little prior information is available and the decision-maker must make as few assumptions as possible about the data, the clustering technique is useful [27]. By applying clustering algorithms to reverse transcriptase and protease sequences, clusters containing sequences that are similar to each other are obtained. The clusters may contain sequences with similar drug response patterns. Applying clustering algorithms, and comparing the clusters with the classifications from lookup tables will achieve a better understanding of the relationship between genotype and phenotype.
In addition to providing comparisons with look-up tables, clusters also allow hypothesizes regarding the occurrences of mutations to be formed. Therefore, such analysis can show which mutations have higher probability of occurring together and those that may influence each other.
One of the best-known algorithms for clustering is Kmeans, which is popular because the time complexity is O(n), where n is the number of patterns [27]. The time complexity makes this a good choice when dealing with a large volume of data, which was the case here. Figure 1 summarizes the methodology used in this work to analyze the protease and reverse transcriptase sequences. First, HIV genomic sequences from patients from 27 Brazilian states were extracted from the national database and binarized according to the presence or absence of mutations. The sequences were clustered and an image was created to represent the clusters. The clusters were characterized given the occurrence of mutations and compared with the prediction of drug resistance from the Brazilian look-up table.

Sequence representation
In the present study, 10,229 reverse transcriptase and protease sequences from HIV subtype B, 801 from subtype F and 424 from subtype C, were obtained from the national database. These samples were taken in accordance with the ethics standards of the Ethics Committee of the Federal University of São Paulo and with the Helsinki Declaration of 1975, revised in 1983. All biological samples were obtained in full accordance with signed informed consent forms (process number in research ethics committee 1433/09).
The Brazilian Guidelines for Resistance Testing allowed only one genotype testing for each patient at the time the sequences were generated; therefore, duplication of the sequences from the same patient was not expected.
To simplify the representation and comparison of the reverse transcriptase andprotease sequences, bitmap mapping was used. In this technique, if a sequence hadthe same amino acid as the wild-type sequence, it was replaced with the value zeroand when the sequence had a different amino acid, it was replaced by the value1, as previously described (Reuman et al. [8] and Melikian et al. [28]). Thus, the sequences could be interpreted as binary vectors in and 99 dimensional spaces (amino acids from reverse transcriptase and 99 from protease).
When working with patterns of high dimensionality, the "curse of high dimensionality" must be avoided. The "curse of high dimensionality" makes all distances look alike in high dimensional spaces [29] and makes it difficult to evaluate similarity. One way to avoid it is to decrease the dimensionality of the data.
To escape the "curse of high dimensionality", 38 positions from reverse transcriptase and 44 positions from protease 4 known to be related to drug resistance were analyzed [2,25].

K-means
In an attempt to classify reverse transcriptase and protease sequences using a pattern recognition algorithm, we applied K-means from the R Project for Statistical Computing [30] to the 10,229 sequences. Sequences were divided according to HIV subtype and genomic region. Thus, K-means was used to search for clusters in the protease and reverse transcriptase sequences from subtypes B, C and F, separately. The algorithm was repeated 10 times for each of the datasets, with random centroids. The value of k, i.e. the number of clusters to be retrieved, ranged from 2 to 16.

Cluster characterization
One problem that arose from generating the clusters was how to view and interpret them in the domain of HIV mutations, which was caused by the large number of sequences and amino acid positions used in our analysis. Images can be used to solve this problem because they provide an intuitive information visualization tool to support and validate the results, and to formulate and test hypotheses. When the research entails data-intensive analysis, the use of images becomes even more important, because the volume of data makes it difficult to manipulate and visualize the data directly. Thus, images can help in the analysis process and can summarize the data and results.
Therefore, to analyze the clusters, observe whether they followed any mutation patterns and to determine what these patterns might be, images of the clusters were created inspired by microarray data visualization. Binary images (i.e. black and white) represented the binary sequences featured as rows and the amino acid positions as columns: 44 columns for protease and 38 columns for reverse transcriptase. The sequences were grouped according to clusters and separated by blue lines. When a sequence had the value of 1 in an amino acid position, it would be represented by a black pixel, and when it had a value of 0, it would be represented by a white pixel. Six images were created for each value of k, combining the proteins and subtypes.
The black and white pixels were useful for distinguishing the clusters, accentuating differences among them and describing them, as well as for summarizing the information within the sequences and clusters. They also helped to view the amino acid positions that represented and characterized the clusters.
To provide more details about the clusters, histograms were plotted for each cluster, for protease and reverse transcriptase, showing the percentage of sequences in the cluster with mutations in each position. Each bar in the histogram represented an amino acid position and the percentage of sequences in the cluster with a mutation at that position.
To compare the clusters with the look-up table used to interpret the genotypic resistance from the Brazilian algorithm for resistance interpretation, another image was generated. The HIVDAG software [31] was used to create this other image. HIVDAG interprets the rules in the Brazilian look-up table in the context of the sequences and produces a prediction regarding antiretroviral resistance. The software classifies the sequences as resistant (R), intermediate resistant (I) and susceptible (S).
To represent the three possible results, red, yellow and green were used for resistance, intermediate resistance and susceptibility, respectively. Thus, as in the binary figure, the rows featured the protein sequences and the columns were the predictions for drug resistance given by the look-up table for that sequence.
In these colored images, vertical lines presenting a dominant color in each cluster indicated that the sequences in that cluster have the same drug resistance prediction. Clusters that showed red, yellow or green vertical lines in different positions indicated that there was some correspondence between the prediction of the look-up table and the K-means clusters.

Results and discussion
For distinct k values, the sequences were distributed in different clusters; black and white images were created for each combination of subtype, k value and protein. Figures 2 and 3 represent the clusters for subtype B, where k = 6 for protease and reverse transcriptase, respectively. The value of k=6 was chosen because it represents better the current knowledge of mutation occurrence and mutation relationships. For k = 6, both TAM groups and the mutation profile comprising substitutions on protease codons 30 and 88 are represented. Nonetheless, as k values progressed, the clusters were first divided into groups of sequences with many mutations and with few or no mutations. For each increase in the k value, the group with many mutations was repeatedly split, although stability and consistency were maintained.
K-medoids have been used in a previous study [14] for clustering a smaller number of subtype B sequences. In order to evaluate this alternative clustering method, it has been applied to the dataset here described. The K-medoids implementation available at [32] has been    As it can be seen, the results are similar to those shown in Figures 2 and 3, except for clusters B6.4, B6.5 and B6.1 from protease and clusters B6.2 and B6.5 from reverse transcriptase.They contain sequences that are predicted to be susceptible to most of the drugs and do not represent patterns of mutations. This difference is probably because although both algorithms are related, k-medoids represents clusters by the median of cluster points, instead of the mean [33]. But, except for these differences, both methods lead to similar results, which corrobotate our findings.
To characterize the clusters, the histograms shown in Figures 6 and 7 for subtype B and k = 6, for protease and reverse transcriptase, respectively, were produced. These histograms display the percentage occurrence of mutations at each amino acid position for each cluster. The mutations that had higher percentages defined the clusters and determined which cluster the sequences belonged to. Those that had high frequencies in one cluster and low frequencies in the others enabled differentiation between the sequences and between the clusters. Additionally, the positions with higher frequencies of mutations in a cluster were those that occurred together more frequently, and their occurrences were considered as related.
To compare the clusters with the predictions of drug resistance given by the rules in the Brazilian look-up table, colored images were created. The images from the protease clusters (see Figure 8 at k = 6) showed division of the sequences into groups that were sensitive to the majority of the drugs and other groups that were resistant to the majority of the drugs. However, the reverse transcriptase clusters showed different combinations of predictions for different clusters, with similar predictions for sequences in the same cluster and different predictions for sequences in different clusters (see Figure 9).
As seen in Figures 2 and 3, the clusters had different mutation profiles for the two proteins. K-means successfully distinguished the sequences and grouped them according to the different mutations, indicating that it is possible to obtain a classification for HIV protein sequences using clustering algorithms, according to the occurrences of the mutations.
The different occurrence patterns for the mutations are emphasized in Figures 6 and 7, which show the distinct Figure 6 Histogram showing the frequency of mutations in the protease kmeans clusters. Histograms containing the frequencies of mutations for each selected amino acid position in protease for each of the six clusters in subtype B at k = 6. Each histogram represents one cluster found by K-means for k = 6 in the protease sequences. Each bar in the histogram represents a protein position and the percentage of sequences in the cluster that contain a mutation at that position.

Figure 7
Histogram showing the frequency of mutations in reverse transcriptase kmeans clusters. Histograms containing the frequencies of mutations for each selected amino acid position in the reverse transcriptase for each of the six clusters in subtype B at k = 6. Each histogram represents one cluster found by K-means for k = 6 in the reverse transcriptase sequences. Each bar in the histogram represents a protein position and the percentage of sequences in the cluster that contain a mutation at that position. Additionally, K-means was able to produce clusters that correlated with different predictions of drug resistance, especially for the reverse transcriptase (see Figure 9). The figures show that although clusters were found for both proteins, reverse transcriptase clusters display more patterns of prediction of drug resistance. As protease gene variation is higher than for reverse transcriptase gene in non-treated patients, the pathways for a strain to become resistance are more limited in reverse transcriptase as compared to the protease. Therefore, we believe that the constrains for variation in the reverse transcriptase gene facilitate the detection of the clusters. The results for subtypes C and F are summarized in Tables 5 and 6. Tables 5 and 6 also attempt to summarize the clusters and depict the essential information that is necessary to understand and compare them. In these tables, the amino acid positions of the proteins are presented for positions where more than 50% of the sequences in the cluster had mutations. Tables 5 and 6 show that for the different subtypes, the mutations that characterized some clusters were similar. The clusters from sequences of subtypes B, C and F were similar in terms of the positions in each cluster that had higher frequencies of mutations, excluding positions that occurred more frequently in a given subtype in this data set. For example, positions 15,20,36,41,69,89 and 93 for subtype C in the protease; positions 15, 35, 36, 41 and 89 for subtype F in the protease; and position 211 for subtypes C and F in the reverse transcriptase. Moreover, the Table 5 Reverse transcriptase amino acid positions with mutations in at least 50% of the sequences by kmeans cluster datasets for subtypes C and F were much smaller than the dataset for subtype B and thus might not represent all the variability in the subtypes. Subtype C was more different compared with subtypes B and F; however, there was still correspondence among the codons defining the clusters.

Conclusion
In this work, a new approach to analyzing HIV mutation data was presented. Current classification schemes are based on rule-based systems and look-up tables that comprise data from scientific studies. The proposed framework is based on a bitmap representation that extracts information from protease and reverse transcriptase sequences and provides information on the interactions among mutations.
A new visualization scheme inspired by microarray data analysis was proposed to better understand the clusters in the HIV domains. The images produced were useful for viewing and comparing the clusters with binary vectors and large volumes of data. In our study, the black and white figures indicated the occurrence and absence of mutations in sequences in each cluster, respectively, thus highlighting the differences between the clusters.
To represent the genetic variability of the virus in a different way from previous works, a large number of sequences and protein positions were used, along with three different HIV-1 subtypes. In the analysis, sequences were clustered, and the clusters were characterized according to the mutation patterns that they represented. The clusters were compared with those clusters revealed by previously published studies, and with the current knowledge of mutation patterns.
Along with the large number of sequences and protein positions, the application of a binary representation for the sequences helped to define a simple measure of similarity. The choice of K-means as the algorithm for mutation pattern searching rendered the method suitable for larger data sets because of its time complexity. The use of the binary image also allowed the analysis of large data sets, as the information in the data is visualized more easily, as is the characterization of the clusters and the mutation patterns.
K-means obtained clusters with similar sequences representing different mutation profiles, and the clusters showed that some mutations frequently occur together, which are important for defining the clusters and that are present in a large number of the sequences. These positions need to be taken into consideration when inferring drug resistance, because they affect a large number of patients.
Some interesting insights came from this clustering result. Notably, mutations in protease codons only produced clusters among non-B strains. Furthermore, as described previously, mutations at codons 89 and 90 in the protease do not cluster together [34], suggesting that methionine at positions 89 and 90 result in a protein structure that is not stable. Mutations at codons 30 and 90 may be selected by the protease inhibitor nelfinavir, but again, these two pairs of mutations do not appear together. It makes biological sense that once you have a replacement such as D30N, you will need a mutation N88D, because these two amino acids interact with each other in the protease protein [35]. However, it has been suggested that the pathway for resistance to nelfinavir will preferentially select the F30N complex among subtype B and exclusively the L90M complex among non-B subtypes [36]. However, we observed the D30N complex among clusters for subtypes B and F (Table 6). It is also interesting that major protease inhibitor mutations, such as in codons 46, 82 and 90, frequently form clusters (Table 6).
Pathways for resistance mutations are the pathways that viruses select for resistance mutations and this is closely related to cross-resistance. TAM 1 and TAM 2 are well-defined distinct pathways for resistance, but we speculate that these are merely initiating pathways because we observed clusters for the reverse transcriptase with between three and six TAMs, thus augmenting levels of resistance and cross resistance (Table 5). Interestingly, all clusters with resistance mutations show the 3TCrelated mutation at codon 184 in the reverse transcriptase. When there is an antiretroviral treatment failure using non-nucleoside reverse transcriptase analogs, mutation at codon 103 will emerge in more than 50% of cases and 50% of these viruses will also harbor the mutation at codon 184 [37]. However, all clusters harboring 103 mutations will also be accompanied of 184 mutations, suggesting that real life virological failure is somehow different.
One interesting outcome from this cluster representation is their alleged relationships with previous exposure to specific antiretrovirals. In this sense, timing or the number of drug exposures, as well as the use of specific drugs, would suggest a specific selection of a cluster of mutations and imply possible resistance/cross resistance. The negative predictive value of a genotype result is low, meaning that the absence of a specific mutation or group of mutations does not mean that this mutation is not present in a minority population and is not present because of the selective pressure of current antiretrovirals being used. Therefore, the history of antiretroviral exposure and the projected profile of mutations can result in a more reliable future salvage therapy regimen.
Furthermore, protease inhibitors are designed according to the structure of the proteins; therefore, the clusters may help in designing future drugs for resistant strains.
In addition to antiretroviral resistance, understanding the mutation patterns is also useful in collaborative efforts to study of immune escape pathways and vaccine research. However, the HIV mutation patterns can confound the determination of the immune escape mechanisms [38] that are relevant to the vaccine research [39].
Our future work will include further validation of the clusters in the HIV domains and updating the current knowledge concerning mutations. We will also evaluate a recent approach to pattern recognition known as biclustering [40,41] for the protease and reverse transcriptase sequences. Biclustering algorithms seem to fit our purposes because they search for submatrices in the data matrix, following a determined pattern, and have been applied to large data sets, such as microarray data.