 Methodology Article
 Open Access
 Published:
A high performance prediction of HPV genotypes by Chaos game representation and singular value decomposition
BMC Bioinformatics volume 16, Article number: 71 (2015)
Abstract
Background
Human Papillomavirus (HPV) genotyping is an important approach to fight cervical cancer due to the relevant information regarding risk stratification for diagnosis and the better understanding of the relationship of HPV with carcinogenesis. This paper proposed two new feature extraction techniques, i.e. ChaosCentroid and ChaosFrequency, for predicting HPV genotypes associated with the cancer. The additional diversified 12 HPV genotypes, i.e. types 6, 11, 16, 18, 31, 33, 35, 45, 52, 53, 58, and 66, were studied in this paper.
In our proposed techniques, a partitioned Chaos Game Representation (CGR) is deployed to represent HPV genomes. ChaosCentroid captures the structure of sequences in terms of centroid of each subregion with Euclidean distances among the centroids and the center of CGR as the relations of all subregions. ChaosFrequency extracts the statistical distribution of mono, di, or higher order nucleotides along HPV genomes and forms a matrix of frequency of dots in each subregion. For performance evaluation, four different types of classifiers, i.e. Multilayer Perceptron, Radial Basis Function, KNearest Neighbor, and Fuzzy KNearest Neighbor Techniques were deployed, and our best results from each classifier were compared with the NCBI genotyping tool.
Results
The experimental results obtained by four different classifiers are in the same trend. ChaosCentroid gave considerably higher performance than ChaosFrequency when the input length is one but it was moderately lower than ChaosFrequency when the input length is two. Both proposed techniques yielded almost or exactly the best performance when the input length is more than three. But there is no significance between our proposed techniques and the comparative alignment method.
Conclusions
Our proposed alignmentfree and scaleindependent method can successfully transform HPV genomes with 7,000  10,000 base pairs into features of 1  11 dimensions. This signifies that our ChaosCentroid and ChaosFrequency can be served as the effective feature extraction techniques for predicting the HPV genotypes.
Background
Human Papillomavirus (HPV) is a small doublestranded and most common sexually transmitting DNA virus. At present, more than one hundred types of Human papillomavirus have been identified. They are differentiated by the genetic sequence of the outer capsid protein L1. Approximately forty types can infect the mucosal epithelium. They are categorized according to their epidemiologic association with cervical cancer. Infection with low risk HPV types such as types 6 and 11 can cause benign or lowgrade cervical cell abnormalities and genital warts. In contrast, high risk HPV types such as 16 and 18 act as carcinogens that can lead to the development of cervical cancer and other anogenital cancers.
Cervical cancer is the second most common cancer significantly causing morbidity and mortality in women worldwide [1]. Persistent infection by high risk HPV is a necessary cause of this cancer. Especially, the most common high risk HPV types are 16 and 18, and approximately 70% of cervical cancer is due to infection by these genotypes [2]. Each genotype of HPV has a different risk level in the cervical cancer. Furthermore, there is a wide variation in genotype distribution in different regions around the world. To better understand the relationship of HPV with carcinogenesis, many countries have investigated the HPV infection among women with cytological status by HPV genotyping methods, as revealed in Switzerland [3], in Italy [4], in Cambodia [5], and in Romania [6].
HPV genotyping is necessary for managing effective medical treatment strategies to patients with persistent infection and for evaluating prevention strategies to individual patients to be immunized with typespecific HPV vaccines [7]. Currently, there are various kinds of HPV genotyping tests used for detecting the genotypes of Human Papillomavirus, in clinical laboratories. For example, PapilloCheck®;, PCRRFLP, HPV genome sequencing, INNOLiPA, Linear Array®; HPV Genotyping Test, etc. These methods detect the HPV genotypes from some regions of genomes. Even though these HPV genotyping tests are beneficial and employed for HPV diagnosis in patients nowadays, they have some limitations. To illustrate this aspect, the HPV genotypes are hardly detected in cases of inadequate samples or low amplification signals of some genotypes. Contamination with previously amplified material can lead to false positive results. Furthermore, mistaken classifications can be occurred through crossreactivity among similar types in the tests based on hybridization [8].
To avoid these problems, some computational methods for identifying HPV types were developed [916]. Since discriminating whether the patients have been infected with the high risk types of Human papillomavirus is the most important and urgent aspect for diagnosis and treatment, multiple perspectives were proposed to focus on predicting the HPV risk types. For instance, Wang and Xiao [9] presented multitudinous physicochemical and statistical features from the protein sequences using Fuzzy K nearest neighbor classifier for the risk type prediction of Human papillomaviruses. They also further developed the better algorithm based on geometric moments of protein distance matrix images using a Fuzzy K nearest neighbor classifier [10]. In addition, classification of HPV risk types was also proposed through algorithms based on decision tree [11], text mining [12], genetic mining of DNA sequence structures [13], support vector machines [14], gapspectrum kernels [15], and ensemble support vector machines with protein secondary structures [16].
While classifying the HPV into high and low risk types is the urgent aspect for diagnosis of the cancer as claimed by many researchers, the study on how to predict specific genotypes of the virus has not significantly focused. In fact, the identification of HPV genotypes infecting the patients is more essential than a rough classification of HPV risk types. To clarify this issue, HPV genotyping can provide more information regarding risk stratification. With the persistent infection, the risk of a precancerous lesion is in between 10% to 15% with HPV types 16 and 18 but below 3% for all other high risk types combined [2]. Furthermore, the relevant diagnosis with cost effectiveness can be done by selecting the virus types to be tested based on epidemiological and prevalence studies from a wide variation in the genotype distribution in different regions around the world. The diversity of virus types and the incidence of multiple infections have made it necessary to develop reliable methods to identify the different genotypes for epidemiological studies and medical treatment. HPV genotyping can make a great contribution to the following aspects: HPV diagnosis in case of single and multiple infection, more information regarding risk stratification, a better understanding of the relationship of HPV with carcinogenesis, and prevention of the cancer though the development of typespecific vaccines. Consequently, HPV genotyping has become an important approach to fight with cervical cancer. For these reasons, this research concentrated on the prediction of HPV genotypes.
Chaos Game Representation (CGR) was proposed as a unique and scaleindependent representation for genomic sequences by Jeffrey [17]. It is an iterative mapping technique assigning each nucleotides in a DNA or amino acids in a protein to a unique coordinates in a 2dimensional space. It can be viewed as a 2dimensional image of distributed dots and captured in a form of 01 square matrix, where 1 represents a dot and 0 represents an empty coordinate. The distribution of positions has two properties of uniqueness and possibility to inverse a coordinate back to its corresponding nucleotide or amino acid [18]. Using graphic approaches to study biological systems can provide useful intuitive insights, as indicated by many previous studies on a series of important biological topics, such as DNA [19,20], RNA [21], genome [2226], protein [2735], drug metabolism systems [36], proteinprotein interactions [37], analysis of protein sequence evolution [38]. Moreover, the cellular automaton graph has also been applied to study hepatitis B viral infections, HBV virus gene missense mutation, as well as represent complicated biological sequences and help to identify various protein attributes [3941].
Singular value decomposition (SVD) is a matrix factorization technique with various applications. For instance, it can be used to solve underdetermined and overdetermined systems of linear equations, find inverse and the pseudoinverse matrices, compute the matrix condition number and calculate the vector system orthogonality and orthogonal complement [42]. SVD is also applied to several areas in gene expression data and microarray data, such as analysis [4346], search [47], image compression [48], gene extraction [42], and classification [49,50], etc. In this paper, we deployed SVD a tool to reduce the size of CGR into a smaller number of feature matrices without losing any knowledge from the original data. Therefore, a new feature extraction was proposed based on the combination of chaos game representation and singular value decomposition.
Due to the significance of HPV genotyping, the objective of this paper is to predict the HPV genotypes from their genomes, which is similar to the conventional methods of genome detection in clinical laboratories. The remaining sections of this paper are organized as follows. Section “Methods” describes for the methods used in this experiment, including collection of HPV data set, the proposed feature extraction techniques, predicting systems, and performance evaluation. Section “Results and discussion” illustrates the experimental results and discussion. Section “Conclusion” concludes the paper.
Methods
As realized by a series recent publications [5158] in response to the call from [59], the following procedures to establish a really useful statistical predictor for a biological system were involved in our method: (i) construct or select valid benchmark data sets to train and test the predictor; (ii) formulate the biological samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the predicted target; (iii) introduce or develop a powerful predicting algorithm (or engine); (iv) properly perform crossvalidation tests to objectively evaluate the anticipated accuracy of the predictor; (v) establish a userfriendly webserver for the predictor accessible to the public. The detail of each procedure is discussed as follows.
HPV genome data from genotypes were collected and their features were extracted by our proposed feature extraction techniques, i.e. ChaosCentroid and ChaosFrequency, as inputs for classification. These features were divided into the training and testing sets by a 2fold cross validation technique. Four different classification models were deployed to train and test the experimental data sets. Then, the prediction performance from the obtained results were evaluated and compared with other methods. Our proposed method consists of the following four main procedures, i.e. data collection, feature extraction, prediction, and performance evaluation.
Collection of HPV data set
To remove the homologous sequences from the benchmark data sets, a cutoff threshold of 25% was imposed in [60,61] to exclude those proteins from the benchmark data sets that are equal to or greater than 25% of sequence identity to any others in a same subset. However, in this study we did not use such a stringent criterion because the currently available data do not allow us to do so. Otherwise, the numbers of genomes for some subsets would be too few to have statistical significance.
HPV genotypes collected in this experiment are those important genotypes detectable by Linear Array®; HPV Genotyping Test. This HPV genotyping is a widely used qualitative test developed by Roche Molecular Diagnostics for detecting HPV genotypes associated with cervical cancer. The test can detect 37 high and low risk HPV genotypes, including those considered as a significant risk factor for HSIL progression to cervical cancer. To challenge the prediction, only HPV genotypes having genome diversity were concentrated in this experiment. Some of 37 genotypes containing few genomes were excluded. For this reason, only HPV genotypes 6, 11, 16, 18, 31, 33, 35, 45, 52, 53, 58 and 66 were involved. The genomes of these HPV genotypes were collected from the National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov/). The data set contains Human Papillomavirus genomes of 12 genotypes, including high, possible high, and low risk types. For each HPV genotype, the number of genomes as well as the minimum and maximum lengths are shown in Table 1.
All viral genomes in this HPV data set were previously published and are publicly available on GenBank or NCBI databases. In addition, the genome names, NCBI access numbers, and HPV genotypes of all genomes in the HPV data set are properly cited in Additional file 1, and the HPV data set used in this experiment is also available in Additional file 2.
Detail of proposed feature extraction techniques
The following techniques, i.e. ChaosCentroid and ChaosFrequency, were proposed to extract the features from the chaos game representation of HPV genomes. To identify each genotype, the relations among subsets of HPV genomes must be clarified. These relations are actually the local features. Since the CGR captures the information of the whole genome data, extracting the global features from the CGR may not be efficient enough to distinguish the HPV genotypes. The local features hidden in various subregions of CGR must be more contemplated. In this work, we concentrate on extracting the local features rather than global features. The difference between ChaosCentroid and ChaosFrequency are the feature representation. HPV genomes contain A, C, G, and T nucleotides. Prior to the discussion of ChaosCentroid and ChaosFrequency, the detail of how to construct CGR is the following. Let x _{ i } and y _{ i } be the coordinates of nucleotide η _{ i } at the i ^{th} position in the nucleotide sequence. Algorithm 1 illustrates how to construct a CGR for capturing a given nucleotide sequence.
A CGR can be viewed as a square whose corners are at coordinates (1,1), (1,1), (1,1), and (1,1) representing nucleotides A, C, G, and T, respectively. Note that the size of CGR according to the coordinates of A, C, G, and T nucleotides is equal to 2×2 units. However, this unit size of original CGR is not appropriate for discussing our algorithm. Therefore, the geometrical structure and the physical size of our CGR are redefined as follows. The size of CGR square is set to n×n and n∈R ^{+}. Its center is also located at the coordinates (0,0). Each corner of this square represents the same nucleotide as that of the original CGR. After Algorithm 1, CGR can be viewed as an image of distributed dots. Figure 1 shows some examples of CGR of HPV genotypes 6, 16, 18, and 31. Obviously, the number of dots in a CGR is equal to the number of nucleotides in a given sequence. Although this CGR image can be directly used in the prediction step, its computational time may be too high due to the large number of dots. Thus it is necessary to extract only those relevant features from this set of dots to reduce the computational time complexity in the prediction process. In this paper, we proposed two different features as the representation of CGR image. The first feature is called ChaosCentroid and the second one is called ChaosFrequency. The detail of each feature is the following.
ChaosCentroid
According to [17], the kth dot plotted on the CGR of a sequence corresponds to the first klong initial subsequence of the sequence. Therefore, any visible pattern of the CGR corresponds to some pattern of the nucleotide sequence. CGR represents the global information of the nucleotide sequence. Partitioning the CGR into several subregions is implemented for revealing local information of the interested areas. If two dots are within the same quadrant, they correspond to sequences with the same last mononucleotide; if they are in the same subquadrant, the sequences have the same last dinucleotides; and so on. This can demonstrate the structure of the sequences yielding the dots. ChaosCentroid utilizes this biological significance by computing the centroid of the distributed dots of each subregion. Therefore, the centroid, which can be converted to specific structure of the sequence, is represented as local information of the subregion. For ChaosCentroid, the CGR is partitioned into \(\frac {n}{g} \times \frac {n}{g}\) equal subregions, where \(\frac {n}{g} \in \{1,2,3,\ldots,11\}\). This range is obtained by all possible numbers that can applied to the CGR. For instance, the CGR is not partitioned when \(\frac {n}{g} = 1\), the CGR is partitioned into 4 equal subregions when \(\frac {n}{g} = 2\), and so on. Furthermore, if the value of \(\frac {n}{g}\) is greater than 11, some subregions does not contain any dots. So, 11 is the maximum value of \(\frac {n}{g}\) in this experiment. For each of \(\frac {n}{g}\) partitioned into the CGR, the centroid of each subregion is computed first. Then all pairs of distances between the centroids and the center of CGR are computed and captured in a form of a matrix. This set of distances can be considered as the relation of information embedded in all subregions. However, the number of ChaosCentroids may be too large. Therefore, this matrix is decomposed by applying singular value decomposition (SVD) method to reduce information complexity. Finally, the \(\frac {n}{g}\) diagonal elements from the \(\frac {n}{g}\)by\(\frac {n}{g}\) diagonal matrix of SVD are represented as the features of CGR and are subsequently used as the input vectors for prediction process. As a result, ChaosCentroid produces 11 formats of input vectors, i.e. the first format have 1 dimension, the second format have 2 dimensions, and so on. Extracting ChaosCentroid consists of the following steps, as illustrated in Algorithm 2. Additionally, Figure 2 shows an example of distances between the centroid of each subregion and the center of CGR for HPV genotype 16 after being partitioned into subregions of size 2×2.
ChaosFrequency
As elucidated in [20], the bias of distribution of different mono, di, tri, or higher order nucleotides along the DNA/RNA sequences can generate different patterns in the CGR. This can be used as diagnostic patterns for different HPV genotypes. The CGRs of the HPV genomes of different genotypes tend to exhibit distinct patterns visually, as displayed in Figure 1. Thus, ChaosFrequency concentrates on the frequencies of subsequences occurred in the HPV genomes. Particularly, when \(\frac {n}{g}\) is equal to 2^{k} where k∈{1,2,3}, it represents the kmer frequency occurred in the HPV sequences. Accordingly, the ratio between the number of dots in the subregion and the total number of dots in the CGR are computed and represented as the feature of each subregion. This ratio can be interpreted as the probability of distribution. Suppose each subregion is of size g×g. After extracting the ChaosFrequency of each subregion, the whole CGR be viewed as a matrix of size \(\frac {n}{g} \times \frac {n}{g}\). This matrix is decomposed by SVD to extract the \(\frac {n}{g}\) diagonal elements used as the feature of CGR. Likewise, this technique produces 11 formats of input vectors, in accordance with those of ChaosCentroid. The detail of this procedure is illustrated in Algorithm 3. Each subregion is referred by its location according to the row and column after the partition of CGR. Let m _{ i,j } be the number of dots in subregion at row i and column j. Suppose there are total M dots in CGR. Then we can calculate the probability of distribution as \(p_{i,j}=\frac {m_{i,j}}{M}\).
Predicting systems
To evaluate the performance of the proposed feature extraction techniques, the testing sets were fed to four different types of predicting systems. Each system has its own principle and criteria for predicting the corresponding HPV genotypes. The predicting systems are multilayer perceptron neural network, radial basis function network, knearest neighbor technique, and fuzzy knearest neighbor technique. From 400 HPV genomes, one of 12 genotypes which are types 6, 11, 16, 18, 31, 33, 35, 45, 52, 53, 58, and 66 was identified. The detail of setup for each predicting system in our experiments are as follows.
Multilayer perceptron neural network
Each input pattern is the feature vector F obtained from Algorithms 2 and 3. Therefore, the numbers of input neurons are ranged from 1 to 11 according to the sizes of the feature vector F. The number of hidden neurons was empirically varied from 1 to 24 neurons to find the most suitable number. From the experiments, 16 hidden neurons are the best number of neurons for producing the best prediction of HPV genotypes. There are 12 output neurons, each of which corresponds to each HPV genotype. To make the testing efficient, the neuron 1 is for determining HPV genotype 6; neuron 2 for type 11; neuron 3 for type 16; neuron 4 for type 18; neuron 5 for type 31; neuron 6 for type 33; neuron 7 for type 35; neuron 8 for type 45; neuron 9 for type 52; neuron 10 for type 53; neuron 11 for type 58; and neuron 12 for type 66. Therefore, the network deployed in our experiments consists of an input layer with \(\frac {n}{g}\) neurons, a hidden layer with 16 neurons, and an output layers with 12 neurons. Backpropagation learning rule was adopted to adjust the weights of the network during the training process. Mean squared normalized error function was used as a terminating criterion in the training process. In testing procedure, the predict HPV genotype is determined by this equation. Let o _{ i } be the output value of output neuron i.
argtype is the mapping from neuron index to its corresponding HPV genotype previously defined.
Radial basis function network
After finding the optimal spread distances for the prediction, the spread of radial basis function (RBF) is set to 0.4 for ChaosCentroid and 0.1 for ChaosFrequency. The same network structure of multilayer perceptron was adopted for this RBF network. The determination in Equation (1) of HPV genotypes for multilayer perceptron was used in this RBF predicting system.
Knearest neighbor technique
In this technique, the determination of HPV genotypes depends upon the value of k nearest neighbors measured by Euclidean distance. For any tested feature vector, the HPV genotype of its nearest neighbor is assigned as the HPV genotype of the tested feature vector. Empirically, it was found that k=1 gave the best performance.
Fuzzy Knearest neighbor technique
Fuzzy knearest neighbor technique was proposed by James M. Keller, Michael R. Gray, and James A. Givens [62]. It is a special variation of the knearest neighbor technique family. The algorithm of fuzzy knearest neighbor assigns class membership to a sample vector rather than assigning the vector to a particular class. An advantage is that no arbitrary assignments are made by the algorithm. Additionally, membership values of the vector should provide a level of assurance to accompany the resultant classification. In this technique, we set k to 1.
Performance evaluation
Among the independent statistical accuracy testing methods for predicted results such as subsampling (e.g., 2, 5 or 10fold crossvalidation) test and jackknife test, jackknife test was deemed the most objective that can always yield a unique result for a given benchmark data set, as elucidated in [59] and demonstrated by Equations 28, 29 and 30 in [59]. Therefore, the jackknife test has been increasingly used and widely recognized by investigators to test the power of various prediction methods (see, e.g., [6373]). Although jackknife is widely used, its computational time is rather high. However, to reduce the computational time, we adopted the 2fold crossvalidation in this experiment to deal with the parameter optimization. Therefore, the reported prediction performance was obtained by the combination of both validating sets.
In this experiment, we adopted Equation 11 of [52] to formulate the set of four metrics, including Sensitivity(Sen), Specificity(Spec), Accuracy(Acc), and Matthew’s Correlation Coefficient(MCC), for evaluating the prediction performance. The formulation of the four metrics is defined by the following equations.
where N ^{+} is the total number of HPV genomes of the investigated genotype whereas \(N^{+}_{}\) the number of HPV genomes of the investigated genotype that is incorrectly predicted as the other genotypes; N ^{−} the total number of HPV genomes of the other genotypes that are not investigated whereas \(N^{}_{+}\) the number of HPV genomes of the other genotypes that is incorrectly predicted as the investigated genotype. The investigated HPV genotype is 6, 11, 16, 18, 31, 33, 35, 45, 52, 53, 58, or 66. For example, if the investigated genotype is 6, N ^{+} is the total number of HPV genomes of genotype 6, while N ^{−} is the total number of the genomes of the other genotypes, excluding genotype 6.
According to Equation 2, the prediction performance can be evaluated in a meaningful explanation, as follows. The sensitivity is used for evaluating the performance of the predicting systems in identifying the investigated genotype. When \(N^{+}_{} = 0\), none of HPV genomes of the investigated genotype was incorrectly predicted as the other genotypes, so the sensitivity is 1. In contrast, while \(N^{+}_{} = N^{+}\), all HPV genomes of the investigated genotype were incorrectly predicted as the other genotypes, so the sensitivity is 0. The specificity is used for evaluating the performance of the systems in excluding the other genotypes. When \(N^{}_{+} = 0\), none of HPV genomes of the other genotypes was incorrectly predicted as the investigated genotype, so the specificity is 1; while \(N^{}_{+} = N^{}\), all HPV genomes of the other genotype were incorrectly predicted as the investigated genotype, so the specificity is 0. The accuracy is used for evaluating the performance of the systems in classifying the investigated genotype and the other genotypes. When \(N^{+}_{} = N^{}_{+} = 0\), none of HPV genomes of the investigated genotype and none of HPV genomes of the other genotypes was incorrectly predicted, so the accuracy is 1; while \(N^{+}_{} = N^{+}\) and \(N^{}_{+} = N^{}\) all HPV genomes of the investigated genotype and all HPV genomes of the other genotypes were incorrectly predicted, so the accuracy is 0. Typically, the Matthew’s Correlation Coefficient (MCC) is used for measuring the quality of binary classification. When \(N^{+}_{} = N^{}_{+} = 0\), none of HPV genomes of the investigated genotypes and none of HPV genomes of the other genotypes was incorrectly predicted, so MCC is 1; when \(N^{+}_{} = N^{+}/2\) and \(N^{}_{+} = N^{}/2\), MCC is 0 meaning no better than random prediction; When \(N^{+}_{} = N^{+}\) and \(N^{}_{+} = N^{}\), MCC is 1 indicating total disagreement between prediction and observation.
However, the set of metrics in Equation 2 is valid only for singlelabel systems. For multilabel systems whose existence has become more frequent in system biology [61,74] and system medicine [67,75], a completely different set of metrics as defined in [76] is needed.
Results and discussion
The value of variable \(\frac {n}{g}\) in Algorithms 2 and 3 was set from 1 to 11. The performance of HPV genotype prediction was separately summarized according to each predicting system and two feature extracting schemes. The obtained results are the following.
Results from multilayer perceptron neural network
The results of the HPV genotype prediction gained by ChaosCentroid and by ChaosFrequency feature extraction with the predicting system based on multilayer perceptron neural network are summarized in Tables 7 and 8, respectively, of Additional file 3. The results were reported according to different values of \(\frac {n}{g} \in \{1, 2, \ldots, 11\}\). It is rather remarkable when \(\frac {n}{g} = 1\).
When \(\frac {n}{g} = 1\), the number of subregions of CGR is equal to one. Thus there is only one centroid computed by ChaosCentroid and the probability of distribution of CGR computed by ChaosFrequency is equal to one. The overall performance of ChaosFrequency is much lower than those of ChaosCentroid. ChaosFrequency gain 0% of sensitivity and 100% of specificity in all genotypes, excepting genotype 16. It implies that the features of all genomes extracted by ChaosFrequency are totally predicted to genotype 16. In contrast, ChaosCentroid can obtain high performance metrics, including accuracy, sensitivity, specificity, and Matthew’s Correlation Coefficient in almost all genotypes. This is because a centroid is computed from the coordinates of every dots. It is obvious that different HPV genotypes must have different distribution of dots and centroids. So, predicting HPV genotypes with high performance from these centroids is possible. But in case of ChaosFrequency, the probability of distribution of every HPV genotype is equal. This makes the feature of each HPV genotype indistinguishable.
However, when the value of \(\frac {n}{g}\) is greater than one, the local information regarding the frequency of subsequence among nucleotides in each subregion is brought out and the performance is increased in proportion to the value of \(\frac {n}{g}\). It is noticeable that there is no significant difference between the overall performance obtained from ChaosCentroid and ChaosFrequency when \(\frac {n}{g} > 3\). In addition, we can conclude that, to achieve high performance of prediction, the local information of each subregion is more relevant than global information.
Results from radial basis function network
The results of the HPV genotype prediction gained by ChaosCentroid and by ChaosFrequency feature extraction with the predicting system based on radial basis function network are summarized in Tables 9 and 10, respectively, of Additional file 3. According to the results, the performance values obtained by this predicting system are unstable among input dimensions. This is because this experiment set only one optimal spread distance, which gain the maximum average accuracy of all dimensions, for each predicting system of ChaosCentroid and ChaosFrequency, respectively. In fact, it is possible that each input dimension has its own proper spread distance, and one value of spread distance can not fit for all dimensions. In addition, it is noticeable that ChaosFrequency with RBF at 4dimensional input can achieve the best performance with minimum input dimension. The overall performance trend obtained from this predicting system is similar to those of multilayer perceptron. But the peformance from multilayer perceptron is significantly higher than the performance from radial basis function.
Results from Knearest neighbor technique
The results of the HPV genotype prediction gained by ChaosCentroid and by ChaosFrequency feature extraction with the predicting system based on knearest neighbor technique are summarized in Tables 11 and 12, respectively, of Additional file 3. The experimental results have shown the high performance of prediction. Therefore, it can imply that, in each subregion, the structure of sequence in a form of centroid by ChaosCentroid and the statistical distribution of mono, di, or higher order nucleotides in a form of frequency by ChaosFrequency, are closed to each other in the same genotype. The overall performance trend obtained from this predicting system is similar to those of multilayer perceptron. But the performance from this predicting system is slightly higher than the performance of multilayer perceptron.
Results from Fuzzy Knearest Neighbor Technique
The results of the HPV genotype prediction gained by ChaosCentroid and by ChaosFrequency feature extraction with the predicting system based on fuzzy knearest neighbor technique are summarized in Tables 13 and 14, respectively, of Additional file 3. The overall performance trend obtained from this predicting system is similar to those of multilayer perceptron. Additionally, the overall performance of this predicting system is slightly higher than the performance of multilayer perceptron but it is statistically equal to the performance of knearest neighbor technique due to setting the same value of k.
Comparative results with Related Method
NCBI viral genotyping tool [77] is a webbased tool for identifying the genotype of a viral sequence. It works by sliding a window along the query sequence and processing each window/sequence segment separately. Each segment is compared to a set of reference sequences using BLAST, which returns the similarity scores for the local alignments. The reference sequence genotype that matches the query with the highest similarity score is assigned to the query segment. The process is repeated for the next window until the whole length of the query sequence has been covered. The results from all windows are combined. If the same genotype is assigned to most segments, then the query sequence is considered the genotype. This tool is a webbased resource that provides a reliable method based on alignment. Then, this experiment adopted this tool for identifying genotypes of the viral genomes in the HPV data set. To evaluate the prediction performance, the result obtained by this genotyping tool were compared with the best results obtained by the proposed ChaosCentroid and ChaosFrequency feature extraction techniques with all predicting systems, as illustrated in Tables 2, 3, 4, 5 and 6.
The experimental results have shown that all methods, excepting ChaosCentroid with radial basis function network, can achieve the best performance of the four metrics, including accuracy, sensitivity, specificity, and Matthew’s Correlation Coefficient, in predicting the HPV genotypes of the data set. It demonstrated that both of the proposed techniques and the NCBI genotyping tool can be used to predict the genotypes of HPV genomes. Even though there is no significance between the proposed techniques and the NCBI genotyping tool, some issues should be considered.
The NCBI genotyping tool provides a reliable method based on homology searching sequence alignment procedure. The limitation of alignment is that it is difficult to identify or classify the protein or DNA sequences in the case that they does not have a significant sequence homology. Besides, the alignment with multiple sequences will take time consuming and only one query sequence at a time can be processed by this tool. So, this method is not appropriate for large scale tasks.
In contrast, the proposed techniques, i.e. ChaosCentroid and ChaosFrequency, are based on Chaos game representation, which provides a unique and scaleindependent representation of DNA sequences through the statistical distribution of mono, di, tri, or higher order nucleotides along DNA sequences. An advantage of CGR over alignment is that it has the potential to reveal the evolutionary and/or functional relationships between the sequences having no significant homology, as elucidated in [35]. Furthermore, it does not require prior knowledge of consensus sequences, nor does it involve exhaustive searches for sequences in databases. The limitation of CGR is that it takes a computational time to generate the representations from DNA sequences. Nevertheless, this experiment utilized the singular value decomposition to reduce the size of CGR into a smaller number of feature matrices so the computational time in the prediction process was also reduced. From the experimental results, it have shown that the proposed ChaosCentroid and ChaosFrequency, which are based on chaos game representation and singular value decomposition, can successfully extract the characteristic parameters of HPV genotypes for the prediction.
Since userfriendly and publicly accessible webservers represent the future direction for developing practically more useful models, simulated methods, or predictors [7880], we may make efforts in our future work to provide a webserver for the method presented in this paper.
Conclusion
This paper proposed two new feature extraction techniques, i.e. ChaosCentroid and ChaosFrequency, based on chaos game representation and singular value decomposition for predicting HPV genotypes from nucleotide sequences in HPV genomes. Both extracting techniques concentrate on the local information among nucleotides. For the subregions in CGR, ChaosCentroid pays attention to capture the structures of the sequences in a form of centroids, while ChaosFrequency focuses on capture the distribution of subsequences in a form of frequencies. Four different predicting systems, i.e. multilayer perceptron neural network, radial basis function network, Knearest neighbor technique, and fuzzy Knearest neighbor technique, were deployed. From the experiment, we found that the features extracted by our proposed feature extraction techniques are significant and independent of the predicting systems. The comparative results demonstrated no significance between our proposed techniques and the NCBI viral genotyping tool. In addition, local information is more important than global information in order to achieve high performance of prediction.
Abbreviations
 HPV:

Human papillomavirus
 LSIL:

Low grade squamous intraepithelial lesion
 HSIL:

High grade squamous intraepithelial lesion
 CGR:

Chaos game representation
 SVD:

Singular value decomposition
 MLP:

Multilayer perceptrons
 RBF:

Radial basis function
 KNN:

KNearest neighbor
 FKNN:

Fuzzy KNearest neighbor
References
Sheng J, Zhang WY. Identification of biomarkers for cervical cancer in peripheral blood lymphocytes using oligonucleotide microarrays. Chin Med J. 2010; 123:1000–5.
Abreu ALP, Souza RP, Gimenes F, Consolaro MEL. A review of methods for detect human papillomavirus infection. Virol J. 2012; 9:262.
Dobec M, Bannwart F, Kilgus S, Kaeppeli F, Cassinotti P. Human papillomavirus infection among women with cytological abnormalities in switzerland investigated by an automated linear array genotyping test. J Med Virol. 2011; 83:1370–6.
Rossi PG, Chini F, Bisanzi S, Burroni E, Carillo G, Lattanzi A, et al. Distribution of high and low risk hpv types by cytological status: a population based study from italy. Infect Agents Cancer. 2011; 6:2.
Couture MC, Page K, Stein ES, Sansothy N, Sichan K, Kaldor J, et al. Cervical human papillomavirus infection among young women engaged in sex work in phnom penh, cambodia: prevalence, genotypes, risk factors and association with hiv infection. BMC Infect Dis. 2012; 12:166.
Ursu RG, Onofriescu M, Nemescu D, Iancu LS. HPV prevalence and type distribution in women with or without cervical lesions in the northeast region of romania. Virol J. 2011; 8:558.
Lee SH, Vigliotti VS, Vigliotti JS, Pappu S. Routine human papillomavirus genotyping by dna sequencing in community hospital laboratories. Infect Agents Cancer. 2007; 2:11.
Carvalho NO, Castillo DM, Perone C, Januario JN, Melo VH, Filho GB. Comparison of hpv genotyping by typespecific pcr and sequencing. Mem Inst Oswaldo Cruz. 2010; 105(1):73–8.
Wang P, Xiao X. Predicting the risk type of human papillomaviruses based on sequencederived features. In: Proceedings of 5th International Conference on Bioinformatics and Biomedical Engineering: 1012 May 2011; Wuhan, China. USA: IEEE: 2011. p. 1–4.
Xiao X, Wang P. A new approach using geometric moments of distance matrix image for risk type prediction of human papillomaviruses. In: Proceedings of 2011 International Conference on Electronics, Communications and Control: 911 September 2011; Ningbo. USA: IEEE: 2011. p. 52–55.
Park S, Hwang S, Zhang B. Classification of the risk types of human papillomavirus by decision trees. In: Proceedings of 4th International Conference on Intelligent Data Engineering and Automated Learning: 2123 March 2003; Hong Kong, China. Germany: Springer Berlin Heidelberg: 2003. p. 540–544.
Park S, Hwang S, Zhang B. Classification of human papillomavirus (hpv) risk type via text mining. Genomics Informatics. 2003; 1(2):80–6.
Eom J, Park S, Zhang B. Genetic mining of DNA sequence structures for effective classification of the risk types of human papillomavirus (HPV). In: Proceedings of the 11th International Conference on Neural Information Processing: 2225 November 2004; Calcutta, India. Germany: Springer Berlin Heidelberg: 2004. p. 1334–1343.
Kim S, Zhang B. Human papillomavirus risk type classification from protein sequences using support vector machines. In: Proceedings of the 2006 International Conference on Applications of Evolutionary Computing: 1012 April 2006; Budapest, Hungary. Germany: Springer Berlin Heidelberg: 2006. p. 57–66.
Kim S, Eom J. Prediction of the human papillomavirus risk types using gapspectrum kernels. In: Proceedings of Third International Symposium on Neural Networks: 28 May  1 June 2006; Chengdu, China. Germany: Springer Berlin Heidelberg: 2006. p. 710–5.
Kim S, Kim J, Zhang B. Ensembled support vector machines for human papillomavirus risk type prediction from protein secondary structures. Comput Biol Med. 2009; 39:187–93.
Jeffrey HJ. Chaos game representation of gene structure. Nucleic Acids Res. 1990; 18:2163–70.
Almeida JS, Carrico JA, Maretzek A, Noble PA, Fletcher M. Analysis of genomic sequences by chaos game representation. Bioinformatics. 2001; 17:429–37.
Lu J, Hu X, Liu X, Shi F. Predicting thermophilic nucleotide sequences based on chaos game representation features and support vector machine. In: Proceedings of 5th International Conference on Bioinformatics and Biomedical Engineering: 1012 May 2011; Wuhan.USA: IEEE: 2011. p. 1–4.
Dutta C, Das J. Mathematical characterization of chaos game representation: New algorithms for nucleotide sequence analysis. J Mol Biol. 1992; 228:715–29.
Xiao Q, Zhou J, Shi L. A novel 3D graphical representation of RNA secondary structures based on chaos game representation. In: Proceedings of Sixth International Conference on Natural Computation: 1012 August 2010; Yantai, Shandong. USA: IEEE: 2010. p. 2999–3002.
Deschavanne PJ, Giron A, Vilain J, Fagot G, Fertil B. Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Mol Biol Evol. 1999; 16:1391–9.
Tavassoly I, Tavassoly O, Rad MSR, Dastjerdi NM. Three dimensional chaos game representation of genomic sequences. In: Proceedings of Frontiers in the Convergence of Bioscience and Information Technologies: 1113 October 2007; Jeju City. USA: IEEE: 2007. p. 219–223.
Yu ZG, Shi L, Xiao QJ, Anh V. Chaos game representation of genomes and their simulation by recurrent iterated function systems. In: Proceedings of the 2nd International Conference on Bioinformatics and Biomedical Engineering: 1618 May 2008; Shanghai. USA: IEEE: 2008. p. 41–46.
Nair VV, Vijayan K, Gopinath DP, Nair AS. ANN based classification of unknown genome fragments using chaos game representation. In: Proceedings of 2010 Second International Conference on Machine Learning and Computing: 911 February 2010; Bangalore. USA: IEEE: 2010. p. 81–85.
Messaoudi I, Oueslati AE, Lachiri Z. Genomic data visualization. In: Proceedings of 2012 6th International Conference on Sciences of Electronics, Technologies of Information and Telecommunications: 2124 March 2012; Sousse. USA: IEEE: 2012. p. 772–8.
Yu ZG, Anh V, Lau KS. Chaos game representation of protein sequences based on the detailed HP model and their multifractal and correlation analyses. J Theor Biol. 2004; 226:341–8.
Yang JY, Yu ZG, Anh V. Clustering structure of large proteins using multifractal analyses based on 6letters model and hydrophobicity scale of amino acids. Chaos, Solitons Fractals. 2009; 40:607–20.
Yang JY, Peng ZL, Yu ZG, Zhang RJ, Anh V, Wang D. Prediction of protein structural classes by recurrence quantification analysis based on chaos game representation. J Theor Biol. 2009; 257:618–26.
Hu XH, Xia JB, Niu XH, Ma X, Song CH, Shi F. Chaos game representation for discriminating thermophilic from mesophilic protein sequences. In: Proceedings of 3rd International Conference on Bioinformatics and Biomedical Engineering: 1113 June 2009; Beijing. USA: IEEE: 2009. p. 1–4.
Nana L, Xiaohui N, Feng S, Xuehai H. Subcellular locations prediction of proteins based on chaos game representation. In: Proceedings of 3rd International Conference on Bioinformatics and Biomedical Engineering: 1113 June 2009; Beijing. USA: IEEE: 2009. p. 1–4.
Song C, Shi F. Subcellular location of apoptosis proteins based on chaos game representation. In: Proceedings of International Conference on Future BioMedical Information Engineering: 1314 December 2009; Sanya. USA: IEEE: 2009. p. 194–196.
Yu ZG, Xiao QJ, Shi L, Yu ZW, Anh V. Chaos game representation of functional protein sequences, and simulation and multifractal analysis of induced measures. Chinese Phys B. 2010; 19:068701.
Olyaee M, yaghubi M. Improved protein structural class prediction based on chaos game representation. In: Proceedings of Fourth Asia International Conference on Mathematical/Analytical Modelling and Computer Simulation: 2628 May 2010; Bornea. USA: IEEE: 2010. p. 486–91.
Basu S, Pan A, Dutta C, Das J. Chaos game representation of proteins. J Mol Graphics Modell. 1997; 15:279–89.
Chou KC. Graphic rule for drug metabolism systems. Curr Drug Metab. 2010; 11:369–78.
Zhou GP. The disposition of the lzcc protein residues in wenxiang diagram provides new insights into the proteinprotein interaction mechanism. J Theor Biol. 2011; 284:142–8.
Wu ZC, Xiao X. 2DMH: a webserver for generating graphic representation of protein sequences based on the physicochemical properties of their constituent amino acids. J Theor Biol. 2010; 267:29–34.
Xiao X, Chou KC. Using pseudo amino acid composition to predict protein attributes via cellular automata and others approaches. Curr Bioinf. 2011; 6:251–60.
Xiao X, Wang P. Cellular automata and its applications in protein bioinformatics. Curr Protein Pept Sci. 2011; 12:508–19.
Xiao X, Wang P. GPCR2L: predicting g proteincoupled receptors and their types by hybridizing two different modes of pseudo amino acid compositions. Mol Biosystems. 2011; 7:911–9.
Alshalalfa M, Alhajj R. Combining singular value decomposition and ttest into hybrid approach for significant gene extraction from microarray data. In: Proceedings of 8th IEEE International Conference on BioInformatics and BioEngineering: 810 October 2008; Athens. USA: IEEE: 2008. p. 1–6.
Duan ZH, Liou LS, Shi T, DiDonato JA. Application of singular value decomposition and functional clustering to analyzing gene expression profiles of renal cell carcinoma. In: Proceedings of the 2003 IEEE Bioinformatics Conference: 1114 August 2003. USA: IEEE: 2003. p. 392–3.
Tomfohr J, Lu J, Kepler TB. Pathway level analysis of gene expression using singular value decomposition. BMC Bioinf. 2005; 6:225.
Berger JA, Hautaniemi S, Mitra SK, Astola J. Jointly analyzing gene expression and copy number data in breast cancer using data reduction models. IEEE/ACM Trans Comput Biol Bioinf. 2006; 3:2–16.
Baty F, Rudiger J, Miglino N, Kern L, Borger P, Brutsche M. Exploring the transcription factor activity in highthroughput gene expression data using RLQ analysis. BMC Bioinf. 2013; 14:178.
Aghili SA, Sahin OD, Agrawal D, Abbadi AE. Efficient filtration of sequence similarity search through singular value decomposition. In: Proceedings of the Fourth IEEE Symposium on Bioinformatics and Bioengineering: 1921 May 2004. USA: IEEE: 2004. p. 403–410.
Peters TJ, SmolikovaWachowiak R, Wachowiak MP. Microarray image compression using a variation of singular value decomposition. In: Proceedings of the 29th Annual International Conference of the IEEE EMBS Cite Internationale: 2226 Aug. 2007; France. USA: IEEE: 2007. p. 1176–1179.
Hu P, Bull SB, Jiang H. Gene network modularbased classification of microarray samples. BMC Bioinf. 2012; 13(Suppl 10):17.
Holec M, Klema J, Zelezny F, Tolar J. Comparative evaluation of setlevel techniques in predictive classification of gene expression samples. BMC Bioinf. 2012; 13(Suppl 10):15.
Fan YN, Xiao X, Min JL. iNRDrug: predicting the interaction of drugs with nuclear receptors in cellular networking. Int J Mol Sci. 2014; 15:4915–37.
Guo SH, Deng EZ, Xu LQ, Ding H, Lin H, Chen W, et al. iNucPseKNC: a sequencebased predictor for predicting nucleosome positioning in genomes with pseudo ktuple nucleotide composition. Bioinformatics. 2014; 30:1522–9.
Liu B, Zhang D, Xu R, Xu J, Wang X. Combining evolutionary information extracted from frequency profiles with sequencebased kernels for protein remote homology detection. Bioinformatics. 2014; 30:472–9.
Qiu WR, Xiao X. iRSpotTNCPseAAC: identify recombination spots with trinucleotide composition and pseudo amino acid components. Int J Mol Sci. 2014; 15:1746–66.
Chen W, Feng PM, Lin H. iSSPseDNC: identifying splicing sites using pseudo dinucleotide composition. Biomed Res Int. 2014; 2014:623149.
Qiu WR, Xiao X, Lin WZ. iMethylPseAAC: identification of protein methylation sites via a pseudo amino acid composition approach. BioMed Res Int. 2014; 2014:947416.
Ding H, Deng EZ, Yuan LF, Liu L. iCTXType: a sequencebased predictor for identifying the types of conotoxins in targeting ion channels. BioMed Res Int. 2014; 2014:286419.
Xu Y, Wen X, Shao XJ, Deng NY. iHydPseAAC: predicting hydroxyproline and hydroxylysine in proteins by incorporating dipeptide positionspecific propensity into pseudo amino acid composition. Int J Mol Sci. 2014; 15:7594–610.
Chou KC. Some remarks on protein attribute prediction and pseudo amino acid composition (50th anniversary year review). J Theor Biol. 2011; 273:236–47.
Chou KC, Shen HB. Review: Recent progresses in protein subcellular location prediction. Anal Biochem. 2007; 370:1–16.
Chou KC, Wu ZC, Xiao X. iLocHum: using accumulationlabel scale to predict subcellular locations of human proteins with both single and multiple sites. Mol Biosyst. 2012; 8:629–41.
Keller JM, Gray MR, Givens JA. A fuzzy knearest neighbor algorithm. IEEE Trans Syst Man Cybernet. 1985; SMC15:580–5.
Lin WZ, Fang JA, Xiao X. iLocAnimal: a multilabel learning classifier for predicting subcellular localization of animal proteins. Mol Biosystems. 2013; 9:634–44.
Chou KC, Cai YD. Prediction and classification of protein subcellular location: sequenceorder effect and pseudo amino acid composition. J Cell Biochem. 2003; 90:1250–60.
Min JL, Xiao X, Chou KC. iEzyDrug: A web server for identifying the interaction between enzymes and drugs in cellular networking. BioMed Res Int. 2013:701317.
Xiao X, Min JL, Wang P. iCDIPseFpt: identify the channeldrug interaction in cellular networking with pseaac and molecular fingerprints. J Theor Biol. 2013; 337:71–9.
Xiao X, Wang P, Lin WZ. iAMP2L: a twolevel multilabel classifier for identifying antimicrobial peptides and their functional types. Anal Biochem. 2013; 436:168–77.
Kong L, Zhang L, Lv J. Accurate prediction of protein structural classes by incorporating predicted secondary structure information into the general form of chou’s pseudo amino acid composition. J Theor Biol. 2014; 344:12–8.
Mondal S, Pai PP. Chou’s pseudo amino acid composition improves sequencebased antifreeze protein prediction. J Theor Biol. 2014; 356:30–5.
Hajisharifi Z, Piryaiee M, Mohammad Beigi M, Behbahani M, Mohabatkar H. Predicting anticancer peptides with chou’s pseudo amino acid composition and investigating their mutagenicity via ames test. J Theor Biol. 2014; 341:34–40.
Chou KC, Cai YD. Prediction of membrane protein types by incorporating amphipathic effects. J Chem Inf Model. 2005; 45:407–13.
Liu B, Xu J, Lan X, Xu R, Zhou J, Wang X, et al. iDNAProt ∣dis: Identifying DNAbinding proteins by incorporating amino acid distancepairs and reduced alphabet profile into the general pseudo amino acid composition. PLoS ONE. 2014; 9:106691.
Xu Y, Wen X, Wen LS, Wu LY. iNitroTyr: prediction of nitrotyrosine sites in proteins with general pseudo amino acid composition. PLoS ONE. 2014; 9:105018.
Chou KC, Wu ZC, Xiao X. iLocEuk: a multilabel classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins. PLoS One. 2011; 6:18258.
Chen L, Zeng WM, Cai YD. Predicting anatomical therapeutic chemical(atc) classification of drugs by integrating chemicalchemical interactions and similarities. PLoS ONE. 2012; 7:35254.
Chou KC. Some remarks on predicting multilabel attributes in molecular biosystems. Mol Biosyst. 2013; 9:1092–100.
Rozanov M, Plikat U, Chappey C, Kochergin A, Tatusova T. A webbased genotyping resource for viral sequences. Nucleic Acids Res. 2004; 32:654–9.
Xu R, Zhou J, Liu B, He Y, Zou Q, Wang X, Chou KC. Identification of dnabinding proteins by incorporating evolutionary information into pseudo amino acid composition via the topngram approach. J Biomolecular Struct Dynamics. 2014. doi:10.1080/07391102.2014.968624.
Qiu WR, Xiao X, Lin WZ. iUbiqLys: Prediction of Lysine Ubiquitination Sites in Proteins by Extracting Sequence Evolution Information Via a Grey System Model. in press.
Lin SX, Lapointe J. Theoretical and experimental biology in one. J Biomed Sci Eng. (JBiSE). 2013; 6:435–42.
Acknowledgements
The authors express their gratitude to the Thailand Research Fund (TRF) for supporting this research under the Golden Jubilee Scholarship.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
CL and YP participated in the design of the feature extractions and contributed in the refinement of the work. WT participated in the design of the feature extractions, collected the HPV data set, implemented the feature extractions, evaluated the prediction performance, and drafted the manuscript. All authors participated in revising the manuscript. All authors read and approved the final manuscript.
Additional files
Additional file 1
Genome names, NCBI access numbers, and HPV genotypes of all genomes in the HPV data set.
Additional file 2
HPV data set.
Additional file 3
Results of the HPV genotype prediction based on the features extracted by ChaosCentroid and ChaosFrequency with all predicting systems.
Rights and permissions
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Tanchotsrinon, W., Lursinsap, C. & Poovorawan, Y. A high performance prediction of HPV genotypes by Chaos game representation and singular value decomposition. BMC Bioinformatics 16, 71 (2015). https://doi.org/10.1186/s1285901504934
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285901504934
Keywords
 HPV
 Genotype
 Chaos game representation
 Singular value decomposition
 Prediction