Protein structural class prediction based on an improved statistical strategy

Background A protein structural class (PSC) belongs to the most basic but important classification in protein structures. The prediction technique of protein structural class has been developing for decades. Two popular indices are the amino-acid-frequency (AAF) based, and amino-acid-arrangement (AAA) with long-term correlation (LTC) – based indices. They were proposed in many works. Both indices have its pros and cons. For example, the AAF index focuses on a statistical analysis, while the AAA-LTC emphasizes the long-term, biological significance. Unfortunately, the datasets used in previous work were not very reliable for a small number of sequences with a high-sequence similarity. Results By modifying a statistical strategy, we proposed a new index method that combines probability and information theory together with a long-term correlation. We also proposed a numerically and biologically reliable dataset included more than 5700 sequences with a low sequence similarity. The results showed that the proposed approach has its high accuracy. Comparing with amino acid composition (AAC) index using a distance method, the accuracy of our approach has a 16–20% improvement for re-substitution test and about 6–11% improvement for cross-validation test. The values were about 23% and 15% for the component coupled method (CCM). Conclusion A new index method, combining probability and information theory together with a long-term correlation was proposed in this paper. The statistical method was improved significantly based on our new index. The cross validation test was conducted, and the result show the proposed method has a great improvement.


Background
Protein function is strongly related to its structure. Analysis of protein functions becomes a fundamental research domain to comprehend its structures. Nowadays, with the increased number of parsed structure entries in bioinformatics databases, it is important to do classification of protein structures in bioinformatics research. Scientists had developed various methodologies for the classification of protein structures. For example, based on the structure types and the arrangements of secondary structural elements, Levitt and Chothia [1] proposed a method to recognize ten protein classes, four principal and six small classes of a protein structure. Biological scientists common focus on the first four principal classes. They are allα, all-β, α/β, and α+β classes, respectively. Therefore, the prediction of the four principal protein structural classes is the foundation in the field of protein analysis. In the fundamental study, many indices and methods were proposed to predict protein structural class [2][3][4][5][6][7]. The commonly-used indices and their corresponding methods are described briefly in the following.
Nishkawa [8] et al. found that protein structural classes are related to their amino acid compositions (AAC). Based on this hypnosis, Chou [9,10] proposed standard vectors from amino acid composition in proteins. The statisticsbased indices are 20-dimensional vectors, through which each variant corresponds to one amino acid occurrence frequency in protein sequence. Although these indices can be considered the eigenvector of a sequence, the information is insufficient enough to reflect the correlation among residues. Another weakness is that the statistics indices can not reflect the biological significance commendably. Accordingly, several methods were proposed such as the distance-based algorithm [11,12], component-coupled-based algorithm [13][14][15], support vector machine (SVM) based algorithm [16] and others [17,18].
Alternatively, people can introduce protein-structuralclass prediction index, which is based on the residues' arrangement and correlation in analysis of proteins. Such index method that uses various physiochemical properties has been experimented and adopted in the prediction. For example, Bu et al. [19] found that the auto-correlation function (ACF) of average non-bonded energy can represent the protein structural class with a better accuracy of prediction. Although a long-term correlation between different residues was considered, it did not include the statistical characteristics of sequences.
In this paper, a new index method is proposed. The method is based on the information and probability theories. In this method, a residue occurrence frequency is used instead of physiochemical indices for long-term correlation calculation. The statistical strategy of residue occurrence frequency is changed from a single sequence to a whole-training dataset. The results showed that the accuracy is significantly improved.

Methods
Suppose the whole dataset S contains N sequences, and this dataset can be divided into m (in this paper, we set m = 4; without losing generality) subsets S ξ (ξ = 1, 2,......, m), thus, The number of sequences in each subset is given by n ξ ; thus the total number of sequence, Chou et al. [9] proposed an index based on the amino acid composition (AAC) frequency in a sequence (Equation 2-4), i.e, Where are the normalized occurrence frequencies of 20 residues for the k th protein in the subset S ξ , and T stands for the transpose symbol.
The average occurrence frequencies or the so-called standard vector for subset S ξ is represented by where Since Chou's great contribution, many methods that are based on residue composition were proposed. The n-order component coupled method was one of them. When n = 0, this algorithm degenerated to the amino acid composition (AAC) method. In the case when n = 1, the corresponding indices can be expressed in terms of a 20 × 20 conditional probability matrix [20]. And if n > 1, the norder component coupled components can be expressed in terms of a multi-dimensional conditional probability matrix. In those residue-composition-based methods, the size of statistics samples must be largely enough. However, the present statistical approach requires to calculate the probability of 20 amino acids or the conditional probability for one sequence. In this way, the conditional (2) probabilities, especially the high-order coupled components, can not be calculated accurately since the length of each protein sequence is not long enough. For any n = 0 coupled component, the influence of amino acid that nearby was not considered. With the increase of n, the long-term interaction between the residues at different positions in a same sequence can be reflected; which it is of computational complexity.
In order to solve these problems mentioned above, a new method with an innovative index is proposed in this paper, which can be summarized as follows: First, a new statistical approach was proposed. The amino acids' component frequency of each entire class (expressed in Equation 5, rather than Equation 4) is calculated, instead of the occurrence frequencies of different residues for a certain protein in each class.
where is the size of k th sequence length for the subset S ξ , and the other parameters remain the same definitions as in Equation 4.
Secondly, we develop a method to improve the component coupled algorithm. Conditional probabilities of different amino acids that have different correlation lengths can be calculated. To simplify the calculation procedure, only a 2-dimensional (20 × 20) matrix is introduced to express any possible distances between residues. The conditional probability can be expressed as P d (a i /a j ), where the subscript d is the distance between the residue a i and a j , that is, d = i-j. For each d, one has According to the theory of the probability multiplication: In Equation 7, P d (a i , a j ) and P d (a j ) can be easily computed from protein sequences, and the conditional probability P d (a i /a j ) can also be calculated.
For the case that d + j exceeds the length of the sequence, the cyclic boundary condition can be used. The residue at which its position is equal to the remainder of d + j and the length of sequence can be considered.
The third step is to determine the indexation of the conditional probability matrix for prediction. The information content of conditional probability is used as the quantification index. For each residue (a j ) in an undetermined sequence, the index of the d-interval can be calculated as: In this natural logarithm expression, l is the length of sequence k. For all the residues in the sequence k, the total information content can be obtained by To consider multi-residue effects on some amino acids, the information contents with different distances can be accumulated to form the whole information contents, I w , i.e., From Equation 8, we can find that the larger the conditional probability is, the smaller the information content is. Hence, the prediction result with minimum total information content should be considered in a predicted class in our method.
where PD is the predicted result.

Dataset
In order to comprehensively perform our statistical studies, the latest version (version 1.71 updated on 24 January 2007) of the database SCOP [21] was used. Four classes' sequences -including 1267 in α class, 1424 in β class, 1682 in α/β class and 1551 in α+β class -with the similarity less than 30% were selected (the reason why using this dataset will be explained in discussion part in detail). After removing the uncertainty sequences that contain the letter x in sequence, the total numbers are 1250, 1375, 1565 and 1524, respectively (see additional files 1 and 2). According to the cross-validation principle, a whole sequence was divided into two subsets, randomly. The training and prediction sets were non-homologous and we selected a number that is large enough for training and test (about 20 times more than the size of dataset used in [9]).

Results
To test the feasibility, verification, and applicability of our index and method, the cross-validation [22] method was used in our study. The total sequences including 4 classes were randomly divided into 2 datasets, i.e., the training and the prediction datasets. The training dataset contains 2856 sequences, and the prediction dataset contains 2858 sequences.
Two traditional indices, AAC and ACF mentioned above, were used to compare with the results from our method. Three methods, mainly, the Hamming distance algorithm (DH), the Euclidean distance algorithm (DE) and the component coupled algorithm (CC), were used to assess the indices.
For the AAC index, the results of DH, DE and CC method were shown in Table 1 and 2.
For the auto-correlation based method, we found that our method with hydrophobicity based indices has a higher accuracy value than the one with other physiochemical properties. We used the Kyte and Doolittle [23] hydrophobicity values respectively, and the number of the autocorrelation function length is listed in Table 3 and 4.
In our experiments, different numbers of long-term correlations were tested, and the distance (d) between 2 and 4 shows to have a better result of accuracy. The results for training dataset and prediction dataset were shown in Table 5.
The comparison of training and prediction results calculated by three different indices was illustratively presented in Figure 1 and 2. We found that our index has the best accuracy in protein structural class prediction. With the same index, the method DE always obtained better accuracy than the method DH.

Discussion
We will discuss the dataset, since it is the most important part in evaluating different indices and methods. People usually use the frequently-used dataset which includes several hundred sequences [10]. It is not relatively reliable enough, relevant to a given dataset scale. Another critical issue is the high sequence similarity. Let's take the 277 dataset [10] as an example. The 277 contains 277 protein domains extracted from the SCOP database.
The remarkable pair-wise similarity can be found in each class after multiple sequence alignment is conducted. For instance, in an alpha class, we found that there are several groups of identical sequences; the biggest one contains about 20 sequences (see additional files 1 and 2). After we conducted pair-wise alignment among these 20 sequences, we found that the sequence similarity was over 85%; indicating that these sequences are very identical to each other. The finding happens when we used other 3 classes. Such a high sequence similarity existed in the both training and test datasets; certainly violating the principle of cross validation. Therefore, we suspended such dataset for a reliable result.
In order to clearly emphasize the importance of selected dataset, we compared the three above methods from two different datasets. The amino acid composition index was used in this comparison study. The re-substitution and cross validation tests were designed and implemented for feature evaluations.
For the dataset including 138 sequences [10], the accuracy for re-substitution test and jack-knife test are shown in Table 6 and 7, and plotted in Figure 3, respectively.
Our dataset is summarized in Table 8 and 9 with a total number of 5714 sequences, 2856 for training dataset and 2856 for testing dataset (see Figure 4).
From Table 6 and 7, one can find that the prediction accuracy is very high for all three methods. This is because that the 138 dataset, just like the 277 dataset, is homologous, which means some sequences are almost the same. We can also find an interesting phenomenon that the accuracy of DH and DE are relatively higher in a cross valida-   tion test than that is in re-substitution test. It is mainly because these methods are insensitive to dataset, which means that there is a good extrapolating property in these algorithms. Comparing with CC and SVM, the total accuracy of our method is much better. However, like many advanced methods, the accuracies of re-substitution and cross validation tests are significantly different.
Traditional methods are usually based on simple criterions, while new-developed algorithms have more complicated rules. More prior probability information made current methods more accurate. However, this information must strongly rely on dataset. Fortunately, with an increased number of parsed-sequences, scientists can solve this problem commendably.
Generally speaking, using three above methods, the accuracy of dataset 5714 is much lower than one of the dataset 138. The 138 dataset is unreliable due to its high sequence similarity. However, in cross-validation test, the accuracy of DH and DE in 5714 dataset is much higher than that in 138 dataset. This illuminates that with an increase of dataset scale, one can improve the extrapolation of algorithms remarkablely.
From Table 1, 2, 3 and 4, we found that the accuracy is obviously decreased, compared with the result mentioned before. This is mainly because that the dataset we used are now larger and much different from the one used before. Therefore, the traditional methods had to be improved with an increase of sample size. Table 1, 2, 3 and 4 also tell us that the difference of accuracy between the training and the prediction datasets is quite small. Therefore, the generalization of these methods is pretty good. It is because there are very few restriction conditions and technical manipulations in traditional methods that avoid a fluctuation between the training and test results by some techniques.
Using our method, the accuracy is between 6% and 16% higher than in the traditional methods. This is because long-term concepts are introduced and the conditional probability is used instead of physiochemical indices; thus to avoid the errors influenced by other parameters. In our test, distance (d) value is between 2 and 4, the accuracy is high. This phenomenon is a good accordance with the frequency characteristics of proteins. As we all know, most alpha helices are 3.6 residues per cycle, which means that a hydrogen bond bridges current residue and the residue 3 or 4 positions behind. Most beta strands have 2 residues per strand cycle, which reflects a strong interaction between two residues in a 2-position interval.
The advantage of our method can be concluded into three aspects: • In our method, the long-term correlation factor is considered without any other physiochemical parameters.
• The accuracy is significantly improved for about 6-16% comparing with two traditional indices.
• The merits in both two traditional methods are inherited. That is, the residue composition frequency and the amino acid arrangement.
However, there still exit some problems, which motivate our future study.
• In our method, we must calculate the correlation between d residues. For the situation that the residue position is near the end of a sequence, the residue d sites behind may exceed the length of the sequence. In such case, the boundary process is crucial to the final result. For convenience, the cyclic boundary condition is used hereby. However, such approach is not biologically significant, and it is not quite reliable. To solve this problem, we are planning to test different types of extended boundary conditions.
• The presented method only calculate the correlation between certain residue and the residue d positions behind. This is a "one-side" statistical work, and the information can not be extracted enough. The calculation of the correlation between the target residue and the residues different sites before and after is necessary to solve the problem.

Conclusion
In this paper, a new method by new indices is proposed. A reliable dataset with large number of entries and low  Accuracy of 3 indices for the prediction dataset sequence similarity is used to train and test our algorithm. The result showed that our method has a higher accuracy than the ones in traditional methods. The application of conditional probability and information content shows that the protein structural prediction can be largely improved by combining the information theory with the probability theory.    1 note that the total number was a little bit different from reference 10, since the PDB database was updated in recent years. 2 CC means the component coupled method. Accuracy of amino acid composition index using the 138 dataset Figure 3 Accuracy of amino acid composition index using the 138 dataset. DH, DE and CC mean the Hamming distance method, the Euclidean distance method and the component coupled method respectively. RS means the re-substitution text, and CV corresponds to the cross validation text.
The accuracy of amino acid composition index using the 5714 dataset Figure 4 The accuracy of amino acid composition index using the 5714 dataset. DH, DE and CC mean the Hamming distance method, the Euclidean distance method and the component coupled method, respectively. RS means the resubstitution text, and CV corresponds to the cross validation set.