Application of amino acid occurrence for discriminating different folding types of globular proteins

Background Predicting the three-dimensional structure of a protein from its amino acid sequence is a long-standing goal in computational/molecular biology. The discrimination of different structural classes and folding types are intermediate steps in protein structure prediction. Results In this work, we have proposed a method based on linear discriminant analysis (LDA) for discriminating 30 different folding types of globular proteins using amino acid occurrence. Our method was tested with a non-redundant set of 1612 proteins and it discriminated them with the accuracy of 38%, which is comparable to or better than other methods in the literature. A web server has been developed for discriminating the folding type of a query protein from its amino acid sequence and it is available at http://granular.com/PROLDA/. Conclusion Amino acid occurrence has been successfully used to discriminate different folding types of globular proteins. The discrimination accuracy obtained with amino acid occurrence is better than that obtained with amino acid composition and/or amino acid properties. In addition, the method is very fast to obtain the results.


Background
Deciphering the native conformation of a protein from its amino acid sequence, known as, protein folding problem is a challenging task. The recognition of proteins belonging to same fold/structural class is an intermediate step to protein structure prediction. For the past few decades, several methods have been proposed for predicting protein structural classes. These methods include discriminant analysis [1], correlation coefficient [2], hydrophobicity profiles [3], amino acid index [4], Bayes decision rule [5], amino acid distributions [6], functional domain occurrences [7], supervised fuzzy clustering approach [8] and amino acid principal component analysis [9]. These methods discriminated protein structural classes with the sensitivity of 70-100% and it mainly depends on the data set. Wang and Yuan [5] developed a data set of 674 globular protein domains belonging to four different structural classes and reported that methods claiming 100% sensitivity for structural class prediction could predict only with the sensitivity of 60% with this data set.
On the other hand, alignment profiles have been widely used for recognizing protein folds [10,11]. Recently, Cheng and Baldi [12] proposed a machine learning algo-rithm for fold recognition using secondary structure, solvent accessibility, contact map and β-strand pairing, which showed a pair wise sensitivity of 27%. On the other hand, amino acid properties have been used for discriminating membrane proteins [13], identification of membrane spanning regions [14], prediction of protein structural classes [15], protein folding rates [16], protein stability [17] etc. Towards this direction, Ding and Dubchak [18] proposed a method based on neural networks and support vector machines for fold recognition using amino acid composition and five other properties, and reported a cross-validated sensitivity of 45%. Further, Ofran and Margalit [19] showed the existence of significant similarity in amino acid composition between proteins of the same fold. In this work, we have used amino acid occurrence (not composition) for discriminating 30 different folding types of globular proteins. We have developed a method based on linear discriminant analysis (LDA), which discriminated a set of 1612 proteins with an accuracy of 38%, which is comparable to other methods in the literature, in spite of the simplicity of the method and large dataset.

Role of re-weighting for fold discrimination
We have computed the occurrence of all the 20 amino acid residues in each protein, which represents the elements of 20 dimensional vectors in each protein. We have applied LDA to these vectors for discrimination. We have employed two kinds of LDA, i.e., with and without reweighting. In LDA with re-weighting, i.e. W k = 1 in eq. (1), all folds equally contribute to maximize the performance of discrimination irrespective of the number of proteins in each fold; i.e., if one fold has hundreds of proteins and another has only few proteins, LDA is optimized to achieve the highest performance equally in all folds. This re-weighting is important especially when the number of proteins included in each fold has large variations. On the other hand, LDA without re-weighting, i.e. W k = N k in eq. (1), tends to achieve the maximum performance for the whole dataset.
We have used the measures, accuracy, sensitivity, precision and F1 for examining the performance of the method. In general, accuracy has the tendency to show high values without re-weighting since it is computed with all data. Sensitivity tends to increase by re-weighting, giving equal weight to each fold. In contrast, precision has the tendency to decrease by re-weighting, since re-weighting increase FP for folds with less number of proteins. On the other hand, F1 is independent of re-weighting as it is harmonic mean of sensitivity and precision.
In Table 1, we presented the discrimination results obtained with different measures and two kinds of LDA (with and without re-weighting). As expected, re-weighting significantly changed all the performances other than F1. Re-weighting increased the sensitivity whereas an opposite trend was observed for precision and accuracy. This is due to the divergence in the number of proteins in each fold (min. 25, max. 173, mean 54, see Table 2). F1 does not change significantly by re-weighting.
Remarkably, we achieved the accuracy of 38% (without re-weighting), which is the best performance to our knowledge, for large number of folds (30) and proteins (1612). Further, the method is extremely simple, which indicates that the amino acid occurrence of proteins carry sufficient information to discriminate protein folds.

Discrimination of proteins belonging to different folding types
We have examined the ability of the present method for predicting proteins belonging to 30 major folds. In Table  2, we have shown the performances of discriminating 30 different folds. We observed that the folds with less number of proteins have the sensitivity of less than 10% without re-weighting. For example, SAM domain like fold has the sensitivity of 8%, which has only 26 proteins. Similar tendency is also observed for the folds b.2, b.34, c.3, c.47, c.55, d. 15 and d.17. The sensitivity of these folds increased significantly with re-weighting. On the other hand, many folds with less than 30 proteins have the sensitivity of more than 20% without re-weighting (e.g., a.3, a.24, a.39 etc.). As there are 30 folds, the expected sensitivity is only 3.3% if classification is supposed to be random. In Table 2, we have also shown the ratio between the number of proteins in each fold and the total number of proteins, which ranges from 2 to 11%. Hence the sensitivity of 20% obtained for several folds is significantly higher than that of random for fold discrimination. Interestingly, most of the folds that were discriminated with more than The comparison between experimental versus predicted folds is shown in Fig. 1. In this figure, dark block indicates the presence of many proteins. The data are normalized in such a way that the total percentage of true fold is 100%. Fig. 1a showed that mainly the folds with more number of proteins are misclassified without re-weighting (e.g., a.4, b.1, c.1 and d.58). The trend has been changed after reweighting: the misclassified proteins are observed to be within the same structural class. Especially, in α + β class, the block diagonal region is distributed almost uniformly, which is partially caused by re-weighting. Since each fold is equally weighted, α + β class is less weighted than other classes. This causes inter-class misclassification between α + β and other classes, because α + β class has only three folds. However, two folds in small structural class can be discriminated with high accuracy/sensitivity/precision/F1 and α + β folds are difficult to discriminate using our method.

Comparison among different re-weighting procedures
The results presented in Tables 1 and 2 showed that the sensitivity of discriminating protein folds differs significantly between different methods (with and without reweighting). Hence, it would be difficult to choose the best method for fold recognition. However, it may be selected based on the interest of the users, whether the prediction is optimized for each fold or over all dataset.
Usually, training and test sets of data are obtained from sequence and structure databases and are culled with sequence identity. However, these data sets do not always reflect proper representatives of all proteins in different folds, e.g., protein population in each fold. Further, the proteins available in databases such as, PDB are biased with the proteins that can be solved experimentally, which may be different from the proportion of real proteins. Hence, considering these aspects would help to develop "good" methods for protein fold recognition in future.
In essence, based on the methods and data sets used in the present work, we suggest that the performance with reweighting is better than that without re-weighting.

Influence of amino acid occurrence in recognizing protein folds
The importance of amino acid occurrence is illustrated with Figure 2(a). In this figure we show the occurrence of the 20 types of amino acid residues in DNA/RNA binding 3-helical bundle (a.4) and Immunoglobulin-like β-sandwich (b.1). The average number of amino acid residues in these folds are 88 and 110, respectively. We observed that the residues Gly, Pro, Ser, Thr and Val are dominant in the fold b.1 whereas an opposite trend was observed for Leu and Arg. In Figure 2(b), we have shown the distribution of residues in "amino acid occurrence" space. It is clearly seen that the two folds are more or less separated in this space. We observed similar results about the variation of amino acid occurrences among different folds in our data set.  In addition, we have tested the performance of the method using amino acid composition (i.e., amino acid occurrence/total number of residues) in each protein. We noticed that the accuracy without re-weighting decreased to 33% indicating the importance of amino acid occurrence (un-normalized composition) in each fold (Table   1). Similar tendency is also observed for discriminating βbarrel membrane proteins [16]. Hence, we suggest that the amino acid occurrence is better than composition for discriminating protein folds. In fact, the normalization of amino acid composition produced the problem of co-linearity, i.e., diversity of vectors is not sufficient compared with the number of proteins. The reason for the dependency of F1 upon different types of LDA (with or without re-weighting) is that four folds have no positive proteins without re-weighting. On the other hand, amino acid occurrence has only two folds without any positive proteins (without re-weighting) as seen in Table 2.

Probability measure of discrimination
In order to have the feasibility of combining the results of our method with other methods we provided the probability of being a protein in a specific fold along with the predicted folding type. In Figure 3, we have shown the probability for fold a.4 (DNA/RNA binding 3-helical bun-dle). Clearly, the fold a.4 has the highest average probability. However, some other folds, (e.g., a.60, d.15, d.17 and d.58) have relatively higher probabilities. This may result in wrong discrimination, which may be fixed by combining the results with other methods.

Comparison with other methods
We have compared the performance of our method with other related works in the literature. Ding and Dubchak [18] introduced a combined method for predicting the folding type of a protein. They have used six parameters, amino acid composition, secondary structure, hydrophobicity, van der Waals volume, polarity and polarizability as attributes, and neural networks and support vector machines for recognition. These features have been combined with the number of votes in each method. They reported the sensitivity of 56% in a test set of 384 proteins and 10-fold cross validation sensitivity of 45% in a training set of 311 proteins from 27 folding types. We have used the same dataset of 311 proteins and assessed the performance of our method. We observed that our method could predict with the leave-one-out cross validation accuracy of 42% (with LDA without re-weighting), which is close to that (45%) reported in Ding and Dubchak [18].  In addition, we have selected the proteins from the folds that are common in both the studies and tested the performance of our method (trained with our dataset of 1612 proteins) in predicting the folding types of the proteins used in Ding and Dubchak [18]. The results are presented in Table 3. Interestingly, our method with re-weighting could correctly identify the folding types with F1 value of more than 30% in 11 among the 19 considered folds. The performances are similar to or better than that reported with the dataset of 1612 proteins. Although our method is optimized with different datasets it has the power to predict the folding type of independent dataset of proteins with similar sensitivity.

Amino acid occurrence
Further, there are several advantages in our method: (i) only one feature, amino acid occurrence is sufficient for prediction rather than six features. The comparison of results obtained with only one feature showed that the performance of our method (42%) is significantly better than that of Ding and Dubchak [18] reported with amino acid composition (20-49%), (ii) voting procedure is not necessary; our method can be directly used for multi-fold classifications, (iii) our method uses LDA, which requires significantly less computational power compared with SVM. In SVM one has to diagonalize the matrix with the size of (protein number) × (protein number); on the other hand, LDA requires only diagonalization of 20 (the number of kinds of amino acid residues) × 20 matrix independent of number of proteins and (iv) although they have reported the dependency of fold specific sensitivities upon number of proteins in each fold, it is difficult to compensate this effect without modifying the complicated voting systems; our method has freedom to compensate it as discussed in the previous sections.
Probability measure of discrimination Figure 3 Probability measure of discrimination. Rows : 103 proteins in fold (a.4). Columns : 30 folds. From left to right, the order is ID in Table 2. The darkest square corresponds to probability 0.5, and the lightest is zero.
Recently, Shen and Chou [20] reported better sensitivity for the same data set of Ding and Dubchak [18]. However, the results are biased with training set of data. We have evaluated the sensitivity of identifying proteins belonging to the folds, four helical up and down bundle (a.24) and EF hand-like (a.39) and we observed that the sensitivity is 30.5% and 24%, respectively. Our predicted sensitivities (38% and 44%, with re-weighting, see table 2) are better than that of Shen and Chou [20].

Possible reasons for obtaining good performance with amino acid occurrence
We have analyzed the possible reasons for obtaining good performance with amino acid occurrence. In Table 4 The usage of five features, i.e., predicted secondary structure, hydrophobicity, normalized van der Waals volume, polarity, polarizability [18] along with amino acid composition yielded the accuracy of 45% using sophisticated and time consuming methods. On the other hand, our simple method employing amino acid occurrence and five features has also showed almost the same value (44%).
The in-depth analysis of the results presented in Table 4 revealed many interesting features. For example, amino acid composition alone showed the accuracy of 35%, which is 7% less than that obtained with occurrence  tion and five features showed the accuracy of 39%, which is similar (38%) to that obtained with composition and length. Hence, length of the protein has an important role as that of five features for discriminating protein folds. This analysis demonstrates the importance of amino acid length and obtaining good performance with amino acid occurrence.
As an individual feature amino acid occurrence showed the best performance among all features, including secondary structure. The combination of amino acid occurrence with other features did not increase the sensitivity and the increase of other parameters is only marginal. This result reveals that the amino acid occurrence contains most of the information that are reflected in other physical features.
Generally, any physical feature can be expressed by amino acid occurrence. Hence, linear combination of amino acid occurrence may express many of physical properties of proteins. In order to verify this concept, we have computed the correlation coefficients between 49 amino acid properties [21][22][23] and the first discriminate function. Each property consists of 20 dimensional vector, like where is the kth physical property of ith amino acid.
Since discriminant function is also 20 dimensional vector and each component of which describes contribution from each amino acid, one can compute correlation coefficient between them.
As can be seen in Table 5, 23 out of 49 properties have high correlation coefficients and less than 5% q-values (i.e., FDR corrected p-values). This analysis shows that linear discriminant function can express many of physical properties, at least, partly. Hence, even if we do not consider physical properties directly, the consideration of amino acid occurrence could discriminate folds well.

Fold recognition on the web
We have developed a web server for discriminating protein folds from amino acid sequence [24]. It takes the amino acid sequence as input and displays the folding type in the output along with probability. Further, the server has the feasibility of selecting the method, with and without re-weighting, and the display options to show the probability details for each fold.

Advantages and limitations of the method
The main advantage of the present method is the discrimination of 30 different folding types of globular proteins with high accuracy/sensitivity/precision/F1. Further it will provide the probability of being a protein in a specific fold. The discrimination results along with probability may be helpful to select templates to build models to new protein. Further, it can be combined with other methods for better performance. The limitation of the method is the usage of only 30 specific folds for discrimination.

Conclusion
In this paper, we have proposed a simple method for discriminating 30 folding types of globular proteins. Interestingly, the simplest method is the best method for the truly complicated problems. Although complicated methods have several possibilities for tuning they generate over fitting to the data set. Further, the method proposed in this work is better than or comparable to other complicated methods, such as, neural networks and support vector machines proposed in the literature for discriminating folding types. In addition, our method has several advantages including the less computational time and classifying the folds at a single run rather than pair-wise comparisons. We have developed a web server [24], which takes the amino acid sequence as the input and displays the folding type in the output. The main limitation of the method is that its application is restricted to 30 folds considered in this work. However, the approach can be extended to other folds when significant representatives are available.

Dataset
We have used a dataset of 1612 globular proteins belonging to 30 major folding types obtained from SCOP database [25] for recognizing protein folds. This dataset has been constructed with the following criteria: (i) there should be at least 25 proteins in each fold and (ii) the sequence identity between any two proteins is not more than 25%. The amino acid sequences of all the proteins are available at [24].

Linear discriminant analysis
We have employed LDA in this work and a brief description is given below. First, we compute the amino acid occurrence of each protein, n i ≡ (n i1 , n i2 , ..., n ij , ..., n i20 ), where i is the number of protein; n i1 , n i2 etc. represents the number of amino acids of each type (Ala, Arg etc.) in ith protein. Then LDA tries to maximize S B (S T ) is the summation of squared distance between the center of mass of all proteins and that within fold (coordinate of each protein) along axis z, i.e., where K is the number of folds, N k is the number of proteins belonging to kth fold and is the center of mass along the axis z, and z k is that within kth fold, i.e., where i' is the i' th protein within the kth fold. z i is the linear combination of n ij with the set of coefficients a ≡ (a 0 ,  a 1 , ..., a j , ... a 20 ), Hence, LDA tries to find a which maximizes η 2 . In total, we can get 20 kinds of z i s which are orthogonal to each other, and discrimination is done based on these z i s. In addition one can introduce weights W k for each group using the equation: The discrimination is done with Bayesian scheme employing Gaussian kernel. Proteins in each fold are assumed to distribute in amino acid occurrence space obeying Gaussian distribution whose center is the mean occurrence within each fold and variance is computed along the 20 kinds of z coordinates. Then the fold with maximum probability is used assign the folding type of a protein.
The probability of each fold may also be used to find other probable folds for a specific sequence.
We have used lda module in MASS library of R [26] and the computational time is less than few seconds using Intel Pentium M processor (1.10 GHz) and 1 GB memory.

Scoring
In this paper, we employed four performances to validate the results. We computed TP k which is the number of proteins being correctly discriminated to be in kth category (e.g., fold). We have also computed FP k (FN k ), which is the number of proteins which are incorrectly discriminated as being (not being) in kth category. Then we defined sensitivity (or recall), Precision, and F1 as For validating whole data set we have taken the category average, where (Performance) is sensitivity/precision/F1. For some cases denominator of precision and/or F1 will be zero and we excluded these categories to compute the average.
The accuracy is defined as, N is the total number of proteins.