Skip to main content

Advertisement

Research article | Open | Published:

A strategy to select suitable physicochemical attributes of amino acids for protein fold recognition

Abstract

Background

Assigning a protein into one of its folds is a transitional step for discovering three dimensional protein structure, which is a challenging task in bimolecular (biological) science. The present research focuses on: 1) the development of classifiers, and 2) the development of feature extraction techniques based on syntactic and/or physicochemical properties.

Results

Apart from the above two main categories of research, we have shown that the selection of physicochemical attributes of the amino acids is an important step in protein fold recognition and has not been explored adequately. We have presented a multi-dimensional successive feature selection (MD-SFS) approach to systematically select attributes. The proposed method is applied on protein sequence data and an improvement of around 24% in fold recognition has been noted when selecting attributes appropriately.

Conclusion

The MD-SFS has been applied successfully in selecting physicochemical attributes of the amino acids. The selected attributes show improved protein fold recognition performance.

Background

Discovering the three dimensional structure of a protein from its amino acid sequence via computational means is a challenging task and open for research in biological science and bioinformatics. Deciphering protein structure elucidates protein functions. This has a profound impact on understanding the heterogeneity of proteins, protein-protein interactions and protein-peptide interactions. This further helps in drug design. A usual way to predict the structure of a protein is to first acquire proteins with known structures (e.g. by crystallography techniques) and then from their sequences, the prediction process can be conducted by developing recognition techniques. Thereafter, the developed techniques can be used to classify unknown protein sequences into one of its classes or folds. The length of a protein sequence (i.e., the number of amino acids in it) is usually different from the length of another protein sequence. However, two proteins with different lengths and low sequential similarities can be categorized to the same fold. The identification of protein folds from a protein sequence would bring us one step closer to the recognition of protein structures. A wide range of techniques have been developed over the past two decades to recognize protein folds. Despite numerous contributions and significant enhancements achieved [1, 2], the protein fold recognition problem is yet to be completely solved.

The focus in protein fold recognition can be broadly classified into two categories: 1) the development of classifiers to improve fold recognition, and 2) the development of feature extraction techniques using alphabetical sequence (syntactical-based) and/or using physicochemical properties of the amino acids (attribute-based or physicochemical-based). For the former case, several classifiers have been developed or used including linear discriminant analysis [3], Bayesian classifiers [4], Bayesian decision rule [5], K-Nearest Neighbor [6, 7], Hidden Markov Model [8, 9], Artificial Neural Network [10, 11] and ensemble classifiers [1, 12]-[14]. For the latter case, several feature extraction techniques have been developed including composition, transition and distribution [15], occurrence [16], pairwise frequencies [17], pseudo-amino acid composition [18], bigrams [19], autocorrelation [6, 20, 21] and deriving features by considering more physicochemical properties [22].

Dubchak et al. [15] proposed syntactical and physicochemical-based features for protein fold recognition. They used the five following attributes of amino acids for deriving physicochemical-based features namely, hydrophobicity (H), predicted secondary structure based on normalized frequency of α-helix (X), polarity (P), polarizability (Z) and van der Waals volume (V). The features proposed by Dubchak et al. [15] have been widely used in the field of protein fold recognition [4, 12, 22]-[28]. Apart from the above mentioned 5 attributes used by Dubchak et al. [15], features have also been extracted by incorporating other attributes of the amino acids. Some of the other attributes used are: solvent accessibility [29], flexibility [30], bulkiness [31], first and second order entropy [32], size of the side chain of the amino acids [22]. Several attributes have been picked for feature extraction usually in an arbitrary way for protein fold recognition. Contrary to this, Taguchi and Gromiha [16] argued that features from attributes of amino acids can be ignored due to having insufficient information and only syntactical-based features should be considered. This shows that proper exploration of the amino acid attributes has not been conducted. To this, we posed a question: ‘which of the attributes of the amino acids are to be selected for the protein fold recognition problem?’ The answer to this would open the third category of research apart from 1) the development of classifiers, and 2) the development of feature extraction techniques based on the syntactic and/or physicochemical properties.

In this study, we develop a methodology for selecting the attributes of the amino acids for protein fold recognition in a systematic manner. In order to do this, a successive feature selection (SFS) technique based on an exhaustive greedy search algorithm can be applied [33, 34]. The SFS technique can find important features from a group of features. However, since several features could be extracted from an attribute (e.g. composition, transition and distribution from hydrophobicity of amino acids) and there could be many attributes, this would lead to selecting multi-dimensional features belonging to an attribute. Therefore, we develop a scheme to identify important attributes by investigating multi-dimensional features corresponding to attributes. For brevity we call the proposed technique as multi-dimensional SFS (MD-SFS).

We show two schemes of MD-SFS: backward elimination and forward selection. In the backward elimination scheme, the search for the best subset of attributes will start by first retaining all the given attributes. Then an irrelevant attribute is discarded from this subset at an iteration time point that causes minimum loss of information for the subset. This elimination of attributes from a subset is performed until all the attributes are ranked. This scheme is useful to find attributes of low importance that could perform well, if selected in an appropriate subset. In the forward selection scheme, the best attribute is selected first, and a subsequent attribute is included in the subset such that the included attribute improves the performance (e.g., in terms of classification) of the subset. This scheme, however, could be biased towards the highest ranking attribute.

Experiments are carried out using Dubchak’s (DD) dataset [25], Taguchi’s (TG) dataset (Taguchi and Gromiha, [16]) and extended Ding and Dubchak (EDD) dataset [2]. The selection of physicochemical attributes by MD-SFS technique shows improvement in protein fold recognition by around 18 ~ 24% on all the datasets when 10-fold cross-validation has been applied. The MD-SFS technique has been illustrated in the next section and its usefulness has been demonstrated in the subsequent sections.

Multi-dimensional successive feature selection

The MD-SFS scheme has been illustrated in Figures 1 and 2. The backward-elimination procedure of MD-SFS has been shown in Figure 1 and the forward-selection procedure has been shown in Figure 2. The purpose of MD-SFS is to select the best attribute for protein fold recognition. In the figures, four attributes (T a  = 4) have been depicted. A feature extraction technique has been used to extract d-dimensional features from each attribute. Attributes are represented as Aj (where j = 1, 2,..., T a ) and extracted features of Aj are represented as. f1j, f2j, …, f d j In the figures, there are 4 levels in total, including the beginning state. The number of attributes at each of the level is denoted by NA. The classification accuracy using k-fold cross-validation of a subset of attributes is denoted by H( · ) (Figure 2). The highest average classification accuracy using k-fold cross-validation at each of the level is depicted by α l where l = 0, 1, …, T a  − 1. The output is the ranked attributes.

Figure 1
figure1

Multi-dimensional successive feature selection: backward elimination scheme.

Figure 2
figure2

Multi-dimensional successive feature selection: forward selection scheme.

MD-SFS: backward elimination

For the backward-elimination case of MD-SFS (Figure 1), a group of features belonging to an attribute is dropped one at a time in each of the successive levels. This would give subsets of attributes containing features. The number of features in a subset at level l is (T a  − l)d. A classifier is used to compute average classification accuracy using k-fold cross-validation procedure on each of the subsets. The subset of attributes with the highest average classification accuracy is progressed to the next subsequent level. The size of subset is reduced by d number of features as we progress across the levels. This process is terminated when all the attributes are ranked. In Figure 1, at level 1, the highest average classification accuracy (α1) obtained is by attribute subset {A1, A2, A4}. It is also possible that average classification accuracy of more than one subset is the same. In that case, the subsets with the highest average classification accuracies would progress to the next level. In Figure 1, subset {A1, A2, A4} is progressed to level 2 and at this level the subset with highest average classification accuracy (α2) is {A2, A4}. At level 3, the subset with highest average classification accuracy (α3) is {A2}. In Figure 1, ranked attributes are {A2, A4, A1, A3}, where A2 is the top ranked attribute and A3 is the bottom ranked or least important attribute. Furthermore, there could be two criteria in which attributes can be selected. For an instance, if we want to select best 3 attributes for the design then we can take {A2, A4, A1} from the ranked attributes. However, a better way would be to find the argument of the maximum of α l i.e., r = arg max l = 0 , . . . , T a 1 α l . For an instance, if r = 2 then this indicates that subset {A2, A4} at level 2 exhibits the maximum accuracy among all the selected subsets at all the levels. Therefore, attributes of subset {A2, A4} can be selected for the design. We refer the former criterion of selection as brute-n (where n is the number of attributes to be selected) and the latter criterion as maximum accuracy (MA) based criterion.

The MD-SFS backward elimination procedure would approximately require between C 2 T a + 1 and 2 T a 1 search combinations, where T α is the total number of attributes and the term mC n is the n-combination of m elements. If t s denotes the number of attributes in a subset s then this subset would have t s d features. Therefore, the computational complexity of a classifier for doing classification using subset s will be based on t s d number of features.

MD-SFS: forward selection

For the forward-selection case of MD-SFS (Figure 2), an attribute with corresponding d-dimensional features would be taken at a time for computing average classification accuracy using the k-fold cross-validation procedure. The attribute corresponding to the highest average classification accuracy will be stored; i.e., r 1 = arg max j = 1 , . . . T a H A j . The selected attribute containing the features will go to the next successive level. In the next level, an attribute that exhibits the highest average classification accuracy in combination with the selected attribute from the previous level A r 1 will be retained. This process will continue until all the attributes are ranked. The number of features used in computing classification accuracy at level l is (l + 1)d. Further, we can apply the same two criteria (brute-n and MA-based) for obtaining attributes from the ranked set of attributes as it was discussed in MD-SFS backward elimination approach.

The MD-SFS forward selection would require around T a (T a  + 1)/2 search combinations, where T a is the total number of attributes. A subset s with t s attributes would have t s d number of features. The computational complexity of a classifier used to compute classification accuracy would depend on t s d number of features.

Methods

Dataset

In this study, three protein sequence datasets have been used: 1) DD-dataset [25], 2) TG-dataset (Taguchi and Gromiha, [16]) and 3) EDD-dataset [2]. The DD-dataset that we have used consists of 311 protein sequences in the training set where two proteins have no more than 35% of sequence identity for aligned subsequence longer than 80 residues. The test set consists of 383 protein sequences where sequence identity is less than 40%. Both the sets belong to 27 SCOP folds which represented all major structural classes: α, β, α/β, and α + β[25]. The training set and test set have been merged as a single set of data in order to perform k-fold cross-validation process.

TG-dataset consists of 1612 protein sequences belonging to 30 different folding types of globular proteins. The names of the number of protein sequences in each of 30 folds have been described in Taguchi and Gromiha [16]. The protein sequences of TG-dataset have been first transformed into their corresponding PSSM (position-specific-scoring-matrix) [35] sequences by using PSIBLAST (http://blast.ncbi.nlm.nih.gov/) (the cut off E-value is set to E = 0.001).

EDD-dataset consists of 3418 proteins with less than 40% sequential similarity belonging to the 27 folds that originally used in DD-dataset. We extracted the EDD-dataset from the 1.75 SCOP in similar manner to Dong et al. [2] in order to study our proposed method using a larger number of samples.

Physicochemical attributes

In this study 30 physicochemical attributesa have been utilized including 5 popular attributes as used by Dubchak et al. [15]. The attributes with the corresponding symbols are listed in Table 1. The residues of amino acids of these 30 attributes are given in Table 2.

Table 1 Physicochemical attributes used in the study
Table 2 Residues of amino acids of the 30 attributes 1

Feature extraction

As discussed in the Background Section, there exist several feature extraction techniques. Given a classifier, the features derived from different feature extraction techniques would exhibit different fold recognition performances. Since in this paper the aim is not to find a feature extraction technique for a particular classifier, we use a simple autocorrelation of the residues of protein sequences. The expression for autocorrelation features used in the paper is given as follows:

R i = 1 N k = 1 N i s k μ s k + i μ ,
(1)

where N is the length of protein sequence, s k is the residue of kth amino acid in a protein sequence and μ is the mean (or average) of N residues. In this work, we use i = 1, 2, …, 20. Therefore, each protein sequence will give 20-dimensional autocorrelation features.

Classifiers

In the literature, several classifiers have been used for the protein fold recognition problem. We used three techniques for classification: support vector machine (SVM), Naïve Bayes (NB) and linear discriminant analysis (LDA) with nearest centroid classifier [57]-[59]. SVM and NB classifiers are used from WEKA environment [60] by using WEKA’s default parameter settings.

Results and discussions

Five attributes used by Ding and Dubchak [25] are used as a benchmark. These attributes are H, P, Z, X and V (see Table 1 for the description of these symbols). In all the experiments we use a 10-fold cross-validation process to obtain the recognition performance. First we present in Table 3 the fold recognition using these 5 attributes on DD, TG and EDD datasets. It can be clearly observed that the highest fold recognition on DD-dataset obtained by HPZXV is 32.8%, on TG-dataset is 28.8% and on EDD-dataset is 38.4%.

Table 3 Protein fold recognition (shown in percentage) on all the datasets using HPZXV attributes used by Ding and Dubchak [[25]]

Next we apply MD-SFS backward elimination approach on DD-dataset, TG-dataset and EDD-dataset, respectively on three cases: 1) using top 10 attributes of the amino acids from Tables 1, 2) using top 15 attributes of the amino acids from Tables 1, and 3) using all 30 attributes from Table 1. We use two criteria: brute-n and MA-based (as discussed in Section MD-SFS: Backward Elimination), to select the attributes. Since in Table 3 the results are reported using 5 attributes, we apply brute-5 to compare the results with that of Table 3. The selected attributes with their corresponding protein fold recognition (abbreviated as PFR in Tables 4, 5, 6, 7, 8, 9, 10 and 11) performance on DD-dataset using brute-5 criterion is given in Table 4 and using MA-based criterion is given in Table 5. The first row of results is by HPZXV (which is taken from Table 3). The first column indicates the number of attributes taken for attribute selection. The same setup has been used for all the remaining tables (Tables 6, 7, 8, 9, 10 and 11). It can be seen from Tables 4 and 5 that incorporating more attributes and then performing attribute selection is helping in improving the recognitionperformance. By using only 5 attributes (Table 4), the recognition performance has significantly improved by 5.7% to 16.6% as compared with the recognition performance of HPZXV attributes. If the number of attributes is not fixed and selection is based on MA criterion then the improvement is recorded between 14.1% and 18.1%.

Table 4 MD-SFS backward elimination approach on DD-dataset using brute-5 criterion
Table 5 MD-SFS backward elimination approach on DD-dataset using MA-based criterion
Table 6 MD-SFS backward elimination approach on TG-dataset using brute-5 criterion
Table 7 MD-SFS backward elimination approach on TG-dataset using MA-based criterion
Table 8 MD-SFS backward elimination approach on EDD-dataset using brute-5 criterion
Table 9 MD-SFS backward elimination approach on EDD-dataset using MA-based criterion
Table 10 MD-SFS forward selection approach on DD-dataset using brute-5 criterion
Table 11 MD-SFS forward selection approach on DD-dataset using MA-based criterion

A similar scheme has been applied using the TG-dataset and the results are reported in Tables 6 and 7 (Table 6 using brute-5 criterion and Table 7 using MA-based criterion). It can be observed from Table 6 that recognition performance has been improved between 7.5% and 10.2%. Also the improvement from Table 7 is between 12.6% and 18.1%.

We have also employed the EDD-dataset for the experiment and the results are reported in Tables 8 and 9 (Table 8 using brute-5 criterion and Table 9 using MA-based criterion). From Table 8, we note that the improvement in recognition performance is between 6.5% and 8.8%, and from Table 9, it is between 15.5% and 24.3%.

Subsequently we applied the MD-SFS forward selection approach on the DD, TG and EDD datasets. Again we use brute-5 and MA-based criteria. The protein fold recognition performance using the DD-dataset with brute-5 criterion is show in Table 10 and with MA-based criterion is shown in Table 11. It can be observed from Table 10 that by using only 5 attributes the recognition performance can be improved between 4.5% and 12.2%. In a similar way, the improvement using MA-based criterion is noted from 13.3% to 17.7%.

On TG-dataset, MD-SFS forward selection with brute-5 criterion is depicted in Table 12 and with MA-based criterion is depicted in Table 13. The improvement from Table 12 using only 5 attributes is between 8.1% and 10.4%; and, from Table 13 we have improvement from 12.4% to 17.5%.

Table 12 MD-SFS forward selection approach on TG-dataset using brute-5 criterion
Table 13 MD-SFS forward selection approach on TG-dataset using MA-based criterion

Similarly, on EDD-dataset, MD-SFS forward selection with brute-5 criterion is shown in Table 14 and with MA-based criterion is shown in Table 15. The improvement from Table 14 using only 5 attributes is between 7.4% and 8.7%; and, from Table 15 we have improvement from 10.5% to 16.2%.

Table 14 MD-SFS forward selection approach on EDD-dataset using brute-5 criterion
Table 15 MD-SFS forward selection approach on EDD-dataset using MA-based criterion

From the results, we can deduce that physicochemical based attributes are important for the prediction accuracy of protein folds. An appropriately selected subset of attributes could enhance the prediction accuracy significantly. The subset of attributes selected for different datasets are different. The attributes in a subset also vary depending on the classifier used. However, some attributes repeatedly appear on the obtained subsets. For an instance, a subset BPEVO is selected from all 30 attributes using brute-5 criterion on DD-dataset when LDA is used and a subset BPDFM is selected when SVM is used (see Table 4). It can be observed that the attributes B and P are common in both the subsets. This could imply that these attributes contain more discriminative information for protein fold recognition than others. When we analyzed all the subsets using brute-5 criterion on all the three datasets (Tables 4, 6, 8, 10, 12 and 14), we found that top 5 occurrences of attributes are J (appeared 12 times), B (appeared 9 times), T (appeared 9 times), F (appeared 8 times) and M (appeared 6 times). Therefore, these attributes (J,B,T,F and M) can be seen as important attributes. However, it does not imply that a subset containing all these 5 attributes would perform the best as the performance of attributes in combination with other attributes is also crucial.

We have also carried out a statistical hypothesis test to exhibit the significance of the results achieved. In order to do this, we randomly selected m attributes from a given set of n attributes and computed prediction accuracy using these m attributes. We repeated this random selection r times and computed average prediction accuracy. All three classifiers (LDA, SVM and NB) are used for this purpose. We applied this testing on all the three benchmark datasets (DD, TG and EDD) and compared the results with the proposed schemes. In this testing, we used m = 5, n = 30 and r = 20. The results are reported in Tables 16, 17 and 18. It can be observed from these tables that the prediction accuracy using a random selection approach is inferior to the proposed schemes. This depicts that systematically selecting attributes (using MD-SFS procedures) contributed to the prediction accuracy of protein folds.

Table 16 Statistical analysis using DD-dataset
Table 17 Statistical analysis using TG-dataset
Table 18 Statistical analysis using EDD-dataset

Furthermore, we have carried out paired t-test with 5% significance level to study the statistical significance of the prediction accuracy obtained. We used MD-SFS backward elimination method (using brute-5 criterion) as a prototype and used all the three classifiers (LDA, SVM and NB). We compared the results obtained by all the classifiers for HPZXV attributes for DD, TG and EDD benchmarks (the degree of freedom is 2). The paired t-test results for LDA, SVM and NB are 0.029, 0.003 and 0.004, respectively. These results show that the prediction accuracies obtained are significant.

We can summarize that the performance of the protein fold recognition improved when the attributes are appropriately selected. This also shows that physicochemical attributes can play an important role in protein fold recognition if selected appropriately. It should also be noted that the performance can be improved further by considering several other feature extraction techniques with sophisticated ensemble classifiers.

Conclusion

In this study, we have shown that by selecting physicochemical attributes of amino acids the protein fold recognition performance improved significantly. It is, therefore, beneficial to explore important attributes in the process of determining the three dimensional structure of proteins. To do this, we have developed a multi-dimensional successive feature selection (MD-SFS) technique and shown it on both backward elimination and forward selection approaches. There are several attributes available (e.g. a list of 544 attributes can be found in AAindex, http://www.genome.jp/aaindex/, [61]) and the investigation of these attributes by an exhaustive search would help in solving the problem better. Though it is always useful to explore as many attributes as possible, it comes with an expense of additional computational cost and memory requirements. Nonetheless, computationally efficient techniques for an exhaustive exploration of important attributes should care to develop along with the development of feature extraction and classification techniques.

Endnote

aThough there are large number of physicochemical based attributes defined for amino acids, many authors (e.g. [31, 62]-[65]) in the past, used limited number of attributes (up to 8) in their studies. We attempted to study the attributes which were given more emphasis in the literature.

References

  1. 1.

    Yang T, Kecman V, Cao L, Zhang C, Huang JZ: Margin-based ensemble classifier for protein fold recognition. Expert Syst Appl. 2011, 38: 12348-12355. 10.1016/j.eswa.2011.04.014.

  2. 2.

    Dong Q, Zhou S, Guan G: A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation. Bioinformatics. 2009, 25 (20): 2655-2662. 10.1093/bioinformatics/btp500.

  3. 3.

    Klein P: Prediction of protein structural class by discriminant analysis. Biochim Biopjys Acta. 1986, 874: 205-215. 10.1016/0167-4838(86)90119-6.

  4. 4.

    Chinnasamy A, Sung WK, Mittal A: Protein structure and fold prediction using tree-augmented naive Bayesian classifier. J Bioinform Comput Biol. 2005, 3 (4): 803-819. 10.1142/S0219720005001302.

  5. 5.

    Wang ZZ, Yuan Z: How good is prediction of protein-structural class by the component-coupled method?. Proteins. 2000, 38: 165-175. 10.1002/(SICI)1097-0134(20000201)38:2<165::AID-PROT5>3.0.CO;2-V.

  6. 6.

    Shen HB, Chou KC: Ensemble classier for protein fold pattern recognition. Bioinformatics. 2006, 22: 1717-1722. 10.1093/bioinformatics/btl170.

  7. 7.

    Ding YS, Zhang TL: Using Chou’s pseudo amino acid composition to predict subcellular localization of apoptosis proteins: an approach with immune genetic algorithm-based ensemble classifier. Patt Recog Letters. 2008, 29: 1887-1892. 10.1016/j.patrec.2008.06.007.

  8. 8.

    Bouchaffra D, Tan J: Protein fold recognition using a structural Hidden Markov Model. Proceedings of the 18th International Conference on Pattern Recognition. 2006, 3: 186-189.

  9. 9.

    Deschavanne P, Tuffery P: Enhanced protein fold recognition using a structural alphabet. Proteins: Structure, Function, and Bioinformatics. 2009, 76: 129-137. 10.1002/prot.22324.

  10. 10.

    Chen K, Zhang X, Yang MQ, Yang JY: Ensemble of probabilistic neural networks for protein fold recognition. Proceedings of the 7th IEEE International Conference on Bioinformatics and Bioengineering (BIBE). 2007, I: 66-70.

  11. 11.

    Ying Y, Huang K, Campbell C: Enhanced protein fold recognition through a novel data integration approach. BMC Bioinforma. 2009, 10 (1): 267-10.1186/1471-2105-10-267.

  12. 12.

    Dehzangi A, Amnuaisuk SP, Ng KH, Mohandesi E: Protein fold prediction problem using ensemble of classifiers. Proceedings of the 16th International Conference on Neural Information Processing. 2009, Part II: 503-511.

  13. 13.

    Dehzangi A, Amnuaisuk SP, Dehzangi O: Enhancing protein fold prediction accuracy by using ensemble of different classifiers. Aust J Intell Inf Process Syst. 2010, 26 (4): 32-40.

  14. 14.

    Dehzangi A, Karamizadeh S: Solving protein fold prediction problem using fusion of heterogeneous classifiers. INF, Int Interdiscip J. 2011, 14 (11): 3611-3622.

  15. 15.

    Dubchak I, Muchnik I, Kim SK: Protein folding class predictor for SCOP: approach based on global descriptors. Proceedings, 5th International Conference on Intelligent Systems for Molecular Biology. 1997, Kalkidiki, Greece, 104-107.

  16. 16.

    Taguchi Y-h, Gromiha MM: Application of amino acid occurrence for discriminating different folding types of globular proteins. BMC Bioinforma. 2007, 8: 404-10.1186/1471-2105-8-404.

  17. 17.

    Ghanty P, Pal NR: Prediction of protein folds: extraction of new features, dimensionality reduction, and fusion of heterogeneous classifiers. IEEE Trans On Nano Bioscience. 2009, 8: 100-110.

  18. 18.

    Chou KC: Prediction of protein cellular attributes using pseudo amino acid composition. Proteins. 2001, 43: 246-255. 10.1002/prot.1035. erratum: 2001, vol. 44, 60

  19. 19.

    Sharma A, Lyons J, Dehzangi A, Paliwal KK: A feature extraction technique using bi-gram probabilities of position specific scoring matrix for protein fold recognition. J Theor Biol. 2013, 320 (7): 41-46.

  20. 20.

    Kurgan LA, Cios KJ, Chen K: SCPRED: Accurate prediction of protein structural class for sequences of twilight-zone similarity with predicting sequences. BMC Bioinforma. 2008, 9: 226-10.1186/1471-2105-9-226.

  21. 21.

    Liu T, Geng X, Zheng X, Li R, Wang J: Accurate Prediction of Protein Structural Class Using Auto Covariance Transformation of PSI-BLAST Profiles. Amino Acids. 2012, 42: 2243-2249. 10.1007/s00726-011-0964-5.

  22. 22.

    Dehzangi A, Amnuaisuk SP: Fold prediction problem: the application of new physical and physicochemical-based features. Protein Pept Lett. 2011, 18: 174-185. 10.2174/092986611794475101.

  23. 23.

    Krishnaraj Y, Reddy CK: Boosting methods for protein fold recognition: an empirical comparison. IEEE Int Conf Bioinfor Biomed. 2008, 393-396.

  24. 24.

    Valavanis IK, Spyrou GM, Nikita KS: A comparative study of multi-classification methods for protein fold recognition. Int J Comput Intell Bioinform Syst Biol. 2010, 1 (3): 332-346.

  25. 25.

    Ding C, Dubchak I: Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics. 2001, 17 (4): 349-358. 10.1093/bioinformatics/17.4.349.

  26. 26.

    Kecman V, Yang T: Protein fold recognition with adaptive local hyper plane Algorithm. Computational Intelligence in Bioinformatics and Computational Biology, CIBCB '09. IEEE Symposium. 2009, Nashville, TN, USA, 75-78.

  27. 27.

    Kavousi K, Moshiri B, Sadeghi M, Araabi BN, Moosavi-Movahedi AA: A protein fold classier formed by fusing different modes of pseudo amino acid composition via PSSM. Comput Biol Chem. 2011, 35 (1): 1-9. 10.1016/j.compbiolchem.2010.12.001.

  28. 28.

    Chmielnicki W, Stapor K: A hybrid discriminative-generative approach to protein fold recognition. Neurocomputing. 2012, 75: 194-198. 10.1016/j.neucom.2011.04.033.

  29. 29.

    Zhang H, Zhang T, Gao J, Ruan J, Shen S, Kurgan LA: Determination of protein folding kinetic types using sequence and predicted secondary structure and solvent accessibility. Amino Acids. 2010, 1-13.

  30. 30.

    Najmanovich R, Kuttner J, Sobolev V, Edelman M: Side-chain flexibility in proteins upon ligand binding. Proteins: Structure, Function, and Bioinformatics. 2000, 39 (3): 261-268. 10.1002/(SICI)1097-0134(20000515)39:3<261::AID-PROT90>3.0.CO;2-4.

  31. 31.

    Huang JT, Tian J: Amino acid sequence predicts folding rate for middle-size two-state proteins. Proteins: Structure, Function, and Bioinformatics. 2006, 63 (3): 551-554. 10.1002/prot.20911.

  32. 32.

    Zhang TL, Ding YS, Chou KC: Prediction protein structural classes with pseudo amino acid composition: approximate entropy and hydrophobicity pattern. J Theor Biol. 2008, 250: 186-193. 10.1016/j.jtbi.2007.09.014.

  33. 33.

    Cormen TH, Leiserson CE, Rivest RL, Stein C: Introduction to algorithms. 1990, USA: MIT Press

  34. 34.

    Sharma A, Imoto S, Miyano S: A top-r feature selection algorithm for microarray gene expression data. IEEE/ACM Trans Comput Biol Bioinform. 2012, 9 (3): 754-764.

  35. 35.

    Schaffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL, Wolf YI, Koonin EV, Altschul SF: Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res. 2001, 29: 2994-3005. 10.1093/nar/29.14.2994.

  36. 36.

    Argos P, Rao JKM, Hargrave PA: Structural prediction of membrane-bound proteins. Eur J Biochem. 1982, 128: 565-575.

  37. 37.

    Zimmerman JM, Eliezer N, Simha R: The characterization of amino acid sequences in proteins by statistical methods. J Theor Biol. 1968, 21: 170-201. 10.1016/0022-5193(68)90069-6.

  38. 38.

    Charton M, Charton BI: The structural dependence of amino acid hydrophobicity parameters. J Theor Biol. 1982, 99: 629-644. 10.1016/0022-5193(82)90191-6.

  39. 39.

    Burgess AW, Ponnuswamy PK, Scheraga HA: Analysis of conformations of amino acid residues and prediction of backbone topography in proteins. Isr J Chem. 1974, 12: 239-286.

  40. 40.

    Fauchere JL, Charton M, Kier LB, Verloop A, Pliska V: Amino acid side chain parameters for correlation studies in biology and pharmacology. Int J Peptide Protein Res. 1988, 32: 269-278.

  41. 41.

    Bundi A, Wuthrich K: 1H-nmr parameters of the common amino acid residues measured in aqueous of the linear tetrapeptides H-Gly-Gly-X-L-Ala-OH. Biopolymers. 1979, 18: 285-297. 10.1002/bip.1979.360180206.

  42. 42.

    Charton M, Charton BI: The dependence of the Chou-Fasman parameters on amino acid side chain structure. J Theor Biol. 1983, 111: 447-450.

  43. 43.

    Khanarian G, Moore WJ: The Kerr effect of amino acids in water. Aust J Chem. 1980, 33: 1727-1741. 10.1071/CH9801727.

  44. 44.

    Cid H, Bunster M, Canales M, Gazitua F: Hydrophobicity and structural classes in proteins. Protein Eng. 1992, 5: 373-375. 10.1093/protein/5.5.373.

  45. 45.

    Chou PY, Fasman GD: Prediction of the secondary structure of proteins from their amino acid sequence. Adv Enzymol. 1978, 47: 45-148.

  46. 46.

    Levitt M: Conformational preferences of amino acids in globular proteins. Biochemistry. 1978, 17: 4277-4285. 10.1021/bi00613a026.

  47. 47.

    Dawson DM: The Biochemical Genetics of Man. Edited by: Brock DJH, Mayo O. 1972, Academic Press

  48. 48.

    Dayhoff MO, Hunt LT, Hurst-Calderone S: Composition of proteins. Atlas of Protein Sequence and Structure. 1978, 5 (3): 363-375.

  49. 49.

    Dayhoff MO, Schwartz RM, Orcutt BC: A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure. 1978, 5 (3): 345-352.

  50. 50.

    Eisenberg D, McLachlan AD: Solvation energy in protein folding and binding. Nature. 1986, 319: 199-203. 10.1038/319199a0.

  51. 51.

    Handbook of Biochemistry: Section A. Proteins. Edited by: Fasman GD. 1976, CRC Press, 3

  52. 52.

    Geisow MJ, Roberts RDB: Amino acid preferences for secondary structure vary with protein class. Int J Biol Macromol. 1980, 2: 387-389. 10.1016/0141-8130(80)90023-9.

  53. 53.

    Grantham R: Amino acid difference formula to help explain protein evolution. Science. 1974, 185: 862-864. 10.1126/science.185.4154.862.

  54. 54.

    Guy HR: Amino acid side-chain partition energies and distribution of residues in soluble proteins. Biophys J. 1985, 47: 61-70. 10.1016/S0006-3495(85)83877-7.

  55. 55.

    Hutchens JO: Heat capacities, absolute entropies, and entropies of formation of amino acids and related compounds. Handbook of Biochemistry. Edited by: Sober HA. 1970, Cleveland, Ohio: Chemical Rubber Co, 2

  56. 56.

    Janin J, Wodak S, Levitt M, Maigret B: Conformation of amino acid side-chains in proteins. J Mol Biol. 1978, 125: 357-386. 10.1016/0022-2836(78)90408-4.

  57. 57.

    Sharma A, Paliwal KK: Rotational linear discriminant analysis technique for dimensionality reduction. IEEE Trans Knowl Data Eng. 2008, 20 (10): 1336-1347.

  58. 58.

    Sharma A, Paliwal KK: A gradient linear discriminant analysis for small sample sized problem. Neural Processing Letters. 2008, 27 (1): 17-24. 10.1007/s11063-007-9056-7.

  59. 59.

    Sharma A, Paliwal KK: Cancer classification by gradient LDA technique using microarray gene expression data. Data Knowl Eng. 2008, 66 (2): 338-347. 10.1016/j.datak.2008.04.004.

  60. 60.

    Witten IH, Frank E: Data mining: practical machine learning tools with java implementations. 2000, San Francisco, CA: Morgan Kaufmann, http://www.cs.waikato.ac.nz/ml/weka/,

  61. 61.

    Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M: AAindex: amino acid index database, progress report 2008. Nucleic Acids Res. 2008, 36: D202-D205. 10.1093/nar/gkn255.

  62. 62.

    Li ZC, Zhou XB, Lin YR, Zou XY: Prediction of protein structure class by coupling improved genetic algorithm and support vector machine. Amino Acids. 2008, 35: 581-590. 10.1007/s00726-008-0084-z.

  63. 63.

    Liu L, Hu X: Based on improved parameters predicting protein fold. Sixth Int Conf Nat Comput (ICNC 2010). 2010, 6: 3291-3295.

  64. 64.

    Kurgan L, Chen K: Prediction of protein structural class for the twilight zone sequences. Biochem Biophys Res Commun. 2007, 357: 453-460. 10.1016/j.bbrc.2007.03.164.

  65. 65.

    Gromiha M: A statistical model for predicting protein folding rates from amino acid sequence with structural class information. J Chem Inf Model. 2005, 45: 494-501. 10.1021/ci049757q.

Download references

Author information

Correspondence to Alok Sharma.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

AS designed and carried out the experiments, and wrote the first draft of the manuscript. KKP assisted in designing a section of experiments. AD provided the dataset and helped in the second draft of the manuscript. JL also helped in the second draft of the manuscript. SI and SM financed the project. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Rights and permissions

Reprints and Permissions

About this article

Keywords

  • Support Vector Machine
  • Prediction Accuracy
  • Linear Discriminant Analysis
  • Recognition Performance
  • Fold Recognition