Predicting protein-ATP binding sites from primary sequence through fusing bi-profile sampling of multi-view features
- Ya-Nan Zhang†1,
- Dong-Jun Yu†2,
- Shu-Sen Li1,
- Yong-Xian Fan1,
- Yan Huang3Email author and
- Hong-Bin Shen1Email author
© Zhang et al.; licensee BioMed Central Ltd. 2012
Received: 5 December 2011
Accepted: 31 May 2012
Published: 31 May 2012
Adenosine-5′-triphosphate (ATP) is one of multifunctional nucleotides and plays an important role in cell biology as a coenzyme interacting with proteins. Revealing the binding sites between protein and ATP is significantly important to understand the functionality of the proteins and the mechanisms of protein-ATP complex.
In this paper, we propose a novel framework for predicting the proteins’ functional residues, through which they can bind with ATP molecules. The new prediction protocol is achieved by combination of sequence evolutional information and bi-profile sampling of multi-view sequential features and the sequence derived structural features. The hypothesis for this strategy is single-view feature can only represent partial target’s knowledge and multiple sources of descriptors can be complementary.
Prediction performances evaluated by both 5-fold and leave-one-out jackknife cross-validation tests on two benchmark datasets consisting of 168 and 227 non-homologous ATP binding proteins respectively demonstrate the efficacy of the proposed protocol. Our experimental results also reveal that the residue structural characteristics of real protein-ATP binding sites are significant different from those normal ones, for example the binding residues do not show high solvent accessibility propensities, and the bindings prefer to occur at the conjoint points between different secondary structure segments. Furthermore, results also show that performance is affected by the imbalanced training datasets by testing multiple ratios between positive and negative samples in the experiments. Increasing the dataset scale is also demonstrated useful for improving the prediction performances.
KeywordsProtein-ATP binding site prediction Position specific position matrix Bi-profile sampling Cross-validation
Annotation of protein functions is one of the challenging tasks in bioinformatics field when facing the mass of the protein sequence data in the post-genomic Era [1–5]. It has been generally acknowledged that protein function annotation not only promotes progress of cell biology, but also benefits the development of pharmaceutical industry. However, the current situation is that there is a huge gap between large available protein sequence data and less identification of protein function. So there is an urgent desire to bridge this gap through developing accurate automated bioinformatics approaches since the wet-lab experiments are particularly laborious and expensive. In many cases, protein realizes its own specific function through interaction with other molecules or ligands in the living cell . Specifically, protein is activated by its amino acid residues interacting with other residues from other proteins or small molecules and thus forms the so-called interaction interfaces [7–9]. Hence, in order to reveal the protein’s complex functions, the first critical thing is often to accurately identify these interacting residues from hundreds or even thousands of other residues. Based on these important targets, we can then recognize the protein functions either through wet-lab analysis or other dry-lab experiments.
A lot of excellent works have been reported to distinguish interacting functional residues and analyze the characteristics of interaction interface [10–13]. The early approach is based on multiple orthologous protein sequence alignment, and then assigning the most highly conserved ones as interacting residues [14–16]. Although the alignment based algorithm is demonstrated successful when a great number of protein sequence candidates is available and the aligned sequences from distinct species share similar functions [17, 18], it often yields high false positive rates in the cases except for the above situations. There is hence a high desire to develop more robust methods [19, 20]. Later on, analysis coupled with protein structural data has been demonstrated to be capable of improving the prediction performance than those obtained only by protein sequence features , where the corresponding network is modeled by the 3D structural data of the corresponding protein molecule. There is an assumption that interacting residues should be easily recognized if they have distinguishing features . Based on this, many researchers begin to focus on analyzing unique sequential characteristics of interacting residues and the corresponding interaction geometry interfaces [7, 22]. Nooren et al.  studied the composition of interacting residues and revealed existing different tendency between those residues from different type of protein complexes. It is suggested that amino acid physicochemical properties may also promote the prediction of binding residues [24–27]. Furthermore, many structural characteristics related to identification of critical residues have been intensively investigated, such as the secondary structure information etc. [28–34]. David et al.  has made a comprehensive characterization of protein interaction interfaces, and found that main-chain atoms contribute significantly to protein-protein interactions, where the type of interaction is highly dependent on the secondary structure type. The recently proposed DoGSite method investigated the concept of subpockets and the difference of Gaussian approach (DoG) is found to be able to improve the prediction rates of protein active sites . Except for the efforts to find the discriminative features of interaction sites and their local sequential or structural environment, some studies are aiming to develop much more robust machine learning algorithms. In the report by Sankararaman et al. , both sequence conservation features and the structure information are integrated into a logistic regression model for a rough prediction. And then the regularized maximum likelihood approach is further exploited in the process of estimating the related parameters for avoiding overfitting phenomenon. These results show that a proper post-processing framework is helpful for yielding better prediction performance.
In this paper, we focused on the specific protein-ATP binding sites prediction, which will play essential role in reavling the mechanisms of protein-ATP complex. ATP is one kind of multifunctional nucleotides in nature composed of adenosine diphosphate (ADP) or adenosine monophosphate (AMP), acting as a coenzyme in transferring chemical energy between different cells and further supplying energy for metabolism and chemical synthesis within cells . Tremendous experimental studies have shown that ATP converts itself into ADP or AMP when it releases energy previously carried, and on the contrary, ADP or AMP is transformed into ATP after it absorbed the chemical energy . So ATP is constantly circulated in organisms and consumed as much as the weight of the human body each day. Naturally, ATP needs to interact with many proteins in the body to accomplish its tasks, which make it necessary to investigate ATP-protein binding residues. Of course, such knowledge can be obtained by conducting various wet-lab experiments, it can be very expensive and large time-consuming. Consequently developing computational methods for protein-ATP binding residues prediction has a great potential application. Despite the importance of the relevant research, little work was reported in this regard. Raghava et al. has done some significant work on protein-ATP binding sites prediction from the amino acid sequence. They firstly constructed benchmark dataset consisting of 168 non-redundant ATP binding protein chains. Subsequently they trained several models based on different feature groups respectively such as amino acid composition of protein sequence, protein evolutionary information in the form of position specific scoring matrix (PSSM) profile, and seven physicochemical properties of amino acid. Studies by Raghava et.al revealed that evolutionary information was critical for distinguishing protein-ATP binding residues from conventional residues of protein sequence. Kurgan et.al developed a sequence based predictor called ATPsite for predicting protein-ATP binding sites and achieved very promising results . In this paper, we followed these pioneer studies aiming to discover novel discriminative features around protein-ATP interactions sites and further improve the prediction performance of protein-ATP binding sites. Considering the importance of protein structural features in predicting protein functional sites in other studies, other than only the amino acid sequential features, we investigated other sequence derived structural features, i.e., protein secondary structure, protein amino acid disorder information, as well as solvent accessibility of amino acids. Furthermore, we also made use of the so-called bi-profile sampling method  to process these multi-view features. Our experimental results based on both the 5-fold and leave-one-out jackknife cross-validation tests show that a proper fusion of multi-view features is helpful for improving predictions of the protein-ATP binding sites.
In the present study, two benchmark datasets were adopted. Firstly, we exploited the same dataset consisting of 168 protein sequences, denoted as ATP168, which was firstly organized by Raghava et al. . This dataset was selected from SuperSite encyclopedia  and then further reduced sequence identity below 40% . To further demonstrate the effectiveness of the proposed method, the second most recent dataset that contains 227 sequences constructed by Kurgan et.al , denoted as ATP227, was also exploited. The sequence identity of any two proteins is also less than 40% in ATP227, which is available at http://biomine.ece.ualberta.ca/ATPsite/.
In following experiments, we performed both 5-fold and leave-one-out cross-validations (LOOCV) on the ATP168 and ATP227 datasets. Taking ATP168 as an example, in the case of LOOCV, each ATP binding protein chain in the dataset was singled out in turn as the testing sample and the remaining 167 sequences constituting of the training dataset. This practice continued until all the protein chains in dataset were traversed over. Binding and non-binding residues in the training dataset were regarded as positive and negative samples to input into the machine learning models for training and prediction purposes. It should be pointed out that when constructing the positive and negative training subsets, imbalance phenomena is observed since the number of non-binding residues is far more than the number of binding residues. Our following experimental results based on selecting different ratios between positive and negative will show that the imbalance will affect the final performance of prediction model. At the beginning of the experiment design, we performed a random selection process from all non-binding residues to obtain a negative subset with the same scale as the positive dataset.
Sequential feature extraction
where s is the original element in PSSM matrix and n s is the corresponding normalized element. Our following experiments will show that this normalization process is important for improving the prediction success rates by reducing the bias and noise contained in the original scores.
In order to reflect the amino acid types of the interacting residues and their local environment, we also encoded the amino acid residue by a 20-D vector of binary values of either zero or one according to the type of amino acid in the alphabetic order, e.g. Ala amino acid can be represented as (10000000000000000000), …, Tyr is represented as (00000000000000000001), and so forth. It should be pointed out that in this encoding way, very sparse feature vector containing few one and much more zeros is obtained that needs to be further processed for avoiding over-fitting problem in the learning and prediction steps.
Derived feature extraction
In order to make fully use of protein sequence and prior knowledge, we also considered several derived protein features including: (1) secondary structure, (2) disorder information, and (3) solvent accessibility. We extracted the protein secondary structure information which are predicted by the state-of-art algorithm of PSIPRED , whose output file provides the possibility profile of all the three secondary structure states (helix, strand, and coil) for each residue in a protein sequence. Subsequently we constructed a series of N × 3 matrix based on the output file where N represents the length of peptide chain and 3 indicates the number of secondary structure types. In addition, we acquired natively unstructured region of protein sequence by DISOPRED2 server . Since it was recognized as one of best servers for disordered regions prediction in protein sequence and gave out the probability whether each residue was disordered or not. And we also acquired solvent accessibility by SSpro program carried by the SCRATCH package  and the results obtained were in the form of the solvent accessibility status, that is, ‘exposed’ or ‘buried’ output for each residue in a protein sequence.
Bi-profile Bayes feature space
Before we input the extracted multi-view information into the machine learning algorithms for prediction, it is important to project these features into a proper space so that the learning algorithms can make a more accurate classification. Bi-profile sampling method was firstly introduced by Shao et al.  in predicting methylation sites in proteins. This approach assumes that peptide chains from positive dataset should exhibit some difference in the level of amino acid characteristics compared to the amino acid of peptide chains from negative dataset. Consequently Shao et al. have achieved success in the improvement of prediction performance and good accuracy was also attained by Song et al.  later in predicting cleavage sites of caspase substrate. In this study, we will apply bi-profile sampling method to encode the sparse feature vector of amino acid and some unbalanced features between different categories. Those unbalanced features usually come from the uneven distribution of amino acid characteristics among different categories as well as the imbalance of the scale of positive dataset relative to negative dataset. The bi-profile sampling method can be briefly summarized as follows: given a peptide chain P with n residues without class label as , where indicates the i th amino acid. Because in our study we are facing two categories classification task, i.e. identifying the real binding sites from those normal ones, we thus have extracted two classes of different peptide chains distinguished by their central amino acid. We further defined S+ as a set of peptide chains including only positive samples, and similarly we can obtained a negative dataset S–. For each element, that is, each single peptide chain in set S+ or S–, bi-profile sampling method means calculating a series of posterior probability related to obtained peptide chain set. Finally, we can use a vector of to represent a collective posterior probability for one peptide chain P, where the first n elements of denote the posterior probability of each amino acid at the corresponding position of the peptide chain P compared to the positive dataset S+, while indicates the posterior probability of each amino acid at the corresponding position of the peptide chain P compared to the negative dataset S–.
Summarization of different feature types and their vector representation dimensions
Position-specific scoring matrix (PSSM)
Position-specific scoring matrix normalized by logistic function (LogisticPSSM)
Bi-profiled binary amino acid composition (Bipro-aa)
Bi-profiled predicted secondary structure (BiPro-ss)
Bi-profiled predicted residue disorder (BiPro-dis)
Bi-profiled predicted residue solvent accessibility (Bipro-sa)
Support vector machines (SVM)
SVM is a machine learning approach based on structural risk minimization principle of statistical learning theory, which has been successfully applied in various bioinformatics researches [49–52]. In this paper, we exploited the freely available software package libsvm-3.11 developed by Chang and Lin and we also selected Radial Basis Function (RBF) as the kernel function since RBF has been demonstrated to be an optimal kernel in many cases. Then there are two parameters of capacity parameter C and kernel width g needing to optimize by a grid search approach . For further theoretical details about SVM, please refer to .
where TP, FP, TN, and FN are abbreviations of the number of true positive, false positive, true negative, and false negative samples respectively. We also exploit Receiver Operating Characteristic (ROC) curve, which is a plot between true positive proportion (TP/TP + FN) and false positive proportion (FP/FP + TN). The area under the ROC curve (AUC), which increases in direct proportion to the prediction performance, is also calculated.
Another important issue should be addressed here is how to objectively report the evaluation indexes listed above, especially in the situation of imbalanced learning scenario. Let’s commence by considering how a ROC curve is calculated. For a soft-type classifier, i.e., classifier that output a continuous numeric value to represent the confidence of a sample belonging to the predicted class, gradually adjusting classification threshold will produce a series of confusion matrices . From each confusion matrix, a ROC point, the coordinate of which is (TP/TP + FN, FP/FP + TN), can then be computed. A series of ROC points constitute the ROC curve. In other words, different ROC point corresponds to a different confusion matrix, from which the evaluation indexes, i.e., Accuracy, Specificity, Sensitivity, and Precision, can be computed. Then, based on which ROC point should we report the evaluation indexes of Eqs.(2)-(5)?
Results and discussion
Analyzing the determinants of binding specificity
Comparisons of residue structural features between real binding sites and those normal ones in ATP168 and ATP227
Residue Structural features
Feature normalization is helpful for improving performance
Combination of different feature groups and performance comparison based on 5-fold cross-validation tests on ATP168 and ATP227
Composition of different features
LogisticPSSM + Bipro-aa
LogisticPSSM + Bipro-dis
LogisticPSSM + Bipro-sa
LogisticPSSM + Bipro-ss
LogisticPSSM + Bipro-allb
LogisticPSSM + Bipro-aa
LogisticPSSM + Bipro-dis
LogisticPSSM + Bipro-sa
LogisticPSSM + Bipro-ss
LogisticPSSM + Bipro-allb
Combination of different feature groups and performance comparison based on leave-one-out jackknife tests on ATP168 and ATP227
Composition of different features
LogisticPSSM + Bipro-aa
LogisticPSSM + Bipro-dis
LogisticPSSM + Bipro-sa
LogisticPSSM + Bipro-ss
LogisticPSSM + Bipro-allb
LogisticPSSM + Bipro-aa
LogisticPSSM + Bipro-dis
LogisticPSSM + Bipro-sa
LogisticPSSM + Bipro-ss
LogisticPSSM + Bipro-allb
Performance by fusing multi-view derived features
The above results have demonstrated that evolutionary profile has relatively good discrimination ability between binding residues and non-binding residues, we then try to further improve the prediction performance by incorporating other multi-view features into the model. As bi-profile sampling method has been demonstrated successful in other studies [41, 48], we then applied this technique to encode the following features as discussed above: (1) binary value of amino acid composition, (2) predicted protein secondary structures, (3) predicted protein amino acid disorder information, and (4) predicted protein solvent accessibility. Because the sliding window for extracting features is 17 and based on the flowchart shown in Figure 1, we then can use a 34-D vector to represent the above 4 features, abbreviated as Bipro-aa, Bipro-ss, Bipro-dis, and Bipro-sa respectively. Table 1 summarizes all the feature information and their vector representation dimensions.
Subsequently we will combine different feature groups and compare their discrimination ability in the corresponding trained models. For reference, we firstly list performance comparisons based on 5-fold cross-validation among combinations of different feature groups in Table 3. As shown in Table 3, more or less improvements can be observed by incorporating any of four feature groups into logistic PSSM profile on both ATP168 and ATP227 datasets. Taking ATP168 as an example, when selecting LogisticPSSM profile coupled with predicted protein secondary structure information, the prediction accuracy performs best in all of compositions in different feature groups, which is 77.56% accuracy. This probably means that protein secondary structure information has better discriminative ability for classifying binding sites from the non-binding sites, which is consistent to our analysis on the Figure 4. When plotting the receiver operating characteristic (ROC) curves for the corresponding models in Table 3, we find that the AUC criterion have also been improved by fusing other information into the evolutionary features. For example, the AUC is 0.8579 in LogisticPSSM and Bipro-ss input case compared with 0.8493 in LogisticPSSM as the independent input feature. Table 4 illustrates the results obtained from the leave-one-out (jack-knife) test, and all the criterion have been improved when considering multi-view features. These results show that different features have their own merits and shortcomings, and fusion process can make them be complementary to each other. Similar results are also observed on ATP227 dataset as shown in Tables 3 and 4.
It is notable from Tables 3 and 4 that the incorporation of PSSM profile and all of the other four kinds of bi-profile features does not lead to the best prediction performance. This phenomenon indicates that not all the features that can be calculated are useful. At the same time, when we simply serially combine these features together, such combination of features will simultaneously increase the information redundancy that could, in turn, deteriorate the final accuracy. Considering of this, a proper feature fusion and selection approach should be discussed in the future in this regard to further improve the prediction performance.
Imbalanced learning effects
Performance affected by dataset size
The core of statistical learning algorithms is learning prediction rules from training samples. Different sizes of the two datasets applied in this study make us capable of studying the prediction performances affected by the dataset scales. All the results listed in Tables 3 and 4 demonstrate that performances are consistently better when classifiers are trained on the ATP227 dataset compared with those on ATP168 on both 5-fold and jackknife cross validation tests. The results are typically improved by 2%–3% under the same feature inputs. For example, the AUC is 0.8638 when training model on ATP168 with LogisticPSSM and Bipro-ss as inputs in jackknife test, while this value increases to 0.8851 on ATP227. These results reveal that it is important to collect as many training samples as possible to make the learning rules more accurate. This is particularly important when studying the small-sample problems where experimentally derived knowledge is very limited in many cases.
Comparison with other methods
We firstly compare our results with previous results obtained by Raghava et al.  on the same dataset of ATP168. In the previous study, the authors acquired an average accuracy of 75.11% based on 5-fold cross-validation on ATP168, while we achieved an accuracy of 77.56% by incorporating predicted protein secondary structure feature into PSSM profile (with logistic normalization) also based on 5-fold cross-validation. In the leave-one-out cross-validation test, 78.18% accuracy through combination of PSSM profile and predicted protein secondary structures is achieved. After constructing the ATP227 dataset by Kurgan lab, a predictor called ATPsite has been constructed by the same group . An AUC value of 0.854 has been reported in ATPsite on ATP227 dataset. On the same dataset, the AUC values are 0.8813 and 0.8851 in 5-fold and jackknife cross-validation tests respectively when incorporating LogisticPSSM and Bipro-ss as the input features in this study. These results indicate that the performance of current study is superior to the state-of-the-art approaches, which can play important complementary roles with existing predictors. It is expected that prediction performance could be further enhanced when performing consensus predictions based on these multiple predictors.
In this paper, we developed a protocol for prediction of protein-ATP binding residues. In order to reflect the multi-view characteristics of binding residues, multiple sequential and sequence derived structural features are exploited, which are further encoded by the bi-profile sampling approaches. Our experimental results show that prediction performance can be improved by fusing multi-view features. Furthermore, we show that increasing dataset size can also be helpful for enhancing the power of ATP binding residue predictors. Performances are found to be affected by the imbalances between positive and negative samples. Current prediction protocol is expected to play an important complementary role to the existing approaches for large-scale ATP binding protein function annotation.
The authors wish to thank the two anonymous reviewers for valuable suggestions and comments, which are very helpful for improvement of this paper. This work was supported by the National Natural Science Foundation of China (No. 91130033, 61175024), Shanghai Science and Technology Commission (No. 11JC1404800), A Foundation for the Author of National Excellent Doctoral Dissertation of PR China (No. 201048), sponsored by Shanghai Pujiang Program and Innovation Program of Shanghai Municipal Education Commission (10ZZ17) and Program for New Century Excellent Talents in University (NCET-11-0330).
- Shapiro L, Harris T: Finding function through structural genomics. Curr Opin Biotechnol 2000, 11(1):31–35. 10.1016/S0958-1669(99)00064-6View ArticlePubMed
- Ofran Y, Punta M, Schneider R, Rost B: Beyond annotation transfer by homology: novel protein-function prediction methods to assist drug discovery. Drug Discov Today 2005, 10(21):1475–1482. 10.1016/S1359-6446(05)03621-4View ArticlePubMed
- Kurgan L, Cios K, Chen K: SCPRED: accurate prediction of protein structural class for sequences of twilight-zone similarity with predicting sequences. BMC Bioinforma 2008, 9: 226. 10.1186/1471-2105-9-226View Article
- Gromiha MM: Protein bioinformatics: from sequence to function. Academic Press/Elsevier, Amsterdam; Boston; 2010.
- Juncker AS, Jensen LJ, Pierleoni A, Bernsel A, Tress ML, Bork P, von Heijne G, Valencia A, Ouzounis CA, Casadio R, et al.: Sequence-based feature prediction and annotation of proteins. Genome Biol 2009, 10(2):206. 10.1186/gb-2009-10-2-206PubMed CentralView ArticlePubMed
- Bergamini CM, Dondi A, Lanzara V, Squerzanti M, Cervellati C, Montin K, Mischiati C, Tasco G, Collighan R, Griffin M, et al.: Thermodynamics of binding of regulatory ligands to tissue transglutaminase. Amino Acids 2010, 39(1):297–304. 10.1007/s00726-009-0442-5View ArticlePubMed
- Talavera D, Robertson DL, Lovell SC: Characterization of protein-protein interaction interfaces from a single species. PLoS One 2011, 6(6):e21053. 10.1371/journal.pone.0021053PubMed CentralView ArticlePubMed
- Bartoli L, Martelli PL, Rossi I, Fariselli P, Casadio R: The prediction of protein-protein interacting sites in genome-wide protein interaction networks: the test case of the human cell cycle. Curr Protein Pept Sci 2010, 11(7):601–608. 10.2174/138920310794109157View ArticlePubMed
- Zhao H, Yang Y, Zhou Y: Structure-based prediction of RNA-binding domains and RNA-binding sites and application to structural genomics targets. Nucleic Acids Res 2011, 39(8):3017–3025. 10.1093/nar/gkq1266PubMed CentralView ArticlePubMed
- Gromiha MM, Yabuki Y, Suresh MX, Thangakani AM, Suwa M, Fukui K: TMFunction: database for functional residues in membrane proteins. Nucleic Acids Res 2009, 37(Database issue):D201–204.PubMed CentralView ArticlePubMed
- Gromiha MM: Protein folding, stability and interactions. Curr Protein Pept Sci 2010, 11(7):497. 10.2174/138920310794109102View ArticlePubMed
- Chen K, Mizianty MJ, Kurgan L: Prediction and analysis of nucleotide-binding residues using sequence and sequence-derived structural descriptors. Bioinformatics 2012, 28(3):331–341. 10.1093/bioinformatics/btr657View ArticlePubMed
- Firoz A, Malik A, Joplin KH, Ahmad Z, Jha V, Ahmad S: Residue propensities, discrimination and binding site prediction of adenine and guanine phosphates. BMC Biochem 2011, 12: 20. 10.1186/1471-2091-12-20PubMed CentralView ArticlePubMed
- Glaser F, Pupko T, Paz I, Bell RE, Bechor-Shental D, Martz E, Ben-Tal N: ConSurf: identification of functional regions in proteins by surface-mapping of phylogenetic information. Bioinformatics 2003, 19(1):163–164. 10.1093/bioinformatics/19.1.163View ArticlePubMed
- Lichtarge O, Bourne HR, Cohen FE: An evolutionary trace method defines binding surfaces common to protein families. J Mol Biol 1996, 257(2):342–358. 10.1006/jmbi.1996.0167View ArticlePubMed
- Thornton JM, George RA, Spriggs RV, Bartlett GJ, Gutteridge A, MacArthur MW, Porter CT, Al-Lazikani B, Swindells MB: Effective function annotation through catalytic residue conservation. Proc Natl Acad Sci U S A 2005, 102(35):12299–12304. 10.1073/pnas.0504833102PubMed CentralView ArticlePubMed
- Yeates TO, Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D: Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles. Proc Natl Acad Sci U S A 1999, 96(8):4285–4288. 10.1073/pnas.96.8.4285PubMed CentralView ArticlePubMed
- Thibert B, Bredesen DE, del Rio G: Improved prediction of critical residues for protein function based on network and phylogenetic analyses. BMC Bioinforma 2005, 6: 213. 10.1186/1471-2105-6-213View Article
- Todd AE, Orengo CA, Thornton JM: Evolution of function in protein superfamilies, from a structural perspective. J Mol Biol 2001, 307(4):1113–1143. 10.1006/jmbi.2001.4513View ArticlePubMed
- Tian W, Skolnick J: How well is enzyme function conserved as a function of pairwise sequence identity? J Mol Biol 2003, 333(4):863–882. 10.1016/j.jmb.2003.08.057View ArticlePubMed
- Ezkurdia I, Bartoli L, Fariselli P, Casadio R, Valencia A, Tress ML: Progress and challenges in predicting protein-protein interaction sites. Brief Bioinform 2009, 10(3):233–246.View ArticlePubMed
- de Vries SJ, Bonvin AM: Intramolecular surface contacts contain information about protein-protein interface regions. Bioinformatics 2006, 22(17):2094–2098. 10.1093/bioinformatics/btl275View ArticlePubMed
- Nooren IM, Thornton JM: Structural characterisation and functional significance of transient protein-protein interactions. J Mol Biol 2003, 325(5):991–1018. 10.1016/S0022-2836(02)01281-0View ArticlePubMed
- Moreira IS, Fernandes PA, Ramos MJ: Hot spots–a review of the protein-protein interface determinant amino-acid residues. Proteins 2007, 68(4):803–812. 10.1002/prot.21396View ArticlePubMed
- DeLano WL: Unraveling hot spots in binding interfaces: progress and challenges. Curr Opin Struct Biol 2002, 12(1):14–20. 10.1016/S0959-440X(02)00283-XView ArticlePubMed
- Ma B, Elkayam T, Wolfson H, Nussinov R: Protein-protein interactions: structurally conserved residues distinguish between binding sites and exposed protein surfaces. Proc Natl Acad Sci U S A 2003, 100(10):5772–5777. 10.1073/pnas.1030237100PubMed CentralView ArticlePubMed
- Burgoyne NJ, Jackson RM: Predicting protein interaction sites: binding hot-spots in protein-protein and protein-ligand interfaces. Bioinformatics 2006, 22(11):1335–1342. 10.1093/bioinformatics/btl079View ArticlePubMed
- Bartlett GJ, Porter CT, Borkakoti N, Thornton JM: Analysis of catalytic residues in enzyme active sites. J Mol Biol 2002, 324(1):105–121. 10.1016/S0022-2836(02)01036-7View ArticlePubMed
- Chea E, Livesay DR: How accurate and statistically robust are catalytic site predictions based on closeness centrality? BMC Bioinforma 2007, 8: 153. 10.1186/1471-2105-8-153View Article
- Amitai G, Shemesh A, Sitbon E, Shklar M, Netanely D, Venger I, Pietrokovski S: Network analysis of protein structures identifies functional residues. J Mol Biol 2004, 344(4):1135–1146. 10.1016/j.jmb.2004.10.055View ArticlePubMed
- Bate P, Warwicker J: Enzyme/non-enzyme discrimination and prediction of enzyme active site location using charge-based methods. J Mol Biol 2004, 340(2):263–276. 10.1016/j.jmb.2004.04.070View ArticlePubMed
- Ben-Shimon A, Eisenstein M: Looking at enzymes from the inside out: the proximity of catalytic residues to the molecular centroid can be used for detection of active sites and enzyme-ligand interfaces. J Mol Biol 2005, 351(2):309–326. 10.1016/j.jmb.2005.06.047View ArticlePubMed
- Zhang H, Zhang T, Chen K, Kedarisetti KD, Mizianty MJ, Bao Q, Stach W, Kurgan L: Critical assessment of high-throughput standalone methods for secondary structure prediction. Brief Bioinform 2011, 12(6):672–688. 10.1093/bib/bbq088View ArticlePubMed
- Gromiha MM, Yokota K, Fukui K: Sequence and structural analysis of binding site residues in protein-protein complexes. Int J Biol Macromol 2010, 46(2):187–192. 10.1016/j.ijbiomac.2009.11.009View ArticlePubMed
- Volkamer A, Griewel A, Grombacher T, Rarey M: Analyzing the topology of active sites: on the prediction of pockets and subpockets. J Chem Inf Model 2010, 50(11):2041–2052. 10.1021/ci100241yView ArticlePubMed
- Sankararaman S, Sha F, Kirsch JF, Jordan MI, Sjolander K: Active site prediction using evolutionary and structural information. Bioinformatics 2010, 26(5):617–624. 10.1093/bioinformatics/btq008PubMed CentralView ArticlePubMed
- Hirokawa N, Takemura R: Biochemical and molecular characterization of diseases linked to motor proteins. Trends Biochem Sci 2003, 28(10):558–565. 10.1016/j.tibs.2003.08.006View ArticlePubMed
- Bustamante C, Chemla YR, Forde NR, Izhaky D: Mechanical processes in biochemistry. Annu Rev Biochem 2004, 73: 705–748. 10.1146/annurev.biochem.72.121801.161542View ArticlePubMed
- Chauhan JS, Mishra NK, Raghava GP: Identification of ATP binding residues of a protein from its primary sequence. BMC Bioinforma 2009, 10: 434. 10.1186/1471-2105-10-434View Article
- Chen Ke MJM, Kurgan Lukasz: ATPsite: sequence-based prediction of ATP-binding residues. Proteome Science 2011, 9(Suppl 1):S4. 10.1186/1477-5956-9-S1-S4View ArticlePubMed
- Shao J, Xu D, Tsai SN, Wang Y, Ngai SM: Computational identification of protein methylation sites through bi-profile Bayes feature extraction. PLoS One 2009, 4(3):e4920. 10.1371/journal.pone.0004920PubMed CentralView ArticlePubMed
- Bauer RA, Gunther S, Jansen D, Heeger C, Thaben PF, Preissner R: SuperSite: dictionary of metabolite and drug binding sites in proteins. Nucleic Acids Res 2009, 37(Database issue):D195–200.PubMed CentralView ArticlePubMed
- Li W, Godzik A: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006, 22(13):1658–1659. 10.1093/bioinformatics/btl158View ArticlePubMed
- Chen K, Mizianty MJ, Kurgan L: ATPsite: sequence-based prediction of ATP-binding residues. Proteome Sci 2011, 9(Suppl 1):S4. 10.1186/1477-5956-9-S1-S4PubMed CentralView ArticlePubMed
- Jones DT: Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 1999, 292(2):195–202. 10.1006/jmbi.1999.3091View ArticlePubMed
- Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DT: Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol 2004, 337(3):635–645. 10.1016/j.jmb.2004.02.002View ArticlePubMed
- Cheng J, Randall AZ, Sweredoski MJ, Baldi P: SCRATCH: a protein structure and structural feature prediction server. Nucleic Acids Res 2005, 33(Web Server issue):W72–76.PubMed CentralView ArticlePubMed
- Song J, Tan H, Shen H, Mahmood K, Boyd SE, Webb GI, Akutsu T, Whisstock JC: Cascleave: towards more accurate prediction of caspase substrate cleavage sites. Bioinformatics 2010, 26(6):752–760. 10.1093/bioinformatics/btq043View ArticlePubMed
- Smialowski P, Schmidt T, Cox J, Kirschner A, Frishman D: Will my protein crystallize? A sequence-based predictor. Proteins 2006, 62(2):343–355.View ArticlePubMed
- Smialowski P, Martin-Galiano AJ, Mikolajka A, Girschick T, Holak TA, Frishman D: Protein solubility: sequence based prediction and experimental verification. Bioinformatics 2007, 23(19):2536–2542. 10.1093/bioinformatics/btl623View ArticlePubMed
- Song J, Tan H, Takemoto K, Akutsu T: HSEpred: predict half-sphere exposure from protein sequences. Bioinformatics 2008, 24(13):1489–1497. 10.1093/bioinformatics/btn222View ArticlePubMed
- Zhang H, Zhang T, Chen K, Shen S, Ruan J, Kurgan L: Sequence based residue depth prediction using evolutionary information and predicted secondary structure. BMC Bioinforma 2008, 9: 388. 10.1186/1471-2105-9-388View Article
- Chang CC, Lin CJ (Eds): In LIBSVM: a library for support vector machines. Software available at ; 2001 http://www.csie.ntu.edu.tw/~cjlin/libsvm Software available at ; 2001
- Vapnik VN: The nature of statistical learning theory. 2nd edition. New York: Springer; 2000.View Article
- Haibo H, Garcia EA: Learning from Imbalanced Data. Knowledge and Data Engineering, IEEE Transactions on 2009, 21(9):1263–1284.View Article
- Jo T, Japkowicz N: Class Imbalances versus Small Disjuncts. ACM SIGKDD Explorations Newsletter 2004, 6(1):40–49. 10.1145/1007730.1007737View Article
- Tompa P: Unstructural biology coming of age. Curr Opin Struct Biol 2011, 21(3):419–425. 10.1016/j.sbi.2011.03.012View ArticlePubMed
- Dosztanyi Z, Tompa P: Prediction of protein disorder. Methods Mol Biol 2008, 426: 103–115. 10.1007/978-1-60327-058-8_6View ArticlePubMed
- Hegyi H, Tompa P: Intrinsically disordered proteins display no preference for chaperone binding in vivo. PLoS Comput Biol 2008, 4(3):e1000017. 10.1371/journal.pcbi.1000017PubMed CentralView ArticlePubMed
- Faraggi E, Xue B, Zhou Y: Improving the prediction accuracy of residue solvent accessibility and real-value backbone torsion angles of proteins by guided-learning through a two-layer neural network. Proteins 2009, 74(4):847–856. 10.1002/prot.22193PubMed CentralView ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.