Identification of ATP binding residues of a protein from its primary sequence
© Chauhan et al; licensee BioMed Central Ltd. 2009
Received: 6 August 2009
Accepted: 19 December 2009
Published: 19 December 2009
One of the major challenges in post-genomic era is to provide functional annotations for large number of proteins arising from genome sequencing projects. The function of many proteins depends on their interaction with small molecules or ligands. ATP is one such important ligand that plays critical role as a coenzyme in the functionality of many proteins. There is a need to develop method for identifying ATP interacting residues in a ATP binding proteins (ABPs), in order to understand mechanism of protein-ligands interaction.
We have compared the amino acid composition of ATP interacting and non-interacting regions of proteins and observed that certain residues are preferred for interaction with ATP. This study describes few models that have been developed for identifying ATP interacting residues in a protein. All these models were trained and tested on 168 non-redundant ABPs chains. First we have developed a Support Vector Machine (SVM) based model using primary sequence of proteins and obtained maximum MCC 0.33 with accuracy of 66.25%. Secondly, another SVM based model was developed using position specific scoring matrix (PSSM) generated by PSI-BLAST. The performance of this model was improved significantly (MCC 0.5) from the previous one, where only the primary sequence of the proteins were used.
This study demonstrates that it is possible to predict 'ATP interacting residues' in a protein with moderate accuracy using its sequence. The evolutionary information is important for the identification of 'ATP interacting residues', as it provides more information compared to the primary sequence. This method will be useful for researchers studying ATP-binding proteins. Based on this study, a web server has been developed for predicting 'ATP interacting residues' in a protein http://www.imtech.res.in/raghava/atpint/.
Adenosine-5'-triphosphate (ATP) is an important molecule in cell biology as an energy molecule and coenzyme. This molecule interacts with large number of proteins during cellular activities and plays a crucial role in various biological reactions. ATP binding proteins (ABPs) have a binding site that allows ATP molecule to interact. This binding sites is a micro-environment where ATP is captured and hydrolyzed to ADP, releasing energy which is utilized by the protein to "do work" by changing the shape of the protein and/or making the enzyme catalytically active. These proteins are powered by the hydrolysis of ATP and convert this chemical energy for mechanical work . Many ATP Binding proteins are transmembrane proteins and responsible for transport of a wide variety of substrates (e.g. lipids, sterols) across extra and intracellular membranes . In summary, ATP binding proteins have important roles in membrane transport, muscle contraction, cellular motility and regulation of various metabolic processes.
Thus it is important to identify ATP binding proteins and 'ATP interacting residues' in these proteins. The experimental identification of residues that interacts with ATP in a protein is costly and time consuming. Thus there is need to use alternate techniques such as computational technique, which have been used successfully for predicting function of proteins [3–12]. In past, methods have been developed for the prediction of polynucleotide (DNA/RNA) interacting residues [8, 13, 14]. Saito et al  developed a general method for predicting nucleotide-binding sites in a protein, which successfully predicts 31% ATP binding sites (not ATP interacting residues). To the best of our knowledge, no prediction method has been developed for detecting specifically the residues interacting with ATP from a protein sequence. Thus, there is a need to develop method for predicting 'ATP interacting residues' in a protein in order to understand protein-ATP interaction.
In this study, a systematic attempt has been made to develop a highly accurate and reliable method for predicting 'ATP interacting residues' in a protein. Initially, Support Vector Machine (SVM) based models have been developed using proteins sequence. In the past, it has been shown that the evolutionary information provided more information [16, 17] than protein sequence, thus we have also used evolutionary information in the form of PSSM profile for developing a prediction method. All the models developed in this study were evaluated using five-fold cross validation technique.
We extract 360 ATP binding protein chains from SuperSite encyclopedia . After removing the redundant sequences using the program CD-HIT, a total of 267 non-redundant PDB chains were obtained where no two sequences have more than 40% identity. In the next step, we examined these proteins using software Ligand Protein Contact (LPC)  and remove those proteins, which are not ATP binding proteins according to LPC. Our final dataset have 168 non-redundant ATP binding protein chains, available at http://www.imtech.res.in/raghava/atpint/atpdataset
Evaluation of a newly developed method is a major challenge for researchers. One of the commonly used techniques for evaluating a model is jack-knife or leave-one-out cross-validation (LOOCV) [4, 20, 21]. In this technique one sequence is used for testing and remaining sequences for training, this process is repeated in such a way that each sequence is used once for testing. Though this is the best technique for evaluation, it is time consuming and computer intensive. Thus, we have used 5-fold cross-validation in this study where sequences were randomly divided into five sets. One set was used for testing and the remaining four sets were used for training. This process was repeated five times in such a way that each set was used once for testing [9, 22]. The final performance was obtained by averaging the performance of all five sets.
Pattern or window size
We have generated overlapping patterns (segments) of different window sizes from 7 to 25 for every ATP binding protein sequences. If the central residue of the pattern was a 'ATP interacting residue', then we assigned the pattern as positive pattern (ATP interacting) otherwise it was assigned as negative pattern (non-ATP interacting). To generate the pattern corresponding to the terminal residues in a protein sequence, we have added (L-1)/2 dummy residue "X" at both terminals of protein (where L is the length of pattern) . As an example, for window size 17, we have added 8 "X" before N-terminal and 8 "X" after C-terminal, in order to create M patterns from sequence of length M [16, 17]. Finally we have obtained a total of 3056 unique windows/patterns of length 17 out of 3082 ATP interacting residues.
Support Vector Machine (SVM)
In most of our studies including this one, we have implemented SVM using SVM light , which is freely downloadable package from http://svmlight.joachims.org/. SVM is a machine learning approach based on structural risk minimization principle of statistics learning theory . The main reason of using this package frequently by us is that it allows implementing various kernels and parameters.
Position Specific Scoring Matrix (PSSM)
In this work, PSSM profiles were generated using PSI-BLAST  where a protein sequence was searched against SWISS-PROT dataset using E-value cut-off of 0.001. This profile contains the probability of occurrence of each type of amino acid at each position along with insertion/deletion. Hence, PSSM is considered as a measure of residue conservation in a given location. This means that evolutionary information for each amino acid is encapsulated in a vector of 21 dimensions where the size of PSSM matrix of a protein with M residues is 21 × M, where M is the length of the target sequence, and each element represents the frequency of occurrence of each of the 20 amino acids and one dummy amino acid "X" at one position in the alignment .
In this study we have used following seven important structural feature as SVM input feature -
The hydrophobicity effect is often a major contributor of binding affinity between a protein and its ligand. All Hydrophobicity calculations were obtained from Fauchère and Pliska scale .
Many nucleotide-binding proteins having a P-loop or phosphate-binding loop, is an ATP binding site motif. It is a glycine-rich loop preceded by a beta sheet. Thus the Beta-Sheet may be important feature in the ATP binding protein. It is obtained from Chou and Fasman scale .
Polarity is a separation of electric charge leading to a molecule having an electric dipole. It results from the uneven partial charge distribution between various amino acids in a protein. We have used Grantham R polarity scale values .
The solvation potential is an important parameter of proteins that gives an idea about the preference of amino acid residues to be exposed to solvent or buried in the interface. For calculation of solvent potential for each amino acid, we have used Jones et al scale .
Residue interface propensities
The residue interface propensity is an important feature of protein binding sites that shows the propensity of each amino acid residues in the interface area. Residue interface propensities for each of the 20 amino acids were computed from Jones and Thornton .
The surface of a protein has a net charge that depends on the number and identities of the charged amino acids, and on pH. At a specific pH the positive and negative charges will balance and the net charge will be zero. Net charge of amino acid obtained from Klein et al .
Average accessible surface area
The accessible surface area is the surface area of a protein that is accessible to another protein or ligand. The average accessible surface area scale values of each amino acid were obtained from Janin et al .
For evaluation of the performance of methods, we have used standard parameter that is routinely used in this type field. Following is a brief description of the threshold dependent parameters which was used for evaluation.
Matthews's Correlation Coefficient (MCC)
Where TP is the number of correctly predicted ATP interacting residues, TN is the number of correctly predicted non-interacting residues, FP is the number of non-interacting residues predicted as interacting residues and FN is the number of interacting residues wrongly predicted as non-interacting.
All the parameters described above are threshold dependent parameters, thus performance of a model depend on threshold. In order to provide the comprehensive view of performance of a model, we have calculated these parameters on different threshold (range from +1 to -1).
Area under the ROC Curve (AUC)
All the measures described above have a common drawback that their performance depends on threshold selected. A known threshold independent parameter is Receiver Operating Curve (ROC). It is a plot between true positive proportion (TP/TP+FN) and false positive proportion (FP/FP+TN). We have used SPSS package to plot ROC and calculate AUC.
Prediction using BLAST
One of the methods which is routinely used for predicting function of a new protein sequence is BLAST. It is a similarity based method and identifies segments/regions in the query sequences which are similar to the target sequence. This method can be used to predict ATP interacting residues in a protein by searching a query protein against database of ATP-binding proteins. In order to evaluate the performance of BLAST on dataset used in this study, we have searched each ATP binding protein chain against remaining ATP binding protein chains. It was observed that only 71 ATP interacting protein chains showed similarity (BLAST hit) with other ATP binding protein chains. Thus, BLAST cannot be used to predict any ATP interacting residues in 97 ATP binding protein chains out of 168 chains in our dataset. In order to evaluate performance of BLAST on those protein chains which, showed similarity, we randomly picked 10 proteins, which have similarity with other ATP-binding protein chains. Even on these proteins, the performance of BLAST was very poor, where the sensitivity was 44% and the probability of correct prediction was 43.37%. This result suggests that BLAST is not suitable for predicting ATP interacting residues in a protein.
SVM Modules using single sequence
The performance of SVM model (learning parameter: g: 0.1 c: 2 j: 3) using amino acid sequence (The SVM parameter g (in RBF kernel), c: parameter for trade-off between training error & margin, j: cost-factor)
The performance of SVM model using binary pattern of different window size patterns.
g:0.1 c:1 j:1
g:0.1 c:1 j:1
g:0.1 c:3 j:3
g:0.1 c:2 j:1
g:0.1 c:1 j:1
g:0.1 c:2 j:3
g:0.1 c:1 j:1
g:0.1 c:1 j:1
g:0.1 c:2 j:2
g:0.1 c:2 j:1
SVM Modules using evolutionary information
The Performance of SVM model (Learning Parameter: g: 0.01 c: 4 j: 1) Using PSI-BLAST Profile
SVM Module based on physico-chemical parameters
The Performance of SVM model (Learning Parameter: g: 0.001 c: 4 j: 1) Using seven physiochemical properties.
The ATP interacting proteins play a significant role in signaling pathways, in which ATP is used as a substrate by kinases that phosphorylate proteins. The identification of ATP interacting residues is difficult using experimental techniques. There is a need for developing computational techniques for identifying ATP interacting residues in a protein from its protein sequence. Saito et al  developed a general method for predicting binding site using empirical scores system. Though this method allows detection of ATP binding sites on a protein with low accuracy but provides no information about ATP interacting residues. There are methods, which allow identifying ATP interacting residues in a protein if its structure is known [19, 37]. These methods are basically assignment method, which assign ATP interacting residues in a PDB file. In this study an attempt has been made to predict ATP interacting residue in a protein with high accuracy. One of the obvious question arise can we used existing techniques for predicting ATP interacting residues. First we used BLAST for predicting ATP interacting residues. As shown in result section we obtained poor performance both in terms of sensitivity and probability of correct prediction. Thus the routinely used similarity search technique like BLAST is not suitable for this problem. In the next step, we examine motif-based techniques for predicting ATP interacting residues. We search motifs using FingerPRINTScan  in 168 ATP binding protein chains used in this study and got motifs only in 54 proteins. No motif was found in the remaining 114 proteins. These motifs only cover around 11% ATP interacting residues (Table S2; see in Additional file) and no common motif was found in ATP binding protein (Table S3; see in Additional file). These results shows that motifs based method cannot be used for identifying of ATP interacting residues.
This study is a systematic attempt to understand and predict ATP interacting residues in a protein First we analyzed ATP interacting residues and its neighbors, and found that there is a significant difference in interacting and non-interacting residues. This means ATP interacting residues can be predicted using any machine leaning techniques. It has been shown in previous studies that SVM perform better than other artificial intelligence technique particularly on small dataset. Thus SVM based model has been developed for predicting ATP interacting residues in a protein from its primary structure and achieved reasonable accuracy. As PSSM based evolutionary information provide better information , hence we also made an attempt to develop method using evolutionary information for predicting ATP interacting residues. The performance of SVM module increases significantly when evolutionary information is in place of single sequence. This demonstrates that evolutionary information is important for predicting ATP interacting residues. In this study we used window size 17; the question arises why we have used 17. Though window size 17 is frequently used in prediction of secondary structure of interacting residues, it does not mean that window size 17 is applicable to each problem. One should try different window size in order to find out optimize window size for a given problem. We try various window sizes from 7 to 25 residues for predicting ATP interacting residues and achieved maximum performance for window size 17. Although accuracy of binary pattern of 25 window size is better than 17 but difference in sensitivity and specificity is much higher. This means that window size 17 is most suitable for predicting ATP interacting residues. This is first study of this kind so it is difficult to compare its performance with existing methods. We hope this study will be useful for researchers working in this area. There is a high probability that other researcher will work on this problem and will develop better method.
In this study we have develop method, for the first time, for predicting ATP interacting residues in a protein from its protein sequence using SVM based model. It was observed that the evolutionary information (PSSM) based SVM modules perform better than the single sequence based modules. Though it has been shown in number of previous studies that the evolutionary information is important for predicting the structural component of a protein, first time we have demonstrated that the evolutionary information is also important for predicting ATP interacting residues. One of the major features of this study is that we are providing web service for predicting ATP interacting residues in a protein. Our web-server; ATPint allows users to identify ATP binding residue using the best model trained on our data set. This server will help the experimental biologist to predict ATP interacting residue from its primary sequence and avoid the number of essential experiments.
We developed a web server "ATPint" using CGI-Perl 5.8.4 script to predict of ATP interacting proteins, which is available at http://www.imtech.res.in/raghava/atpint. This server allows users to predict ATP interacting proteins using PSSM based SVM models. User can select any threshold within -1 to +1 by default it is 0.2. The prediction result presented in graphical form where the predicted ATP interacting and non-interacting are displayed in different color.
We are grateful to Dr Alok Kumar Mondal for proof reading of manuscript. The authors are thankful to Hifzur Rahman Ansari for his help in developing web-server. We are grateful for funding agencies Council of Scientific and Industrial Research (CSIR) and the Department of Biotechnology (DBT), India for supporting this study.
- Bustamante C, Chemla YR, Forde NR, Izhaky D: Mechanical processes in biochemistry. Annual Review of Biochemistry 2004, 73: 705–748. 10.1146/annurev.biochem.72.121801.161542View ArticlePubMedGoogle Scholar
- Hirokawa N, Tekamura R: Biochemical and molecular characterization of diseases linked to motor proteins. Trends in Biochemical Sciences 2003, 28: 558–565. 10.1016/j.tibs.2003.08.006View ArticlePubMedGoogle Scholar
- Garg A, Bhasin M, Raghava GPS: Support vector machine-based method for subcellular localization of human proteins using amino acid compositions, their order, and similarity search. J Biol Chem 2005, 280: 14427–14432. 10.1074/jbc.M411789200View ArticlePubMedGoogle Scholar
- Chou KC, Zhang CT: Review: Prediction of protein structural classes. Critical Reviews in Biochemistry and Molecular Biology 1995, 30: 275–349. 10.3109/10409239509083488View ArticlePubMedGoogle Scholar
- Chou KC, Shen HB: Cell-PLoc: A package of web-servers for predicting subcellular localization of proteins in various organisms. Nature Protocols 2008, 3: 153–162. 10.1038/nprot.2007.494View ArticlePubMedGoogle Scholar
- Chou KC, Shen HB: Review: Recent progresses in protein subcellular location prediction. Analytical Biochemistry 2007, 370: 1–16. 10.1016/j.ab.2007.07.006View ArticlePubMedGoogle Scholar
- Chou KC, Shen HB: MemType-2L: A Web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM. Biochem Biophys Res Comm 2007, 360: 339–345. 10.1016/j.bbrc.2007.06.027View ArticlePubMedGoogle Scholar
- Ahmad S, Gromiha M, Sarai A: Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information. Bioinformatics 2004, 20: 477–486. 10.1093/bioinformatics/btg432View ArticlePubMedGoogle Scholar
- Kumar M, Gromiha M, Raghava GPS: Prediction of RNA binding sites in a protein using SVM and PSSM profile. Proteins: Structure, Function, and Bioinformatics 2007, 71: 189–194. 10.1002/prot.21677View ArticleGoogle Scholar
- Shen HB, Chou KC: EzyPred: A top-down approach for predicting enzyme functional classes and subclasses. Biochem Biophys Res Comm 2007, 364: 53–59. 10.1016/j.bbrc.2007.09.098View ArticlePubMedGoogle Scholar
- Xiao X, Wang P, Chou KC: GPCR-CA: A cellular automaton image approach for predicting G-protein-coupled receptor functional classes. Journal of Computational Chemistry 2009, 30: 1414–1423. 10.1002/jcc.21163View ArticlePubMedGoogle Scholar
- Chou KC: Review: Structural bioinformatics and its impact to biomedical science. Current Medicinal Chemistry 2004, 11: 2105–2134.View ArticlePubMedGoogle Scholar
- Jeong E, Chung IF, Miyano S: A network method for identification of RNA-interacting residues in protein. Genome Inform 2004, 15: 105–116.PubMedGoogle Scholar
- Jeong E, Miyano S: A Weighted profile based method for protein-RNA interacting residue prediction. In Lecture notes in computer science. Volume 39. Edited by: Corrado P, Luca C, Stephen E. Berlin/Heidelberg: Springer; 2006:123–139. full_textGoogle Scholar
- Saito M, Go M, Shira T: An empirical approach for detecting nucleotide-binding sites on proteins. Protein Engineering, Design & Selection 2006, 19: 67–75. 10.1093/protein/gzj002View ArticleGoogle Scholar
- Kaur H, Raghava GPS: Prediction of β-turns in proteins from multiple alignments using neural network. Protein Sci 2003, 12: 627–634. 10.1110/ps.0228903PubMed CentralView ArticlePubMedGoogle Scholar
- Kaur H, Raghava GPS: A neural-network based method for prediction of gamma-turns in proteins from multiple sequence alignment. Protein Sci 2003, 12: 923–929. 10.1110/ps.0241703PubMed CentralView ArticlePubMedGoogle Scholar
- Bauer RA, Günther S, Jansen D, Heeger C, Thaben P, Preissner R: SuperSite: dictionary of metabolite and drug binding sites in proteins. Nucl Acids Res 2009, 37: D195–200. 10.1093/nar/gkn618PubMed CentralView ArticlePubMedGoogle Scholar
- Sobolev V, Sorokine A, Prilusky J, Abola EE, Edelman M: Automated analysis of interatomic contacts in proteins. Bioinformatics 1999, 15: 327–332. 10.1093/bioinformatics/15.4.327View ArticlePubMedGoogle Scholar
- Chen C, Chen L, Zou X, Cai P: Prediction of protein secondary structure content by using the concept of Chou's pseudo amino acid composition and support vector machine. Protein & Peptide Letters 2009, 16(1):27–31. 10.2174/092986609787049420View ArticleGoogle Scholar
- Ding H, Luo L, Lin H: Prediction of cell wall lytic enzymes using Chou's amphiphilic pseudo amino acid composition. Protein & Peptide Letters 2009, 16: 351–355. 10.2174/092986609787848045View ArticleGoogle Scholar
- Mishra N, Kumar M, Raghava GPS: Support vector machine based method for predicting Glutathione S-transferases proteins. Protein Pept Lett 2007, 14: 575–580.View ArticlePubMedGoogle Scholar
- Joachims T: Making large-scale support vector machine learning practical. In Advances in kernel methods: support vector learning. Edited by: Scholkopf B, Burges C, Smola A. Cambridge, MA: MIT Press; 1999:169–184.Google Scholar
- Vapnik V: The nature of statistical learning theory. New York:Springer; 1995.View ArticleGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Fauchére J, Pliska V: Hydrophobic parameters π of amino acid side chains from partitioning of N-acetyl-amino-acid amides. Eur J Med Chem 1983, 18: 369–375.Google Scholar
- Chou PY, Fasman GD: Amino acid scale: Conformational parameter for beta-sheet (computed from 29 proteins). Adv Enzym 1978, 47: 45–148.Google Scholar
- Grantham R: Amino acid difference formula to help explain protein evolution. Science 1974, 185: 862–864. 10.1126/science.185.4154.862View ArticlePubMedGoogle Scholar
- Jones DT, Taylor WR, Thornton JM: A new approach to protein fold recognition. Nature 1992, 358: 86–89. 10.1038/358086a0View ArticlePubMedGoogle Scholar
- Jones S, Thornton JM: Principles of protein-protein interactions. Proc Natl Acad Sci 1996, 93: 13–20. 10.1073/pnas.93.1.13PubMed CentralView ArticlePubMedGoogle Scholar
- Klein P, Kanehisa M, DeLisi C: Prediction of protein function from sequence properties: Discriminant analysis of a data base. J Biochim Biophys Acta 1984, 787: 221–226.View ArticleGoogle Scholar
- Janin J, Wodak S: Conformation of amino acid side-chains in proteins. J Mol Biol 1978, 125(3):357–86. 10.1016/0022-2836(78)90408-4View ArticlePubMedGoogle Scholar
- Du P, Li Y: Prediction of protein submitochondria locations by hybridizing pseudo-amino acid composition with various physicochemical features of segmented sequence. BMC Bioinformatics 2006, 7: 518. 10.1186/1471-2105-7-518PubMed CentralView ArticlePubMedGoogle Scholar
- Kaur H, Raghava GPS: Prediction of beta turns in proteins from multiple alignment using neural network. Protein Sci 2003, 12: 627–634. 10.1110/ps.0228903PubMed CentralView ArticlePubMedGoogle Scholar
- Kuznetsov IB, Gou Z, Li R, Hwang S: Using evolutionary and structural information to predict DNA-binding sites on DNA-binding proteins. Proteins 2006, 64: 19–27. 10.1002/prot.20977View ArticlePubMedGoogle Scholar
- Jones S, Thornton JM: Analysis of Protein-Protein Interaction Sites using Surface Patches. J Mol Biol 1997, 272: 121–132. 10.1006/jmbi.1997.1234View ArticlePubMedGoogle Scholar
- Ting G, Yanxin S, Zhirag S: A noval statistical ligand-binding site predictor: application to ATP-binding site. Protein Engineering, Design & Selection 2005, 18: 65–70. 10.1093/protein/gzi006View ArticleGoogle Scholar
- Zdobnov EM, Apweiler R: InterProScan - an integration platform for the signature-recognition methods in InterPro.". Bioinformatics 2001, 17(9):847–848. 10.1093/bioinformatics/17.9.847View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.