PlantPhos: using maximal dependence decomposition to identify plant phosphorylation sites with substrate site specificity
© Lee et al; licensee BioMed Central Ltd. 2011
Received: 14 March 2011
Accepted: 26 June 2011
Published: 26 June 2011
Protein phosphorylation catalyzed by kinases plays crucial regulatory roles in intracellular signal transduction. Due to the difficulty in performing high-throughput mass spectrometry-based experiment, there is a desire to predict phosphorylation sites using computational methods. However, previous studies regarding in silico prediction of plant phosphorylation sites lack the consideration of kinase-specific phosphorylation data. Thus, we are motivated to propose a new method that investigates different substrate specificities in plant phosphorylation sites.
Experimentally verified phosphorylation data were extracted from TAIR9-a protein database containing 3006 phosphorylation data from the plant species Arabidopsis thaliana. In an attempt to investigate the various substrate motifs in plant phosphorylation, maximal dependence decomposition (MDD) is employed to cluster a large set of phosphorylation data into subgroups containing significantly conserved motifs. Profile hidden Markov model (HMM) is then applied to learn a predictive model for each subgroup. Cross-validation evaluation on the MDD-clustered HMMs yields an average accuracy of 82.4% for serine, 78.6% for threonine, and 89.0% for tyrosine models. Moreover, independent test results using Arabidopsis thaliana phosphorylation data from UniProtKB/Swiss-Prot show that the proposed models are able to correctly predict 81.4% phosphoserine, 77.1% phosphothreonine, and 83.7% phosphotyrosine sites. Interestingly, several MDD-clustered subgroups are observed to have similar amino acid conservation with the substrate motifs of well-known kinases from Phospho.ELM-a database containing kinase-specific phosphorylation data from multiple organisms.
This work presents a novel method for identifying plant phosphorylation sites with various substrate motifs. Based on cross-validation and independent testing, results show that the MDD-clustered models outperform models trained without using MDD. The proposed method has been implemented as a web-based plant phosphorylation prediction tool, PlantPhos http://csb.cse.yzu.edu.tw/PlantPhos/. Additionally, two case studies have been demonstrated to further evaluate the effectiveness of PlantPhos.
Protein phosphorylation is the most widespread and well-studied post-translational modification in eukaryotic cells. It is one of the most prevalent intracellular protein modifications that influence numerous cellular processes . It has been estimated that one-third to one-half of all proteins in a eukaryotic cell are phosphorylated . Furthermore, protein phosphorylation, catalyzed by specific kinases, plays crucial regulatory roles in intracellular signal transduction. Networks composed of proteins and small molecules that transmit information from the cell surface to the nucleus are ultimately affected by transcriptional changes . An estimated 1% to 3% of functional eukaryotic genes encode protein kinases; this suggests that they are involved in many aspects of cellular regulation and metabolism . However, a full understanding on the mechanism of intracellular signal transduction remains a major challenge in cellular biology.
Protein phosphorylation regulates various cellular processes not only in mammals but also in plants. It is reported that the regulation of carbon and nitrogen metabolism in plants is driven by phosphorylation . Phosphorylation is also involved in modulating a sucrose phosphate synthase enzyme which controls the signaling pathway for the process of sucrose synthesis in plants . Phosphorylation also aids in modulating the plant process of synthesizing Ammonia, an organic compound which is required to give energy to certain organs which are not able to photosynthesize . Furthermore, protein phosphorylation is involved in the process of plant growth and plant response to stress [6, 7]. Stone et al. have identified a number of plant kinases; however, the precise functional roles of specific protein kinases were not widely elucidated .
Due to the interest of the scientific community in further understanding the process of phosphorylation, mass spectrometry-based proteomics have been used to enable the large-scale mapping of in vivo phosphorylation sites . With this, several databases have been proposed to store experimentally verified phosphorylation sites with catalytic kinases, such as Phospho.ELM , PhosphoSite , UniProtKB/Swiss-Prot , PHOSIDA , and dbPTM [13, 14]. While most resources focus on phosphorylation sites in mammalian organisms, there are some databases which store phosphorylation sites in plants such as PhosPhAt , P3DB  and TAIR . PhosPhAt consolidates knowledge of mass spectrometry-based identified phosphorylation sites in Arabidopsis thaliana and offers a phosphorylation site prediction tool specifically trained on experimentally identified Arabidopsis thaliana phosphorylation motifs . P3DB provides a resource of protein phosphorylation data from multiple plants. Moreover, a phosphopeptide BLAST browser was implemented to allow users to query the database for phosphopeptides similar to protein sequences of their interest . TAIR maintains a database of genetic and molecular data for Arabidopsis thaliana. Protein data stored in TAIR includes the complete protein sequence along with phosphorylation site annotations.
Due to the high complexity and difficulty of phosphorylation site identification using mass spectrometry, a number of mammalian protein phosphorylation prediction tools have been developed using different methods and yielding various predictive performance. KinasePhos 1.0 [18, 19], incorporated profile HMM for identifying kinase-specific phosphorylation site prediction, whose overall predictive accuracy is about 87%. Version 2.0 of KinasePhos  incorporated support vector machine (SVM) with the protein coupling pattern to identify phosphorylation sites for 58 kinase groups. NetPhosK  applied an artificial neural network algorithm to predict 17 PK groups-specific phosphorylation sites. Scansite 2.0  identified short protein sequence motifs that are recognized by modular signaling domains, phosphorylated by serine/threonine, tyrosine kinases or those that mediate specific interactions with protein or phospholipid ligands. GPS [23, 24] is a group-based phosphorylation site prediction and scoring platform which clusters 216 unique protein kinases in 71 groups. PPSP  developed an approach based on Bayesian decision theory for predicting the potential phosphorylation sites accurately for around 70 protein kinase groups. PHOSIDA , incorporated SVM with surface accessibility and evolutionary conservation, made 91.75%, 81.06%, and 76.19% accuracies in serine, threonine, and tyrosine, respectively.
With regard to plant phosphorylation prediction, PhosPhAt  has utilized a set of 802 experimentally validated phosphoserine sites to develop a classifier of SVM for identifying pSer sites in Arabidopsis thaliana. This yielded an area under curve rate of around 0.81 on a redundant TAIR7  protein dataset. More recently, Gao et al. incorporated protein sequence information and protein disordered regions, and integrated k-nearest neighbor and SVM for predicting phosphorylation sites. Their method utilized the PhosPhAt dataset of pSer in Arabidopsis thaliana and the TAIR7 non-redundant protein database. However, these works do not predict phosphorylation sites according to plant substrate site specificity . Therefore, there is a need to investigate the various substrate site specificities in plants and utilize this information for predicting kinase-specific plant protein phosphorylation sites.
Information regarding protein kinases that phosphorylate substrates in plants is very limited. Based on the collection of experimentally verified plant phosphorylation sites from TAIR9 and UniProtKB/Swiss-Prot, phosphorylation sites are not annotated with its corresponding kinase. Due to this limitation, majority of the published methods for computationally identifying kinase-specific phosphorylation sites are trained mainly by using data from non-plant organisms. This study aims to analyze plant phosphorylation sites, investigate substrate site specificity in plants, and most importantly, present a novel method for identifying potential phosphorylation sites in plant proteins using the available substrate site specificity information. This work applies maximal dependence decomposition (MDD)  to cluster all phosphorylated fragments into subgroups presenting meaningful and statistically significant site specificity. MDD was firstly used to group the splice sites during the identification process of splice site prediction . A large group of aligned sequences can be moderated into subgroups that capture the most significant dependencies between positions. Huang et al. have applied MDD to improve the prediction performance of PKA, PKC, and CK2 kinase groups. In this study, MDD is adopted to investigate various substrate specificities of plant phosphorylation sites. Additionally, the motif of each MDD-clustered subgroup is compared to the substrate motifs of known kinases in Phospho.ELM -a database for integrating comprehensive information of kinase-specific phosphorylation sites from multiple organisms. According to a five-fold cross-validation evaluation, models trained with MDD-clustered subgroups could improve predictive accuracy as compared to models trained without the application of MDD clustering. Furthermore, an independent data set is used to further evaluate the effectiveness of the models that achieve the best accuracy during cross-validation. Finally, the MDD-clustered models are adopted to implement an effective web-based tool, namely PlantPhos, for identifying plant phosphorylation sites with substrate motifs that may potentially be recognized by plant kinases. The prediction tool and the data used in this study can be available at http://csb.cse.yzu.edu.tw/PlantPhos/.
Materials and methods
Data collection and preprocessing
Data statistics of the experimentally verified phosphorylation sites collected from TAIR9 protein database.
Type of Data
Redundant sequence fragments or sequences which contain the same amino acids on each corresponding position in both the positive data sets and negative data sets are removed. In the case of redundant sequence fragments found in both positive and negative data sets, the sequence in the negative data set is removed and the sequence in the positive data set is retained as shown in the redundancy removal step of Figure 1. The removal of redundant data resulted in 2506 positive fragments and 87877 negative fragments for S; 378 positive fragments and 10402 negative fragments for T; and 108 positive fragments and 1681 negative fragments for Y. Since the number of the negative fragments is much greater than the number of the corresponding positive fragments, the number of data is balanced by obtaining the same number of negative fragments based on the number of positive fragments. The idea of balancing the negative data with the positive data is done in reference to PhosPhAt  which uses random selection to obtain a balanced data set. However, in order to avoid an unstable prediction performance through a non-representative data set derived using random selection, the K-means clustering method  is used for this study. K-means clustering selects well-represented data from a large data set in order to achieve a more globalized sample. A data point which has a minimal distance from other data points surrounding it is selected as a representative data. For this study, K-means clustering is performed based on sequence identity. The value of K which denotes the number of samples to be obtained from the negative data set is defined by the number of the corresponding positive data. This resulted in a data set consisting of 2506 positive and negative S fragments; 378 positive and negative T fragments; and 108 positive and negative Y fragments. Table 1 shows the final number of positive and negative fragments in S, T, and Y used for this study.
Data clustering by maximal dependence decomposition
One of the aims of this study is to investigate the substrate site specificity of plant phosphorylation sites based on amino acid sequences. In order to explore the conserved motifs from a large data set, MDD is applied to cluster all phosphorylated fragments into subgroups, which can show statistically significant site specificity. MDD  is a methodology to group a set of aligned signal sequences to moderate a large group into subgroups that capture the most significant dependencies between positions. In previous studies , MDD was proposed to group splice sites during the identification process of splice site prediction. However, in this work, we group protein sequences instead of nucleotides. MDD adopts chi-square test χ2(Ai, Aj) to evaluate the dependence of amino acid occurrence between two positions, A i and A j , which surround the phosphorylation site. In order to extract motifs that have conserved biochemical property of amino acids when doing MDD, we categorize the twenty types of amino acids into five groups: neutral, acid, basic, aromatic, and imino groups, as shown in Additional file 1, Table S1. Then, a contingency table of the amino acids occurrence between two positions is constructed, as presented in Additional file 1, Figure S1. The chi-square test is defined as:
where X mn represents the number of sequences that have the amino acids of group m in position A i and have the amino acids of group n in position A j , for each pair (A i , A j ) with i≠j. E mn is calculated as , where X mR = X m1 + ...+X m5 , X Cn = X 1n + ...+X 5n , and X denotes the total number of sequences. If a strong dependence is detected (defined as a X 2 value is larger than 34.3, corresponding to a cutoff level of P = 0.005 with 16 degrees of freedom) between two positions, then the process is continued as described by Burge and Karlin . As illustrated in Additional file 1, Figure S1, it can be observed that position +1 has the maximal dependence with the occurrence of imino amino acids. Subsequently, all data can be divided into two subgroups where one has the occurrence of imino amino acids in position +1 while the other does not have the occurrence of imino amino acids in position +1. MDD clustering is a recursive process which divides the positive sets into tree-like subgroups. When applying MDD to cluster the sequences in the positive set, a parameter, i.e., the minimum-cluster-size, should be set. If the size of a subgroup is less than the minimum-cluster-size, the subgroup will not be divided any further. The MDD process terminates until all the subgroup sizes are less than the value of the minimum-cluster-size. With reference to previous works that utilize MDD , there exists no set values for the parameters of MDD clustering. In order to obtain an optimal minimum cluster size, MDD clustering is executed using various values. Each subgroup is represented using WebLogo  to graphically visualize the corresponding substrate motif. The resulting clusters are then analyzed as to whether or not they contain significant conserved motifs. Subgroups with very similar motifs are further grouped together into a single cluster in order to provide more meaningful groups and avoid redundant clusters as shown in the MDD clustering step of Figure 1.
Model learning and evaluation
In this work, profile HMM is learned from the site sequences of each optimized MDD-clustered subgroups. An HMM describes a probability distribution over a potentially infinite number of sequences . It can also be used to detect distant relationships between amino acids sequences. Here, we use the software package HMMER version 2.3.2  to build the profile HMMs, to calibrate the HMMs, and to search the putative phosphorylation sites against the protein sequences. HMM builds a model based on positive instances of a class; thus, in this study, only positive data were utilized to build a model. The MDD-clustered sets of the phosphorylation sites are taken as training sets to learn the HMMs. One HMM is built for each MDD-clustered subgroup.
For each model of the MDD-clustered subgroups, a threshold parameter is selected as a cut-off value in identifying potential positive data from a query . An optimized threshold is selected as the value which gives the most optimal cross-validation performance for each training model. To search the hits of a HMM, HMMER returns both a bit score and an expectation value (E-value). The bit score is the base two logarithm of the ratio between the probability that the query sequence is a significant match and the probability that it is generated by a random model. The E-value represents the expected number of sequences with a score greater than or equal to the returned HMMER bit scores. A search result with an HMMER bit score greater than the threshold parameter is taken as a positive prediction. While decreasing the bit score threshold favors finding true positives, increasing the bit score threshold favors finding true negatives.
Cross-validation is an important evaluation procedure prior to the application of a predictor . The predictive performance of the constructed models is evaluated by performing five-fold cross validation. The training data is divided into five groups by splitting each dataset into five approximately equal sized subgroups. During cross-validation, each one of the five subgroups is regarded as the validation set in turn, and the remainder is regarded as the training set. Next, the following measures of predictive performance of the trained models are calculated: Precision (Pre) = TP/(TP+FP), Sensitivity (Sn) = TP/(TP+FN), Specificity (Sp) = TN/(TN+FP), and Accuracy (Acc) = (TP + TN)/(TP+FP+TN+FN), where TP, TN, FP and FN represent the numbers of true positives, true negatives, false positives and false negatives, respectively. Along with 5-fold cross-validation, different values for the HMMER bit score were also tested in order to obtain the optimal threshold parameter for predicting query sequences. Each value from -20 to 0 was each tested as the HMMER bit score. Then, the results of each fold using each bit score value is compared and analyzed. The value which yields the highest cross-validation performance on all five folds of a specific model is selected as the optimal HMMER threshold for that HMM.
As for classification, the prediction performance of the trained models may be overestimated due to a possible over-fit in the training set; thus, an independent test is done. The experimental plant phosphorylation sites of UniProtKB/Swiss-Prot , which were not included in TAIR9 , are regarded as the independent test set and are used to estimate the actual prediction performance. Using a window size of 21-mer, there are 332, 105, and 14 phosphorylation sites for S, T, and Y, respectively. Similar to the extraction of a negative set of training data, there are 664, 210, and 28 non-phosphorylation sites for S, T, and Y, respectively, are regarded as the negative set of independent test data. A balanced number of negative data is selected using K-means clustering to match the number of positive data. After performing a five-fold cross-validation evaluation, the independent test set is used to evaluate the MDD-clustered HMMs with the highest accuracy. After searching against all HMMs, the prediction result for each test data on each HMM is evaluated. If a query data is predicted by at least one HMM to be positive, then it is reported by PlantPhos as a phosphorylation site. This is because each HMM represents the target motif of a specific plant protein kinase which means that a positive hit of a certain model matches the motif it represents. In case where a query data is predicted by two or more HMMs as a phosphorylation site, the cluster which gives the highest prediction score is treated as the plant protein kinase motif that the query data matches. Meanwhile, if a query data is predicted by all HMMs to be negative, then it is considered as a non-phosphorylation site. In order to justify the results due to the balancing of the positive and negative data, the independent testing is done 10 times on S, T, and Y data.
Results and discussion
Investigation of substrate site specificities
The substrate site specificity and predictive performance in nine MDD-clustered subgroups of phosphoserine.
Number of data
Entropy plot of substrate motif
HMMER bit score
The substrate site specificity and predictive performance in six MDD-clustered subgroups of phosphothreonine.
The substrate site specificity and predictive performance in six MDD-clustered subgroups of phosphotyrosine.
Predictive performance of five-fold cross-validation
The cross-validation process includes the selection of the threshold parameter for each model. The threshold parameter is a specific bit score that serves as the cutoff value of HMMsearch for determining matching query sequences for an HMM . The threshold is selected by first testing each value from the range of -20 to 0 as the bit score. The threshold is tuned to a specific value which allows an HMM to yield a high and balanced Specificity and Sensitivity for a specific HMM. Table 3 shows the threshold score selected for each model of pSer as well as its individual predictive performance. Also, Table 4 and 5 show the threshold score selected for each model of pThr and pTyr, respectively, as well as its individual predictive performance.
The comparison of five-fold cross-validation between single HMM and MDD-clustered HMMs.
Evaluation of the selected models using independent test set
The comprison of independent testing.
Comparison with other methods
Currently, very few prediction methods dedicated to identifying phosphorylation sites in plants have been proposed. Among these are PhosPhAt  which uses a SVM classifier and the work of Gao et al. which incorporates protein sequence information, protein disordered regions, integrated k-nearest neighbor, and SVM for predicting phosphorylation sites (see Additional file 1, Table S5). However, only PhosPhAt offers a readily-available web tool, PhosPhAt 3.0, for predicting plant phosphorylation sites. For comparison, the independent test data set used for evaluating PlantPhos was utilized. Using a window size of 21-mer, 332 pSer, 105pThr, and 14 pTyr sites were used as the positive test data. A balanced number of negative data is then selected using K-means clustering to match the number of positive data. The resulting data are then classified as phosphorylation or non-phosphorylation sites using both PlantPhos and PhosPhAt 3.0. Finally, the precision, sensitivity, specificity, and accuracy of both methods are calculated in order to compare their respective predictive performance. Table 7 shows the results of both methods after being tested on the same data set. It can be observed that PhosPhAt 3.0 yields an accuracy of 56.2%, 55.2%, and 46.4% for predicting S, T, and Y sites, respectively, which is lower as compared to the proposed method which yields an accuracy of 76.5%, 73.4%, 76.2% for predicting S, T, and Y sites, respectively.
Moreover, the predictive performance of PlantPhos is compared with general phosphorylation prediction tools in order to compare its ability in identifying plant phosphorylation sites. NetPhos is a general phosphorylation prediction tool trained on mammalian phosphoproteins which uses an artificial neural network method to predict phosphorylation sites in query sequences . DisPhos is another general phosphorylation prediction tool trained on experimentally verified phosphorylation sites which uses disorder information to improve prediction accuracy . Furthermore, DisPhos provides an Arabidopsis thaliana predictor, which is selected for this test. These two phosporylation tools, which are available online, are tested using the same independent data set used for evaluating PlantPhos. Table 7 shows that NetPhos yields an accuracy of 60.5%, 47.8%, and 46.4% for predicting S, T, and Y sites. On the other hand, DisPhos 1.3  yields an accuracy of 65.0%, 48.6%, and 55.8% for predicting S, T, and Y sites.
Comparison of motifs between MDD-clustered subgroups and well-known kinase groups
Implementation of the prediction scheme
The importance of phosphorylation has been indicated in the regulation of protein functions and cell signaling, but the state of research in this field is hindered by experimental difficulties especially for the investigation of substrate site specificity in various organisms. In this work, we have analyzed plant phosphorylation sites by applying MDD clustering; using this method, the available plant phosphorylation data were grouped into several subgroups-each showing a significant conserved motif. Then, we developed a novel method for predicting protein phosphylation sites in plants by training HMMs for each MDD-clustered subgroup which are then used to predict potential phosphorylation sites by HMMsearch. Our method is evaluated by means of 5-fold cross-validation which yields an average accuracy of 82.4% for predicting pSer, 78.6% for predicting pThr, and 89.0% for predicting pTyr. Moreover, our method is further evaluated by testing it on an independent data set which shows that our method can predict novel phosphorylation sites by using the experimental phosphorylation data in plant proteins from UniProtKB/SwissProt. Additionally, we were able to further investigate the MDD-clustered motifs in plants by referencing to the motifs of known kinases from Phospho.ELM . Through this method, we were able to observe similar kinase motifs between plants and other organisms. Lastly, the method has been implemented as a web tool named PlantPhos in order for the research community to be able to facilitate phosphorylation site prediction on plant protein data using the proposed method.
Future development for PlantPhos involves (i) the extension of the system to include other plant species other than Arabidopsis thaliana; (ii) the acquisition of additional experimentally verified plant protein data to re-calibrate more robust MDD clusters; (iii) and a more comprehensive investigation of substrate site specificities in plants with additional plant phosphorylation data.
PlantPhos can be accessed via a web interface, and is freely available to all interested users at http://csb.cse.yzu.edu.tw/PlantPhos/. All of the data set used in this work is also available for download in the website.
The authors sincerely appreciate the National Science Council of the Republic of China for financially supporting this research under Contract Numbers of NSC 99-2320-B-155-001.
- Steen H, Jebanathirajah JA, Rush J, Morrice N, Kirschner MW: Phosphorylation analysis by mass spectrometry: myths, facts, and the consequences for qualitative and quantitative measurements. Mol Cell Proteomics 2006, 5(1):172–181.View ArticlePubMedGoogle Scholar
- Hubbard MJ, Cohen P: On target with a new mechanism for the regulation of protein phosphorylation. Trends Biochem Sci 1993, 18(5):172–177. 10.1016/0968-0004(93)90109-ZView ArticlePubMedGoogle Scholar
- Steffen M, Petti A, Aach J, D'Haeseleer P, Church G: Automated modelling of signal transduction networks. BMC Bioinformatics 2002, 3: 34. 10.1186/1471-2105-3-34PubMed CentralView ArticlePubMedGoogle Scholar
- Stone JM, Walker JC: Plant protein kinase families and signal transduction. Plant Physiol 1995, 108(2):451–457. 10.1104/pp.108.2.451PubMed CentralView ArticlePubMedGoogle Scholar
- Diolez P, Kesseler A, Haraux F, Valerio M, Brinkmann K, Brand MD: Regulation of oxidative phosphorylation in plant mitochondria. Biochem Soc Trans 1993, 21(Pt 3(3)):769–773.View ArticlePubMedGoogle Scholar
- Huber SC: Exploring the role of protein phosphorylation in plants: from signalling to metabolism. Biochemical Society Transactions 2007., 35:Google Scholar
- Luan S: Tyrosine phosphorylation in plant cell signaling. Proc Natl Acad Sci USA 2002, 99(18):11567–11569. 10.1073/pnas.182417599PubMed CentralView ArticlePubMedGoogle Scholar
- Aebersold R, Mann M: Mass spectrometry-based proteomics. Nature 2003, 422(6928):198–207. 10.1038/nature01511View ArticlePubMedGoogle Scholar
- Diella F, Gould CM, Chica C, Via A, Gibson TJ: Phospho.ELM: a database of phosphorylation sites-update 2008. Nucleic Acids Res 2008, 36 Database: D240–244.Google Scholar
- Hornbeck PV, Chabra I, Kornhauser JM, Skrzypek E, Zhang B: PhosphoSite: A bioinformatics resource dedicated to physiological protein phosphorylation. Proteomics 2004, 4(6):1551–1561. 10.1002/pmic.200300772View ArticlePubMedGoogle Scholar
- Schneider M, Lane L, Boutet E, Lieberherr D, Tognolli M, Bougueleret L, Bairoch A: The UniProtKB/Swiss-Prot knowledgebase and its Plant Proteome Annotation Program. J Proteomics 2009, 72(3):567–573. 10.1016/j.jprot.2008.11.010PubMed CentralView ArticlePubMedGoogle Scholar
- Gnad F, Ren S, Cox J, Olsen JV, Macek B, Oroshi M, Mann M: PHOSIDA (phosphorylation site database): management, structural and evolutionary investigation, and prediction of phosphosites. Genome Biol 2007, 8(11):R250. 10.1186/gb-2007-8-11-r250PubMed CentralView ArticlePubMedGoogle Scholar
- Lee TY, Hsu JB, Chang WC, Wang TY, Hsu PC, Huang HD: A comprehensive resource for integrating and displaying protein post-translational modifications. BMC Res Notes 2009, 2: 111. 10.1186/1756-0500-2-111PubMed CentralView ArticlePubMedGoogle Scholar
- Lee TY, Huang HD, Hung JH, Huang HY, Yang YS, Wang TH: dbPTM: an information repository of protein post-translational modification. Nucleic Acids Res 2006, 34 Database: D622–627.View ArticleGoogle Scholar
- Heazlewood JL, Durek P, Hummel J, Selbig J, Weckwerth W, Walther D, Schulze WX: PhosPhAt: a database of phosphorylation sites in Arabidopsis thaliana and a plant-specific phosphorylation site predictor. Nucleic Acids Res 2008, 36 Database: D1015–1021.Google Scholar
- Gao J, Agrawal GK, Thelen JJ, Xu D: P3DB: a plant protein phosphorylation database. Nucleic Acids Res 2009, 37 Database: D960–962.View ArticleGoogle Scholar
- Huala E, Dickerman AW, Garcia-Hernandez M, Weems D, Reiser L, LaFond F, Hanley D, Kiphart D, Zhuang M, Huang W, et al.: The Arabidopsis Information Resource (TAIR): a comprehensive database and web-based information retrieval, analysis, and visualization system for a model plant. Nucleic Acids Res 2001, 29(1):102–105. 10.1093/nar/29.1.102PubMed CentralView ArticlePubMedGoogle Scholar
- Huang HD, Lee TY, Tzeng SW, Horng JT: KinasePhos: a web tool for identifying protein kinase-specific phosphorylation sites. Nucleic Acids Res 2005, 33 Web Server: W226–229.View ArticleGoogle Scholar
- Huang HD, Lee TY, Tzeng SW, Wu LC, Horng JT, Tsou AP, Huang KT: Incorporating hidden Markov models for identifying protein kinase-specific phosphorylation sites. J Comput Chem 2005, 26(10):1032–1041. 10.1002/jcc.20235View ArticlePubMedGoogle Scholar
- Wong YH, Lee TY, Liang HK, Huang CM, Wang TY, Yang YH, Chu CH, Huang HD, Ko MT, Hwang JK: KinasePhos 2.0: a web server for identifying protein kinase-specific phosphorylation sites based on sequences and coupling patterns. Nucleic Acids Res 2007, 35 Web Server: W588–594.View ArticleGoogle Scholar
- Blom N, Sicheritz-Ponten T, Gupta R, Gammeltoft S, Brunak S: Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence. Proteomics 2004, 4(6):1633–1649. 10.1002/pmic.200300771View ArticlePubMedGoogle Scholar
- Obenauer JC, Cantley LC, Yaffe MB: Scansite 2.0: Proteome-wide prediction of cell signaling interactions using short sequence motifs. Nucleic Acids Res 2003, 31(13):3635–3641. 10.1093/nar/gkg584PubMed CentralView ArticlePubMedGoogle Scholar
- Xue Y, Zhou F, Zhu M, Ahmed K, Chen G, Yao X: GPS: a comprehensive www server for phosphorylation sites prediction. Nucleic Acids Res 2005, 33 Web Server: W184–187.View ArticleGoogle Scholar
- Zhou FF, Xue Y, Chen GL, Yao X: GPS: a novel group-based phosphorylation predicting and scoring method. Biochem Biophys Res Commun 2004, 325(4):1443–1448. 10.1016/j.bbrc.2004.11.001View ArticlePubMedGoogle Scholar
- Xue Y, Li A, Wang L, Feng H, Yao X: PPSP: prediction of PK-specific phosphorylation site with Bayesian decision theory. BMC Bioinformatics 2006, 7: 163. 10.1186/1471-2105-7-163PubMed CentralView ArticlePubMedGoogle Scholar
- Gao J, Agrawal GK, Thelen JJ, Obradovic Z, Dunker AK, Xu D: A New Machine Learning Approach for Protein Phosphorylation Site Prediction in Plants. Lect Notes Comput Sci 2009, 5462/2009: 18–29.View ArticleGoogle Scholar
- Lee TY, Lin ZQ, Hsieh SJ, Bretana NA, Lu CT: Exploiting maximal dependence decomposition to identify conserved motifs from a group of aligned signal sequences. Bioinformatics 2011, 27(13):1780–7. 1 1 10.1093/bioinformatics/btr291View ArticlePubMedGoogle Scholar
- Burge C, Karlin S: Prediction of complete gene structures in human genomic DNA. J Mol Biol 1997, 268(1):78–94. 10.1006/jmbi.1997.0951View ArticlePubMedGoogle Scholar
- Shien DM, Lee TY, Chang WC, Hsu JB, Horng JT, Hsu PC, Wang TY, Huang HD: Incorporating structural characteristics for identification of protein methylation sites. J Comput Chem 2009, 30(9):1532–1543. 10.1002/jcc.21232View ArticlePubMedGoogle Scholar
- Crooks GE, Hon G, Chandonia JM, Brenner SE: WebLogo: a sequence logo generator. Genome Res 2004, 14(6):1188–1190. 10.1101/gr.849004PubMed CentralView ArticlePubMedGoogle Scholar
- Eddy SR: Profile hidden Markov models. Bioinformatics 1998, 14(9):755–763. 10.1093/bioinformatics/14.9.755View ArticlePubMedGoogle Scholar
- Chou KC, Shen HB: Recent progress in protein subcellular location prediction. Anal Biochem 2007, 370(1):1–16. 10.1016/j.ab.2007.07.006View ArticlePubMedGoogle Scholar
- Vacic V, Iakoucheva LM, Radivojac P: Two Sample Logo: a graphical representation of the differences between two sets of sequence alignments. Bioinformatics 2006, 22(12):1536–1537. 10.1093/bioinformatics/btl151View ArticlePubMedGoogle Scholar
- Schneider TD, Stephens RM: Sequence logos: a new way to display consensus sequences. Nucleic Acids Res 1990, 18(20):6097–6100. 10.1093/nar/18.20.6097PubMed CentralView ArticlePubMedGoogle Scholar
- Blom N, Gammeltoft S, Brunak S: Sequence and structure-based prediction of eukaryotic protein phosphorylation sites. J Mol Biol 1999, 294(5):1351–1362. 10.1006/jmbi.1999.3310View ArticlePubMedGoogle Scholar
- Iakoucheva LM, Radivojac P, Brown CJ, O'Connor TR, Sikes JG, Obradovic Z, Dunker AK: The importance of intrinsic disorder for protein phosphorylation. Nucleic Acids Res 2004, 32(3):1037–1049. 10.1093/nar/gkh253PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.