Skip to main content

PPSP: prediction of PK-specific phosphorylation site with Bayesian decision theory



As a reversible and dynamic post-translational modification (PTM) of proteins, phosphorylation plays essential regulatory roles in a broad spectrum of the biological processes. Although many studies have been contributed on the molecular mechanism of phosphorylation dynamics, the intrinsic feature of substrates specificity is still elusive and remains to be delineated.


In this work, we present a novel, versatile and comprehensive program, PPSP (Prediction of PK-specific Phosphorylation site), deployed with approach of Bayesian decision theory (BDT). PPSP could predict the potential phosphorylation sites accurately for ~70 PK (Protein Kinase) groups. Compared with four existing tools Scansite, NetPhosK, KinasePhos and GPS, PPSP is more accurate and powerful than these tools. Moreover, PPSP also provides the prediction for many novel PKs, say, TRK, mTOR, SyK and MET/RON, etc. The accuracy of these novel PKs are also satisfying.


Taken together, we propose that PPSP could be a potentially powerful tool for the experimentalists who are focusing on phosphorylation substrates with their PK-specific sites identification. Moreover, the BDT strategy could also be a ubiquitous approach for PTMs, such as sumoylation and ubiquitination, etc.


Protein phosphorylation, as one of the most common post-translational modifications (PTM), is reversibly and transiently performed by protein kinases (PKs). It plays crucial regulatory roles in a variety of biological cellular processes, including transcription [1], translation [2], mitosis/cell cycle [3], neurite outgrowth [4, 5] and signal transductions [6], etc. Many previous researches have contributed to increase our knowledge on phosphorylation. However, the intrinsic features of phosphorylation dynamics are still cryptic and remain to be dissected. Biochemically, the catalytic site of a PK hydrolyzes adenosine triphosphate (ATP) and transfers a phosphate moiety to the acceptor residue (S/T, Y in eukaryotes) in the substrate. Each PK only modifies a defined subset of substrates specifically to ensure signaling fidelity, and defects of PK function often induce a variety of diseases and cancers [7].

There is an extensively-adopted hypothesis that PKs phosphorylate their substrates at the specific sites (consensus sequence) flanking with canonical motif [810]. To date, the consensus motifs of ~30 PKs have been reported [11]. However, there is still a large number of PKs with their specific target motifs remained to be identified. Therefore, elucidating PK-specific phosphorylation sites on the substrates is the foundation of understanding the molecular mechanism of substrates specificity and important for the biomedical drug design. However, it has been described that only consensus motif is not enough for providing the specificity of PK recognition in vivo [12]. There are numerous mechanisms have been proposed to contribute specificity for PKs, such as co-complex of PKs with their substrates, subcellular co-localization, interacting through modular docking sites, phosphopeptide-binding mechanisms, etc [1217]. In a cell, protein kinase usually forms a tight complex with its target either through a third scaffold protein, or by recognizing and binding short sequence of the substrate, known as a docking site [12, 18]. Moreover, phosphopeptide-binding domains (PBDs) are also important to achieve substrate specificity. Numerous PBDs (PTB, WW, SH2, SH3, FHA, MH2, WD40, Polo-box, and 14-3-3, etc) bind the phosphorylated forms of specific proteins, with recognizing distinct peptides surrounding the phosphorylated sites (pS/T, or pY) [1417, 19]. However, how these mechanisms achieve the additional specificity for PKs beyond phosphorylated motifs is still elusive, and there are very few computational studies published on this area [13, 16, 19]. In addition, many docking sites and PBDs still remain to be dissected. Thus, in this work, we focus on the prediction of PK-specific phosphorylation sites based on profiles/features of the surrounding primary sequences, as previously described [810].

Conventional experimental identifications of PK-specific phosphorylation sites on substrates in vivo and in vitro have provided the foundation of understanding the mechanisms of phosphorylation dynamics. However, these experiments are often time-consuming and expensive. And the enzymatic activity of the PKs are usually diminished or impeded in vitro, hampering on the studies of phosphorylation greatly. Recently, phospho-proteomic studies with mass spectrometry (MS) approaches have generated numerous data in yeast [20], mouse [21], and human [8], etc. But in these cases, it's still difficult to distinguish the PK-specific sites on the substrates. With regard of this, it is of note that the in silico prediction of PK-specific phosphorylation sites is in urgent need for the further experimental manipulation. To address this question, several excellent predictors have been implemented and reported [13, 2225]. For example, NetPhos has employed the consensus-motif-based approaches implemented in the artificial neural networks (ANNs) algorithm [22]. The enhanced version, NetPhosK can predict PK-specific phosphorylation sites for ~17 PKs [23]. Another online tool Scansite [13] has constructed the motif profiles of phosphorylation sites for ~20 PKs, and could predict their target sites, respectively. Previously, we have reported a web server GPS, which has been implemented in GPS (Group-based phosphorylation Predicting and Scoring) algorithm [26, 27]. GPS could predict ~70 kinds of PK-specific phosphorylation sites, and gain excellent performance on several PK groups, especially for kinase Aurora-B. Recently, a novel and excellent web tool of KinasePhos has been incorporated with HMM (Hidden Markov Models) algorithm and constructed for phosphorylation sites predicting of 18 PK-specific groups [24, 25].

In this study, we present a novel, convenient and comprehensive program, PPSP (Prediction of PK-specific Phosphorylation site), implemented in an algorithm of Bayesian decision theory (BDT). An online PPSP web service has been also constructed, accurately predicting PK-specific phosphorylation sites for 68 PK groups. The prediction performances of PPSP are satisfactory on several well-studied PKs and comparable with the other existing tools NetPhosK, Scansite, KinasePhos and GPS. Moreover, PPSP also provides the accurate prediction for many novel PKs, such as TRK, mTOR, SyK, and MET/RON, etc. Obviously, PPSP is more accurate and powerful. Therefore, we propose that PPSP could be useful and insightful for further experimental design. In addition, the prediction results of PPSP combined with delicate experiments verifications will propel our understanding of the mechanisms of phosphorylation into a new phase.


Preparation of training data set

Firstly, we obtained the data set of phosphorylation sites from Phospho.ELM (Ver 2.0, Sep. 2004) [28] and filtered the phosphorylation sites without information of PKs. There were ~1,400 sites preserved. We also manually curated the recent literature and acquired ~660 items (Before Nov. 2004). These newly curated data has been submitted to Phospho.ELM for further integration. The two data sets were integrated, and the redundant items were removed if two items exactly pinned point to the same phosphorylation site from one protein sequence. Then the total training data set contained >2,000 non-redundant positive data with very few homologous sites (see additional file 1).

Since there were several PKs with too few known phosphorylation sites, we clustered them into distinct sub-groups based on sequence homology. For example, eight ribosomal protein S6 kinases (RSK1, Q15418; RSK3, Q15349; RSK2, P51812; MSK1, O75676; MSK1, O75582; RSK4, Q9UK32; S6K1, P23443; STK14B, Q9UBS0) are homologous with high similarity, so we clustered these PKs into a unique PK group of S6K (Ribosomal protein S6 kinase, or RSK). In total, we have enabled 68 PK grouped.

Although Swiss-Prot also curates a huge amount of phosphorylation sites, we have found ~69% of the annotation to be ambiguous (7,924 of 11,520) (see additional file 2). There are only 842 items to be kinase-specific sites, and only 18 PKs with not less than ten sites (see additional file 3). Phospho.ELM has been constructed based on the rationale of allowing both experimentalists and bioinformatists to easily access extensive information of phosphoproteins with their sites, i.e., tracking the primary reference to find whether the site is really phosphorylated, identified in vivo or in vitro, and the relationship between the phosphorylation with physiological response [28]. And these data has been collected from literature manually with high quality. Taken together, although other resources also have collected some phosphorylation sites, we chose Phospho.ELM for its comprehensiveness.

Positive & negative control for evaluation

The sequence information of these phosphorylation substrates was retrieved from ExPASy. As previously described [11], we adopted the experimental phosphorylation sites as the positive control, while all other residues (S/T or Y) in the phosphorylation substrates were regarded as the negative control. The detailed statistics of the positive and negative data sets categorized by PK groups is available (see additional file 4).

Bayesian Decision Theory (BDT)

Supposed that we have an unclassified data x that belongs to one of two certain categories: C1 (defined as phosphorylated sites in this work) and C2 (defined as non-phosphorylated sites). In addition, suppose the posterior probability of x for these two categories can be denoted as: p(C1|x) and p(C2|x). Then the probability of wrong prediction is:

P ( e r r o r | x ) = p ( C 1 | x ) , i f x C 2 p ( C 2 | x ) , i f x C 1 ( 1 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGqbaucqGGOaakcqWGLbqzcqWGYbGCcqWGYbGCcqWGVbWBcqWGYbGCcqGG8baFcqWG4baEcqGGPaqkcqGH9aqpfaqabeGabaaabaGaemiCaaNaeiikaGIaem4qam0aaSbaaSqaaiabigdaXaqabaGccqGG8baFcqWG4baEcqGGPaqkcqGGSaalcqWGPbqAcqWGMbGzcqqGGaaicqWG4baEcqGHiiIZcqWGdbWqdaWgaaWcbaGaeGOmaidabeaaaOqaaiabdchaWjabcIcaOiabdoeadnaaBaaaleaacqaIYaGmaeqaaOGaeiiFaWNaemiEaGNaeiykaKIaeiilaWIaemyAaKMaemOzayMaeeiiaaIaemiEaGNaeyicI4Saem4qam0aaSbaaSqaaiabigdaXaqabaaaaOGaaCzcaiaaxMaadaqadaqaaiabigdaXaGaayjkaiaawMcaaaaa@61F9@

To minimize the expectation of error probability that is defined as [29]:

P(error) = ∫P(error|x)p(x)dx    (2)

It is obvious that one should choose the more probable category as the prediction result, which can be formulated by the Bayesian Decision Rule [29]:

p r e d i c t x a s { C 1 , i f P ( C 1 | x ) > P ( C 2 | x ) C 2 , o t h e r w i s e ( 3 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGWbaCcqWGYbGCcqWGLbqzcqWGKbazcqWGPbqAcqWGJbWycqWG0baDcaaMc8UaemiEaGNaaGPaVlabdggaHjabdohaZnaaceaabaqbaeaabiGaaaqaaiabdoeadnaaBaaaleaacqaIXaqmaeqaaOGaeiilaWcabaGaemyAaKMaemOzayMaaGPaVlabdcfaqjabcIcaOiabdoeadnaaBaaaleaacqaIXaqmaeqaaOGaeiiFaWNaemiEaGNaeiykaKIaeyOpa4JaemiuaaLaeiikaGIaem4qam0aaSbaaSqaaiabikdaYaqabaGccqGG8baFcqWG4baEcqGGPaqkaeaacqWGdbWqdaWgaaWcbaGaeGOmaidabeaakiabcYcaSaqaaiabd+gaVjabdsha0jabdIgaOjabdwgaLjabdkhaYjabdEha3jabdMgaPjabdohaZjabdwgaLbaaaiaawUhaaiaaxMaacaWLjaWaaeWaaeaacqaIZaWmaiaawIcacaGLPaaaaaa@6A7C@

Furthermore, by definition we can introduce the loss function λ(α i |C j ), where α i ,i = 1,2 is the finite set of possible solution. Thus the expected loss (risk) of taking action α i is:

R ( α i | x ) = l = 1 2 λ ( α i | C l ) P ( C l | x ) ( 4 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGsbGucqGGOaakcqaHXoqydaWgaaWcbaGaemyAaKgabeaakiabcYha8jabdIha4jabcMcaPiabg2da9maaqahabaGaeq4UdWMaeiikaGIaeqySde2aaSbaaSqaaiabdMgaPbqabaGccqGG8baFcqWGdbWqdaWgaaWcbaGaemiBaWgabeaakiabcMcaPiabdcfaqjabcIcaOiabdoeadnaaBaaaleaacqWGSbaBaeqaaOGaeiiFaWNaemiEaGNaeiykaKcaleaacqWGSbaBcqGH9aqpcqaIXaqmaeaacqaIYaGma0GaeyyeIuoakiaaxMaacaWLjaWaaeWaaeaacqaI0aanaiaawIcacaGLPaaaaaa@5448@

In this condition, the goal of optimization becomes to minimize the overall risk for every x. Similar to the rationale of Bayesian Decision Rule, we can obtain the best performance by computing R i |x) for each solution α i and choose that for which has the minimal overall risk (also named as Bayes Risk) [29].

Training and prediction procedure

In this study, a local ennea-peptide (9aa) is deployed to define a candidate phosphorylation site, which has 4 upstream and 4 downstream residues of the potential phosphorylation site and can be denoted as x MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWG4baEgaWcaaaa@2E37@ = (x1,x2,...,x9)'. Given some positive (training) data, there are many ways to estimate R i |x) (where α1 and α2 denote different prediction results: true and false phosphorylation sites, respectively). One simple way is to assume that all flanking residues are mutual independent, and then the Bayes Risk can be formulated as:

R ( α i | x ) = j = 1 9 R ( α i | x j ) ( 5 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGsbGucqGGOaakcqaHXoqydaWgaaWcbaGaemyAaKgabeaakiabcYha8jqbdIha4zaalaGaeiykaKIaeyypa0ZaaabCaeaacqWGsbGucqGGOaakcqaHXoqydaWgaaWcbaGaemyAaKgabeaakiabcYha8jabdIha4naaBaaaleaacqWGQbGAaeqaaOGaeiykaKcaleaacqWGQbGAcqGH9aqpcqaIXaqmaeaacqaI5aqoa0GaeyyeIuoakiaaxMaacaWLjaWaaeWaaeaacqaI1aqnaiaawIcacaGLPaaaaaa@4BCB@

R ( α i | x j ) = E ( λ | x j , α i ) = l = 1 2 k = 1 20 λ ( j , k | α i , C l ) p ( C l | x j ) ( 6 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGsbGucqGGOaakcqaHXoqydaWgaaWcbaGaemyAaKgabeaakiabcYha8jabdIha4naaBaaaleaacqWGQbGAaeqaaOGaeiykaKIaeyypa0JaemyrauKaeiikaGIaeq4UdWMaeiiFaWNaemiEaG3aaSbaaSqaaiabdQgaQbqabaGccqGGSaalcqaHXoqydaWgaaWcbaGaemyAaKgabeaakiabcMcaPiabg2da9maaqahabaWaaabCaeaacqaH7oaBcqGGOaakcqWGQbGAcqGGSaalcqWGRbWAcqGG8baFcqaHXoqydaWgaaWcbaGaemyAaKgabeaakiabcYcaSiabdoeadnaaBaaaleaacqWGSbaBaeqaaOGaeiykaKIaemiCaaNaeiikaGIaem4qam0aaSbaaSqaaiabdYgaSbqabaGccqGG8baFcqWG4baEdaWgaaWcbaGaemOAaOgabeaakiabcMcaPaWcbaGaem4AaSMaeyypa0JaeGymaedabaGaeGOmaiJaeGimaadaniabggHiLdaaleaacqWGSbaBcqGH9aqpcqaIXaqmaeaacqaIYaGma0GaeyyeIuoakiaaxMaacaWLjaWaaeWaaeaacqaI2aGnaiaawIcacaGLPaaaaaa@71C0@

Here p(C l |x j ) is the posterior probability of x j belonging to category C l and can be further described by the Bayesian formula:

p ( C l | x j ) = p ( x j | C l ) p ( C l ) p ( x j ) = p ( x j | C l ) p ( C l ) l = 1 2 p ( x j | C l ) p ( C l ) , l = 1 , 2 ( 7 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGWbaCcqGGOaakcqWGdbWqdaWgaaWcbaGaemiBaWgabeaakiabcYha8jabdIha4naaBaaaleaacqWGQbGAaeqaaOGaeiykaKIaeyypa0ZaaSaaaeaacqWGWbaCcqGGOaakcqWG4baEdaWgaaWcbaGaemOAaOgabeaakiabcYha8jabdoeadnaaBaaaleaacqWGSbaBaeqaaOGaeiykaKIaemiCaaNaeiikaGIaem4qam0aaSbaaSqaaiabdYgaSbqabaGccqGGPaqkaeaacqWGWbaCcqGGOaakcqWG4baEdaWgaaWcbaGaemOAaOgabeaakiabcMcaPaaacqGH9aqpdaWcaaqaaiabdchaWjabcIcaOiabdIha4naaBaaaleaacqWGQbGAaeqaaOGaeiiFaWNaem4qam0aaSbaaSqaaiabdYgaSbqabaGccqGGPaqkcqWGWbaCcqGGOaakcqWGdbWqdaWgaaWcbaGaemiBaWgabeaakiabcMcaPaqaamaaqahabaGaemiCaahaleaacqWGSbaBcqGH9aqpcqaIXaqmaeaacqaIYaGma0GaeyyeIuoakiabcIcaOiabdIha4naaBaaaleaacqWGQbGAaeqaaOGaeiiFaWNaem4qam0aaSbaaSqaaiabdYgaSbqabaGccqGGPaqkcqWGWbaCcqGGOaakcqWGdbWqdaWgaaWcbaGaemiBaWgabeaakiabcMcaPaaacqGGSaalcqWGSbaBcqGH9aqpcqaIXaqmcqGGSaalcqaIYaGmcaWLjaGaaCzcamaabmaabaGaeG4naCdacaGLOaGaayzkaaaaaa@7FDD@

Here p(C l ) is the prior probability of category C l and p(x j |C l ) can be estimated by observing the occurrence of each residue in training data given the hypothesis of equation (5). Although there are much more false phosphorylation site in data set, we give equal prior probability for each category (no prior information), which can avoid bias prediction results. The loss function we construct is based on BLOSUM62 matrix [30] by considering the biochemical difference of residues, which can be formulated as:

λ ( j , k | α i , C l ) = { B L O S U M 62 ( j , k ) , i f α i C l 0 , i f α i = C l ( 8 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqaH7oaBcqGGOaakcqWGQbGAcqGGSaalcqWGRbWAcqGG8baFcqaHXoqydaWgaaWcbaGaemyAaKgabeaakiabcYcaSiabdoeadnaaBaaaleaacqWGSbaBaeqaaOGaeiykaKIaeyypa0ZaaiqaaeaafaqabeGacaaabaGaeyOeI0IaemOqaiKaemitaWKaem4ta8Kaem4uamLaemyvauLaemyta0KaeGOnayJaeGOmaiJaeiikaGIaemOAaOMaeiilaWIaem4AaSMaeiykaKIaeiilaWcabaGaemyAaKMaemOzayMaaGPaVlabeg7aHnaaBaaaleaacqWGPbqAaeqaaOGaeyiyIKRaem4qam0aaSbaaSqaaiabdYgaSbqabaaakeaacqaIWaamcqGGSaalaeaacqWGPbqAcqWGMbGzcaaMc8UaeqySde2aaSbaaSqaaiabdMgaPbqabaGccqGH9aqpcqWGdbWqdaWgaaWcbaGaemiBaWgabeaaaaaakiaawUhaaiaaxMaacaWLjaWaaeWaaeaacqaI4aaoaiaawIcacaGLPaaaaaa@6A7D@

Although other matrices could be also adopted, the BLOSUM62 matrix is chosen in this work. Moreover, we introduce a trade-off threshold b as the only parameter in this method to control the performance for different categories. Thus the final Discriminant function for prediction is:

p r e d i c t x a s { C 1 , i f R ( α 2 | x ) R ( α 1 | x ) > b C 2 , o t h e r w i s e ( 9 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGWbaCcqWGYbGCcqWGLbqzcqWGKbazcqWGPbqAcqWGJbWycqWG0baDcaaMc8UafmiEaGNbaSaacaaMc8UaemyyaeMaem4Cam3aaiqaaeaafaqaaeGacaaabaGaem4qam0aaSbaaSqaaiabigdaXaqabaGccqGGSaalaeaacqWGPbqAcqWGMbGzcaaMc8UaemOuaiLaeiikaGIaeqySde2aaSbaaSqaaiabikdaYaqabaGccqGG8baFcuWG4baEgaWcaiabcMcaPiabgkHiTiabdkfasjabcIcaOiabeg7aHnaaBaaaleaacqaIXaqmaeqaaOGaeiiFaWNafmiEaGNbaSaacqGGPaqkcqGH+aGpcqWGIbGyaeaacqWGdbWqdaWgaaWcbaGaeGOmaidabeaakiabcYcaSaqaaiabd+gaVjabdsha0jabdIgaOjabdwgaLjabdkhaYjabdEha3jabdMgaPjabdohaZjabdwgaLbaaaiaawUhaaiaaxMaacaWLjaWaaeWaaeaacqaI5aqoaiaawIcacaGLPaaaaaa@6E20@

The outline of the training and procedure in this work is illustrated in Figure 1. We first estimate the probability distribution of each residue of the true/false ennea-peptide within the training data. Then the Bayes risk for either potential solution (i.e true or false phosphorylation site) is calculated, respectively. To implement the final differential function in equation (9) effectively, we built a difference profile of Bayesian decision risk for each PK family/group in prediction. In this way, a candidate site for a given protein kinase is assessed in the profile and the outcome for each residue is summed up. If the difference of risks (false prediction minus true prediction) is greater than the threshold b, it will be predicted by PPSP as a negative site that can not be phosphorylated by this PK. Otherwise, PPSP will infer the site is as a potential phosphorylation site. In this work, the threshold for each PK has been optimized automatically.

Figure 1

The outline of the training and procedure of PPSP.

Results and discussions

Prediction performance of PPSP

Three measurements, i.e., Sensitivity (Sn), Specificity (Sp), and Mathew correlation coefficient (MCC) are widely employed to evaluate the performance of prediction with definitions as below:

S n = T P T P + F N , S p = T N T N + F P , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaafaqabeqacaaabaGaem4uamLaemOBa4Maeyypa0ZaaSaaaeaacqWGubavcqWGqbauaeaacqWGubavcqWGqbaucqGHRaWkcqWGgbGrcqWGobGtaaGaeiilaWcabaGaem4uamLaemiCaaNaeyypa0ZaaSaaaeaacqWGubavcqWGobGtaeaacqWGubavcqWGobGtcqGHRaWkcqWGgbGrcqWGqbauaaaaaiabcYcaSaaa@456D@


M C C = ( T P × T N ) ( F N × F P ) ( T P + F N ) × ( T N + F P ) × ( T P + F P ) × ( T N + F N ) . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGnbqtcqWGdbWqcqWGdbWqcqGH9aqpdaWcaaqaaiabcIcaOiabdsfaujabdcfaqjabgEna0kabdsfaujabd6eaojabcMcaPiabgkHiTiabcIcaOiabdAeagjabd6eaojabgEna0kabdAeagjabdcfaqjabcMcaPaqaamaakaaabaGaeiikaGIaemivaqLaemiuaaLaey4kaSIaemOrayKaemOta4KaeiykaKIaey41aqRaeiikaGIaemivaqLaemOta4Kaey4kaSIaemOrayKaemiuaaLaeiykaKIaey41aqRaeiikaGIaemivaqLaemiuaaLaey4kaSIaemOrayKaemiuaaLaeiykaKIaey41aqRaeiikaGIaemivaqLaemOta4Kaey4kaSIaemOrayKaemOta4KaeiykaKcaleqaaaaakiabc6caUaaa@6698@

Among the data with positive predictions by PPSP, the real positives are regarded as true positives (TP), while the others are defined as false positives (FP). Among the data with negative predictions by PPSP, the real positives are regarded as false negatives (FN), while the others are defined as true negatives (TN). The Sensitivity (Sn) and Specificity (Sp) illustrate the correct prediction ratios of positive and negative data sets respectively. But when the number of positive data and negative data differ too much from each other, the Mathew correlation coefficient (MCC) should be calculated to assess the prediction performance. The value of MCC ranges from -1 to 1, and bigger MCC stands for better prediction performance.

To assess whether PPSP is unbiased and robust for prediction, we adopt the standard method "Jack-Knife" validation. We perform the Jack-Knife validation for these PKs by removing one real site from the training data set at a time and re-calculating the Sn &Sp, respectively. The final results are the average of the all Sn &Sp of the Jack-Knife validation. Although "Jack-Knife" validation does make sense when the size of the data set is small (i.e., N < 30), we have also taken an additional test with n-fold (4-, 6-, 8- and 10-fold in this work) cross-validation for 22 PK groups with larger positive data set (N ≥ 30). As previously proposed [25], the tests are repeated for 20 times and the Sn &Sp is computed each time. Then the average Sn & Sp are calculated as the final results (see additional file 5).

In table 1, we list the prediction performances for four most well-studied PKs of PKA (Protein kinase A), CK2 (Casein Kinase II), ATM (Ataxia telangiectasia mutated) and S6K (Ribosomal protein S6 kinase, or RSK). The prediction performances of self-consistency, Jack-knife validation and n-fold cross-validation has been provided. For PKA, CK2, ATM and S6K, the Sn &Sp of the self-consistency is 92.31% & 97.40%, 93.33% & 96.46%, 92.59% & 91.98%, and 89.47% & 95.90%, while the Jack-Knife validation is 90.11% & 90.46%, 83.21% & 88.44%, 86.05% & 91.89%, 92.86% & 91.05%, respectively. Interestingly, the performances of n-fold cross-validation are very similar and consistent with the results of the Jack-Knife validation. So the PPSP is quite robust and unbiased for these well-studied PKs. Moreover, PPSP could predict for several novel PKs (>30, see additional file 4). In Table 2, we choose four PKs, including TRK (Neurotrophic tyrosine kinase receptor), mTOR (Mammalian target of rapamycin), SyK (Spleen tyrosine kinase), and MET/RON (Hepatocyte growth factor receptor/Macrophage-stimulating protein receptor), which predictors for them are not available previously. Interestingly, the prediction performance of PPSP is also satisfying. And the Jack-knife validation proposes that the PPSP approach is also robust and unbiased for these novel PKs. The full content of the prediction performance of PPSP is available from PPSP website.

Table 1 The performances of self-consistency, Jack-knife validation and n -fold (4-, 6-, 8-, 10-fold in this work) cross-validation for four well-studied PKs of PKA, CK2, ATM and S6K. The n- fold cross-validation has been performed for the large data sets (N ≥ 30).
Table 2 The self-consistency performance and Jack-knife validation for four novel PKs of TRK, mTOR, SyK and MET/RON.

To evaluate the performance of PPSP on the signal to noise for phosphorylation sites retrieval, we also perform two additional evaluations. Firstly, we randomly generate 10, 000 serine (S) and threonine (T) ennea-peptides for serine/threonine kinases (STKs), with 10, 000 tyrosine (Y) nona-peptides for tyrosine kinases (TKs). In addition, to determine the ability of the PPSP to retrieve potential real phosphorylation sites from the full proteome, we have downloaded the protein sequences of human proteome from public database Again, we randomly retrieve 10, 000 S & T and Y ennea-peptides for STKs and TKs from the human proteome, respectively. Then we compute the Risk Difference (RD) of each ennea-peptide. Under the default threshold of PPSP, the percentile of the sites predicted to be potential true positive hits is listed (see in Table 3). The prediction results of random and human proteome data set are very similar. And the distribution of Risk Difference of random and human proteome data set of PKA-specific site prediction is diagramed in Figure 2. In this work, the default threshold of PKA is 3.5, and predicted Risk Differences of the most of the ennea-peptides from the two data sets are smaller than this cut-off. Based on these analyses, we propose that PPSP could efficiently predict the potential real sites with very low false positive hits. The ratio of Serine and Threonine is not exactly equal. However, we and others are unable to explain this question [25].

Table 3 With the default cut-off of PPSP, the percentile of the sites predicted to be potential true positive hits is listed. Both random ennea-peptides and data sets from human proteome have been computed, separately.
Figure 2

the distribution of risk difference of random and human proteome data set of PKA-specific site prediction is diagramed in Figure 2. A. Distribution of Risk Difference of random data set (serine) of PKA-specific site prediction. B. Distribution of Risk Difference of random data set (threonine) of PKA-specific site prediction. C. Distribution of Risk Difference of human proteome data set (serine) of PKA-specific site prediction. D. Distribution of Risk Difference of Human proteome data set (threonine) of PKA-specific site prediction.

Comparison of PPSP with Scansite, NetPhosK, KinasePhos and GPS

With four well-studied PKs of PKA, CK2, ATM and S6K as model kinases, we compare PPSP with four previous online prediction systems: Scansite, NetPhosK, KinasePhos and GPS. In Table 4, we list the prediction performances of Scansite, NetPhosK, KinasePhos and GPS for PKA, CK2, ATM and S6K, respectively. Since we can't re-perform the Jack-knife validation for the predictors, we submit the substrate sequence into these tools for prediction. And the self-consistency performance of PPSP is adopted here for comparison. Scansite has three thresholds for prediction, including high, medium and low stringency, while KinasePhos has paid attention to prediction specificity with three cut-off values, as 90%, 95% and 100%. And the default parameter is adopted for GPS. We calculate the prediction performances of Scansite and KinasePhos at three distinct thresholds, separately. As for NetPhosK, we only adopt the default cut-off value with 0.5, in mode of Prediction without filtering. Obviously, PPSP, NetPhosK, KinasePhos and GPS are better than Scansite. For PKA, the prediction performance of PPSP is 90.11% (Sn) and 91.70% (Sp), and outperforms to NetPhosK (Sn 79.12% &Sp 90.65%) with about 10% higher sensitivity and similar specificity. And for CK2, the performance of PPSP is 83.21% (Sn) and 90.01% (Sp), slightly higher than NetPhosK (Sn 82.48% &Sp 89.43%). The prediction performance of KinasePhos is similar with PPSP on PKA and CK2. However, for ATM, the NetPhosK is 86.01% (Sn) and 98.51% (Sp), whereas PPSP is 93.02% (Sn) and 94.06% (Sp). Although PPSP has a lower specificity than NetPhosK with ~4%, the sensitivity is high with ~7% enhanced. Finally, for S6K (also called as RSK in NetPhosK), although the specificity of PPSP (97.97%) and NetPhosK (97.14%) is quite similar, PPSP outperforms than NetPhosK with ~10% higher in sensitivity. With regard of this, we propose the prediction performance of PPSP could be at least comparable with the existing systems.

Table 4 The prediction performance of Scansite, NetPhosK, KinasePhos and GPS for four well-studied PKs of PKA, CK2, ATM and S6K.

However, the analysis and comparison above are only in theoretical and not intuitive. Furthermore, we browse the recent literature from PubMed and randomly choose some instances for comparison. One example is Bluetongue virus (BTV) nonstructural protein 2 (NS2, P23065), a substrate of CK2 [31]. As a nonspecific single-stranded RNA (ssRNA)-binding protein, NS2 accumulates in BTV-infected cells, and is functional in viral replication and morphogenesis [3134]. NS2 could hydrolyze both ATP and GTP with high affinity, showing strong enzymatic activity [32]. Using mutagenesis analysis, CK2 was demonstrated to phosphorylate NS2 in two serine sites S249 and S259, probably modulating its RNA binding properties, enzymatic activity or influencing its ability to interact with other viral proteins [31]. For CK2-specific phosphorylation sites prediction, all of the four programs can detect them successfully (see in Table 5). In this case, the Scansite with medium stringency get the best hits. PPSP predict three sites as positive hits (T247, S249, and S259), but NetPhosK provide too much results with seven positive hits. Two additional instances are also provided in Table 5. One is Drosophila transcription factor protein GAGA (Q08605), regulating gene transcription and chromatin remodeling, etc [35]. The other is human Calmodulin protein (P62158) [36]. The prediction results of the four programs are shown in Table 5. And the online prediction of PPSP is diagramed in Figure 3. Obviously, for the well-studied PKs, i.e. CK2, PPSP is accurate and comparable with the existing tools.

Table 5 The experimental verified vs. predicted CK2-specific phosphorylation sites of Bluetongue virus (BTV) nonstructural protein 2 (NS2), Drosophila transcription factor GAGA and human Calmodulin protein.
Figure 3

The prediction results of Bluetongue virus (BTV) nonstructural protein 2 (NS2), Drosophila transcription factor GAGA and human Calmodulin protein with PPSP. Figure 3A – prediction results of NS2; Figure 3B – prediction results of GAGA; Figure 3C – prediction results of Calmodulin.

Application of PPSP to the novel PKs

For application of PPSP to the novel PKs, here we employ PPSP to predict the phosphorylation sites of TRK. TRK is a sub-family of receptor tyrosine kinases (RTK), consisting three highly similar homologs, TRK-A, -B, and -C [37]. TRK-A, -B, and -C could be activated specifically by nerve growth factor (NGF), brain-derived neurotrophic factor (BDNF) and NT-4/-5, and NT-3, respectively. Under activated state, TRK could regulate a variety of biological processes including cell survival, embryo, differentiation, proliferation, axon and dendrite growth and patterning, and apoptosis, etc [37].

Recently, protein Ras guanine-releasing factor 1 (RasGrf1, Q13972), a GTPase of the Ras and Rho family, has been proposed to be phosphorylated and interact with TRK-A, -B, -C in co-transfection experiments [5]. However, the exact phosphorylation sites of RasGrf1 by TRK remain to be identified. PPSP has predicted that there are totally two potential phosphorylation sites on RasGrf1 (Y94 & Y1209) (see in Figure 4). Moreover, the human tumorous imaginal disc 1 (TID1, Q96EY1) was verified as a substrate of TRK with co-immunoprecipitation (Co-IP) [4] and the phosphorylation sites were not elucidated. PPSP could predict that there are three candidate sites with Y94, Y95 and Y173 (see in Figure 4). These prediction results would be very useful for the further experimentation and elucidation phospho-regulation underlying cellular dynamics.

Figure 4

The diagram of potential phosphorylation sites of human RasGrf1 (Q13972) and TID1 (Q96EY1) by TRK.


In this work, we present a novel computational program–PPSP (prediction of PK-specific phosphorylation sites) based on Bayesian decision theory (BDT). We model a candidate phosphorylation motif as an ennea/nona-peptide (9aa) flanking with 4 upstream and 4 downstream residues of a potential phosphorylation site (S/T, or Y). With the BDT algorithm, we estimate the probability distributions of true and false phosphorylation sites and make prediction based on a loss function constructed with BLOSUM62 matrix [30]. We have evaluated the sensitivity and specificity of PPSP by "Jack-knife" validation. An online PPSP web service has been also constructed, accurately predicting PK-specific phosphorylation sites for 68 PK groups. For comparison with four reported systems Scansite, NetPhosK, KinasePhos and GPS, we take four well-studied PKs of PKA, CK2, ATM and S6K as model kinases. The prediction performances of PPSP are satisfactory judged using these well-studied PKs and comparable with the other existing tools. Moreover, PPSP also provides the accurate prediction for many novel PKs, such as TRK, mTOR, SyK, and MET/RON, etc. Thus, comparison with the previous work, PPSP provides more accurate and powerful ability. Moreover, the BDT approach could also be an extensive method for PTMs prediction, such as sumoylation and ubiquitination, etc. In addition, although many phospho-proteomic researches have generated numerous data [8, 20, 21], however, the up-regulated PKs still remain to be dissected. Despite the demonstration of phosphor-regulation of protein kinases and their respective substrates, the exact phosphorylation sites are unclear [4, 5]. Taken together, the prediction results of PPSP should be insightful and important for further experiments. The combination of computational and experimental identifications will propel our understanding of phosphorylation dynamics into a new phase.

Availability and requirements

PPSP has been implemented in Linux + Apache + PHP, and is freely available at: A latest web browser (eg. Internet Explorer, Netscape, or Mozilla, etc) is required.


  1. 1.

    Schafmeier T, Haase A, Kaldi K, Scholz J, Fuchs M, Brunner M: Transcriptional feedback of neurospora circadian clock gene by phosphorylation-dependent inactivation of its transcription factor. Cell 2005, 122(2):235–246. 10.1016/j.cell.2005.05.032

    CAS  Article  PubMed  Google Scholar 

  2. 2.

    Singh CR, Curtis C, Yamamoto Y, Hall NS, Kruse DS, He H, Hannig EM, Asano K: Eukaryotic translation initiation factor 5 is critical for integrity of the scanning preinitiation complex and accurate control of GCN4 translation. Mol Cell Biol 2005, 25(13):5480–5491. 10.1128/MCB.25.13.5480-5491.2005

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  3. 3.

    Lou Y, Yao J, Zereshki A, Dou Z, Ahmed K, Wang H, Hu J, Wang Y, Yao X: NEK2A interacts with MAD1 and possibly functions as a novel integrator of the spindle checkpoint signaling. J Biol Chem 2004, 279(19):20049–20057. 10.1074/jbc.M314205200

    CAS  Article  PubMed  Google Scholar 

  4. 4.

    Liu HY, MacDonald JI, Hryciw T, Li C, Meakin SO: Human tumorous imaginal disc 1 (TID1) associates with Trk receptor tyrosine kinases and regulates neurite outgrowth in nnr5-TrkA cells. J Biol Chem 2005, 280(20):19461–19471. 10.1074/jbc.M500313200

    CAS  Article  PubMed  Google Scholar 

  5. 5.

    Robinson KN, Manto K, Buchsbaum RJ, MacDonald JI, Meakin SO: Neurotrophin-dependent tyrosine phosphorylation of Ras guanine-releasing factor 1 and associated neurite outgrowth is dependent on the HIKE domain of TrkA. J Biol Chem 2005, 280(1):225–235. 10.1074/jbc.M505720200

    CAS  Article  PubMed  Google Scholar 

  6. 6.

    Pawson T: Specificity in signal transduction: from phosphotyrosine-SH2 domain interactions to complex cellular systems. Cell 2004, 116(2):191–203. 10.1016/S0092-8674(03)01077-8

    CAS  Article  PubMed  Google Scholar 

  7. 7.

    Ma L, Chen Z, Erdjument-Bromage H, Tempst P, Pandolfi PP: Phosphorylation and functional inactivation of TSC2 by Erk implications for tuberous sclerosis and cancer pathogenesis. Cell 2005, 121(2):179–193. 10.1016/j.cell.2005.02.031

    CAS  Article  PubMed  Google Scholar 

  8. 8.

    Beausoleil SA, Jedrychowski M, Schwartz D, Elias JE, Villen J, Li J, Cohn MA, Cantley LC, Gygi SP: Large-scale characterization of HeLa cell nuclear phosphoproteins. Proc Natl Acad Sci U S A 2004, 101(33):12130–12135. 10.1073/pnas.0404720101

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  9. 9.

    Kreegipuu A, Blom N, Brunak S: PhosphoBase, a database of phosphorylation sites: release 2.0. Nucleic Acids Res 1999, 27(1):237–239. 10.1093/nar/27.1.237

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  10. 10.

    Manning BD, Cantley LC: Hitting the target: emerging technologies in the search for kinase substrates. Sci STKE 2002, 2002(162):PE49.

    PubMed  Google Scholar 

  11. 11.

    Kim JH, Lee J, Oh B, Kimm K, Koh I: Prediction of phosphorylation sites using SVMs. Bioinformatics 2004, 20(17):3179–3184. 10.1093/bioinformatics/bth382

    CAS  Article  PubMed  Google Scholar 

  12. 12.

    Biondi RM, Nebreda AR: Signalling specificity of Ser/Thr protein kinases through docking-site-mediated interactions. Biochem J 2003, 372(Pt 1):1–13. 10.1042/BJ20021641

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  13. 13.

    Obenauer JC, Cantley LC, Yaffe MB: Scansite 2.0: Proteome-wide prediction of cell signaling interactions using short sequence motifs. Nucleic Acids Res 2003, 31(13):3635–3641. 10.1093/nar/gkg584

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  14. 14.

    Uhlik MT, Temple B, Bencharit S, Kimple AJ, Siderovski DP, Johnson GL: Structural and evolutionary division of phosphotyrosine binding (PTB) domains. J Mol Biol 2005, 345(1):1–20. 10.1016/j.jmb.2004.10.038

    CAS  Article  PubMed  Google Scholar 

  15. 15.

    Yaffe MB, Elia AE: Phosphoserine/threonine-binding domains. Curr Opin Cell Biol 2001, 13(2):131–138. 10.1016/S0955-0674(00)00189-7

    CAS  Article  PubMed  Google Scholar 

  16. 16.

    Yaffe MB, Leparc GG, Lai J, Obata T, Volinia S, Cantley LC: A motif-based profile scanning approach for genome-wide prediction of signaling pathways. Nat Biotechnol 2001, 19(4):348–353. 10.1038/86737

    CAS  Article  PubMed  Google Scholar 

  17. 17.

    Yaffe MB, Smerdon SJ: The use of in vitro peptide-library screens in the analysis of phosphoserine/threonine-binding domain structure and function. Annu Rev Biophys Biomol Struct 2004, 33: 225–244. 10.1146/annurev.biophys.33.110502.133346

    CAS  Article  PubMed  Google Scholar 

  18. 18.

    Holland PM, Cooper JA: Protein modification: docking sites for kinases. Curr Biol 1999, 9(9):R329–31. 10.1016/S0960-9822(99)80205-X

    CAS  Article  PubMed  Google Scholar 

  19. 19.

    Joughin BA, Tidor B, Yaffe MB: A computational method for the analysis and prediction of protein:phosphopeptide-binding sites. Protein Sci 2005, 14(1):131–139. 10.1110/ps.04964705

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  20. 20.

    Ficarro SB, McCleland ML, Stukenberg PT, Burke DJ, Ross MM, Shabanowitz J, Hunt DF, White FM: Phosphoproteome analysis by mass spectrometry and its application to Saccharomyces cerevisiae. Nat Biotechnol 2002, 20(3):301–305. 10.1038/nbt0302-301

    CAS  Article  PubMed  Google Scholar 

  21. 21.

    Ballif BA, Villen J, Beausoleil SA, Schwartz D, Gygi SP: Phosphoproteomic analysis of the developing mouse brain. Mol Cell Proteomics 2004, 3(11):1093–1101. 10.1074/mcp.M400085-MCP200

    CAS  Article  PubMed  Google Scholar 

  22. 22.

    Blom N, Gammeltoft S, Brunak S: Sequence and structure-based prediction of eukaryotic protein phosphorylation sites. J Mol Biol 1999, 294(5):1351–1362. 10.1006/jmbi.1999.3310

    CAS  Article  PubMed  Google Scholar 

  23. 23.

    Blom N, Sicheritz-Ponten T, Gupta R, Gammeltoft S, Brunak S: Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence. Proteomics 2004, 4(6):1633–1649. 10.1002/pmic.200300771

    CAS  Article  PubMed  Google Scholar 

  24. 24.

    Huang HD, Lee TY, Tzeng SW, Horng JT: KinasePhos: a web tool for identifying protein kinase-specific phosphorylation sites. Nucleic Acids Res 2005, 33(Web Server issue):W226–9. 10.1093/nar/gki471

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  25. 25.

    Huang HD, Lee TY, Tzeng SW, Wu LC, Horng JT, Tsou AP, Huang KT: Incorporating hidden Markov models for identifying protein kinase-specific phosphorylation sites. J Comput Chem 2005, 26(10):1032–1041. 10.1002/jcc.20235

    CAS  Article  PubMed  Google Scholar 

  26. 26.

    Xue Y, Zhou F, Zhu M, Ahmed K, Chen G, Yao X: GPS: a comprehensive www server for phosphorylation sites prediction. Nucleic Acids Res 2005, 33(Web Server issue):W184–7. 10.1093/nar/gki393

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  27. 27.

    Zhou FF, Xue Y, Chen GL, Yao X: GPS: a novel group-based phosphorylation predicting and scoring method. Biochem Biophys Res Commun 2004, 325(4):1443–1448. 10.1016/j.bbrc.2004.11.001

    CAS  Article  PubMed  Google Scholar 

  28. 28.

    Diella F, Cameron S, Gemund C, Linding R, Via A, Kuster B, Sicheritz-Ponten T, Blom N, Gibson TJ: Phospho.ELM: a database of experimentally verified phosphorylation sites in eukaryotic proteins. BMC Bioinformatics 2004, 5(1):79. 10.1186/1471-2105-5-79

    PubMed Central  Article  PubMed  Google Scholar 

  29. 29.

    Duda RO, Hart PE, Stork DG: Pattern classification. 2nd edition. Beijing, China Machine Press; 2004:680.

    Google Scholar 

  30. 30.

    Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A 1992, 89(22):10915–10919.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  31. 31.

    Modrof J, Lymperopoulos K, Roy P: Phosphorylation of bluetongue virus nonstructural protein 2 is essential for formation of viral inclusion bodies. J Virol 2005, 79(15):10023–10031. 10.1128/JVI.79.15.10023-10031.2005

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  32. 32.

    Horscroft NJ, Roy P: NTP binding and phosphohydrolase activity associated with purified bluetongue virus non-structural protein NS2. J Gen Virol 2000, 81(Pt 8):1961–1965.

    CAS  Article  PubMed  Google Scholar 

  33. 33.

    Lymperopoulos K, Wirblich C, Brierley I, Roy P: Sequence specificity in the interaction of Bluetongue virus non-structural protein 2 (NS2) with viral RNA. J Biol Chem 2003, 278(34):31722–31730. 10.1074/jbc.M301072200

    CAS  Article  PubMed  Google Scholar 

  34. 34.

    Taraporewala ZF, Chen D, Patton JT: Multimers of the bluetongue virus nonstructural protein, NS2, possess nucleotidyl phosphatase activity: similarities between NS2 and rotavirus NSP2. Virology 2001, 280(2):221–231. 10.1006/viro.2000.0764

    CAS  Article  PubMed  Google Scholar 

  35. 35.

    Bonet C, Fernandez I, Aran X, Bernues J, Giralt E, Azorin F: The GAGA Protein of Drosophila is Phosphorylated by CK2. J Mol Biol 2005, 351(3):562–572. 10.1016/j.jmb.2005.06.039

    CAS  Article  PubMed  Google Scholar 

  36. 36.

    Arrigoni G, Marin O, Pagano MA, Settimo L, Paolin B, Meggio F, Pinna LA: Phosphorylation of calmodulin fragments by protein kinase CK2. Mechanistic aspects and structural consequences. Biochemistry 2004, 43(40):12788–12798. 10.1021/bi049365c

    CAS  Article  PubMed  Google Scholar 

  37. 37.

    Huang EJ, Reichardt LF: Trk receptors: roles in neuronal signal transduction. Annu Rev Biochem 2003, 72: 609–642. 10.1146/annurev.biochem.72.121801.161629

    CAS  Article  PubMed  Google Scholar 

Download references


We thank Dr. Fengfeng Zhou (UGA) for helpful discussions and critical reading of this manuscript. The authors thank Drs T.J. Gibson and F. Diella for providing the data set of Phospho.ELM for this work. We thank the anonymous referees for their many helpful comments. The work is supported by grants from Chinese Natural Science Foundation (39925018 and 30121001), Chinese Academy of Science (KSCX2-2-01), Chinese 973 project (2002CB713700), Beijing Office for Science (H020220020220) and American Cancer Society (RPG-99-173-01) to X. Yao. X. Yao is a GCC Distinguished Cancer Research Scholar.

Author information



Corresponding author

Correspondence to Xuebiao Yao.

Additional information

Authors' contributions

YX and AL should be regarded as joint First Authors. YX and AL designed the methodology, carried out the analysis and drafted the manuscript. LW developed the web service, contributed several insightful opinions and improved manuscript greatly. XY coordinated the research and finalized the manuscript.

Yu Xue, Ao Li contributed equally to this work.

Electronic supplementary material

Additional file 1: To test whether the training data sets are highly redundant, we retrieve all protein sequences for each PK-specific substrate. Then we use CD-HIT to check whether many protein sequences are highly homologous. The result of the eight PK groups employed in this work is listed. However, most of the PK-specific substrates are shown with low similarity. For CK2 and PKA, we carefully check each pairs of the homologous protein sequences. However, most of the phosphorylation sites are not homologous sites. Thus, we propose the training data set is proper for this work with low redundant. (XLS 14 KB)

Additional file 2: The statistics of the annotations of the phosphorylation information from Swiss-Prot database. The entries annotated with "by similarity", "potential", "probable" or "partial" are regarded as ambiguous annotations. There are only 842 annotations of the kinase-specific phosphorylation sites provided. (XLS 14 KB)

Additional file 3: The statistics of the annotations of the kinase-specific phosphorylation sites from Swiss-Prot database. There are only 18 PK groups with not less than ten sites. (XLS 18 KB)

Additional file 4: Data set description for each protein kinase. (XLS 18 KB)

Additional file 5: The prediction performance of PPSP (self-consistency, Jack-Knife validation and n-fold cross-validation) for 22 PK groups with large data set (N ≥ 30). (XLS 20 KB)

Authors’ original submitted files for images

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Xue, Y., Li, A., Wang, L. et al. PPSP: prediction of PK-specific phosphorylation site with Bayesian decision theory. BMC Bioinformatics 7, 163 (2006).

Download citation


  • Phosphorylation Site
  • Prediction Performance
  • Potential Phosphorylation Site
  • Bayesian Decision Theory
  • BLOSUM62 Matrix