PostMod: sequence based prediction of kinase-specific phosphorylation sites with indirect relationship

Background Post-translational modifications (PTMs) have a key role in regulating cell functions. Consequently, identification of PTM sites has a significant impact on understanding protein function and revealing cellular signal transductions. Especially, phosphorylation is a ubiquitous process with a large portion of proteins undergoing this modification. Experimental methods to identify phosphorylation sites are labor-intensive and of high-cost. With the exponentially growing protein sequence data, development of computational approaches to predict phosphorylation sites is highly desirable. Results Here, we present a simple and effective method to recognize phosphorylation sites by combining sequence patterns and evolutionary information and by applying a novel noise-reducing algorithm. We suggested that considering long-range region surrounding a phosphorylation site is important for recognizing phosphorylation peptides. Also, from compared results to AutoMotif in 36 different kinase families, new method outperforms AutoMotif. The mean accuracy, precision, and recall of our method are 0.93, 0.67, and 0.40, respectively, whereas those of AutoMotif with a polynomial kernel are 0.91, 0.47, and 0.17, respectively. Also our method shows better or comparable performance in four main kinase groups, CDK, CK2, PKA, and PKC compared to six existing predictors. Conclusion Our method is remarkable in that it is powerful and intuitive approach without need of a sophisticated training algorithm. Moreover, our method is generally applicable to other types of PTMs.


Background
Post-translational modifications (PTMs) have important implication on the protein functions involved in signal transductions and many human diseases. Especially, phosphorylation is one of the most ubiquitous of these processes with a reported 30~50% of eukaryotic proteins undergoing this modification. For this reason, identifying phosphorylation sites is important for understanding functional role of proteins and cell signalling networks. In order to determine phosphorylation sites several experimental tools such as mass spectrometry have been used. Experimental efforts using those techniques have made it possible to construct several databases for phosphorylation sites, such as Phospho. ELM [1,2], PhosphoSite [3], and PhosPhAt [4]. However, those techniques are time-consuming and high cost approaches. Due to such practical limitation, an efficient computational algorithm to recognize phosphorylation sites is highly desirable.
Previously, several methods to predict phosphorylation sites have been developed by probing evolutionary information, using physicochemical properties, or searching motif patterns. The most successful algorithms are machine learning-based approaches. Using the artificial neural network (ANN) models, NetPhosYesat [5] predicts phosphorylation sites in yeast, and NePhosK [6] provides a sequence-based phosphorylation site prediction service. Examples of support vector machine (SVM)-based approaches are PredPhospho [7], Auto-Motif [8,9], and kinasePhos2.0 [10] which trains SVM by using amino acid coupling patterns and solvent accessibility. Recently, probabilistic frameworks and new kernel methods were suggested. PPSP [11] used Bayesian decision theory to predict PK-specific phosphorylation sites, and SiteSeek [12] was implemented with a high search sensitivity by introducing a new adaptive locallyeffective kernel method with hydrophobic information. In addition, conditional random field model was applied to predict kinase-specific phosphorylation [13].
Despite high performance of those machine learning or statistical approaches, development of simple, intuitive, and generally applicable algorithms has been pursued. A group-based approach, GPS, simply and intuitively recognizes phosphorylation sites by calculating peptide similarities with BLOSUM62 matrix and deciding which group is closest to the given peptide after clustering known peptide groups [14]. Our study aimed to develop a new algorithm by inventing a new scoring method, as well as by introducing an effective noise-reducing system, which can be applied to different types of modifications. We developed a new scoring scheme to measure the sequence similarity by combining pairwise sequence similarity scores and profile-profile alignment scores. Basic assumption was that physicochemical information, motif information, and evolutionary information could be retrieved by measuring sequence similarities. We also generalized the motif scoring method, which has been conventionally used for predicting phosphorylation sites, by performing profile-profile alignments with gaps. It turned out that such generalization significantly improved the prediction accuracy. Considering both features together, we developed a new peptide sequence similarity scoring method. We then applied a noise-reducing system exploiting indirect relationships among peptide sequences. When we tested our new method on 48 different kinase groups, the results indicated that the two innovative features of our present work, i.e., a new sequence similarity scoring method and the noise-reducing system, both contributed to the outstanding performance of the new method in recognizing phosphorylation sites correctly, showing better performance than AutoMotif which is one of the best-performing methods. Also, by testing unbiased data set we can achieve better or comparable performance compared to six existing predictors.

Datasets
We developed our new method using Phospho.ELM (released in December 2008) database [2]. The database contains experimentally validated phosphorylation sites for 254 different kinases. From the database we selected kinase groups which contained more than 20 known phosphorylation sites, resulting in 48 different kinase groups in our test set. To develop and evaluate the new method, positive (phosphorylation) and negative (nonphosphorylation) peptides were needed to make the 'reference set'. For a specific phosphorylation type, positive peptides were all peptides in Phospho.ELM database that had the same type of phosphorylation. Negative peptides were randomly selected from sequences which shared the same phosphorylation residue types with positive peptides. We selected negative peptides 10 times more than the number of positive peptides. The whole dataset can be downloaded from our web server.
Peptide sequence similarity scoring scheme Our scoring system was designed to give a high score when two peptides have high similarity, indicating that if a query peptide gets high scores with known phosphorylation peptides, the query peptide is predicted to be a peptide with the same type of phosphorylation. To calculate peptide similarities we combined two different sequence similarity measures, one using BLOSUM62 matrix [15] and the other using profile-profile alignment which contains evolutionary information. Both measures are widely used to calculate sequence similarities. We assumed that comparing sequence similarity with BLOSUM62 matrix could provide similarity measure for physicochemical properties of the two sequences and motif patterns indirectly. Similarity score using BLOSUM62 matrix, S BLOSUM62 , between peptides A and B with fixed window size 7 was defined as The score(A i , B i ) is the substitution scores between two amino acids A i and B i in BLOSUM62 matrix. The window size 7 was determined by referencing a previously developed method, GPS [14]. If the candidate phosphorylation sites were near the N or C terminus, we represented the absent terminal sequences as X.
The second component of our new scoring scheme is the profile-profile alignment scores. The conventional way to measure sequence similarity for the purpose of predicting PTM sites is to use motif scoring methods where gapless alignments were typically assumed. We generalized the motif scoring method by performing profile-profile alignments allowing gaps. To calculate similarity scores based on profile-profile alignment we first generated the position specific scoring matrix (PSSM) and the position specific frequency matrix (PSFM) for a protein sequence which contained a given peptide by using PSI-BLAST [16]. We used blastpgp version 2.2.15 with default parameters except the options for the number of iterations (j = 5) and the cutoff E-value value (h = 0.001). Then we extracted PSSM and PSFM corresponding to the given peptide. Using both matrices we computed profile-profile alignment scores for the position i of a peptide A and the position j of a peptide B (PPA ij ) as follows, where, f ik and f jk are the frequencies of PSFM matrix at the position of i and j, and S ik and S jk represent the scores of PSSM matrix at the position of i and j. The detail procedure was reported in our previous work [17]. We aligned both peptides A and B and calculated similarity scores by using dynamic programming with gap penalty of 3.0 and gap extension penalty of 0.75. We referred to this profile-profile alignment scores as S profile . During profile-profile alignment we selected the window size as 41 after trying several different window sizes such as 7, 19, 41, and 101. In previous works, wide ranges of values from 7 to 19 have been used as a window size for calculating the sequence similarity. We increased the window size from 7 up to 101 to evaluate effect of considering long-range region surrounding phosphorylation sites. The performances for PKB-group with different window sizes were measured, and then 41 was selected as the optimal window size for calculating the profile-profile alignments. Once having calculated both types of similarity scores, S BLOSUM62 and S profile , we multiplied both scores to calculate the final similarity score (S combined ) of the two peptides as follows, The positive effect of combined-measure is described in Result and discussion section. We also tried a linear combination of S BLOSUM62 and S profile as the final similarity score and found that the multiplicative form of the two scores showed better performance.
Noise reduction scheme utilizing indirect relationships By using similarity scores we can rank all reference peptides for a given query peptide. If a scoring system is perfect, it would give higher scores to all true phosphorylation peptides than to any of non-phosphorylation peptides. However, our scoring system is obviously imperfect, partly because our current scoring system only considers sequence features. We may be able to improve the accuracy by adding new features, but in this work we focused on designing noise-reducing system by considering indirect relationships among reference peptides. Basic idea is that if highly ranked reference peptides tend to be known phosphorylation peptides, the query peptide is likely to be a phosphorylation site, otherwise a non-phosphorylation site. Figure 1 illustrates the noise-reducing scoring system of this work. In step 1, for a given query (in this example, a phosphorylation peptide), we calculate S combined scores for all peptides in the reference set, and select the top α highly ranked peptides (in this example, α = 5, consisting of 2 phosphorylation peptides (P j ) and 3 non-phosphorylation peptides (N j )). Next in step 2, indirect relationship matrix is constructed by calculating S combined scores between those top α hits and all peptides in the reference set (in this example, there are 5 phosphorylation and 5 non-phosphorylation peptides).
When constructing the matrix, if a peptide i is not included in the top 5 hits for a peptide j, the score (j, i) of the matrix is set to zero. In step 3, indirect scores are calculated by using indirect relationship matrix. We assume that scores between positive (or negative) peptides are signal, while those between positive (or negative) and negative (or positive) are noise. According to our hypothesis, indirect scores can be calculated as follows, where we give the weight of 10 to positive peptides since the number of negative peptides is 10 times that of positive peptides. For example, suppose that P 2 recognized P 1 and P 4 as signal, and N 3 as noise. Then, the final indirect score of P 2 (5.70) is calculated by adding P 1 and P 4 with weight 10 (10*(0.61+0.01)) and subtracting N 3 (0.50). Finally, from the indirect scores we consider the top β hits (in this example, β = 4) as query related peptides. Then, if the number of positive peptides are greater than γ (in this example, γ = 2), we predict the query peptide as a phosphorylation peptide. In this example P 2 , P 3 , P 4 , and N 2 are the top 4 hits, and thereby we predict that the query peptide is a phosphorylation peptide.
There existed several parameter determination issues in constructing the noise-reducing system such as the number of highly ranked peptides α, the value of β and γ. To determine those parameters we searched the optimized parameter set using PKB-group. It is obvious that different kinase groups have the different optimized parameters but we applied the same parameters to all cases to avoid over-fitting. As a result, for the number of highly ranked peptides we selected the value of half number of positive peptides in the reference set. For the value of β and γ we used value of one-third and onefourth of positive peptides, respectively.

Performance assessment
We assessed the prediction performance with leave-oneout cross validation (LOOCV). We used all dataset except one as the reference set and tested our scoring system with left-out one peptide. The accuracy (ACC), precision (P), and recall (R) were calculated to measure the performance. The equations are as follows. denote true positives, true negatives, false positives, and false negatives, respectively. We benchmarked another prediction server, AutoMotif [8,9]. We directly compared the performance of our method to that of AutoMotif since our dataset and evaluation scheme were same as AutoMotif. The performance data of AutoMotif were extracted from website [18]. Furthermore, to test unbiased data set we used new data set constructed by Illustration of the noise-reducing system. Illustration of the noise-reducing system. In step 1, we find the top 5 hits for a given query, where P j is a phosphorylation peptide and N j is a non-phosphorylation peptide. Next, S combined scores are calculated between the top 5 hits and all peptides in a reference set (10 peptides), where if a peptide i is not included in top 5 hits for a peptide j the score (j, i) is set to zero. In step 3, by summing each row of indirect relationship matrix we calculate indirect scores. During summation we assume that scores between positive (or negative) peptides are signal, while those between positive (or negative) and negative (or positive) are noise. Finally, we check the number of phosphorylation peptides among the top 4 hits by indirect scores. In this example P 2 , P 3 , P 4 , and N 2 are recognized as the top 4 hits, and among them 3 peptides are phosphorylation peptides, and thereby we predict that the query peptide is a phosphorylation peptide.

Results and discussion
Performance variation of various features for PKB-group We evaluated the ability of different level of similarity measures to discriminate phosphorylation and nonphosphorylation peptides. As we described in Method section, we tested four different types of scores, S BLOSUM62 , S profile , S combined , and our noise-reducing system. We selected PKB-group (protein kinase B) as a toy example to select parameters and to test performance variation of various features. The serine/threonine kinase PKB has been shown to play a crucial role in the control of diverse and important cellular functions such as cell survival and glycogen metabolism [21]. The result of performance comparison for PKB-group is shown in Table 1. In the aspect of precision and recall, S combined was significantly better than S BLOSUM62 and S profile . The precision and recall were increased more than 22% and 54%, respectively from those of S BLOSUM62 by combining S profile . This indicates that combining both similarity measures lead to significant positive effect. The positive effect is likely to be originated from removing ambiguous cases such as high score in S BLOSUM62 but low score in S profile , and vice versa. Furthermore, the noise-reducing system highly increased recall (24% increased) compared to S combined at the similar level of precisions (0.87 at Noise-reducing, 0.84 at S combined ), indicating the noisereducing system filtered out many false positives and rescued many true phosphorylation peptides which were falsely recognized as non-phosphorylation peptides by S combined .
We also evaluated overall performance across 48 kinase groups. The result is summarized in Table 2, showing that in terms of overall performance our noise-reducing system is most effective for identifying phosphorylation sites. Especially, recall was increased about 15% from S combined at the 0.68 precision. Also significant performance enhancement was occurred in S combined compared to both S BLOSUM62 and S profile . The results remark that combining S profile with S BLOSUM62 is generally effective to increase discriminate ability for phosphorylation sites. Moreover, we expect that we can apply the same noisereducing system to other types of post-translation modifications such as ubiquitination.
To better understand how the noise-reducing system determines phosphorylation sites with higher accuracy, we consider one specific example, the phosphorylation peptide, [RGGSASRS]. The first and fourth hits (both are non-phosphorylation sites, false positives) of the given peptide are [DGTSLKV] and [VQDTYQI], respectively, with S combined . However, the first hit does not have any indirect relationships with other highly ranked peptides, thereby in the noise-reducing system its rank drops to the 38 th . Also the fourth hit shows indirect relationships with highly ranked positive peptides (this is a false relationship), therefore the rank of third hit also drops to 40 th . On the other hand, the ranks of several positive peptides are increased since several true indirect relationships exist among highly ranked positive peptides. This example shows exactly how the noise-reducing system works to detect phosphorylation sites.
Importance of considering long-range region surrounding a phosphorylation site Several mechanisms have been proposed to understand kinase specific binding properties. Protein kinase forms a protein complex with its substrate through recognizing phosphorylation binding domains or short sequence patterns of substrates. To recognize kinase-specific motifs, not only short sequence patterns but also local structure around a phosphorylation site may be important. We assume that this information can be measured by calculating similarities of long-range region surrounding phosphorylation sites with profile-profile alignment (PPA).
We evaluated our assumption by comparing performance with various features. We drew number of true matches (phosphorylation peptides) according to number of false matches (non-phosphorylation peptides) up to 1000 false matches among 48 kinase families. The results are shown in figure 2, and the performance varies according to window size of peptides. In figure, values in brackets represent number of windowed residues. Here, S BLOSUM62 represents a short sequence pattern  BMC Bioinformatics 2010, 11(Suppl 1):S10 http://www.biomedcentral.com/1471-2105/11/S1/S10 comparison scheme with BLOSUM62 matrix for 7 windowed residues. When we extended window size as 41 the performance was degraded, therefore we fixed window size as 7 in S BLOSUM62 . To evaluate effect of longrange similarities first, we searched proper window size of long-range region by using PPA based score (S profile ) alone. Among three different window sizes, S profile with 19 windowed residues provided more precise similarities compared to 7 or 41 windowed residues. S profile (19 windowed residues), S profile (7 windowed residues), and S profile (41 windowed residues) recognized 610, 567, and 539 true matches up to 500 false matches. Next, we combined both S BLOSUM62 and S profile and measured performance variations. In this case, the best performance was occurred at the combination of S BLOSUM62 and S profile with 41 windowed residues. From the figure we note that if we combined S BLOSUM62 and S profile , the performance with S profile (41 windowed residues) was increased more than other two cases even though it showed worst performance when we considered S profile score alone. S combined with S profile (41 windowed residues) detected 1613 true matches up to 1000 false matches, while considering 19 and 7 windowed residues recognized 1583 and 1485 true matches, respectively. The results indicate that considering both short and longrange properties important to increase search sensitivity. When we search phosphorylation peptides the most important property may be physicochemical properties of adjacent residues to a phosphorylation site. However, together with this information, considering long-range region similarities can provide evolutionary or structural similarities surrounding kinase binding sites. Thereby both properties effectively contribute to measures peptide similarities.

Performance comparison with AutoMotif
Our new method was applied to 48 kinase groups, and the performance of 36 groups was compared to the benchmarked results of AutoMotif. The performance data of AutoMotif were extracted from AutoMotif web server. The mean accuracy, precision, and recall of AutoMotif with a polynomial kernel were 0.91, 0.47, and 0.17, respectively, while those of the new method were 0.93, 0.67, and 0.40, respectively. The performance variations among 48 kinase groups are shown in Table 3. We note that in general the new method shows better performance than AutoMotif. Especially, 22 kinase groups showed overall increased performance from AutoMotif in all three aspects (accuracy, precision, and recall).
From the individual performance variations, it is notable that the new method was more effective for the kinase groups that contain relatively a small number of positive peptides. For the large kinase groups that contain more than 100 positive peptides, such as CDK-group, CK2group, PKA-group, PKC-alpha, MAPK1, PKC_group, Src, and CK2-alpha, only in three kinase groups (MAPK1, Src, and CK2-alpha) our new method showed the performance enhancement in all three aspects, while in a majority of small kinase groups that contain less than 100 positive peptides, the new method achieved significantly better performance than AutoMotif. This tendency of performance improvement depending on the number of positive peptides may have stemmed from the amount of information. If we have many positive peptides, we can easily recognize phosphorylation peptides since the search space of phosphorylation peptides is dense. On the other hand, if we do not have enough number of positive peptides, prediction methods based on the sequence information of positive peptides inevitably contain much noise, making the predictions unreliable. In this situation applying our new method effectively reduces noise and retrieves weak signals, resulting in high search sensitivity. In this regard we expect that we can apply the new method to other small PTM groups and equally achieve performance enhancement.
Moreover, we conducted 10-fold cross validation to measure the performance variation. The results are In figure, values in brackets represent number of windowed residues in peptides. From the figure we note that S combined with S profile (41 windowed residues) shows best performance. The fact remarks that considering long-range region is effective to identify phosphorylation peptides.
BMC Bioinformatics 2010, 11(Suppl 1):S10 http://www.biomedcentral.com/1471-2105/11/S1/S10 shown in Table 4. The mean accuracy, precision, and recall (0.93, 0.59 and 0.43, respectively) were comparable or slightly degraded from LOOCV of the new method, but still the performance was better than AutoMotif (LOOCV). The result indicates that the new method performs well in an independent query set.

Performance assessment with independent test set
We also benchmarked our results to six existing methods, GPS [14], KinasePhos [10], NetPhosK [6], PPSP [11], PredPhospho [7], and Scansite [22] by using independent test set created by Wan et al. [19]. The test set consists of four main kinase families (CDK, CK2, PKA, and PKC) and  The area of under ROC curves are shown in Table 5. In table, top two ranked methods are bolded. We noted that our method was ranked as top or second in four kinases families. Our method showed better performance than GPS, KinasePhos, PredPhospho, and Scansite. Especially, compared to GPS designed to search phosphorylation peptides through sequence similarity using BLOSUM62 matrix, the performance improvement in our method indicates that addressing evolutionary information could be helpful to identity phosphorylation peptides. Furthermore, compared to PPSP our method shows significant improved results in CK2 family (6% increased) but similar performance in other kinase families. To conclude it is hard to say new method is outstanding compared to other methods but our method is generally effective to recognize phosphorylation peptides in four main kinase families.

Web server construction
We constructed the web server to provide an easy access to our new phosphorylation site prediction method. Our web server, PostMod (prediction of Post-translational Modification sites) was implemented with python CGI scripts and html. Currently we provide prediction of phosphorylation sites for 48 kinases but our future direction is to apply the new method to other kinds of PTM sites. Figure 3 shows the sever input (A) and output pages (B). The search sequence can be submitted by pasting it into the text box. The server allows a user to select one of 48 different kinase types. For the query sequence the server searches all putative phosphorylation sites and generates several peptides. Meanwhile, the server generates PSSM and PSFM matrix of the query sequence by executing blastpgp 2.2.20. After that, the sever compares sequence similarity with the peptide sequences in the reference set which are phosphorylation and non-phosphorylation sites of the corresponding kinase type from phospho.ELM database. Figure 3(B) is the output page for AMPK beta-1 chain (UniProt id is P80386). All putative 36 phosphorylation sites (S, T) were shown together with confidence scores. Three predicted phosphorylation sites were bolded. The confidence score is the fraction of phosphorylation peptides among the top β hits (see Method section). We set the threshold value for phosphorylation sites as 0.5 of confidence score. User can receive search result via email. The web server is available through our website [23].

Conclusion
The present method is remarkable in the sense that it is simple and computationally costless, and yet shows the outstanding performance improvement for the various kinds of kinases. We showed that when we combined the BLOSUM62 matrix-based similarity measure and the profile-profile alignment scores, the recognition results were significantly improved. Moreover, applying our noise-reducing system by exploiting indirect relationships effectively eliminated noise, and thereby increased the overall performance. The overall performance enhancement on 48 kinds of different kinases suggests that our method is generally applicable to other types of PTMs. Furthermore, it is expected that combining our method with better similarity methods would achieve higher accuracy for finding phosphorylation sites.
Performance degradation in a conventional sequence similarity measures is mainly originated from improper similarity scoring system, which gives higher scores to unrelated peptides, producing many false positives. The best solution may be developing new features which well discriminate positives from negatives. If we do not have such powerful features, we need to concentrate on removing noise. In this manner we addressed a concept of indirect relationships, and we showed that considering indirect relationships can be a powerful tool to eliminate the false positives.
To conclude, applying the new method produces good results without need of sophisticated machine learning techniques in detecting phosphorylation sites. Furthermore, we expect that applying our new method to other kinds of biological analysis would achieve high performance improvement.

Figure 3
The PostMod server input page (A) and result page (B). In input page, search sequence is pasted into the text box and one of 48 kinase types is selected. The default kinase type is AMPK_group. The example sequence is AMPK beta-1 chain (UniProt id is P80386). The search result of phosphorylation sites are shown in (B). There are 36 candidate phosphorylation sites (S, T) and three of them are recognized as phosphorylation sites (bolded line).