MotifAll: discovering all phosphorylation motifs
 Zengyou He^{1}Email author,
 Can Yang^{2},
 Guangyu Guo^{3},
 Ning Li^{3} and
 Weichuan Yu^{2}
https://doi.org/10.1186/1471210512S1S22
© He et al; licensee BioMed Central Ltd. 2011
Published: 15 February 2011
Abstract
Background
Phosphorylation motifs represent common patterns around the phosphorylation site. The discovery of such kinds of motifs reveals the underlying regulation mechanism and facilitates the prediction of unknown phosphorylation event. To date, people have gathered large amounts of phosphorylation data, making it possible to perform substratedriven motif discovery using data mining techniques.
Results
We describe an algorithm called MotifAll that is able to efficiently identify all statistically significant motifs. The proposed method explores a support constraint to reduce search space and avoid generating random artifacts. As the number of phosphorylated peptides are far less than that of unphosphorylated ones, we divide the mining process into two stages: The first step generates candidates from the set of phosphorylated sequences using only support constraint and the second step tests the statistical significance of each candidate using the odds ratio derived from the whole data set. Experimental results on real data show that MotifAll outperforms current algorithms in terms of both effectiveness and efficiency.
Conclusions
MotifAll is a useful tool for discovering statistically significant phosphorylation motifs. Source codes and data sets are available at: http://bioinformatics.ust.hk/MotifAll.rar.
Background
Protein phosphorylation is an essential posttranslational modification event for the regulation and maintenance of most biological processes. Recent advances in highthroughput methods such as tandem mass spectrometry enable rapid and direct discovery of hundreds of phosphorylation sites in a single experiment [1]. The availability of large amounts of phosphorylation sites makes it possible to perform phosphorylation motif finding using data mining techniques.
According to [2] and [3], phosphorylation motif discovery is defined as finding a set of motifs that appear more often in the set of phosphorylated peptides P than in the set of unphosphorylated peptides N. That means each phosphorylation motif is “overexpressed” in P. Here all peptides have the fixed length L and they are aligned on the phosphorylated residue. We often call P the foreground set and N the background set.
The discovery of phosphorylation motifs is computationally challenging. Suppose a motif resides in the peptides of length L. This motif is a consensus sequence (including the phosphorylation site) that consists of conserved positions and wildcard positions that can match any residue. The number of all possible phosphorylation motifs is then 21^{(}^{ L }^{–1)} – 1. Though the length L is usually fixed to be a small value (e.g., 13) in previous studies [2, 3], it is still infeasible to perform exhaustive search to check all potential phosphorylation motifs. Besides, it is also unclear which metric is more suitable to measure the statistical significance of the motif.
To efficiently find phosphorylation motifs, two heuristic methods have been proposed. The MotifX method [2] is a greedy algorithm that reports motifs in an iterative manner. In each round, one most statistically significant motif is detected. The peptides matching the motif identified in the first round are removed prior to the second round of searching. This procedure repeats a number of rounds until no significant motifs can be found. The MoDL method [3] is an optimizationbased algorithm, which formulates the motiffinding problem as the minimization of description length. In other words, MoDL tries to optimize the expressiveness of a set of motifs rather than quantifying the significance of a single motif. However, both MotifX and MoDL can only find a small subset of phosphorylation motifs. They cannot guarantee to find all statistically significant motifs so that some important phosphorylation patterns remain unknown to biologists. Furthermore, MotifX is limited to discover nonoverlap motifs while MoDL unusually reports at most three motifs.
In this paper, we present a new algorithm called MotifAll for the discovery of all statistically significant phosphorylation motifs. MotifAll uses the odds ratio to quantify the overexpressiveness of each motif in the set of phosphorylated peptides against the background set. To avoid exhaustive search, we impose a support constraint for each motif on the set of phosphorylated peptides. The use of support constraint enables us to borrow ideas from association rule mining [4] to generate and prune candidates in a levelwise manner before calculating the odds ratio.
To demonstrate the superiority of MotifAll method, we conduct experimental studies using the PhosPhAt database 3.0 of Arabidopsis phosphorylation sites [5, 6]. MotifAll performs better than MotifX and MoDL in finding more significant phosphorylation motifs. Furthermore, it is very fast and is able to finish the mining of large data sets within a reasonable time period.
The rest of the paper is organized as follows: Section 2 presents the MotifAll algorithm. Section 3 shows the experimental results. Section 4 gives some discussions and Section 5 concludes the paper.
Methods
There are two critical issues in phosphorylation motif finding. The first is how to measure the overexpressiveness of each motif. The second is how to perform the search in an efficient manner. In the proposed MotifAll method, we use odds ratio to evaluate if one candidate motif is overexpressed. To search efficiently, we use the following strategies to improve the running efficiency of MotifAll.

We impose a support constraint on each candidate motif. Here the support for a motif is defined as the percentage of phosphorylated peptides that match this motif. The notation of support is widely used in association rule mining [4]. One motif is said to be frequent if its support is no less than a given threshold. The aim of introducing support constraint is twofold: on the one hand, it can filter out nonfrequent motifs that may correspond to random artifacts; on the other hand, it makes it possible to generate and prune motifs in a levelwise manner so as to avoid bruteforce search.

We divide the mining process into two stages. In the first stage, we perform frequent motif finding using only the data set P since the number of phosphorylated peptides is much smaller than that of unphosphorylated ones. In the second stage, we collect support information using the data set N to calculate the statistical significance of those candidate motifs from the first stage. We will report all phosphorylation motifs whose pvalues are no larger than a userspecified significance threshold.
Odds ratio and statistical significance score
The odds of an event is the probability that this event occurs divided by the probability that it does not occur. The odds ratio is defined as the ratio of the odds of an event in one group to the odds in the complementary group [7].
A contingency table for a phosphorylation motif
m  

P  c _{00}  c _{01} 
N  c _{10}  c _{11} 
An odds ratio of 1 means that the target motif is equally likely to exist in both P and N. An odds ratio greater than 1 indicates that this motif is more likely to appear in the set P.
Then, the zvalue
Z(m) = LOR(m)/SE(m), (4)
follows a standard normal distribution.
Finally, we can calculate the pvalue to assess the statistical significance of each motif.
Frequent motif mining
Given a set of phosphorylated peptides P and a userspecified support threshold s, the objective of frequent motif finding is to discover all motifs whose supports are no less than s. The support for a motif is the percentage of phosphorylated peptides in P that match this motif. In other words, a motif m is frequent if and only if its cell count c_{00} in Table 1 satisfies c_{00} ≥ Ps with  •  denoting the size of a set. Note that a similar occurrence constraint is utilized in the MotifX algorithm [2], which represents the minimal number of phosphorylated peptides needed to match the residue/position pair in its greedy search procedure. There are two fundamental differences between our support constraint and their occurrence constraint:

The support constraint is applied to the entire motif rather a single residue/postion pair.

The support constraint can be used to prune the search space besides preventing the generation of random artifacts.
One may argue that such a support constraint will result in the loss of some less frequent motifs that are statistically significant. We will explain its rationale from two perspectives: First, since the motif describes the phosphorylation pattern, it should be applicable to many substrates; otherwise, such pattern might be random artifact due to the limited number of known phosphorylation peptides. Second, we can use very small support threshold in the mining process to avoid missing infrequent motifs. In the extreme case, setting s = 1/P will guarantee the completeness. Certainly, this may result in the report of many meaningless motifs. From this viewpoint, we can regard the support threshold as a parameter for controlling the tradeoff between completeness and false positive.
More importantly, the use of support constraint enables us to exploit a levelwise pruning strategy so as to reduce the search space. This idea has been widely used since the introduction of Apriori algorithm for association rule mining [4]. The application of this strategy to frequent motif finding is rather straightforward. For the sake of completeness, we will describe the mining procedure briefly.
In this paper, we focus on the discovery of patternbased phosphorylation motifs, e.g., consensus sequences that consist of either conserved positions or wild positions (denoted by “.”) that can match any animo acids. Each motif has a single phosphorylated residue, which is denoted with a underlined character (S, T or Y).
We define the size of a motif as the number of conserved positions in that motif. We also call a motif of size k a kmotif. We define that one kmotif is the generalization of another motif if they have the same conserved residues at k positions. For instance, “D…S..P” and “D.Y.S..P” are 2motif and 3motif, respectively. And the first motif is a generalization of the second one.
Furthermore, we use F_{ k } to denote the set of all frequent kmotifs and Z_{ k } to denote the set of all potential frequent kmotifs.
 1.
Only the frequent motifs from F_{ k }_{–1} are used to generate Z_{ k } since a kmotif will not be frequent if one of its (k – 1) generalizations is infrequent. Therefore, the search space for kmotifs is reduced. An example search tree is given in Fig.1.
 2.
The set P is scanned to count the support of candidates in Z_{ k }. Less frequent motifs are deleted from Z_{ k } to generate F_{ k }.
We stop the search when F_{ k } is empty and return F = ∪F_{ k } as the final result.
MotifAll algorithm
 1.
Finding the set of all frequent motifs F from P using the levelwise search method introduced in the previous subsection.
 2.
Scanning the set N to calculate the log odds ratio for each motif m ∈ F. Those motifs whose pvalues are no greater than θ are returned to the user.
Note that the statistical evaluation of motifs can be done in various ways. Thus, we can also use other measures in MotifAll by simply replacing the log odds ratio with the new significance measure in step 2. The choice of significance evaluation measure will not change the completeness property of our algorithm.
Results
Data
To test the performance of MotifAll, we use the PhosPhAt database 3.0 of Arabidopsis phosphorylation sites [5, 6] to construct the set of phosphorylated peptides P. Note that we only utilize unambiguous site identifications in the construction process. The length of each extracted peptide is 21 with a measured phosphorylated residue in the 11th position. To generate the background data set N, we first extract all 21mers with a phosphorylated residue in the center position from the TAIR7 protein database. Then, we remove all peptides already in P. The remaining peptides form the background data.
One may worry that such background data generation procedure will disable the extraction of meaningful motifs since N contains many peptides that can be phosphorylated but are not identified so far. We like to point out that these potential phosphorylated peptides are overwhelmed by those truly nonphosphorylated peptides in N. Thus, this data generation does not change the characteristics of the background set N. Overall, we generate three groups of data for serine (denoted by PhAtS), threonline (denoted by PhAtT) and tyrosine (denoted by PhAtY), respectively. Their characteristics are the following:

PhAtS: It contains 2734 foreground sequences and 982050 background sequences.

PhAtT: It contains 415 foreground sequences and 550574 background sequences.

PhAtY: It contains 80 foreground sequences and 304344 background sequences.
Performance comparison
In the experiments, we compare our MotifAll algorithm against the MotifX algorithm [2] and the MoDL algorithm [3]. Note that we didn’t include other motif finding algorithms such as TEIRESIAS [8] in the comparison. This is because these algorithms are not designed for phosphorylation motif discovery and it has been shown that MoDL [3] is superior to these methods.
Firstly, our algorithm is able to find more statistically significant phosphorylation motifs than existing algorithms. This is because our method has a theoretical guarantee on the completeness of results under a given parameter setting. In particular, almost all reported motifs of MotifX and MoDL are included in the motif set of MotifAll. There are two exceptions: one is the motif “T D” detected by MotifX and another is the motif “K . . T” found by MoDL. After checking these two motifs carefully, we found that the pvalue of “T D” is 1. 1 × 10^{–5} and the pvalue of “K . . T” is 2. 4 × 10^{–3} according to our significance test. Obviously, both motifs are not statistically significant according to the pvalue threshold of 10^{–6}. Secondly, the increase of support threshold for MotifAll will generate motif set that is very similar to that of MotifX and MoDL. This is clearly visible in PhAtT and PhAtY. To further check if this is true, we also perform motif finding on PhAtS at the support threshold of 15%. We obtain three motifs under this setting: “R . . S”, “S . S” and “S P”. This result set is almost identical to that of MotifX and MoDL listed in Fig.3. This means that MotifAll not only can find more useful motifs but also is capable of serving as a substitute of existing algorithms in a flexible manner.
Finally, we can generate more motifs using MotifX by lowering the occurrence threshold. For instance, if the occurrence threshold is set to 20 on PhAtS, MotifX will return 31 motifs. To test whether MotifAll can still find these motifs, we use an approximately equivalent setting of s = 1% to conduct the experiment. Totally, MotifAll detects 1153 significant motifs that include all 31 motifs reported by MotifX. The detailed results are available at http://bioinformatics.ust.hk/MotifAll.rar.
Effect of support constraint
The running efficiency test in Fig.5 shows that the increase of support threshold will lead to the decrease of running time as well. More importantly, MotifAll can finish the motif finding procedure within 30 seconds, which is faster than MotifX and MoDL. This means MotifAll is capable of discovering motifs from large data sets very efficiently.
Discussion
The discovery of phosphorylation motifs is a computationally challenging problem. This paper tries to resolve it through the introduction of MotifAll algorithm. Here we discuss several related problems that may need to be further investigated.

Problem formulation: One critical question is how to formulate the computational problem for phosphorylation motif discovery. Currently, it is casted either as a search problem or as an optimization problem. MotifX falls into the first category while MoDL belongs to the second category. In this paper, we opt for the first category since such formulation guarantees that each identified motif is statistically significant. However, it may report too many motifs than necessary as the statistical significance may not necessarily agrees with the biological interpretation. In this regard, optimizationbased formulation has the merit of generating a concise motif set. Therefore, a better problem formulation incorporating the biological knowledge of phosphorylation is still needed.

Motif evaluation: In MotifAll, we use two measures for motif assessment: statistical significance and support. Though our initial motivation of introducing the support constraint was to reduce the search space, we later found that it is also of practical importance in motif evaluation. Without the help of support constraint, it is almost impossible to obtain a concise motif set using only the significance test. From this perspective, we strongly recommend the adoption of support as one standard measure for phosphorylation motif evaluation in future studies.
Furthermore, we also need to develop new evaluation measures since the combination of significance threshold and support is still insufficient for separating true phosphorylation motifs from false ones in some cases. The general question of developing an effective evaluation measure is open and needs more investigations.

Motif utilization: Experimental methods to identify phosphorylation sites are time consuming, labor intensive and expensive [9, 10]. The identified motifs can be used to predict potential phosphorylation sites before biological validation. For instance, the recent released ScanX method [11] is one such representative, which is built on MotifX. The capability of finding more significant motifs using MotifAll makes it possible to build more accurate classifiers for better prediction.
Conclusions
We introduced the MotifAll algorithm for finding phosphorylation motifs. MotifAll can identify all statistically significant motifs under a given parameter setting. Meanwhile, it is very fast such that it is able to find hundreds of meaningful motifs from millions of peptide sequences within one minute on a personal computer. Our experimental results show that it outperforms existing phosphorylation motif discovery algorithms.
We have shown that MotifAll is able to find more phosphorylation motifs than existing algorithms. However, it is very expensive and difficult to perform biological validation. To measure the correctness of the identified motifs that are not reported by other algorithms, one alternative strategy is to perform permutation test so as to control the false positive rate. Unfortunately, the permutation test is a very timeconsuming procedure since it needs to execute MotifAll many times. To address this issue, our future work will focus on the design and implementation of fast algorithms that support large permutation test.
Declarations
Acknowledgements
We thank Dr. Waltraud X. Schulze for providing us the PhosPhAt 3.0 database. This work was partially supported by the Natural Science Foundation of China under Grant No. 61003176, the Fundamental Research Funds for the Central Universities of China (DUT10JR05 and DUT10ZD110), the General Research Fund 661408 and 621707 from the Hong Kong Research Grant Council and the Research Proposal Competition Awards RPC07/08.EG25 and RPC10.EG04 from the Hong Kong University of Science and Technology.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 1, 2011: Selected articles from the Ninth Asia Pacific Bioinformatics Conference (APBC 2011). The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/12?issue=S1.
Authors’ Affiliations
References
 Amanchy R, Periaswamy B, Mathivanan S, Reddy R, Tattikota S, Pandey A: A curated compendium of phosphorylation motifs. Nature Biotechnology 2007, 25(3):285–286. 10.1038/nbt0307285View ArticlePubMedGoogle Scholar
 Schwartz D, Gygi SP: An iterative statistical approach to the identification of protein phosphorylation motifs from largescale data sets. Nature Biotechnology 2005, 23(11):1391–139. 10.1038/nbt1146View ArticlePubMedGoogle Scholar
 Ritz A, Shakhnarovich G, Salomon A, Raphael B: Discovery of phosphorylation motif mixtures in phosphoproteomics data. Bioinformatics 2009, 25: 14–21. 10.1093/bioinformatics/btn569PubMed CentralView ArticlePubMedGoogle Scholar
 Agrawal R, Srikant R: Fast algorithms for mining association rules. Proc. of VLDB’94 1994, 487–499.Google Scholar
 Heazlewood JL, Durek P, Hummel J, Selbig J, Weckwerth W, Walther D, Schulze WX: PhosPhAt: a database of phosphorylation sites in Arabidopsis thaliana and a plantspecific phosphorylation site predictor. Nucleic Acids Research 2007, 36: D1015D1021. 10.1093/nar/gkm812PubMed CentralView ArticlePubMedGoogle Scholar
 Durek P, Schmidt R, Heazlewood J, Jones A, MacLean D, Nagel A, Kersten B, Schulze W: PhosPhAt: the Arabidopsis thaliana phosphorylation site database. An update. Nucleic Acids Research 2010, 38: D828D834. 10.1093/nar/gkp810PubMed CentralView ArticlePubMedGoogle Scholar
 WassertheilSmoller S: Biostatistics and epidemiology. Springer; 2004.Google Scholar
 Rigoutsos I, Floratos A: Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm. Bioinformatics 1998, 14: 55–67. 10.1093/bioinformatics/14.1.55View ArticlePubMedGoogle Scholar
 Xue Y, Li A, Wang L, Feng H, Yao X: PPSP: prediction of PKspecific phosphorylation site with Bayesian decision theory. BMC Bioinformatics 2006, 7: 163. 10.1186/147121057163PubMed CentralView ArticlePubMedGoogle Scholar
 Jung I, Matsuyama A, Yoshida M, Kim D: PostMod: sequence based prediction of kinasespecific phosphorylation sites with indirect relationship. BMC Bioinformatics 2010, 11(Suppl 1):S10. 10.1186/1471210511S1S10PubMed CentralView ArticlePubMedGoogle Scholar
 Schwartz D, Chou M, Church G: Predicting protein posttranslational modifications using metaanalysis of proteome scale data sets. Molecular & Cellular Proteomics 2009, 8(2):365–379. 10.1074/mcp.M800332MCP200View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.