- Open Access
Motif-All: discovering all phosphorylation motifs
© He et al; licensee BioMed Central Ltd. 2011
- Published: 15 February 2011
Phosphorylation motifs represent common patterns around the phosphorylation site. The discovery of such kinds of motifs reveals the underlying regulation mechanism and facilitates the prediction of unknown phosphorylation event. To date, people have gathered large amounts of phosphorylation data, making it possible to perform substrate-driven motif discovery using data mining techniques.
We describe an algorithm called Motif-All that is able to efficiently identify all statistically significant motifs. The proposed method explores a support constraint to reduce search space and avoid generating random artifacts. As the number of phosphorylated peptides are far less than that of unphosphorylated ones, we divide the mining process into two stages: The first step generates candidates from the set of phosphorylated sequences using only support constraint and the second step tests the statistical significance of each candidate using the odds ratio derived from the whole data set. Experimental results on real data show that Motif-All outperforms current algorithms in terms of both effectiveness and efficiency.
Motif-All is a useful tool for discovering statistically significant phosphorylation motifs. Source codes and data sets are available at: http://bioinformatics.ust.hk/MotifAll.rar.
- Significant Motif
- Association Rule Mining
- Support Threshold
- Frequent Motif
- Candidate Motif
Protein phosphorylation is an essential post-translational modification event for the regulation and maintenance of most biological processes. Recent advances in high-throughput methods such as tandem mass spectrometry enable rapid and direct discovery of hundreds of phosphorylation sites in a single experiment . The availability of large amounts of phosphorylation sites makes it possible to perform phosphorylation motif finding using data mining techniques.
According to  and , phosphorylation motif discovery is defined as finding a set of motifs that appear more often in the set of phosphorylated peptides P than in the set of unphosphorylated peptides N. That means each phosphorylation motif is “over-expressed” in P. Here all peptides have the fixed length L and they are aligned on the phosphorylated residue. We often call P the foreground set and N the background set.
The discovery of phosphorylation motifs is computationally challenging. Suppose a motif resides in the peptides of length L. This motif is a consensus sequence (including the phosphorylation site) that consists of conserved positions and wild-card positions that can match any residue. The number of all possible phosphorylation motifs is then 21( L –1) – 1. Though the length L is usually fixed to be a small value (e.g., 13) in previous studies [2, 3], it is still infeasible to perform exhaustive search to check all potential phosphorylation motifs. Besides, it is also unclear which metric is more suitable to measure the statistical significance of the motif.
To efficiently find phosphorylation motifs, two heuristic methods have been proposed. The Motif-X method  is a greedy algorithm that reports motifs in an iterative manner. In each round, one most statistically significant motif is detected. The peptides matching the motif identified in the first round are removed prior to the second round of searching. This procedure repeats a number of rounds until no significant motifs can be found. The MoDL method  is an optimization-based algorithm, which formulates the motif-finding problem as the minimization of description length. In other words, MoDL tries to optimize the expressiveness of a set of motifs rather than quantifying the significance of a single motif. However, both Motif-X and MoDL can only find a small subset of phosphorylation motifs. They cannot guarantee to find all statistically significant motifs so that some important phosphorylation patterns remain unknown to biologists. Furthermore, Motif-X is limited to discover non-overlap motifs while MoDL unusually reports at most three motifs.
In this paper, we present a new algorithm called Motif-All for the discovery of all statistically significant phosphorylation motifs. Motif-All uses the odds ratio to quantify the over-expressiveness of each motif in the set of phosphorylated peptides against the background set. To avoid exhaustive search, we impose a support constraint for each motif on the set of phosphorylated peptides. The use of support constraint enables us to borrow ideas from association rule mining  to generate and prune candidates in a level-wise manner before calculating the odds ratio.
To demonstrate the superiority of Motif-All method, we conduct experimental studies using the PhosPhAt database 3.0 of Arabidopsis phosphorylation sites [5, 6]. Motif-All performs better than Motif-X and MoDL in finding more significant phosphorylation motifs. Furthermore, it is very fast and is able to finish the mining of large data sets within a reasonable time period.
The rest of the paper is organized as follows: Section 2 presents the Motif-All algorithm. Section 3 shows the experimental results. Section 4 gives some discussions and Section 5 concludes the paper.
There are two critical issues in phosphorylation motif finding. The first is how to measure the over-expressiveness of each motif. The second is how to perform the search in an efficient manner. In the proposed Motif-All method, we use odds ratio to evaluate if one candidate motif is over-expressed. To search efficiently, we use the following strategies to improve the running efficiency of Motif-All.
We impose a support constraint on each candidate motif. Here the support for a motif is defined as the percentage of phosphorylated peptides that match this motif. The notation of support is widely used in association rule mining . One motif is said to be frequent if its support is no less than a given threshold. The aim of introducing support constraint is two-fold: on the one hand, it can filter out non-frequent motifs that may correspond to random artifacts; on the other hand, it makes it possible to generate and prune motifs in a level-wise manner so as to avoid brute-force search.
We divide the mining process into two stages. In the first stage, we perform frequent motif finding using only the data set P since the number of phosphorylated peptides is much smaller than that of unphosphorylated ones. In the second stage, we collect support information using the data set N to calculate the statistical significance of those candidate motifs from the first stage. We will report all phosphorylation motifs whose p-values are no larger than a user-specified significance threshold.
Odds ratio and statistical significance score
The odds of an event is the probability that this event occurs divided by the probability that it does not occur. The odds ratio is defined as the ratio of the odds of an event in one group to the odds in the complementary group .
A contingency table for a phosphorylation motif
An odds ratio of 1 means that the target motif is equally likely to exist in both P and N. An odds ratio greater than 1 indicates that this motif is more likely to appear in the set P.
Then, the z-value
Z(m) = LOR(m)/SE(m), (4)
follows a standard normal distribution.
Finally, we can calculate the p-value to assess the statistical significance of each motif.
Frequent motif mining
Given a set of phosphorylated peptides P and a user-specified support threshold s, the objective of frequent motif finding is to discover all motifs whose supports are no less than s. The support for a motif is the percentage of phosphorylated peptides in P that match this motif. In other words, a motif m is frequent if and only if its cell count c00 in Table 1 satisfies c00 ≥ |P|s with | • | denoting the size of a set. Note that a similar occurrence constraint is utilized in the Motif-X algorithm , which represents the minimal number of phosphorylated peptides needed to match the residue/position pair in its greedy search procedure. There are two fundamental differences between our support constraint and their occurrence constraint:
The support constraint is applied to the entire motif rather a single residue/postion pair.
The support constraint can be used to prune the search space besides preventing the generation of random artifacts.
One may argue that such a support constraint will result in the loss of some less frequent motifs that are statistically significant. We will explain its rationale from two perspectives: First, since the motif describes the phosphorylation pattern, it should be applicable to many substrates; otherwise, such pattern might be random artifact due to the limited number of known phosphorylation peptides. Second, we can use very small support threshold in the mining process to avoid missing infrequent motifs. In the extreme case, setting s = 1/|P| will guarantee the completeness. Certainly, this may result in the report of many meaningless motifs. From this viewpoint, we can regard the support threshold as a parameter for controlling the trade-off between completeness and false positive.
More importantly, the use of support constraint enables us to exploit a level-wise pruning strategy so as to reduce the search space. This idea has been widely used since the introduction of Apriori algorithm for association rule mining . The application of this strategy to frequent motif finding is rather straightforward. For the sake of completeness, we will describe the mining procedure briefly.
In this paper, we focus on the discovery of pattern-based phosphorylation motifs, e.g., consensus sequences that consist of either conserved positions or wild positions (denoted by “.”) that can match any animo acids. Each motif has a single phosphorylated residue, which is denoted with a underlined character (S, T or Y).
We define the size of a motif as the number of conserved positions in that motif. We also call a motif of size k a k-motif. We define that one k-motif is the generalization of another motif if they have the same conserved residues at k positions. For instance, “D…S..P” and “D.Y.S..P” are 2-motif and 3-motif, respectively. And the first motif is a generalization of the second one.
Furthermore, we use F k to denote the set of all frequent k-motifs and Z k to denote the set of all potential frequent k-motifs.
Only the frequent motifs from F k –1 are used to generate Z k since a k-motif will not be frequent if one of its (k – 1) generalizations is infrequent. Therefore, the search space for k-motifs is reduced. An example search tree is given in Fig.1.
The set P is scanned to count the support of candidates in Z k . Less frequent motifs are deleted from Z k to generate F k .
We stop the search when F k is empty and return F = ∪F k as the final result.
Finding the set of all frequent motifs F from P using the level-wise search method introduced in the previous subsection.
Scanning the set N to calculate the log odds ratio for each motif m ∈ F. Those motifs whose p-values are no greater than θ are returned to the user.
Note that the statistical evaluation of motifs can be done in various ways. Thus, we can also use other measures in Motif-All by simply replacing the log odds ratio with the new significance measure in step 2. The choice of significance evaluation measure will not change the completeness property of our algorithm.
To test the performance of Motif-All, we use the PhosPhAt database 3.0 of Arabidopsis phosphorylation sites [5, 6] to construct the set of phosphorylated peptides P. Note that we only utilize unambiguous site identifications in the construction process. The length of each extracted peptide is 21 with a measured phosphorylated residue in the 11th position. To generate the background data set N, we first extract all 21-mers with a phosphorylated residue in the center position from the TAIR7 protein database. Then, we remove all peptides already in P. The remaining peptides form the background data.
One may worry that such background data generation procedure will disable the extraction of meaningful motifs since N contains many peptides that can be phosphorylated but are not identified so far. We like to point out that these potential phosphorylated peptides are overwhelmed by those truly non-phosphorylated peptides in N. Thus, this data generation does not change the characteristics of the background set N. Overall, we generate three groups of data for serine (denoted by PhAtS), threonline (denoted by PhAtT) and tyrosine (denoted by PhAtY), respectively. Their characteristics are the following:
PhAtS: It contains 2734 foreground sequences and 982050 background sequences.
PhAtT: It contains 415 foreground sequences and 550574 background sequences.
PhAtY: It contains 80 foreground sequences and 304344 background sequences.
In the experiments, we compare our Motif-All algorithm against the Motif-X algorithm  and the MoDL algorithm . Note that we didn’t include other motif finding algorithms such as TEIRESIAS  in the comparison. This is because these algorithms are not designed for phosphorylation motif discovery and it has been shown that MoDL  is superior to these methods.
Firstly, our algorithm is able to find more statistically significant phosphorylation motifs than existing algorithms. This is because our method has a theoretical guarantee on the completeness of results under a given parameter setting. In particular, almost all reported motifs of Motif-X and MoDL are included in the motif set of Motif-All. There are two exceptions: one is the motif “T D” detected by Motif-X and another is the motif “K . . T” found by MoDL. After checking these two motifs carefully, we found that the p-value of “T D” is 1. 1 × 10–5 and the p-value of “K . . T” is 2. 4 × 10–3 according to our significance test. Obviously, both motifs are not statistically significant according to the p-value threshold of 10–6. Secondly, the increase of support threshold for Motif-All will generate motif set that is very similar to that of Motif-X and MoDL. This is clearly visible in PhAtT and PhAtY. To further check if this is true, we also perform motif finding on PhAtS at the support threshold of 15%. We obtain three motifs under this setting: “R . . S”, “S . S” and “S P”. This result set is almost identical to that of Motif-X and MoDL listed in Fig.3. This means that Motif-All not only can find more useful motifs but also is capable of serving as a substitute of existing algorithms in a flexible manner.
Finally, we can generate more motifs using Motif-X by lowering the occurrence threshold. For instance, if the occurrence threshold is set to 20 on PhAtS, Motif-X will return 31 motifs. To test whether Motif-All can still find these motifs, we use an approximately equivalent setting of s = 1% to conduct the experiment. Totally, Motif-All detects 1153 significant motifs that include all 31 motifs reported by Motif-X. The detailed results are available at http://bioinformatics.ust.hk/MotifAll.rar.
Effect of support constraint
The running efficiency test in Fig.5 shows that the increase of support threshold will lead to the decrease of running time as well. More importantly, Motif-All can finish the motif finding procedure within 30 seconds, which is faster than Motif-X and MoDL. This means Motif-All is capable of discovering motifs from large data sets very efficiently.
The discovery of phosphorylation motifs is a computationally challenging problem. This paper tries to resolve it through the introduction of Motif-All algorithm. Here we discuss several related problems that may need to be further investigated.
Problem formulation: One critical question is how to formulate the computational problem for phosphorylation motif discovery. Currently, it is casted either as a search problem or as an optimization problem. Motif-X falls into the first category while MoDL belongs to the second category. In this paper, we opt for the first category since such formulation guarantees that each identified motif is statistically significant. However, it may report too many motifs than necessary as the statistical significance may not necessarily agrees with the biological interpretation. In this regard, optimization-based formulation has the merit of generating a concise motif set. Therefore, a better problem formulation incorporating the biological knowledge of phosphorylation is still needed.
Motif evaluation: In Motif-All, we use two measures for motif assessment: statistical significance and support. Though our initial motivation of introducing the support constraint was to reduce the search space, we later found that it is also of practical importance in motif evaluation. Without the help of support constraint, it is almost impossible to obtain a concise motif set using only the significance test. From this perspective, we strongly recommend the adoption of support as one standard measure for phosphorylation motif evaluation in future studies.
Furthermore, we also need to develop new evaluation measures since the combination of significance threshold and support is still insufficient for separating true phosphorylation motifs from false ones in some cases. The general question of developing an effective evaluation measure is open and needs more investigations.
Motif utilization: Experimental methods to identify phosphorylation sites are time consuming, labor intensive and expensive [9, 10]. The identified motifs can be used to predict potential phosphorylation sites before biological validation. For instance, the recent released Scan-X method  is one such representative, which is built on Motif-X. The capability of finding more significant motifs using Motif-All makes it possible to build more accurate classifiers for better prediction.
We introduced the Motif-All algorithm for finding phosphorylation motifs. Motif-All can identify all statistically significant motifs under a given parameter setting. Meanwhile, it is very fast such that it is able to find hundreds of meaningful motifs from millions of peptide sequences within one minute on a personal computer. Our experimental results show that it outperforms existing phosphorylation motif discovery algorithms.
We have shown that Motif-All is able to find more phosphorylation motifs than existing algorithms. However, it is very expensive and difficult to perform biological validation. To measure the correctness of the identified motifs that are not reported by other algorithms, one alternative strategy is to perform permutation test so as to control the false positive rate. Unfortunately, the permutation test is a very time-consuming procedure since it needs to execute Motif-All many times. To address this issue, our future work will focus on the design and implementation of fast algorithms that support large permutation test.
We thank Dr. Waltraud X. Schulze for providing us the PhosPhAt 3.0 database. This work was partially supported by the Natural Science Foundation of China under Grant No. 61003176, the Fundamental Research Funds for the Central Universities of China (DUT10JR05 and DUT10ZD110), the General Research Fund 661408 and 621707 from the Hong Kong Research Grant Council and the Research Proposal Competition Awards RPC07/08.EG25 and RPC10.EG04 from the Hong Kong University of Science and Technology.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 1, 2011: Selected articles from the Ninth Asia Pacific Bioinformatics Conference (APBC 2011). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/12?issue=S1.
- Amanchy R, Periaswamy B, Mathivanan S, Reddy R, Tattikota S, Pandey A: A curated compendium of phosphorylation motifs. Nature Biotechnology 2007, 25(3):285–286. 10.1038/nbt0307-285View ArticlePubMedGoogle Scholar
- Schwartz D, Gygi SP: An iterative statistical approach to the identification of protein phosphorylation motifs from large-scale data sets. Nature Biotechnology 2005, 23(11):1391–139. 10.1038/nbt1146View ArticlePubMedGoogle Scholar
- Ritz A, Shakhnarovich G, Salomon A, Raphael B: Discovery of phosphorylation motif mixtures in phosphoproteomics data. Bioinformatics 2009, 25: 14–21. 10.1093/bioinformatics/btn569PubMed CentralView ArticlePubMedGoogle Scholar
- Agrawal R, Srikant R: Fast algorithms for mining association rules. Proc. of VLDB’94 1994, 487–499.Google Scholar
- Heazlewood JL, Durek P, Hummel J, Selbig J, Weckwerth W, Walther D, Schulze WX: PhosPhAt: a database of phosphorylation sites in Arabidopsis thaliana and a plant-specific phosphorylation site predictor. Nucleic Acids Research 2007, 36: D1015-D1021. 10.1093/nar/gkm812PubMed CentralView ArticlePubMedGoogle Scholar
- Durek P, Schmidt R, Heazlewood J, Jones A, MacLean D, Nagel A, Kersten B, Schulze W: PhosPhAt: the Arabidopsis thaliana phosphorylation site database. An update. Nucleic Acids Research 2010, 38: D828-D834. 10.1093/nar/gkp810PubMed CentralView ArticlePubMedGoogle Scholar
- Wassertheil-Smoller S: Biostatistics and epidemiology. Springer; 2004.Google Scholar
- Rigoutsos I, Floratos A: Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm. Bioinformatics 1998, 14: 55–67. 10.1093/bioinformatics/14.1.55View ArticlePubMedGoogle Scholar
- Xue Y, Li A, Wang L, Feng H, Yao X: PPSP: prediction of PK-specific phosphorylation site with Bayesian decision theory. BMC Bioinformatics 2006, 7: 163. 10.1186/1471-2105-7-163PubMed CentralView ArticlePubMedGoogle Scholar
- Jung I, Matsuyama A, Yoshida M, Kim D: PostMod: sequence based prediction of kinase-specific phosphorylation sites with indirect relationship. BMC Bioinformatics 2010, 11(Suppl 1):S10. 10.1186/1471-2105-11-S1-S10PubMed CentralView ArticlePubMedGoogle Scholar
- Schwartz D, Chou M, Church G: Predicting protein post-translational modifications using meta-analysis of proteome scale data sets. Molecular & Cellular Proteomics 2009, 8(2):365–379. 10.1074/mcp.M800332-MCP200View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.