Modular composition predicts kinase/substrate interactions
© Liu and Tozeren. 2010
Received: 4 December 2009
Accepted: 25 June 2010
Published: 25 June 2010
Skip to main content
© Liu and Tozeren. 2010
Received: 4 December 2009
Accepted: 25 June 2010
Published: 25 June 2010
Phosphorylation events direct the flow of signals and metabolites along cellular protein networks. Current annotations of kinase-substrate binding events are far from complete. In this study, we scanned the entire human protein sequences using the PROSITE domain annotation tool to identify patterns of domain composition in kinases and their substrates. We identified statistically enriched pairs of strings of domains (signature pairs) in kinase-substrate couples presented in the 2006 version of the PTM database.
The signature pairs enriched in kinase - substrate binding interactions turned out to be highly specific to kinase subtypes. The resulting list of signature pairs predicted kinase-substrate interactions in validation dataset not used in learning with high statistical accuracy.
The method presented here produces predictions of protein phosphorylation events with high accuracy and mid-level coverage. Our method can be used in expanding the currently available drafts of cell signaling pathways and thus will be an important tool in the development of combination drug therapies targeting complex diseases.
Transient interactions of proteins with other proteins, such as those that occur during phosphorylation events, comprise a fundamental element of signal processing in living cells . Protein kinases constitute one of the largest families of signaling proteins in eukaryotic cells . Currently, there are more than 500 known protein kinases in the human genome . A phosphorylated amino acid distinguishes itself from the unmodified residue by having a large hydrophilic group with increased hydrogen-bonding, hydration and salt-bridge formation capability. Such modifications often result in switches and altered lines of connections in signaling and metabolic pathways of living cells . Phosphorylation binding interactions are important downstream in gene expression pathways in binding of transcription factors to their substrate proteins .
Transient interactions between proteins often require multiple sites of physical connection and may even require a third party protein such as an adopter protein. Catalytic phosphorylation events at active sites is facilitated either by the utilization of protein recognition modules or the adaptation of docking interactions . Recent structural data indicates that specificity of binding between a kinase and a substrate does not necessarily arise from the active site but from substrate and the specific docking interactions . Globular domain - motif interactions accompany active site interactions in the binding of tyrosine kinase to their substrates. Large numbers of such globular domain/linear motif interactions have already been associated with protein-protein interactions (PPI). Web tools such as PROSITE , Pfam , PRINTS , ProDom , and InterPro  can be used to annotate the globular domains and larger linear motifs on the sequence of any given protein. Similarly, the web tool ELM  annotates on protein sequences large numbers of linear motifs known to be involved in protein interactions. Some of these motifs may play important roles in virus-host interactions via a mechanism for hijacking function [13, 14].
Known annotations of domain-motif interactions on protein partners often result in the prediction of large numbers of false positives in PPI . It is also becoming clear that selectivity of docking sites in MAPK kinases along with the catalytic motif is an important player in identifying PPI . An accurate method of PPI prediction based on interactions of short linear motifs on one protein with large globular domains on the protein pair is yet to be developed .
Computational prediction of PPI from primary sequences of proteins poses a number of other challenges to overcome including the noise in the training PPI data, lack of a true negative training set, as well as problems associated with 3D experimental and molecular modeling of proteins in potentially binding configurations . PPI prediction methods that were developed in the last decade include methods based on sequence homology , feature vectors and machine learning methodology , association studies  and knowledge guided inference of domain-domain interactions from incomplete PPI networks . Computational studies focusing on extracting domain signature pairs associated with PPI have utilized yeast datasets  or datasets spanning across species .
The success achieved in computational association of domain signature pairs with experimentally verified PPI in these aforementioned studies prompted us to investigate signature pair/PPI association in phosphorylation events within the human proteome. We asked the question whether modular composition of proteins (kinase and their substrates), combined with a database of known PPI, could be sufficient in a statistical enrichment procedure to predict known PPI not used in the training. The choice of domains as features for predicting PPI made sense because modular composition of proteins provides insights into their interaction with up and downstream proteins in cell signaling circuits [1, 5].
In addressing this question, we used the Post Translational Modification (PTM) database 2006 edition containing 5602 PPIs to identify statistically enriched signature pairs in kinase/substrate binding. Our ten-to-one and two-to-one learning and testing procedures produced receiver operator characteristic curves reflecting excellent accuracy in the identification of phosphorylation events. Additional verification included the use of PPI in the PTM 2009 edition and in other databases not included in PTM [22–24]. Our bioinformatics analysis uncovered sets of domain clusters that are specifically enriched in various kinases and kinase substrates. Moreover, we showed that pairs of such domain clusters bridges kinase and kinase substrates with high specificity and sensitivity.
The computational space in our model is large compared to other approaches focusing only on the domain annotation of proteins known to be interacting with each other. In the present study we scanned the entire proteome for domain annotation in order to develop background sets of randomly generated virtual protein pairs to be used in statistical enrichment of domains in protein subsets. Another feature specific to our method is the consideration of strings of domains as signatures for binding predictions. This assumption facilitated us to consider binding events between proteins involving multiple sets of domains. Results produced by our method achieved better PPI prediction accuracy in phosphorylation on the average than other presently available computational methods for PPI. Our study illustrates the dominance of a grammar based on interacting domain signature pairs in the language of post modification interactions between proteins in the human proteome.
The learning dataset on kinase/substrate binding was downloaded from the Post Translational Modification database (PTM), version 2006 . The dataset contained 5602 phosphorylation events between 272 kinases and 1432 kinase substrates. The independent testing datasets consisted of phosphorylation events not recorded in PTM 2006 but recorded in the PTM 2009 , the Human Protein Reference Database (HPRD) , and the Biological General Repository for Interaction Datasets (BioGRID) . Predictions of our model were used to match phosphorylated proteins in the PhosphoELM database  with candidate targeting kinases for further experimental verification.
Database of protein domains, families and functional sites named PROSITE  was downloaded to our Laboratory's Blade Center. In this set up, the search engine for PROSITE took protein FASTA sequences as inputs and returned hits of PROSITE domains (D) as outputs. Human protein sequences from the NCBI Gene Bank were scanned and a column matrix indicating the presence (1) and absence (0) of domains were assigned for each human protein. The dimension of these domain column matrices was equal to the number of domains (2102) in the PROSITE Database.
Statistical enrichment of domains in protein subgroups (target group) was performed with respect to control (background) group made of the entire protein kinase group. Domain column matrices were determined for each member of a target group and these matrices were summed up over the membership of the subgroup. Next, a set of proteins of the same number as the target group was selected randomly from the background group and the corresponding sum domain column matrix was computed. This operation was repeated 10,000 times and the p value for enrichment was computed by the fraction of times the background group had more domains of a given identity than the target group. List of domains (domain clusters) enriched in a kinase- or substrate subtype was identified as the list of signatures that are enriched in the target group, a kinase- or substrate subtype.
A score matrix was constructed for selecting signature pairs strongly associated with known PPIs in tyrosine- and serine/threonine phosphorylation subgroups. Specifically we wanted to identify signature pairs (such as A-B) such that presence of signature A in protein K and signature B in protein L would predict with high confidence a PPI between K and L. For this purpose, for the known PPI interactions in the learning dataset (EPPI), we generated a score matrix whose rows and columns identified the enriched signatures in tyrosine and serine/threonine kinases (TK, S/TK) and their substrates (TKB, S/TKB). Each element of the matrix corresponded to the number of EPPI for which a signature pair (A-B) was present in the opposing proteins of the pair. Another score matrix for virtual PPI, VPPI (background), was generated by randomly pairing proteins from the learning dataset, in effect creating VPPI interactions equal in number to all possible protein combinations from kinase and substrate proteins in the PPI set. The p value for signature pair enrichment in a given PPI subgroup was computed using the hypergeometric test in the R Project for Statistical Computing, based on the scores summed from the learning set and the background set. The resulting signature pairs were ranked according to their p value, with the one corresponding to the lowest p value ranked highest. The highest p value used as cut-off in the analysis was p = 0.001.
The signature pairs thus identified via statistical enrichment, were used to predict new PPI events. A protein pair was considered as undergoing phosphorylation interaction if they expressed at least one of the signature pairs determined by the enrichment analysis. Consider a protein pair (L, K) that is associated with a statistically enriched signature pair (A-B). Assumption that the presence of A-B means the presence of a phosphorylation PPI between L and K (PPPI) may lead to false positives. The prediction accuracy was evaluated by computing the probability that the match between predicted and experimental PPI sets to have occurred randomly. Consider there are N VPPI events that can be generated randomly from n kinases and m kinase substrates. Among the N VPPI, M have already been annotated as EPPI. Let the signature pair A-B predict Y number of PPPI, W of which have been verified as EPPI. The hypergeometric test than tests the probability of randomly choosing at least W EPPI by selecting Y PPPI out of a possible N VPPI. Lower the p value, higher is the accuracy of the PPI prediction method presented in this study.
The higher the value of Sp, the lower is the error for assuming PPI between L and K based on the presence of the enriched signature pair (A-B). Parameter Se is a measure of the coverage, namely the size of the PPI pool potentially predicted by A-B.
We also used precision and recall to evaluate the statistical enrichment of experimental PPI in our predicted PPI set. Precision (Pr) was defined as TP/(TP + FP). Recall (Re) is the same as the sensitivity parameter Se.
We used training and testing sets at 2-fold and 10-fold cross validation to test the accuracy of our predictions in 100 iterations using statistical enrichment with p values varying from zero to one . After each set of training and testing we determined the specificity and sensitivity. We plotted the receiver operating characteristics (ROC) curve using the average values of specificity and sensitivity over 100 iterations. The area under the ROC curve (AUC) quantified the likelihood that one can identify a kinase-substrate interaction using the method described above.
In addition, we used multiple validation processes to evaluate the performance of our model. The first process was to check the accuracy of the enriched signature pairs in predicting PPI among the random protein pairs derived from proteins in the learning data set. A p value, representing the probability of randomly generated prediction was computed for the PPI predicted by each signature pair by using the hypergeometric test.
Next, we compared the PPI predictions based on PTM 2006 learning database with PPI not present in PTM 2006 but present in PTM 2009 and in two other databases (HPRD, BioGrid). We identified the phosphorylation PPI in the HPRD and BioGrid databases as those PPI made of a kinase and a substrate partner of the same type (tyrosine or serine/threonine) as listed in Gene Ontology . For each comparison, we computed the number of PPI predicted, number of PPI matched, and the maximum number of virtual PPI that could be generated using the testing PPI dataset. These numbers yielded p values for random prediction using the hypergeometric test.
We tested the accuracy of PPI predictions of the present model, with two previously published domain based methods: correlated sequence-signature markers (CSSM)  and the knowledge-guided inference of domain-domain interactions (K-GIDDI) . Using the algorithms and data presented in these papers, we identified the enriched domain pair signatures and the resulting numbers of predicted PPI as well as the number of matched PPI (matching already annotated PPI) within the randomly generated PPI set from PTM 2006 as well as the validation datasets used for our model. We used the hypergeometric test, sensitivity, and specificity as described above to identify the accuracy of prediction.
Accuracy and coverage of the present approach for predicting kinase - substrate interactions
Approximately 75 percent of known kinase-substrate interactions occurred between proteins with at least one annotated PROSITE domain on their primary sequence. For cross validation, we used the kinase-substrate pair list in PTM 2006 and took its subset made of protein couples with both proteins expressing at least one annotated PROSITE domain. This restriction was necessary since our prediction method is based on existence of certain domain pairs (signature pairs) in interacting proteins. As described in the methods section, we used 10-fold and 2-fold cross validation in 100 iterations and generated receiver operating characteristic (ROC) curves for predicting tyrosine kinase and serine/threonine kinase interactions (Figure 3B). The figure indicates excellent accuracy at 10-fold cross validation and slightly lower accuracy in 2-fold cross validation. The areas under the ROC curves (AUC) for these cases are reported in the figure.
Next we compared our predicted PPI set with those phosphorylation PPI sets that have not been used in our statistical enrichment processes. Three PPI databases, BioGrid, HPRD, and PTM 2009, have contained hundreds of kinase/substrate phosphorylation events as shown in Table 1 testing part. We used the signature pairs listed in additional file 2 to predict PPI events among the proteins in the PPI events shown in Table1 testing set. The p values for the match between our predictions and the known PPI events not used in our enrichment procedures ranged from 9*10 -12 to 7*10 -7 for PPI presented in HPRD and BioGrid database whereas we had higher but still significant p values when predicting PTM 2009.
Next we compared the experimental data shown in Table 1 with the corresponding predictions that could be made using the domain based methods recently published (CSSM & K-GIDDI). These comparisons yielded p values that were larger than the ones for our method. In particular, the p values showed no significance for model CSSM predicting PTM 2009 and the serine/threonine binding data from BioGrid. The reason why our model yielded better results than CSSM could be due to our grammar differentiating between proteins with different domain string expression. Another reason maybe our use of randomly generated background PPI databases in our enrichment method rather than an analytical equation based only on data for PPI. Note also that CSSM model was for the yeast proteome and we used not their published results but generated PPI predictions using their procedure here for comparison with experimental data for the human proteome.
K-GIDDI simulation also yielded in higher p values than our method when compared with the human protein interactive data shown in Table 1. The comparison maybe unfavorable to K-GIDDI since the model incorporates PPI events from multiple species during training phase and therefore might miss some PPI events specific to human. Nevertheless, the fact that all these three approaches gave statistically significant predictions for at least the HPRD database indicates the validity of domain based approaches in predicting phosphorylation events. Sensitivity and specificity parameters were also computed for three approaches across different dataset. Present study shows same accuracy level of specificity as K-GIDDI and better coverage, CSSM show much better coverage but much less specificity.
Efficiency of the present score matrix enrichment in matching known phosphorylation PPI
Binding interactions of proteins with other proteins are at the foundations of cellular networks. Phosphorylation is responsible for the flow of signals and metabolites along the protein pathways . Dynamic binding interactions, such as those that occur in phosphorylation events, appear prominently in signaling pathways in health and in disease including hypertension, diabetes, HIV infection, and cancer [2, 30]. Although kinases have long been considered as drug targets, compounds targeting kinases (kinase inhibitors and natural substances) have been found to be more promiscuous than originally anticipated, which can potentially lead to side effects . It is important to identify potential phosphorylation partners of kinases in order to assess its range of impact on the flow of signals and metabolites along cellular pathways. Recent methods of mapping dynamic protein interactions in kinase signaling using live-cell fluorescence fluctuation spectroscopy and imaging already produced promising results  and kinase morphisms have been directly linked to population subtypes in disease states . These new experimental approaches will benefit from the ongoing efforts in predicting dynamic protein interactions based on existing data and learning/testing/validation approaches. Our study produces this type of computational prediction sets of protein-protein interactions for experimental validation.
We used large-scale bioinformatics databases and tools and developed a methodology for predicting phosphorylation binding events that are yet to be fully annotated. Our method benefits from the hypothesis and assumptions of the previous computational methods of PPI prediction and specifically utilizes the concept of correlated sequence signatures as markers of protein-protein interaction developed by Sprinzak and Margalit (2001). The two new elements in our approach consist of (a) expanding the definition of signature to strings of domains rather than a single domain and (b) the use of background composed of random pairing of kinase and substrates in the statistical processes for identifying signature pairs indicative of phosphorylation events. The first assumption is consistent with our observation that certain strings of domains are highly statistically enriched in kinase subtypes and their substrates compared to the rest of the kinase interactome. The second assumption, requirement of statistical enrichment, against highly differentiating background sets, allowed us to further reduce the set of correlated sequence signatures obtained solely on the data involving PPI. The list of signature pairs developed in the present study, when used in predicting kinase/substrate interactions in phosphorylation events, produced results that are largely matched with experimental data not used in statistical enrichments for signature identification. The p values associated with our predictions and their comparison to independent experimental data ranged from a low of 10 -11 to 0.0038, depending on the kinase subtype and the database used for comparison.
Thousands of human proteins have been identified as undergoing phosphorylation binding interactions in the PhosphoELM database but the identity of the kinases responsible for these phosphorylation events are yet to be quantified. Our method produced candidate kinases targeting these substrates. The resulting list turned out to be consistent with literature not yet included into the PhosphoELM database. In all cases, the partnering between the substrate and the kinase predicted in this study can serve as a guide for kinase identification studies involving known kinase substrates. Another important use of our method will be in expanding and revising existing literature on cellular pathways decorated with phosphorylation events. Such revisions will be useful in identifying the consequences of small drug interventions on a kinase in terms of its interaction with immediate neighbors. Last but not least, our observation that domains expressed by kinase proteins and their substrates are largely subtype-specific drastically reduces the upper bound for the number of experiments one has to conduct for quantifying a major subset of transient binding interactions between protein pairs associated with phosphorylation.
One important disadvantage of our method is the bias toward the discovery of PPI with proteins having similar domain composition. This feature is also persistent in PPI prediction methods based on sequence homology. This tendency is observable in our prediction of new results included in PTM 2009 based on the PTM 2006 dataset. Although our match is statistically significant, the p values we get for this comparison is significantly larger than comparison with HPRD and BioGrid. It is expected that our methodology will pick up more PPI events correctly as we learn more about the protein sequence grammar that relates domain expression with protein-protein interaction patterns.
Protein phosphorylation events redirect and redistribute the flow of signals and metabolites in cellular pathways. Kinases that phosphorylate multiple substrates have been favorable targets for drug development against many disease types. In this study, we developed a high throughput method that predicts potential binding partners for kinases using existing domain annotation tools and interactome databases for the human proteome. The method, when tested against independent databases, yields predictions with high statistical accuracy. Results indicate that domains expressed by any two proteins constitute a strong determinant of the potential for phosphorylation related binding interactions between them. Our expansion of the MAPK pathway using the prediction method outlined in the study presented results compatible with research literature.
Authors thank Dr. Lyle Ungar of the University of Pennsylvania for providing valuable input during the construction of our prediction method. This research was supported by a Calhoun fellowship to Yichuan Liu as well as BioAdvance funds to the Greater Philadelphia Bioinformatics Alliance.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.