PKIS: computational identification of protein kinases for experimentally discovered protein phosphorylation sites
© Zou et al.; licensee BioMed Central Ltd. 2013
Received: 25 January 2013
Accepted: 6 August 2013
Published: 13 August 2013
Dynamic protein phosphorylation is an essential regulatory mechanism in various organisms. In this capacity, it is involved in a multitude of signal transduction pathways. Kinase-specific phosphorylation data lay the foundation for reconstruction of signal transduction networks. For this reason, precise annotation of phosphorylated proteins is the first step toward simulating cell signaling pathways. However, the vast majority of kinase-specific phosphorylation data remain undiscovered and existing experimental methods and computational phosphorylation site (P-site) prediction tools have various limitations with respect to addressing this problem.
To address this issue, a novel protein kinase identification web server, PKIS, is here presented for the identification of the protein kinases responsible for experimentally verified P-sites at high specificity, which incorporates the composition of monomer spectrum (CMS) encoding strategy and support vector machines (SVMs). Compared to widely used P-site prediction tools including KinasePhos 2.0, Musite, and GPS2.1, PKIS largely outperformed these tools in identifying protein kinases associated with known P-sites. In addition, PKIS was used on all the P-sites in Phospho.ELM that currently lack kinase information. It successfully identified 14 potential SYK substrates with 36 known P-sites. Further literature search showed that 5 of them were indeed phosphorylated by SYK. Finally, an enrichment analysis was performed and 6 significant SYK-related signal pathways were identified.
In general, PKIS can identify protein kinases for experimental phosphorylation sites efficiently. It is a valuable bioinformatics tool suitable for the study of protein phosphorylation. The PKIS web server is freely available at http://bioinformatics.ustc.edu.cn/pkis.
Reversible protein phosphorylation, which is one of the most common post-translation modifications in eukaryotes, is involved in various cellular processes including regulation of metabolism , DNA repair , and cellular differentiation . It plays an especially dominant role in signal transduction in biological systems [4, 5]. Kinase-specific phosphorylation data including substrate sites (P-sites) and the corresponding protein kinase is the root of reconstruction of signal transduction networks and is widely used in different fields of biomedicine, especially in the identification of potential drug targets [6, 7]. For this reason, precise annotation of phosphorylated proteins is key to further research regarding phosphoproteomes.
In recent years, considerable efforts have been devoted to experimental and computational identification of phosphorylation data. Historically, phosphorylation sites were discovered mainly using low-throughput technology . However, these biotechniques, such as 32P-labeling and degenerate peptide library screening, are costly, labor-intensive, and time consuming [9, 10]. With recent developments in mass spectrometry, experimentally verified phosphorylation data have accumulated rapidly. For example, Wiśniewski et al. identified nearly 12,035 unique P-sites in 4,579 mouse brain proteins using mass spectrometry . However, this high-throughput technology cannot provide information regarding the protein kinases that catalyze phosphorylation substrates. Systematically matching these P-sites to specific kinases experimentally is not currently feasible . This limits the amount of protein kinase information that can be made available in phosphorylation databases. For example, a well-known database of experimentally verified phosphorylation data in eukaryotes, Phospho.ELM, currently lists 3,151 phosphorylation sites with corresponding kinase information . However, this accounts for less than 12% of the total of 27,404 human phosphorylation sites deposited in this database. The increasingly large gap between experimentally verified phosphorylation data and protein kinase information hampers studies on protein phosphorylation and signal transduction pathways. Existing kinase-specific phosphorylation site prediction tools, such as PPSP , KinasePhos 2.0 , Musite  and GPS2.1 , may generate kinase information for experimentally verified phosphorylation sites, but they focus on predicting novel phosphorylation sites and therefore show less than optimal performance for other purposes.
To address these limitations, this study presents a novel bioinformatics tool called the protein kinase identification server (PKIS). It is designed for the identification of protein kinases that act at known P-sites with high specificity. Human phosphorylation data was retrieved from the Phospho.ELM database and used to train the kinase identification models by incorporating the composition of monomer spectrum (CMS) with SVMs . Comprehensive analysis shows that CMS encoding performs better than binary encoding in identifying protein kinases for known P-sites. The results of performance evaluation show that PKIS is more powerful than widely used P-site prediction tools.
Prediction performance in different window sizes
Previous studies have demonstrated that the side chains of amino acids surrounding P-sites influence the phosphorylation process, including contacts with kinases. In this way, the sequence surrounding a P-site plays a vital role in determining which kinase catalyzes the corresponding phosphorylation substrate. However, the residues enclosing the P-sites in the linear sequence may not be adjacent spatially, and distinguishing the residues surrounding the P-sites exactly for all the phosphorylated proteins experimentally is difficult and time-consuming . For these reasons, all residues within 30 aa (amino acid) of the P-sites were selected for further examination.
To identify protein kinases confidently, the specificities of the SVM models in PKIS were all required to be at least 99.0%, which meant that the expected rate of false positive results was not larger than 1.0%. Then changes in sensitivity at different window sizes were examined (Figure 1B). For MET kinase, the best sensitivity (0.57) is obtained when m is equal to 19. For this reason, this optimized window size was used to build the SVM model for MET kinase in PKIS.
Evaluation of CMS encoding
Comparing with kinase-specific P-site prediction tools on the Phospho.ELM database
The performance of PKIS was evaluated and compared to three widely used kinase-specific P-site prediction tools: KinasePhos2.0 , Musite , and GPS2.1 . It should be pointed out that none of these tools provide an option for unbiased evaluation of performance (e.g. LOOCV). In this case, we had to use all human phosphorylated proteins in Phospho.ELM database as testing data to assess their performance. These results were biased, because the P-sites in the Phospho.ELM database were also used for model training by these tools [15-17]. This inevitably lead to over-estimations of performance. Additionally, the performance of PKIS was examined using LOOCV, which can accurately reflect the true performance of the proposed method.
Performance of the PKIS web server on testing data
Comparison of PKIS with kinase-specific P-site prediction tools on testing data
A case study
Significant KEGG pathways enriched in the combined dataset
Fc epsilon RI signaling pathway
B cell receptor signaling pathway 1
Pathogenic Escherichia coli infection 1
ErbB signaling pathway
PKIS is freely accessible to users at the following web address: http://bioinformatics.ustc.edu.cn/pkis/. Users can submit phosphorylated proteins with verified P-sites and select all or some of the 56 predictive models available for protein kinase identification. In Additional file 7: Figure S3, the predicted results are presented as a table in which each row represents a unique kinase-substrate pair. To better understand the substrate binding preferences of each protein kinase, the CMS logo representing the substrate’s specificity is also provided in the predicted results. Datasets for all 56 kinases, including corresponding accession numbers and protein sequences can be downloaded from http://bioinformatics.ustc.edu.cn/pkis/download.html.
Conclusions and discussions
Protein kinase identification is attracting significant attention due to the large number of P-sites discovered using high-throughput technologies. In the present study, a novel kinase identification web server was developed based on CMS encoding strategy and SVMs. In addition, to achieve optimal performance we generated specific negative data for SVM training in that different negative dataset construction strategies can bring about significantly different performance with respect to the classification problems . The results showed PKIS outperformed many existing P-site prediction tools for the identification of protein kinases. However, there is still room for further improvement. The system showed limited identification performance for a few kinases. Protein phosphorylation is a highly complex biological process occurring in vivo. As such, the primary sequences around the potential P-sites may be not sufficient to indicate the corresponding protein kinase. The performance of this system may be enhanced by incorporating more biological information, such as protein functional domains and subcellular localization. Currently, kinase-specific phosphorylation data for other organisms are still sparse. However, with rapidly accumulated phosphorylation data, it may be possible to develop a platform that can be used to accurately identify protein kinases in multiple organisms.
All 37,145 phosphorylation instances in humans were extracted from the latest version of Phospho.ELM (9.0). After excluding redundant records, 27,404 P-sites were recognized in 5,374 proteins, including 3,151 kinase-substrate pairs. These phosphorylation sites and their kinase information were used for further analysis. For each kinase, the corresponding phosphorylation instances were used as positive data (+). Negative data (−) were comprised of phosphorylation events catalyzed by other kinases, instead of non-phosphorylation sites that were used by P-site prediction tools. To ensure reliable results, a total of 56 kinases with more than 10 positive instances were selected. See Additional file 8: Table S5 summarizes the statistics of all these kinases.
In this study, sequence information was encoded using an efficient encoding strategy called CMS . As a part of the CMS, monomer spectrum (MS) represents the amino acid composition and the corresponding feature value is the occurrence frequency of each amino acid in a certain window. For example, for the peptide CADKSPEQSPDAEYPTH, the resulting MS feature vector is 1, 2, 1, 3, 2, 0, 0, 2, 2, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0. For a protein sequence with a single side window size of m, CMS incorporates different MS vectors under a series of window size from 3 to 2*m+1. Unlike the MS encoding strategy, this reflects the occurrence of the amino acids in certain positions and therefore provides more sequential information than amino acid composition for given window size. Additional file 9: Figure S4 shows the differences in CMS and MS encoding strategies.
Classification and evaluation
The receiver operating characteristic (ROC) curves were also plotted and the area under the curves (AUC) was calculated as additional measurements of performance. To minimize possible false positives in the results, for each prediction, a threshold was adopted to guarantee that the specificity was no less than 99%. For each kinase, the optimal window size used to encode CMS encoding was determined using the best sensitivity obtained in LOOCV.
This work was supported by National Natural Science Foundation of China (61101061, 31100955), Fundamental Research Funds for the Central Universities (WK2100230011), Specialized Research Fund for the Doctoral Program of Higher Education (20113402120028).
- Acin-Perez R, Gatti DL, Bai Y, Manfredi G: Protein phosphorylation and prevention of cytochrome oxidase inhibition by ATP: coupled mechanisms of energy metabolism regulation. Cell Metab. 2011, 13 (6): 712-719. 10.1016/j.cmet.2011.03.024.PubMed CentralView ArticlePubMed
- Wood CD, Thornton TM, Sabio G, Davis RA, Rincon M: Nuclear localization of p38 MAPK in response to DNA damage. Int J Biol Sci. 2009, 5 (5): 428-437.PubMed CentralView ArticlePubMed
- Lian I, Kim J, Okazawa H, Zhao J, Zhao B, Yu J, Chinnaiyan A, Israel MA, Goldstein LS, Abujarour R, et al: The role of YAP transcription coactivator in regulating stem cell self-renewal and differentiation. Genes Dev. 2010, 24 (11): 1106-1118. 10.1101/gad.1903310.PubMed CentralView ArticlePubMed
- Mann M, Ong SE, Gronborg M, Steen H, Jensen ON, Pandey A: Analysis of protein phosphorylation using mass spectrometry: deciphering the phosphoproteome. Trends Biotechnol. 2002, 20 (6): 261-268. 10.1016/S0167-7799(02)01944-3.View ArticlePubMed
- Standen CL, Kennedy NJ, Flavell RA, Davis RJ: Signal transduction cross talk mediated by Jun N-terminal kinase-interacting protein and insulin receptor substrate scaffold protein complexes. Mol Cell Biol. 2009, 29 (17): 4831-4840. 10.1128/MCB.00155-09.PubMed CentralView ArticlePubMed
- Tasaki S, Nagasaki M, Kozuka-Hata H, Semba K, Gotoh N, Hattori S, Inoue J, Yamamoto T, Miyano S, Sugano S, et al: Phosphoproteomics-based modeling defines the regulatory mechanism underlying aberrant EGFR signaling. PLoS One. 2010, 5 (11): e13926-10.1371/journal.pone.0013926.PubMed CentralView ArticlePubMed
- Sobolev B, Filimonov D, Lagunin A, Zakharov A, Koborova O, Kel A, Poroikov V: Functional classification of proteins based on projection of amino acid sequences: application for prediction of protein kinase substrates. BMC Bioinforma. 2010, 11: 313-10.1186/1471-2105-11-313.View Article
- Trost B, Kusalik A: Computational prediction of eukaryotic phosphorylation sites. Bioinformatics. 2011, 27 (21): 2927-2935. 10.1093/bioinformatics/btr525.View ArticlePubMed
- Aponte AM, Phillips D, Harris RA, Blinova K, French S, Johnson DT, Balaban RS: 32P labeling of protein phosphorylation and metabolite association in the mitochondria matrix. Methods Enzymol. 2009, 457: 63-80.PubMed CentralView ArticlePubMed
- Songyang Z, Blechner S, Hoagland N, Hoekstra MF, Piwnica-Worms H, Cantley LC: Use of an oriented peptide library to determine the optimal substrates of protein kinases. Curr Biol. 1994, 4 (11): 973-982. 10.1016/S0960-9822(00)00221-9.View ArticlePubMed
- Wisniewski JR, Nagaraj N, Zougman A, Gnad F, Mann M: Brain phosphoproteome obtained by a FASP-based method reveals plasma membrane protein topology. J Proteome Res. 2010, 9 (6): 3280-3289. 10.1021/pr1002214.View ArticlePubMed
- Linding R, Jensen LJ, Ostheimer GJ, van Vugt MA, Jorgensen C, Miron IM, Diella F, Colwill K, Taylor L, Elder K, et al: Systematic discovery of in vivo phosphorylation networks. Cell. 2007, 129 (7): 1415-1426. 10.1016/j.cell.2007.05.052.PubMed CentralView ArticlePubMed
- Dinkel H, Chica C, Via A, Gould CM, Jensen LJ, Gibson TJ, Diella F: Phospho.ELM: a database of phosphorylation sites--update 2011. Nucleic Acids Res. 2011, 39 (Database issue): D261-D267.PubMed CentralView ArticlePubMed
- Xue Y, Li A, Wang L, Feng H, Yao X: PPSP: prediction of PK-specific phosphorylation site with Bayesian decision theory. BMC Bioinforma. 2006, 7: 163-10.1186/1471-2105-7-163.View Article
- Wong YH, Lee TY, Liang HK, Huang CM, Wang TY, Yang YH, Chu CH, Huang HD, Ko MT, Hwang JK: KinasePhos 2.0: a web server for identifying protein kinase-specific phosphorylation sites based on sequences and coupling patterns. Nucleic Acids Res. 2007, 35 (Web Server issue): W588-W594.PubMed CentralView ArticlePubMed
- Gao J, Thelen JJ, Dunker AK, Xu D: Musite, a tool for global prediction of general and kinase-specific phosphorylation sites. Mol Cell Proteomics. 2010, 9 (12): 2586-2600. 10.1074/mcp.M110.001388.PubMed CentralView ArticlePubMed
- Xue Y, Liu Z, Cao J, Ma Q, Gao X, Wang Q, Jin C, Zhou Y, Wen L, Ren J: GPS 2.1: enhanced prediction of kinase-specific phosphorylation sites with an algorithm of motif length selection. Protein Eng Des Sel. 2011, 24 (3): 255-260. 10.1093/protein/gzq094.View ArticlePubMed
- Torii M, Liu H, Hu ZZ: Support vector machine-based mucin-type o-linked glycosylation site prediction using enhanced sequence feature encoding. AMIA Annu Symp Proc. 2009, 2009: 640-644.PubMed CentralPubMed
- Wang M, Li C, Chen W, Wang C: Prediction of PK-specific phosphorylation site based on information entropy. Sci China C Life Sci. 2008, 51 (1): 12-20. 10.1007/s11427-008-0012-1.View ArticlePubMed
- Veeranna , Lee JH, Pareek TK, Jaffee H, Boland B, Vinod KY, Amin N, Kulkarni AB, Pant HC, Nixon RA: Neurofilament tail phosphorylation: identity of the RT-97 phosphoepitope and regulation in neurons by cross-talk among proline-directed kinases. J Neurochem. 2008, 107 (1): 35-49.PubMed CentralPubMed
- Manning G, Whyte DB, Martinez R, Hunter T, Sudarsanam S: The protein kinase complement of the human genome. Science. 2002, 298 (5600): 1912-1934. 10.1126/science.1075762.View ArticlePubMed
- Newman RH, Hu J, Rho HS, Xie Z, Woodard C, Neiswinger J, Cooper C, Shirley M, Clark HM, Hu S, et al: Construction of human activity-based phosphorylation networks. Mol Syst Biol. 2013, 9: 655-PubMed CentralView ArticlePubMed
- Scharenberg AM, El-Hillal O, Fruman DA, Beitz LO, Li Z, Lin S, Gout I, Cantley LC, Rawlings DJ, Kinet JP: Phosphatidylinositol-3,4,5-trisphosphate (PtdIns-3,4,5-P3)/Tec kinase-dependent calcium signaling pathway: a target for SHIP-mediated inhibitory signals. EMBO J. 1998, 17 (7): 1961-1972. 10.1093/emboj/17.7.1961.PubMed CentralView ArticlePubMed
- Dennis G, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki RA: DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol. 2003, 4 (5): P3-10.1186/gb-2003-4-5-p3.View ArticlePubMed
- da Huang W, Sherman BT, Lempicki RA: Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009, 4 (1): 44-57.View ArticlePubMed
- Cerwenka A, Lanier LL: Natural killer cells, viruses and cancer. Nat Rev Immunol. 2001, 1 (1): 41-49. 10.1038/35095564.View ArticlePubMed
- Foucault I, Liu YC, Bernard A, Deckert M: The chaperone protein 14-3-3 interacts with 3BP2/SH3BP2 and regulates its adapter function. J Biol Chem. 2003, 278 (9): 7146-7153. 10.1074/jbc.M209509200.View ArticlePubMed
- Foucault I, Le Bras S, Charvet C, Moon C, Altman A, Deckert M: The adaptor protein 3BP2 associates with VAV guanine nucleotide exchange factors to regulate NFAT activation by the B-cell antigen receptor. Blood. 2005, 105 (3): 1106-1113.View ArticlePubMed
- Chen CH, Martin VA, Gorenstein NM, Geahlen RL, Post CB: Two closely spaced tyrosines regulate NFAT signaling in B cells via Syk association with Vav. Mol Cell Biol. 2011, 31 (14): 2984-2996. 10.1128/MCB.05043-11.PubMed CentralView ArticlePubMed
- Peters JD, Furlong MT, Asai DJ, Harrison ML, Geahlen RL: Syk, activated by cross-linking the B-cell antigen receptor, localizes to the cytosol where it interacts with and phosphorylates alpha-tubulin on tyrosine. J Biol Chem. 1996, 271 (9): 4755-4762. 10.1074/jbc.271.9.4755.View ArticlePubMed
- Kulathu Y, Grothe G, Reth M: Autoinhibition and adapter function of Syk. Immunol Rev. 2009, 232 (1): 286-299. 10.1111/j.1600-065X.2009.00837.x.View ArticlePubMed
- Kavallaris M, Chao SK, Yang C-PH, Horwitz SB: Posttranslational Modifications of Tubulin. Cytoskeleton and Human Disease. 2012, New Jersey: Humana Press, 241-257.View Article
- Gong H, Liu X, Wu J, He Z: Data construction for phosphorylation site prediction. Brief Bioinform. 2013, 10.1093/bib/bbt012.
- Chang CC, Lin CJ: LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST). 2011, 2 (3): 27-
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.