Characterization and identification of ubiquitin conjugation sites with E3 ligase recognition specificities
© Nguyen et al.; licensee BioMed Central Ltd. 2015
Published: 21 January 2015
In eukaryotes, ubiquitin-conjugation is an important mechanism underlying proteasome-mediated degradation of proteins, and as such, plays an essential role in the regulation of many cellular processes. In the ubiquitin-proteasome pathway, E3 ligases play important roles by recognizing a specific protein substrate and catalyzing the attachment of ubiquitin to a lysine (K) residue. As more and more experimental data on ubiquitin conjugation sites become available, it becomes possible to develop prediction models that can be scaled to big data. However, no development that focuses on the investigation of ubiquitinated substrate specificities has existed. Herein, we present an approach that exploits an iteratively statistical method to identify ubiquitin conjugation sites with substrate site specificities.
In this investigation, totally 6259 experimentally validated ubiquitinated proteins were obtained from dbPTM. After having filtered out homologous fragments with 40% sequence identity, the training data set contained 2658 ubiquitination sites (positive data) and 5532 non-ubiquitinated sites (negative data). Due to the difficulty in characterizing the substrate site specificities of E3 ligases by conventional sequence logo analysis, a recursively statistical method has been applied to obtain significant conserved motifs. The profile hidden Markov model (profile HMM) was adopted to construct the predictive models learned from the identified substrate motifs. A five-fold cross validation was then used to evaluate the predictive model, achieving sensitivity, specificity, and accuracy of 73.07%, 65.46%, and 67.93%, respectively. Additionally, an independent testing set, completely blind to the training data of the predictive model, was used to demonstrate that the proposed method could provide a promising accuracy (76.13%) and outperform other ubiquitination site prediction tool.
A case study demonstrated the effectiveness of the characterized substrate motifs for identifying ubiquitination sites. The proposed method presents a practical means of preliminary analysis and greatly diminishes the total number of potential targets required for further experimental confirmation. This method may help unravel their mechanisms and roles in E3 recognition and ubiquitin-mediated protein degradation.
The important role of protein ubiquitination plays in cells has led to an increasing interest in computational identification of ubiquitination sites (Ubi-sites) [7–13]. However, most tools were developed on the basis of small-scale protein data. With the advancement in proteomics technology, it has become necessary to construct new models scalable and practical for big proteome data. Recently, two new approaches had been introduced for identifying Ubi-sites from large-scale proteome data [14, 15]. The UbiProber integrated key position and amino acid redisude features specifically designed for large-scale to predict both general and species-specific Ubi-sites . The analysis of UbiProber also showed that: 1) ubiquitination patterns are conserved across different species; 2) some key positions and key amino acid residues are essential for improving the prediction performance; 3) the physicochemical properties of residues in the flanking sequences surrounding a Ubi-site are important in the ubiquitination process. For Ubi-site identification in humans, hCKSAAP_UbSite  has utilized amino acid patterns and properties to improve the prediction performance. The area under the receiver operating characteristic (ROC) curve (AUC) was 0.770 and 0.757 for the training and testing data set, respectively.
Although UbiProber and hCKSAAP_UbSite had demonstrated both accuracy and stability, there was room for improvements on the performance. In addition, there was still a lack of Ubi-site identification tools for large-scale data. Consequently, we were motivated to develop a new method to predict Ubi-sites based on their substrate site specificities. The five-fold cross-validation was adopted to evaluate the performance of the predictive models. When applied on the training data set, the model generated an accuracy of 60.17% and MCC of 0.202. On the testing data set, the model obtained an accuracy of 61.30% and MCC of 0.225. In addition, the maximal dependence decomposition (MDD) was employed to improve the predictive performance. The average performance of the model with the integration of MDD was better than that without MDD, reaching an overall 67.93% in accuracy and an MCC value of 0.363. Furthermore, the independent testing also revealed that the combined MDD-models yielded the highest performance. The sensitivity, specificity, accuracy and MCC were 87.76%, 70.02%, 76.13%, and 0.549, respectively. Thus, this would be an important and promising approach for researchers who are interested in identifying ubiquitination sites, especially for large-scale proteome data.
Materials and method
Data collection and pre-processing
Data statistics of collected ubiquitination sites.
Resource (data set)
Number of ubiquitinated proteins
Number of ubiquitinated lysines
Number of non-ubiquitinated lysines
dbPTM (Training set)
mUbiSiDa (Independent testing set)
Data statistics after using CD-HIT.
Training data set (6259)
Testing data set (35494)
Detection of substrate site specificities
With the recent advancements in proteomics technology, more and more experimental data on ubiquitin conjugation sites become available, giving us new opportunities to work on large-scale data. However, the complexity of large-scale data also presents to be a challenge. Although several prediction tools have demonstrated stability and effectiveness, the performance was still required significant improvement. In this work, Maximal Dependence Decomposition (MDD)  was applied to detect the substrate site specificities of protein ubiquitination. In our previous works [24–30], we used the MDD approach on amino acid sequences instead of nucleotides, and successfully identified conserved motif and improved the prediction performance. MDD adopts a chi-square χ2(Ai, Aj) test to iteratively assess the dependence of amino acid occurrence between two positions Ai and Aj that surround the Ubi-sites. In this study, MDD was applied to sub-divide the positive training data (2658 Ubi-sites fragments) to ten subgroups containing significant substrate motifs. The negative data for each MDD-clustered subgroups were randomly selected from the negative training (5532 non-Ubi-fragments) with a ratio approximately equal to 1:2.08 (same as the ratio of positive training to negative training--5532:2658). As illustrated in Figure S1 (in Additional File 1), these subgroups were used to generate profile hidden Markov models (profile HMMs) for the identification of protein ubiquitination sites with their corresponding substrate motifs.
Model construction and cross-validation
A public software, HMMER , was adopted to generate profile HMMs from the fragment sequences of each MDD-clustered subgroup. An HMM can detect distant relationships between amino acids surrounding the ubiquitination sites. In general, profile HMM learns a predictive model from positive dataset of a class; thus, in this study, only ubiquitinated data (positive training set) was utilized to build a predictive model. For each model of the MDD-clustered subgroups, a threshold parameter is selected as a cut-off value in identifying potential positive data from a query sequence. To search the hits of a HMM, HMMER returns both a bit score and an expectation value (E-value). A search result with an HMMER bit score greater than the threshold parameter is taken as a positive prediction. Prior to the construction of a final model, the predictive performance of the models with varying parameters are evaluated by performing k-fold cross validation. In doing k-fold cross validation, the training data is divided into k groups by splitting each dataset into approximately equal sized subgroups. The advantage of k-fold cross-validation is that all original data are regarded as both training set and test set, and each data is used for testing exactly once . In this study, k is set to five. The models are initially evaluated using five-fold cross-validation and are gauged by measuring their predictive performance. The following measurements were employed to assess the performance of the predictive model: Sensitivity , , and , where TP, TN, FP and FN represent the numbers of True Positives, True Negatives, False Positives and False Negatives, respectively. Mathews Correlation Coefficient (MCC) was also used to access the quality of the predicted result to the observed data. Finally, the models with best predictive performance were further evaluated using the independent testing data.
Due to the over-fitting problem of the training data set, the predictive performance of the trained models may be overestimated. Therefore, we constructed an independent testing data set to evaluate for the real case. Recently, Chen et al.  released a comprehensive database, named mUbiSiDa, for protein ubiquitination sites in mammals. The data set downloaded from mUbiSiDa included 35494 proteins was selected as the independent testing data. The positive and negative data were generated using the same approach as applied to the training data, resulting in 110695 positive and 1217977 negative data. To avoid data redundancy, the homologous fragments were removed using CD-HIT with the sequence identity at 40%. Next, fragments in the negative data that were identical to the positive were filtered out to prevent over-fitting. As a result, the final independent testing data consisted of 3267 positive and 6214 negative data (Table 2). This testing data set was also used on other prediction tools to compare with our models in terms of performance.
Results and discussion
Amino acid composition of ubiquitination sites
Substrate motifs of ubiquitination sites
Performance evaluation by five-fold cross-validation for all data and 10 MDD-clustered subgroups.
MDD cluster 1
MDD cluster 2
MDD cluster 3
MDD cluster 4
MDD cluster 5
MDD cluster 6
MDD cluster 7
MDD cluster 8
MDD cluster 9
MDD cluster 10
10 MDD clusters (average)
Independent testing performance
Independent testing performance for single HMM and MDD-clustered HMMs.
Single HMM (all data)
Interactions between E3 ligases and ubiquitinated proteins
As mentioned previously, the ubiquitin-mediated protein degradation is a sequential process involving in 3 major kinds of enzymes: activating enzyme E1, conjugating enzyme E2 and ubiquitin ligase E3. In the ubiquitin-proteasome pathway, E3 ligases play very important roles by recognizing a specific protein substrate and catalyzing the attachment of ubiquitin to the target protein, usually at a lysine (K) residue containing site. Therefore, a full understanding about the interactions between E3 ligases and ubiquitination substrate proteins has been being an emerging study in the investigation of protein ubiquitination regulatory network. In order to provide a further investigation for the interactions between E3 ligases and ubiquitinated proteins, we have collected E3 ligases and their protein-protein interactions in human and mouse. The experimentally verified E3 ligases were collected from four resources, as shown in Table S1 (in Additional File 1). After the removal of data redundancy, the non-redundant data of E3 ligases contained 501 entries in human and 232 entries in mouse. The ubiquitination substrate (Ubi-substrate) proteins were extracted from Ubi-substrate training and independent testing data sets which were used for training and testing predictor in identification of ubiquitination sites mentioned previously. As a result, 32260 Ubi-substrate proteins on Human and 5195 Ubi-substrate proteins on Mouse were obtained. In addition, basing on protein-protein interaction, the relationships between E3 ligases and Ubi-substrate proteins were investigated. Through the investigation of protein-protein interactions, Table S2 (in Additional File 1) shows that the 501 human and 232 mouse E3 ligases interact with 3938 human and 604 mouse ubiquitinated proteins along with 17397 human and 2949 mouse ubiquitination site, respectively.
The recent rapid accumulation of proteomics data has given us the opportunity to mine large amounts of protein data, extract important information about ubiquitination, and build models to identify ubiquitination sites. However, the performance of existing Ubi-site identification tools still appear to be hampered when dealing with large data sets as our experimentation has suggested. Thus, it is necessary to develop an effective approach to improve the efficiency of prediction. Previously, we have demonstrated the ability of MDD to enhance the performance of predictive models by clustering a large set of aligned signal sequences into subgroups [23, 35–37]. In this work, we applied the similar strategy combined with profile hidden Markov models in a prediction model for Ubi-site identification and obtained an overall improvement in all performance measures. Additionally, evaluation of our model with an independent testing data set showed the strength of our approach in comparison to an existing prediction tool. To further enhance the performance of our model for practical applications on large-scale data, we applied MDD to sub-divide the positive training data into subgroups with statistically significant information. Therefore, this work has demonstrated that the MDD-clustered HMMs could provide promising predictive ability in identifying ubiquitination sites from large-scale proteome data.
The authors sincerely appreciate the Ministry of Science and Technology, Taiwan (MOST 103-2221-E-155-020-MY3 to TYL, 103-2221-E-155-038 to JTYW, and 103-2221-E-038-013-MY2 to THC) for financially supporting this research.
Publication charge for this work was funded by MOST grant 103-2221-E-155-020-MY3 to TYL.
This article has been published as part of BMC Bioinformatics Volume 16 Supplement 1, 2015: Selected articles from the Thirteenth Asia Pacific Bioinformatics Conference (APBC 2015): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S1
- Pickart CM, Eddins MJ: Ubiquitin: structures, functions, mechanisms. Bba-Mol Cell Res. 2004, 1695 (1-3): 55-72.Google Scholar
- Welchman RL, Gordon C, Mayer RJ: Ubiquitin and ubiquitin-like proteins as multifunctional signals. Nature reviews Molecular cell biology. 2005, 6 (8): 599-609. 10.1038/nrm1700.View ArticlePubMedGoogle Scholar
- Hicke L, Schubert HL, Hill CP: Ubiquitin-binding domains. Nat Rev Mol Cell Bio. 2005, 6 (8): 610-621. 10.1038/nrm1701.View ArticleGoogle Scholar
- Burger AM, Seth AK: The ubiquitin-mediated protein degradation pathway in cancer: therapeutic implications. Eur J Cancer. 2004, 40 (15): 2217-2229. 10.1016/j.ejca.2004.07.006.View ArticlePubMedGoogle Scholar
- Hershko A, Ciechanover A: The ubiquitin system. Annu Rev Biochem. 1998, 67: 425-479. 10.1146/annurev.biochem.67.1.425.View ArticlePubMedGoogle Scholar
- Gilon T, Chomsky O, Kulka RG: Degradation signals for ubiquitin system proteolysis in Saccharomyces cerevisiae. Embo J. 1998, 17 (10): 2759-2766. 10.1093/emboj/17.10.2759.PubMed CentralView ArticlePubMedGoogle Scholar
- Tung CW, Ho SY: Computational identification of ubiquitylation sites from protein sequences. BMC bioinformatics. 2008, 9: 310-10.1186/1471-2105-9-310.PubMed CentralView ArticlePubMedGoogle Scholar
- Radivojac P, Vacic V, Haynes C, Cocklin RR, Mohan A, Heyen JW, Goebl MG, Iakoucheva LM: Identification, analysis, and prediction of protein ubiquitination sites. Proteins. 2010, 78 (2): 365-380. 10.1002/prot.22555.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhao XW, Li XT, Ma ZQ, Yin MH: Prediction of Lysine Ubiquitylation with Ensemble Classifier and Feature Selection. Int J Mol Sci. 2011, 12 (12): 8347-8361. 10.3390/ijms12128347.PubMed CentralView ArticlePubMedGoogle Scholar
- Lee TY, Chen SA, Hung HY, Ou YY: Incorporating Distant Sequence Features and Radial Basis Function Networks to Identify Ubiquitin Conjugation Sites. Plos One. 2011, 6 (3):Google Scholar
- Cai YD, Huang T, Hu LL, Shi XH, Xie L, Li YX: Prediction of lysine ubiquitination with mRMR feature selection and analysis. Amino Acids. 2012, 42 (4): 1387-1395. 10.1007/s00726-011-0835-0.View ArticlePubMedGoogle Scholar
- Feng KY, Huang T, Feng KR, Liu XJ: Using WPNNA Classifier in Ubiquitination Site Prediction Based on Hybrid Features. Protein Peptide Lett. 2013, 20 (3): 318-323.Google Scholar
- Chen Z, Chen YZ, Wang XF, Wang C, Yan RX, Zhang ZD: Prediction of Ubiquitination Sites by Using the Composition of k-Spaced Amino Acid Pairs. Plos One. 2011, 6 (7):Google Scholar
- Chen X, Qiu JD, Shi SP, Suo SB, Huang SY, Liang RP: Incorporating key position and amino acid residue features to identify general and species-specific Ubiquitin conjugation sites. Bioinformatics. 2013, 29 (13): 1614-1622. 10.1093/bioinformatics/btt196.View ArticlePubMedGoogle Scholar
- Chen Z, Zhou Y, Song JN, Zhang ZD: hCKSAAP_UbSite: Improved prediction of human ubiquitination sites by exploiting amino acid pattern and properties. Bba-Proteins Proteom. 2013, 1834 (8): 1461-1467. 10.1016/j.bbapap.2013.04.006.View ArticleGoogle Scholar
- Su MG, Huang KY, Lu CT, Kao HJ, Chang YH, Lee TY: topPTM: a new module of dbPTM for identifying functional post-translational modifications in transmembrane proteins. Nucleic Acids Res. 2014, 42 (Database issue): D537-545.PubMed CentralView ArticlePubMedGoogle Scholar
- Lu CT, Huang KY, Su MG, Lee TY, Bretana NA, Chang WC, Chen YJ, Chen YJ, Huang HD: dbPTM 3.0: an informative resource for investigating substrate site specificity and functional association of protein post-translational modifications. Nucleic Acids Res. 2013, 41 (D1): D295-D305. 10.1093/nar/gks1229.PubMed CentralView ArticlePubMedGoogle Scholar
- Lee TY, Huang HD, Hung JH, Huang HY, Yang YS, Wang TH: dbPTM: an information repository of protein post-translational modification. Nucleic Acids Res. 2006, D622-627. 34 DatabaseGoogle Scholar
- Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O'Donovan C, Phan I, et al: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003, 31 (1): 365-370. 10.1093/nar/gkg095.PubMed CentralView ArticlePubMedGoogle Scholar
- Chernorudskiy AL, Garcia A, Eremin EV, Shorina AS, Kondratieva EV, Gainullin MR: UbiProt: a database of ubiquitylated proteins. BMC bioinformatics. 2007, 8: 126-10.1186/1471-2105-8-126.PubMed CentralView ArticlePubMedGoogle Scholar
- Huang Y, Niu BF, Gao Y, Fu LM, Li WZ: CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics. 2010, 26 (5): 680-682. 10.1093/bioinformatics/btq003.PubMed CentralView ArticlePubMedGoogle Scholar
- Crooks GE, Hon G, Chandonia JM, Brenner SE: WebLogo: A sequence logo generator. Genome Res. 2004, 14 (6): 1188-1190. 10.1101/gr.849004.PubMed CentralView ArticlePubMedGoogle Scholar
- Lee TY, Lin ZQ, Hsieh SJ, Bretana NA, Lu CT: Exploiting maximal dependence decomposition to identify conserved motifs from a group of aligned signal sequences. Bioinformatics. 2011, 27 (13): 1780-1787. 10.1093/bioinformatics/btr291.View ArticlePubMedGoogle Scholar
- Chen YJ, Lu CT, Lee TY: dbGSH: a database of S-glutathionylation. Bioinformatics. 2014, 30 (16): 2386-2388. 10.1093/bioinformatics/btu301.View ArticlePubMedGoogle Scholar
- Su MG, Lee TY: Incorporating substrate sequence motifs and spatial amino acid composition to identify kinase-specific phosphorylation sites on protein three-dimensional structures. BMC bioinformatics. 2013, 14 (Suppl 16): S2-10.1186/1471-2105-14-S16-S2.PubMed CentralView ArticlePubMedGoogle Scholar
- Lee TY, Chen YJ, Lu CT, Ching WC, Teng YC, Huang HD: dbSNO: a database of cysteine S-nitrosylation. Bioinformatics. 2012, 28 (17): 2293-2295. 10.1093/bioinformatics/bts436.View ArticlePubMedGoogle Scholar
- Lee TY, Chen YJ, Lu TC, Huang HD: SNOSite: exploiting maximal dependence decomposition to identify cysteine S-nitrosylation with substrate site specificity. Plos One. 2011, 6 (7): e21849-10.1371/journal.pone.0021849.PubMed CentralView ArticlePubMedGoogle Scholar
- Lee TY, Bretana NA, Lu CT: PlantPhos: using maximal dependence decomposition to identify plant phosphorylation sites with substrate site specificity. BMC bioinformatics. 2011, 12: 261-10.1186/1471-2105-12-261.PubMed CentralView ArticlePubMedGoogle Scholar
- Wong YH, Lee TY, Liang HK, Huang CM, Wang TY, Yang YH, Chu CH, Huang HD, Ko MT, Hwang JK: KinasePhos 2.0: a web server for identifying protein kinase-specific phosphorylation sites based on sequences and coupling patterns. Nucleic Acids Res. 2007, W588-594. 35 Web ServerGoogle Scholar
- Huang HD, Lee TY, Tzeng SW, Horng JT: KinasePhos: a web tool for identifying protein kinase-specific phosphorylation sites. Nucleic Acids Res. 2005, W226-229. 33 Web ServerGoogle Scholar
- Eddy SR: Profile hidden Markov models. Bioinformatics. 1998, 14 (9): 755-763. 10.1093/bioinformatics/14.9.755.View ArticlePubMedGoogle Scholar
- Lu CT, Chen SA, Bretana NA, Cheng TH, Lee TY: Carboxylator: incorporating solvent-accessible surface area for identifying protein carboxylation sites. J Comput Aided Mol Des. 2011, 25 (10): 987-995. 10.1007/s10822-011-9477-2.View ArticlePubMedGoogle Scholar
- Chen T, Zhou T, He B, Yu HY, Guo XJ, Song XF, Sha JH: mUbiSiDa: A Comprehensive Database for Protein Ubiquitination Sites in Mammals. Plos One. 2014, 9 (1):Google Scholar
- Maor R, Jones A, Nuhse TS, Studholme DJ, Peck SC, Shirasu K: Multidimensional protein identification technology (MudPIT) analysis of ubiquitinated proteins in plants. Mol Cell Proteomics. 2007, 6 (4): 601-610. 10.1074/mcp.M600408-MCP200.View ArticlePubMedGoogle Scholar
- Lee TY, Chen YJ, Lu TC, Huang HD, Chen YJ: SNOSite: Exploiting Maximal Dependence Decomposition to Identify Cysteine S-Nitrosylation with Substrate Site Specificity. Plos One. 2011, 6 (7):Google Scholar
- Burge C, Karlin S: Prediction of complete gene structures in human genomic DNA. J Mol Biol. 1997, 268 (1): 78-94. 10.1006/jmbi.1997.0951.View ArticlePubMedGoogle Scholar
- Huang KY, Wu HY, Chen YJ, Lu CT, Su MG, Hsieh YC, Tsai CM, Lin KI, Huang HD, Lee TY: RegPhos 2.0: an updated resource to explore protein kinase-substrate phosphorylation networks in mammals. Database (Oxford). 2014, 2014: bau034-10.1093/database/bau034.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.