A novel method for high accuracy sumoylation site prediction from protein sequences
© Xu et al; licensee BioMed Central Ltd. 2008
Received: 05 April 2007
Accepted: 08 January 2008
Published: 08 January 2008
Protein sumoylation is an essential dynamic, reversible post translational modification that plays a role in dozens of cellular activities, especially the regulation of gene expression and the maintenance of genomic stability. Currently, the complexities of sumoylation mechanism can not be perfectly solved by experimental approaches. In this regard, computational approaches might represent a promising method to direct experimental identification of sumoylation sites and shed light on the understanding of the reaction mechanism.
Here we presented a statistical method for sumoylation site prediction. A 5-fold cross validation test over the experimentally identified sumoylation sites yielded excellent prediction performance with correlation coefficient, specificity, sensitivity and accuracy equal to 0.6364, 97.67%, 73.96% and 96.71% respectively. Additionally, the predictor performance is maintained when high level homologs are removed.
By using a statistical method, we have developed a new SUMO site prediction method – SUMOpre, which has shown its great accuracy with correlation coefficient, specificity, sensitivity and accuracy.
Sumoylation, a reversible post-translational modification (PTM) by the small ubiquitin-related modifier (SUMO) is essential to dozens of cellular activities, including subcellular transport, control of gross subnuclear architecture, direct and indirect effects on transcription, regulation of DNA damage recovery and replication, chromosome segregation, cell cycle progression, and competition with other ubiquitin-like modifiers (Ubls) [1–3]. Sumoylation is reportedly also a factor in various diseases and disorders, especially neural diseases, such as neuronal intranuclear inclusion disease (NIID), Alzheimer's disease (AD), and Parkinson's disease (PD) [4, 5]. SUMO proteins are highly conserved across eukaryotes, and mammals express four highly conserved SUMO genes – SUMO-1, SUMO-2, SUMO-3, and SUMO-4-among which SUMO-1 has received the most attention. Yeasts express only a single SUMO gene, while plants express at least eight. However, the exact role played by such a modification – for example, positive or negative transcriptional regulation – is still unknown. Thus, more detailed information is needed on sumoylation substrates and sites.
It was still widely accepted that ψ KxE/D [6, 7] (ψ represents a large hydrophobic amino acid and x represents any amino acid) is the consensus motif for SUMO-1 conjugation. However, there were many cases of sumoylation which did not occur at sites with this consensus motif. In fact, approximately 26% (69/268) of confirmed sumoylation sites contain a non-consensus motif. Although it has been reported that in some cases a short peptide containing the ψ KxE/D motif and a nuclear localization signal(NLS) is sufficient for SUMO-1 recognition in vivo , SUMO E3 ligases that increase the efficiency of SUMO conjugation may require more sequence information . Therefore, it is necessary to focus on the exact sumoylated site and the related sequence information that may be required.
Currently, the complexities of sumoylation mechanism can not be perfectly solved by experimental approaches. Mutational analysis has been widely used in the identification of the majority of known sumoylated sites. However, while facing larger and more complex proteins, especially those with dozens of potential consensus and non-consensus sumoylation sites, mutational analysis would be labor-intensive and time-consuming. Another approach – large-scale proteomic approach is more suitable for high-throughput identification. But limited by reagent availability and the efficiency of computational peptide identification, its accuracy and stability are not so perfect. Furthermore, their current results mainly concentrate on the identification of sumoylation substrates rather than the sites [8–11]. Although Pedrioli et al. have introduced SUMmOn, an automated theoretical pattern recognition tool that identifies sumoylated sites by detecting diagnostic PTM fragment ion series within complex MS/MS spectra, its practical sensitivity and accuracy require further validation. In this regard, computational approaches might represent a promising method to direct experimental identification of sumoylated sites. SUMOplot, for instance, is the first sumoylation site prediction tool and made a great progress. But limited by its over-concentration on data with ψ KxE or ψ KxE/D consensus motif, the prediction results may miss many non-consensus true positives. Another recent bioinformatical tool SUMOsp which applies GPS and MotifX on sumoylated site prediction, has achieved its prediction sensitivity as high as 89.12% . Nevertheless, the large number of free parameters and small size of true-positive dataset may cause over-prediction. And the most appropriate performance measurement, Matthews' correlation coefficient (CC), is not so great in either prediction tool.
In the current work, SUMOpre employed a new statistical method, to predict sumoylated sites based on its adjacent amino acid subsequence. Correlation coefficient of 0.6401 was significantly higher than those in SUMOplot (0.4785) and SUMOsp (0.4873). The 5-fold cross validation/self-consistency also showed higher specificity (97.67%/97.74%) and accuracy (96.71%/96.79%), while keeping sensitivity (73.96%/74.25%) at equivalent levels to those in the two published predictors. In addition, the predictor performance was maintained when high level homologs were removed. All these results revealed that SUMOpre has a greater robustness and prediction accuracy for sumoylation site prediction. The SUMOpre web server is available on line .
Effect of window length and threshold value on prediction performance
In order to derive good prediction parameters from limited experimental data, especially those with such a significantly unbalanced number of positives to negatives (approximately 1:25), it is crucial to confirm the appropriate one-side window length (n) and threshold value (Thd) and to realize their effects on prediction performance.
Stability of SUMOpre
Performance of SUMOpre using datasets filtering out highly homologous sequences.
To further illustrate the robustness of SUMOpre in regard to threshold-independent performance, receiver operating characteristic (ROC) curves of self-consistency, jack-knife validation and 5-fold cross validation were provided (see Additional file 1). After comparisons with SUMOsp , both the ROC curves and the areas under the ROC curves (AUC) again obviously imply the robustness of SUMOpre.
Comparison of SUMOpre with SUMOplot and SUMOsp
Performance comparisons using the whole dataset for training and testing.
Performance comparisons using the 144-protein dataset for training and 15-protein dataset for testing.
Use of Web service
We have successfully developed a highly robust sumoylated site prediction tool with the use of statistical methods. In order to avoid overtraining due to the limited experimental data, the predictor performance is maintained when high level homologs are removed and using as few as possible fitting parameters. A predictor with too many free coefficients would directly "remember" almost all of the information without any optimization, and result in unrealistically high accuracies on current data and low accuracies on unknown protein sequences. Thus, SUMOpre provides better robustness, specificity and accuracy while retaining a similar level of sensitivity as other prediction methods. Furthermore, considering the highly unbalanced training data (the negative training dataset is approximately 25 times large of the positive dataset), the main parameter for assessment of predictive performance should be the Matthews' correlation coefficient, CC, that of SUMOpre is significantly higher (0.6401) than the top level of SUMOsp (0.4873) and SUMOplot (0.4785).
Since about 74% of confirmed sumoylation sites in the training data contain a consensus motif, the free coefficients obtained by training would be optimized to "remember" more information for the consensus motif. The prediction specificity of SUMOpre, Sp, is 0.89 for consensus sites and 0.25 for non-consensus sites, implying that the prediction of non- consensus sites is fairly hard.
Sumoylation mechanism significantly depends on sequence information
Why was SUMOpre able to perform so well by simply utilizing sequence information? It is mainly due to the corresponding sumoylation mechanism that is heavily dependent on sequence information. SUMO is conjugated to target proteins by an enzymatic cascade involving a SUMO activating enzyme (E1), a SUMO conjugating enzyme (E2), and typically a SUMO ligase (E3). SUMO proteins are activated by the heterodimeric E1 AOS1-UBA2 that use the E2 UBC9 for conjugation. There are currently three types of known E3s for the SUMO proteins – RanBP2 (Ran-binding protein-2), PIASs family and Pc2 (Polycomb 2 homolog). These three types of enzymes have distinct subcellular localizations and mediate the modification of specific substrates [1, 18]. Furthermore statistic over the 57 sites with identified PDB structures, there are only 3 sites buried in protein interior (K99 in 1JI7 with 17% exposed, K347 in 1AM9 with 12% exposed and K447 in 1U0J with 7% exposed) and all other 54 sites exposed on the protein surface. This may reflect the fact that UBC9 makes direct contact with substrates and has sequence preference. In contrast to more than one thousand protein kinases and their complicated phosphorylation recognition and modification systems with dissimilar site preferences , direct SUMO recognition on the single Lys site merely relies on limited factors: three enzymes and other elements such as subcellular localization or appropriate presentation of the sequence on the substrates [1, 3]. Without various enzymes and their complex recognition mechanisms and other factors, motif recognition based solely on sequence information could be sufficient for sumoylation prediction.
As discussed by Matunis & Pickart , sumoylation is frequently site specific, which may refer to the maximum benefit from reduced entropy if the reacting lysine residue is forced into a catalytically favorable orientation. Furthermore, the performance of SUMOpre, based merely on sequence information from known sumoylated sites, supports the suggestion of a sequence-dependent recognition and modification mechanism. In fact, using a dataset with 268 sumoylation sites (that includes 69 with non-consensus motifs) and 6,361 non-reported sumoylation Lys sites (including 210 sites with consensus motifs), we have achieved a powerful predictor with CC = 0.6401 and sensitivity = 74.25%. All findings indicate that sequence information, especially the close proximity of a Lys to sequence information, is an essential factor impacting the specificity of SUMO recognition and modification.
By using a statistical method, we have developed a new SUMO site prediction method – SUMOpre, which has shown its great accuracy with correlation coefficient, specificity, sensitivity and accuracy equal to 0.6364, 97.67%, 73.96% and 96.71% in 5-fold cross validation, respectively. Due to the full consideration on both consensus ψ KxE/D and non-consensus motif, our method achieved greater robustness (0.15 higher correlation coefficients) than other published predictors. Furthermore, our prediction accomplishment based on protein sequence supports the suggestion of a sequence-dependent recognition and modification mechanism.
PubMed was searched with keywords 'SUMO' and 'sumoylation' and obtained 268 unambiguously experimentally defined sumoylation sites in 159 proteins from 710 research articles published online before Aug. 10, 2006. Their primary sequences have also been extracted from Swiss-Prot/TrEMBL database . In those 159 protein sequences, there are a total of 6,629 lysine (Lys) sites, including 268 experimentally identified sumoylated sites used as positive training data, and 6,361 non-reported sumoylation Lys sites.
In order to compare prediction performance with other two published predictors, the dataset was divided into two subsets for performance comparison among the three predictors. One 144-protein set included proteins within 240 experimentally identified sumoylated sites reported before December 10, 2005, and utilized in SUMOsp. The other 15-protein one contained proteins within 28 experimentally identified sumoylated sites that were reported after December 10, 2005. The other K sites not reportedly sumoylated were collected as negative training datasets for both subsets (see Additional file 2 and 3).
Here, θj(ωi| Rj) is a position-dependent 20-demisional vector for 20 types of amino acids. C is a constant. If the subsequences in the window are DAMKNEC, for example, the equation for the sumoylated state of Lys is:1.0 = θ1(D) + θ2(A) + θ3(M) + θ5(N) + θ6(E) + θ7(C) + C
That for the non-sumoylated state is:0 = θ1(D) + θ2(A) + θ3(M) + θ5(N) + θ6(E) + θ7(C) + C
Eq. (5) is a linear equation. All the coefficients could be determined by the data in the training dataset using Multiple Linear Regression method (MLR) to minimize the sum of the square of deviation between the left and right side of the equation. In the practical MLR process, one component should be left out from the 20-demisional vector θj, and the number of free coefficients is 2 × n × 19. In the above example with n = 3, there are 114 (2 × 3 × 19) free coefficients to be determined.
NCBI BLASTCLUST filter for highly homologous sequences
NCBI BLASTCLUST was employed to filter out highly homologous protein sequences from the original dataset . BLASTCLUST automatically and systematically clusters protein sequences based on pairwise matches found using the BLAST algorithm. Similarity threshold and minimum length coverage are two crucial parameters for filtering out highly homologous sequences: the former was set as a BLAST score density, while the later restricted the minimum percent for pairwise coverage. The similarity threshold was set at 0.3. Due to different values of minimum length coverage of the 159 protein sequences, proteins were grouped in one cluster if they shared greater similarity and larger minimum length coverage than the corresponding thresholds. Only one protein in each cluster was chosen to establish new low-homology training and test datasets, while the remaining protein sequences were filtered out.
Where TP is the number of positive cases that were correctly predicted; TN is the number of negative cases that were correctly rejected;FP is the number of over-predicted cases; and FN is the number of under-predicted cases.
Self-consistency, K-fold cross-validation and Jack-knife tests
Predictive quality was examined with three approaches, one based on the re-substitution test and the other two upon k-fold cross-validation.
The sumoylation state for each motif in the entire dataset is predicted using the rules derived from the same dataset.
The dataset was randomly divided into k subsets. Each time, one of the k subsets was used as the test set and the other k-1 subsets were assembled to form a training set.
Jack-knife (Leave-one-out cross validation)
An extreme validation deduced from k-fold cross validation with k equal to N, the number of data points in the set. It means that N separate times, the function approximator was trained on all the data except for the point being predicted.
List of abbreviations
small ubiquitin-related modifier
- n :
one-side window length
Matthews' correlation coefficient
receiver operating characteristic curve
Multiple Linear Regression method.
This work was supported in part by grants 30230100 and 30421003 from the Natural Science Foundation of China (NSFC).
- Hay RT: SUMO: a history of modification. Mol Cell 2005, 18(1):1–12. 10.1016/j.molcel.2005.03.012View ArticlePubMedGoogle Scholar
- Kroetz MB: SUMO: a ubiquitin-like protein modifier. Yale J Biol Med 2005, 78(4):197–201.PubMed CentralPubMedGoogle Scholar
- Seeler JS, Dejean A: Nuclear and unclear functions of SUMO. Nat Rev Mol Cell Biol 2003, 4(9):690–9. 10.1038/nrm1200View ArticlePubMedGoogle Scholar
- Dorval V, PE F: Small ubiquitin-like modifier (SUMO) modification of natively unfolded proteins tau and alpha-synuclein. J Biol Chem 2006, 14(281(15)):9919–24. 10.1074/jbc.M510127200View ArticleGoogle Scholar
- Shinbo Y, Niki T, Taira T, Ooe H, Takahashi-Niki K, Maita C, Seino C, Iguchi-Ariga SM, Ariga H: Proper SUMO-1 conjugation is essential to DJ-1 to exert its full activities. Cell Death Differ 2006, 13(1):96–108. 10.1038/sj.cdd.4401704View ArticlePubMedGoogle Scholar
- Sampson DeborahA, Wang Min, Matunis MJ: The Small Ubiquitin-like Modifier-1 (SUMO-1) Consensus Sequence Mediates Ubc9 Binding and Is Essential for SUMO-1 Modification. J Biol Chem 2001, 276(24):21664–9. 10.1074/jbc.M100006200View ArticlePubMedGoogle Scholar
- Rodriguez ManuelS, Dargemont Catherine, Hay RT: SUMO-1 Conjugation in Vivo Requires Both a Consensus Modification Motif and Nuclear Targeting. J Biol Chem 2001, 276(April 20):12654–9. 10.1074/jbc.M009476200View ArticlePubMedGoogle Scholar
- Denison C, Rudner AD, Gerber SA, Bakalarski CE, Moazed D, Gygi SP: A proteomic strategy for gaining insights into protein sumoylation in yeast. Mol Cell Proteomics 2005, 4(3):246–54. 10.1074/mcp.M400154-MCP200View ArticlePubMedGoogle Scholar
- Gocke CB, Yu H, Kang J: Systematic identification and analysis of mammalian small ubiquitin-like modifier substrates. J Biol Chem 2005, 280(6):5004–12. 10.1074/jbc.M411718200View ArticlePubMedGoogle Scholar
- Hannich JT, Lewis A, Kroetz MB, Li SJ, Heide H, Emili A, Hochstrasser M: Defining the SUMO-modified proteome by multiple approaches in Saccharomyces cerevisiae. J Biol Chem 2005, 280(6):4102–10. 10.1074/jbc.M413209200View ArticlePubMedGoogle Scholar
- Rosas-Acosta G, Russell WK, Deyrieux A, Russell DH, Wilson VG: A universal strategy for proteomic studies of SUMO and other ubiquitin-like modifiers. Mol Cell Proteomics 2005, 4(1):56–72.PubMed CentralView ArticlePubMedGoogle Scholar
- Pedrioli PG, Raught B, Zhang XD, Rogers R, Aitchison J, Matunis M, Aebersold R: Automated identification of SUMOylation sites using mass spectrometry and SUMmOn pattern recognition software. Nat Methods 2006, 3(7):533–9. 10.1038/nmeth891View ArticlePubMedGoogle Scholar
- Xue Y, Zhou F, Fu C, Xu Y, Yao X: SUMOsp: a web server for sumoylation site prediction. Nucleic Acids Research 2006, 34(1):254–7. 10.1093/nar/gkl207View ArticleGoogle Scholar
- SUMOpre web server[http://spg.biosci.tsinghua.edu.cn/service/sumoprd/predict.cgi]
- Wang ZX, Yuan Z: How good is prediction of protein structural class by the component-coupled method? Proteins 2000, 38(2):165–75. 10.1002/(SICI)1097-0134(20000201)38:2<165::AID-PROT5>3.0.CO;2-VView ArticlePubMedGoogle Scholar
- SUMOplot web server[http://www.abgent.com/doc/sumoplot]
- SUMOsp web server[http://bioinformatics.lcd-ustc.org/sumosp]
- Welchman RL, Gordon C, Mayer RJ: Ubiquitin and ubiquitin-like proteins as multifunctional signals. Nat Rev Mol Cell Biol 2005, 6(8):599–609. 10.1038/nrm1700View ArticlePubMedGoogle Scholar
- Manning G, Whyte DB, Martinez R, Hunter T, Sudarsanam S: The protein kinase complement of the human genome. Science 2002, 298(5600):1912–34. 10.1126/science.1075762View ArticlePubMedGoogle Scholar
- Matunis MJ, Pickart CM: Beginning at the end with SUMO. Nat Struct Mol Biol 2005, 12(7):565–6. 10.1038/nsmb0705-565View ArticlePubMedGoogle Scholar
- Swiss-Prot/TrEMBL database[http://cn.expasy.org]
- Garnier J, Osguthorpe DJ, Robson B: Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. J Mol Biol 1978, 120(1):97–120. 10.1016/0022-2836(78)90297-8View ArticlePubMedGoogle Scholar
- Garnier J, Gibrat JF, Robson B: GOR method for predicting protein secondary structure from amino acid sequence. Methods Enzymol 1996, 266: 540–53.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.