DeepUbi: a deep learning framework for prediction of ubiquitination sites in proteins

Fu, Hongli; Yang, Yingxi; Wang, Xiaobo; Wang, Hui; Xu, Yan

doi:10.1186/s12859-019-2677-9

Research article
Open access
Published: 18 February 2019

DeepUbi: a deep learning framework for prediction of ubiquitination sites in proteins

Hongli Fu¹^na1,
Yingxi Yang¹^na1,
Xiaobo Wang¹,
Hui Wang² &
…
Yan Xu ORCID: orcid.org/0000-0001-9462-580X^1,3

BMC Bioinformatics volume 20, Article number: 86 (2019) Cite this article

15k Accesses
49 Citations
11 Altmetric
Metrics details

Abstract

Background

Protein ubiquitination occurs when the ubiquitin protein binds to a target protein residue of lysine (K), and it is an important regulator of many cellular functions, such as signal transduction, cell division, and immune reactions, in eukaryotes. Experimental and clinical studies have shown that ubiquitination plays a key role in several human diseases, and recent advances in proteomic technology have spurred interest in identifying ubiquitination sites. However, most current computing tools for predicting target sites are based on small-scale data and shallow machine learning algorithms.

Results

As more experimentally validated ubiquitination sites emerge, we need to design a predictor that can identify lysine ubiquitination sites in large-scale proteome data. In this work, we propose a deep learning predictor, DeepUbi, based on convolutional neural networks. Four different features are adopted from the sequences and physicochemical properties. In a 10-fold cross validation, DeepUbi obtains an AUC (area under the Receiver Operating Characteristic curve) of 0.9, and the accuracy, sensitivity and specificity exceeded 85%. The more comprehensive indicator, MCC, reaches 0.78. We also develop a software package that can be freely downloaded from https://github.com/Sunmile/DeepUbi.

Conclusion

Our results show that DeepUbi has excellent performance in predicting ubiquitination based on large data.

Background

Ubiquitin was first discovered by Goldstein et al. in 1975 [1]. Ubiquitination, covalent attachment of ubiquitin to a variety of cellular proteins, is a common post-translational modification (PTM) in eukaryotic cells [2]. In the process of ubiquitination, ubiquitin is attached to substrates on lysine (K) residues by a three-stage enzymatic reaction. There are three enzymes involved-ubiquitin activating enzyme (E1s), ubiquitin conjugating enzyme (E2s) and ubiquitin ligating enzyme (E3s), which work one after another [3,4,5]. The ubiquitination system is responsible for many aspects of cellular molecular function, such as protein localization, metabolism, regulation and degradation [4,5,6,7]. It also participates in the regulation of various biological processes such as cell division and apoptosis, signal transduction, gene transcription, DNA repair and replication, intracellular transport and virus budding [4, 5]. Evidence has shown that ubiquitination has a close relationship with cell transformation, immune response and inflammatory response [8]. Abnormal ubiquitination status is also involved in many diseases. For example, the ubiquitination of metastasis suppressor 1, mediated by the skp1-cullin1-F- box beta-transducin repeat-containing protein, is essential for regulating cell proliferation and migration in breast and prostate cancers [9].

Due to the roles of ubiquitination, the precise prediction of ubiquitination sites is particularly important. Conventional experimental methods are time-consuming and labour-intensive, and thus computational methods are necessary as a supplementary approach [10, 11]. In recent years, a variety of machine learning methods have been applied to predict protein ubiquitination sites. Tung and Ho [12] developed a ubiquitination site predictor UbiPred, using support vector machine (SVM) with 31 informative physicochemical features selected from the published amino acid indices [13]. Radivojac [14] used a random forest algorithm to develop a predictor, UbPred, in which 586 sequence attributes were employed as the input feature vector. Zhao [15] adopted an ensemble approach to the voting mechanism. Lee [16] designed UbSite, which uses an efficient radial basis function (RBF) kernel to identify ubiquitination sites. Chen [17] proposed a predictor, CKSAAP_UbSite, using the composition of k-spaced amino acid pairs (CKSAAP). Cai [18] proposed a predictor utilizing the nearest neighbour algorithm. Chen [19] proposed a new tool, UbiProber, which was designed for general and specific species. Chen [20] developed hCKSAAP_UbSite by integrating four different types of predictive variables. Qiu [21] developed iubiq-lys using support vector machine. Cai and Jiang [22] used multiple machine learning algorithms to predict ubiquitination sites. Wang [23] designed a tool, ESA-UbiSite, using an evolutionary algorithm (ESA). In addition, there are many other predictors such as UbiSite [24], UbiBrowser [25], RUBI [26], the WPAAN classifier [27], MDDLogoclustered SVM models [28] and the non-canonical pathway network [29]. Although various ubiquitination site predictors have been developed, there are still limitations. As noted above, the existing computational methods for predicting ubiquitination sites are shallow machine learning methods and their datasets are small. However, a large amount of biomedical data has been accumulated and shallow machine learning algorithms do not handle big data well. In this study, we propose a lysine ubiquitination predictor, DeepUbi, using a deep learning framework on a large dataset.

Results

Cross-validation performance

For the series of hyperparameter choices, we obtain a set of better performing hyper-parameters, which are shown in Table 1. Using a set of clear and effective metrics defined in Eq. 4 to measure the quality of predictors, we considered how to objectively derive the values. Three different verification methods are generally used to evaluate the predictive performance: the independent dataset test, sub-sampling test and jackknife test [30]. The jackknife test can exclude the “memory” effect and the arbitrariness problem because the outcome obtained by the jackknife cross-validation is always unique for a given benchmark dataset [21]. However, it is time-consuming, especially for big datasets. In this study, k-fold cross validation was utilized to evaluate the performance of the proposed predictors because of the large dataset.

Table 1 The values of super-parameter tuning

Full size table

First, the 4-fold, 6-fold, 8-fold and 10-fold cross validations are executed 10 times on the simple One-Hot encoding scheme. The results are shown in Table 2. All of the accuracies are greater than 85% and the highest accuracy reaches 88.74%, illustrating the robustness of the CNNUbi. The ROC curves and AUC values are shown in Fig. 1 and are more intuitive, and the largest AUC value was 0.89. These results show that the deep learning framework learns some instinct information and has good performance. To obtain more information, we add three other features into the One-Hot encoding scheme (see Table 3 and Fig. 2). In the 10-fold cross-validation, all the ROC curves are very close to each other. The One-Hot plus CKSAAP encoding scheme clearly performs the best in all of these features. We call it DeepUbi with an AUC of 0.9066 and MCC of 0.78.

Table 2 The results of 4-, 6-,8-,10-fold cross-validations with the One-Hot feature

Full size table

Table 3 The results of four different encoding schemes in the 10-fold cross-validation

Full size table

Our DeepUbi predictor was obtained using balanced data. In the experimentally verified ubiquitination and non-ubiquitination data, the ratio of positive and negative peptides was 1:8. We also tested the performance on naturally distributed data when the algorithm was trained with balanced data. The results in Table 4 illustrate that the performance is slightly worse than with balanced data.

Table 4 The results for naturally distributed DeepUbi data

Full size table

Comparison with other existing methods

A comprehensive comparison of our models with the available sequence-based predictors was performed and the corresponding data and results are shown in Table 5. In the last decade, many researchers have contributed to the prediction and research of ubiquitination sites in proteins. The comparison shows that the deep learning model performs very well on big datasets. The predictors improved the accuracy by adding new features, using a variety of machine learning algorithms or adding new datasets. The precision of the predictors is approximately 0.8. In this study, we propose the DeepUbi predictor and apply a deep learning framework with more accuracy. The AUC close to 0.9 and other indicators of accuracy, sensitivity and specificity are also better than those of existing methods. These results suggest that DeepUbi learned deeper characteristics.

Table 5 Comparison of DeepUbi and other ubiquitination prediction tools

Full size table

To eliminate the impact of data volume differences and make a more vivid comparison, we conduct additional experiments. We randomly select the same number of positive and negative samples as the existing predictor from our data 10 times. Each sample set is tested with 10 cross-validations, and the average results are listed in Table 6. Comparison of Table 5 and Table 6 shows that the DeepUbi results are much higher than those of other predictors for the same number of samples. For example, the data in UbiPred has an Acc of 84.44%, Sn of 83.44%, Sp of 85.43%, AUC of 0.85 and MCC of 0.69. Selecting the same number UbiPred data as the test set 10 times, the average result for DeepUbi is an Acc of 98.77%, Sn of 98.87%, Sp of 98.67%, AUC of 0.99 and MCC of 0.98. The AUC values of DeepUbi are close to 0.9, illustrating the performance of deep learning.

Table 6 The DeepUbi results for the same number of samples as the other existing tools

Full size table

Analysis of ubiquitination peptides

To illustrate the performance of our predictor, we also conduct an analysis using the training data. First, the probabilistic histogram of composition of flanking amino acids surrounding the ubiquitination candidate sites is generated, as shown in Fig. 3a and b. Amino acid residues Ala (A), Glu (E), Leu (L), Arg (R) and Ser (S) appear more ratio in positive data (ubiquitination fragments), while Cys (C), Phe (F), His (H), Ile (I) and Val (Y) are more enriched in negative data (non-ubiquitination fragments). Next, a well-known tool, Two Sample Logo [31], is applied to detect the position-specific amino acid composition difference between the training data, and the sequence logo is shown in Fig. 3c. The results reveal the dependencies of flanking amino acids around the substrate sites.

Discussion

We use the biggest data repository designed for protein lysine modification to learn the DeepUbi predictor. A convolutional neural network, a deep learning framework, is adopted to predict ubiquitination. It is composed of a convolutional layer, a nonlinear layer and a pooling layer. Convolutional neural networks can learn a large number of mapping relations between input and output without any precise mathematical expression between the input and output. We construct six steps, including inputting the fragment, constructing an embedding layer, building multi-convolution-pooling layers, adding features, constructing fully connected layers, and the output layer. The deep learning framework is first used to predict ubiquitination.

Four better encoding schemes are adopted in the feature construction, One-Hot encoding, the physicochemical properties, the composition of k-spaced amino acid pairs (CKSAAP) and the pseudo amino acid composition. One-Hot plus CKSAAP have the best performance with and AUC of 0.9066 in the cross-validation.

In the data, the sequence motif analysis shows that there are differences between positive and negative fragments. Thus, it is feasible to obtain classification information from the peptide itself. Different features are adopted to train the model. The hybrid of One-Hot and CKSAAP is selected as the best, with an AUC of 0.9066.

DeepUbi has better performance than the existing tools. Researchers could use the predictor to select potential candidates and conduct experiments to verify them. This will reduce the range of candidate proteins and save time and labour. The sequence analysis of the ubiquitination will provide suggestions for future work.

In the future, we will investigate other feature constructions that may better extract the properties of samples. Second, we aim to improve performance by increasing the depth and model parameters through system learning. The current method may also be used to identify other PTM sites in proteins.

Conclusion

In this work, we propose a new ubiquitination predictor, DeepUbi, which uses a deep learning framework and achieves satisfactory success with the biggest data set. DeepUbi extracts features from the original protein fragments with an AUC of 0.9066 and an MCC of 0.78. We construct six steps including inputting fragment, constructing an embedding layer, building multi-convolution-pooling layers, adding features, constructing fully connected layers, and output layer. The deep learning framework is first used in prediction of ubiquitination. However, DeepUbi is not too deep, as we only use two convolution-pooling structures. We also develop a software package for DeepUbi that can be freely downloaded from https://github.com/Sunmile/DeepUbi. The deep learning model is an effective prediction method and will improve accuracy by increasing the depth in the future.

Methods

Benchmark dataset

In this study, the ubiquitination data is collected from the PLMD (v3.0, June. 2017) database [32], which is the biggest online data repository designed for protein lysine modification. The original data contains 121,742 ubiquitination sites from 25,103 proteins. If the data contains homologous samples, it would increase the bias of results. We remove the redundant protein sequences to eliminate homology bias using the CD-HIT web server [33], which is freely available at http://weizhongli-lab.org/cd-hit/, and obtains 12,053 different proteins with ≤30% sequence identity. A sliding window with the length of 15 × 2 + 1 = 31 is used to intercept the protein sequences with lysine residues in the centre. If the upstream or downstream residues of a protein are less than 15, the lacking residue is filled with a “pseudo” residue ‘X’. There are too many negative peptides compared to the positive peptides. To obtain a better predictor, we select the negative samples by deleting the redundant segments using 30% identity to ensure that none of the segments had ≥30% pair-wise identity in the negative peptides [24]. Finally, we obtain a training dataset containing 53,999 ubiquitination and 50,315 non-ubiquitination fragments. A detailed flow chart of these steps is shown in Fig. 4.

Feature construction

A good feature can extract the correlation of instinct ubiquitination characters and the targets from peptide sequences [34]. Four better feature encoding schemes are adopted, One-Hot encoding, the physicochemical properties, the composition of k-spaced amino acid pairs and the pseudo amino acid composition.

One-Hot Encoding.

The conventional feature representation of amino acid composition uses 20 binary bits to represent an amino acid. To deal with the problem of sliding windows spanning out of the N-terminal or C-terminal, one additional bit is appended to indicate this situation. Then, a vector of size (20 + 1) bits is used to represent a sample. For example, the amino acid A is represented by ‘100000000000000000000’ and R is represented by ‘010000000000000000000’.

Informative physicochemical properties (IPCP)

In PTM site prediction, physicochemical properties are essential to extract information for a fragment or protein. Tung [12] proposed an informative physicochemical property mining algorithm that could quantify the effectiveness of individual physicochemical properties in prediction. They used the value of the main effect difference (MED) [35] to estimate the individual effects of physicochemical properties. The property with the largest MED is the most effective in predicting ubiquitination sites. In the study, 31 informative physicochemical properties are selected as the features for calculation, and are listed in Additional file 1: Table S1.

Compositions of K-spaced amino acid pairs (CKSAAP)

The CKSAAP encoding scheme is the composition of k-spaced residue pairs (separated by k amino acids) in the protein sequence, which is useful for predicting protein flexible or rigid regions [36]. For example, there are 441 residue pairs (i.e., AA, AC, ..., XX). Therefore, the feature vector can be defined as

$$ \left\{\frac{N_{AA}}{N_{to tal}}\kern0.75em \begin{array}{cc}\frac{N_{AC}}{N_{to\mathrm{t} al}}&, \end{array}\cdots, \kern0.5em \begin{array}{cc}\frac{N_{XX}}{N_{to tal}}&\ \end{array}\right\} $$

(1)

where N_total is the total number of k-spaced residue pairs in the fragment and N_AA is the number of amino acid pair AA in the fragment. Each component in the vector represents the contribution of k-spaced amino acid pairs. For instance, the AA component is represented as $ \frac{N_{AA}}{N_{total}} $. In this paper, k = 0, 1, 2, 3, 4, and a 441 × 5 = 2205 vector was obtained by the CKSAAP encoding scheme.

Pseudo Amino Acid Composition (PseAAC).

Chou’s pseudo amino acid composition is a set of discrete serial correlation factors combined with traditional 20 amino acid components [37]. In the study, we select 20 correlation factors and the weight of these factors is 0.05, and a 40-dimension vector is acquired.

Algorithm

Deep learning, which evolved from the acquisition of big data, and the power of parallel and distributed computing have facilitated major advances in numerous domains such as image recognition, speech recognition, and natural language processing [38]. Every protein is a sentence, and residues in the protein sequence can be seen as “words”. The prediction of ubiquitination can be seen as a ‘natural language prediction’ (NLP) task. Therefore, we propose a convolutional neural network (CNN) deep learning model and obtain good prediction performance on a large data set. A convolutional neural network (CNN) is a deep learning framework. It is composed of a convolutional layer, a nonlinear layer and a pooling layer. Our model is constructed with six steps (Input a fragment, Construct an embedding layer, Build multi-convolution-pooling layers, Add features, Construct fully connected layers, and an Output layer), as shown in Fig. 5a.

The input protein fragment representation is x ∈ R^L × 21, where L is the length of the fragment. The first layer is the embedding layer, which maps input vectors into low-dimensional vector representations. It is essentially a lookup table that we learn from data. E = xW_e, where e is the embedding dimension, W_e is the embedding weight matrix and E ∈ R^L × e is the embedding matrix, which is a continuous product. Then, we assign the embedding matrix E as an image and use the convolutional neural network to extract features. Because the adjacent residues in the fragments are always highly correlated, one dimensional convolution can be used. The width of the convolution kernel is the dimension of the embedding vector. The height is a super parameter, which is a manual set. For example, if there is a convolution filter with size a_k, then a feature map is obtained by the convolution

$$ {\mathrm{z}}_k(m)=f\left({\sum}_{i=1}^{a_k}{\sum}_{j=1}^ew\left(i,j\right)\times E\left(i+m,j\right)\right) $$

(2)

where f is the activation function, which is a rectified linear unit (ReLU) [39], w is the weight vector and $ {\mathrm{z}}_k\in {R}^{L-{a}_k+1} $. The number of convolution filters of size a_k is also set. The feature map obtained from different convolution kernels is a different size, so a max-pooling function is use to maintain the same dimension. The final eigenvector h is then obtained. For more intuitive understanding, see Fig. 5b. For the first model, CNNUbi, we use the features obtained from the last step without additional features, i.e., h_new = h. For comparison, the second model, DeepUbi, is built with additional features and h_new = [h, b], where b is the additional features. Finally, each of the two output units has a score between 0 and 1, illustrating by the softmax equation $ {p}_i=\frac{e^i}{\sum_j{e}^j} $. Here, i = F_cw_o represents the input of class unit i, F_c is the output of the fully connected layer and w_o is the weight matrix. The cross-entropy objective function is assigned as the cost function Add features

$$ \mathrm{CE}=-{\sum}_{n=1}^N{y}^n\ln P\left({y}^n=1|{x}^n\right)+\left(1-{y}^n\right)\ln P\left({y}^n=0|{x}^n\right) $$

(3)

where N represents the batch size of the training set and xⁿ and yⁿ represent the n-th protein fragment and its label, respectively. Using the Adam optimizers, DeepUbi is trained based on a variety of super-parameters such as the batch size, maximum epoch, learning rate, dropout rate and convolution blocks.

Model evaluation and performance measures

A confusion matrix is a visual display tool for evaluating the quality of classification models. Each column of the matrix represents the sample situation of the model prediction and each row of the matrix represents the actual situation of the sample. There are four values in the matrix, where TP represents the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives. In the literature, the following metrics based on the confusion matrix are often used to evaluate the performance of a predictor

$$ \left\{\begin{array}{c}\begin{array}{c}\mathrm{Sp}=\frac{TN}{TN+ FP}\\ {}\mathrm{Sn}=\frac{TP}{FN+ TP}\end{array}\ \\ {}\begin{array}{c}\mathrm{Acc}=\frac{TP+ TN}{TP+ TN+ FP+ FN}\ \\ {}\mathrm{MCC}=\frac{TP\times TN- FP\times FN}{\sqrt{\left( TP+ FN\right)\left( TN+ FP\right)\left( TP+ FP\right)\left( TN+ FN\right)}}\end{array}\end{array}\ \right. $$

(4)

where Sn represents the sensitivity, Sp is the specificity, Acc is the accuracy, and MCC is the Matthew’s correlation coefficient. The ROC (Receiver Operating Characteristic) curves and the area under the ROC curve (AUC) are usually used to evaluate the classifier’s resolving power.

Abbreviations

Acc:: Accuracy
AUC:: Area under the ROC curve
CKSAAP:: Composition of k-spaced amino acid pairs
CNN:: Convolutional neural network
IPCP:: Informative physicochemical properties
MCC:: Mathew’s correlation coefficient
MED:: Main effect difference
PseAAC:: Pseudo amino acid composition
PTM:: Post-translational modification
RBF:: Radial basis function
ReLU:: Rectified linear unit
Sn:: Sensitivity
Sp:: Specificity
SVM:: Support vector machine

References

Goldstein G, Scheid M, Hammerling U, Schlesinger DH, Niall HD, Boyse EA. Isolation of a polypeptide that has lymphocyte-differentiating properties and is probably represented universally in living cells. Proc Natl Acad Sci U S A. 1975;72(1):11–5.
Article CAS PubMed PubMed Central Google Scholar
Wilkinson KD. Protein ubiquitination: a regulatory post-translational modification. Anticancer Drug Des. 1987;2(2):211–29.
CAS PubMed Google Scholar
Ou CY, Pi HW, Chien CT. Control of protein degradation by E3 ubiquitin ligases in Drosophila eye development. Trends Genet. 2003;19(7):382–9.
Article CAS PubMed Google Scholar
Herrmann J, Lerman LO, Lerman A. Ubiquitin and ubiquitin-like proteins in protein regulation. Circ Res. 2007;100(9):1276–91.
Article CAS PubMed Google Scholar
Welchman R, Gordon C, Mayer RJ. Ubiquitin and ubiquitin-like proteins as multifunctional signals. Nat Rev Mol Cell Biol. 2005;6(8):599–609.
Article CAS PubMed Google Scholar
Hurley JH, Sangho L, Gali P. Ubiquitin-binding domains. Biochem J. 2006;399(Pt 3):361.
Article CAS PubMed PubMed Central Google Scholar
Nath D, Shadan S. The ubiquitin system. Nature. 2009;458(7237):421-21.
Schwartz AL, Ciechanover A. The ubiquitin-proteasome pathway and pathogenesis of human diseases. Annu Rev Med. 1999;50:57–74.
Article CAS PubMed Google Scholar
Zhong J, Shaik S, Wan L, Tron AE, Wang Z, Sun L, Inuzuka H, Wei W. SCF beta-TRCP targets MTSS1 for ubiquitination-mediated destruction to regulate cancer cell proliferation and migration. Oncotarget. 2013;4(12):2339–53.
Article PubMed PubMed Central Google Scholar
Hitchcock AL, Kathryn A, Gygi SP, Silver PA. A subset of membrane-associated proteins is ubiquitinated in response to mutations in the endoplasmic reticulum degradation machinery. Proc Natl Acad Sci U S A. 2003;100(22):12735–40.
Article CAS PubMed PubMed Central Google Scholar
Ikeda F, Dikic I. Atypical ubiquitin chains: new molecular signals. EMBO Rep. 2008;9(6):536–42.
Article CAS PubMed PubMed Central Google Scholar
Tung CW, Ho SY: Computational identification of ubiquitylation sites from protein sequences. Bmc Bioinformatics 2008, 9(1):310–310.
Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M. AAindex: amino acid index database, progress report 2008. Nucleic Acids Res. 2008;36(Database issue):D202–5.
CAS PubMed Google Scholar
Radivojac P, Vacic V, Haynes C, Cocklin RR, Mohan A, Heyen JW, Goebl MG, Iakoucheva LM. Identification, analysis, and prediction of protein ubiquitination sites. Proteins. 2010;78(2):365–80.
Article CAS PubMed PubMed Central Google Scholar
Zhao X, Li X, Ma Z, Yin M. Prediction of lysine ubiquitylation with ensemble classifier and feature selection. Int J Mol Sci. 2011;12(12):8347–61.
Article CAS PubMed PubMed Central Google Scholar
Lee TY, Chen SA, Hung HY, Ou YY. Incorporating distant sequence features and radial basis function networks to identify ubiquitin conjugation sites. PLoS One. 2011;6(3):e17331.
Article CAS PubMed PubMed Central Google Scholar
Chen Z, Chen YZ, Wang XF, Wang C, Yan RX, Zhang ZD. Prediction of ubiquitination sites by using the composition of k-spaced amino acid pairs. PLoS One. 2011;6(7).
Cai YD, Huang T, Hu LL, Shi XH, Xie L, Li YX. Prediction of lysine ubiquitination with mRMR feature selection and analysis. Amino Acids. 2012;42(4):1387–95.
Article CAS PubMed Google Scholar
Chen X, Qiu JD, Shi SP, Suo SB, Huang SY, Liang RP. Incorporating key position and amino acid residue features to identify general and species-specific ubiquitin conjugation sites. Bioinformatics. 2013;29(13):1614–22.
Article CAS PubMed Google Scholar
Chen Z, Zhou Y, Song JN, Zhang ZD. hCKSAAP_UbSite: improved prediction of human ubiquitination sites by exploiting amino acid pattern and properties. Bba-Proteins Proteom. 2013;1834(8):1461–7.
Article CAS Google Scholar
Qiu WR, Xiao X, Lin WZ, Chou KC. iUbiq-Lys: prediction of lysine ubiquitination sites in proteins by extracting sequence evolution information via a gray system model. J Biomol Struct Dyn. 2015;33(8):1731–42.
Article CAS PubMed Google Scholar
Cai B, Jiang X. Computational methods for ubiquitination site prediction using physicochemical properties of protein sequences. BMC Bioinformatics. 2016;17:116.
Article PubMed PubMed Central Google Scholar
Wang JR, Huang WL, Tsai MJ, Hsu KT, Huang HL, Ho SY. ESA-UbiSite: accurate prediction of human ubiquitination sites by identifying a set of effective negatives. Bioinformatics. 2017;33(5):661–8.
Article CAS PubMed Google Scholar
Huang C-H, Su M-G, Kao H-J, Jhong J-H, Weng S-L, Lee T-Y. UbiSite: incorporating two-layered machine learning method with substrate motifs to predict ubiquitin-conjugation site on lysines. BMC Syst Biol. 2016;10(1):S6.
Article Google Scholar
Li Y, Xie P, Lu L, Wang J, Diao L, Liu Z, Guo F, He Y, Liu Y, Huang Q, et al. An integrated bioinformatics platform for investigating the human E3 ubiquitin ligase-substrate interaction network. Nat Commun. 2017;8(1):347.
Article PubMed PubMed Central Google Scholar
Walsh I, Di Domenico T, Tosatto SCE. RUBI: rapid proteomic-scale prediction of lysine ubiquitination and factors influencing predictor performance. Amino Acids. 2014;46(4):853–62.
Article CAS PubMed Google Scholar
Kai-Yan F, Tao H, Kai-Rui F, Xiao-Jun L. Using WPNNA classifier in ubiquitination site prediction based on hybrid features. Protein Pept Lett. 2013;20(3):318–23.
Google Scholar
Nguyen V, Huang K, Huang C, Lai KR, Lee T. A new scheme to characterize and identify protein ubiquitination sites. IEEE/ACM Trans Comput Biol Bioinform. 2017;14(2):393–403.
Article CAS PubMed Google Scholar
Ghosh S, Febin Prabhu Dass J. Non-canonical pathway network modelling and ubiquitination site prediction through homology modelling of NF-κB. Gene. 2016;581(1):48–56.
Article CAS PubMed Google Scholar
Chou KC. Some remarks on protein attribute prediction and pseudo amino acid composition. J Theor Biol. 2011;273(1):236–47.
Article CAS PubMed Google Scholar
Vacic V, Iakoucheva LM, Radivojac P. Two sample logo: a graphical representation of the differences between two sets of sequence alignments. Bioinformatics. 2006;22(12):1536–7.
Article CAS PubMed Google Scholar
Xu H, Zhou J, Lin S, Deng W, Zhang Y, Xue Y. PLMD: an updated data resource of protein lysine modifications. J Genet Genomics. 2017;44(5):243–50.
Article PubMed Google Scholar
Huang Y, Niu BF, Gao Y, Fu LM, Li WZ. CD-HIT suite: a web server for clustering and comparing biological sequences. Bioinformatics. 2010;26(5):680–2.
Article CAS PubMed PubMed Central Google Scholar
Plewczynski D, Tkacz A, Wyrwicz LS, Rychlewski L. AutoMotif server: prediction of single residue post-translational modifications in proteins. Bioinformatics. 2005;21(10):2525–7.
Article CAS PubMed Google Scholar
Tung CW, Ho SY. POPI: predicting immunogenicity of MHC class I binding peptides by mining informative physicochemical properties. Bioinformatics. 2007;23(8):942–9.
Article CAS PubMed Google Scholar
Chen K, Kurgan LA, Ruan J. Prediction of flexible/rigid regions from protein sequences using k-spaced amino acid pairs. BMC Struct Biol. 2007;7:25.
Article PubMed PubMed Central Google Scholar
Chou KC. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins-Structure Function and Genetics. 2001;43(3):246–55.
Article CAS Google Scholar
Min S, Lee B, Yoon S. Deep learning in bioinformatics. Brief Bioinform. 2017;18(5):851–69.
PubMed Google Scholar
Nair V, Hinton GE. Rectified linear units improve restricted boltzmann machines. In: International conference on international conference on machine learning; 2010. p. 807–14.
Google Scholar

Download references

Acknowledgements

Dr. Jun Ding helped us with the programming and processed the data. We also thanked the anonymous reviewers who gave us very valuable suggestions. The manuscript is edited by American Journal Experts (AJE) prior to submission.

Funding

This work is supported by grants from the Natural Science Foundation of China (11671032) and the 2015 National Traditional Medicine Clinical Research Base Business Construction Special Topics (JDZX2015299). The funders have no role in the design of the study, collection, analysis, and interpretation of the data or writing the manuscript.

Availability of data and materials

A total of 121,742 ubiquitination sites were collected from PLMD database (http://plmd.biocuckoo.org/) and the proteins were retrieved from UniProt (https://www.uniprot.org/). The data is provided on website https://github.com/Sunmile/DeepUbi and the file name is “Raw Data”.

Author information

Hongli Fu and Yingxi Yang contributed equally to this work.

Authors and Affiliations

Department of Information and Computing Science, University of Science and Technology Beijing, Beijing, 100083, China
Hongli Fu, Yingxi Yang, Xiaobo Wang & Yan Xu
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
Hui Wang
Beijing Key Laboratory for Magneto-photoelectrical Composite and Interface Science, University of Science and Technology Beijing, Beijing, 100083, China
Yan Xu

Authors

Hongli Fu
View author publications
You can also search for this author in PubMed Google Scholar
Yingxi Yang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaobo Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yan Xu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

YX and YY conceived of and designed the experiments. HF, XW, HW and YY performed the experiments and data analysis. HF and YX wrote the paper. YX and YY revised the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Yan Xu.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional file

Additional file 1:

Table S1. The 31 informative physicochemical properties and their corresponding MED (main effect difference) scores. (XLSX 42 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Fu, H., Yang, Y., Wang, X. et al. DeepUbi: a deep learning framework for prediction of ubiquitination sites in proteins. BMC Bioinformatics 20, 86 (2019). https://doi.org/10.1186/s12859-019-2677-9

Download citation

Received: 08 November 2018
Accepted: 12 February 2019
Published: 18 February 2019
DOI: https://doi.org/10.1186/s12859-019-2677-9

DeepUbi: a deep learning framework for prediction of ubiquitination sites in proteins

Abstract

Background

Results

Conclusion

Background

Results

Cross-validation performance

Comparison with other existing methods

Analysis of ubiquitination peptides

Discussion

Conclusion

Methods

Benchmark dataset

Feature construction

Informative physicochemical properties (IPCP)

Compositions of K-spaced amino acid pairs (CKSAAP)

Algorithm

Model evaluation and performance measures

Abbreviations

References

Acknowledgements

Funding

Availability of data and materials

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Publisher’s Note

Additional file

Additional file 1:

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us