- Research article
- Open Access
A deep learning method to more accurately recall known lysine acetylation sites
© The Author(s). 2019
- Received: 17 September 2018
- Accepted: 16 January 2019
- Published: 23 January 2019
Lysine acetylation in protein is one of the most important post-translational modifications (PTMs). It plays an important role in essential biological processes and is related to various diseases. To obtain a comprehensive understanding of regulatory mechanism of lysine acetylation, the key is to identify lysine acetylation sites. Previously, several shallow machine learning algorithms had been applied to predict lysine modification sites in proteins. However, shallow machine learning has some disadvantages. For instance, it is not as effective as deep learning for processing big data.
In this work, a novel predictor named DeepAcet was developed to predict acetylation sites. Six encoding schemes were adopted, including a one-hot, BLOSUM62 matrix, a composition of K-space amino acid pairs, information gain, physicochemical properties, and a position specific scoring matrix to represent the modified residues. A multilayer perceptron (MLP) was utilized to construct a model to predict lysine acetylation sites in proteins with many different features. We also integrated all features and implemented the feature selection method to select a feature set that contained 2199 features. As a result, the best prediction achieved 84.95% accuracy, 83.45% specificity, 86.44% sensitivity, 0.8540 AUC, and 0.6993 MCC in a 10-fold cross-validation. For an independent test set, the prediction achieved 84.87% accuracy, 83.46% specificity, 86.28% sensitivity, 0.8407 AUC, and 0.6977 MCC.
The predictive performance of our DeepAcet is better than that of other existing methods. DeepAcet can be freely downloaded from https://github.com/Sunmile/DeepAcet.
- Lysine acetylation
- Deep learning
Post-translational modifications (PTMs) refer to the chemical modification of a protein after translation. PTMs play a crucial role in regulating many biological functions, such as protein localization in the cell, protein stabilization, and the regulation of enzymatic activity . Studies have shown that 50–90% of the proteins in the human body undergo PTMs, mainly through the splicing of the peptide chain backbone, the addition of new groups to the side chains of specific amino acids, or the chemical modification of existing groups. Acetylation is one of the most important and ubiquitous PTMs in proteins. Protein acetylation is a widespread covalent modification in eukaryotes that occurs by transferring acetyl groups from acetyl coenzyme A (acetyl CoA) to either the α-amino (Nα) group of amino-terminal residues or to the ε-amino group (Nε) of internal lysines at specific sites . The lysine acetylation catalyzed by histone acetyltransferases (HATs) or lysine acetyltransferases (KATs) reversibly regulates a large number of biological processes . The function of lysine acetylation in histones to control gene expression by modifying the chromatin structure has been widely studied . Recent studies in proteomics have shown that most acetylation events occur on non-chromatin associated proteins and play an important role in cell signaling and metabolism, protein activities and structure, and sister chromatid polymerization [5–7]. In addition to histone acetylation, non-histone acetylation is also important. Some studies have shown that acetylated non-histones affect the stability of mRNA, intracellular localization, protein-protein interactions, enzyme activity and transcriptional regulation [2, 8, 9]. In addition, most non-histone proteins targeted by acetylation are associated with cancer cell proliferation, tumorigenesis and immune functions .
Although a large number of lysine acetylated proteins have been identified, there are still many acetylated proteins that need to be identified. The mechanism of protein acetylation is still largely unknown. The identification of acetylation sites will be an essential step in understanding the molecular mechanisms of protein acetylation. Also, some cancer [11, 12], neurodegenerative disorders [13, 14] and cardiovascular diseases [15, 16] are related to aberrant lysine acetylation. Thus, the identification of acetylation sites can provide a certain guidance for the treatment of some diseases . Kim et al.  first developed a method for detecting lysine acetylation sites at the proteomic level by enriching acetylated peptides with lysine acetylated-specific antibodies. Choudhary et al.  used high-resolution mass spectrometry to identify 3600 lysine acetylation sites on 1750 proteins. However, the experimental identification of lysine acetylation is very laborious with long periods, for high cost and low throughput. It is necessary to predict the lysine acetylation sites through better approaches.
Performance of DeepAcet
Performance measures and dimensions for the different features
From published articles, it is known that a combination of different features makes a model better. Therefore, our next step was to test the predictive performance of combined features. We utilized the CKSAAP encoding scheme and obtained a 2205-dimension featured vector, a 651-dimension featured vector from the one-hot or BLOSUM62, a 434-dimension featured vector from the 14 physicochemical properties from AAindex, a 1-dimension featured vector from IG and a 30-dimension featured vector from the PSSM encoding scheme. The total dimension of features was 3972. We utilized all the features without feature selection as an input to the neural network and K-fold (k = 4, 6, 8, 10) cross-validation to evaluate their predictive performance (Additional file 1: Table S1).
Performance measures for the 4-, 6-, 8-, and 10-fold cross-validations
Analysis between lysine acetylation and non-acetylation fragments
Optimal features analysis
Total number of features for the different K values
Comparison with other existing methods
Comparision of the performance results with different webserver tools
In this work, a satisfactory predictor which could predict unknown acetylation sites, DeepAcet, was obtained by multilayer perceptron from the combination of various encoding schemes. For a long time, researchers have mainly used shallow machine learning algorithms and their methods to predict modified lysine sites. However, in practical application, shallow machine learning is not good for the extraction of high-level features and has poor predictive performance when processing large data. Shallow machine learning uses machine learning algorithms to parse data, learn data features and make decisions or predictions. Deep learning simulates the structure and function of the human brain by identifying the unstructured input of representative data and making accurate decisions. In recent years, deep artificial neural networks have received more and more attention and have been widely applied to image and speech recognition, natural language understanding, and computational biology [30–34]. By propagating data in a deep network, it can effectively extract data features and highly complex functions to improve the classification ability of predictors. Therefore, a deep neural network is used in this work. Deep neural networks can also better handle high-dimensional encoding vectors by training complex multi-layer networks.
The length of input peptides to learning architecture is also one of the hyperparameters. In the prediction of posttranslational modifications, the general range for protein fragments are 21–41. We also tested several lengths such as 21, 23, 25, 27, 29, 33 and 35 on our benchmark data and found that 31 was the best length (Additional file 3: Table S3).
Although we implemented a deep learning framework to build the model and got good results, there is still room for improvement. First, we only considered the composition and location information for the fragments and didn’t consider structural features. Secondly, there is no systematic method to adjust the hyperparameters (e.g., the number of neurons and the number of iterations) of the neural network, which can only be adjusted through the constant experimentation. In the future, we will consider structural information into the features and the new neural network. We could obtain better robustness and accuracy with more experimentally verified acetylation sites. Meanwhile, researchers have found acetylation is associated with diseases [35–37]. We could do some work about the acetylation modification with the disease association.
Lysine acetylation in protein has become a key post-transcriptional modification in cell regulation . To fully understand the molecular mechanism for the biological processes associated with acetylation, a preliminary and critical step is to identify the acetylated substrates and the corresponding acetylation sites. Therefore, the prediction of acetylation sites through computational methods is desirable and necessary. We built a predictor, DeepAcet, from six features based on a deep learning framework. To get the best predictor, feature selection was utilized to reduce meaningless ones. The predictor achieved an accuracy of 84.95%, specificity of 83.45%, sensitivity of 86.44%, AUC of 0.8540, and MCC of 0.6993 in a 10-fold cross-validation. For the independent test set, the predictive performance achieved an accuracy of 84.87%, a specificity of 83.46%, a sensitivity of 86.28%, AUC of 0.8407, and MCC of 0.6977, results which were significantly superior to those of other predictors. DeepAcet can be freely downloaded from https://github.com/Sunmile/DeepAcet.
We retrieved 29,923 human lysine acetylated sites from the CPLM database (http://cplm.biocuckoo.org/)  and their proteins from UniProt (http://www.uniprot.org/). These proteins were truncated with a centered lysine (K) to a fragment length of 31 after many trials. The missing amino acids were filled with the pseudo amino acid “X”. We assigned fragments with the experimental lysine acetylation site into the positive dataset, S+, and the other fragments into the negative dataset, S−. In general, if the training dataset had high homology, over-fitting would occur during the training process, which would reduce the generalization ability of the classifier. If more than 30% of the residues in the two comparison fragments were same, only one of them was retained and the other was deleted. After removing the redundant fragments, we obtained 16,107 positive and 57,443 negative fragments. Since the imbalance of a training dataset would cause prediction errors, we randomly selected 16,107 negative fragments from the original dataset, S−.
The number of samples for the imbalanced, balanced, training and independent test sets
All existing operation engines can only handle vectors but not sequence samples . Thus, an important step before training the model was to convert the sequences into numerical vectors that the algorithm could recognize directly. This process is known as feature encoding or feature construction. In this work, six encoding schemes including the basic position, evolutionary information and physicochemical properties were used to construct features. One-hot, Blosum62, Composition of K-space amino acid pairs (CKSAAP), Information gain (IG), AAIndex, and Position-specific scoring matrix (PSSM) are available in the Additional file 6: S6.
During the neural network training process, we used a Rectified Linear Unit (ReLU) as the activation function , and a softmax loss function  in our model. Additionally, the error backpropagation algorithm  and the mini-batch gradient descent algorithm were utilized to optimize the parameters. In the transmission of data from input to output, neural networks could learn and extract underlying features of the data. The last layer was the output layer, and the number of neurons in this layer denoted the number of categories. We adopted the softmax function , which is commonly used in classification as an activation function in the output layer. The mini-batch gradient descent algorithm was meant to use a small part of the training samples to train the model each time, which could reduce the calculation of the gradient descent method. The optimal value for batch size was 40. To accelerate the rate of gradient descent and suppress the oscillation, we adopted a momentum item in the process of optimizing weights and bias. To reduce overfitting, we used dropout methods in every layer of the neural network except for the last layer. This way, not every neuron had a full connection, which could reduce overfitting and speed up the training of the neural network. Detailed parameter information about the neural network is shown in Additional file 7: Table S7. The predictor for the above deep learning framework is called DeepAcet.
Measurements of performance
Dr. Jun Ding helped us in the program and processed the data. We also thank the three anonymous reviewers which gave us very valuable suggestions.
This work was supported by grants from the Natural Science Foundation of China (11671032), the Fundamental Research Funds for the Central Universities (No. FRF-TP-17-024A2) and the 2015 National traditional Medicine Clinical Research Base Business Construction Special Topics (JDZX2015299). The funders had no role in the design of the study, the collection, analysis, and interpretation of data and in writing the manuscript.
Y.X and Y.Y conceived and designed the experiments. M.W, H.W and Y.Y performed the experiments and data analysis. M.W and Y.X wrote the paper. Y.X and Y.Y revised the manuscript. We ensured that all authors had read and approved the manuscript, and ensured that this is the case.
Ethics approval and consent to participate
Consent for publication
The authors declare no competing financial interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Audagnotto M, Dal Peraro M. Protein post-translational modifications: in silico prediction tools and molecular modeling. Comput Struct Biotechnol J. 2017;15:307–19.PubMedPubMed CentralView ArticleGoogle Scholar
- Bannister AJ, Miska EA, Gorlich D, Kouzarides T. Acetylation of importin-alpha nuclear import factors by CBP/p300. Curr Biol. 2000;10(8):467–70.PubMedView ArticleGoogle Scholar
- Deng W, Wang C, Zhang Y, Xu Y, Zhang S, Liu Z, Xue Y. GPS-PAIL: prediction of lysine acetyltransferase-specific modification sites from protein sequences. Sci Rep. 2016;6:39787.PubMedPubMed CentralView ArticleGoogle Scholar
- Meng X, Lv Y, Mujahid H, Edelmann MJ, Zhao H, Peng X, Peng Z. Proteome-wide lysine acetylation identification in developing rice (Oryza sativa) seeds and protein co-modification by acetylation, succinylation, ubiquitination, and phosphorylation. Biochim Biophys Acta. 2018;1866(3):451–63.View ArticleGoogle Scholar
- Liu Z, Cao J, Gao X, Zhou Y, Wen L, Yang X, Yao X, Ren J, Xue Y. CPLA 1.0: an integrated database of protein lysine acetylation. Nucleic Acids Res. 2011;39(Database issue):D1029–34.PubMedView ArticleGoogle Scholar
- Choudhary C, Weinert BT, Nishida Y, Verdin E, Mann M. The growing landscape of lysine acetylation links metabolism and cell signalling. Nat Rev Mol Cell Biol. 2014;15(8):536–50.PubMedView ArticleGoogle Scholar
- Song J, Lafont A, Chen J, Wu FM, Shirahige K, Rankin S. Cohesin acetylation promotes sister chromatid cohesion only in association with the replication machinery. J Biol Chem. 2012;287(41):34325–36.PubMedPubMed CentralView ArticleGoogle Scholar
- Bannister AJ, Miska EA. Regulation of gene expression by transcription factor acetylation. Cell Mol Life Sci. 2000;57(8–9):1184–92.PubMedView ArticleGoogle Scholar
- Gu W, Roeder RG. Activation of p53 sequence-specific DNA binding by acetylation of the p53 C-terminal domain. Cell. 1997;90(4):595–606.PubMedView ArticleGoogle Scholar
- Spange S, Wagner T, Heinzel T, Kramer OH. Acetylation of non-histone proteins modulates cellular signalling at multiple levels. Int J Biochem Cell Biol. 2009;41(1):185–98.PubMedView ArticleGoogle Scholar
- Yang XJ. The diverse superfamily of lysine acetyltransferases and their roles in leukemia and other diseases. Nucleic Acids Res. 2004;32(3):959–76.PubMedPubMed CentralView ArticleGoogle Scholar
- Geng H, Harvey CT, Pittsenbarger J, Liu Q, Beer TM, Xue C, Qian DZ. HDAC4 protein regulates HIF1alpha protein lysine acetylation and cancer cell response to hypoxia. J Biol Chem. 2011;286(44):38095–102.PubMedPubMed CentralView ArticleGoogle Scholar
- Iyer A, Fairlie DP, Brown L. Lysine acetylation in obesity, diabetes and metabolic disease. Immunol Cell Biol. 2012;90(1):39–46.PubMedView ArticleGoogle Scholar
- Chen KC, Liao YC, Hsieh IC, Wang YS, Hu CY, Juo SH. OxLDL causes both epigenetic modification and signaling regulation on the microRNA-29b gene: novel mechanisms for cardiovascular diseases. J Mol Cell Cardiol. 2012;52(3):587–95.PubMedView ArticleGoogle Scholar
- Jonas MC, Costantini C, Puglielli L. PCSK9 is required for the disposal of non-acetylated intermediates of the nascent membrane protein BACE1. EMBO Rep. 2008;9(9):916–22.PubMedPubMed CentralView ArticleGoogle Scholar
- Jeong H, Then F, Melia TJ Jr, Mazzulli JR, Cui L, Savas JN, Voisine C, Paganetti P, Tanese N, Hart AC, et al. Acetylation targets mutant huntingtin to autophagosomes for degradation. Cell. 2009;137(1):60–72.PubMedPubMed CentralView ArticleGoogle Scholar
- Mottet D, Castronovo V. Histone deacetylases: target enzymes for cancer therapy. Clin Exp Metastasis. 2008;25(2):183–9.PubMedView ArticleGoogle Scholar
- Kim SC, Sprung R, Chen Y, Xu Y, Ball H, Pei J, Cheng T, Kho Y, Xiao H, Xiao L, et al. Substrate and functional diversity of lysine acetylation revealed by a proteomics survey. Mol Cell. 2006;23(4):607–18.PubMedView ArticleGoogle Scholar
- Choudhary C, Kumar C, Gnad F, Nielsen ML, Rehman M, Walther TC, Olsen JV, Mann M. Lysine acetylation targets protein complexes and co-regulates major cellular functions. Science. 2009;325(5942):834–40.PubMedView ArticleGoogle Scholar
- Lee TY, Hsu JB, Lin FM, Chang WC, Hsu PC, Huang HD. N-ace: using solvent accessibility and physicochemical properties to identify protein N-acetylation sites. J Comput Chem. 2010;31(15):2759–71.PubMedView ArticleGoogle Scholar
- Shao J, Xu D, Hu L, Kwan YW, Wang Y, Kong X, Ngai SM. Systematic analysis of human lysine acetylation proteins and accurate prediction of human lysine acetylation through bi-relative adapted binomial score Bayes feature representation. Mol BioSyst. 2012;8(11):2964–73.PubMedView ArticleGoogle Scholar
- Liu Y, Lin Y. A novel method for N-terminal acetylation prediction. Genomics, Proteomics & Bioinformatics. 2004;2(4):253–5.View ArticleGoogle Scholar
- Suo SB, Qiu JD, Shi SP, Sun XY, Huang SY, Chen X, Liang RP. Position-specific analysis and prediction for protein lysine acetylation based on multiple features. PLoS One. 2012;7(11):e49108.PubMedPubMed CentralView ArticleGoogle Scholar
- Ao Li YX, Jin C, Wang M, Yaob X. Prediction of Nε-acetylation on internal lysines implemented in Bayesian Discriminant Method. Biochem Biophys Res Commun. 350(4):818–24.Google Scholar
- Hou T, Zheng G, Zhang P, Jia J, Li J, Xie L, Wei C, Li Y. LAceP: lysine acetylation site prediction using logistic regression classifiers. PLoS One. 2014;9(2):e89575.PubMedPubMed CentralView ArticleGoogle Scholar
- Kiemer L, Bendtsen JD, Blom N. NetAcet: prediction of N-terminal acetylation sites. Bioinformatics. 2005;21(7):1269–70.PubMedView ArticleGoogle Scholar
- Xu Y, Ding YX, Deng NY, Liu LM. Prediction of sumoylation sites in proteins using linear discriminant analysis. Gene. 2016;576(1 Pt 1):99–104.PubMedView ArticleGoogle Scholar
- Chen QY, Tang J, Du PF. Predicting protein lysine phosphoglycerylation sites by hybridizing many sequence based features. Mol BioSyst. 2017;13(5):874–82.PubMedView ArticleGoogle Scholar
- Vacic V, Iakoucheva LM, Radivojac P. Two sample logo: a graphical representation of the differences between two sets of sequence alignments. Bioinformatics. 2006;22(12):1536–7.PubMedView ArticleGoogle Scholar
- Angermueller C, Parnamaa T, Parts L, Stegle O. Deep learning for computational biology. Mol Syst Biol. 2016;12(7):878.PubMedPubMed CentralView ArticleGoogle Scholar
- Ching T, Himmelstein DS, Beaulieu-Jones BK, Kalinin AA, Do BT, Way GP, Ferrero E, Agapow PM, Zietz M, Hoffman MM, et al. Opportunities and obstacles for deep learning in biology and medicine. J R Soc Interface. 2018;15(141).Google Scholar
- Voulodimos A, Doulamis N, Doulamis A, Protopapadakis E. Deep learning for computer vision: a brief review. Comput Intell Neurosci. 2018;2018:7068349.PubMedPubMed CentralGoogle Scholar
- Zorzi M, Testolin A, Stoianov IP. Modeling language and cognition with deep unsupervised learning: a tutorial overview. Front Psychol. 2013;4:515.PubMedPubMed CentralView ArticleGoogle Scholar
- Shen D, Wu G, Suk HI. Deep learning in medical image analysis. Annu Rev Biomed Eng. 2017;19:221–48.PubMedPubMed CentralView ArticleGoogle Scholar
- Cui Y, Zhang M, Pestell R, Curran EM, Welshons WV, Fuqua SA. Phosphorylation of estrogen receptor alpha blocks its acetylation and regulates estrogen sensitivity. Cancer Res. 2004;64(24):9199–208.PubMedView ArticleGoogle Scholar
- Zhou W, Capello M, Fredolini C, Piemonti L, Liotta LA, Novelli F, Petricoin EF. Mass spectrometry analysis of the post-translational modifications of alpha-enolase from pancreatic ductal adenocarcinoma cells. J Proteome Res. 2010;9(6):2929–36.PubMedView ArticleGoogle Scholar
- Liu D, Liu C, Li J, Azadzoi K, Yang Y, Fei Z, Dou K, Kowall NW, Choi HP, Vieira F, et al. Proteomic analysis reveals differentially regulated protein acetylation in human amyotrophic lateral sclerosis spinal cord. PLoS One. 2013;8(12):e80779.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhao S, Xu W, Jiang W, Yu W, Lin Y, Zhang T, Yao J, Zhou L, Zeng Y, Li H, et al. Regulation of cellular metabolism by protein lysine acetylation. Science. 2010;327(5968):1000–4.PubMedPubMed CentralView ArticleGoogle Scholar
- Liu Z, Wang Y, Gao T, Pan Z, Cheng H, Yang Q, Cheng Z, Guo A, Ren J, Xue Y. CPLM: a database of protein lysine modifications. Nucleic Acids Res. 2014;42(Database issue):D531–6.PubMedView ArticleGoogle Scholar
- Xu Y, Wen X, Wen LS, Wu LY, Deng NY, Chou KC. iNitro-Tyr: prediction of nitrotyrosine sites in proteins with general pseudo amino acid composition. PLoS One. 2014;9(8):e105018.PubMedPubMed CentralView ArticleGoogle Scholar
- Gao J, Wang Z, Yang Y, Zhang W, Tao C, Guan J, Rao N. A novel approach for lie detection based on F-score and extreme learning machine. PLoS One. 2014;8(6):e64704.PubMedView ArticleGoogle Scholar
- Richard HR, Hahnloser RS, Misha A, Mahowald Rodney J, Douglas H, Seung S. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature. 2000;405:947–51.View ArticleGoogle Scholar
- Nahid AA, Mehrabi MA, Kong Y. Histopathological breast Cancer image classification by deep neural network techniques guided by local clustering. Biomed Res Int. 2018;2018:2362108.PubMedPubMed CentralView ArticleGoogle Scholar
- Li Y, Fu Y, Li H, Zhang S-W: The Improved Training Algorithm of Back Propagation Neural Network with Self-adaptive Learning Rate 2009:73–76.Google Scholar