Pripper: prediction of caspase cleavage sites from whole proteomes
© Piippo et al. 2010
Received: 8 March 2010
Accepted: 15 June 2010
Published: 15 June 2010
Skip to main content
© Piippo et al. 2010
Received: 8 March 2010
Accepted: 15 June 2010
Published: 15 June 2010
Caspases are a family of proteases that have central functions in programmed cell death (apoptosis) and inflammation. Caspases mediate their effects through aspartate-specific cleavage of their target proteins, and at present almost 400 caspase substrates are known. There are several methods developed to predict caspase cleavage sites from individual proteins, but currently none of them can be used to predict caspase cleavage sites from multiple proteins or entire proteomes, or to use several classifiers in combination. The possibility to create a database from predicted caspase cleavage products for the whole genome could significantly aid in identifying novel caspase targets from tandem mass spectrometry based proteomic experiments.
Three different pattern recognition classifiers were developed for predicting caspase cleavage sites from protein sequences. Evaluation of the classifiers with quality measures indicated that all of the three classifiers performed well in predicting caspase cleavage sites, and when combining different classifiers the accuracy increased further. A new tool, Pripper, was developed to utilize the classifiers and predict the caspase cut sites from an arbitrary number of input sequences. A database was constructed with the developed tool, and it was used to identify caspase target proteins from tandem mass spectrometry data from two different proteomic experiments. Both known caspase cleavage products as well as novel cleavage products were identified using the database demonstrating the usefulness of the tool. Pripper is not restricted to predicting only caspase cut sites, but it gives the possibility to scan protein sequences for any given motif(s) and predict cut sites once a suitable cut site prediction model for any other protease has been developed. Pripper is freely available and can be downloaded from http://users.utu.fi/mijopi/Pripper.
We have developed Pripper, a tool for reading an arbitrary number of proteins in FASTA format, predicting their caspase cleavage sites and outputting the cleaved sequences to a new FASTA format sequence file. We show that Pripper is a valuable tool in identifying novel caspase target proteins from modern proteomics experiments.
Caspases form a protein family of aspartate-specific proteases . Altogether 15 different caspases have been identified in mammals, and they are grouped into two major subfamilies, inflammatory and apoptotic caspases . Caspases mediate their effects through aspartate-specific cleavage of their target proteins, and several hundreds of caspase target proteins have been identified . This caspase-cascade plays a central role in the induction and transduction of apoptotic signals, as well as in the regulation of immunity, cellular proliferation and differentiation .
Typically, caspases recognize a four amino acid long motif P4P3P2P1 in their target sequences. Most often the last amino acid, P1, is Asp (D) but in some rare cases also Glu (E), Gly(G) or Ala (A) [3, 4]. Studies of amino acids in the motif have revealed that the motif is only moderately conserved between target proteins of different caspases. Caspases 1, 4, 5 and 13 tend to recognize the motif (W/L)EHD, caspases 2, 3 and 7 DEXD and caspases 6, 8, 9 and 10 the motif (I/L/V)E(H/T)D [4, 5]. Although a certain protein might have several caspase cleavage sites, recent studies indicate that it is cut at a single site at a time . The known caspase cut sites do not exhibit any preferences for location in functional domains of proteins .
The prediction of caspase cleavage sites from proteins is not a trivial task due to the heterogeneity of target sequences, and several different tools have been developed to predict caspase cleavage sites from individual proteins. PeptideCutter http://www.expasy.org/tools/peptidecutter/, PEPS , CasPredictor  and GraBCas  are based on scoring matrices to score different positions and amino acids at and near the caspase cleavage sites. The specific site in the sequence is either classified as a caspase cut site or as a non-cut site based on the score. In addition, three tools using pattern recognition methods have been developed. They use either support vector machines (CASVM) [4, 10] or neural networks . Recently described Cascleave uses the SVM method, but in addition to the primary sequence data of the proteins it also utilizes some structural features to predict caspase cleavage sites . In addition, a two step-model has been suggested that uses first either CASVM or GraBCas and then a scoring method to increase positive predictive value of the classifiers . These prediction tools are valuable in offering a method to predict caspase cut sites for a single protein. However, currently there are no freely downloadable tools that could predict cut sites for several proteins simultaneously, e.g. for entire proteomes in a single run or combine several different methods for prediction. The availability of caspase cut sites for the whole genome could significantly aid in identifying caspase target sequences from tandem mass spectrometry data of biological samples.
We have developed a new tool, Pripper (Protein snipper) for predicting caspase cut sites for an arbitrary number of protein sequences. Three different pattern recognition classifiers were trained to predict caspase cut sites. The first is based on the support vector machine , the second on random forests  and the third on the J48 algorithm . One additional classifier (Vote) which is capable of combining the prediction results from selected classifiers was also constructed in the tool. Pripper is not restricted to predicting only caspase cut sites, but it gives the possibility to scan protein sequences for any given motif(s) and predict cut sites once a suitable cut site prediction model for any other protease has been developed. A newly created classifier can be easily incorporated in the tool with minor effort of programming.
Supervised learning methods were used to form recognizers for caspase cut sites in Pripper. These methods use a training data set of candidate cut sites whose classification is known along with a number of features connected to these candidates. A robust way to teach the classifier is to use the leave-one-out method in which the classifier is taught as many times as there are objects in the training data set. One object is left out from the training set and will be used to test the classifier which is formed on this reduced training set. This way each object is used once for testing and the combined results are used to evaluate the performance of the trained classifier.
The training sequences for known caspase cut sites were acquired from published material [3, 6, 16]. Other species than human were excluded from the sequence sets due to their smaller number of known cut sites. The sequences were downloaded using the EBI Dbfetch tool http://www.ebi.ac.uk/cgi-bin/dbfetch. It was verified that each of the downloaded sequences contained the published motif at the published cut site.
Altogether, 443 positive cut site samples were gathered from 358 different proteins. Negative training sequences were generated from the same sequences that were used for positive sequences. The 443 negative sites were selected at the positions of Asp (D) that were not detected as caspase cut sites. The used positive and negative sequences are listed in the Supplementary material (Additional file 1: Training set). All the sequences used in the training set were unique.
Three different pattern recognition classifiers were trained to predict caspase cut sites from proteins. The first classifier was implemented with Support Vector Machines (SVM) [13, 17]. SVMs are classifiers that are based on the maximization of the margin between the classes. The data are considered as n-dimensional vectors and the algorithm finds a hyperplane that separates the vectors in different classes with a maximal margin. The SVM method is based on the fact that a kernel function can be used to map vectors of the original feature space to a higher dimensional space in which the data can always be linearly separated. Feature vectors consisted of a fixed number of amino acids on both sides of the cutting site encoded in a numerical form. Each amino acid in the sequence was represented as an array of length 20 representing the 20 different amino acids. Only one element was set to one and the rest to zero and the number one identifies the amino acid in question. For example, the amino acid Ala (A) was encoded by a vector [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] and Val (V) as [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]. A similar encoding has been used in other methods utilizing SVMs [4, 10]. The LIBSVM library  was used to train the SVM-classifier.
The second trained classifier was the J48-method that is implemented in the Weka classification library . It is a version of the C4.5 algorithm  that is a decision-tree-based classifier. The training data are organized as subtrees based on a selected feature that most effectively splits the data set. Information entropy is used to calculate an impurity value and the best split is used to split the tree and the same procedure is iterated until the tree cannot be divided anymore. Typically, also a pruning method is used to remove uninformative nodes of the tree and to improve the prediction accuracy and to avoid over-fitting. The leaves of the trees represent the classification results.
The third method used was Random Forest (RF) that is based on training a collection of trees, called forest, using randomly selected features for splitting the decision tree during the tree growing process . Each sample is classified with each tree in the forest during the classification, and the final output is given as the vote of all the trees. The class that gets the highest vote will be selected as the class output for the sample.
Optimized parameter values for trained classifiers.
Kernel function parameter γ
Error parameter ν
Stopping criterion ε
6 before cut site
4 after cut site
Number of features
Number of trees
12 before cut site
12 after cut site
Minimum number of objects
Number of folds
Reduced error pruning
4 before cut site
2 after cut site
Each classifier was trained using the leave-one-out method. The number of correct positive classifications (true positives, TP), incorrect positive classifications (false positive, FP), correct negative classifications (true negatives, TN) and incorrect negative classifications (false negatives, FN) were counted from the leave-one-out training. Using these values the following measures for classifier evaluation were calculated.
The application for generating the cut sites from the given set of proteins was implemented in Java. The application uses the LIBSVM  and Weka version 3.5.7  libraries to load the trained classification models and to use them in the classification of the protein sequences. In addition, each classifier can be used separately to predict the caspase cut sites, or a Vote-classifier can be used. In the Vote-classifier the user can select the desired classifiers and the tool predicts the cut sites with each selected classifier and classifies any site to be a cut site only if the majority of the selected classifiers predict the site as a cut site.
Two different protein samples were used to test the performance of the created caspase cleavage databases in the analysis of tandem mass spectrometry (MS/MS) data. One of the samples was a mitochondrial proteome of human keratinocytes subjected to two-dimensional electrophoresis (2-DE) based proteome analysis . The other sample was a mitochondrial fraction of influenza A virus infected human primary macrophages analyzed using in-solution digestion and iTRAQ-labeling protocol (Applied Biosystems). In both samples, proteins were digested with trypsin and the resulting peptides were analyzed by LC-MS/MS using an Ultimate 3000 nano-LC (Dionex) and a QSTAR Elite hybrid quadrupole TOF-MS (Applied Biosystems/MDS Sciex) with nano-ESI ionization. The LC-MS/MS data were searched with in-house Mascot version 2.1 through ProteinPilot 2.0.1 interface. The Mascot search criteria used were trypsin digestion with one missed cleavage allowed, carbamidomethyl modification of cysteine (2-DE sample) or methylthio modification of cysteine and 4plex iTRAQ labelling of lysine and peptide N-terminus (iTRAQ sample) as fixed modifications, methionine oxidation as variable modification, precursor ion mass tolerance of 50 ppm, fragment ion mass tolerance of 0.2 Da, and peptide charge state of +1, +2 or +3. Identification threshold of p < 0.05 was used with an additional requirement of at least one unique peptide identification for each protein ("bold red").
Caspases recognize a four amino acid long motif P4P3P2P1 in their target sequences; however, the prediction of caspase cleavage sites from proteins is a non-trivial task due to the heterogeneity of the target sequences. Three different classifiers were trained in the present research to predict caspase cleavage sites in protein sequences. Support vector machines have already been widely used and they have turned out to be effective in various biological classification problems including the caspase cleavage site prediction [4, 10]. Therefore, it was selected also here as one of the trained classifiers. In addition, two different tree-based classifiers were trained, the Random Forest (RF) and J48. Tree-based methods have been used more rarely in biological classification problems, but RF has shown some promising results in other context  and therefore it was selected as one of the trained models. One advantage of the decision-tree based models is that they output the decision tree and show the exact decision rules to the user. Since RF makes its decision on the basis of several trees, it can not give a single and easily readable decision tree as an output. Therefore, also a pure decision tree model, J48, that is capable of producing a single decision tree as an output was selected as one of the classifiers.
Training of the classifiers produced the best sequence length to be used in the prediction for each classifier. The best protein sequence length differed between the classifiers from 4 to 12 amino acids before and 2 to 12 amino acids after the cut sites. The best sequence length for the SVM classifier was 6 amino acids before the cut site and 4 after it, and therefore the trained classifier was named SVM-6-4. Respectively, the two other classifiers are called RF-12-12 and J48-4-2. It is noteworthy that the different classifiers have different optimized lengths of protein sequences indicating that the amino acids surrounding the cut sites are very heterogeneous. Based on these data and previously published results, it is clear that the reliable prediction of caspase cleavage sites requires information of several amino acids surrounding the motif. The optimized lengths were quite similar to the best CASVM classifier that takes into account 14 amino acids before and 10 after the cut site (CASVM P14-P10') . CASVM has also been optimized for shorter lengths containing only the cut site motif (P4-P1) or the cut site motif and the two following amino acids (P4-P2').
Although the J48 classifier was not the most efficient, the decision tree can reveal important insight for caspase cleavage sites (Additional File 2 - J48-4-2 Decision Tree). The J48-4-2 classifier was trained with the sequence length of 4 amino acids before the cut site and 2 amino acids after the cut site (X1X2X3X4|X5X6). The decision tree shows that the classifier uses information from 5 amino acids. The most important features for the prediction are the amino acid that follows the cut site (X5), and the first amino acid in the beginning of the cut site motif (X1). The last residue of the cut site motif, X4, is not part of the tree, but this can be explained by the fact that all negative samples were designed to contain Asp (D) in that position leaving this residue uninformative.
Quality measures of trained classifiers and comparison with publicly available caspase cut site prediction models.
A new tool, Pripper, was developed to read an input file of protein sequences, to scan each protein sequence and predict cleavage sites for each protein and to write the predicted cleavage products to a new data file. Pripper reads input data and outputs a data file of protein sequences in FASTA format. The user can choose the motifs that the sequences are searched for, and classification is performed only to those sequence regions. If the user does not give any motifs, Pripper scans the whole length of the protein sequence residue by residue and performs the classification.
The user can also choose whether all possible cleavage product combinations (cut at single or multiple sites) are written to the output file, or whether each of the sequences is cut from a single site at a time or from all of the predicted sites simultaneously. User can also choose whether the full sequence is written to the output file or not. The tool allows also a list of known cut sites to be used instead of the prediction. If a protein has user-given cleavage sites, new cleavage sites are not predicted from the protein but the given sites are used instead, and the cleavage products are written to the output file based only on those given cut sites. Pripper is freely available at http://users.utu.fi/mijopi/Pripper.
Predicted caspase cut sites of the human protein sequences downloaded from UniProtKB database.
Predicted number of cut proteins
% of all proteins (96 123)
Average number of cut sites per sequence
Predicted cut sites total
% of all possible cut sites (1 749 441)
SVM-6-4 predicted that 73,2% of all the proteins contained a cut site. RF-12-12 predicted a roughly similar amount of 76,1%, but J48-4-2 predicted that more than 80% of the proteins contained a cut site. The Vote-classifier was selected to perform the classification with both SVM-6-4 and RF-12-12, and it predicts a cut site only if both classifiers predicted the site to be cut. This classifier predicted approximately 69% of the proteins to have at least one caspase cleavage site. Based on this analysis, the human proteome contains approximately 66 500 caspase targets. The cleavage products from human protein sequences that were predicted by the Vote-classifier are available in Additional file 3.
Currently, only a few hundred caspase target proteins are known [3, 10], but it has been suggested that the cells could contain thousands of caspase targets . Since the trained classifiers are based on the primary sequence data of proteins, it is possible that the cells contain putative caspase target sites, but they reside in the protein regions that are not accessible to caspases when the protein has folded to its three-dimensional structure. The caspase cut sites are not linked to any functional protein domains  However, it has been discovered that many of the caspase cleavage sites occur in loops and in α-helical regions , as well as in PEST-regions . PEST-regions are rich in Ser (S), Thr (T), Pro (P), Glu (E), and Asp (D) and actually form non-structural regions that are often found in unstable proteins that are susceptible to proteolysis . Therefore, the prediction efficiency of the classifiers might be improved by taking into account the secondary or higher-order structure of proteins.
We have developed a new tool, Pripper, for reading an arbitrary number of proteins in FASTA format, predicting their caspase cleavage sites and outputting the cleaved sequences to a new FASTA format sequence file. The sequence file generated can be used e.g. as a database for searching tandem mass spectrometry data allowing the identification of caspase cleavage products. Three different pattern recognition classifiers, SVM-6-4, RF-12-12 and J48-4-2 were trained to predict caspase cleavage sites. The evaluation of the classifiers with ROC curve (Figure 1) and quality measures (Table 2) indicated that all of the three classifiers had a good performance for predicting caspase cleavage sites. The comparison of our method to existing caspase cleavage classifiers showed that the best performing classifiers were SVM-6-4, RF-12-12 and Vote developed in this study and the previously published CASVM classifier with the longest prediction sequence (P14-P10') .
The developed tool was used to construct a database of caspase cleavage products. Each possible Asp (D) amino acid and its surrounding amino acids were evaluated with SVM-6-4 and RF-12-12 classifiers and a site was predicted as cleaved only if both of the methods predicted the site to be a caspase cleavage site (Vote-classifier in the tool). The created database contains the predicted caspase cleavage products, and it was used to identify caspase cleavage products from tandem mass spectrometry data from real biological samples. Here we have shown that Pripper is a valuable tool in identifying novel caspase target proteins from mass spectrometry based proteomics experiments.
Being a standalone application, Pripper does not rely on web-connections or depend on server availability, and thus the local machine determines the maximum input file size and processing capacity. In addition, it is capable of processing only user given motif patterns from the input sequences and producing the desired cleavage products. It enhances the prediction results by combining different classifiers. In addition, the implemented application is not restricted to predicting caspase cut sites, but it also provides a framework for scanning protein sequences for given motif for any protease cut sites once a suitable cut site prediction model for a protease is developed.
Project name: Protein Snipper (Pripper)
Project home page: http://users.utu.fi/mijopi/Pripper/
Operating system(s): Windows
Programming language: Java
Other requirements: Java version 1.6.0 or higher, libsvm, Weka, BioJava
Any restrictions to use by non-academics: none
Support Vector Machine
Funding: This work was supported by the Academy of Finland [grant number 114437], the Sigrid Juselius Foundation and the Helsinki Graduate School in Biotechnology and Molecular Biology.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.