Machine learning approach to predict protein phosphorylation sites by incorporating evolutionary information

Background Most of the existing in silico phosphorylation site prediction systems use machine learning approach that requires preparing a good set of classification data in order to build the classification knowledge. Furthermore, phosphorylation is catalyzed by kinase enzymes and hence the kinase information of the phosphorylated sites has been used as major classification data in most of the existing systems. Since the number of kinase annotations in protein sequences is far less than that of the proteins being sequenced to date, the prediction systems that use the information found from the small clique of kinase annotated proteins can not be considered as completely perfect for predicting outside the clique. Hence the systems are certainly not generalized. In this paper, a novel generalized prediction system, PPRED (Phosphorylation PREDictor) is proposed that ignores the kinase information and only uses the evolutionary information of proteins for classifying phosphorylation sites. Results Experimental results based on cross validations and an independent benchmark reveal the significance of using the evolutionary information alone to classify phosphorylation sites from protein sequences. The prediction performance of the proposed system is better than those of the existing prediction systems that also do not incorporate kinase information. The system is also comparable to systems that incorporate kinase information in predicting such sites. Conclusions The approach presented in this paper provides an efficient way to identify phosphorylation sites in a given protein primary sequence that would be a valuable information for the molecular biologists working on protein phosphorylation sites and for bioinformaticians developing generalized prediction systems for the post translational modifications like phosphorylation or glycosylation. PPRED is publicly available at the URL http://www.cse.univdhaka.edu/~ashis/ppred/index.php.


Background
One of the most critical cellular phenomenon is phosphorylation of proteins as it is involved in signal transduction of various processes including cell cycle, proliferation and apoptosis [1][2][3]. This phenomenon is catalyzed by protein kinases affecting certain acceptor residues (Serine, Threonine and Tyrosine) in substrate sequences. A study on 2D-gel electrophoresis showed that 30-50% of the proteins in an eukaryotic cell had undergone phosphorylation [4]. So, accurate prediction of the phosphorylation sites of eukaryotic proteins may help in understanding the overall intracellular activities.
Both experimental and computational methods have been developed to investigate the phosphorylation sites. But in vivo and in vitro methods are often time-consuming, expensive and have very limited scope due to some restrictions for many enzymatic reactions. On the other hand, in silico prediction of phosphorylation sites from computational approaches may provide fast and automatic annotations for candidate phosphorylation sites. Besides, there are web servers that provide experimental results of phosphorylation sites in proteins which were achieved after in vivo or in vitro experiments. For example, PHOSIDA [5] was developed as a phosphorylation site database which was integrated with thousands of high-confidence in vivo phosphorylated sites identified by mass spectrometry-based proteomics in five different species (Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans and Saccharomyces cerevisiae).
Whereas a range of in silico predictors have been developed using different machine learning techniques. For example, PPSP was developed applying Bayesian Decision Theory [6]. It can predict the phosphorylation sites for about 70 phospho-kinase groups. Training dataset of PPSP was collected from Phospho.ELM (version 2, September, 2004) [7] and the phosphorylation sites without kinase information were filtered out preserving only 1400 significant kinase specific phosphorylated sites. DIS-PHOS was developed using dataset from Swissprot with phosphorylation annotations on the eukaryotic proteins [8] resulting a total of 1500 such phosphorylation sites. In the prediction system called NetphosK, six Serine/Threonine kinases for which the largest number of known acceptor sites annotated in the phosphoBase [9] were identified. For each of the six kinases 22 to 258 different substrate sites were considered. Then the information derived from sequence logos of each of the groups were incorporated to train a neural network [10]. Kinasephos was developed using phosphoBase [9] and Swissprot(rel.45) protein dataset where only 1163 sites were found to have kinase annotations [11]. Several kinase groups were split into smaller subgroups using maximal dependence decomposition. Then each of the subgroups was separately used in the training phase to build profile Hidden Markov Model. Scansite 2.0 identifies short sequence motifs that were recognized by phosphorylation on serine, threonine or tyrosine residues [12]. In this case, many of the motifs were determined using oriented peptide library experiments. The peptides that were phosphorylated by the kinase enzymes were isolated and sequenced as an ensemble by Edman degradation. When sequenced in this manner, each Edman cycle revealed the relative amount of each amino acid residue occurring at the corresponding positions. This information was scaled and normalized to get a type of PSSM (Position Specific Scoring Matrices). The PSSM, generated by this study did not include evolutionary information because it was based on a limited number of proteins that were phosphorylated by proteins with only same type of kinases. In such cases, evolutionary links between the protein under consideration with proteins without kinase annotations were not considered. NetPhos [13] is a neural network-based method for predicting potential phosphorylation sites at serine, threonine or tyrosine residues in protein sequences. This system did not consider any kinase specific information for prediction. The AutoMotif Server AMS [14] performs phosphorylation site predictions based only on local sequence information, for examplepreferences of short segments around phosphorylation residues. This server also did not use kinase specific information during the training and the prediction phases. The group based prediction system, GPS [15] classified the protein kinases into a hierarchical structure with four levels, including group, family, subfamily and single protein kinase in the preparation of such prediction system.
The in silico prediction systems that included kinase information performed particularly well when kinase information of the target proteins was known or species or group specific classification knowledge was known beforehand. For example, experimental studies of phosphorylation in yeast revealed strong preferences for particular kinases for specific substrates and also indicated that predictions based on phosphorylation site patterns on those cases could lead to substantial over-prediction [16].
The current set of phosphorylation site prediction systems has recently been analyzed [17]. The analysis also revealed that the existing systems are not generalized in a sense that they were trained mainly with a limited number of proteins having kinase annotations that add noise in the performance of prediction systems when no kinase information is known. The prediction method (PPRED) proposed in this article moved ahead to overcome the limitation by incorporating only evolutionary information--PSSM profile of the proteins rather than using any kinase specific information. For a protein sequence, the PSSM profile, generated by PSI-BLAST (Position Specific Iterated Basic Local Alignment Search Tool) of NCBI describes the likelihood of a particular residue substitution at a specific position based on evolutionary information [18][19][20] and it provides more comprehensive information about proteins than a single sequence [21,22].

Cross Validation Performance
In the training dataset (namely A", collected from Phospho.ELM (ver. 8.1) [23]), there were 5724 phosphorylated proteins. The number of positive sites annotated by phospho.ELM and the number of negative sites annotated in our system for each of the three residues S (serine), T (Threonine) and Y (Tyrosine) are shown in Table 1. The PSSM profiles of the proteins of A" dataset provided the training instances for the SVMs. The ratio of the number of negative to positive sites was a big number which could definitely bias the SVMs training, that would lead in predicting most of the unknown sites as negative. Thus it was required to reduce the number of negative instances to overcome the problem. Four separate experiments were performed with training datasets containing number of positive to negative training instances having the ratios 1 : 2, 1 : 1 , 1 : 1 and 1 : respectively.
In the first experiment, the training dataset was prepared to have the number of positive to negative training instances with ratio 1 : 2. To do this, the ratio was first calculated, where n was the number of negative training instances and p was the number of positive training instances from the A" dataset. Then the expected training dataset was prepared by selecting every r th instance from the negative instance set. Then a three-fold cross validations were performed on this modified training dataset. Separate three fold cross validations were performed on the instance set for five different window sizes (7, 9, 11, 13 and 15) for each of the three residues (S, T and Y). Similarly in second, third and fourth experiment, the ratios were settled using the similar formulae: , and respectively. Table 2, Table   3, Table 4 and Table 5 show the results of the cross valida- From the results it can be observed that the PPRED showed optimum specificity and sensitivity if the sizes of the given positive and negative training datasets were equal. If the ratio was given less than 1, the sensitivity lowered but specificity rose, whereas if the ratio was given greater than 1, the opposite trend was observed.

Optimum choice of dataset ratio
From the experimental results shown in the Table 2,  Table 3, Table 4 and Table 5 it is interesting to observe their ROC (Receiver Operating Characteristics) plot. We know that each discrete classifier produces a false positive rate, true positive rate pair that eventually corresponds to a single point in an ROC space [24]. As our proposed system (PPRED) is a discrete classifier, it provides output which was only a class label (whether positive or negative). The ROC of each of the four experiments for each of the three residues (Serine, Threonine and Tyrosine) are shown in Figure 1, 2 and 3 respectively. It is to be noted that if the ratio of positive to negative training instances is chosen to be 1:1, all the assessment parameters show better figures. So the PPRED system will adhere to use the model 3 with ratio of the number of positive to negative instances to be 1:1.

Optimum choice of window size
The result shown in Table 4 reveals the fact that if more features were incorporated in a single instance of training data (that means increasing the window size), the prediction accuracy and sensitivity would increase. Nevertheless, there was a slight drop in the specificity.
But if the window size was increased beyond 15 (i.e., more features were added in an instance), the computational complexity and the required time for the SVMs training would increase exponentially. So the window size 15 can be considered as an optimum choice.

Independent Benchmark Results
The independent benchmark dataset, named B in this study was collected from the article [17]. It contained 297 phospho-proteins and Table 6 shows the number of positive and negative sites in these proteins. The authors of that article randomly chose 400 phosphorylation sites from the Phospho.ELM [7] database (three hundreds  from the version 6.0 and one hundred from the version 7.0 and Uniprot release 11.3). The PSSM profiles of the proteins from the B dataset were used to prepare testing instances that were given to the SVMs to assess the proposed prediction system using the classification knowledge built during the cross-validation phase. Like crossvalidation phase, separate testing operations were performed on the testing instances for five different window sizes for each of the three residues as shown in Table 7. In predicting phospho-serine sites, PPRED showed accuracy of 61.34%, 67.82%, 68.44%, 67.77% and 64.96% for window sizes of 7, 9, 11, 13 and 15 respectively. Similar trend of increasing performance was found in the case of sensitivity, specificity and Mathews correlation coefficient parameters for increasing window size. So if a window size of 15 was chosen for S (serine), the system can achieve up to 65% accuracy, 72% sensitivity, 65% specificity with Mathews correlation coefficient to be 0.08. Almost similar observations were found in testing for T and Y sites. In case of T (Threonine) site prediction, the proposed system is 69.87% accurate, 67.06% sensitive, 69.90% specific with Mathews correlation coefficient of 0.07 with window size 15 and in case of Y site prediction with window size 15, it was found to be 65% accurate, 76% sensitive, 65% specific with the Mathews correlation coefficient 0.11. The independent benchmark test shown in Table 7 also underlines the importance of using more features (increased window size) to achieve better prediction performance.

Comparison with existing systems based on the benchmark
A random 297 protein entries were extracted from the Phospho.ELM database (ver. 6 and 7) with 211, 85 and 97 phosphorylated sites of serine, threonine and tyrosine respectively in the article [17] and the performance of the five existing prediction systems (PPSP [6], DISPHOS [8], KinasePhos [11], NetPhosK [10] and Scansite 2.0 [12]) were tested with these 393 annotations. But from Table 6, it can be found that more phosphorylation annotations were done by Phospho.ELM server (ver. 8.1), which became 923, 239 and 338 positive serine, threonine and tyrosine phosphorylated sites respectively on those 297 proteins. To compare the proposed system with the existing systems, the system was checked whether it can identify those 393 annotations. Table 8 shows the comparison of the proposed system along with the nine existing prediction systems (PPSP [6], DISPHOS [8], KinasePhos [11], NetPhosK [10], Scansite 2.0 [12], AutoMotif Server AMS 2.0 [14], GPS 2.0 [15], PHOSIDA [5], NetPhos [13]) in terms of prediction scores (Q 3 score), which is the number of correct identifications of phosphorylated sites. The window size was chosen to be 15 in PPRED that was found showing better prediction performance. The test result shows that the The sensitivity (Sn) and the specificity (Sp) columns of the table reveal that the system using this ratio identifies most of the sites as negative.
system can correctly predict 152, 57 and 74 phosphorylated serine, threonine and tyrosine sites out of 211, 85 and 97 annotated serine, threonine and tyrosine sites respectively of the independent benchmark. From the result it is evident that the proposed method has good prediction accuracy in predicting phosphorylated serine, threonine and tyrosine sites than those of AutoMotifServer AMS, GPS, NetPhos, PHOSIDA and Scansite 2.0. Table 9, Table 10 and Table 11 show the detailed comparative analysis of the ten prediction systems including the proposed system (PPRED) in terms of serine, threonine and tyrosine site predictions respectively. Performance parameters, such as accuracy (Ac), sensitivity (Sn), specificity (Sp), Mathews correlation coefficient (Mcc) and False positive rate (FPR) are shown in the comparison tables. Each of the comparison tables underlines the competitive performance of the proposed system --PPRED among all other existing systems.

Discussion
Most of the existing phosphorylation site prediction systems use kinase specific information of the phosphorylated sites. In those cases, proteins without kinase annotations from the phosphorylation-positive dataset found to date from Phospho.ELM [23] or SWISS-PROT [25] were not considered and hence were filtered out in those systems. It can be found from the present update of Phospho.ELM dataset (August 12, 2008) that only 20% of the positive phosphorylation sites contain kinase annotations, that means more than 80% of the dataset are omitted in the design of the existing kinase specific prediction systems. These major truncations definitely ignore some important properties of phosphorylation sites, such as-evolutionary conservation of phospho-proteins. Our hypothesis is that this information would be useful in classifying phosphorylation sites. Moreover, this evolutionary conservation has been found useful in many other in silico prediction systems, such as, in the prediction of protein-protein interaction sites [26], prediction of DNA binding sites in proteins [27] or even finding motifs [28].
The outcome of this study was to give a direction in developing a phosphorylation site prediction system that uses the generalized information (such as evolutionary information) from all phosphorylated proteins rather than partial information obtained from the kinase-annotated proteins. It also directs that the evolutionary conservation can be a good candidate feature for the prediction purpose.
The proposed method (PPRED) was successful in overcoming the limitations of the kinase-specific prediction methods to separate the two classes of proteins --phosphorylated and non-phosphorylated proteins. In fact, our proposed system deliberately omitted the kinase specific Here in this table, the sensitivity (Sn) and the specificity (Sp) columns reveal that the system using this ratio identifies most of the sites as negative.
information of the phosphorylation sites to underline the importance of the evolutionary profiles alone to predict phosphorylation sites. The prediction results also proved the hypothesis that the proposed system using only the evolutionary information of proteins can classify phosphorylated and non-phosphorylated sites from given primary sequences of protein accurately enough to be used compatibly with any existing system. In designing the prediction system all serine, threonine and tyrosine residues which were not annotated as phosphorylated and which were not positioned in the window of size 50 of any of the positive annotated residues were considered as negative phosphorylated sites. But some of the nonannotated sites, that were treated as negative sites in our study could be annotated as positive sites in future experiments, which would then require to re-train the whole system with new training data which will in turn increase the prediction accuracy. It was found that the number of positive sites were far less than that of the negative sites. The number of negative sites adds bias to the assessment of the prediction accuracy. If all the positive and negative sites were used in the training dataset, experimental result would show most of the sites as negative. So to attain a good prediction accuracy, a reduction in negative training instances is required. But there is a debate on how much to reduce the number of negative instances. In this study separate experimental results enlightened that if the number of negative sites can be reduced in such a way that the number becomes equal to that of positive sites, the prediction system shows its best performance.
Furthermore, the number of sites in serine, threonine and tyrosine were not also equal. So three separate prediction modules were built in the proposed PPRED system for detecting the probable sites of the three phosphorylated residues (S, T and Y). For example, whenever a serine site is to be predicted for phosphorylation event, concerned module takes over the job which actually overcome the problem of biasing by number of phosphorylated sites of other residues (in this case, T and/or Y).
Experimental results (Table 8) showed that the prediction score of the proposed system (PPRED) exhibit better performance in predicting phosphorylated sites than those of the AutoMotif Server AMS, GPS, NetPhos, PHOSIDA and Scansite 2.0 systems. Again, the PPRED uses only the evolutionary information of proteins in classification, whereas other existing methods --Kinase-Phos, NetPhosK, PPSP and DISPHOS used either kinase group information or many other features to train their corresponding machine learning programs. In this direc- tion, performance of the PPRED is comparable to those prediction systems ( Table 8).
The results shown in Table 9, Table 10 and Table 11 established the fact that evolutionary information has a good relation with the protein phosphorylation and hence can contribute in designing a good prediction system.

Conclusions
In this work, a novel phosphorylation site prediction system, PPRED was presented that incorporated only the evolutionary information of the proteins of both phosphorylated and non-phosphorylated classes. Experimental results of the system revealed that the system exhibits better prediction performance than some of the existing kinase specific and non-specific prediction systems. The results of the experiments also underlined the significance of using evolutionary information of both phosphorylated and non-phosphorylated proteins which were used as the only classification feature in the proposed system. Comparing the proposed system with other approaches, it was found that the proposed method provides a generalized and a more consistent prediction performance in all the cases. The incorporation of the evolutionary information contributed in both classifying the two types of sites and making the system more generalized.

Prediction System Design
The work flow for testing the proposed system with the independent benchmark dataset (B) is shown in Figure 4.
From the work flow diagram, sequences of both A" and B datasets were given to PSI-BLAST's "blastpgp" program to generate the PSSM profiles, which are the encapsulated representation of the evolutionary information of the proteins. The SVMs training instances were then prepared from the PSSM profiles. There were two classes of instances for both of the datasets A" and B. The positive and the negative class instances of A" were equalized in terms of number of instances and both were merged together to prepare the training set.
A three-fold cross validation was performed on the final merged instance set using the SVMs training module. Separate model files (Knowledge base file) for each of the three phosphorylated residues (S, T and Y) and for each of the five different window sizes (7, 9, 11, 13 and 15) were stored on the disk. Each of the individual cross validation results were reported in the result section.
To test the system, the instance set B was used and an appropriate model file stored in the cross-validation phase was chosen by looking at the type of residue and size of the window. The chosen model file and instance set B were given to the SVMs prediction module for testing. The SVMs prediction module performs predictions Here in this table, the sensitivity (Sn) and the specificity (Sp) columns reveal that the system using this ratio identifies most of the sites as positive.

Figure 1 ROC of the proposed prediction system for predicting serine sites.
There are five ROC points for each of the four models that represent the five window sizes (7,9,11,13 and 15) for the corresponding model. of phosphorylation sites on the given instance set from B based on the given classification knowledge Base (model) file. Figure 5 illustrates the flow of operations of the PPRED system when an unknown protein sequence is given to it for prediction. Firstly the PSI-BLAST was employed to generate the PSSM profile of the given protein sequence. Then separate SVMs testing instances were prepared for each of the three residues (S, T and Y) and for each of the five different window sizes (7, 9, 11, 13 and 15). Appropriate knowledge-base (model) file is chosen that was stored at the cross-validation phase to predict "target labels" (+1:Positive, -1:Negative) for each of the testing instances.
The following sections explain each of the essential components of the PPRED system.

Evolutionary Information of Proteins
The proposed method incorporates evolutionary information of phosphorylation sites. If we perform a multiple sequence alignment of the proteins against an nr (nonredundant) dataset of proteins, we will get a score of each of the twenty amino acids against each position of the target protein. The scores represent the evolutionary conservation information among the members of its lineage. This information can be represented as a two dimensional matrix which is known as the PSSM profile of the protein.

Figure 3 ROC of the proposed prediction system for predicting tyrosine sites.
There are five ROC points for each of the four models that represent the five window sizes (7,9,11,13 and 15) for the corresponding model. It was observed in this study that PSSM scores across a predefined window of a phosphorylated residue of some protein sequences have similar lineage of evolution. Hence the scores obtained from the PSSM profiles of phosphorylated proteins across a predefined window can be a good source of classification data for a prediction system. The PSSM profile of phosphorylated proteins were generated using PSI-BLAST method [18][19][20].

Dataset Preparation
Two sources of dataset were used in this study. The first dataset is the Phospho.ELM version 8.1 that was released on August 12, 2008 [23] and was named A dataset in this study. The A dataset contains 6019 protein entries with a total of 18253 annotations of phosphorylation sites. Of these annotations 13320 phosphorylated serine sites, 2766 threonine sites, 2166 tyrosine sites were annotated (Additional file 1). There was an annotation of phosphorylated histidine which was discarded from this experiment because the objective of this work is to classify only the most frequently occurred phosphorylated residues which are serine, threonine and tyrosine residues. So, a new dataset A' was prepared from the dataset A that contained every proteins of the A dataset except the protein containing the histidine phosphorylated site (Additional file 2).
The second dataset was collected from the article [17] which they used in assessing the performance of some existing prediction systems. This independent benchmark dataset was named B dataset in our study. The B dataset contains 297 protein entries with annotations of 211 serine, 85 threonine and 97 tyrosine phosphorylated sites. But the B dataset contained 294 protein entries which were also in the A' dataset, so these common 294 protein entries were discarded from the A' dataset to form a new training dataset A" which is disjoint from the testing dataset B. The A" dataset and the independent benchmark dataset B can be found in the Additional file 3 and Additional file 4 respectively.

Positive Dataset Preparation
PSSM profiles of all the proteins of A" and B datasets were generated using PSI-BLAST search against the nonredundant (nr) database of protein sequences. A PSSM matrix for each of the proteins was generated by the "blastpgp" program of the PSI-BLAST package with three iterations of searching at cutoff E-value of 0.001 for inclusion of sequences in subsequent iterations.
For example, the command to generate a PSSM profile of the protein with accession P16386 is given below: blastpgp -d nr -i "P16386.seq" j 3 -h 0.001 -Q "P16386.pssm" Here, "P16386.seq" file contains the primary sequence of the protein P16386 in raw format. The option "-j 3" is From this table it is evident that using ratio 1:1 shows good prediction performance for both positive and negative site predictions for each of the three residues. It is also evident that the performance increases if more features are included in each training instance (i.e., increasing the window size).
to run "blastpgp" program for three iterations. The option "-h 0.001" is to restrict including unrelated sequences with a cutoff E-value of 0.001. The "-Q" option redirects the resultant PSSM profile to be saved in a file named "P16386.pssm". The PSSM thus generated contained the probability of occurrence of each type of amino acid at each position. The evolutionary information for each amino acid is encapsulated in a vector of L × 20 dimensional matrix, where L is the length of the given protein sequence. Figure 6 demonstrates a fragment of PSSM profile of a protein with window size 11.

Negative Dataset Preparation
The most significant problem while compiling datasets for machine learning is that there is no negative data included in any of the known databases. Knowing the fact that a specific serine or threonine or tyrosine is not phosphorylated is extremely useful when designing a binary prediction method. Unfortunately, such information is very rarely published. To conclusively prove that a site is negative under all conditions is impossible.
In this study the non-annotated sites satisfying the criteria stated in the Proposition 1 were considered as negative sites. Proposition 1. A non-annotated residue is considered as negative site if it is not in a distance of 50 residues from any phosphorylation annotated residue of a protein sequence.
Unfortunately, some of the The negative sites obtained using the Proposition 1 could be proved to be positive in future experiments. However, negative sites were used in this study because only a few serine, threonine and tyrosine residues are phosphorylated and the PSSM profile scores of the phosphorylated serine, threonine and tyrosine residues is skewed away from that of the scores of non-phosphorylated serine, threonine and tyrosine residues. Therefore, SVMs, which in practice allow some training errors, would regard false negative sites as errors.
The same approach of generating positive features of serine, threonine and tyrosine was employed to generate negative features as well.

Support Vector Machines
The Support Vector Machines (SVMs) is a supervised learning algorithm for two-group classification problems [29,30]. The SVMs is known for its high performance in classifying unknown data and has been applied to many problem areas. The SVMs map the feature vector into a high dimensional feature space and classifies the samples by separating the hyper-plane in the space. At the training stage, SVMs search for an optimal hyper-plane by solving a quadratic programming optimization problem.
This hyper-plane, determined by the criterion that maximizes the distance of nearest feature vector, has good generalization performance. We used LIBSVM (Library There are a total of 211 phosphorylated serine sites, 85 threonine sites and 97 tyrosine sites in the benchmark dataset. Here it can be found easily that the proposed system (PPRED) shows better accuracy than AutoMotif Server AMS, GPS, NetPhos, PHOSIDA and Scansite 2.0 for predicting S, T and Y sites.
for Support Vector Machines) [31], with a radial basis function (RBF) kernel to predict phosphorylation sites.
The SVMs using the RBF kernel has two parameters, γ and C. We fixed C and γ at default values of 1 and respectively, where k is the number of attributes (features) in each instance of training dataset.

Training System Design
From the PSSM profiles of the proteins of the A" dataset, negative and positive instances were prepared for each of the three phosphorylated residues --S, T and Y. The number of training instances of each residue of certain label (positive or negative) are shown in Table 1.
As mentioned above that a subset of non-annotated phosphorylation sites in the dataset A" is used as negative dataset. But from Table 1 it is evident that, the size of negative dataset greatly outnumbered that of the positive dataset for each of the three residues. If the SVMs were trained with these positive and negative datasets, it would predict most of the sites as negative [32].
To overcome this problem, reduction of size of the negative dataset was necessary. This study performs four separate experiments that reduces the size of negative dataset to become twice, one and a half times, half times and equal to that of the positive dataset. For example, to prepare the training dataset to have the same number of instances of positive and negative labels, the ratio r of negative dataset size to positive dataset size was first cal-culated. Then the positive dataset was kept intact (as the number of positive instances is less than that of negative instances), and all but every r th instance of the negative dataset were truncated. This way, the size of the negative dataset was made equal to the size of the positive dataset. After this equalizing, both the positive and the negative dataset were merged together to form the training dataset for the SVMs. In the other three experiments similar technique was employed to prepare the training datasets for the tree other different ratios. The detail results of the four experiments were discussed in the subsection "Cross Validation Performance" under the section "Results and Discussion".

Testing the Proposed System
To evaluate the performance of the system two separate testing phases were performed. In the first phase, a 3-fold cross validation was used and in the second phase, dataset B was used.
• Phase 1: Three fold cross validation test In the three-fold cross validation, the merged training dataset was divided into three equal sets. Of these three sets, two sets were used for training and the remaining set was used for testing. This process was repeated three times in such a way that each of the three sets is used once for testing. The final performance parameters were obtained by averaging the performance of all the three sets. It should be noted that, in each of the three training phases, the SVMs produced a knowledge base (Model File), which were stored on disk and used later during prediction.

Availability and Requirements
The PPRED (Phosphorylation Predictor) web server is publicly accessible at the URL http://www.cse.univdhaka.edu/~ashis/ppred/index.php. An internet browser is all that is needed to use the server. Any protein sequence of length at most 10000 residues (in FASTA or raw sequence format) can be submitted along with the submitter's email address to the PPRED server. The PPRED server then performs the prediction task and notifies the submitter of the task through his or her email within a short period of time. The PPRED web server is installed in a desktop computer assembled with an   The PPRED server program is installed on Fedora core 9 operating system. The installed softwares are: LIBSVM version 2.9 and BLAST version 2.2.19 (with NR database release of Nov 16, 2008). All training and testing datasets can be found in downloadable format at that URL. It is worth mentioning that the time required to predict phosphorylation sites from a given protein sequence depends on the length of the protein sequence.

Additional material
Additional file 1 Training dataset: A. The A dataset is the phospho.ELM version 8.1 database that was released on Aug 12, 2008. The dataset contains 6019 protein entries with a total of 18253 annotations of phosphorylation sites. Of these annotations 13320 phosphorylated serine sites, 2766 threonine sites, 2166 tyrosine sites were annotated. There was an additional annotation of phosphorylated histidine. The Phospho.ELM is a database of S/T/Y phosphorylation sites hosted at the URL http://phospho.elm.eu.org/. It was collected on December 2008.
Additional file 2 Training dataset: A'. dataset was prepared from the dataset A that contained every proteins of the A dataset except the protein containing the histidine phosphorylated site. We discarded this entry because the objective of this work was to classify only the most frequently occurred phosphorylated sites which are serine, threonine and tyrosine residues. Authors' contributions AKB carried out the study relating to evolutionary information of proteins, participated in the sequence alignment to extract evolutionary information and drafted the manuscript. NN participated in the design of the study and supervised the whole research work. ARS participated in the design of the support vector features from the evolutionary information of proteins. All authors have read, revised and approved the final manuscript.