AntiBP2: improved version of antibacterial peptide prediction

Background Antibacterial peptides are one of the effecter molecules of innate immune system. Over the last few decades several antibacterial peptides have successfully approved as drug by FDA, which has prompted an interest in these antibacterial peptides. In our recent study we analyzed 999 antibacterial peptides, which were collected from Antibacterial Peptide Database (APD). We have also developed methods to predict and classify these antibacterial peptides using Support Vector Machine (SVM). Results During analysis we observed that certain residues are preferred over other in antibacterial peptide, particularly at the N and C terminus. These observation and increased data of antibacterial peptide in APD encouraged us to again develop a new and more robust method for predicting antibacterial peptides in protein from their amino acid sequence or given peptide have antibacterial properties or not. First, the binary patterns of the 15 N terminus residues were used for predicting antibacterial peptide using SVM and achieved accuracy of 85.46% with 0.705 Mathew's Correlation Coefficient (MCC). Then we used the binary pattern of 15 C terminus residues and achieved accuracy of 85.05% with 0.701 MCC, latter on we developed prediction method by combining N & C terminus and achieved an accuracy of 91.64% with 0.831 MCC. Finally we developed SVM based model using amino acid composition of whole peptide and achieved 92.14% accuracy with MCC 0.843. In this study we used five-fold cross validation technique to develop all these models and tested the performance of these models on an independent dataset. We further classify antibacterial peptides according to their sources and achieved an overall accuracy of 98.95%. We further classify antibacterial peptides in their respective family and got a satisfactory result. Conclusion Among antibacterial peptides, there is preference for certain residues at N and C terminus, which helps to discriminate them from non-antibacterial peptides. Amino acid composition of antibacterial peptides helps to demarcate them from non-antibacterial peptide and their further classification in source and family. Antibp2 will be helpful in discovering efficacious antibacterial peptide, which we hope will be helpful against antibiotics resistant bacteria. We also developed user friendly web server for the biological community.

Results: During analysis we observed that certain residues are preferred over other in antibacterial peptide, particularly at the N and C terminus. These observation and increased data of antibacterial peptide in APD encouraged us to again develop a new and more robust method for predicting antibacterial peptides in protein from their amino acid sequence or given peptide have antibacterial properties or not. First, the binary patterns of the 15 N terminus residues were used for predicting antibacterial peptide using SVM and achieved accuracy of 85.46% with 0.705 Mathew's Correlation Coefficient (MCC). Then we used the binary pattern of 15 C terminus residues and achieved accuracy of 85.05% with 0.701 MCC, latter on we developed prediction method by combining N & C terminus and achieved an accuracy of 91.64% with 0.831 MCC. Finally we developed SVM based model using amino acid composition of whole peptide and achieved 92.14% accuracy with MCC 0.843. In this study we used five-fold cross validation technique to develop all these models and tested the performance of these models on an independent dataset. We further classify antibacterial peptides according to their sources and achieved an overall accuracy of 98.95%. We further classify antibacterial peptides in their respective family and got a satisfactory result.
Conclusion: Among antibacterial peptides, there is preference for certain residues at N and C terminus, which helps to discriminate them from non-antibacterial peptides. Amino acid composition of antibacterial peptides helps to demarcate them from non-antibacterial peptide and their further classification in source and family. Antibp2 will be helpful in discovering efficacious antibacterial peptide, which we hope will be helpful against antibiotics resistant bacteria. We also developed user friendly web server for the biological community.

Background
In the past few decades, a large number of bacterial strains have evolved ways to adapt or become resistant to the currently available antibiotic [1]. The widespread resistance of bacterial pathogens to conventional antibiotics has prompted renewed interest in the use of alternative natural microbial inhibitors such as antimicrobial peptides. Antimicrobial peptides (AMPs) are a family of host-defense peptides most of which are gene-encoded and produced by living organisms of all types [2][3][4][5][6][7][8]. Antimicrobial peptides (AMPs) are small molecular weight proteins with broad spectrum antimicrobial activity against bacteria, viruses, and fungi [3,10]. These evolutionarily conserved peptides are usually positively charged and have both a hydrophobic and hydrophilic side that enables the molecule to be soluble in aqueous environments yet also enter lipidrich membranes. Once in a target microbial membrane, the peptide kills target cells through diverse mechanisms [5].
Antimicrobial peptides have a broad spectrum of activity and can act as antibacterial, antifungal, antiviral and sometimes even as anticancer peptide [10]. These antibacterial peptides have other properties like antibacterial activity, mitogen activity or act as signaling molecules including pathogen-lytic activities [10]. Extensive work has been done in the field of antibacterial peptide, describing their identification, characterization, mechanism of action etc. keeping in mind their numerous biotechnological applications [11][12][13]. Lot of work has been done to collect and compile these peptides in form of a database [14][15][16][17].
These antibacterial peptides have very low sequence homology, despite their common function [18]. Previously we developed a very robust method AntiBP [19], for predicting antibacterial peptide using SVM, QM (quantitative matrix) and artificial neural network (ANN). Growth of antibacterial peptides in APD database in the last 2 years motivated us to develop a prediction method based on the newer and larger (almost double) dataset. We once again analyzed the antibacterial peptides and developed SVM based models to predict antibacterial peptides, because our previous study show that SVM over perform than other method. In AntiBP2 we also extracted clean dataset of antibacterial peptide families from Swiss-Prot and developed classification models for them. In the following text, we first discuss the method developed to distinguish antibacterial peptides from non-antibacterial peptides (prediction part) and in the next step describe the method for classifying these peptides on the basis of source and classes (classification part).

Analysis of the antibacterial peptides
Analysis of antibacterial peptides in AntiBP [19] had shown a preference for certain residues over others at both the termini. By drawing the pLOGOs [20] it was also seen that there seems to be a residue preference at different position of antibacterial peptides. As the dataset in AntiBP2 was almost double in size compared to the dataset used in the previous method AntiBP, we again decided to analyze the antibacterial peptides and look for any change or shift in preference trend. We again generated sequence logos of 15 N-terminal and C-terminal residues using pLOGO program (Figures 1  and 2).  Sequence logo of last fifteen residues (C-terminus) of antibacterial peptides. The figure depicts the sequence logo of last fifteen residues (C-terminus) of antibacterial peptides, where size of residue is proportional to its propensity.
It was seen that the pLOGOs drawn in AntiBP2 showed similar trend as shown in the method AntiBP [19]. Here also in the N-terminus dataset G, F, V, R was predominating at first position and L, I, W, F were frequently present at 2nd position. Similarly, certain residues are preferred at the C-terminus, for example residues K, G, C, and R are preferred at most of the positions. Though both N and C terminus have a higher proportion of positively charged residues but in AntiBP2 analysis also we could notice a higher frequency of positively charged residues at the C-terminus as compared to the N-terminus (Figures 3 and 4). This may be because it is the C-terminus first interacts with the negatively charged membrane of the bacteria and penetrate it [21]. The N-terminus later helps to hamper the crucial bacterial metabolic functions by interacting with intracellular components like DNA and RNA [22]. Antibacterial peptides also have a high propensity of the residues Cys which is normally not preferred in most of the proteins Overall amino acids composition comparison of antibacterial and non-antibacterial shows positively charged Lys is prominent in antibacterial peptides ( Figure 5). Similarly Gly and Ile propensity is also high in antibacterial peptides

Prediction
The performances of NT15, CT15, NTCT15 and whole peptide based prediction method for antibacterial peptides are given below in Table 1. The accuracies achieved by NTCT15 model and whole peptide based model were almost equal (~91%) and is highest among all the models. The performance of NT15 model was better that that of CT15 model.

Performance on independent or blind dataset
The prediction models developed in this study were evaluated on a 466 sequence independent dataset ( Table 2). These antibacterial peptides in the independent dataset were not used for developing above models either in training or testing.
The results of classification of frog's antibacterial peptides and mammalian antibacterial peptides into their respective families (5 each) are given in detail in Table 5 and

Discussion
A great deal of interest is shown nowadays in antibacterial peptides or the so called "nature's antibiotics", which seem to be promising to overcome the growing problem of antibiotic resistance [23][24][25]. The design of novel peptides with antimicrobial activities requires the development of methods for narrowing down the candidate peptides so as to enable rational experimentation by wetlab scientists. Attempts have been made to develop methods and strategies for designing effective antimicrobial peptides [26,27]. AntiBP is one such method meant to discover efficacious antibacterial peptides that we hope could prove to be a boon to combat the dreadful antibiotic resistant bacteria. Enormous growth of antibacterial peptide data in the databases motivated us to develop an improved version of AntiBP using the same strategy. The new version was name AntiBP2.
The N and C terminus sequence logos of AntiBP2 dataset were almost similar to those in the previous method AntiBP. This indicates that though there seems to be an absence of great homology or conservation among antibacterial peptides but the pattern of positional preference of certain residues remains constant. We once again developed the prediction method to classify antibacterial peptide from the non-antibacterial peptide. But this time the method was developed using a training data that was double in size to the one previously used. We developed both whole peptide based compositional      BMC Bioinformatics 2010, 11(Suppl 1):S19 http://www.biomedcentral.com/1471-2105/11/S1/S19 models as well as binary pattern based terminus approaches. This time we retained the whole peptide based method also as it becomes difficult to predict peptides that are less than 15 residues in length by the binary pattern based terminal models. In this method also we achieved impressive results with all the above approaches but the best performers were the NTCT15 and whole peptide based prediction models (achieving 91% accuracy). This was followed by the NT15 based prediction model while the CT15 based model being the poorest performer among all. This trend is just similar to what was seen in AntiBP. The performance evaluation of prediction models on the independent dataset followed the trend shown during development of prediction models (in sync with the trend followed by the AntiBP method). The NTCT15 model performed the best followed by NT15 and CT15 models in respective order.
In AntiBP2 we have also developed models that could classify antibacterial peptides further into families with high accuracy. First we successfully made an attempt to develop classification models that could assign the source of origin to predicted antibacterial peptides. The classification models to classify the antibacterial peptides further into corresponding families were also developed. The results attained in all the classification methods clearly indicate that although the antibacterial peptides do no show a greater conservation or homology, but they become more and more as we go down to the level of a particular family. This is evident from the high accuracies achieved for each family in various classification models. Therefore, AntiBP2 is an efficient method that can predict and classify the antibacterial peptides. We hope that our method would help the wet lab scientists to design improved and efficacious antibacterial peptides in future.

Conclusion
There is a rapid growth in the field of antibacterial peptide research in response to the demand for novel antibacterial agents. AntiBP2 is one such efficient method that can predict and classify the antibacterial peptides and help to find newer antibacterial peptides more speedily and conveniently. We hope that our method would promote the research to design improved and efficacious antibacterial peptides in future.

Main dataset
The positive dataset for this method was once again fetched from the antimicrobial peptide database APD [17]. We retrieved a total of 999 unique antibacterial peptides from this database. We used this dataset to build the whole peptide composition based SVM models to predict antibacterial peptides of any length.
Negative dataset against whole peptide dataset As there is no source of experimentally proven nonantibacterial peptides, so we adopted the same strategy that was used to generated the negative dataset in AntiBP. We chose to extract random peptides from proteins belonging to all intracellular locations except from the secretary proteins (because antibacterial peptides are mostly secreted outside the cell). Though some of these randomly selected peptides could be antibacterial in nature but the possibilities are remote. To do this we used the data which was used in MitPred [28]. MitPred dataset had proteins belonging to various intracellular locations (nucleus, cytoplasm, ER, golgi complex, mitochondria). These proteins were then mixed and shuffled thoroughly so that the negative dataset does not have overrepresentation of proteins belonging to any particular location. Now we selected those proteins that were >100 amino acids in length. This was done as many of the antibacterial peptides in the positive dataset having >90 residues in length. Now for peptide in the positive dataset, we calculated its length and cut a random peptide of corresponding length from the negative dataset protein. Thus we got 999 negative peptides in result.

NT15, CT15 and NTCT15 datasets
We created NT15 and CT15 datasets by taking first fifteen and last fifteen residues respectively from the antibacterial peptides as done in AntiBP [19]. For NTCT15 dataset we concatenated the CT15 peptides with their corresponding NT15 counterparts. To reduce the redundancy in the positive dataset, duplicates were removed and we were left with 782 NT15, 786 CT15 peptides and 861 NTCT15 peptides.
Negative dataset against NT15, CT15 and NTCT15 datasets The strategy to generate the negative datasets for NT15, CT15 and NTCT15 datasets was the same as used in AntiBP. Once again the dataset having thoroughly mixed and shuffled proteins belonging to various subcellular locations was taken. For NT15 and CT15 negative datasets 15 residues long peptides were cut randomly from this dataset. From these peptides we selected 786 peptides to be used as negative dataset against both, NT15 and CT15 datasets. The negative dataset for NTCT15 dataset was created by extracting 861 random peptides (30 residues in length) from the non-secretary protein dataset.

Independent dataset
We took 466 peptides from the family classification dataset (which was fetched from Swiss-Prot) which were not present in our main dataset (taken from APD database). This dataset was not used either for training or testing the method. These peptides served as the independent dataset for evaluating the performance of the prediction models.

Techniques used
As the SVM based technique performed the best in the method AntiBP [19], we therefore exploited SVM to develop the prediction method in this case. In this study, all SVM models have been developed using a freely available program SVM_Light [29]. This program allows users to run SVM using various kernels and parameters. In this study, the accuracy was computed at a cut-off score where sensitivity and specificity are nearly equal.

Evaluation of parameters
Five-fold cross-validation technique has been used to evaluate the performance of all the models developed in this study. In five fold cross-validation technique a dataset is randomly divided into five sets, where each set consists of nearly equal number of antibacterial peptides and non antibacterial peptides. Four sets are used for training and the remaining set for testing. This process is repeated five times so that each set is used once for testing. The performance of method is average performance of method on five sets. Following parameters has been used for assessing the performance of a method. Where, TP and TN are correctly predicted antibacterial peptides and non-antibacterial peptides respectively. FP and FN are wrongly predicted antibacterial peptides and non-antibacterial peptides respectively. Sensitivity (Sn) or percent coverage of antibacterial peptide is the percentage of antibacterial peptide predicted as antibacterial peptide; specificity (Sp) or percent coverage of non-antibacterial is the percentage of non-antibacterial peptide predicted as non-antibacterial peptide; overall accuracy (Ac) is the percentage of correctly predicted antibacterial and non antibacterial. The five fold cross validation technique was used for evaluation of all the three methods.

Prediction of antibacterial peptides
Whole peptide based approach Though it is seen that the terminus approaches are useful to scan the antibacterial peptide in a larger protein sequence but it becomes difficult of predict peptide which are less than 15 residues. Therefore, a whole peptide based SVM model was also developed in order to predict antibacterial peptides of any length. Amino acid composition of the amino acid residues was fed to train the SVM.

NT15, CT15 and NTCT15 approach
Again the binary patterns of NT15, CT15 and NTCT15 datasets were used to develop prediction methods as described in AntiBP. The performance was evaluated using Five-fold cross validation technique.

Classification of antibacterial peptides
Multiclass SVM was exploited to develop the classification models and thus models were developed to classify the antibacterial peptides belonging to different sources e.g. Bacteria, Insect, Frog, mammals and plants. N SVMs model were constructed for N-class classification. For antibacterial peptide classification, the number of classes was equal to 5. Five 1-v-r SVMs models were constructed for classification of antibacterial peptides. The ith SVM was trained with all the samples of ith class labelled positive and all other samples labelled negative. An unknown example was classified into the class that corresponds to the SVM with the highest output score.
The results for the family prediction are given in Table 2.
Antibacterial peptides belonging to various sources were further classified into families. Classification models were developed for peptides belonging to insects, frogs and mammals. To classify Insect antibacterial peptides into families 5 1-vs-r SVMs were developed. In a similar way 5 1-vs-r SVM models were developed to classify frog and mammalian antibacterial peptides into their respective families. The detailed results of classification of BMC Bioinformatics 2010, 11(Suppl 1):S19 http://www.biomedcentral.com/1471-2105/11/S1/S19 insect, frog and mammalian peptides are given in results section (Table 3, 4 and 5).

Availability and requirements
We developed a web server AntiBP2 [30] freely available for predicting and classify antibacterial peptides using models developed in this study. This web server was developed on SUN server (model T-1000) under Solaris environment using PERL programming languages.