Dataset
Main dataset
The positive dataset for this method was once again fetched from the antimicrobial peptide database APD [17]. We retrieved a total of 999 unique antibacterial peptides from this database. We used this dataset to build the whole peptide composition based SVM models to predict antibacterial peptides of any length.
Negative dataset against whole peptide dataset
As there is no source of experimentally proven non-antibacterial peptides, so we adopted the same strategy that was used to generated the negative dataset in AntiBP. We chose to extract random peptides from proteins belonging to all intracellular locations except from the secretary proteins (because antibacterial peptides are mostly secreted outside the cell). Though some of these randomly selected peptides could be antibacterial in nature but the possibilities are remote. To do this we used the data which was used in MitPred [28]. MitPred dataset had proteins belonging to various intracellular locations (nucleus, cytoplasm, ER, golgi complex, mitochondria). These proteins were then mixed and shuffled thoroughly so that the negative dataset does not have overrepresentation of proteins belonging to any particular location. Now we selected those proteins that were >100 amino acids in length. This was done as many of the antibacterial peptides in the positive dataset having >90 residues in length. Now for peptide in the positive dataset, we calculated its length and cut a random peptide of corresponding length from the negative dataset protein. Thus we got 999 negative peptides in result.
NT15, CT15 and NTCT15 datasets
We created NT15 and CT15 datasets by taking first fifteen and last fifteen residues respectively from the antibacterial peptides as done in AntiBP [19]. For NTCT15 dataset we concatenated the CT15 peptides with their corresponding NT15 counterparts. To reduce the redundancy in the positive dataset, duplicates were removed and we were left with 782 NT15, 786 CT15 peptides and 861 NTCT15 peptides.
Negative dataset against NT15, CT15 and NTCT15 datasets
The strategy to generate the negative datasets for NT15, CT15 and NTCT15 datasets was the same as used in AntiBP. Once again the dataset having thoroughly mixed and shuffled proteins belonging to various subcellular locations was taken. For NT15 and CT15 negative datasets 15 residues long peptides were cut randomly from this dataset. From these peptides we selected 786 peptides to be used as negative dataset against both, NT15 and CT15 datasets. The negative dataset for NTCT15 dataset was created by extracting 861 random peptides (30 residues in length) from the non-secretary protein dataset.
Datasets for Subfamily classification
These datasets for classification of antibacterial peptides were extracted from the protein sequence database Swiss-Prot. These include peptides belonging to bacteria, insects, frogs, mammals and peptides categories into plants. The antibacterial peptides belonging to insects further belonged to 5 families i.e. apidaecin, attacin, cecropins, invertebrate defensins and lebocin. The antibacterial peptides belonging to mammals contained alpha-defensin, beta-defensin, cathelicidin, hepcidin and histatin. Frog antibacterial peptides also had sequences from bombinin, brevinin, caerin, dermaseptin, dermorphin, phylloseptin, pleurain, tryptophillin. As the number of peptides in dermorphin, phylloseptin, pleurain and tryptophillin were very less therefore, these were combined into a single class named as "Other".
Independent dataset
We took 466 peptides from the family classification dataset (which was fetched from Swiss-Prot) which were not present in our main dataset (taken from APD database). This dataset was not used either for training or testing the method. These peptides served as the independent dataset for evaluating the performance of the prediction models.
Techniques used
As the SVM based technique performed the best in the method AntiBP [19], we therefore exploited SVM to develop the prediction method in this case. In this study, all SVM models have been developed using a freely available program SVM_Light [29]. This program allows users to run SVM using various kernels and parameters. In this study, the accuracy was computed at a cut-off score where sensitivity and specificity are nearly equal.
Evaluation of parameters
Five-fold cross-validation technique has been used to evaluate the performance of all the models developed in this study. In five fold cross-validation technique a dataset is randomly divided into five sets, where each set consists of nearly equal number of antibacterial peptides and non antibacterial peptides. Four sets are used for training and the remaining set for testing. This process is repeated five times so that each set is used once for testing. The performance of method is average performance of method on five sets. Following parameters has been used for assessing the performance of a method.
Where, TP and TN are correctly predicted antibacterial peptides and non-antibacterial peptides respectively. FP and FN are wrongly predicted antibacterial peptides and non-antibacterial peptides respectively. Sensitivity (Sn) or percent coverage of antibacterial peptide is the percentage of antibacterial peptide predicted as antibacterial peptide; specificity (Sp) or percent coverage of non-antibacterial is the percentage of non-antibacterial peptide predicted as non-antibacterial peptide; overall accuracy (Ac) is the percentage of correctly predicted antibacterial and non antibacterial. The five fold cross validation technique was used for evaluation of all the three methods.
Prediction of antibacterial peptides
Whole peptide based approach
Though it is seen that the terminus approaches are useful to scan the antibacterial peptide in a larger protein sequence but it becomes difficult of predict peptide which are less than 15 residues. Therefore, a whole peptide based SVM model was also developed in order to predict antibacterial peptides of any length. Amino acid composition of the amino acid residues was fed to train the SVM.
NT15, CT15 and NTCT15 approach
Again the binary patterns of NT15, CT15 and NTCT15 datasets were used to develop prediction methods as described in AntiBP. The performance was evaluated using Five-fold cross validation technique.
Classification of antibacterial peptides
Multiclass SVM was exploited to develop the classification models and thus models were developed to classify the antibacterial peptides belonging to different sources e.g. Bacteria, Insect, Frog, mammals and plants. N SVMs model were constructed for N-class classification. For antibacterial peptide classification, the number of classes was equal to 5. Five 1-v-r SVMs models were constructed for classification of antibacterial peptides. The i th SVM was trained with all the samples of i th class labelled positive and all other samples labelled negative. An unknown example was classified into the class that corresponds to the SVM with the highest output score. The results for the family prediction are given in Table 2.
Antibacterial peptides belonging to various sources were further classified into families. Classification models were developed for peptides belonging to insects, frogs and mammals. To classify Insect antibacterial peptides into families 5 1-vs-r SVMs were developed. In a similar way 5 1-vs-r SVM models were developed to classify frog and mammalian antibacterial peptides into their respective families. The detailed results of classification of insect, frog and mammalian peptides are given in results section (Table 3, 4 and 5).