Analysis and prediction of antibacterial peptides

Background Antibacterial peptides are important components of the innate immune system, used by the host to protect itself from different types of pathogenic bacteria. Over the last few decades, the search for new drugs and drug targets has prompted an interest in these antibacterial peptides. We analyzed 486 antibacterial peptides, obtained from antimicrobial peptide database APD, in order to understand the preference of amino acid residues at specific positions in these peptides. Results It was observed that certain types of residues are preferred over others in antibacterial peptides, particularly at the N and C terminus. These observations encouraged us to develop a method for predicting antibacterial peptides in proteins from their amino acid sequence. First, the N-terminal residues were used for predicting antibacterial peptides using Artificial Neural Network (ANN), Quantitative Matrices (QM) and Support Vector Machine (SVM), which resulted in an accuracy of 83.63%, 84.78% and 87.85%, respectively. Then, the C-terminal residues were used for developing prediction methods, which resulted in an accuracy of 77.34%, 82.03% and 85.16% using ANN, QM and SVM, respectively. Finally, ANN, QM and SVM models were developed using N and C terminal residues, which achieved an accuracy of 88.17%, 90.37% and 92.11%, respectively. All the models developed in this study were evaluated using five-fold cross validation technique. These models were also tested on an independent or blind dataset. Conclusion Among antibacterial peptides, there is preference for certain residues at N and C termini, which helps to demarcate them from non-antibacterial peptides. Both the termini play a crucial role in imparting the antibacterial property to these peptides. Among the methods developed, SVM shows the best performance in predicting antibacterial peptides followed by QM and ANN, in that order. AntiBP (Antibacterial peptides) will help in discovering efficacious antibacterial peptides, which we hope will prove to be a boon to combat the dreadful antibiotic resistant bacteria. A user friendly web server has also been developed to help the biological community, which is accessible at .


Background
In the past few decades, a large number of bacterial strains have evolved ways to adapt or become resistant to the currently available antibiotics [1]. Researchers are focusing on alternative drugs based on antimicrobial peptides, which play an important role in innate immunity. As part of innate immune system, antimicrobial peptides provide protection against a wide variety of microorganisms in both vertebrates and invertebrates [2][3][4][5][6][7][8][9][10]. These peptides are very diverse with respect to amino acid sequence and secondary structure but share certain properties, such as affinity for the negatively charged phospholipids that are present on the outer surfaces of the cytoplasmic membrane of many microbial species. These peptides are ubiquitous, simple and effective factors acting within the innate immune system. Their short length and fast & efficient action against microbes has made them potential candidates as peptide drugs [4,11]. Several peptides and their derivatives have already passed clinical trials successfully [12,13] and several others are considered as potential therapeutics [12].
Antimicrobial peptides have a broad spectrum of activity, including activity against bacteria, fungi, viruses, and even cancer cells [14]. Other than having pathogen-lytic activities, these peptides have other properties like anti-tumour activity; mitogen activity, or act as signalling molecules [14]. In addition, they have a number of biotechnological applications, e.g. in transgenic plants [15,16], in aquaculture, and as aerosol spray for patients of cystic fibrosis [17]. In the past extensive work has been done in the field of antibacterial peptides, describing their identification, characterization, mechanism of action etc. The information about these peptides has been collected and compiled; following are major databases on antimicrobial peptides I) ANTIMIC have around 1700 sequences [18] ii) AMSDb consists 804 antimicrobial peptides of eukaryotic origin [19], iii) Peptaibol consists of around 300 antibiotic peptides originated from fungal organisms [20] and iv) APD consists of detailed information for 525 antimicrobial peptides [21].
Recently, a HMM (Hidden Markov Model) based method has been developed for searching conserved motifs of βdefensin family in genome databases [22]. The antibacterial peptides have little sequence homology, despite common properties [12]. Thus it is difficult to develop method for predicting the antibacterial peptides based on similarity. Moreover, experimental methods for identification and designing of antibacterial peptides are costly, time consuming and resource intensive. Thus there is a need to develop computational tools for predicting antibacterial peptides, which could be used to design potent peptides against bacterial pathogens.
In the present study, a systematic attempt has been made to understand this important class of peptides and to develop an algorithm for predicting antibacterial peptides with high accuracy. First, we collected and analyzed antibacterial peptides in order to understand preference of residue type at different positions. We also compared composition of antibacterial and non-antibacterial peptides. We analyzed residues of both N and C termini as both play important role in their activity; for example, Cterminus is responsible for the membrane interaction and pore formation, while the N-terminal region is important in bacteria-specific interaction process [23]. Based on our observation, we developed our methods using both N-terminus and C-terminus approach. In this study, three popular classification techniques, namely, Quantitative Matrix (QM), Artificial Neural Network (ANN) and Support Vector Machine (SVM), have been used to predict antibacterial peptides.

Analysis of the antibacterial peptides
The frequency of occurrence of all 20 natural amino acids at both termini was examined. It was observed that particular types of residues are preferred over others in anti-bacterial peptides. In order to demonstrate residue preference at different position of antibacterial peptides, sequence logos using plogo program [24] were generated. The sequence logos of 15 N-terminal and C-terminal residues are shown in Figure 1 and 2, respectively. As shown in Figure 1, certain residues are more abundant at specific positions, e.g., G, F, V, R at first position; L, I, W, F at 2 nd position etc. Overall antibacterial peptides are dominated by certain type of residues like G, R, K, L etc., being present at most of the positions. Similarly, certain residues are preferred at the C-terminus, for example residues K, G, C, and R are preferred at most of the positions.
In order to understand biasness of residues in antibacterial peptides, we computed frequency of each type of the residues in antibacterial and non-antibacterial peptides. As shown in Figure Additional file: 1 (Supplementary Material), residues G and F are significantly higher at first position of N-terminus of antibacterial peptides than that of non-antibacterial peptides. Similarly residues W, I, L, F are more frequent at the 2 nd position of N-terminus of Sequence logo of first fifteen residues (N-terminus) of anti-bacterial peptides Figure 1 The figure depicts the sequence logo of first fifteen residues (N-terminus) of antibacterial peptides, where size of residue is proportional to its propensity antibacterial peptides in comparison to non-antibacterial peptides. We also computed polar, non-polar, positive charge residues etc. at different positions of antibacterial peptides and compared with non-bacterial peptides (See Figure S11 to S20). Though both termini possess higher frequency of positively charged residues (K and R), the difference between the frequency of these residues between the antibacterial and non-antibacterial peptides is higher at the C-terminus. The difference in the frequencies of the positively charged residues between the positive and negative examples increases from position 1 to 5 at the N-terminus, whereas it tends to decrease from position 1 to 5 at the C-terminus. While higher frequency of the positively charged residues may be helping the C-terminus to interact with the negatively charged bacterial membrane, positively charged residues at the N-terminus may help in the interaction between these peptides and intracellular components like DNA and RNA, thus hampering the crucial bacterial functions. The proportion of Cysteine is higher at the N-terminus also. The proportion of negatively charged amino acids (D and E) were also found to be very low, as expected, since these may interfere during the course of penetration of the negatively charged cell membrane of the bacteria.

Overall compositional biasness
As shown in the above analysis ( Figure S1-S20), antibacterial peptides have preference for certain types of residues at certain positions. The next question is if the overall composition of these antibacterial is also different from that of non-antibacterial peptides. Thus we compared the overall amino acid composition of antibacterial and nonantibacterial peptides. As shown in Figure 3, certain residues like C, G, I and K are significantly higher in antibacterial peptides, as expected, since these residues are preferred at certain positions in antibacterial peptides. It is also noteworthy that residue R was not significantly higher in antibacterial peptides, though it was preferred at certain positions. Certain residues like D, E, S and P are not preferred in antibacterial peptides.

Prediction of Antibacterial Peptides
Whole peptide The prediction of antibacterial peptides is difficult, but very important. In the past attempts have been made, but these methods are subjective, based on physico-chemical property plots, than quantitative. The prediction of antibacterial peptides is difficult due to variation in the size of antibacterial peptides, since the machine learning techniques need fixed length pattern. Thus the question was how to circumvent this problem and develop a machine learning based method for predicting antibacterial peptides. As shown in Figure 3, amino acid composition of antibacterial peptides is significantly different from that of non-antibacterial peptides. In the past, methods have been developed based on machine learning techniques, using protein composition to classify the proteins (in spite of their variation in size), particularly for subcellular localization of proteins. Thus, we have developed a SVM based model using amino acid composition on our main dataset and achieved an accuracy of 89.04% with sensitivity of 89.92% and specificity of 88.09% (Table 1). Though we could predict antibacterial peptides with high accuracy using amino acid composition, we realised the weakness of this composition-based approach, since it is difficult to decide what length we should search in proteins as length of antibacterial peptides varied from 5 to ~80. Despite its better performance, it is impossible to use this approach for searching antibacterial peptides in genomes or proteins.
Overall amino acid composition of antibacterial vs non-anti-bacterial peptides Figure 3 Overall amino acid composition of antibacterial vs non-antibacterial peptides. This figure represents a comparison of composition bias of various amino acids in antibacterial and non-antibacterial peptides.
Sequence logo of last fifteen residues (C-terminus) of anti-bacterial peptides Figure 2 The figure depicts the sequence logo of last fifteen residues (C-terminus) of antibacterial peptides, where size of residue is proportional to its propensity.

N-terminal approach
It has been observed that certain types of residues are preferred at the N-terminal regions of antibacterial peptides. Thus we made an attempt to develop a method for predicting antibacterial peptides using their N-terminal residues. We created a dataset NT5 which consists of 5 Nterminal residues of antibacterial and non-antibacterial peptides (See Material and Methods). Then we developed a SVM model using the amino acid composition of these N-terminal residues and achieved an accuracy of 73.20% with 73.65% sensitivity and 72.75% specificity ( Table 1). The performance of this SVM module is much lower than of SVM module using composition of whole peptide, but this module can be used to scan antibacterial peptides in protein sequences. As we are using only first five residues here, it's possible to use these residues as binary pattern instead of composition. In binary pattern a vector of dimension 20 represents a residue, and for 5 residues the input vector of dimension is 100 (20 × 5). As shown in Table 1, we achieved an accuracy of 74.55% using the binary pattern based SVM model. Overall, the binarybased SVM model performs better than composition based model using first five N-terminal residues. Thus we extended our binary-based SVM models for predicting antibacterial peptides using more number of residues from the N-terminus. As shown in Table 2, we achieved an accuracy of 86.45% with specificity of 87.68% using binary-based SVM using the first 10 N-terminal residues. We got the best accuracy using first 15 N-terminal residues, which is 87.85% with specificity 87.98%.

C-terminal and N+C terminal
We have used the same strategy for the C-terminus as used for the N-terminus. The performance of binary-based SVM model using 15 C-terminal residues is shown in Table 3. Though performance was lower using C-terminal residues, the trend of performance was same. In order to check if using the N and C-termini of the peptides together will enhance the accuracy of the method, we developed a N+C-terminus based approach. In this approach the first and last 15 residues of each peptide in the original dataset were joined in order to create the positive dataset of NC-termini based method. Similar strategy as used in the N and C termini approaches were applied in this approach also. The performance of binary-based SVM model using N+C terminal residues is shown in Table 3. We got an accuracy of 92.11% with 92.11% specificity using binary-based SVM model using the first and last 15 residues. This model performs better than other models. The trend of performance in this case too follows the trend shown by the above two approaches, but the performance of this approach.

Quantitative Matrices (QM)
In case of QM, the contribution of each residue for each position of antibacterial peptide has been calculated. A matrix with weights for each amino acid residue in every position of peptide was generated using Equation 1. Table  S1 in supplementary file shows the propensity of each residue at each position in antibacterial peptides. Higher positive score of a residue at a given position means this residue is highly preferred at that position, for example G, W, D and C are highly preferred at positions 1, 2, 3, 4 and 5. Similarly, higher negative scores mean that residue is not preferred in antibacterial peptides at that position. One of the major advantages of QM is that the effect of each residue on antibacterial activity of peptide could be easily estimated. As shown in Table 4, we achieved an accuracy of 84.78%, 82.03% and 90.37%, using 15 residues of N, C and N+C terminal of peptides, respectively.

Artificial Neural Network (ANN)
The neural network is a powerful machine learning technique commonly used in the field of bioinformatics. Thus we also used this technique in this study for predicting antibacterial peptides. The performance of ANN-based method using 15 N-terminal residues is shown in Table 5.
We achieved an accuracy of 83.63%, 77.34% and 88.17%, using 15 residues of N, C and N+C terminal of peptides, respectively. Though the trend was same as that of SVM, QM and ANN models, the performance of ANN was poor compared to other techniques. This is probably due to the small number of peptides used for training the models.

Consensus approach
In an attempt to enhance the prediction accuracy we tried the consensus approach. In this approach SVM, QM and ANN, were separately used to predict whether a peptide is antibacterial or not. A peptide was considered to be predicted positive only if at least two of these techniques predict it as a positive. Otherwise the peptide was predicted negative. The accuracy of this consensus prediction was calculated. This was done for all the three approaches i.e. N-terminal; C-terminal and NC-terminal approach ( Table  6). It was observed that the performance of consensus model was slightly better than that of SVM model.

Performance on independent or blind dataset
The models developed in this study were evaluated on an independent dataset (Table 7). These antibacterial peptides in the independent dataset were not used for developing above models either in training or testing. This dataset consists of 39 antimicrobial peptides obtained from Swiss-Prot. As shown in Table 7, SVM, ANN and QM correctly predicted 23, 19 and 21 antibacterial peptides, respectively, using N-terminal residues of peptides. In case of C-terminus based approach SVM, ANN and QM correctly predicted 30, 27 and 29 peptides, respectively. It was interesting to note that the C-terminus based models perform better than the N-terminus based models, which is in contrast to the results when evaluated using five-fold cross validation. The models based on N+C termini correctly predicted 26, 25 and 27 peptides, using SVM, ANN and QM, respectively. Overall performance of our models was very poor on independent dataset in comparison to performance on the main dataset. In order to understand the reason of poor performance we randomly extracted 39 peptides from the main dataset to form a new independent dataset and added 39 peptides from the old independent dataset to the main dataset. Then we retrained our models and evaluated their performance on 39 peptides of new independent dataset. As shown in Table 7 performance of models on these 39 peptides was as good as performance on main dataset. This shows that our 39 original peptides used as independent dataset have either some uniqueness or some problem. Thus we reanalyzed these peptides and observed that 24 out of 39 are pure peptides (only antibacterial peptides) and 15 peptides are precursors to antibacterial peptides (containing both the signal sequence and the antibacterial peptide) that require further processing in order to produce active antibacterial peptides. The poor performance on independent dataset may be due to inclusion of these precursor sequences in the independent dataset as the dataset used for training the method contains only active antibacterial peptides (and not precursors of antibacterial peptides). Thus we checked the performance of our method on these 24 active antibacterial sequences, and as expected the performance of our method on these 24 sequences was as good as on main dataset.

Discussion
In the past attempts have been made to develop methods and strategies for designing effective antimicrobial peptides [25,26]. In this paper we have made systematic and comprehensive effort to understand and predict antibacterial peptides. As shown Figure 1 and 2, antibacterial peptides have preference for certain types of residues at certain positions. This observation is very important for experimentalists in designing de-novo antibacterial peptides; they may substitute amino acids at certain positions in order to make the peptides more effective. The quantitative matrices provide the magnitude of contribution of each type of residue at each position of antibacterial peptide. This information can be used directly for designing antibacterial peptides (See Supplementary Material). The major challenge was to develop an accurate method for predicting antibacterial peptides due to two major reasons; i) the antibacterial peptides have lot of variation in size and machine learning software need fixed length pattern and ii) experimentally proven non-antibacterial are not available which are very important for developing the method. This problem is similar to B-cell epitope prediction [27]. In order to handle the problem we developed methods using only limited number of residues from the N and C termini of peptides. In order to generate nonantibacterial peptides or negative examples, we obtained peptides randomly from non-secretory proteins [28]. Though some these peptides can be antibacterial, but chances are very rare.
First we attempted to develop models to discriminate antibacterial peptides from non-antibacterial peptides using whole peptides. In this case we used amino acid composition of peptides as input for our models. Though this model discriminates antibacterial peptides with high accuracy, it is difficult to use this model for scanning potential antibacterial proteins. Thus wee used N-terminal, C-terminal and N+C terminal approaches, as both N and C-termini play an important role in antibacterial activity, as suggested by the literature cited. The C-termi- nus interacts with the cell membrane and makes a pore, whereas the N-terminus helps in bacteria specific interaction process. One of the limitations of Machine Learning Techniques is that they require fixed length input vectors. Antibacterial peptides, on the contrary, are of varying length. Therefore, peptides of fixed lengths (5, 10, 15 20) were used to develop a SVM model. The number of peptides reduced significantly when 20 or 25 residue long peptides are used instead of 15 residues (335 20 mer peptides & 252 25 mer peptides). Thus in this study, we made models using 15 residues only as most (~60%) of the antibacterial peptides are 20-30 residues long. The performance of all the approaches was evaluated using five fold cross validation techniques. Performance of all modules based on QM, ANN and SVM was impressive but SVM based module using fifteen residues was the best. The poor performance of the ANN as compared to QM and SVM is due to the over optimization of ANN. It has been shown in the past that ANN suffers with the problem of over optimization when the training dataset is small [29].
AntiBP, though has some limitations like not including post-translational modifications and topological aspects, but to the best of our knowledge it is first method developed for predicting and designing potential antibacterial peptides. Our methods are likely to help the researchers in finding and in designing better peptides-based antibiotics.

Conclusion
Currently a great deal of interest is shown in antibacterial peptides, so called "nature's antibiotics", which seem to be promising to overcome the growing problem of antibiotic resistance [30][31][32]. The design of novel peptides with antimicrobial activities requires the development of methods for narrowing down the candidate peptides so as to enable rational experimentation by wet-lab scientists. AntiBP is one such method meant to discover efficacious antibacterial peptides that we hope could prove to be a boon to combat the dreadful antibiotic resistant bacteria.

Dataset
We extracted 486 antibacterial peptides having less than 61 residues from APD database [21]. We analyzed the length of these peptides and observed that most of them have more than 15 residues. Thus we created a dataset of antibacterial peptides having number of residues between 15 and 60. After removing identical peptides, we got a dataset of 436 non-redundant antibacterial peptides, also called positive dataset. In order to generate negative dataset or dataset of non-antibacterial peptides, we extracted equal number of peptides randomly from non-secretary proteins. Our final dataset, called main dataset, has 436 antibacterial and 436 random peptides.

N-terminal and C-terminal Dataset
In addition to the main dataset, we also created other datasets (e.g. NT5, NT10, NT15, CT15, and NTCT15) using terminal residues of peptides. In order to create a dataset NT15, we extracted 15 residues from N-terminus of antibacterial peptides. After removing identical peptides, we got 391 non-redundant peptides of length 15. An equal number of random peptides have been extracted from non-secretory proteins. Finally, dataset NT15 has 391 antibacterial peptides and 391 random peptides of length 15. Similarly, we created datasets NT5 and NT10 after extracting 5 and 10 residues from N-terminus of antibacterial peptides. In order to create dataset CT15, we extracted 15 residues from the C-terminus of antibacterial peptides. After removing identical peptides, we got 384 non-redundant antibacterial peptides and an equal number of random peptides. Similarly, we created datasets CT5 and CT10. In order to create dataset NTCT15, we extracted 15 residues from N-terminus and 15-residues from C-terminus of antibacterial peptides. After removing identical peptides, we got 431 non-redundant antibacterial peptides; equal number of random peptides was extracted from non-secretory proteins.

Blind or Independent dataset
All antibacterial peptides with residues between 15 and 80 were extracted from Swiss-Prot database and from this set peptides present in our main dataset were removed. Finally we got a dataset of 39 antibacterial peptides called blind and independent dataset. All these 39 peptides are experimentally verified and annotated as antibacterial peptides according to Swiss-Prot.

Quantitative matrix
The quantitative matrix is basically a propensity of each residue at a particular position. To classify the data of antibacterial peptides and non-antibacterial peptides, different quantitative matrices were generated for the Nterminal and C-terminal residues. Following equation was used to generate the quantitative matrices Q (i,r) = P (i,r) -N (i,r) Where, Q (i,r) is the weight of residue r at position 'i' in the matrix. 'r' can be any natural amino acid and the value of 'i' can vary from 1 to 15. P (i,r) and N (i,r) is the probability of residue 'r' at position 'i' in antibacterial peptides and non-antibacterial peptides respectively. E i,r and A i,r is number residue 'r' at position 'i' in antibacterial peptides and non-antibacterial peptides, respectively, and NP i,r is  the number of antibacterial peptides and NN i,r is the number of non-antibacterial peptides having residue 'r' at position 'i'. The quantitative matrix generated by using eq.(1), which is an addition matrix where the score of a peptide is calculated by summing up the scores of each residue at specific position along peptide sequence as Where, L is the length of the peptide.

Support Vector Machine
In this study, all SVM models have been developed using a freely available program SVM_LIGHT [33]. This program allows users to run SVM using various kernels and parameters. In this study, the accuracy was computed at a cut-off score where sensitivity and specificity are nearly equal.

Artificial Neural network
In order to develop ANN models we used Stuttgart Neural Network Simulator, SNNS version 4.2 [34]. The advantage of this package is that it allows incorporation of resulting networks in ANSI C functions for use in stand-alone code. The critical step in ANN is the optimization of the hidden nodes and other learning parameters, in order to achieve the best performance. Standard square error (SSE) was used for the performance function.

Evaluation of the Method
Five-fold cross-validation technique has been used to evaluate the performance of all the models developed in this study. In five fold cross-validation technique a dataset is randomly divided into five sets, where each set consists of nearly equal number of antibacterial peptides and non antibacterial peptides. Four sets are used for training and the remaining set for testing. This process is repeated five times so that each set is used once for testing. The performance of method is average performance of method on five sets. Following parameters has been used for assessing the performance of a method.
, where PPV is Probability of positive prediction.
Where TP and TN are correctly predicted antibacterial peptides and non-antibacterial peptides respectively. FP and FN are wrongly predicted antibacterial peptides and non -antibacterial peptides respectively. The five fold cross validation technique was used for evaluation of all the three methods used.