Skip to main content

An advanced approach to identify antimicrobial peptides and their function types for penaeus through machine learning strategies

Abstract

Background

Antimicrobial peptides (AMPs) are essential components of the innate immune system and can protect the host from various pathogenic bacteria. The marine environment is known to be one of the richest sources for AMPs. Effective usage of AMPs and their derivatives can greatly improve the immunity and breeding survival rate of aquatic products. It is highly desirable to develop computational tools for rapidly and accurately identifying AMPs and their functional types, for the purpose of helping design new and more effective antimicrobial agents.

Results

In this study, we made an attempt to develop an advanced machine learning based computational approach, MAMPs-Pred, for identification of AMPs and its function types. Initially, SVM-prot 188-D features were extracted that were subsequently used as input to a two-layer multi-label classifier. In specific, the first layer is to identify whether it is an AMP by applying RF classifier, and the second layer addresses the multi-type problem by identifying the activites or function types of AMPs by applying PS-RF and LC-RF classifiers. To benchmark the methods,the MAMPs-Pred method is also compared with existing best-performing methods in literature and has shown an improved identification accuracy.

Conclusions

The results reported in this study indicate that the MAMP-Pred method achieves high performance for identifying AMPs and its functional types.The proposed approach is believed to supplement the tools and techniques that have been developed in the past for predicting AMPs and their function types.

Background

Antimicrobial peptides (AMPs) are crucial components of the innate immune system and can protect the host from various pathogenic bacteria and viruses. They are generally short peptides with 10–50 amino acids [1] and have very low sequence homology to one another. AMPs nowadays have attracted increased attention of research owing to their broad-spectrum antimicrobial activity and more importantly to the fact that AMPs may overcome the antimicrobial resistance, which makes it a potential alternative therapeutic agent for humans or a substitute to conventional antibiotics.

However, the mechanisms of action of AMPs, as well as their structure-activity relationships, are not completely understood [2]. Identification and optimization of AMPs can provide a theoretical basis for discovery and design of new and more effective antimicrobial agents. For instance, a multidimensional signature model was proposed in [3] that facilitates discovery of AMPs and offers insights into the evolution of molecular determinants. Experimental and computational studies are generally devoted to dealing with this challenging task. Computational methods were developed to accelerate the process of prediction and classification of AMPs. Recently, approaches based on machine learning techniques are commonly adopted due to their high efficiency, high speed, low cost and generalization abilities. They can sufficiently mine the intrinsic linear and non-linear relationship between antibacterial activity and biochemical attributes, which is suitable for dealing with large scale antimicrobial peptide prediction tasks with complex models.

Methods of choice include support vector machine (SVM) [47], nearest neighbor [8] or k-nearest neighbor algorithm [9], random forests (RFs) [10]), decision tree model [11], hidden Markov models (HMMs) [12], and neural network model [13] which seek for prediction power in a context of supervised classification. Most recent work includes a "deep" network architecture for chemical data analysis and classification together with a prospective proof-of-concept application proposed in [14]. Some predictors only apply binary classifiers to identify whether a query peptide sequence is AMP or not, such as [4, 5, 8]. Multi-class classifiers have also been developed which obtained more detailed quantitative results. Lira et al. [11] created a decision tree model to classify the antimicrobial activities of synthetic peptides into four classes. ClassAMP [4] has been developed to predict the propensity of a peptide sequence to have antibacterial, antifungal, or antiviral activity. However, it can be seen by a comparison of the sequences in APD database [15, 16] that a same sequence may occur in different subclasses, which in fact a very common phenomenon. Therefore, it is highly desirable to develop mechanisms for rapidly and accurately learning from multi-label datasets, for the purpose of helping design new and more effective antimicrobial agents. Considering various possible functional types of AMPs, Xiao et al. proposed a two-level multi-label classifier iAMP-2L, where an improved fuzzy K-nearest neighbour (FKNN) algorithm was applied, and after the AMPs are first identified, the positive samples are subjected to regular multi-label learning processing [9]. The prediction accuracy for 4 types of AMPs was further improved in [17]. Zhou’s method [18] has applied the LIFT multi-label learning algorithm to predict 5 types of AMPs and achieved 70% accuracy of prediction.

This paper aims to develop an advanced method, MAMPs-Pred, for classification and prediction of AMPs and their function types, which proves to achieve an improved prediction accuracy upon state of the art mechanisms. The marine environment is known to be one of the richest sources for AMPs. It is meaningful to predict the AMPs and their function types of penaeus by this method, which has helped us to understand the immune system of marine species. In addition, it eases subsequent mining and exploration of antimicrobial activity of other species.

In this approach, a 188-D feature set constructed from SVM-Prot features [19, 20] were used to map the peptide sequences to numeric feature vectors, which were subsequently used as input to a two-layer multi-label classifier. The first layer is to identify whether a query peptides sequence is an AMP, and the second layer addresses the multi-type problem by identifying whether an AMP belongs to multiple function types. Different classification methods were compared, and the results were discussed and analyzed. In short, a combination of first-layer 188D-RF classifier and second-layer PS-RF or LC-RF classifier is proved to have achieved the best performance. The proposed approach achieved higher accuracy than existing approaches of best performance, while performed upon benchmark dataset. In addition, the quality of the prediction was verified when applied to penaeus sequences. The proposed method may play an important complementary role to the existing predictors in this area.

Materials and methods

Benchmark dataset

For the convenience of later description, the benchmark dataset is expressed by

$$\begin{array}{*{20}l} s &= s^{AMPs} \cup s^{non-AMPs} \end{array} $$
(1)

Where sAMPs is the AMPs dataset consisting of AMPs sequences only, snonAMPs the non-AMP dataset with non-AMP sequences only, and is the symbol for union in the set theory. The peptide sequences in sAMPs were fetched from the APD database [15, 16], which has collected all antimicrobial peptides from the PubMed, PDB, Google and Swiss-Prot databases. According to their different functional types, the AMP sequences can be further classified into 16 categories; i.e.,

$$\begin{array}{*{20}l} s &= s_{1}^{AMPs} \cup s_{2}^{AMPs} \cup s_{3}^{AMPs} \cup \ldots \cup s_{16}^{AMPs} \end{array} $$
(2)

Where the subscripts 1, 2, 3,...,16 represent “Wound healing”, “Spermicidal”, “Insecticidal”, “Chemotactic”, “Antifungal”, “Anti-protist”, “Antioxidant”, “Antibacterial”, “Antibiotic”, “Antimalarial”, “Antiparasital”, “Antiviral”, “Anticancer/tumor”, “Anti-HIV”, “Proteinase inhibitor” and “Surface immobilized”. The lengths of AMPs are varying within the region from 5 to 100 amino acids. Note that among the original 2954 sAMPs sequences, 278 sequences have unknown antibacterial activity.

Furthermore, to reduce homology bias and redundancy, the program CD-HIT [21] was utilized to winnow those sequences that have ≥ pairwise sequence identity to any other in a same subset. The alignment bandwidth of the CD-HIT field is set to 5 according to the shortest length of AMPs. To ensure that each subset has enough samples for statistic processing, and to ensure that all categories are covered, the CD-HIT only performs redundancy removal to a subset of samples with sequence numbers larger than 180, which means that the de-redundancy processing are only performed for antifungal, antibacterial, antiviral and anti-cancer polypeptides. Finally, we obtained 2618 AMPs as the current benchmark dataset sAMPs as shown in Table 1.

Table 1 Preprocessed benchmark dataset

The negative samples snonAMPs contains polypeptide sequences snonAMPsPept, and protein fragments snonAMPsProt.

Where snonAMPsPept were constructed according to following procedures:

  1. 1

    Collected all the polypeptide sequences sUNPPeptide with length 1 to 15483, in total 79378, from the UniProt database.

  2. 2

    Removed any sequence that already exists in sAMPs, any sequence that contains any code other than the 20 native amino acid codes, and any sequence with length less than 5 or larger than 100.

  3. 3

    The process is described by following equation, and at this point 10503 sequences snonAMPsPept were obtained.

    $$\begin{array}{*{20}l} s^{non-AMPs-Pept}&=s^{UNP-Peptide}-s^{AMPs}-seq_{illeg}\\ (len \in [5,100]) \end{array} $$
    (3)

On the other hand, snonAMPsProt were constructed according to following procedures:

  1. 1

    Obtained Pfam families that sAMPs belong to. Because some AMPs are homologous and have the same family number, we remove duplicate family numbers from Pfam and get de-redundant families posPfam.

  2. 2

    Removed posPfam from the Pfam families and obtained negPfam. Fetched a random protein sequence with the length between 5 and 100 from each negPfam family.

  3. 3

    The process is described by following equation. In total 109 short protein sequences snonAMPsProt were obtained.

    $$\begin{array}{*{20}l} s^{non-AMPs-Prot}&=Ran(Pfam-posPfam)\\ (len \in [5,100]\!) \end{array} $$
    (4)

The snonAMPs were constructed by following equation.

$$\begin{array}{*{20}l} s^{non-AMPs}&=s^{non-AMPs-Pept} \cup s^{non-AMPs-Prot} \end{array} $$
(5)

The CD-HIT [21] program was then applied to winnow snonAMPs. Finally, 4371 sequences were constructed, which were used to form the negative samples dataset snonAMPs as shown in Table 1.

Feature extraction

In machine learning, choosing informative, discriminating and independent features is a crucial step for the success of a prediction method. The optimal feature set shall be able to capture the distribution patterns of the dataset.

In this study, we have adopted two feature extraction algorithms for comparison, which are SVM-Prot 188-D based on 8 types of physical-chemical properties and amino acid composition, and Pseudo amino acid composition features (Co-Pse-AAC) based on 5 types of physical-chemical properties respectively.

SVM-Prot is a web server for protein classification. It constructs 188-D features for protein sequences description and classification [19, 20]. The features have been applied successfully in several protein identification works, such as cytokines [22, 23] and enzymes [24, 25]. The extracted features include hydrophobicity, normalized van der Waals volume, polarity, polarizability, charge, surface tension, secondary structure and solvent accessibility [19]. For each of these 8 types of physical-chemical properties, some feature groups were designed to describe global information of protein sequences. These feature groups contain composition (C), transition (T) and distribution (D) [19, 26]. Thus, the dimension of each feature vector is 21. In addition, considering amino acid composition (AAC), the protein structure is composed of 20 amino acids. The dimension of 188-D features is therefore expressed as below formula:

$$\begin{array}{*{20}l} D_{188-D}&=\sum_{i=1}^{L} D_{21Vct} + D_{aac} \end{array} $$
(6)

Where L is the number of features, which is 8 in this context. Take Cecropin A as an example. The 188-D features of Cecropin A is showed in Table 2. To the best of our knowledge, it is the first attempt in literature to apply SVM-Prot 188-D feature set composition in AMPs and non-AMPs classification and identification.

Table 2 188-D feature of cecropin A

On the other hand, Pseudo amino acid composition features (Co-Pse-AAC) [27] as an efficient computation tool has been diffusely leveraged for protein sequences in predicting protein structures and functions, as well as DNA and RNA sequences [28]. The 40-dimension Co-Pse-AAC features were extracted and sufficiently incorporate the effects of sequence order. This method has taken 5 types of physical-chemical properties into consideration.

Data balancing

Most machine learning classification algorithms are sensitive to the imbalanced data sets [29]. The classifiers tend to have a higher recognition rate for the majority class, which makes it difficult to identify the minority class correctly [3032]. In this study, there were 2718 AMPs samples and 4371 non-AMPs samples, which were highly imbalanced. In order to eliminate the over fitting problem caused by imbalanced data, we have applied two sampling mechanisms to construct the training dataset.

Firstly, we have implemented a random-under-sampling method to down sample the large class set snonAMPs, so that the sample number of large class set equals the small class set, and the resulting training dataset is defined as strain. Another method we have applied is weighted random sampling [33], which has balanced the dataset by applying different weights to the unbalanced samples. Given that the ratio of sAMPs and snonAMPs is approximately equal to 3:5, weight factor 5 and 3 were applied to sAMPs and snonAMPs respectively, and the obtained train dataset is defined as sweighttr.

Test dataset

The test dataset was constructed by following method. Firstly we randomly pick up 1382 negative samples from the sequences that have been deleted from snonAMPs in the CD-HIT process, and noted it by snonAMPsDEL. Further, in the phrase of acquiring benchmark dataset from APD (The Antimicrobial Peptide Database) database, there are 278 sequences with unknown antibacterial activity among the original 2954 sAMPs sequences, which is defined by snonAMPsNOACT.

The 278 snonAMPsNOACT sequences, together with the 1382 snonAMPsDEL, form the independent test dataset Stest for the first layer of our two-layer multi-label classifier, which is in total 1660 samples.

The 278 snonAMPsNOACT sequences were also applied as prediction dataset for the second layer of our two-layer multi-label classifier, which will be illustrated in following chapters.

Two-layer multi-label classifier

In machine learning, multi-label classification is the problem of finding a model that maps inputs x to binary vectors y, i.e., assigning a value of 0 or 1 for each label in y. In the multi-label problem there is no constraint on how many of the classes the instance can be assigned to. An overview of multi-label classification is available at [34].

In general, the methods to study multi-label classification can be divided into two categories: adapted algorithm methods and problem transformation methods. Some classification models have been adapted to the multi-label task, without requiring problem transformations. For instance, AdaBoost.MH and AdaBoost.MR are extended versions of AdaBoost for multi-label data. And the ML-kNN algorithm extends the k-NN classifier to multi-label data. Examples also include decision trees, neural networks adapted for multi-label learning.

Problem transformation methods fall into another category of multi-label classification. With converting multi-label problems into one or more single-label problems, literally existing single-label classifier can be used to meet the multi-label classification requirements. Representative algorithms include Binary Relevance (BR), Classifier Chains (CC), Label Combination Method (LC/LP), Integrated LP Method Rakel, and Pruned Sets Method (PS). BR amounts to independently training one binary classifier for each label; CC is similar to BR, except that it takes into account label dependencies; LC/LP treats each label combination as a new label and implicitly considers the label.

A polypeptide can be a non-AMP that does not have any antimicrobial activity. It is actually a prediction problem with negative samples, which cannot be handled directly by traditional multi-label classification. Incorporating non-AMPs rationally into predictive models is an essential issue for multi-label classification to predict function types of AMPs. To address this issue, we improve upon the state of the art in multi-label classification and make several contributions.

For the first-layer classifier in identifying a query peptide sequence as an AMP or non-AMP, the random forest (RF) algorithm was applied as a base classifier because of its good performance and simple-to-use feature. Random forest is an ensemble method in which a classifier is constructed by combining several independent base classifiers. The individual predictions are aggregated to combine into a final prediction, based on a majority voting on the individual predictions. By averaging several trees, there is a significantly lower risk of over fitting.

For the second layer classifier in identifying which functional type(s) the query AMP peptide sequence belongs to, a task of multi-label classification was launched. We choose Meka/Mulan open source framework to implement our second layer multi-label classifier. Meka is based on the Weka machine learning toolkit, one of the well-known data mining platforms (http://www.cs.waikato.ac.nz/ml/weka/), and integrates the open-source Java library Mulan framework for providing the capability of multi-label datasets learning. Meka proposed a trimming set method and a Classifier Chains (CC) method, and uses logarithmic loss to punish misplaced tags to prevent partial misprediction in the overall label distortion. For the second-layer prediction, PS-RF or LC-RF is applied as a base multi-label classifier due to its performance.

Measurement metrics

The metrics Sensitivity (SN), specificity (SP), overall accuracy (Acc) and Matthew’s correlation coefficient (Mcc) were applied to measure the performance of the first-layer classifier [18, 3540], where TPi,FPi,TNi,FNi denote the numbers of true positive instances, false positive instances, true negative instances and false negative instances respectively.

$$\begin{array}{*{20}l} SN &= \frac{TP_{i}}{TP_{i} + {FN}_{i}} \end{array} $$
(7)
$$\begin{array}{*{20}l} SP &= \frac{TN_{i}}{FP_{i} + {TN}_{i}} \end{array} $$
(8)
$$\begin{array}{*{20}l} Acc &= \frac{TP_{i} + {TN}_{i}}{TP_{i} + {FP}_{i} + {TN}_{i} +{FN}_{i}} \end{array} $$
(9)
$$ {\begin{aligned} Mcc &= \frac{TP_{i} \times {TN}_{i} - {FP}_{i} \times {FN}_{i}} {\sqrt{({TP}_{i} + {FP}_{i}) \times ({TN}_{i} + {FN}_{i}) \times ({TP}_{i} + {FN}_{i}) \times ({TN}_{i} + {FP}_{i})}} \end{aligned}} $$
(10)

The metric Exact-Match Ratio (EMR), Hamming-Loss (H-Loss), Accuracy (Acc), Precision (Precison, Recall), Ranking-Loss (RL), Log-Loss, One-error (OE), F1-Measure (F1-Mic, F1-Mac) were applied for evaluation the second-layer multi-label classifier.

$$\begin{array}{*{20}l} EMR(\Lambda_{t}) &= \frac{1}{K} \sum_{i=1}^{K} (\tilde{y_{i}} = y_{i}) \end{array} $$
(11)
$$\begin{array}{*{20}l} H-Loss(\Lambda_{t}) &= \frac{1}{KL} \sum_{i=1}^{K} \frac{|\tilde{y_{i}} \cup {y_{i}}|-|\tilde{y_{i}} \cap {y_{i}}|}{L} \end{array} $$
(12)
$$\begin{array}{*{20}l} Acc(\Lambda_{t}) &= \frac{1}{K} \sum_{i=1}^{K} \frac{|\tilde{y_{i}} \cup {y_{i}}|}{|\tilde{y_{i}} \cap {y_{i}}|} \end{array} $$
(13)
$$\begin{array}{*{20}l} Precision(\Lambda_{t}) &= \frac{1}{K} \sum_{i=1}^{K} \frac{|\tilde{y_{i}} \cap {y_{i}}|}{\tilde{y_{i}}} \end{array} $$
(14)
$$\begin{array}{*{20}l} Recall(\Lambda_{t}) &= \frac{1}{K} \sum_{i=1}^{K} \frac{|\tilde{y_{i}} \cap {y_{i}}|}{y_{i}} \end{array} $$
(15)
$$\begin{array}{*{20}l} F1(\Lambda_{t}) &= \frac{2.0 \times Precision(\Lambda_{t}) \times Recall(\Lambda_{t})}{Precision(\Lambda_{t}) + Recall(\Lambda_{t})} \\ OE(\Lambda_{t}) &= \frac{1}{K} \sum_{i=1}^{K} \{[{argmax}_{y \in Y} h(x_{i}, y)] \not\in y_{i}\} \\ &= \frac{1}{K} \sum_{i=1}^{K} \frac{2 |\tilde{y_{i}} \cap {y_{i}}|}{|\tilde{y_{i}}| + |y_{i}|} \end{array} $$
(16)
$$\begin{array}{*{20}l} RL(\Lambda_{t}) &\!\!=\frac{1}{K} \sum_{i=1}^{K} \frac{1}{|\tilde{y_{i}}| \!\times\! |y_{i}|} |\{ (y_{1}, y_{2}) | f_{t} h((x_{i}, y_{1})) \\ \leq f_{t} h((x_{i}, y_{2})) \}| \end{array} $$
(17)
$$\begin{array}{*{20}l} Log-Loss(\Lambda_{t}) &= \frac{1}{KL} \sum_{i=1}^{K} \sum_{j=1}^{L}\\ & \left\{min\left[-Log-Loss\left(\tilde{w_{j}^{i}}, y_{j}^{i}\right), ln(K)\right]\right\} \end{array} $$
(18)

Results

First classifier - Identifying AMPs or non-AMPs

Firstly, we extracted SVM-prot 188-D features and Co-Pse-AAC 40-D features for each peptide sequence. Then the first-layer classifier was followed for identifying if the sequence is AMPs or not. Several common classifiers, including Random Forest (RF), Bagging, J48, OneR, Naive Bayesian NB, KNN, and LibSVM, were chosen for performance comparison. The result showed that the Random Forest and Bagging classifiers based on decision trees have achieved the highest prediction accuracy rate that exceeded 84% for both SVM-prot 188-D and Co-Pse-AAC 40-D features (Fig. 1).

Fig. 1
figure 1

The main flowchart of the AMPs identification and prediction process

We further applied 1660 test dataset samples Stest to verify 5 RF and Bagging based classifiers (188D-RF–W, 188D-RF–R, 188D-Bagging–W, 188D-Bagging–R, 40D-RF-R), where W denotes weighted random sampling, and R denotes random-under-sampling, since the AMP dataset is highly imbalanced, whereas sampling methods might affect the prediction performance significantly.

Table 3 shows that the 188D-RF-W classifier based on weighted random sampling can guarantee good sensitivity and specificity on both training set and test set, which can efficiently identify AMPs and non-AMPs, where TPR represents true positive rate, FPR represents false positive rate, and AUC is area under the curve. Hence, we use it as the first-layer classifier of our proposed MAMP-Pred method. FPR TPR AUC

Table 3 Performance comparison of first-layer classifiers on test dataset Stest

Second classifier - Identifying function types of AMPs

We investigated several multi-label classification methods on dataset sAMPs in order to find the best classifier for identifying AMPs function types. We firstly evaluated different problem transformation methods, including Binary Correlation (BR), Classifier Chain (CC), Bayesian Classifier Chain (BCC), Tag Combination (LC), pruning set (PS), combined with representative single-label classifiers including J48, Random Tree, Random Forest, KNN and Bagging. We also investigated several adapted algorithm methods such as MLkNN, BRkNN, BP neural network, BPMLL, and DeepML, whereas the details were not illustrated in this paper due to the space limitations.

All multi-label classifiers have adopted train/test dataset split and 10-fold cross-validation mechanisms based on sAMPs for evaluation. The evaluation results of BR-RF, PS-RF, CC-RF, BCC-RF, LC-RF and BRkNN methods on dataset sAMPs are shown in Table 4. It can be seen that PS-RF and LC-RF have achieved the highest overall accuracy, and 10-fold cross-validation performs better than train/test dataset split mechanism for all problem transformation methods.

Table 4 Performance Comparison of Second-layer Classifiers (10 fold cross-validation)

The second stage is to apply PS-RF and LC-RF classifiers for predicting the possible antimicrobial activities or function types of the 278 AMPs with unknown antibacterial activity snonAMPsNOACT. Similar prediction results were obtained in PS-RF and LC-RF. As shown in Fig. 2, there is one wound healing activity, one spermicidal activity, one chemotactic activity, one antimalarial activity, 6 Insecticidal activities, 27 antifungal activities, 27 anti-HIV activities, 13 Antiparasital activities, 19 antiviral activities, 23 anticancer activities, 5 proteinase inhibitor activities, 223 antibacterial activities. In addition, none of the antimicrobial peptides may have anti-protist, antioxidant, antibiotics, and surface immobilized activities.

Fig. 2
figure 2

Predicting function types of snonAMPsNOACT

Performance evaluation

To benchmark our method, we present a comparative analysis of our MAMPs-Pred method against other existing best-performing in literature. Most of the existing methods can only be used to identify a query peptide as an AMP or non-AMP.

To make the comparison feasible and applicable, we firstly compared the first-layer classifier of MAMPs-Pred with the first-level classifier of iAMP-2L. We have applied the independent test data sets \(S_{test}^{Ind}\) in [9], which contains 920 AMPs and non-AMPs sequences. The overall accuracy rate of iAMP-2L was 86.32%. Our mechanism has achieved 87.14% classification accuracy, which shows better performance than iAMP-2L, as shown in Table 5.

Table 5 Performance comparison of MAMPs-Pred and iAMP-2L first-layer on \(S_{test}^{Ind}\) dataset)

The second-layer classifier of MAMPs-Pred was compared with the iAMP-2L method [9] and LIFT classification method proposed in [17]. It can be seen that our MAMPs-Pred method has gained an improved overall performance over iAMP-2L and LIFT as shown in Table 6.

Table 6 Performance comparison of MAMPs-Pred and iAMP-2L, LIFT second-layer on \(S_{test}^{Ind}\) data set

The first reason is that the amino acid composition and its eight physicochemical properties which are used for feature extraction in this study, can better express the relationship between structure and antimicrobial peptides function types thus yield significantly improved performance.

The second reason is that the pruning set method applied in the second-layer multi-label classification, which transforms the label set into a single label in the problem, and directly models the label correlation, can achieves an overall better prediction performance.

Performance on predicting Penaeus AMPs

In total 14298 protein sequences of shrimp (Penaeus) were fetched from the public UniProt database, including Penaeus monodon, Penaeus vannamei, etc. We then obtained 1452 sequences with a length between 5 and 100 from the 14298 sequences, followed by extracting SVM-prot 188-D features based on amino acid composition (AAC) and its 8 physicochemical properties for each penaeus protein sequence. The processed sequences were subsequently fed to the first-layer classifier of MAMP-Pred. A total of 126 AMPS/AMPS-like sequences were detected, accounting for 8.68% of the total sequence.

In the second-layer multi-label classification, we have predicted the possible antimicrobial activities or function types that an AMP belongs to. All 126 penaeus AMPs sequences had antibacterial activity, one with chemotactic activity, and four with antifungal activity, as shown in Fig. 3. MAMP-Pred can be regarded as an efficient data-mining method to predict the potential antimicrobial peptides and antibacterial activities of the query sequences.

Fig. 3
figure 3

AMPs activity prediction of 126 shrimp sequences

Discussion

Antimicrobial peptides are increasingly gaining considerable attention both from research and industry, as well as clinical interest. With the growing microbial resistance to conventional antimicrobial agents, the demand for unconventional and efficient AMPs has become urgent. Effective usage of AMPs and their derivatives can greatly improve the immunity and breeding survival rate of aquatic products.

The results reported in this study indicate that the MAMP-Pred method achieves high performance for identifying AMPs and its functional types. The proposed approach is believed to supplement the tools and techniques that have been developed in the past for prediction of AMPs. The primary reason is that the amino acid composition and its eight physicochemical properties which are used for the feature extraction in this study, can better express the relationship between structure and antimicrobial peptides function types. The second reason is that the pruning set method applied in the second-layer multi-label classification achieves an overall higher prediction performance.

As summarized in [41], the recognition accuracy of machine learning methods ranges from the upper 70 to the lower 90 percent. Reported recognition accuracy has steadily improved over the past decade, while there is room for improvement.

The current MAMP-Pred approach can be straightforwardly extended in following directions in future research work:

1. Construct a more reliable datasets of positive and negative samples to reduce potential bias of model training introduced by sequence homology. We also believe that with more data available in the future, the prediction accuracy can be significantly enhanced.

2. The two-level prediction requires learning and classification to be performed twice, which lowers down the prediction efficiency. An adaptive dynamic approach which possibly yields faster speed and higher efficiency is of definite interest in our future research.

3. In this approach, the overlay of prediction errors might incur significant drop of prediction accuracy. In future work, the current method shall be straightforwardly extended to address these issues.

4. Predicting the AMPs and their function types of penaeus by this method can help us to understand the immune system of marine species. In addition, it eases subsequent mining and exploration of antimicrobial activity of other species. The predictor holds very high potential to become a useful high throughput tool to predict antimicrobial activity of other species.

Conclusion

In this study, we made an attempt to develop an advanced machine learning based computational approach, MAMPs-Pred, for identification of AMPs and its function types. Initially, SVM-prot 188-D features were extracted that were subsequently used as input to a two-layer multi-label classifier. The first layer is to identify whether it is an AMP by applying RF classifier, and the second layer addresses the multitype problem by identifying the activities or function types of AMPs by applying PS-RF and LC-RF classifiers.

Abbreviations

Acc:

Overall accuracy

AMPs:

Antimicrobial peptides

APD:

Antimicrobial peptide database

CD-HIT:

Cluster database at high identity with tolerance

ClassAMP:

A method to predict the propensity of a peptide sequence

Co-Pse-AAC:

Pseudo amino acid composition

EMR:

Exact-match ratio

FKNN:

Fuzzy K-nearest neighbour

H-Loss:

Hamming-loss

iAMP-2L:

A two-level multilabel classifier

LC-RF:

Label combination-random forests

LIFT:

Zhou’s multi-label learning algorithm

MAMPs-Pred:

Our method

Mcc:

Matthew’s correlation coefficient

PS-RF:

Pruned sets-random forests

RF:

Random forests

HMMS:

Hidden Markov models

SN:

Sensitivity

SP:

Specificity

SVM-prot 188-D:

A web server for protein classification with 188-D feature

SVM:

Support vector machine

References

  1. Malmsten M. Antimicrobial peptides. Ups J Med Sci. 2014; 199:204.

    Google Scholar 

  2. Torrent M, Nogues MV, Boix E. Discovering new in silico tools for antimicrobial peptide prediction. Curr Drug Targets. 2012. https://doi.org/10.2174/138945012802002311.

    Article  CAS  Google Scholar 

  3. Nannette YY, Michael RY. Multidimensional signatures in antimicrobial peptides. Proc Natl Acad Sci. 2004; 7363:7368. https://doi.org/10.1073/pnas.0401567101.

    Google Scholar 

  4. Meher PK, Sahu TK, Saini V, Rao AQ. Predicting antimicrobial peptides with improved accuracy by incorporating the compositional, physico-chemical and structural features into chou’s general PseAAC; 2017. https://doi.org/10.1038/srep42362.

  5. Khosravian M. Predicting antibacterial peptides by the concept of chou’s pseudo-amino acid composition and machine learning methods. Protein Pept Lett. 2013; 180:186. https://doi.org/10.2174/0929866511320020009.

    Google Scholar 

  6. Niarchou A. C-PAmP: large scale analysis and database construction containing high scoring computationally predicted antimicrobial peptides for all the available plant species. PLoS ONE. 2013. https://doi.org/10.1371/journal.pone.0079728.

    Article  CAS  Google Scholar 

  7. Lin HH, Han LY, Cai CZ, Ji ZL, Chen YZ. Prediction of transporter family from protein sequence by support vector machine approach. Proteins. 2006. https://doi.org/10.1002/prot.20605.

    Article  Google Scholar 

  8. Wang P. Prediction of antimicrobial peptides based on sequence alignment and feature selection methods. Plos ONE. 2011. https://doi.org/10.1371/journal.pone.0018476.

    Article  CAS  Google Scholar 

  9. Xiao X. iAMP-2L: a two-level multi-label classifier for identifying antimicrobial peptides and their functional types. Anal Biochem. 2013. https://doi.org/10.1016/j.ab.2013.01.019.

    Article  CAS  Google Scholar 

  10. Joseph S. ClassAMP: A prediction tool for classification of antimicrobial peptides. IEEE/ACM Trans Comput Biol Bioinform. 2012. https://doi.org/10.1109/TCBB.2012.89.

    Article  Google Scholar 

  11. Lira F. Prediction of antimicrobial activity of synthetic peptides by a decision tree model. Appl Environ Microbio. 2013. https://doi.org/10.1128/AEM.02804-12.

    Article  CAS  Google Scholar 

  12. Fjell CD. AMPer: a database and an automated discovery tool for antimicrobial peptides. Bioinformatics. 2013. https://doi.org/10.1093/bioinformatics/btm068.

    Article  CAS  Google Scholar 

  13. Daniel V. Deep learning improves antimicrobial peptide recognition. Bioinformatics. 2018. https://doi.org/10.1093/bioinformatics/bty179.

    Article  CAS  Google Scholar 

  14. Schneider P. Hybrid network model for “deep learning” of chemical data: application to antimicrobial peptides; 2006. https://doi.org/10.1002/minf.201600011.

    Article  Google Scholar 

  15. Wang Z, Wang G. APD: the antimicrobial peptide database. Nucleic Acids Res. 2004; 590:592. https://doi.org/10.1093/nar/gkh025.

    Google Scholar 

  16. Wang G. Li, Wang Z. APD2: the updated antimicrobial peptide database and its application in peptide design. Nucleic Acids Res. 2009; 933:937. https://doi.org/10.1093/nar/gkn823.

    Google Scholar 

  17. Wang P, Xiao X. Multi-label classifier design for predicting the functional types of antimicrobial peptides. Adv Mater Res. 2013. https://doi.org/10.4028/www.scientific.net/AMR.718-720.293.

    Article  Google Scholar 

  18. Zhou HL. A Multi-label classifier for prediction membrane protein functional types in animal. J Membr Biol. 2014; 1141:1148. https://doi.org/10.1007/s00232-014-9708-2.

    Google Scholar 

  19. Cai CZ, Han LY, Ji ZL, Chen X, Chen YZ. SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res. 2003. https://doi.org/10.1093/nar/gkg600.

    Article  CAS  Google Scholar 

  20. Li YH. SVM-Prot: SVM-Prot 2016: A web-server for machine learning prediction of protein functional families from sequence irrespective of similarity. PloS ONE. 2016. https://doi.org/10.1371/journal.pone.0155290.

    Article  Google Scholar 

  21. Huang Y, Niu B, Gao Y, Fu L, Li W. CD-HIT Suite: a web server for clustering and comparing biological sequence. Bioinformatics. 2010. https://doi.org/10.1093/bioinformatics/btq003.

    Article  CAS  Google Scholar 

  22. Quan Z. An approach for identifying cytokines based on a novel ensemble classifer. BioMed Res Int. 2013. https://doi.org/10.1155/2013/686090.

    Google Scholar 

  23. Zeng XX. Identification of cytokine via an improved genetic algorithm. Front Comput Sci. 2015; 643:651.

    Google Scholar 

  24. Cheng XY. A global characterization and identification of multifunctional enzymes; 2012. https://doi.org/10.1371/journal.pone.0038979.

    Article  CAS  Google Scholar 

  25. Zou Q, Chen W, Huang Y, Liu X, Jiang Y. Identifying multi-functional enzyme with hierarchical multi-label classifier. J Comput Theor Nanosci. 2013; 1038:1043. https://doi.org/10.1166/jctn.2013.2804.

    Google Scholar 

  26. Chou KC. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics. 2005; 10:19. https://doi.org/10.1093/bioinformatics/bth466.

    Google Scholar 

  27. Bin L. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res. 2015; 65:71. https://doi.org/10.1093/nar/gkv458.

    Google Scholar 

  28. Song L. nDNA-prot: Identifcation of DNA-binding proteins based on unbalanced classification. BMC Bioinformatics. 2014. https://doi.org/10.1186/1471-2105-15-298.

    Article  Google Scholar 

  29. Zou Q, Guo M, Liu Y, Wang J. A Classification method for class-imbalanced data and its application on bioinformatics. J Comput Res Dev. 2010; 1407:1414.

    Google Scholar 

  30. Lin S. Under-sampling method research in class-imbalanced data. J Comput Res Dev. 2011; 47:53.

    Google Scholar 

  31. Batista GE, Prati RC, Monard MC. A study of the behavior of several methods for balancing machine learning training data. ACM Sigkdd Explor Newsl. 2004; 20:29. https://doi.org/10.1145/1007730.1007735.

    Google Scholar 

  32. Guo LJ. Research on imbalanced data classification based on ensemble and under-sampling. J Front Comput Sci Technol. 2013; 630:638.

    Google Scholar 

  33. Tsoumakas G, Katakis I. Multi label classification: an overview. Int J Data Warehous Min. 2007; 1:13.

    Google Scholar 

  34. Guo SH. iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition. Bioinformatics. 2014; 1522:1529. https://doi.org/10.1093/bioinformatics/btu083.

    Google Scholar 

  35. Lin H, Deng EZ, Ding H, Chen W, Chou KC. iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucleic Acids Res. 2014; 12961:12972. https://doi.org/10.1093/nar/gku1019.

    Google Scholar 

  36. Tang H, Chen W, Lin H. Identification of immunoglobulins using Chou’s pseudo amino acid composition with feature selection technique. Mol BioSyst. 2016. https://doi.org/10.1039/c5mb00883b.

    Article  CAS  Google Scholar 

  37. Zhu PP. Predicting the subcellular localization of mycobacterial proteins by incorporating the optimal tripeptides into the general form of pseudo amino acid composition. Mol Biosyst. 2015; 558:563. https://doi.org/10.1039/c4mb00645c.

    Google Scholar 

  38. Chen W, Feng P, Ding H, Lin H, Chou KC. iRNA-Methyl: Identifying N(6)-methyladenosine sites using pseudo nucleotide composition. Anal Biochem. 2015; 26:33. https://doi.org/10.1016/j.ab.2015.08.021.

    Article  Google Scholar 

  39. Chen W, Feng P, Lin H. Prediction of replication origins by calculating DNA structural properties. FEBS Lett. 2012. https://doi.org/10.1016/j.febslet.2012.02.034.

    Article  CAS  Google Scholar 

  40. Chen W, Feng P, Lin H, Chou KC. iSS-PseDNC: identifying splicing sites using pseudo dinucleotide composition. BioMed Res Int. 2014. https://doi.org/10.1155/2014/623149.

    Google Scholar 

  41. Daniel V. Improving recognition of antimicrobial peptides and target selectivity through machine learning and genetic programming. IEEE/ACM Trans Comput Biol Bioinform. 2017. https://doi.org/10.1109/TCBB.2015.2462364.

    Article  Google Scholar 

Download references

Acknowledgements

We would like to acknoledge the authors appeared in the References.

Funding

The work was supported by the National Natural Science Foundation of China (Grant Nos. 61472333, 61772441, 61472335, 61425002), Project of marine economic innovation and development in Xiamen (No. 16PFW034SF02), Natural Science Foundation of the Higher Education Institutions of Fujian Province (No. JZ160400), Natural Science Foundation of Fujian Province (No. 2017J01099), President Fund of Xiamen University (No. 20720170054). Publication costs are funded by 61772441 or 16PFW034SF02.

Availability of data and materials

The datasets and features were downloaded on the following URL. https://github.com/JianyuanLin/SupplementaryData.

About this supplement

This article has been published as part of BMC Bioinformatics Volume 20 Supplement 8, 2019: Decipher computational analytics in digital health and precision medicine. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-20-supplement-8.

Ethics approval and consent for participation

Not applicable.

Author information

Authors and Affiliations

Authors

Contributions

XL, YL and YC conceived and designed the experiments, YC collected the dataset, YL and YC performed the experiments, YL wrote the paper; XL,YL and CL analyzed the data, XL and YL discussed the results and improved the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Xiangrong Liu.

Ethics declarations

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lin, Y., Cai, Y., Liu, J. et al. An advanced approach to identify antimicrobial peptides and their function types for penaeus through machine learning strategies. BMC Bioinformatics 20 (Suppl 8), 291 (2019). https://doi.org/10.1186/s12859-019-2766-9

Download citation

  • Published:

  • DOI: https://doi.org/10.1186/s12859-019-2766-9

Keywords