Dataset preparation
The datasets of pathogenic proteins were constructed by curating the available databases such as Virulence Factor Database (VFDB) [16], Pathosystems Resource Integration Center (PATRIC) [17], PHIDIAS virulent factors (http://www.phidias.us/victors/) [18], Antibiotic Resistance Proteins from Comprehensive Antibiotic Resistance Database (CARD) [19] and Antibiotic Resistance Genes Database (ARDB) [20]. The initial dataset contained 8,794 sequences (excluding antibiotic resistance proteins) as virulent proteins and 4,992 Antibiotic Resistance sequences. A negative dataset consisting of non-pathogenic proteins comprised of a total of 18,296 sequences was also constructed using the Database of Essential Genes [21].
The sequences with non-confirmatory annotations such as “hypothetical”, “like”, “may”, “possible”, “potential”, “predicted”, “probable”, “putative”, “uncharacterized”, “unknown”, “unnamed” were removed from the main dataset and were used for constructing the Real Dataset. For removing the redundancy in the remaining sequence data, clustering was performed using CD-HIT at a cut-off value of 0.95 (i.e., 95% sequence similarity) and the representative sequences from each cluster were obtained [22]. The clustering resulted in the sets of 4948 virulent proteins sequences (not including antibiotic resistance proteins), 1040 antibiotic resistance protein sequences and 11,029 non-pathogenic proteins.
Sub-categorization of datasets and curation into multiple classes
The sub-categorization was performed based on the function of the protein sequences followed by an extensive literature review to classify the different groups of proteins into broader categories. The uniqueness and the relatedness amongst the different classes of proteins in terms of their role in pathogenicity, structural diversity, and the origin of the proteins, were considered while classifying the proteins. Sequences with ambiguous annotations were removed to reduce the noise in the data. Sequences in the essential proteins database having similar annotations to virulent proteins were removed. Additionally, except polymerases, enzyme sequences were removed to decrease large variability within negative dataset.
Thus, MP4 protein dataset consisted of protein sequences belonging to 6 major classes: Antibiotic resistance proteins, Non-pathogenic proteins, Secretory proteins, Sigma proteins, Capsules and toxins. Non-pathogenic proteins were considered as a single class, the Antibiotic resistance proteins and Toxins were clubbed into a single class because both these types of protein have either evolved or adapted for the sole purpose of bacterial resistance and pathogenicity, while the other class comprising of secretory proteins and capsular proteins aids in the virulence of a microbe ranging from enhancing attachment to the optimization of resources in the bacterial niche to disruption of host cells. Thus, the sequences were classified as: Class 1 proteins consisting of Non-pathogenic proteins (1047 sequences), Class 2 proteins consisting of antibiotic resistance proteins and toxins (1020 sequences), and Class 3 proteins with secretory and capsular proteins (1492 sequences).
Each class was randomly divided into 80:20 ratios from which the 80% parts from all the classes were combined to make a training dataset (2848 sequences). Blind dataset contained rest 20% of the dataset (711 sequences), based on which statistical measures of data classification of machine learning model is estimated.
Construction of different validation datasets
For performance validation of MP4, two different datasets were used consisting of sequences not used for training. Sequences kept for the construction of real dataset-1, were curated into the three classes according to the aforementioned procedure. The real dataset-1 consisted of 308 manually curated sequences. A real dataset-2 consisting of 200 sequences previously used in MP3 [3], was also curated based on the sub-categorization mentioned in the previous section.
Construction of independent genomic and metagenomic validation datasets and comparative datasets
A literature review was performed to construct independent datasets consisting of 25 pathogenic and 25 non-pathogenic bacterial strains, respectively. These sequences were then downloaded from the NCBI FTP server (ftp://ftp.ncbi.nlm.nih.gov/).
The performance comparison between MP3 and MP4 was performed using the VirulentPred sequences [14] and Shigella flexineri virulence plasmid sequences which consisted of Shigella flexineri virulence plasmid group I and Shigella flexineri virulence plasmid group II. Shigella flexineri virulence plasmid group I consists of proteins that are translocated by Shigella into the host cells during the infection (Translocated proteins, 18 sequences) whereas Shigella flexineri virulence plasmid group II contains proteins that remain in the bacteria during the infection (Non-translocated proteins, 19 sequences), which were previously used for the validation of the performance of MP3 [3]. The Shigella flexineri virulence plasmid group III (1 sequence) [3] was also used to validate the function of MP4. All the sequences used to compare MP3 and MP4 were previously used by MP3 [3]. A third dataset consisting of 41 proteins from Mycobacterium tuberculosis NITR203 strain that included known as well as hypothetical proteins was used to assess ability of MP4 to predict and sub-categorize hypothetical sequences. These sequences were earlier used by MP3 [3].
Metagenomic datasets for healthy individual (SRR5898979) and colorectal cancer patient (SRR8865601) were obtained from [23]. Using SPAdes (version 3.13.0) [24], forward and reverse pairedend reads were assembled into single reads for CRC patient and healthy individual respectively. Using Prodigal (version 2.6.3) software [25], gene prediction was performed which were used as an input for MP4 to predict proportion of different classes of pathogenic proteins in the two samples.
Calculation of pathogenicity index
The pathogenicity index was calculated for all the strains and was used as basis for identifying and differentiating pathogenic strains from non-pathogenic strains.
$$Pathogenicity\,Index=\frac{Number\,of\,positive\,sequences}{Total\,number\,of\,sequences}$$
where;
$$Number\,of\,positive\,sequences = Number\,of\,sequences \in class 2+Number\,of\,sequences\in Class 3$$
Input features
Dipeptide frequency and pepstats features
The dipeptide frequency provides information about the amino acid sequence arrangement for a protein. As an input, it provides global information on the protein features in a fixed-length vector. It encompasses information such as local order and fraction of amino acid. The dipeptide frequency of each protein can be calculated using the following formula:
$$Dipepide\,frequency=\frac{Total\,number\,of\,dipeptides}{Total\,number\,of\,all\,possible\,dipeptides}\times 100$$
The pepstats features were calculated using EMBOSS:6.6.0.0 (http://emboss.sourceforge.net/apps/cvs/emboss/apps/embossupdate.html). The pepstats calculation provided a total of 34 features consisting of physicochemical properties of proteins including Molecular weight, Number of residues, Average residue weight, Charge, Isoelectric point, molar percent and extinction coefficient at 1 mg/ml (A280) (Additional file 1: Table S1). With these features, the sequences with variable lengths were converted into vectors with Lx434 dimension (L: number of rows in each dataset), which our machine learning algorithm can use for data classification.
Selection of appropriate machine learning model for classification
WEKA 3.8.2 was used to compare the various machine learning algorithms including PART, Random Forest, IBk and SVM using fivefold cross validation on training data with dipeptide frequency and pepstats features as inputs. Accuracy, precision, F-measure, MCC, ROC, true positive rate (TPR) and true negative rate (TNR) values were recorded for all the algorithms mentioned in this section.
Optimization of various RF and SVM parameters for the development of classification models
The comparison between different machine learning algorithms revealed that SVM and RF had comparable performances in terms of Accuracy, Precision, F-measure, MCC and ROC values on the training dataset. Hence, both SVM and RF were considered for further parameter optimization and evaluation using e1071 package [26] and randomForest library respectively, available in R (version 3.4). For the RF-based model, the importance of each feature was calculated at ntree = 500, using the mean decrease in accuracy at best mtry obtained with the help of tuneRF function which calculates mtry values using OOB error as an estimate. The OOB error that represents the error in prediction by randomForest algorithm was calculated using top 5, 10, 15, 20, 30, 50, 70 and 90% features (dipeptide frequencies + pepstats features) using different mtry values at ntree = 200–1000. The best performing model was selected based on the least %OOB error. The performance of various classification models employing RF algorithms was optimized using fivefold cross-validation. In RF, the OOB error was used as a measure for accuracy.
The tuning of SVM kernels was performed using tune function of R library e1071 at different values of cost, gamma, degree and coef0 with the help of all 434 features (Additional file 2: Table S2). Parameters for the kernel having the highest total cross-validation accuracy and the least error were selected for the development of SVM-based models. The SVM-based feature selection was performed using the VarImp function, a generic method for calculating variable importance for objects. The VarImp function uses ROC values as the measure of the importance of features. Three different lists consisting of important features for each class were obtained. Further, top 50, 100, 150, 200, 250, 300, 350, 400 and 410 features were extracted from each list. Then, for every list, the entries with the same ROC value between all the three classes were extracted to give final lists of top features.
Performance validation of final classification model
The performances of the final models were evaluated using 20% of total data kept as blind dataset and on the independent datasets consisting of 25 known pathogenic and non-pathogenic strains. Performances of the model are represented in terms of Sensitivity, Specificity, Accuracy, Precision and Balanced Accuracy.
$$Sensitivity\; = \;\frac{TP}{{TP + FN}},\;Specificity\; = \;\frac{TN}{{TN + FN}},\;Precision = \;\frac{TP}{{TP + FP}}$$
$$Accuracy\; = \;\frac{TP + TN}{{TP + FN + FP + TN}}$$
$$Balanced\,Accuracy=\frac{Sn+Sp}{2}$$
$$MCC=\frac{\left(TP*TN\right)-\left(FP*FN\right)}{\surd \left(TP+FP\right)\left(TP+FN\right)\left(TN+FP\right)\left(TN+FN\right)}$$
Sensitivity (Sn) Sensitivity measures the ability of the process to predict correct results.
Specificity (Sp) Specificity measures the ability of a process to predict incorrect results.
Accuracy Accuracy measures the degree of correctness of the predicted results to its actual value or the experimental value.
MCC Matthews correlation coefficient.
Balanced Accuracy Balanced Accuracy measures the average of the proportion corrects of each class individually. It is used when the dataset used for training purposes is unbalanced.
Where, TP: The pathogenic protein correctly identified as pathogenic.
FP: The non-pathogenic protein incorrectly identified as pathogenic protein.
TN: The non-pathogenic protein correctly identified as non-pathogenic.
FN: The pathogenic protein incorrectly identified as non-pathogenic protein.