Skip to main content

Table 5 Sensitivity and specificity performance measures of binary classification on different test datasets when using machine learning algorithms with different training datasets

From: A novel strategy for classifying the output from an in silicovaccine discovery pipeline for eukaryotic pathogens using machine learning algorithms

Test dataset

Training dataset

 

T. gondii

 

Plasmodium

 

C. elegans

 

Combined species

 

Benchmark

 
 

SN

SP

SN

SP

SN

SP

SN

SP

SN

SP

 

Decision Tree a

T. gondii

1.00b

0.81b

0.95

0.89

1.00

0.83

1.00

0.83

1.00

0.83

Plasmodium

0.84

0.90

1.00b

1.00b

0.85

0.96

1.00

0.92

1.00

0.98

C. elegans

0.87

0.93

1.00

0.99

1.00b

1.00b

1.00

0.99

1.00

0.98

Combined species

0.87

0.92

1.00

0.99

0.98

0.99

1.00b

0.98b

1.00

0.97

Benchmark

0.86

0.91

0.97

0.96

0.96

0.96

0.97

0.91

1.00b

1.00b

 

Adaptive boosting a

T. gondii

0.51b

0.06b

0.96

0.88

1.00

0.83

1.00

0.91

1.00

0.83

Plasmodium

0.82

0.99

0.98b

0.96b

0.95

0.96

1.00

1.00

1.00

0.98

C. elegans

0.87

0.99

1.00

1.00

1.00b

1.00b

1.00

1.00

1.00

0.98

Combined species

0.87

0.99

1.00

0.99

0.99

0.99

1.00b

0.99b

1.00

0.98

Benchmark

0.85

0.99

0.97

0.98

0.97

0.96

0.99

0.99

0.98b

0.97b

 

Random forest a

T. gondii

0.97b

0.90b

1.00

0.83

1.00

0.89

1.00

1.00

1.00

0.83

Plasmodium

0.87

1.00

0.99b

0.99b

1.00

1.00

1.00

1.00

1.00

0.98

C. elegans

0.83

1.00

0.98

1.00

1.00b

1.00b

1.00

1.00

1.00

1.00

Combined species

0.84

1.00

0.98

0.99

1.00

1.00

1.00b

1.00b

1.00

0.99

Benchmark

0.82

1.00

0.99

0.99

0.99

1.00

0.97

0.99

0.99b

0.99b

 

k-Nearest neighbour

T. gondii

0.80b

0.83b

1.00

0.83

0.95

0.83

1.00

0.83

0.90

0.78

Plasmodium

0.77

0.96

0.95b

0.84b

0.88

0.96

0.99

0.94

0.81

0.96

C. elegans

0.88

0.99

0.99

0.95

0.96b

0.98b

0.99

0.99

0.95

0.98

Combined species

0.87

0.98

0.99

0.94

0.97

0.98

0.96b

0.97b

0.92

0.97

Benchmark

0.93

0.96

1.00

0.90

0.96

0.96

0.96

0.97

0.98b

0.96b

 

Naive bayes classifier

T.gondii

1.00b

0.91b

1.00

0.78

1.00

0.83

1.00

0.83

1.00

0.83

Plasmodium

0.97

0.98

0.98b

0.99b

1.00

0.92

1.00

0.96

1.00

0.98

C. elegans

0.87

1.00

0.92

0.95

1.00b

0.98b

0.97

0.98

1.00

0.99

Combined species

0.89

0.99

0.93

0.95

1.00

0.97

0.98b

0.97b

1.00

0.98

Benchmark

0.81

1.00

0.97

0.94

1.00

0.93

1.00

0.99

1.00b

1.00b

 

Neural networks a

T. gondii

0.98b

0.90b

0.99

0.83

1.00

0.84

1.00

0.91

0.99

0.83

Plasmodium

0.88

0.92

0.99b

0.89b

0.99

0.97

0.97

0.98

0.93

0.97

C. elegans

0.83

0.99

0.92

0.98

0.99b

0.99b

1.00

1.00

0.98

0.97

Combined species

0.91

0.96

0.93

0.98

0.99

0.98

0.99b

0.98b

0.97

0.97

Benchmark

0.78

0.97

0.97

0.97

0.99

0.95

0.99

0.96

1.00b

0.95 b

 

Support vector machines

T.gondii

0.83b

0.92b

0.89

1.00

0.89

0.89

1.00

0.89

1.00

0.83

Plasmodium

0.88

0.97

0.98b

0.98b

0.96

0.98

1.00

0.98

1.00

0.98

C. elegans

0.83

0.89

0.98

0.99

0.94b

0.99b

0.99

1.00

0.91

0.99

Combined species

0.84

0.91

0.98

0.98

0.99

0.99

0.92b

0.99b

0.93

0.98

Benchmark

0.74

0.99

0.96

0.96

0.94

0.99

0.96

1.00

0.83b

0.92b

  1. Abbreviations: SN = sensitivity; SP = specificity; T. gondii = Toxoplasma gondii; Plasmodium = species in the genus Plasmodium including falciparum, yoelii yoelii, and berghei; C. elegans = Caenorhabditis elegans; Combined species = combination of T. gondii, Plasmodium, and C. elegans datasets; Benchmark = dataset comprising evidence for T. gondii and Neospora caninum proteins from published studies.
  2. aResults from the same input data fluctuate. The algorithm-specific R functions were executed 100 times and the prediction outcomes (false positives and negatives, true positives and negatives) were averaged to calculate SN and SP.
  3. bObtained from multiple cross-validations i.e. the algorithm-specific R functions randomly used 70% of the training dataset to build a model and the remaining 30% was used in the binary classification test. The cross-validation was executed 100 times and the prediction outcomes were averaged to calculate SN and SP.
  4. The values underlined denote the best performing training dataset for classifying the benchmark proteins.