Skip to main content

Table 7 Description of proteins from the benchmark dataset that were misclassified by at least one machine learning algorithm

From: A novel strategy for classifying the output from an in silicovaccine discovery pipeline for eukaryotic pathogens using machine learning algorithms

UniProt ID

Protein name

Subcellular annotation

Expected classification

Final classificationa

Misclassification by algorithmb

Evidence profilec

Q27298

SAG1 protein (P30

Membrane

YES

YES

AB RF SVM

Q27298,0,Y,0.297,0.141,M,2,7.30,0.56,0,21.5,Secreted,0.255,0.205,YES

B0LUH4

Microneme protein 13

Unknown

YES

YES

kNN

B0LUH4,0,Y,0.888,0.907,S,1,0.11,0.11,0,29.0,Secreted,0.270,0.355,YES

P84343

Peptidyl-prolyl cis-trans isomerase

Unknown

YES

YES

kNN

P84343,0,Y,0.817,0.963,S,1,1.11,1.11,0,29.0,Secreted,0.465,0.536,YES

Q9U483

Microneme protein Nc-P38

Unknown

YES

YES

kNN

Q9U483,0,Y,0.427,0.587,S,4,0.23,0.23,0,30.0,Secreted,0.355,0.1736,YES

B9PRX5

Proteasome subunit alpha type

Unknown

YES

YES

RF SVM

B9PRX5,0,Y,0.250,0.254,M,2,16.81,7.23,0,22.0,Secreted,0.648,0.515,YES

B9QH60

Acetyl-CoA carboxylase, putative

Unknown

YES

YES

SVM

B9QH60,1,N,0.322,0.019,M,1,22.02,0.00,1,5.0,Secreted,0.846,0.437,YES

B6K9N1

Cytochrome P450 (putative)

Unknown

NO

NO

kNN

B6K9N1,1,N,0.131,0.041,U,2,15.35,0.03,0,5.0,Membrane,0.197,0.480,NO

B9Q0C2

Anamorsin homolog

Cytoplasm

NO

NO

kNN

B9Q0C2,0,Y,0.245,0.108,U,4,0.54,0.00,0,20.0,Secreted,0.382,0.210,NO

B9PK71

DNA-directed RNA polymerase subunit

Nucleus

NO

NO

NB

B9PK71,0,N,0.188,0.223,U,4,0.00,0.00,0,22.0,Secreted,0.368,0.380,NO

  1. aFinal classification takes into account predictions from each algorithm and the most frequent classification type is used i.e. a majority rule approach. A YES classification is adopted for tied votes e.g. Q27298.
  2. bAlgorithms are executed multiple times on the same input data. An in-house Perl script summarises the multiple runs and indicates the number of times (as a percentage) the predicted classification of protein differs from the expected. Proteins are regarded as misclassified if the number of times = 100%.
  3. cColumn headers: 1 = ID, 2 = Phobius_TM, 3 = Phobius_SP, 4 = SignalP, 5 = TargetP_SP, 6 = TargetP_loc, 7 = TargetP_RC, 8 = TMHMM_AA, 9 = TMHMM_First60, 10 = TMHMM_TM, 11 = WoLF_PSORT, 12 = WoLF_PSORT_annotation, 13 = MHCI, 14 = MHCII, 15 = Expected classification.
  4. Abbreviations: AB = Adaptive boosting, RF = random forest, SVM = support vector machines, NB = Naive Bayes, kNN = k-Nearest neighbour, NN = neural network.