Skip to main content

Table 1 Datasets used for training and testing machine learning models

From: A novel strategy for classifying the output from an in silicovaccine discovery pipeline for eukaryotic pathogens using machine learning algorithms

Namea

Number of proteins in each groupb

Organism

Comments

 

Membrane-associated

Secreted

Neither membrane-associated nor secreted

  

T. gondii

8

13

18

Toxoplasma gondii

 

Plasmodium

47

26

51

Plasmodium

Includes falciparum, yoelii yoelii, and berghei

C. elegans

324

56

380

Caenorhabditis elegans

 

Combined species

379

95

449

Combination of organisms

Includes T. gondii, C. elegans, P. falciparum, P. yoelii yoelii, and P. berghei

Benchmark

70c

70

Combination of two organisms

T. gondii and Neospora caninum (excludes the proteins in T. gondii dataset)

  1. aThis is the name used to refer to the dataset throughout the paper.
  2. bProteins (except for the benchmark dataset) were initially grouped in accordance with the subcellular location descriptor in UniProtKB, then fine-tuned in accordance to cross-validation testing, epitope presence, and reference to other UniProtKB annotations and Gene Ontology. Benchmark proteins were taken from published studies (70 experimentally shown to induce immune responses).
  3. cCombination of proteins from membrane-associated, secreted, and unknown subcellular locations.
  4. Note: Membrane-associated and Secreted proteins are expected ‘YES’ classification for vaccine candidacy. Neither membrane-associated nor secreted proteins are expected ‘NO’ classification. There was an attempt to create an equal representation of YES and NO classifications in the training datasets.