Skip to main content

A novel strategy for classifying the output from an in silicovaccine discovery pipeline for eukaryotic pathogens using machine learning algorithms



An in silico vaccine discovery pipeline for eukaryotic pathogens typically consists of several computational tools to predict protein characteristics. The aim of the in silico approach to discovering subunit vaccines is to use predicted characteristics to identify proteins which are worthy of laboratory investigation. A major challenge is that these predictions are inherent with hidden inaccuracies and contradictions. This study focuses on how to reduce the number of false candidates using machine learning algorithms rather than relying on expensive laboratory validation. Proteins from Toxoplasma gondii, Plasmodium sp., and Caenorhabditis elegans were used as training and test datasets.


The results show that machine learning algorithms can effectively distinguish expected true from expected false vaccine candidates (with an average sensitivity and specificity of 0.97 and 0.98 respectively), for proteins observed to induce immune responses experimentally.


Vaccine candidates from an in silico approach can only be truly validated in a laboratory. Given any in silico output and appropriate training data, the number of false candidates allocated for validation can be dramatically reduced using a pool of machine learning algorithms. This will ultimately save time and money in the laboratory.


This study addresses a major problem raised from a previous feasibility study [1] of a high-throughput in silico vaccine discovery pipeline for eukaryotic pathogens. A typical in silico pipeline output is a collection of different protein characteristics that are predicted by freely available bioinformatics programs [1]. These protein characteristics (referred henceforth as an evidence profile) represent potential evidence from which a researcher can make an informed decision as to a protein’s suitability as a vaccine candidate. The problem is that this evidence can be in different formats, contradicting, and inaccurate culminating in large numbers of false positive and negative decisions. The current solution is to accept that candidates will inevitably be missed due to the nature of an in silico approach and to rely on the laboratory validation to identify false candidates. The study herein focuses on how to reduce the false error rates using a computational approach.

Eukaryotic pathogens are extremely complicated systems comprised of thousands of unique proteins that are expressed in multifaceted life cycles and in response to varying environmental stimuli. A desired aim of an in silico approach for subunit vaccine discovery is to identify which of these proteins will evoke a protective, yet safe, immune response in the host [2, 3]. It is currently impossible, however, to know within an in silico environment how a host will truly respond to a single protein or combination of proteins. Consequently, an in silico approach is not an attempt to replace experimental work but is a complementary approach to predict which proteins among thousands are worthy of further laboratory investigation. Vaccine discovery tools have been developed for prokaryotes [4, 5], though, there is no in silico pipeline available to the public for eukaryotic pathogens and no clear consensus as to what type of protein constitutes an ideal subunit vaccine. Currently, the characteristics of proteins guaranteed to induce the desired immune response are poorly defined. Nevertheless, some protein characteristics which are considered relevant to vaccine discovery are sub-cellular location; presence of signal peptides, transmembrane domains, and epitopes [2, 6-8].

The poor reliability of the in silico output arises because an unknown percentage of the in silico input (e.g. protein sequences, database annotations, and predicted evidence itself) are acknowledged incorrect or missing. Bioinformatics programs used to predict protein characteristics are, in general, inaccurate [9-15]. The inaccuracy can be a consequence of erroneous input data or overly simplistic algorithms, or simply due the complexity of the problem being solved. Since most prediction programs are imprecise, it can be expected that a percentage of the predicted protein characteristics will be incorrect. The difficulty encountered by a program user is to ascertain which of these predictions are correct and can contribute to the collection of evidence that supports a protein’s vaccine candidacy.

Given an in silico output, we propose that supervised machine learning methods can accurately classify the suitability of a protein, among potential thousands, for further laboratory investigation. Applying machine learning algorithms to solving biological problems is not novel. However, applying them to classify eukaryotic proteins for vaccine discovery is novel and this is reflected by the presence of only a few publications on the topic [16-18]. We illustrate the proposal on an in silico output comprising evidence from proteins experimentally shown to induce immune responses (referred henceforth as the benchmark dataset) and hence expected to be likely vaccine candidates.

Results and discussion

Five datasets (see Table 1) containing evidence profiles were used in various ways to test the classification of a protein as either a vaccine candidate (YES classification) or non-vaccine candidate (NO classification). These evidence profiles for proteins from Toxoplasma gondii, Neospora caninum, Plasmodium sp., and Caenorhabditis elegans, were compiled from the output predictions made by seven bioinformatics programs (see Table 2).

Table 1 Datasets used for training and testing machine learning models
Table 2 High-throughput standalone programs used in this study to predict protein characteristics

A typical profile is a mixture of data types corresponding to an accuracy measure, a perceived reliability, or a type of score for the protein characteristic being predicted (see Figure 1 and 2). There will always be considerable uncertainty in the profile due to inherent inaccuracies in the source of the evidence. That is, there is an unknown but expected percentage of inaccuracy in the input sequence, training data (if required), and program algorithm itself impeding precise prediction. This is irrespective of the target pathogen. The key question to be answered is whether we can classify potential vaccine candidates based on evidence profiles with hidden inaccuracies.

Figure 1
figure 1

A schematic of a typical in silico vaccine discovery pipeline output. A typical in silico pipeline output is a collection of different protein characteristics that are predicted by bioinformatics programs. The schematic depicts a collection of some of the scores (potential evidence) associated with these predicted characteristics. A collection of scores for one protein is referred to as an evidence profile in the study. Each column represents a potential input variable or predictor for machine learning algorithms. The last column is a ‘YES’ or ‘NO’ as to whether the protein is expected to be a vaccine candidate (a requirement for machine learning training data) and represents the target variable i.e. the variable to be predicted for new profiles.

Figure 2
figure 2

An extract of evidence profiles. Specific values from high-throughput standalone prediction programs are extracted and compiled to generate evidence profiles. Each row contains the collection of evidence for one protein (i.e. an evidence profile). Each column contains the score for a protein characteristic predicted by a specific program (i.e. an input variable or predictor). See the ‘Contents of evidence profiles’ subsection for a description of the columns.

Contents of evidence profiles

The Columns in the evidence profile are as follows: 1 = UniProt ID. 2 = Number of predicted transmembrane helices (Phobius_TM). 3 = A ‘Y’ or ‘N’ to indicate a predicted signal peptide (Phobius_SP) - a ‘Y’ is more likely to be a secreted protein. 4 = Probability of a secretory signal peptide (SignalP). 5 = Probability of a secretory signal peptide (TargetP_SP). 6 = Predicted localisation based on the scores: M = mitochondrion, S = secretory pathway, U = other location (TargetP_loc). 7 = Reliability class (RC) - from 1 (most reliable) to 5 (least reliable) and is a measure of prediction certainty (TargetP_RC). 8 = Expected number of amino acid residues in transmembrane helices (the higher the number the more likely the protein is membrane-associated) (TMHMM_AA). 9 = Expected number of residues in the transmembrane helices located in first 60 amino acids of protein. The larger the number the more likely the predicted transmembrane helix in the N-terminal is a signal peptide (TMHMM_First60). 10 = Number of predicted transmembrane helices (TMHMM_TM). 11 = Number of nearest neighbours that have a similar location (WoLF PSORT). 12 = Predicted subcellular location (Secreted or Membrane or NOT_secreted_or_membrane) (WoLF_ PSORT_annotation). 13 = Probability score encapsulating the collective potential of T-cell epitopes on protein with respect to vaccine candidacy (MHCI). Raw affinity scores derived from IEDB Peptide-MHC I Binding predictor. 14 = Probability score encapsulating the collective potential of T-cell epitopes on protein with respect to vaccine candidacy (MHCII). Raw affinity scores derived from IEDB Peptide-MHC II Binding predictor. 15 = Expected ‘YES’ or ‘NO’ vaccine candidacy (Target variable).

Classifying with one individual piece of evidence

The first test was to determine whether proteins could be correctly classified using an individual piece of evidence (i.e. one input variable from an evidence profile). Figure 3 shows an example of how the test was applied. The sensitivity and specificity of the classification is shown in Table 3. The most notable observation is that non-vaccine candidates are predominantly correctly classified but the main trade-off is a substantial number of false negatives, as evidenced by the low sensitivity scores. The conclusion here is that there is no one individual input variable that can precisely determine the classification. This is not an unexpected result because each input variable represents only one particular protein characteristic and there is currently no one characteristic that conclusively epitomises a vaccine candidate.

Figure 3
figure 3

Example of test applied to a predicted protein characteristic for the purpose of binary classification. In this example, proteins are listed in descending order based on the number of transmembrane (TM) domains per protein predicted by the program Phobius (input value = Phobius_TM). A threshold value of 0 is applied to the score (i.e. number of TM domains) to segregate the list into two classifications. Above the threshold is ‘YES’ for vaccine candidacy and below or equal is ‘NO’. The classification is compared with the expected classification to determine sensitivity and specificity performance measures.

Table 3 Sensitivity and specificity performance measures of binary classification for individual input variables taken from datasets

Classifying with a rule-based approach

The next test was to determine if a combination of two or more input variables could efficiently perform the vaccine classification by applying an appropriate rule. Figure 4 illustrates the rule-based approach. A total of 17 combinations were tested with a programmed trial and error approach to obtain the maximum sensitivity and specificity. Table 4 shows the best rule from each combination. The best result achieved when tested on the benchmark dataset was 0.43 and 0.97 for sensitivity and specificity respectively. There were two main observations made from the rule-based testing: a rule that works well with one dataset does not necessarily generalise to another, and it is difficult to strike the ideal balance between sensitivity and specificity. For example, judicious adjustments to the rule threshold values can capture all proteins classified ‘YES’ in a test dataset (i.e. highly sensitive with zero false negatives) but at the expense of more false positives. Furthermore, if this adjusted rule is then applied to another dataset there are still false classifications. The conclusion here is that it is not feasible to compose a universal set of rules applicable to all datasets for the purpose of classifying proteins.

Figure 4
figure 4

A graph of proteins from the combined training dataset using only two input variables to illustrate a rule-based approach for binary classification. Abbreviations: TMHMM_AA = number of amino acid residues in transmembrane helices (a transmembrane domain is expected to be greater than 18), WoLF PSORT = nearest neighbour score (16 = 50%). Triangles and circles indicate expected vaccine candidacy of proteins. The aim of the rule-based approach is to find the optimum threshold values that segregate majority of triangles from majority of circles. Best rule for binary classification is ‘NO if TMHMM_AA < 12 and WoLF PSORT < 15 (shaded area on graph) else YES’. Two examples of where YES and NO classification rules are broken are shown on graph. When this best rule was applied to the benchmark dataset the sensitivity and specificity were 0.43 and 0.97 respectively.

Table 4 Sensitivity and specificity of classifications on applying rule to benchmark dataset

Classifying with machine learning algorithms

Seven, popular, supervised machine learning algorithms were used in an attempt to improve on the rule-based approach. Table 5 shows the sensitivity and specificity performance measures of the binary classification. The five datasets were used interchangeably for both training and testing. The table is presented as a matrix with training datasets in columns and test datasets in rows. For example, T. gondii dataset is used to build the decision tree model and tested on the benchmark dataset. Included in the matrix are classification results from cross-validation, which are expected to approach 1.0 (most algorithms have an inherent unavoidable error i.e. noise). Cross-validation results that greatly differ from 1.0 suggest there is at least one problematic evidence profile. The combined species dataset is the combination of the T. gondii, Plasmodium, and C. elegans datasets. The results, therefore, are positively biased when the combined species dataset is used for training and testing on datasets other than the benchmark. Similarly, testing on the combined species dataset with species-specific trained models is also positively biased. The main benchmark for the algorithm comparison is the classification of the benchmark proteins using the combined species to train the model.

Table 5 Sensitivity and specificity performance measures of binary classification on different test datasets when using machine learning algorithms with different training datasets

In summary, the best benchmark performing algorithm (based on the sum of sensitivity and specificity) is naïve Bayes; then adaptive boosting; followed jointly by random forest and support vector machines (SVM); then neural networks, k-nearest neighbour, and finally decision tree. With the exception of decision tree, the difference in performance is so minimal that the ranked performance here could easily change given different training and test datasets and/or fine-tuning of the algorithm parameters. Ultimately, there was no apparent difference between the algorithms with respect to solving this specific problem of classifying evidence profiles.

Factors affecting performance of machine learning algorithms

It is the content of the training dataset and in particular the number of problematic profiles in both the training and test datasets that have the greatest impact on the performance of the algorithm. Certain profiles are more problematic than others for some algorithms to classify and tend to be consistently misclassified. The T. gondii trained model performed the poorest when tested on the benchmark proteins irrespective of the algorithm used. It is tempting to assume that the poor performance from the T. gondii trained model was due to a misclassification of the target input variable for some of the evidence profiles. However, there are two other proposed reasons for this inaccuracy: the training dataset contains the least number of evidence profiles (39 in total), but more importantly it contains three labelled profiles with questionable evidence (i.e. erroneous evidence predictions identified when manually assessing them). Cross-validation is a useful indication that a particular profile is problematic. Problematic profiles, both in the training and test datasets, tend to contain ambiguous evidence which can cause the algorithm to make an unexpected classification. Based on cross-validation, the T. gondii data contained the most problematic profiles for all algorithms, followed by Plasmodium, benchmark and C. elegans datasets. Removing problematic profiles improves performance in cross-validation. It is therefore tempting to remove these problematic profiles from the training datasets for deployment but their removal negatively impacts performance. The motivation behind using the machine learning algorithms is to overcome the effects of erroneous evidence that is currently inherent in the in silico vaccine discovery output. Consequently, the training data should retain problematic profiles for building models for deployment. They need to be retained in the application of the model because it is unclear whether these problematic profiles are incorrect or whether they are correct but rare (i.e. they are outliers). New profiles for classification are expected to contain an unknown percentage of similar erroneous evidence. Algorithms vary in their ability to handle problematic profiles according to what other profiles are represented in the training dataset. For example, the combined species trained model is a collection of exactly the same profiles as those in the individual species trained models. However, the algorithms when trained with the combined species are able to correctly classify the problematic profiles more effectively than individual species trained models.

The results in Table 5 show that there is no fundamental difference between evidence profiles from different eukaryotic species. For example, the benchmark dataset is composed of T.gondii and N. caninum data and yet both the Plasmodium and C. elegans trained models outperformed the T. gondii trained model. The ideal training dataset for the classification problem described herein is one that contains the most variety of evidence profiles irrespective of the source species.

None of the algorithms can consistently classify evidence profiles without false predictions irrespective of the training dataset. Each algorithm nonetheless performed better than the rule-based approach with a collective average sensitivity and specificity of 0.97 and 0.98. The main reason why the machine learning algorithms performed better than the rule-based approach in this study is related to how they handle erroneous evidence. For example, a classification rule, applied to a combination of input variables, fails when only one input variable is erroneous. Machine learning algorithms, despite erroneous evidence in both the training and test datasets, can still exploit a generalised pattern within the collection of evidence for the purpose of classification.

A proposed classification system

The proposed classification system (see Figure 5) uses the ensemble of classifiers, excluding the decision tree, to make a final classification based on voting and a majority rule decision from predictions of the individual classifiers. In the case of a tied vote, the decision is deemed a YES classification. The logic behind this decision is that false positives are preferential to false negatives as they can be identified later during the laboratory validation. Table 6 shows the UniProt identifier for proteins from the benchmark dataset that were consistently incorrectly classified by the machine learning algorithms. At least one of the six algorithms failed to correctly classify six proteins (Q27298, B0LUH4, P84343, Q9U483, B9PRX5, B9QH60) that were expected to be YES and three proteins (B6K9N1, B9Q0C2, B9PK71) expected to be NO. Table 7 provides a description of these misclassified proteins. After applying the majority rule approach, all proteins were classified as expected. The final predicted classification of protein Q27298 was YES based on a tied decision. There are three possible reasons why a protein in the final classification process might be misclassified: 1) the expected classification is incorrect, 2) the majority of algorithms fail, and 3) the evidence profile is too problematic. The misclassifications in Table 6 suggest that they were mainly due to the failure of a particular algorithm when considering the successful classification by other algorithms. The evidence profiles for Q27298 and B9PRX5 are possibly problematic for the algorithms that made the misclassification. This is most likely because the algorithms have not been trained for a profile of this type i.e. the training dataset is failing. In this case (or in the case of any classified vaccine candidate), false positives can only be identified in the laboratory. Interpreting the relationship between evidence profiles and an immune response in host remains a challenge to the in silico vaccine discovery approach.

Figure 5
figure 5

Overview of a proposed classification system using a pool of machine learning algorithms to determine the suitability of proteins for vaccine candidacy. Protein sequences for a target species are input into seven prediction programs. These programs provide evidence as to whether the proteins associated with the sequences are either membrane-associated or secreted, and contain epitopes. Evidence for each protein is collated to create an evidence profile. A collection of evidence profiles are used as input to a pool of six independent machine learning algorithms for classification. Final classification is based on voting and a majority rule decision.

Table 6 Misclassified proteins from the benchmark dataset by machine learning algorithms
Table 7 Description of proteins from the benchmark dataset that were misclassified by at least one machine learning algorithm

Future developments

The outcome of the classification system is a list of proteins that are worthy of laboratory investigation. Each protein in the list is assumed to have an equal chance of being a vaccine candidate. An improvement to the proposed classification system is to score the proteins according to a likelihood or confidence level that the classifications are correct. The R functions for SVM and random forest support class-probabilities i.e. an estimated probability for each protein belonging to ‘YES’ and ‘NO’ classes. For such an extension, the format of the training datasets are the same except the target value would no longer be a ‘YES’ or ‘NO’ but a single probability score that attempts to encapsulate each snippet of evidence representing the evidence profile. Determining such a score is a challenge that still remains. The advantage of an appropriate scoring system is that the proteins in the vaccine candidacy list can then be ranked. A caveat here is that the ranking is based on a confidence level of prediction. A protein with a high probability score does not necessarily imply a high probability of an immune response when injected in a host.

The proposed classification system is intended to illustrate a framework on which researchers can build more efficient systems. For example, only seven high-throughput prediction programs were used here to create the evidence profiles. There are other bioinformatics programs [1] that could be used to predict similar or additional protein characteristics from protein sequences, such as GPI anchoring, molecular function, and biological process involvement. At the time of writing, there is no high-throughput standalone GPI predictor. Appropriate values that support vaccine candidacy could be extracted from these extra program outputs and added to the evidence profile as additional columns in the training datasets.

There are examples of proteins with annotated interior subcellular locations that have been observed to induce an immune response [19]. It is assumed here that these proteins are not naturally exposed to the immune system but were exposed as a consequence of experimental conditions. Nevertheless, the important point here is that they do induce an immune response and are potential vaccine candidates. These interior proteins are missed by the current proposed classification system. All protein types that induce an immune response in theory need to be addressed to create a totally encompassing system for in silico vaccine discovery. This can only be accomplished if distinguishing characteristics that exemplify antigenicity can be predicted given proteins sequences. A prediction program that distinguishes antigenic and non-antigenic interior proteins is sought.


We conclude the following when given a high-throughput in silico vaccine discovery output consisting of predicted protein characteristics (evidence profiles) from thousands of proteins: 1) machine learning algorithms can perform binary classification (i.e. yes or no vaccine candidacy) for these proteins more accurately than human generated rules; 2) there is no apparent difference in performance (i.e. sensitivity and specificity) between the algorithms; adaptive boosting, random forest, k-nearest neighbour classifier, naive Bayes classifier, neural networks, and SVM, when performing this particular classification task; 3) none of the algorithms can consistently classify evidence profiles without false predictions using the training datasets in this study; 4) there is no fundamental difference in patterns in evidence profiles compiled from different species e.g. a model trained on one species can classify proteins from another and hence no target specific training datasets are required; 5) an ideal training dataset is one that contains the most variety of evidence profiles irrespective of the source species e.g. quality and variety are indisputably the most important factors that impact the accuracy of algorithms; and 6) a pool of algorithms with a voting and majority rule decision can perform classification with a high degree of accuracy e.g. 100% sensitivity and specificity was demonstrated in this study by correctly determining the expected classification of the benchmark dataset.

Vaccine candidates from an in silico approach can only be truly validated in a laboratory. There are essentially two options. One is to rely on laboratory validation to identify false candidates. The other is to use our proposed classification system to identify those proteins more worthy of laboratory validation. This will ultimately save time and money by reducing the false candidates allocated for validation.


Eukaryotic pathogens used in study

Toxoplasma gondii, Plasmodium sp., and Caenorhabditis elegans were the chosen species to train the machine learning algorithms. Toxoplasma gondii is an apicomplexan pathogen responsible for birth defects in humans [20] and is an important model system for the phylum Apicomplexa [21-23]. Species in the genus Plasmodium are also apicomplexan pathogens and can cause the disease malaria [24]. These species were selected because in comparison to most other pathogens, they have experimentally validated data for protein subcellular location, albeit limited for T. gondii. Caenorhabditis elegans is a free-living nematode that is not a pathogen but is rich in validated data [25]. This species was particularly chosen to investigate whether a universal training dataset could be used for the classification of proteins from any eukaryotic pathogen or whether target specific training datasets are required.

Training data for machine learning algorithms

Two sets of distinct evidence profiles for each training dataset were required. One set representing evidence for proteins that are vaccine candidates and another for non-vaccine candidates. The major challenge here is that there are too few examples of protein subunit vaccines, irrespective of the target pathogen, to create ideal training datasets. Consequently, the training datasets used in this study are based on proteins that are only likely vaccine candidates - ‘likely’ in this context is based on two a priori held hypotheses:1) a protein that is either external to or located on, or in, the membrane of a pathogen is more likely to be accessible to surveillance by the immune system than a protein within the interior of a pathogen [26]; and 2) a protein containing peptides (T-cell epitopes) that bind to major histocompatibility complex (MHC) molecules fulfils one of several prerequisites for a vaccine based on this protein. That is, a protein vaccine candidate needs to contain T-cell epitopes to induce the creation of a memory T-cell repertoire capable of recognizing a pathogen [27, 28].

Appropriate protein sequences for T. gondii, C. elegans, and Plasmodium species were downloaded from the Universal Protein Resource knowledgebase (UniProtKB at In UniProtKB at the time of writing, there were 19261 proteins for T. gondii species (this includes strains such as ME49, VEG, RH, and GT1), 25765 for C. elgans, and 75,507 for the genus Plasmodium. Despite T. gondii being a well-studied organism, only 55 proteins had the status of manually annotated and reviewed. In comparison, C. elegans had 3360 reviewed and Plasmodium 488. A challenge was that the protein’s annotations in UniProtKB (e.g. protein name, domains, protein families, subcellular location etcetera) were not necessarily indicative to selecting the desired three classes of proteins: secreted, membrane-associated, and other. The subcellular location annotation was the most informative out of all annotations. Of the reviewed proteins, 39 for T. gondii, 1190 for C. elegans and 202 for Plasmodium had experimental evidence to support the annotation for their subcellular location. To aid in creating a preliminary training dataset, proteins from the desired subcellular locations were selected using the advanced search facility in UniProt and entering either a partial or whole term in the subcellular location field. Using the word ‘membrane’ in the UniProt advanced search, 11 of the 39 T. gondii proteins were selected. Similarly, 10 out of 39 were selected using the word ‘secreted’. For C. elegans, 796 of the 1190 proteins with experimentally derived subcellular locations had the word ‘membrane’ and 47 had ‘secreted’ (unlike apicomplexan pathogens, C. elegans do not secrete proteins for the purpose of invasion and survival within host cells). There were only four Plasmodium proteins with ‘secreted’ annotation in contrast to 134 with membrane (there are many more secreted proteins in UniProtKB but not yet reviewed). This broad word search selected undesired proteins with subcellular descriptions such as parasitophorous vacuole membrane and golgi apparatus membrane. Proteins with inappropriate subcellular descriptions were manually removed or reclassified in the training datasets on consultation with the UniProt controlled vocabulary ( The expected ‘YES’ or ‘NO’ classification for each protein in the training datasets was fined-tuned in accordance to cross-validation testing, epitope presence as per reference to the Immune Epitope Database and Analysis Resource (, and reference to other UniProtKB annotations and Gene Ontology. Descriptions of the datasets are shown in Table 1.

Bioinformatics prediction programs

The downloaded protein sequences from UniProtKB were used as input to seven prediction programs (WoLF PSORT [11], SignalP [29], TargetP [10], TMHMM [13], Phobius [12] and IEDB peptide-MHC I and II binding predictors [30, 31]). These programs have several features in common: applicable to eukaryotes, can be freely downloaded, run in a standalone mode, allow high-throughput processing, and execute in a Linux environment. The emphasis here is on high-throughput. An in-house Perl script selected values (potential evidence) from the program outputs and compiled them into one file to construct the evidence profiles.

Machine learning algorithms

Seven supervised machine learning algorithms were executed within R (a free software environment for statistical computing and graphics - via R functions from packages that can be downloaded from the Comprehensive R Archive Network (CRAN): 1) decision tree, also referred to as classification and regression trees (CART) [32] via the rpart R function (implemented in the rpart package); 2) adaptive boosting [33] via the ada R function [34]; 3) random forest algorithm via the randomForest R function [35]; 4) k-nearest neighbour classifier (k-NN) via a knn R function [36, 37] contained in the Class package; 5) naive Bayes classifier via a naiveBayes R function contained in the e1071 package; 6) neural network (single hidden layer multilayer perceptrons) via the nnet R function contained in the nnet package [36, 37]; and 7) support vector machines via the ksvm R function [38], which is contained in the kernlab package.

The algorithms were chosen because there is a wealth of literature on their successful application to a wide range of problems in multiple fields. The focus here is therefore on the application of the algorithms to solving a specific biological problem and not an evaluation or judgement of their design and logic. The application of each algorithm to building a classification model is similar in the sense that algorithm-specific R functions are used with the same training datasets. All seven machine learning R functions required at least two arguments: a data frame of categorical and/or numeric input variables (i.e. the training dataset consisting of the evidence profiles) and a class vector of ‘YES’ or ‘NO’ classification for each evidence profile i.e. target variable.

Cross-validation was performed to evaluate each training dataset and the resultant model built by each algorithm. That is, an in-house R function was used to execute the machine learning R functions multiple times (e.g. 100 runs). For each run the function randomly selected 70% of the training set to build a model. The remaining 30% of the training set was used as test data for classification. An R function called predict[39] was used as a generic function for predictions. An in-house Perl script summarised the multiple runs and the prediction outcomes were averaged to calculate sensitivity and specificity performance measures.

Benchmark dataset

The benchmark dataset consisted of a collection of evidence profiles derived from T. gondii and Neospora caninum (an apicomplexan pathogen that is morphologically and developmentally similar to T. gondii[40]). In a similar fashion to creating the evidence profiles for the training datasets, protein sequences (140 in total) downloaded from UniProtKB were input into the seven prediction programs and an in-house Perl script compiled the evidence profiles.

It is well acknowledged in the literature that the development of vaccines directed against T. gondii and N. caninum should focus on selecting proteins that are capable of eliciting mainly a cell-mediated immune (CMI) response involving CD4 + ve T cells, Type 1 helper T cells (Th1) and Interferon-gamma (IFN-γ) in addition to a humoral response [19, 41-43]. Seventy of the evidence profiles are for proteins from published studies. Twenty-two of these proteins have been observed to induce cell-mediated immune (CMI) responses and the remaining 48 have been experimentally shown to be membrane-associated or secreted. Eleven of the proteins have epitopes identified experimentally and some of these epitopes have been shown to elicit significant humoral and cellular immune responses in vaccinated mice when used in combination with other epitopes [44-47]. Additional file 1: Table S1 lists the benchmark proteins along with a publication reference to the relevant study. A brief description of the vaccine significance for some of these proteins and an entire list of evidence profiles for the benchmark dataset are also provided in Additional file 1. A further 70 evidence profiles for proteins that have been experimentally shown to be neither membrane-associated nor secreted were added to the benchmark dataset.


  1. Goodswen SJ, Kennedy PJ, Ellis JT: A guide to in silico vaccine discovery for eukaryotic pathogens. Brief Bioinform. 2012, [Epub ahead of print]

    Google Scholar 

  2. Mora M, Donati C, Medini D, Covacci A, Rappuoli R: Microbial genomes and vaccine design: refinements to the classical reverse vaccinology approach. Current Opinion in Microbiology. 2006, 9 (5): 532-536. 10.1016/j.mib.2006.07.003.

    Article  CAS  PubMed  Google Scholar 

  3. Rappuoli R: Bridging the knowledge gaps in vaccine design. Nat Biotech. 2007, 25 (12): 1361-1366. 10.1038/nbt1207-1361.

    Article  CAS  Google Scholar 

  4. He Y, Xiang Z, Mobley HLT: Vaxign: The First Web-Based Vaccine Design Program for Reverse Vaccinology and Applications for Vaccine Development. Journal of Biomedicine and Biotechnology. 2010, 2010: 297505-

    PubMed Central  PubMed  Google Scholar 

  5. Vivona S, Bernante F, Filippini F: NERVE: new enhanced reverse vaccinology environment. BMC Biotechnol. 2006, 6 (1): 35-10.1186/1472-6750-6-35.

    Article  PubMed Central  PubMed  Google Scholar 

  6. Leuzzi R, Savino S, Pizza M, Rappuoli R: Handbook of Meningococcal Disease. Genome Mining and Reverse Vaccinology. 2006, Wiley-VCH Verlag GmbH & Co. KGaA, 391-402.

    Google Scholar 

  7. Serino L, Pizza M, Rappuoli R: Pathogenomics. Reverse Vaccinology: Revolutionizing the Approach to Vaccine Design. 2006, Wiley-VCH Verlag GmbH & Co. KGaA, 533-554.

    Google Scholar 

  8. Vivona S, Gardy JL, Ramachandran S, Brinkman FSL, Raghava GPS, Flower DR, Filippini F: Computer-aided biotechnology: from immuno-informatics to reverse vaccinology. Trends in biotechnology. 2008, 26 (4): 190-200. 10.1016/j.tibtech.2007.12.006.

    Article  CAS  PubMed  Google Scholar 

  9. Dyrløv Bendtsen J, Nielsen H, von Heijne G, Brunak S: Improved prediction of signal peptides: signalP 3.0. J Mol Biol. 2004, 340 (4): 783-795. 10.1016/j.jmb.2004.05.028.

    Article  Google Scholar 

  10. Emanuelsson O, Brunak S, von Heijne G, Nielsen H: Locating proteins in the cell using targetP, signalP and related tools. Nat Protocols. 2007, 2 (4): 953-971. 10.1038/nprot.2007.131.

    Article  CAS  Google Scholar 

  11. Horton P, Park K-J, Obayashi T, Fujita N, Harada H, Adams-Collier CJ, Nakai K: WoLF PSORT: protein localization predictor. Nucleic Acids Res. 2007, 35 (suppl 2): W585-W587.

    Article  PubMed Central  PubMed  Google Scholar 

  12. Kall L, Krogh A, Sonnhammer ELL: A combined transmembrane topology and signal peptide prediction method. J Mol Biol. 2004, 338 (5): 1027-1036. 10.1016/j.jmb.2004.03.016.

    Article  CAS  PubMed  Google Scholar 

  13. Krogh A, Larsson B, von Heijne G, Sonnhammer ELL: Predicting transmembrane protein topology with a hidden markov model: application to complete genomes. J Mol Biol. 2001, 305 (3): 567-580. 10.1006/jmbi.2000.4315.

    Article  CAS  PubMed  Google Scholar 

  14. Peters B, Bui H-H, Frankild S, Nielsen M, Lundegaard C, Kostem E, Basch D, Lamberth K, Harndahl M, Fleri W, et al: A community resource benchmarking predictions of peptide binding to MHC-I molecules. PLoS Comput Biol. 2006, 2 (6): 574-584.

    Article  CAS  Google Scholar 

  15. Wang P, Sidney J, Dow C, Mothe B, Sette A, Peters B: A systematic assessment of MHC class II peptide binding predictions and evaluation of a consensus approach. PLoS Comput Biol. 2008, 4 (4): e1000048-10.1371/journal.pcbi.1000048.

    Article  PubMed Central  PubMed  Google Scholar 

  16. Bhasin M, Raghava GPS: Prediction of CTL epitopes using QM, SVM and ANN techniques. Vaccine. 2004, 22 (23-24): 3195-3204.

    Article  CAS  PubMed  Google Scholar 

  17. Bowman BN, McAdam PR, Vivona S, Zhang J, Luong T, Belew RK, Sahota H, Guiney D, Valafar F, Fierer J, et al: Improving reverse vaccinology with a machine learning approach. Vaccine. 2011, In Press, Uncorrected Proof

    Google Scholar 

  18. Sollner J, Mayer B: Machine learning approaches for prediction of linear B-cell epitopes on proteins. J Mol Recognit. 2006, 19 (3): 200-208. 10.1002/jmr.771.

    Article  PubMed  Google Scholar 

  19. Rocchi MS, Bartley PM, Inglis NF, Collantes-Fernandez E, Entrican G, Katzer F, Innes EA: Selection of Neospora caninum antigens stimulating bovine CD4(+ve) T cell responses through immuno-potency screening and proteomic approaches. Veterinary Research. 2011, 42: 1-91.

    Article  Google Scholar 

  20. Montoya JG, Liesenfeld O: Toxoplasmosis. Lancet. 2004, 363 (9425): 1965-1976. 10.1016/S0140-6736(04)16412-X.

    Article  CAS  PubMed  Google Scholar 

  21. Che F-Y, Madrid-Aliste C, Burd B, Zhang H, Nieves E, Kim K, Fiser A, Angeletti RH, Weiss LM: Comprehensive proteomic analysis of membrane proteins in toxoplasma gondii. Mol Cell Proteomics. 2010, 10 (1): M110 000745-

    Article  PubMed Central  PubMed  Google Scholar 

  22. Kim K, Weiss LM: Toxoplasma gondii: the model apicomplexan. Int J Parasitol. 2004, 34 (3): 423-432. 10.1016/j.ijpara.2003.12.009.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  23. Roos DS: Themes and variations in apicomplexan parasite biology. Science. 2005, 309 (5731): 72-73. 10.1126/science.1115252.

    Article  CAS  PubMed  Google Scholar 

  24. Snow RW, Guerra CA, Noor AM, Myint HY, Hay SI: The global distribution of clinical episodes of plasmodium falciparum malaria. Nature. 2005, 434 (7030): 214-217. 10.1038/nature03342.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  25. Kurz CL, Ewbank JJ: Caenorhabditis elegans: an emerging genetic model for the study of innate immunity. Nat Rev Genet. 2003, 4 (5): 380-390. 10.1038/nrg1067.

    Article  CAS  PubMed  Google Scholar 

  26. Flower DR, Macdonald IK, Ramakrishnan K, Davies MN, Doytchinova IA: Computer aided selection of candidate vaccine antigens. Immunome Research. 2010, 6 (Suppl 2): S1-10.1186/1745-7580-6-S2-S1.

    Article  PubMed Central  PubMed  Google Scholar 

  27. Kaech SM, Wherry EJ, Ahmed R: Effector and memory T-cell differentiation: implications for vaccine development. Nat Rev Immunol. 2002, 2 (4): 251-262. 10.1038/nri778.

    Article  CAS  PubMed  Google Scholar 

  28. Sette A, Fikes J: Epitope-based vaccines: an update on epitope identification, vaccine design and delivery. Curr Opin Immunol. 2003, 15 (4): 461-470. 10.1016/S0952-7915(03)00083-9.

    Article  CAS  PubMed  Google Scholar 

  29. Petersen TN, Brunak S, von Heijne G, Nielsen H: SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat Methods. 2011, 8 (10): 785-786. 10.1038/nmeth.1701.

    Article  CAS  PubMed  Google Scholar 

  30. Kim Y, Ponomarenko J, Zhu Z, Tamang D, Wang P, Greenbaum J, Lundegaard C, Sette A, Lund O, Bourne PE, et al: Immune epitope database analysis resource. Nucleic Acids Res. 2012, 40 (W1): W525-W530. 10.1093/nar/gks438.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  31. Kim Y, Sette A, Peters B: Applications for T-cell epitope queries and tools in the immune epitope database and analysis resource. J Immunol Methods. 2011, 374 (1-2): 62-69.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  32. Breiman L, Friedman JH, Olshen RA, Stone CJ: Classification and Regression Trees. 1984, Wadsworth International Group

    Google Scholar 

  33. Freund Y, Schapire RE: A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci. 1997, 55 (1): 119-139. 10.1006/jcss.1997.1504.

    Article  Google Scholar 

  34. Friedman J, Hastie T, Tibshirani R: Additive logistic regression: a statistical view of boosting. Ann Stat. 2000, 28 (2): 337-374.

    Article  Google Scholar 

  35. Breiman L: Random forests. Mach Learn. 2001, 45 (1): 5-32. 10.1023/A:1010933404324.

    Article  Google Scholar 

  36. Ripley BD: Pattern Recognition and Neural Networks. 1996, Cambridge University Press, 1

    Google Scholar 

  37. Venables WN, Ripley BD: Modern Applied Statistics with S. 2002, Springer, 4

    Book  Google Scholar 

  38. Platt J: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers. 1999, 1999: 61-74.

    Google Scholar 

  39. Chambers JM, Hastie TJ: Statistical Models in S. 1992, Wadsworth and Books/Cole Computer Science Series, Chapman and Hall

    Google Scholar 

  40. Dubey JP, Carpenter JL, Speer CA, Topper MJ, Uggla A: Newly recognized fatal protozoan disease of dogs. J Am Vet Med Assoc. 1988, 192 (9): 1269-1285.

    CAS  PubMed  Google Scholar 

  41. Andrianarivo AG, Anderson ML, Rowe JD, Gardner IA, Reynolds JP, Choromanski L, Conrad PA: Immune responses during pregnancy in heifers naturally infected with neospora caninum with and without immunization. Parasitol Res. 2005, 96 (1): 24-31. 10.1007/s00436-005-1313-y.

    Article  PubMed  Google Scholar 

  42. Reichel MP, Ellis JT: Neospora caninum - how close are we to development of an efficacious vaccine that prevents abortion in cattle?. Int J Parasitol. 2009, 39 (11): 1173-1187. 10.1016/j.ijpara.2009.05.007.

    Article  PubMed  Google Scholar 

  43. Tuo WB, Fetterer R, Jenkins M, Dubey JP: Identification and characterization of neospora caninum cyclophilin that elicits gamma interferon production. Infect Immun. 2005, 73 (8): 5093-5100. 10.1128/IAI.73.8.5093-5100.2005.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  44. Cong H, Gu QM, Yin HE, Wang JW, Zhao QL, Zhou HY, Li Y, Zhang JQ: Multi-epitope DNA vaccine linked to the A(2)/B subunit of cholera toxin protect mice against toxoplasma gondii. Vaccine. 2008, 26 (31): 3913-3921. 10.1016/j.vaccine.2008.04.046.

    Article  CAS  PubMed  Google Scholar 

  45. Maksimov P, Zerweck J, Maksimov A, Hotop A, Gross U, Pleyer U, Spekker K, Daeubener W, Werdermann S, Niederstrasser O, et al: Peptide microarray analysis of in silico-predicted epitopes for serological diagnosis of toxoplasma gondii infection in humans. Clin Vaccine Immunol. 2012, 19 (6): 865-874. 10.1128/CVI.00119-12.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  46. Nielsen HV, Lauemoller SL, Christiansen L, Buus S, Fomsgaard A, Petersen E: Complete protection against lethal toxoplasma gondii infection in mice immunized with a plasmid encoding the SAG1 gene. Infect Immun. 1999, 67 (12): 6358-6363.

    PubMed Central  CAS  PubMed  Google Scholar 

  47. Wang Y, Wang M, Wang G, Pang A, Fu B, Yin H, Zhang D: Increased survival time in mice vaccinated with a branched lysine multiple antigenic peptide containing B- and T-cell epitopes from T. gondii antigens. Vaccine. 2011, 29 (47): 8619-8623. 10.1016/j.vaccine.2011.09.016.

    Article  CAS  PubMed  Google Scholar 

Download references


SJG gratefully acknowledges receipt of a PhD scholarship from Zoetis (Pfizer) Animal Health.

Author information

Authors and Affiliations


Corresponding author

Correspondence to John T Ellis.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

SG conceived and designed the experiments, performed the experiments, and analysed the data. All authors contributed to the writing of the manuscript and read and approved the final version.

Electronic supplementary material


Additional file 1: Includes typical outputs from prediction programs used for the in silico vaccine discovery pipeline, a list of the benchmark test proteins along with a publication reference to relevant studies, and a brief description of the vaccine significance for some of these proteins.(PDF 272 KB)

Authors’ original submitted files for images

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Goodswen, S.J., Kennedy, P.J. & Ellis, J.T. A novel strategy for classifying the output from an in silicovaccine discovery pipeline for eukaryotic pathogens using machine learning algorithms. BMC Bioinformatics 14, 315 (2013).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • Training Dataset
  • Vaccine Candidate
  • Machine Learning Algorithm
  • Benchmark Dataset
  • Laboratory Validation