A novel strategy for classifying the output from an in silico vaccine discovery pipeline for eukaryotic pathogens using machine learning algorithms

Background An in silico vaccine discovery pipeline for eukaryotic pathogens typically consists of several computational tools to predict protein characteristics. The aim of the in silico approach to discovering subunit vaccines is to use predicted characteristics to identify proteins which are worthy of laboratory investigation. A major challenge is that these predictions are inherent with hidden inaccuracies and contradictions. This study focuses on how to reduce the number of false candidates using machine learning algorithms rather than relying on expensive laboratory validation. Proteins from Toxoplasma gondii, Plasmodium sp., and Caenorhabditis elegans were used as training and test datasets. Results The results show that machine learning algorithms can effectively distinguish expected true from expected false vaccine candidates (with an average sensitivity and specificity of 0.97 and 0.98 respectively), for proteins observed to induce immune responses experimentally. Conclusions Vaccine candidates from an in silico approach can only be truly validated in a laboratory. Given any in silico output and appropriate training data, the number of false candidates allocated for validation can be dramatically reduced using a pool of machine learning algorithms. This will ultimately save time and money in the laboratory.


Introduction
This document supplements the paper, 'A novel strategy for classifying the output from an in silico vaccine discovery pipeline for eukaryotic pathogens using machine learning algorithms'. It is not intended to be read cover to cover but as a reference to assist the reader in a more detailed understanding of the paper, if required.
The document is in five parts: 1) Example outputs of the bioinformatics prediction programs used in the study; 2) information on the creation of the benchmark dataset including Table S1, comprising the compiled proteins with columns for Gene name, NCBI accession, UniProt ID, Protein description, Epitope experimental evidence, Organism, Study publication reference, and Comments; 3) a brief description of some of the protein types listed in Table S1 that studies have shown to be potential or at least speculative vaccine candidates; 4) Table S2 and S3, showing experimental information about epitopes and MHC binding related to proteins in Table S1; and 5) a list of 'output values' (i.e. evidence profiles) generated by seven prediction programs given protein sequences associated with the proteins in Table S1.

Example outputs from prediction programs
Selected output values from seven bioinformatics prediction programs (WoLF PSORT [1], SignalP [2], TargetP [3], TMHMM [4], Phobius [5] and IEDB peptide-MHC I and II binding predictors [6,7])) were used to test methods for vaccine candidate classification: Figure S1 shows a typical output from WoLF PSORT. Information about each protein sequence is displayed on separate lines (only three sequences are shown in Figure S1). Each field along the line contains a localization class (based on UniProt "Subcellular Localization" field keywords) and a score separated by a comma. There are 12 localisation classes that also map to Gene Ontology (GO) 1 . As an example of how to interpret the output in Figure S1, protein 'seq1' has six candidate sites listed in descending order of likelihood based on a score. The most likely site is extracellular (extr) and plasma membrane (plas) i.e. there is dual localisation with a score of 11.5. The plasma membrane (on its own) is the next most likely site, followed by extracellular, endoplasmic reticulum (E.R.), lysosome (lyso) and finally peroxisome (pero). The accuracy of WoLF PSORT is influenced by the number of each type of localisation site in the training data .e.g. sites with few examples in the training dataset are seldom correctly predicted.

SignalP
It is recommended in the SignalP user manual that only the first 50 to 70 amino acids of each sequence should be used in the prediction as longer sequences increase the risk of false positives. To restrict the length of the input sequence a command-line parameter is used (e.g. -trunc 70). An example of the summary output from SignalP is shown in Figure S2. The output comprises five different scores between 0 and 1: 1) Cmax is the maximum "cleavage site'" score (a C-score is calculated for each position in the submitted sequence and a significant high score indicates a cleavage site); 2) Ymax is a derivative of the C-score combined with the Sscore resulting in a better cleavage site prediction than the raw C-score alone. 3) S-max is the "maximum signal peptide" prediction score (the S-score for the signal peptide prediction is calculated for every single amino acid position in the submitted sequence and a high score indicates that the corresponding amino acid is part of a signal peptide, and a low score indicates that the amino acid is part of a mature protein); 4) Smean is the "average of the S-score", and 5) D is an average of the "Smean and Ymax" score. Position (pos) is the location in the amino acid sequence where Cmax (.i.e. cleavage site position), Ymax (i.e. length of signal peptide), and Smax occur. The "Y" or "N" is a yes or no indication that the sequence has a cleavage site and a signal peptide, when D is above or below the Dmaxcut. High scores also indicate that the sequence is a secretory protein.
According to the authors of SignalP, a high D-score is the best indicator of secretory proteins [8].

TargetP
TargetP predicts the presence and length of secretory pathway signal peptides (SP) and mitochondrial targeting peptides (mTP) in the N-terminal presequences [9]. An example of TargetP output is shown in Figure S3. Len is the sequence length, followed by neural network scores for mitochondrial targeting peptide (mTP), secretory 1 Gene Ontology (GO) website at: http://www.geneontology.org/ signal peptide (SP), and "other" localizations. The predicted localisation (loc) based on the scores is either mitochondrion (M) or secretory pathway (S) or any other location (-). The reliability class (RC) is from 1 (most reliable) to 5 (least reliable) and is a measure of prediction certainty. The truncated peptide length (TPlen) indicates the predicted presequence length to the cleavage site. Figure S4 shows one line of a typical output from TMHMM in a summary format. Each output line shows the length (len) of the protein sequence followed by the expected number of amino acid residues in transmembrane helices (ExpAA). If the ExpAA number is larger than 18 (a value proposed by the TMHMM creators) it is very likely to be a transmembrane protein (or have a signal peptide). The output line also shows the expected number of residues in the transmembrane helices in the first 60 amino acids of the protein (First60), the number of predicted transmembrane helices (PredHel), and the predicted protein topology i.e. the in/out orientation of the protein relative to the membrane. The creators of THHMM propose that a First60 value greater than 10 indicates a possible N-terminal signal sequence. Figure S5 shows the output from Phobius in a short format. The output information for one protein sequence (SEQENCE) per line consists of the number of transmembrane (TM) helices, a "Y" or "N" indicator that the sequence has a signal peptide (SP), and a predicted topology (information for only one protein sequence is shown).

T-Cell MHC class I and II binding prediction tools
Immune Epitope Database Analysis Resource (IEDB) provides a download Linux package (for a 32    (ARB) [11], Stabilized matrix method (SMM) [12], SMM with a Peptide-MHC Binding Energy Covariance matrix (SMMPMBEC), Scoring Matrices derived from Combinatorial Peptide Libraries (Comblib_Sidney2008) [13], Consensus [14], and NetMHCpan [15]. The available prediction methods for MHC class II are: Consensus [16], Average relative binding (ARB) [11], combinatorial library (unpublished method), NN-align [17] (this method is the equivalent to netMHCII version 2.2), SMM-align [18] (equivalent to netMHCII version 1.1), Sturniolo [19] (a method also used in the program TEPITOPE [20]), and NetMHCIIpan [21]. Figure S6 shows a typical output from the MHC class I predictor using a Consensus method (some columns have been deleted and the format adjusted to fit output on the page). Beginning at the start amino acid (numbered 1) of each sequence (denoted by #), a test subsequence of a specific peptide length (e.g. PepLengh = 9) is created (e.g. Sequence = MSMEGDRPS and is located from amino acids 1 to 9 on sequence input #1). The subsequence is scored (e.g. in units of IC 50 nM) for binding affinity against the MHC allele e.g. HLA-A*02:05, using different prediction methods scores are calculated for each amino acid at each position in the subsequence, which are then added to yield the overall binding affinity.

Benchmark dataset
The benchmark dataset contains a compilation of Toxoplasma gondii and Neospora caninum proteins compiled from published studies that have experimentally shown the proteins to be membrane-associated or secreted.
More importantly, many of the proteins were observed to induce immune responses and therefore represent the type of proteins likely to be worthwhile vaccine candidates. Eleven of the proteins have epitopes identified experimentally and some of these epitopes have been shown to elicit significant humoral and cellular immune responses in vaccinated mice when used in combination with other epitopes. The compilation of proteins is used as test data in a proof-of-concept for a classification system that is described in the paper. Two publications in particular were used to compile the protein list for the benchmark dataset. The first was a study by Rocchi and colleagues [22]. The aim was to identify tachyzoites antigens that are recognised by a cell mediated immune (CMI) response of experimentally infected animals [22]. Six N. caninum proteins and 16 functional orthologues of T. gondii were identified to elicit a CMI response. The study provided the NCBI accession numbers to these 22 identified proteins; most of which are included in Table S1 along with reference to additional studies that support Rocchi's findings. Several of the proteins are from subcellular locations other than the expected plasma membrane and extracellular sites, such as the cytoplasm (e.g. ribosomes and chaperonins), nucleus (e.g. histone H4), and enzymes (e.g. proteasome complex and glutamine synthetase). Although the latter proteins were identified in Rocchi's study to induce a CMI response, the classification system described in the paper does not classify them as potential vaccine candidates. This classification was expected as they are neither secreted nor membrane-associated, and have no epitope evidence. The assumption is that these proteins from the interior of  the pathogen are not naturally exposed to the immune system of the host but were exposed during the study as a result of the immunological procedure. Proteins that were not classified as potential vaccine candidates are indicated with 'Classification = NO' in the Comments column in Table S1. The second main study to be highlighted here is by Che and colleagues [23]. The study involved a comprehensive proteomic analysis of membrane proteins in T. gondii. In brief, three proteomics strategies were used: one-dimensional gel electrophoresis liquid chromatography-tandem mass spectrometry (1D gel LC-MS/MS), biotin labelling in conjunction with 1D gel LC-MS/MS analysis, and a novel strategy that combined three-layer 'Sandwich' Gel Electrophoresis (TLSGE) with multidimensional protein identification technology (MudPIT) [23]. The transmembrane protein clusters identified in the study were deposited in the Einstein Biodefense Proteomics Research Center (http://toro.aecom.yu.edu/cgi-bin/biodefense/main.cgi) and the data provided to ToxoDB (http://ToxodB.org), which is part of EuPathDB. Only proteins identified by all three strategies and having one or more predicted transmembrane segments were included in Table S1. Several proteins from the Che study in Table S1 were not classified as potential vaccine candidates by the classification system (indicated with 'Classification = NO' in the Comments column). These questionable proteins were investigated further by examining the protein's annotation in UniProt, which included links to Gene Ontology and availability of epitope evidence For the most part, the function or subcellular locations of these proteins are not annotated as membrane-associated. The annotated function or subcellular location has been included, when applicable, in the Comments column of Table S1.
It seems to be well acknowledged in the literature that the development of vaccines directed against T.
gondii or N. caninum should focus on selecting proteins that are capable of eliciting mainly a CMI response involving CD4+ve T cells, Type 1 helper T cells (Th1) and Interferon-gamma (IFN-γ) (this is in addition to the humoral response) [22,24,25,26]. The types of proteins that are likely to induce the required immune response are those that are secreted from specialized organelles (micronemes, rhoptries, and dense granules). These secreted proteins are involved in the invasion and survival within host cells. The proteins typically possess a classical N-terminal signal sequence [27] for directing the protein. Following their synthesis in the cytoplasm, proteins that carry a signal peptide can be routed to no fewer than six distinct destinations: (i) plasma membrane; (ii) micronemes; (iii) apicoplast; (iv) rhoptries; (v) dense granules, and subsequently to either the parasitophorous vacuole space or the parasitophorous vacuole membrane; and (iv) inner membrane complex (IMC) [28]. The secretory proteins are likely to have secondary targeting signals responsible for precise delivery to the appropriate destination [29,30] or are delivered by a cargo receptor and chaperone protein [31]. Supposed secretory proteins without obvious signal sequences in the N-terminal are probably inaccurately annotated in UniProt, as the first exon prediction is notoriously difficult [27].
Several proteins in Table S1 that were derived from the Rocchi and Che studies are hypothetical proteins and are possibly unique to T.gondii or Apicomplexans in general. A BLASTP was performed using sequences of these hypothetical proteins as queries. The nearest characterised homologue protein that was found following BLASTP has been included in the Comments column when appropriate. Proteins that are used in the classification system training datasets, such as micronemal proteins (1, 4 and 6) are excluded from the test dataset.
The list of proteins in Table S1 was intended to illustrate a classification method proposed in the paper rather than to focus on any biological significance of particular vaccine candidates. The list for the purpose of a comprehensive study of N. caninum and T. gondii vaccine candidates is acknowledged to be incomplete because an exhaustive search of the literature was not undertaken. There are some proteins in the list that have no evidence in the literature to indicate they are immunogenic or even likely to induce an immune response. These proteins do, nevertheless, have evidence that they are secreted or membrane-associated and have epitope evidence, and hence their reason for inclusion. To reiterate, the crux of the classification system is to distinguish secreted or membrane-associated proteins from all other types of proteins and especially proteins with epitope evidence. The entire premise for the in silico vaccine discovery approach presented in this paper is based on an a priori held hypothesis that a protein that is either external to or located on, or in, the membrane of a pathogen and/or contains peptides that bind to MHC molecules is more likely to be accessible to surveillance by the immune system than a protein within the interior of a pathogen [32].
The experimental evidence for the epitope and MHC binding information in Tables   The following proteins are in no particular order of importance but are grouped into three sections: membraneassociated, secreted, and miscellaneous.
Collectively, surface antigens are known as the SRS (SAG1-related sequences) superfamily of proteins. The SRS2 protein is involved in the host cell invasion process [90] and polyclonal and monoclonal antibodies directed against it were shown to inhibit invasion of placental ovine trophoblasts in vitro [91].
Several rodent studies using NcSR2 as a vaccine against N. caninum tachyzoites have demonstrated improved survival for the host [44], a Th2 immune response with reduced transplacental transmission [45], and humoral and cellular immune responses [46]. In two cattle studies, vaccines incorporating NcSRS2 induced Tlymphocyte activation and IFN-γ secretion [43,47].
Two surface proteins of 29 and 35 kDa (designated Ncp29/NcSAG1 and Ncp35/NcSRS2, respectively) from N. caninum tachyzoites were identified [34]. Localization studies and surface labelling with biotin demonstrated that Ncp29 and Ncp35 are membrane-associated and displayed on the surface of the parasite.
Ncp29 and Ncp35 were characterised as GPI-anchored surface proteins. Sequence comparisons of Ncp29 and Ncp35 with GenBank sequences indicated that they are most similar to the T. gondii surface antigen 1 (SAG1) and surface antigen 1-related sequence 2 (SRS2), respectively. Consequently, Ncp29 has been designated NcSAG1 and Ncp35 has been designated NcSRS2. Both NcSAG1 and NcSRS2 contain a tandem duplicated motif and 12 conserved cysteines, which are also found in all of the SAG and SRS proteins of T. gondii [34].
Recombinant vaccinia viruses expressing the surface protein of NcSAG1 (or NcSRS2) were constructed and shown to effectively protect from an N. caninum invasion in a mouse model system (the efficacy of NcSRS2 was higher than that of NcSAG1) [36]. In other studies, mice immunized with r-SAG1 delayed death for 60 hours when challenged with T gondii RH tachyzoites [35]; a combined DNA/recombinant antigen-vaccine, based on NcSAG1 and NcSRS2, respectively, exhibited a highly significant protective effect against experimentally induced cerebral neosporosis in mice [33]; and a combined vaccination with NcSRS2 and NcDG1 showed protective effects against experimental infection in gerbils [44].
Using liposomes as adjuvant, a purified membrane antigen from T. gondii (SAG1 p30) was shown to provide protection of mice from a fatal T. gondii infection [37]. In another study, immune splenocytes from mice immunized with p30 appeared to lyse peritoneal macrophages infected with T. gondii [40].

SRS domain containing proteins (XP_002369822) are present in large numbers on the parasite surface
and facilitate the invasion of multiple host and cell types [94]. They are considered to be extremely immunogenic in Toxoplasma [22].

Apical membrane antigen 1 (AMA1) is a conserved transmembrane adhesin of apicomplexan parasites and is an essential component of the moving junction complex involved in host-cell invasion. T. gondii AMA1
is secreted onto the parasite surface and subsequently released by proteolytic cleavage within its transmembrane domain [85]. The Plasmodium apical membrane antigen 1 has been shown to elicit a protective immune response against merozoites dependent on the correct pairing of its numerous disulfide bonds [84]. In a study using preincubation of free tachyzoites with anti-rNcAMA1 (a N. caninum AMA1 recombinant), IgG antibodies inhibited the invasion into host cells by N. caninum and T gondii [86]. The latter indicates a potential common vaccine candidate to control two parasites. Biopterin transport (BT) Transmembrane protein: Massimine and colleagues report that the presence of putative folate transporter genes in the Toxoplasma genome, which are homologous to the BT1 family of proteins, suggests that Toxoplasma may encode proteins involved in folate transport. Folates are key elements in eukaryotic biosynthetic processes. In a study [95], BT1 in the species Leishmania donovani was inactivated by gene disruption mediated by homologous recombination. The L. donovani BT1 null mutant (i.e. an attenuated organism) showed less capacity to induce infection in mice than wild-type parasites and could elicit protective immunity in mice susceptible to infection against a L. donovani challenge [95]. The folate transport mechanism therefore represents a novel target in a vaccination strategy or the development of new drugs [73].

Calcium-transporting ATPase:
Calcium controls a number of vital processes in apicomplexans including protein secretion, motility, and differentiation [76]. ATPases are membrane-bound transporters that couple ion movement through a membrane with the synthesis or hydrolysis of a nucleotide, usually ATP. A study showed evidence of a T. gondii plasma membrane-type Ca2+ ATPases and suggested that parasite calcium pathways may be exploited as new therapeutic targets for intervention [76]. The process of invasion involves two Ca2+ dependent events: protrusion of the conoid and the induced secretion of adhesive complexes from the micronemes [28]. Immunolocalization and challenge studies using a recombinant Vibrio cholerae ghost expressing Trypanosoma brucei Ca2+ ATPase (TBCA2) antigen demonstrated immune responses in mice [96].
However, the immunization failed to protect the mice against a T. brucei challenge, despite the inducement of antigen-specific antibodies, Th1 cytokines, interleukin-2, and IFN-γ, Glucose transporter protein: T. gondii uses host sugars for energy and to generate glycoconjugates that are important to its survival and virulence. A glucose transporter protein facilitates in transporting mannose, galactose, fructose, glucose, and hexose at its plasma membrane. A study [74] demonstrates that glucose is nonessential for T. gondii tachyzoites. However, a study has validated a hexose transporter of Plasmodium falciparum as a novel drug target [97]. There is no literature implicating transporter proteins as vaccine candidates. Segments of transporter proteins are nevertheless exposed to the immune system.

Secreted proteins
GRA2/GRA3/GRA4/GRA7 are dense granule proteins involved in the cellular invasion process. Dense granules are secretory vesicles that play a major role in the structural modifications of the parasitophorous vacuole (PV) in which the parasite develops [53].
Escherichia coli expressed NcGRA2 demonstrated immunogenicity in an immunization/challenge mouse model of transplacental transmission, but only partial reduction against foetal infection and pup mortality [98].
Similarly in another study, vaccination of mice with recombinant NcGRA2 expressed in a Brucella abortus strain induced only partial protection against transplacental transmission with a mortality of 10-50% [54].
Immunization of mice with plasmid DNA expressing NcGRA7 conferred partial protection against congenital neosporosis [51]. Also, both humoral and cellular immune responses against T. gondii was detected in sheep immunized with DNA plasmids encoding T. gondii GRA7 formulated in an adjuvant formulation [50].
Studies using antibodies to immunolocalize the T. gondii dense granule protein GRA3 have shown that this protein associates strongly with the parasitophorous vacuole membrane (PVM) i.e. GRA3 has an N-terminal secretory signal sequence and a transmembrane domain consistent with its insertion into the PVM. A homologue was identified in N. caninum (UniProtID Q6YDA6). GRA3 possesses a dilysine 'KKXX' endoplasmic reticulum (ER) retrieval motif that interacts with PVM and the calcium modulating ligand of host cell ER in the parasitism of T. gondii [48,49]. There is no evidence in the literature that GRA3 induces an immune response. However, the findings on GRA3 support the fact that the five prediction programs indicate that GRA3, and most other dense granule proteins described here, are both membrane-associated and secreted.
GRA2 and GRA4 are not predicted to be membrane-associated.
NcMIC11/ Nc-Mic3/ MIC3 are from micronemes, which are secretory organelles, and are discharged by exocytosis during the attachment to the host cell surface to facilitate cell invasion [99]. Many microneme proteins also contain well-conserved functional domains associated with mainly adhesive activity (e.g. EGF-like and PAN_1 domains) and some protease activity (e.g. Peptidase_S8 and Rhomboid) [27].
MIC3 is expressed in all three infectious stages of T. gondii (tachyzoites, bradyzoites, and sporozoites). A DNA vaccine encoding the MIC3 protein has been demonstrated to elicit a strong specific immune response providing significant protection against T. gondii infection [68].
ROP1/ROP2/ROP4/ROP18 are secreted proteins from rhoptries (specialized secretory organelles in the apical complex) and are involved in a variety of cellular functions related to host cell invasion, formation of the parasitophorous vacuole, and parasite-host cell interplay [100].
The protein combinations of rROP2 + rROP4 + rGRA4 and rROP2 + rROP4 + rSAG1 were shown to be very effective in the development of a high level of protection irrespective of the genetic backgrounds and innate resistance to toxoplasmosis of the laboratory mice [38].
A DNA vaccine encoding the ROP1 antigen of T.gondii and ovine CD154 was demonstrated to stimulate humoral and cellular immune responses in sheep. The intramuscular injection of pROP1 only induced a Th1specific immune response [55].
Vaccination with recombinant NcROP2 induces a protective Th-1-biased or Th-2-biased immune response against experimental N. caninum in mice (depends also on the adjuvant used) [100]; fusion proteins ROP2-SAG1 exhibit immunogenicity as a recombinant protein vaccine, or DNA vaccine, or DNA boosted with protein immunization procedure [41]; and NcMIC1, NcMIC3, and NcROP2 applied either as single vaccines or as vaccine combinations leads to a significant protection against vertical transmission of N. caninum in mice [57].
The polymorphic rhoptry protein kinase ROP18 was recently shown to determine the difference in virulence between the T. gondii types I, II and III strains (which are prevalent in North America and Europe) by phosphorylating and inactivating the IFN-γ-induced immunity-related Guanosine Triphosphatases (IRGs) that promote killing by disrupting the parasitophorous vacuole membrane (PVM) in murine cells [62].
Cyclophilins (Peptidyl-prolyl cis-trans isomerase) are ubiquitous cytosolic proteins. A study has demonstrated cyclophilin (NcCyP) is present in N. caninum tachyzoites and is a major component responsible for the induction of IFN-γ production [26]. The production of IFN-γ in response to intracellular microbial exposure is critical to the development of a host protective immunity to control the acute phase of neosporosis.

Miscellaneous
The following proteins were included in the test dataset because the proteomics Che study identified them as transmembrane proteins. There is no evidence in the literature that these proteins induce an immune response and, from their annotated descriptions, are unlikely vaccine candidates i.e. they are not associated with the plasma membrane. The proteins remain in the test dataset essentially because proteins of these types are expected to be classified as vaccine candidates in a deployment of the classification system i.e. the prediction programs predict that they are membrane-associated given their protein sequences. Whether these proteins, or in fact any classified candidate, prove to be false positives, can only be determined in the laboratory.
Sortilin-like receptor is a transmembrane cargo receptor that functions in transport to the endolysosome system in yeast and mammals [31]. T. gondii sortilin-like receptor is required for the subcellular localization and formation of apical secretory organelles. It is a transmembrane protein that resides within Golgi-endosomal related compartments. The lumenal domain specifically interacts with rhoptry and microneme proteins, while the cytoplasmic tail recruits cytosolic sorting machinery involved in anterograde and retrograde protein transport [83] Gliding-associated proteins (GAPs) are components of the glideosome. The glideosome is a unique attribute of the Apicomplexa phylum and is an actin-and myosin-based machine [77]. This macromolecular machine provides the gliding motility for parasite migration across biological barriers and for host-cell invasion and egress [28]. The glideosome is assumed not to be exposed to the immune system as it is located between the plasma membrane and inner membrane complex (IMC). GAP45 is anchored to the plasma membrane and IMC via its N-and C-terminal extremities, respectively.

Acetyl-CoA carboxylase (ACC)
is an enzyme involved in fatty acid synthesis. This enzyme is synthesized in the cytosol and transported into the apicoplast [81]. Aryloxyphenoxypropionates, inhibitors of the plastid acetyl-CoA carboxylase (ACC) of grasses, also inhibit T. gondii ACC [82].
Thioredoxin protein: The apicoplast in T.gondii is an essential chloroplast-related organelle, bounded by multiple membranes, to which proteins are trafficked via the secretory system. The thioredoxin protein in T.
gondii is apicoplast-associated, which is predominantly soluble or peripherally associated with membranes, and which localizes primarily to the outer compartments of the apicoplast [87]. Research is investigating a role for the apicoplast in vaccine strategies. Genetically attenuated malaria parasites (with deleted genes that encode for apicoplast fatty acid biosynthesis) have been trialled and provide sterile immunity in mice for 210 days [102].
Apicoplast fatty acid biosynthesis is essential for organelle biogenesis and parasite survival in T. gondii hosted by mice [103].
Lectin-domain protein: T. gondii has as broad host cell specificity suggesting that adhesion should involve the recognition of ubiquitous surface-exposed host molecules or, alternatively, the presence of various parasite attachment molecules able to recognize different host cell receptors [71]. In a study [71], a sugar-binding activity (lectin) in tachyzoites of T. gondii was discovered that plays a role in vitro in erythrocyte agglutination and infection of human fibroblasts and epithelial cells. The results of the study suggest that the attachment of T.
gondii to its target cell is mediated by parasite lectins and that sulfated sugars on the surface of host cells may function as a key parasite receptor.

Printout of evidence profiles used in benchmark dataset
List 1: List of evidence profiles for the test set proteins from Table S1.