EpiJen: a server for multistep T cell epitope prediction

Background The main processing pathway for MHC class I ligands involves degradation of proteins by the proteasome, followed by transport of products by the transporter associated with antigen processing (TAP) to the endoplasmic reticulum (ER), where peptides are bound by MHC class I molecules, and then presented on the cell surface by MHCs. The whole process is modeled here using an integrated approach, which we call EpiJen. EpiJen is based on quantitative matrices, derived by the additive method, and applied successively to select epitopes. EpiJen is available free online. Results To identify epitopes, a source protein is passed through four steps: proteasome cleavage, TAP transport, MHC binding and epitope selection. At each stage, different proportions of non-epitopes are eliminated. The final set of peptides represents no more than 5% of the whole protein sequence and will contain 85% of the true epitopes, as indicated by external validation. Compared to other integrated methods (NetCTL, WAPP and SMM), EpiJen performs best, predicting 61 of the 99 HIV epitopes used in this study. Conclusion EpiJen is a reliable multi-step algorithm for T cell epitope prediction, which belongs to the next generation of in silico T cell epitope identification methods. These methods aim to reduce subsequent experimental work by improving the success rate of epitope prediction.


Background
The accurate identification of T-cell epitopes remains a critical step in the development of subunit and peptidebased vaccines [1]. The first step of such studies is usually in silico prediction of potential MHC binders from the sequence of a studied protein, followed by labor-, timeand resource-consuming experiments which aim to verify the natural processing, presentation and T-cell recognition of the predicted peptides. As the veracity of initial in silico predictions improves, so subsequent "wet lab" work becomes faster, more efficient, and, ultimately, more successful.
The main processing pathway for Major Histocompatibility Complex (MHC) class I ligands involves degradation of proteins by the proteasome, followed by transport of the products by the transporter associated with antigen processing (TAP) to the endoplasmic reticulum (ER), where peptides are bound to MHC class I molecules, and then presented on the cell surface by MHCs. The proteasome is responsible for generating the C terminus but not the N terminus of the final presented peptide [2][3][4][5]. The proteasome is a multimeric proteinase with three active sites: a site with trypsin-like activity (cleavage after basic residues), one with chymotrypsin-like activity (cleavage after hydrophobic residues), and another with peptidylglutamyl-peptide hydrolytic activity (cleavage after acidic residues) [6][7][8]. In addition, in vertebrates there are three γ-interferon-inducible subunits that replace the constitutive subunits [9] and assemble the immunoproteasome. The immunoproteasomes have an altered hierarchy of proteosomal cleavage, enhancing cleavage after basic and hydrophobic residues and inhibiting cleavage after acidic residues [10,11]. This is in accord with the amino acid preferences for binding to MHC class I molecules at the C terminus [12].
TAP is an ATP-dependent peptide transport protein that belongs to the ATP-binding cassette (ABC) family of transporters. This family transports across membranes a wide range of molecules, from small sugars to large polypeptides [13]. There are two TAP proteins (TAP-1 and TAP-2) which form a transmembrane (TM) heterodimer. Both proteins encode one hydrophobic TM domain and one ATP-binding domain [14]. Extant experimental studies have shown that TAP prefers peptides of eight or more amino acids with hydrophobic or basic residues at the carboxy terminus [15,16]. TAP-mediated antigen presentation is important not only for cytosolic antigens but also for most epitopes within membrane/secretory proteins [17]. The TAP-dependent pathway is the principal processing route for peptides binding HLA-A1, HLA-A3, HLA-A11, HLA-A24, HLA-B15 and HLA-B27 [18][19][20]. Some peptides are able to access the ER via other, TAPindependent mechanisms. Examples of alleles exhibiting only partial dependence on TAP include HLA-A2, HLA-A23, HLA-B7 and HLA-B8 [21][22][23][24].
Proteins of the MHC are both polygenic (i.e. there are more than one MHC class I and MHC class II genes) and polymorphic (i.e. there are many alleles of each gene) [25]. Each class of MHC has several loci: HLA-A, HLA-B and HLA-C for class I and HLA-DR, HLA-DQ and HLA-DP for class II. MHC alleles may differ by as many as 30 amino acid substitutions, most of them are found within the binding site and are critical for the specificity of peptide binding and therefore for T cell recognition [26][27][28]. Such an uncommon degree of polymorphism implies a selective pressure to create and maintain it. Different polymorphic MHC alleles have different peptide specificities: each allele binds peptides exhibiting particular sequence patterns.
The next generation of T cell epitope identification methods will focus on integrated multi-step approaches, which subsume proteasome cleavage, TAP transport and MHC binding. The advantages of such integrated methods are higher accuracy and a lower rate of false positive predictions, although they may generate more false negative predictions due to the use of incomplete training sets or high thresholds for individual steps. Although some previous attempts have been made to combine predictive methods [57,60,62,63], true integrated methods have only recently emerged, examples include: SMM [64], NetCTL [65] and WAPP [66]. SMM stands for Stabilized Matrix Method and is a T cell epitope predictive tool based on QMs for binding to MHC class I molecules, peptide transport by TAP and proteasomal (or immunoproteasomal) cleavage of protein sequences [64]. NetCTL implements, in a combined manner, ANN-based proteasome cleavage prediction, a TAP binding QM and ANN-based MHC class I binding prediction [65]. WAPP applies proteasomal cleavage matrices, SVM-based TAP and MHC predictions as a series of successive filters [66]. All three methods emphasize the greatly reduced number of peptides which need to be tested in order to identify true epitopes; they show good accuracy for sets they have been tested with.
In the present study, we develop a multi-step algorithm for T cell epitope prediction, which we call EpiJen. The method is applied to a set of overlapping peptides generated from a whole protein sequence and acts as a series of filters which successfully reduce the number of potential epitopes ( Figure 1). The final set of peptides needed to be tested for epitopes rarely includes more than 5% of the whole sequence. QMs for each step were developed using the additive method [42]. Since its appearance in 2001, this method has been applied to more than 2500 peptides binding to several human and murine MHC class I and class II proteins [67] and has been shown to be both reliable and highly predictive, allowing us to design superbinders [68]. The method was also used to generate QMs for TAP binding [69] and proteasome cleavage [70]. Here, we combine all additive method human models and make them publicly available via the EpiJen server for T cell epitope prediction [71]. The performance of EpiJen is tested using external sets of known T cell epitopes and is compared with the predictions made by SMM, NetCTL and WAPP methods.

EpiJen step one: proteasome cleavage
The dataflow in EpiJen is shown in Figure 1. Initially, the protein is chopped into overlapping decamers and processed by a proteasome cleavage QM. A previously derived and tested p1p1' model, as described in the Methods section below [70], is used. The model takes into account only the contributions of the residues next to the cleavage site: C-terminus and the next aa. Two thresholds, 0.0 and 0.1, can be used here. Threshold 0.0 is recommended for alleles which prefer Phe or Trp at the C-terminus: HLA-A*24, HLA-B*07, HLA-B*27, HLA-B*35, HLA-B*51 and HLA-B*53. The epitopes for other alleles are predicted accurately at a threshold of 0.1. This initial step has a powerful filtering ability: between one half and two thirds of the true negatives were eliminated by this step. The "cleaved" peptides, present as nonamers, are then passed to the next filter: the TAP binding QM. Figure 1 Flowchart of EpiJen.

EpiJen step two: TAP transport
The TAP binding QM also has been derived and tested previously [69]. A threshold of 5.00 is recommended for both fully and partially TAP-dependent alleles. Pro and Asp at anchor position 2 has a strong negative effect on TAP binding [69]. For that reason, a threshold of 3.0 is recommended for epitopes binding to HLA-B*07, HLA-B*35, HLA-B*40, HLA-B*44, HLA-B*51 and HLA-B*53. The filtering ability of the TAP step is low. Up to 10% of the true negatives are eliminated here. The "transported" peptides move to the next filter: MHC binding.

EpiJen step three: MHC binding
EpiJen includes 18 QMs which can be used to predict binding to different HLA-A and B alleles. Certain QMs were developed for single alleles and others developed for allele families. QMs developed for whole supertypes were poorly predictive, especially for HLA-B supertypes. Some MHC models were derived previously [42,52,72], while others were developed for this study. The statistics of newly derived models are given in Tables 1 and 2. Quantitative data (continuous values like IC 50 s) were available for certain alleles, for the rest only sequences of binders were known (discontinuous values). As is described in the Methods section below, binding models based on continuous values were derived by multiple linear regression (MLR) ( Table 1) and those based on discontinuous values by discriminant analysis (DA) ( Table 2). "Leave-one-out" cross-validation tests indicate a higher predictive rate for the DA models (AUC ROC > 0.9; accuracy > 80%) than MLR models (q 2 ≈ 0.5). The filtering ability of this step is significant: approximately 25-30% of the true negatives are eliminated here. The thresholds for this step are 0.5 for the DA models and 5.3 for MLR models. These thresholds can not be altered by the user. They seek to reduce the number of false positives in long protein sequences.

EpiJen step four: epitope selection
All peptides which are presented by MHCs on the cell surface after being cleaved by the proteasome and transported by TAP could potentially be T cell epitopes. However, only a small number of all possible epitopes are actually immunogenic. To reduce the number of false positives we tested different thresholds, which we defined as percentages of available peptides sourced by one protein.
The top 5% threshold performed best, giving 85% sensitivity; we recommend it and use it as a default value for this step. Optional are thresholds 2, 3 and 4%.

External validation
A set of 160 epitopes and their source proteins were collected from AntiJen [73] ) four thresholds were tested: top 2% to 5%. As the number of nonepitopes generated from each protein was significantly higher than the number of epitopes, only two parameters -sensitivity ((true positives/(true positives + false negatives)) and positive predictive value (PPV) ((true positives/ (true positives + false positives)) -were used for comparison. Parameters accuracy ((true positives + true nega- tives)/total) and specificity ((true negatives/(true negatives + false positives)) could be misleading. If 98% of the peptides in one source protein are non-epitopes, a model that simply predicts everything as non-epitope will not be very useful, yet it will nonetheless have an overall accuracy of 98% and a specificity of 100%.
The true positives were 141 (5% cutoff), 132 (4% cutoff), 123 (3% cutoff) and 114 (2% cutoff). False negatives were 25, 34, 43 and 52, while the false positives decreased from 2743 to 2173, 1618 and 1060, respectively. The parameter sensitivity varies from 69% (at 2% cutoff) to 85% (at 5% cutoff) (Figure 2). The parameter PPV diminishes from 10% (at 2% cutoff) to 5% (at 5% cutoff). Thus, our tests indicate that a 5% threshold at the final epitope selection step is sufficient to generate an 85% epitope prediction. This means that by using EpiJen, one need only test 5% of the whole sequence in order to predict 85% of available epitopes. Detailed results from the external validation are given in Additional File 1, which is provided as Additional Material.

Comparison with SMM, NetCTL and WAPP
The ability of EpiJen to predict T cell epitopes was tested on a set of known T cell epitopes, which belonged to 12 HIV proteins, and the resulting predictions were compared with those made by SMM, NetCTL and WAPP. The comparisons were made in conditions close to those used by experimental immunologist: the complete sequence of a protein of interest is submitted to an available web server and the results recorded. NetCTL and WAPP predictions were made using default thresholds. The top 5% best predicted peptides were selected as a threshold for SMM and EpiJen. As WAPP only predicts peptides binding only to HLA-A*01, HLA-A*02, HLA-A*03 and HLA-B*27, but SMM does not predict HLA-B*27 binding, the epitopes used in the test set were restricted to the set of alleles common to all four programs: HLA-A*01, HLA-A*02 and HLA-A*03. As described in Methods, a set of 99 epitopes was compiled: 4 peptides binding to HLA-A*01, 66 to HLA-A*02 and 29 to HLA-A*03. Three peptides were promiscuous, binding to both HLA-A*02 and HLA-A*03. The predictions were compared in terms of sensitivity and PPV.
EpiJen recognized 61 out of 99 epitopes (62%sensitivity), followed by SMM with 57 (58%sensitivity), NetCTL with 49 (50%sensitivity), and WAPP with 33 (33%sensitivity) ( Table 3). The PPVs were low for all of the four methods: 21% for NetCTL, 17% for both EpiJen and WAPP, and 16% for SMM. Detailed results from this comparative study are given in Additional File 2, which is provided as Additional Material.

Discussion
EpiJen is a server for multistep T cell epitope prediction. The principal steps in the antigen processing pathway are modeled by a set of different QMs. The models are applied successively, eliminating a proportion of negatives at each stage. Proteasome cleavage (step 1) and TAP transport (step 2) models are applied to all alleles. MHC binding EpiJen combines several widely used methods in drug design [74,75], which have generally proven reliable and predictive. Moreover, the external validation of EpiJen and the comparison with other integrated methods showed that EpiJen performed best in terms of sensitivity.
The number of peptides that have to be tested in order to reach 85% reliability is 5% of a protein sequence. According to Larsen et al. [65], NetCTL, SYFPEITHI and BIMAS achieve 85% reliability within the top 7%, 10% and 9%, respectively. The moderate performance of all integrated methods when used to predict the HIV epitope set may be due to the fact that all peptides available in the Los Alamos HIV Database were included, regardless of whether a peptide has been naturally processed or not. EpiJen excluded most of the false negative HIV epitopes after the proteasome cleavage step and only a few of them were predicted as nonbinders. It is well known that "all models are wrong, yet some of them might be useful". The modeling process follows the accumulation of knowledge about a particular mechanism. As knowledge improves, so models improve. Antigen processing is a very complicated cascade of cellular events. It is clear that, for example, cleavage by the proteasome is only one event in antigen presentation: there are many more, and many of these are proteolytic. Analyses of peptide generation and T-cell epitopes expression in proteasome-inhibited cells suggest that cytoplasmic proteases other than proteasomes may also be involved in antigen processing pathway [76][77][78]. Tripeptidylpeptidase II (TPPII) was suggested to supply peptides because of its ability to cleave peptides in vitro and its upregulation in cells surviving partial proteasome inhibition [79]. Leucine aminopeptidase was found to generate antigenic peptides from N-terminally extended precursors [80]. Puromycin sensitive animopeptidase and bleomycin hydrolase were shown to trim N termini of synthetic peptides [81]. An enzyme located in the lumen in ER and called ERAAP (ER aminopeptidase associated with antigen processing) [82] or ERAP1 (ER aminopeptidase 1) [83,84], has been shown to be responsible for the final trimming of the N termini of peptides presented by MHC class I molecules.

External validation of EpiJen
Recently, it was shown that within the proteasome, pep-tides could be formed from noncontiguous parts of the source protein [85,86]. The mechanism of this splicing is not fully understood. Currently there is insufficient quantitative data about the role of the above mentioned events to allow a precise bioinformatic evaluation of their impact on the antigen processing pathway. Overall, it is clear that, ultimately, many more pathways, involving many more stages, will need to incorporated into predictive methods in order to model it accurately; given current data, however, EpiJen represents the most accurate and parsimonious approach to antigen prediction.
Compared to other methods, EpiJen offers many potential advantages. First, a large quantity of experimental data (more than 2500 peptides) has been used to develop the models. Second, the additive method combines two well known, widely used methods in drug design [42], which have generally proven to be both reliable and predictive: the Free-Wilson method [74] and partial least squares (PLS) [75]. Finally, and most importantly, EpiJen uses its models as successive filters: negatives are eliminated at each step rather than their score being summed in order to exceed a global threshold. This is in contrast to alternative methods [64,65]. The combined score, as used by SMM and NetCTL, obscures the final result, because a low  TP  3  6  2  15  5  4  0  3  2  0  14  3  57  FN  2  5  1  5  0  2  1  2  2  1  11  10  42  FP  15  18  8  41  23  14  5  17  3  4  115  27  290  Sensitivity %  60  55  67  75  100  67  0  60  50  0  56  (or even negative) TAP and/or proteasome score could be compensated for by a high MHC score. The cellular antigen processing pathway, as modeled in EpiJen, works in a hierarchical or successive manner not in parallel. Peptides that have been eliminated at any of the steps do not continue to the next step. EpiJen is thus based on a more mechanistically meaningful model of antigen presentation than other available methods. EpiJen is both a more adaptable and a more flexible approach, which should prove a significant advantage as combination methods, such as this, evolve.

Conclusion
EpiJen belongs to a new generation of integrated methods for T cell epitope prediction. It is an open system: new models will be included in the future, while old ones will be improved. Integrated methods aim to rationalize the process of epitope searching and accelerate epitope-based vaccine design. They possess significant potential for improving the predictive ability of in silico epitope identification by adding more features and new high quality experimental data.

Peptide datasets
More than 2500 peptides (nonamers) were used to develop models involved in EpiJen. Models for proteasome cleavage [70], TAP binding [69], HLA-A2 [71], HLA-A3 [52] and HLA-B35 [87] binding were derived previously. The majority of peptides were extracted from Anti-Jen [73], although the SYFPEITHI database [38] was also used. As has been described elsewhere [70], the training set for the proteasome cleavage model consisted of 489 naturally processed T-cell epitopes associated with HLA-A and HLA-B molecules and a test set of 231 peptides were used for external validation. All T-cell epitopes common to the two sets were first excluded from the training set. The training set for the TAP binding model included 163 poly-Alanine nonameric peptides and two test sets containing 85 peptides were used for external validation [69]. 1371 peptides were used for the development of the MHC binding models; the models were tested by "leave-oneout" cross-validation (LOO-CV).

Additive method
The additive method [42] can be used to develop a QM for any particular peptide-protein interaction. Nonameric peptides are presented as a binary string of length 180 (9 positions × 20 amino acids). A term is equal to 1 when a particular amino acid at a particular position is present and 0 when it is absent. The dependent variable could take either continuous values, like logIC 50 , or discontinuous ones, such as binder or non-binder. When the dependent variable is continuous, multiple linear regression is used to derive the QM. In the case of a discontinu-ous dependent variable, models are derived by discriminant analysis. In both cases, the matrix is solved using partial least squares (PLS).

Partial least squares (PLS)
PLS is a so called projection method [75], which can handle matrices with more variables than observations and with noisy and highly collinear data. In such cases, conventional statistical methods, such as multiple linear regression, produce over-fitted models, i.e. models that fit the training data well but are unreliable in prediction. PLS forms new variables, or principal components (PC), as linear combinations of the initial variables and then uses them to predict the dependent variable. The PLS method used in this study was implemented in SYBYL 6.9 [88]. The type of the dependent variable (continuous or discontinuous) determined the statistical method: linear regression or discriminant analysis.

Multiple linear regression by partial least squares (MLR-PLS)
MLR-PLS requires at least 30 peptides with experimentally measured affinities (IC 50 ) to generate good models. In this study MLR was applied only to models for binding to HLA-B7, HLA-B51 and HLA-B53 alleles. IC 50 values were collected from AntiJen [73]. The optimal number of PC used to derive the model was defined as the number which lead to the highest cross-validated q 2 and/or the lowest standard error of prediction (SEP). The q 2 values were derived after "leave-one-out" cross-validation (LOO-CV). The non-cross-validated models were assessed by the explained variance r 2 , standard error of estimate (SEE), and F-ratio. The non-cross-validated models were used at EpiJen step 3.

Discriminant analysis by partial least squares (DA-PLS)
Not enough quantitative data was found for peptide binding to alleles HLA-A24, HLA-B27, HLA-B40 and HLA-B44. Instead, sets of MHC binders for each allele were collected from AntiJen and SYFPEITHI. Each source protein was passed through the first two steps of EpiJen (proteasome cleavage and TAP binding) and only properly "cleaved" and "TAP-transported" peptides were selected. Among them the binders form a small set; the rest were considered as non-binders. The number of non-binders was significantly higher than that of binders (Table 2). In order to develop a reasonable model by DA, equivalent numbers of binders and non-binders are required. Otherwise, a model will represent an over-or under-estimation of amino acids contributions. The number of non-binders was reduced by the method of hierarchical clustering, as described below. This resulted in almost equivalent numbers of binders and non-binders for each allele family. The QMs were derived by PLS. The optimum number of PCs used do derive the model was defined as the number lead-ing to the lowest standard error of prediction (SEP) after "leave-one-out" cross-validation (LOO-CV). The prediction rate of the models was measured by LOO-CV using Receiver Operating Characteristic (ROC) curves [89]. Two variables sensitivity (true positives/total positives) and 1specificity (false positives/total negatives) were calculated at different thresholds. The area under the curve (AUC ROC ) is a quantitative measure of predictive ability and varies from 0.5 for a random prediction to 1.0 for a perfect prediction. Prediction a ccuracy ((true positives + true negatives)/total) at a threshold 0.5, was also calculated. The non-cross-validated models were used at EpiJen step 3.

Hierarchical clustering (HC)
Clustering is the process of dividing a set of entities into several subsets. Members of each subset are similar to each other but different from members of other subsets [90].
There have been numerous cluster methods described [91]. In the present study, HC, using the agglomerative algorithm [90], was applied. According to this algorithm, clusters are built from the bottom up, first by merging individual items into clusters, and then by merging clusters into superclusters, until the final merge brings all items into a single cluster. This method was applied as implemented in Sybyl 6.9 [88]. The distance between the clusters was calculated using the complete-linkage method, i.e. the distance between the most distant pair of data points in both clusters is taken into account.
In this study, HC reduced the number of non-binders for the DA-PLS models, i.e. models for peptide binding to HLA-A24, HLA-B27, HLA-B40 and HLA-B44. The last second or third level of the dendrogram, which contained a number of clusters close to the number of binders, was used. One peptide from each cluster was chosen at random to act as the exemplar for that cluster.
To reduce the number of non-epitopes, only proteins consisting of less than 1000 amino acids were considered in the study. As the number of non-epitopes generated from one protein was significantly higher than the number of epitopes, only two parameters -sensitivity and positive predictive value ((true positives/(true positives + false positives)) -were used for the assessment of program performance.
For the comparison between the integrated methods, a set of known HIV epitopes was collected from the CTL/CD8+ T cell epitope maps published in the HIV Molecular Immunology Database (last updates June 8, 2005) of the Los Alamos National Laboratory [92]. Epitopes are mapped to the HXB2 consensus protein sequence. Only nonamer epitopes, restricted by HLA-A*01, HLA-A*02 and HLA-A*03 at the serotype level, were included in the test set. Epitopes which spanned a protein boundary were not considered. The final set consisted of 99 epitopes from 12 source proteins. Proteins p2p7p1p6 and gag/pol TF did not have epitopes of interest and were not used. Four epitopes were restricted by HLA-A*01, 66 -by HLA-A*02 and 29 -by HLA-A*03. Three of the epitopes were promiscuous. Parameters sensitivity and positive predictive value were used to assess the results.

EpiJen Server
The EpiJen server is implemented in Perl, with an interface written in HTML. EpiJen identifies epitopes from nucleic acid and protein sequences, using the four-stage model described above, with options to vary the relevant requisite parameters, such as selected MHC allele and individual model thresholds. Prediction from nucleic acid allows for 3 or 6 frame translation, as recently it has become known that many antigenic peptides emerge from cryptic translational reading frames [93,94], as well as through post-translational splicing [85,86], amongst other noncanonical expression mechanisms.