Epitope identification is an essential step toward synthetic vaccine development since epitopes play an important role in activating immune response. Classical experimental approaches are laborious and time-consuming, and therefore computational methods for generating epitope candidates have been actively studied. Most of these methods, however, are based on sophisticated nonlinear techniques for achieving higher predictive performance. The use of these techniques tend to diminish their interpretability with respect to binding potential: that is, they do not provide much insight into binding mechanisms.
We have developed a novel epitope prediction method named EpicCapo and its variants, EpicCapo+ and EpicCapo+REF. Nonapeptides were encoded numerically using a novel peptide-encoding scheme for machine learning algorithms by utilizing 40 amino acid pairwise contact potentials (referred to as AAPPs throughout this paper). The predictive performances of EpicCapo+ and EpicCapo+REF outperformed other state-of-the-art methods without losing interpretability. Interestingly, the most informative AAPPs estimated by our study were those developed by Micheletti and Simons while previous studies utilized two AAPPs developed by Miyazawa & Jernigan and Betancourt & Thirumalai. In addition, we found that all amino acid positions in nonapeptides could effect on performances of the predictive models including non-anchor positions. Finally, EpicCapo+REF was applied to identify candidates of promiscuous epitopes. As a result, 67.1% of the predicted nonapeptides epitopes were consistent with preceding studies based on immunological experiments.
Cytotoxic T lymphocytes (CTLs) play an important role in the vertebrate immune system. CTLs recognize pathogens via peptide presentation on major histocompatibility complex molecules (MHCs). If the source of peptides is an infectious virus, the CTL response could be stimulated, thus leading to the elimination of virus-infected cells
. MHC-bound peptides are called epitopes, and they are usually composed of 8–20 amino acids. Epitope identification is an essential step toward synthetic vaccine development, since epitopes play an important role in the activation of the immune response
. Epitopes are traditionally identified by synthesizing a large number of nonapeptides and subsequently performing affinity assays. Those peptides with high affinity for MHC proteins are considered as potential epitopes. However, the process of developing a new vaccine is time-consuming and laborious when performed with traditional methods. To avoid the problems of such bottlenecks, instead computational methods can be effectively applied to search for candidate peptides and identify new promising epitopes.
Due to the importance of vaccines for human, we focus on MHCs in humans, which are referred to as the human leukocyte antigens (HLAs). There are three classes of HLAs: I, II, and III. Epitopes presented on HLA class I molecules are recognized by CTLs. HLA class I proteins can be categorized into three types according to their genes: HLA-A, HLA-B, and HLA-C. A majority of previous studies have focused on the HLA-A*02:01 allele because it is the most frequent allele of the A2 supertype in the Northeast Asian and Caucasian populations
. Typically, the HLA-A*02:01 epitope consists of 8–10 amino acids, and many studies have focused on nonapeptides in particular: that is, epitopes that are 9 residues long
1A shows the nonapeptide epitope LLFGYPVYV fitted inside the HLA-A*02:01 binding cleft, which consists of two α-helices and one β-sheet (from PDB entry 1DUZ
1B shows the conformation of the nonapeptide epitope LLFGYPVYV.
Early epitope binding prediction algorithms were based on allele-specific motifs
[8, 9]. For example, for the HLA-A*02:01 allele, positions 2 and 9 of nonapeptides were the most important ones for binding. The residues at both positions were defined as classical anchor residues typically occupied by leucine, valine, and isoleucine since the MHC molecule forms hydrophobic sites for amino acids at these two positions
. Additionally, the residues at positions 1, 3, and 7 were identified as secondary anchor residues. Positions 1 and 3 were mainly preferred by tyrosine and phenylalanine
[11, 12]. The residue at position 7 was suggested to be an amphipathic residue suitable for amino acids with small hydrophobic side-chains such as valine and alanine
. In this manner, unknown peptides that matched with such allele-specific motifs were determined to be epitopes.
As more data became available, statistical methods could be applied to calculating a positional scoring matrix. In the matrix, an element was defined individually for each position and specific amino acids, resulting in an L × 20 coefficient matrix where L is the length of the peptide. In general, the matrix is used under the assumption that each amino acid in a peptide sequence independently contributes to a certain binding energy according to an element included in the positional scoring matrix. Overall binding energy is estimated from the summation of binding energies from all positions. There are several methods based on such a positional scoring matrix: for example, BIMAS
, Gibbs sampler
, and SMMPMBEC.
Currently, the most successful approach for epitope prediction utilizes machine learning algorithms. These algorithms require large enough datasets for training in order to obtain reliable results. Fortunately, the Immune Epitope Database (IEDB)
 provides more than 100,000 MHC binding data related to T-cell epitopes from infectious pathogens, experimental pathogens, and self-antigens (autoantigens). IEDB encompasses patent data from biotechnological and pharmaceutical companies, as well as direct submissions from research programs and partners. As reliable experimental data are provided, the volume promises a sufficient grounding for developing good predictive models. Although IEDB is not the only database that provides such information, it has more entries than other existing databases. Examples of other databases are SYFPEITHI
, and AntiJen
, a predictor based on artificial neural networks, used data from both IEDB and SYFPEITHI and performed very well. SVRMHC
, a predictor based on support vector regression (SVR) used data from AntiJen and used LIBSVM
 for SVR-related implementation. Moreover, there also exists an epitope predictor based on a hidden Markov model
The allele-specific motif method, the positional scoring matrix method, and machine learning-based methods use only sequence information in general. Almost none of these methods can provide a clear explanation about the effects of the physicochemical properties of amino acids on binding affinity. In some cases, there are not enough peptides for training: e.g., when using data from rare alleles. Therefore, three-dimensional (3D) structure-based methods have been developed
[30–32] to uncover binding mechanisms and address all forces related to binding affinity. However, such methods are currently less reliable than data-driven methods
. The reason is that 3D structure-based methods usually require a number of crystal structures of MHC-peptide complexes, which are still not available in large numbers.
Currently, more than 2,000 HLA alleles have been identified. Searching for epitopes that bind to a large number of those alleles would be computationally exhaustive and time-consuming. Therefore, the concept of allele supertypes was developed by clustering alleles into groups based on overlapping epitopes
[34–38]. Within each supertype, most of the alleles should share the same epitopes. These epitopes are called ‘promiscuous epitopes’, which show great promise for vaccine development due to their potential for a high level of population coverage.
In this study, we have developed a novel epitope prediction method named EpicCapo. Peptides were encoded numerically by combining information on the peptide-MHC (pMHC) contact sites with amino acid pairwise contact potentials (AAPPs), accompanied by a support vector machine (SVM)
. Our method’s performance was evaluated by using benchmark datasets and then compared with other high performance methods. In addition, identification of candidates of promiscuous CTL epitopes for influenza A viruses was demonstrated using the proposed method.
The H1N1 or H5N1 strain of influenza A virus caused a lethal flu in humans, as seen in the epidemics of 2005–2009. Although inactivated influenza vaccination is beneficial, the development of more effective vaccines is still needed, particularly in elderly adults who are more susceptible to viral infections
. Identification of promiscuous CTL epitopes might aid this issue by providing candidate peptides from viral proteins for vaccine development.
Results and discussion
Comparison of peptide-encoding schemes
We compared our peptide-encoding scheme (Section Peptide data encoding) with binary peptide-encoding and with four amino acid descriptors (Table
1). The results of the comparison of the peptide-encoding schemes (Table
2) showed that EpicCapo performed better than others in the classification tasks. It achieved the highest average area under the curve (AUC; 0.882), followed by binary encoding (0.879), DPPS (0.878), FASGAI (0.874), z-scale (0.858), and ISA/ECI (0.796) schemes. All of standard deviations were less than 0.01. A comparison of receiver operating characteristic (ROC) curves is shown in Figure
Holdout method using training dataset and testing dataset
0.883 ± 0.005
0.792 ± 0.006
0.886 ± 0.003
0.841 ± 0.004
0.915 ± 0.001
0.876 ± 0.005
0.821 ± 0.005
0.862 ± 0.003
0.848 ± 0.003
0.916 ± 0.001
0.865 ± 0.005
0.760 ± 0.007
0.834 ± 0.004
0.816 ± 0.004
0.888 ± 0.001
0.847 ± 0.004
0.761 ± 0.004
0.825 ± 0.003
0.801 ± 0.003
0.882 ± 0.001
0.847 ± 0.005
0.732 ± 0.005
0.815 ± 0.004
0.793 ± 0.004
0.873 ± 0.002
0.799 ± 0.005
0.652 ± 0.005
0.760 ± 0.003
0.731 ± 0.003
0.797 ± 0.001
0.883 ± 0.005
0.721 ± 0.006
0.831 ± 0.003
0.807 ± 0.003
0.883 ± 0.002
Means and standard deviations were calculated by 20 iterations of 10-fold cross validation.
Underlined values represent the highest performance.
sens = sensitivity; spec = specificity; F1 = F-score; ACC = accuracy; AUC = area under the curve.
*These three top-ranked AAPPs were MICC010101, SIMK990101, and SIMK990105 (see Additional file
Although EpicCapo used the largest number of features (M × K = 360)—higher than binary encoding (180), DPPS (90), FASGAI (54), z-scale (45), and ISA/ECI (18)—we confirmed that its high performance was not due to a larger number of features. In our study, the training dataset was separated into 40 datasets corresponding to 40 AAPPs. Each dataset consisted of 9 features. The classification functions were fitted to these datasets, and after that the AAPPs were ranked by AUC. The results, as shown in Table
2, suggested that even by using only three top-ranked AAPPs (27 features in total), the classification performance values are comparable to those obtained by using all AAPPs. These three top-ranked AAPPs were MICC010101, SIMK990101, and SIMK990105 (see Additional file
1). They have been previously used in identifying native-like protein structures
[44, 45], and were also identified as important AAPPs in our accompanying experiments.
Classification results of benchmark datasets
We applied EpicCapo to benchmark datasets of 34 MHC-I alleles
. As shown in Table
3, NetMHC performed the best, ahead of ARB, SMM, and SMMPMBEC. For EpicCapo, average AUCs were lower than in NetMHC (0.1%–3.4%) in 13 allele datasets and were higher than in NetMHC (0.1%–9.3%) in 21 allele datasets when using all of the 40 AAPPs (360 features). Almost all of standard deviations were low except several alleles with results of standard deviation larger than 0.01. However, if more data are available, these standard deviations can be decreased. To improve the performance of our method, we developed EpicCapo+ by selecting an appropriate subset of AAPPs. As seen in Table
3, the performance of EpicCapo+ was higher than EpicCapo and comparable with NetMHC. The overall performance of EpicCapo+ is significantly higher than that of other methods according to a paired t-test (two-tailed) comparison of average AUCs from all alleles. The IDs of AAPPs used for estimating the predictive models of EpicCapo+ are shown in Additional file
Classification results of 34 allele datasets
# of peptides
0.972 ± 0.004
0.977 ± 0.003
0.950 ± 0.004
0.951 ± 0.004
0.901 ± 0.004
0.909 ± 0.004
0.920 ± 0.003
0.923 ± 0.003
0.925 ± 0.004
0.927 ± 0.004
0.934 ± 0.004
0.938 ± 0.003
0.945 ± 0.004
0.951 ± 0.002
0.853 ± 0.012
0.865 ± 0.011
0.941 ± 0.005
0.957 ± 0.007
0.944 ± 0.008
0.945 ± 0.010
0.930 ± 0.002
0.935 ± 0.003
0.926 ± 0.004
0.934 ± 0.004
0.891 ± 0.003
0.899 ± 0.003
0.901 ± 0.005
0.907 ± 0.003
0.960 ± 0.004
0.964 ± 0.002
0.942 ± 0.005
0.951 ± 0.004
0.940 ± 0.006
0.950 ± 0.005
0.886 ± 0.013
0.911 ± 0.009
0.949 ± 0.005
0.958 ± 0.003
0.900 ± 0.004
0.907 ± 0.007
0.811 ± 0.007
0.912 ± 0.011
0.798 ± 0.009
0.861 ± 0.013
0.813 ± 0.010
0.871 ± 0.008
0.930 ± 0.012
0.948 ± 0.015
0.916 ± 0.008
0.940 ± 0.008
0.927 ± 0.008
0.938 ± 0.006
0.792 ± 0.009
0.854 ± 0.010
0.959 ± 0.005
0.964 ± 0.004
0.940 ± 0.014
0.968 ± 0.006
0.956 ± 0.016
0.985 ± 0.017
0.844 ± 0.021
0.880 ± 0.017
0.950 ± 0.015
0.966 ± 0.009
0.883 ± 0.009
0.926 ± 0.008
0.984 ± 0.012
0.992 ± 0.013
For each dataset, AUCs were evaluated based on 5-fold cross validation. In the lower part, p-values of average AUCs were calculated using paired t-tests (two-tailed).
Means and standard deviations were calculated by 20 iterations of 5-fold cross validation for EpicCapo and EpicCapo+.
Underlined values represent the highest performance among ARB, SMM, SMMPMBEC, and NetMHC. Values in bold represent significant improvements of EpicCapo or EpicCapo+ AUCs from 20 iterations of 5-fold cross validation over the underlined values according to t-tests (one-tailed, significance level = 0.01).
In this experiment, EpicCapo+ was further developed as EpicCapo+REF to improve the predictive performance and identify important positions of nonapeptides in pMHC binding (Section Improving the performance of HLA-A-nonapeptide binding predictive models). The IDs of AAPPs used in EpicCapo+REF are shown in Table
4 (for more details on AAPPs, see Additional file
1). The most important AAPPs identified by EpicCapo+ were IDs 14 (MICC010101) and 28 (SIMK990105), which were selected in 13 out of 14 alleles. IDs 11 (KESO980102) and 26 (SIMK990103) were also considered to be important, because they were selected in 9 out of 14 alleles. From previous studies that used AAPPs in MHC I epitope prediction, AAPP IDs 19 (MIYS960102) and 2 (BETM990101) proved to be important in peptide-MHC binding predictions
[5, 47, 48]. In our study, however, BETM990101 was not selected for an AAPP subset for any allele, and MIYS960102 was chosen for only two alleles (A*0203 and A*0206). In a report by Schueler-Furman et al.
, KESO980102 was also tested and compared with MIYS960102; however, there was no significant improvement in the predictive performance. Therefore, it is interesting that MICC010101, SIMK990105, KESO980102, and SIMK990103 were important for generating better predictive models in our study.
Optimal subsets of AAPPs and number of selected features identified by EpicCapo+REFusing 14 HLA-A allele datasets
We further investigated the generated features according to the selected subset of AAPPs. In our peptide-encoding scheme, nine features were generated from one AAPP, corresponding to the nine amino acid positions in the nonapeptide. Previous studies have indicated that not all positions were important in pMHC binding
[4, 10–12]. Therefore, some features corresponding to specific positions could be removed to improve the predictive performance.
The Relief algorithm
 was employed in our study to rank the features according to their importance in separating the nonbinding peptides from the binding ones. The ranking results showed that the ten top-ranked features correspond to positions 9 and 2 in most of the alleles, followed by positions 3, 1, or 7 (see Additional file
3). As indicated in Tables
4, the overall AUC value of EpicCapo+REF was higher than that of EpicCapo+; however, it was still slightly lower than that of NetMHC in the A*01:01 and A*02:06 alleles. In summary, EpicCapo+REF performed better than other methods, with an average AUC of 0.935. Table
4 also shows the number of selected features after employing the Relief-F algorithm. These numbers were different for specific alleles. For the A*01:01, A*02:02, and A*06:01 alleles, no features were removed. However, for the A*02:06, A*24:02, A*29:02, and A*68:02 alleles, 20 or more features were removed. Interestingly, features corresponding to positions 5 and 8, which have previously been considered to not significantly contribute to HLA binding potentials, were still included in some of the selected feature subsets. Therefore, we assumed that features corresponding to different positions are not independent, and that all features from all positions should be required input to estimate the model with the highest-performance (see Additional file
Candidates of promiscuous epitopes for a development of influenza A viral vaccines
Since EpicCapo+REF performed better than the other existing methods when testing with 14 HLA-A allele datasets, it was further used to find candidates of promiscuous epitopes from influenza A viral sequences. Epitopes from protein sequences of H1N1 (A/PR/8/34), H3N2 (A/Aichi/2/68), H1N1 (A/New York/4290/2009), and H5N1 (A/Hong Kong/483/97) were identified using EpicCapo+REF. The prediction results of all influenza A strains categorized into specific alleles are shown in Table
5. All 14 alleles were assigned to supertype groups using the supertype classification defined by previous studies
[34–37]. The A*01:01 and A*26:01 alleles were assigned to the A1 group. The A*29:02 allele was assigned to an unidentified group. As shown in Table
5, there are a small number of predicted positive peptides in the A1 supertype. For example, in case of H1N1 (A/PR/8/34), only one peptide was identified as positive for the allele A*26:01. In contrast, there were quite high numbers of predicted positive peptides in the A2, A24, and A3 supertypes. Even the A*29:02 allele, which was assigned to an unidentified group, had a higher number of predicted positive peptides than those in the A1 group. Based on our findings, when promiscuous epitopes were identified from the overlapping epitopes of four Influenza A viral strains (Additional file
4), the A1 group rarely shared peptides with other groups. As shown in Additional file
4, the A*01:01 allele shared only one peptide (YSHGTGTGY) with A*29:02, and the A*26:01 allele shared the peptide DTVNRTHQY with A*29:02 and A*68:01. Moreover, the A*29:02 allele also shared peptides with the A2 and A3 groups: e.g., SMELPSFGV and QTYDWTLNR, respectively (Additional file
4). Therefore, A*29:02 can be considered as a special group that links A1, A2, and A3 together. Furthermore, Doytchinova et al.
 assigned A*29:02 to the A3 group. However, we did not find overlapping epitopes from the four Influenza A viral strains in the A*24:02 allele assigned to the A24 group. This suggested that A*24:02 itself is different from other alleles considered here, and this might be the reason why most of the previous studies assigned it separately to the A24 group
[34–37]. As shown in Additional file
4, 51 peptides (67.1%) of the total 76 epitopes were immunologically validated as positive, whereas 9 peptides (11.8%) were validated as negative. No evidence of immunological validation could be obtained for 16 peptides (21.1%). These results indicate that our newly developed method provides a markedly high accuracy in epitope identification, given the fact that most of the identified epitopes could be correlated with immunological experimental evidence. However, even without such immunological evidence, those epitopes identified by our computational approach might be considered as candidates for new vaccine development.
Prediction results of EpicCapo+REFusing four influenza A strains categorized by specific alleles
# of predicted positive peptides
H1N1 New York/4290/2009
H5N1 Hong Kong/483/97
Our results are in agreement with the study by Uchida
, which identified promiscuous epitopes from influenza A H1N1 (A/PR/8/34), H3N2 (A/Aichi/2/68), H1N1 (A/New York/4290/2009), and H5N1 (A/Hong Kong/483/97). Uchida found experimentally confirmed CTL epitopes in the A2 group. In our results, the epitopes identified by EpicCapo+REF in the A2 group were consistent with them (Table
6). In addition, we found promising candidates of promiscuous epitopes also for the A1 and A3 groups as shown in Additional file
Comparison of epitopes identified by EpicCapo+REFwith the broadly protective influenza A viral epitopes identified by Uchida
Although the overall performance of EpicCapo+REF was high, there are two limitations in the use of this method. The first limitation is the length of input peptides must be equal to 9. In the further study, we will improve EpicCapo+REF to be applicable to peptides with the length of 8–11. The second limitation is that input amino acids must not be special or ambiguous ones. Examples of special amino acids are U (Selenocysteine) and O (Pyrrolysine). Also, examples of ambiguous amino acids are B (Asparagine or aspartic acid), Z (Glutamine or glutamic acid), and J (Leucine or Isoleucine). EpicCapo+REF are not applicable with these amino acids since they are not included in AAPPs.
In this study, we have developed a novel method for epitope prediction. Peptides were encoded numerically, combining information of pMHC contact sites and amino acid pairwise contact potentials, accompanied by an SVM for estimating the predictive model. Our method achieved high performance in testing with benchmark datasets. In addition, our study identified a number of candidates of promiscuous CTL epitopes from four influenza A viral strains, consistent with previously reported immunological experiments. This consistency in results strongly supports the accuracy of our method. We speculate that our techniques may be useful in identifying promising candidates of promiscuous epitopes for the development of new vaccines.
Peptide data encoding
We propose a novel peptide-encoding scheme for machine learning algorithms. This scheme utilized the information of pMHC contact sites retrieved from the international ImMunoGeneTics information system, IMGT
, the allele-specific positional scoring matrices developed by SMMPMBEC, and the AAPPs from AAindex
The reference pMHC contact sites retrieved from IMGT were modified by adding more MHC positions. The added MHC positions were determined by observing the pMHC contact sites of the selected 189 crystal structures of the HLA-nonapeptide complex collected from IMGT entries specific to the MHC-I receptor type. If there were new contact positions, the reference pMHC contact sites were modified by adding those new positions. Therefore, more HLA-nonapeptide contact positions were included in the modified pMHC contact site because the reference pMHC contact sites resulted from the use of only 74 crystal structures of the HLA-nonapeptide complex
. Utilizing the modified pMHC contact sites should provide more reliable results during the prediction. Additional file
5 shows the references and added pMHC contact sites positions. This information served as a binding template between the peptide and MHC. In NetMHCpan
, the reference pMHC contact sites were used to extract a pseudosequence representing the given MHC molecule. When performing prediction, sequence information from both peptide and MHC was taken into account. However, the pairs of amino acids between the MHC molecule and peptide were not of concern. Therefore, to generate a more informative predictive model, we used information about the pairs of amino acids at the interface between an MHC molecule and a nonapeptide, represented by AAPPs. In addition, the allele-specific positional scoring matrices developed by SMMPMBEC were used in our study. These matrices provide information of how likely a given amino acid would be preferred or avoided in a specific residue. Like NetMHCpan, SMMPMBEC did not use AAPPs. Consequently, we proved that a proper selection of AAPPs could lead to higher performance in the prediction. The encoded data could be further used in tasks of classification or regression using machine learning algorithms. In this study, we demonstrated the feasibility of the classification task by using the SVM implemented in the R package kernlab
Here, we propose a novel scheme for encoding nonapeptides into input vectors of the SVM. Suppose E(a1,a2) is an AAPP for the amino acids a1 and a2. If two or more types of AAPPs are available, we denote kth type of the AAPP by Ek(a1,a2). Also, we denote the ith amino acid of the nonapeptide n and the jth amino acid of HLA by ui(n) and vj, respectively. In order to combine information of position-specific amino acid scores of the nonapeptides with AAPPs, we define a score Sk,i(n) for the ith a kth type of AAPP as follows:
where L is the length of the HLA protein, Ti(a) is the ith position score of the amino acid a for the nonapeptides described by SMMPMBEC, and δij is an indicator variable that takes the value of 1 if the ith amino acid of a nonapeptide and the jth amino acid of HLA contact each other, and 0 otherwise. Here, the positional scoring matrix Ti(a) is trained based on training data and multiplied by −1 to reverse the order of values (a high positive value denotes high preference between an amino acid and the position) and scaled into the range of 1 to 10 since we need to avoid loss of information when Ti(a) equals zero. In fact, any range that does not include zero can be used; in this study, it is the range of 1 to 10. The scaling of positional scoring matrices is shown in Additional file
6. Note that ∑ j = 1Lδij is the number of contact sites for the ith amino acid of a nonapeptide (see Additional file
5). Intuitively, this score represents average pair-potential of contact sites, weighted by position-specific amino acid score for nonapeptides. Let K be the number of AAPPs available, and M be the length of the peptide, set to 9 throughout this study. Using this scoring scheme, we transform a nonapeptide n into a M × K-dimensional numerical vector, whose (M(k–1) + i)th element is Sk,i(n). For example, the encoded nonapeptides consist of 9 features if one AAPP is used, and 360 features if 40 AAPPs are used. Figure
3 illustrates an example of the data-encoding scheme for the first position of the nonapeptide.
Our peptide-encoding scheme was compared with binary peptide-encoding and with four amino acid descriptors, as shown in Table
1 using the dataset reported by Bi and colleagues (supplementary information for Table S2 in
). This dataset consists of 1,998 quantitative affinity-known HLA-A*02:01-restricted nonapeptides. The dataset was randomly partitioned into a training set containing 1,500 nonapeptides for estimating predictive models using the SVM, and a test set containing 498 nonapeptides for validating the models. For our peptide-encoding scheme, the positional scoring matrix was trained based on the external dataset downloaded from IEDB, consisting of 500 nonapeptides restricted to the HLA-A*02:01 allele (Additional file
7). These nonapeptides were included in neither training nor test sets. For the binary peptide-encoding, each amino acid was encoded as a binary vector of length 20, resulting in a vector of length 180 for a nonapeptide. In case of using amino acid descriptors, the length of an encoded vector would be equal to M times larger than the length of descriptor vectors. The performances of the data-encoding schemes were evaluated in classification tasks, using a 10-fold cross validation. Throughout our experiments, the parameter C (cost of constraint violation), epsilon, and the type of kernel used for the SVM were 1, 0.1, and the radial basis kernel, respectively. The class for each nonapeptide was determined by using an IC50 affinity cutoff at 500 nM. Nonapeptides with an affinity less than 500 nM were considered to be binders, and non-binders otherwise. The study by Moutaftsi et al.
 showed that 90 of epitopes that could stimulate CD8+ T cell responses bound to MHC with affinities lower than 500 nM. The predictive performance is evaluated using five measures: overall accuracy (ACC), sensitivity (sens), specificity (spec), F-score (F1), and area under the curve (AUC) for the received operating characteristic curve. ACC, sens, spec, and F1 are defined as
where TP, FP, TN, and FN are the numbers of overall true positives, false positives, true negatives, and false negatives, respectively.
Validation of predictive models using benchmark datasets
The performance of EpicCapo was validated by using benchmark datasets of 34 MHC-I alleles provided by Peters et al.
. In this experiment, the positional scoring matrices were trained based on training data according to the cross validation technique. 20 iterations of 5-fold cross validation were conducted to evaluate AUCs for EpicCapo. We compared the results of our method with those of ARB, NetMHC, SMM, and SMMPMBEC.
EpicCapo was further developed as EpicCapo+ by selecting AAPPs. Each encoded allele dataset was initially separated into 40 datasets according to the 40 AAPPs. The classification task was performed for each dataset to calculate AUC using the SVM and using the same parameters as EpicCapo. Then, the 40 datasets were ranked by AUC from highest to lowest. Next, the classification task was performed again by adding the datasets of AAPPs one by one based on their rank. Finally, the optimal subset of AAPPs that led to the highest AUC was identified for each allele. The average AUCs of all alleles as calculated from EpicCapo+ were compared with those from EpicCapo and other methods using paired t-tests (two-tailed). For each allele, the AUCs from 20 iterations of 5-fold cross validation of EpicCapo and EpicCapo+ were compared with the maximum AUC among other methods by using t-tests (one-tailed, significance level = 0.01).
Improving the performance of HLA-A-nonapeptide binding predictive models
To increase the performance of our predictive models, the positional scoring matrices used in this experiment were trained based on datasets containing larger number of nonapeptides. These matrices are available at
. After encoding 14 HLA-A allele datasets using the downloaded matrices, EpicCapo+ was performed again to identify optimal subsets of AAPPs therein. We used the Relief-F algorithm
 implemented in the machine learning software Weka
 to perform the feature selection task, ranking the features according to their importance in discriminating the MHC binder peptides from the non-binder ones. The default parameters provided by Weka were used, and a 5-fold cross validation was conducted for evaluating feature importance. The best feature subsets were constructed by adding the features, one by one, from the top-ranked feature to the last one in the classification task using the SVM. The AUC gradually increased with the addition of features, until it reached the highest value. Features after this point were considered irrelevant and ignored. We named this method, accompanied with the Relief-F algorithm, EpicCapo+REF.
Identification of candidates of promiscuous epitopes
EpicCapo+REF was further tested to identify candidates of promiscuous epitopes—i.e., nonapeptides that were predicted to be MHC binders for various HLA alleles—from the protein sequences of four influenza A viral subtypes: H1N1 (A/PR/8/34), H3N2 (A/Aichi/2/68), H1N1 (A/New York/4290/2009), and H5N1 (A/Hong Kong/483/97). These protein sequences were downloaded from the NCBI website (http://www.ncbi.nlm.nih.gov/). The nonapeptides were generated from these sequences by using a nonamer sliding window. Next, all of the generated nonapeptides were used as inputs in EpicCapo+REF predictive models. These models were estimated by using 14 HLA-A allele datasets, and each model was specific for each allele type. The identified epitopes were validated by cross-checking with the results of immunological experiments.
The first author has been supported by Japanese government scholarship (Monbukagakusho) to study in Japan. The authors would like to thank all members in the Bioinformatics Laboratory of Kanazawa University for sharing their data mining and machine learning knowledge.
Graduate School of Natural Science and Technology, Kanazawa University
Institute of Science and Engineering, Kanazawa University
Department of Microbiology, Faculty of Science, Kasetsart University
Shastri N, Schwab S, Serwold T: Producing nature’s gene-chips: the generation of peptides for display by MHC class I molecules.Annu Rev Immunol 2002, 20:463–493.PubMedView Article
Lundegaard C, Hoof I, Lund O, Nielsen M: State of the art and challenges in sequence based T-cell epitope prediction.Immunome Res 2010,6(Suppl 2):S3.PubMedView Article
Liang B, Zhu L, Liang Z, Weng X, Lu X, Zhang C, Li H, Wu X: A simplified PCR-SSP method for HLA-A2 subtype in a population of Wuhan, China.Cell Mol Immunol 2006, 3:453–458.PubMed
Tian F, Yang L, Lv F, Yang Q, Zhou P: In silico quantitative prediction of peptides binding affinity to human MHC molecule: an intuitive quantitative structure-activity relationship approach.Amino Acids 2009, 36:535–554.PubMedView Article
Altuvia Y, Margalit H: A structure-based approach for prediction of MHC-binding peptides.Methods 2004, 34:454–459.PubMedView Article
Du QS, Wei YT, Pang ZW, Chou KC, Huang RB: Predicting the affinity of epitope-peptides with class I MHC molecule HLA-A*02:01: an application of amino acid-based peptide prediction.Protein Eng Des Sel 2007, 20:417–423.PubMedView Article
Khan AR, Baker BM, Ghosh P, Biddison WE, Wiley DC: The structure and stability of an HLA-A*02:01/octameric tax peptide complex with an empty conserved peptide-N-terminal binding site.J Immunol 2000, 164:6398–6405.PubMed
Rotzschke O, Falk K, Stevanovic S, Jung G, Walden P, Rammensee HG: Exact prediction of a natural T cell epitope.Eur J Immunol 1991, 21:2891–2894.PubMedView Article
Sette A, Buus S, Appella E, Smith JA, Chesnut R, Miles C, Colon SM, Grey HM: Prediction of major histocompatibility complex binding regions of protein antigens by sequence pattern analysis.Proc Natl Acad Sci U S A 1989, 86:3296–3300.PubMedView Article
Falk K, Rotzschke O, Stevanovic S, Jung G, Rammensee HG: Allele-specific motifs revealed by sequencing of self-peptides eluted from MHC molecules.Nature 1991, 351:290–296.PubMedView Article
Ruppert J, Sidney J, Celis E, Kubo RT, Grey HM, Sette A: Prominent role of secondary anchor residues in peptide binding to HLA-A2.1 molecules.Cell 1993, 74:929–937.PubMedView Article
Madden DR, Garboczi DN, Wiley DC: The antigenic identity of peptide-MHC complexes: a comparison of the conformations of five viral peptides presented by HLA-A2.Cell 1993, 75:693–708.PubMedView Article
Saper MA, Bjorkman PJ, Wiley DC: Refined structure of the human histocompatibility antigen HLA-A2 at 2.6 A resolution.J Mol Biol 1991, 219:277–319.PubMedView Article
Parker KC, Bednarek MA, Coligan JE: Scheme for ranking potential HLA-A2 binding peptides based on independent binding of individual peptide side-chains.J Immunol 1994, 152:163–175.PubMed
Reche PA, Glutting JP, Reinherz EL: Prediction of MHC class I binding peptides using profile motifs.Hum Immunol 2002, 63:701–709.PubMedView Article
Nielsen M, Lundegaard C, Worning P, Hvid CS, Lamberth K, Buus S, Brunak S, Lund O: Improved prediction of MHC class I and class II epitopes using a novel Gibbs sampling approach.Bioinformatics 2004, 20:1388–1397.PubMedView Article
Bui HH, Sidney J, Peters B, Sathiamurthy M, Sinichi A, Purton KA, Mothe BR, Chisari FV, Watkins DI, Sette A: Automated generation and evaluation of specific MHC binding predictive tools: ARB matrix applications.Immunogenetics 2005, 57:304–314.PubMedView Article
Peters B, Sette A: Generating quantitative models describing the sequence specificity of biological processes with the stabilized matrix method.BMC Bioinforma 2005, 6:132.View Article
Kim Y, Sidney J, Pinilla C, Sette A, Peters B: Derivation of an amino acid similarity matrix for peptide: MHC binding and its application as a Bayesian prior.BMC Bioinforma 2009, 10:394.View Article
Peters B, Sidney J, Bourne P, Bui HH, Buus S, Doh G, Fleri W, Kronenberg M, Kubo R, Lund O, et al.: The immune epitope database and analysis resource: from vision to blueprint.PLoS Biol 2005, 3:e91.PubMedView Article
Schonbach C, Koh JL, Flower DR, Wong L, Brusic V: FIMM, a database of functional molecular immunology: update 2002.Nucleic Acids Res 2002, 30:226–229.PubMedView Article
Brusic V, Rudy G, Harrison LC: MHCPEP, a database of MHC-binding peptides: update 1997.Nucleic Acids Res 1998, 26:368–371.PubMedView Article
Lata S, Bhasin M, Raghava GP: MHCBN 4.0: A database of MHC/TAP binding peptides and T-cell epitopes.BMC Res Notes 2009, 2:61.PubMedView Article
Toseland CP, Clayton DJ, McSparron H, Hemsley SL, Blythe MJ, Paine K, Doytchinova IA, Guan P, Hattotuwagama CK, Flower DR: AntiJen: a quantitative immunology database integrating functional, thermodynamic, kinetic, biophysical, and cellular data.Immunome Res 2005, 1:4.PubMedView Article
Lundegaard C, Lamberth K, Harndahl M, Buus S, Lund O, Nielsen M: NetMHC-3.0: accurate web accessible predictions of human, mouse and monkey MHC class I affinities for peptides of length 8–11.Nucleic Acids Res 2008, 36:W509-W512.PubMedView Article
Wan J, Liu W, Xu Q, Ren Y, Flower DR, Li T: SVRMHC prediction server for MHC-binding peptides.BMC Bioinforma 2006, 7:463.View Article
Udaka K, Mamitsuka H, Nakaseko Y, Abe N: Empirical evaluation of a dynamic experiment design method for prediction of MHC class I-binding peptides.J Immunol 2002, 169:5744–5753.PubMed
Rosenfeld R, Zheng Q, Vajda S, DeLisi C: Flexible docking of peptides to class I major-histocompatibility-complex receptors.Genet Anal 1995, 12:1–21.PubMedView Article
Bui HH, Schiewe AJ, von Grafenstein H, Haworth IS: Structural prediction of peptides binding to MHC class I molecules.Proteins 2006, 63:43–52.PubMedView Article
Antes I, Siu SW, Lengauer T: DynaPred: a structure and sequence based method for the prediction of MHC class I binding peptide sequences and conformations.Bioinformatics 2006, 22:e16-e24.PubMedView Article
Lundegaard C, Lund O, Kesmir C, Brunak S, Nielsen M: Modeling the adaptive immune system: predictions and simulations.Bioinformatics 2007, 23:3265–3275.PubMedView Article
Treanor JD: Influenza–the goal of control.N Engl J Med 2007, 357:1439–1441.PubMedView Article
Liang G, Yang L, Chen Z, Mei H, Shu M, Li Z: A set of new amino acid descriptors applied in prediction of MHC class I binding peptides.Eur J Med Chem 2009, 44:1144–1154.PubMedView Article
Sandberg M, Eriksson L, Jonsson J, Sjostrom M, Wold S: New chemical descriptors relevant for the design of biologically active peptides. A multivariate characterization of 87 amino acids.J Med Chem 1998, 41:2481–2491.PubMedView Article
Collantes ER, Dunn WJ 3rd: Amino acid side chain descriptors for quantitative structure-activity relationship studies of peptide analogues.J Med Chem 1995, 38:2705–2713.PubMedView Article
Micheletti C, Seno F, Banavar JR, Maritan A: Learning effective amino acid interactions through iterative stochastic techniques.Proteins 2001, 42:422–431.PubMedView Article
Simons KT, Ruczinski I, Kooperberg C, Fox BA, Bystroff C, Baker D: Improved recognition of native-like protein structures using a combination of sequence-dependent and sequence-independent features of proteins.Proteins 1999, 34:82–95.PubMedView Article
Peters B, Bui HH, Frankild S, Nielson M, Lundegaard C, Kostem E, Basch D, Lamberth K, Harndahl M, Fleri W, et al.: A community resource benchmarking predictions of peptide binding to MHC-I molecules.PLoS Comput Biol 2006, 2:e65.PubMedView Article
Schueler-Furman O, Altuvia Y, Sette A, Margalit H: Structure-based prediction of binding peptides to MHC class I molecules: application to a broad range of MHC alleles.Protein Sci 2000, 9:1838–1846.PubMedView Article
Singh SP, Mishra BN: Ranking of binding and nonbinding peptides to MHC class I molecules using inverse folding approach: implications for vaccine design.Bioinformation 2008, 3:72–82.PubMedView Article
Kononenko I: Estimating Attributes: Analysis and Extensions of RELIEF. In Machine Learning: ECML-94. Springer; 1994:171–182.View Article
Uchida T: Development of a cytotoxic T-lymphocyte-based, broadly protective influenza vaccine.Microbiol Immunol 2011, 55:19–27.PubMedView Article
Kaas Q, Lefranc MP: T cell receptor/peptide/MHC molecular characterization and standardized pMHC contact sites in IMGT/3Dstructure-DB.In Silico Biol 2005, 5:505–528.PubMed
Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M: AAindex: amino acid index database, progress report 2008.Nucleic Acids Res 2008, 36:D202-D205.PubMedView Article
Nielsen M, Lundegaard C, Blicher T, Lamberth K, Harndahl M, Justesen S, Roder G, Peters B, Sette A, Lund O, Buus S: NetMHCpan, a method for quantitative predictions of peptide binding to any HLA-A and -B locus protein of known sequence.PLoS One 2007, 2:e796.PubMedView Article
Bi J, Song R, Yang H, Li B, Fan J, Liu Z, Long C: Stepwise identification of HLA-A*02:01-restricted CD8+T-cell epitope peptides from herpes simplex virus type 1 genome boosted by a StepRank scheme.Biopolymers 2011, 96:328–339.PubMedView Article
Moutaftsi M, Peters B, Pasquetto V, Tscharke DC, Sidney J, Bui HH, Grey H, Sette A: A consensus epitope prediction approach identifies the breadth of murine T(CD8+)-cell responses to vaccinia virus.Nat Biotechnol 2006, 24:817–819.PubMedView Article
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.