Quantitative prediction of mouse class I MHC peptide binding affinity using support vector machine regression (SVR) models
© Liu et al; licensee BioMed Central Ltd. 2006
Received: 22 December 2005
Accepted: 31 March 2006
Published: 31 March 2006
The binding between peptide epitopes and major histocompatibility complex proteins (MHCs) is an important event in the cellular immune response. Accurate prediction of the binding between short peptides and the MHC molecules has long been a principal challenge for immunoinformatics. Recently, the modeling of MHC-peptide binding has come to emphasize quantitative predictions: instead of categorizing peptides as "binders" or "non-binders" or as "strong binders" and "weak binders", recent methods seek to make predictions about precise binding affinities.
We developed a quantitative support vector machine regression (SVR) approach, called SVRMHC, to model peptide-MHC binding affinities. As a non-linear method, SVRMHC was able to generate models that out-performed existing linear models, such as the "additive method". By adopting a new "11-factor encoding" scheme, SVRMHC takes into account similarities in the physicochemical properties of the amino acids constituting the input peptides. When applied to MHC-peptide binding data for three mouse class I MHC alleles, the SVRMHC models produced more accurate predictions than those produced previously. Furthermore, comparisons based on Receiver Operating Characteristic (ROC) analysis indicated that SVRMHC was able to out-perform several prominent methods in identifying strongly binding peptides.
As a method with demonstrated performance in the quantitative modeling of MHC-peptide binding and in identifying strong binders, SVRMHC is a promising immunoinformatics tool with not inconsiderable future potential.
The T cell, a specialized type of immune cell, continuously searches out proteins originating from pathogenic organisms, such as viruses, bacteria, fungi, or parasites. The T cell surface is enriched in a particular receptor protein: the T cell receptor or TCR, which binds to major histocompatibility complex proteins (MHCs) expressed on the surfaces of other cells. MHCs bind small peptide fragments derived from both host and pathogen proteins. It is the recognition of such complexes that lies at the heart of the cellular immune response. These short peptides are known as epitopes. Although the significance of non-peptide epitopes, such as lipids and carbohydrates, is now understood increasingly well, peptidic B cell and T cell epitopes (as mediated by the humoral and cellular immune systems respectively) remain the primary tools by which the intricate complexity of the immune response might be examined. While the prediction of B-cell epitopes remains primitive , a multiplicity of sophisticated methods for the prediction of T-cell epitopes has developed .
The earliest efforts in predicting the binding of short peptides to MHC molecules focused on identifying peptide sequence motifs that were characteristic of binding to MHC . This motif approach assumed that the presence of certain residues at specific positions (which are referred to as "anchor" positions) critically defined the binding ability of the peptide to the MHC. This somewhat simplistic assumption rendered the motif approach prone to false predictions. Later methods adopted more informative representations of peptide binding and more sophisticated modeling strategies such as position-specific scoring matrices (PSSM) [4–7], artificial neural networks (ANN) [8–10], hidden Markov model (HMM)  and support vector machine (SVM) classification [12, 13]. With increasing amounts of MHC-peptide binding data available to facilitate their optimization, these methods have become increasingly effective in making predictions about whether a given peptide binds to a particular MHC molecule, and – when it does bind – whether the binding is strong or weak.
Recently, the modeling of MHC-peptide binding has come to emphasize quantitative predictions: instead of categorizing peptides as "binders" or "non-binders" or as "strong binders" and "weak binders", several new methods make predictions about the precise binding affinities (usually expressed as pIC50, the negative logarithm of the IC50). The additive method developed by Doytchinova et al. is a representative example of this trend. In this method, the binding affinity of the MHC-peptide interaction is modeled as the sum of peptide background contribution (a constant term), the amino acid contributions at each position, and (optionally) the adjacent peptide side-chain interaction . The additive method has been shown to be effective in modeling MHC-peptide binding for a range of human and mouse class I MHC molecules, and, using an iterative extension, also a set of human and mouse Class II alleles [14–17]. Additive method models not only provide more precise information about the binding reactions, but also demonstrated enhanced accuracy in the prediction of untested peptides compared to other prediction methods such as SYFPEITHI, BIMAS and RANKPEP [16, 17].
In this paper, we shall explore how potential improvements might be made in quantitative immunoinformatic techniques, such as the additive method. First, utilizing non-linearity, since properly chosen non-linear models can, in describing complex systems, often out-perform linear models. Second, the use of a more informative scheme for encoding amino acids since most immunoinformatic methods encode amino acids by their identities using indicator variables, information concerning similarities in physicochemical properties between the 20 amino acids is typically neglected.
Support Vector Machines (SVMs) are a class of learning based non-linear modeling techniques with proven performance in a wide range of practical applications . Originally, SVMs were developed for classification or qualitative modeling problems. With the introduction of an ε-insensitive loss function, SVMs have been extended to solve nonlinear regression (or quantitative modeling) problems. In this study, we employed the SVM regression (SVR) technique to model MHC-peptide binding affinities for three mouse class I MHC alleles (H2-Db, H2-Kb and H2-Kk). We name this new modeling method SVRMHC. In SVRMHC models, peptides were described using a new 11-factor encoding scheme. This takes into account a number of important physicochemical parameters of the 20 amino acids (including hydrophobicity scale, polarity, isoelectric point, and accessible surface area). These SVRMHC models demonstrated consistently better performance than linear methods in terms of describing power, self-consistency, and prediction accuracy. Moreover, comparisons between our SVR models and several other popular prediction tools indicated that the SVRMHC models performed best in identifying strong binders to mouse class I MHC molecules.
The datasets used in this study, and the online implementation of the SVRMHC models for the three mouse class I alleles, can be accessed on the supplementary web site .
SVR model parameter optimization
For the training of the SVR models, one kernel parameter (γ), and two kernel-independent parameters (ε and C) need to be determined (Eq.(5)). There are no commonly agreed methods for determining optimal SVR model parameters. In most published SVR studies we have examined, these model parameters were determined one at a time, by first fixing all other parameters, then letting the parameter take a range of different values, and thus identifying the value that corresponds to the best model performance assessed by cross-validation [20, 21]. This method, though efficient in terms of execution time, disregards potential interactions between different model parameters. Cherkassky and Ma  advocated picking two of the three SVR model parameters (ε and C) from training data based on characterizations of the data, such as noise level and sample number. This method, though theoretically sound, did not, in our hands, always find the best set of parameters. In this study, we adopted a parameter selection procedure that combines the method of Cherkassky and Ma's with a grid-search. For ε and C, we first calculated the "recommended value" using Cherkassky and Ma's formulas, then searched a parameter range from 1/10th of the recommended value to 10 times the recommended value. The kernel parameterγ does not depend on the datasets. We picked a search range for γ as [0.001, 1], which safely covered the γ ranges commonly used in the literature [20, 21, 23]. After setting search ranges for the three parameters, we undertook a grid-search through the three-dimensional parameter space. For each parameter, four, six or eight equal-sized steps (on logarithm scale) were taken in the grid-search.
The parameters recommended by Cherkassky and Ma (2004) (ε and C) and the final optimized parameters (ε, C and γ) of the SVRMHC models constructed for the three mouse class I alleles.
Model for H2-Db
Model for H2-Kb
Model for H2-Kk
SVRMHC models performed better than linear models in quantitative predictions
The SVRMHC models constructed for the three MHC molecules demonstrated consistently better performance than linear models built from the same datasets.
Comparison between the additive method and the SVRMHC method in models constructed with the H2-Db, H2-Kb and H2-Kk datasets.
Numbers of outliers
r 2 (self-consistency, outliers removed)
r 2 (self-consistency, entire dataset)
Average Residual (entire dataset)
Average Residual (outliers removed)
q 2 (LOO_CV) (outliers removed)
The H2-Kb dataset consisted of 62 octamer peptides and associated binding affinities (Table 2). With SVRMHC, the step-wise outlier exclusion procedure determined and excluded 6 outliers, compared to 7 outliers removed by the additive method. The self-testing model constructed using the SVRMHC method produced an r2 of 0.568 for the entire dataset (including outliers), in contrast to the r2 of 0.370 produced by the additive model. An r2 of 0.970 was obtained with SVRMHC after the 6 outliers were excluded, which was lower than the r2 obtained by the additive model (0.989) after the exclusion of 7 outliers. The AR of the SVRMHC model for the entire dataset (including outliers) was 0.382 and the AR of the additive method was 0.443. However, the AR of the SVM model after the 6 outliers were removed (0.130) was higher than the AR of the additive model after 7 outliers were removed (0.095). The LOO cross-validated q2 of the model constructed with SVRMHC was 0.486, slightly higher than the additive LOO cross-validated q2 of 0.454. These results indicated that for the H2-Kb dataset, SVRMHC produced models that had higher descriptive power and prediction accuracy, though the self-testing model exhibited lower level of self-consistency after the outliers were removed.
The H2-Kk dataset is the largest of the three datasets, consisting of 154 octamers and associated binding affinities. No outliers were excluded compared to 2 outliers using the additive method. This, again, suggests that the SVRMHC method has higher "descriptive power" than linear methods. The self-testing additive model produced an r2 of 0.849 (whole dataset) and 0.933 (2 outliers excluded). The self-testing SVRMHC model gave an r2 of 0.973. Since no outlier was determined, there is only one r2 calculated. The AR for the SVRHMC model was 0.039, compared to the additive model, which gave 0.178 for the entire dataset and 0.151 after the outliers were removed. The LOO cross-validated q2 for SVRMHC was 0.721, compared to an additive LOO cross-validated q2 of 0.456.
SVRMHC models out-performed other methods in identifying strong binders
We compared the performance of SVRMHC to that of existing prediction tools for MHC-peptide binding: the additive method, SYFPEITHI , BIMAS , RANKPEP , and SVMHC . At first, we attempted a strategy described in [16, 17]: trying to find recent literature reports of new peptide binding experiments for the three mouse class I MHC molecules. The hope was that predictions could be made for these binding experiments using both the SVRMHC model and the other methods, and a concomitant comparison in prediction accuracy could be made. However, this strategy was not successful, because in most recently-published binding experiments pre-screening with prediction tools was used. This was most often SYFPEITHI and sometimes BIMAS [24–29]. Only the peptides predicted to be strong binders were tested experimentally, and peptides not predicted to be strong binders were disregarded. Moreover, false predictions (peptides predicted by SYFPEITHI or BIMAS to be strong binders but which were determined experimentally to be weak binders or non-binders) were sometimes not reported [24, 26, 27]. It is not surprising that prediction tools used in pre-screening (SYFPEITHI or BIMAS) always performed better in identifying good binders in these published studies (results not shown).
Thus, we applied another scheme for making comparisons between the SVRMHC method and other prediction methods – by using Receiver Operating Characteristic (ROC) analysis [30, 31]. The prediction performance of any classification-type model can be assessed using the combination of two properties – specificity and sensitivity. The sensitivity vs. (1-specificity) relationship is referred to as the ROC relationship (see Methods). In a ROC curve, the two axes both have a range of [0, 1], therefore the area under a ROC curve (A ROC ) also takes a range of [0, 1]. A purely "random guess" prediction model would have an A ROC of 0.5. For better models, A ROC would be higher than 0.5. The closer A ROC is to 1, the better the performance is for the predicting model.
In the MHC ligand database MHCBN , we downloaded all nomamer ligands for the H2-Db molecule and all octamer ligands for H2-Kb and H2-Kk. These peptides were grouped into two groups: "strong binders" and "weak binders" (see Methods). For H2-Db, there were 28 strong binders and 44 weak binders, and for H2-Kb, there were 22 strong binders and 24 weak binders. No weak binders were retrieved for the H2-Kk, thus no ROC analysis was conducted for this allele. The scores generated by these different prediction methods have very different meanings. The scores produced by the SVRMHC models and the additive models are predicted pIC50 values, while scores from BIMAS are predicted half lives. RANKPEP outputs scores calculated from a PSSM (position-specific score matrix) profile. The scores produced by SYFPEITHI are nominal scores. They are generated by differentially scoring matches, within an individual peptide, to primary and secondary anchors within the target motif. Thus, they represent how close a particular peptide is to the expected pattern of a motif. The scores produced by SVMHC, on the other hand, are the distances between the peptides and the separating hyperplane defined by the SVM model. Despite the different meanings of these scores, they all roughly approximate increasing functions of predicted binding strength, therefore the areas under the ROC curves can be used as an objective measure of prediction performance.
Accurately predicting the binding between short peptides and MHC molecules remains a major task for immunoinformatics. Quantitative prediction of exact peptide binding affinity represents the most recent development in the field. Quantitative prediction is a finer-scale description of binding, and the ability to construct effective quantitative models manifests an improved understanding of the mechanism of MHC-peptide interactions. There have been two reported approaches to quantitative prediction of MHC-peptide binding. The first approach makes use of 3D QSAR (quantitative structure-activity relationship) techniques, and models the interaction between the peptide and the MHC molecule using CosMSIA (Comparative Molecular Similarity Indices Analysis) [33, 34]. This approach, though accurate, requires structural knowledge about how the peptide and the MHC molecule interact with each other in 3D space. The second approach to quantitative modeling is the bioinformatics approach, including the additive method and the SVRMHC method presented here. In contrast to 3D-QSAR, this approach uses only peptide sequences as their input, and does not require any 3D structural information. This property makes the bioinformatics methods more straightforward and generally applicable. Particularly, they are more suitable for modeling the binding of less studied MHC molecules for which no 3D structural information is available.
Linear models, as exemplified here by the additive method, has previously demonstrated impressive performance in modeling a variety of MHC-peptide binding systems: the human class I allele HLA-A*0201 , the mouse Class I MHC alleles  and the human class II allele DRB1*0401 [15, 16]. However, as is generally known, properly chosen non-linear models can often out-perform linear models in describing complex systems, although linear models can often be more intuitive and easily understood. Also, in many immunoinformatic techniques, including the additive method, amino acid residues are encoded by their identities, and the physicochemical properties of the amino acids are ignored. In this study, we have addressed both problems. The use of a non-linear SVR technique leads to an enhancement in predictivity. Meanwhile, the adoption of an 11-factor encoding of amino acids renders the resultant models sensitive to similarities in important physicochemical properties among the residues in the peptides being modeled.
Support vector machines (SVMs) are a new class of learning machines motivated by statistical learning theory , and they are gaining popularity because of their theoretically attractive features and profound empirical performance. Several reports have been seen in the literature where SVM classification models were developed to analyze peptide binding profiles qualitatively [12, 13, 36]; yet, to our knowledge, the current report is the first reported quantitative modeling study in which the SVR technique has been applied to model peptide binding.
In this study, we constructed SVRMHC models using the binding data of three mouse class I alleles (H2-Db, H2-Kb and H2-Kk), and compared the resulting models to a linear models, built using the additive method, constructed using the same datasets. The models constructed with SVRMHC have been shown to be superior to those constructed with linear methods, in terms of descriptive power (as shown by a smaller number of "outliers"), prediction accuracy (manifest as a higher cross-validated correlation coefficient q2), self-consistency (higher non-cross-validated explained variance r2), and overall precision in prediction (lower average residual of the prediction). Although an improved performance was seen in all three SVRMHC models, the levels of improvement differed between the models constructed for the three MHC alleles. There seems to be a positive correlation between the amount of improvement achieved by the SVRMHC models and dataset size. With the largest of the three datasets – the H2-Kk dataset (154 peptides), the SVRMHC method demonstrated the greatest level of improvement. The LOO cross-validated q2 increased from 0.456 to 0.721. With the two smaller datasets – the H2-Db dataset (65 peptides) and the H2-Dk dataset (62 peptides) – the LOO cross-validated q2 increased from 0.401 and 0.454 for the additive models to 0.456 and 0.486 for the SVRMHC models, respectively, marking a smaller improvement than for H2-Kk. When we looked at the self-consistency measure, with the two larger datasets (H2-Kk and H2-Db), the SVRMHC models consistently demonstrated higher levels of self-consistency than the linear models for the entire datasets as well as for the datasets after removal of outliers. For the smallest dataset, H2-Dk, although the SVRMHC model produced a higher r2 than the corresponding linear model for the entire dataset (0.568 vs. 0.370); after outliers were removed, the SVRMHC model produced a lower r2 than the additive model did (0.970 vs. 0.989). The same trend is true for the AR measurement. For the two larger datasets (H2-Kk and H2-Db), the SVRMHC models consistently produced lower AR values than the additive models for both the entire datasets and for the datasets after removal of outliers. However, for the smallest dataset (H2-Dk), the SVRMHC model produced a lower AR than the additive model for the entire dataset (0.382 vs. 0.443), but a higher AR than the additive model after the removal of outliers (0.130 vs. 0.095). These observations suggest that the SVRMHC approach may become more accurate as datasets grow.
The scores used in the 11-factor encoding for the 20 amino acids, after scaling to the range [0, 1].
Hydrogen bond donors
Average accessible surface area
van der Waals parameter R0
van der Waals parameter epsilon
Free energy of solution in water
Average side chain orientation angle
The outliers determined by the additive method and the SVRMHC method for H2-Db, H2-Kb and H2-Kk. Common outliers determined by both methods are italicized.
Outliers determined by Additive method
Outliers determined by SVRMHC method
Predicted pIC50 (Additive)
Predicted pIC50 (SVRMHC)
Predicted pIC50 (Additive)
Predicted pIC50 (SVRMHC)
Outliers determined by Additive method
Outliers determined by SVRMHC method
Predicted pIC50 (Additive)
Predicted pIC50 (SVRMHC)
Predicted pIC50 (Additive)
Predicted pIC50 (SVRMHC)
Outliers determined by Additive method
Outliers determined by SVRMHC method
Predicted pIC50 (Additive)
Predicted pIC50 (SVRMHC)
Comparison of performance between the additive method, SVRMHC, and SVR models with sparse encoding scheme for H2-Db, H2-Kb and H2-Kk.
SVR, Sparse encoding
Numbers of outliers
Numbers of outliers
Numbers of outliers
The ROC analysis allows us to compare of several prediction tools for MHC-peptide binding: SVRMHC, the additive method, SYFPEITHI, BIMAS, RANKPEP and SVMHC. Our ROC analysis indicated that the SVRMHC method, with an average A ROC of 0.786, was the most accurate in identifying strong binders for the two mouse MHC molecules H2-Db and H2-Kb. It is followed by SYFPEITHI and the additive method (with average A ROC = 0.727 and 0.726, respectively). SVMHC (average A ROC = 0.704), BIMAS (average A ROC = 0.695) and RANKPEP (average A ROC = 0.621) were less accurate than the other three methods we compared. We need to stress, though, that this comparison is based on the models constructed for only two MHC molecules, and this rank order may not be true in the general case.
ROC-based comparison of the five predicting methods – SVRMHC, additive, SYFPEITHI, BIMAS, RANKPEP and SVMHC, after overlapped peptides were removed for SVRMHC and additive methods, but not for the four qualitative methods.
Despite their encouraging performance, SVR-based models reported here also exhibit some disadvantages. Most notably, these models are "black box" models, and are poorly interpretable. We cannot infer, for example, which peptide positions are the most important in determining the strength of the MHC-peptide binding. Not that this necessarily obviates the utility of SVRMHC models as an immunological tool.
Currently, we are working to improve further SVR-based modeling methods, focusing on testing different combinations of physicochemical properties in the feature encoding scheme. We also plan to construct MHC-peptide binding models for other MHC molecules hosted in the AntiJen database [37, 38] and to make these prediction models available online. In the next phase of this project, we will adapt the SVR-based methodology to the more challenging task of predicting the MHC-peptide binding of class II MHCs.
In this paper, we demonstrated SVRMHC, a SVR-based quantitative modeling approach to model peptide-MHC binding affinities, and showed that SVRMHC is a promising immunoinformatics tool with not inconsiderable future potential. With the ongoing, rapid development of high-throughput functional proteomics technologies, such as peptide microarray technology, the SVR modeling approach is expected to see broader use in modeling MHC-peptide binding, and protein-peptide binding reactions in general.
Support Vector Machine Regression (SVR) overview
Support Vector Machines (SVMs) are a class of learning machines based on statistical learning theory [18, 35]. With the introduction of an ε-insensitive loss function, SVMs have been extended to solve nonlinear regression estimation . In SVR, with input data set (where x i is the input vector, d i is the desired real-valued labeling, and n is the number of the input records), x is first mapped into a higher-dimension feature space F via a nonlinear mappingΘ, then linear regression is performed in this space. In other words, SVR approximate a function using the following equation
y = f(x) = w Θ(x) + b (2)
The coefficients w and b are estimated by minimizing
where L ε (d, y) is the empirical error measured by ε-insensitive loss function
and the term 1/2||w||2 is a regularization term. The constant C is specified by the user, and it determines the trade-off between the empirical risk and the regularization term. ε is also specified by the user, and it is equivalent to the approximation accuracy of the training data.
The estimations of w and b are obtained by transforming Eq. (3) into the primal function:
By introducing Lagrange multipliers, the optimization problem can be transformed into a quadratic programming problem. The solution takes the following form:
where K is the kernel function K(x, x i ) = Θ(x) T Θ(x i ) By using of a kernel function, we can deal with problems of arbitrary dimensionality without having to compute the mapping Θ explicitly. Commonly used kernels include the linear kernel, polynomial kernel, and the radial basis function (RBF) kernel. In this exploration, we chose to use the RBF (radial basis function) kernel as recommended in Chang and Lin . The RBF kernel takes the following form:
K(x i , x i ) = exp(-γ||x - x i ||2), γ > 0. (7)
We constructed SVRMHC models using three MHC-peptide binding datasets for mouse Class I MHC alleles. These sets have been used previously to construct models using the additive method , facilitating comparison between the two methods. The data consists of peptide sequences and experimentally measured binding affinities (expressed numerically as pIC50). The first dataset contains 65 nonamer peptides (H2-Db allele), the second dataset 62 octamers (H2-Kb), and the third dataset 154 octamer peptides (H2-Kk).
Encoding scheme of peptide sequences
The most widely-used representation of an amino acid sequence in immunoinformatic modelling is the "sparse encoding" scheme [12, 40]. However, such an encoding scheme does not account for any similarity in physicochemical properties between amino acids. We developed a new encoding method. First, from AA-index , we picked a list of what we considered as important general physicochemical properties (e.g., polarity, isoelectric point, and accessible surface area). Into this list, we added a number of properties that were identified in 3-D QSAR analysis  as key determinants of peptide-MHC interaction (volume, number of hydrogen bond donors, hydrophobicity). This led to a list of properties that consists of 17 physicochemical indices. We calculated the pair-wise correlation coefficients (r2) of these 17 factors. For any pair of factors with r2 > 0.8, we eliminated one of the two factors. In the end, a list of 11 factors was obtained. The values of the 11 physicochemical parameters were linearly scaled to the range [0, 1] for the 22 amino acids (Table 5). The list of the 17 factors and their pair-wise r2 are presented in online supplementary material . As input to the SVRMHC models, a given octamer or nonamer peptide sequence is represented as a long vector concatenated from the eight or nine numerical vectors (each of length 11) encoding the corresponding residue in the sequence. We name this encoding scheme the "11-factor encoding".
Outlier determination and exclusion
To ease comparison of SVRMHC models and those constructed previously using the additive method, we applied the same step-wise outlier determination and exclusion scheme as used in . For each dataset, a SVRMHC model was first constructed using the whole dataset, and prediction was made for each sequence in the whole dataset. We called this model the "self-testing model". If at least one sequence in the dataset produced a residual value = 2.0 log units in the "self-testing model" (the residual value is defined as the absolute value of the difference between the predicted affinity and true affinity on logarithm scale), then the sequence with the maximum residual value was excluded as an outlier, and a replacement self-testing model was constructed using the remaining sequences. This procedure was repeated until all sequences in the dataset had residual values < 2.0 log units.
Assessment of model performance
The performance of the SVRMHC models was assessed using several metrics. The number of outliers determined and excluded can be considered as a measurement of "descriptive power" of a model: a model that excludes a smaller number of outliers is better at describing the dataset as a whole than a model that excludes a greater number of outliers. For the final self-testing model (the self-testing model after all outliers are removed), we can assess its "self-consistency" using the explained variance (or squared correlation coefficient) r2 (see ).
The most important measure of a model's performance is its prediction accuracy, which can be assessed by the cross-validated correlation coefficient, q2, of the model:
where n is the number of peptides in the dataset, pIC 50 i and are the predicted and experimentally measured pIC50 values for the i th peptide, respectively, and is the mean of the experimentally measured pIC50 values. As in , we used leave-one-out (LOO) cross-validation to check our models' prediction performance.
Another metric that can be used to assess the performance of the models is the average residual (AR), defined simply as
The AR is a measure of the overall precision of the prediction made by the model. A model with a lower AR overall makes more precise prediction than a model with a higher AR.
ROC analysis and comparisons of SVR models with other predicting tools
Prediction performance of any classification-type model can be assessed by the combination of two parameters: "false positive rate" and the "false negative rate" or, equivalently, specificity and sensitivity. Sensitivity is defined as 1- "false negative rate", and specificity is defined as the 1- "false positive rate". A plot of sensitivity vs. (1-specificity) is known as the ROC curve.
In the MHC ligand database MHCBN , all nomamer ligands for the H2-Db molecule and all octamer ligands for H2-Kb and H2-Kk were downloaded. In the MHCBN database, the peptide ligands are classified into five categories: "high binding", "moderate binding", "low binding", "no-binding" and "unknown". We grouped all peptides in the "high binding" and "moderate binding" categories together as "strong binders", all peptides in the "low binding" and "no-binding" categories together as "weak binders", and discarded the peptides in the "unknown" category. All ligands for the H2-Kk molecule downloaded from MHCBN were "strong binders", therefore the ROC analysis was not performed with H2-Kk.
The scores used for the SVRMHC method in the ROC analysis were the predicted pIC50 values of the test ligands for the final SVRMHC models. The scores used for the additive method , SYFPEITHI , BIMAS , RANKPEP , and SVMHC  were obtained by querying the corresponding online predicting servers. Default parameters were used when making the queries. After the scores of all peptides for a MHC molecule (H2-Db or H2-Kb) were obtained, each score value was used in turn as a cut-off point. At each cut-off point , the true positive rate was calculated as
where s i is the predicted score for peptide i, and S is the set of all "strong binders". The false positive rate was calculated as
where W is the set of all "weak binders". The ROC curve was plotted as rf,pvs. rt,p.
We thank F. Xiao, Q. Su and Z. Zhang for their assistance in earlier phases of this work. Dr I.A. Doytchinova, Medical University, Sofia provided considerable help and advice in model development. This project was supported by the Department of Neuroscience and the Graduate School, University of Minnesota.
- Blythe MJ, Flower DR: Benchmarking B cell epitope prediction: underperformance of existing methods. Protein Sci 2005, 14(1):246–248. 10.1110/ps.041059505PubMed CentralView ArticlePubMedGoogle Scholar
- Flower DR, Doytchinova IA, Paine K, P. T, Blythe MJ, Lamponi D, Zygouri C, Guan P, McSparron H, H. K: Computational Vaccine Design. In Drug Design: Cutting Edge Approaches. Edited by: Flower DR. Cambridge, Royal Society of Chemisty; 2002:136–180.View ArticleGoogle Scholar
- Sette A, Buus S, Appella E, Smith JA, Chesnut R, Miles C, Colon SM, Grey HM: Prediction of major histocompatibility complex binding regions of protein antigens by sequence pattern analysis. Proc Natl Acad Sci U S A 1989, 86(9):3296–3300. 10.1073/pnas.86.9.3296PubMed CentralView ArticlePubMedGoogle Scholar
- Nielsen M, Lundegaard C, Worning P, Hvid CS, Lamberth K, Buus S, Brunak S, Lund O: Improved prediction of MHC class I and class II epitopes using a novel Gibbs sampling approach. Bioinformatics 2004, 20(9):1388–1397. 10.1093/bioinformatics/bth100View ArticlePubMedGoogle Scholar
- Rammensee H, Bachmann J, Emmerich NP, Bachor OA, Stevanovic S: SYFPEITHI: database for MHC ligands and peptide motifs. Immunogenetics 1999, 50(3–4):213–219. 10.1007/s002510050595View ArticlePubMedGoogle Scholar
- Parker KC, Bednarek MA, Coligan JE: Scheme for ranking potential HLA-A2 binding peptides based on independent binding of individual peptide side-chains. J Immunol 1994, 152(1):163–175.PubMedGoogle Scholar
- Reche PA, Glutting JP, Reinherz EL: Prediction of MHC class I binding peptides using profile motifs. Hum Immunol 2002, 63(9):701–709. 10.1016/S0198-8859(02)00432-9View ArticlePubMedGoogle Scholar
- Nielsen M, Lundegaard C, Worning P, Lauemoller SL, Lamberth K, Buus S, Brunak S, Lund O: Reliable prediction of T-cell epitopes using neural networks with novel sequence representations. Protein Sci 2003, 12(5):1007–1017. 10.1110/ps.0239403PubMed CentralView ArticlePubMedGoogle Scholar
- Brusic V, Rudy G, Honeyman G, Hammer J, Harrison L: Prediction of MHC class II-binding peptides using an evolutionary algorithm and artificial neural network. Bioinformatics 1998, 14(2):121–130. 10.1093/bioinformatics/14.2.121View ArticlePubMedGoogle Scholar
- Honeyman MC, Brusic V, Stone NL, Harrison LC: Neural network-based prediction of candidate T-cell epitopes. Nat Biotechnol 1998, 16(10):966–969. 10.1038/nbt1098-966View ArticlePubMedGoogle Scholar
- Mamitsuka H: Predicting peptides that bind to MHC molecules using supervised learning of hidden Markov models. Proteins 1998, 33(4):460–474. 10.1002/(SICI)1097-0134(19981201)33:4<460::AID-PROT2>3.0.CO;2-MView ArticlePubMedGoogle Scholar
- Donnes P, Elofsson A: Prediction of MHC class I binding peptides, using SVMHC. BMC Bioinformatics 2002, 3(1):25. 10.1186/1471-2105-3-25PubMed CentralView ArticlePubMedGoogle Scholar
- Bhasin M, Raghava GP: SVM based method for predicting HLA-DRB1*0401 binding peptides in an antigen sequence. Bioinformatics 2004, 20(3):421–423. 10.1093/bioinformatics/btg424View ArticlePubMedGoogle Scholar
- Doytchinova IA, Blythe MJ, Flower DR: Additive method for the prediction of protein-peptide binding affinity. Application to the MHC class I molecule HLA-A*0201. J Proteome Res 2002, 1(3):263–272. 10.1021/pr015513zView ArticlePubMedGoogle Scholar
- Hattotuwagama CK, Toseland CP, Guan P, Taylor DL, Hemsley SL, Doytchinova IA, Flower DR: Class II Mouse Major Histocompatibility Complex Peptide Binding Affinity: In Silico bioinformatic prediction using robust multivariate statistics. J Chem Inf Mod (in press) 2005.Google Scholar
- Doytchinova IA, Flower DR: Towards the in silico identification of class II restricted T-cell epitopes: a partial least squares iterative self-consistent algorithm for affinity prediction. Bioinformatics 2003, 19(17):2263–2270. 10.1093/bioinformatics/btg312View ArticlePubMedGoogle Scholar
- Hattotuwagama CK, Guan P, Doytchinova IA, Flower DR: New horizons in mouse immunoinformatics: reliable in silico prediction of mouse class I histocompatibility major complex peptide binding affinity. Org Biomol Chem 2004, 2(22):3274–3283. 10.1039/b409656hView ArticlePubMedGoogle Scholar
- Cristianini N, Shawe-Taylor J: An introduction to support vector machines and other kernel-based learning methods. Cambridge, UK, Cambridge University Press; 2000.View ArticleGoogle Scholar
- SVRMHC supplementary web site [http://SVRMHC.umn.edu/SVRMHC]
- Xue CX, Zhang RS, Liu HX, Liu MC, Hu ZD, Fan BT: Support vector machines-based quantitative structure-property relationship for the prediction of heat capacity. J Chem Inf Comput Sci 2004, 44(4):1267–1274. 10.1021/ci049934nView ArticlePubMedGoogle Scholar
- Yao XJ, Panaye A, Doucet JP, Zhang RS, Chen HF, Liu MC, Hu ZD, Fan BT: Comparative study of QSAR/QSPR correlations using support vector machines, radial basis function neural networks, and multiple linear regression. J Chem Inf Comput Sci 2004, 44(4):1257–1266. 10.1021/ci049965iView ArticlePubMedGoogle Scholar
- Cherkassky V, Ma Y: Practical selection of SVM parameters and noise estimation for SVM regression. Neural Netw 2004, 17(1):113–126. 10.1016/S0893-6080(03)00169-2View ArticlePubMedGoogle Scholar
- Liu HX, Zhang RS, Yao XJ, Liu MC, Hu ZD, Fan BT: Prediction of the isoelectric point of an amino acid based on GA-PLS and SVMs. J Chem Inf Comput Sci 2004, 44(1):161–167. 10.1021/ci034173uView ArticlePubMedGoogle Scholar
- Huang Y, Fayad R, Smock A, Ullrich AM, Qiao L: Induction of mucosal and systemic immune responses against human carcinoembryonic antigen by an oral vaccine. Cancer Res 2005, 65(15):6990–6999. 10.1158/0008-5472.CAN-04-3669View ArticlePubMedGoogle Scholar
- Saren A, Pascolo S, Stevanovic S, Dumrese T, Puolakkainen M, Sarvas M, Rammensee HG, Vuola JM: Identification of Chlamydia pneumoniae-derived mouse CD8 epitopes. Infect Immun 2002, 70(7):3336–3343. 10.1128/IAI.70.7.3336-3343.2002PubMed CentralView ArticlePubMedGoogle Scholar
- Jaimes MC, Feng N, Greenberg HB: Characterization of homologous and heterologous rotavirus-specific T-cell responses in infant and adult mice. J Virol 2005, 79(8):4568–4579. 10.1128/JVI.79.8.4568-4579.2005PubMed CentralView ArticlePubMedGoogle Scholar
- Wrightsman RA, Luhrs KA, Fouts D, Manning JE: Paraflagellar rod protein-specific CD8+ cytotoxic T lymphocytes target Trypanosoma cruzi-infected host cells. Parasite Immunol 2002, 24(8):401–412. 10.1046/j.1365-3024.2002.00479.xView ArticlePubMedGoogle Scholar
- Peng S, Ji H, Trimble C, He L, Tsai YC, Yeatermeyer J, Boyd DA, Hung CF, Wu TC: Development of a DNA vaccine targeting human papillomavirus type 16 oncoprotein E6. J Virol 2004, 78(16):8468–8476. 10.1128/JVI.78.16.8468-8476.2004PubMed CentralView ArticlePubMedGoogle Scholar
- Zhi Y, Kobinger GP, Jordan H, Suchma K, Weiss SR, Shen H, Schumer G, Gao G, Boyer JL, Crystal RG, Wilson JM: Identification of murine CD8 T cell epitopes in codon-optimized SARS-associated coronavirus spike protein. Virology 2005, 335(1):34–45. 10.1016/j.virol.2005.01.050View ArticlePubMedGoogle Scholar
- Schueler-Furman O, Altuvia Y, Sette A, Margalit H: Structure-based prediction of binding peptides to MHC class I molecules: application to a broad range of MHC alleles. Protein Sci 2000, 9(9):1838–1846.PubMed CentralView ArticlePubMedGoogle Scholar
- Doytchinova I, Hemsley S, Flower DR: Transporter associated with antigen processing preselection of peptides binding to the MHC: a bioinformatic evaluation. J Immunol 2004, 173(11):6813–6819.View ArticlePubMedGoogle Scholar
- Bhasin M, Singh H, Raghava GP: MHCBN: a comprehensive database of MHC binding and non-binding peptides. Bioinformatics 2003, 19(5):665–666. 10.1093/bioinformatics/btg055View ArticlePubMedGoogle Scholar
- Doytchinova IA, Flower DR: Toward the quantitative prediction of T-cell epitopes: coMFA and coMSIA studies of peptides with affinity for the class I MHC molecule HLA-A*0201. J Med Chem 2001, 44(22):3572–3581. 10.1021/jm010021jView ArticlePubMedGoogle Scholar
- Doytchinova IA, Flower DR: Physicochemical explanation of peptide binding to HLA-A*0201 major histocompatibility complex: a three-dimensional quantitative structure-activity relationship study. Proteins 2002, 48(3):505–518. 10.1002/prot.10154View ArticlePubMedGoogle Scholar
- Vapnik V: Statistical Learning Theory. New York, John Wiley & Sons; 1998.Google Scholar
- Zhao Y, Pinilla C, Valmori D, Martin R, Simon R: Application of support vector machines for T-cell epitopes prediction. Bioinformatics 2003, 19(15):1978–1984. 10.1093/bioinformatics/btg255View ArticlePubMedGoogle Scholar
- Guan P, Doytchinova IA, Zygouri C, Flower DR: MHCPred: A server for quantitative prediction of peptide-MHC binding. Nucleic Acids Res 2003, 31(13):3621–3624. 10.1093/nar/gkg510PubMed CentralView ArticlePubMedGoogle Scholar
- Toseland CP, Clayton DJ, McSparron H, Hemsley SL, Blythe MJ, Paine K, Doytchinova IA, Guan P, Hattotuwagama CK, Flower DR: AntiJen: a quantitative immunology database integrating functional, thermodynamic, kinetic, biophysical, and cellular data. Immunome Res 2005, 1(1):4. 10.1186/1745-7580-1-4PubMed CentralView ArticlePubMedGoogle Scholar
- Chang CC, Lin CJ: A practical guide to SVM classification, LibSVM documentation. 2004.Google Scholar
- Baldi P, Brunak S: Bioinformatics: the machine learning approach. Cambridge, MA, The MIT Press; 2001.Google Scholar
- Kawashima S, Ogata H, Kanehisa M: AAindex: Amino Acid Index Database. Nucleic Acids Res 1999, 27(1):368–369. 10.1093/nar/27.1.368PubMed CentralView ArticlePubMedGoogle Scholar
- MHCPred [http://www.jenner.ac.uk/MHCPred/]
- SYFPEITHI [http://www.syfpeithi.de/Scripts/MHCServer.dll/EpitopePrediction.htm]
- BIMAS [http://thr.cit.nih.gov/molbio/hla_bind/]
- RANKPEP [http://www.mifoundation.org/Tools/rankpep.html]
- SVMHC [http://www-bs.informatik.uni-tuebingen.de/SVMHC]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.