Predicting peptides binding to MHC class II molecules using multi-objective evolutionary algorithms

Background Peptides binding to Major Histocompatibility Complex (MHC) class II molecules are crucial for initiation and regulation of immune responses. Predicting peptides that bind to a specific MHC molecule plays an important role in determining potential candidates for vaccines. The binding groove in class II MHC is open at both ends, allowing peptides longer than 9-mer to bind. Finding the consensus motif facilitating the binding of peptides to a MHC class II molecule is difficult because of different lengths of binding peptides and varying location of 9-mer binding core. The level of difficulty increases when the molecule is promiscuous and binds to a large number of low affinity peptides. In this paper, we propose two approaches using multi-objective evolutionary algorithms (MOEA) for predicting peptides binding to MHC class II molecules. One uses the information from both binders and non-binders for self-discovery of motifs. The other, in addition, uses information from experimentally determined motifs for guided-discovery of motifs. Results The proposed methods are intended for finding peptides binding to MHC class II I-Ag7 molecule – a promiscuous binder to a large number of low affinity peptides. Cross-validation results across experiments on two motifs derived for I-Ag7 datasets demonstrate better generalization abilities and accuracies of the present method over earlier approaches. Further, the proposed method was validated and compared on two publicly available benchmark datasets: (1) an ensemble of qualitative HLA-DRB1*0401 peptide data obtained from five different sources, and (2) quantitative peptide data obtained for sixteen different alleles comprising of three mouse alleles and thirteen HLA alleles. The proposed method outperformed earlier methods on most datasets, indicating that it is well suited for finding peptides binding to MHC class II molecules. Conclusion We present two MOEA-based algorithms for finding motifs, one for self-discovery and the other for guided-discovery by experimentally determined motifs, and thereby predicting binding peptides to I-Ag7 molecule. Our experiments show that the proposed MOEA-based algorithms are better than earlier methods in predicting binding sites not only on I-Ag7 but also on most alleles of class II MHC benchmark datasets. This shows that our methods could be applicable to find binding motifs in a wide range of alleles.


Background
Major histocompatibility complex (MHC) molecules play a key role in initiating immune responses.They bind to and expose an antigen (or short peptides) to T cell receptors (TCR) triggering an immune response against the infected cell or foreign agent.MHC molecules make multiple contacts with the side-chains of binding peptides, which define the binding motif and determine the specificity of binding [1].Prediction of peptides binding to a MHC class II molecule is difficult due to different types of side chains and because the length of the binding peptides is longer than 9aa (approximately 11 to 22aa) [1,2].It has been previously observed that a core of 9aa is sufficient for binding peptides to a MHC class II molecules [3], however, the exact location of the binding core (or motif) within the peptide is usually unknown and vary.
A binding motif is usually represented either by a consensus sequence or as a weight matrix [4].The presence or composition of a motif can be experimentally determined from a large pool of putative binding peptides [3,5].However, such wet-lab experiments are costly, time consuming, and cumbersome.Amino acids at specific sites of a motif, contributing significantly to the binding are referred to as primary anchor residues and the corresponding sites as anchor positions.By using such position-specific information, earlier studies have found weight matrix models elaborating the nature and strength of binding motifs [6,7].These models offer binding strengths of every residue at specific sites in the form of a position specific scoring matrix (PSSM).[7] In general, MHC class-II prediction methods are categorized into two main classes [8]: (1) quantitative prediction methods that predict inhibitory concentration (IC 50 ) values and (2) qualitative prediction methods that determine the binding status (binder or non-binder) based on the predictive score.Recent quantitative prediction approaches include SVRMHC [8], PLS-ISC [9], ARB [10], and SMM-align [11].The ARB approach uses full length of the peptide whereas both SVRMHC and PLS-ISC approaches use a preprocessing step involving alignment of sequences, based on anchor position-specific residues.The underlying assumption of SMM-align is that amino acids occupying the 9-mer binding core motif are sufficient to determine the affinity of peptide-MHC binding.However, in some cases, the predictive performance could be improved by incorporating terminal residues known as peptide flanking residues (PFR) [11].
All methods predicting peptides binding to MHC molecules have their pros and cons; most show good performance only for datasets upon which they were developed.Therefore, there is a need for new algorithms that perform well on previously unseen data.We propose to use MOEA to align a set of experimentally determined binding peptides at their binding cores and subsequently derive the consensus motif.The methods are especially useful when molecules are promiscuous and bind to a large number of low affinity peptides.The preliminary results of our work have been presented in [37].
I-A g7 is the MHC class II molecule of the NOD mouse, critical for the development of insulin-dependent diabetes mellitus (IDDM) and other autoimmune disorders [38][39][40][41][42][43].Knowledge of peptides binding to I-A g7 is important in understanding the molecular basis of development of IDDM in NOD mice.Experiments have demonstrated that I-A g7 binding peptides are 9-30aa long [44].Finding motifs in peptide binding to I-A g7 is a non-trivial problem [45,46].Despite numerous attempts, no consensus has been reached on the rules of peptide binding to I-A g7 molecule [38][39][40][41][42][43][44][45][46][47][48].However, computational analyses on multiple datasets indicate that experimental motifs satisfy only a subset of rules describing the optimal motif.
To demonstrate the utility in predicting peptides binding to other MHC molecules, our method is tested on two benchmark datasets comprising of peptides of number of different HLA (human MHC) and mouse alleles.The first dataset, referred to as BM-Set1 here onwards, consists of different combinations of peptides of HLA-DRB1*0401 allele, and the second dataset, BM-Set2, consists of datasets from thirteen different HLA alleles and three mouse alleles.

Multi-Objective Evolutionary Algorithms (MOEA)
Evolutionary algorithms (EA) are based on the principles of biological evolution and have often been successful in solving complex search and optimization problems.Majority of bioinformatics applications of EA have been in the discovery of motifs such as transcription factor binding sites [49][50][51][52][53]. Yet, only a few researchers have used EA for the prediction of peptides binding to protein sequences [36].
An EA consists of (1) representing input variables as individuals or chromosomes (binary or real valued) in a population, (2) formulating the fitness (objective function) to evaluate individuals, (3) generating a new population by genetic operations (such as reproduction, crossover, and mutation) on the current population, and (4) determining if the population has reached the optimal fitness.The algorithm begins with an initial population and evolves over time.At a particular instance of evolution, every individual is evaluated by its fitness.New populations (offspring) are produced from highly fit individuals (parents) selected, which undergo genetic operations.Each offspring is paired and compared to its parents.Highly fit individuals are retained in the population while less fit individuals are discarded.Search mechanisms such as elitism, constraint-handling, and multi-objective optimization are available for finding a better spread of solutions, depending on the needs of the optimization problem [54][55][56][57].
Multi-objective evolutionary algorithms (MOEA) are used to solve problems which require simultaneous optimization of a number of competing objective functions [58][59][60][61].MOEA maintains a set of solutions ranked by their dominance at a given instant of the evolution.A solution is said to dominate another if it is better or equal with respect to all objectives and strictly better in at least one objective [58].Often, there are more than one non-dominated solutions, representing the best ones, collectively known as the Pareto front.MOEA algorithms result in a Pareto optimal set of solutions.
Non-dominated Sorting Genetic Algorithm II (NSGA-II) was recently introduced to incorporate several new genetic mechanisms for better convergence, such as non-dominated sorting, elitism, diversity preservation, and constraint handling [58].In NSGA-II, a population is subjected to several rounds of non-dominated sorting.That is, all the non-dominated individuals are identified and assigned the same fitness value until a new set of nondominated solutions is found.The solutions found in subsequent rounds are assigned fitness values lower than those in the previous rounds.This process continues until the whole population is partitioned into non-dominated fronts with diverse fitness values.The elitism prevents the loss of fit individuals encountered in earlier generations by allowing earlier solutions to survive in the subsequent generations.The diversity of Pareto-optimal solutions is maintained by imposing a measure referred to as crowding distance.A solution that satisfies the constraints defined by the objective functions is called a feasible solution.

Peptide Binding to MHC Class II I-A g7
In this paper, we attempt to find an optimal motif describing peptide binding to MHC class II molecules, using experimentally determined binding data.There are several factors that impede the derivation of such a consensus motif.The first is the strong resemblance among the peptides isolated in a single experiment and the second is the diversity among different datasets.A motif derived from a dataset lacking diversity indicates a bias towards the dataset used in deriving the motif.Such motifs are difficult to generalize on other experimental or previously unseen datasets.The MOEA based motif detection algorithm is designed to find a consensus motif on I-A g7 datasets, which alleviates the influences arising from biased datasets and thereby predicts binding peptides more accurately in new datasets.

Predicting Peptides Binding to MHC Class II
We use our approach to find a consensus motif on seven experimental datasets of peptides binding to I-A g7 molecules, obtained from literature [40][41][42][43][62][63][64].The motif is validated using an independent testing set generated from the Stratmann dataset [46].The overall quality of prediction was measured using area under curve (AUC) of the receiver operating characteristics (ROC) curve [65][66][67].AUC values of all feasible solutions in the final population of EA were evaluated and the solution with the highest AUC was chosen as the consensus motif (see Additional file 1).
Table 1 shows the information of the datasets extracted from literature, which were used in the training.A blank '-'indicates the unavailability of a particular information.As an example, the details of the experimental motif of Reizis et al are given in Table 2. Table 3 shows the performance when an experimental motif is used to predict peptide binders in other datasets.As seen, a motif of a particular experiment does not characterize peptide binding of I-A g7 molecules in other datasets.Table 4 shows the cross-validation performance of two motifs (by self-discovery and guided-discovery) derived using MOEA; in a particular cross-validation run, one experimental dataset was excluded and the motif was derived using the information of the remaining datasets.The motif was tested for predicting binders and non-binders of the left-out dataset.The self-discovery approach uses only the binding information whereas the guided-discovery uses both binding information as well as information associated with experimental motifs.As seen in Table 4, by achieving AUC values greater than 0.7 for all cross-validation runs, MOEA derived motifs demonstrate better generalization capabilities compared to experimentally determined motifs.The binding motifs derived from self-discovery and guided-discovery are illustrated as sequence logo plots [68] in the Additional file 2.
To compare the performance of our method with earlier methods, a training dataset was created by combining all the experimental datasets given in Table 1.Motifs derived on the training dataset were tested on an independent test dataset -a balanced set generated from Stratmann dataset.The Stratmann dataset was balanced by adding randomly generated non-binders.Twenty five such balanced test datasets were assembled by generating random samples starting from different seeds and adding them to the Stratmann dataset.The results reported are based on the average AUC values over all balanced test sets.Figure 1 shows comparison of performances of motifs derived by MOEA and by earlier motif prediction approaches such as MEME and RANKPEP.An increase of 4-10% in predictive performance is observed with MOEA over the other approaches.
Comparison of performances of MOEA derived motifs for BM-Set1 (see Table 5) with enhanced Gibbs sampler [32], TEPITOPE [35], SVRMHC [8] and ARB [10], is given in Table 6.As seen, MOEA shows comparable or superior performance with Gibbs sampler on all datasets except for the Southwood dataset.Out of the ten non-redundant (NR) datasets, the MOEA outperformed Gibbs sampler, TEPITOPE, SVRMHC and ARB by seven, nine, eight and ten datasets, respectively.
The performance of MOEA on BM-Set2 (see Table 7) was compared with Gibbs sampler [32], TEPITOPE [35], SVRMHC [8], ARB [10] and NetMHCII [11].Each allele dataset was subjected to five-fold cross-validation and the results are given in Table 8.The present method shows comparable or superior performance on majority of allele datasets compared to Gibbs sampler, SVRMHC, TEPITOPE, and NetMHCII.A fair comparison of ARB method cannot be drawn because the method has been trained on quantitative data obtained from IEDB [10].

Discussion
We proposed two approaches using MOEA for deriving motifs (1) when the information of only the binders and non-binders are known (i.e., self-discovery) and (2) when, in addition, the information of experimentally (wet-lab) determined motifs are available (i.e., guideddiscovery).
Since I-A g7 molecule is known to bind to a large number of peptides of low affinity and appears to be a promiscuous binder, the prediction of peptides binding to I-A g7 molecule has been nontrivial.This has lead to the definition of a number of suboptimal consensus motifs specific to the datasets.MOEA derived motifs had superior generalization capabilities to those derived with MEME and RANKPEP techniques as well as to the experimentally determined motifs on other datasets.The performances evaluated on two benchmark datasets indicate that the Information on I-A g7 related peptide binding datasets and motifs.Unavailable information is indicated by "-".

Table 2: Representation of an experimentally derived I-A g7 motif
Position Well-Tolerated Weakly-Tolerated Non-Tolerated The description of experimentally determined I-A g7 9-mer peptide binding motif by Reizis: each position accommodates a well-tolerated, weakly-tolerated, or non-tolerated amino acid.The positions P4, P6 and P9 are the primary anchor positions where binding is highly likely to occur.The number of binders and non-binders in the original and non-redundant (NR) datasets in BM-Set1.
present MOEA based algorithm is applicable in deriving motifs on other class II MHC alleles as well.
The likelihood of finding an optimal motif by MOEA is higher than by a local or greedy search because of the sto-chastic nature of EA.The proposed approach learns from the characteristics of both binders and non-binders in the training set whereas other methods use information only from binders to determine motifs [27,32].Moreover, ranges of the parameters involved in MOEA are known, so the parameters of the fitness functions are quickly estimated in a few cross-validation runs.Furthermore, unlike the earlier methods, the present method does not rely on any prior information such as anchor positions to obtain an alignment, prior distributions, etc., [8,9].Given sufficient data samples representing both binders and nonbinders, the method could be applicable to find motifs in other types of molecules.A future direction of this research would be to integrate additional information such as peptide length [69] and PFR [70] as such information has been shown to have the potential to enhance motif detection [11,69].This would lead to further improvement of the performance of the present algorithm.
Even though EAs are generally known to be computationally intensive, training for derivation of scoring matrices can be performed off-line and the prediction engines can be provided through web services.As seen in Tables 6 and  8, a single method does not always perform well on all types of allele datasets.Nevertheless, the present method showed higher accuracy in detecting motifs on majority of MHC alleles in the benchmark datasets.Therefore, we  Comparison of AUC values of the BM-Set1 (DRB1*0401).†These values are based on smaller dataset sizes as SVRMHC didn't predict values for some of the peptides.The values from the Gibbs sampler were estimated from the matrix provided by the authors in [32].
believe that MOEA-based methods could provide a general framework for efficiently determining motifs in a wide range of MHC molecules.
In immunology, accuracy and speed in predicting binding peptides is of paramount importance.Computationally predicted binders do subsequently need to be validated with wet-lab experiments.By using computational predictions as an initial step, high cost involved in initial screening and time-consuming clinical testing can be significantly reduced.Towards this end, the proposed MOEA methods present a promising way to predict peptides that bind to MHC class II alleles including promiscuous and low affinity peptide binders.

Conclusion
We present two MOEA-based algorithms for finding motifs, one for self-discovery and the other for guided-discovery by experimentally determined motifs, and thereby predicting binding peptides to I-A g7 molecule.Our experiments show that the proposed MOEA-based algorithms are better than earlier methods in predicting binding sites not only on I-A g7 but also on most alleles of class II MHC benchmark datasets.This demonstrates the applicability of our methods to find binding motifs in a wide range of MHC alleles.

Datasets
Several I-A g7 datasets were extracted from literature [40][41][42][43][62][63][64] and from Brusic, V.(unpublished data).The numbers of binders and non-binders in each dataset are given in Table 1.The datasets consist of short peptides ranging from 9-30aa in length.Their binding affinities had been experimentally determined by independent studies and classified as binders or non-binders based on IC 50 values according to the following scheme [41]: good binder (IC 50 = 100 nM); weak binder (IC 50 = 2000 nM); non-binder (IC 50 = 50000 nM).The datasets in [40][41][42][43][62][63][64] were combined into a single training dataset and curated by removing duplicates and redundancy as follows: if a binder is a subsequence of another binder sequence, the longer binder sequence is discarded; if a  non-binder is a subsequence of another non-binder, the shorter subsequence is discarded.Let the curated whole dataset be referred to as training dataset here onwards and it be denoted by D = {(x i , v i ): i = 1, 2,.... N} where N is the number of total peptide sequences and x i is the i-th peptide sequence with the label v i ε {b, nb} indicating whether the sequence x i is a binder (b) or a non-binder (nb).The number of peptides in the training set N = 438 in which the number of binders N b = 304 and the number of non-binders N nb = 134.

Comparison of Performances
The set of experimentally validated I-A g7 motifs [38][39][40][41][42][43][44] derived largely from uncorrelated datasets [40][41][42][43] was extracted and is illustrated in Table 1 with the distribution of binders and non-binders in each dataset.Table 2 illustrates an experimentally validated motif of I-A g7 reported by Reizis et al [40].Experimental motifs are described by the anchor positions and binding affinities of amino acids of the motif.The residues which contribute significantly to the peptide binding are called primary anchor residues and positions they reside are called anchor positions.An amino acid occupying a specific position within a motif is characterized as well tolerated, weakly tolerated, or nontolerated based on its involvement in the binding process.
An independent dataset was generated from binders of Stratmann dataset [46], consisting of a diverse set of I-A g7 binding peptides with their binding affinities, to find the test accuracies in predicting binders and non-binders.The Stratmann dataset was balanced with randomly generated 9-mer non-binders so that for testing dataset, N b = N nb = 112.

Binding Score Matrix
A k-mer motif of amino acids is characterized by a PSSM Q = {q ia } k × 20 where q ia denotes the binding strength of the site i when it is occupied by amino acid a.The binding score of a putative motif is computed by adding the binding scores assigned to each amino acid at the respective positions.The binding score indicates the likelihood of the motif binding to the molecule.The binding score s i of sequence x i = (x i,1 , x i,2 ,...x i, n ) of length n is determined by the maximum value of binding scores computed for all kmer subsequences in x i : where s ij denotes the binding score of the subsequence beginning at location j of the sequence i, which is given by and assuming that only one motif instance exists in every sequence, the location j* of the motif is given by That is, the most likely motif instance of sequence x i , say m i , is given by the sequence .

Self-discovery of Motif
We derive a consensus motif from the training dataset which consists of peptides from several experiments and of varying lengths.The positions of binding cores within the peptides are unknown.The elements of the PSSM are represented as 20k-tuples (q ia , : i = 1,... k; a ε Ω) where Ω represents the amino acid alphabet.Each element in the k-tuple is converted to a real number representation using a binary word of size θ so that q ia ∈ [0, 2 θ -1].The k-mer motif is therefore represented by an individual of 20kθ long string in the EA.Let the population at t-th iteration of the evolution is denoted by q(t) = {q 1 (t), q 2 (t),..... q M (t)} where q j (t) represents an individual in a population of size M.
The fitness function is designed to arrive at an optimal consensus of the motif, by using the training dataset.A solution is evaluated based on its ability to maximize the accuracies in identifying true binders (TP) and true nonbinders (TN) as well as to widen the gap between the total score for binders and non-binders.This is achieved by two fitness functions: f 1 to minimize the sum of false positives (FP) and false negatives (FN), and f 2 to minimize the ratio between the average cumulative scores of non-binders and binders: Eqs. ( 4) and ( 5) are minimized and subjected to following two constraints: where s(m i ) denotes the score computed for the most likely motif instance m i of sequence x i of the training dataset, and Kronecker δ is one when the argument is satisfied and otherwise is zero.N b and N nb are the total counts of (2)

Scoring of Experimental Motifs
The description of an experimental k-mer motif conveys three kinds of information at each site: (1) the amino acid occupied, (2) the tolerance level of the amino acid, and (3) the strength of binding.Let us denote a k-mer motif validated in experiment "e" by m(e) and the tolerance level of the residue at site j by ρ j where ρ j ∈ {well, weak, unknown, non -tolerated}.The binding strength of site j is expressed by σ j ∈ {primary -anchor, secondaryanchor, other}.Then, the binding score for a k-mer experimental motif is given by

Guided-discovery of Motif
In this algorithm, we assume that experimentally determined motifs are available along with the experimental datasets.An MOEA is proposed to determine a motif closer to experimental motifs.An objective function f 3 is proposed to best represent the characteristics of the motif that is close to the knowledge embedded in the experimental motifs: where denotes the estimated PSSM of the motif.We use the same objective function in Eq. ( 4) to accurately predict binders of the training dataset.The MOEA minimizes the objective functions given in Eqs. ( 4) and ( 9), subjected to the two constraints given in Eqs. ( 6) and (7).
The summation in Eq. ( 9) is taken over all the experimen- The elements in the PSSM of experimental motifs are set to values within the same range [0, 2 θ -1] as before.The following procedure is adopted to determine the elements of Q(m(e)): a well tolerated amino acid at an anchor position of the motif receives the highest possible score of 2 θ -1; the lowest score of zero is assigned to a non-tolerated residue; weakly tolerated residues and residues at secondary anchor positions receive of (2 θ -1)/2; and all the other unknown positions receives a score of (2 θ -1)/3.

Performance Comparison
The binding scores of I-A g7 experimental motifs were computed using Eq. ( 8) by assigning the following values for binding strengths: primary = 4, secondary = 2, and others = 1, and for anchor positions: well = 4, weak = 2, non-tolerated = -4, and unknown = 0.The experimentally determined motifs were used with peptide data in the guideddiscovery of motifs.
We used AUC to compare performance of the proposed methods with earlier approaches [28,34] and experimental motifs [38][39][40][41][42][43][44].Whether a peptide is a binder or a nonbinder is determined by a threshold of the binding score.By varying this threshold, the ROC curve was plotted, from which AUC value was obtained.A comparison of performances of the methods is given in Figure 1.
In order to compare to the MEME method, only binders in the I-A g7 training set were submitted to MEME motif discovery tool at the prediction server [71].The motif of 9-mer length was obtained with the following options: zero or one motif per sequence, minimum and maximum width = 9.The performance accuracy of RANKPEP approach on the testing dataset was carried out by uploading the dataset to the online prediction server at [72] with a 4% binding threshold [34].

Benchmark Datasets
The proposed self-discovery approach was tested on BM-Set1, i.e., HLA-DRB1*0401, which consists of one training set and 10 testing datasets and had been earlier used to benchmark a number of motif finding algorithms [25,26,32,73].The performance of MOEA was compared with earlier methods [8,10,32,35].
The training set consisting of binders and non-binders was assembled as follows: an ensemble of 532 unique binding peptides were extracted from SYFPEITHI [44] and MHCPEP [63] databases and a set of 177 unique nonbinders were extracted from the MHCBN database [20].The datasets were pre-processed by removing peptides that did not allow a hydrophobic residue at P1 position of all putative 9-mer binding cores and unnatural peptides containing more than 75% alanine [32].The preprocessed s m Of the 10 testing datasets, 8 datasets were taken from the MHC-bench as described in [74].The other 2 datasets were extracted from experiments described by Southwood [75] and Geluk [76].An affinity of (IC 50 = 1000 nM) was taken as the threshold for peptide binding as described in [75].Homology reduction had been carried out on all datasets in order to reduce the chances of over-fitting due to the redundancy of datasets.The peptides in the nonredundant (NR) datasets had sequence similarities less than 90%.The number of binders and non-binders in the original and NR datasets are given in Table 5.
We tested our method on BM-Set2 comprising of 3 mouse alleles and 13 HLA alleles made available at [77].These quantitative peptide datasets had been extracted from the IEDB at [78].The number of binders and non-binders in each dataset is given in Table 7.The DRB3-0101 allele dataset was excluded from the benchmark dataset because of the significant imbalance between binders and nonbinders (3 binders and 99 non-binders).With this dataset, we compared our method with [8,10,11,32,35].
Publish with Bio Med Central and every scientist can read your work free of charge

Figure 1
Comparison of Performances.Comparison of performance of MOEA based algorithms -self-discovery and guideddiscovery -against MEME, RANKPEP, and experimental motifs on the balanced I-A g7 test datasets (the performance was averaged over 25 test datasets)

1 (
-binders in the dataset.The constant κ >N b /N nb for N b > N nb , or vice versa) was empirically determined to minimize the number of false positives.The two parameters α 1 (<<N nb ) and α 2 (<<N b ) are set to minimize FP and FN rates, respectively.If none of the individuals satisfies the above constraints, MOEA reports no feasible solution.Given the training set, a few trial runs with different initializations are necessary to determine the best values of α 1 and α 2 .
tal motifs and | -Q(m(e))| is the sum of squares of differences between individual elements of weight matrices and Q(m(e)).The knowledge of the experimental motif is incorporated to the consensus motif adaptively with the distance function used in f 3 .Further, the fitness f 1 optimizes the specificity and sensitivity of the prediction of binders.
456 unique peptides with a length distribution ranging from 9 to 30 amino acid residues.

Table 3 : Validation of I-A g7 experimental motifs
Performance measured by AUC of experimentally determined I-A g7 motifs on their own datasets and other experimental datasets.

Table 7 : Description of peptides in BM-Set2
The number of binders and non-binders in each of the dataset in BM-Set2.The datasets in BM-Set2 were obtained from[77].The DRB3-0101 allele dataset was excluded from the performance comparison due to significant imbalance in the dataset (3 binders and 99 nonbinders).

Table 8 : Comparison of Performance on BM-Set2
Comparison of AUC values from five-fold cross-validation of allele datasets given in BM-Set2."-" indicates that the allele is unavailable for testing with the respective prediction method.
"BioMed Central will be the most significant development for disseminating the results of biomedical researc h in our lifetime." available free of charge to the entire biomedical community peer reviewed and published immediately upon acceptance cited in PubMed and archived on PubMed Central yours -you keep the copyright Submit your manuscript here: http://www.biomedcentral.com/info/publishing_adv.asp BioMedcentral BMC Bioinformatics 2007, 8:459 http://www.biomedcentral.com/1471-2105/8/459