Skip to main content

USMPep: universal sequence models for major histocompatibility complex binding affinity prediction



Immunotherapy is a promising route towards personalized cancer treatment. A key algorithmic challenge in this process is to decide if a given peptide (neoepitope) binds with the major histocompatibility complex (MHC). This is an active area of research and there are many MHC binding prediction algorithms that can predict the MHC binding affinity for a given peptide to a high degree of accuracy. However, most of the state-of-the-art approaches make use of complicated training and model selection procedures, are restricted to peptides of a certain length and/or rely on heuristics.


We put forward USMPep, a simple recurrent neural network that reaches state-of-the-art approaches on MHC class I binding prediction with a single, generic architecture and even a single set of hyperparameters both on IEDB benchmark datasets and on the very recent HPV dataset. Moreover, the algorithm is competitive for a single model trained from scratch, while ensembling multiple regressors and language model pretraining can still slightly improve the performance. The direct application of the approach to MHC class II binding prediction shows a solid performance despite of limited training data.


We demonstrate that competitive performance in MHC binding affinity prediction can be reached with a standard architecture and training procedure without relying on any heuristics.


Immunotherapy is a promising route towards personalized cancer treatment with a variety of possible realizations, see [14] for recent reviews. One path is the administration of nanoparticle vaccines customized with neoantigens. The major histocompatibility complex plays a central role in this process as it is supposed to bind to peptides derived from proteins of the cell or from pathogens in order to display them on the surface of the cell for recognition by T-cells. There are three classes of MHC molecules, where MHC class I and II are most important due to their involvement in the targeted immune response. Due to the special nature of the MHC protein, it can bind to peptides that are potentially structurally very different from each other. Therefore, the prediction if a MHC molecule binds to certain peptide is a very challenging task that is, however, a crucial sub task for neoantigen identification for practical realizations of personalized immunotherapy [1].

MHC binding prediction is a well-established problem in bioinformatics with a large number of existing algorithmic solutions. Although many of these algorithms show an excellent performance, they typically rely on complicated training procedures to achieve this performance, such as pretraining on prediction tasks for related alleles or training with artificial negative peptides. In addition, existing solutions use complicated model selection procedures to select a small number of well-performing models from potentially hundreds of trained models to eventually construct an ensemble classifier. Most of the existing approaches are restricted to peptides of fixed length, where shorter sequences are padded or longer sequences are trimmed to an appropriate length by well-motivated but still heuristic rules to identify so-called binding regions. The most prominent MHC I prediction algorithms are summarized in Table 1. We refer to dedicated reviews for more detailed comparisons [5, 6].

Table 1 Comparison of MHC I prediction tools

Finally, not all binding prediction tools are evaluated on standard benchmark datasets, which reduces the comparability, and, even where this is the case, it is often hard to disentangle algorithmic advancements from improvements due to larger amounts of training data. In addition, statements about the generalization in the sense of the algorithm’s performance when applied to unseen data are often difficult due to potential overlaps between train and test sets, in particular as training sets often remain undisclosed. This urges for the creation of benchmark repositories, where the existing data are processed in a standardized fashion and split into training, validation and test sets.

In this manuscript, we demonstrate that state-of-the-art performance can be reached with a straightforward approach: We use a single-layer recurrent neural network that is trained end-to-end on a regression task without any task-specific prior knowledge such as fixed embeddings in the form of amino acid similarity matrices. By construction, this model is able to incorporate input of variable length without the need for heuristics, such as for the identification of binding regions. The model is trained using standard training procedures without any artificial data or pretraining on related classification tasks. Even single models are very competitive. Ensembling or language model pretraining only slightly improve this performance. We fix hyperparameters only once and use standard benchmark datasets to assess the model performance. We provide, amongst others, evaluation results on the recently published HPV dataset [12], demonstrating an excellent performance, which strongly suggests that the measured model performance generalizes to unseen peptide data.

Recurrent architectures have already been used previously for MHC binding prediction [13, 14] and we discuss in more detail how USMPep stands out from these approaches. MHCnuggets [13] is rather similar to the proposed approach (apart from the use of fixed embeddings), but relies on a complex transfer learning protocol to achieve its performance. Only limited benchmarking results are available, which makes it difficult to realistically assess its prediction performance. The very recent MHCSeqNet [14] also uses a recurrent architecture, again with pretrained rather than learned embeddings, incorporating both peptide and allele sequence to train a single prediction model for all alleles. However, the paper frames the prediction task as a classification task, which makes it difficult to align the results with the large number of existing benchmark datasets that are predominantly targeted at regression tasks. Nevertheless, the inclusion of the allele sequence represents an exciting opportunity for MHC binding affinity prediction in particular in the light of recent advances in natural language processing on tasks involving two input sequences such as question answering tasks.


USMPep: universal sequence models for peptide binding prediction

The approach builds on the UDSMProt-framework [15] and related work in natural language processing [16]. We distinguish two variants of our approach, either train the regression from scratch or employ language model pretraining. A language model tries to predict the next token given the sequence up to this token, on unlabeled sequence data, here: of simulated proteasome-cleaved peptides. The architecture of the language model is, at its core, a recurrent neural network regularized by different kinds of dropout, and more specifically an averaged stochastic gradient descent weight-dropped long short-term memory (AWD LSTM) model [17]. After the language model pretraining step, the model is finetuned on the regression task of MHC binding prediction by replacing the output layer with a concat pooling layer and two fully connected layers, see Fig. 1 for a schematic representation. The setup closely follows that used in [15], where protein properties were predicted. The smaller dataset sizes and shorter sequence lengths in the peptide setting (in comparison to protein classification) do not allow for building up large contexts and were accounted for by the reduction of the number of layers from 3 to 1, of the number of hidden units from 1150 to 64 and of the embedding size from 400 to 50.

Fig. 1
figure 1

Schematic representation of the model architecture

Similar to [15], the training procedure included 1-cycle learning rate scheduling [18] and discriminative learning rates [16] during finetuning. Target variables for the regression model were log-transformed half-maximal inhibitory concentration (IC50)-values and a modified mean-squared error loss function [11] that allows to incorporate qualitative data.

Dropout rate, the number of training epochs, hidden layers, hidden units and embedding dimensions, were set based on selected alleles of a particular MHC class I dataset (Kim14 [19], see the detailed description below) by using the score on one of the provided cross-validation folds. The learning rate was determined based on range tests [18]. After this step, the aforementioned hyperparameters were kept fixed for all datasets and alleles both for MHC class I and class II prediction. In particular, neither hyperparameters nor models were selected based on test set scores.

For later convenience, the following acronyms refer to the prediction tools introduced in this work:

  • USMPep_FS_sng single prediction model trained from scratch

  • USMPep_FS_ens ensemble of ten prediction models trained from scratch

  • USMPep_LM_sng single prediction model with language model pretraining

  • USMPep_LM_ens ensemble of ten prediction models with language model pretraining

For simplicity, we consider ensembles of models with identical architectures and hyperparameters and average the final individual predictions.

MHC binding prediction datasets

For the downstream task of peptide MHC binding prediction, we benchmarked our model on three MHC class I and one MHC class II binding affinity datasets (details listed in Table 2). These datasets comprise peptide sequences and the corresponding binding affinities to specific MHC alleles.

Table 2 Details of training and test datasets

Kim14 is a commonly used binding affinity dataset compiled by [19], available on the Immune Epitope Database (IEDB)Footnote 1 [20], and is split into a non-overlapping training (BD2009) and test set (Blind). Similar peptides (of same length with at least 80% sequence identity) shared by training and test set were removed from Blind. For BD2009, we selected the provided cross-validation split without similar peptides between the subsamples (”cv_gs"). There are 53 class I alleles (human and mouse/ macaque alleles) with respectively 117326 and 27680 affinity measurements in BD2009 and Blind. For comparability with recently developed systematical benchmarks [5, 12], we tested USMPep on two further MHC I datasets, which we refer to as HPV and IEDB_16. The training data of the tools reported in the literature vary in size and compilation.

We trained our models on data provided by [11] and refer to this dataset as MHCFlurry18. It is assembled from an IEDB snapshot of December 2017 and the Kim14 dataset.

HPV is a recently published dataset of 743 affinity measurements of peptides derived from two human paillomavirus 16 (HPV16) proteins binding to seven human leukocyte antigen (HLA) class I alleles [12]. Peptides were considered as binders if they had IC50-values below 100000nM. For peptides classified as non-binders, quantitative measurements are not available.

IEDB16_I is made up of an IEDB snapshot of October 2016 [5]. It was filtered for quantitative measurements with IC50≤ 50000nM and 9mer peptides. Training sequences of other tools were removed from the dataset. It consists of 2827 affinity measurements across 32 class I alleles. We removed any sequences occurring in the test dataset from our training data MHCFlurry18.

In addition, we trained and tested USMPep on MHC class II binding data: Wang10 is an experimental binding affinity dataset from the IEDB site based on the dataset by [21]. We used it to train our prediction tools.

IEDB16_II is a MHC II test dataset provided by [5] from the same IEDB snapshot as the MHC I IEDB16_I test set above, filtered for quantitative measurements with IC50≤50000nM and 15mer peptides. After removing sequences present in the training data, 15034 affinity measurements covering 24 alleles remained in the test dataset. We benchmarked our models on this dataset.

Evaluation metrics

For performance evaluation, we consider two evaluation metrics that are most frequently considered in the literature [5, 11]: The area under the receiver operating characteristic curve (AUC ROC) measures the performance of binary classifying binders and non-binders. While AUC ROC is straightforward to evaluate, it comes with the disadvantage of having to specify a threshold to turn the targets into binary labels, which discards valuable label information during the evaluation procedure. Commonly applied threshold values exist for the datasets under consideration, as discussed in the previous section, but the simplicity of this procedure neglects a possible allele dependence of these threshold values [22]. Ranking metrics such as Spearman r evaluate the correlation between the rankings of measured and predicted affinities and circumvent this issue. Spearman r can only be evaluated for quantitative measurements, which discards information on test sets that contain also qualitative measurements. For both metrics, we calculated error bars based on 95% empirical bootstrap confidence intervals. For single models, we report the mean performance across 10 runs and the maximal deviation of the point estimate compared to the lower and upper bounds provided by the respective confidence intervals as a conservative error estimate. An alternative approach, which is taken for example in [14], is to frame the problem as a classification rather than a regression problem i.e. to predict the probability of binding of a peptide to a given allele and to use AUC ROC as performance metric. In principle, the raw prediction of our models before transforming back to IC50-values could also be interpreted as probabilities and are also available from our code repository to ensure straightforward comparability, even though the results between regression and classification models are only partially comparable due to different training objectives.

The prediction performance across different alleles that make up a single MHC benchmark dataset can be quantified in different ways. Overall performance measures can be calculated across multiple alleles by concatenating all target and prediction results and evaluating the respective metrics on this set. This predominantly used but rarely discussed method has to be contrasted with reporting the mean or the median of the respective performance measures across all alleles, which is the default evaluation metric for related tasks such as remote homology detection [23] or transcription factor binding site prediction [24]. The difference between both evaluation approaches is related to the discussion about micro vs. macro averages for the evaluation of multi-class classification problems [25]. In particular, there are two fundamental differences between both evaluation approaches: First, the datasets enter the overall score with different weights determined by the size of the respective test sets, which is a weighting based on the experimental availability of binding affinities whereas the mean score assigns equal weight to all test sets. Second, the overall performance measure implicitly assumes that prediction scores are directly comparable across different alleles, which seems slightly questionable in the light of the discussion of allele-dependent binding thresholds [22]. To give the reader a complete picture of the prediction performance, we will report overall as well as mean scores. In any case, we advocate to provide individual prediction for all peptides, which allows to possibly redo the analysis using a different performance metric at a later point in time. To this end, the peptide-wise binding affinity predictions for our tools are provided in the accompanying code repository.


The results section is organized as follows: In “MHC class I binding prediction” section, we present a detailed evaluation of the performance of USMPep for MHC class I binding affinity prediction. Three different benchmark datasets highlight different performance characteristics. In “MHC class II binding prediction” section, we investigate the applicability of our methods for MHC class II binding affinity prediction. Finally, we discuss language modeling on peptide data and its impact on downstream performance in “Language modeling on peptide data and its impact on downstream performance” section.

MHC class I binding prediction

IEDB16 dataset

We open the assessment of MHC class I binding prediction with results on the IEDB16 dataset that showcases the excellent predictive performance of USMPep. We compare to literature results that were evaluated in a recent comprehensive benchmark [5] on this dataset. This benchmark includes evaluation metrics testing not only accuracy of binder classification, but also accuracy of binding affinity ranking and of direct binding affinity prediction accuracy. Covering 32 HLA alleles, the IEDB16 dataset reflects a broad spectrum of MHC molecules.

In Fig. 2, we show overall AUC ROC and overall Spearman r as reported by [5] for the latest versions of the NetMHC tools, MHCFlurry, SMMPMBEC and consensus and our scores for the different versions of USMPep. This is supplemented by mean AUC ROC and mean Spearman r compared to results provided in the data repository accompanying [5]. For mean AUC ROC and Spearman r error bars could not be calculated for the literature approaches due to the fact that only allele-wise scores but no peptide-wise predictions were provided. In the light of the issues discussed in “Evaluation metrics” section, we advocate the use of mean scores rather than overall scores. For easy comparability, we also provide overall scores as they are used predominantly in the literature. It turns out that an ensemble of ten predictors with language model pretraining (USMPep_LM_ens), reaches the highest scores in both mean evaluation metrics. In this respect, the results of all four USMPep-variants are consistent with each other and similar (within error bars) to the result of MHCFlurry, the best-performing method in the benchmark [5]. This result stresses the claims of excellent prediction performance even for a single model trained from scratch. Interestingly, the performance of all proposed prediction tools is slightly worse when considering overall scores. The error boundaries of our tools barely touch those of MHCFlurry with regard to overall Spearman r. In particular, in terms of overall AUC ROC none of our predictors is consistent with MHCFlurry within error bars. We further investigated the origin of this performance deficiency and found that it could be traced back to a single allele, HLA-B-3801, which is peculiar in the sense that 172 of the 176 test set samples fall into a single Hobohl cluster [19] of sequences with more than 80% sequence similarity, i.e. show a particularly high sequence identity that is not seen in other test datasets. These 172 samples constitute a sizable amount of the overall 2827 test samples and strongly influence the predictive performance when using overall performance metrics. With the exceptions in terms of the overall metrics for the IEDB16 dataset, our proposed methods are consistent with the best-performing methods for all MHC I benchmark datasets both for overall and mean performance metrics.

Fig. 2
figure 2

Comparison of MHC class I predictors. AUC ROC and Spearman r are evaluated on predictions for the IEDB16_I test set (AUC ROC could not be evaluated for alleles HLA-B-2704, HLA-B-1503 and HLA-B-1501, whereas Spearman r could not be computed for alleles HLA-B-1503 and HLA-B-1501. These alleles are therefore not included in the scores.)

HPV dataset

As the training data is not publicly available for some MHC I prediction tools, a possible overlap between training and test datasets and correspondingly an overestimation of the predictive performance cannot be excluded. The same applies to the most common procedure of reducing the overlap between training and test set by merely removing sequences from the test set that are also contained in the training set in identical form rather than using more elaborate measures for sequence similarity. These issues can be circumvented by a performance evaluation on a dataset of different origin that has so far not been used to train MHC prediction tools. This applies to the recently released HPV binding affinity data [12]. However, in this benchmark, it is not possible to disentangle superior prediction performance due to larger amounts of training data from algorithmic advances since size and compilation of the training set of the algorithms vary.

There are only quantitative measurements available for the peptides considered as binders and we therefore chose to evaluate the predictive performance with AUC ROC. We report the performance of all models considered in [12] and our tools measured by AUC ROC in Table 3, where we used the predictions provided by [12]. Our USMPep tools show an excellent prediction performance. For three out of seven alleles, an USMPep-model even reaches the highest AUC ROC. All neural-network-based predictors show a similar AUC ROC evaluated across all measurements in the dataset, while the ensemble with language model pretraining (USMPep_LM_ens) shows the highest mean and overall scores among all prediction tools. As for the IEDB16 dataset, even the single model USMPep-tools are very competitive.

Table 3 Benchmarking MHC class I predictors on recently published binding affinity data (HPV16), see also Table 4 for allele-wise scores
Table 4 Allele-wise results on (HPV16), see also Table 3 for mean and overall scores

It is instructive to investigate the performance of the different MHC prediction tools restricted to peptides of a certain length, which is only possible for the HPV dataset, where peptide-wise predictions for all literature approaches are provided. The result of such an analysis is shown in Fig. 3. Our tools outperform the other models on only the 11mer peptides. This observation can be explained by the fact that the internal state of the recurrent neural network has to build up over the sequence. The longer the peptide, the more context is available, which is why USMPep generates comparably more accurate predictions for long sequences than for shorter ones.

Fig. 3
figure 3

Evaluating MHC class I predictors on recently published binding affinity data (HPV16) grouped by peptide length. Predictive performance is evaluated by mean AUC ROC. For allele-wise and overall performance comparisons see Tables 4 and 3. Percentages in brackets indicate the proportion of peptides of that particular length. For example, bars at 8(15%) show results for peptides that are eight amino acids long, which had a share of 15% of all peptides in total

Kim14 dataset

As final benchmark dataset for MHC class I prediction, we consider the Kim14 dataset that is interesting for a number of reasons. In order to investigate how the predictive power of our approach depends on the size of the training data set, we trained and tested our model on the Kim14BD2009 and Blind data. The authors of [11] kindly provided us with the Blind predictions of their tool trained on BD2009, which allow for a direct comparison with a state-of-the-art tool. Corresponding training routines are by now also available in the code repository accompanying [11].

First, we compare the prediction success measured by AUC ROC and Spearman r computed across all alleles (Fig. 4). No MHCFlurry predictors exist for alleles HLA-B-2703, HLA-B-0803 and HLA-B3801 with rank 45, 49 and 52 due to insufficient training data. These alleles were therefore also excluded for the scores of our tools. The predictors perform very similarly with regard to all metrics. Due to the different scales of the metrics, the minor performance variations appear to be more pronounced for Spearman r in comparison to AUC ROC. Nevertheless, the ranking of the almost equally performing predictors remains consistent. Our pretrained tool USMPep_LM_ens performs only slightly better than USMPep_FS_ens trained from scratch. This also holds for the single model versions. Both USMPep ensemble predictors are compatible with MHCFlurry.

Fig. 4
figure 4

Performance of USMPep and MHCFlurry on MHC class I binding prediction. Both models were trained on the Kim14BD2009 data AUC ROC and Spearman r were evaluated on the predictions for the Blind test set (AUC ROC could not be evaluated for allele HLA-B-4601, whereas Spearman r could not be computed for allele HLA-B-4601 and HLA-B-2703. These alleles are therefore not included in the scores.)

Second, to examine the impact of the training set size, we report allele-wise Spearman r scores in Fig. 5 for our predictors and MHCFlurry. The alleles are ranked by the size of the corresponding training set. While 9528 training sequences exist for the rank 0 MHC molecule HLA-A-0201, there are only 136 training peptides for allele HLA-B-3801 with rank 52. Spearman r is only shown for alleles with more than 25 quantitative measurements. The variance of the allelwise performances of the different tools becomes more pronounced the less training data are available. However, none of the models outperforms the others for the subset of alleles with less than 1000 training data points (rank range33 to 52). When averaging over the alleles with rank 33 to 52, the mean Spearman r scores of 0.54(5), 0.57(4), 0.57(5) for USMPep_FS_ens, USMPep_LM_ens and MHCFlurry, respectively, remain consistent within error bars. This observation is interesting considering the fact that for alleles with fewer than 1000 training measurements, MHCFlurry was pretrained on an augmented training set with measurements from BLOSUM similar alleles, USMPep_LM_ens was pretrained on a large corpus of unlabeled peptides and USMPep_FS_ens in contrast only saw the training sequences corresponding to one MHC molecule. These results stress that further efforts might be required to truly leverage the potential of unlabeled peptide data in order to observe similar improvements as seen for proteins [15] in particular for small datasets.

Fig. 5
figure 5

Performance of our MHC I prediction tools compared to MHCFlurry on single alleles. Spearman r was calculated for predictions on the Kim14Blind data for alleles with more than 25 quantitative measurements. The predictors were trained on Kim14BD2009. The alleles are ranked by the size of the corresponding training set (9528 peptides for rank 0 to 136 peptides with rank 52). No MHCFlurry predictors were provided for alleles HLA-B-2703, HLA-B-0803 and HLA-B-3801 with rank 45, 49 and 52

MHC class II binding prediction

Turning to MHC Class II binding prediction, we aim to demonstrate the universality of our approach beyond its applicability to different MHC I alleles. Here, we stress again that we use the same model architecture, the same pretrained language model in case of pretraining, and even the same set of hyperparameters for all MHC class I and class II alleles. The main difference between and MHC class I and class II binding prediction is the typically larger length of 15 amino acids for MHC class II compared to at most 11 for MHC class I. The analysis of the prediction performance in dependence of the length of the peptide in the previous section suggests that this setting is particularly suitable for the USMPep prediction tools. Unfortunately, the reported literature results vary widely concerning the selection of training data, which makes it difficult to distinguish between algorithmic improvements and improvements due to larger amounts of training data.

The USMPep prediction tools, and in particular the ensemble variants, show a solid performance compared to literature results, see Fig. 6. Whereas the USMPep-predictors always provided the best-performing method for MHC class I prediction, it is outperformed for MHC class II by NetMHCIIpan and nn_align. We deliberately decided to train on Wang10 instead of a more recent IEDB snapshot to work on a well-defined published dataset. However, this makes it hard to assess if the performance differences between our results and the best-performing methods can be attributed to the fact that the USMPep-predictors were trained using IEDB data up to 2010 whereas in particular the best-performing tools were trained on larger amounts and more recent data or if there a particular intricacies inherent to the MHC class II prediction task.

Fig. 6
figure 6

Comparison of MHC class II predictors. AUC ROC and Spearman r were evaluated on predictions for the IEDB16_II test set

Language modeling on peptide data and its impact on downstream performance

As final analysis, we analyze language modeling on peptide data and its impact on MHC binding affinity prediction as downstream task. To this end, we constructed a dataset of simulated proteasome-cleaved peptides to pretrain USMpep on a large corpus of unlabeled sequences. We filtered the SwissProt release 2018_10 for the human proteome and employed NetChop [26] to obtain proteasome cleavage sites for these proteins. The stochastic process of protein slicing was modeled by cutting with the cleavage probability provided by NetChop. We discarded sequences of less than eight and more than 20 amino acids length and obtained 6547641 peptides. We compare the performance of a peptide language model to that of a language model trained on human protein data using prediction accuracy as metric.

The results in terms of language model performance along with the corresponding downstream performance (MHC) on the regression task are compiled in Table 5 and allow a number of interesting observations: First, the language model performance increases considerably when training on (proteasome-cleaved) peptide data in accordance with expectations. It is crucial to remark, that the language modeling task on peptide data poses additional difficulties compared to language modeling on protein data as the sequences are comparably short and the model thus cannot build up a lot of context. Additionally, the model does not only have to learn the normal language model task for protein data but implicitly has to learn to stochastically predict cleavage sites. Second, even we evaluated on protein data, the protein language model only reaches an accuracy of 0.137, which is considerably lower than the accuracy of 0.41 reported in the literature [15]. This effect is a direct consequence of the considerably smaller model size (1 instead of 3 layers; 64 instead of 1150 hidden units; embedding size of 50 instead of 400).

Table 5 Language model and MHC class I binding affinity prediction performance

The details of the language model pretraining directly impact the downstream performance and show a consistent trend across all experiments described above even though the differences in downstream performance stay small and mostly remain consistent within error bars. In line with the general trend, the most downstream-task-adapted pretraining on peptide data performs best, generally performing slightly better than the corresponding model trained from scratch. In contrast, pretraining on protein data in general even leads to a loss in performance compared to training from scratch.


In this work, we put forward USMPep, a recurrent neural network that consistently shows excellent performance on three popular MHC class I binding prediction datasets as well as a solid performance on MHC class II binding prediction, see Table 6 for a performance summary. Most remarkably, this is achieved with a standard training procedure without incorporating artificial negative peptides, complicated transfer learning protocols or ensembling strategies and without relying on heuristics.

Table 6 Performance summary: Rank of USMPep compared to competitors across the different datasets

A central issue that prevents a true comparability of algorithmic approaches to the problem is the fact that the datasets that were used to train the prediction models differ between different literature approaches and are often not publicly available. This entangles the predictive power of a given algorithm with the data it was trained on. This urges for the creation of an appropriate benchmarking repository along with standardized evaluation procedures to allow for a structured benchmarking of MHC binding prediction algorithms. As a first step, we advocate to provide binding affinity predictions for all peptides to allow fine-grained comparisons of the overall predictive performance even at a later stage as opposed to reporting just a single score summarizing the performance across all datasets.

Availability and requirements

Project name: USMPepProject home page: system(s): Platform independentProgramming language: PythonOther requirements: see project homepageLicense: BSDAny restrictions to use by non-academics: as permitted by BSD License





Major histocompatibility complex


Averaged stochastic gradient descent weight-dropped long short-term memory


Immune Epitope Database


Area under the receiver operating characteristic curve


Blocks substitution matrix


Amino acid


Neural network


Peptide MHC binding energy covariance


  1. Scheetz L, Park KS, Li Q, Lowenstein PR, Castro MG, Schwendeman A, Moon JJ. Engineering patient-specific cancer immunotherapies. Nat Biomed Eng. 2019.

  2. Sahin U, Türeci Ö. Personalized vaccines for cancer immunotherapy. Science. 2018; 359(6382):1355–60.

    Article  CAS  PubMed  Google Scholar 

  3. Schumacher TN, Schreiber RD. Neoantigens in cancer immunotherapy. Science. 2015; 348(6230):69–74.

    Article  CAS  PubMed  Google Scholar 

  4. Hu Z, Ott PA, Wu CJ. Towards personalized, tumour-specific, therapeutic vaccines for cancer. Nat Rev Immunol. 2018; 18(3):168.

    Article  CAS  PubMed  Google Scholar 

  5. Zhao W, Sher X. Systematically benchmarking peptide-MHC binding predictors: From synthetic to naturally processed epitopes. PLOS Comput Biol. 2018; 14(11):1006457.

    Article  Google Scholar 

  6. Mei S, Li F, Leier A, Marquez-Lago TT, Giam K, Croft NP, Akutsu T, Smith AI, Li J, Rossjohn J, Purcell AW, Song J. A comprehensive review and performance evaluation of bioinformatics tools for HLA class I peptide-binding prediction. Brief Bioinformatics. 2019.

  7. Kim Y, Sidney J, Pinilla C, Sette A, Peters B. Derivation of an amino acid similarity matrix for peptide:MHC binding and its application as a Bayesian prior. BMC Bioinformatics. 2009; 10(1):394.

    Article  PubMed  PubMed Central  Google Scholar 

  8. Moutaftsi M, Peters B, Pasquetto V, Tscharke DC, Sidney J, Bui H-H, Grey H, Sette A. A consensus epitope prediction approach identifies the breadth of murine TCD8+-cell responses to vaccinia virus. Nat Biotechnol. 2006; 24(7):817–9.

    Article  CAS  PubMed  Google Scholar 

  9. Andreatta M, Nielsen M. Gapped sequence alignment using artificial neural networks: application to the MHC class I system. Bioinformatics. 2015; 32(4):511–7.

    Article  PubMed  PubMed Central  Google Scholar 

  10. Jurtz V, Paul S, Andreatta M, Marcatili P, Peters B, Nielsen M. NetMHCpan-4.0: Improved Peptide–MHC Class I Interaction Predictions Integrating Eluted Ligand and Peptide Binding Affinity Data. J Immunol. 2017; 199(9):3360–8.

    Article  CAS  PubMed  Google Scholar 

  11. O’Donnell TJ, Rubinsteyn A, Bonsack M, Riemer AB, Laserson U, Hammerbacher J. MHCflurry: Open-Source Class I MHC Binding Affinity Prediction. Cell Syst. 2018; 7(1):129–1324.

    Article  PubMed  Google Scholar 

  12. Bonsack M, Hoppe S, Winter J, Tichy D, Zeller C, Küpper MD, Schitter EC, Blatnik R, Riemer AB. Performance Evaluation of MHC Class-I Binding Prediction Tools Based on an Experimentally Validated MHC–Peptide Binding Data Set. Cancer Immunol Res. 2019; 7(5):719–36.

    Article  CAS  PubMed  Google Scholar 

  13. Bhattacharya R, Sivakumar A, Tokheim C, Guthrie VB, Anagnostou V, Velculescu VE, Karchin R. Evaluation of machine learning methods to predict peptide binding to MHC Class I proteins. bioRxiv. 2017.

  14. Phloyphisut P, Pornputtapong N, Sriswasdi S, Chuangsuwanich E. MHCSeqNet: a deep neural network model for universal MHC binding prediction. BMC Bioinformatics. 2019; 20(1).

  15. Strodthoff N, Wagner P, Wenzel M, Samek W. UDSMProt: universal deep sequence models for protein classification. Bioinformatics. 2020; 36(8):2401–9.

    Article  PubMed  PubMed Central  Google Scholar 

  16. Howard J, Ruder S. Universal language model fine-tuning for text classification. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne: Association for Computational Linguistics: 2018. p. 328–339.

    Google Scholar 

  17. Merity S, Keskar NS, Socher R. Regularizing and optimizing LSTM language models. arXiv preprint arXiv:1708.02182. 2017.

  18. Smith LN. A disciplined approach to neural network hyper-parameters: Part 1 - learning rate, batch size, momentum, and weight decay. arXiv preprint arXiv:1803.09820. 2018.

  19. Kim Y, Sidney J, Buus S, Sette A, Nielsen M, Peters B. Dataset size and composition impact the reliability of performance benchmarks for peptide-MHC binding predictions. BMC Bioinformatics. 2014; 15(1):241.

    Article  PubMed  PubMed Central  Google Scholar 

  20. Vita R, Mahajan S, Overton JA, Dhanda SK, Martini S, Cantrell JR, Wheeler DK, Sette A, Peters B. The Immune Epitope Database (IEDB): 2018 update. Nucleic Acids Res. 2018; 47(D1):339–43. doi:10.1093/nar/gky1006.

    Article  Google Scholar 

  21. Wang P, Sidney J, Kim Y, Sette A, Lund O, Nielsen M, Peters B. Peptide binding predictions for HLA DR, DP and DQ molecules. BMC Bioinformatics. 2010; 11(1):568.

    Article  PubMed  PubMed Central  Google Scholar 

  22. Paul S, Weiskopf D, Angelo MA, Sidney J, Peters B, Sette A. HLA Class I Alleles Are Associated with Peptide-Binding Repertoires of Different Size, Affinity, and Immunogenicity. J Immunol. 2013; 191(12):5831–9.

    Article  CAS  PubMed  Google Scholar 

  23. Chen J, Guo M, Wang X, Liu B. A comprehensive review and comparison of different computational methods for protein remote homology detection. Brief Bioinformatics. 2016; 19(2):231–44.

    Article  Google Scholar 

  24. Alipanahi B, Delong A, Weirauch MT, Frey BJ. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol. 2015; 33(8):831–8.

    Article  CAS  PubMed  Google Scholar 

  25. Manning CD, Raghavan P, Schutze H. Introduction to Information Retrieval. New York: Cambridge University Press; 2008.

    Book  Google Scholar 

  26. Nielsen M, Lundegaard C, Lund O, Keşmir C. The role of the proteasome in generating cytotoxic T-cell epitopes: insights obtained from improved predictions of proteasomal cleavage. Immunogenetics. 2005; 57(1):33–41.

    Article  CAS  PubMed  Google Scholar 

  27. Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A. Automatic differentiation in PyTorch. In: 31st Conference on Neural Information Processing Systems (NIPS) Workshop Autodiff: 2017.

  28. Howard J, et GitHub. 2018. Accessed 26 Apr 2019.

Download references


The authors thank Patrick Wagner for discussions and work on related topics. The authors thank Timothy O’Donnell for correspondence and for kindly providing MHCFlurry predictions on the Kim14 dataset. USMPep was implemented using PyTorch [27] and [28].


This work was supported by the Bundesministerium für Bildung und Forschung (BMBF) through the Berlin Big Data Center under Grant 01IS14013A and the Berlin Center for Machine Learning under Grant 01IS18037I.

Author information

Authors and Affiliations



Conceptualization: NS, MW; Data Curation: JV, NS; Investigation: JV; Methodology: JV, NS; Software: JV, NS; Supervision: NS, WS; Validation: JV; Visualization: JV; Writing — Original Draft Preparation: JV, NS; Writing — Review & Editing: JV, MW, WS, NS; Final approval of the manuscript: JV, MW, WS, NS

Corresponding author

Correspondence to Nils Strodthoff.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Vielhaben, J., Wenzel, M., Samek, W. et al. USMPep: universal sequence models for major histocompatibility complex binding affinity prediction. BMC Bioinformatics 21, 279 (2020).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: