Skip to main content

Table 1 Comparison of MHC I prediction tools

From: USMPep: universal sequence models for major histocompatibility complex binding affinity prediction

Architecture

SMMPMBEC [7]

One-hot encoding, linear model (scoring matrix)

consensus [8]

Linear model (scoring matrix), median rank as prediction

NetMHC4 [9]

Input: 9mer fixed length blocks substitution matrix (BLOSUM) encoding plus additional features; multilayer perceptron with one hidden layer

NetMHCpan4 [10]

Input: 9mer fixed length BLOSUM encoding for peptide, pseudo-sequence for MHC molecule plus additional features; multilayer perceptron with one hidden layer

MHCFlurry [11]

Input: 15mer fixed length BLOSUM62 encoding, missing residues filled with wildcard amino acid (AA); feedforward neural network (NN) with 0 to 2 locally connected and one fully connected hidden layer

USMPep (this work)

Learned embedding layer; AWD LSTM with one hidden layer

Training procedure

SMMPMBEC

Ridge regression with modified regularization, peptide MHC binding energy covariance (PMBEC) similarity matrix as Bayesian prior

consensus

Four scoring matrices from existing algorithms

NetMHC4

Training on non 9mer peptides by insertion of wildcard AA or deletion at all possible positions; augmented training set with natural peptides for each length assumed to be negative

NetMHCpan4

Same insertion/ deletion procedure as NetMHC4; augmented training set with random artificial negatives

MHCFlurry

Pretraining on BLOSUM62 similar allele for alleles with little training data; augmented training set with artificial negative peptides

USMPep

Optional: language model pretraining on unlabeled sequences

Model selection

SMMPMBEC

Single model

consensus

Single model

NetMHC4

Ensemble of 4 NNs

NetMHCpan4

Ensemble of 100 NNs

MHCFlurry

Ensemble of 8-16 NNs selected from 320 models on a validation set

USMPep

Optional: ensemble of 10 NNs with identical architectures and hyperparameters