From: USMPep: universal sequence models for major histocompatibility complex binding affinity prediction
Architecture | |
SMMPMBEC [7] | One-hot encoding, linear model (scoring matrix) |
consensus [8] | Linear model (scoring matrix), median rank as prediction |
NetMHC4 [9] | Input: 9mer fixed length blocks substitution matrix (BLOSUM) encoding plus additional features; multilayer perceptron with one hidden layer |
NetMHCpan4 [10] | Input: 9mer fixed length BLOSUM encoding for peptide, pseudo-sequence for MHC molecule plus additional features; multilayer perceptron with one hidden layer |
MHCFlurry [11] | Input: 15mer fixed length BLOSUM62 encoding, missing residues filled with wildcard amino acid (AA); feedforward neural network (NN) with 0 to 2 locally connected and one fully connected hidden layer |
USMPep (this work) | Learned embedding layer; AWD LSTM with one hidden layer |
Training procedure | |
SMMPMBEC | Ridge regression with modified regularization, peptide MHC binding energy covariance (PMBEC) similarity matrix as Bayesian prior |
consensus | Four scoring matrices from existing algorithms |
NetMHC4 | Training on non 9mer peptides by insertion of wildcard AA or deletion at all possible positions; augmented training set with natural peptides for each length assumed to be negative |
NetMHCpan4 | Same insertion/ deletion procedure as NetMHC4; augmented training set with random artificial negatives |
MHCFlurry | Pretraining on BLOSUM62 similar allele for alleles with little training data; augmented training set with artificial negative peptides |
USMPep | Optional: language model pretraining on unlabeled sequences |
Model selection | |
SMMPMBEC | Single model |
consensus | Single model |
NetMHC4 | Ensemble of 4 NNs |
NetMHCpan4 | Ensemble of 100 NNs |
MHCFlurry | Ensemble of 8-16 NNs selected from 320 models on a validation set |
USMPep | Optional: ensemble of 10 NNs with identical architectures and hyperparameters |