Integrating peptides' sequence and energy of contact residues information improves prediction of peptide and HLA-I binding with unknown alleles

Background The HLA (human leukocyte antigen) class I is a kind of molecule encoded by a large family of genes and is characteristic of high polymorphism. Now the number of the registered HLA-I molecules has exceeded 3000. Slight differences in the amino acid sequences of HLAs would make them bind to different sets of peptides. In the past decades, although many methods have been proposed to predict the binding between peptides and HLA-I molecules and achieved good performance, most experimental data used by them is limited to the HLAs with a small number of alleles. Thus they are inclined to obtain high prediction accuracy only for data with similar alleles. Because the peptides and HLAs together determine the binding, it's necessary to consider their contribution meanwhile. Results By taking into account the features of the peptides sequence and the energy of contact residues, in this paper a method based on the artificial neural network is proposed to predict the binding of peptides and HLA-I even when the HLAs' potential alleles are unknown. Two experiments in the allele-specific and super-type cases are performed respectively to validate our method. In the first case, we collect 14 HLA-A and 14 HLA-B molecules on Bjoern Peters dataset, and compare our method with the ARB, SMM, NetMHC and other 16 online methods. Our method gets the best average AUC (Area under the ROC) value as 0.909. In the second one, we use leave one out cross validation on MHC-peptide binding data that has different alleles but shares the common super-type. Compared to gold standard methods like NetMHC and NetMHCpan, our method again achieves the best average AUC value as 0.847. Conclusions Our method achieves satisfactory results. Whenever it's tested on the HLA-I with single definite gene or with super-type gene locus, it gets better classification accuracy. Especially, when the training set is small, our method still works better than the other methods in the comparison. Therefore, we could make a conclusion that by combining the peptides' information, HLAs amino acid residues' interaction information and contact energy, our method really could improve prediction of the peptide HLA-I binding even when there aren't the prior experimental dataset for HLAs with various alleles.


Background
In the cellular immune system, peptide binding to MHC (Major Histocompatibility Complex, in humans MHC is also called Human Leukocyte Antigen HLA) is the most selective step in recognition of pathogens. In humans, there are three types of MHC molecules and their recognizing, binding, transporting and functioning mechanism are distinct. Taking the MHC-I for example. Proteins in the cytosol are first degraded by the proteasome, and then peptides are internalized by TAP (transporter associating with antigen processing) channel in the endoplasmic reticulum, where MHC-I molecules freshly are synthesized. Complexes of MHC-I binding to peptide enter Golgi apparatus and finally externalize on the cell membrane to interact with T lymphocytes. Correctly and precisely predicting the T cell epitope has realistic meaningfulness, especially important for the vaccine design. Many experiments on analyzing the binding complexes of peptide and MHC indicate that the binding sites have the binding specificity. This specificity is usually determined by the molecular weight, the electric charge, the pH value and other attributes. In the past decades, many prediction methods have been proposed to predict the epitope. They could be categorized into the following types. The first one is the motif matching based methods. From the binding complex fragments of peptide and MHC, Rudensky [1] purified and detected the amino acid sequences, from which they proposed three binding motifs I-A s , I-A b , and I-E b . According to these three motifs, Cole [2] successfully predicted MHC epitopes in the Sendai virus M protein.
Analogous methods by checking whether peptides have the anchor sites matching with binding sites in the MHC molecules include works [3][4][5]. This kind of methods is simple and easy to understand but the prediction accuracy is not high. The second one is the scoring matrix based method. It could be viewed as the generalisation of the motif matching based method. For each type of MHC molecules, the existing peptide-MHC binding data are statistically analysed to generate a coefficient matrix, in which the element represents the degree of amino acid contributing to the binding when appearing in a certain position. Parker [6] used 154 synthetic peptides to get the molecule HLA-A2 scoring matrix; Kubo [7] got the HLA-A1, HLA-A3 and HLA-A11 scoring matrix and Udaka [8] got H2-K b , D b and L b scoring matrix respectively. Other similar methods to build up MHC I class and class II molecules' scoring matrix include ProPred [9], ARB [10], SMM [11]. Given the scoring matrix of one MHC molecule, the strength of any peptide binding to it could be calculated in an addition or a multiple way. In comparison with the first type method, they usually have higher accuracy, but there still exist some shortcomings like that they assume each amino acid residue to independently affect the binding and ignore the interactions among the amino acid residues. In order to further improve the prediction accuracy, some methods try to consider the whole peptide sequence and establish more complex models to reflect the real situation. This category of methods includes the Bayesian method (Bayesian) [12], HMM (Hidden Markov Model) [13], SVM (Support Vector Machine) [14], ANN (Artificial Neural Network) [15] and so on [16][17][18][19][20][21]. Before the prediction, the training process is essential for the prediction model. According to the different types of training data used, previous stated methods also could be categorized into the sequencebased methods and the structure-based methods. They have their own advantages and limitations. On one side, the sequence-based methods always adopt machine-learning approaches that need large amount of training data. When the training data is sufficient, they could get good prediction accuracy. In fact, HLA-I is extremely polymorphic. For example, in the database IMGT/HLA (international ImMunoGeneTics project) [22] the number of registered HLA-I (HLA-I has three major gene locus HLA-A, HLA-B and HLA-C.) molecules has exceeded 3000 and that for HLA-II is over 1100. Unfortunately, for most of HLA molecules with different alleles, there are few or even no experimentally obtained binding complex data to facilitate analysing their binding characteristics. Even for those HLAs that have experimental data, by 2010, IMGT/HLA has deposited 893 allele sequences of the HLA-A loci and 1534 allele sequences of the HLA-B loci, which implies the wide existence of polymorphism. Based on one frequently used dataset composed of 35 HLA-I molecules, the binding prediction accuracy of the HLA-I and peptide is reported to reach average 0.9 AUC (Area Under roc Curve), but this dataset is too special and only accounts for a small part of the known HLA-I molecules. Those HLAs with slightly different amino acid sequences may have their own binding specificity to different sets of peptides. Thus, these sequence-based methods are biased towards the known alleles if they use special training data and may have the over-fitting prediction problem. On the other hand, structure-based methods could jump over the obstacles of sequence polymorphism and directly take advantage of the MHC molecule complex's 3D structures and use their empirical force fields as the binding specificity to estimate the binding affinity. However, the available 3D structure dataset is insufficient. Now there are only tens of HLA-I molecules' 3D structures resolved, so the accuracy of structure-based methods is restricted by the inadequate number of 3D structures and is usually lower than sequence-based methods.
In order to overcome the existing problems and achieve better prediction accuracy, we still take machine-learning strategy based on the sequencing data. As mentioned above, HLA-I has more polymorphism than HLA-II. So here we focus on the HLA-I and a method based on the artificial neural network is proposed to predict on HLA-I with unknown alleles where there are limited or even no prior experimentally obtained dataset. Different from other sequence-based methods using ANN, our method not only considers peptide sequence information but also HLAs' amino acid residues and the energy of contact residues. With information integration, our method is expected to predict the binding of peptide and HLA-I with high accuracy.
The rest sections of this paper will be organized as followed. In the method part, a method based on artificial neural network will be introduced, in which the HLAs interaction residues extraction and contact energy computation will be described. The experiments on two datasets will be implemented in the results and discussion part. On dataset is on the benchmark dataset and the other one is on the super-type dataset. The performance of our method will be evaluated under these two conditions by comparing with other methods. The last part is the conclusion.

Methods
There're lots of classifiers could be used for prediction. Choosing a proper classifier relies on the application contexts. Here, we use ANN (Artificial Neural Network) classifier to predict the peptide and HLA-I binding. The reasons for utilizing the ANN model are their advantages of self-learning, self-adaptive and modelling nonlinear relationship. Some works [23,24] have proved that ANN is suitable for the epitope prediction. Besides the classifier, how to select the classification features is another important step. Not all of HLA's amino acids take action in the binding process. Therefore, we will first find out which amino acid residues really function and then compute their contact energy. These prominent features will be properly encoded and finally used to predict. The Figure 1 gives the framework of our method.

Interacting residues of HLA and peptide
The crystal structures of peptide-HLA binding complexes show that although HLA molecules from the same gene locus have different alleles, the complexes have similar spatial structure. Madden [25] have resolved the 3D structures of five complexes where different peptides binding to the same HLA-A*0201 molecule. The results show they have similar structure and HLAs' amino acid residue binding sites. Inspired by Figure 1 The framework of our method. The input data contains two parts, one is the peptide and the other is HLA molecule. HLA molecules will be processed by the steps of extracting interacting amino acid residues and computing the contact energy. Then they will be encoded as the classification features and input into the established classifier to do the training and predict.
Madden's work, one approach to overcome the influence of coding genes' polymorphism is to find out the major common interacting sites of HLAs from the same gene locus. We collect, calculate and finally get the frequent function residues of HLA from the existing structure data of peptide-HLA binding complexes. All raw peptide and HLA structure data come from the database PDB (protein Data Base). In total, there're 111 peptide-HLA-A binding complexes and 87 peptide-HLA-B binding complexes used. When the distance of residues of the HLA and peptide is less than 4Ȧ, we think that they interact. The results show that for HLAs from the same gene locus, their amino acid residue interacting sites are similar, which is consistent with the result of Madden. We discard those residues of HLAs that interact with peptides less than 5 times. The putative sites are shown in the Figure 2(a) and Figure 2(b) respectively for HLA-A and HLA-B.

Contact energy of amino acid residues
In fact, as the participants of binding interaction, amino acid residues of HLA and peptide don't independently contribute to the binding. Therefore, it's necessary to take into account the interaction among the amino acid residues. The structure of a peptide-HLA complex is decided by several forces such as the interaction between amino acids and water molecules, the interaction between amino acids and so on. Although there are several forces, the interaction between amino acids plays a major role. Generally, the protein structure could be estimated by the amino acid sequence and the interaction strength could be calculated by the classical electromagnetic theory. If the attracting strength between two amino acid residues is great, their position will have more chance to be near. Therefore, if there are sufficient protein structure data, the mutual attracting strength could be computed. In the structure biology, Miyazawa and Jernigan [26] matrix has been widely used for computing the protein sequence energy in different structure templates. But there's a limitation that Miyazawa and Jernigan matrix takes the solvent as the reference status and is usually accurate for nonpolar molecules interaction. But it couldn't ignore the interaction between amino acids and water molecules. Thus, we use the B matrix [27], which is based on Miyazawa and Jernigan matrix but refers to the threonine, to get the contact energy of amino acid residues.

Encode the peptide and HLA
For the existing prediction methods based on the artificial neural network, there are three approaches to encode the peptide. The first one is the parse matrix, in which each amino acid is represented by 19 zeros and 1 one. The second one is BLOSUM matrix and the third one is to select some attributes from the physical and chemical properties of amino acids. Among them, BLOSUM matrix usually has the best prediction accuracy and good ability of distinguishing the amino acids. In this paper, the BLOSUM matrix will encode peptides and the B matrix of the contact energy of amino acid residues will encode HLA sequence. Therefore, the encoding length for the HLA-A and peptide binding is 239 dimensions, in which 180 dimensions are for the peptide encoded by BLOSUM matrix and 59 dimensions are for HLA-A amino acid residues interacting with the peptides encoded by the B matrix. The encoding length for the HLA-B and peptide is 255 dimensions, in which 180 dimensions are for peptide encoded by BLOSUM matrix and 75 dimensions are for HLA-B amino acid residues interacting with the peptides encoded by the B matrix. In order to better measure the difference of affinity and facilitate the artificial neural network training. The affinity will be transformed to the logarithm format varying from 0 to 1 like [15].

Build and train ANNBM
There're several subtypes in the ANN. In this paper we use the error back propagation feed-forward neural network. Because HLA features are encoded by B Matrix, our ANN predictor is also specially named as ANNBM.
Theoretical analysis proves that ANN with one single hidden layer can map almost continuous relationship function. Therefore we establish the ANNBM consisting of three layers including an input layer, a hidden layer and an output layer. Neural network's input layer has 239 nodes for predicting peptide-HLA-A binding, while neural network's input layer has 255 nodes for predicting peptide-HLA-B binding. The output layer has only one node. It is the logarithmic value of binding affinity between HLA and peptide. When constructing the ANNBM, there is no golden criterion to determine the number of nodes in the hidden layer. With less hidden nodes, the ability of learning from samples is poor and unable to reflect the relationship perfectly; with excessive hidden nodes, it may remember the noise in the samples leading to the over learning problem and reduce the generalization ability. In principle, the number of hidden nodes depends on the training sample scale, the sample noise and the complexity of the relationship. A common way to set the number is called trail-and-error, so we test the hidden nodes varying from 2 to 12. The hidden layer with 9 nodes has gotten the minimum mean square error.
In ANNBM, the activation function still used the sigmoid function. For Sigmoid function when the input of variables is very big, its slope trends to 0.
Because of this characteristic, for some learning algorithms like the steepest descent algorithm, as the weights and thresholds are far from its best, the gradient is very small and leads to weights and the thresholds correction is very small, so we use the RPROP method to do network weights adjustment. RPROP takes into account only the sign of the partial derivative over all patterns but not the magnitude, and acts independently on each weight. For each weight, if there is a sign change of the partial derivative of the total error function compared to the last iteration, the update value for that weight is multiplied by a factor η − , where η − is less than 1. If the last iteration produces the same sign, the update value is multiplied by a factor of η + , where η + is greater than 1.
The update values are calculated for each weight in the manner described as above, and finally each weight is changed by its own update value, in the opposite direction of that weight's partial derivative, so as to minimise the total error function. The parameter η + is empirically set to 1.2 and η − to 0.5.

Results and discussion
Experiment on the benchmark dataset Bjoern Peters [28] collect peptide-MHC-I binding datasets and builds a benchmark dataset. This benchmark dataset comes from two research groups' works. They are Alessandro Sette in the La Jolla Gene and Immunology Institutes and Søren Buus in the Copenhagen University. Although their experiment systems have several differences in binding the judging index, detecting targets, the preparation of MHC molecules and the purity of peptides, the format of their experiment output is the same. For each peptide, the IC50/EC50 value is assigned to measure its binding affinity. In order to evaluate the consistency of experiment results from these two groups, Bjoern Peters makes the cross validation and results show good consistency. The benchmark finally contains 48 828 binding data on 35 HLA-I molecules. In this paper, we pick up HLA-I with the length 9 to validate our method.
From the table 1, we could see our ANNBM get the best performance. And the prediction is independent with the scale of dataset, which could be observed in the Figure 3. More interesting, the whole prediction accuracy of ANNBM is nearly similar with the second perfect result from the NetMHC and they both belong to the ANN model. Apparently, here the neural networks appear its non-linear mapping advantage that could learn the high order relationship from the training process. Therefore ANN based methods get higher accuracy than SMM and ARB. Besides the model selection, the differences in selecting and encoding of the features also determinate the classification accuracy. SMM and ARB are based on the scoring matrix. NetMHC combines the sparse, BLOSUM matrix and hidden Markov to encode peptides.
Remarkable, for the small peptide dataset such as B*4002, B*4402, B*4403 and B*5701 molecules, ANNBM outperforms the NetMHC and other methods. The four peptides' numbers are respectively 118, 119, 119 and 59. ANNBM method in the four molecules gets the AUC, respectively, 0.858, 0.824, 0.791 and 0.96, and NetMHC method in the four molecules gets the AUC, respectively, 0.754, 0.778, 0.763 and 0.826. We believe that the ANNBM benefits from taking in account the interaction information of HLA and peptides, whereas NetMHC just encodes the peptides information that is insufficient for ANN to learn the high order relationship under the small datasets situations. Figure 4 shows that when there're enough training dataset the prediction of these methods have no obvious difference just like HLA-A*0201, but under the small dataset situation showed in the Figure 5 ANNBM proves integrating sequence information and energy of contact residues really could improve the prediction accuracy. It isn't hard to understand when training data is sufficient, ordinary classifiers could learn the relationship by large amount of training. The outperformance of ANNBM in the small dataset proves ANNBM effectively catches the key factors for the prediction.

Prediction on unknown alleles dataset
If there is no prior binding dataset for a HLA molecule, we could make use of the Sette and Sidney work [29] to indirectly solve this obstacle. Sette and Sidney discover that HLA-I class molecules could be divided into several super classes according to their binding specificity.  [7,31]. NetMHC is the one that has the closest prediction accuracy in the benchmark testing. Here we check whether it could keep performing well. The NetMHCpan is another method designed to predict peptide-HLA binding with unknown alleles. The author of NetMHCpan is an experienced researcher focusing The panels from left to right and up to down are the linear fitting between the peptide number (x axis) and accuracy (y axis) on five methods: ANNBM, ARB, SMM, NetMHC, and Other methods. The right down picture is the standard deviation of the classification accuracy. We could see ANNBM gets the smallest slope rate and standard deviation, which proves that ANNBM is most independent with dataset scale and stable.

Conclusions
We have developed an ANNBM method that can well predict peptides binding to HLAs for which have limited or even no prior exact experimentally obtained data. Our method takes both peptide sequence information and contact energy of amino acid residues into account, and could give the quantitative binding affinity. Allelespecific benchmark and super-type experimental datasets successfully validate this method. ANNBM is stably From table.2, we can see that ANNBM method obtains the higher average AUC value than NetMHCpan and NetMHC methods by 0.023 and 0.073. NetMHC encoding method doesn't take into account the HLA molecules information. Although the training data comes from the same super-type and acquires perfect results on the allele specific benchmark dataset, the HLA differences in the same super class are not reflected, so it is not difficult to understand the NetMHC prediction accuracy decreases and lower than those of ANNBM and NetMHCpan that encode HLA molecules information. Comparing the encoding method of the HLA molecules between ANNBM and NetMHCpan, ANNBM uses the B matrix and each amino acid that could interact with peptide is denoted by a numerical value, while NetMHCpan uses the BLOSUM matrix and a 20 dimensions vector to denote each amino acid. Obviously, ANNBM has higher efficiency in the storage and computation. The average AUC of ANNBM is greater than that of NetMHCpan, especially on the A*0202 and B*3501, whose ROC curves are showed in figure 6 and 7.