Attention mechanism enhanced LSTM with residual architecture and its application for protein-protein interaction residue pairs prediction

Background Recurrent neural network(RNN) is a good way to process sequential data, but the capability of RNN to compute long sequence data is inefficient. As a variant of RNN, long short term memory(LSTM) solved the problem in some extent. Here we improved LSTM for big data application in protein-protein interaction interface residue pairs prediction based on the following two reasons. On the one hand, there are some deficiencies in LSTM, such as shallow layers, gradient explosion or vanishing, etc. With a dramatic data increasing, the imbalance between algorithm innovation and big data processing has been more serious and urgent. On the other hand, protein-protein interaction interface residue pairs prediction is an important problem in biology, but the low prediction accuracy compels us to propose new computational methods. Results In order to surmount aforementioned problems of LSTM, we adopt the residual architecture and add attention mechanism to LSTM. In detail, we redefine the block, and add a connection from front to back in every two layers and attention mechanism to strengthen the capability of mining information. Then we use it to predict protein-protein interaction interface residue pairs, and acquire a quite good accuracy over 72%. What’s more, we compare our method with random experiments, PPiPP, standard LSTM, and some other machine learning methods. Our method shows better performance than the methods mentioned above. Conclusion We present an attention mechanism enhanced LSTM with residual architecture, and make deeper network without gradient vanishing or explosion to a certain extent. Then we apply it to a significant problem– protein-protein interaction interface residue pairs prediction and obtain a better accuracy than other methods. Our method provides a new approach for protein-protein interaction computation, which will be helpful for related biomedical researches.

There is a standard RNN model, including three layers-input, recurrent, and output layer, whose outputs will be activated by linear or nonlinear functions acting on previous or latter inputs. The arrows show the flow in detail from the unit i to unit j. For the hidden layer unit h, we denote the input of hidden layer unit h at time t: the output of the hidden layer unit h at time t is denoted as b t h , and the activation function is the output layer's input can be calculated at the same time: Like the standard back propagation algorithm, BPTT is also a repeated application of chain rule. For the gradients of loss functions in RNN, the influence from loss function to hidden is not only through hidden layer's output, but also through its next time step: where Then we can get the derivative of whole network weight respectively : Long short term memory [2](LSTM), as a variant of RNN, proposed by Hochreiter and shown in Fig. 2, consists of one block which has three gates(input/forget/output gate) whose every activation probability is from 0(the gate closes)to 1(the gate opens), and some cells which can remember information and transit it to the next step, while the hidden layer unit in RNN is replaced by three gates. The output values of input gate and forget gate are determined by the prior cells states and the input values.
The subscripts ι, φ and ω denote the input, forget and output gate of the block respectively, and c denotes one of the C memory cells. The peephole weight from cell c to the input, forget and output gates is denoted as w cι , w cφ and w cω respectively. s t c denotes the state of cell c at time t. f, g and h is the activation function of the gates, cell input and output, respectively. Let I denote the number of inputs, K denote the number of outputs and H denote the number of cells in the hidden layer.
Viewing to the Fig. 2 framework, we can get the equations : input gate forget gate cell output gate cell's output When compared with RNN, LSTM is easier to change the weight of self-recursive model dynamically by adding the gates, and handle different scale data with better performance. Although there are many variants of LSTM, like  [3] which is a simplification of LSTM, and bidirectional LSTM [4], showing stronger performance, there are also some problems in LSTM-gradient explosion or gradient vanishing. [5,6] both mentioned that in their paper, and employed residual learning [7] to avoid that problem, and did related experiment in speech and human activity recognition. That is why the applications of LSTM that we see are always in shallow neural networks. Though there are a lot of methods [8,9] getting away from gradient explosion or gradient vanishing to some extent, such as weight regularization, batchnorm, clip gradient, etc, there are no better measures to solve the problem of gradient combining with layer scales. Recently, Sabeek [10] had done RNN in the depths of residual learning, which solved the gradient vanishing problem and showed a better performance. Given the thought of convolutional residual memory networks [11] and deep residual neural networks [7], we utilize a method with mathematical derivation to avoid the problems and deepen LSTM neural networks to excavate more information from original data in next section. Though some researchers aforementioned utilized this thought, there are some differences from our work-we use every two layers as a residue instead of one layer as a residue to accelerate the computational velocity in a sequential and larger dataset while Sabeek used it for sentimental analysis with a small dataset. And we prove its convergence theoretically. Furthermore, we utilize the attention mechanism to strengthen the extraction of information. This part will be shown in "Model architecture" section. If there are some notations you feel confused in "Results" section, we suggest that you'd better to read the "Methods" section before "Results" section. All of these will be described in the flow processes of the algorithm and application in our paper in Fig. 3.

Results
Because the impact to accuracy of FRPP of layer number in neural networks is usually more uncomplicated and efficient than units numbers in parametric numbers. Like the methods of dichotomization, we use different layer numbers in a wide bound to find one with the best performance, then in this way continue to find the neighbor layer numbers and choose the optimal unit number. Viewing to the Table 1 left, we find that layer_60, not only the predicted true positive amounts in top 1% but also the mean accuracy, shows better performance than others. In like manner the unit_n and the model layer_m_unit_n can be denoted similarly in whole passage. After that, we continue to narrow it. Table 1 right shows the layer number near to layer_60, which is better than ones around it. So we next search the optimal unit number in layer_60, and finally we choose the best result with unit number in layer_60. Based on Table 1, Table 2 shows the results of the number of different units in detail. Despite the model mean of layer_60_unit_6 is lower than layer_60_unit_8, the number of RFPP(1% ) is quite lager inversely. Table 3 elaborates the result of model layer_60_unit_8 further on. In this model we can predict 8/11 if we choose the top 1% pairs of every dimer in the test set as predictions.

Comparison with other methods
PPiPP [12] is a method by using protein sequences for monomer binding site predictions, and PAIRpred [13] is a fresh complex interface prediction approach published in 2014 and realizes a higher prediction accuracy. Zhenni Zhao [14] used a deep learning architecture-multi-layer LSTMs, to predict interface residue pairs, and achieved a better accuracy. Table 4 shows the results from the above-mentioned approaches in different Docking Benchmark Data dataset. The evaluation index is RFPP. When p equals 90%, our model can predict around 90% proteins correctly in our dataset if we choose top 194 residue pairs as prediction. And it improves around a third when comparing with others. Because of the differences of proteins that we select in our train and test set, and pre-treatment methods, we can only take a look at the results of the comparison partly. In addition, our protein sequence is longer and residue pairs amount is bigger than above, hence these can increase the difficulties for predicting RFPP. In order Fig. 3 The evolutional flow processes from methods to application in this paper to balance the comparison, we use another evaluation index-accuracy order, to replace it. Wei Wang.etc [15] used different machine learning methods chosen by different protein properties to predict interface residue pairs. we show the comparison and our prediction precision by choosing top 1% residue pairs in Table 5.
Furthermore, we also use random theory to calculate the RFPP. As we know mathematical expectation is one of the most significant numerical characteristics to describe the average of variables. X denotes the random variable of RFPP here. In order to correspond to our index of algorithm, we select 1000 pairs randomly, so where N denotes the number of surface residue pairs and M denotes the number of interface residue pairs. Then Why we use the inequality is that the the latter is simpler than the former in computational complexity, but calculation is still complicated based on pure theory. Monte Carlo simulation is a well-known method to compute the expectation by using the frequency of events to estimate its probability respectively. This will be more convenient for us to achieve them. We use, more specifically, random simulation about 10 billion times, then we count it that happens respectively. The formula: Table 1 The accuracy order of dimers in test set Accuracy order layer_10 layer_20 layer_30 layer_40 layer_50 layer_60 layer_70 layer_56 layer_58 layer_59 layer_60 layer_61 layer_62  Here,the purpose we extract the coefficient 1 10billion is to avoid something happening to reduce the error like the frequency 15 10billion limited to 0. All the results will be shown in the last row of Table 3. We can clearly see that our result is extremely better than random RFPP except 1GL1 and 1BUH.

Discussion
Viewing Tables 1 and 2, we select the two best prediction accuracy in each table while choosing top 1% as estimated index. According to the Fig. 4, we find that our model shows poor performance in protein 1BUH and good performance in protein both 2VDB and 1Z5Y commonly. One of the most possible reasons is that 1BUH is far away from the train data in homology while 2VDB and 1Z5Y aren't. This will be verified by identity matrix to some extent which shows the highest homology in train set is 12.86% between 1DFG and 1BUH. As for 1GL1, We notice that the random model with RFPP 124 shows better performance than our model with RFPP 194. This is hard to give an explanation. But from the perspective of homology, we find that 1GL1 has a little higher homology 16.7% with 2I9B. This may be one possible reason for 1GL1. We also depict some of proteinprotein interaction interface pairs predicted by our model in Fig. 5 where the first row is predicted well, but the second is not.
On the one hand, how to choose hyperparameters is also a complicated problem in deep learning. The existing methods such as grid search which gives a trick for us. On the other hand, most biological data will lose some information when we transform it. In detail we use three-dimensional coordinates of one atom to replace an amino acid for simplification and we excessively depend on the structure of monomers, It's one of the biggest limitations. Because our problem is to predict whether any two monomers can form a dimer complex. And the different features selection from original data make different Note: NCPD(m% )=n means that there are n dimers which meet the in equation accuracy order≤ m% , and the result of last row will be explained in next section Note: lstm_m_nodes_n means the model has m layer LSTMs,and each layer has n units prediction performance. If we don't consider any physicochemical and geometric properties, from sequence to predict structure directly usually shows low accuracy. And because our prediction method depends on the 9 feature values from monomers structure other than dimer complexes structure, therefore if some values are missing, we will delete the corresponding pairs or whole dimers. This is also a limitation. Recently AlQuraishi [16] employ bi-directional LSTM to predict protein structure from protein sequence and obtain state-of-art achievement. This may inspire us to rethink the problem from protein sequence perspective. Data extreme imbalance is a serious problem introduced to model for training. How to choose a good approach is also preferred.

Conclusions
In this paper, we employ a novel LSTM based on residual architecture and attention mechanism, and derive the gradient. Then we utilize this model to predict proteinprotein interaction interface residue pairs, and compare our model with standard LSTMs and other methods, to show that our prediction accuracy is more than 72 percent which far surpasses other methods in performance. This will be more significant for biomedical related research as well as the computational though there are a lot of further problems we can consider like the feature selections, coevolution [17] information, contact preferences and interface composition [18].

Algorithm derivation
Before deriving the equations of backward pass, we need to redefine LSTM. We call the LSTM unit a small block, and the two LSTM layers a big block, which possesses an additional connection from the output layer l to the output layer l + 2 (see bold line in Fig. 6). Figure 6 is a simplified version, and we just consider that there is only one cell in LSTM unit. However, what we usually use is full connection traditionally. In order to view the differences from different layers, we use the (·) l to present the values of the layer l respectively. For example, the b t c l denotes the cell output value of layer l. And if they are in a same layer, then we omit the superscript l additionally.
output gate forget gate input gate We can see that if gradient vanishing happens in layer l + 2 which also means that ∂(b t c ) l = 0, the conventional LSTM fail to update parameters before layer l + 2. But from (2.2), our model architecture can prohibit that because of 1 +

Background, data, and evaluation criteria
Proteins are the foundations of life activities for cells, but most of them exert their functions only having interaction with other molecules. As a result, protein-protein interaction prediction becomes a very important project. The first step of it is to know the site of interface residue pairs precisely. The most common methods are from experimental and computational perspective recently. One the one hand, anatomizing all proteins is unfeasible to experiment technicians for the high expenses. On the other hand, the computational methods become the scientific tidal current due to its low costs and convenience, such as template [19] and structure model [20] methods. In recent years, artificial intelligence especially machine learning and deep learning has been used in computer vision image and language recognition,etc, and received many achievements. At the same time some computational researchers transfer those methods to biology. Protein contact prediction [21] is one of the good instances by using deep residual networks. Though there are some achievements [13][14][15] in protein-protein interaction interface residue pairs predictions especially while Zhenni [14] used a deep learning architecture to tackle this project, we still need to proceed and develop new algorithms for its low accuracy. Here we will apply our method to predict interface residue pairs.
Our data is from benchmark versions 3.0, 4.0, and 5.0 [22,23] on the international Critical Assessment of PRotein-protein Interaction predictions(CAPRI). All selected dimers whose states are unbound satisfy our requirement and add up to 54, then they are randomly split into three parts including train, validation, test set with ratio around 6:2:2 (shown in Table 6). Moreover, In order to illustrate test efficiency of our data partition structure, we identity multi protein sequences homology comparison in ClustalW2 https://www.ebi.ac.uk/Tools/msa/muscle/. Both of the results are attached in supplementary-identity matrix, and only the homology≥ 30% of two dimers is shown in Table 6. From the identity matrix, we can see only the partition of 2I25(in train set) and 1H9D(in test set) is little unreasonable because of the homology with 40%, but we will show the better prediction result of 1H9D with such litter higher homology later. Every residue pair consists of 18 features which are concatenated by the two 9 feature values of each residue proposed basing on physicochemical and geometric properties which are common in computation. The 9 features are listed below and their computation are shown respectively in Table 7. Interior Contact area(IC) [24], Exterior Contact area with other residues(EC) [24] Exterior Void area(EV) [24,25], Absolute Exterior Solvent Accessible area(AESA) [25], Relative Exterior Solvent Accessible area(RESA) [25], Hydropathy Index(HI, two versions) [26,27] and pK α (two versions) [28]. paper [29]summarized these features and their respective tools for computation. Here we just simply describe it. IC is the Interior Contact area between atoms inside a residue. EC is the Exterior Contact area between Big block LSTM with no connection from the same layers and full connection from adjacent two layer networks. To simplify the network, we just consider an input with one unit in the layer l and an output with one unit in the layer l + 2 residues from the same protein. EV is the area does not contact with water molecules or any amino acid. AESA is the contact area between water molecules and surface residues. RESA is a proportion between AESA in protein and AESA of free amino acids. H1 and H2 are two versions of hydrophobicity index used to measure the hydrophobic ability. pKa is a reflection of the electrostatics of surface residue in the specific environment.
A residue pair is defined as interface if the contact areas of two amino acids from different two monomers are not zero. Here we use two statistical evaluation criteria combining biological meanings to measure our model prediction: rank of the first positive prediction(RFPP), and the number of correctly predicted dimers(NCPD). In order to overcome the length differences and balance the predicted difficult degree in different proteins, accuracy order is adopted.
accuracy order = RFPP TNRP , where TNRP is the total number of residue pairs in a dimer.

Model architecture
This is a binary classification problem. The input format is a matrix with dimension L×18 Fig. 7, since every amino acid consists of 9 features and a residue pair possesses 18 features. Where L is the number of combinations of amino acid residue pairs. We use the label 1 to present that the pair is an interface residue pair, and label 0 is opposite. Because the amount of label 0s is extremely larger than 1s, so we need to pre-treat the imbalance between the positive and negative samples. We use a distance to exclude some impossible residue pairs. The distance between different chains will be small to some way to meet a threshold if the residue pairs are contact. Therefore we choose the residue pairs with the most short distance, then choose 3 residues around them in each chain respectively, hence there are 3 × 3 pairs altogether. This method can reduce the amount of negative samples efficiently. Because we use this selective method which can make the data sequential, therefore the LSTM neural network is a quite good choice for us. Then the data pre-treated will be input to the neural network architecture. There are some hyperparameters to explain in detail. Dropout [30] is a way to prevent model from over-fitting, because it can be a probability from 0 to 1 to drop out the units and cutdown all the connections from the units to next units randomly. In this paper, we use 0.15 to dropout some redundant information of the inputs. According to the new achievement,  Wojciech Zeremba [31] proposed a new method-adding dropout from the current layer to next layer, but not to recurrent layer, to regularize the RNN, which inspires us to use dropout in LSTM and fit it in 0.6. These hyperparameters can be fitted by a common technique-grid search, and the results will be shown in supplementary. Attention has been widely used in speech recognition [32] and reasoning [33],etc for its efficient mechanism which can reallocate weight and retrieve some more critical information, therefore these motivate us to use attention in our model. The dense layer's activation function is softmax, and the loss function is categorical crossentropy. Softmax and crossentropy is designed as following σ (Z j ) = e z j K k=1 e z k for j = 1, 2, ..., K.
where p is a true distribution while q is an estimated distribution. Softmax function can mapping a n d vector to another n d vector whose elements are from 0 to 1. Crossentrop, equal to maximum likelihood estimation, is an index to measure the gap between the true distribution and the estimated distribution.