Skip to main content
  • Methodology Article
  • Open access
  • Published:

Attention mechanism enhanced LSTM with residual architecture and its application for protein-protein interaction residue pairs prediction



Recurrent neural network(RNN) is a good way to process sequential data, but the capability of RNN to compute long sequence data is inefficient. As a variant of RNN, long short term memory(LSTM) solved the problem in some extent. Here we improved LSTM for big data application in protein-protein interaction interface residue pairs prediction based on the following two reasons. On the one hand, there are some deficiencies in LSTM, such as shallow layers, gradient explosion or vanishing, etc. With a dramatic data increasing, the imbalance between algorithm innovation and big data processing has been more serious and urgent. On the other hand, protein-protein interaction interface residue pairs prediction is an important problem in biology, but the low prediction accuracy compels us to propose new computational methods.


In order to surmount aforementioned problems of LSTM, we adopt the residual architecture and add attention mechanism to LSTM. In detail, we redefine the block, and add a connection from front to back in every two layers and attention mechanism to strengthen the capability of mining information. Then we use it to predict protein-protein interaction interface residue pairs, and acquire a quite good accuracy over 72%. What’s more, we compare our method with random experiments, PPiPP, standard LSTM, and some other machine learning methods. Our method shows better performance than the methods mentioned above.


We present an attention mechanism enhanced LSTM with residual architecture, and make deeper network without gradient vanishing or explosion to a certain extent. Then we apply it to a significant problem– protein-protein interaction interface residue pairs prediction and obtain a better accuracy than other methods. Our method provides a new approach for protein-protein interaction computation, which will be helpful for related biomedical researches.


Recurrent neural network(RNN), proposed by Hochreiter, is a major neural network in deep learning, which does as a bridge to connect the the information from past to present. It is based on the back propagation algorithm and contains the factor caused by time, therefore RNN is a kind of back propagation through time(BPTT) algorithm. What’s more, it can tackle the sequencial data including temporal and spatial data owing to its property.

Look at the standard RNN Fig. 1, the information is forward propagation from inputs to outputs. We can describe those information flow by a series of equations. Symbols and notations in this paper mainly refer to the book [1] written by Alex Graves. But here we’ll write it briefly. x denotes the input vector value, \(x_{i}^{t}\) denotes the value of input ith of vector x at time t, and wij denotes the weight from the unit i to unit j. For the hidden layer unit h, we denote the input of hidden layer unit h at time t:

$$ a_{h}^{t} = \sum\limits_{i=1}^{I}w_{ih}x_{i}^{t}+ \sum\limits_{h'=1}^{H}w_{h'h}b_{h'}^{t-1}, $$
Fig. 1
figure 1

There is a standard RNN model, including three layers-input, recurrent, and output layer, whose outputs will be activated by linear or nonlinear functions acting on previous or latter inputs. The arrows show the flow in detail

the output of the hidden layer unit h at time t is denoted as \(b_{h}^{t}\), and the activation function is θh, so

$$ b_{h}^{t}=\theta(a_{h}), $$

the output layer’s input can be calculated at the same time:

$$ a_{k}^{t} = \sum\limits_{h=1}^{H}w_{hk}b_{h}^{t}. $$

Like the standard back propagation algorithm, BPTT is also a repeated application of chain rule. For the gradients of loss functions in RNN, the influence from loss function to hidden is not only through hidden layer’s output, but also through its next time step:

$$ \delta_{h}^{t} = \theta'(a_{h}^{t})\left(\sum\limits_{k=1}^{K}\delta_{k}^{t}w_{hk}+\sum\limits_{h'=1}^{t+1}w_{hh'}\right), $$


$$ \delta_{j}^{t}\stackrel{def}{=}\frac{\partial \mathcal{L}}{\partial a_{j}^{t}}, $$

Then we can get the derivative of whole network weight respectively :

$$ \frac{\partial\mathcal L}{\partial w_{ij}}=\sum\limits_{t=1}^{T} \frac{\partial\mathcal L}{\partial a_{j}^{t}}\frac{\partial a_{j}^{t}}{\partial w_{ij}}=\sum\limits_{t=1}^{T}\delta_{j}^{t} b_{i}^{t}. $$

Long short term memory [2](LSTM), as a variant of RNN, proposed by Hochreiter and shown in Fig. 2, consists of one block which has three gates(input/forget/output gate) whose every activation probability is from 0(the gate closes)to 1(the gate opens), and some cells which can remember information and transit it to the next step, while the hidden layer unit in RNN is replaced by three gates. The output values of input gate and forget gate are determined by the prior cells states and the input values.

Fig. 2
figure 2

The memory block with one cell of LSTM neural network

The subscripts ι,ϕ and ω denote the input, forget and output gate of the block respectively, and c denotes one of the C memory cells. The peephole weight from cell c to the input, forget and output gates is denoted as wcι,wcϕ and wcω respectively. \(s_{c}^{t}\) denotes the state of cell c at time t. f, g and h is the activation function of the gates, cell input and output, respectively. Let I denote the number of inputs, K denote the number of outputs and H denote the number of cells in the hidden layer.

Viewing to the Fig. 2 framework, we can get the equations :

input gate

$$ a_{\iota}^{t} =\sum\limits_{i=1}^{I}w_{i\iota}x_{i}^{t} + \sum\limits_{h=1}^{H}w_{h\iota}b_{h}^{t-1} + \sum\limits_{c=1}^{C}w_{c\iota}s_{c}^{t-1}, $$
$$ b_{\iota}^{t}=f\left(a_{\iota}^{t}\right), $$

forget gate

$$ a_{\phi}^{t} =\sum\limits_{i=1}^{I}w_{i\phi}x_{i}^{t} + \sum\limits_{h=1}^{H}w_{h\phi}b_{h}^{t-1} + \sum\limits_{c=1}^{C}w_{c\phi}s_{c}^{t-1}, $$
$$ b_{\phi}^{t}=f(a_{\phi}^{t}), $$


$$ a_{c}^{t}=\sum\limits_{i=1}^{I}w_{ic}x_{i}^{t} + \sum\limits_{h=1}^{H}w_{hc}b_{h}^{t-1}, $$
$$ s_{c}^{t}=b_{\phi}^{t} s_{c}^{t-1}+b_{\iota}^{t}g\left(a_{c}^{t}\right), $$

output gate

$$ a_{\omega}^{t} =\sum\limits_{i=1}^{I}w_{i\omega}x_{i}^{t} + \sum\limits_{h=1}^{H}w_{h\omega}b_{h}^{t-1} + \sum\limits_{c=1}^{C}w_{c\omega}s_{c}^{t-1}, $$
$$ b_{\omega}^{t} =f\left(a_{\omega}^{t}\right), $$

cell’s output

$$ b_{c}^{t}=b_{\omega}^{t} h\left(s_{c}^{t}\right). $$

When compared with RNN, LSTM is easier to change the weight of self-recursive model dynamically by adding the gates, and handle different scale data with better performance. Although there are many variants of LSTM, like GRU [3] which is a simplification of LSTM, and bidirectional LSTM [4], showing stronger performance, there are also some problems in LSTM–gradient explosion or gradient vanishing. [5, 6] both mentioned that in their paper, and employed residual learning [7] to avoid that problem, and did related experiment in speech and human activity recognition. That is why the applications of LSTM that we see are always in shallow neural networks. Though there are a lot of methods [8, 9] getting away from gradient explosion or gradient vanishing to some extent, such as weight regularization, batchnorm, clip gradient, etc, there are no better measures to solve the problem of gradient combining with layer scales. Recently, Sabeek [10] had done RNN in the depths of residual learning, which solved the gradient vanishing problem and showed a better performance. Given the thought of convolutional residual memory networks [11] and deep residual neural networks [7], we utilize a method with mathematical derivation to avoid the problems and deepen LSTM neural networks to excavate more information from original data in next section. Though some researchers aforementioned utilized this thought, there are some differences from our work–we use every two layers as a residue instead of one layer as a residue to accelerate the computational velocity in a sequential and larger dataset while Sabeek used it for sentimental analysis with a small dataset. And we prove its convergence theoretically. Furthermore, we utilize the attention mechanism to strengthen the extraction of information. This part will be shown in “Model architecture” section. If there are some notations you feel confused in “Results” section, we suggest that you’d better to read the “Methods” section before “Results” section. All of these will be described in the flow processes of the algorithm and application in our paper in Fig. 3.

Fig. 3
figure 3

The evolutional flow processes from methods to application in this paper


Because the impact to accuracy of FRPP of layer number in neural networks is usually more uncomplicated and efficient than units numbers in parametric numbers. Like the methods of dichotomization, we use different layer numbers in a wide bound to find one with the best performance, then in this way continue to find the neighbor layer numbers and choose the optimal unit number. Viewing to the Table 1 left, we find that layer_60, not only the predicted true positive amounts in top 1%0 but also the mean accuracy, shows better performance than others. In like manner the unit _n and the model layer_m_unit_n can be denoted similarly in whole passage. After that, we continue to narrow it. Table 1 right shows the layer number near to layer_60, which is better than ones around it. So we next search the optimal unit number in layer_60, and finally we choose the best result with unit number in layer_60. Based on Table 1, Table 2 shows the results of the number of different units in detail. Despite the model mean of layer _60_unit_6 is lower than layer _60_unit_8, the number of RFPP(1%0) is quite lager inversely. Table 3 elaborates the result of model layer _60_unit_8 further on. In this model we can predict 8/11 if we choose the top 1%0 pairs of every dimer in the test set as predictions.

Table 1 The accuracy order of dimers in test set
Table 2 The accuracy order of dimers in test set with layer _60
Table 3 The prediction results of layer _60_unit_8 in test set

Comparison with other methods

PPiPP [12] is a method by using protein sequences for monomer binding site predictions, and PAIRpred [13] is a fresh complex interface prediction approach published in 2014 and realizes a higher prediction accuracy. Zhenni Zhao [14] used a deep learning architecture–multi-layer LSTMs, to predict interface residue pairs, and achieved a better accuracy. Table 4 shows the results from the above-mentioned approaches in different Docking Benchmark Data dataset. The evaluation index is RFPP. When p equals 90%, our model can predict around 90% proteins correctly in our dataset if we choose top 194 residue pairs as prediction. And it improves around a third when comparing with others. Because of the differences of proteins that we select in our train and test set, and pre-treatment methods, we can only take a look at the results of the comparison partly. In addition, our protein sequence is longer and residue pairs amount is bigger than above, hence these can increase the difficulties for predicting RFPP. In order to balance the comparison, we use another evaluation index–accuracy order, to replace it. Wei Wang.etc [15] used different machine learning methods chosen by different protein properties to predict interface residue pairs. we show the comparison and our prediction precision by choosing top 1%0 residue pairs in Table 5.

Table 4 Comparison with PAIRpred, PPiPP and multi-layered LSTM
Table 5 Comparison by choosing top 1%0 residue pairs

Furthermore, we also use random theory to calculate the RFPP. As we know mathematical expectation is one of the most significant numerical characteristics to describe the average of variables. X denotes the random variable of RFPP here. In order to correspond to our index of algorithm, we select 1000 pairs randomly, so

$$P(X=i)= \left \{ \begin{aligned} &\frac{C_{N-M}^{i-1}C_{M}^{1}C_{N-M-i}^{1000-i}}{C_{N}^{1000}}, \quad i=1,2,...,1000 \\ &1-\sum\limits_{i=1}^{1000} \frac{C_{N-M}^{i-1}C_{M}^{1}C_{N-M-i}^{1000-i}}{C_{N}^{1000}}. \quad else \end{aligned} \right. $$

where N denotes the number of surface residue pairs and M denotes the number of interface residue pairs.


$$ E(X)= \sum\limits_{i} i\times P(X=i)\ge \sum\limits_{i=1}^{1000} i\times P(X=i)+1000\times \frac{C_{N-M}^{1000}}{C_{N}^{1000}} $$

Why we use the inequality is that the the latter is simpler than the former in computational complexity, but calculation is still complicated based on pure theory. Monte Carlo simulation is a well-known method to compute the expectation by using the frequency of events to estimate its probability respectively. This will be more convenient for us to achieve them. We use, more specifically, random simulation about 10 billion times, then we count it that happens respectively. The formula:

$${\begin{aligned} \sum\limits_{i=1}^{i=1000}i &\times\frac{\mathrm{count(RFPP=\mathit{i})}}{10 \text{billion} }+1000\\ &\times\frac{10 \text{billion}-\sum\limits_{i=1}^{1000}\mathrm{count(RFPP=\mathit{i})}}{10\text{billion}}=\frac{1}{10 \text{billion}}[\cdots] \end{aligned}} $$

Here,the purpose we extract the coefficient \(\frac 1{10 \text {billion}}\) is to avoid something happening to reduce the error like the frequency \(\frac {15}{10 \text {billion}}\) limited to 0. All the results will be shown in the last row of Table 3. We can clearly see that our result is extremely better than random RFPP except 1GL1 and 1BUH.


Viewing Tables 1 and 2, we select the two best prediction accuracy in each table while choosing top 1%0 as estimated index. According to the Fig. 4, we find that our model shows poor performance in protein 1BUH and good performance in protein both 2VDB and 1Z5Y commonly. One of the most possible reasons is that 1BUH is far away from the train data in homology while 2VDB and 1Z5Y aren’t. This will be verified by identity matrix to some extent which shows the highest homology in train set is 12.86% between 1DFG and 1BUH. As for 1GL1, We notice that the random model with RFPP 124 shows better performance than our model with RFPP 194. This is hard to give an explanation. But from the perspective of homology, we find that 1GL1 has a little higher homology 16.7% with 2I9B. This may be one possible reason for 1GL1. We also depict some of protein-protein interaction interface pairs predicted by our model in Fig. 5 where the first row is predicted well, but the second is not.

Fig. 4
figure 4

Prediction of different model parameters, where code _m_n means the layer number of LSTM is n, and the unit number in each LSTM layer is m. Longitudinal axis represents accuracy order and horizontal axis means PDB respectively

Fig. 5
figure 5

Model architecture. Where big block LSTM is defined as mentioned above

On the one hand, how to choose hyperparameters is also a complicated problem in deep learning. The existing methods such as grid search which gives a trick for us. On the other hand, most biological data will lose some information when we transform it. In detail we use three-dimensional coordinates of one atom to replace an amino acid for simplification and we excessively depend on the structure of monomers, It’s one of the biggest limitations. Because our problem is to predict whether any two monomers can form a dimer complex. And the different features selection from original data make different prediction performance. If we don’t consider any physicochemical and geometric properties, from sequence to predict structure directly usually shows low accuracy. And because our prediction method depends on the 9 feature values from monomers structure other than dimer complexes structure, therefore if some values are missing, we will delete the corresponding pairs or whole dimers. This is also a limitation. Recently AlQuraishi [16] employ bi-directional LSTM to predict protein structure from protein sequence and obtain state-of-art achievement. This may inspire us to rethink the problem from protein sequence perspective. Data extreme imbalance is a serious problem introduced to model for training. How to choose a good approach is also preferred.


In this paper, we employ a novel LSTM based on residual architecture and attention mechanism, and derive the gradient. Then we utilize this model to predict protein-protein interaction interface residue pairs, and compare our model with standard LSTMs and other methods, to show that our prediction accuracy is more than 72 percent which far surpasses other methods in performance. This will be more significant for biomedical related research as well as the computational though there are a lot of further problems we can consider like the feature selections, coevolution [17] information, contact preferences and interface composition [18].


Algorithm derivation

Before deriving the equations of backward pass, we need to redefine LSTM. We call the LSTM unit a small block, and the two LSTM layers a big block, which possesses an additional connection from the output layer l to the output layer l+2 (see bold line in Fig. 6).

Fig. 6
figure 6

Some of prediction of protein-protein interaction interface residue pairs, which are highlighted in surface and shown in different colors with amino acid name and site in corresponding chains. a 1H9D b 2VDB c 1GL1 d 1BUH

Figure 6 is a simplified version, and we just consider that there is only one cell in LSTM unit. However, what we usually use is full connection traditionally. In order to view the differences from different layers, we use the (·)l to present the values of the layer l respectively. For example, the \(\left (b_{c}^{t}\right)^{\mathit {l}}\) denotes the cell output value of layer l. And if they are in a same layer, then we omit the superscript l additionally.

$$ \left(\epsilon_{c}^{t}\right)^{\mathit{l+2}}\stackrel{def}{=}\frac{\partial \mathcal L}{\partial \left(b_{c}^{t}\right)^{\mathit{l}}+\partial\left(b_{c}^{t}\right)^{\mathit{l+2}}}, \qquad \epsilon_{s}^{t}\stackrel{def}{=}\frac{\partial \mathcal L}{\partial s_{c}^{t}}, $$

cell’s output

$$ \epsilon_{c}^{t} = \sum\limits_{k=1}^{K}w_{ck}\delta_{k}^{t} + \sum\limits_{g=1}^{G}w_{cg}\delta_{g}^{t+1}, $$

output gate

$$ {\begin{aligned} \delta_{\omega}^{t}=\frac{\partial{\mathcal{L}}}{\partial a_{\omega}^{t}} &=\frac{\partial{\mathcal{L}}}{\partial\left(b_{c}^{t}\right)^{l}+\partial\left(b_{c}^{t}\right)^{l+2}} \frac{\partial \left(b_{c}^{t}\right)^{{l}}+\partial \left(b_{c}^{t}\right)^{} {l+2}}{\partial \left(b_{\omega}^{t}\right)^{l}}\frac{\partial\left(b_{\omega}^{t}\right)^{l}}{\partial\left(a_{\omega}^{t}\right)^{l}}\\ &= \left(\epsilon_{c}^{t}\right)^{l+2}h\left(s_{c}^{t}\right)\left(1+\frac{\partial\left(b_{c}^{t}\right)^{l+2}}{\partial\left(b_{c}^{t}\right)^{l}}\right)f'\left(a_{\omega}^{t}\right), \end{aligned}} $$


$$ \epsilon_{s}^{t} = b_{\omega}^{t}h'\left(s_{c}^{t}\right)\epsilon_{c}^{t} + b_{\phi}^{t+1}\epsilon_{s}^{t+1}+ w_{c\iota}\delta_{\iota}^{t+1} + w_{c\phi}\delta_{\phi}^{t+1}+ w_{c\omega}\delta_{\omega}^{t}, $$


$$ \delta_{c}^{t} =\frac{\partial \mathcal L}{\partial a_{c}^{t}}=\frac{\partial \mathcal L}{\partial s_{c}^{t}} \frac {\partial s_{c}^{t}}{\partial a_{c}^{t}}=\epsilon_{s}^{t}b_{\iota}^{t}g'\left(a_{c}^{t}\right), $$

forget gate

$$ \delta_{\phi}^{t} = \frac{\partial\mathcal L}{\partial a_{\phi}^{t}}=\frac{\partial\mathcal L}{\partial s_{c}^{t}}\frac{\partial s_{c}^{t}}{\partial b_{\phi}^{t}}\frac{\partial b_{\phi}^{t}}{\partial a_{\phi}^{t}}=\epsilon_{s}^{t} s_{c}^{t-1}f'\left(a_{\phi}^{t}\right), $$

input gate

$$ \delta_{\iota}^{t} =\frac{\partial\mathcal L}{\partial a_{\iota}^{t}}=\frac{\partial\mathcal L}{\partial s_{c}^{t}}\frac{\partial s_{c}^{t}}{\partial b_{\iota}^{t}}\frac{\partial b_{\iota}^{t}}{\partial a_{\iota}^{t}}= \epsilon_{s}^{t} g\left(a_{c}^{t}\right) f'\left(a_{\iota}^{t}\right). $$

We can see that if gradient vanishing happens in layer l+2 which also means that \(\frac {\partial \left (b_{c}^{t}\right)^{l+2}}{\partial \left (b_{c}^{t}\right)^{l}}=0\), the conventional LSTM fail to update parameters before layer l+2. But from (2.2), our model architecture can prohibit that because of \(1+ \frac {\partial \left (b_{c}^{t}\right)^{l+2}}{\partial \left (b_{c}^{t}\right)^{l}}=1\).

Background, data, and evaluation criteria

Proteins are the foundations of life activities for cells, but most of them exert their functions only having interaction with other molecules. As a result, protein-protein interaction prediction becomes a very important project. The first step of it is to know the site of interface residue pairs precisely. The most common methods are from experimental and computational perspective recently. One the one hand, anatomizing all proteins is unfeasible to experiment technicians for the high expenses. On the other hand, the computational methods become the scientific tidal current due to its low costs and convenience, such as template [19] and structure model [20] methods. In recent years, artificial intelligence especially machine learning and deep learning has been used in computer vision image and language recognition,etc, and received many achievements. At the same time some computational researchers transfer those methods to biology. Protein contact prediction [21] is one of the good instances by using deep residual networks. Though there are some achievements [1315] in protein-protein interaction interface residue pairs predictions especially while Zhenni [14] used a deep learning architecture to tackle this project, we still need to proceed and develop new algorithms for its low accuracy. Here we will apply our method to predict interface residue pairs.

Our data is from benchmark versions 3.0, 4.0, and 5.0 [22, 23] on the international Critical Assessment of PRotein-protein Interaction predictions(CAPRI). All selected dimers whose states are unbound satisfy our requirement and add up to 54, then they are randomly split into three parts including train, validation, test set with ratio around 6:2:2 (shown in Table 6). Moreover, In order to illustrate test efficiency of our data partition structure, we identity multi protein sequences homology comparison in ClustalW2 Both of the results are attached in supplementary–identity matrix, and only the homology ≥30% of two dimers is shown in Table 6. From the identity matrix, we can see only the partition of 2I25(in train set) and 1H9D(in test set) is little unreasonable because of the homology with 40%, but we will show the better prediction result of 1H9D with such litter higher homology later. Every residue pair consists of 18 features which are concatenated by the two 9 feature values of each residue proposed basing on physicochemical and geometric properties which are common in computation. The 9 features are listed below and their computation are shown respectively in Table 7. Interior Contact area(IC) [24], Exterior Contact area with other residues(EC) [24] Exterior Void area(EV) [24, 25], Absolute Exterior Solvent Accessible area(AESA) [25], Relative Exterior Solvent Accessible area(RESA) [25], Hydropathy Index(HI, two versions) [26, 27] and pK α (two versions) [28]. paper [29]summarized these features and their respective tools for computation. Here we just simply describe it. IC is the Interior Contact area between atoms inside a residue. EC is the Exterior Contact area between residues from the same protein. EV is the area does not contact with water molecules or any amino acid. AESA is the contact area between water molecules and surface residues.

Table 6 The data partition structure and homology (≥30%)
Table 7 The 9 features and their computation

RESA is a proportion between AESA in protein and AESA of free amino acids. H1 and H2 are two versions of hydrophobicity index used to measure the hydrophobic ability. pKa is a reflection of the electrostatics of surface residue in the specific environment.

A residue pair is defined as interface if the contact areas of two amino acids from different two monomers are not zero. Here we use two statistical evaluation criteria combining biological meanings to measure our model prediction: rank of the first positive prediction(RFPP), and the number of correctly predicted dimers(NCPD). In order to overcome the length differences and balance the predicted difficult degree in different proteins, accuracy order is adopted.

\( accuracy \quad order = \frac {RFPP}{TNRP} \), where TNRP is the total number of residue pairs in a dimer.

Model architecture

This is a binary classification problem. The input format is a matrix with dimension L×18 Fig. 7, since every amino acid consists of 9 features and a residue pair possesses 18 features. Where L is the number of combinations of amino acid residue pairs. We use the label 1 to present that the pair is an interface residue pair, and label 0 is opposite. Because the amount of label 0s is extremely larger than 1s, so we need to pre-treat the imbalance between the positive and negative samples. We use a distance to exclude some impossible residue pairs. The distance between different chains will be small to some way to meet a threshold if the residue pairs are contact. Therefore we choose the residue pairs with the most short distance, then choose 3 residues around them in each chain respectively, hence there are 3×3 pairs altogether. This method can reduce the amount of negative samples efficiently. Because we use this selective method which can make the data sequential, therefore the LSTM neural network is a quite good choice for us. Then the data pre-treated will be input to the neural network architecture. There are some hyperparameters to explain in detail. Dropout [30] is a way to prevent model from over-fitting, because it can be a probability from 0 to 1 to drop out the units and cutdown all the connections from the units to next units randomly. In this paper, we use 0.15 to dropout some redundant information of the inputs. According to the new achievement, Wojciech Zeremba [31] proposed a new method–adding dropout from the current layer to next layer, but not to recurrent layer, to regularize the RNN, which inspires us to use dropout in LSTM and fit it in 0.6. These hyperparameters can be fitted by a common technique–grid search, and the results will be shown in supplementary. Attention has been widely used in speech recognition [32] and reasoning [33],etc for its efficient mechanism which can reallocate weight and retrieve some more critical information, therefore these motivate us to use attention in our model. The dense layer’s activation function is softmax, and the loss function is categorical crossentropy. Softmax and crossentropy is designed as following

$$ \sigma(\mathbf{Z}_{j})=\frac{e^{z_{j}}}{\sum_{k=1}^{K}e^{z_{k}}} \quad for \,\, j=1,2,...,K. $$
Fig. 7
figure 7

Big block LSTM with no connection from the same layers and full connection from adjacent two layer networks. To simplify the network, we just consider an input with one unit in the layer l and an output with one unit in the layer l+2

$$ H(p,q)=E_{p}[-\log q]=H(p)+D_{KL}(p\|q) $$

where p is a true distribution while q is an estimated distribution. Softmax function can mapping a nd vector to another nd vector whose elements are from 0 to 1. Crossentrop, equal to maximum likelihood estimation, is an index to measure the gap between the true distribution and the estimated distribution.

Availability of data and materials

Our code and parameters of model can be found in and data is available in



Back propagation through time


Long short term memory


The number of correctly predicted dimers


Rank of the first positive prediction


Recurrent neural network


Total number of residue pairs in a dimer


  1. Graves A. Supervised sequence labelling. In: Supervised Sequence Labelling with Recurrent Neural Networks. Springer: 2012. p. 5–13.

  2. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997; 9(8):1735–80.

    Article  CAS  PubMed  Google Scholar 

  3. Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. 2014.

  4. Zhou J, Xu W. End-to-end learning of semantic role labeling using recurrent neural networks. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers): 2015. p. 1127–37.

  5. Kim J, El-Khamy M, Lee J. Residual lstm: Design of a deep recurrent architecture for distant speech recognition. arXiv preprint arXiv:1701.03360. 2017.

  6. Zhao Y, Yang R, Chevalier G, Xu X, Zhang Z. Deep residual bidir-lstm for human activity recognition using wearable sensors. Math Problems Engineer. 2018; 2018(7316954):13.

    Google Scholar 

  7. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016. p. 770–8.

  8. Jozefowicz R, Zaremba W, Sutskever I. An empirical exploration of recurrent network architectures. In: Int Confer Mach Learn.2015. p. 2342–50.

  9. Chung J, Gulcehre C, Cho K, Bengio Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. 2014.

  10. Pradhan S, Longpre S. Exploring the depths of recurrent neural networks with stochastic residual learning. Report. 2016.

  11. Moniz J, Pal C. Convolutional residual memory networks. arXiv preprint arXiv:1606.05262. 2016.

  12. Ahmad S, Mizuguchi K. Partner-aware prediction of interacting residues in protein-protein complexes from sequence data. PLoS One. 2011; 6(12):29104.

    Article  Google Scholar 

  13. Afsar Minhas FuA, Geiss BJ, Ben-Hur A. Pairpred: Partner-specific prediction of interacting residues from sequence and structure. Proteins: Struct, Func, Bioinforma. 2014; 82(7):1142–55.

    Article  CAS  Google Scholar 

  14. Zhao Z, Gong X. Protein-protein interaction interface residue pair prediction based on deep learning architecture. IEEE/ACM Trans Comput Biol Bioinforma. 2017; 16(5):1753–59.

    Google Scholar 

  15. Wang W, Yang Y, Yin J, Gong X. Different protein-protein interface patterns predicted by different machine learning methods. Sci Rep. 2017; 7(1):16023.

    Article  PubMed  PubMed Central  Google Scholar 

  16. AlQuraishi M. End-to-end differentiable learning of protein structure. Cell systems. 2019; 8(4):292–301.

    Article  CAS  PubMed  Google Scholar 

  17. Ovchinnikov S, Kamisetty H, Baker D. Robust and accurate prediction of residue–residue interactions across protein interfaces using evolutionary information. Elife. 2014; 3:02030.

    Article  Google Scholar 

  18. Nadalin F, Carbone A. Protein–protein interaction specificity is captured by contact preferences and interface composition. Bioinformatics. 2017; 34(3):459–68.

    Article  PubMed Central  Google Scholar 

  19. Ohue M, Matsuzaki Y, Shimoda T, Ishida T, Akiyama Y. Highly precise protein-protein interaction prediction based on consensus between template-based and de novo docking methods. In: BMC Proceedings. BioMed Central: 2013. p. 6.

  20. Singh R, Park D, Xu J, Hosur R, Berger B. Struct2net: a web service to predict protein–protein interactions using a structure-based approach. Nucleic Acids Res. 2010; 38(suppl_2):508–15.

    Article  Google Scholar 

  21. Wang S, Sun S, Li Z, Zhang R, Xu J. Accurate de novo prediction of protein contact map by ultra-deep learning model. PLoS Comput Biol. 2017; 13(1):1005324.

    Article  Google Scholar 

  22. Vreven T, Moal IH, Vangone A, Pierce BG, Kastritis PL, Torchala M, Chaleil R, Jiménez-García B, Bates PA, Fernandez-Recio J, et al. Updates to the integrated protein–protein interaction benchmarks: docking benchmark version 5 and affinity benchmark version 2. J Mole Biol. 2015; 427(19):3031–41.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Janin J, Henrick K, Moult J, Ten Eyck L, Sternberg MJ, Vajda S, Vakser I, Wodak SJ. Capri: a critical assessment of predicted interactions. Proteins: Structure, Function, and Bioinformatics. 2003; 52(1):2–9.

    Article  CAS  Google Scholar 

  24. Fischer TB, Holmes JB, Miller IR, Parsons JR, Tung L, Hu JC, Tsai J. Assessing methods for identifying pair-wise atomic contacts across binding interfaces. J Struct Biol. 2006; 153(2):103–12.

    Article  CAS  PubMed  Google Scholar 

  25. Hubbard S, Thornton J. Naccess: Department of biochemistry and molecular biology, university college london. 1993. Software available at

  26. Eisenberg D. Three-dimensional structure of membrane and surface proteins. Ann Rev Biochem. 1984; 53(1):595–623.

    Article  CAS  PubMed  Google Scholar 

  27. Kyte J, Doolittle RF. A simple method for displaying the hydropathic character of a protein. J Mole Biol. 1982; 157(1):105–32.

    Article  CAS  Google Scholar 

  28. Olsson MH, Søndergaard CR, Rostkowski M, Jensen JH. Propka3: consistent treatment of internal and surface residues in empirical p k a predictions. J Chem Theory Comput. 2011; 7(2):525–37.

    Article  CAS  PubMed  Google Scholar 

  29. Yang Y, Wang W, Lou Y, Yin J, Gong X. Geometric and amino acid type determinants for protein-protein interaction interfaces. Quantitative Biol. 2018; 6(2):163–74.

    Article  CAS  Google Scholar 

  30. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res. 2014; 15(1):1929–58.

    Google Scholar 

  31. Zaremba W, Sutskever I, Vinyals O. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329. 2014.

  32. Chorowski JK, Bahdanau D, Serdyuk D, Cho K, Bengio Y. Attention-based models for speech recognition. In: Advances in Neural Information Processing Systems: 2015. p. 577–85.

  33. Rocktäschel T, Grefenstette E, Hermann KM, Kočiskỳ T, Blunsom P. Reasoning about entailment with neural attention. arXiv preprint arXiv:1509.06664. 2015.

Download references


The authors gratefully acknowledge the discussion with Zhenni Zhao. Thanks to the support of Beijing Advanced Innovation Center for Structural biology.


We’d like to thank for National Natural Science Foundation of China(Nos. 31670725 and 91730301) for providing financial supports for this study and publication charges. These funding bodies did not play any role in the design of study, the interpretation of data, or the writing of this manuscript.

Author information

Authors and Affiliations



JL developed the algorithm, did the computation, and wrote the manuscript. XG designed the project, collected the data and revised the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Xinqi Gong.

Ethics declarations

Ethics approval and consent to participate

No applicable.

Consent for publication

No applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver( applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, J., Gong, X. Attention mechanism enhanced LSTM with residual architecture and its application for protein-protein interaction residue pairs prediction. BMC Bioinformatics 20, 609 (2019).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: