Protein–protein interaction prediction based on ordinal regression and recurrent convolutional neural networks

Background Protein protein interactions (PPIs) are essential to most of the biological processes. The prediction of PPIs is beneficial to the understanding of protein functions and thus is helpful to pathological analysis, disease diagnosis and drug design etc. As the amount of protein data is growing fast in the post genomic era, high-throughput experimental methods are expensive and time-consuming for the prediction of PPIs. Thus, computational methods have attracted researcher’s attention in recent years. A large number of computational methods have been proposed based on different protein sequence encoders. Results Notably, the confidence score of a protein sequence pair could be regarded as a kind of measurement to PPIs. The higher the confidence score for one protein pair is, the more likely the protein pair interacts. Thus in this paper, a deep learning framework, called ordinal regression and recurrent convolutional neural network (OR-RCNN) method, is introduced to predict PPIs from the perspective of confidence score. It mainly contains two parts: the encoder part of protein sequence pair and the prediction part of PPIs by confidence score. In the first part, two recurrent convolutional neural networks (RCNNs) with shared parameters are applied to construct two protein sequence embedding vectors, which can automatically extract robust local features and sequential information from the protein pairs. Based on it, the two embedding vectors are encoded into one novel embedding vector by element-wise multiplication. By taking the ordinal information behind confidence score into consideration, ordinal regression is used to construct multiple sub-classifiers in the second part. The results of multiple sub-classifiers are aggregated to obtain the final confidence score. Following that, the existence of PPIs is determined by the confidence score. We set a threshold \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta$$\end{document}θ, and say the interaction exists between the protein pair if its confidence score is bigger than \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta$$\end{document}θ. Conclusions We applied our method to predict PPIs on data sets S. cerevisiae and Homo sapiens. Through experimental verification, our method outperforms state-of-the-art PPI prediction models.


Background
Proteins [1,2] are critical to the cells and tissues in the body. They participate in various life activities, like antibody immunity, catalyzing metabolic reactions and transporting molecules, etc. Usually, proteins are associated with other proteins to form the protein complexes, so as to perform the functions of living organisms in a better way. Among the protein complexes, protein-protein interactions (PPIs) play a crucial role in successfully carrying out different biological processes in cells, such as transcription, translation, cell cycle control, and secretion [3]. Therefore, the problem of PPI prediction is of great significance in pathological analysis [4], disease diagnosis [5], drug design [6], and is becoming a research focus in the field of proteomics.
A large amount of high-throughput experimental methods have been applied to predict the PPIs from protein complexes, such as Yeast double hybrid screens [7], tandem affinity purification [8,9], and proteome chips and micro-array technology [10,11]. However, with the accumulation of protein data, these methods suffer from the restrictions of time and economic cost, and cannot meet the needs of human life science research in the post genomic era. It is also worth noting that due to the existence of the subjective or objective factors, such as operation error and experimental error, the experimental results often deviate slightly from the actual results, sometimes even leading to a large proportion of false positive or false negative experimental data. For example, there are about 80,000 PPIs predicted by these high-throughput experimental methods, but only a relatively small number (about 2400) of these PPIs could be obtained by more than one method [7]. Hence, only using high-throughput experimental methods would not get high-quality and reliable experimental results. To overcome these drawbacks, computational methods have attracted researcher's attention. They [12] predict the PPIs mainly based on the sequential information of amino acids of proteins. Besides, some other methods [13][14][15] are based on the structural information, or based on fusion of multiple information from different data sources. With these information, it is also necessary to extract effective features to guarantee good prediction results. For this purpose, researchers begin to focus on the protein sequence encoding techniques and the corresponding predicting techniques for PPIs.
Shen et al. [12] introduced a conjoint triad (CT) descriptor, which considered the properties of each amino acid and its neighbouring amino acids and extracted the features of the local environmental information in the amino acid sequence. Gough et al. [13] proposed a method to describe the amino acid sequence based on the physical and chemical properties, combined with the structural information of protein. Later based on the physical and chemical properties of amino acids, Guo et al. [14] established an auto covariance (AC) encoding method to get the correlation and the interaction information of amino acids at different positions. For the high dimension of features of protein sequences, Thanathamathee et al. [15] used a principal component analysis algorithm to reduce the dimension first, and then constructed a forward feedback neural network as a classifier to predict the PPIs. Recently, some deep learning frameworks were proposed. For instance, Hashemifar et al. [16] presented a Siamese-like convolutional neural network with random projection, and data augmentation technique was used to extract sequential information in the framework. Li et al. [17] discussed another deep neural network framework, which learned the local features automatically only from protein primary sequences, according to the encoding, embedding, convolutional neural network, and long short-term memory (LSTM) neural network layers.
Notably, confidence score could be regarded as a kind of measurement for PPIs. The higher confidence score one protein pair gets, the more likely the protein pair interacts. Thus in this work, we propose a novel method, called ordinal regression and recurrent convolutional neural network (OR-RCNN), to predict PPIs by confidence score. The method could be concluded into two parts: (1) an encoder of protein sequence pair based on recurrent convolutional neural network (RCNN), and (2) PPI prediction model based on ordinal regression. In order to deal with the protein sequence pair in a better way, two RCNNs with shared parameters are introduced here. Each RCNN encodes one of the protein sequence pair into an embedding vector, which integrates multiple convolution layers with pooling and bidirectional gate recurrent unit (GRU) layers with concatenate, so as to extract the local features and sequential information more accurately. An element-wise multiplication is then done on the two embedding vectors, encoding them into one novel embedding vector. Till now, the encoder of protein sequence pair is presented, which encodes a pair of protein sequences into one embedding vector aggregating multi-granularity features. In order to predict PPIs, we predict the confidence score of the embedding vector first. The ordinal information is hidden behind confidence score by means of artificially setting the ordinal sub-intervals of confidence score. To this end, the concept of rank is used to show the ordinal information, and each subinterval is corresponding to a rank value. To efficiently use the ordinal information, ordinal regression is applied here. Based on it, the prediction problem of confidence score is transformed into a series of binary classification problems. Some multi-layer perceptrons are utilized in the binary classification to get the ordinal information about rank. Later, all the ordinal information are aggregated to get the final confidence score of the protein pair. Finally, whether the interaction between protein pair exists is determined by its confidence score. Experimental results show that our method could boost the performance of PPI prediction on both the S. cerevisiae and Homo sapiens data sets.

Results
In this section, we first introduce two PPI data sets S. cerevisiae and Homo sapiens. Then, we give the experimental setting. Finally, the experimental results are presented.

Data sets
The high throughput data sets S. cerevisiae and Homo sapiens are both from STRING database [18]. There is a confidence score for each pair of protein sequences in S. cerevisiae and Homo sapiens. Let CS min = 0 and CS max = 1 be the minimal and maximal values of confidence score for protein pairs, respectively. We separate the interval of confidence score (0, 1) into K sub-intervals with equal length. Let K = 20 , and the 20 sub-intervals are (0, 0.05), [0.05, 0.1),…, [0.95, 1). The protein pair is labeled k, if its confidence score belongs to the k-th sub-interval. Thus, both datasets are split into 20 subsets according to the sub-intervals.
We only consider the data with the length of protein sequence between 50 and 2000. For the data set S. cerevisiae, there are 1584 data in the sub-interval (0, 0.05), and 5360 data in the sub-interval [0.7, 0.75). Considering that there are too many data in the rest 18 sub-intervals, we randomly select 5400 data in each sub-interval. Thus, there are totally 104144 data actually used in the experiment. For the data set Homo sapiens, since the true data set is too big, we randomly select 5000 data in each sub-interval in our experiment. We randomly select 90% data in each sub-interval as the training data, and the left 10% data as the test data for both S. cerevisiae and Homo sapiens.

Experimental settings
Since the protein sequences have different lengths from 50 to 2000, we extend the short protein sequence to a sequence with length 2000 by adding zero-padding technique [19]. Let the batch size be 768, and let d = 3 . The 3-max-pooling mechanism is applied in the pooling layer. The output of bidirectional GRU layer with concatenate operator is a 150-dimension vector.
The AMSGrad algorithm [20] is used to optimize the cross-entropy loss function L for each sub-classifier. In the algorithm, we set the learning rate to be 0.001, and set the exponential decay rates of β 1 and β 2 to be 0.9 and 0.999, respectively.
We first evaluate the performance of each sub-classifier to see if the ordinal information about label has been pick up correctly. The evaluation is based on different comparison types: comparison of all the sub-classifiers, comparison of pre-trained embedding methods, selection of key parameters, comparison of computing equation for confidence score and study on the impact of the ratio of training over test data. Five criterions are used in the evaluation, including accuracy, precision, sensitivity (recall), specificity and F 1 score. Then, we emphasize the fact that the concatenate operator in our method performs better than the residual shortcut operator. Finally, our method is compared with some existing methods to show its advantages, based on the mean absolute error (MAE), mean squared error (MSE) and the above five criterions.

Comparison of sub-classifiers
The performance of each sub-classifier for OR-RCNN method on data set S. cerevisiae is shown in Fig. 1. The 19 sub-classifiers have very close performance in accuracy, precision, sensitivity, and F 1 score. For the specificity, the first 5 sub-classifiers perform poorly while the last 14 sub-classifiers all have good performance. That is, when we get the ordinal information for lower labels, the sub-classifier is not a good choice for specificity, while for higher labels, it would be much better.
In the rest of the experiments, we will not show the performance for all the 19 subclassifier, but show that for only three typical sub-classifiers f 5 , f 10 , f 15 . The corresponding sub-classification problems are to predict if the label of a protein pair is bigger than 5, 10 and 15, respectively.

Comparison of pre-trained embedding methods
We mainly compare three embedding methods: (1) a co ; (2) a eh ; (3) one-hot. Let the half length C of context be 3, and let the size of negative sampling be 5. The embedding a co is obtained by pre-training 8000 protein sequences of data set SHS148k from database STRING. The embedding a eh is directly computed by the electrostaticity and hydrophobicity. The one-hot method assigns a 20 dimensional vector to each amino acid.   Table 1 shows the comparison results of performance of different embedding methods for sub-classifiers f 5 , f 10 , f 15 on data set S. cerevisiae, respectively. Obviously, the embedding methods a co and a eh get very close results to each other, while they outperform the one-hot method.

Selection of key parameters
There are two key parameters in the OR-RCNN method: (1) the dimension d ′ of hidden state, (2) the repeated times of RCNN unit. We study how to select the optimal parameters for our method.
Let the repeated times of RCNN unit be 5, we compare the prediction results for different dimensions of hidden state on data set S. cerevisiae. The result are shown in Table 2 for sub-classifiers f 5 , f 10 , f 15 , respectively. We examine the performance with d ′ = 10, 25, 50, 75 for each sub-classifier. As the dimension value increases from 10 to 50, the performance improves significantly. As it increases from 50 to 75, the performance improves only a little, or even decreased. Thus in most cases, d ′ = 50 is a better choice for our method.
Given d ′ = 50 , we investigate the influence of repeated times for RCNN unit on the OR-RCNN method. The repeated times is set from 1 to 5. Table 3 shows the compared results for sub-classifiers f 5 , f 10 , f 15 on data set S. cerevisiae, respectively. We can see that the more times the RCNN unit occurs in our method, the better performance it could achieve. However, when the repeated times range from 1 to 3, our method enhances the performance rapidly, and when repeated times range from 3 to 5, it enhances very slowly, or even reduce a little. Hence, we choose the best repeated times of RCNN unit to 5 in the experiments.

Comparison of different computing equations for confidence score
The computing equation for confidence score is a critical step in our method. For any pair of protein sequences, the middle value of sub-interval where the predicted label falls in is taken to be the confidence score. Here, we compare this equation with other two, which take two endpoint values in the sub-interval, respectively, where r(x i ) is the predicted label for protein pair x i . The compared results are presented in Table 4 on data set S. cerevisiae. Evidently, the MSE and MAE of our method are both lower than those of the Eqs. (1) and (2), implying that the selection of computing equation for confidence score would influence on the performance of OR-RCNN method. Moreover, the predicted confidence score for Eq. (6) approximates the true value more closely. Thus, we prefer Eq. (6) than the other two for our method. (1)

Study on the impact of the ratio of training set over test set
Given a data set, we have to split it into two subsets, one is the training set and the other is the test set. The ratio of training set over test set may influence the performance.
Here, we check the impact of the ratio of training set over test set on performance with three sub-classifiers f 5 , f 10 , f 15 , on S. cerevisiae data set. We set the ratio to 5:5, 6:4, 7:3, 8:2, 9:1, respectively, and the results are presented on Table 5. With the increasing of the ratio, the values of accuracy, precision, recall, specificity and F1-score all increase. Furthermore, the sub-classifier f 10 increases the fastest in the three sub-classifiers, f 15 increases a bit slower than f 10 . Therefore, the ratio of training data and test data does impact on the performance of our method, and it achieves better performance with a bigger ratio. Besides, the ratio of training data over test data impacts differently on the three sub-classifiers.

Comparison of the operators concatenate and residual shortcut
In our method, the concatenate operator is applied to the bidirectional gated recurrent unit (GRU) layer. Here, we compare the performance of our method with the concatenate operator and that of the same method with residual shortcut operator.  exhibits the comparison results on Guo's Yeast data set from database of interacting proteins (DIP) [21]. We can see that the accuracy, precision and F1-score of concatenate operator are all bigger than those of residual shortcut operator. Further, the accuracy of concatenate operator improves 1.6%, precision improves 1.25% and F1-score improves 1.74%, all compared with those of residual shortcut operator. It implies that the concatenate operator could improve the performance, thus it is more suitable for our method. In other words, the bidirectional GRU with concatenate enhances the delivering of features and makes use of the features much more efficiently.

Comparison with existing PPI prediction methods
In order to show the advantage of our method, we compare it with some state-of-art methods. Our method is an ensemble method and it consists of two modules: feature description model and prediction model. Here, we choose the methods AC [14] and composition transition distribution (CTD) descriptor [22] for feature description, and methods random forest (RF) [23], extreme gradient boosting (XGBoost) [24] and support vector machine (SVM) [25] for the prediction. Thus, we have the methods RF-AC, RF-CTD, XGBoost-AC, XGBoost-CTD, SVM-AC and SVM-CTD. Table 7 demonstrates the comparison results of MAE and MSE for the confidence score on data sets S. cerevisiae and Homo sapiens. We can see that the RF-CTD method achieves the smallest MAE and MSE among the existing methods on both data sets, and the results of RF-AC is very close to those of RF-CTD. Meanwhile, our method reduces the MAE and MSE by 49.78% and 57.33% on data set S. cerevisiae, and reduces the MAE and MSE by 44.12% and 50.75% on data set Homo sapiens, respectively, both compared with RF-CTD. To sum up, we have following two conclusions: (1) Our method achieves much more accurate values of confidence score on both data sets, thereby improving the performance on different species; (2) The reductions of MAE and MSE vary with data sets, that is, our method improves the performance to different degrees on different species.
After predicting the confidence score for the protein pairs, we set a threshold θ = 0.1, 0.2, . . . , 0.9 , to predict the PPIs. Figure 2 illustrates the accuracy, precision, specificity, F1-score and recall of our method, compared with RF-AC, RF-CTD and R-RCNN methods. Note that R-RCNN is not an existing method. In order to show the effectiveness of ordinary regression in our method, the R-RCNN method is introduced. It consists of two parts. The first part is the RCNN encoder, which is the same as that of our method. While the second part R represents the regression model. Here, we use the multi-layer perceptron with scalar output as the regression model. In most cases, our method could get better results for the five criterions than other methods. Concretely, when the value of θ increases from 0.1 to 0.9, on one hand, the values of accuracy, precision and specificity decrease first and then increase, while those of F1-score and recall tend to decrease slowly. On the other hand, the values of the five criterion of our method fluctuate in a smaller range compared with the other three methods. Especially when we choose the value of θ ranging from 0.5 to 0.9, our method improves the performance more significantly. In general, our method outperforms most of the existing methods.

Discussion
It is well known that most of the PPI prediction models contain two modules: one is an encoder encoding the protein pairs into feature vectors, the other is a prediction model determining whether the interactions exist in the protein pairs. Inspired by the idea of RCNN encoder and ordinal regression, we propose the OR-RCNN method to predict PPIs. On one hand, two RCNN encoders with shared parameters are assembled to one encoder, so that each protein pair could be encoded into one feature vector. For the encoder, we also substitute the concatenate operator for the residual shortcut operator in the bidirectional GRU layer, since experimental results on Guo's Yeast data set have shown the advantages of concatenate operator. On the other hand, considering the fact that the higher the confidence score of one protein pair is, the more likely the protein pair interacts, we suggest to mine the hidden ordinal information behind the confidence score to boost the performance of PPI prediction. For this purpose, the ordinal regression is applied in our method. Compared to the common regression model without using ordinal information, the ordinal regression model improves the performance in terms of five metrics: accuracy, precision, specificity, F1-score and recall. In summary, by combining the assembled RCNN encoder and the ordinal regression model, our OR-RCNN method significantly boosts the prediction performance, and outperforms most of the existing methods.

Conclusion
In this paper, an OR-RCNN method is proposed to predict PPIs according to its confidence score. In our method, the protein sequence pair is first encoded into one embedding vector based on two RCNN encoders. They share the same parameters, so as to reduce the complexity for training process. Next, multiple sub-classifiers are investigated to the embedding vector based on the idea of ordinal regression. It effectively exploits the ordinal information behind the confidence score by uniformly splitting the confidence interval into several non-overlapping sub-intervals, and rearranging the sub-intervals in an increasing order. Then, the ordinal information from these sub-classifiers of any protein pair are aggregated to get its confidence score. Finally, we predict the PPI of the protein pair with a threshold. Experiments have shown that the OR-RCNN method outperforms the state-of-the-art methods on data sets S. cerevisiae and Homo sapiens.

Methods
In this section, we describe the OR-RCNN method for PPI prediction task. Some basic concepts of the RCNN encoder are introduced first. Then, the general framework of our method is conducted. Finally, the technical details of our method are presented.

Preliminaries
Denote by A the vocabulary of 20 standard amino acids. Denote by S = [a 1 , . . . , a l ] the sequence of amino acids for a protein, where a i is an amino acid in the vocabulary.

Pre-trained embedding for amino acids
Since the sequential information of amino acids for a protein is usually non-numerical, the embedding method is necessary in the pre-training process. An amino acid a ∈ A could be embedded into a semi-latent vector a , and a is numerical.
Here, we introduce two embedding methods. The first method applies the Skip-Gram model [26] to the protein sequence. Let a co be the embedding, which measures the similarity of co-occurrence of two amino acids. Formally, to maximize the average log probability of the similarity, we minimize the objective function J SG where a co,t and a co,t+j are both the embedding results for the t'th amino acid a t ∈ S and the neighbor, respectively, and C is the length of the half context. Note, the context is a subsequence of the protein sequence S with length 2C + 1 . The probability p is a softmax function: where a ′ co,k is a negative sample not occurring in the same context with a co,t , and m is the size of negative sampling.
The second method [12] expresses the embedding as a eh . It measure the similarity of properties, like electrostaticity and hydrophobicity, between two amino acids. The reason is that electrostatic and hydrophobic interactions occupy the most important position in PPIs. They could be computed by their dipoles and volumes of the side chains of amino acids, respectively. Naturally, the 20 amino acids in A are divided into 7 classes. Thus, a eh is a 7 dimensional vector, like one-hot encoding method.

RCNN encoder
RCNN encoder [27] is applied to get the global sequential information and local features which are both crucial to predict PPIs. In the deep neural network encoder framework, there are mainly two computing modules. One is the convolution layer with pooling, and the other is the bidirectional GRU with residual. The general framework is illustrated in Fig. 3 [27].
The convolution layer with pooling The purpose of the convolution layer with pooling is to extract local information from the input. Let S ′ = [v 1 , v 2 , . . . , v l ] be an input sequence of pre-trained embedding for a protein or the output of a previous neural network layer. We sample a consecutive sub-sequence [v t , v t+1 , . . . , v t+d−1 ] (simply denoted by v t:t+d−1 ) from S ′ . By using the weight-sharing kernel M ∈ R d×d ′ , it generates a d ′ dimensional latent vector h 1 log p(a co,t+j |a co,t ), p(a co,t+j |a co,t ) = exp(a co,t+j · a co,t ) m k=1 exp(a ′ co,k · a co,t ) , Fig. 3 Illustration of RCNN encoder. The left illustrates the whole structure of RCNN encoder, and the right exhibits the structure of RCNN unit from the sub-sequence v t:t+d−1 , where d is the parameter for the kernel size and b M is a vector for bias. The latent vector h 1 t extracts local features from the sub-sequence v t:t+d−1 . Let t = 1, 2, . . . , l − d + 1 , respectively, it obtains a sequence of latent vectors , generating from all the sub-sequences of input sequence S ′ . And H is the output of convolution layer.
Consider that the size of H is too big, i.e., there are too many features extracted from the input. Thus, in the pooling layer, it aims at reducing the dimension of the output H to make it robust. To this end, the "n-max-pooling mechanism" [28,29] is employed to every sub-sequence sampled from H with length n, where the length n is a pre-defined parameter. Notably, any two sub-sequences sampled from H are not overlapped. The mechanism is to choose the maximal value of the subsequence as its value in each dimension j, defined as Though the pooling layer discretizes the output of convolution layer, the most important features to the subsequence are preserved in the pooling output, and the number of preserved features is only 1/n of that of output of convolution layer.
Bidirectional GRU with Residual The GRU [30,31] is an alternative of the long shortterm memory (LSTM) network. Compared with LSTM, the GRU is much more efficient, and it discovers the sequential information without the demand of single memory cells [32]. For the purpose, each unit is composed of two kinds of gates: one is the reset gate r t , and the other is the update gate z t .
Given an input vector v t ∈ S ′ , GRU updates the hidden state h 3 t based on the weighted average value of the candidate state h 3 t and the previous state h 3 t−1 . The updating equation is expressed as follows where M * and N * ( * ∈ {z, s, r} ) are weight matrices, b * is a bias vector, σ is a sigmoid function, and the notation ⊙ means the element-wise multiplication. Here, the reset gate r t calculates the candidate state h 3 t , and the update gate z t updates the hidden state h 3 t . The bidirectional GRU layer [27] takes into account the sequential information of the input sequence S ′ in two directions. In the forward encoding process −−→ GRU , the input sequence [v 1 , v 2 , . . . , v l ] is read from v 1 to v l . While in the backward encoding process ←−− GRU , it is read from v l to v 1 . For every input vector v t , the two encoding results for different directions are put together, that is, In addition, the residual mechanism [33] is also implemented in the bidirectional GRU layer. It identically maps the bidirectional GRU input to its output with a residual shortcut. Thus, the value of input vector v t is added to the hidden state h 4 t , and the bidirectional GRU layer with residual shortcut is defined as It greatly simplifies the training process, and requires much less time for updating parameters to converge.
Protein Sequence Encoding Figure 3(left) shows the general framework of the encoding process for a given protein sequence S, denoted by E RCNN (S).
Given a protein sequence S, the convolution layer with pooling and the bidirectional GRU layer with residual shortcut occur in the framework alternately. The convolution layer is the first encoding layer to extract the local features from the input sequence, and the pooling layer is to make the convolution result robust. Then, the robust results are input into the bidirectional GRU layer with residual, such that the sequential information are preserved. The two components form a RCNN unit, illustrated in Fig. 3(right). By using multiple RCNN units, we can get a multi-granular feature aggregation for the protein sequence S. Indeed, before the RCNN unit, the protein sequence S has been embedded into one vector, and it is the feature vector of the protein. By virtue of the first RCNN unit, the feature vector is encoded into another vector by Eq. (4), that is, the features are aggregated for the first time. Then, the aggregated vector is regarded as the input of the second RCNN unit, and it is aggregated again. In other words, the features of protein sequence S are aggregated as many times as the repeating occurrence of the RCNN unit.
On top of the framework, the last bidirectional GRU layer is followed by a convolution layer with pooling. The convolution layer is the same to that in RCNN unit, such that the local features are extracted from the final hidden states H ′ = [h ′ 1 , h ′ 2 , . . . , h ′ |H ′ | ] . However, the pooling layer differs from that in RCNN unit. Instead of the "n-max-pooling mechanism", the "global average pooling mechanism" [34] is applied here, since the dimensions of the final hidden states and the previous hidden states are not necessary to be equal. It takes the average of all the features, i.e., This is the result of protein sequence encoder.

Overview
The framework of OR-RCNN method is illustrated in Fig. 4. It is composed of two parts: one is the encoder for protein sequence pair (in the bottom dashed rectangle), the other is the prediction model for PPIs by confidence score.
The encoder for protein sequence pair contains two RCNNs with shared parameters and the element-wise multiplication technique. Each RCNN encodes one sequence of the protein pair into an embedding vector. Since the two RCNNs are both deep neural (4) networks, they share the same parameters to reduce the computational complexity of the training process. Based on it, we use the element-wise multiplication technique to transform the two embedding vectors into one vector. In other words, the protein sequence pair is encoded into one embedding vector. In order to predict PPI for the protein sequence pair, we use the confidence score to measure the likelihood of existence of PPI. The higher confidence score the protein sequence pair has, the more likely the protein pair interacts. Thus, the problem of PPI prediction is converted to the problem of the confidence score prediction. Given N protein pairs and the corresponding confidence scores, we first divide the interval of confidence score value into K sub-intervals. Obviously, the sub-intervals could be ranked in an increasing order. After ranking, we give the k'th ( k = 1, 2, . . . , K ) sub-interval a label k. Thus, each protein pair is labeled k ( k = 1, 2, . . . , K ), if its value of confidence score falls in the k'th sub-interval. To exploiting the ordinal information in a better way, the ordinal regression is investigated here. It trains K − 1 binary sub-classifiers by the N protein pairs and their labels. When a novel protein pair is coming, the K − 1 binary sub-classifiers could be jointly used to predict the final label for the protein pair. Based on it, the label of the protein pair is mapped into the value of confidence score by a certain computing equation. Finally, the PPI prediction result is totally determined by the confidence score.

Technical details
In order to simplify the notations, let x i = (S i 1 , S i 2 ), i = 1, . . . , N be N pairs of proteins, and CS i be the confidence score of x i .

Encoder for a pair of protein sequences
Given a pair of protein sequences x i = (S i 1 , S i 2 ) , the protein sequences S i 1 and S i 2 are first embedded into a vector, respectively, in the pre-training process. And the embedding vector is the feature vector of the corresponding protein. This step ensures the input proteins S i 1 and S i 2 are both numerical type, preparing for the follow-up work. Then, the two embedding vectors are both encoded to another vectors, respectively, by two RCNNs with concatenate operator. Note that, the RCNN with concatenate operator is a slightly modification of RCNN encoder with residual shortcut (Eq. (4)) [27]. Figure 5 shows the workflow of RCNN encoder with concatenate operator.
Our method only differs in the RCNN unit for the bidirectional GRU layer. Our method use the concatenate operator [35], instead of the residual mechanism. It connects all the features on the channels to realize the feature reuse. Given an input vector v t of the convolution layer and the hidden state h 4 t (Eq. (3)), the concatenate operator is defined as follows Here, the input vector v t is concatenated to the right side of hidden state h 4 t , so as to avoid the problem of gradient disappearing. Furthermore, it enhances the delivering of features, makes use of the features much more efficiently, and reduces the numbers of parameters to a certain extent. In other words, the protein sequences S i 1 and S i 2 are encoded to the E RCNN (S i 1 ) and E RCNN (S i 2 ) , respectively, by virtue of Eq. (5) in the bidirectional GRU layer. It is noteworthy that the two protein sequence encoders share the same parameters, which also reduces the number of parameters in our method, thereby reduces the computational cost.
Finally, the embedding vectors for protein sequences E RCNN (S i 1 ) and E RCNN (S i 2 ) are transformed into one vector by the element-wise multiplication, i.e., E RCNN (S i 1 ) ⊙ E RCNN (S i 2 ) (or simply write as x i ). This multiplication is a common technique to discover the relationship between the two embedding vectors. where q i = (q i 1 , q i 2 ) is an one-hot indicator for the class label of x i . Then, we have the sub-classifier f k , j = 1, 2.
If the inequality f k (·) > 0 holds true, it means the predicted label of the protein pair is bigger than k, otherwise, it is no more than k. Now, we summarize all the ordinal information from each sub-classifier f k , k = 1, 2, . . . , K − 1 to derive the order of a given protein pair x i . The K − 1 outputs f k (x i ), k = 1, 2, . . . , K − 1 are aggregated to predict the final label of x i . The final label is defined as where [·] is equal to 1, if the condition in [·] is satisfied, otherwise it is equal to 0. Moreover, the predicted confidence score for protein pair x i is expressed as which takes the middle value of the sub-interval corresponding to the predicted final label as its confidence score.
Finally, we predict PPI of protein pair x i . Given a threshold θ , if the predicted confidence score of x i is bigger than θ , we determine there exists an interaction between the protein pair x i , otherwise, there does not exist the interaction. Note that, the final prediction result is up to the value of threshold θ . We could adjust the prediction performance by setting the optimal value of θ through experiments.