 Research
 Open Access
 Published:
Protein–protein interaction prediction based on ordinal regression and recurrent convolutional neural networks
BMC Bioinformatics volume 22, Article number: 485 (2021)
Abstract
Background
Protein protein interactions (PPIs) are essential to most of the biological processes. The prediction of PPIs is beneficial to the understanding of protein functions and thus is helpful to pathological analysis, disease diagnosis and drug design etc. As the amount of protein data is growing fast in the post genomic era, highthroughput experimental methods are expensive and timeconsuming for the prediction of PPIs. Thus, computational methods have attracted researcher’s attention in recent years. A large number of computational methods have been proposed based on different protein sequence encoders.
Results
Notably, the confidence score of a protein sequence pair could be regarded as a kind of measurement to PPIs. The higher the confidence score for one protein pair is, the more likely the protein pair interacts. Thus in this paper, a deep learning framework, called ordinal regression and recurrent convolutional neural network (ORRCNN) method, is introduced to predict PPIs from the perspective of confidence score. It mainly contains two parts: the encoder part of protein sequence pair and the prediction part of PPIs by confidence score. In the first part, two recurrent convolutional neural networks (RCNNs) with shared parameters are applied to construct two protein sequence embedding vectors, which can automatically extract robust local features and sequential information from the protein pairs. Based on it, the two embedding vectors are encoded into one novel embedding vector by elementwise multiplication. By taking the ordinal information behind confidence score into consideration, ordinal regression is used to construct multiple subclassifiers in the second part. The results of multiple subclassifiers are aggregated to obtain the final confidence score. Following that, the existence of PPIs is determined by the confidence score. We set a threshold \(\theta\), and say the interaction exists between the protein pair if its confidence score is bigger than \(\theta\).
Conclusions
We applied our method to predict PPIs on data sets S. cerevisiae and Homo sapiens. Through experimental verification, our method outperforms stateoftheart PPI prediction models.
Background
Proteins [1, 2] are critical to the cells and tissues in the body. They participate in various life activities, like antibody immunity, catalyzing metabolic reactions and transporting molecules, etc. Usually, proteins are associated with other proteins to form the protein complexes, so as to perform the functions of living organisms in a better way. Among the protein complexes, proteinprotein interactions (PPIs) play a crucial role in successfully carrying out different biological processes in cells, such as transcription, translation, cell cycle control, and secretion [3]. Therefore, the problem of PPI prediction is of great significance in pathological analysis [4], disease diagnosis [5], drug design [6], and is becoming a research focus in the field of proteomics.
A large amount of highthroughput experimental methods have been applied to predict the PPIs from protein complexes, such as Yeast double hybrid screens [7], tandem affinity purification [8, 9], and proteome chips and microarray technology [10, 11]. However, with the accumulation of protein data, these methods suffer from the restrictions of time and economic cost, and cannot meet the needs of human life science research in the post genomic era. It is also worth noting that due to the existence of the subjective or objective factors, such as operation error and experimental error, the experimental results often deviate slightly from the actual results, sometimes even leading to a large proportion of false positive or false negative experimental data. For example, there are about 80,000 PPIs predicted by these highthroughput experimental methods, but only a relatively small number (about 2400) of these PPIs could be obtained by more than one method [7]. Hence, only using highthroughput experimental methods would not get highquality and reliable experimental results. To overcome these drawbacks, computational methods have attracted researcher’s attention. They [12] predict the PPIs mainly based on the sequential information of amino acids of proteins. Besides, some other methods [13,14,15] are based on the structural information, or based on fusion of multiple information from different data sources. With these information, it is also necessary to extract effective features to guarantee good prediction results. For this purpose, researchers begin to focus on the protein sequence encoding techniques and the corresponding predicting techniques for PPIs.
Shen et al. [12] introduced a conjoint triad (CT) descriptor, which considered the properties of each amino acid and its neighbouring amino acids and extracted the features of the local environmental information in the amino acid sequence. Gough et al. [13] proposed a method to describe the amino acid sequence based on the physical and chemical properties, combined with the structural information of protein. Later based on the physical and chemical properties of amino acids, Guo et al. [14] established an auto covariance (AC) encoding method to get the correlation and the interaction information of amino acids at different positions. For the high dimension of features of protein sequences, Thanathamathee et al. [15] used a principal component analysis algorithm to reduce the dimension first, and then constructed a forward feedback neural network as a classifier to predict the PPIs. Recently, some deep learning frameworks were proposed. For instance, Hashemifar et al. [16] presented a Siameselike convolutional neural network with random projection, and data augmentation technique was used to extract sequential information in the framework. Li et al. [17] discussed another deep neural network framework, which learned the local features automatically only from protein primary sequences, according to the encoding, embedding, convolutional neural network, and long shortterm memory (LSTM) neural network layers.
Notably, confidence score could be regarded as a kind of measurement for PPIs. The higher confidence score one protein pair gets, the more likely the protein pair interacts. Thus in this work, we propose a novel method, called ordinal regression and recurrent convolutional neural network (ORRCNN), to predict PPIs by confidence score. The method could be concluded into two parts: (1) an encoder of protein sequence pair based on recurrent convolutional neural network (RCNN), and (2) PPI prediction model based on ordinal regression. In order to deal with the protein sequence pair in a better way, two RCNNs with shared parameters are introduced here. Each RCNN encodes one of the protein sequence pair into an embedding vector, which integrates multiple convolution layers with pooling and bidirectional gate recurrent unit (GRU) layers with concatenate, so as to extract the local features and sequential information more accurately. An elementwise multiplication is then done on the two embedding vectors, encoding them into one novel embedding vector. Till now, the encoder of protein sequence pair is presented, which encodes a pair of protein sequences into one embedding vector aggregating multigranularity features. In order to predict PPIs, we predict the confidence score of the embedding vector first. The ordinal information is hidden behind confidence score by means of artificially setting the ordinal subintervals of confidence score. To this end, the concept of rank is used to show the ordinal information, and each subinterval is corresponding to a rank value. To efficiently use the ordinal information, ordinal regression is applied here. Based on it, the prediction problem of confidence score is transformed into a series of binary classification problems. Some multilayer perceptrons are utilized in the binary classification to get the ordinal information about rank. Later, all the ordinal information are aggregated to get the final confidence score of the protein pair. Finally, whether the interaction between protein pair exists is determined by its confidence score. Experimental results show that our method could boost the performance of PPI prediction on both the S. cerevisiae and Homo sapiens data sets.
Results
In this section, we first introduce two PPI data sets S. cerevisiae and Homo sapiens. Then, we give the experimental setting. Finally, the experimental results are presented.
Data sets
The high throughput data sets S. cerevisiae and Homo sapiens are both from STRING database [18]. There is a confidence score for each pair of protein sequences in S. cerevisiae and Homo sapiens. Let \(\overline{CS}_{min}=0\) and \(\overline{CS}_{max}=1\) be the minimal and maximal values of confidence score for protein pairs, respectively. We separate the interval of confidence score (0, 1) into K subintervals with equal length. Let \(K=20\), and the 20 subintervals are (0, 0.05), [0.05, 0.1),…, [0.95, 1). The protein pair is labeled k, if its confidence score belongs to the kth subinterval. Thus, both datasets are split into 20 subsets according to the subintervals.
We only consider the data with the length of protein sequence between 50 and 2000. For the data set S. cerevisiae, there are 1584 data in the subinterval (0, 0.05), and 5360 data in the subinterval [0.7, 0.75). Considering that there are too many data in the rest 18 subintervals, we randomly select 5400 data in each subinterval. Thus, there are totally 104144 data actually used in the experiment. For the data set Homo sapiens, since the true data set is too big, we randomly select 5000 data in each subinterval in our experiment. We randomly select 90% data in each subinterval as the training data, and the left 10% data as the test data for both S. cerevisiae and Homo sapiens.
Experimental settings
Since the protein sequences have different lengths from 50 to 2000, we extend the short protein sequence to a sequence with length 2000 by adding zeropadding technique [19]. Let the batch size be 768, and let \(d=3\). The 3maxpooling mechanism is applied in the pooling layer. The output of bidirectional GRU layer with concatenate operator is a 150dimension vector.
The AMSGrad algorithm [20] is used to optimize the crossentropy loss function L for each subclassifier. In the algorithm, we set the learning rate to be 0.001, and set the exponential decay rates of \(\beta _{1}\) and \(\beta _{2}\) to be 0.9 and 0.999, respectively.
We first evaluate the performance of each subclassifier to see if the ordinal information about label has been pick up correctly. The evaluation is based on different comparison types: comparison of all the subclassifiers, comparison of pretrained embedding methods, selection of key parameters, comparison of computing equation for confidence score and study on the impact of the ratio of training over test data. Five criterions are used in the evaluation, including accuracy, precision, sensitivity (recall), specificity and \(F_{1}\) score. Then, we emphasize the fact that the concatenate operator in our method performs better than the residual shortcut operator. Finally, our method is compared with some existing methods to show its advantages, based on the mean absolute error (MAE), mean squared error (MSE) and the above five criterions.
Experimental results
Comparison of subclassifiers
The performance of each subclassifier for ORRCNN method on data set S. cerevisiae is shown in Fig. 1. The 19 subclassifiers have very close performance in accuracy, precision, sensitivity, and \(F_{1}\) score. For the specificity, the first 5 subclassifiers perform poorly while the last 14 subclassifiers all have good performance. That is, when we get the ordinal information for lower labels, the subclassifier is not a good choice for specificity, while for higher labels, it would be much better.
In the rest of the experiments, we will not show the performance for all the 19 subclassifier, but show that for only three typical subclassifiers \(f_{5}, f_{10}, f_{15}\). The corresponding subclassification problems are to predict if the label of a protein pair is bigger than 5, 10 and 15, respectively.
Comparison of pretrained embedding methods
We mainly compare three embedding methods: (1) \({\mathbf{a}}_{co}\); (2) \({\mathbf{a}}_{eh}\); (3) onehot. Let the half length C of context be 3, and let the size of negative sampling be 5. The embedding \({\mathbf{a}}_{co}\) is obtained by pretraining 8000 protein sequences of data set SHS148k from database STRING. The embedding \({\mathbf{a}}_{eh}\) is directly computed by the electrostaticity and hydrophobicity. The onehot method assigns a 20 dimensional vector to each amino acid.
Table 1 shows the comparison results of performance of different embedding methods for subclassifiers \(f_{5}, f_{10}, f_{15}\) on data set S. cerevisiae, respectively. Obviously, the embedding methods \({\mathbf{a}}_{co}\) and \({\mathbf{a}}_{eh}\) get very close results to each other, while they outperform the onehot method.
Selection of key parameters
There are two key parameters in the ORRCNN method: (1) the dimension \(d^{\prime}\) of hidden state, (2) the repeated times of RCNN unit. We study how to select the optimal parameters for our method.
Let the repeated times of RCNN unit be 5, we compare the prediction results for different dimensions of hidden state on data set S. cerevisiae. The result are shown in Table 2 for subclassifiers \(f_{5}, f_{10}, f_{15}\), respectively. We examine the performance with \(d^{\prime}=10,25,50,75\) for each subclassifier. As the dimension value increases from 10 to 50, the performance improves significantly. As it increases from 50 to 75, the performance improves only a little, or even decreased. Thus in most cases, \(d^{\prime}=50\) is a better choice for our method.
Given \(d^{\prime}=50\), we investigate the influence of repeated times for RCNN unit on the ORRCNN method. The repeated times is set from 1 to 5. Table 3 shows the compared results for subclassifiers \(f_{5}, f_{10}, f_{15}\) on data set S. cerevisiae, respectively. We can see that the more times the RCNN unit occurs in our method, the better performance it could achieve. However, when the repeated times range from 1 to 3, our method enhances the performance rapidly, and when repeated times range from 3 to 5, it enhances very slowly, or even reduce a little. Hence, we choose the best repeated times of RCNN unit to 5 in the experiments.
Comparison of different computing equations for confidence score
The computing equation for confidence score is a critical step in our method. For any pair of protein sequences, the middle value of subinterval where the predicted label falls in is taken to be the confidence score. Here, we compare this equation with other two, which take two endpoint values in the subinterval, respectively,
where \(\bar{r}(x_{i})\) is the predicted label for protein pair \(x_{i}\). The compared results are presented in Table 4 on data set S. cerevisiae. Evidently, the MSE and MAE of our method are both lower than those of the Eqs. (1) and (2), implying that the selection of computing equation for confidence score would influence on the performance of ORRCNN method. Moreover, the predicted confidence score for Eq. (6) approximates the true value more closely. Thus, we prefer Eq. (6) than the other two for our method.
Study on the impact of the ratio of training set over test set
Given a data set, we have to split it into two subsets, one is the training set and the other is the test set. The ratio of training set over test set may influence the performance. Here, we check the impact of the ratio of training set over test set on performance with three subclassifiers \(f_{5}, f_{10}, f_{15}\), on S. cerevisiae data set. We set the ratio to 5:5, 6:4, 7:3, 8:2, 9:1, respectively, and the results are presented on Table 5. With the increasing of the ratio, the values of accuracy, precision, recall, specificity and F1score all increase. Furthermore, the subclassifier \(f_{10}\) increases the fastest in the three subclassifiers, \(f_{15}\) increases a bit slower than \(f_{10}\). Therefore, the ratio of training data and test data does impact on the performance of our method, and it achieves better performance with a bigger ratio. Besides, the ratio of training data over test data impacts differently on the three subclassifiers.
Comparison of the operators concatenate and residual shortcut
In our method, the concatenate operator is applied to the bidirectional gated recurrent unit (GRU) layer. Here, we compare the performance of our method with the concatenate operator and that of the same method with residual shortcut operator. Table 6 exhibits the comparison results on Guo’s Yeast data set from database of interacting proteins (DIP) [21]. We can see that the accuracy, precision and F1score of concatenate operator are all bigger than those of residual shortcut operator. Further, the accuracy of concatenate operator improves 1.6%, precision improves 1.25% and F1score improves 1.74%, all compared with those of residual shortcut operator. It implies that the concatenate operator could improve the performance, thus it is more suitable for our method. In other words, the bidirectional GRU with concatenate enhances the delivering of features and makes use of the features much more efficiently.
Comparison with existing PPI prediction methods
In order to show the advantage of our method, we compare it with some stateofart methods. Our method is an ensemble method and it consists of two modules: feature description model and prediction model. Here, we choose the methods AC [14] and composition transition distribution (CTD) descriptor [22] for feature description, and methods random forest (RF) [23], extreme gradient boosting (XGBoost) [24] and support vector machine (SVM) [25] for the prediction. Thus, we have the methods RFAC, RFCTD, XGBoostAC, XGBoostCTD, SVMAC and SVMCTD.
Table 7 demonstrates the comparison results of MAE and MSE for the confidence score on data sets S. cerevisiae and Homo sapiens. We can see that the RFCTD method achieves the smallest MAE and MSE among the existing methods on both data sets, and the results of RFAC is very close to those of RFCTD. Meanwhile, our method reduces the MAE and MSE by 49.78% and 57.33% on data set S. cerevisiae, and reduces the MAE and MSE by 44.12% and 50.75% on data set Homo sapiens, respectively, both compared with RFCTD. To sum up, we have following two conclusions: (1) Our method achieves much more accurate values of confidence score on both data sets, thereby improving the performance on different species; (2) The reductions of MAE and MSE vary with data sets, that is, our method improves the performance to different degrees on different species.
After predicting the confidence score for the protein pairs, we set a threshold \(\theta =0.1,0.2,\ldots ,0.9\), to predict the PPIs. Figure 2 illustrates the accuracy, precision, specificity, F1score and recall of our method, compared with RFAC, RFCTD and RRCNN methods. Note that RRCNN is not an existing method. In order to show the effectiveness of ordinary regression in our method, the RRCNN method is introduced. It consists of two parts. The first part is the RCNN encoder, which is the same as that of our method. While the second part R represents the regression model. Here, we use the multilayer perceptron with scalar output as the regression model. In most cases, our method could get better results for the five criterions than other methods. Concretely, when the value of \(\theta\) increases from 0.1 to 0.9, on one hand, the values of accuracy, precision and specificity decrease first and then increase, while those of F1score and recall tend to decrease slowly. On the other hand, the values of the five criterion of our method fluctuate in a smaller range compared with the other three methods. Especially when we choose the value of \(\theta\) ranging from 0.5 to 0.9, our method improves the performance more significantly. In general, our method outperforms most of the existing methods.
Discussion
It is well known that most of the PPI prediction models contain two modules: one is an encoder encoding the protein pairs into feature vectors, the other is a prediction model determining whether the interactions exist in the protein pairs. Inspired by the idea of RCNN encoder and ordinal regression, we propose the ORRCNN method to predict PPIs. On one hand, two RCNN encoders with shared parameters are assembled to one encoder, so that each protein pair could be encoded into one feature vector. For the encoder, we also substitute the concatenate operator for the residual shortcut operator in the bidirectional GRU layer, since experimental results on Guo’s Yeast data set have shown the advantages of concatenate operator. On the other hand, considering the fact that the higher the confidence score of one protein pair is, the more likely the protein pair interacts, we suggest to mine the hidden ordinal information behind the confidence score to boost the performance of PPI prediction. For this purpose, the ordinal regression is applied in our method. Compared to the common regression model without using ordinal information, the ordinal regression model improves the performance in terms of five metrics: accuracy, precision, specificity, F1score and recall. In summary, by combining the assembled RCNN encoder and the ordinal regression model, our ORRCNN method significantly boosts the prediction performance, and outperforms most of the existing methods.
Conclusion
In this paper, an ORRCNN method is proposed to predict PPIs according to its confidence score. In our method, the protein sequence pair is first encoded into one embedding vector based on two RCNN encoders. They share the same parameters, so as to reduce the complexity for training process. Next, multiple subclassifiers are investigated to the embedding vector based on the idea of ordinal regression. It effectively exploits the ordinal information behind the confidence score by uniformly splitting the confidence interval into several nonoverlapping subintervals, and rearranging the subintervals in an increasing order. Then, the ordinal information from these subclassifiers of any protein pair are aggregated to get its confidence score. Finally, we predict the PPI of the protein pair with a threshold. Experiments have shown that the ORRCNN method outperforms the stateoftheart methods on data sets S. cerevisiae and Homo sapiens.
Methods
In this section, we describe the ORRCNN method for PPI prediction task. Some basic concepts of the RCNN encoder are introduced first. Then, the general framework of our method is conducted. Finally, the technical details of our method are presented.
Preliminaries
Denote by \({\mathcal{A}}\) the vocabulary of 20 standard amino acids. Denote by \(S=[a_{1}, \ldots , a_{l}]\) the sequence of amino acids for a protein, where \(a_{i}\) is an amino acid in the vocabulary.
Pretrained embedding for amino acids
Since the sequential information of amino acids for a protein is usually nonnumerical, the embedding method is necessary in the pretraining process. An amino acid \(a \in {\mathcal{A}}\) could be embedded into a semilatent vector \({\mathbf{a}}\), and \({\mathbf{a}}\) is numerical.
Here, we introduce two embedding methods. The first method applies the SkipGram model [26] to the protein sequence. Let \({\mathbf{a}}_{co}\) be the embedding, which measures the similarity of cooccurrence of two amino acids. Formally, to maximize the average log probability of the similarity, we minimize the objective function \(J_{SG}\)
where \({\mathbf{a}}_{co,t}\) and \({\mathbf{a}}_{co,t+j}\) are both the embedding results for the t’th amino acid \(a_{t} \in S\) and the neighbor, respectively, and C is the length of the half context. Note, the context is a subsequence of the protein sequence S with length \(2C+1\). The probability p is a softmax function:
where \({\mathbf{a}}_{co,k}^{\prime}\) is a negative sample not occurring in the same context with \({\mathbf{a}}_{co,t}\), and m is the size of negative sampling.
The second method [12] expresses the embedding as \({\mathbf{a}}_{eh}\). It measure the similarity of properties, like electrostaticity and hydrophobicity, between two amino acids. The reason is that electrostatic and hydrophobic interactions occupy the most important position in PPIs. They could be computed by their dipoles and volumes of the side chains of amino acids, respectively. Naturally, the 20 amino acids in \({\mathcal{A}}\) are divided into 7 classes. Thus, \({\mathbf{a}}_{eh}\) is a 7 dimensional vector, like onehot encoding method.
RCNN encoder
RCNN encoder [27] is applied to get the global sequential information and local features which are both crucial to predict PPIs. In the deep neural network encoder framework, there are mainly two computing modules. One is the convolution layer with pooling, and the other is the bidirectional GRU with residual. The general framework is illustrated in Fig. 3 [27].
The convolution layer with pooling The purpose of the convolution layer with pooling is to extract local information from the input. Let \(S^{\prime} = [v_{1}, v_{2}, \ldots , v_{l}]\) be an input sequence of pretrained embedding for a protein or the output of a previous neural network layer. We sample a consecutive subsequence \([v_{t}, v_{t+1}, \ldots , v_{t+d1}]\) (simply denoted by \(v_{t:t+d1}\)) from \(S^{\prime}\). By using the weightsharing kernel \(M \in {\mathbb{R}}^{d \times d^{\prime}}\), it generates a \(d^{\prime}\) dimensional latent vector \(h_{t}^{1}\)
from the subsequence \(v_{t:t+d1}\), where d is the parameter for the kernel size and \(b_{M}\) is a vector for bias. The latent vector \(h_{t}^{1}\) extracts local features from the subsequence \(v_{t:t+d1}\). Let \(t=1,2,\ldots ,ld+1\), respectively, it obtains a sequence of latent vectors \(H = [h_{1}^{1}, h_{2}^{1}, \ldots , h_{ld+1}^{1}]\), generating from all the subsequences of input sequence \(S^{\prime}\). And H is the output of convolution layer.
Consider that the size of H is too big, i.e., there are too many features extracted from the input. Thus, in the pooling layer, it aims at reducing the dimension of the output H to make it robust. To this end, the “nmaxpooling mechanism” [28, 29] is employed to every subsequence sampled from H with length n, where the length n is a predefined parameter. Notably, any two subsequences sampled from H are not overlapped. The mechanism is to choose the maximal value of the subsequence as its value in each dimension j, defined as
Though the pooling layer discretizes the output of convolution layer, the most important features to the subsequence are preserved in the pooling output, and the number of preserved features is only 1/n of that of output of convolution layer.
Bidirectional GRU with Residual The GRU [30, 31] is an alternative of the long shortterm memory (LSTM) network. Compared with LSTM, the GRU is much more efficient, and it discovers the sequential information without the demand of single memory cells [32]. For the purpose, each unit is composed of two kinds of gates: one is the reset gate \(r_{t}\), and the other is the update gate \(z_{t}\).
Given an input vector \(v_{t} \in S^{\prime}\), GRU updates the hidden state \(h_{t}^{3}\) based on the weighted average value of the candidate state \(\tilde{h}_{t}^{3}\) and the previous state \(h_{t1}^{3}\). The updating equation is expressed as follows
where \(M_{*}\) and \(N_{*}\) (\(* \in \{z,s,r\}\)) are weight matrices, \(b_{*}\) is a bias vector, \(\sigma\) is a sigmoid function, and the notation \(\odot\) means the elementwise multiplication. Here, the reset gate \(r_{t}\) calculates the candidate state \(\tilde{h}_{t}^{3}\), and the update gate \(z_{t}\) updates the hidden state \(h_{t}^{3}\).
The bidirectional GRU layer [27] takes into account the sequential information of the input sequence \(S^{\prime}\) in two directions. In the forward encoding process \(\overrightarrow{GRU}\), the input sequence \([v_{1}, v_{2}, \ldots , v_{l}]\) is read from \(v_{1}\) to \(v_{l}\). While in the backward encoding process \(\overleftarrow{GRU}\), it is read from \(v_{l}\) to \(v_{1}\). For every input vector \(v_{t}\), the two encoding results for different directions are put together, that is,
In addition, the residual mechanism [33] is also implemented in the bidirectional GRU layer. It identically maps the bidirectional GRU input to its output with a residual shortcut. Thus, the value of input vector \(v_{t}\) is added to the hidden state \(h_{t}^{4}\), and the bidirectional GRU layer with residual shortcut is defined as
It greatly simplifies the training process, and requires much less time for updating parameters to converge.
Protein Sequence Encoding Figure 3(left) shows the general framework of the encoding process for a given protein sequence S, denoted by \(E_{RCNN}(S)\).
Given a protein sequence S, the convolution layer with pooling and the bidirectional GRU layer with residual shortcut occur in the framework alternately. The convolution layer is the first encoding layer to extract the local features from the input sequence, and the pooling layer is to make the convolution result robust. Then, the robust results are input into the bidirectional GRU layer with residual, such that the sequential information are preserved. The two components form a RCNN unit, illustrated in Fig. 3(right). By using multiple RCNN units, we can get a multigranular feature aggregation for the protein sequence S. Indeed, before the RCNN unit, the protein sequence S has been embedded into one vector, and it is the feature vector of the protein. By virtue of the first RCNN unit, the feature vector is encoded into another vector by Eq. (4), that is, the features are aggregated for the first time. Then, the aggregated vector is regarded as the input of the second RCNN unit, and it is aggregated again. In other words, the features of protein sequence S are aggregated as many times as the repeating occurrence of the RCNN unit.
On top of the framework, the last bidirectional GRU layer is followed by a convolution layer with pooling. The convolution layer is the same to that in RCNN unit, such that the local features are extracted from the final hidden states \(H^{\prime} = [h_{1}^{\prime}, h_{2}^{\prime}, \ldots , h_{H^{\prime}}^{\prime} ]\). However, the pooling layer differs from that in RCNN unit. Instead of the “nmaxpooling mechanism”, the “global average pooling mechanism” [34] is applied here, since the dimensions of the final hidden states and the previous hidden states are not necessary to be equal. It takes the average of all the features, i.e.,
This is the result of protein sequence encoder.
Overview
The framework of ORRCNN method is illustrated in Fig. 4. It is composed of two parts: one is the encoder for protein sequence pair (in the bottom dashed rectangle), the other is the prediction model for PPIs by confidence score.
The encoder for protein sequence pair contains two RCNNs with shared parameters and the elementwise multiplication technique. Each RCNN encodes one sequence of the protein pair into an embedding vector. Since the two RCNNs are both deep neural networks, they share the same parameters to reduce the computational complexity of the training process. Based on it, we use the elementwise multiplication technique to transform the two embedding vectors into one vector. In other words, the protein sequence pair is encoded into one embedding vector.
In order to predict PPI for the protein sequence pair, we use the confidence score to measure the likelihood of existence of PPI. The higher confidence score the protein sequence pair has, the more likely the protein pair interacts. Thus, the problem of PPI prediction is converted to the problem of the confidence score prediction. Given \({\mathcal{N}}\) protein pairs and the corresponding confidence scores, we first divide the interval of confidence score value into K subintervals. Obviously, the subintervals could be ranked in an increasing order. After ranking, we give the k’th (\(k=1,2,\ldots , K\)) subinterval a label k. Thus, each protein pair is labeled k (\(k=1,2,\ldots , K\)), if its value of confidence score falls in the k’th subinterval. To exploiting the ordinal information in a better way, the ordinal regression is investigated here. It trains \(K1\) binary subclassifiers by the \({\mathcal{N}}\) protein pairs and their labels. When a novel protein pair is coming, the \(K1\) binary subclassifiers could be jointly used to predict the final label for the protein pair. Based on it, the label of the protein pair is mapped into the value of confidence score by a certain computing equation. Finally, the PPI prediction result is totally determined by the confidence score.
Technical details
In order to simplify the notations, let \(x_{i} = (S_{i_{1}}, S_{i_{2}}), i=1,\ldots ,{\mathcal{N}}\) be \({\mathcal{N}}\) pairs of proteins, and \(\overline{CS}_{i}\) be the confidence score of \(x_{i}\).
Encoder for a pair of protein sequences
Given a pair of protein sequences \(x_{i} = (S_{i_{1}}, S_{i_{2}})\), the protein sequences \(S_{i_{1}}\) and \(S_{i_{2}}\) are first embedded into a vector, respectively, in the pretraining process. And the embedding vector is the feature vector of the corresponding protein. This step ensures the input proteins \(S_{i_{1}}\) and \(S_{i_{2}}\) are both numerical type, preparing for the followup work.
Then, the two embedding vectors are both encoded to another vectors, respectively, by two RCNNs with concatenate operator. Note that, the RCNN with concatenate operator is a slightly modification of RCNN encoder with residual shortcut (Eq. (4)) [27]. Figure 5 shows the workflow of RCNN encoder with concatenate operator.
Our method only differs in the RCNN unit for the bidirectional GRU layer. Our method use the concatenate operator [35], instead of the residual mechanism. It connects all the features on the channels to realize the feature reuse. Given an input vector \(v_{t}\) of the convolution layer and the hidden state \(h_{t}^{4}\) (Eq. (3)), the concatenate operator is defined as follows
Here, the input vector \(v_{t}\) is concatenated to the right side of hidden state \(h_{t}^{4}\), so as to avoid the problem of gradient disappearing. Furthermore, it enhances the delivering of features, makes use of the features much more efficiently, and reduces the numbers of parameters to a certain extent. In other words, the protein sequences \(S_{i_{1}}\) and \(S_{i_{2}}\) are encoded to the \(E_{RCNN}(S_{i_{1}})\) and \(E_{RCNN}(S_{i_{2}})\), respectively, by virtue of Eq. (5) in the bidirectional GRU layer. It is noteworthy that the two protein sequence encoders share the same parameters, which also reduces the number of parameters in our method, thereby reduces the computational cost.
Finally, the embedding vectors for protein sequences \(E_{RCNN}(S_{i_{1}})\) and \(E_{RCNN}(S_{i_{2}})\) are transformed into one vector by the elementwise multiplication, i.e., \(E_{RCNN}(S_{i_{1}}) \odot E_{RCNN}(S_{i_{2}})\) (or simply write as \(\bar{x}_{i}\)). This multiplication is a common technique to discover the relationship between the two embedding vectors.
Prediction model for PPIs by confidence score
Suppose the value of confidence score for a protein pair falls in the interval \((\overline{CS}_{min}, \overline{CS}_{max})\). We separate it uniformly into K nonoverlapped subintervals. The first subinterval is expressed as \((\overline{CS}_{min}, \overline{CS}_{min}+(\overline{CS}_{max}\overline{CS}_{min})/K)\), and the k’th (\(k=2,\ldots ,K\)) subinterval is \([\overline{CS}_{min}+(k1)(\overline{CS}_{max}\overline{CS}_{min})/K, \overline{CS}_{min}+k(\overline{CS}_{max}\overline{CS}_{min})/K)\). Accordingly, the K subintervals are ranked in an increasing order. Then, the label \(y_{i}\) of \(x_{i}\) is set to be k (\(k=1,2,\ldots ,K\)) automatically, if the confidence score of \(x_{i}\) falls in the k’th subinterval. Above all, \(D =\{(\bar{x}_{i},y_{i}), i=1,\ldots , {\mathcal{N}} \}\) is the training data set in the prediction model for PPIs by confidence score.
Now, we begin to train the prediction model with data set D. Since the ordinal information of each data \(\bar{x}_{i}\) is hidden behind the label \(y_{i}\), ordinal regression [36, 37] is applied here to make full use of the ordinal information. The ordinal regression could be regarded as the aggregation of \(K1\) subclassification problem, where the k’th (\(k=1,2,\ldots ,K1\)) subclassification problem is represented as determining whether the label of \(x_{i}\) is bigger than k. To this end, we divide the whole training set D into two subsets: the positive class \(D_{k}^{+}\) with the label bigger than k, and the negative class \(D_{k}^{}\) with the label no more than k, and then relabel them by
Denote by \(f_{k}\) (\(k=1,2,\ldots ,K1\)) the subclassifier for the k’th subclassification problem. Obviously, the \(K1\) binary subclassifiers \(f_{k}, k = 1, 2, \ldots , K1\) are all trained on the entire training set D with different divisions. It would contribute to getting better classification performance and can effectively avoid the overfitting.
While training, the subclassifier \(f_{k}\) is determined by a multilayer perceptron with a Leaky ReLU active function [38]. It solves the problem of gradient dispersion, and converges much faster than sigmoid or tanh active functions. Given a pair of protein sequences \(x_{i}\), the output of the perceptron is a two dimensional vector, denoted by \(\hat{s}^{i} =(\hat{s}_{1}^{i},\hat{s}_{2}^{i})\), and is normalized to another vector by softmax function, denoted by \(s^{i}=(s_{1}^{i},s_{2}^{i})\),
Here, \(s_{1}^{i}\) and \(s_{2}^{i}\) represent the confidence level of \(x_{i}\) belonging to the positive and negative classes, respectively. For the perceptron, the learning target is to minimize the crossentropy loss function L,
where \(q^{i} =(q_{1}^{i}, q_{2}^{i})\) is an onehot indicator for the class label of \(\bar{x}_{i}\). Then, we have the subclassifier \(f_{k}\)
If the inequality \(f_{k}(\cdot ) > 0\) holds true, it means the predicted label of the protein pair is bigger than k, otherwise, it is no more than k.
Now, we summarize all the ordinal information from each subclassifier \(f_{k}, k = 1,2,\ldots , K1\) to derive the order of a given protein pair \(x_{i}\). The \(K1\) outputs \(f_{k}(\bar{x}_{i}), k = 1,2,\ldots , K1\) are aggregated to predict the final label of \(x_{i}\). The final label is defined as
where \([\cdot ]\) is equal to 1, if the condition in \([\cdot ]\) is satisfied, otherwise it is equal to 0. Moreover, the predicted confidence score for protein pair \(x_{i}\) is expressed as
which takes the middle value of the subinterval corresponding to the predicted final label as its confidence score.
Finally, we predict PPI of protein pair \(x_{i}\). Given a threshold \(\theta\), if the predicted confidence score of \(x_{i}\) is bigger than \(\theta\), we determine there exists an interaction between the protein pair \(x_{i}\), otherwise, there does not exist the interaction. Note that, the final prediction result is up to the value of threshold \(\theta\). We could adjust the prediction performance by setting the optimal value of \(\theta\) through experiments.
Availability of data and materials
The data sets used and/or analysed in this study are available from the corresponding articles. Three data sets S. cerevisiae, Homo sapiens and Yeast are all available at https://github.com/xuweixia88/ORRCNN.git.
Abbreviations
 PPI:

Protein protein interaction;
 ORRCNN:

ordinal regression and recurrent convolutional neural network;
 RCNN:

recurrent convolutional neural network;
 CT:

conjoint triad;
 AC:

auto covariance;
 GRU:

gate recurrent unit;
 MAE:

mean absolute error;
 MSE:

mean squared error;
 DIP:

database of interacting proteins;
 CTD:

composition transition distribution;
 RF:

random forest;
 XGBoost:

extreme gradient boosting;
 SVM:

support vector machine;
 LSTM:

long shortterm memory.
References
Branden CI, Tooze J. Introduction to protein structure. New York: Garland Science; 2012.
Morozov AV, Havranek JJ, Baker D, Siggia ED. Protein–DNA binding specificity predictions with structural models. Nucleic Acids Res. 2005;33(18):5781–98.
Junker BH, Schreiber F. Analysis of biological networks. Hoboken: Wiley; 2008.
Furney SJ, Albà MM, LópezBigas N. Differences in the evolutionary history of disease genes affected by dominant or recessive mutations. BMC Genom. 2006;7(1):165.
Wu S, Shao F, Sun R, Sui Y, Wang Y, Wang J. Analysis of human genes with protein–protein interaction network for detecting disease genes. Physica A. 2014;398:217–28.
Clatworthy AE, Pierson E, Hung DT. Targeting virulence: a new paradigm for antimicrobial therapy. Nat Chem Biol. 2007;3(9):541.
Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y. A comprehensive twohybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci USA. 2001;98(8):4569–74.
Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, Millar A, Taylor P, Bennett K, Boutilier K, Yang LY, Wolting C, Donaldson I, Schandorff S, Shewnarane J, Vo M, Taggart J, Goudreault M, Muskat B, Alfarano C, Dewar D, Lin Z, Michalickova K, Willems AR, Sassi H, Nielsen PA, Rasmussen KJ, Andersen JR, Johansen LE, Hansen LH, Jespersen H, Podtelejnikov A, Nielsen E, Crawford J, Poulsen V, Sorensen BD, Matthiesen J, Hendrickson RC, Gleeson F, Pawson T, Moran MF, Durocher D, Mann M, Hogue CWV, Figeys D, Tyers M. Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature. 2002;415(6868):180–3.
Aebersold R, Mann M. Mass spectrometrybased proteomics. Nature. 2003;422(6928):198–207.
Macbeath G, Schreiber SL. Printing proteins as microarrays for highthroughput function determination. Science. 2000;289(5485):1760–3.
Zhu H, Bilgin M, Bangham R, Hall D, Casamayor A, Bertone P, Lan N, Jansen R, Bidlingmaier S, Houfek T, Mitchell T, Miller P, Dean RA, Gerstein M, Snyder M. Global analysis of protein activities using proteome chips. Science. 2001;293(5537):2101–5.
Shen J, Zhang J, Luo X, Zhu W, Yu K, Chen K, Li Y, Jiang H. Predicting protein–protein interactions based only on sequences information. Proc Natl Acad Sci USA. 2007;104(11):4337–41.
Bock JR, Gough DA. Wholeproteome interaction mining. Bioinformatics. 2003;19(1):125–34.
Guo Y, Yu L, Wen Z, Li M. Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences. Nucleic Acids Res. 2008;36(9):3025–30.
Thanathamathee P, Lursinsap C. Predicting protein–protein interactions using correlation coefficient and principle component analysis. IEEE; 2009. p. 3025–30.
Hashemifar S, Neyshabur B, Khan AA, Xu J. Predicting protein–protein interactions through sequencebased deep learning. Bioinformatics. 2018;34:802–10.
Li H, Gong X, Yu H, Zhou C. Deep neural network based predictions of protein interactions using primary sequences. Molecules. 2018;23(8):1923–38.
Damian S, Morris JH, Helen C, Michael K, Stefan W, Milan S, Alberto S, Doncheva NT, Alexander R, Peer B. The string database in 2017: qualitycontrolled protein–protein association networks, made broadly accessible. Nucleic Acids Res. 2017;45:362–8.
Pan X, Hongbin S. Predicting RNA–protein binding sites and motifs through combining local and global deep convolutional neural networks. Bioinformatics. 2018;34(20):3427–36.
Reddi SJ, Kale S, Kumar S. On the convergence of Adam and beyond. In: Proceedings of the 6th international conference on learning representations; 2018. https://openreview.net/forum?id=ryQu7fRZ.
Salwínski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg DS. The database of interacting proteins: 2004 update. Nucleic Acids Res. 2004;32:449–51.
Yang L, Xia JF, Gui J. Prediction of protein–protein interactions from protein sequence using local descriptors. Protein Peptide Lett. 2010;17:1085–90.
Wong L, You Z, Li S, Huang Y, Liu G. Detection of protein–protein interactions from amino acid sequences using a rotation forest model with a novel PRLPQ descriptor. In: Huang D, Han K, editors. Advanced intelligent computing theories and applications—11th international conference, Lecture Notes in Computer Science, vol. 9227. Springer; 2015. p. 713–20.
Chen T, Guestrin C. Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining; 2016. p. 785–94.
Chang C, Lin C. LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol. 2011;2(3):1–27.
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th international conference on neural information processing systems, vol. 2; 2013. p. 3111–9.
Chen M, Ju CJT, Zhou G, Chen X, Wang W. Multifaceted protein–protein interaction prediction based on Siamese residual RCNN. Bioinformatics. 2019;35(14):305–14.
Hu B, Lu Z, Li H, Chen Q. Convolutional neural network architectures for matching natural language sentences. In: Proceedings of the 27th international conference on neural information processing systems; 2014. p. 2042–50.
Severyn A, Moschitti A. Twitter sentiment analysis with deep convolutional neural networks. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval; 2015. p. 959–62.
Cho K, Van Merrienboer B, Bahdanau D, Bengio Y. On the properties of neural machine translation: encoder–decoder approaches. Comput Sci. 2014.
Chung J, Gulcehre C, Cho KH, Bengio Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. 2014.
Dhingra B, Liu H, Yang Z, Cohen WW, Salakhutdinov R. Gatedattention readers for text comprehension. In: Proceedings of the 55th annual meeting of the association for computational linguistics; 2017. p. 1832–1846.
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the 2016 IEEE conference on computer vision and pattern recognition; 2016. p. 770–8.
Lin M, Chen Q, Yan S. Network in network. In: Bengio Y, LeCun Y, editors. Proceedings of the 2nd international conference on learning representation; 2014. arXiv:1312.4400.
Iandola FN, Moskewicz MW, Karayev S, Girshick RB, Darrell T, Keutzer K. Densenet: implementing efficient convnet descriptor pyramids. Eprint Arxiv; 2014.
Niu Z, Zhou M, Wang L, Gao X, Hua G. Ordinal regression with multiple output CNN for age estimation. In: 2016 IEEE conference on computer vision & pattern recognition; 2016. p. 4920–4928.
Chen S, Zhang C, Dong M, Le J, Rao M. Using rankingCNN for age estimation. In: 2017 IEEE conference on computer vision and pattern recognition; 2017. p. 742–51.
Maas AL, Hannun AY, Ng AY. Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of the 30th international conference on machine learning, Atlanta, Georgia, USA, vol. 28; 2013.
Acknowledgements
Not applicable.
About this Supplement
This article has been published as part of BMC Bioinformatics Volume 22 Supplement 6, 2021: 19th International Conference on Bioinformatics 2020 (InCoB2020). The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume22supplement6.
Funding
WX and YG were supported by the National Key Research and Development Program of China (grant No. 2016YFC0901704) and the National Natural Science Foundation of China (NSFC) (grant No. 61972100), YW and JG were supported by the National Natural Science Foundation of China (NSFC) (grant No. 61772367). NSFC funded the design of the study, and the analysis and interpretation of data; the National Key Research and Development Program of China funded the collection of data and the writing of the manuscript. Publication cost was funded by NSFC No. 61972100.
Author information
Authors and Affiliations
Contributions
JG conceived the work and revised the manuscript. WX designed the experiments and drafted the manuscript. YG and YW finished the experiments. All authors have read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Xu, W., Gao, Y., Wang, Y. et al. Protein–protein interaction prediction based on ordinal regression and recurrent convolutional neural networks. BMC Bioinformatics 22 (Suppl 6), 485 (2021). https://doi.org/10.1186/s12859021043690
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859021043690
Keywords
 Protein–protein interaction
 Confidence score
 Ordinal regression
 Recurrent convolutional neural network