 Research
 Open access
 Published:
Longdistance dependency combined multihop graph neural networks for protein–protein interactions prediction
BMC Bioinformatics volume 23, Article number: 521 (2022)
Abstract
Background
Protein–protein interactions are widespread in biological systems and play an important role in cell biology. Since traditional laboratorybased methods have some drawbacks, such as timeconsuming, moneyconsuming, etc., a large number of methods based on deep learning have emerged. However, these methods do not take into account the longdistance dependency information between each two amino acids in sequence. In addition, most existing models based on graph neural networks only aggregate the firstorder neighbors in protein–protein interaction (PPI) network. Although multiorder neighbor information can be aggregated by increasing the number of layers of neural network, it is easy to cause overfitting. So, it is necessary to design a network that can capture long distance dependency information between amino acids in the sequence and can directly capture multiorder neighbor information in protein–protein interaction network.
Results
In this study, we propose a multihop neural network (LDMGNN) model combining long distance dependency information to predict the multilabel protein–protein interactions. In the LDMGNN model, we design the protein amino acid sequence encoding (PAASE) module with the multihead selfattention Transformer block to extract the features of amino acid sequences by calculating the interdependence between every two amino acids. And expand the receptive field in space by constructing a twohop protein–protein interaction (THPPI) network. We combine PPI network and THPPI network with amino acid sequence features respectively, then input them into two identical GIN blocks at the same time to obtain two embeddings. Next, the two embeddings are fused and input to the classifier for predict multilabel protein–protein interactions. Compared with other stateoftheart methods, LDMGNN shows the best performance on both the SHS27K and SHS148k datasets. Ablation experiments show that the PAASE module and the construction of THPPI network are feasible and effective.
Conclusions
In general terms, our proposed LDMGNN model has achieved satisfactory results in the prediction of multilabel protein–protein interactions.
Background
Protein takes one of the most common molecules in organisms. It is the material basis of life activities and participates in various biological processes in organisms [1]. Most vital biological processes in organisms are generally driven by protein–protein interactions (PPIs), rather than an individual protein acting alone [2,3,4]. PPIs are widespread and play an important role in biological systems. For instance, PPIs are essential for biological cell activities such as cell proliferation, immune response, signal transduction, DNA transcription, and replication [5]. Therefore, exploring the interactions between protein and protein is the key to study cell biology [6,7,8] and has great significance to the diagnosis and treatment of diseases, as well as the design and development of drugs [9]. At present, there are many methods for the prediction of PPIs, which can be broadly divided into two types: laboratorybased traditional methods and deep learning methods.
In pace with the rapid development of highthroughput technology, a number of laboratorybased traditional methods have been used to predict PPIs, such as mass spectrometric protein complex identification (MSPCI) [10], yeast twohybrid (Y2H) [11, 12] and tandem affinity purification (TAP) [13, 14]. These methods can visually observe the interactions between protein and protein. However, the vast majority of experiments are based on genome scale, and has narrow comprehensive. At the same time, a lot of time and money are needed to support the smooth running of the experiment. In addition, part of experiments rely on obtaining target proteins from animals, which violates ethics and morality [15,16,17,18,19]. To address the shortcomings of traditional methods, researchers were turning to deep learning methods.
Deep learning, by virtue of its powerful feature learning ability, has been valued by various fields and is rapidly evolving, no exception in the bioinformatics field, where it is ingeniously applied to probe some problems in bioinformatics, such as protein–protein interactions prediction tasks. Sun et al. [20] applied stacked autoencoder (SAE) to capture amino acid sequence features to predict PPI. Hang et al. [21] designed a deep neural network (DNNPPI) framework capable of automatically acquiring features in the amino acid sequence of proteins. Chen et al. [7] constructed an endtoend framework for PPI prediction based on siamese residual recurrent neural network (PIPR), which extracted features from amino acid sequences. These deep learning models all exhibit excellent generalization ability for addressing the PPIs prediction tasks. However, they are highly dependent on the amino acid sequence of proteins. Since there may be some similarities between different amino acids, these typical deep learning models cannot effectively capture the information of the entire protein amino acid sequence and the relationships between different amino acids.
It should be noted that all the methods mentioned above only take amino acid sequence features as input. They don’t consider the interactions between proteins [22], which makes the prediction performance limited. Protein–protein interactions can be considered as hidden information to some extent, so combining it and amino acid sequence information together can improve the accuracy of prediction. For the PPI network, it can be viewed as a graph with each protein as the node and the connecting relationships between proteins as the edge.
Therefore, Yang et al. [23] proposed to view the PPI network as an undirected graph and applied GCN [24] in the PPIs prediction task for the first time. It constructed an unsigned variational graph autoencoder, which combined the PPI network with amino acid sequence information to learn the features of proteins to predict PPI. Inspired by the methods of graph signal processing, Colonnese et al. [25] considered node features on PPI networks as signals, and developed a Markov model to accomplish PPIs prediction. Lv et al. [26] constructed a GNNPPI model based on graph isomorphism network (GIN) to predict the interactions between protein–protein pairs. It not only considered the amino acid sequence information, but also fully considered the correlation between proteins.
These models have improved the accuracy of PPIs prediction. However, they only considered the information of two proteins directly interacting in PPI network, ignoring the information of protein–protein pairs indirectly interacting. In this way, the information captured by these model are incomplete. Studies [27, 28] have shown that indirectly interacting also have meaningful information. In a biological network, two molecular nodes that do not interact directly also have some similarities [29]. And then, for the target node, it would be helpful if the information of such indirectly interacting nodes could be aggregated.
In this paper, we propose a novel LDMGNN model to implement the multilabel prediction task of protein–protein interactions. This model mainly aims to solve the two problems mentioned above. The first problem is that the existing methods ignore the longdistance dependence information between amino acids in the sequence, and the second is that the existing methods do not fully consider the interaction among indirect connected protein nodes. To solve the first problem, we use the Transformer with a multihead selfattention mechanism to capture the correlation between every two amino acids in the sequence. For the second problem, we construct a twohop protein–protein interaction (THPPI) network based on the PPI network to enhance graph representation learning. Overall, our main contributions:

We use the Transformer module with a multihead selfattention mechanism to capture the longdistance dependency information between each two amino acids in sequence.

Based on the twohop concept, we construct a THPPI network based on PPI network, which can capture the information between indirectly interacting proteins and thus increase the receptive field in space.

The experimental results show that our method exhibit good performance.
Resuts
In the following, the datasets, experimental parameter settings, evaluation metrics, baselines, experimental results and analysis used in our experiment will be introduced.
Datasets
In this paper, we use two common datasets, i.e., SHS27k and SHS148k, to evaluate our method. These two datasets contain a lot of PPIs information and amino acid sequence information. They were randomly selected by [7] from the Homo sapiens subset of STRING [30] according to the rule that the sequence alignment similarity is less than \(40 \%\). The SHS27k dataset contains 1690 proteins and 7624 pairs of PPIs, and the SHS148k dataset contains 5189 proteins and 44,488 pairs of PPIs. The interactions of the two datasets can be divided into seven types, i.e., posttranslational modification (ptmod), catalysis, reaction, activation, expression, binding and inhibition, they can show not only the physical correlation between proteins but also the functional correlation.
We regard all known PPIs as positive samples, and negative samples of the same size are randomly selected from unknown interactions. In our experiment, the positive and negative sample rate is 1:1. Specifically, in the SHS27k dataset, the number of negative samples is 7624. In the SHS148k dataset, the number of negative samples is 44,488. At the same time, inspired by [26], in order to evaluate the generalization ability of the LDMGNN model more realistically, we choose three partition schemes to divide the test set, i.e., random, BFS and DFS. Our test set accounts for \(20 \%\) of the dataset.
Given the protein set P and PPI set I , construct a protein–protein interaction network \(G=\langle {P}, {I}\rangle\). The size of the fixed test set is N (We divide the data set according to edges, that is, N refers to the number of edges in the test set). Firstly, a protein is randomly selected from the protein set P as the root node \(p_{root}\). Given a threshold t, the degree of the root node \(p_{root}\) must be less than this threshold (we set the threshold \(t=5\)), that is \(\left N\left( p_{\text{ root } }\right) \right <t\). Set the initial test set \(I_{\text{ test }}=\emptyset\), the current node \(p_{cur} = p_{root}\). And then use DFS(BFS) algorithm to search the neighbor nodes \(p_{k}\) of the current node \(p_{cur}\), i.e. \(p_{k} \in N\left( p_{cur}\right)\). At this time, the test set \(I_{\text{ test }}=I_{\text{ test }} \cup I_{cur}\), \(I_{cur}=\left\{ p_{cur}, p_{k}\right\}\). The process is repeated until the number of edges in the subgraph formed by all nodes in the test set exceeds N.
Parameter settings and evaluation metric
Our experiment is performed on an NVIDIA GTX 3090 GPU with a PyTorch deep learning framework. We choose the Adam algorithm [31] as the optimization strategy in this paper with a weight decay coefficient of 5e−4 and a batch size setting of 512. We train our models for 300 epochs with an initial learning rate of 0.001. We choose the ReduceLROnPlateau function to vary the learning rate and to prevent overfitting, the patience is set to 20. During model training, if the loss is not reduced for 20 consecutive iterations, training will automatically stop.
Since our task is to use a classifier to solve the multilabel PPI classification. The interactions between protein–protein pairs have at least one label. Moreover, the types of PPIs in the SHS27k and SHS148k datasets are extremely unbalanced [26]. MicroF1 will emphasize the common labels in the datasets, which is not easy to be affected by small samples or large samples, so that each sample has the same importance [32]. Comprehensive consideration, we choose the MicroF1 evaluation metric to measure the accuracy of our model. The mathematical formula is as follows:
where \(Recall_m\) and \(Precision_m\) are the total recall and total precision for all classes, expressed with mathematical formulae as follows:
where n indicates the number of classes, in this experiment, the number of classes is 7. \(TP_{i}\), \(FP_{i}\), \(TN_{i}\) and \(FN_{i}\) indicate true positives, false positives, true negatives and false negatives of the ith class, respectively.
Baselines
In order to better illustrate the effectiveness of our model, we compare LDMGNN with different baselines. These baselines can be divided into machine learning based and deep learning based. We choose three algorithms based on machine learning, which are SVM [33], RF [34] and LR [35]. The input are the features of proteins, which are common handcrafted protein features, i.e., AC [33] and CTD [36].
When compare with the models based on deep learning, we construct the same architecture as them. We input the SHS27k dataset and SHS148k dataset into the model, and change the output from the original two classification to multilabel classification. These deep learning models are as follows.

HIN2Vec [37]: A representation learning framework for heterogeneous information networks (HIN). It uses different types of interactions among nodes to capture the features of nodes and meta paths in HIN.

SDNE [38]: A structural deep network embedding method for link prediction and multilabel classification tasks. It can not only effectively capture the highly nonlinear network structure, but also preserve the global and local structure of the network.

LPIDLDN [39]: A deep learning model of dualnetwork neural architecture composed of feature importance ranking (FIR) network and MLP network. Given the sequences of protein and lncRNA, predict the potential interaction between lncRNA and protein.

LPIdeepGBDT [40]: A multiplelayer deep structure model based on gradient boosting decision trees. Given the sequences of protein and lncRNA, predict the unobserved LPIs.

DTICDF [41]:A cascade deep forest model based on hybrid feature, which cascades the traditional machine learning models RF and XGB. Given the hybrid feature (contains the information of drug, target and drugtarget interaction) to predict the interaction between drug and target.

PIPR [7]: An endtoend network model for predicting PPI, which combines two residual RCNN using Siamese architecture. And this method provides an automatic multigranularity feature selection mechanism to capture the features of sequences.

GAT [42]: A new neural network based on graph structure data. It learns the embedding of nodes by using selfattention mechanism in the structure of graph.

GNNPPI [26]: A graph neural network model, given the information of protein amino acid sequence and PPI network, is used for the prediction of multilabel PPI.
Our experiment is inspired by GNNPPI [26]. However, compared with GNNPPI model, our LDMGNN model mainly has the following two innovations. First, in the part of amino acid sequence encoding, we innovatively propose to replace biGRU block with a transformer block with multihead selfattention mechanism, which can not only capture the longdistance dependence information between amino acids, but also solve the problem that biGRU cannot be parallelized. Second, considering that there may be some connection between nodes that do not interact directly, in order to capture more comprehensive information of proteins, we construct a twohop PPI network. This is not available in the GNNPPI model.
Results and analysis
As shown in Table 1, the LDMGNN method shows the best performance compared to the other baselines. From this result, it can be seen that our model has fully learn the longdistance dependency between amino acids and effectively expand the receptive field, which can improve the prediction accuracy of multilabel PPI. From the perspective of dataset size, the performance of the model increases with the size of the dataset. Obviously, our method performs better under dataset SHS148k than dataset SHS27k, this is because we are able to obtain more valuable information as the PPI network growing. From the perspective of dataset partition scheme, the performance improvement of our method in BFS and DFS partitioning scheme is generally higher than that in random. For the SHS27k dataset, our method achieves an absolute improvement of \(1.43 \%\), \(10.75 \%\), \(3.48 \%\) when compared with the GNNPPI model in random, BFS, and DFS partitioning methods, respectively. And for the SHS148k dataset, our method achieves an absolute improvement of \(0.12 \%\), \(2.61 \%\), \(1.12 \%\) when compared with the GNNPPI method in random, BFS, and DFS partitioning methods, respectively. This illustrates that our method has a certain generalization ability and has practical implications.
Significant difference analysis
To verify whether the performance of our proposed LDMGNN model is statistically significantly different from these 11 baseline models, we conducted a paired samples ttest using SPSS software. The related results are shown in Tables 2 and 3. Table 2 is used to represent the correlation between two samples, where the value of correlations is in the interval [\(1\),1]. When this value is greater than 0, it means that there is a positive correlation between the two samples, and when this value is less than 0, it means that there is a negative correlation between the two samples. And the larger the absolute value of the correlations, the stronger the correlation between the two samples. At the same time, the significance level p value (i.e., Sig. in Table 2) should be less than 0.05. When the p value is less than 0.05, it can indicate whether the correlation between samples is significant. A paired samples ttest only makes sense when there is a significant correlation between paired samples. Table 3 shows the results of the paired samples ttest, when the p value (i.e., Sig.(2tailed)) is less than 0.05, it indicates that there is a significant difference between the two samples.
As shown in Table 2 , the correlation of the first row (LDMGNN & SVM) is 0.976 (the absolute value is close to 1), and the significance level p value (Sig.) is 0.001 (< 0.05), which indicates that the two samples of LDMGNN and SVM have significant correlation, and is strongly correlated. Similarly, it can be seen that the correlations of the last 10 pairs of samples are 0.917, 0.920, 0.916, 0.953, 0.938, 0.929, 0.943, 0.916, 0.938 and 0.965, respectively (all greater than 0.9). And the corresponding significance level p value (i.e., Sig.) of the last 10 paired samples are 0.010, 0.009, 0.010, 0.003, 0.006, 0.007, 0.005, 0.010, 0.006 and 0.002 (all less than 0.05). Obviously, these 10 paired samples are all significantly correlated. Therefore, it is meaningful to perform a paired samples ttest on these 11 pairs of paired samples.
From the Table 3, we can see that the p values (i.e., Sig.(2tailed)) of the top 10 pairs of samples are 0.001, 0.005, 0.000, 0.000, 0.009, 0.002, 0.001, 0.002, 0.015 and 0.011, respectively, all of which are less than 0.05. It shows that there are significant differences between the first 10 pairs of samples. The p values (i.e., Sig.(2tailed)) of the 11th pair of samples is 0.094, which is greater than 0.05. As for the 11th pair of samples, we consider that the extreme imbalance of the data may be a factor in this situation. Further, from the perspective of reality, we believe that LDMGNN model has certain practical significance. Lv et al. [26] studied the Homo sapiens subsets at two time points (2011 / 01 / 25 and 2021 / 01 / 25) in the BioGRID database. They found that the newly discovered proteins had local patterns of BFS and DFS. In these two partition schemes, our LDMGNN model has a large improvement in accuracy compared with the GNNPPI model.
Ablation analysis
In order to verify the importance and effectiveness of each module in this study for the prediction model, we conduct an ablation study by deleting or replacing each module in this study. We use GNNPPI as the baseline for PPIs prediction, which processes amino acid sequences using RNN and aggregates only firstorder neighbor information. −PMHGE represents the removal of PMHGE from the LDMGNN model and, unlike baseline GNNPPI, and uses the Transformer with a multihead selfattention mechanism to learn the amino acid interdependency in the sequence. PAASE represents the deletion of the PAASE module from the LDMGNN model. Unlike the baseline GNNPPI, we construct a THPPI network and simultaneously aggregates firstorder and secondorder neighbor information, increasing the spatial receptive field in the model. LDMGNN is our proposed model. Compared with the baseline GNNPPI, our LDMGNN not only captures the longdistance dependency information in the sequence but also increases the spatial receptive field in space. We still use MicroF1 as the evaluation metric.
As can be seen from Table 4, when the model only uses the multihead selfattention mechanism to capture the longdistance dependency information in the sequence, for the SHS27k dataset, the current model increases by \(0.81 \%\), \(5.03 \%\) and \(2.20 \%\) respectively compared with the GNNPPI model under the random, BFS and DFS partitioning schemes. For the SHS148k dataset, \(0.12 \%\), \(2.36 \%\) and \(0.10 \%\) improvements are achieved respectively. The results show that the prediction accuracy could be improved if the model only captures the interdependency of amino acids in the sequence, indicating that the longdistance dependency of amino acids in the sequence plays a positive role in the prediction of multilabel PPI. In view of the situation that microF1 of the SHS148k dataset decreased by \(2.36 \%\) in the BFS partitioning scheme, we believe that this is caused by data imbalance.
When the model only increases the spatial receptive field of network, for the SHS27k dataset, the current model increases by \(1.37 \%\), \(4.80 \%\) and \(0.09 \%\) respectively compared with the GNNPPI model under the random, BFS and DFS partitioning schemes. For the SHS148k dataset, \(0.09 \%\), \(0.26 \%\) and \(0.24 \%\) improvements are achieved respectively. The results show that aggregating the information of firstorder and secondorder neighbors simultaneously can improve the accuracy of prediction, which suggest that appropriate increase of network receptive field also plays a positive role in the prediction of multilabel PPI. However, the improved accuracy is the highest when the model captures the interdependencies between amino acids in the sequence and aggregate the firstorder and secondorder neighbors, which further demonstrate that LDMGNN model is effective for predicting multilabel PPI.
The selection of hop number
When constructing a multihop PPI network to increase the receptive field, we conducted experiments on different k to determine the khop network to be constructed. The experimental results are shown in Table 5. In the table, “OneHop” indicates the case where \(k = 1\). In this case, the PPI network is the original PPI network, and the target node only aggregates the information of firstorder neighbors. In the table, “OneHop \(+\) TwoHop” represents the case where \(k=2\). We construct a THPPI network, where the target node aggregates the information of firstorder and secondorder neighbors simultaneously. “OneHop \(+\) TwoHop \(+\) ThreeHop” in the table represents \(k=3\). We construct both a THPPI network and a ThreeHop PPI network, and the target node aggregates the information of firstorder, secondorder, and thirdorder neighbors at the same time. We still use microF1 as the evaluation metric, and each boldface number means the best accuracy under this partitioning scheme. Obviously, the prediction accuracy is the highest when k is 2.
It can be seen from Table 5 that when k is set to 2, the accuracy of the model obtained in each partition scheme on the two datasets is higher than that when K is set to 1. This indicates that it is necessary to construct a multihop PPI network to increase the receptive field of the network in space. However, with the increase of k, the performance of the model on the two datasets shows a decreasing trend. When k is 3, the model acquires much less accuracy in each of the partition methods on both datasets than when k is 2. Indeed, the accuracy obtained when k is 3 is less in all cases than when k is 1. This indicates that the simple construction of a multihop PPI network is not the best. As the receptive field gradually increases, the model gradually tends to be overfitting. Therefore, in this study, k is selected as 2 when we construct the multihop PPI network. It further shows that it is effective and reasonable for us to aggregate the firstorder and secondorder neighbor information at the same time and appropriately increase the receptive field of the network.
The effect of more negative samples on model performance
To test how more negative samples will affect the performance of the model, we increase the number of negative samples while keeping the number of positive samples constant. At this time, the positive: negative sample rate in the datasets will change. Specifically, for the SHS27k dataset, we randomly selected three negative samples from proteins pairs with unknown interactions, and the number of these three negative samples were 22,872, 38,120 and 76,240, respectively. We took the protein pairs that are known interactions as positive samples, and the number is 7624. Thus we can obtain three different positive:negative sample rates, which are 1:3, 1:5 and 1:10. Similarly, for the SHS148k dataset, we randomly selected three negative samples from protein pairs of unknown interactions, with numbers of 133,464, 222,440, and 444,880, respectively. We also took known interacting protein pairs as positive samples with a number of 44,488, resulting in three different positive:negative sample rates, which are 1 : 3, 1 : 5, and 1 : 10, respectively. We still choose three partition schemes to divide the test set, namely random, BFS and DFS, and our test set accounts for \(20 \%\) of the dataset.
The experimental results are shown in Table 6, where 1:1 is the positive:negative sample rate used by the LDMGNN model. As can be seen from Table 6, for the SHS27k dataset, under random, BFS and DFS partition schemes. Compared with the result that positive:negative sample rate is 1:1, when the positive:negative sample rate is 1:3, the performance of the model decreases by 3.09, 3.13 and 2.06, respectively; when the rate is 1:5, the performance of the model decreases by 6.60, 5.27 and 5.26, respectively; when the rate is 1:10, the performance of the model decreases by 11.40, 8.94 and 8.36, respectively. And for the SHS148k dataset, under random, BFS and DFS partition schemes. Compared with the result that positive:negative sample rate is 1:1, when the positive:negative sample rate is 1:3, the performance of the model decreases by 3.07, 2.26 and 4.82, respectively; when the rate is 1:5, it decreases by 10.28, 5.77 and 11.15, respectively; when the rate is 1:10, it decreases by 16.83, 12.62 and 16.87, respectively.
Obviously, we can see that when we add more negative samples, the performance of the model will decrease significantly. This is because when the number of negative samples exceeds the number of positive samples, it will affect the correct judgment of the model on the positive samples, so that the classifier can not capture the features of the positive samples well. Therefore, the imbalance between positive and negative samples will negatively affect the performance of the model.
Discussion
Next, we will introduce why we use the Transformer with a multihead selfattention mechanism to capture amino acid sequence information and why we construct a THPPI network to aggregate twohop neighbor information.
It is acknowledged that the Transformer [43] was first proposed to replace recurrent neural network (RNN) to solve natural language processing. It has two unique properties, one is that it can obtain the longdistance dependency of sequences, and the other is that it can be parallelized. Inspired by the methods of natural language processing such as Bert [44] and Roberta [45], we regard each amino acid as a vector, the amino acid sequence as a vector set. And we consider that there may also be longdistance dependency between every two amino acids in a sequence. So we use the Transformer with a multihead selfattention mechanism to capture amino acid sequence information.
Meanwhile, we constrcut a THPPI network since there may be meaningful information between two indirectly interacting nodes. Exactly, the competitive inhibition [46] in biochemistry can also explain that there may be some structural similarity or functional similarity between the two indirectly connected nodes. As shown in Fig. 1, a typical example [47] of similar structures in biomolecules causing competitive inhibition. When humans are bitten by snakes, snake venom proteins follow the blood circulation into the nervous tissue space, bind to acetylcholine receptors (AchR), and the binding affinity between them is much higher than that of acetylcholine (Ach). Thus the snake venom proteins would inhibit the binding of acetylcholine to acetylcholine receptors. Here, the snake venom proteins are structurally similar to acetylcholine. Enlightened by this case, we consider that in a PPI network, there may be two proteins that are indirectly connected, but structurally similar or functionally similar. So we construct a THPPI network to aggregate the secondorder neighbor information to enlarge the receptive field in space.
Conclusions
In this study, we propose the LDMGNN model to predict multilabel protein–protein interactions. LDMGNN first captures the potential features of amino acid sequences through the PAASE module and then concatenates this information with the topological information of the initial PPI network and the topological information of the THPPI network respectively. Then, they are respectively input into the graph neural network, and the two obtained feature matrices are addition by elementwise as the final embedding of protein and protein pair. Finally, this embedding is fed into a classifier for predicting protein–protein interactions.
We carry out a series of experiments, and in the SHS27k dataset and SHS148k dataset, our model shows better performance than the existing model. Furthermore, we perform an ablation experiment to verify that each module in the model is indispensable and that the parameters in the experiment are reasonable and effective. This indicates that the Transformer with a multihead selfattention mechanism can successfully capture the longdistance dependence information in amino acid sequences. The spatial receptive field of the network can be increased by aggregating the firstorder and secondorder neighbor information simultaneously. In conclusion, the LDMGNN model can comprehensively learn the feature information between protein pairs and has a good potential in PPI prediction.
Methods
This section introduces the proposed multilabel PPI prediction approach, which is an endtoend representation learning model. Given the representation of protein amino acid sequence, the adjacency matrix of PPI network and the adjacency matrix of THPPI network, we try to predict the labels between protein–protein pairs. The representation of protein amino acid sequence input here is processed into a numerical vector by [7]. In this section, we define the multilabel PPI prediction problem. Then we will introduce our LDMGNN model in detail.
Problem formulation
We represent the set of amino acids as M, and define the amino acid sequence S of a protein, which consists of amino acids in varying proportions as \(S=\left\{ m_{1}, m_{2}, \ldots , m_{l}\right\}\), where \(m_{i} \in M, i=1,2, \ldots , l\). We consider the initial PPI network as an undirected graph \(G_{1}=\langle {P}, {I}\rangle\), whose adjacency matrix is \(A_{1} \in \{0,1\}^{N \times N}\), where P is the set of proteins and denoted as \(P=\left\{ p_{1}, p_{2}, \ldots , p_{n}\right\}\). I is the set of protein–protein interactions, defined as \(I=\left\{ p_{i j} \mid p_{i j}=\left( p_{i}, p_{j}\right) , i \ne j, p_{i}, p_{j} \in P\right\}\). If \(p_{ij}=1\), this indicates that there have interactions between protein \(p_{i}\) and protein \(p_{j}\). If \(p_{ij}=0\), this indicates that there is no interaction between the proteins or the interactions between them has not been identified at this time. Similarly, we consider the constructed THPPI network as an undirected graph \(G_{2}=\langle {P}, {I}\rangle\), whose adjacency matrix is \(A_{2} \in \{0,1\}^{N \times N}\).
The task of this multilabel classification is to learn a model \(F=(p, {\hat{y}})\) from the training set \(I_{train}\), and its input p is protein pairs with known interaction, \(p \in \ I_{train}\). The output is a 7dimensional vector, which corresponds to a finite set of labels L. We define the label set of PPIs as \(L=\left\{ \ell _{0}, \mathrm {\ell _{1}, \ldots , \ell }_{n}\right\}\), where \(n=6\) are the types of protein–protein interaction, which are posttranslational modifications (PTMOD), catalysis, reaction, activation, expression, binding and inhibition, respectively. The interaction of each pair of proteins contains at least one type. When there is a certain type of interaction, the corresponding position in the vector is 1, otherwise it is 0. The learned model F is used to predict the labels \({\hat{y}}_{ij}\) of protein pair \(p_{ij} \in \ I_{test}\).
Overview
The framework of the proposed LDMGNN model is shown in Fig. 2. We introduce the framework from the following three parts. The first part is the “Protein Encoding”, which is the core of LDMGNN model. It is used to extract the representation of protein nodes. The second part is “Feature Fusion”, and the last part is the “Multilabel PPI Prediction”.
Protein encoding
In the process of protein encoding, we can regard this process as two parts. These two parts are trained together in an endtoend manner. One is protein amino acid sequence encoding (PAASE), which is to capture the protein feature based on amino acid sequence, which we call sequence feature. The second is protein multihop graph encoding (PMHGE), which can be regarded as composed of two branches. One branch uses a GIN block to capture the first embedding of protein, and the other branch uses a GIN block to capture the second embedding of protein.
Protein amino acid sequence encoding (PAASE)
This module is used to capture the protein feature based on amino acid sequence. Predefined feature appearing in modules were observed by [7] through processing. Chen et al. [7] used the embedding method to represent each amino acid \(m \in M\) as a 13 dimensional vector, which is composed of two subvectors, i.e., \(m=\left[ m_{1}, m_{2}\right]\). The first subvector \(m_1\) measures the cooccurrence similarity of amino acids, and its dimension is 5. The second subvector \(m_2\) represents the similarity of the electrostatic and hydrophobic among amino acids, which is an eightdimensional onehot encoding.
As shown in Fig. 3, the predefined feature are input and pass through a onedimensional convolution layer, and the size of the convolution kernel is 3. We input the hidden features of the output into the normalization layer, which can increase the learning rate of the model and speed up the training speed. Then we choose a maximum pooling layer and let it extract more representative features. To capture the longdistance dependency information in the sequence, we then input these representative features into the Transformer module with multihead selfattention mechanism for learning about the interdependencies of amino acids. Next, We input the hidden features obtained from the MHSA Transformer layer into a onedimensional average pooling layer, for which dimension reduction will be performed. Finally, the sequence feature is obtained through a fully connected layer.
The transformer with multihead selfattention mechanism
The function of this block is to learn the interdependence between each two amino acids in the amino acid sequence by calculating the correlation coefficient between them. This can not only capture the local information of amino acids in the sequence, but also capture the longdistance dependency information between amino acids.
Because of the potential for multiple types of interactions between each pair of amino acids, we apply multihead selfattention mechanism to each amino acid of each protein amino acid sequence. Then we extract the low dimensional feature embedding of each amino acid by computing the correlation of different kinds between each pair of amino acids. As shown in Fig. 4, and for convenience, we only present a brief diagram. For each amino acid \(m_{i}\), it gets a query vector \(q_{i} \in R^{dq}\), a key vector \(k_{i} \in R^{dk}\), and a value vector \(v_{i} \in R^{dv}\). They are obtained by linear transformations of the features of amino acids using trainable parameters \(W_{q} \in R^{\text {feature} \_ \text {in} \times {dq}}\), \(W_{k} \in R^{\text {feature} \_ \text {in} \times {dk}}\) and \(W_{v} \in R^{\text {feature} \_ \text {in} \times {dv}}\), which are shared for all amino acid nodes.
Since there are h types of correlations for each pair of amino acids \(\left( m_{i}, m_{j}\right)\), it is necessary to calculate the embeddings of amino acid nodes using the multihead selfattention mechanism. For each node \(m_{i}\) in the amino acid sequence, \(q_{i, 1}, \ldots , q_{i, h}\) can be obtained by linear transformation of \(q_{i}\) using different weight matrices \(W_{q, 1}, \ldots , W_{q, h}\). Smilarly, \(k_{i, 1}, \ldots , k_{i, h}\) can be obtained by linear transformation of \(k_{i}\) using different weight matrices \(W_{k, 1}, \ldots , W_{k, h}\); \(v_{i, 1}, \ldots , v_{i, h}\) can be obtained by linear transformation of \(v_{i}\) using different weight matrices \(W_{v, 1}, \ldots , W_{v, h}\). For each correlation in each pair of amino acids, its coefficients \(\alpha _{i, j}^{h}\) can be calculated using the Querykey dot product method, which can be expressed by mathematical formula as follows:
These correlation coefficients will then be normalized. The correlation coefficient \(\alpha _{i, j}^{h}\) is used as the weight to measure the value of each amino acid \(m_{i}\), and then the weighted sum is carried out to obtain the embedding of h types of each amino acid \(m_{i}\), which are \(z_{i}^{1}, \ldots , {z}_{i}^{h}\), respectively. Finally, these embeddings of all kinds are concatenated with the Transformer to produce the final output \(z_{i}\) of an amino acid node in the current layer, which is expressed by mathematical formula as:
where \(Z_{i} \in R^{\text{ feature }\_\text {out }}\), and
Protein multihop graph encoding (PMHGE)
In the process of protein multihop graph encoding, we use two GIN blocks to obtain two embeddings of protein. The input of the first GIN block is the original PPI network and the sequence feature, and the output is the first embedding \(E_1\) of protein. The input of the second GIN block is the twohop PPI network (THPPI) and the sequence feature, and output is the second embedding \(E_2\) of protein. In this way, not only the features of adjacent nodes can be aggregated directly, but also the features of non adjacent nodes can be aggregated.
Twohop protein–protein interaction (THPPI) network
Through the THPPI graph network, the GIN block can learn new interactions between two proteins for the purpose of augmented graph representation. We construct the THPPI network through the PPI network. Specifically, we generate the adjacency matrix \(A^{2}\) of the THPPI network \(G_{2}\) through the adjacency matrix A of the original graph \(G_{1}\), so as to obtain the structural information of the THPPI network. The mathematical formula is described as follows:
where \({\text {sign}}(x)\) is a symbolic function, when \(x>0\), the value is 1; and when \(x \le 0\), the value is 0. It is worth noting that the new adjacency matrix contains the selfconnection relation. And when protein nodes i and j correspond to values greater than 0, indicating that there is twohop relationship between them on the original graph. Similarly, in the THPPI network \(G_{2}\), two protein nodes are connected by edges to indicate their interaction. Due to the nature of graphs in general, the model of graph neural networks can also do message passing and aggregation operational on \(G_{2}\).
Graph isomorphic network (GIN) block
The two GIN blocks in Fig. 2 are the same, as shown in Fig. 5 below. The input of this module is an \(L^{*} 256\) feature matrix, where L represents the number of proteins and 256 represents the number of features of each protein. This feature matrix is obtained by concatenating structural information of the PPI network with amino acid sequence information. After two linear layers, two ReLU activation layers and normalization layers, the protein embedding was obtained.
Graph neural networks [24, 48,49,50] have seen tremendous progress in a variety of extremely challenging tasks. While graph isomorphism network (GIN) [51] is proved to be the most powerful variant of graph neural network (GNN) at present. Next, we will introduce how to obtain the information of the PPI network and the information of the THPPI network through the GIN block.
Similar to graph neural networks, the neighbor aggregation mechanism is the core of GIN. We iteratively update the feature of each node by aggregating feature of its neighbors. After k iterations, the structure information in the khop neighborhood can be captured. And the new feature vector \(g_{p}^{k}\) of node p can be expressed by the following mathematical formula:
where N(p) is the set of all neighbor nodes of node p. We choose vector sum as the aggregation function and multilayer perceptrons (MLP) as the update function.
Then, for the original PPI network \(G_{1}\), after kth iterations, node p obtains the feature vector \(g_{p_{1}}^{k}\), which can be expressed as:
where \(p_{1}^{\prime }\) represents the firstorder neighbor of the node p, and \(\epsilon _{1}\) is hyperparameter. Finally, we can obtain the embedding of \(G_1\), which is called \(E_1\).
Similarly, for the THPPI network \(G_{2}\), after kth iterations, node p obtains the feature vector \(g_{p_{2}}^{k}\), which can be expressed as:
where \(p_{2}^{\prime \prime }\) represents the secondorder neighbor of node p, and \(\epsilon _{2}\) is hyperparameter. On the original graph \(G_1\), \(p_{2}^{\prime \prime }\) is the secondorder neighbor of p. However, on the graph \(G_2\) we constructed, \(p_{2}^{\prime \prime }\) is the firstorder neighbor of p. Finally, we can obtain the embedding of \(G_2\), which is called \(E_2\).
Feature fusion
This operation can well integrate the embedding \(E_1\) of the original PPI network and the embedding \(E_2\) of the THPPI network into the same embedding space. We fuse these two embeddings together to obtain the final embedding \(E_{\text{ out } }\) of all proteins and use elementwise summation as the fusion form in this paper. Expressed in mathematical formula as the following:
Multilabel PPI prediction
We input the final embedding \(E_{\text{ out } }\) of the proteins into a fully connected (FC) layer classifier, which predicts the interactions between two proteins. We use dot product operation to combine the embedding \(e_i\) of protein \(p_i\) and the embedding \(e_j\) of protein \(p_j\). The mathematical formula is as follows:
In order to better supervise the training process of the model, we choose the multitask binary crossentropy loss function. And its mathematical formula is shown as follows:
where \({\mathcal {I}}_{\text{ train } }\) represents the training set. \(y_{i j}^{k}\) and \({\hat{y}}_{i j}^{k}\) denotes the groundtruth label and predicted probability for class k, respectively.
Availability of data and materials
The datasets used in this study can be downloaded from the http://yellowstone.cs.ucla.edu/~muhao/pipr/SHS_ppi_beta.zip or from the https://drive.google.com/open?id=1y_5gje6AofqjrkMPY58XUdKgDuu1mZCh. The source code is available online at: https://github.com/666Coco123/LDMGNN.
Abbreviations
 PPI:

Protein–protein interaction
 THPPI:

Twohop protein–protein interaction
 DNA:

Deoxyribonucleic acid
 Y2H:

Yeast twohybrid
 MSPCI:

Mass spectrometric protein complex identification
 TAP:

Tandem affinity purification
 SVM:

Support vector machine
 RF:

Random forest
 LR:

Logistic regression
 SAE:

Stacked auto encoder
 GCN:

Graph convolutional networks
 GIN:

Graph isomorphism networks
 RNN:

Recurrent neural networks
 ER:

Erdös–Rényi
 DFS:

Depth first search
 BFS:

Breadth first search
 AchR:

Acetylcholine receptors
 Ach:

Acetylcholine
 PAASE:

Protein amino acid sequence encoding
 MHSA:

Multihead selfattention
 PMHGE:

Protein multihop graph encoding
 LDMGNN:

Longdistance dependency combined multihop graph neural networks
References
Hu L, Wang X, Huang YA, Hu P, You ZH. A survey on computational models for predicting protein–protein interactions. Brief Bioinform. 2021;22:bbab036.
Raimondi D, Simm J, Arany A, Moreau Y. A novel method for data fusion over entityrelation graphs and its application to protein–protein interaction prediction. Bioinformatics. 2021;37:2275–81.
Meyer MJ, Das J, Wang X, Yu H. Instruct: a database of highquality 3d structurally resolved protein interactome networks. Bioinformatics. 2013;29:1577–9.
Hamp T. Sequencebased prediction of protein–protein interactions (2014)
Huang K, Xiao C, Glass L, Zitnik M, Sun J. SkipGNN: predicting molecular interactions with skipgraph networks. Sci Rep. 2020;10:1–16.
Berggrd T, Linse S, James P. Methods for the detection and analysis of protein–protein interactions. Proteomics. 2010;7(16):2833–42.
Chen M, Ju JT, Zhou G, Chen X, Wang W. Multifaceted protein–protein interaction prediction based on siamese residual RCNN. Bioinformatics. 2019;35(14):305–14.
Xia Y, Xia CQ, Pan X, Shen HB. GraphBind: protein structural context embedded rules learned by hierarchical graph neural networks for recognizing nucleicacidbinding residues. Nucl Acids Res. 2021;49: e51.
Liu L, Mamitsuka H, Zhu S. HPODNets: deep graph convolutional networks for predicting human proteinphenotype associations. Bioinformatics. 2021. https://doi.org/10.1093/bioinformatics/btab729.
Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, Millar A, Taylor P, Bennett K, Boutilier K. Systematic identification of protein complexes in saccharomyces cerevisiae by mass spectrometry. Nature. 2002;415(6868):180–3.
Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y. A comprehensive twohybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci U S A. 2001;98(8):4569–74.
Fields S, Sternglanz R. The twohybrid system: an assay for protein–protein interactions. Trends Genet. 1994;10(8):286.
Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature. 2002;415(6868):141–7.
Bürckstümmer T, Bennett KL, Preradovic A, Schütze G, Hantschel O, SupertiFurga G, Bauch A. An efficient tandem affinity purification procedure for interaction proteomics in mammalian cells. Nat Methods. 2006;3(12):1013.
Han J, Dupuy D, Bertin N, Cusick ME, Vidal M. Effect of sampling on topology predictions of proteinprotein interaction networks. Nat Biotechnol. 2005;23(7):839–44.
Piehler J. New methodologies for measuring protein interactions in vivo and in vitro. Curr Opin Struct Biol. 2005;15(1):4–14.
Byron O, Vestergaard B. Proteinprotein interactions: a suprastructural phenomenon demanding transdisciplinary biophysical approaches. Curr Opin Struct Biol. 2015;35:76–86.
Gingras AC, Gstaiger M, Raught B, Aebersold R. Analysis of protein complexes using mass spectrometry. Nat Rev Mol Cell Biol. 2007;8(8):645–54.
Rivas J, Fontanillo C. Proteinprotein interactions essentials: key concepts to building and analyzing interactome networks. PLoS Comput Biol. 2010;6(6):1000807.
Sun T, Bo Z, Lai L, Pei J. Sequencebased prediction of protein protein interaction using a deeplearning algorithm. BMC Bioinform. 2017;18(1):277.
Hang L, Gong XJ, Yu H, Zhou C. Deep neural network based predictions of protein interactions using primary sequences. Molecules. 2018;23(8):1923.
Liu L, Zhu X, Ma Y, Piao H, Peng J. Combining sequence and network information to enhance protein–protein interaction prediction. BMC Bioinform. 2020;21(Suppl 16):1–13.
Yang F, Fan K, Song D, Lin H. Graphbased prediction of protein–protein interactions with attributed signed graph embedding. BMC Bioinform. 2020;21(1):1–16.
Kipf TN, Welling M. Semisupervised classification with graph convolutional networks (2016)
Colonnese S, Petti M, Farina L, Scarano G, Cuomo F. Protein–protein interaction prediction via graph signal processing. IEEE Access. 2021;9:142681–92. https://doi.org/10.1109/ACCESS.2021.3119569.
Lv G, Hu Z, Bi Y, Zhang S. Learning unknown from correlations: graph neural network for internovelprotein interaction prediction (2021)
Zitnik M, Sosi R, Feldman MW, Leskovec J. Evolution of resilience in protein interactomes across the tree of life. Proc Natl Acad Sci. 2019;116(10):201818013.
Kovács I, Luck K, Spirohn K, Wang Y, Pollis C, Schlabach S, Bian W, Kim DK, Kishore N, Hao T. Networkbased prediction of protein interactions. Nat Commun. 2019;10(1):1–8.
Costanzo M, Baryshnikova A, Bellay J, Kim Y, Spear ED, Al E. The genetic landscape of a cell. Science. 2010;327(5964):425–31.
Szklarczyk D, Gable AL, Lyon D, Junge A, Wyder S, HuertaCepas J, Simonovic M, Doncheva NT, Morris JH, Bork P, Jensen LJ. String v11: protein–protein association networks with increased coverage, supporting functional discovery in genomewide experimental datasets. Nucl Acids Res. 2018;47:D607–13.
Kingma D, Ba J. Adam: a method for stochastic optimization. Computer Science (2014)
Zhang M, Zhou Z. A review on multilabel learning algorithms. IEEE Trans Knowl Data Eng. 2014;26(8):1819–37.
Guo Y, Yu L, Wen Z, Li M. Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences. Nucl Acids Res. 2008;9:3025–30.
Wong L, You ZH, Li S, Huang YA, Liu G. Detection of protein–protein interactions from amino acid sequences using a rotation forest model with a novel PRLPQ descriptor. In: International conference on intelligent computing (2015)
Yael S, Martin K, Roded S, Xue Y. A method for predicting protein–protein interaction types. PLoS ONE. 2014;9(3):90904.
Du X, Sun S, Hu C, Yao Y, Yan Y, Zhang Y. Deepppi: boosting prediction of protein–protein interactions with deep neural networks. J Chem Inf Model. 2017;57:1499.
Fu T, Lee WC, Lei Z. Hin2vec: explore metapaths in heterogeneous information networks for representation learning. In: Proceedings of the 2017 ACM on conference on information and knowledge management. CIKM ’17, pp. 1797–1806. Association for Computing Machinery, New York. 2017. https://doi.org/10.1145/3132847.3132953
Wang D, Cui P, Zhu W. Structural deep network embedding. In: ACM SIGKDD international conference on knowledge discovery & data mining (2016)
Lihong P, Wang C, Tian X, Zhou L, Li K. Finding LNCRNAprotein interactions based on deep learning with dualnet neural architecture. IEEE/ACM Trans Comput Biol Bioinform. 2021. https://doi.org/10.1109/TCBB.2021.3116232.
Zhouzhou L, Wang Z, Tian X, Peng L. LPIdeepGBDT: a multiplelayer deep framework based on gradient boosting decision trees for lncRNAprotein interaction identification. BMC Bioinform. 2021;22(1):1–24.
Chu Y, Kaushik AC, Wang X, Wang W, Zhang Y, Shan X, Salahub DR, Xiong Y, Wei DQ. DTICDF: a cascade deep forest model towards the prediction of drugtarget interactions based on hybrid features. Brief Bioinform. 2019;22(1):451–62. https://doi.org/10.1093/bib/bbz152.
Velikovi P, Cucurull G, Casanova A, Romero A, Lió P, Bengio Y. Graph attention networks (2017)
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. Attention is all you need. In: Proceedings of the 31st international conference on neural information processing systems. NIPS’17. Red Hook: Curran Associates Inc., pp. 6000–6010 (2017)
Devlin J, Chang MW, Lee K, Toutanova K. Bert: pretraining of deep bidirectional transformers for language understanding (2018)
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V. Roberta: a robustly optimized bert pretraining approach (2019)
Nelson DL, Cox MM. Lehninger principles of biochemistry. 5th ed. New York: Worth Publishers; 2008.
Tsetlin VI, Hucho F. Snake and snail toxins acting on nicotinic acetylcholine receptors: fundamental aspects and medical applications. FEBS Lett. 2004;557(1–3):9–13.
Li Y, Tarlow D, Brockschmidt M, Zemel R. Gated graph sequence neural networks. Computer Science (2015)
Hamilton WL, Ying R, Leskovec J. Inductive representation learning on large graphs. In: Proceedings of the 31st international conference on neural information processing systems. NIPS’17. Red Hook: Curran Associates Inc., pp. 1025–1035 (2017)
Hamilton WL, Ying R, Leskovec J. Representation learning on graphs: methods and applications (2017)
Xu K, Hu W, Leskovec J, Jegelka S. How powerful are graph neural networks? (2018)
Acknowledgements
Not applicable.
Funding
This work was supported by the Artificial Intelligence Program of Shanghai (2019RGZN01077), National Natural Science Foundation of China No. 12271362.
Author information
Authors and Affiliations
Contributions
WZ and CXH conceived the work. WZ and CX designed and completed the experiment. CXH, XFQ and ZSY conducted the guidance. YRL adjusted the format of the manuscript. All authors have read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Zhong, W., He, C., Xiao, C. et al. Longdistance dependency combined multihop graph neural networks for protein–protein interactions prediction. BMC Bioinformatics 23, 521 (2022). https://doi.org/10.1186/s12859022050626
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859022050626