Long-distance dependency combined multi-hop graph neural networks for protein–protein interactions prediction

Zhong, Wen; He, Changxiang; Xiao, Chen; Liu, Yuru; Qin, Xiaofei; Yu, Zhensheng

doi:10.1186/s12859-022-05062-6

Research
Open access
Published: 05 December 2022

Long-distance dependency combined multi-hop graph neural networks for protein–protein interactions prediction

Wen Zhong¹,
Changxiang He¹,
Chen Xiao¹,
Yuru Liu¹,
Xiaofei Qin² &
…
Zhensheng Yu¹

BMC Bioinformatics volume 23, Article number: 521 (2022) Cite this article

1985 Accesses
2 Citations
1 Altmetric
Metrics details

Abstract

Background

Protein–protein interactions are widespread in biological systems and play an important role in cell biology. Since traditional laboratory-based methods have some drawbacks, such as time-consuming, money-consuming, etc., a large number of methods based on deep learning have emerged. However, these methods do not take into account the long-distance dependency information between each two amino acids in sequence. In addition, most existing models based on graph neural networks only aggregate the first-order neighbors in protein–protein interaction (PPI) network. Although multi-order neighbor information can be aggregated by increasing the number of layers of neural network, it is easy to cause over-fitting. So, it is necessary to design a network that can capture long distance dependency information between amino acids in the sequence and can directly capture multi-order neighbor information in protein–protein interaction network.

Results

In this study, we propose a multi-hop neural network (LDMGNN) model combining long distance dependency information to predict the multi-label protein–protein interactions. In the LDMGNN model, we design the protein amino acid sequence encoding (PAASE) module with the multi-head self-attention Transformer block to extract the features of amino acid sequences by calculating the interdependence between every two amino acids. And expand the receptive field in space by constructing a two-hop protein–protein interaction (THPPI) network. We combine PPI network and THPPI network with amino acid sequence features respectively, then input them into two identical GIN blocks at the same time to obtain two embeddings. Next, the two embeddings are fused and input to the classifier for predict multi-label protein–protein interactions. Compared with other state-of-the-art methods, LDMGNN shows the best performance on both the SHS27K and SHS148k datasets. Ablation experiments show that the PAASE module and the construction of THPPI network are feasible and effective.

Conclusions

In general terms, our proposed LDMGNN model has achieved satisfactory results in the prediction of multi-label protein–protein interactions.

Peer Review reports

Background

Protein takes one of the most common molecules in organisms. It is the material basis of life activities and participates in various biological processes in organisms [1]. Most vital biological processes in organisms are generally driven by protein–protein interactions (PPIs), rather than an individual protein acting alone [2,3,4]. PPIs are widespread and play an important role in biological systems. For instance, PPIs are essential for biological cell activities such as cell proliferation, immune response, signal transduction, DNA transcription, and replication [5]. Therefore, exploring the interactions between protein and protein is the key to study cell biology [6,7,8] and has great significance to the diagnosis and treatment of diseases, as well as the design and development of drugs [9]. At present, there are many methods for the prediction of PPIs, which can be broadly divided into two types: laboratory-based traditional methods and deep learning methods.

In pace with the rapid development of high-throughput technology, a number of laboratory-based traditional methods have been used to predict PPIs, such as mass spectrometric protein complex identification (MS-PCI) [10], yeast two-hybrid (Y2H) [11, 12] and tandem affinity purification (TAP) [13, 14]. These methods can visually observe the interactions between protein and protein. However, the vast majority of experiments are based on genome scale, and has narrow comprehensive. At the same time, a lot of time and money are needed to support the smooth running of the experiment. In addition, part of experiments rely on obtaining target proteins from animals, which violates ethics and morality [15,16,17,18,19]. To address the shortcomings of traditional methods, researchers were turning to deep learning methods.

Deep learning, by virtue of its powerful feature learning ability, has been valued by various fields and is rapidly evolving, no exception in the bioinformatics field, where it is ingeniously applied to probe some problems in bioinformatics, such as protein–protein interactions prediction tasks. Sun et al. [20] applied stacked autoencoder (SAE) to capture amino acid sequence features to predict PPI. Hang et al. [21] designed a deep neural network (DNN-PPI) framework capable of automatically acquiring features in the amino acid sequence of proteins. Chen et al. [7] constructed an end-to-end framework for PPI prediction based on siamese residual recurrent neural network (PIPR), which extracted features from amino acid sequences. These deep learning models all exhibit excellent generalization ability for addressing the PPIs prediction tasks. However, they are highly dependent on the amino acid sequence of proteins. Since there may be some similarities between different amino acids, these typical deep learning models cannot effectively capture the information of the entire protein amino acid sequence and the relationships between different amino acids.

It should be noted that all the methods mentioned above only take amino acid sequence features as input. They don’t consider the interactions between proteins [22], which makes the prediction performance limited. Protein–protein interactions can be considered as hidden information to some extent, so combining it and amino acid sequence information together can improve the accuracy of prediction. For the PPI network, it can be viewed as a graph with each protein as the node and the connecting relationships between proteins as the edge.

Therefore, Yang et al. [23] proposed to view the PPI network as an undirected graph and applied GCN [24] in the PPIs prediction task for the first time. It constructed an unsigned variational graph autoencoder, which combined the PPI network with amino acid sequence information to learn the features of proteins to predict PPI. Inspired by the methods of graph signal processing, Colonnese et al. [25] considered node features on PPI networks as signals, and developed a Markov model to accomplish PPIs prediction. Lv et al. [26] constructed a GNN-PPI model based on graph isomorphism network (GIN) to predict the interactions between protein–protein pairs. It not only considered the amino acid sequence information, but also fully considered the correlation between proteins.

These models have improved the accuracy of PPIs prediction. However, they only considered the information of two proteins directly interacting in PPI network, ignoring the information of protein–protein pairs indirectly interacting. In this way, the information captured by these model are incomplete. Studies [27, 28] have shown that indirectly interacting also have meaningful information. In a biological network, two molecular nodes that do not interact directly also have some similarities [29]. And then, for the target node, it would be helpful if the information of such indirectly interacting nodes could be aggregated.

In this paper, we propose a novel LDMGNN model to implement the multi-label prediction task of protein–protein interactions. This model mainly aims to solve the two problems mentioned above. The first problem is that the existing methods ignore the long-distance dependence information between amino acids in the sequence, and the second is that the existing methods do not fully consider the interaction among indirect connected protein nodes. To solve the first problem, we use the Transformer with a multi-head self-attention mechanism to capture the correlation between every two amino acids in the sequence. For the second problem, we construct a two-hop protein–protein interaction (THPPI) network based on the PPI network to enhance graph representation learning. Overall, our main contributions:

We use the Transformer module with a multi-head self-attention mechanism to capture the long-distance dependency information between each two amino acids in sequence.
Based on the two-hop concept, we construct a THPPI network based on PPI network, which can capture the information between indirectly interacting proteins and thus increase the receptive field in space.
The experimental results show that our method exhibit good performance.

Resuts

In the following, the datasets, experimental parameter settings, evaluation metrics, baselines, experimental results and analysis used in our experiment will be introduced.

Datasets

In this paper, we use two common datasets, i.e., SHS27k and SHS148k, to evaluate our method. These two datasets contain a lot of PPIs information and amino acid sequence information. They were randomly selected by [7] from the Homo sapiens subset of STRING [30] according to the rule that the sequence alignment similarity is less than $40 \%$. The SHS27k dataset contains 1690 proteins and 7624 pairs of PPIs, and the SHS148k dataset contains 5189 proteins and 44,488 pairs of PPIs. The interactions of the two datasets can be divided into seven types, i.e., posttranslational modification (ptmod), catalysis, reaction, activation, expression, binding and inhibition, they can show not only the physical correlation between proteins but also the functional correlation.

We regard all known PPIs as positive samples, and negative samples of the same size are randomly selected from unknown interactions. In our experiment, the positive and negative sample rate is 1:1. Specifically, in the SHS27k dataset, the number of negative samples is 7624. In the SHS148k dataset, the number of negative samples is 44,488. At the same time, inspired by [26], in order to evaluate the generalization ability of the LDMGNN model more realistically, we choose three partition schemes to divide the test set, i.e., random, BFS and DFS. Our test set accounts for $20 \%$ of the dataset.

Given the protein set P and PPI set I , construct a protein–protein interaction network $G=\langle {P}, {I}\rangle$. The size of the fixed test set is N (We divide the data set according to edges, that is, N refers to the number of edges in the test set). Firstly, a protein is randomly selected from the protein set P as the root node $p_{root}$. Given a threshold t, the degree of the root node $p_{root}$ must be less than this threshold (we set the threshold $t=5$), that is $\left| N\left( p_{\text{ root } }\right) \right| <t$. Set the initial test set $I_{\text{ test }}=\emptyset$, the current node $p_{cur} = p_{root}$. And then use DFS(BFS) algorithm to search the neighbor nodes $p_{k}$ of the current node $p_{cur}$, i.e. $p_{k} \in N\left( p_{cur}\right)$. At this time, the test set $I_{\text{ test }}=I_{\text{ test }} \cup I_{cur}$, $I_{cur}=\left\{ p_{cur}, p_{k}\right\}$. The process is repeated until the number of edges in the subgraph formed by all nodes in the test set exceeds N.

Parameter settings and evaluation metric

Our experiment is performed on an NVIDIA GTX 3090 GPU with a PyTorch deep learning framework. We choose the Adam algorithm [31] as the optimization strategy in this paper with a weight decay coefficient of 5e−4 and a batch size setting of 512. We train our models for 300 epochs with an initial learning rate of 0.001. We choose the ReduceLROnPlateau function to vary the learning rate and to prevent overfitting, the patience is set to 20. During model training, if the loss is not reduced for 20 consecutive iterations, training will automatically stop.

Since our task is to use a classifier to solve the multi-label PPI classification. The interactions between protein–protein pairs have at least one label. Moreover, the types of PPIs in the SHS27k and SHS148k datasets are extremely unbalanced [26]. Micro-F1 will emphasize the common labels in the datasets, which is not easy to be affected by small samples or large samples, so that each sample has the same importance [32]. Comprehensive consideration, we choose the Micro-F1 evaluation metric to measure the accuracy of our model. The mathematical formula is as follows:

$$\begin{aligned} \ Micro{-}F1 = 2 \frac{Recall_m \times Precision_m}{Recall_m + Precision_m}, \end{aligned}$$

(1)

where $Recall_m$ and $Precision_m$ are the total recall and total precision for all classes, expressed with mathematical formulae as follows:

$$\begin{aligned} Recall_{m} = \frac{TP_{1}+TP_{2}+\cdots +TP_{n}}{TP_{1}+TP_{2}+\cdots +TP_{n}+FN_{1}+FN_{2}+\cdots +FN_{n}}, \end{aligned}$$

(2)

$$\begin{aligned} Precision_m = \frac{TP_1+TP_2+\cdots +TP_n}{TP_1+TP_2+\cdots +TP_n+FP_1+FP_2+\cdots +FP_n}, \end{aligned}$$

(3)

where n indicates the number of classes, in this experiment, the number of classes is 7. $TP_{i}$, $FP_{i}$, $TN_{i}$ and $FN_{i}$ indicate true positives, false positives, true negatives and false negatives of the ith class, respectively.

Baselines

In order to better illustrate the effectiveness of our model, we compare LDMGNN with different baselines. These baselines can be divided into machine learning based and deep learning based. We choose three algorithms based on machine learning, which are SVM [33], RF [34] and LR [35]. The input are the features of proteins, which are common handcrafted protein features, i.e., AC [33] and CTD [36].

When compare with the models based on deep learning, we construct the same architecture as them. We input the SHS27k dataset and SHS148k dataset into the model, and change the output from the original two classification to multi-label classification. These deep learning models are as follows.

HIN2Vec [37]: A representation learning framework for heterogeneous information networks (HIN). It uses different types of interactions among nodes to capture the features of nodes and meta- paths in HIN.
SDNE [38]: A structural deep network embedding method for link prediction and multi-label classification tasks. It can not only effectively capture the highly nonlinear network structure, but also preserve the global and local structure of the network.
LPI-DLDN [39]: A deep learning model of dual-network neural architecture composed of feature importance ranking (FIR) network and MLP network. Given the sequences of protein and lncRNA, predict the potential interaction between lncRNA and protein.
LPI-deepGBDT [40]: A multiple-layer deep structure model based on gradient boosting decision trees. Given the sequences of protein and lncRNA, predict the unobserved LPIs.
DTI-CDF [41]:A cascade deep forest model based on hybrid feature, which cascades the traditional machine learning models RF and XGB. Given the hybrid feature (contains the information of drug, target and drug-target interaction) to predict the interaction between drug and target.
PIPR [7]: An end-to-end network model for predicting PPI, which combines two residual RCNN using Siamese architecture. And this method provides an automatic multi-granularity feature selection mechanism to capture the features of sequences.
GAT [42]: A new neural network based on graph structure data. It learns the embedding of nodes by using self-attention mechanism in the structure of graph.
GNN-PPI [26]: A graph neural network model, given the information of protein amino acid sequence and PPI network, is used for the prediction of multi-label PPI.

Our experiment is inspired by GNN-PPI [26]. However, compared with GNN-PPI model, our LDMGNN model mainly has the following two innovations. First, in the part of amino acid sequence encoding, we innovatively propose to replace biGRU block with a transformer block with multi-head self-attention mechanism, which can not only capture the long-distance dependence information between amino acids, but also solve the problem that biGRU cannot be parallelized. Second, considering that there may be some connection between nodes that do not interact directly, in order to capture more comprehensive information of proteins, we construct a two-hop PPI network. This is not available in the GNN-PPI model.

Results and analysis

As shown in Table 1, the LDMGNN method shows the best performance compared to the other baselines. From this result, it can be seen that our model has fully learn the long-distance dependency between amino acids and effectively expand the receptive field, which can improve the prediction accuracy of multi-label PPI. From the perspective of dataset size, the performance of the model increases with the size of the dataset. Obviously, our method performs better under dataset SHS148k than dataset SHS27k, this is because we are able to obtain more valuable information as the PPI network growing. From the perspective of dataset partition scheme, the performance improvement of our method in BFS and DFS partitioning scheme is generally higher than that in random. For the SHS27k dataset, our method achieves an absolute improvement of $1.43 \%$, $10.75 \%$, $3.48 \%$ when compared with the GNN-PPI model in random, BFS, and DFS partitioning methods, respectively. And for the SHS148k dataset, our method achieves an absolute improvement of $0.12 \%$, $2.61 \%$, $1.12 \%$ when compared with the GNN-PPI method in random, BFS, and DFS partitioning methods, respectively. This illustrates that our method has a certain generalization ability and has practical implications.

Table 1 Performance results of LDMGNN compared with other methods on two datasets, we report the mean and standard deviation of the test sets

Full size table

Significant difference analysis

To verify whether the performance of our proposed LDMGNN model is statistically significantly different from these 11 baseline models, we conducted a paired samples t-test using SPSS software. The related results are shown in Tables 2 and 3. Table 2 is used to represent the correlation between two samples, where the value of correlations is in the interval [$-1$,1]. When this value is greater than 0, it means that there is a positive correlation between the two samples, and when this value is less than 0, it means that there is a negative correlation between the two samples. And the larger the absolute value of the correlations, the stronger the correlation between the two samples. At the same time, the significance level p value (i.e., Sig. in Table 2) should be less than 0.05. When the p value is less than 0.05, it can indicate whether the correlation between samples is significant. A paired samples t-test only makes sense when there is a significant correlation between paired samples. Table 3 shows the results of the paired samples t-test, when the p value (i.e., Sig.(2-tailed)) is less than 0.05, it indicates that there is a significant difference between the two samples.

As shown in Table 2 , the correlation of the first row (LDMGNN & SVM) is 0.976 (the absolute value is close to 1), and the significance level p value (Sig.) is 0.001 (< 0.05), which indicates that the two samples of LDMGNN and SVM have significant correlation, and is strongly correlated. Similarly, it can be seen that the correlations of the last 10 pairs of samples are 0.917, 0.920, 0.916, 0.953, 0.938, 0.929, 0.943, 0.916, 0.938 and 0.965, respectively (all greater than 0.9). And the corresponding significance level p value (i.e., Sig.) of the last 10 paired samples are 0.010, 0.009, 0.010, 0.003, 0.006, 0.007, 0.005, 0.010, 0.006 and 0.002 (all less than 0.05). Obviously, these 10 paired samples are all significantly correlated. Therefore, it is meaningful to perform a paired samples t-test on these 11 pairs of paired samples.

Table 2 Paired samples correlation of the LDMGNN model with 11 baseline models

Full size table

Table 3 Paired samples test of the LDMGNN model with 11 baseline models

Full size table

From the Table 3, we can see that the p values (i.e., Sig.(2-tailed)) of the top 10 pairs of samples are 0.001, 0.005, 0.000, 0.000, 0.009, 0.002, 0.001, 0.002, 0.015 and 0.011, respectively, all of which are less than 0.05. It shows that there are significant differences between the first 10 pairs of samples. The p values (i.e., Sig.(2-tailed)) of the 11th pair of samples is 0.094, which is greater than 0.05. As for the 11th pair of samples, we consider that the extreme imbalance of the data may be a factor in this situation. Further, from the perspective of reality, we believe that LDMGNN model has certain practical significance. Lv et al. [26] studied the Homo sapiens subsets at two time points (2011 / 01 / 25 and 2021 / 01 / 25) in the BioGRID database. They found that the newly discovered proteins had local patterns of BFS and DFS. In these two partition schemes, our LDMGNN model has a large improvement in accuracy compared with the GNN-PPI model.

Ablation analysis

In order to verify the importance and effectiveness of each module in this study for the prediction model, we conduct an ablation study by deleting or replacing each module in this study. We use GNN-PPI as the baseline for PPIs prediction, which processes amino acid sequences using RNN and aggregates only first-order neighbor information. −PMHGE represents the removal of PMHGE from the LDMGNN model and, unlike baseline GNN-PPI, and uses the Transformer with a multi-head self-attention mechanism to learn the amino acid interdependency in the sequence. -PAASE represents the deletion of the PAASE module from the LDMGNN model. Unlike the baseline GNN-PPI, we construct a THPPI network and simultaneously aggregates first-order and second-order neighbor information, increasing the spatial receptive field in the model. LDMGNN is our proposed model. Compared with the baseline GNN-PPI, our LDMGNN not only captures the long-distance dependency information in the sequence but also increases the spatial receptive field in space. We still use Micro-F1 as the evaluation metric.

Table 4 Ablation studies on SHS27k and SHS148k datasets

Full size table

As can be seen from Table 4, when the model only uses the multi-head self-attention mechanism to capture the long-distance dependency information in the sequence, for the SHS27k dataset, the current model increases by $0.81 \%$, $5.03 \%$ and $2.20 \%$ respectively compared with the GNN-PPI model under the random, BFS and DFS partitioning schemes. For the SHS148k dataset, $0.12 \%$, $-2.36 \%$ and $0.10 \%$ improvements are achieved respectively. The results show that the prediction accuracy could be improved if the model only captures the interdependency of amino acids in the sequence, indicating that the long-distance dependency of amino acids in the sequence plays a positive role in the prediction of multi-label PPI. In view of the situation that micro-F1 of the SHS148k dataset decreased by $2.36 \%$ in the BFS partitioning scheme, we believe that this is caused by data imbalance.

When the model only increases the spatial receptive field of network, for the SHS27k dataset, the current model increases by $1.37 \%$, $4.80 \%$ and $0.09 \%$ respectively compared with the GNN-PPI model under the random, BFS and DFS partitioning schemes. For the SHS148k dataset, $0.09 \%$, $0.26 \%$ and $0.24 \%$ improvements are achieved respectively. The results show that aggregating the information of first-order and second-order neighbors simultaneously can improve the accuracy of prediction, which suggest that appropriate increase of network receptive field also plays a positive role in the prediction of multi-label PPI. However, the improved accuracy is the highest when the model captures the interdependencies between amino acids in the sequence and aggregate the first-order and second-order neighbors, which further demonstrate that LDMGNN model is effective for predicting multi-label PPI.

The selection of hop number

When constructing a multi-hop PPI network to increase the receptive field, we conducted experiments on different k to determine the k-hop network to be constructed. The experimental results are shown in Table 5. In the table, “One-Hop” indicates the case where $k = 1$. In this case, the PPI network is the original PPI network, and the target node only aggregates the information of first-order neighbors. In the table, “One-Hop $+$ Two-Hop” represents the case where $k=2$. We construct a THPPI network, where the target node aggregates the information of first-order and second-order neighbors simultaneously. “One-Hop $+$ Two-Hop $+$ Three-Hop” in the table represents $k=3$. We construct both a THPPI network and a Three-Hop PPI network, and the target node aggregates the information of first-order, second-order, and third-order neighbors at the same time. We still use micro-F1 as the evaluation metric, and each boldface number means the best accuracy under this partitioning scheme. Obviously, the prediction accuracy is the highest when k is 2.

Table 5 Comparison experiment of hop number selection on SHS27k and SHS128k datasets

Full size table

It can be seen from Table 5 that when k is set to 2, the accuracy of the model obtained in each partition scheme on the two datasets is higher than that when K is set to 1. This indicates that it is necessary to construct a multi-hop PPI network to increase the receptive field of the network in space. However, with the increase of k, the performance of the model on the two datasets shows a decreasing trend. When k is 3, the model acquires much less accuracy in each of the partition methods on both datasets than when k is 2. Indeed, the accuracy obtained when k is 3 is less in all cases than when k is 1. This indicates that the simple construction of a multi-hop PPI network is not the best. As the receptive field gradually increases, the model gradually tends to be over-fitting. Therefore, in this study, k is selected as 2 when we construct the multi-hop PPI network. It further shows that it is effective and reasonable for us to aggregate the first-order and second-order neighbor information at the same time and appropriately increase the receptive field of the network.

The effect of more negative samples on model performance

To test how more negative samples will affect the performance of the model, we increase the number of negative samples while keeping the number of positive samples constant. At this time, the positive: negative sample rate in the datasets will change. Specifically, for the SHS27k dataset, we randomly selected three negative samples from proteins pairs with unknown interactions, and the number of these three negative samples were 22,872, 38,120 and 76,240, respectively. We took the protein pairs that are known interactions as positive samples, and the number is 7624. Thus we can obtain three different positive:negative sample rates, which are 1:3, 1:5 and 1:10. Similarly, for the SHS148k dataset, we randomly selected three negative samples from protein pairs of unknown interactions, with numbers of 133,464, 222,440, and 444,880, respectively. We also took known interacting protein pairs as positive samples with a number of 44,488, resulting in three different positive:negative sample rates, which are 1 : 3, 1 : 5, and 1 : 10, respectively. We still choose three partition schemes to divide the test set, namely random, BFS and DFS, and our test set accounts for $20 \%$ of the dataset.

The experimental results are shown in Table 6, where 1:1 is the positive:negative sample rate used by the LDMGNN model. As can be seen from Table 6, for the SHS27k dataset, under random, BFS and DFS partition schemes. Compared with the result that positive:negative sample rate is 1:1, when the positive:negative sample rate is 1:3, the performance of the model decreases by 3.09, 3.13 and 2.06, respectively; when the rate is 1:5, the performance of the model decreases by 6.60, 5.27 and 5.26, respectively; when the rate is 1:10, the performance of the model decreases by 11.40, 8.94 and 8.36, respectively. And for the SHS148k dataset, under random, BFS and DFS partition schemes. Compared with the result that positive:negative sample rate is 1:1, when the positive:negative sample rate is 1:3, the performance of the model decreases by 3.07, 2.26 and 4.82, respectively; when the rate is 1:5, it decreases by 10.28, 5.77 and 11.15, respectively; when the rate is 1:10, it decreases by 16.83, 12.62 and 16.87, respectively.

Obviously, we can see that when we add more negative samples, the performance of the model will decrease significantly. This is because when the number of negative samples exceeds the number of positive samples, it will affect the correct judgment of the model on the positive samples, so that the classifier can not capture the features of the positive samples well. Therefore, the imbalance between positive and negative samples will negatively affect the performance of the model.

Table 6 Results of increase the number of negative samples on SHS27K and SHS148k

Full size table

Discussion

Next, we will introduce why we use the Transformer with a multi-head self-attention mechanism to capture amino acid sequence information and why we construct a THPPI network to aggregate two-hop neighbor information.

It is acknowledged that the Transformer [43] was first proposed to replace recurrent neural network (RNN) to solve natural language processing. It has two unique properties, one is that it can obtain the long-distance dependency of sequences, and the other is that it can be parallelized. Inspired by the methods of natural language processing such as Bert [44] and Roberta [45], we regard each amino acid as a vector, the amino acid sequence as a vector set. And we consider that there may also be long-distance dependency between every two amino acids in a sequence. So we use the Transformer with a multi-head self-attention mechanism to capture amino acid sequence information.

Meanwhile, we constrcut a THPPI network since there may be meaningful information between two indirectly interacting nodes. Exactly, the competitive inhibition [46] in biochemistry can also explain that there may be some structural similarity or functional similarity between the two indirectly connected nodes. As shown in Fig. 1, a typical example [47] of similar structures in biomolecules causing competitive inhibition. When humans are bitten by snakes, snake venom proteins follow the blood circulation into the nervous tissue space, bind to acetylcholine receptors (AchR), and the binding affinity between them is much higher than that of acetylcholine (Ach). Thus the snake venom proteins would inhibit the binding of acetylcholine to acetylcholine receptors. Here, the snake venom proteins are structurally similar to acetylcholine. Enlightened by this case, we consider that in a PPI network, there may be two proteins that are indirectly connected, but structurally similar or functionally similar. So we construct a THPPI network to aggregate the second-order neighbor information to enlarge the receptive field in space.

Conclusions

In this study, we propose the LDMGNN model to predict multi-label protein–protein interactions. LDMGNN first captures the potential features of amino acid sequences through the PAASE module and then concatenates this information with the topological information of the initial PPI network and the topological information of the THPPI network respectively. Then, they are respectively input into the graph neural network, and the two obtained feature matrices are addition by element-wise as the final embedding of protein and protein pair. Finally, this embedding is fed into a classifier for predicting protein–protein interactions.

We carry out a series of experiments, and in the SHS27k dataset and SHS148k dataset, our model shows better performance than the existing model. Furthermore, we perform an ablation experiment to verify that each module in the model is indispensable and that the parameters in the experiment are reasonable and effective. This indicates that the Transformer with a multi-head self-attention mechanism can successfully capture the long-distance dependence information in amino acid sequences. The spatial receptive field of the network can be increased by aggregating the first-order and second-order neighbor information simultaneously. In conclusion, the LDMGNN model can comprehensively learn the feature information between protein pairs and has a good potential in PPI prediction.

Methods

This section introduces the proposed multi-label PPI prediction approach, which is an end-to-end representation learning model. Given the representation of protein amino acid sequence, the adjacency matrix of PPI network and the adjacency matrix of THPPI network, we try to predict the labels between protein–protein pairs. The representation of protein amino acid sequence input here is processed into a numerical vector by [7]. In this section, we define the multi-label PPI prediction problem. Then we will introduce our LDMGNN model in detail.

Problem formulation

We represent the set of amino acids as M, and define the amino acid sequence S of a protein, which consists of amino acids in varying proportions as $S=\left\{ m_{1}, m_{2}, \ldots , m_{l}\right\}$, where $m_{i} \in M, i=1,2, \ldots , l$. We consider the initial PPI network as an undirected graph $G_{1}=\langle {P}, {I}\rangle$, whose adjacency matrix is $A_{1} \in \{0,1\}^{N \times N}$, where P is the set of proteins and denoted as $P=\left\{ p_{1}, p_{2}, \ldots , p_{n}\right\}$. I is the set of protein–protein interactions, defined as $I=\left\{ p_{i j} \mid p_{i j}=\left( p_{i}, p_{j}\right) , i \ne j, p_{i}, p_{j} \in P\right\}$. If $p_{ij}=1$, this indicates that there have interactions between protein $p_{i}$ and protein $p_{j}$. If $p_{ij}=0$, this indicates that there is no interaction between the proteins or the interactions between them has not been identified at this time. Similarly, we consider the constructed THPPI network as an undirected graph $G_{2}=\langle {P}, {I}\rangle$, whose adjacency matrix is $A_{2} \in \{0,1\}^{N \times N}$.

The task of this multi-label classification is to learn a model $F=(p, {\hat{y}})$ from the training set $I_{train}$, and its input p is protein pairs with known interaction, $p \in \ I_{train}$. The output is a 7-dimensional vector, which corresponds to a finite set of labels L. We define the label set of PPIs as $L=\left\{ \ell _{0}, \mathrm {\ell _{1}, \ldots , \ell }_{n}\right\}$, where $n=6$ are the types of protein–protein interaction, which are post-translational modifications (PTMOD), catalysis, reaction, activation, expression, binding and inhibition, respectively. The interaction of each pair of proteins contains at least one type. When there is a certain type of interaction, the corresponding position in the vector is 1, otherwise it is 0. The learned model F is used to predict the labels ${\hat{y}}_{ij}$ of protein pair $p_{ij} \in \ I_{test}$.

Overview

The framework of the proposed LDMGNN model is shown in Fig. 2. We introduce the framework from the following three parts. The first part is the “Protein Encoding”, which is the core of LDMGNN model. It is used to extract the representation of protein nodes. The second part is “Feature Fusion”, and the last part is the “Multi-label PPI Prediction”.

Protein encoding

In the process of protein encoding, we can regard this process as two parts. These two parts are trained together in an end-to-end manner. One is protein amino acid sequence encoding (PAASE), which is to capture the protein feature based on amino acid sequence, which we call sequence feature. The second is protein multi-hop graph encoding (PMHGE), which can be regarded as composed of two branches. One branch uses a GIN block to capture the first embedding of protein, and the other branch uses a GIN block to capture the second embedding of protein.

Protein amino acid sequence encoding (PAASE)

This module is used to capture the protein feature based on amino acid sequence. Predefined feature appearing in modules were observed by [7] through processing. Chen et al. [7] used the embedding method to represent each amino acid $m \in M$ as a 13 dimensional vector, which is composed of two sub-vectors, i.e., $m=\left[ m_{1}, m_{2}\right]$. The first sub-vector $m_1$ measures the co-occurrence similarity of amino acids, and its dimension is 5. The second sub-vector $m_2$ represents the similarity of the electrostatic and hydrophobic among amino acids, which is an eight-dimensional one-hot encoding.

As shown in Fig. 3, the predefined feature are input and pass through a one-dimensional convolution layer, and the size of the convolution kernel is 3. We input the hidden features of the output into the normalization layer, which can increase the learning rate of the model and speed up the training speed. Then we choose a maximum pooling layer and let it extract more representative features. To capture the long-distance dependency information in the sequence, we then input these representative features into the Transformer module with multi-head self-attention mechanism for learning about the interdependencies of amino acids. Next, We input the hidden features obtained from the MHSA Transformer layer into a one-dimensional average pooling layer, for which dimension reduction will be performed. Finally, the sequence feature is obtained through a fully connected layer.

The transformer with multi-head self-attention mechanism

The function of this block is to learn the interdependence between each two amino acids in the amino acid sequence by calculating the correlation coefficient between them. This can not only capture the local information of amino acids in the sequence, but also capture the long-distance dependency information between amino acids.

Because of the potential for multiple types of interactions between each pair of amino acids, we apply multi-head self-attention mechanism to each amino acid of each protein amino acid sequence. Then we extract the low dimensional feature embedding of each amino acid by computing the correlation of different kinds between each pair of amino acids. As shown in Fig. 4, and for convenience, we only present a brief diagram. For each amino acid $m_{i}$, it gets a query vector $q_{i} \in R^{dq}$, a key vector $k_{i} \in R^{dk}$, and a value vector $v_{i} \in R^{dv}$. They are obtained by linear transformations of the features of amino acids using trainable parameters $W_{q} \in R^{\text {feature} \_ \text {in} \times {dq}}$, $W_{k} \in R^{\text {feature} \_ \text {in} \times {dk}}$ and $W_{v} \in R^{\text {feature} \_ \text {in} \times {dv}}$, which are shared for all amino acid nodes.

Since there are h types of correlations for each pair of amino acids $\left( m_{i}, m_{j}\right)$, it is necessary to calculate the embeddings of amino acid nodes using the multi-head self-attention mechanism. For each node $m_{i}$ in the amino acid sequence, $q_{i, 1}, \ldots , q_{i, h}$ can be obtained by linear transformation of $q_{i}$ using different weight matrices $W_{q, 1}, \ldots , W_{q, h}$. Smilarly, $k_{i, 1}, \ldots , k_{i, h}$ can be obtained by linear transformation of $k_{i}$ using different weight matrices $W_{k, 1}, \ldots , W_{k, h}$; $v_{i, 1}, \ldots , v_{i, h}$ can be obtained by linear transformation of $v_{i}$ using different weight matrices $W_{v, 1}, \ldots , W_{v, h}$. For each correlation in each pair of amino acids, its coefficients $\alpha _{i, j}^{h}$ can be calculated using the Query-key dot product method, which can be expressed by mathematical formula as follows:

$$\begin{aligned} \alpha _{i, j}^{h}=q_{i, h} \ldots k_{j, h}^{T}. \end{aligned}$$

(4)

These correlation coefficients will then be normalized. The correlation coefficient $\alpha _{i, j}^{h}$ is used as the weight to measure the value of each amino acid $m_{i}$, and then the weighted sum is carried out to obtain the embedding of h types of each amino acid $m_{i}$, which are $z_{i}^{1}, \ldots , {z}_{i}^{h}$, respectively. Finally, these embeddings of all kinds are concatenated with the Transformer to produce the final output $z_{i}$ of an amino acid node in the current layer, which is expressed by mathematical formula as:

$$\begin{aligned} Z_{i}={\text {concat}}\left( z_{i}^{1} \ldots z_{i}^{h}\right) \cdot W_{o} , \end{aligned}$$

(5)

where $Z_{i} \in R^{\text{ feature }\_\text {out }}$, and

$$\begin{aligned} z_{i}^{h}=\sum _{j} {\text {softmax}}_{j}\left( \frac{\alpha _{i, j}^{h}}{\sqrt{d_{k}}}\right) \cdot v_{j}^{h}. \end{aligned}$$

(6)

Protein multi-hop graph encoding (PMHGE)

In the process of protein multi-hop graph encoding, we use two GIN blocks to obtain two embeddings of protein. The input of the first GIN block is the original PPI network and the sequence feature, and the output is the first embedding $E_1$ of protein. The input of the second GIN block is the two-hop PPI network (THPPI) and the sequence feature, and output is the second embedding $E_2$ of protein. In this way, not only the features of adjacent nodes can be aggregated directly, but also the features of non adjacent nodes can be aggregated.

Two-hop protein–protein interaction (THPPI) network

Through the THPPI graph network, the GIN block can learn new interactions between two proteins for the purpose of augmented graph representation. We construct the THPPI network through the PPI network. Specifically, we generate the adjacency matrix $A^{2}$ of the THPPI network $G_{2}$ through the adjacency matrix A of the original graph $G_{1}$, so as to obtain the structural information of the THPPI network. The mathematical formula is described as follows:

$$\begin{aligned} A^{2}={\text {sign}}\left( A \cdot A^{T}\right) , \end{aligned}$$

(7)

where ${\text {sign}}(x)$ is a symbolic function, when $x>0$, the value is 1; and when $x \le 0$, the value is 0. It is worth noting that the new adjacency matrix contains the self-connection relation. And when protein nodes i and j correspond to values greater than 0, indicating that there is two-hop relationship between them on the original graph. Similarly, in the THPPI network $G_{2}$, two protein nodes are connected by edges to indicate their interaction. Due to the nature of graphs in general, the model of graph neural networks can also do message passing and aggregation operational on $G_{2}$.

Graph isomorphic network (GIN) block

The two GIN blocks in Fig. 2 are the same, as shown in Fig. 5 below. The input of this module is an $L^{*} 256$ feature matrix, where L represents the number of proteins and 256 represents the number of features of each protein. This feature matrix is obtained by concatenating structural information of the PPI network with amino acid sequence information. After two linear layers, two ReLU activation layers and normalization layers, the protein embedding was obtained.

Graph neural networks [24, 48,49,50] have seen tremendous progress in a variety of extremely challenging tasks. While graph isomorphism network (GIN) [51] is proved to be the most powerful variant of graph neural network (GNN) at present. Next, we will introduce how to obtain the information of the PPI network and the information of the THPPI network through the GIN block.

Similar to graph neural networks, the neighbor aggregation mechanism is the core of GIN. We iteratively update the feature of each node by aggregating feature of its neighbors. After k iterations, the structure information in the k-hop neighborhood can be captured. And the new feature vector $g_{p}^{k}$ of node p can be expressed by the following mathematical formula:

$$\begin{aligned} g_{p}^{k}={\text {Update}}\left( \left\{ g_{p}^{k-1}, a_{p}^{k}\right\} \right) , a_{p}^{k}={\text {Agg}}\left( \left\{ g_{p^{\prime }}^{k-1} \mid p^{\prime } \in N(p)\right\} \right) . \end{aligned}$$

(8)

where N(p) is the set of all neighbor nodes of node p. We choose vector sum as the aggregation function and multi-layer perceptrons (MLP) as the update function.

Then, for the original PPI network $G_{1}$, after kth iterations, node p obtains the feature vector $g_{p_{1}}^{k}$, which can be expressed as:

$$\begin{aligned} g_{p_{1}}^{k}=M L P^{k}\left( \left( 1+\epsilon _{1}^{k}\right) +\sum _{p_{1}^{\prime } \in N(p)} g_{p_{1}^{\prime }}^{k-1}\right) , \end{aligned}$$

(9)

where $p_{1}^{\prime }$ represents the first-order neighbor of the node p, and $\epsilon _{1}$ is hyperparameter. Finally, we can obtain the embedding of $G_1$, which is called $E_1$.

Similarly, for the THPPI network $G_{2}$, after kth iterations, node p obtains the feature vector $g_{p_{2}}^{k}$, which can be expressed as:

$$\begin{aligned} g_{p_{2}}^{k}=M L P^{k}\left( \left( 1+\epsilon _{2}^{k}\right) +\sum _{p_{2}^{\prime \prime } \in N(p)} g_{p_{2}^{\prime \prime }}^{k-1}\right) , \end{aligned}$$

(10)

where $p_{2}^{\prime \prime }$ represents the second-order neighbor of node p, and $\epsilon _{2}$ is hyperparameter. On the original graph $G_1$, $p_{2}^{\prime \prime }$ is the second-order neighbor of p. However, on the graph $G_2$ we constructed, $p_{2}^{\prime \prime }$ is the first-order neighbor of p. Finally, we can obtain the embedding of $G_2$, which is called $E_2$.

Feature fusion

This operation can well integrate the embedding $E_1$ of the original PPI network and the embedding $E_2$ of the THPPI network into the same embedding space. We fuse these two embeddings together to obtain the final embedding $E_{\text{ out } }$ of all proteins and use element-wise summation as the fusion form in this paper. Expressed in mathematical formula as the following:

$$\begin{aligned} E_{\text{ out } } =E_1 + E_2. \end{aligned}$$

(11)

Multi-label PPI prediction

We input the final embedding $E_{\text{ out } }$ of the proteins into a fully connected (FC) layer classifier, which predicts the interactions between two proteins. We use dot product operation to combine the embedding $e_i$ of protein $p_i$ and the embedding $e_j$ of protein $p_j$. The mathematical formula is as follows:

$$\begin{aligned} {\hat{y}}_{i j}=FC\left( e_i \cdot e_j\right) . \end{aligned}$$

(12)

In order to better supervise the training process of the model, we choose the multi-task binary cross-entropy loss function. And its mathematical formula is shown as follows:

$$\begin{aligned} L=\sum _{k=0}^{n}\left( \sum _{p_{i j} \in {\mathcal {I}}_{\text{ train } }}-y_{i j}^{k} \log {\hat{y}}_{i j}^{k}-\left( 1-y_{i j}^{k}\right) \log \left( 1-{\hat{y}}_{i j}^{k}\right) \right) , \end{aligned}$$

(13)

where ${\mathcal {I}}_{\text{ train } }$ represents the training set. $y_{i j}^{k}$ and ${\hat{y}}_{i j}^{k}$ denotes the ground-truth label and predicted probability for class k, respectively.

Availability of data and materials

The datasets used in this study can be downloaded from the http://yellowstone.cs.ucla.edu/~muhao/pipr/SHS_ppi_beta.zip or from the https://drive.google.com/open?id=1y_5gje6AofqjrkMPY58XUdKgDuu1mZCh. The source code is available online at: https://github.com/666Coco123/LDMGNN.

Abbreviations

PPI:: Protein–protein interaction
THPPI:: Two-hop protein–protein interaction
DNA:: Deoxyribonucleic acid
Y2H:: Yeast two-hybrid
MS-PCI:: Mass spectrometric protein complex identification
TAP:: Tandem affinity purification
SVM:: Support vector machine
RF:: Random forest
LR:: Logistic regression
SAE:: Stacked auto encoder
GCN:: Graph convolutional networks
GIN:: Graph isomorphism networks
RNN:: Recurrent neural networks
ER:: Erdös–Rényi
DFS:: Depth first search
BFS:: Breadth first search
AchR:: Acetylcholine receptors
Ach:: Acetylcholine
PAASE:: Protein amino acid sequence encoding
MHSA:: Multi-head self-attention
PMHGE:: Protein multi-hop graph encoding
LDMGNN:: Long-distance dependency combined multi-hop graph neural networks

References

Hu L, Wang X, Huang YA, Hu P, You ZH. A survey on computational models for predicting protein–protein interactions. Brief Bioinform. 2021;22:bbab036.
Article Google Scholar
Raimondi D, Simm J, Arany A, Moreau Y. A novel method for data fusion over entity-relation graphs and its application to protein–protein interaction prediction. Bioinformatics. 2021;37:2275–81.
Article CAS Google Scholar
Meyer MJ, Das J, Wang X, Yu H. Instruct: a database of high-quality 3d structurally resolved protein interactome networks. Bioinformatics. 2013;29:1577–9.
Article CAS Google Scholar
Hamp T. Sequence-based prediction of protein–protein interactions (2014)
Huang K, Xiao C, Glass L, Zitnik M, Sun J. SkipGNN: predicting molecular interactions with skip-graph networks. Sci Rep. 2020;10:1–16.
Article Google Scholar
Berggrd T, Linse S, James P. Methods for the detection and analysis of protein–protein interactions. Proteomics. 2010;7(16):2833–42.
Article Google Scholar
Chen M, Ju JT, Zhou G, Chen X, Wang W. Multifaceted protein–protein interaction prediction based on siamese residual RCNN. Bioinformatics. 2019;35(14):305–14.
Article Google Scholar
Xia Y, Xia CQ, Pan X, Shen HB. GraphBind: protein structural context embedded rules learned by hierarchical graph neural networks for recognizing nucleic-acid-binding residues. Nucl Acids Res. 2021;49: e51.
Article CAS Google Scholar
Liu L, Mamitsuka H, Zhu S. HPODNets: deep graph convolutional networks for predicting human protein-phenotype associations. Bioinformatics. 2021. https://doi.org/10.1093/bioinformatics/btab729.
Article Google Scholar
Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, Millar A, Taylor P, Bennett K, Boutilier K. Systematic identification of protein complexes in saccharomyces cerevisiae by mass spectrometry. Nature. 2002;415(6868):180–3.
Article CAS Google Scholar
Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci U S A. 2001;98(8):4569–74.
Article CAS Google Scholar
Fields S, Sternglanz R. The two-hybrid system: an assay for protein–protein interactions. Trends Genet. 1994;10(8):286.
Article CAS Google Scholar
Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature. 2002;415(6868):141–7.
Article CAS Google Scholar
Bürckstümmer T, Bennett KL, Preradovic A, Schütze G, Hantschel O, Superti-Furga G, Bauch A. An efficient tandem affinity purification procedure for interaction proteomics in mammalian cells. Nat Methods. 2006;3(12):1013.
Article Google Scholar
Han J, Dupuy D, Bertin N, Cusick ME, Vidal M. Effect of sampling on topology predictions of protein-protein interaction networks. Nat Biotechnol. 2005;23(7):839–44.
Article CAS Google Scholar
Piehler J. New methodologies for measuring protein interactions in vivo and in vitro. Curr Opin Struct Biol. 2005;15(1):4–14.
Article CAS Google Scholar
Byron O, Vestergaard B. Protein-protein interactions: a supra-structural phenomenon demanding trans-disciplinary biophysical approaches. Curr Opin Struct Biol. 2015;35:76–86.
Article CAS Google Scholar
Gingras AC, Gstaiger M, Raught B, Aebersold R. Analysis of protein complexes using mass spectrometry. Nat Rev Mol Cell Biol. 2007;8(8):645–54.
Article CAS Google Scholar
Rivas J, Fontanillo C. Protein-protein interactions essentials: key concepts to building and analyzing interactome networks. PLoS Comput Biol. 2010;6(6):1000807.
Article Google Scholar
Sun T, Bo Z, Lai L, Pei J. Sequence-based prediction of protein protein interaction using a deep-learning algorithm. BMC Bioinform. 2017;18(1):277.
Article Google Scholar
Hang L, Gong XJ, Yu H, Zhou C. Deep neural network based predictions of protein interactions using primary sequences. Molecules. 2018;23(8):1923.
Article Google Scholar
Liu L, Zhu X, Ma Y, Piao H, Peng J. Combining sequence and network information to enhance protein–protein interaction prediction. BMC Bioinform. 2020;21(Suppl 16):1–13.
Google Scholar
Yang F, Fan K, Song D, Lin H. Graph-based prediction of protein–protein interactions with attributed signed graph embedding. BMC Bioinform. 2020;21(1):1–16.
Article Google Scholar
Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks (2016)
Colonnese S, Petti M, Farina L, Scarano G, Cuomo F. Protein–protein interaction prediction via graph signal processing. IEEE Access. 2021;9:142681–92. https://doi.org/10.1109/ACCESS.2021.3119569.
Article Google Scholar
Lv G, Hu Z, Bi Y, Zhang S. Learning unknown from correlations: graph neural network for inter-novel-protein interaction prediction (2021)
Zitnik M, Sosi R, Feldman MW, Leskovec J. Evolution of resilience in protein interactomes across the tree of life. Proc Natl Acad Sci. 2019;116(10):201818013.
Article Google Scholar
Kovács I, Luck K, Spirohn K, Wang Y, Pollis C, Schlabach S, Bian W, Kim DK, Kishore N, Hao T. Network-based prediction of protein interactions. Nat Commun. 2019;10(1):1–8.
Article Google Scholar
Costanzo M, Baryshnikova A, Bellay J, Kim Y, Spear ED, Al E. The genetic landscape of a cell. Science. 2010;327(5964):425–31.
Article CAS Google Scholar
Szklarczyk D, Gable AL, Lyon D, Junge A, Wyder S, Huerta-Cepas J, Simonovic M, Doncheva NT, Morris JH, Bork P, Jensen LJ. String v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucl Acids Res. 2018;47:D607–13.
Article Google Scholar
Kingma D, Ba J. Adam: a method for stochastic optimization. Computer Science (2014)
Zhang M, Zhou Z. A review on multi-label learning algorithms. IEEE Trans Knowl Data Eng. 2014;26(8):1819–37.
Article Google Scholar
Guo Y, Yu L, Wen Z, Li M. Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences. Nucl Acids Res. 2008;9:3025–30.
Article Google Scholar
Wong L, You ZH, Li S, Huang YA, Liu G. Detection of protein–protein interactions from amino acid sequences using a rotation forest model with a novel PR-LPQ descriptor. In: International conference on intelligent computing (2015)
Yael S, Martin K, Roded S, Xue Y. A method for predicting protein–protein interaction types. PLoS ONE. 2014;9(3):90904.
Article Google Scholar
Du X, Sun S, Hu C, Yao Y, Yan Y, Zhang Y. Deepppi: boosting prediction of protein–protein interactions with deep neural networks. J Chem Inf Model. 2017;57:1499.
Article CAS Google Scholar
Fu T, Lee W-C, Lei Z. Hin2vec: explore meta-paths in heterogeneous information networks for representation learning. In: Proceedings of the 2017 ACM on conference on information and knowledge management. CIKM ’17, pp. 1797–1806. Association for Computing Machinery, New York. 2017. https://doi.org/10.1145/3132847.3132953
Wang D, Cui P, Zhu W. Structural deep network embedding. In: ACM SIGKDD international conference on knowledge discovery & data mining (2016)
Lihong P, Wang C, Tian X, Zhou L, Li K. Finding LNCRNA-protein interactions based on deep learning with dual-net neural architecture. IEEE/ACM Trans Comput Biol Bioinform. 2021. https://doi.org/10.1109/TCBB.2021.3116232.
Article Google Scholar
Zhouzhou L, Wang Z, Tian X, Peng L. LPI-deepGBDT: a multiple-layer deep framework based on gradient boosting decision trees for lncRNA-protein interaction identification. BMC Bioinform. 2021;22(1):1–24.
Google Scholar
Chu Y, Kaushik AC, Wang X, Wang W, Zhang Y, Shan X, Salahub DR, Xiong Y, Wei D-Q. DTI-CDF: a cascade deep forest model towards the prediction of drug-target interactions based on hybrid features. Brief Bioinform. 2019;22(1):451–62. https://doi.org/10.1093/bib/bbz152.
Article Google Scholar
Velikovi P, Cucurull G, Casanova A, Romero A, Lió P, Bengio Y. Graph attention networks (2017)
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. Attention is all you need. In: Proceedings of the 31st international conference on neural information processing systems. NIPS’17. Red Hook: Curran Associates Inc., pp. 6000–6010 (2017)
Devlin J, Chang MW, Lee K, Toutanova K. Bert: pre-training of deep bidirectional transformers for language understanding (2018)
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V. Roberta: a robustly optimized bert pretraining approach (2019)
Nelson DL, Cox MM. Lehninger principles of biochemistry. 5th ed. New York: Worth Publishers; 2008.
Google Scholar
Tsetlin VI, Hucho F. Snake and snail toxins acting on nicotinic acetylcholine receptors: fundamental aspects and medical applications. FEBS Lett. 2004;557(1–3):9–13.
Article CAS Google Scholar
Li Y, Tarlow D, Brockschmidt M, Zemel R. Gated graph sequence neural networks. Computer Science (2015)
Hamilton WL, Ying R, Leskovec J. Inductive representation learning on large graphs. In: Proceedings of the 31st international conference on neural information processing systems. NIPS’17. Red Hook: Curran Associates Inc., pp. 1025–1035 (2017)
Hamilton WL, Ying R, Leskovec J. Representation learning on graphs: methods and applications (2017)
Xu K, Hu W, Leskovec J, Jegelka S. How powerful are graph neural networks? (2018)

Download references

Acknowledgements

Not applicable.

Funding

This work was supported by the Artificial Intelligence Program of Shanghai (2019-RGZN-01077), National Natural Science Foundation of China No. 12271362.

Author information

Authors and Affiliations

College of Science, University of Shanghai for Science and Technology, Jungong Road, Shanghai, 200093, China
Wen Zhong, Changxiang He, Chen Xiao, Yuru Liu & Zhensheng Yu
School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Jungong Road, Shanghai, 200093, China
Xiaofei Qin

Authors

Wen Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Changxiang He
View author publications
You can also search for this author in PubMed Google Scholar
Chen Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Yuru Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaofei Qin
View author publications
You can also search for this author in PubMed Google Scholar
Zhensheng Yu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

WZ and CXH conceived the work. WZ and CX designed and completed the experiment. CXH, XFQ and ZSY conducted the guidance. YRL adjusted the format of the manuscript. All authors have read and approved the final manuscript.

Corresponding author

Correspondence to Changxiang He.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Zhong, W., He, C., Xiao, C. et al. Long-distance dependency combined multi-hop graph neural networks for protein–protein interactions prediction. BMC Bioinformatics 23, 521 (2022). https://doi.org/10.1186/s12859-022-05062-6

Download citation

Received: 30 January 2022
Accepted: 16 November 2022
Published: 05 December 2022
DOI: https://doi.org/10.1186/s12859-022-05062-6

Long-distance dependency combined multi-hop graph neural networks for protein–protein interactions prediction

Abstract

Background

Results

Conclusions

Background

Resuts

Datasets

Parameter settings and evaluation metric

Baselines

Results and analysis

Significant difference analysis

Ablation analysis

The selection of hop number

The effect of more negative samples on model performance

Discussion

Conclusions

Methods

Problem formulation

Overview

Protein encoding

Protein amino acid sequence encoding (PAASE)

The transformer with multi-head self-attention mechanism

Protein multi-hop graph encoding (PMHGE)

Two-hop protein–protein interaction (THPPI) network

Graph isomorphic network (GIN) block

Feature fusion

Multi-label PPI prediction

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us