Effective comparative analysis of protein-protein interaction networks by measuring the steady-state network flow using a Markov model
- Hyundoo Jeong1,
- Xiaoning Qian1 and
- Byung-Jun Yoon1Email author
https://doi.org/10.1186/s12859-016-1215-2
© The Author(s) 2016
Published: 6 October 2016
Abstract
Background
Comparative analysis of protein-protein interaction (PPI) networks provides an effective means of detecting conserved functional network modules across different species. Such modules typically consist of orthologous proteins with conserved interactions, which can be exploited to computationally predict the modules through network comparison.
Results
In this work, we propose a novel probabilistic framework for comparing PPI networks and effectively predicting the correspondence between proteins, represented as network nodes, that belong to conserved functional modules across the given PPI networks. The basic idea is to estimate the steady-state network flow between nodes that belong to different PPI networks based on a Markov random walk model. The random walker is designed to make random moves to adjacent nodes within a PPI network as well as cross-network moves between potential orthologous nodes with high sequence similarity. Based on this Markov random walk model, we estimate the steady-state network flow – or the long-term relative frequency of the transitions that the random walker makes – between nodes in different PPI networks, which can be used as a probabilistic score measuring their potential correspondence. Subsequently, the estimated scores can be used for detecting orthologous proteins in conserved functional modules through network alignment.
Conclusions
Through evaluations based on multiple real PPI networks, we demonstrate that the proposed scheme leads to improved alignment results that are biologically more meaningful at reduced computational cost, outperforming the current state-of-the-art algorithms. The source code and datasets can be downloaded from http://www.ece.tamu.edu/~bjyoon/CUFID.
Keywords
Background
Complex biological mechanisms such as signaling pathways and metabolic processes are governed and coordinated by numerous protein-protein interactions (PPIs). In addition to gene expression profiles, PPIs provide invaluable information that can be exploited to predict novel functional modules that perform critical biological functions. Thanks to recent advances in high-throughput protein interaction measurement techniques, PPI networks for different species have been archived in public databases, where the coverage and quality of these networks continue to improve over time. To translate these protein interaction data into useful biological knowledge – for example, that of the functional organization of cells and the detailed mechanisms of various cellular functions – we need effective means for analyzing the available PPI networks to accurately annotate the protein functions and to identify modules of proteins that may be potentially involved in crucial biological processes conserved across species.
Although one can study functions of proteins in PPI networks through biological experiments, it takes a large amount of valuable resources including labor, experimental cost, and time. As we have increasing evidence that various functional modules are conserved across different species, in which orthologous proteins and their interactions are preserved, comparative network analysis based on computational approaches would be one reasonable alternative that can save the cost and time of expensive biological experiments [1, 2]. Through comparative network analysis techniques such as network querying and network alignment [3], we can identify conserved functional modules as well as functionally similar proteins. By identifying corresponding protein nodes across networks, functional annotations of known proteins in well-studied species could be transferred to matching proteins in the PPI networks of less-studied species, which provides an efficient way of predicting potential functions of unknown proteins.
Illustration of a pairwise network alignment. The blue colored node in network X is deleted in network Y, and the pink colored node in network Y is inserted. Note that an inserted node in one network can be viewed as a deleted node in the other network
Several network alignment algorithms have been proposed to identify and predict orthologous protein pairs and conserved functional modules in different networks. The pioneering network alignment algorithms, PathBLAST [4] and NetworkBLAST [1, 5], focus on identifying highly conserved local complexes. However, PathBLAST can only search for linear paths, and NetworkBLAST constructs local alignments where one protein can have multiple matching partners, which may yield ambiguous alignments. IsoRank [6] estimates the node correspondence using a modification of the widely-known PageRank algorithm [7], where the basic idea is that two proteins have a high probability to be aligned if their neighboring proteins are also matched well. IsoRankN [8] extends IsoRank to align multiple PPI networks by adopting PageRank-Nibble [9], a spectral clustering method. IsoRank and IsoRankN are relatively time consuming and require a huge amount of memory as the size of the network increases. SMETANA [10] adopts a semi-Markov random walk (SMRW) model to estimate the node correspondence scores. These scores are updated through the intra-network and cross-network probabilistic consistency transformations, which are subsequently used to greedily build the network alignment. SMETANA-CSRW [11] estimates the node correspondence scores using a context-sensitive random walk (CSRW) model [12], which integrates the node similarity and the topological similarity between networks. Then, it constructs the final alignment based on a greedy approach. Although SMETANA-CSRW has slightly higher computational complexity as the network size increases, the utilization of the CSRW model has been shown to improve the accuracy of the alignment results. PINALOG [13] detects dense subnetworks as communities. Then, it constructs the initial community mapping and extends the alignment by mapping the neighboring nodes of the core proteins. HubAlign [14] first assigns weights to the nodes and edges in the PPI networks based on their topological importance (i.e., likelihood to be a hub), and then calculates the alignment score for every pair of proteins based on the global topological property and sequence information. Then, the algorithm constructs a global network alignment using a greedy seed-and-extension approach. Both PINALOG and HubAlign are more dependent on the topological similarity between networks than node similarity for obtaining the network alignment results, which may degrade the alignment accuracy when handling incomplete PPI networks or networks that may contain a relatively large number of false positive interactions.
In this paper, we propose a novel network alignment algorithm, called CUFID-align (Comparative network analysis Using the steady-state network Flow to IDentify orthologous proteins). The algorithm estimates the node correspondence by measuring the steady-state network flow of a random walk model over an integrated network of the given PPI networks. To accurately estimate the node correspondence based on the steady-state network flow, in a way that effectively captures the biological significance, we design the Markov random walk model such that the relative frequency that the random walker makes transitions between a pair of nodes in different PPI networks is proportional to the pairwise node similarity and the topological similarity between the surrounding network regions. The proposed scheme effectively captures the functional correspondence between nodes across different networks and the estimated node correspondence scores can lead to accurate network alignment results, as will be demonstrated through performance assessment based on real PPI networks.
Methods
Problem formulation
Suppose that we have a pair of PPI networks with the graph representations \({\mathcal {G}}_{X} = \left ({{\mathcal {U}},{\mathcal {D}}}\right)\) and \({\mathcal {G}}_{Y} = \left ({{\mathcal {V}},{\mathcal {E}}} \right)\), in which nodes represent proteins in each PPI network (i.e., \(u_{i} \in \mathcal {U}\) or \(v_{j} \in \mathcal {V}\)), and edges (\(d_{ij} \in \mathcal {D}\) or \(e_{ij} \in \mathcal {E}\)) indicate that the corresponding protein u i (or v i ) binds with the protein u j (or v j ). The edge weights in the PPI networks can indicate the strength or confidence of the interactions between the proteins. Given a pair of nodes across the PPI networks, we assume that the pairwise node similarity score s(u i ,v j ), \(u_{i} \in {\mathcal {U}}\) and \(v_{j} \in {\mathcal {V}}\) can be computed, for example, based on the sequence similarity between the proteins. In this study, we utilized BLAST bit scores between proteins as the pairwise node similarity scores. However, other types of similarity measurements (or their combinations) could be also used as the pairwise node similarity score in case such measurements can be easily obtained.
according to the MEA criterion. This MEA approach has been widely used by many multiple sequence alignment algorithms [15–19] and it has been shown to be useful for network alignment [10, 11] and network querying [20] as well.
Motivation and overview of the proposed method
Based on the above problem setting, to construct a confident network alignment, it is crucial to accurately estimate the pairwise node alignment probabilities. To obtain biologically meaningful alignment results, it is necessary that the pairwise node alignment probability is proportional to both the pairwise node similarity (i.e., sequence similarity) and the topological similarity between the subnetwork regions surrounding the nodes in the respective networks. This is based on the observation that orthologous proteins typically have a high level of compositional similarity and often display similar interaction patterns to their neighboring nodes [1, 2]. To accurately estimate the pairwise node alignment probability by effectively integrating these two different types of similarities, we propose to utilize the concept of steady-state network flow (i.e., the amount of ‘water’ that flows through a given channel in the network). Similar concepts have been previously adopted in various engineering applications to find the solutions to similar assignment problems. For example, in digital communication systems, the water-filling algorithm [21] is utilized to compute the optimal allocation of resources. Conceptually, it pours ‘water’ into an OFDM (orthogonal frequency division multiplexing) channel, and the ‘water level’ in the OFDM channel is utilized to find the optimal solution of the transmit power for each subcarrier. In digital image processing, the so-called watershed method [22] is used to find edges or contours of objects in the given image. The watershed method assumes that ‘water’ flows along the image gradient (e.g., intensity differences) and eventually reaches the local minima so that the ‘water level’ in the image provides the solution for the desired image segmentation.
Illustration for constructing the integrated network from a network pair. Dotted lines in the left figure indicate the pairwise node similarity, in which the thickness of each line is proportional to the pairwise node similarity. To construct the integrated network, we insert an edge between each pair of nodes across different networks if they have positive pairwise node similarity
Illustration of how node correspondence is measured based on the steady-state network flow. Straight lines represent protein-protein interactions, and dotted lines indicate pairwise node similarity. In this example, there is a larger steady-state network flow between the node pair (a, b) than the node pair (a, d) because the node pair (a, b) has higher pairwise node similarity and as the nodes have similar interacting nodes in the respective networks. In contrast, although the node pair (a, d) has positive pairwise node similarity, the neighboring nodes in the respective networks are not similar, which leads to a smaller steady-state network flow between the nodes (a, d) compared to the flow between (a, b)
Estimating the node correspondence through a Markov random walk model
In order to effectively estimate the node correspondence by integrating both the pairwise node similarity and topological similarity using a Markov random walk model, we first construct the integrated network G=(V,E) by combining \({\mathcal {G}}_{X}\) and \({\mathcal {G}}_{Y}\). Nodes of the integrated network G are the union of the nodes of \({\mathcal {G}}_{X}\) and \({\mathcal {G}}_{Y}\) (i.e., \(V={\left \{ {{\mathcal {U}},{\mathcal {V}}} \right \}}\)), and edges are the union of the edges of \({\mathcal {G}}_{X}\), \({\mathcal {G}}_{Y}\), and additional weighted edges \({\mathcal {F}}\), where \({\mathcal {F}} = \left \{ {s\left ({u_{i},v_{j}} \right)|u_{i} \in {\mathcal {U}},v_{j} \in {\mathcal {V}}} \right \}\) (i.e., \(E={\left \{ {{\mathcal {D}},{\mathcal {E}},{\mathcal {F}}} \right \}}\)). On this integrated network G, we allow the random walker to randomly move from the current node to any of its neighboring nodes at each time step. We define two different types of random moves based on their starting and ending points. First, if the random walker moves from a node in \({\mathcal {U}}\) to a node in \({\mathcal {U}}\) (or from a node in \({\mathcal {V}}\) to a node in \({\mathcal {V}}\)), we define it as an intra-network random move, as the random walk takes place in the same PPI network. Second, if the random walker moves from a node in \({\mathcal {U}}\) to a node in \({\mathcal {V}}\) (or from a node in \({\mathcal {V}}\) to a node in \({\mathcal {U}}\)), we refer to this as a cross-network random move. The intra-network random move mainly aims to capture the topological similarity between the two PPI networks while the cross-network random move aims to incorporate the pairwise node similarity between nodes that originally belong to different PPI networks.
where D X is a \(\left | {\mathcal {U}} \right | \times \left | {\mathcal {U}} \right |\) dimensional diagonal matrix such that \(D_{X} \left [ {i,i} \right ] = \sum \limits _{\forall j} {A_{X} \left [ {i,j} \right ]} \), and D Y is a \(\left | {\mathcal {V}} \right | \times \left | {\mathcal {V}} \right |\) dimensional diagonal matrix such that \(D_{Y} \left [i,i\right ] = \sum \limits _{\forall j} {A_{Y} \left [ {i,j}\right ]}\).
where S T is a \(\left |{\mathcal {V}}\right |\times \left |{\mathcal {U}}\right |\) dimensional matrix for the pairwise node similarity score, and \( {\textbf {D}_{S^{T}}}\phantom {\dot {i}\!}\) is a \(\left | {\mathcal {U}} \right | \times \left |{\mathcal {U}}\right |\) dimensional diagonal matrix such that \(D_{S^{T}}\left [{i,i}\right ] = \sum \limits _{\forall j} {s^{T}\left [{i,j}\right ]}\). In fact, the transition probability matrices P X→Y and P Y→X are normalized pairwise node similarity score matrices in the row-wise and column-wise manner.
Based on the proposed random walk protocol, the random walker transits more frequently between the pair of nodes (u i ,v j ) if the node u i and the node v j have a higher pairwise node similarity and also if their neighboring nodes also have higher pairwise node similarity (i.e., higher topological similarity). So, as a result, the random walker will spend more time on an edge that connects a pair of nodes (u i ,v j ), \(u_{i} \in {\mathcal {U}}\) and \(v_{j} \in {\mathcal {V}}\) as their overall similarity (or node correspondence) increases. Hence, we can effectively estimate the pairwise node alignment probability – which should be proportional to the desired node correspondence – by measuring the steady-state network flow through each edge (u i ,v j ), \(u_{i} \in {\mathcal {U}}\) and \(v_{j} \in {\mathcal {V}}\).
Illustration of the steady-state network flow. Note that the red colored arrows indicate the intra-network random moves, while the blue colored arrows represent the cross-network random moves
where C is a \(\left |{\mathcal {U}}\right |\times \left |{\mathcal {V}}\right |\) dimensional matrix for the steady-state network flow (i.e., pairwise node correspondence scores), π X is a \(\left |{\mathcal {U}}\right |\times \left |{\mathcal {U}}\right |\) dimensional diagonal matrix such that π X [i,i]=π(u i ), \(u_{i} \in {\mathcal {U}}\), and π Y is a \(\left | \mathcal {V}\right |\times \left |{\mathcal {V}} \right |\) dimensional diagonal matrix such that π Y [j,j]=π(v j ), \(v_{j} \in {\mathcal {V}}\).
After transforming and removing node correspondence scores lower than a specific threshold using Eq. (13), we obtain the final node correspondence scores \(\bar {\mathbf {C}}\), which will be used to construct the network alignment.
Constructing the pairwise network alignment
Finally, to find the optimal solution of Eq. (1) based on the derived pairwise node alignment probability, we construct the maximum weighted bipartite matching (MWBM) between \(\mathcal {G}_{X}\) and \(\mathcal {G}_{Y}\), using an efficient implementation of the MWBM algorithm included in the GAIMC library [23].
Results and discussion
Datasets and experimental set-up
We assessed the performance of CUFID-align based on the IsoBase dataset [24], which includes PPI networks of five different species: H. sapiens (human), M. musculus (mouse), D. melanogaster (fly), C. elegans (worm), and S. cerevisiae (yeast). PPI networks in the IsoBase dataset were constructed by integrating five different databases: BioGRID [25], DIP [26], HPRD [27], MINT [28], and IntAct [29]. In IsoBase, the H. sapiens network has 22,369 proteins and 43,757 interactions; the M. musculus network has 24,855 proteins and 452 interactions; the D. melanogaster network has 14,098 proteins and 26,726 interactions; the C. elegans network has 19,756 proteins and 5,853 interactions; and the S. cerevisiae network has 6,659 proteins and 38,109 interactions.
We assessed the quality of the predicted network alignment based on the following metrics: correct nodes (CN), specificity (SPE), gene ontology consistency (GOC) scores, conserved interactions (CI), conserved orthologous interactions (COI), and computation time. Note that CN, SPE, and GOC scores assess the biological significance of the alignment, and CI and COI assess the topological quality of the alignment. If the aligned nodes have the same functional annotation based on the KEGG Orthology (KO) group annotations [30], we considered the node alignment to be correct. CN counts the total number of correctly aligned nodes in a given network alignment. SPE is the relative ratio of the total number of correctly aligned node pairs to the total number of aligned node pairs.
where |c| is the number of proteins having the particular GO term c, and |r o o t(c)| is the total number of proteins under the root GO term of the particular GO term c, where three root GO terms are molecular function (MF, GO:0003674), biological process (BP, GO:0008150), and cellular component (CC, GO:0005575). Note that if at least one protein in the aligned protein pair does not have a functional annotation such as KO group annotations or GO terms, the aligned protein pair was removed before computing the performance metrics CN, SPE, and GOC scores.
We compared the performance of CUFID-align against a number of state-of-the-art alignment methods: IsoRank [6], SMETANA [10], SMETANA-CSRW [11], PINALOG [13], and HubAlign [14]. Additionally, to verify the effectiveness of the network-based approach over the conventional approach that uses sequence similarity alone, we compared the various network-based methods and with a method that finds the best mapping between networks solely based on the sequence similarity between the proteins. More specifically, given a network pair, we tried to predict the network alignment by using maximum weighted bipartite matching based on BLAST bit scores. Since both SMETANA and SMETANA-CSRW yield many-to-many mappings by default, we used the parameter n max =1 to obtain one-to-one mappings. Other than this, the default parameters were used in our experiments (i.e., α=0.9 and β=0.8). For HubAlign, we used the default parameters (i.e., λ=0.1, d=10, and α=0.7). For IsoRank, we set the parameter α=0.6 as recommend in the original paper. For CUFID-align, we set the parameter α=0.9 and β=90 percentile of the transformed correspondence score. We performed all experiments on a desktop computer equipped with a 3.2 GHz Intel i5 quad-core processor and 8 GB memory.
Performance assessment based on the IsoBase dataset
Pairwise alignment results for the IsoBase dataset. Protein functionality is determined based on the KEGG Orthology (KO) group annotations
Yeast – Fly | Yeast – Worm | Yeast – Human | Yeast – Mouse | Fly – Worm | ||||||
CN a | SPE b | CN | SPE | CN | SPE | CN | SPE | CN | SPE | |
CUFID-align | 1,708 | 0.748 | 1,548 | 0.834 | 1,330 | 0.736 | 1,304 | 0.794 | 2,616 | 0.873 |
SMETANA-CSRW | 1,610 | 0.757 | 1,426 | 0.850 | 1,224 | 0.733 | 1,192 | 0.802 | 2,444 | 0.870 |
SMETANA | 1,530 | 0.733 | 1,422 | 0.843 | 1,134 | 0.710 | 1,182 | 0.782 | 2,338 | 0.852 |
PINALOG | 1,368 | 0.722 | 640 | 0.737 | 1,100 | 0.682 | 76 | 0.400 | 672 | 0.689 |
HubAlign | 1,326 | 0.681 | 98 | 0.170 | 1,082 | 0.633 | 42 | 0.231 | 102 | 0.201 |
IsoRank | 1,414 | 0.712 | 650 | 0.703 | 1,142 | 0.702 | 76 | 0.369 | 918 | 0.818 |
BLAST-MWBMc | 1,712 | 0.776 | 1,544 | 0.836 | 1,334 | 0.768 | 1,280 | 0.792 | 2,680 | 0.885 |
Fly – Human | Fly – Mouse | Worm – Mouse | Worm – Human | Human – Mouse | ||||||
CN | SPE | CN | SPE | CN | SPE | CN | SPE | CN | SPE | |
CUFID-align | 2,528 | 0.754 | 2,364 | 0.788 | 1,818 | 0.807 | 1,858 | 0.791 | 5,178 | 0.983 |
SMETANA-CSRW | 2,358 | 0.763 | 2,146 | 0.768 | 1,610 | 0.811 | 1,722 | 0.803 | 5,002 | 0.978 |
SMETANA | 2,096 | 0.706 | 2,112 | 0.764 | 1,578 | 0.803 | 1,570 | 0.780 | 4,876 | 0.972 |
PINALOG | 1,172 | 0.604 | 118 | 0.567 | 66 | 0.458 | 482 | 0.677 | 282 | 0.972 |
HubAlign | 354 | 0.219 | 34 | 0.230 | 24 | 0.188 | 32 | 0.063 | 144 | 0.667 |
IsoRank | 1,736 | 0.725 | 146 | 0.566 | 72 | 0.456 | 644 | 0.793 | 286 | 0.979 |
BLAST-MWBM | 2,580 | 0.766 | 2,374 | 0.781 | 1,824 | 0.808 | 1,884 | 0.794 | 5,140 | 0.982 |
When it comes to the specificity of the alignment results, random walk based methods (CUFID-align, SMETANA-CSRW, and SMETANA) achieve relatively higher SPE compared to PINALOG and HubAlign. SPE of HubAlign appears to be more sensitive than the other methods with respect to the average degrees of the PPI networks. CUFID-align, SMETANA-CSRW, and SMETANA achieve similar SPE, often higher than those of PINALOG and HubAlign. This means that CUFID-align can in general more accurately align protein pairs that have the same functional annotations compared to PINALOG and HubAlign.
GOC scores of various pairwise network alignment algorithms
The above results show that CUFID-align can accurately predict matching proteins in different species that have similar functionalities, according to the functional annotations of proteins that are currently available. The results also imply that the proposed algorithm may provide a useful tool for predicting the functions of unknown proteins in less studied species through network alignment with species that have been better studied.
Number of conserved interactions (CI) obtained by different network alignment algorithms
Yeast – Fly | Yeast – Worm | Yeast – Human | Yeast – Mouse | Fly – Worm | |
CUFID-align | 1,721 | 486 | 3,421 | 56 | 347 |
SMETANA-CSRW | 337 | 110 | 2,468 | 31 | 107 |
SMETANA | 504 | 171 | 2,377 | 37 | 116 |
PINALOG | 2,982 | 1,000 | 6,231 | 225 | 666 |
HubAlign | 836 | 4,013 | 2,659 | 545 | 3,276 |
IsoRank | 1,436 | 764 | 3,165 | 176 | 558 |
BLAST-MWBM | 246 | 89 | 1,317 | 14 | 70 |
Fly – Human | Fly – Mouse | Worm – Mouse | Worm – Human | Human – Mouse | |
CUFID-align | 1,547 | 59 | 18 | 459 | 318 |
SMETANA-CSRW | 710 | 41 | 8 | 198 | 336 |
SMETANA | 965 | 50 | 16 | 283 | 337 |
PINALOG | 2,730 | 88 | 47 | 917 | 358 |
HubAlign | 9,317 | 491 | 459 | 3,743 | 532 |
IsoRank | 1,471 | 106 | 130 | 569 | 350 |
BLAST-MWBM | 441 | 12 | 2 | 138 | 253 |
Number of conserved orthologous interactions (COI) obtained by different network alignment algorithms
Yeast – Fly | Yeast – Worm | Yeast – Human | Yeast – Mouse | Fly – Worm | |
CUFID-align | 91 | 10 | 743 | 5 | 3 |
SMETANA-CSRW | 91 | 8 | 749 | 6 | 2 |
SMETANA | 86 | 10 | 705 | 8 | 4 |
PINALOG | 129 | 15 | 970 | 19 | 4 |
HubAlign | 57 | 2 | 634 | 15 | 5 |
IsoRank | 94 | 11 | 741 | 10 | 4 |
BLAST-MWBM | 74 | 8 | 556 | 4 | 2 |
Fly – Human | Fly – Mouse | Worm – Mouse | Worm – Human | Human – Mouse | |
CUFID-align | 202 | 13 | 1 | 21 | 111 |
SMETANA-CSRW | 196 | 10 | 1 | 26 | 139 |
SMETANA | 230 | 14 | 1 | 26 | 123 |
PINALOG | 180 | 22 | 2 | 27 | 134 |
HubAlign | 67 | 15 | 4 | 5 | 98 |
IsoRank | 185 | 17 | 1 | 18 | 142 |
BLAST-MWBM | 112 | 6 | 0 | 15 | 94 |
We also compared the network-based approaches with the sequence-similarity-based approach. As we can see in Table 1 and Fig. 5, a simple sequence-similarity-based approach can construct network alignments with high functional coherence, and that the node similarity score may provide useful guidelines for identifying orthologous proteins. However, this results should be taken with a grain of salt, since they are likely due to the fact that the current functional annotations of proteins are often based on sequence similarity between proteins. As shown in Tables 2 and 3, BLAST-MWBM – which uses BLAST bit score and MWBM without using any network information – can identify a much smaller number of CIs and COIs compared to the network-based methods. These results imply that strong dependence on sequence similarity for constructing a network alignment has the potential risk of getting biased results that may fail to capture important protein interactions that are conserved across different species, which may be critical in deciphering the underlying cellular mechanisms that involve those interactions. In contrast, network-based methods, including CUFID-align, that incorporate topological information for constructing network alignments can make accurate and balanced predictions that identify both orthologous proteins as well as conserved interactions. Our results clearly show the importance of effective integration of node similarity and topological similarity for effective comparative analysis of PPI networks.
CPU time of the tested network alignment algorithms (in seconds)
Yeast – Fly | Yeast – Worm | Yeast – Human | Yeast – Mouse | Fly – Worm | |
CUFID-align | 6.22 | 4.79 | 11.22 | 5.70 | 12.88 |
SMETANA-CSRW | 243.64 | 163.24 | 448.29 | 435.94 | 3,002.20 |
SMETANA | 6.65 | 5.81 | 11.47 | 9.12 | 26.11 |
HubAlign | 451.24 | 75.67 | 571.23 | 5.30 | 55.87 |
PINALOG | 997.85 | 1,654.66 | 1,984.03 | 2,202.15 | 2,141.00 |
IsoRank | 1,737.07 | 369.52 | 3401.29 | 64.47 | 181.55 |
Fly – Human | Fly – Mouse | Worm – Mouse | Worm – Human | Human – Mouse | |
CUFID-align | 18.93 | 18.38 | 25.71 | 28.36 | 68.59 |
SMETANA-CSRW | 6,104.70 | 6,420.80 | 6,383.70 | 6,084.10 | 4,9185.00 |
SMETANA | 63.43 | 60.85 | 53.24 | 56.28 | 454.11 |
HubAlign | 532.31 | 4.92 | 1.91 | 84.45 | 8.99 |
PINALOG | 3,127.35 | 1,611.39 | 101.86 | 6,764.56 | 4,864.16 |
IsoRank | 1433.27 | 37.77 | 16.92 | 326.79 | 77.64 |
Extension of CUFID-align for the alignment of multiple networks
In this work, we have focused on the steady-state network flow approach and its application to the pairwise network alignment problem. However, the problem of multiple network alignment has been gaining wide interest in the research community and its practical importance has been increasing as the number of available PPI networks for different species continue to increase. Although it is beyond the scope of the current paper, we expect the extension of CUFID-align for multiple network alignment will be relatively straightforward. First of all, to this aim, we can modify the transition probability matrix in Eq. (9) by concatenating the normalized adjacency matrices and node similarity score matrices for the multiple PPI networks to be aligned. Following the construction of this extended transition probability matrix, the steps for computing the node correspondence scores – shown in Eqs. (11) and (13) – can be modified by constructing diagonal matrices and inserting corresponding the matrices into the diagonal terms. The extended version of CUFID-align for multiple PPI network alignment is expected to have distinctive advantages over other existing multiple PPI network alignment algorithms. First, it may be able to estimate the ‘global’ node correspondence scores more accurately. Currently, most multiple PPI network alignment algorithms estimate the node correspondence scores for every PPI network pair in the interest of computational complexity. The estimated pairwise node correspondence scores are later updated based on additional transformations to make them more suitable for multiple network alignment. However, considering that the ultimate goal is in constructing the alignment of multiple networks, it would be preferable to estimate the node correspondence scores (or equivalently, node alignment probabilities) Pr[u i ∼v j |G] considering all networks, rather than just estimating \(\Pr \left [ {u_{i} \sim v_{j} |{\mathcal {G}}_{X},{\mathcal {G}}_{Y}} \right ]\) based on the given network pair, where \(u_{i} \in {\mathcal {G}}_{X} \), \(v_{j} \in {\mathcal {G}}_{Y} \), and G is the set of all PPI networks including \({\mathcal {G}}_{X}\) and \({\mathcal {G}}_{Y}\). Since the aforementioned extension of CUFID-align estimates the node correspondence scores based on an integrated network that combines all networks in G, it has the potential to accurately compute the posterior node-to-node alignment probability given all the networks. Computation of such ‘global’ node correspondence score may lead to improved multiple network alignment results. Second, the extended version of CUFID-align will still be computationally very efficient, as most steps in CUFID-align only require simple matrix operations even if extended to multiple networks. Finally, the extended approach will require relatively low computational resources (especially, in terms of memory). For example, suppose that there are N PPI networks, where the number of nodes in the i-th network G i is V i . To align the N PPI networks, IsoRankN will need the pairwise node correspondence scores for each of the \(\binom {N}{2}\) network pairs, where for each pair, the algorithm will need to construct a |V i ·V j |×|V i ·V j | dimensional matrix. However, CUFID-align can compute the global node correspondence scores by constructing a single \(\left | {\sum \limits _{i = 1}^{N} {V_{i}} } \right | \times \left | {\sum \limits _{i = 1}^{N} {V_{i}}}\right |\) dimensional matrix. We are currently working on extending CUFID-align for multiple network alignment.
Conclusions
In this paper, we proposed CUFID-align, a novel network alignment algorithm based on the concept of steady-state network flow of a Markov random walk model on an integrated network. Given a pair of PPI networks, CUFID-align constructs an integrated network and a Markov random walk model on the resulting network such that the steady-state network flow between a pair of nodes in different PPI networks increases when the nodes have higher pairwise node similarity (typically measured based on sequence similarity) and topological similarity. For this purpose, the Markov random walk model is designed to make more frequent transitions between protein nodes that have higher overall similarity, thereby making the steady-state network flow – which reflects the long-run behavior of the random walker – an effective measure of the correspondence between nodes that belong to different networks. As we have shown in our performance assessment results using real PPI networks in the IsoBase database, CUFID-align can accurately align proteins with identical functional annotations at a relatively low computational cost. Our results show that CUFID-align may provide an effective means of computationally annotating the functions of proteins through comparative analysis of PPI networks.
Declarations
Acknowledgements
The authors would like to thank to Dr. David Gleich for providing the GAIMC library used in the implementation of CUFID-align proposed in this paper. This work was supported in part by the National Science Foundation through the NSF Award CCF-1149544 and NSF Award CCF-1447235.
Declarations
This article has been published as part of BMC Bioinformatics Volume 17 Supplement 13, 2016: Proceedings of the 13th Annual MCBIOS conference. The full contents of the supplement are available online at http://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-17-supplement-13.
Funding
Publication cost for this article was funded by the National Science Foundation through the NSF Award CCF-1149544 and NSF Award CCF-1447235.
Availability of data and materials
The source code and datasets can be downloaded from http://www.ece.tamu.edu/~bjyoon/CUFID.
Authors’ contributions
Conceived the method: HJ, XQ, BJY. Developed the algorithm and performed the simulations: HJ. Analyzed the results and wrote the paper: HJ, XQ, BJY. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interest.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
- Sharan R, Suthram S, Kelley RM, Kuhn T, McCuine S, Uetz P, Sittler T, Karp RM, Ideker T. Conserved patterns of protein interaction in multiple species. Proc Natl Acad Sci U S A. 2005; 102(6):1974–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Sharan R, Ideker T. Modeling cellular machinery through biological network comparison. Nat Biotechnol. 2006; 24(4):427–33.View ArticlePubMedGoogle Scholar
- Yoon BJ, Qian X, Sahraeian SME. Comparative analysis of biological networks: Hidden Markov model and Markov chain-based approach. IEEE Signal Proc Mag. 2012; 1(29):22–34.View ArticleGoogle Scholar
- Kelley BP, Yuan B, Lewitter F, Sharan R, Stockwell BR, Ideker T. PathBLAST: a tool for alignment of protein interaction networks. Nucleic Acids Res. 2004; 32(suppl 2):W83–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Kalaev M, Smoot M, Ideker T, Sharan R. NetworkBLAST: comparative analysis of protein networks. Bioinformatics. 2008; 24(4):594–6.View ArticlePubMedGoogle Scholar
- Singh R, Xu J, Berger B. Global alignment of multiple protein interaction networks with application to functional orthology detection. Proc Natl Acad Sci U S A. 2008; 105(35):12763–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Page L, Brin S, Motwani R, Winograd T. The PageRank Citation Ranking: Bringing Order to the Web. Technical report, Stanford Digital Library Technologies Project; 1999.Google Scholar
- Liao CS, Lu K, Baym M, Singh R, Berger B. IsoRankN: spectral methods for global alignment of multiple protein networks. Bioinformatics. 2009; 25(12):i253–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Andersen R, Chung F, Lang K. Local graph partitioning using PageRank vectors. Proc IEEE Foundations of Computer Science; 2006, pp. 475–86.Google Scholar
- Sahraeian SME, Yoon BJ. SMETANA: accurate and scalable algorithm for probabilistic alignment of large-scale biological networks. PLoS ONE. 2013; 8(7):e67995.View ArticlePubMedPubMed CentralGoogle Scholar
- Jeong H, Yoon BJ. Accurate multiple network alignment through context-sensitive random walk. BMC Syst Biol. 2015; 9(Suppl 1):S7.View ArticlePubMedPubMed CentralGoogle Scholar
- Jeong H, Yoon BJ. Effective estimation of node-to-node correspondence between different graphs. IEEE Signal Proc Lett. 2015; 22(6):661–5.View ArticleGoogle Scholar
- Phan HT, Sternberg MJ. PINALOG: a novel approach to align protein interaction networks–implications for complex detection and function prediction. Bioinformatics. 2012; 28(9):1239–45.View ArticlePubMedPubMed CentralGoogle Scholar
- Hashemifar S, Xu J. HubAlign: an accurate and efficient method for global alignment of protein–protein interaction networks. Bioinformatics. 2014; 30(17):i438–44.View ArticlePubMedPubMed CentralGoogle Scholar
- Do CB, Mahabhashyam MS, Brudno M, Batzoglou S. ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res. 2005; 15(2):330–40.View ArticlePubMedPubMed CentralGoogle Scholar
- Roshan U, Livesay DR. ProbAlign: multiple sequence alignment using partition function posterior probabilities. Bioinformatics. 2006; 22(22):2715–21.View ArticlePubMedGoogle Scholar
- Sahraeian SME, Yoon BJ. PicXAA: greedy probabilistic construction of maximum expected accuracy alignment of multiple sequences. Nucleic Acids Res. 2010; 38(15):4917–28.View ArticlePubMedPubMed CentralGoogle Scholar
- Sahraeian SME, Yoon BJ. PicXAA-R: efficient structural alignment of multiple RNA sequences using a greedy approach. BMC bioinforma. 2011; 12:1.View ArticleGoogle Scholar
- Sahraeian SME, Yoon BJ. PicXAA-Web: a web-based platform for non-progressive maximum expected accuracy alignment of multiple biological sequences. Nucleic Acids Res. 2011; 39:W8–12.View ArticlePubMedPubMed CentralGoogle Scholar
- Sahraeian SME, Yoon BJ. RESQUE: Network reduction using semi-Markov random walk scores for efficient querying of biological networks. Bioinformatics. 2012; 28(16):2129–36.View ArticlePubMedGoogle Scholar
- Cover TM, Thomas JA. Elements of information theory. Hoboken: John Wiley & Sons; 2012.Google Scholar
- Vincent L, Soille P. Watersheds in digital spaces: an efficient algorithm based on immersion simulations. IEEE Trans Pattern Anal Mach Intell. 1991; 13(6):583–98.View ArticleGoogle Scholar
- Gleich D. GAIMC: graph algorithms in Matlab code. Matlab Toolbox. 2009. https://github.com/dgleich/gaimc. Accessed 25 May 2016.
- Park D, Singh R, Baym M, Liao CS, Berger B. IsoBase: a database of functionally related proteins across PPI networks. Nucleic Acids Res. 2011; 39(suppl 1):D295–300.View ArticlePubMedGoogle Scholar
- Breitkreutz BJ, Stark C, Reguly T, Boucher L, Breitkreutz A, Livstone M, Oughtred R, Lackner DH, Bähler J, Wood V, et al. The BioGRID interaction database 2008 update. Nucleic Acids Res. 2008; 36(suppl 1):D637–40.PubMedGoogle Scholar
- Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D. The database of interacting proteins: 2004 update. Nucleic Acids Res. 2004; 32(suppl 1):D449–51.View ArticlePubMedPubMed CentralGoogle Scholar
- Prasad TK, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, Telikicherla D, Raju R, Shafreen B, Venugopal A, et al. Human protein reference database 2009 update. Nucleic Acids Res. 2009; 37(suppl 1):D767–72.View ArticleGoogle Scholar
- Ceol A, Aryamontri AC, Licata L, Peluso D, Briganti L, Perfetto L, Castagnoli L, Cesareni G. MINT, the molecular interaction database 2009 update. Nucleic Acids Res. 2009; 38:D532–39.View ArticlePubMedPubMed CentralGoogle Scholar
- Aranda B, Achuthan P, Alam-Faruque Y, Armean I, Bridge A, Derow C, Feuermann M, Ghanbarian A, Kerrien S, Khadake J, et al.The IntAct molecular interaction database in. Nucleic Acids Res. 2010; 38(suppl 1):D525–31.View ArticlePubMedGoogle Scholar
- Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000; 28:27–30.View ArticlePubMedPubMed CentralGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene Ontology: tool for the unification of biology. Nat Genet. 2000; 25:25–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Shih YK, Parthasarathy S. Identifying functional modules in interaction networks through overlapping Markov clustering. Bioinformatics. 2012; 28(18):i473–9.View ArticlePubMedPubMed CentralGoogle Scholar