Skip to main content

AligNet: alignment of protein-protein interaction networks

Abstract

Background

All molecular functions and biological processes are carried out by groups of proteins that interact with each other. Metaproteomic data continuously generates new proteins whose molecular functions and relations must be discovered. A widely accepted structure to model functional relations between proteins are protein-protein interaction networks (PPIN), and their analysis and alignment has become a key ingredient in the study and prediction of protein-protein interactions, protein function, and evolutionary conserved assembly pathways of protein complexes. Several PPIN aligners have been proposed, but attaining the right balance between network topology and biological information is one of the most difficult and key points in the design of any PPIN alignment algorithm.

Results

Motivated by the challenge of well-balanced and efficient algorithms, we have designed and implemented AligNet, a parameter-free pairwise PPIN alignment algorithm aimed at bridging the gap between topologically efficient and biologically meaningful matchings. A comparison of the results obtained with AligNet and with the best aligners shows that AligNet achieves indeed a good balance between topological and biological matching.

Conclusion

In this paper we present AligNet, a new pairwise global PPIN aligner that produces biologically meaningful alignments, by achieving a good balance between structural matching and protein function conservation, and more efficient computations than state-of-the-art tools.

Background

One of the most difficult problems in systems biology is to discover protein-protein interactions as well as their associated functions. The alignment and analysis of protein-protein interaction networks (PPIN) has become a key ingredient to obtain functional orthologs as well as evolutionary conserved assembly pathways of protein complexes. With this purpose, several pairwise alignment algorithms have been proposed in the last 15 years. The early aligners [15] were aimed at finding local alignments between regions with similar structure in the networks under comparison. But since the alignments between regions of the pair of PPIN could be mutually inconsistent, it could be impossible to merge the alignments between regions into an alignment of the whole networks. In contrast, a global alignment algorithm is aimed at finding the best overall alignment between whole PPIN [6]. Several such global PPIN aligners have been proposed during the last years [4, 711].

Most PPIN aligners are based on the idea that “two nodes are similar when their corresponding neighbors are so,” taking into account both the network topology and the biological features of the proteins in the definition of “similarity.” The problem is that attaining the right balance between network topology and biological information is one of the most difficult and key points in any PPIN alignment algorithm. As it is shown in [12, 13], when an alignment process is guided by topological information only, it produces alignments with a high topological coherence but a low biological coherence, while when it is guided by sequence information only, the resulting alignments have a high biological coherence but a low topological coherence. This becomes specially inconvenient in those aligners where the user has to choose the value of a parameter that specifies the desired balance between the topological and the sequence similarities. In addition, most aligners are not efficient from the computational point of view.

Motivated by this lack of well-balanced and efficient algorithms, we have designed AligNet, a parameter-free pairwise PPIN alignment algorithm aimed at filling the gap between efficient topologically and biologically meaningful matchings. The overall idea of the algorithm is to obtain many local alignments that are combined and extended into a meaningful global alignment. The final alignment captures the benefits of considering both types of alignments: with the local alignments we capture the topological similarity between the networks and we speed up the running time of the algorithm, while with the final global alignment we solve the inconsistencies among the local alignments and yield an overall alignment of the pair of input PPIN. AligNet has been implemented in R [14], and the implementation is freely available from https://github.com/biocom-uib/AligNet.

A comparison of the results obtained with AligNet and with the best aligners assessed in [12, 13] shows that AligNet achieves indeed a good balance between topological and biological matching. In the tests reported in this paper, AligNet obtained high functional consistence scores between aligned proteins in most of the alignments and also a reasonable fraction of conserved interactions. In addition, AligNet, together with HubAlign [8], had the best running times among all the aligners considered in our tests.

Methods

In this paper, by a graph we understand an undirected graph, that is, a structure G=(V,E) with V a finite set of nodes and E a family of 2-element subsets {u,v} of V called the edges of the graph. A PPIN is modelled in a natural way as a graph, with its nodes representing the proteins and its edges, their interactions.

We introduce now some notations. Let G=(V,E) be a graph. We say that an edge e={u,v} is incident to u and v. The nodes v such that {u,v}E are the neighbors of u, and they form the set NG(u). The degree deg(u) of a node uV is the number of edges incident to it. A path between two nodes u,vV is a sequence of pairwise different edges {u,u1},{u1,u2},…,{uk−1,uk},{uk,v} such that the first and last edges are incident to u and v, respectively, and every pair of consecutive edges share a node (different from u and v, in the case of the first and last edges, respectively). The length of a path is the number of edges forming it, and its intermediate nodes are u1,…,uk. Two nodes are connected when there exists a path between them. For every pair of connected nodes u,vV, their distance dG(u,v) in G is the length of a shortest path connecting them. The diameter D(G) of G is the maximum distance between any two connected nodes in G. The cardinality of a set X is denoted by |X|.

AligNet receives as input two graphs G=(V,E) and G=(V,E) representing two PPIN (in particular, each node of them is injectively identified with a protein) and it produces, as output, a similarity score for them and a local and a global alignment between them. Figure 1 shows the pipeline of our algorithm AligNet. The main steps in AligNet that are described below are:

  1. 1

    The computation of overlapping clusterings C(G) and C(G), respectively, of the input networks G and G.

  2. 2

    The computation of alignments between pairs of clusters in C(G) and C(G).

  3. 3

    The computation of a matching between C(G) and C(G).

  4. 4

    The computation of a local alignment of the input networks G and G.

  5. 5

    The extension of this local alignment to a meaningful global alignment.

Fig. 1
figure 1

Pipeline of AligNet algorithm

Step 1. Overlapping clusterings. The first step in AligNet consists in computing an overlapping clustering of each input network. These clusterings are based on the following similarity score s(u,v) between pairs of proteins (nodes) u,v in a PPIN G: If u,v are not connected by a path, then s(u,v)=0, and if they are connected,

$$s(u,v)= \frac{B(u,v)+\frac{D(G)+1-d_{G}(u,v)}{D(G) +1}}{2}, $$

where B(u,v) is the normalized bit score of the proteins u and v, that is, the rescaled version of their alignment score obtained with BLAST+, which is independent of the size of the search space [15]. The intuition behind this similarity score is that two proteins are similar if they have similar sequences of nucleotides and they are relatively close to each other in the graph.

To obtain the overlapping clustering of an input network, we define a cluster centered at each node. To avoid the choice of a fixed and arbitrary cluster size, we considered the similarity score distribution and define the cluster centered at each node as follows. Let α be the third quartile of the distribution of the similarity score values of pairs of nodes, so that only 25% of the pairs of nodes (u,v) are such that s(u,v)>α. Then, for every node uV, the cluster Cu in Gcentered at u is

$$C_{u}=\left\{ v\in V \mid s(u,v) > \alpha\right\}. $$

Let C(G)={CuuV} and \(\phantom {\dot {i}\!}C(G')=\left \{C_{u'}\mid u'\in V'\right \}\).

Figure 2 displays two toy PPIN that will be used as a running example throughout this section. The first network consists of 8 nodes and 9 edges, while the second network consists of 9 nodes and 17 edges. Figure 3 displays the PPI networks considered as a running example as well as its overlapping clustering. The first network consists of 8 nodes and 9 edges, so there are 8 clusters. The second network consists of 9 nodes and 17 edges, and its overlapping clustering has 9 clusters.

Fig. 2
figure 2

a A subnetwork of the Drosophila melanogaster PPI network. b A subnetwork of the Homo Sapiens PPI network

Fig. 3
figure 3

Overlapping clusterings. This figure shows the overlapping clustering on the PPINs in Fig. 2 obtained by AligNet. We can see here the 8 clusters in the network in Fig. 2 on the left, and the 9 clusters in the network in Fig. 2 on the right. The center of every cluster is highlighted in blue. Since we have considered two small pieces of a PPIN, we obtain here that, the first cluster on the left is the entire piece of network. In the right, we obtain also the entire piece of network in the second cluster on the right. Notice that we obtain the whole piece of the network when we consider the cluster of a node that is in the center of the network

Step 2. Alignments between pairs of clusters. In this second step, AligNet computes an alignment between every pair of clusters CuC(G) and \(\phantom {\dot {i}\!}C_{u'}\in C(G')\) such that B(u,u)>0. These alignments define an alignment score between every such a pair of clusters that will be used in the third step to compute a matching between C(G) and C(G).

Formally, for every uV and uV such that B(u,u)>0, the alignment between CuC(G) and \(\phantom {\dot {i}\!}C_{u'}\in C(G')\) is obtained as follows:

  • Match u with u. Set \(\phantom {\dot {i}\!}L_{u,u'}=\left \{(u,u')\right \}, L_{u,u'}^{(1)}=\{u\}\) and \(\phantom {\dot {i}\!}L_{u,u'}^{(2)}=\{u'\}\).

  • For every vCuNG(u) and for every \(\phantom {\dot {i}\!}v'\in C_{u'}\cap N_{G'}(u')\), let

    $$F\left(v,v'\right)=| \deg(v)-\deg(v')| - B\left(v,v'\right) +1. $$

    Compute a matching \(\phantom {\dot {i}\!}M_{u,u'}\subseteq (C_{u}\cap N_{G}(u))\times (C_{u'}\cap N_{G'}(u'))\) that minimizes \(\phantom {\dot {i}\!}\sum _{(v,v')\in M_{u,u'}} F(v,v')\) using the Hungarian algorithm [16]. Sort the pairs in \(\phantom {\dot {i}\!}M_{u,u'}\) in decreasing order of their F value, and concatenate them to \(\phantom {\dot {i}\!}L_{u,u'}\). Add their first coordinates to \(\phantom {\dot {i}\!}L_{u,u'}^{(1)}\) and their second coordinates to \(\phantom {\dot {i}\!}L_{u,u'}^{(2)}\).

  • Iterate step (ii), replacing (u,u) by the rest of the pairs in \(\phantom {\dot {i}\!}L_{u,u'}\) and removing from Cu and \(\phantom {\dot {i}\!}C_{u'}\) the nodes already aligned.

    More specifically, in the k-th iteration, take the k-th element (v0,v0′) of \(\phantom {\dot {i}\!}L_{u,u'}\). For every \(\phantom {\dot {i}\!}w\in (C_{u}\setminus L_{u,u'}^{(1)})\cap N_{G}(v_{0}) \) and every \(\phantom {\dot {i}\!}w'\in (C_{u'}\setminus L_{u,u'}^{(2)})\cap N_{G'}(v_{0}')\), compute F(w,w). Then, compute a matching

    $$ M_{v_{0},v_{0}'}\subseteq \left((C_{u}\setminus L_{u,u'}^{(1)})\cap N_{G}(v_{0})\right)\times \left((C_{u'}\setminus L_{u,u'}^{(2)})\cap N_{G'}(v_{0}')\right) $$

    that minimizes \(\phantom {\dot {i}\!}\sum _{(v,v')\in M_{v_{0},v_{0}'}} F(v,v')\). Sort the pairs forming \(\phantom {\dot {i}\!}M_{v_{0},v_{0}'}\) in decreasing order of their F value, and concatenate them to \(\phantom {\dot {i}\!}L_{u,u'}\). Add their first coordinates to \(\phantom {\dot {i}\!}L_{u,u'}^{(1)}\) and their second coordinates to \(\phantom {\dot {i}\!}L_{u,u'}^{(2)}\).

The resulting alignment \(\phantom {\dot {i}\!}L_{u,u'}\) defines a partial injective mapping \(\phantom {\dot {i}\!}\eta _{u,u'}: C_{u}\rightarrow C_{u'}\). The nodes in Cu that are matched to nodes in \(\phantom {\dot {i}\!} C_{u'}\) form the domain of the mapping \(\phantom {\dot {i}\!}\eta _{u,u'}\), which is denoted by \(\phantom {\dot {i}\!}Dom\, \eta _{u,u'}\). Figure 4 shows an example of the alignment of a pair of clusters: one cluster from the first network and another cluster from the second network. The general idea behind this alignment procedure is that u is matched to u and then a node vCu should be matched to a node v in \(\phantom {\dot {i}\!}C_{u'}\) when they have similar sequences and similar degrees, provided that, furthermore, there exist paths connecting u with v and u with v such that their intermediate nodes are already aligned in sequential order along the paths. The alignment procedure gives priority to matching neighbors of nodes x,x at the possible shortest distance of the respective cluster centers and with F(x,x) as large as possible among those pairs already matched at the same iterative step.

Fig. 4
figure 4

Alignment of a pair of clusters. This figure shows how AligNet aligns two clusters which corresponds to Step 2 of our algorithm. The clusters in this example are, respectively, the first in the list of clusters of G, which are shown on the left in Fig. 3 and the seventh in the list of clusters of G, which are shown on the right in Fig. 3. We show in the picture all the steps needed to align the cluster of G with the cluster of G. From top to bottom in this figure, we can see that AligNet first aligns the centers of the clusters, which are the nodes highlighted in blue. Then, AligNet aligns the neighbors of the centers (second row). Next, AligNet aligns the neighbors of the neighbors. In each step we show in a different colour the nodes that are aligned in the present step. Notice that, in this example, there are two nodes that remain unmatched

Step 3. Matching between families of clusters. Let

$$\mathcal{A}=\left\{\eta_{u,u'} \mid u \in V,\ u'\in V', B(u,u')>0\right\}$$

be the set of alignments obtained in step 2. The score of each \(\phantom {\dot {i}\!}\eta _{u,u'}\in \mathcal {A}\) is defined as

$$Score(\eta_{u,u'})=\frac{\sum_{v\in Dom \, \eta_{u,u'}}B(v,\eta_{u,u'}(v))}{|Dom \,\eta_{u,u'}|}+ \frac{|Dom \, \eta_{u,u'}|}{max_{\eta_{w,w'}\in \mathcal{A}}|Dom \,\eta_{w,w'}|}. $$

This score assesses simultaneously the average similarity of the sequences of the proteins matched by \(\phantom {\dot {i}\!}\eta _{u,u'}\) and their number.

Once computed all these scores, AligNet obtains a matching between C(G) and C(G) by applying the maximum weighted bipartite matching algorithm to the bipartite graph whose nodes are the clusters in C(G) and C(G), whose edges connect pairs of clusters CuC(G) and \(\phantom {\dot {i}\!}C_{u'}\in C(G')\) with B(u,u)>0, and the weight of the edge connecting Cu with \(\phantom {\dot {i}\!}C_{u'}\) is the score \(\phantom {\dot {i}\!}Score(\eta _{u,u'})\). We shall denote by \(\mathcal {C}\) the set of partial injective mappings \(\phantom {\dot {i}\!}\eta _{u,u'}\) corresponding to pairs of clusters \(\phantom {\dot {i}\!}(C_{u},C_{u'})\) that are matched by this matching. Figure 5 shows the matching obtained in this step between the families of clusters in Fig. 3.

Fig. 5
figure 5

Alignment of the clusterings. This figure shows the final assignment (same colour) between the clusters in Fig. 3 produced by AligNet, which corresponds also to Step 3. Each of the eight clusters obtained from G is aligned to one, and only one, of the nine clusters obtained from G. Hence, one cluster from G remains unmatched which is the second cluster in the third row on the right in Fig. 3. In this figure, we show the clusters from G on the left and its corresponding cluster image from G on the right

Step 4. Local alignment of PPIN. In this step, AligNet produces a local alignment between G and G from the matching between C(G) and C(G) obtained in the previous step.

The main idea is to define this alignment by merging the partial injective mappings \(\phantom {\dot {i}\!}\eta _{u,u'}\in \mathcal {C}\). The problem is that these mappings may be inconsistent. A first approach to overcome this problem would be to consider the weighted bipartite hypergraph with set of nodes VV and where every mapping \(\phantom {\dot {i}\!}\eta _{u,u'}\) defines a hyperarc with source its domain, target its image, and weight \(\phantom {\dot {i}\!}Score\left (\eta _{u,u'}\right)\), and to solve on it the weighted bipartite hypergraph assignment problem, whose solution would provide a well-defined local alignment of the input networks.

However, in order to decrease the computation time of AligNet, we do not define this hypergraph from the whole \( \mathcal {C}\), but just from a subset \(\mathcal {R}\) of best-scored alignments built recursively as follows. Starting with \(\mathcal {R}=\emptyset \), AligNet adds to \(\mathcal {R}\) at each step a mapping \(\phantom {\dot {i}\!}\eta _{w_{0},w_{0}'}\in \mathcal {C}\) with w0 not belonging to the union of the domains of the mappings \(\phantom {\dot {i}\!}\eta _{w,w'}\) already in \(\mathcal {R}\) and with maximum \(\phantom {\dot {i}\!}Score\left (\eta _{w_{0},w_{0}'}\right)\) among all such mappings. AligNet iterates this procedure until every node in \(\phantom {\dot {i}\!}\bigcup _{\eta _{u,u'}\in \mathcal {C}} Dom\, \eta _{u,u'}\) belongs to the domain of some mapping in \(\mathcal {R}\). In Fig. 6 we give the subset \(\mathcal {R}\) of \(\mathcal {C}\) for the networks in our running example.

Fig. 6
figure 6

Appropriate set of alignments. This figure shows how AligNet constructs an appropriate set of alignments considered to obtain a final local alignment. This corresponds to the Step 4 of our aligner. First of all, a maximum score alignment between a pair of clusters is chosen: in this case, this corresponds to the matching between the clusters in Fig. 4. Both clusters are shown in the second row of this figure. The shadowed nodes are the nodes that are not aligned. Next, a maximum score alignment of a pair of clusters with source a cluster centered at a shadowed node is chosen: it turns out to be the one in the second row in Fig. 5 and it is shown in the third row in this figure. Finally, the last alignment to be included in the appropriate set of alignments must be the one with source cluster centered at the remaining shadowed node: this corresponds to the alignment in the last row in Fig. 5 shown in the bottom of this figure. Notice that in the end, that is when we consider the three alignments together, there are four nodes in the source network with inconsistent assignments

Then, Alignet obtains from the directed hypergraph with nodes VV and hyperarcs defined by the mappings \(\phantom {\dot {i}\!}\eta _{u,u'}\in \mathcal {R}\) as explained above, a local well-defined alignment between G and G as a solution of the corresponding weighted bipartite hypergraph assignment problem [17]. Figure 7 shows the local alignment obtained from the hypergraph corresponding to Fig. 6.

Fig. 7
figure 7

Local alignment. This figure shows the local alignment of the original networks obtained by AligNet in its fourth step, once the inconsistent assignments have been solved. The coherent assignment of nodes is obtained as the solution to the weighted bipartite hypergraph assignment problem, for the hypergraph associated to the appropriate set of alignments described in Fig. 6. In this case, the hypergraph has three hyperarcs, corresponding to the three alignments considered in the appropriate set of alignments

Step 5. Global meaningful alignment of PPIN. In order to extend the local alignment produced in the previous step, AligNet iterates the following procedure:

  • It removes the nodes in G and G that have already been aligned, and it recomputes the score of each alignment \(\phantom {\dot {i}\!}\eta _{u,u'}\) following the same definition as in step 3, but only taking into account the remaining nodes in its domain and image.

  • It computes a new optimal matching \(\mathcal {C}\) between C(G) and C(G), as in step 3, but using as edges those \(\phantom {\dot {i}\!}\eta _{u,u'}\) whose updated score is positive, and weights these updated scores.

  • It computes a new set \(\mathcal {R}\) of best-scored alignments \(\phantom {\dot {i}\!}\eta _{u,u'}\) with \(\phantom {\dot {i}\!}Score(\eta _{u,u'})>0\), as in step 4.

  • It defines a new directed hypergraph whose nodes are the nodes in VV not yet aligned and hyperarcs the mappings \(\phantom {\dot {i}\!}\eta _{u,u'}\) in the new set \(\mathcal {R}\), understood as hyperarcs with source the still unaligned nodes in their domain and target the still unaligned nodes in their image.

  • It computes a local alignment between unaligned nodes in V and V by solving the weighted bipartite hypergraph assignment problem for this hypergraph, and it adds this local alignment to the alignment obtained so far.

This procedure is iterated while there exist nodes not aligned belonging to the domain or the image of some alignment \(\phantom {\dot {i}\!}\eta _{u,u'}\) with (updated) positive score. In Fig. 8 we show the final global meaningful alignment obtained with AligNet for the networks in our running example.

Fig. 8
figure 8

Final global alignment. This figure shows the final global alignment of the original networks obtained by AligNet. Notice that, in the fifth step of AligNet, the previous alignment is extended to a global one. In this case, there were two unmatched nodes in the source network in Fig. 7 which are now assigned

Results

In this section we report the tests performed to assess the performance of AligNet. Following the comparisons published in [12, 13], we decided to compare AligNet with SPINAL [7], HubAlign [8], NATALIE [18], L-GRAAL [19], and PINALOG [20] on the dataset used in [12], which consists of the PPIN of M. musculus (mus), C. elegans (cel), D. melanogaster (dme), S. cerevisiae (sce), and H. sapiens (hsa), downloaded from the IsoBase database [21] (version 1.0.2); see Table 1. Unfortunately, we had to discard the aligner NATALIE from our tests because some computations did not finish.

Table 1 Number of nodes and edges (with and without loops) of the PPIN considered as input data in our tests

In a first assessment of the alignments, we used two quality measures: the edge correctness ratio (EC), which quantifies the amount of structure preserved by the alignment, and the functional coherence value (FC), which assesses the functional similarity of the aligned proteins by comparing their Gene Ontology annotation. More formally, let G=(V,E) and G=(V,E) be two PPIN such that |V|≤|V| and let μ:VV be a mapping defining an alignment. The edge correctness ratio of μ is

$$EC(\mu)= \frac{\left|\left\{\{u,v\} \in E \, : \, \{\mu(u),\mu(v)\} \in E'\right\} \right|}{ min\{|E|, |E'|\}} $$

and the functional coherence value of μ is

$$FC(\mu)= \frac{\sum_{u\in V} FS(u,\mu(u))}{|V|}, $$

where the similarity score FS is defined by

$$ FS(u,u') = \frac{|GO(u) \cap GO(u')|}{|GO(u) \cup GO(u')|}, $$

with GO(u) and GO(u) the sets of GO annotations of the proteins u and u, respectively.

Tables 2 and 3, as well as Figs. 9 and 10, report the EC and FC scores of the alignments, respectively. These scores are produced by the aligners under consideration using the aligners’ parameters suggested by default whenever it was needed. Because all alignments attained a very low FC score, to put these low scores in perspective, we estimated the maximum value FCmax of the FC score for every pair of networks. This maximum value FCmax was obtained solving the maximum weighted bipartite matching problem, where the complete bipartite graph had the proteins as nodes and the weight of each edge connecting one protein in a network to a protein in the other network was the FC score of the corresponding pair of proteins. These maximum values are listed in Table 3. We observe that they are very low, being around 0.2 in most computations. Also, we observe in Tables 2 and 3, that AligNet and HubAlign obtained the best balance between FC and EC scores followed by PINALOG and L-GRAAL.

Fig. 9
figure 9

Edge Correctness Scores. This figure shows the edge correctness score obtained for each aligner in every alignment. The different aligners are presented in different colours

Fig. 10
figure 10

Functional Coherence Scores. This figure shows the functional coherence score obtained for each aligner in every alignment. In a purple dot we show the maximal value expected for every The different aligners are presented in different colours

Table 2 Edge correctness ratio obtained in every alignment
Table 3 Functional coherence value obtained in every alignment

In addition, in our first test and in order to measure the amount of variation or dispersion of the EC and FC scores used to evaluate the aligners, we introduced some noise to the networks by randomly adding and deleting 5% of the edges. For every aligner, we were able to compute 100 new pairwise alignments considering the perturbed networks of M. musculus mapped to the perturbed networks of C. elegans, D. melanogaster, and S. cerevisiae. In this way, for every aligner we ended up with a sample of 100 EC and FC scores for each of the alignments mus–cel, mus–sce and mus–dme. In Table 4, the mean of the EC and FC scores as well as their standard deviation are presented. Also, to visualise the scores distribution, we considered violin plots to present the results (See Figures 11,12 and 13). We conclude that small perturbations of the real networks produced small variations of the EC and FC scores.

Fig. 11
figure 11

Scores of mus–cel alignments. This figure shows as violin plots the distribution of the EC and FC scores obtained for every aligner in the alignments of the perturbed networks of mus and cel

Fig. 12
figure 12

Scores of mus–sce alignments. This figure shows as violin plots the distribution of the EC and FC scores obtained for every aligner in the alignments of the perturbed networks of mus and sce

Fig. 13
figure 13

Scores of mus–dme alignments. This figure shows as violin plots the distribution of the EC and FC scores obtained for every aligner in the alignments of the perturbed networks of mus and dme

Table 4 Statistics of the EC and FC scores

As a second test, we compared the behavior of AligNet, PINALOG, HubAlign, and L-GRAAL in relation to the alignment of protein complexes (we excluded SPINAL from this test because its results in the EC and FC tests were not convincing). Following the procedure explained in [20], we considered the database MIPS CORUM [22] as the gold standard for the human protein complexes and the information available in [23] as the gold standard for the yeast complexes. In addition, we considered the functional information available in MIPS CORUM for the human complexes and in MIPS FunCat [24] for the yeast complexes. To measure the quality of an alignment in terms of its behaviour on protein complexes, we used the complex functional coherence value (CFC), defined as the ratio of complexes that are aligned correctly with respect to the aligned complexes. More specifically, if we call a pair of complexes, one in each network, coherent when they share some biological function and incoherent otherwise, and if we denote by CP and NCP the numbers of coherent and incoherent pairs of aligned complexes, then \(CFC=\frac {CP}{CP + NCP}\times 100 \). We report the results obtained by all the aligners in Table 5 and Fig. 14. We observe there that AligNet obtained the highest CFC value (25.34) followed by PINALOG (24.48) whereas HubAlign and L- GRAAL obtained a very low CFC value (5, 4.75 resp.).

Fig. 14
figure 14

Complex Functional Coherence. This figure shows the number of non-assigned complexes (in blue), the number of coherent pairs (in green), the number of incoherent pairs (in red) and the complex functional coherence value (yellow dot). The number of complexes is shown on the left axis, while the complex functional coherence value is shown on the right axis

Table 5 Number of not assigned, correctly assigned (CP), incorrectly assigned (NCP) protein complexes and the complex functional coherence value obtained for every aligner

In order to further compare the results obtained by AligNet on protein complexes with those of the others aligners, we counted, for each other aligner A, the complexes that were not aligned either by AligNet or by A; the coherent and incoherent pairs among those complexes that were aligned by AligNet but not by A; and the coherent and incoherent pairs among those complexes that were aligned by A but not by AligNet. The results are given in Table 6 and Fig. 15. We observe there that the number of incoherent pairs by HubAlign, L-GRAAL and PINALOG versus AligNet nearly double the number of incoherent pairs by AligNet versus the others.

Fig. 15
figure 15

Complex Functional Coherence Precision. This figure shows the number of coherent pairs (green) and incoherent pairs (red) obtained with one aligner versus the other

Table 6 Numbers of complexes assigned by AligNet and not assigned by the other aligners, and conversely

As a third test to evaluate the aligners, we considered the essential proteins, i.e. those proteins that are indispensable for the survival of an organism, again in the human and yeast PPINs. We evaluate the aligners performance assuming that essential proteins must be aligned to essential proteins. Thus, for every alignment between the PPIN of S. cerevisiae and H. sapiens, a true possitive (TP) is an essential protein matched to an essential protein while a false possitive (FP) is an essential protein matched to a non essential one. In the same way, a true negative (TN) is a non essential protein matched to a non essential one and a false negative (FN) is a non essential protein matched to an essential one. The essential proteins information was retrieved from the DEG Database [25] (http://www.essentialgene.org/). We considered the following statistical measures to evaluate the aligners performance: specificity defined by TN/N, precision defined by TN/N, F1-score defined by 2TP/(2TP+FP+FN), accuracy defined by (TP+TN)/(P+N) and balanced accuracy, defined by ((TP/P)+(TN/N))/2, where P and N are the number of essential and non essential proteins respectively in S. cerevisiae. Also, we calculated the Pearson correlation of this binary classification problem, called MCC (Matthews Correlation Coefficient) defined by

$${{\text{MCC}}={\frac {{{TP}}\times {{TN}}-{{FP}}\times {{FN}}}{\sqrt {({{TP}}+{{FP}})({{TP}}+{{FN}})({ {TN}}+{ {FP}})({{TN}}+{{FN}})}}}}$$

and the proficiency, also called uncertainty coefficient or entropy coefficient. The uncertainty coefficient in this test is defined as follows: let {p1,…pn} be the set of proteins in S. cerevisiae and let η be an alignment between the two PPIN S. cerevisiae and H. sapiens. Two random variables X and Y are considered such that, X is a binary vector X=(xi)i=1,…,n such that xi takes the value 1 if protein pi is essential and the value 0 otherwise. Y is a binary vector Y=(yi)i=1,…,n such that yi takes the value 1 if protein η(pi) is essential and the value 0 otherwise. Then, the uncertainty coefficient is defined by

$$UC=(H(X)- H(X|Y)) /H(X)$$

where H(X) is the entropy of X and H(X|Y) is the conditional entropy. In this test, the uncertainty coefficient measures the capability to predict that a S. cerevisiae protein is essential provided that its image by η is essential. In Fig. 16 we show the values for each statistical measure obtained for every aligner. As we can observe there, all aligners have a similar value of accuracy and balanced accuracy. Concerning specificity, precision and F1-score, HubAlign obtained the lowest value while the others aligners are comparable. The highest proficiency and MCC values were obtained by AligNet while the lowest one was obtained by PINALOG.

Fig. 16
figure 16

Binary Classifier Metrics. This figure shows the results obtained for each aligner in the essential proteins alignment test, for every statistical measure

Finally, in order to study the efficiency of the considered aligners, we observed their running time and memory space needed to perform an alignment. We run our implementation of AligNet on a server with 4 processors at 2.6 GHz and 20 GB of RAM and we also run the latest implementations of PINALOG (downloaded from http://www.sbg.bio.ic.ac.uk/~pinalog/), SPINAL (downloaded from http://code.google.com/p/spinal/), HubAlign (downloaded from" https://github.com/hashemifar/HubAlign) and also L-GRAAL (downloaded from http://www0.cs.ucl.ac.uk/staff/natasa/L-GRAAL/). NATALIE could not align the two smallest networks, C. elegans and D. melanogaster, on a computer with 64 GB of RAM. PINALOG, SPINAL, HubAlign and L-GRAAL were able to complete all the alignments. In order to visualize their running times, we show the running time of every finished computation for each aligner in the top barplot in Fig. 17. We can observe that AligNet is considerably faster than PINALOG and SPINAL, with a running time of less than 1,000 seconds in most of the alignments. However,it is difficult to see the running times in some alignments because SPINAL needed more than 20,000 seconds for the alignment between S. cerevisiae and H. sapiens. Thus, in order to visualize the results in the cases where the aligners consumed less than 3,500 seconds, we describe in Fig. 18 the running times cutting at ten minutes. We observe there that HubAlign is the fastest aligner followed by AligNet.

Fig. 17
figure 17

Running times. This figure shows the running times (in seconds) we obtained when we performed all the alignments for every pair of the considered networks. In this figure we present the results obtained with the aligners AligNet, PINALOG, SPINAL, HubAlign and L-GRAAL

Fig. 18
figure 18

Running times cut at 10 minutes. We show in this figure the running times for those alignments that took lees than 10 minutes

In Fig. 19, we present the running times ordered by the networks size. We observe that in the case of AligNet the running time increases as so do the networks. However, this is not the case of L-GRAAL, SPINAL and HubAlign. On the other hand, PINALOG presents a correlation between networks sizes and running times but it is the slowest aligner. Thus, AligNet is the aligner that present the strongest correlation between running time and networks size.

Fig. 19
figure 19

Time Consistency. This figure shows the running times in seconds obtained for every pairwise alignment and every aligner. We ordered the pairwise alignments considering the size (number of nodes) of the networks

Discussion

We performed three tests to evaluate and compare our tool AligNet to the best aligners according to [12, 13]. In the first test we assessed the alignment correctness by calculating the EC and FC scores. We present the results in Tables 2 and 3, as well as Figs. 9 and 10. We can observe there that the alignments of small networks with a small number of edges, such as M. musculus, produced alignments with high EC scores, especially when the target network has a large number of edges. However, we can also observe that, when the number of edges in the source network increased, the EC scores decreased dramatically even in the case of HubAlign. As far as the functional coherence goes, we can observe in Table 3 and Fig. 10 that all aligners attained a very low FC score whose value in most of the computations is around 0.2 points below the maximum score that can be obtained. An overview to Figs. 9 and 10 reflects that the order from the highest to the lowest EC scores is almost the opposite to the order from the highest to the lowest FC score. That is, the alignment with the highest EC score gets the lowest FC score, being AligNet and HubAlign the aligners that obtained the best balance between FC and EC scores followed by PINALOG and L-GRAAL.

In this first test, we also measured the amount of variation or dispersion of the EC and FC scores used to evaluate the aligners. We introduced some noise to the networks by randomly adding and deleting 5% of the edges. In this way, for every aligner we ended up with a sample of 100 EC and FC scores for each of the alignments mus–cel, mus–sce and mus–dme. In Table 4, the mean of the EC and FC scores as well as their standard deviation are presented. Notice that the differences between the mean of the EC scores obtained by AligNet and HubAlign is around 0.3 being HubAlign the aligner with highest EC scores, while the differences between the mean of the FC scores obtained by AligNet and PINALOG is at maximum 0.05 being PINALOG the aligner with highest FC scores but the lowest EC scores. Thus, the goal of AligNet is accomplished since it clearly obtained the best balance between EC and FC scores. Also, to visualise the scores distribution, in Figures 11,12 and 13 we present the results considering violin plots. In violin plots we can observe the probability density of the EC and FC scores as well as all the data that is shown in a box plot. As we can observe in these figures, HubAlign and L-GRAAL obtained the highest EC scores but the lowest FC scores in contrast to PINALOG that obtained the lowest EC scores but the highest FC scores. Notice that, the violin’s shapes show the scores distribution, that is, flat and wide violins indicate that most of the values are near to the mean in contrast to vertical and narrow violins where the values are dispersed away from the mean. There is only a vertical violin corresponding to the EC scores in the alignments of L-GRAAL between mus and sce. This entails that except for this vertical violin case, small perturbations of the real networks produced small variations of the EC and FC scores.

In the second test we evaluated the alignment of protein complexes. We used the complex functional coherence value (CFC) to measure the quality of an alignment in terms of its behaviour on protein complexes. The CFC score is defined as the ratio of complexes that are aligned correctly with respect to the aligned complexes. In Table 5 and Fig. 14 we show the results obtained by all the aligners. AligNet obtained the highest CFC value (25.34) followed by PINALOG (24.48) and HubAlign but L- GRAAL obtained a very low CFC value (5, 4.75 resp.). In order to further compare the results obtained by AligNet on protein complexes with those of the others aligners, we counted, for each other aligner A, the complexes that were not aligned either by AligNet or by A; the coherent and incoherent pairs among those complexes that were aligned by AligNet but not by A; and the coherent and incoherent pairs among those complexes that were aligned by A but not by AligNet. The results are given in Table 6 and Fig. 15. In its first two numerical columns we can see that 891 complexes were not aligned neither by AligNet nor by HubAlign; 263 complexes were aligned by AligNet but not by HubAlign, of which 88 were correctly aligned (coherent pairs) and 175 were incorrectly aligned (by AligNet); and 378 complexes were aligned by HubAlign but not by AligNet, of which 21 were correctly aligned and 357 were incorrectly aligned (by HubAlign). Therefore, HubAlign aligned more complexes than AligNet, but AligNet achieved a higher precision in the alignment of complexes than HubAlign: 33.5% vs 5.6%. In a similar way, AligNet also showed a higher precision than L-GRAAL and a slightly higher precision than PINALOG (19.2% vs 17.4%), although PINALOG aligned more complexes than AligNet. Our interpretation is that AligNet is more conservative than PINALOG.

In the third test we evaluated the alignment of essential proteins in the human and yeast PPINs. We evaluated the aligners performance assuming that essential proteins must be aligned to essential proteins and we compute seven statistical measures. In Fig. 16 we show the values for each statistical measure obtained for every aligner. As we can observe there, all aligners have a similar value of accuracy and balanced accuracy. Concerning specificity, precision and F1-score, HubAlign obtained the lowest value while the others aligners are comparable. The highest proficiency and MCC values were obtained by AligNet while the lowest one was obtained by PINALOG.

Finally, one of the weak points of PPIN aligners is their lack of efficiency. Indeed, as we have already mentioned, although NATALIE was suggested as a good aligner, it could not align the two smallest networks, C. elegans and D. melanogaster, on a computer with 64 GB of RAM. With respect to PINALOG, SPINAL, HubAlign and L-GRAAL, we were able to complete all the alignments. In order to visualize their running times, we show the running time of every finished computation for each aligner in Fig. 17. We can observe there that SPINAL is, with a big difference, the slowest one to compute the alignment between H. sapiens and S. cerevisiae, and also between D. melanogaster and S. cerevisiae. PINALOG is the slowest one, also with a big difference, to compute the alignment between C. elegans and H. sapiens, as well as the alignment between H. sapiens and M. musculus. AligNet is considerably faster than PINALOG and SPINAL, with a running time of less than 1,000 seconds in most of the alignments. Only in one computation, the alignment between D. melanogaster and H. sapiens, AligNet is slower than PINALOG and SPINAL, and the difference is less than 2,000 seconds. Because SPINAL needed more than 20,000 seconds for the alignment between S. cerevisiae and H. sapiens, it is difficult to see the running times in some alignments. Thus, in order to visualize the results in the cases where the aligners consumed less than 3,500 seconds, we show in Fig. 18 the running times cutting at ten minutes. We observe there that HubAlign is the fastest aligner. Thus, we conclude that HugAlign is the fastest aligner followed by AligNet.

We also present the running times ordered by the networks size in Fig. 19. It should be expected that the running time increases as so do the networks, and this is the case of AligNet. However, this is not the case of L-GRAAL, SPINAL and HubAlign. Actually, L-GRAAL shows an unpredictable running time related with the networks size. In sum, HubAlign is clearly the faster aligner but the correlation between the networks size and running times is not clear. PINALOG presents a correlation between networks sizes and running times but it is the slowest aligner. And AligNet present the strongest correlation between running time and networks size and it is faster than PINALOG.

Conclusions

In this paper we present AligNet, a new method and software tool for the pairwise global alignment of PPIN aimed to produce biologically meaningful alignments by achieving a good balance between structural matching and protein function conservation. AligNet is a parameter-free algorithm that, given two PPIN, produces a consistent alignment from the smaller network, in terms of number of nodes, to the larger network. Its implementation in R is freely available from https://github.com/biocom-uib/AligNet.

In order to assess the correctness of AligNet, we have evaluated the quality of the alignments obtained with it and with the 4 best aligners established in [12, 13], namely: PINALOG, SPINAL, HubAlign, and L-GRAAL. As a result of the comparison between the aligners, we obtained again, as it was the case in [12, 13], that the agreement of the alignments obtained with different aligners is very low. Most global aligners achieved a high node coverage, meaning that the average number of assigned nodes in the source network is high, but all of them obtained a very low biological coherence value. With respect to the topological coherence value, some aligners were able to obtain a high score but it was associated with a low biological coherence score. Overall, we can conclude that AligNet is the aligner that obtained the best balance between topological coherence (it preserves 60% of the edges) and functional coherence (relative function coherence values between 20% and 40% and the highest complex functional coherence score, 25.34). PINALOG obtained similar functional coherence scores than those of AligNet, lower topological coherence scores and the lowest proficiency value. HubAlign and L-GRAAL obtained high topological coherence scores but very low CFC values. SPINAL surprisingly obtained a very low topological coherence value. Thus, we recommend Alignet to preserve the biological function, and to preserve the topological structure.

Abbreviations

PPIN:

Protein-Protein Interaction Networks

EC:

Edge Correctness

FC:

Functional Coherence

CFC:

Complex Functional Coherence

CP:

Coherent Pairs

NCP:

Incoherent Pairs

TP:

True Positive

FP:

False Positive

TN:

True Negative

FN:

False Negative

MCC:

Matthews Correlation Coefficient

UC:

Uncertainty Coefficients

References

  1. Kelley BP, Yuan B, et al.PathBLAST: a tool for alignment of protein interaction networks. Nucleic Acids Res. 2004; 32(Web Server issue):W83–88.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Koyutürk M, Kim Y, et al.Pairwise alignment of protein interaction networks. J Comput Biol. 2006; 13(2):182–199.

    Article  PubMed  Google Scholar 

  3. Li Z, Wang Y, et al.Alignment of protein interaction networks by integer quadratic programming. In: 2006 International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE: 2006. p. 5527–30.

  4. Liang Z, Xu M, Teng M, Niu L. NetAlign: a web-based tool for comparison of protein interaction networks. Bioinformatics. 2006; 22(17):2175–7.

    Article  CAS  PubMed  Google Scholar 

  5. Narayanan M, Karp RM. Comparing protein interaction networks via a graph match-and-split algorithm. J Comput Biol. 2007; 14(7):892–907.

    Article  CAS  PubMed  Google Scholar 

  6. Elmsallati A, Clark C, Kalita J. Global alignment of protein-protein interaction networks: A survey. IEEE/ACM Trans Comput Biol Bioinforma. 2016; 13(4):689–705.

    Article  CAS  Google Scholar 

  7. Aladaǧ AE, Erten C. SPINAL: Scalable protein interaction network alignment. Bioinformatics. 2013; 29(7):917–24.

    Article  PubMed  CAS  Google Scholar 

  8. Hashemifar S, Xu J. HubAlign: an accurate and efficient method for global alignment of protein-protein interaction networks. Bioinformatics. 2014; 30(17):i438–44.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Neyshabur B, Khadem A, Hashemifar S, Arab SS. NETAL: a new graph-based method for global alignment of protein-protein interaction networks. Bioinformatics. 2013; 29(13):1654–62.

    Article  CAS  PubMed  Google Scholar 

  10. Patro R, Kingsford C. Global network alignment using multiscale spectral signatures. Bioinformatics. 2012; 28(23):3105–14.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Singh R, Xu J, Berger B. Global alignment of multiple protein interaction networks with application to functional orthology detection. PNAS. 2008; 105(35):12763–8.

    Article  CAS  PubMed  Google Scholar 

  12. Clark C, Kalita J. A comparison of algorithms for the pairwise alignment of biological networks. Bioinformatics. 2014; 30(16):2351–9.

    Article  CAS  PubMed  Google Scholar 

  13. Malod-Dognin N, Ban K, Pržulj N. Unified alignment of protein-protein interaction networks. Sci Rep. 2017; 7(953).

  14. Alain FZ, Elena NI, Erik HWG. Meesters. A Beginner’s Guide to R: Springer; 2009.

  15. Camacho C, Coulouris G, et al.BLAST+: architecture and applications. BMC Bioinformatics. 2009; 10(1):1.

    Article  CAS  Google Scholar 

  16. Kuhn HW. The Hungarian method for the assignment problem. Naval Res Logist. 2005; 52(1):7–21.

    Article  Google Scholar 

  17. Borndörfer R, Heismann O. The hypergraph assignment problem. Discrete Optim. 2015; 15:15–25.

    Article  Google Scholar 

  18. Klau GW. A new graph-based method for pairwise global network alignment. BMC Bioinformatics. 2009; 10(1):S59.

    Article  PubMed  PubMed Central  Google Scholar 

  19. Malod-Dognin N, Pržulj N. L-GRAAL: Lagrangian graphlet-based network aligner. Bioinformatics. 2015; 31(13):2182–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. TPhan HT, Sternberg MJE. PINALOG: A novel approach to align protein interaction networks—implications for complex detection and function prediction. Bioinformatics. 2012; 28(9):1239–45.

    Article  CAS  Google Scholar 

  21. Park D, Singh R, et al.IsoBase: a database of functionally related proteins across PPI networks. Nucleic Acids Res. 2011; 39(suppl 1):D295–D300.

    Article  CAS  PubMed  Google Scholar 

  22. Ruepp A, Brauner B, et al.CORUM: the comprehensive resource of mammalian protein complexes. Nucleic Acids Res. 2008; 36(suppl 1):D646–D650.

    CAS  PubMed  Google Scholar 

  23. Gavin A-C, Aloy P, et al.Proteome survey reveals modularity of the yeast cell machinery. Nature. 2006; 440(7084):631–6.

    Article  CAS  PubMed  Google Scholar 

  24. Ruepp A, Zollner A, et al.The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. Nucleic Acids Res. 2004; 32(18):5539–45.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Luo H, Lin Y, Gao F, Zhang C-T, Zhang R. DEG 10, an update of the Database of Essential Genes that includes both protein-coding genes and non-coding genomic elements. Nucleic Acids Res. 2014; 42:D574–D580.

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

We thank Gabriel Riera for the technical support.

About this supplement

This article has been published as part of BMC Bioinformatics Volume 21 Supplement 6, 2020: Selected articles from the 15th International Symposium on Bioinformatics Research and Applications (ISBRA-19): bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-21-supplement-6.

Funding

This research was partially supported by the Spanish Ministry of Science, Innovation and Universities and the European Regional Development Fund through projects DPI2015-67082-P and PGC2018-096956-B-C43 (FEDER/ MICINN/AEI). Publication costs are funded by Spanish Ministry of Economy and Competitiveness and European Regional Development Fund project PGC2018-096956-B-C43 (FEDER/ MICINN/AEI). The funding body did not play any roles in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

Author information

Authors and Affiliations

Authors

Contributions

ML, FR and GV conceived and coordinated the study, performed data analysis and drafted the manuscript. RA and AA performed all the bioinformatic analyses. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Mercè Llabrés.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Alcalá, A., Alberich, R., Llabrés, M. et al. AligNet: alignment of protein-protein interaction networks. BMC Bioinformatics 21 (Suppl 6), 265 (2020). https://doi.org/10.1186/s12859-020-3502-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-020-3502-1

Keywords