AligNet: alignment of protein-protein interaction networks

Alcalá, Adrià; Alberich, Ricardo; Llabrés, Mercè; Rosselló, Francesc; Valiente, Gabriel

doi:10.1186/s12859-020-3502-1

Volume 21 Supplement 6

Selected articles from the 15th International Symposium on Bioinformatics Research and Applications (ISBRA-19): bioinformatics

Research
Open access
Published: 18 November 2020

AligNet: alignment of protein-protein interaction networks

Adrià Alcalá^1,2,
Ricardo Alberich^1,2,
Mercè Llabrés ORCID: orcid.org/0000-0002-5358-6768^1,2,
Francesc Rosselló^1,2 &
…
Gabriel Valiente³

BMC Bioinformatics volume 21, Article number: 265 (2020) Cite this article

2819 Accesses
8 Citations
1 Altmetric
Metrics details

Abstract

Background

All molecular functions and biological processes are carried out by groups of proteins that interact with each other. Metaproteomic data continuously generates new proteins whose molecular functions and relations must be discovered. A widely accepted structure to model functional relations between proteins are protein-protein interaction networks (PPIN), and their analysis and alignment has become a key ingredient in the study and prediction of protein-protein interactions, protein function, and evolutionary conserved assembly pathways of protein complexes. Several PPIN aligners have been proposed, but attaining the right balance between network topology and biological information is one of the most difficult and key points in the design of any PPIN alignment algorithm.

Results

Motivated by the challenge of well-balanced and efficient algorithms, we have designed and implemented AligNet, a parameter-free pairwise PPIN alignment algorithm aimed at bridging the gap between topologically efficient and biologically meaningful matchings. A comparison of the results obtained with AligNet and with the best aligners shows that AligNet achieves indeed a good balance between topological and biological matching.

Conclusion

In this paper we present AligNet, a new pairwise global PPIN aligner that produces biologically meaningful alignments, by achieving a good balance between structural matching and protein function conservation, and more efficient computations than state-of-the-art tools.

Background

One of the most difficult problems in systems biology is to discover protein-protein interactions as well as their associated functions. The alignment and analysis of protein-protein interaction networks (PPIN) has become a key ingredient to obtain functional orthologs as well as evolutionary conserved assembly pathways of protein complexes. With this purpose, several pairwise alignment algorithms have been proposed in the last 15 years. The early aligners [1–5] were aimed at finding local alignments between regions with similar structure in the networks under comparison. But since the alignments between regions of the pair of PPIN could be mutually inconsistent, it could be impossible to merge the alignments between regions into an alignment of the whole networks. In contrast, a global alignment algorithm is aimed at finding the best overall alignment between whole PPIN [6]. Several such global PPIN aligners have been proposed during the last years [4, 7–11].

Most PPIN aligners are based on the idea that “two nodes are similar when their corresponding neighbors are so,” taking into account both the network topology and the biological features of the proteins in the definition of “similarity.” The problem is that attaining the right balance between network topology and biological information is one of the most difficult and key points in any PPIN alignment algorithm. As it is shown in [12, 13], when an alignment process is guided by topological information only, it produces alignments with a high topological coherence but a low biological coherence, while when it is guided by sequence information only, the resulting alignments have a high biological coherence but a low topological coherence. This becomes specially inconvenient in those aligners where the user has to choose the value of a parameter that specifies the desired balance between the topological and the sequence similarities. In addition, most aligners are not efficient from the computational point of view.

Motivated by this lack of well-balanced and efficient algorithms, we have designed AligNet, a parameter-free pairwise PPIN alignment algorithm aimed at filling the gap between efficient topologically and biologically meaningful matchings. The overall idea of the algorithm is to obtain many local alignments that are combined and extended into a meaningful global alignment. The final alignment captures the benefits of considering both types of alignments: with the local alignments we capture the topological similarity between the networks and we speed up the running time of the algorithm, while with the final global alignment we solve the inconsistencies among the local alignments and yield an overall alignment of the pair of input PPIN. AligNet has been implemented in R [14], and the implementation is freely available from https://github.com/biocom-uib/AligNet.

A comparison of the results obtained with AligNet and with the best aligners assessed in [12, 13] shows that AligNet achieves indeed a good balance between topological and biological matching. In the tests reported in this paper, AligNet obtained high functional consistence scores between aligned proteins in most of the alignments and also a reasonable fraction of conserved interactions. In addition, AligNet, together with HubAlign [8], had the best running times among all the aligners considered in our tests.

Methods

In this paper, by a graph we understand an undirected graph, that is, a structure G=(V,E) with V a finite set of nodes and E a family of 2-element subsets {u,v} of V called the edges of the graph. A PPIN is modelled in a natural way as a graph, with its nodes representing the proteins and its edges, their interactions.

We introduce now some notations. Let G=(V,E) be a graph. We say that an edge e={u,v} is incident to u and v. The nodes v such that {u,v}∈E are the neighbors of u, and they form the set N_G(u). The degree deg(u) of a node u∈V is the number of edges incident to it. A path between two nodes u,v∈V is a sequence of pairwise different edges {u,u₁},{u₁,u₂},…,{u_k−1,u_k},{u_k,v} such that the first and last edges are incident to u and v, respectively, and every pair of consecutive edges share a node (different from u and v, in the case of the first and last edges, respectively). The length of a path is the number of edges forming it, and its intermediate nodes are u₁,…,u_k. Two nodes are connected when there exists a path between them. For every pair of connected nodes u,v∈V, their distance d_G(u,v) in G is the length of a shortest path connecting them. The diameter D(G) of G is the maximum distance between any two connected nodes in G. The cardinality of a set X is denoted by |X|.

AligNet receives as input two graphs G=(V,E) and G^′=(V^′,E^′) representing two PPIN (in particular, each node of them is injectively identified with a protein) and it produces, as output, a similarity score for them and a local and a global alignment between them. Figure 1 shows the pipeline of our algorithm AligNet. The main steps in AligNet that are described below are:

1
The computation of overlapping clusterings C(G) and C(G^′), respectively, of the input networks G and G^′.
2
The computation of alignments between pairs of clusters in C(G) and C(G^′).
3
The computation of a matching between C(G) and C(G^′).
4
The computation of a local alignment of the input networks G and G^′.
5
The extension of this local alignment to a meaningful global alignment.

Step 1. Overlapping clusterings. The first step in AligNet consists in computing an overlapping clustering of each input network. These clusterings are based on the following similarity score s(u,v) between pairs of proteins (nodes) u,v in a PPIN G: If u,v are not connected by a path, then s(u,v)=0, and if they are connected,

$$s(u,v)= \frac{B(u,v)+\frac{D(G)+1-d_{G}(u,v)}{D(G) +1}}{2}, $$

where B(u,v) is the normalized bit score of the proteins u and v, that is, the rescaled version of their alignment score obtained with BLAST+, which is independent of the size of the search space [15]. The intuition behind this similarity score is that two proteins are similar if they have similar sequences of nucleotides and they are relatively close to each other in the graph.

To obtain the overlapping clustering of an input network, we define a cluster centered at each node. To avoid the choice of a fixed and arbitrary cluster size, we considered the similarity score distribution and define the cluster centered at each node as follows. Let α be the third quartile of the distribution of the similarity score values of pairs of nodes, so that only 25% of the pairs of nodes (u,v) are such that s(u,v)>α. Then, for every node u∈V, the cluster C_u in Gcentered at u is

$$C_{u}=\left\{ v\in V \mid s(u,v) > \alpha\right\}. $$

Let C(G)={C_u∣u∈V} and $\phantom {\dot {i}\!}C(G')=\left \{C_{u'}\mid u'\in V'\right \}$.

Figure 2 displays two toy PPIN that will be used as a running example throughout this section. The first network consists of 8 nodes and 9 edges, while the second network consists of 9 nodes and 17 edges. Figure 3 displays the PPI networks considered as a running example as well as its overlapping clustering. The first network consists of 8 nodes and 9 edges, so there are 8 clusters. The second network consists of 9 nodes and 17 edges, and its overlapping clustering has 9 clusters.

Step 2. Alignments between pairs of clusters. In this second step, AligNet computes an alignment between every pair of clusters C_u∈C(G) and $\phantom {\dot {i}\!}C_{u'}\in C(G')$ such that B(u,u^′)>0. These alignments define an alignment score between every such a pair of clusters that will be used in the third step to compute a matching between C(G) and C(G^′).

Formally, for every u∈V and u^′∈V^′ such that B(u,u^′)>0, the alignment between C_u∈C(G) and $\phantom {\dot {i}\!}C_{u'}\in C(G')$ is obtained as follows:

Match u with u^′. Set $\phantom {\dot {i}\!}L_{u,u'}=\left \{(u,u')\right \}, L_{u,u'}^{(1)}=\{u\}$ and $\phantom {\dot {i}\!}L_{u,u'}^{(2)}=\{u'\}$.
For every v∈C_u∩N_G(u) and for every $\phantom {\dot {i}\!}v'\in C_{u'}\cap N_{G'}(u')$, let
$$F\left(v,v'\right)=| \deg(v)-\deg(v')| - B\left(v,v'\right) +1. $$
Compute a matching $\phantom {\dot {i}\!}M_{u,u'}\subseteq (C_{u}\cap N_{G}(u))\times (C_{u'}\cap N_{G'}(u'))$ that minimizes $\phantom {\dot {i}\!}\sum _{(v,v')\in M_{u,u'}} F(v,v')$ using the Hungarian algorithm [16]. Sort the pairs in $\phantom {\dot {i}\!}M_{u,u'}$ in decreasing order of their F value, and concatenate them to $\phantom {\dot {i}\!}L_{u,u'}$. Add their first coordinates to $\phantom {\dot {i}\!}L_{u,u'}^{(1)}$ and their second coordinates to $\phantom {\dot {i}\!}L_{u,u'}^{(2)}$.
Iterate step (ii), replacing (u,u^′) by the rest of the pairs in $\phantom {\dot {i}\!}L_{u,u'}$ and removing from C_u and $\phantom {\dot {i}\!}C_{u'}$ the nodes already aligned.

More specifically, in the k-th iteration, take the k-th element (v₀,v0′) of $\phantom {\dot {i}\!}L_{u,u'}$. For every $\phantom {\dot {i}\!}w\in (C_{u}\setminus L_{u,u'}^{(1)})\cap N_{G}(v_{0}) $ and every $\phantom {\dot {i}\!}w'\in (C_{u'}\setminus L_{u,u'}^{(2)})\cap N_{G'}(v_{0}')$, compute F(w,w^′). Then, compute a matching
$$ M_{v_{0},v_{0}'}\subseteq \left((C_{u}\setminus L_{u,u'}^{(1)})\cap N_{G}(v_{0})\right)\times \left((C_{u'}\setminus L_{u,u'}^{(2)})\cap N_{G'}(v_{0}')\right) $$
that minimizes $\phantom {\dot {i}\!}\sum _{(v,v')\in M_{v_{0},v_{0}'}} F(v,v')$. Sort the pairs forming $\phantom {\dot {i}\!}M_{v_{0},v_{0}'}$ in decreasing order of their F value, and concatenate them to $\phantom {\dot {i}\!}L_{u,u'}$. Add their first coordinates to $\phantom {\dot {i}\!}L_{u,u'}^{(1)}$ and their second coordinates to $\phantom {\dot {i}\!}L_{u,u'}^{(2)}$.

The resulting alignment $\phantom {\dot {i}\!}L_{u,u'}$ defines a partial injective mapping $\phantom {\dot {i}\!}\eta _{u,u'}: C_{u}\rightarrow C_{u'}$. The nodes in C_u that are matched to nodes in $\phantom {\dot {i}\!} C_{u'}$ form the domain of the mapping $\phantom {\dot {i}\!}\eta _{u,u'}$, which is denoted by $\phantom {\dot {i}\!}Dom\, \eta _{u,u'}$. Figure 4 shows an example of the alignment of a pair of clusters: one cluster from the first network and another cluster from the second network. The general idea behind this alignment procedure is that u is matched to u^′ and then a node v∈C_u should be matched to a node v^′ in $\phantom {\dot {i}\!}C_{u'}$ when they have similar sequences and similar degrees, provided that, furthermore, there exist paths connecting u with v and u^′ with v^′ such that their intermediate nodes are already aligned in sequential order along the paths. The alignment procedure gives priority to matching neighbors of nodes x,x^′ at the possible shortest distance of the respective cluster centers and with F(x,x^′) as large as possible among those pairs already matched at the same iterative step.

Step 3. Matching between families of clusters. Let

$$\mathcal{A}=\left\{\eta_{u,u'} \mid u \in V,\ u'\in V', B(u,u')>0\right\}$$

be the set of alignments obtained in step 2. The score of each $\phantom {\dot {i}\!}\eta _{u,u'}\in \mathcal {A}$ is defined as

$$Score(\eta_{u,u'})=\frac{\sum_{v\in Dom \, \eta_{u,u'}}B(v,\eta_{u,u'}(v))}{|Dom \,\eta_{u,u'}|}+ \frac{|Dom \, \eta_{u,u'}|}{max_{\eta_{w,w'}\in \mathcal{A}}|Dom \,\eta_{w,w'}|}. $$

This score assesses simultaneously the average similarity of the sequences of the proteins matched by $\phantom {\dot {i}\!}\eta _{u,u'}$ and their number.

Once computed all these scores, AligNet obtains a matching between C(G) and C(G^′) by applying the maximum weighted bipartite matching algorithm to the bipartite graph whose nodes are the clusters in C(G) and C(G^′), whose edges connect pairs of clusters C_u∈C(G) and $\phantom {\dot {i}\!}C_{u'}\in C(G')$ with B(u,u^′)>0, and the weight of the edge connecting C_u with $\phantom {\dot {i}\!}C_{u'}$ is the score $\phantom {\dot {i}\!}Score(\eta _{u,u'})$. We shall denote by $\mathcal {C}$ the set of partial injective mappings $\phantom {\dot {i}\!}\eta _{u,u'}$ corresponding to pairs of clusters $\phantom {\dot {i}\!}(C_{u},C_{u'})$ that are matched by this matching. Figure 5 shows the matching obtained in this step between the families of clusters in Fig. 3.

Step 4. Local alignment of PPIN. In this step, AligNet produces a local alignment between G and G^′ from the matching between C(G) and C(G^′) obtained in the previous step.

The main idea is to define this alignment by merging the partial injective mappings $\phantom {\dot {i}\!}\eta _{u,u'}\in \mathcal {C}$. The problem is that these mappings may be inconsistent. A first approach to overcome this problem would be to consider the weighted bipartite hypergraph with set of nodes V⊔V^′ and where every mapping $\phantom {\dot {i}\!}\eta _{u,u'}$ defines a hyperarc with source its domain, target its image, and weight $\phantom {\dot {i}\!}Score\left (\eta _{u,u'}\right)$, and to solve on it the weighted bipartite hypergraph assignment problem, whose solution would provide a well-defined local alignment of the input networks.

However, in order to decrease the computation time of AligNet, we do not define this hypergraph from the whole $ \mathcal {C}$, but just from a subset $\mathcal {R}$ of best-scored alignments built recursively as follows. Starting with $\mathcal {R}=\emptyset $, AligNet adds to $\mathcal {R}$ at each step a mapping $\phantom {\dot {i}\!}\eta _{w_{0},w_{0}'}\in \mathcal {C}$ with w₀ not belonging to the union of the domains of the mappings $\phantom {\dot {i}\!}\eta _{w,w'}$ already in $\mathcal {R}$ and with maximum $\phantom {\dot {i}\!}Score\left (\eta _{w_{0},w_{0}'}\right)$ among all such mappings. AligNet iterates this procedure until every node in $\phantom {\dot {i}\!}\bigcup _{\eta _{u,u'}\in \mathcal {C}} Dom\, \eta _{u,u'}$ belongs to the domain of some mapping in $\mathcal {R}$. In Fig. 6 we give the subset $\mathcal {R}$ of $\mathcal {C}$ for the networks in our running example.

Then, Alignet obtains from the directed hypergraph with nodes V⊔V^′ and hyperarcs defined by the mappings $\phantom {\dot {i}\!}\eta _{u,u'}\in \mathcal {R}$ as explained above, a local well-defined alignment between G and G^′ as a solution of the corresponding weighted bipartite hypergraph assignment problem [17]. Figure 7 shows the local alignment obtained from the hypergraph corresponding to Fig. 6.

Step 5. Global meaningful alignment of PPIN. In order to extend the local alignment produced in the previous step, AligNet iterates the following procedure:

It removes the nodes in G and G^′ that have already been aligned, and it recomputes the score of each alignment $\phantom {\dot {i}\!}\eta _{u,u'}$ following the same definition as in step 3, but only taking into account the remaining nodes in its domain and image.
It computes a new optimal matching $\mathcal {C}$ between C(G) and C(G^′), as in step 3, but using as edges those $\phantom {\dot {i}\!}\eta _{u,u'}$ whose updated score is positive, and weights these updated scores.
It computes a new set $\mathcal {R}$ of best-scored alignments $\phantom {\dot {i}\!}\eta _{u,u'}$ with $\phantom {\dot {i}\!}Score(\eta _{u,u'})>0$, as in step 4.
It defines a new directed hypergraph whose nodes are the nodes in V∪V^′ not yet aligned and hyperarcs the mappings $\phantom {\dot {i}\!}\eta _{u,u'}$ in the new set $\mathcal {R}$, understood as hyperarcs with source the still unaligned nodes in their domain and target the still unaligned nodes in their image.
It computes a local alignment between unaligned nodes in V and V^′ by solving the weighted bipartite hypergraph assignment problem for this hypergraph, and it adds this local alignment to the alignment obtained so far.

This procedure is iterated while there exist nodes not aligned belonging to the domain or the image of some alignment $\phantom {\dot {i}\!}\eta _{u,u'}$ with (updated) positive score. In Fig. 8 we show the final global meaningful alignment obtained with AligNet for the networks in our running example.

Results

In this section we report the tests performed to assess the performance of AligNet. Following the comparisons published in [12, 13], we decided to compare AligNet with SPINAL [7], HubAlign [8], NATALIE [18], L-GRAAL [19], and PINALOG [20] on the dataset used in [12], which consists of the PPIN of M. musculus (mus), C. elegans (cel), D. melanogaster (dme), S. cerevisiae (sce), and H. sapiens (hsa), downloaded from the IsoBase database [21] (version 1.0.2); see Table 1. Unfortunately, we had to discard the aligner NATALIE from our tests because some computations did not finish.

Table 1 Number of nodes and edges (with and without loops) of the PPIN considered as input data in our tests

Full size table

In a first assessment of the alignments, we used two quality measures: the edge correctness ratio (EC), which quantifies the amount of structure preserved by the alignment, and the functional coherence value (FC), which assesses the functional similarity of the aligned proteins by comparing their Gene Ontology annotation. More formally, let G=(V,E) and G^′=(V^′,E^′) be two PPIN such that |V|≤|V^′| and let μ:V→V^′ be a mapping defining an alignment. The edge correctness ratio of μ is

$$EC(\mu)= \frac{\left|\left\{\{u,v\} \in E \, : \, \{\mu(u),\mu(v)\} \in E'\right\} \right|}{ min\{|E|, |E'|\}} $$

and the functional coherence value of μ is

$$FC(\mu)= \frac{\sum_{u\in V} FS(u,\mu(u))}{|V|}, $$

where the similarity score FS is defined by

$$ FS(u,u') = \frac{|GO(u) \cap GO(u')|}{|GO(u) \cup GO(u')|}, $$

with GO(u) and GO(u^′) the sets of GO annotations of the proteins u and u^′, respectively.

Tables 2 and 3, as well as Figs. 9 and 10, report the EC and FC scores of the alignments, respectively. These scores are produced by the aligners under consideration using the aligners’ parameters suggested by default whenever it was needed. Because all alignments attained a very low FC score, to put these low scores in perspective, we estimated the maximum value FC_max of the FC score for every pair of networks. This maximum value FC_max was obtained solving the maximum weighted bipartite matching problem, where the complete bipartite graph had the proteins as nodes and the weight of each edge connecting one protein in a network to a protein in the other network was the FC score of the corresponding pair of proteins. These maximum values are listed in Table 3. We observe that they are very low, being around 0.2 in most computations. Also, we observe in Tables 2 and 3, that AligNet and HubAlign obtained the best balance between FC and EC scores followed by PINALOG and L-GRAAL.

Table 2 Edge correctness ratio obtained in every alignment

Full size table

Table 3 Functional coherence value obtained in every alignment

Full size table

In addition, in our first test and in order to measure the amount of variation or dispersion of the EC and FC scores used to evaluate the aligners, we introduced some noise to the networks by randomly adding and deleting 5% of the edges. For every aligner, we were able to compute 100 new pairwise alignments considering the perturbed networks of M. musculus mapped to the perturbed networks of C. elegans, D. melanogaster, and S. cerevisiae. In this way, for every aligner we ended up with a sample of 100 EC and FC scores for each of the alignments mus–cel, mus–sce and mus–dme. In Table 4, the mean of the EC and FC scores as well as their standard deviation are presented. Also, to visualise the scores distribution, we considered violin plots to present the results (See Figures 11,12 and 13). We conclude that small perturbations of the real networks produced small variations of the EC and FC scores.

Table 4 Statistics of the EC and FC scores

Full size table

As a second test, we compared the behavior of AligNet, PINALOG, HubAlign, and L-GRAAL in relation to the alignment of protein complexes (we excluded SPINAL from this test because its results in the EC and FC tests were not convincing). Following the procedure explained in [20], we considered the database MIPS CORUM [22] as the gold standard for the human protein complexes and the information available in [23] as the gold standard for the yeast complexes. In addition, we considered the functional information available in MIPS CORUM for the human complexes and in MIPS FunCat [24] for the yeast complexes. To measure the quality of an alignment in terms of its behaviour on protein complexes, we used the complex functional coherence value (CFC), defined as the ratio of complexes that are aligned correctly with respect to the aligned complexes. More specifically, if we call a pair of complexes, one in each network, coherent when they share some biological function and incoherent otherwise, and if we denote by CP and NCP the numbers of coherent and incoherent pairs of aligned complexes, then $CFC=\frac {CP}{CP + NCP}\times 100 $. We report the results obtained by all the aligners in Table 5 and Fig. 14. We observe there that AligNet obtained the highest CFC value (25.34) followed by PINALOG (24.48) whereas HubAlign and L- GRAAL obtained a very low CFC value (5, 4.75 resp.).

Table 5 Number of not assigned, correctly assigned (CP), incorrectly assigned (NCP) protein complexes and the complex functional coherence value obtained for every aligner

Full size table

In order to further compare the results obtained by AligNet on protein complexes with those of the others aligners, we counted, for each other aligner A, the complexes that were not aligned either by AligNet or by A; the coherent and incoherent pairs among those complexes that were aligned by AligNet but not by A; and the coherent and incoherent pairs among those complexes that were aligned by A but not by AligNet. The results are given in Table 6 and Fig. 15. We observe there that the number of incoherent pairs by HubAlign, L-GRAAL and PINALOG versus AligNet nearly double the number of incoherent pairs by AligNet versus the others.

Table 6 Numbers of complexes assigned by AligNet and not assigned by the other aligners, and conversely

Full size table

As a third test to evaluate the aligners, we considered the essential proteins, i.e. those proteins that are indispensable for the survival of an organism, again in the human and yeast PPINs. We evaluate the aligners performance assuming that essential proteins must be aligned to essential proteins. Thus, for every alignment between the PPIN of S. cerevisiae and H. sapiens, a true possitive (TP) is an essential protein matched to an essential protein while a false possitive (FP) is an essential protein matched to a non essential one. In the same way, a true negative (TN) is a non essential protein matched to a non essential one and a false negative (FN) is a non essential protein matched to an essential one. The essential proteins information was retrieved from the DEG Database [25] (http://www.essentialgene.org/). We considered the following statistical measures to evaluate the aligners performance: specificity defined by TN/N, precision defined by TN/N, F₁-score defined by 2TP/(2TP+FP+FN), accuracy defined by (TP+TN)/(P+N) and balanced accuracy, defined by ((TP/P)+(TN/N))/2, where P and N are the number of essential and non essential proteins respectively in S. cerevisiae. Also, we calculated the Pearson correlation of this binary classification problem, called MCC (Matthews Correlation Coefficient) defined by

$${{\text{MCC}}={\frac {{{TP}}\times {{TN}}-{{FP}}\times {{FN}}}{\sqrt {({{TP}}+{{FP}})({{TP}}+{{FN}})({ {TN}}+{ {FP}})({{TN}}+{{FN}})}}}}$$

and the proficiency, also called uncertainty coefficient or entropy coefficient. The uncertainty coefficient in this test is defined as follows: let {p₁,…p_n} be the set of proteins in S. cerevisiae and let η be an alignment between the two PPIN S. cerevisiae and H. sapiens. Two random variables X and Y are considered such that, X is a binary vector X=(x_i)_i=1,…,n such that x_i takes the value 1 if protein p_i is essential and the value 0 otherwise. Y is a binary vector Y=(y_i)_i=1,…,n such that y_i takes the value 1 if protein η(p_i) is essential and the value 0 otherwise. Then, the uncertainty coefficient is defined by

$$UC=(H(X)- H(X|Y)) /H(X)$$

where H(X) is the entropy of X and H(X|Y) is the conditional entropy. In this test, the uncertainty coefficient measures the capability to predict that a S. cerevisiae protein is essential provided that its image by η is essential. In Fig. 16 we show the values for each statistical measure obtained for every aligner. As we can observe there, all aligners have a similar value of accuracy and balanced accuracy. Concerning specificity, precision and F₁-score, HubAlign obtained the lowest value while the others aligners are comparable. The highest proficiency and MCC values were obtained by AligNet while the lowest one was obtained by PINALOG.

Finally, in order to study the efficiency of the considered aligners, we observed their running time and memory space needed to perform an alignment. We run our implementation of AligNet on a server with 4 processors at 2.6 GHz and 20 GB of RAM and we also run the latest implementations of PINALOG (downloaded from http://www.sbg.bio.ic.ac.uk/~pinalog/), SPINAL (downloaded from http://code.google.com/p/spinal/), HubAlign (downloaded from" https://github.com/hashemifar/HubAlign) and also L-GRAAL (downloaded from http://www0.cs.ucl.ac.uk/staff/natasa/L-GRAAL/). NATALIE could not align the two smallest networks, C. elegans and D. melanogaster, on a computer with 64 GB of RAM. PINALOG, SPINAL, HubAlign and L-GRAAL were able to complete all the alignments. In order to visualize their running times, we show the running time of every finished computation for each aligner in the top barplot in Fig. 17. We can observe that AligNet is considerably faster than PINALOG and SPINAL, with a running time of less than 1,000 seconds in most of the alignments. However,it is difficult to see the running times in some alignments because SPINAL needed more than 20,000 seconds for the alignment between S. cerevisiae and H. sapiens. Thus, in order to visualize the results in the cases where the aligners consumed less than 3,500 seconds, we describe in Fig. 18 the running times cutting at ten minutes. We observe there that HubAlign is the fastest aligner followed by AligNet.

In Fig. 19, we present the running times ordered by the networks size. We observe that in the case of AligNet the running time increases as so do the networks. However, this is not the case of L-GRAAL, SPINAL and HubAlign. On the other hand, PINALOG presents a correlation between networks sizes and running times but it is the slowest aligner. Thus, AligNet is the aligner that present the strongest correlation between running time and networks size.

Discussion

We performed three tests to evaluate and compare our tool AligNet to the best aligners according to [12, 13]. In the first test we assessed the alignment correctness by calculating the EC and FC scores. We present the results in Tables 2 and 3, as well as Figs. 9 and 10. We can observe there that the alignments of small networks with a small number of edges, such as M. musculus, produced alignments with high EC scores, especially when the target network has a large number of edges. However, we can also observe that, when the number of edges in the source network increased, the EC scores decreased dramatically even in the case of HubAlign. As far as the functional coherence goes, we can observe in Table 3 and Fig. 10 that all aligners attained a very low FC score whose value in most of the computations is around 0.2 points below the maximum score that can be obtained. An overview to Figs. 9 and 10 reflects that the order from the highest to the lowest EC scores is almost the opposite to the order from the highest to the lowest FC score. That is, the alignment with the highest EC score gets the lowest FC score, being AligNet and HubAlign the aligners that obtained the best balance between FC and EC scores followed by PINALOG and L-GRAAL.

In this first test, we also measured the amount of variation or dispersion of the EC and FC scores used to evaluate the aligners. We introduced some noise to the networks by randomly adding and deleting 5% of the edges. In this way, for every aligner we ended up with a sample of 100 EC and FC scores for each of the alignments mus–cel, mus–sce and mus–dme. In Table 4, the mean of the EC and FC scores as well as their standard deviation are presented. Notice that the differences between the mean of the EC scores obtained by AligNet and HubAlign is around 0.3 being HubAlign the aligner with highest EC scores, while the differences between the mean of the FC scores obtained by AligNet and PINALOG is at maximum 0.05 being PINALOG the aligner with highest FC scores but the lowest EC scores. Thus, the goal of AligNet is accomplished since it clearly obtained the best balance between EC and FC scores. Also, to visualise the scores distribution, in Figures 11,12 and 13 we present the results considering violin plots. In violin plots we can observe the probability density of the EC and FC scores as well as all the data that is shown in a box plot. As we can observe in these figures, HubAlign and L-GRAAL obtained the highest EC scores but the lowest FC scores in contrast to PINALOG that obtained the lowest EC scores but the highest FC scores. Notice that, the violin’s shapes show the scores distribution, that is, flat and wide violins indicate that most of the values are near to the mean in contrast to vertical and narrow violins where the values are dispersed away from the mean. There is only a vertical violin corresponding to the EC scores in the alignments of L-GRAAL between mus and sce. This entails that except for this vertical violin case, small perturbations of the real networks produced small variations of the EC and FC scores.

In the second test we evaluated the alignment of protein complexes. We used the complex functional coherence value (CFC) to measure the quality of an alignment in terms of its behaviour on protein complexes. The CFC score is defined as the ratio of complexes that are aligned correctly with respect to the aligned complexes. In Table 5 and Fig. 14 we show the results obtained by all the aligners. AligNet obtained the highest CFC value (25.34) followed by PINALOG (24.48) and HubAlign but L- GRAAL obtained a very low CFC value (5, 4.75 resp.). In order to further compare the results obtained by AligNet on protein complexes with those of the others aligners, we counted, for each other aligner A, the complexes that were not aligned either by AligNet or by A; the coherent and incoherent pairs among those complexes that were aligned by AligNet but not by A; and the coherent and incoherent pairs among those complexes that were aligned by A but not by AligNet. The results are given in Table 6 and Fig. 15. In its first two numerical columns we can see that 891 complexes were not aligned neither by AligNet nor by HubAlign; 263 complexes were aligned by AligNet but not by HubAlign, of which 88 were correctly aligned (coherent pairs) and 175 were incorrectly aligned (by AligNet); and 378 complexes were aligned by HubAlign but not by AligNet, of which 21 were correctly aligned and 357 were incorrectly aligned (by HubAlign). Therefore, HubAlign aligned more complexes than AligNet, but AligNet achieved a higher precision in the alignment of complexes than HubAlign: 33.5% vs 5.6%. In a similar way, AligNet also showed a higher precision than L-GRAAL and a slightly higher precision than PINALOG (19.2% vs 17.4%), although PINALOG aligned more complexes than AligNet. Our interpretation is that AligNet is more conservative than PINALOG.

In the third test we evaluated the alignment of essential proteins in the human and yeast PPINs. We evaluated the aligners performance assuming that essential proteins must be aligned to essential proteins and we compute seven statistical measures. In Fig. 16 we show the values for each statistical measure obtained for every aligner. As we can observe there, all aligners have a similar value of accuracy and balanced accuracy. Concerning specificity, precision and F₁-score, HubAlign obtained the lowest value while the others aligners are comparable. The highest proficiency and MCC values were obtained by AligNet while the lowest one was obtained by PINALOG.

Finally, one of the weak points of PPIN aligners is their lack of efficiency. Indeed, as we have already mentioned, although NATALIE was suggested as a good aligner, it could not align the two smallest networks, C. elegans and D. melanogaster, on a computer with 64 GB of RAM. With respect to PINALOG, SPINAL, HubAlign and L-GRAAL, we were able to complete all the alignments. In order to visualize their running times, we show the running time of every finished computation for each aligner in Fig. 17. We can observe there that SPINAL is, with a big difference, the slowest one to compute the alignment between H. sapiens and S. cerevisiae, and also between D. melanogaster and S. cerevisiae. PINALOG is the slowest one, also with a big difference, to compute the alignment between C. elegans and H. sapiens, as well as the alignment between H. sapiens and M. musculus. AligNet is considerably faster than PINALOG and SPINAL, with a running time of less than 1,000 seconds in most of the alignments. Only in one computation, the alignment between D. melanogaster and H. sapiens, AligNet is slower than PINALOG and SPINAL, and the difference is less than 2,000 seconds. Because SPINAL needed more than 20,000 seconds for the alignment between S. cerevisiae and H. sapiens, it is difficult to see the running times in some alignments. Thus, in order to visualize the results in the cases where the aligners consumed less than 3,500 seconds, we show in Fig. 18 the running times cutting at ten minutes. We observe there that HubAlign is the fastest aligner. Thus, we conclude that HugAlign is the fastest aligner followed by AligNet.

We also present the running times ordered by the networks size in Fig. 19. It should be expected that the running time increases as so do the networks, and this is the case of AligNet. However, this is not the case of L-GRAAL, SPINAL and HubAlign. Actually, L-GRAAL shows an unpredictable running time related with the networks size. In sum, HubAlign is clearly the faster aligner but the correlation between the networks size and running times is not clear. PINALOG presents a correlation between networks sizes and running times but it is the slowest aligner. And AligNet present the strongest correlation between running time and networks size and it is faster than PINALOG.

Conclusions

In this paper we present AligNet, a new method and software tool for the pairwise global alignment of PPIN aimed to produce biologically meaningful alignments by achieving a good balance between structural matching and protein function conservation. AligNet is a parameter-free algorithm that, given two PPIN, produces a consistent alignment from the smaller network, in terms of number of nodes, to the larger network. Its implementation in R is freely available from https://github.com/biocom-uib/AligNet.

In order to assess the correctness of AligNet, we have evaluated the quality of the alignments obtained with it and with the 4 best aligners established in [12, 13], namely: PINALOG, SPINAL, HubAlign, and L-GRAAL. As a result of the comparison between the aligners, we obtained again, as it was the case in [12, 13], that the agreement of the alignments obtained with different aligners is very low. Most global aligners achieved a high node coverage, meaning that the average number of assigned nodes in the source network is high, but all of them obtained a very low biological coherence value. With respect to the topological coherence value, some aligners were able to obtain a high score but it was associated with a low biological coherence score. Overall, we can conclude that AligNet is the aligner that obtained the best balance between topological coherence (it preserves 60% of the edges) and functional coherence (relative function coherence values between 20% and 40% and the highest complex functional coherence score, 25.34). PINALOG obtained similar functional coherence scores than those of AligNet, lower topological coherence scores and the lowest proficiency value. HubAlign and L-GRAAL obtained high topological coherence scores but very low CFC values. SPINAL surprisingly obtained a very low topological coherence value. Thus, we recommend Alignet to preserve the biological function, and to preserve the topological structure.

Abbreviations

PPIN:: Protein-Protein Interaction Networks
EC:: Edge Correctness
FC:: Functional Coherence
CFC:: Complex Functional Coherence
CP:: Coherent Pairs
NCP:: Incoherent Pairs
TP:: True Positive
FP:: False Positive
TN:: True Negative
FN:: False Negative
MCC:: Matthews Correlation Coefficient
UC:: Uncertainty Coefficients

References

Kelley BP, Yuan B, et al.PathBLAST: a tool for alignment of protein interaction networks. Nucleic Acids Res. 2004; 32(Web Server issue):W83–88.
Article CAS PubMed PubMed Central Google Scholar
Koyutürk M, Kim Y, et al.Pairwise alignment of protein interaction networks. J Comput Biol. 2006; 13(2):182–199.
Article PubMed Google Scholar
Li Z, Wang Y, et al.Alignment of protein interaction networks by integer quadratic programming. In: 2006 International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE: 2006. p. 5527–30.
Liang Z, Xu M, Teng M, Niu L. NetAlign: a web-based tool for comparison of protein interaction networks. Bioinformatics. 2006; 22(17):2175–7.
Article CAS PubMed Google Scholar
Narayanan M, Karp RM. Comparing protein interaction networks via a graph match-and-split algorithm. J Comput Biol. 2007; 14(7):892–907.
Article CAS PubMed Google Scholar
Elmsallati A, Clark C, Kalita J. Global alignment of protein-protein interaction networks: A survey. IEEE/ACM Trans Comput Biol Bioinforma. 2016; 13(4):689–705.
Article CAS Google Scholar
Aladaǧ AE, Erten C. SPINAL: Scalable protein interaction network alignment. Bioinformatics. 2013; 29(7):917–24.
Article PubMed CAS Google Scholar
Hashemifar S, Xu J. HubAlign: an accurate and efficient method for global alignment of protein-protein interaction networks. Bioinformatics. 2014; 30(17):i438–44.
Article CAS PubMed PubMed Central Google Scholar
Neyshabur B, Khadem A, Hashemifar S, Arab SS. NETAL: a new graph-based method for global alignment of protein-protein interaction networks. Bioinformatics. 2013; 29(13):1654–62.
Article CAS PubMed Google Scholar
Patro R, Kingsford C. Global network alignment using multiscale spectral signatures. Bioinformatics. 2012; 28(23):3105–14.
Article CAS PubMed PubMed Central Google Scholar
Singh R, Xu J, Berger B. Global alignment of multiple protein interaction networks with application to functional orthology detection. PNAS. 2008; 105(35):12763–8.
Article CAS PubMed Google Scholar
Clark C, Kalita J. A comparison of algorithms for the pairwise alignment of biological networks. Bioinformatics. 2014; 30(16):2351–9.
Article CAS PubMed Google Scholar
Malod-Dognin N, Ban K, Pržulj N. Unified alignment of protein-protein interaction networks. Sci Rep. 2017; 7(953).
Alain FZ, Elena NI, Erik HWG. Meesters. A Beginner’s Guide to R: Springer; 2009.
Camacho C, Coulouris G, et al.BLAST+: architecture and applications. BMC Bioinformatics. 2009; 10(1):1.
Article CAS Google Scholar
Kuhn HW. The Hungarian method for the assignment problem. Naval Res Logist. 2005; 52(1):7–21.
Article Google Scholar
Borndörfer R, Heismann O. The hypergraph assignment problem. Discrete Optim. 2015; 15:15–25.
Article Google Scholar
Klau GW. A new graph-based method for pairwise global network alignment. BMC Bioinformatics. 2009; 10(1):S59.
Article PubMed PubMed Central Google Scholar
Malod-Dognin N, Pržulj N. L-GRAAL: Lagrangian graphlet-based network aligner. Bioinformatics. 2015; 31(13):2182–9.
Article CAS PubMed PubMed Central Google Scholar
TPhan HT, Sternberg MJE. PINALOG: A novel approach to align protein interaction networks—implications for complex detection and function prediction. Bioinformatics. 2012; 28(9):1239–45.
Article CAS Google Scholar
Park D, Singh R, et al.IsoBase: a database of functionally related proteins across PPI networks. Nucleic Acids Res. 2011; 39(suppl 1):D295–D300.
Article CAS PubMed Google Scholar
Ruepp A, Brauner B, et al.CORUM: the comprehensive resource of mammalian protein complexes. Nucleic Acids Res. 2008; 36(suppl 1):D646–D650.
CAS PubMed Google Scholar
Gavin A-C, Aloy P, et al.Proteome survey reveals modularity of the yeast cell machinery. Nature. 2006; 440(7084):631–6.
Article CAS PubMed Google Scholar
Ruepp A, Zollner A, et al.The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. Nucleic Acids Res. 2004; 32(18):5539–45.
Article CAS PubMed PubMed Central Google Scholar
Luo H, Lin Y, Gao F, Zhang C-T, Zhang R. DEG 10, an update of the Database of Essential Genes that includes both protein-coding genes and non-coding genomic elements. Nucleic Acids Res. 2014; 42:D574–D580.
Article CAS PubMed Google Scholar

Download references

Acknowledgements

We thank Gabriel Riera for the technical support.

About this supplement

This article has been published as part of BMC Bioinformatics Volume 21 Supplement 6, 2020: Selected articles from the 15th International Symposium on Bioinformatics Research and Applications (ISBRA-19): bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-21-supplement-6.

Funding

This research was partially supported by the Spanish Ministry of Science, Innovation and Universities and the European Regional Development Fund through projects DPI2015-67082-P and PGC2018-096956-B-C43 (FEDER/ MICINN/AEI). Publication costs are funded by Spanish Ministry of Economy and Competitiveness and European Regional Development Fund project PGC2018-096956-B-C43 (FEDER/ MICINN/AEI). The funding body did not play any roles in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

Author information

Authors and Affiliations

Department of Mathematics and Computer Science, University of the Balearic Islands, Palma de Mallorca, E-07122, Spain
Adrià Alcalá, Ricardo Alberich, Mercè Llabrés & Francesc Rosselló
Balearic Islands Health Research Institute (IdISBa), Palma de Mallorca, E-07010, Spain
Adrià Alcalá, Ricardo Alberich, Mercè Llabrés & Francesc Rosselló
Algorithms, Bioinformatics, Complexity and Formal Methods Research Group, Technical University of Catalonia, Barcelona, E-08034, Spain
Gabriel Valiente

Authors

Adrià Alcalá
View author publications
You can also search for this author in PubMed Google Scholar
Ricardo Alberich
View author publications
You can also search for this author in PubMed Google Scholar
Mercè Llabrés
View author publications
You can also search for this author in PubMed Google Scholar
Francesc Rosselló
View author publications
You can also search for this author in PubMed Google Scholar
Gabriel Valiente
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

ML, FR and GV conceived and coordinated the study, performed data analysis and drafted the manuscript. RA and AA performed all the bioinformatic analyses. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Mercè Llabrés.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Alcalá, A., Alberich, R., Llabrés, M. et al. AligNet: alignment of protein-protein interaction networks. BMC Bioinformatics 21 (Suppl 6), 265 (2020). https://doi.org/10.1186/s12859-020-3502-1

Download citation

Received: 13 August 2019
Accepted: 16 April 2020
Published: 18 November 2020
DOI: https://doi.org/10.1186/s12859-020-3502-1

Selected articles from the 15th International Symposium on Bioinformatics Research and Applications (ISBRA-19): bioinformatics

AligNet: alignment of protein-protein interaction networks