A new graph-based method for pairwise global network alignment
- Gunnar W Klau^{1}Email author
Affiliated with
DOI: 10.1186/1471-2105-10-S1-S59
© Klau. 2009
Published: 30 January 2009
Advertisement
DOI: 10.1186/1471-2105-10-S1-S59
© Klau. 2009
Published: 30 January 2009
In addition to component-based comparative approaches, network alignments provide the means to study conserved network topology such as common pathways and more complex network motifs. Yet, unlike in classical sequence alignment, the comparison of networks becomes computationally more challenging, as most meaningful assumptions instantly lead to NP-hard problems. Most previous algorithmic work on network alignments is heuristic in nature.
We introduce the graph-based maximum structural matching formulation for pairwise global network alignment. We relate the formulation to previous work and prove NP-hardness of the problem.
Based on the new formulation we build upon recent results in computational structural biology and present a novel Lagrangian relaxation approach that, in combination with a branch-and-bound method, computes provably optimal network alignments. The Lagrangian algorithm alone is a powerful heuristic method, which produces solutions that are often near-optimal and – unlike those computed by pure heuristics – come with a quality guarantee.
Computational experiments on the alignment of protein-protein interaction networks and on the classification of metabolic subnetworks demonstrate that the new method is reasonably fast and has advantages over pure heuristics. Our software tool is freely available as part of the LISA library.
In systems biology, complex biological systems are often modeled as networks. Examples include protein-protein interaction (PPI), metabolic, gene-regulatory, and signal transduction networks. The increasing quality and quantity of available data creates the need for automated analysis methods to better understand cellular processes, network organization, evolutionary changes, and disease mechanisms [1, 2]. Based on the assumption that evolutionary conservation implies functional significance, comparative approaches may help improve the accuracy of data, elucidate protein pathways and complexes, generate, investigate, and validate hypotheses about the underlying networks, and transfer functional annotations. In addition to component-based comparative approaches, network alignments provide the means to study conserved network topology such as common pathways and more complex network motifs. Yet, unlike in classical sequence alignment, the comparison of networks becomes computationally more challenging, as most meaningful assumptions instantly lead to NP-hard problems.
One of the first contributions to automatic biological network alignment is [3], where the authors introduce a concept later called global alignment graph and find functionally related enzyme clusters in metabolic networks using a simple heuristic. Kelley et al. [4] formalize the concept and present the PATHBLAST algorithm, which heuristically finds high-scoring common paths in two protein-protein interaction networks using randomized dynamic programming. Detecting more complex shared topologies has been addressed by Sharan et al. [5], where the authors introduce a probabilistic model for protein complexes and propose a heuristic greedy approach to search for dense subgraphs in the global alignment graph, which correspond to significant shared complexes in the original PPI networks. Koyutürk et al. [6] also use the global alignment graph with a more elaborate scoring scheme to compute pairwise alignments of PPI networks. Narayanan and Karp [7] compare two PPI networks using a different model based on a graph-matching algorithm. They restrict the structural conservation to the environment of a node and thus achieve a polynomial running time.
While most of the above approaches aim at computing local alignments, a recent method by Singh et al. [8] focuses explicitly on computing global alignments between protein interaction networks. They heuristically approach the problem by preferably matching nodes which have a similar neighborhood, which they encode as an eigenvalue problem.
For multiple network alignment, the method from [5] has been adapted in [9]. Koyutürk et al. [10] determine multiple alignments by contracting the global alignment graph and then applying algorithms from frequent itemset extraction. Jaeger and Leser [11] determine conserved subgraphs among k PPI networks using a heuristic for multidimensional matching in a k-partite graph that results from linking each protein to its best ortholog match candidate in each of the other networks. The GRAEMLIN algorithm [12] uses local search to construct a global multiple alignment. Singh et al. have adapted their method for the multiple case [13].
In this paper, we introduce the maximum structural matching formulation for global network alignment and show its relation to the global alignment graph. We derive integer linear programming formulations for the maximum structural matching problem and a Lagrangian relaxation algorithm based on these formulations. To our knowledge, this is the first contribution to the relatively young field of biological network alignment that does not approach the problem heuristically. Still, our computational results indicate that the Lagrangian approach is reasonably fast to provably optimally align even large networks. We present preliminary results from two ongoing proof-of-concept studies, where we use the method to globally align protein-protein-interaction networks and to classify metabolic subnetworks.
Note that this is a methodological paper whose purpose is to introduce the new approach with mathematical rigor. The two proof-of-concept studies demonstrate the potential of the method in practice. However, a detailed comparison to other methods is beyond the scope of this article and will be carried out as future work.
In this section we give a formal definition of network alignment. We define the global pairwise network alignment problem and present a graph-theoretical reformulation, which is an extension of the maximum weight trace formulation, which has been proposed by Kececioglu for classical sequence alignment [14]. Furthermore, we relate our definition to previous work.
In analogy to the classical sequence case, we define a pairwise alignment of two networks as follows. Note that this definition is already quite close to the formulation presented later in this section and can readily be extended to multiple network alignment. Let "-" denote the gap symbol.
Note that in contrast to sequence alignments, network alignments do not have to respect an inherent sequential order of the objects to align.
where σ: V _{1} × V _{2} → ℝ^{≥0} gives the score of mapping individual nodes onto each other and τ: V _{1} × V _{2} × V _{1} × V _{2} → ℝ^{≥0} gives the score of mapping pairs of nodes onto each other.
Given these definitions, we are able to define the network alignment problem formally:
Definition 3 (Pairwise global network alignment). Given two networks G _{1} = (V _{1}, E _{1}) and G _{2} = (V _{2}, E _{2}) and a scoring function s as defined in Def. 2, the pairwise global network alignment problem asks for a highest-scoring alignment A* of G _{1} and G _{2}, that is, s(A), where denotes the set of all possible alignments of G _{1} and G _{2}.
Theorem 1. The pairwise global network alignment problem as defined in Def. 3 is NP-hard.
Proof. It is easy to see that the pairwise network alignment problem is in NP, since a non-deterministic algorithm needs only guess the best alignment a. We prove NP-hardness by a simple reduction from the maximum common subgraph problem. A common subgraph of two graphs G _{1} = (V _{1}, E _{1}) and G _{2} = (V _{2}, E _{2}) is characterized by subsets E _{1}' ⊆ E _{1} and E _{2}' ⊆ E _{2} such that the two subgraphs = (V _{1}', E _{1}') and = (V _{2}', E _{2}') are isomorphic, where V _{1}' and V _{2}' denote the vertices that are the endpoints of edges in E _{1}' and E _{2}', respectively. A maximum common subgraph is a common subgraph with the maximum number of edges, and its computation is a well-known NP-hard problem [16].
We can solve the maximum subgraph problem with an algorithm for network alignment by simply using the following scoring function:σ(i, j) = 0 for all i ∈ V _{1}, j ∈ V _{2}
A best network alignment will then correspond to a maximum common subgraph. □
The above definition of network alignment is very close to the notion of trace as introduced by Kececioglu for classical sequence alignment [14]. We give an analogous definition for the alignment of networks:
Definition 4 (Alignment graph). Given two networks G _{1} = (V _{1}, E _{1}) and G _{2} = (V _{2}, E _{2}), the alignment graph A is a complete bipartite edge-weighted graph with vertex set V _{1} ∪ V _{2}. The weight of an edge e = (i, j) with i ∈ V _{1} and j ∈ V _{2} is w(e) = σ(i, j) and represents the gain of aligning the endpoints of the edge.
Observation 1. There is a one-to-one correspondence between matchings in the alignment graph and network alignments.
The alignment graph provides an alternative way to represent an alignment of the nodes in a network. Yet, in the basic version we are unable to deal with structural conservation. Therefore we introduce the concept of structural matches, which have already been used for structural alignment, where they are referred to as interaction matches [15].
Definition 5 (Structural match). A structural match is a pair of alignment edges (i, j), (k, l) in the alignment graph. We say that a network alignment realizes a match (i, j), (k, l) if it realizes both alignment edges (i, j) and (k, l).
We are now able to reformulate the pairwise global network alignment problem as a combinatorial problem in the alignment graph. Let > denote an arbitrary order of the edges in A.
The maximum structural matching problem asks for a highest-scoring structural matching.
Observation 1 straightforwardly extends to the structural case and yields the following result.
Lemma 1. Consider a network alignment a and the matching M it realizes in the alignment graph. Then we have s(a) = s(M).
This allows us to concentrate on the alignment graph to find the best pairwise global network alignment. In the next section, we present an integer linear programming approach to determine a maximum structural matching in a bipartite graph.
Note that our definition of alignment graph is different, but in a sense equivalent, to the global alignment graph concept used in the PATHBLAST algorithm [4] and first introduced in [3]. The following observation relates the two concepts for the case of pairwise alignment; the multiple case is analogous. The global alignment graph contains weighted nodes for pairs of nodes in the original networks – which correspond to the alignment edges in our bipartite alignment graph – and weighted edges represent conserved interactions, gaps, or mismatches – which correspond to structural matches in our definition. Weights of nodes and edges correspond to the weights of alignment edges and structural matches, respectively. Determining clique-like heavy subgraphs in the global alignment graph – for which several heuristics have been presented – is equivalent to our definition of network alignment as a maximum structural matching in our alignment graph. We nevertheless prefer our alternative definition, because it allows us to employ the well-studied field of matchings in bipartite graphs as the next sections will show.
We can straightforwardly cast the maximum structural matching problem as a non-linear integer program.
We now apply variable splitting or Lagrangian decomposition, a well-known technique in mathematical programming [18], to build a good basis for a Lagrangian approach. In computational biology, this technique has already been successfully applied to the maximum contact map overlap problem in computational structural proteomics [19] and to structural RNA alignment [15].
The following result allows us to concentrate on solving the ILP (11)–(16).
Theorem 2. A feasible solution respecting the constraints of ILP (11)–(16) corresponds to a structural matching in the alignment graph whose score is equal to the score of the objective function.
Proof. Let (x, ) be a feasible solution of the ILP. Clearly, x represents a network alignment. Now consider a variable with = 1. Inequality (13) ensures that the first half of the match, namely, (i, j), is realized, whereas the second half is taken care of by equality (14) in combination with inequality (13). Thus, the solution corresponds to a structural matching, the score of which, due to property (10), clearly equals the score of (11). For the other direction of the proof, setting the variables x and y according to the characteristic vectors of a structural matching does not violate any of the constraints. Again, it is easy to see that the structural score of the matching and the objective function value coincide. □
Inspired by recent successes in solving similar integer linear programs using Lagrangian relaxation, we propose to employ this approach to find provably optimal and near-optimal solutions of ILP (11)–(16).
A fundamental result in mathematical optimization says that for each choice of penalty terms λ, each solution of the relaxed problem provides an upper bound for the original problem. Naturally, we are interested in the tightest such bound.
The penalty vectors in (22) change the weights of the structural matches and, intuitively, can be used to force pairs of complementary directed structural match variables to agree on their choices. We employ subgradient optimization for this task and find the Lagrangian multipliers that yield the lowest upper bound. Subgradient optimization is an iterative process that involves solving the relaxed problem over and over again, see, for example [20] for a detailed description of the method. The following result implies that we can do this efficiently.
Theorem 3. The relaxed problem can be reduced to the bipartite matching problem and can be solved in polynomial time.
Proof. The proof is similar to the one given in [19] for the contact map overlap problem and rests upon the observation that each directed structural match variable can be assigned unambiguously to an alignment variable – unlike in the undirected, original case. We can therefore concentrate on the alignment variables x. If such a variable x _{ ij }is zero, then its contribution to the objective function is zero as well, since all incident directed structural match variables are forced to zero due to constraint (19). If, on the other hand, an edge (i, j) is part of the solution, we can compute its contribution to the objective function, or its profit, in polynomial time as follows: we assign the weight (i, j, k, l) + λ _{ ijkl }to each edge (l, m) in the alignment graph and compute the profit p _{ ij }of edge (i, j) via a maximum bipartite matching according to these weights. The resulting matching corresponds to the best case that may happen if alignment edge (i, j) is part of the solution.
To compute the overall best solution, we choose those alignment edges that give the best network alignment according to their profits p. Again, this is a bipartite matching problem, which can be solved in polynomial time. □
Theorem 3 gives us a good upper bound. In order to find good lower bounds, we analyze the network alignment given by the solution of each relaxed problem and compute the best structural completion, yielding a feasible solution for the original problem. Given a matching M, we simply add all structural matches (i, j), (k, l) with both (i, j) ∈ M and (k, l) ∈ M.
Let u*, l* be the best upper and lower bounds found by our algorithm, respectively, and let (x*, y*) be the best solution it finds. Our algorithm for network alignment is then as follows:
Initialization;
λ = 0; u* = ∞; l* = -∞;
Main Loop;
repeat
x = solution of relaxed problem with value u;
adapt Lagrangian multipliers;
compute structural completion (x, y) of x with value l;
if u <u* then u* = u;
if l >l* then
l* = l;
(x*, y*) = (x, y);
until l* = u* or some termination criteria are met;
As the structural matching problem is NP-hard, there will, in the general case, be a duality gap unless P equals NP. In other words, there will be instances for which u* and l* will not coincide. Therefore we define some additional termination criteria like, for example, a maximal number of iterations. Although the possible duality gap makes our algorithm heuristic in nature, it nevertheless comes with a quality guarantee due to the computation of the upper bound. Often this bound is quite good, and then it is fair to say that our algorithm efficiently computes provably near-optimal solutions. In addition, it is straightforward to embed the Lagrangian approach into a branch-and-bound framework resulting in a truly exact approach for the network alignment problem, which will then, of course, take exponential time to finish for some instances.
We have implemented that Lagrangian algorithm for network alignment described in the previous section and offer it as the freely available software tool NATALIE within the PLANET LISA framework. PLANET LISA is a library of algorithms for computational structural and systems biology, which has initially been created to facilitate computational structural comparisons of RNA molecules and proteins [21]. In its basic version, NATALIE reads in two graphs in GraphML format [22] as well as additional information that determine the σ- and τ-parts in the scoring function depending on the application.
The purpose of this paper is to introduce the new method; we have not yet performed a detailed comparative study including other tools, which we plan to carry out as future work. We present, however, preliminary results from two ongoing projects that utilize the NATALIE algorithm. These studies demonstrate that the method works well in practice and has a high potential to become a very competitive tool in the area of network alignment.
In a cooperation with the Knowledge Management in Bioinformatics group of the Humboldt-Universität Berlin we use NATALIE to align protein-protein interaction networks based on orthology information about proteins in different species.
Number of potential orthologs. Average number of potential orthologs as compared to H. sapiens.
Species | n | m | ∅ cand. compared to H. sapiens |
---|---|---|---|
H. sapiens | 9 695 | 34 979 | n/a |
M. musculus | 3 247 | 3 116 | 5.47 |
D. melanogaster | 10 232 | 41 332 | 2.87 |
S. cerevisiae | 5 864 | 25 527 | 2.85 |
Comparison of H. sapiens against other species. Results of comparing the PPI network of H. sapiens against other species. The entries in the table denote the instance, the value of the best solution found, the value of the upper bound, and the resulting quality guarantee.
H. sapiens vs. | best solution | upper bound | guarantee |
---|---|---|---|
M. musculus | 1 087 | 1 087 | 100.00% |
D. melanogaster | 284 | 285 | 99.65% |
S. cerevisiae | 431 | 431 | 100.00% |
A common way to represent the topology of a metabolic network is its stoichiometric matrix, which characterizes the system of homogeneous linear equations that describe the network of biochemical reactions at steady state. Together with the Computational Systems Biochemistry group at Charité, Berlin, we investigate randomization models for a given metabolic network.
We therefore consider environments of different sizes for each reaction in the network and classify the resulting subnetworks according to their topological equivalence. Two reaction environments are topologically equivalent if the induced stoichiometric submatrices are permutation-equivalent, that is, one can be transformed into the other only by permuting its rows and columns. Then, randomized networks can be generated by swapping reaction environments that exhibit the same topology.
Classification of metabolic subnetworks. Number of comparisons for the classification of metabolic subnetworks depending on different reaction environment sizes s.
s | E. coli | S. cerevisiae |
---|---|---|
1 | 114 172 | 87 863 |
2 | 423 956 | 490 528 |
3 | 78 680 948 | 122 067 031 |
For each comparison, NATALIE has to decide whether the two subnetworks are topologically equivalent or not. Although, in the current version, it takes about two weeks to do the computations, NATALIE finds the correct answer for all of the more than 200 million comparisons and thus correctly computes the equivalence classes. In this application, the quality guarantee of the Lagrangian approach is indispensable, and the same results could not have been computed with a purely heuristic method. Yet, they could have been obtained probably much faster using a tailor-made algorithm for detecting graph isomorphisms. We plan, however, to develop a similarity metric between stoichiometric matrices based on the maximum common subgraph of their corresponding labeled graphs and have therefore used our novel approach, which has been proven efficient enough for this application. The details of this study will be described elsewhere.
We believe that the maximum structural matching formulation and our algorithmic contribution is a first step towards a very competitive framework for network alignment, query, and comparison problems. We see perspectives for many interesting research directions. As the formulation as well as the algorithm can deal with multiple alignments, we plan to adapt the concepts to the multiple case. For practical purposes, a progressive alignment method seems to be appropriate for which an adequate consensus concept has to be developed. Moreover, the analogy to classical sequence alignment suggests to investigate local network alignments, where a first step consists in computing maximum connected motifs.
As the formulation is very flexible, it can easily be adapted to any type of undirected or directed, labeled or unlabeled, and weighted or unweighted network. It can be used for answering network queries as well as for detecting repeated motifs in a single network.
Clearly, a good search procedure is only one component in a successful alignment framework. The analogy to sequence alignment suggests that a lot of further research has to go into development and evaluation of suitable scoring functions and into statistical analysis of the results. This more statistically-oriented line of research will be different for each of the numerous applications for network alignment in computational biology. Currently, we address these topics in the ongoing projects, the alignment of PPI networks and the comparison of metabolic networks. Likewise, a visualization of the results is an important research topic. Here, we envision an integration into the CYTOSCAPE software [24].
GWK thanks the Knowledge Management in Bioinformatics group of Humboldt-Universität Berlin and, in particular, Samira Jaeger for providing the PPI network comparison data and conversion into NATALIE format and the Computational Systems Biochemistry group at Charité, Berlin, and, in particular, Carola Huthmacher, for providing the data for the computation of stoichiometric equivalence of metabolic subnetworks.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 1, 2009: Proceedings of The Seventh Asia Pacific Bioinformatics Conference (APBC) 2009. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S1
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.