A new graph-based method for pairwise global network alignment

Background In addition to component-based comparative approaches, network alignments provide the means to study conserved network topology such as common pathways and more complex network motifs. Yet, unlike in classical sequence alignment, the comparison of networks becomes computationally more challenging, as most meaningful assumptions instantly lead to NP-hard problems. Most previous algorithmic work on network alignments is heuristic in nature. Results We introduce the graph-based maximum structural matching formulation for pairwise global network alignment. We relate the formulation to previous work and prove NP-hardness of the problem. Based on the new formulation we build upon recent results in computational structural biology and present a novel Lagrangian relaxation approach that, in combination with a branch-and-bound method, computes provably optimal network alignments. The Lagrangian algorithm alone is a powerful heuristic method, which produces solutions that are often near-optimal and – unlike those computed by pure heuristics – come with a quality guarantee. Conclusion Computational experiments on the alignment of protein-protein interaction networks and on the classification of metabolic subnetworks demonstrate that the new method is reasonably fast and has advantages over pure heuristics. Our software tool is freely available as part of the LISA library.

computationally more challenging, as most meaningful assumptions instantly lead to NP-hard problems.

Previous work
One of the first contributions to automatic biological network alignment is [3], where the authors introduce a concept later called global alignment graph and find functionally related enzyme clusters in metabolic networks using a simple heuristic. Kelley et al. [4] formalize the concept and present the PATHBLAST algorithm, which heuristically finds high-scoring common paths in two protein-protein interaction networks using randomized dynamic programming. Detecting more complex shared topologies has been addressed by Sharan et al. [5], where the authors introduce a probabilistic model for protein complexes and propose a heuristic greedy approach to search for dense subgraphs in the global alignment graph, which correspond to significant shared complexes in the original PPI networks. Koyutürk et al. [6] also use the global alignment graph with a more elaborate scoring scheme to compute pairwise alignments of PPI networks. Narayanan and Karp [7] compare two PPI networks using a different model based on a graph-matching algorithm. They restrict the structural conservation to the environment of a node and thus achieve a polynomial running time.
While most of the above approaches aim at computing local alignments, a recent method by Singh et al. [8] focuses explicitly on computing global alignments between protein interaction networks. They heuristically approach the problem by preferably matching nodes which have a similar neighborhood, which they encode as an eigenvalue problem.
For multiple network alignment, the method from [5] has been adapted in [9]. Koyutürk et al. [10] determine multiple alignments by contracting the global alignment graph and then applying algorithms from frequent itemset extraction. Jaeger and Leser [11] determine conserved subgraphs among k PPI networks using a heuristic for multidimensional matching in a k-partite graph that results from linking each protein to its best ortholog match candidate in each of the other networks. The GRAEMLIN algorithm [12] uses local search to construct a global multiple alignment. Singh et al. have adapted their method for the multiple case [13].

Contribution
In this paper, we introduce the maximum structural matching formulation for global network alignment and show its relation to the global alignment graph. We derive integer linear programming formulations for the maximum structural matching problem and a Lagrangian relaxation algorithm based on these formulations. To our knowledge, this is the first contribution to the relatively young field of biological network alignment that does not approach the problem heuristically. Still, our computational results indicate that the Lagrangian approach is reasonably fast to provably optimally align even large networks. We present preliminary results from two ongoing proof-of-concept studies, where we use the method to globally align protein-protein-interaction networks and to classify metabolic subnetworks.
Note that this is a methodological paper whose purpose is to introduce the new approach with mathematical rigor. The two proof-of-concept studies demonstrate the potential of the method in practice. However, a detailed comparison to other methods is beyond the scope of this article and will be carried out as future work.

A combinatorial formulation for network alignment
In this section we give a formal definition of network alignment. We define the global pairwise network alignment problem and present a graph-theoretical reformulation, which is an extension of the maximum weight trace formulation, which has been proposed by Kececioglu for classical sequence alignment [14]. Furthermore, we relate our definition to previous work.
In analogy to the classical sequence case, we define a pairwise alignment of two networks as follows. Note that this definition is already quite close to the formulation presented later in this section and can readily be extended to multiple network alignment. Let "-" denote the gap symbol.
Definition 1 (Network alignment). Given two networks G 1 = (V 1 , E 1 ) and G 2 = (V 2 , E 2 ), a network alignment a: Note that in contrast to sequence alignments, network alignments do not have to respect an inherent sequential order of the objects to align. Definition 2 (Score). The score of a network alignment a: gives the score of mapping individual nodes onto each other and τ: This definition allows a quite flexible modeling of scores, which may be used to express mismatches and gaps, and which can also be based on additional information, such as, for example, edge weights. Typically, the σ-part of the scoring function will be based on pairwise similarity of the objects represented by the nodes and will assign, say, similar proteins in two protein-protein interaction networks a high score, whereas the τ-part will reward conserved interactions between pairs of nodes. Note that the definition is similar to structural alignment scoring functions as, for example, used to compare RNA molecules [15]. Figure 1 illustrates the definitions.
Given these definitions, we are able to define the network alignment problem formally:  Proof. It is easy to see that the pairwise network alignment problem is in NP, since a non-deterministic algorithm needs only guess the best alignment a. We prove NP-hardness by a simple reduction from the maximum common subgraph problem. A common subgraph of two graphs G 1 = (V 1 , E 1 ) and G 2 = (V 2 , E 2 ) is characterized by subsets E 1 ' ⊆ E 1 and E 2 ' ⊆ E 2 such that the two subgraphs = (V 1 ', denote the vertices that are the endpoints of edges in E 1 ' and E 2 ', respectively. A maximum common subgraph is a common subgraph with the maximum number of edges, and its computation is a well-known NP-hard problem [16].
We can solve the maximum subgraph problem with an algorithm for network alignment by simply using the following scoring function: and A best network alignment will then correspond to a maximum common subgraph. ᮀ The above definition of network alignment is very close to the notion of trace as introduced by Kececioglu for classical sequence alignment [14]. We give an analogous definition for the alignment of networks: Definition 4 (Alignment graph). Given two networks G 1 = (V 1 , E 1 ) and G 2 = (V 2 , E 2 ), the alignment graph A is a complete bipartite edge-weighted graph with vertex set V 1 ∪ V 2 . The weight of an edge e = (i, j) with i ∈ V 1 and j ∈ V 2 is w(e) = σ(i, j) and represents the gain of aligning the endpoints of the edge. Figure 2 shows the alignment graph for the instance given in Fig. 1. In analogy to the sequence case, we say that a network alignment a realizes an edge Similar to the trace formulation we strive to establish a connection between an alignment a and the alignment graph A. As the order of the vertices does not play a role, this connection is precisely characterized by the graphtheoretical concept of matching. A matching in a graph is a subset of its edges such that no two chosen edges share a common endpoint.

Observation 1.
There is a one-to-one correspondence between matchings in the alignment graph and network alignments.
The alignment graph provides an alternative way to represent an alignment of the nodes in a network. Yet, in the basic version we are unable to deal with structural conservation. Therefore we introduce the concept of structural Network alignment a Figure 1 Network alignment a. A dashed arrow from a node i ∈ V 1 from the first network Unmapped vertices are mapped to gaps. The score of the alignment depends on the values given in σ and τ. For simplicity, we assume that σ(i, j) = 1 for all i ∈ V 1 and j ∈ V 2 and that τ(i, j, k, l) = 1 if (i, k) ∈ E 1 and (j, l) ∈ E 2 and τ(i, j, k, l) = 0 otherwise. This leads to a score of 4 + 5 = 9 in the example.
for and otherwise matches, which have already been used for structural alignment, where they are referred to as interaction matches [15].

Definition 5 (Structural match).
A structural match is a pair of alignment edges (i, j), (k, l) in the alignment graph. We say that a network alignment realizes a match (i, j), (k, l) if it realizes both alignment edges (i, j) and (k, l).
We are now able to reformulate the pairwise global network alignment problem as a combinatorial problem in the alignment graph. Let > denote an arbitrary order of the edges in A.
Definition 6 (Maximum structural matching). Given two networks G 1 = (V 1 , E 1 ) and G 2 = (V 2 , E 2 ) and a scoring function s, the structural score of a matching M in the alignment graph A is given by The maximum structural matching problem asks for a highest-scoring structural matching.
Observation 1 straightforwardly extends to the structural case and yields the following result.

Lemma 1. Consider a network alignment a and the matching M it realizes in the alignment graph. Then we have s(a) = s(M).
This allows us to concentrate on the alignment graph to find the best pairwise global network alignment. In the next section, we present an integer linear programming approach to determine a maximum structural matching in a bipartite graph.
Note that our definition of alignment graph is different, but in a sense equivalent, to the global alignment graph concept used in the PATHBLAST algorithm [4] and first introduced in [3]. The following observation relates the two concepts for the case of pairwise alignment; the multiple case is analogous. The global alignment graph contains weighted nodes for pairs of nodes in the original networks -which correspond to the alignment edges in our bipartite alignment graph -and weighted edges represent conserved interactions, gaps, or mismatches -which correspond to structural matches in our definition. Weights of nodes and edges correspond to the weights of alignment edges and structural matches, respectively. Determining clique-like heavy subgraphs in the global alignment graph -for which several heuristics have been presented -is equivalent to our definition of network alignment as a maximum structural matching in our alignment graph. We nevertheless prefer our alternative definition, because it allows us to employ the well-studied field of matchings in bipartite graphs as the next sections will show.

Integer linear programming formulations for network alignment
We can straightforwardly cast the maximum structural matching problem as a non-linear integer program.
For each edge (i, j) ∈ V 1 × V 2 of the alignment graph, we define a binary variable x ij with the interpretation x ij = 1 if (i, j) is part of the structural matching and x ij = 0 otherwise.
Let δ(v) denote the set of edges incident to vertex v. The formulation is then Inequalities (2) make sure that the choice of alignment edges corresponds to a matching in the bipartite graph and go back to Birkhoff's theorem [17]. Linearization leads to the following integer linear program (ILP), which forms the basis of our Lagrangian relaxation approach. We define variables y ijkl = x ij x kl and obtain Alignment graph We now apply variable splitting or Lagrangian decomposition, a well-known technique in mathematical programming [18], to build a good basis for a Lagrangian approach. In computational biology, this technique has already been successfully applied to the maximum contact map overlap problem in computational structural proteomics [19] and to structural RNA alignment [15].
We therefore split each structural variable y ijkl into two "directed" variables and and make sure that they adopt the same value. Likewise, we define new weights for the directed structural variables with the property setting . The resulting integer linear program is then: The following result allows us to concentrate on solving the ILP (11)-(16).

Theorem 2.
A feasible solution respecting the constraints of ILP (11)- (16) corresponds to a structural matching in the alignment graph whose score is equal to the score of the objective function.
Proof. Let (x, ) be a feasible solution of the ILP. Clearly, x represents a network alignment. Now consider a variable with = 1. Inequality (13) ensures that the first half of the match, namely, (i, j), is realized, whereas the second half is taken care of by equality (14) in combination with inequality (13). Thus, the solution corresponds to a structural matching, the score of which, due to property (10), clearly equals the score of (11). For the other direction of the proof, setting the variables x and y according to the characteristic vectors of a structural matching does not violate any of the constraints. Again, it is easy to see that the structural score of the matching and the objective function value coincide. ᮀ

Lagrangian relaxation for network alignment
Inspired by recent successes in solving similar integer linear programs using Lagrangian relaxation, we propose to employ this approach to find provably optimal and nearoptimal solutions of ILP (11)- (16).
Therefore, we relax constraint (14) and obtain the following relaxed problem: Here, vector λ contains the Lagrangian multipliers, which penalize the violation of (14). We exploit the fact that, in our case, λ ijkl = -λ klij and rewrite (17) . relaxed problem over and over again, see, for example [20] for a detailed description of the method. The following result implies that we can do this efficiently.
Theorem 3. The relaxed problem can be reduced to the bipartite matching problem and can be solved in polynomial time.
Proof. The proof is similar to the one given in [19] for the contact map overlap problem and rests upon the observation that each directed structural match variable can be assigned unambiguously to an alignment variableunlike in the undirected, original case. We can therefore concentrate on the alignment variables x. If such a variable x ij is zero, then its contribution to the objective function is zero as well, since all incident directed structural match variables are forced to zero due to constraint (19). If, on the other hand, an edge (i, j) is part of the solution, we can compute its contribution to the objective function, or its profit, in polynomial time as follows: we assign the weight (i, j, k, l) + λ ijkl to each edge (l, m) in the alignment graph and compute the profit p ij of edge (i, j) via a maximum bipartite matching according to these weights. The resulting matching corresponds to the best case that may happen if alignment edge (i, j) is part of the solution.
To compute the overall best solution, we choose those alignment edges that give the best network alignment according to their profits p. Again, this is a bipartite matching problem, which can be solved in polynomial time. ᮀ Theorem 3 gives us a good upper bound. In order to find good lower bounds, we analyze the network alignment given by the solution of each relaxed problem and compute the best structural completion, yielding a feasible solution for the original problem. Given a matching M, we simply add all structural matches (i, j), (k, l) with both (i, j) ∈ M and (k, l) ∈ M.
Let u*, l* be the best upper and lower bounds found by our algorithm, respectively, and let (x*, y*) be the best solution it finds. Our algorithm for network alignment is then as follows: 12 until l* = u* or some termination criteria are met; As the structural matching problem is NP-hard, there will, in the general case, be a duality gap unless P equals NP. In other words, there will be instances for which u* and l* will not coincide. Therefore we define some additional termination criteria like, for example, a maximal number of iterations. Although the possible duality gap makes our algorithm heuristic in nature, it nevertheless comes with a quality guarantee due to the computation of the upper bound. Often this bound is quite good, and then it is fair to say that our algorithm efficiently computes provably near-optimal solutions. In addition, it is straightforward to embed the Lagrangian approach into a branch-andbound framework resulting in a truly exact approach for the network alignment problem, which will then, of course, take exponential time to finish for some instances.

Results and discussion
We have implemented that Lagrangian algorithm for network alignment described in the previous section and offer it as the freely available software tool NATALIE within the PLANET LISA framework. PLANET LISA is a library of algorithms for computational structural and systems biology, which has initially been created to facilitate computational structural comparisons of RNA molecules and proteins [21]. In its basic version, NATALIE reads in two graphs in GraphML format [22] as well as additional information that determine the σand τ-parts in the scoring function depending on the application.
The purpose of this paper is to introduce the new method; we have not yet performed a detailed comparative study including other tools, which we plan to carry out as future work. We present, however, preliminary results from two ongoing projects that utilize the NATALIE algorithm. These studies demonstrate that the method works well in practice and has a high potential to become a very competitive tool in the area of network alignment.

First proof of concept: comparison of protein-protein interaction networks
In a cooperation with the Knowledge Management in Bioinformatics group of the Humboldt-Universität Berlin we use NATALIE to align protein-protein interaction networks based on orthology information about proteins in different species.
We analyze data from the following four species: Homo sapiens, Mus musculus, Drosophila melanogaster, and Saccharomyces cerevisiae. The PPI networks were obtained using data from several open databases and their origin is described in [11]. Candidates for orthologous proteins between the species were determined using protein enzyme classes, InterPro domains, and a minimum sequence identity threshold of α = 0.4, see again [11] for details. In a prototypical experiment, we compare the network of H. sapiens against all other networks using a simple scoring function. Table 1 provides information about the network sizes, where n and m denote the number of nodes and edges in the networks, and the average number of potential orthologs for a sequence identity threshold of α = 0.4 as compared to the network of H. sapiens. We use the following scoring function and align the three pairs of PPI networks. We set This scoring function simply counts the number of conserved interactions of proteins that are potentially orthologous. We limit the CPU time to 1 h and yield the results summarized in Tab. 2.
Clearly, more elaborate scoring schemes may yield biologically more meaningful solutions. This simple experiment demonstrates, however, that the Lagrangian algorithm performs very well even on large data. All solutions but the alignment computed for D. melanogaster are provably optimal and even this alignment is very close to optimal. Figure 3 shows the alignment computed between the PPI networks of Mus musculus and Homo sapiens.

Second proof of concept: classification of metabolic subnetworks
A common way to represent the topology of a metabolic network is its stoichiometric matrix, which characterizes the system of homogeneous linear equations that describe the network of biochemical reactions at steady state. Together with the Computational Systems Biochemistry group at Charité, Berlin, we investigate randomization models for a given metabolic network.
We therefore consider environments of different sizes for each reaction in the network and classify the resulting subnetworks according to their topological equivalence. Two reaction environments are topologically equivalent if the induced stoichiometric submatrices are permutationequivalent, that is, one can be transformed into the other only by permuting its rows and columns. Then, randomized networks can be generated by swapping reaction environments that exhibit the same topology.
We employ a result by Colbourn [23] and determine permutation-equivalence via computing whether two corresponding labeled graphs are isomorphic. Since two graphs are isomorphic if and only if their maximum common subgraph equals the input graphs, we can use NATALIE to do the computations. We compute the equivalence classes of reaction environments of sizes s ∈ {1, 2, 3} of the metabolic networks of E. coli and S. cerevisiae, which were obtained from the Systems Biology Research group at UCSD. The graphs that correspond to the stoichiometric matrices of these reaction environments are typically quite small and have rarely more than twenty vertices. Table 3 shows the number of pairwise comparisons that had to be computed: For each comparison, NATALIE has to decide whether the two subnetworks are topologically equivalent or not.
Although, in the current version, it takes about two weeks to do the computations, NATALIE finds the correct answer for all of the more than 200 million comparisons and thus correctly computes the equivalence classes. In this application, the quality guarantee of the Lagrangian approach is indispensable, and the same results could not have been computed with a purely heuristic method. Yet, they could  have been obtained probably much faster using a tailormade algorithm for detecting graph isomorphisms. We plan, however, to develop a similarity metric between stoichiometric matrices based on the maximum common subgraph of their corresponding labeled graphs and have therefore used our novel approach, which has been proven efficient enough for this application. The details of this study will be described elsewhere.

Conclusion
We believe that the maximum structural matching formulation and our algorithmic contribution is a first step towards a very competitive framework for network alignment, query, and comparison problems. We see perspectives for many interesting research directions. As the formulation as well as the algorithm can deal with multiple alignments, we plan to adapt the concepts to the multiple case. For practical purposes, a progressive alignment method seems to be appropriate for which an adequate consensus concept has to be developed. Moreover, the analogy to classical sequence alignment suggests to investigate local network alignments, where a first step consists in computing maximum connected motifs.
As the formulation is very flexible, it can easily be adapted to any type of undirected or directed, labeled or unlabeled, and weighted or unweighted network. It can be used for answering network queries as well as for detecting repeated motifs in a single network.
Clearly, a good search procedure is only one component in a successful alignment framework. The analogy to sequence alignment suggests that a lot of further research has to go into development and evaluation of suitable scoring functions and into statistical analysis of the results. This more statistically-oriented line of research will be different for each of the numerous applications for network alignment in computational biology. Currently, we address these topics in the ongoing projects, the alignment of PPI networks and the comparison of metabolic networks. Likewise, a visualization of the results is an important research topic. Here, we envision an integration into the CYTOSCAPE software [24].