CMPF: Class-switching minimized pathfinding in metabolic networks
© Lim and Wong; licensee BioMed Central Ltd. 2012
Published: 13 December 2012
Skip to main content
© Lim and Wong; licensee BioMed Central Ltd. 2012
Published: 13 December 2012
The metabolic network is an aggregation of enzyme catalyzed reactions that converts one compound to another. Paths in a metabolic network are a sequence of enzymes that describe how a chemical compound of interest can be produced in a biological system. As the number of such paths is quite large, many methods have been developed to score paths so that the k-shortest paths represent the set of paths that are biologically meaningful or efficient. However, these approaches do not consider whether the sequence of enzymes can be manufactured in the same pathway/species/localization. As a result, a predicted sequence might consist of groups of enzymes that operate in distinct pathway/species/localization and may not truly reflect the events occurring within cell.
We propose a path weighting method CMPF (Class-switching Minimized Pathfinder) to search for routes in a metabolic network which minimizes pathway switching. In biological terms, a pathway is a series of chemical reactions which define a specific function (e.g. glycolysis). We conjecture that routes that cross many pathways are inefficient since different pathways define different metabolic functions. In addition, native routes are also well characterized within pathways, suggesting that reasonable paths should not involve too many pathway switches. Our method can be generalized when reactions participate in a class set (e.g., pathways, species or cellular localization) so that the paths predicted have minimal class crossings.
We show that our method generates k-paths that involve the least number of class switching. In addition, we also show that native paths are recoverable and alternative paths deviates less from native paths compared to other methods. This suggests that paths ranked by our method could be a way to predict paths that are likely to occur in biological systems.
Metabolic networks consist of small chemical molecules that are transformed from one to another by enzymes in a specified series of reactions. Predicting source-to-target routes in metabolic pathways is an interesting problem that has applications in synthetic biology, bioengineering and systems biology. For example, a biologist might be interested to know how long-chained lipids might be produced. It is also useful in tracer and genetic knockout experiments .
The problem is defined on a set of rules which dictate how substrates are transformed into products on a given enzyme. Blum and Kohlbacher described two early approaches targeted at this problem . The first models compound transforming rules into a graph and employs shortest path algorithms to predict routes [2, 3]. The second approach expands a set of compounds (initially just the starting substrate) by applying the set of rules iteratively adding the products of reactions to the set . A shortest path algorithm is then run on a graph constructed from the resulting set of compounds.
It is uninteresting to only output a single shortest path, because native paths seldom have the shortest topological length. For example, the path from glucose to pyruvate has a shortest length of 4, but the native path requires 9 reactions. On the other hand, the number of paths between a source and a target can grow exponentially in a highly connected network. For example, Blum and Kohlbacher reported that between glucose and pyruvate, there can be about 500,000 paths . Most reported methods use Eppstein's k-shortest path algorithm  in which the paths are not guaranteed to be simple loopless paths. Routes with loops are biologically uninteresting because they overlap with a shorter route previously discovered. The problem of finding k-shortest simple loopless paths incurs higher complexity and is harder to implement. The results from shortest-path finding is also highly dependent on the weights associated with nodes or edges, which are modelled differently in different approaches. For example, Croes et al. use nodes to represent chemical compounds and assigned weights to nodes based on its degree centrality . Rahman et al. assign weights to edges based on compound structure similarity .
Earlier approaches also try to avoid compounds participating in many reactions but do not play an important role in the path. These compounds are termed as 'currency' compounds since they are circulating in metabolic pathways. Blum and Kohlbacher use atom mapping rules which keeps track of atom transfer between substrate and product compounds to avoid paths with currency compounds . More recent approaches relied on new information -- e.g., the RPAIR database [7–10] -- to avoid 'currency' compounds and to infer more reliable routes. Xia et al. use species information to model the weights of graph edges , they believe that reactions that can be found in more species are more reliable. However, these approaches do not check whether it is possible to start with a reaction from a certain pathway/species and end with a reaction so that as little change in pathway/species is required in the route. It might be biologically unmeaningful to obtain routes where there are reactions in the middle of the route that cannot be produced from the pathway/species that the first substrate started from. We think that routes with minimal switches are more preferable. In addition, most approaches fail to recover the native source-to-target route that is used by biological systems. We suggest that plausible routes should not deviate too much from native routes.
One way to find paths that minimizes species crossing is to reduce this problem to the original k shortest path problem. This can be done by modelling the network using a reaction connecting graph where by the nodes are reactions and an edge connects two reactions if the product of one is the substrate of the other. Unique reaction nodes are duplicated for each species that the reaction occurs in. For example, if ten species are involved in a reaction, then ten separate reaction nodes will be created. An edge is assigned weight of zero if the two nodes have the same supporting species and one otherwise. Any k-shortest path produced by the algorithm also has minimum species crossing because each cross-species reaction in the path has a penalty of one, whereas reactions that do not cross species have no penalty. However, this results in a prohibitively large number of nodes and edges. For example, if every reaction is supported by 1000 species on average, then for every two reaction nodes in the original graph, we obtain 2000 nodes with 1,000,000 edges between them.
Instead, the algorithm in CMPF uses bounded depth path enumeration and scores the paths based on a scoring scheme where paths that cross species/pathways many times have higher penalty scores than those that do not. We use a bipartite graph to model the metabolic network which consists of RPAIR nodes r 1, ..., r n and compound nodes c 1, ..., c n . An RPAIR is pair of compounds with similar chemical structure on two sides of a reaction . A directed edge connects RPAIR to compound and vice versa if the compound participates in the reaction represented by the RPAIR. We use the notation RPAIR and reaction interchangeably in this paper, since they represent similar concepts.
Now, we are ready to define our basic framework of reaction transition weight and linear path score as follows:
where γ is some constant denoting the cost of making a within-pathway transition and δ is the extra cost of making a pathway switch; note that δ = 0 if P ϕ (r i -1) ∩ P(r i ) is nonempty, as a pathway switch is not needed in this situation. We will further refine γ and δ in the next section.
The weight of a transition r i → r i+1 based on species is denoted weight S,ϕ (r i → r i +1), which is defined analogously.
where weight ϕ (r i → r i+1) is either weight P,ϕ (r i → r i +1) or weight S,ϕ (r i → r i+1), for pathway- or species-based score respectively.
In our scoring scheme, the weight of a transition r i → r i+1 in a linear path depends on whether the pathways (or species) that support r i also support r i+1 in that specific linear path. Thus, a topologically shorter linear path may have a higher score (i.e., cost) than a longer one. Moreover, the computation of the score is independent between different linear paths, suggesting exhaustive enumeration of linear paths as a method to find and rank linear paths. Since most linear paths that are useful usually do not exceed a certain topological length and also because exhaustive enumeration can be slow, we use bounded depth enumeration to speed up the search process.
We construct a global metabolic bipartite graph annotated with species/pathway information and exclude non-main RPAIRs to avoid 'currency' compounds. We enumerate all possible paths up to a specified depth and return the k lowest scoring linear paths based on our scoring scheme. In this way, we guarantee that linear paths generated by our method have the least number of cross-pathway/species reactions and are always optimal up to the depth to which we traverse the graph.
Many other approaches have tried giving paths meaning by defining meaningful edge weights. Our framework allows us to reuse these ideas and incorporate these weights. For example, Croes et al. use node connectivity as edge weights so that 'currency' compounds are avoided ; and Xia et al. use the inverse of organism frequency that a reaction belongs to as edge weights so that reactions that belong to more organisms are preferred . In CMPF, we use γ 1 and γ 2 to denote these two scoring strategies.
Native paths rarely involve a switch between pathways. On the other hand, alternative paths might involve such switching. The penalty given when a switch occurs can be made more meaningful, since some pathway crossing is preferable to others. Our framework discussed thus far allows us to rank linear paths by their pathway switching equivalence class. A pathway switching equivalence class here refers to a group of linear paths with the same number of pathway/switches. However, the arbitration of linear paths within the same pathway switching equivalence class is random.
We can do better by defining a function which computes the distance from one pathway to another. We define the 'metabolite closure' M(x) of a pathway x to be the set of metabolites that are generated within that pathway. We hypothesize that pathways performing similar function have similar metabolite closures because the end product often determines intermediate metabolites. Hence, a switch from pathway x to pathway y would be preferred if their metabolite closures agree well with each other.
In addition, a switch can involve two sets of pathways. In this case, we take the average distance of all possible switchings.
and δ 1 = 0 if P ϕ (r i ) ∩ P(r i+1) is non-empty.
Similarly, a switch between species is more likely to happen if the two species are related. The distance between two species can be similarly defined using their metabolite closures. A more intuitive measure of species similarity is one that is based on the taxonomy tree. In this case, the path length from species x to y in the taxonomy tree can be used to determine the distance between them; we denote this distance by .
which is the average taxonomic path length between the two sets of species that support these two consecutive reactions in the linear path ϕ; δ 2 = 0 if S ϕ (r i ) ∩ S(r i+1) is non-empty.
We modify Definition 1 to a weighted sum of scores defined in the previous section.
where γ 1, γ 2, δ 1, and δ 2 are described earlier; and w 1 + w 2 + w 3 + w 4 = 1.
Comparing average number of pathways crossed from the top 20 routes in two methods.
There are two consequences to long pathway traces. The first is that pathway transitions 'hop' from one pathway to another. The number of consecutive reactions in the same pathway is small. We believe this that such transitions activates many different biological functions without achieving any specific purpose. For example, one of NEAT's predicted paths is shown in Figure 9b. The dotted lines represent additional reaction steps in the same pathway not shown in the figure. The width of the dotted edge represents the number of consecutive reaction steps in the same pathway. It starts by breaking down glucose (glycolysis pathway) and transits to converting and breaking down of lactose, half way through the process, the intermediate metabolite is converted back to a glucose analog, switching back to glycolysis followed by a transition to the pentose phosphate pathway. In contrast, CMPF prefers to stick to the same pathway until a switch is permissible, as indicated by the thicker dotted lines (See Figure 9c).
The second consequence is that a transition might be made to a non-relevant pathway. For example, one of MRSD's predicted paths is shown in Figure 9a. The reaction starts from breaking down glucose to producing fructose followed by a diversion to the glycine protein pathway before finally producing pyruvate. We think this is biologically not meaningful because breaking down of glucose into pyruvate is a simple function that does not involve anabolism or catabolism of amino-acids. While it is technically possible to obtain pyruvate from glucose by going through the protein pathway, it might make more sense to produce amino acids after transiting to the protein pathway rather than coming back to break down glucose. Instead, whenever a switch is permissible, CMPF prefers to switch to the most similar pathway based on our dynamic scoring function.
We build a compound-RPAIR bipartite 'super' graph representing the all reactions in the metabolic network from all species and pathways as described previously. The RPAIR database from KEGG is categorized into 'main', 'trans', 'cofac', 'ligase' and 'leave' depending on their roles in a chemical reaction . To avoid currency compounds, we used main RPAIRs to construct the 'super' graph. This allows us to enumerate paths in a reasonable time since many irrelevant edges are excluded from the 'super' graph. However, we note that some 'trans' RPAIRs are present in native paths, suggesting that some 'trans' RPAIRs are important for pathfinding.
On the other hand, permitting edges from 'trans' RPAIRs make the exhaustive search significantly slower. To allow a more comprehensive search to run within reasonable time, we permitted 'trans' RPAIRs edges if they do not increase the graph branching factor by too much. To do this, we measure the increase in node connectivity after adding 'trans' RPAIR edges. The 'trans' RPAIR edges would be added only if they lie within one standard deviation from the median. This is a heuristic approach and thus may miss informative paths that include 'trans' RPAIRs.
We developed a software package to exhaustively traverse the graph up to a user specified depth threshold [see Additional file 1]. The user specifies the starting and final product as well as the weights for scoring paths. The paths are displayed using a well known graph visualization tool, GraphViz . The software also allows users to highlight paths by their score ranking.
The problem of predicting source target routes in a biological pathway depends on the users' searching criteria. We have shown in this paper that our proposed path scoring scheme gives users the alternative to find paths that minimizes class crossing and also allows users to evaluate predicted paths. Our scoring scheme is also sufficiently flexible to allow us to find routes with minimal switching between species or any other class that a reaction can participate in. We evaluate our method against other graph-based methods and demonstrate that paths ranked by our scoring scheme align better to native paths. This suggests that alternative paths predicted by our method might be more likely to occur in real biological systems.
Project name: CMPF
Project homepage: http://compbio.ddns.comp.nus.edu.sg:8080/cmpf/
Operating System(s): Platform independent. Windows system is tested
Programming language: Java
Other requirements: Java runtime environment 1.6 and above, at least 1GB of free RAM.
This work was supported in part by a National University of Singapore research scholarship (Lim) and a Singapore National Research Foundation grant NRF-G-CRP-2007-04-082(d)
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 17, 2012: Eleventh International Conference on Bioinformatics (InCoB2012): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S17.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.