An efficient algorithm for de novo predictions of biochemical pathways between chemical compounds
© Nakamura et al.; licensee BioMed Central Ltd. 2012
Published: 13 December 2012
Skip to main content
© Nakamura et al.; licensee BioMed Central Ltd. 2012
Published: 13 December 2012
Prediction of biochemical (metabolic) pathways has a wide range of applications, including the optimization of drug candidates, and the elucidation of toxicity mechanisms. Recently, several methods have been developed for pathway prediction to derive a goal compound from a start compound. However, these methods require high computational costs, and cannot perform comprehensive prediction of novel metabolic pathways. Our aim of this study is to develop a de novo prediction method for reconstructions of metabolic pathways and predictions of unknown biosynthetic pathways in the sense that it does not require any initial network such as KEGG metabolic network to be explored.
We formulated pathway prediction between a start compound and a goal compound as the shortest path search problem in terms of the number of enzyme reactions applied. We propose an efficient search method based on A* algorithm and heuristic techniques utilizing Linear Programming (LP) solution for estimation of the distance to the goal. First, a chemical compound is represented by a feature vector which counts frequencies of substructure occurrences in the structural formula. Second, an enzyme reaction is represented as an operator vector by detecting the structural changes to compounds before and after the reaction. By defining compound vectors as nodes and operator vectors as edges, prediction of the reaction pathway is reduced to the shortest path search problem in the vector space. In experiments on the DDT degradation pathway, we verify that the shortest paths predicted by our method are biologically correct pathways registered in the KEGG database. The results also demonstrate that the LP heuristics can achieve significant reduction in computation time. Furthermore, we apply our method to a secondary metabolite pathway of plant origin, and successfully find a novel biochemical pathway which cannot be predicted by the existing method. For the reconstruction of a known biochemical pathway, our method is over 40 times as fast as the existing method.
Our method enables fast and accurate de novo pathway predictions and novel pathway detection.
Identification of the metabolic pathway of a chemical compound and discovery of new metabolic pathways are important in various fields. In general, an enzyme reaction pathway is a sequence of applications of enzymes (represented by EC number) that derives a goal compound from a given compound. In the field of drug discovery , the mechanism of side effects based on information about metabolic pathways has been investigated to clarify the movement of drugs in the body and to optimize drug candidate compounds. In the field of toxicity prediction, exploration of the dynamics of in vivo chemical substances identified the metabolic pathway, leading to the elucidation of the mechanisms of toxicity . Prediction of the biochemical pathways for secondary metabolites has received the most attention in recent years. Although secondary metabolites have been used as lead compounds for food and medicines, most of their biosynthetic pathways still remain unknown. Further, computational methods that support de novo design of biosynthetic pathways are expected in the field of synthetic biology. In the synthetic biology approach, the de novo design is not necessarily limited to biochemical routes that already exist in nature .
To solve the problem of predicting various metabolic pathways, many attempts from bioinformatics have been made so far. Existing approaches can be broadly divided into three methods: the fingerprint-based method, the maximum common substructure search method, and the reaction rule-based method.
A chemical compound is represented by a fingerprint of the molecular structure, and the Tanimoto coefficient between fingerprints for compounds is calculated to indicate similarity. It then predicts that there is a metabolic pathway between compounds if the similarity exceeds a certain threshold. The necessary calculations are fast, but accurate path prediction is difficult.
This approach focuses on the maximum common substructure between compounds to predict a metabolic pathway. The maximum common substructure search is an NP-hard problem, and requires enormous computation time in order to evaluate the similarity between compounds of complex structures [7, 8]. Various approximation algorithms have been studied in the search for a computationally tractable approach [9–16].
This requires a database of reaction rules constructed from known metabolic reactions, and attempts to predict a metabolic pathway as a sequence of reaction rules. As a feature of reaction rules, some techniques focus on physicochemical properties and structures , while other methods focus on enzyme and gene information [26, 27]. Since prediction ability depends on the size and type of metabolites used to build reaction rules, comprehensive prediction is difficult using the exhaustive search algorithm such as breadth-first search [23, 24], and the approach has only been used to predict specific pathways and enzymes. In addition, the complexity of the features used to construct the reaction rules is a factor that has made comprehensive prediction difficult.
This study aims at a comprehensive and de novo approach to predict metabolic pathways between two arbitrary known or unknown compounds, and belongs to the rule-based methods. Using a simple feature that focuses only on the structural formula of compounds, our method enables comprehensive prediction that has been difficult for the conventional methods. Enzyme reactions on the metabolic pathways are used as the reaction rules, and are extracted from the KEGG database . The feature on which we focus is the information before and after the structural change of the compound caused by the enzyme reaction. By using the information about this structural change, our method predicts the enzyme reactions that give the shortest path between two compounds for a given query. Further, a constraint for applying an enzyme reaction rule to a compound is set as the substrate inclusion condition, that is, the compound must include the substrate of the enzyme reaction as part of its own structure. This constraint and shortest-path strategy lead to de novo prediction of unknown biosynthetic pathways that a knowledge-based approach [17, 18] cannot predict. In this study, the metabolic pathway prediction problem is reduced to the shortest path problem, and the search method is based on A* algorithm to traverse nodes in the order of priority and employs the LP solution as an admissible heuristics for estimating the distance to the goal.
First, a chemical compound is represented by a feature vector which counts the frequencies of substructures in the structural formula. Second, a set of enzyme reaction rules is collected from the KEGG pathway database. Third, a reaction rule is represented as an operator vector by detecting the structural change to compounds before and after the reaction. Fourth, by defining compound vectors as nodes and operators as edges, prediction of a reaction pathway from a start compound to a goal compound is reduced to the shortest path search problem in the vector space. Then, "the output for reaction pathway prediction consists of a sequence of applied reaction rules". The A* algorithm is used to efficiently search for the shortest path. Finally, the Linear Programming (LP) algorithm is used as an admissible heuristic for estimating the distance to the goal.
The data for compounds and metabolic enzyme reaction information used in this method all come from KEGG. First, we extracted the information pathways from KEGG pathway . By using the KEGG API, we concretely collected all enzyme reactions registered in the global map on the KEGG pathway. In the KEGG Reaction, a pair of compounds that are registered as "main" before and after the reaction indicate that it is a metabolic reaction present in KEGG enzyme or the KEGG pathway global map. In this study, the 2D-SDF structure was extracted only for those pairs registered as "main" to ensure the focus on compound metabolic reactions. Further, most KEGG reactions are registered as reversible reaction, and therefore, the forward and reverse directions were treated as a separate reaction. As a result, the 14570 enzyme reactions and the 6073 related compounds were obtained.
A key idea in our method is that a chemical compound is converted to a feature vector that represents substructure statistics extracted from the structural formula of the compound. This feature-vector representation evaluates whether a feature, such as a specific substructure, exists in a chemical compound or how many times that feature appears. This converts information about compounds into numerical vectors, called feature vectors, whose ith value corresponds to the existence or frequency of the ith feature considered. This feature-vector representation enables us to reduce the pathway search problem to a computationally feasible problem in the vector space, as will be discussed later in detail.
where is a set of paths whose length (depth), or number of bonds, is between l and u (u ≥ l) and which appear at least once in the chemical structures in the dataset. f c (p) is the number of appearances of path p in the structure of a chemical compound c.
We call the path length range specified by l and u the "representation-depth", and denote it by "depth l-u", which is crucial for the expressiveness of vector representation.
Further, every reaction rule R a for the reaction a is defined as a pair R a = (U a , O a ) of the substrate vector and the operator vector O a . As a result, an application of the enzyme reaction to a compound can be achieved simply by "addition" of the operator vector to the compound vector (see also Figure 1), and a reaction pathway from a start compound S to a goal compound G is represented by a sequence of applications (additions) of operator vectors:
In this method, different reactions may sometimes be represented by the same vector because of insufficient short-length path counts in the compound vector.
Note that this computationally easy procedure for substrate inclusion is a great advantage of our method using vector representation, because the graph inclusion problem for determining whether a compound structure contains a substrate structure is computationally hard (NP-hard).
where h(t) is the true distance to the goal. That is, a heuristic function h'(t) that always underestimates the distance to the goal is required. Such a heuristic function h'(t) is referred to as an "admissible heuristics". If a given heuristic is admissible, the A* algorithm will reliably find a shortest path. The A* algorithm was implemented using the data structure "sorted priority queue" for maintaining the nodes to be traversed with weights of the evaluation function value f(t), while the breadth-first search uses the simple "queue" for the nodes to be traversed with no weight.
By setting the heuristic function h'(t) to zero for any node t, the A* algorithm becomes equivalent to the breadth-first (BF) search as exhaustive search.
where ||O max || represents the maximum norm among all of the operator vectors. The MH distance divided by this norm becomes an admissible heuristic, because this modified MH distance indicates the number of times the goal node G is reached by only applying the largest norm operator, and hence does not exceed the true distance to the goal node.
This optimization problem is an Integer Programming (IP) problem. The solution to this problem is similar to that for the shortest reaction path problem between the start node and the goal node, except that it does not take into account the order of application of the reaction rules and it ignores the constraint conditions when applying reaction rules. Nevertheless, the solution to "minimize ∑ k w k " provides the tightest underbound for estimating the distance from the current node t to the goal node G, and it is obviously admissible. However, a critical defect is that the IP problem is computationally hard.
Our approach is to relax the constraints on the optimization problem "minimize ∑ k w k " and to treat w k as a real number rather than an integer, that is, "continuous relaxation". The optimization problem now becomes an LP problem that can be solved in polynomial time. Note that allowing w k to be a real number means that we may apply an operator a real number of times, for example "apply the operator 1.5 times". In other words, a real-valued solution for the optimization problem "minimize ∑ k w k " can be considered as the shortest distance to the goal node in a real vector space. In addition, it is well known and obvious that the real solutions for the optimization problem "minimize ∑ k w k " with linear equation constraints are always smaller than the integer solutions. Therefore, the LP solution is admissible for guaranteeing the shortest path. We use this value as the LP heuristic function, which is another advantage of our method using the vector representation.
For solving the LP heuristic, we used IBM ILOG CPLEX in . CPLEX is one of the fastest optimization problem solvers, and can be used for linear programming, quadratic programming, constraint programming, mixed integer programming, and is applicable to large-scale problems.
Reaction rules for the whole KEGG pathway database
Dimensionality of vector representation
Number of operator vectors O
Some reactions are registered as different in KEGG, but the changes in structure are the same and only the substrates are different.
Some reactions are actually different but are represented by the same vector.
The structure registered as "main" is unchanged by the reaction.
The weakness of the second reason can be reduced by increasing the representation-depth for the vectors, which increases the number of reactions distinguished due to the improved expressive power.
In this study, we used the well-known DDT degradation pathway data set  as pathway data to verify the validity of our method. DDT is a chemical substance that can be synthesized for minimal cost, and began to be used as an insecticide during the 1940s because of its insecticidal action against many insects. However, the human carcinogenicity of DDT and its long-term persistency in the environment has since been pointed out . It is important to evaluate the negative impact on the environment, and human health studies analyzing the metabolism of DDT has continued in recent years .
Reaction rules only for the DDT degradation pathway
Number of operator vectors
In our experiments, 20 × 19 = 380 pathway routes were selected for the search problem. The first validation experiment only used the 46 enzyme reaction rules contained in the DDT degradation pathway. In the second "more general" experiment, all KEGG reaction rules were used to search the DDT pathway.
Agreement rate with the true pathway
Agreement of the distance (%)
Agreement of the route (%)
Average computational time (seconds/pair) for finding 380 pathway routes
Comparing the efficiency of the heuristic functions in this table showed in particular that a significant reduction in computational time was achieved by the LP heuristic. On the other hand, in the depth 0-3, reduction in computation time was not seen for most heuristics. This implies that, as the representation depth increases, the substrate inclusion condition works more effectively, and the number of branches in the search space becomes smaller.
Average number of branchings in the search (#branch/pair)
Average computational time (seconds/pair) using all KEGG reaction rules
The agreement rate between the true distance and the true pathway route using the LP heuristic were 100% (380/380). Thus, despite using the generic operators (all KEGG reaction rules), the results showed that the method had high reproducibility.
Another pathway prediction using all KEGG reaction rules was executed for Lutein biosynthesis pathway. Lutein biosynthesis pathway is a secondary metabolic pathway from the start compound "Lycopene" to the goal compound "Lutein". Lycopene is a red carotenoid and Lutein is a plant carotenoid, and there are two routes from Lycopene to Lutein in KEGG pathway database: the one is via Zeinoxanthin and the other is via α-Cryptoxanthin. The Lutein biosynthesis pathway has other difficulty compared with DDT pathway prediction: the structures of chemical compounds in the pathway are significantly larger than the ones in DDT pathway, and the KEGG pathway predictive tool PathPred [23, 37] could not predict this pathway.
Our method with the LP heuristics succeeded to precisely predict all pathways between every pair of compounds on the Lutein biosynthesis pathway. The average computational time for the LP heuristic to predict the shortest paths for all pairs was 10.9 seconds. On the other hand, PathPred failed to predict the pathway between Lycopene and Lutein, where the default parameters of PathPred were used: "Simcomp Threshold" was set at 0.4, "Prediction cycle" was set at 1, and Reference pathway was set at "Biosynthesis of Secondary Metabolites (Plants)".
To demonstrate the effectiveness of our method for finding novel pathways, we applied our method to predict a biochemical pathway for the start node "Delphinidin" and the goal node "Gentiodelphin". Both compounds are present in the KEGG database. Gentiodelphin is a plant-derived secondary metabolite associated with blue dye, and is known to be synthesized from Delphinidin . The KEGG pathway predictive tool PathPred was also used for performance comparison.
Overall, our A*-based algorithm with the LP heuristic is more comprehensive and computationally efficient prediction method for biochemical pathway finding.
We have achieved high-speed pathway predictions using a vector-based search that simply focuses on the 2D structures of compounds. The A* algorithm guarantees the discovery of the shortest path, and the efficient search is achieved by the Linear Programming heuristic that estimates the distance to the goal. Results of verification experiments show the high reproducibility of KEGG pathways, the validity of the novel predicted pathway, and the versatility of our method.
where P is the size of the search space if all solutions are explored, N is the number of reaction rules, and d is the distance from the start node to the goal node. In the heuristic search, the search space can be reduced by visiting the nodes on the true path on a priority basis. Table 5 shows that the LP heuristic can significantly reduce the search space compared to searching all possible solutions.
where B is the ratio that bounds the branching, that is, the ratio at which operator applications are eliminated. When B increases, the base of the exponential function becomes smaller and hence the exponential increase can be reduced. That is, B plays a role in minimizing the exponential expansion of the search space. The significant reduction in computational time achieved by increasing the representation-depth for the vector representation is considered to be due to this reason. In other words, designing a high-speed searching method requires both an accurate heuristic function that estimates the distance to the goal and an effective bound on the branching to reduce the search space.
Our experimental results for comprehensive predictions using all 8108 KEGG reaction rules show that our proposed method is able to reproduce enzyme reaction pathways in the KEGG pathway database with high accuracy. This is presumably due to the LP heuristic and bound on branching due to the substrate inclusion constraint on the vector representation.
Our proposed method in this paper is a de novo prediction method in the sense that it does not require any initial network such as KEGG metabolic network as input and it is not a method just to traverse the pathway network. Our method takes as input the set of enzyme reaction rules collected from the KEGG pathway database. However, this does not necessarily imply that the pathway prediction using the list of all reaction rules is equal to the path search on KEGG pathway network. For each compound occurring at a node in KEGG pathway network, the KEGG network only contains the enzyme reactions whose substrate is exactly equal to the compound as an edge connected to the node. On the other hand, our method applies all reaction rules to a given compound if the compound is not only equal to the substrate of the reaction rule but also contains the substrate as a sub-structure (the substrate inclusion condition). Therefore, the search space of our method is exponentially larger than the KEGG pathway network. Further, our method is able to predict unknown biosynthetic pathways between two arbitrary known or unknown compounds.
We have proposed a computationally efficient method to predict biochemical reaction pathways that derives a goal compound from a start compound. A chemical compound is represented by a feature vector that counts the frequencies of substructure occurrences in the structural formula. A set of enzyme reaction rules collected from the KEGG pathway database was represented using operator vectors, by determining the structural change in the compounds before and after the reaction. Two constraint conditions when applying reaction rules were substrate inclusion and compound formation. By defining each compound vector as a node and each operator as an edge, prediction of reaction pathways was reduced to the shortest path search problem in a vector space. We proposed an efficient search method that uses the A* algorithm for the shortest path search problem. We used an LP solution for heuristic estimation of the distance to the goal. The results showed that our method had high reproducibility for KEGG pathways and a high possibility of predicting new reaction pathways. We understand that we need larger-scale experiments to test the general performance and stability of our method on a number of various known pathways. This is one of our important future works. Also in the future work, the resulting shortest distance can be thought of as a kind of similarity measure between compounds that represents metabolic information, and hence applications to determining similarity of compounds for drug discovery such as [38–40] can be also expected.
Department of Biosciences and Informatics, Faculty of Science and Technology, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama 223-8522, Japan.
This work was supported in part by a Grant program for bioinformatics research and development from the Japan Science and Technology Agency. This work was also supported by Grant-in-Aid for KAKENHI (Grant-in-Aid for Scientific Research) on Innovative Areas (No.221S0002) and Scientific Research (A) No.23241066 from the Ministry of Education, Culture, Sports, Science and Technology of Japan.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 17, 2012: Eleventh International Conference on Bioinformatics (InCoB2012): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S17.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.