Volume 13 Supplement 17
Eleventh International Conference on Bioinformatics (InCoB2012): Bioinformatics
An efficient algorithm for de novo predictions of biochemical pathways between chemical compounds
 Masaomi Nakamura^{1},
 Tsuyoshi Hachiya^{1},
 Yutaka Saito^{1},
 Kengo Sato^{1} and
 Yasubumi Sakakibara^{1}Email author
DOI: 10.1186/1471210513S17S8
© Nakamura et al.; licensee BioMed Central Ltd. 2012
Published: 13 December 2012
Abstract
Background
Prediction of biochemical (metabolic) pathways has a wide range of applications, including the optimization of drug candidates, and the elucidation of toxicity mechanisms. Recently, several methods have been developed for pathway prediction to derive a goal compound from a start compound. However, these methods require high computational costs, and cannot perform comprehensive prediction of novel metabolic pathways. Our aim of this study is to develop a de novo prediction method for reconstructions of metabolic pathways and predictions of unknown biosynthetic pathways in the sense that it does not require any initial network such as KEGG metabolic network to be explored.
Results
We formulated pathway prediction between a start compound and a goal compound as the shortest path search problem in terms of the number of enzyme reactions applied. We propose an efficient search method based on A* algorithm and heuristic techniques utilizing Linear Programming (LP) solution for estimation of the distance to the goal. First, a chemical compound is represented by a feature vector which counts frequencies of substructure occurrences in the structural formula. Second, an enzyme reaction is represented as an operator vector by detecting the structural changes to compounds before and after the reaction. By defining compound vectors as nodes and operator vectors as edges, prediction of the reaction pathway is reduced to the shortest path search problem in the vector space. In experiments on the DDT degradation pathway, we verify that the shortest paths predicted by our method are biologically correct pathways registered in the KEGG database. The results also demonstrate that the LP heuristics can achieve significant reduction in computation time. Furthermore, we apply our method to a secondary metabolite pathway of plant origin, and successfully find a novel biochemical pathway which cannot be predicted by the existing method. For the reconstruction of a known biochemical pathway, our method is over 40 times as fast as the existing method.
Conclusions
Our method enables fast and accurate de novo pathway predictions and novel pathway detection.
Background
Identification of the metabolic pathway of a chemical compound and discovery of new metabolic pathways are important in various fields. In general, an enzyme reaction pathway is a sequence of applications of enzymes (represented by EC number) that derives a goal compound from a given compound. In the field of drug discovery [1], the mechanism of side effects based on information about metabolic pathways has been investigated to clarify the movement of drugs in the body and to optimize drug candidate compounds. In the field of toxicity prediction, exploration of the dynamics of in vivo chemical substances identified the metabolic pathway, leading to the elucidation of the mechanisms of toxicity [2]. Prediction of the biochemical pathways for secondary metabolites has received the most attention in recent years. Although secondary metabolites have been used as lead compounds for food and medicines, most of their biosynthetic pathways still remain unknown. Further, computational methods that support de novo design of biosynthetic pathways are expected in the field of synthetic biology. In the synthetic biology approach, the de novo design is not necessarily limited to biochemical routes that already exist in nature [3].
To solve the problem of predicting various metabolic pathways, many attempts from bioinformatics have been made so far. Existing approaches can be broadly divided into three methods: the fingerprintbased method, the maximum common substructure search method, and the reaction rulebased method.
Fingerprintbased method [4]
A chemical compound is represented by a fingerprint of the molecular structure, and the Tanimoto coefficient between fingerprints for compounds is calculated to indicate similarity. It then predicts that there is a metabolic pathway between compounds if the similarity exceeds a certain threshold. The necessary calculations are fast, but accurate path prediction is difficult.
Maximum common substructure search method [5, 6]
This approach focuses on the maximum common substructure between compounds to predict a metabolic pathway. The maximum common substructure search is an NPhard problem, and requires enormous computation time in order to evaluate the similarity between compounds of complex structures [7, 8]. Various approximation algorithms have been studied in the search for a computationally tractable approach [9–16].
Rulebased method [17–24]
This requires a database of reaction rules constructed from known metabolic reactions, and attempts to predict a metabolic pathway as a sequence of reaction rules. As a feature of reaction rules, some techniques focus on physicochemical properties and structures [25], while other methods focus on enzyme and gene information [26, 27]. Since prediction ability depends on the size and type of metabolites used to build reaction rules, comprehensive prediction is difficult using the exhaustive search algorithm such as breadthfirst search [23, 24], and the approach has only been used to predict specific pathways and enzymes. In addition, the complexity of the features used to construct the reaction rules is a factor that has made comprehensive prediction difficult.
This study aims at a comprehensive and de novo approach to predict metabolic pathways between two arbitrary known or unknown compounds, and belongs to the rulebased methods. Using a simple feature that focuses only on the structural formula of compounds, our method enables comprehensive prediction that has been difficult for the conventional methods. Enzyme reactions on the metabolic pathways are used as the reaction rules, and are extracted from the KEGG database [28]. The feature on which we focus is the information before and after the structural change of the compound caused by the enzyme reaction. By using the information about this structural change, our method predicts the enzyme reactions that give the shortest path between two compounds for a given query. Further, a constraint for applying an enzyme reaction rule to a compound is set as the substrate inclusion condition, that is, the compound must include the substrate of the enzyme reaction as part of its own structure. This constraint and shortestpath strategy lead to de novo prediction of unknown biosynthetic pathways that a knowledgebased approach [17, 18] cannot predict. In this study, the metabolic pathway prediction problem is reduced to the shortest path problem, and the search method is based on A* algorithm to traverse nodes in the order of priority and employs the LP solution as an admissible heuristics for estimating the distance to the goal.
Methods
First, a chemical compound is represented by a feature vector which counts the frequencies of substructures in the structural formula. Second, a set of enzyme reaction rules is collected from the KEGG pathway database. Third, a reaction rule is represented as an operator vector by detecting the structural change to compounds before and after the reaction. Fourth, by defining compound vectors as nodes and operators as edges, prediction of a reaction pathway from a start compound to a goal compound is reduced to the shortest path search problem in the vector space. Then, "the output for reaction pathway prediction consists of a sequence of applied reaction rules". The A* algorithm is used to efficiently search for the shortest path. Finally, the Linear Programming (LP) algorithm is used as an admissible heuristic for estimating the distance to the goal.
KEGG reaction data
The data for compounds and metabolic enzyme reaction information used in this method all come from KEGG. First, we extracted the information pathways from KEGG pathway [29]. By using the KEGG API, we concretely collected all enzyme reactions registered in the global map on the KEGG pathway. In the KEGG Reaction, a pair of compounds that are registered as "main" before and after the reaction indicate that it is a metabolic reaction present in KEGG enzyme or the KEGG pathway global map. In this study, the 2DSDF structure was extracted only for those pairs registered as "main" to ensure the focus on compound metabolic reactions. Further, most KEGG reactions are registered as reversible reaction, and therefore, the forward and reverse directions were treated as a separate reaction. As a result, the 14570 enzyme reactions and the 6073 related compounds were obtained.
Representation of chemical compounds and enzyme reactions
A key idea in our method is that a chemical compound is converted to a feature vector that represents substructure statistics extracted from the structural formula of the compound. This featurevector representation evaluates whether a feature, such as a specific substructure, exists in a chemical compound or how many times that feature appears. This converts information about compounds into numerical vectors, called feature vectors, whose i th value corresponds to the existence or frequency of the i th feature considered. This featurevector representation enables us to reduce the pathway search problem to a computationally feasible problem in the vector space, as will be discussed later in detail.
where ${\mathcal{P}}_{l}^{u}$ is a set of paths whose length (depth), or number of bonds, is between l and u (u ≥ l) and which appear at least once in the chemical structures in the dataset. f_{ c }(p) is the number of appearances of path p in the structure of a chemical compound c.
We call the path length range specified by l and u the "representationdepth", and denote it by "depth lu", which is crucial for the expressiveness of vector representation.
Further, every reaction rule R_{ a } for the reaction a is defined as a pair R_{ a } = (U_{ a }, O_{ a }) of the substrate vector ${U}_{a}\left(={D}_{l}^{u}\left(i\right)\right)$ and the operator vector O_{ a }. As a result, an application of the enzyme reaction to a compound can be achieved simply by "addition" of the operator vector to the compound vector (see also Figure 1), and a reaction pathway from a start compound S to a goal compound G is represented by a sequence of applications (additions) of operator vectors:
$S\underset{{U}_{1}\le S}{\overset{{R}_{1}=\left({U}_{1},{O}_{1}\right)}{\to}}{X}_{1}\left(=S+{O}_{1}\right)\underset{{U}_{2\phantom{\rule{0.3em}{0ex}}}\le {X}_{1}}{\overset{{R}_{2}=\left({U}_{2},{O}_{2}\right)}{\to}}{X}_{2}\left(={X}_{1}+{O}_{2}\right)\to \cdots \to {X}_{n1}\underset{{U}_{n}\le {X}_{n1}}{\overset{{R}_{n}=\left({U}_{n},{O}_{n}\right)}{\to}}G.$
In this method, different reactions may sometimes be represented by the same vector because of insufficient shortlength path counts in the compound vector.
Two constraint conditions for applying enzyme reaction rules
Note that this computationally easy procedure for substrate inclusion is a great advantage of our method using vector representation, because the graph inclusion problem for determining whether a compound structure contains a substrate structure is computationally hard (NPhard).
Search algorithm between two compounds
A* algorithm and heuristics
where h(t) is the true distance to the goal. That is, a heuristic function h'(t) that always underestimates the distance to the goal is required. Such a heuristic function h'(t) is referred to as an "admissible heuristics". If a given heuristic is admissible, the A* algorithm will reliably find a shortest path. The A* algorithm was implemented using the data structure "sorted priority queue" for maintaining the nodes to be traversed with weights of the evaluation function value f(t), while the breadthfirst search uses the simple "queue" for the nodes to be traversed with no weight.
Breadthfirst search (exhaustive search)
By setting the heuristic function h'(t) to zero for any node t, the A* algorithm becomes equivalent to the breadthfirst (BF) search as exhaustive search.
Manhattan distance
where O_{ max } represents the maximum norm among all of the operator vectors. The MH distance divided by this norm becomes an admissible heuristic, because this modified MH distance indicates the number of times the goal node G is reached by only applying the largest norm operator, and hence does not exceed the true distance to the goal node.
Linear programming (LP) heuristics
This optimization problem is an Integer Programming (IP) problem. The solution to this problem is similar to that for the shortest reaction path problem between the start node and the goal node, except that it does not take into account the order of application of the reaction rules and it ignores the constraint conditions when applying reaction rules. Nevertheless, the solution to "minimize ∑_{ k } w_{ k }" provides the tightest underbound for estimating the distance from the current node t to the goal node G, and it is obviously admissible. However, a critical defect is that the IP problem is computationally hard.
Our approach is to relax the constraints on the optimization problem "minimize ∑_{ k } w_{ k }" and to treat w_{ k } as a real number rather than an integer, that is, "continuous relaxation". The optimization problem now becomes an LP problem that can be solved in polynomial time. Note that allowing w_{ k } to be a real number means that we may apply an operator a real number of times, for example "apply the operator 1.5 times". In other words, a realvalued solution for the optimization problem "minimize ∑_{ k } w_{ k }" can be considered as the shortest distance to the goal node in a real vector space. In addition, it is well known and obvious that the real solutions for the optimization problem "minimize ∑_{ k } w_{ k }" with linear equation constraints are always smaller than the integer solutions. Therefore, the LP solution is admissible for guaranteeing the shortest path. We use this value as the LP heuristic function, which is another advantage of our method using the vector representation.
For solving the LP heuristic, we used IBM ILOG CPLEX in [33]. CPLEX is one of the fastest optimization problem solvers, and can be used for linear programming, quadratic programming, constraint programming, mixed integer programming, and is applicable to largescale problems.
Results
Datasets and target pathways
KEGG Reaction dataset
Reaction rules for the whole KEGG pathway database
Representationdepth  01  02  03 

Dimensionality of vector representation  76  254  653 
Number of operator vectors O  4240  5542  8108 
 1.
Some reactions are registered as different in KEGG, but the changes in structure are the same and only the substrates are different.
 2.
Some reactions are actually different but are represented by the same vector.
 3.
The structure registered as "main" is unchanged by the reaction.
The weakness of the second reason can be reduced by increasing the representationdepth for the vectors, which increases the number of reactions distinguished due to the improved expressive power.
DDT degradation pathway
In this study, we used the wellknown DDT degradation pathway data set [34] as pathway data to verify the validity of our method. DDT is a chemical substance that can be synthesized for minimal cost, and began to be used as an insecticide during the 1940s because of its insecticidal action against many insects. However, the human carcinogenicity of DDT and its longterm persistency in the environment has since been pointed out [35]. It is important to evaluate the negative impact on the environment, and human health studies analyzing the metabolism of DDT has continued in recent years [36].
Reaction rules only for the DDT degradation pathway
Representationdepth  01  02  03 

Number of operator vectors  38  44  46 
In our experiments, 20 × 19 = 380 pathway routes were selected for the search problem. The first validation experiment only used the 46 enzyme reaction rules contained in the DDT degradation pathway. In the second "more general" experiment, all KEGG reaction rules were used to search the DDT pathway.
Reconstruction of DDT pathway by shortest path finding
Agreement rate with the true pathway
depth  01  02  03 

Agreement of the distance (%)  93.2  98.4  100 
Agreement of the route (%)  81.3  93.2  100 
Computational times for heuristics
Average computational time (seconds/pair) for finding 380 pathway routes
depth  BF  MH  LP 

01  1534  27.9  0.872 
02  52.1  0.255  0.0325 
03  0.0240  0.0314  0.0310 
Comparing the efficiency of the heuristic functions in this table showed in particular that a significant reduction in computational time was achieved by the LP heuristic. On the other hand, in the depth 03, reduction in computation time was not seen for most heuristics. This implies that, as the representation depth increases, the substrate inclusion condition works more effectively, and the number of branches in the search space becomes smaller.
Average number of branchings in the search (#branch/pair)
depth  BF  MH  LP 

01  8389  1572  225 
02  1575  129.4  17.4 
03  12.8  11.6  7.6 
Prediction of DDT pathway using all KEGG reaction rules
Average computational time (seconds/pair) using all KEGG reaction rules
depth  BF  MH  LP 

01  N/A  N/A  N/A 
02  N/A  N/A  N/A 
03  N/A  N/A  61.9 
The agreement rate between the true distance and the true pathway route using the LP heuristic were 100% (380/380). Thus, despite using the generic operators (all KEGG reaction rules), the results showed that the method had high reproducibility.
Prediction of Lutein biosynthesis pathway using all KEGG reaction rules
Another pathway prediction using all KEGG reaction rules was executed for Lutein biosynthesis pathway. Lutein biosynthesis pathway is a secondary metabolic pathway from the start compound "Lycopene" to the goal compound "Lutein". Lycopene is a red carotenoid and Lutein is a plant carotenoid, and there are two routes from Lycopene to Lutein in KEGG pathway database: the one is via Zeinoxanthin and the other is via αCryptoxanthin. The Lutein biosynthesis pathway has other difficulty compared with DDT pathway prediction: the structures of chemical compounds in the pathway are significantly larger than the ones in DDT pathway, and the KEGG pathway predictive tool PathPred [23, 37] could not predict this pathway.
Our method with the LP heuristics succeeded to precisely predict all pathways between every pair of compounds on the Lutein biosynthesis pathway. The average computational time for the LP heuristic to predict the shortest paths for all pairs was 10.9 seconds. On the other hand, PathPred failed to predict the pathway between Lycopene and Lutein, where the default parameters of PathPred were used: "Simcomp Threshold" was set at 0.4, "Prediction cycle" was set at 1, and Reference pathway was set at "Biosynthesis of Secondary Metabolites (Plants)".
Finding novel biochemical pathways for secondary metabolites of plant origin
To demonstrate the effectiveness of our method for finding novel pathways, we applied our method to predict a biochemical pathway for the start node "Delphinidin" and the goal node "Gentiodelphin". Both compounds are present in the KEGG database. Gentiodelphin is a plantderived secondary metabolite associated with blue dye, and is known to be synthesized from Delphinidin [23]. The KEGG pathway predictive tool PathPred was also used for performance comparison.
Overall, our A*based algorithm with the LP heuristic is more comprehensive and computationally efficient prediction method for biochemical pathway finding.
Discussion
We have achieved highspeed pathway predictions using a vectorbased search that simply focuses on the 2D structures of compounds. The A* algorithm guarantees the discovery of the shortest path, and the efficient search is achieved by the Linear Programming heuristic that estimates the distance to the goal. Results of verification experiments show the high reproducibility of KEGG pathways, the validity of the novel predicted pathway, and the versatility of our method.
Search space for pathway predictions
where P is the size of the search space if all solutions are explored, N is the number of reaction rules, and d is the distance from the start node to the goal node. In the heuristic search, the search space can be reduced by visiting the nodes on the true path on a priority basis. Table 5 shows that the LP heuristic can significantly reduce the search space compared to searching all possible solutions.
where B is the ratio that bounds the branching, that is, the ratio at which operator applications are eliminated. When B increases, the base of the exponential function becomes smaller and hence the exponential increase can be reduced. That is, B plays a role in minimizing the exponential expansion of the search space. The significant reduction in computational time achieved by increasing the representationdepth for the vector representation is considered to be due to this reason. In other words, designing a highspeed searching method requires both an accurate heuristic function that estimates the distance to the goal and an effective bound on the branching to reduce the search space.
Reproducibility of KEGG Pathway
Our experimental results for comprehensive predictions using all 8108 KEGG reaction rules show that our proposed method is able to reproduce enzyme reaction pathways in the KEGG pathway database with high accuracy. This is presumably due to the LP heuristic and bound on branching due to the substrate inclusion constraint on the vector representation.
De novo prediction of known and unknown biosynthetic pathways
Our proposed method in this paper is a de novo prediction method in the sense that it does not require any initial network such as KEGG metabolic network as input and it is not a method just to traverse the pathway network. Our method takes as input the set of enzyme reaction rules collected from the KEGG pathway database. However, this does not necessarily imply that the pathway prediction using the list of all reaction rules is equal to the path search on KEGG pathway network. For each compound occurring at a node in KEGG pathway network, the KEGG network only contains the enzyme reactions whose substrate is exactly equal to the compound as an edge connected to the node. On the other hand, our method applies all reaction rules to a given compound if the compound is not only equal to the substrate of the reaction rule but also contains the substrate as a substructure (the substrate inclusion condition). Therefore, the search space of our method is exponentially larger than the KEGG pathway network. Further, our method is able to predict unknown biosynthetic pathways between two arbitrary known or unknown compounds.
Conclusions
We have proposed a computationally efficient method to predict biochemical reaction pathways that derives a goal compound from a start compound. A chemical compound is represented by a feature vector that counts the frequencies of substructure occurrences in the structural formula. A set of enzyme reaction rules collected from the KEGG pathway database was represented using operator vectors, by determining the structural change in the compounds before and after the reaction. Two constraint conditions when applying reaction rules were substrate inclusion and compound formation. By defining each compound vector as a node and each operator as an edge, prediction of reaction pathways was reduced to the shortest path search problem in a vector space. We proposed an efficient search method that uses the A* algorithm for the shortest path search problem. We used an LP solution for heuristic estimation of the distance to the goal. The results showed that our method had high reproducibility for KEGG pathways and a high possibility of predicting new reaction pathways. We understand that we need largerscale experiments to test the general performance and stability of our method on a number of various known pathways. This is one of our important future works. Also in the future work, the resulting shortest distance can be thought of as a kind of similarity measure between compounds that represents metabolic information, and hence applications to determining similarity of compounds for drug discovery such as [38–40] can be also expected.
Author's information
Department of Biosciences and Informatics, Faculty of Science and Technology, Keio University, 3141 Hiyoshi, Kohokuku, Yokohama 2238522, Japan.
List of abbreviations
 LP:

Linear Programming
 MH:

Manhattan
 BF:

Breadthfirst
 IP:

Integer Programming
 DDT:

dichlorodiphenyltrichloroethane.
Declarations
Acknowledgements
This work was supported in part by a Grant program for bioinformatics research and development from the Japan Science and Technology Agency. This work was also supported by GrantinAid for KAKENHI (GrantinAid for Scientific Research) on Innovative Areas (No.221S0002) and Scientific Research (A) No.23241066 from the Ministry of Education, Culture, Sports, Science and Technology of Japan.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 17, 2012: Eleventh International Conference on Bioinformatics (InCoB2012): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S17.
Authors’ Affiliations
References
 Cho A, Yun H, Park J, Lee S, Park S: Prediction of novel synthetic pathways for the production of desired chemicals. BMC Systems Biology. 2010, 4: 3510.1186/17520509435.PubMed CentralView ArticlePubMed
 Nicholson J, Connelly J, Lindon J, Holmes E: Metabonomics: a platform for studying drug toxicity and gene function. Nature Reviews Drug Discovery. 2002, 1 (2): 153162. 10.1038/nrd728.View ArticlePubMed
 Medema M, van Raaphorst R, Takano E, Breitling R: Computational tools for the synthetic design of biochemical pathways. Nature Reviews Microbiology. 2012, 10 (3): 191202. 10.1038/nrmicro2717.View ArticlePubMed
 Tohsato Y, Nishimura Y: Metabolic pathway alignment based on similarity between chemical structures. IPSJ Digital Courier. 2007, 3 (0): 736745.View Article
 Kotera M, McDonald A, Boyce S, Tipton K: Eliciting possible reaction equations and metabolic pathways involving orphan metabolites. Journal of Chemical Information and Modeling. 2008, 48 (12): 23352349. 10.1021/ci800213g.View ArticlePubMed
 Leber M, Egelhofer V, Schomburg I, Schomburg D: Automatic assignment of reaction operators to enzymatic reactions. Bioinformatics. 2009, 25 (23): 31353142. 10.1093/bioinformatics/btp549.View ArticlePubMed
 Steinbeck C, Han Y, Kuhn S, Horlacher O, Luttmann E, Willighagen E: The Chemistry development kit (CDK): An opensource Java library for chemoand bioinformatics. Journal of chemical information and computer sciences. 2003, 43 (2): 493500. 10.1021/ci025584y.PubMed
 Rahman S, Bashton M, Holliday G, Schrader R, Thornton J: Small molecule subgraph detector (SMSD) toolkit. Journal of cheminformatics. 2009, 1: 113. 10.1186/1758294611.View Article
 McGregor J, Willett P: Use of a maximum common subgraph algorithm in the automatic identification of ostensible bond changes occurring in chemical reactions. Journal of Chemical Information and Computer Sciences. 1981, 21 (3): 137140. 10.1021/ci00031a005.
 Stahl M, Mauser H: Database clustering with a combination of fingerprint and maximum common substructure methods. Journal of chemical information and modeling. 2005, 45 (3): 542548. 10.1021/ci050011h.View ArticlePubMed
 Takahashi Y, Sukekawa M, Sasaki S: Automatic identification of molecular similarity using reducedgraph representation of chemical structure. Journal of chemical information and computer sciences. 1992, 32 (6): 639643. 10.1021/ci00010a009.
 Sussenguth E: A graphtheoretic algorithm for matching chemical structures. Journal of Chemical Documentation. 1965, 5: 3643. 10.1021/c160016a007.View Article
 Raymond J, Willett P: Effectiveness of graphbased and fingerprintbased similarity measures for virtual screening of 2D chemical structure databases. Journal of computeraided molecular design. 2002, 16: 5971. 10.1023/A:1016387816342.View ArticlePubMed
 Raymond J, Willett P: Maximum common subgraph isomorphism algorithms for the matching of chemical structures. Journal of computeraided molecular design. 2002, 16 (7): 521533. 10.1023/A:1021271615909.View ArticlePubMed
 Raymond J, Gardiner E, Willett P: Heuristics for similarity searching of chemical graphs using a maximum common edge subgraph algorithm. Journal of chemical information and computer sciences. 2002, 42 (2): 305316. 10.1021/ci010381f.PubMed
 Cao Y, Jiang T, Girke T: A maximum common substructurebased algorithm for searching and predicting druglike compounds. Bioinformatics. 2008, 24 (13): i36610.1093/bioinformatics/btn186.PubMed CentralView ArticlePubMed
 Hatzimanikatis V, Li C, Ionita J, Henry C, Jankowski M, Broadbelt L: Exploring the diversity of complex metabolic networks. Bioinformatics. 2005, 21 (8): 16031609. 10.1093/bioinformatics/bti213.View ArticlePubMed
 Li C, Henry C, Jankowski M, Ionita J, Hatzimanikatis V, Broadbelt L: Computational discovery of biochemical routes to specialty chemicals. Chemical engineering science. 2004, 59 (2223): 50515060. 10.1016/j.ces.2004.09.021.View Article
 Hou B, Ellis L, Wackett L: Encoding microbial metabolic logic: predicting biodegradation. Journal of industrial microbiology & biotechnology. 2004, 31 (6): 261272.View Article
 Langowski J, Long A: Computer systems for the prediction of xenobiotic metabolism. Advanced drug delivery reviews. 2002, 54 (3): 407415. 10.1016/S0169409X(02)00011X.View ArticlePubMed
 Oh M, Yamada T, Hattori M, Goto S, Kanehisa M: Systematic analysis of enzymecatalyzed reaction patterns and prediction of microbial biodegradation pathways. Journal of chemical information and modeling. 2007, 47 (4): 17021712. 10.1021/ci700006f.View ArticlePubMed
 Talafous J, Sayre L, Mieyal J, Klopman G: META. 2. A dictionary model of mammalian xenobiotic metabolism. Journal of chemical information and computer sciences. 1994, 34 (6): 13261333. 10.1021/ci00022a015.PubMed
 Moriya Y, Shigemizu D, Hattori M, Tokimatsu T, Kotera M, Goto S, Kanehisa M: PathPred: an enzymecatalyzed metabolic pathway prediction server. Nucleic acids research. 2010, W138W143. 38 Web Server
 Gao J, Ellis L, Wackett L: The university of Minnesota pathway prediction system: multilevel prediction and visualization. Nucleic acids research. 2011, W406W411. 39 Web Server
 GonzalezLergier J, Broadbelt L, Hatzimanikatis V: Theoretical considerations and computational analysis of the complexity in polyketide synthesis pathways. Journal of the American Chemical Society. 2005, 127 (27): 99309938. 10.1021/ja051586y.View ArticlePubMed
 Yamanishi Y, Vert J, Kanehisa M: Supervised enzyme network inference from the integration of genomic data and chemical information. Bioinformatics. 2005, 21 (suppl 1): i468i477. 10.1093/bioinformatics/bti1012.View ArticlePubMed
 Feist A, Henry C, Reed J, Krummenacker M, Joyce A, Karp P, Broadbelt L, Hatzimanikatis V, Palsson B: A genomescale metabolic reconstruction for Escherichia coli K12 MG1655 that accounts for 1260 ORFs and thermodynamic information. Molecular Systems Biology. 2007, 3: 121PubMed CentralView ArticlePubMed
 Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M: KEGG for integration and interpretation of largescale molecular data sets. Nucleic Acids Research. 2012, 40: D109D114. 10.1093/nar/gkr988.PubMed CentralView ArticlePubMed
 KEGG PATHWAY Database. [http://www.kegg.jp/kegg/pathway.html]
 Swamidass SJ, Chen J, Bruand J, Phung P, Ralaivola L, Baldi P: Kernels for small molecules and the prediction of mutagenicity, toxicity and anticancer activity. Bioinformatics. 2005, 21 (Supple 1): 359368.View Article
 Nagamine N, Sakakibara Y: Statistical prediction of protein chemical interactions based on chemical structure and mass spectrometry data. Bioinformatics. 2007, 23 (15): 20042012. 10.1093/bioinformatics/btm266.View ArticlePubMed
 Sakakibara Y, Hachiya T, Uchida M, Nagamine N, Sugawara Y, Yokota M, Nakamura M, Popendorf K, Komori T, Sato K: COPICAT: A software system for predicting interactions between proteins and chemical compounds. Bioinformatics. 2012, doi:10.1093/bioinformatics/bts031
 IBM ILOG CPLEX. [http://www06.ibm.com/software/jp/websphere/ilog/optimization/coreproductstechnologies/cplex/]
 DDT degradation  Reference pathway. [http://www.kegg.jp/keggbin/show_pathway?map00351]
 Higginson J: DDT: Epidemiological evidence. IARC scientific publications. 1985, 107117. 65
 Manaca M, Grimalt J, Gari M, Sacarlal J, Sunyer J, Gonzalez R, Dobaño C, Menendez C, Alonso P: Assessment of exposure to DDT and metabolites after indoor residual spraying through the analysis of thatch material from rural African dwellings. Environmental Science and Pollution Research. 2011, 19 (3): 756762.PubMed CentralView ArticlePubMed
 PathPred: Pathway Prediction server. [http://www.genome.jp/tools/pathpred/]
 Hattori M, Okuno Y, Goto S, Kanehisa M: Development of a chemical structure comparison method for integrated analysis of chemical and genomic information in the metabolic pathways. Journal of the American Chemical Society. 2003, 125 (39): 1185311865. 10.1021/ja036030u.View ArticlePubMed
 Tsuda K, Kin T, Asai K: Marginalized kernels for biological sequences. Bioinformatics. 2002, 18 (suppl 1): S26810.1093/bioinformatics/18.suppl_1.S268.View ArticlePubMed
 Nagamine N, Shirakawa T, Minato Y, Torii K, Kobayashi H, Imoto M, Sakakibara Y: Integrating statistical predictions and experimental verifications for enhancing proteinchemical interaction predictions in virtual screening. PLoS Computational Biology. 2009, 5 (6): e100039710.1371/journal.pcbi.1000397.PubMed CentralView ArticlePubMed
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.