Volume 11 Supplement 1

## Selected articles from the Eighth Asia-Pacific Bioinformatics Conference (APBC 2010)

- Research
- Open Access

# Reaction graph kernels predict EC numbers of unknown enzymatic reactions in plant secondary metabolism

- Hiroto Saigo†
^{1}, - Masahiro Hattori†
^{2}, - Hisashi Kashima
^{3}and - Koji Tsuda
^{4}Email author

**11 (Suppl 1)**:S31

https://doi.org/10.1186/1471-2105-11-S1-S31

© Saigo et al; licensee BioMed Central Ltd. 2010

**Published:**18 January 2010

## Abstract

### Background

Understanding of secondary metabolic pathway in plant is essential for finding druggable candidate enzymes. However, there are many enzymes whose functions are not yet discovered in organism-specific metabolic pathways. Towards identifying the functions of those enzymes, assignment of EC numbers to the enzymatic reactions they catalyze plays a key role, since EC numbers represent the categorization of enzymes on one hand, and the categorization of enzymatic reactions on the other hand.

### Results

We propose reaction graph kernels for automatically assigning EC numbers to unknown enzymatic reactions in a metabolic network. Reaction graph kernels compute similarity between two chemical reactions considering the similarity of chemical compounds in reaction and their relationships. In computational experiments based on the KEGG/REACTION database, our method successfully predicted the first three digits of the EC number with 83% accuracy. We also exhaustively predicted missing EC numbers in plant's secondary metabolism pathway. The prediction results of reaction graph kernels on 36 unknown enzymatic reactions are compared with an expert's knowledge. Using the same data for evaluation, we compared our method with E-zyme, and showed its ability to assign more number of accurate EC numbers.

### Conclusion

Reaction graph kernels are a new metric for comparing enzymatic reactions.

## Keywords

- Random Walk
- Label Sequence
- Graph Kernel
- Loganin
- Secologanin

## Background

Fulfilling such missing EC numbers on a pathway can be casted as a multi-class classification problem given a pair of substrate and product as an input and the corresponding EC number as an output. Kotera et al. proposed an automatic EC number assignment system "E-zyme" for metabolic reactions [3]. Yamanishi et al. recently reformed the engine of E-zyme by introducing multi-layered matching and weighted majority voting [4]. However, E-zyme is still based on the detection of maximum common subgraphs between chemical compounds, so the shift of a large chemical group is not correctly detected [5]. Also, the E-zyme system is a rule-based method and does not allow approximate matching, which results in poor coverage. In many cases, E-zyme rejects a query because none of the rules matches [6].

In this paper, we propose to represent a metabolic reaction as a *reaction graph*, where each vertex corresponds to a chemical compound, and an edge between two chemical compounds corresponds to their relationships in reaction. A reaction graph is a 'graph of graphs', because each node contains a graph representing a chemical compound. To evaluate the similarity of two reaction graphs, we use marginalized graph kernels [7] in a recursive way. First, we compute graph kernels between every pair of chemical compounds and then use it as a node kernel for an upper-level graph kernel. In our experiment based on the KEGG/REACTION database, our reaction graph kernel in combination with kernel nearest neighbor showed 83% accuracy for classifying 4610 reactions into 124 classes. Furthermore, we exhaustively extracted missing enzymatic reactions in the plant's secondary metabolism in the KEGG database. Among the 56 reactions extracted, we could assign EC numbers to 36 reactions with the help of an expert from the KEGG team. The performance of our method is compared with E-zyme on this external validation set. Reaction graph kernel successfully assigned EC numbers to 22 EC classes, 14 EC subclasses and 12 EC subsubclasses. On the other hand, E-zyme could assign EC numbers to only 14 EC classes, 10 EC subclasses and 8 EC subsubclasses, due to its low coverage. The biochemical grounds for manual assignments are shown together with the individual prediction results of reaction graph kernels and E-zyme.

Data and supplementary information is available from http://www.mpi-inf.mpg.de/%7Ehiroto/RGKDATA/

## Results and Discussion

### Reaction graph and reaction graph kernel

Given such a chemical reaction, a task is to predict the EC number of the enzyme catalyzing the reaction. In this case, the enzyme is *secologanin synthase* (EC 1.3.3.9), which turns a substrate (*Loganin*) into a product (*Secologanin*) with *NADPH* as a cofactor. However, if the information on the enzyme is not available, we need to look up the entries in the database whose reactions are similar to the reaction of interest. A reasonable similarity metric is a key to solving this problem.

*reaction graphs*. A reaction graph consists of vertices, which are compounds in a reaction, and edges which denote the relationships between compounds. The edge labels are chosen from either 'main', 'leave', 'cofactor', 'transferase' or 'ligase' based on the categorization in the KEGG/RPAIR database. We additionally introduced a 'group' edge which connects all the compounds on the substrate side of the reaction, and all the compounds on the product side of the reaction. An example of reaction graph corresponding to Equation (1) is presented in Figure 2.

To evaluate the similarity between two reaction graphs, we use random walk kernels [7] in a recursive way. We first compute all the pairwise similarities of the vertices (chemical compounds) using random walk kernels. Then the compound-wise similarities are used as the label matching probabilities for the upper-level graph kernel. The details of random walk kernels are described in the Methods section.

### Leave-one-out prediction of missing EC numbers

In order to evaluate the reaction graph kernels, we collected metabolic reactions from the KEGG/REACTION database. Following the pre-process used by [3], we did not use reactions which (i) do not have EC numbers, (ii) include chemical compounds whose structures are not available, (iv) have classes 97 and 99, (v) have only one reaction in the same subsubclass. This pre-processing resulted in 4, 610 reactions in 6 classes, 50 subclasses, and 124 subsubclasses.

In this experiment, we withheld one reaction from the database, and predicted its EC number using all the other reaction-enzyme pairs. For the prediction, we used the nearest neighbor approach based on the reaction graph kernels. For the calculation of the reaction graph kernels, we used Chemcpp (Available from http://chemcpp.sourceforge.net/) with the "non-tottering" option [8]. The random walk parameter of the lower-level and upper-level graph kernels were selected from {0.99, 0.8, 0.7, 0.6}, respectively, and 0.9 was used for both kernels, since it performed best in the experiments.

Leave-one-out cross validation accuracy

EC class | EC subclass | EC subsubclass | |
---|---|---|---|

full-edge | 94.8% | 86.0% | 82.5% |

RPAIR | 92.3% | 81.4% | 78.1% |

main-pair | 77.8% | 69.8% | 66.2% |

### Predicting EC numbers of unannotated reactions in plant's secondary metabolism

In order to further evaluate the proposed method, we performed a blind test, where we tested only reactions whose EC numbers are not yet assigned in the secondary metabolism of plants. First we collected metabolic reactions from the KEGG "Biosynthesis of Secondary Metabolites - Reference pathway" data. From the resulting 56 reactions, we removed 20 reactions which are either non-enzymatic reactions or multi-step reactions whose systems are too complicated, based on an expert's judgement. Then we tested the E-zyme and reaction graph kernels on the remaining 36 reactions.

Number of correct predictions and accuracy in top *k* candidates for 36 unknown reactions

Coverage | EC main | EC sub | EC subsub | ||
---|---|---|---|---|---|

RGK | TOP1 | 100% | 22 (61.1%) | 14 (38.9%) | 12 (33.3%) |

TOP3 | 56 (51.9%) | 30 (27.8%) | 24 (22.2%) | ||

TOP5 | 86 (47.8%) | 37 (20.6%) | 27 (15.0%) | ||

E-zyme | TOP1 | 61.1% | 14 (63.6%) | 10 (45.5%) | 8 (36.4%) |

TOP3 | 42 (63.6%) | 24 (36.4%) | 18 (27.3%) | ||

TOP5 | 57 (51.8%) | 30 (27.3%) | 24 (21.8%) |

A list of newly annotated reactions is presented in Additional file 1, together with prediction results of E-zyme and reaction graph kernels (RGK). For reaction graph kernels, the Z-score (
, where *x* is a raw score, and *μ* and *σ* are the mean and the standard deviation of the candidate scores) is calculated so that one can find a candidate with a saliently higher score than others. The biochemical grounds for the manual assignment of the EC numbers are presented in the "Comments" column. Since the enzyme nomenclature and the scheme of EC number classification had been published [9], we can infer the plausible EC numbers from given information of reaction formula. As can be seen in Additional file 1, some reactions progress in multiple steps and have several correct EC numbers. However, neither reaction graph kernels nor E-zyme considers such situations, which remains for future research.

## Conclusion

We proposed an alternative method for assigning EC numbers to unknown enzymatic reactions based on reaction graph kernels which measure similarity between reaction graphs. On a blind test predicting missing EC numbers in plant secondary metabolism pathway, we demonstrated that reaction graph kernels collected more number of accurate potential EC numbers than E-zyme.

## Methods

In this section, we introduce graph kernels that define similarity metrics between two labeled graphs.

### Random walk graph kernel

The key idea behind the random walk graph kernel is to use random walks on the given graphs to generate label sequences, and each graph is represented as a bag of label sequences from the random walks. The similarity of two graphs are defined as the number of common label sequences weighted by the probability of the corresponding walks (or more precisely, the probability of common label sequences being generated). The random walk graph kernel is a valid kernel, since it is interpreted as an inner product in the feature space spanned by the label sequences.

Let us assume that we want to define a similarity metric between two labeled graphs *G*_{1} = (*V*_{1}, *E*_{1}, *L*_{1}(*V*_{1})) and *G*_{2} = (*V*_{2}, *E*_{2}, *L*_{2}(*V*_{2})), where *V*_{1} and *V*_{2} are sets of vertices, *E*_{1} and *E*_{2} are sets of edges, and *L*_{1} and *L*_{2} are sets of labels of the vertices. Although our description assumes that the edges are not labeled we convert labeled edges to labeled vertices if the edges have labels. (Actually, we have bond labels in the lower-level graph kernel, and reaction labels in the upper-level graph kernel.) This conversion increases the number of vertices from |*V*_{1}| + |*V*_{2}| to |*V*_{1}| + |*V*_{2}| + |*E*_{1}| + |*E*_{2}| and doubles the number of edges.

*G*

_{1}and

*G*

_{2}to define our graph kernel. First, we define a random walk over one graph. Let u

_{1}(

*t*) be a |

*V*

_{1}|-dimensional vector representing the probability distribution of the position of the random walk over the vertices in

*G*

_{1}at time

*t*. The random walk starts with an initial distribution u

_{1}(0). One possible choice of u

_{1}(0) is the uniform distribution over

*V*

_{1}. At each time step

*t*, the random walk terminates with probability 1 -

*λ*

_{1}where 0 <

*λ*

_{1}< 1. The random walk proceeds with probability

*λ*

_{1}, and moves to the next vertex by using a transition matrix T

_{1}. The (

*i, j*)-th element of T

_{1}indicates the probability of a transition from the

*j*-th vertex to the

*i*-th vertex in

*G*

_{1}. One possible choice of

*T*

_{1}is the normalized adjacency matrix of

*G*

_{1}. The dynamics of the random walks over

*G*

_{1}are given as

*t*, the probability distribution over

*V*

_{1}is represented as (1 -

*λ*

_{1}) (

*λ*

_{1}T

_{1})

^{ t }u

_{1}(0). A random walk over

*G*

_{2}is defined by using u

_{2}(

*t*),

*λ*

_{2}, and T

_{2}defined accordingly. Since we want to compute the probability of two label sequences produced by the random walks matching, we consider the joint random walk using T

_{1}and T

_{2}over

*G*

_{1}and

*G*

_{2}, respectively. Specifically, the joint distribution of the two random walks is given as U(

*t*) = u

_{1}(

*t*) ⊗ u

_{2}(

*t*)

^{⊤}, where ⊗ indicates the Kronecker product. Noting that the two random walks are independent of each other, the dynamics for the joint random walk is given as

*V*

_{1}| × |

*V*

_{2}| vertex-wise kernel matrix. The values of M can take any values between zero and one according to the similarities between the labels. One simple choice of M is the Dirac kernel, where the (

*i*

_{1},

*i*

_{2})-th elements of M is 1 if the

*i*

_{1}-th node in

*V*

_{1}and the

*i*

_{2}-th node in

*V*

_{2}have an identical label, and is 0 otherwise. We will discuss the specific choice of M for our reaction kernel later. The dynamics of the "label matching" joint random walks are represented as

where * is the Hadamard (element-wise) product, and V(0) ≡ M* U(0).

*i*

_{1},

*i*

_{2})-th element of V is denoted by . Now our goal is reduced to computing the infinite sum . From Eq. (2), we have the relation

used by [10] to update the current solution starting from
← V(0). The computational complexity of each update is *O*(|*E*_{1}||*V*_{2}| + |*E*_{2}||*V*_{1}| + |*V*_{1}||*V*_{2}|), where the first term and the second term are for applying T_{1} and T_{2} to
, respectively. The third term is for the application of M, so it can be replaced by the number of non-zero elements in M. Therefore,
can be updated very efficiently if the graphs and M are sparse. The iteration is continued until convergence, but usually a few dozen steps are sufficient.

Our specific choices of M in our reaction graph kernel are as follows. For the upper-level reaction graph kernel, the elements of M for defining similarities among chemical compounds are replaced by the lower-level compound graph kernel, while we use the Dirac kernel for the elements of M for chemical reactions. In the lower-level compound graph kernel, we use the Dirac kernel for both bond similarities and atom similarities.

## Notes

## Declarations

### Acknowledgements

KT is supported by JSPS Kakenhi 21680025 and the Functional RNA Project of New Energy and Industrial Technology Development Organization (NEDO).

This article has been published as part of *BMC Bioinformatics* Volume 11 Supplement 1, 2010: Selected articles from the Eighth Asia-Pacific Bioinformatics Conference (APBC 2010). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/11?issue=S1.

## Authors’ Affiliations

## References

- Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita K, Itoh M, Kawashima S, Katayama T, Araki M, Hirakawa M: From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Research 2006, 34: D354–357. 10.1093/nar/gkj102PubMed CentralView ArticlePubMedGoogle Scholar
- Bennett RN, Wallsgrove RM: Secondary metabolites in plant defence mechanisms. New Phytologist 1994, 127(4):617–633. 10.1111/j.1469-8137.1994.tb02968.xView ArticleGoogle Scholar
- Kotera M, Okuno Y, Hattori M, Goto S: Computational assignment of the EC numbers for genomic-scale analysis of enzymatic reactions. Journal of American Chemical Society 2004, 126: 16487–16498. 10.1021/ja0466457View ArticleGoogle Scholar
- Yamanishi Y, Hattori M, Kotera M, Goto S, Kanehisa M: E-zyme: predicting potential EC numbers from the chemical transformation pattern of substrate-product pairs. Bioinformatics 2009, 25(12):i79-i86. 10.1093/bioinformatics/btp223View ArticleGoogle Scholar
- Arita M: In silico atomic tracing by substrate-product relationships in Escherichia coli intermediary metabolism. Genome Res 2003, 13: 2455–2466. 10.1101/gr.1212003PubMed CentralView ArticlePubMedGoogle Scholar
- Yamanishi Y, Mihara H, Osaki M, Muramatsu H, Esaki N, Sato T, Hizukuri Y, Goto S, Kanehisa M: Prediction of missing enzyme genes in a bacterial metabolic network. FEBS J 2007, 2262–2273. 10.1111/j.1742-4658.2007.05763.xGoogle Scholar
- Kashima H, Tsuda K, Inokuchi A: Marginalized Kernels between Labeled Graphs. In Proceedings of the Twentieth International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann; 2003:321–328.Google Scholar
- Pierre M, Ueda N, Akutsu T, Perret JL, Vert JP: Graph Kernels for Molecular Structure-Activity Relationship Analysis with Support Vector Machines. Journal of Chemical Information and Modeling 2005, 939–951.Google Scholar
- International Union of Biochemistry and Molecular Biology Nomenclature Committee:Enzyme nomenclature 1992. Academic Press; 1992. [http://www.chem.qmul.ac.uk/iubmb/enzyme/]Google Scholar
- Vishwanathan SVN, Borgwardt K, Schraudolph N: Fast computation of graph kernels. In Advances in Neural Information Processing Systems 19. MIT Press; 2007:1449–1456.Google Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.