First, a chemical compound is represented by a feature vector which counts the frequencies of substructures in the structural formula. Second, a set of enzyme reaction rules is collected from the KEGG pathway database. Third, a reaction rule is represented as an operator vector by detecting the structural change to compounds before and after the reaction. Fourth, by defining compound vectors as nodes and operators as edges, prediction of a reaction pathway from a start compound to a goal compound is reduced to the shortest path search problem in the vector space. Then, "the output for reaction pathway prediction consists of a sequence of applied reaction rules". The A* algorithm is used to efficiently search for the shortest path. Finally, the Linear Programming (LP) algorithm is used as an admissible heuristic for estimating the distance to the goal.
KEGG reaction data
The data for compounds and metabolic enzyme reaction information used in this method all come from KEGG. First, we extracted the information pathways from KEGG pathway [29]. By using the KEGG API, we concretely collected all enzyme reactions registered in the global map on the KEGG pathway. In the KEGG Reaction, a pair of compounds that are registered as "main" before and after the reaction indicate that it is a metabolic reaction present in KEGG enzyme or the KEGG pathway global map. In this study, the 2D-SDF structure was extracted only for those pairs registered as "main" to ensure the focus on compound metabolic reactions. Further, most KEGG reactions are registered as reversible reaction, and therefore, the forward and reverse directions were treated as a separate reaction. As a result, the 14570 enzyme reactions and the 6073 related compounds were obtained.
Representation of chemical compounds and enzyme reactions
A key idea in our method is that a chemical compound is converted to a feature vector that represents substructure statistics extracted from the structural formula of the compound. This feature-vector representation evaluates whether a feature, such as a specific substructure, exists in a chemical compound or how many times that feature appears. This converts information about compounds into numerical vectors, called feature vectors, whose i th value corresponds to the existence or frequency of the i th feature considered. This feature-vector representation enables us to reduce the pathway search problem to a computationally feasible problem in the vector space, as will be discussed later in detail.
Substructures or paths extracted from chemical structures, which are regarded as graphs with atoms as nodes and bonds as edges, can be an effective descriptor of chemical compounds [30–32]. In this study, the feature vector based on the 2D chemical structure is defined as the number of appearances of each path in the structure of the chemical compound as follows:
(1)
where is a set of paths whose length (depth), or number of bonds, is between l and u (u ≥ l) and which appear at least once in the chemical structures in the dataset. f
c
(p) is the number of appearances of path p in the structure of a chemical compound c.
For example, methane (CH4),
can be represented by the following feature vector:
(3)
We call the path length range specified by l and u the "representation-depth", and denote it by "depth l-u", which is crucial for the expressiveness of vector representation.
According to the feature-vector representation of chemical compounds, every enzyme reaction rule in the 14570 KEGG enzyme reactions is represented as an operator vector. An operator vector expresses the change in chemical structure before and after the reaction, which is computed as the subtraction of the substrate compound vector from the product compound vector: Let i denote the substrate compound of an enzyme reaction a and j denote the product compound of a. Let and denote the vector representation of i and j, respectively. Then, the operator vector O
a
for the reaction a is defined as (see also Figure 1):
Further, every reaction rule R
a
for the reaction a is defined as a pair R
a
= (U
a
, O
a
) of the substrate vector and the operator vector O
a
. As a result, an application of the enzyme reaction to a compound can be achieved simply by "addition" of the operator vector to the compound vector (see also Figure 1), and a reaction pathway from a start compound S to a goal compound G is represented by a sequence of applications (additions) of operator vectors:
In this method, different reactions may sometimes be represented by the same vector because of insufficient short-length path counts in the compound vector.
Two constraint conditions for applying enzyme reaction rules
As a constraint for applying a reaction rule to a compound, the substrate inclusion condition is set as inclusion of the substrate vector. When attempting to apply the operator vector of a reaction rule R = (U, O) to a compound vector C, the compound vector C must include the substrate vector U as part of its own structure (see Figure 2). The precise determination procedure requires the entries c
k
in the compound vector C = (c1, ..., c
n
) to exceed the corresponding value u
k
in the substrate vector U = (u1, ..., u
n
), as represented by the following formula:
(4)
Note that this computationally easy procedure for substrate inclusion is a great advantage of our method using vector representation, because the graph inclusion problem for determining whether a compound structure contains a substrate structure is computationally hard (NP-hard).
The second constraint is the "non-negative" compound-vector condition. Since the operator vector O denotes a change in structure before and after the reaction, it sometimes contains negative values and the application (addition) of this to a compound vector C may produce a vector containing negative values. A compound vector is a vector that we define as representing the frequency of occurrence of partial structures, so negative values are not appropriate. Therefore, we set the following non-negative condition to filter out compound vectors containing negative values, preventing inappropriate products that contain negative values as a result of applying an operator vector.
(5)
Search algorithm between two compounds
The purpose of this study is, given a start compound S and a goal compound G, to find the pathway leading to G by applying reaction rules. Here, the distance of a pathway between S and G is defined as "the number of rule applications (i.e., path length)". By introducing this distance definition, the metabolic pathway prediction problem between compounds can be replaced by a mathematical shortest path search problem. That is, finding the shortest path for reaching the integer vector of the goal compound by successively adding integer vectors of reaction rules to integer vectors of intermediate compounds can be considered as the problem of finding the shortest path in the integer-vector space. In this search space, a node is an intermediate compound vector and an edge is an applied reaction operator vector (see Figure 3).
A* algorithm and heuristics
The A* algorithm uses a best-first search and finds a least-cost path from a given start node to a goal node. It uses a distance-plus-cost heuristic function to determine the order in which the search algorithm visits nodes to be explored in the search space. In the A* algorithm, the evaluation function (the distance-plus-cost heuristic) f(t) for each node t in the search space is defined as the sum of two functions, the past-cost function p(t), which is the cost (distance) it has taken to get from the start node to the current node t, and a heuristic estimate h'(t) of the distance to the goal (see Figure 3):
In addition, the condition that ensures the A* algorithm finds a shortest path is expressed by the following formula:
where h(t) is the true distance to the goal. That is, a heuristic function h'(t) that always underestimates the distance to the goal is required. Such a heuristic function h'(t) is referred to as an "admissible heuristics". If a given heuristic is admissible, the A* algorithm will reliably find a shortest path. The A* algorithm was implemented using the data structure "sorted priority queue" for maintaining the nodes to be traversed with weights of the evaluation function value f(t), while the breadth-first search uses the simple "queue" for the nodes to be traversed with no weight.
Breadth-first search (exhaustive search)
By setting the heuristic function h'(t) to zero for any node t, the A* algorithm becomes equivalent to the breadth-first (BF) search as exhaustive search.
Manhattan distance
Since each node t is an n-dimensional vector representing a compound, the most simple heuristic function h'(t) is to use the Manhattan (MH) distance, denoted by MH(t, G), between the current node vector t and the goal node vector G:
(8)
However, naive use of the MH distance is inadmissible and does not guarantee the shortest path solution. Therefore, we use the following modified MH heuristic function h'(t):
(9)
where ||O
max
|| represents the maximum norm among all of the operator vectors. The MH distance divided by this norm becomes an admissible heuristic, because this modified MH distance indicates the number of times the goal node G is reached by only applying the largest norm operator, and hence does not exceed the true distance to the goal node.
Linear programming (LP) heuristics
A path from the current node t to the goal node G is represented by a sequence of reaction rules R1, ..., R
m
, and hence the difference ΔT = G - t between the current node vector t = (t1, ..., t
n
) and the goal node vector G = (g1, ..., g
n
) can be represented by a linear sum of operator vectors . Let w
k
denote the number of times that the operator vector O
k
is applied. Then the difference ΔT between the current node t and the goal node G may be expressed as follows:
and the sum of the coefficients w
k
is exactly equal to the distance (the number of applications of operators) between the current node and the goal node. Now, consider the following optimization problem for the current node vector t = (t1, ..., t
n
) and the goal node vector G = (g1, ..., g
n
):
(12)
(13)
This optimization problem is an Integer Programming (IP) problem. The solution to this problem is similar to that for the shortest reaction path problem between the start node and the goal node, except that it does not take into account the order of application of the reaction rules and it ignores the constraint conditions when applying reaction rules. Nevertheless, the solution to "minimize ∑
k
w
k
" provides the tightest underbound for estimating the distance from the current node t to the goal node G, and it is obviously admissible. However, a critical defect is that the IP problem is computationally hard.
Our approach is to relax the constraints on the optimization problem "minimize ∑
k
w
k
" and to treat w
k
as a real number rather than an integer, that is, "continuous relaxation". The optimization problem now becomes an LP problem that can be solved in polynomial time. Note that allowing w
k
to be a real number means that we may apply an operator a real number of times, for example "apply the operator 1.5 times". In other words, a real-valued solution for the optimization problem "minimize ∑
k
w
k
" can be considered as the shortest distance to the goal node in a real vector space. In addition, it is well known and obvious that the real solutions for the optimization problem "minimize ∑
k
w
k
" with linear equation constraints are always smaller than the integer solutions. Therefore, the LP solution is admissible for guaranteeing the shortest path. We use this value as the LP heuristic function, which is another advantage of our method using the vector representation.
For solving the LP heuristic, we used IBM ILOG CPLEX in [33]. CPLEX is one of the fastest optimization problem solvers, and can be used for linear programming, quadratic programming, constraint programming, mixed integer programming, and is applicable to large-scale problems.