### Basic Framework

One way to find paths that minimizes species crossing is to reduce this problem to the original k shortest path problem. This can be done by modelling the network using a reaction connecting graph where by the nodes are reactions and an edge connects two reactions if the product of one is the substrate of the other. Unique reaction nodes are duplicated for each species that the reaction occurs in. For example, if ten species are involved in a reaction, then ten separate reaction nodes will be created. An edge is assigned weight of zero if the two nodes have the same supporting species and one otherwise. Any k-shortest path produced by the algorithm also has minimum species crossing because each cross-species reaction in the path has a penalty of one, whereas reactions that do not cross species have no penalty. However, this results in a prohibitively large number of nodes and edges. For example, if every reaction is supported by 1000 species on average, then for every two reaction nodes in the original graph, we obtain 2000 nodes with 1,000,000 edges between them.

If the reaction nodes are compacted by storing all the species supporting that reaction, then a k-shortest-path approach does not work. This is because two routes might share a common reaction and a switch is incurred in one of the routes but not the other. On the other hand, a k-shortest-path approach requires the edge weight to be unchanged. When the shared reaction has a shared edge weight, a switch cannot be captured by such a framework. This suggests that paths should be scored independently from one another (See Figure 2).

Instead, the algorithm in CMPF uses bounded depth path enumeration and scores the paths based on a scoring scheme where paths that cross species/pathways many times have higher penalty scores than those that do not. We use a bipartite graph to model the metabolic network which consists of RPAIR nodes *r*_{1}, ..., *r*_{
n
} and compound nodes *c*_{1}, ..., *c*_{
n
}. An RPAIR is pair of compounds with similar chemical structure on two sides of a reaction [7]. A directed edge connects RPAIR to compound and vice versa if the compound participates in the reaction represented by the RPAIR. We use the notation RPAIR and reaction interchangeably in this paper, since they represent similar concepts.

Given a reaction *r*_{
i
} in a linear path *ϕ* = *r*_{1} → ... → *r*_{
n
}, we write *P*_{
ϕ
} (*r*_{
i
}) and *S*_{
ϕ
} (*r*_{
i
}) respectively to be the set of pathways and species that "support" the reaction *r*_{
i
} in the linear path *ϕ*. We assume that the pathways and species that support the first reaction *r*_{1} of *ϕ* are all the pathways and species that *r*_{1} belongs to; that is, *P*_{
ϕ
} (*r*_{1}) = *P*(*r*_{1}) and *S*_{
ϕ
} (*r*_{1}) = *S*(*r*_{1}). For an intermediate reaction *r*_{
i
} in the linear path *ϕ*, if possible, the pathways and species that support it should be those that *r*_{
i
} belongs to and also support the preceding reaction *r*_{i - 1}. Thus, *P*_{
ϕ
} (*r*_{
i
}) = *P*_{
ϕ
} (*r*_{i -1}) ∩ *P*(*r*_{
i
}), provided *P*_{
ϕ
} (*r*_{i-1}) ∩ *P*(*r*_{
i
}) is nonempty; and *S*_{
ϕ
} (*r*_{
i
}) = *S*_{
ϕ
} (*r*_{i - 1}) ∩ *S*(*r*_{
i
}), provided *S*_{
ϕ
} (*r*_{i - 1}) ∩ *S*(*r*_{
i
}) is nonempty. On the other hand, when the set of pathways or species that support the preceding reaction *r*_{i -1}is totally different from the set of pathways or species that *r*_{
i
} belongs to, it is not possible to transition from reaction *r*_{i -1}to *r*_{
i
} in the linear path *ϕ*. That is, in this case, a pathway or species switch is necessary. We assume that the entire set of pathways or species that *r*_{
i
} belongs to can be used for this switch. Thus, *P*_{
ϕ
} (*r*_{
i
}) = *P*(*r*_{
i
}) when *P*_{
ϕ
} (*r*_{i -1}) ∩ *P*(*r*_{
i
}) is empty; and *S*_{
ϕ
}(*r*_{
i
}) = *S*(*r*_{
i
}) when *S*_{
ϕ
} (*r*_{i -1}) ∩ *S*(*r*_{
i
}) is empty (See Figure 3).

Now, we are ready to define our basic framework of reaction transition weight and linear path score as follows:

**Definition 1** *The weight of a transition r*_{
i
} → *r*_{i+1}*in a linear path ϕ* = *r*_{1} → ... → *r*_{
n
} *based on pathways is defined as*

weigh{t}_{P,\varphi}\left({r}_{i}\to {r}_{i+1}\right)=\gamma +\delta ,

*where γ is some constant denoting the cost of making a within-pathway transition and δ is the extra cost of making a pathway switch; note that δ* = 0 *if P*_{
ϕ
}(*r*_{i -1}) ∩ *P*(*r*_{
i
}) *is nonempty, as a pathway switch is not needed in this situation. We will further refine γ and δ in the next section*.

The weight of a transition *r*_{
i
} → *r*_{i+1}based on species is denoted *weight*_{
S,ϕ
} (*r*_{
i
} → *r*_{i +1}), which is defined analogously.

**Definition 2** *The pathway- or species-based score of linear path ϕ* = *r*_{1} → ... → *r*_{
n
} *is just the sum of the weights of all the transitions in ϕ based on pathways or species. That is*,

score\left(\varphi \right)=\sum _{i=1}^{n-1}weigh{t}_{\varphi}\left({r}_{i}\to {r}_{i+1}\right)

*where weight*_{
ϕ
} (*r*_{
i
} → *r*_{i+1}) *is either weight* _{P,ϕ}(*r*_{
i
} → *r*_{i +1}) *or weight* _{S,ϕ}(*r*_{
i
} → *r*_{i+1}), *for pathway*- *or species-based score respectively*.

In our scoring scheme, the weight of a transition *r*_{
i
} → *r*_{i+1}in a linear path depends on whether the pathways (or species) that support *r*_{
i
} also support *r*_{i+1}in that specific linear path. Thus, a topologically shorter linear path may have a higher score (i.e., cost) than a longer one. Moreover, the computation of the score is independent between different linear paths, suggesting exhaustive enumeration of linear paths as a method to find and rank linear paths. Since most linear paths that are useful usually do not exceed a certain topological length and also because exhaustive enumeration can be slow, we use bounded depth enumeration to speed up the search process.

We construct a global metabolic bipartite graph annotated with species/pathway information and exclude non-main RPAIRs to avoid 'currency' compounds. We enumerate all possible paths up to a specified depth and return the *k* lowest scoring linear paths based on our scoring scheme. In this way, we guarantee that linear paths generated by our method have the least number of cross-pathway/species reactions and are always optimal up to the depth to which we traverse the graph.

### Framework Extensions

#### Edge Weights

Many other approaches have tried giving paths meaning by defining meaningful edge weights. Our framework allows us to reuse these ideas and incorporate these weights. For example, Croes et al. use node connectivity as edge weights so that 'currency' compounds are avoided [2]; and Xia et al. use the inverse of organism frequency that a reaction belongs to as edge weights so that reactions that belong to more organisms are preferred [11]. In CMPF, we use *γ*_{1} and *γ*_{2} to denote these two scoring strategies.

#### Penalty Scores

Native paths rarely involve a switch between pathways. On the other hand, alternative paths might involve such switching. The penalty given when a switch occurs can be made more meaningful, since some pathway crossing is preferable to others. Our framework discussed thus far allows us to rank linear paths by their pathway switching equivalence class. A pathway switching equivalence class here refers to a group of linear paths with the same number of pathway/switches. However, the arbitration of linear paths within the same pathway switching equivalence class is random.

We can do better by defining a function which computes the distance from one pathway to another. We define the 'metabolite closure' M(x) of a pathway x to be the set of metabolites that are generated within that pathway. We hypothesize that pathways performing similar function have similar metabolite closures because the end product often determines intermediate metabolites. Hence, a switch from pathway x to pathway y would be preferred if their metabolite closures agree well with each other.

**Definition 3** *The distance between two pathways x and y is the average of the normalized size of the set difference of their metabolite closures M(x) and M(y)*.

{\delta}_{1}^{\prime}\left(x,y\right)=0.5*\frac{M\left(x\right)\backslash M\left(y\right)}{M\left(x\right)}+0.5*\frac{M\left(y\right)\backslash M\left(x\right)}{M\left(y\right)}

In addition, a switch can involve two sets of pathways. In this case, we take the average distance of all possible switchings.

**Definition 4** *Given a transition r*_{
i
} → *r*_{i +1}*in a linear path ϕ and P*_{
ϕ
}(*r*_{
i
}) **∩** *P*(*r*_{i+1}) *is an emptyset, we define*

{\delta}_{1}=\frac{{\sum}_{x\in {P}_{\varphi}\left({r}_{i}\right),y\in P\left({r}_{i+1}\right)}{\delta}_{1}^{\prime}\left(x,y\right)}{|{P}_{\varphi}\left({r}_{i}\right)\times P\left({r}_{i+1}\right)|}

*and δ*_{1} = 0 *if P*_{
ϕ
} (*r*_{
i
}) ∩ *P*(*r*_{i+1}) *is non-empty*.

Similarly, a switch between species is more likely to happen if the two species are related. The distance between two species can be similarly defined using their metabolite closures. A more intuitive measure of species similarity is one that is based on the taxonomy tree. In this case, the path length from species x to y in the taxonomy tree can be used to determine the distance between them; we denote this distance by {\delta}_{2}^{\prime}\left(x,y\right).

**Definition 5** *Given a transition r*_{
i
} → *r*_{i+1}*in a linear path ϕ and S*_{
ϕ
} (*r*_{
i
}) ∩ *S*(*r*_{i+1}) *is an emptyset, we define*

{\delta}_{2}=\frac{{\sum}_{x\in {S}_{\varphi}\left({r}_{i}\right),y\in S\left({r}_{i+1}\right)}{\delta}_{2}^{\prime}\left(x,y\right)}{|{S}_{\varphi}\left({r}_{i}\right)\times S\left({r}_{i+1}\right)|},

*which is the average taxonomic path length between the two sets of species that support these two consecutive reactions in the linear path ϕ*; *δ*_{2} = 0 *if S*_{
ϕ
} *(r*_{
i
}*)* ∩ *S(r*_{i+1}) *is non-empty*.

#### Combination of Scoring Functions

We modify Definition 1 to a weighted sum of scores defined in the previous section.

**Definition 6** *The weight of a transition r*_{
i
} → *r*_{i + 1}*in a linear path ϕ = r*_{1} → ... → *r*_{
n
}*is defined as*

weigh{t}_{\varphi}\left({r}_{i}\to {r}_{i+1}\right)={w}_{1}*{\gamma}_{1}+{w}_{2}*{\gamma}_{2}+{w}_{3}*{\delta}_{1}+{w}_{4}*{\delta}_{2}

*where γ*_{1}, *γ*_{2}, *δ*_{1}, *and δ*_{2} *are described earlier*; *and w*_{1} + *w*_{2} + *w*_{3} + *w*_{4} = 1.