### Pathway alignment and data organization

Most pathway databases and repositories (such as KEGG [1], MetaCyc [2], and BioCarta [4] provide an instance of each pathway for each of the organisms in which this pathway occurs. Instances describing the same cellular process could be different, of course, but they are considered analogous. Thus, differences between two organisms that are specific to a given pathway can be found by using a pathway alignment engine such as MetaPathwayHunter (MPH) [5], to align the two instances against each other, producing (the not necessarily equal) two similarity scores. When looking at a set of *n* organisms, we can arrange the results of an all-against-all comparison of the organisms for a specific pathway in an *n* × *n* matrix *a*, where *a*_{
ij
}is the similarity score of aligning the pathway as it appears in organism *i* with its analogous occurrence in organism *j*.

Each such 2 dimensional array is called a *page* (Figure 1A), and in itself it constitutes a similarity matrix that can be used for building a phylogeny. A single pathway, however, is not a good enough basis for deriving meaningful phylogenies (and for calibrating the similarity score). Rather, we are interested in a global picture based on all the available pathways. Thus, we create a page for each of the *m* pathways in the database; then, the resulting pages are organized in an *n* × *n* × *m* 3 dimensional (3D) array, *A*, where each page is indexed by the corresponding pathway (Figure 1B); now *A*_{
ijk
}is the similarity score for comparing pathway *k* as it appears in organisms *i* and *j*. Note that since some pathways do not exist for some of the organisms (or are missing from the database), the sizes of the pages might not be commensurate with each other; to accommodate for this phenomenon and to obtain a regular *n* × *n* × *m* 3D array we stretch the pages to be the same size, denoting missing entries appropriately (see more on this below).

### Interpreting and weighting the data

Analyzing these data in order to obtain insights about the evolutionary history of the organisms was performed as follows: We collapse the 3D array into a 2-dimensional (dis)similarity matrix by combining the *m* entries in each column into one scalar entry; this matrix would serve us as a distance matrix in order to build a phylogenetic tree using known algorithms. The first step in the process of combining a column's entries is setting a threshold that describes scores as relevant: all similarity scores below this threshold (assuming all similarity scores, as produced by MPH, are lower than or equal to 0 and the higher the score - the better the similarity) will not contribute to the values in the similarity matrix, and all similarity scores above it will contribute to them equally. Using this method - on one hand - causes some loss of information, but - on the other hand - enables combining the data in an effective manner while also handling the issue of missing entries that will be discussed in more detail later in this section. For each page we create a *Threshold Graph* (TG) in the following manner: define a node *v* ∈ *V* for each organism; two such nodes are connected by a directed edge *e* ∈ *E* if their similarity score is higher than the given threshold. Using (1) we calculate *w*, an assessed weight for each TG:

This weight, inspired by (but not identical to) Shannon's entropy [24], reflects the amount of information in the page. This formula gives scores ≥0 to each graph, based on the degrees of all its nodes. A graph will have *w* = 0 if the degrees of all nodes in the graph are 0, since the part of the formula will be 0 for each node.

Alternatively, w = 0 if all node degrees are |V|, in which case the will be equal to 0 for all nodes. In the first case, the organisms are all different from one another and there is nothing to learn about their similarity. In the latter, each organism is "very similar" to all the other organisms, which can teach us nothing new about the whole group. Each edge in the TG is given a weight according to this formula. The 3D array is modified by replacing *A*_{
ijk
}by 0 if nodes *i* and *j* are not joined by an edge in the TG corresponding to pathway *k*, and by the calculated weight *w* if an edge exists between organisms *i* and *j* in the TG. In fact, each page is now the adjacency matrix of the weighted TG.

This technique also solves the problem of missing entries, as follows: for the cases where a pathway exists in one of the organisms but does not in the other, we decided never to connect the two organisms with an edge in the TG, and as a result the corresponding entry in the array will always be 0. To accommodate the cases where the pathway is missing in both organisms, we added a new global, Boolean parameter *b*: If *b* is true then each such two organisms will be joined by an edge, otherwise such edges will not be formed. Intuitively, if the confidence in the data is high then *b* should be assigned true, since in this case the absence of a pathway in both organisms tells us that they are similar in their lacking. If our confidence in the data is low, *b* should be assigned false since in this case the pathway could be missing due to lack of data, and this might not indicate a real (dis)similarity.

Naturally, using a threshold in forming the TG resulted in loss of information. However, we found that the advantage of being able to include missing entries is far more significant to our results than the information lost by not using the actual alignment score. Furthermore, by adjusting the threshold value, we can make sure that our results are as precise as possible. After dealing with this issue we can safely add all the similarity scores in each vector along the pathway axis and thus create a 2-dimensional similarity matrix.

### Building phylogenetic trees

The resulting 2D matrix is turned into a dissimilarity matrix so that it can be used as input to an algorithm for building phylogenetic trees, *e.g.* the neighbor-joining algorithm [25], as embodied in *e.g.* the Phylip [26] package, as follows: each element in the matrix is subtracted from the maximum element in it; recall that this element must appear along the main diagonal since each organism obtains a perfect similarity score when aligned with itself.

Finally, it is possible to identify the contribution of specific pathways to the resulting phylogeny: One can trace differences among phylogenetic trees back to the pathways that cause them in order to understand their biological reason; furthermore, the alignments produced by the MPH algorithm can be used to further elucidate these differences. Moreover, we can study the pathways whose TGs got the highest weights to see if they induce an interesting classification of the organisms into groups.

### The computational pipeline for inferring a phylogenetic tree based on metabolic pathways

The pseudo-code for the computational pipeline is given here:

**Input**: n: number of organisms

m: number of pathways

P: matrix n*m. Entry P_{ij} is the instance of pathway j in organism i

d: deletion score for MPH

t: threshold

b: Boolean parameter to determine whether a pathway missing in two organisms is considered over (*b* = true) or under (*b* = false) the threshold

**Output**: tree: a phylogenetic tree

**for** k := 1..m **do**

**for** i, j := 1..n **do**

**if** exists P_{ik} and exists P_{jk} **then**

**if** size of P_{ik} < P_{jk} **then** A_{ijk} := MPH(P_{ik}, P_{jk}, d);

**else** A_{ijk} := MPH(P_{jk}, P_{ik}, d);

**elsif** not exists P_{ik} and not exists P_{jk} **then** A_{ijk} := *both missing*;

**else** A_{
ijk
}:= *one missing*;

build graph TG(V, E) s. t.

V := {1..n};

E := {(i, j) | A_{ijk} ≥ t}; //*only numeric values*

**if** b **then** E := E ∪ {(i, j) | A_{ijk} = *both missing*};

;

**for** i, j := 1..n **do**

**if** (i, j) ∈ E **then** A_{ijk} := *w*(*TG*(*V*, *E*));

**else** A_{ijk} := 0;

**for** i, j := 1..n **do**

SimMat_{ij} := ;

DisSimMat := SimMat_{11}*1_{(n, n)} - SimMat;

tree := NeighbourJoining(DisSimMat);

The time complexity of this algorithm is *O(mn*^{2}), which is dominated by the time needed to read the entries of the input array, assuming that *m* >>*n*.

### Data sources

Our algorithm was applied to data that was obtained from the MetaCyc repository [2]. We chose this database since it is very rich in data and it is organized in a way that best suits our requirements, *i.e.*, it contains one variation of each pathway for all the organisms for which it is known. Initially we used over 660 metabolic pathways for the 205 organisms that were retrieved from this database (the entire data available at the time of the assay, September 2007). This presented a significant computational challenge, namely making over 26 million individual pathway alignments. Fortunately, this problem is "embarrassingly parallel", *i.e.* each alignment can be run separately, and thus we were able to solve it using a computational Grid facility.

### Studying co-evolution of metabolic pathways

For comparing the evolution of different metabolic pathways the following steps were performed. First, a distance matrix (*i.e.* a page, see Figure 1) was computed for each pathway (as before). Next, all the pages were compared to each other, one pair at a time. To this end, we considered both the fact that a pathway may appear only in part of the organisms and that in cases that a pathway appears in a pair of organisms its structure in the two organisms may be different (see the previous subsections). Thus, the distance measure between a pair of matrices includes two components: one is the generalized Hamming distance that 'captures' similar/non-similar appearances in the same organism, *H*(*p*_{1}, *p*_{2}). This can be further explained as follows: Let *d*_{1} denote the value of an entry in the distance matrix corresponding to a case where a pathway does not appear in the two organisms; let *d*_{2} denote the value of an entry in the distance matrix corresponding to a case where a pathway appears in the two organisms; let *d*_{3} denote the value of an entry in the distance matrix corresponding to a case where a pathway appears in the first organism but not in the second one; and let *d*_{4} denote the value of an entry in the distance matrix corresponding to a case where a pathway appears in the second organism but not in the first one. Let *D*(*x*, *y*) denote the contribution to the generalized Hamming distance due to value *x* in a certain entry in the page of one pathway and *y* in the same entry in the page of the second pathway. We used ∀_{
x
}*D*(*x*, *x*) = 0;

*D*(*d*_{3}, *d*_{4}) = *D*(*d*_{4}, *d*_{3}) = 2, *D*(*d*_{3}, *d*_{1}) = *D*(*d*_{4}, *d*_{1}) = *D*(*d*_{3}, *d*_{2}) = *D*(*d*_{4}, *d*_{2}) = 1, and *D*(*d*_{1}, *d*_{3}) = *D*(*d*_{1}, *d*_{4}) = *D*(*d*_{2}, *d*_{3}) = *D*(*d*_{2}, *d*_{4}) = 1.

The final score was normalized by dividing it by the number of entries in a page.

The second component, *L*(*p*_{1}, *p*_{2}), considers only entries where both pathways appear in both organisms and it is the *L*_{1} distance between the vectors that are composed of these entries in the two organisms. The final distance is a weighted average of the two distances, computed as in (2). As the first component of the score reflected a rougher distance measure we used *W*_{
p
}= 100; however, the result was robust to small changes in *W*_{
p
}.

Based on this measure we generated a distance matrix between pathways that can be further analyzed as described in the next section (for example by clustering analysis).