A clique-based method for the edit distance between unordered trees and its application to analysis of glycan structures.

BACKGROUND
Measuring similarities between tree structured data is important for analysis of RNA secondary structures, phylogenetic trees, glycan structures, and vascular trees. The edit distance is one of the most widely used measures for comparison of tree structured data. However, it is known that computation of the edit distance for rooted unordered trees is NP-hard. Furthermore, there is almost no available software tool that can compute the exact edit distance for unordered trees.


RESULTS
In this paper, we present a practical method for computing the edit distance between rooted unordered trees. In this method, the edit distance problem for unordered trees is transformed into the maximum clique problem and then efficient solvers for the maximum clique problem are applied. We applied the proposed method to similar structure search for glycan structures. The result suggests that our proposed method can efficiently compute the edit distance for moderate size unordered trees. It also suggests that the proposed method has the accuracy comparative to those by the edit distance for ordered trees and by an existing method for glycan search.


CONCLUSIONS
The proposed method is simple but useful for computation of the edit distance between unordered trees. The object code is available upon request.


Background
Analysis of tree structured data is important in bioinformatics because there exist various kinds of tree structured biological data, which include RNA secondary structures [1,2], phylogenetic trees [3][4][5], glycans (i.e., sugar chains) [6][7][8][9], and vascular trees [10,11]. Various techniques have been applied to analyses of these tree structured data. Though machine learning techniques have been extensively applied to analysis of glycan structures [7][8][9], it is still important to develop simple comparison/search methods because machine learning methods are not appropriate for fast search of similar objects. Indeed, in analysis of biological sequences, such sequence search/comparison tools as FASTA, BLAST and SSEAECH are still widely used. Therefore, it is worthy to develop search/comparison methods for tree structured data. In order to compare tree structured data, it is required to define some measure of similarity or dissimilarity between two trees. Among various measures, the tree edit distance is the most fundamental and has been extensively studied [12]. It measures the distance between two trees by means of the minimum cost sequence of edit operations that transforms one tree into another tree, where an edit operation is either a deletion of a node, an insertion of a node, or a substitution of a label of a node. For the tree edit distance problem for ordered trees, Tai developed an O(n 6 ) time algorithm [13], where n is the number of nodes in a larger input tree. Several improvements followed from this work. Demaine et al. recently developed an O(n 3 ) time algorithm and showed that this bound is optimal under some computation strategy [14].
The tree edit distance between ordered trees is useful if the ordering among children has an important meaning (e.g., for RNA secondary structures). However, in some applications, it is preferable to regard input trees as unordered trees. At least, in many applications, more flexible matching can be made possible if input trees are regarded as unordered trees and thus the chance that similar data is missed can be decreased. It is to be noted that edit distance for unordered trees is always smaller than that for ordered trees. Unfortunately, Zhang et al. proved that the tree edit distance problem for unordered trees is NP-hard [15]. Furthermore, Zhang and Jiang proved that it is MAX SNP-hard [16], which means that there exists no polynomial time approximation scheme unless P=NP. In order to cope with this hardness, Akutsu et al. developed a fixed parameter algorithm which works in O(2.62 k · poly(n)) time [17], where k is the maximum allowed edit distance. Their algorithm might be useful for comparison of very similar trees (i.e., k is small). However, it is not useful for comparison of non-similar trees. Horesh et al. developed an A* algorithm [3]. Their algorithm works efficiently for moderate size trees. However, their algorithm can only handle unit cost cases (i.e., the cost of each edit operation is 1). Some alternatives to the tree edit distance for unordered trees have been proposed [6,12,18,19]. However, none of them is widely accepted as a measure of similarity for unordered trees. Therefore, it is still needed to develop a practical method for calculating tree edit distance between unordered trees.
In this paper, we propose a practical method using algorithms for computing the maximum clique. The idea of the method is simple: the edit distance problem is reduced to the maximum clique problem and then practical solvers for the maximum clique problem are applied. The maximum clique problem is a fundamental problem in computer science and is to find a complete subgraph of the maximum number of vertices in a given undirected graph. Though the maximum clique problem is proven to be NP-hard, several practical algorithms have been developed and successfully applied for practical problems [20][21][22][23]. By utilizing such algorithms [20,21], we can solve the edit distance problem for unordered trees of moderate size (i.e., trees with 304 5 nodes). Though similar reductions have been proposed for similar edit distance problems [24,25], to our knowledge, it is the first method that exactly solves the proper tree edit distance problem for unordered trees using maximum clique, where we use the fastest maximum clique algorithms [21,22] developed by one of the authors and his collaborators. Furthermore, to our knowledge, it is the first practical method for computing the unordered tree edit distance with general editing cost functions.
In order to evaluate the proposed method, we perform computational experiments using glycan structure data stored in the KEGG database [26]. The result suggests that our proposed method can efficiently compute the edit distance for moderate size unordered trees. It also suggests that the proposed method has the accuracy comparative to those by the edit distance for ordered trees and by an existing method for glycan search.

Tree edit distance
Here, we briefly review tree edit distance and edit distance mapping (see also Figure 1) for rooted, labelled and unordered trees [12,15,16].
Let T be a rooted unordered tree. We assume that each node v has a label ℓ(v) over an alphabet Σ. r(T), V (T), and E(T) denote the root of T, the set of nodes in T, and the set of edges in T, respectively. For a node v Î V(T), anc(v) denotes the set of ancestors of v. In the following, n denotes the number of nodes in a larger input tree (i.e., n = max{|V(T 1 )|, |V(T 2 )|}).
An edit operation on a tree T is either a deletion, an insertion, or a substitution, where each operation is defined as follows (see also Figure 1 For each of edit operations, the cost is defined as follows: g(a, b): cost of substituting a node with label a to label b, g(a, ): cost of deleting a node labeled with a, g( , a): cost of inserting a node labeled with a.
The edit distance dist(T 1 , T 2 ) between two unordered trees T 1 and T 2 is defined as the cost of the minimum cost sequence of edit operations that transforms T 1 to T 2 . In this paper, we adopt the following standard assumption so that dist(T 1 , T 2 ) becomes a distance metric [12,15]: • g(a, b) ≥ 0 for any (a, b) Σ′ × Σ′, • g(a, a) = 0 for any a Σ′, We call T 2 a subtree of T 1 if T 2 is obtained from T 1 only by deletion operations. It should be noted that this definition of subtree is different from a subgraph of a tree.
There exists a close relationship between the edit distance and the edit distance mapping (or just mapping) [12,15]. M ⊆ V (T 1 ) × V(T 2 ) is called a mapping if the following conditions are satisfied for any two pairs (u 1 Let I 1 and I 2 be the sets of nodes in V(T 1 ) and V(T 2 ) not appearing in M, respectively. Then, the following relation holds [12,15]: It is seen that f(u, v) = f(v, u) ≥ 0 holds. It should also be noted that under the unit cost model (i.e., g(a, b) = 1 for all a ≠ b), f(v, v) = 2 and f(u, v) = 1 hold for ℓ(u) ≠ ℓ(v). Let score(M) be the score of a mapping M defined by score(M) = ∑ (u,v) M f(u, v). Let M OPT be the mapping with the maximum score. Then, we can see from the definition that the following property holds [17]: assuming that the root of T 1 corresponds to the root of T 2 in M OPT , where this assumption can be removed if we add dummy nodes having the same label to T 1 and T 2 as the new roots.

Reduction to maximum clique
Let G(V, E) be an undirected graph. Then, a subgraph G′ The maximum clique problem is to find a maximum clique (i.e., a clique with the maximum number of vertices) in a given undirected graph. Though the maximum clique problem is known to be NP-hard, several practical algorithms have been developed [20][21][22][23]. In some cases, weighted versions of the maximum clique problem are utilized. Among such variants, we consider the case that weights are given to vertices. Let w(v) denote the weight of a vertex v in G(V, E). Then, a weighted version of the maximum clique problem is to find a clique G′(V′, E′) which maximizes ∑ vÎV′ w(v). In this paper, we call this variant the maximum vertex weighted clique problem, whereas the maximum clique problem denotes the original one.
Our proposed method is based on a simple reduction from the edit distance problem for unordered trees to the maximum clique problem. Based on Eq. (1), for calculating the tree edit distance, it is enough to find a mapping M which maximizes ∑ (u, v) M f(u, v). In order to find such a mapping, we construct an undirected graph G(V,E) from two input trees T 1 and T 2 by where the first two conditions and the last two conditions in the definition of E correspond to conditions (i) and (ii) for the edit distance mapping, respectively. We can see that there is a one-to-one correspondence between the set of cliques and the set of edit distance mappings, where we let r(T 1 ) correspond to r(T 2 ) (because the root cannot be deleted or inserted). Here, we assign weight w (x) to each vertex x = (u, v) V by w(x) = f(u, v). Then, we can see from Eq. (1) that the tree edit distance can be obtained by finding a maximum vertex weighted clique.
It is to be noted that if we consider the case of g(a, ) = g( , a) = 1, g(a, a) = 0 for all a Σ, and g(a, b) = 2 for all Fukagawa et al. BMC Bioinformatics 2011, 12(Suppl 1):S13 http://www.biomedcentral.com/1471-2105/12/S1/S13 and thus we can use a non-weighted version of maximum clique algorithms (see Figure 2). In such a case, the resulting mapping gives a largest common subtree (a tree with the largest number of nodes which is a subtree of both T 1 and T 2 ) [17].

Maximum clique algorithms
In this study, we use algorithms for both the maximum clique problem and the maximum vertex weighted clique problem. For both problems, Tomita and his collaborators have been developing several algorithms.
Recent studies on comparison with other existing algorithms suggest that their algorithms are fastest in most cases [22]. Based on several preliminary experiments, we chose MCQ and MWCQ as algorithms for the maximum clique problem and the maximum vertex weighted clique problem, respectively, where MWCQ is basically an extended version of MCQ. Details of MCQ and MWCQ are given in [21] and [20], respectively. Though there are some theoretical studies on related algorithms [23], the worst case time complexities of MCQ and MWCQ are left open. Therefore, we cannot discuss the time complexity of our proposed method, whereas it is straight-forward to see that the graph obtained by reduction from input trees has O(|V(T 1 )| × |V(T 2 )|) nodes and O(|V(T 1 )| 2 × |V(T 2 )| 2 ) edges.

Results
We implemented the above mentioned maximum clique-based method (MCQ-based method) and maximum vertex weighted clique-based method (MWCQ-based method) using C language. We performed computational experiments using a PC with Intel Core 2 Duo 2.8 GHz CPU and 3.48 GB RAM running under the Cygwin/Widnows XP operating system. As tree structures, we used glycan structures obtained from KEGG/ Glycan database [26].

Results on efficiency
First we examined the computational efficiency of MWCQ-based method, where we used the standard weighting scheme (i.e., f(v, v) = 2 and f(u, v) = 1 for ℓ(u) ≠ ℓ(v)) corresponding to the unit cost edit distance. We randomly selected 100 pairs of glycan structures with a specified range of the total number of nodes (i.e., the sum of the numbers of nodes in T 1 and T 2 ) and measured the average CPU time (user time) per pair. Unbalanced cases in which the size of one structure was not larger than 1/3 of the other structure were excluded. For each of the ranges in 60~79, we took the average over 20 pairs because there did not exist an enough number of pairs, where we could use 18 pairs among 20 pairs for the range of 65~69 because there were two very bad cases for which the program could not output a solution within 10 minutes. For the ranges of 80~84 and 85~89, we could use only 9 and 5 pairs, respectively. The result is shown in Table 1. From this table, it is seen that the proposed method works efficiently for moderate size trees (i.e., trees with 30~45 nodes), which means that the proposed method works efficiently for most glycan structures.
Next we examined the computational efficiency of MCQ-based method, which corresponds to the case of computation of the largest common subtree (i.e., f(v, v) = 2 and f(u, v) = 0 for ℓ(u) ≠ ℓ(v)). As in the case of MWCQ-based method, we randomly selected 100 pairs of glycan structures with a specified range of the total number of nodes and measured the average CPU time, where we used a fewer number of pairs when the number of nodes was no less than 60 as in the case of MWCQ-based method. The result is shown in Table 2. From this table, it is seen that MCQ-based method works very fast for most glycan structures. It is to be noted that CPU time does not necessarily increase as the size of input trees because the size of transformed Fukagawa et al. BMC Bioinformatics 2011, 12(Suppl 1):S13 http://www.biomedcentral.com/1471-2105/12/S1/S13 clique instances strongly depends on the distribution of identical labels in input trees and thus does not so much depend on the size of input trees. Though MCQ-based method is very fast, it makes extensive use of identity of node labels (node pairs without non-identical labels are ignored and thus the number of remaining nodes in G(V, E) becomes very small). On the other hand, MWCQ-based method takes all node pairs between T 1 and T 2 into account and thus is not very fast. Compared with an existing method [3], MCQbased method is much faster but solves an easier problem (it seems from their results that their method can be applied to comparison of trees with up to 90 nodes (sum of two input trees) though CPU time is not shown in [3]). On the other hand, it seems from Table 1 that MWCQ-based method has a similar performance with that in [3] though [3] solves non-labeled cases whereas we solved labeled cases. However, MWCQ-based method has a merit: it can handle general editing cost functions whereas the method in [3] can only handle the unit editing cost.

Results on similar structure search
Though the ordered and unordered tree edit distances are widely-accepted (dis)similarity measures on trees, we performed computational experiments in order to examine how it is useful for similarity search for glycans. We used a dataset compiled by Yamanishi et al. [9] on four properties on glycans, where we used 355 structures among 356 glycan structures listed in their list since we could not obtain one structure. Though this dataset is prepared for evaluating machine learning methods, we applied it to evaluation of search methods. We compared the following four similarity search methods: global glycan alignment and local glycan alignment implemented in the KCaM glycan search tool (version of Sept. 2004 with the default parameters) [6], unit cost ordered tree edit distance, and unit cost unordered tree edit distance (i.e., MWCQ-based method). Glycan alignment scores were introduced for efficient comparison of glycan structures. Though it is based on tree edit distance, the deletion (and corresponding insertion) operation is simplified so that only one child and its descendants can survive if a node is deleted. Therefore, there is a possibility that similar structures are missed by glycan alignment.
We evaluated the performance of similarity search using the AUC score [27]. In order to apply the AUC score, we need positive and negative samples. For that purpose, each pair of sequences in the dataset is regarded as a positive sample if the distance (resp., alignment score) is smaller (resp., greater) than a given threshold. Otherwise, it is regarded as a negative sample. Each positive sample is classified into either true positive or false positive according to whether sequences in the pair belong to the same class. Similarly, each negative sample is classified into either false negative or true negative according to whether sequences in the pair belong to the same class. Then, true positive rate and false positive rate are defined as the ratio of the number of true positive samples to the number of true positive and false negative samples and the ratio of the number of false positive samples to the number of false positive and true negative samples, respectively. The ROC (Receiver Operating Characteristic) curve is a graphical plot of the true positive rate vs. the false positive rate obtained by varying the threshold. The AUC (Area Under Curve) score is defined as the area under the ROC curve: AUC scores of 1 and 0.5 correspond to complete classification and random classification, respectively. The resulting ROC curves are shown in Figure 3 and Figure 4, and the resulting AUC scores are shown in Table 3. It should be mentioned that we could not obtain meaningful AUC scores for plasma and serum datasets (i.e., AUC scores for these data sets were less than 0.65 though the local alignment method produced better results). Since it seems that these data sets are not  appropriate for simple search, Table 3 lists AUC scores only for leukemia and erythrocyte datasets. It is seen from the table that the tree edit distance measures are better than the alignment scores for leukemia data but are worse for erythrocyte data. It is also seen that the AUC scores for ordered tree edit distance are very close to the AUC scores for unordered tree edit distance. Though we cannot conclude that the unordered tree edit distance is better than other similarity measures for glycan search, it is comparative to other measures. Fukagawa et al. BMC Bioinformatics 2011, 12(Suppl 1):S13 http://www.biomedcentral.com/1471-2105/12/S1/S13 In order to see more differences between ordered tree edit distance and unordered tree edit distance, we computed ordered tree edit distances when the order of children of every node was reversed in one of the input trees. The results are also shown in Table 3 (denoted as "reversed ordered tree"), Figure 3, and Figure 4 (denoted as 'reverse'). For the result on the erythrocyte dataset, it is seen that the difference between ordered tree edit distance and unordered tree edit distance becomes larger (i.e., the difference increases from 0.004 to 0.008) though it is still small. Though we do not clearly understand the reason of this small difference, it might be because a single path in each glycan structure is relevant for the features studied in this paper.
The total CPU time for computing the distances (or scores) between all pairs of glycans in the dataset is also shown for each method in Table 3. Though the proposed clique-based method took more CPU time than other methods, the differences were not very large. It should be mentioned that we used a clique-based method for computing ordered tree edit distance for simplicity of implementation and thus CPU time on ordered tree edit distance would be much larger here than that by an efficient dynamic programming-based algorithm [14], but that is not relevant because CPU time for unordered tree edit distance is fast enough in Table 3.
Here, we briefly explain the methodological differences among measures. Figure 5 illustrates the difference between unordered tree edit distance and ordered tree edit distance. As shown in Figure 5, suppose that there exist two trees T 1 and T 2 with roots r 1 and r 2 . Suppose further that r 1 has two subtrees A and B, and r 2 has two subtrees B′ and A′ in these orders, where A and A′ (B and B′, respectively) are similar to each other, and A is larger than B. If two trees are regarded as ordered, ordered tree edit mapping takes only matching between A and A′ into account. Otherwise, unordered tree mapping takes matchings between A and A′, and between B and B′ into account. Figure 6 illustrates the difference of the deletion operation between tree edit and glycan alignment. In tree edit, all children of the deleted node u become children of the parent v of u. However, in glycan alignment, only one child can be a child of v and the other children are deleted along with their descendants, where the surviving child is chosen so that the resulting score is maximized. It is seen from these figures that the tree edit distance for unordered trees provides the most flexible matching.

Conclusions
In this paper, we have proposed a clique-based method for computing the tree edit distance between rooted A' B' Figure 5 Comparison of unordered and ordered tree edit distances.
Fukagawa et al. BMC Bioinformatics 2011, 12(Suppl 1):S13 http://www.biomedcentral.com/1471-2105/12/S1/S13 unordered trees. We implemented two versions: one using a maximum clique (MCQ) algorithm [21] and the other one using a maximum vertex weighted clique (MWCQ) algorithm [20]. The former one is faster than an existing A* algorithm [3]. However, it uses a non-standard editing cost scheme and thus is not more useful than the A* algorithm. The efficiency of the latter one is similar to that of the A* algorithm. However, it has two merits: it can handle general cost distances whereas the A* algorithm can only handle the unit cost distance, improvements of maximum clique algorithms lead to improvements of the efficiency of edit distance computation.
We also compared the unordered edit distance with ordered edit distance, global and local glycan alignment scores for glycan similarity search. Though the result did not show clear advantage of the unordered edit distance, it was comparative to these three measures. It is to be noted that the unit cost model was used for edit distance measures whereas score functions specialized for glycans were used for glycan alignments. Therefore, if we use editing costs specialized for glycans, we may obtain better performances. Such a development is left as future work.
Finally we again note that the edit distances for both ordered and unordered trees are already established measures for calculating the (dis)similarity between trees [12]. Therefore, application of the proposed method is not limited to glycan structures. It might be applied to analysis of various tree structure data if each tree consists of up to several tens of nodes. Fukagawa et al. BMC Bioinformatics 2011, 12(Suppl 1):S13 http://www.biomedcentral.com/1471-2105/12/S1/S13