A cliquebased method for the edit distance between unordered trees and its application to analysis of glycan structures
 Daiji Fukagawa^{1},
 Takeyuki Tamura^{2},
 Atsuhiro Takasu^{3},
 Etsuji Tomita^{4, 5} and
 Tatsuya Akutsu^{2}Email author
https://doi.org/10.1186/1471210512S1S13
© Fukagawa et al; licensee BioMed Central Ltd. 2011
Published: 15 February 2011
Abstract
Background
Measuring similarities between tree structured data is important for analysis of RNA secondary structures, phylogenetic trees, glycan structures, and vascular trees. The edit distance is one of the most widely used measures for comparison of tree structured data. However, it is known that computation of the edit distance for rooted unordered trees is NPhard. Furthermore, there is almost no available software tool that can compute the exact edit distance for unordered trees.
Results
In this paper, we present a practical method for computing the edit distance between rooted unordered trees. In this method, the edit distance problem for unordered trees is transformed into the maximum clique problem and then efficient solvers for the maximum clique problem are applied. We applied the proposed method to similar structure search for glycan structures. The result suggests that our proposed method can efficiently compute the edit distance for moderate size unordered trees. It also suggests that the proposed method has the accuracy comparative to those by the edit distance for ordered trees and by an existing method for glycan search.
Conclusions
The proposed method is simple but useful for computation of the edit distance between unordered trees. The object code is available upon request.
Background
Analysis of tree structured data is important in bioinformatics because there exist various kinds of tree structured biological data, which include RNA secondary structures [1, 2], phylogenetic trees [3–5], glycans (i.e., sugar chains) [6–9], and vascular trees [10, 11]. Various techniques have been applied to analyses of these tree structured data. Though machine learning techniques have been extensively applied to analysis of glycan structures [7–9], it is still important to develop simple comparison/search methods because machine learning methods are not appropriate for fast search of similar objects. Indeed, in analysis of biological sequences, such sequence search/comparison tools as FASTA, BLAST and SSEAECH are still widely used. Therefore, it is worthy to develop search/comparison methods for tree structured data. In order to compare tree structured data, it is required to define some measure of similarity or dissimilarity between two trees. Among various measures, the tree edit distance is the most fundamental and has been extensively studied [12]. It measures the distance between two trees by means of the minimum cost sequence of edit operations that transforms one tree into another tree, where an edit operation is either a deletion of a node, an insertion of a node, or a substitution of a label of a node. For the tree edit distance problem for ordered trees, Tai developed an O(n^{6}) time algorithm [13], where n is the number of nodes in a larger input tree. Several improvements followed from this work. Demaine et al. recently developed an O(n^{3}) time algorithm and showed that this bound is optimal under some computation strategy [14].
The tree edit distance between ordered trees is useful if the ordering among children has an important meaning (e.g., for RNA secondary structures). However, in some applications, it is preferable to regard input trees as unordered trees. At least, in many applications, more flexible matching can be made possible if input trees are regarded as unordered trees and thus the chance that similar data is missed can be decreased. It is to be noted that edit distance for unordered trees is always smaller than that for ordered trees. Unfortunately, Zhang et al. proved that the tree edit distance problem for unordered trees is NPhard [15]. Furthermore, Zhang and Jiang proved that it is MAX SNPhard [16], which means that there exists no polynomial time approximation scheme unless P=NP. In order to cope with this hardness, Akutsu et al. developed a fixed parameter algorithm which works in O(2.62^{ k } · poly(n)) time [17], where k is the maximum allowed edit distance. Their algorithm might be useful for comparison of very similar trees (i.e., k is small). However, it is not useful for comparison of nonsimilar trees. Horesh et al. developed an A* algorithm [3]. Their algorithm works efficiently for moderate size trees. However, their algorithm can only handle unit cost cases (i.e., the cost of each edit operation is 1). Some alternatives to the tree edit distance for unordered trees have been proposed [6, 12, 18, 19]. However, none of them is widely accepted as a measure of similarity for unordered trees. Therefore, it is still needed to develop a practical method for calculating tree edit distance between unordered trees.
In this paper, we propose a practical method using algorithms for computing the maximum clique. The idea of the method is simple: the edit distance problem is reduced to the maximum clique problem and then practical solvers for the maximum clique problem are applied. The maximum clique problem is a fundamental problem in computer science and is to find a complete subgraph of the maximum number of vertices in a given undirected graph. Though the maximum clique problem is proven to be NPhard, several practical algorithms have been developed and successfully applied for practical problems [20–23]. By utilizing such algorithms [20, 21], we can solve the edit distance problem for unordered trees of moderate size (i.e., trees with 30 ~ 45 nodes). Though similar reductions have been proposed for similar edit distance problems [24, 25], to our knowledge, it is the first method that exactly solves the proper tree edit distance problem for unordered trees using maximum clique, where we use the fastest maximum clique algorithms [21, 22] developed by one of the authors and his collaborators. Furthermore, to our knowledge, it is the first practical method for computing the unordered tree edit distance with general editing cost functions.
In order to evaluate the proposed method, we perform computational experiments using glycan structure data stored in the KEGG database [26]. The result suggests that our proposed method can efficiently compute the edit distance for moderate size unordered trees. It also suggests that the proposed method has the accuracy comparative to those by the edit distance for ordered trees and by an existing method for glycan search.
Methods
Tree edit distance
Let T be a rooted unordered tree. We assume that each node v has a label ℓ(v) over an alphabet Σ. r(T), V(T), and E(T) denote the root of T, the set of nodes in T, and the set of edges in T, respectively. For a node v ∈ V(T), anc(v) denotes the set of ancestors of v. In the following, n denotes the number of nodes in a larger input tree (i.e., n = max{V(T_{1}), V(T_{2})}).
An edit operation on a tree T is either a deletion, an insertion, or a substitution, where each operation is defined as follows (see also Figure 1):
Deletion: Delete a nonroot node v in T with parent u, making the children of v become children of u. The children are inserted in the place of v into the set of the children of u.
Insertion: Inverse of delete. Insert a node v as a child of u in T, making v the parent of some of the children of u.
Substitution: Change the label of a node v in T.
For each of edit operations, the cost is defined as follows:
γ(a, b): cost of substituting a node with label a to label b,
γ(a, ∈): cost of deleting a node labeled with a,
γ(∈, a): cost of inserting a node labeled with a.
The edit distance dist(T_{1}, T_{2}) between two unordered trees T_{1} and T_{2} is defined as the cost of the minimum cost sequence of edit operations that transforms T_{1} to T_{2}. In this paper, we adopt the following standard assumption so that dist(T_{1}, T_{2}) becomes a distance metric [12, 15]:

γ(a, b) ≥ 0 for any (a, b) ∈ Σ′ × Σ′,

γ(a, a) = 0 for any a ∈ Σ′,

γ(a, b) = γ(b, a) for any (a, b) ∈ Σ′ × Σ′,

γ(a, c) ≤ γ(a, b) + γ(b, c) for any a,b,c ∈ Σ′ × Σ′ × Σ′,
where Σ′ = ΣU {∈}. We call T_{2} a subtree of T_{1} if T_{2} is obtained from T_{1} only by deletion operations. It should be noted that this definition of subtree is different from a subgraph of a tree.
 (i)
u_{1} = u_{2} iff v_{1} = v_{2},
 (ii)
u_{1} ∈ anc(u_{2}) iff v_{1} ∈ anc(v_{2}).
Here we define a score function f(u, v) for (u, v) ∈ V(T_{1}) × V(T_{2}) by
f(u, v) = γ(ℓ(u), ∈) + γ(∈, ℓ(v))  γ(ℓ(u), ℓ(v)).
assuming that the root of T_{1} corresponds to the root of T_{2} in M_{ OPT }, where this assumption can be removed if we add dummy nodes having the same label to T_{1} and T_{2} as the new roots.
Reduction to maximum clique
Let G(V, E) be an undirected graph. Then, a subgraph G′(V′, E′) of G(V, E) is called a clique if it is a complete subgraph (i.e., {{v_{ i },v_{ j }}  v_{ i }, v_{ j } ∈ V′, v_{ i } ≠ v_{ j } } = E′). The maximum clique problem is to find a maximum clique (i.e., a clique with the maximum number of vertices) in a given undirected graph. Though the maximum clique problem is known to be NPhard, several practical algorithms have been developed [20–23]. In some cases, weighted versions of the maximum clique problem are utilized. Among such variants, we consider the case that weights are given to vertices. Let w(v) denote the weight of a vertex v in G(V, E). Then, a weighted version of the maximum clique problem is to find a clique G′(V′, E′) which maximizes ∑_{ v∈V′ }w(v). In this paper, we call this variant the maximum vertex weighted clique problem, whereas the maximum clique problem denotes the original one.
Our proposed method is based on a simple reduction from the edit distance problem for unordered trees to the maximum clique problem. Based on Eq. (1), for calculating the tree edit distance, it is enough to find a mapping M which maximizes ∑_{(u, v)∈M}f(u, v). In order to find such a mapping, we construct an undirected graph G(V,E) from two input trees T_{1} and T_{2} by
V = {(u, v)  u ∈ V(T_{1}), u ≠ r (T_{1}), v ∈ V(T_{2}), v ≠ r(T_{2})},
E = {{(u_{1},v_{1}), (u_{2},v_{2} )}  u_{1} ≠ u_{2}, v_{1} ≠ v_{2}, u_{1}∈ anc(u_{2}) iff v_{1} ∈ anc(v_{2}), u_{2} ∈ anc(u_{1}) iff v_{2} ∈ anc(v_{1})},
where the first two conditions and the last two conditions in the definition of E correspond to conditions (i) and (ii) for the edit distance mapping, respectively. We can see that there is a onetoone correspondence between the set of cliques and the set of edit distance mappings, where we let r(T_{1}) correspond to r(T_{2}) (because the root cannot be deleted or inserted). Here, we assign weight w(x) to each vertex x = (u, v) ∈ V by w(x) = f(u, v). Then, we can see from Eq. (1) that the tree edit distance can be obtained by finding a maximum vertex weighted clique.
Maximum clique algorithms
In this study, we use algorithms for both the maximum clique problem and the maximum vertex weighted clique problem. For both problems, Tomita and his collaborators have been developing several algorithms. Recent studies on comparison with other existing algorithms suggest that their algorithms are fastest in most cases[22]. Based on several preliminary experiments, we chose MCQ and MWCQ as algorithms for the maximum clique problem and the maximum vertex weighted clique problem, respectively, where MWCQ is basically an extended version of MCQ. Details of MCQ and MWCQ are given in [21] and [20], respectively.
Though there are some theoretical studies on related algorithms [23], the worst case time complexities of MCQ and MWCQ are left open. Therefore, we cannot discuss the time complexity of our proposed method, whereas it is straightforward to see that the graph obtained by reduction from input trees has O(V(T_{1}) × V(T_{2})) nodes and O(V(T_{1})^{2} × V(T_{2})^{2}) edges.
Results
We implemented the above mentioned maximum cliquebased method (MCQbased method) and maximum vertex weighted cliquebased method (MWCQbased method) using C language. We performed computational experiments using a PC with Intel Core 2 Duo 2.8 GHz CPU and 3.48 GB RAM running under the Cygwin/Widnows XP operating system. As tree structures, we used glycan structures obtained from KEGG/Glycan database [26].
Results on efficiency
CPU time on maximum vertex weighted cliquebased method
total number of nodes  average CPU time (sec.) 

30 ~ 34  0.004340 
35 ~ 39  0.004990 
40 ~ 44  0.015200 
45 ~ 49  0.050800 
50 ~ 54  0.473000 
55 ~ 59  2.160000 
60 ~ 64  3.020000 
65 ~ 69  15.300000 
70 ~ 74  4.380000 
75 ~ 79  2.610000 
80 ~ 84  7.930000 
85 ~ 89  232.000000 
CPU time on maximum cliquebased method
total number of nodes  average CPU time (sec.) 

30 ~ 34  0.010400 
35 ~ 39  0.000191 
40 ~ 44  0.000203 
45 ~ 49  0.001100 
50 ~ 54  0.000780 
55 ~ 59  0.004530 
60 ~ 64  0.125000 
65 ~ 69  4.600000 
70 ~ 75  0.016400 
75 ~ 79  0.032800 
80 ~ 84  0.000087 
85 ~ 89  0.000032 
Though MCQbased method is very fast, it makes extensive use of identity of node labels (node pairs without nonidentical labels are ignored and thus the number of remaining nodes in G(V, E) becomes very small). On the other hand, MWCQbased method takes all node pairs between T_{1} and T_{2} into account and thus is not very fast. Compared with an existing method [3], MCQbased method is much faster but solves an easier problem (it seems from their results that their method can be applied to comparison of trees with up to 90 nodes (sum of two input trees) though CPU time is not shown in [3]). On the other hand, it seems from Table 1 that MWCQbased method has a similar performance with that in [3] though [3] solves nonlabeled cases whereas we solved labeled cases. However, MWCQbased method has a merit: it can handle general editing cost functions whereas the method in [3] can only handle the unit editing cost.
Results on similar structure search
Though the ordered and unordered tree edit distances are widelyaccepted (dis)similarity measures on trees, we performed computational experiments in order to examine how it is useful for similarity search for glycans. We used a dataset compiled by Yamanishi et al. [9] on four properties on glycans, where we used 355 structures among 356 glycan structures listed in their list since we could not obtain one structure. Though this dataset is prepared for evaluating machine learning methods, we applied it to evaluation of search methods. We compared the following four similarity search methods: global glycan alignment and local glycan alignment implemented in the KCaM glycan search tool (version of Sept. 2004 with the default parameters) [6], unit cost ordered tree edit distance, and unit cost unordered tree edit distance (i.e., MWCQbased method). Glycan alignment scores were introduced for efficient comparison of glycan structures. Though it is based on tree edit distance, the deletion (and corresponding insertion) operation is simplified so that only one child and its descendants can survive if a node is deleted. Therefore, there is a possibility that similar structures are missed by glycan alignment.
Comparison of glycan similarity measures via AUC score
In order to see more differences between ordered tree edit distance and unordered tree edit distance, we computed ordered tree edit distances when the order of children of every node was reversed in one of the input trees. The results are also shown in Table 3 (denoted as “reversed ordered tree”), Figure 3, and Figure 4 (denoted as ‘reverse’). For the result on the erythrocyte dataset, it is seen that the difference between ordered tree edit distance and unordered tree edit distance becomes larger (i.e., the difference increases from 0.004 to 0.008) though it is still small. Though we do not clearly understand the reason of this small difference, it might be because a single path in each glycan structure is relevant for the features studied in this paper.
The total CPU time for computing the distances (or scores) between all pairs of glycans in the dataset is also shown for each method in Table 3. Though the proposed cliquebased method took more CPU time than other methods, the differences were not very large. It should be mentioned that we used a cliquebased method for computing ordered tree edit distance for simplicity of implementation and thus CPU time on ordered tree edit distance would be much larger here than that by an efficient dynamic programmingbased algorithm [14], but that is not relevant because CPU time for unordered tree edit distance is fast enough in Table 3.
Conclusions
In this paper, we have proposed a cliquebased method for computing the tree edit distance between rooted unordered trees. We implemented two versions: one using a maximum clique (MCQ) algorithm [21] and the other one using a maximum vertex weighted clique (MWCQ) algorithm [20].
The former one is faster than an existing A* algorithm [3]. However, it uses a nonstandard editing cost scheme and thus is not more useful than the A* algorithm. The efficiency of the latter one is similar to that of the A* algorithm. However, it has two merits: it can handle general cost distances whereas the A* algorithm can only handle the unit cost distance, improvements of maximum clique algorithms lead to improvements of the efficiency of edit distance computation.
We also compared the unordered edit distance with ordered edit distance, global and local glycan alignment scores for glycan similarity search. Though the result did not show clear advantage of the unordered edit distance, it was comparative to these three measures. It is to be noted that the unit cost model was used for edit distance measures whereas score functions specialized for glycans were used for glycan alignments. Therefore, if we use editing costs specialized for glycans, we may obtain better performances. Such a development is left as future work.
Finally we again note that the edit distances for both ordered and unordered trees are already established measures for calculating the (dis)similarity between trees [12]. Therefore, application of the proposed method is not limited to glycan structures. It might be applied to analysis of various tree structure data if each tree consists of up to several tens of nodes.
Declarations
Acknowledgements
The work of TA was partially supported by MEXT GrantinAid No. 22240009. The work of DF and AT was partly supported by MEXT GrantinAid No. 18049069. The work of ET was partially supported by MEXT GrantinAid No. 22500009.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 1, 2011: Selected articles from the Ninth Asia Pacific Bioinformatics Conference (APBC 2011). The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/12?issue=S1.
Authors’ Affiliations
References
 Jiang T, Lin G, Ma B, Zhang K: A general edit distance between RNA structures. J Comput Biol 2002, 9: 371–388. 10.1089/10665270252935511View ArticlePubMedGoogle Scholar
 Ma B, Wang L, Zhang K: Computing similarity between RNA structures. Theoret Comp Sci 2002, 276: 111–132. 10.1016/S03043975(01)00192XView ArticleGoogle Scholar
 Horesh T, Mehr R, Unger R: Designing an A* algorithm for calculating edit distance between rootedunordered trees. J Comput Biol 2006, 13: 1165–1176. 10.1089/cmb.2006.13.1165View ArticlePubMedGoogle Scholar
 Jansson J, Ng JHK, Sadakane K, Sung WK: Rooted maximum agreement supertrees. Algorithmica 2005, 43: 293–307. 10.1007/s0045300411475View ArticleGoogle Scholar
 Moret BME, LiSan Wang LS, Warnow T: Toward new software for computational phylogenetics. IEEE Computer 2002, 35: 55–64.View ArticleGoogle Scholar
 Aoki KF, Yamaguchi A, Ueda N, Akutsu T, Mamitsuka H, Goto S, Kanehisa M: KCaM (KEGG Carbohydrate Matcher): a software tool for analyzing the structures of carbohydrate sugar chains. Nucleic Acids Res 2004, 32: W267W272. 10.1093/nar/gkh473PubMed CentralView ArticlePubMedGoogle Scholar
 Hashimoto K, AokiKinoshita KF, Ueda N, Kanehisa M, Mamitsuka H: A new efficient probabilistic model for mining labeled ordered trees applied to glycobiology. ACM Trans Knowledge Discovery from Data 2008, 2: 6.View ArticleGoogle Scholar
 Kuboyama T, Hirata K, AokiKinoshita KF: An efficient unordered tree kernel and its application to glycan classification. In Proceedings of the 12th PacificAsia Conference on Know ledge Discovery and Data Mining (Lecture Notes in Computer Science, Vol. 5012). Edited by: T W, E S, M TK, A I. Springer; 2008:184–195.Google Scholar
 Yamanishi Y, Bach F, Vert JP: Glycan classification with tree kernels. Bioinformatics 2007, 23: 1211–1216. 10.1093/bioinformatics/btm090View ArticlePubMedGoogle Scholar
 Chen Z, Molloi S: Automatic 3D vascular tree construction in CT angiography. Comp Med Imaging and Graphics 2003, 27: 469–479. 10.1016/S08956111(03)000399View ArticleGoogle Scholar
 Yu KC, Ritman EL, Higgns E: System for the analysis and visualization of large 3D anatomical trees. Comput in Biol and Med 2007, 37: 1802–1830. 10.1016/j.compbiomed.2007.06.005View ArticleGoogle Scholar
 Bille P: A survey on tree edit distance and related problem. Theoret Comput Sci 2005, 337: 217–239. 10.1016/j.tcs.2004.12.030View ArticleGoogle Scholar
 Tai KC: The treetotree correction problem. J ACM 1979, 26: 422–433. 10.1145/322139.322143View ArticleGoogle Scholar
 Demaine ED, Mozes S, Rossman B, Weimann O: An optimal decomposition algorithm for tree edit distance. ACM Trans Alg 2009, 6: 1. 10.1145/1644015.1644017View ArticleGoogle Scholar
 Zhang K, Statman R, Shasha D: On the editing distance between unordered labeled trees. Inf Proc Lett 1992, 42: 133–139. 10.1016/00200190(92)90136JView ArticleGoogle Scholar
 Zhang K, Jiang T: Some MAX SNPhard results concerning unordered labeled trees. Inf Proc Lett 1994, 49: 249–254. 10.1016/00200190(94)900620View ArticleGoogle Scholar
 Akutsu T, Fukagawa D, Takasu A, Tamura T: Exact algorithms for computing tree edit distance between unordered trees. Theoret Comput Sci, in press.Google Scholar
 Jiang T, Wang L, Zhang K: Alignment of trees  an alternative to tree edit. Theoret Comp Sci 1995, 143: 137–148. 10.1016/03043975(95)800158View ArticleGoogle Scholar
 Zhang K: A constrained edit distance between unordered labeled trees. Algorithmica 1996, 15: 205–222. 10.1007/BF01975866View ArticleGoogle Scholar
 Nakamura T, Tomita E: Efficient algorithms for finding a maximum clique with maximum vertex weight. In Technical Report of the University of ElectroCommunications (in Japanese). Tokyo; 2005.Google Scholar
 Tomita E, Seki T: An efficient branchandbound algorithm for finding a maximum clique. In Proceedings of the 4th International Conference on Discrete Mathematics and Theoretical Computer Science (Lecture Notes in Computer Science, Vol. 2731). Edited by: Calude C, Dinneen MJ, Vajnovszki V. Springer; 2003:278–289.View ArticleGoogle Scholar
 Tomita E, Sutani Y, Higashi T, Takahashi S, Wakatsuki M: A simple and faster branchandbound algorithm for finding a maximum clique. In Proceedings of the 4th International Workshop on Algorithms and Computation (Lecture Notes in Computer Science, Vol. 5942). Edited by: Rahman MS, Fujita S. Springer; 2010:191–203.Google Scholar
 Tomita E, Akutsu T, Matsunaga T: Efficient algorithms for finding maximum and maximal cliques  Effective tools for bioinformatics . In Biomedical Engineering, Trends, Researches and Technologies. Vienna: INTECH.; in press.Google Scholar
 Pelillo M, Siddiqi K, Zucker SW: Matching hierarchical structures using association graphs. IEEE Trans Patt Match Mach Intell 1999, 21: 1105–1119. 10.1109/34.809105View ArticleGoogle Scholar
 Torsello A, Hancock ER: Computing approximate tree edit distance using relaxation labeling. Patt Recog Lett 2003, 24: 1089–1097. 10.1016/S01678655(02)002556View ArticleGoogle Scholar
 Kanehisa M, Goto S, Furumichi M, Tanabe M, Hirakawa M: KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res 2010, 38: D355D360. 10.1093/nar/gkp896PubMed CentralView ArticlePubMedGoogle Scholar
 Gribskov M, Robinson NL: Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Comput Chem 1996, 20: 25–33. 10.1016/S00978485(96)800040View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.