Tree edit distance
Here, we briefly review tree edit distance and edit distance mapping (see also Figure 1) for rooted, labelled and unordered trees [12, 15, 16].
Let T be a rooted unordered tree. We assume that each node v has a label ℓ(v) over an alphabet Σ. r(T), V(T), and E(T) denote the root of T, the set of nodes in T, and the set of edges in T, respectively. For a node v ∈ V(T), anc(v) denotes the set of ancestors of v. In the following, n denotes the number of nodes in a larger input tree (i.e., n = max{|V(T1)|, |V(T2)|}).
An edit operation on a tree T is either a deletion, an insertion, or a substitution, where each operation is defined as follows (see also Figure 1):
Deletion: Delete a non-root node v in T with parent u, making the children of v become children of u. The children are inserted in the place of v into the set of the children of u.
Insertion: Inverse of delete. Insert a node v as a child of u in T, making v the parent of some of the children of u.
Substitution: Change the label of a node v in T.
For each of edit operations, the cost is defined as follows:
γ(a, b): cost of substituting a node with label a to label b,
γ(a, ∈): cost of deleting a node labeled with a,
γ(∈, a): cost of inserting a node labeled with a.
The edit distance dist(T1, T2) between two unordered trees T1 and T2 is defined as the cost of the minimum cost sequence of edit operations that transforms T1 to T2. In this paper, we adopt the following standard assumption so that dist(T1, T2) becomes a distance metric [12, 15]:
-
γ(a, b) ≥ 0 for any (a, b) ∈ Σ′ × Σ′,
-
γ(a, a) = 0 for any a ∈ Σ′,
-
γ(a, b) = γ(b, a) for any (a, b) ∈ Σ′ × Σ′,
-
γ(a, c) ≤ γ(a, b) + γ(b, c) for any a,b,c ∈ Σ′ × Σ′ × Σ′,
where Σ′ = ΣU {∈}. We call T2 a subtree of T1 if T2 is obtained from T1 only by deletion operations. It should be noted that this definition of subtree is different from a subgraph of a tree.
There exists a close relationship between the edit distance and the edit distance mapping (or just mapping)[12, 15]. M ⊆ V (T1) × V(T2) is called a mapping if the following conditions are satisfied for any two pairs (u1, v1), (u2, v2) ∈ M:
-
(i)
u1 = u2 iff v1 = v2,
-
(ii)
u1 ∈ anc(u2) iff v1 ∈ anc(v2).
Let I1 and I2 be the sets of nodes in V(T1) and V(T2) not appearing in M, respectively. Then, the following relation holds [12, 15]:
Here we define a score function f(u, v) for (u, v) ∈ V(T1) × V(T2) by
f(u, v) = γ(ℓ(u), ∈) + γ(∈, ℓ(v)) - γ(ℓ(u), ℓ(v)).
It is seen that f(u, v) = f(v, u) ≥ 0 holds. It should also be noted that under the unit cost model (i.e., γ(a, b) = 1 for all a ≠ b), f(v, v) = 2 and f(u, v) = 1 hold for ℓ(u) ≠ ℓ(v). Let score(M) be the score of a mapping M defined by score(M) = ∑(
u,v
)∈
M
f(u, v). Let M
OPT
be the mapping with the maximum score. Then, we can see from the definition that the following property holds [17]:
assuming that the root of T1 corresponds to the root of T2 in M
OPT
, where this assumption can be removed if we add dummy nodes having the same label to T1 and T2 as the new roots.
Reduction to maximum clique
Let G(V, E) be an undirected graph. Then, a subgraph G′(V′, E′) of G(V, E) is called a clique if it is a complete subgraph (i.e., {{v
i
,v
j
} | v
i
, v
j
∈ V′, v
i
≠ v
j
} = E′). The maximum clique problem is to find a maximum clique (i.e., a clique with the maximum number of vertices) in a given undirected graph. Though the maximum clique problem is known to be NP-hard, several practical algorithms have been developed [20–23]. In some cases, weighted versions of the maximum clique problem are utilized. Among such variants, we consider the case that weights are given to vertices. Let w(v) denote the weight of a vertex v in G(V, E). Then, a weighted version of the maximum clique problem is to find a clique G′(V′, E′) which maximizes ∑
v∈V′
w(v). In this paper, we call this variant the maximum vertex weighted clique problem, whereas the maximum clique problem denotes the original one.
Our proposed method is based on a simple reduction from the edit distance problem for unordered trees to the maximum clique problem. Based on Eq. (1), for calculating the tree edit distance, it is enough to find a mapping M which maximizes ∑(u, v)∈Mf(u, v). In order to find such a mapping, we construct an undirected graph G(V,E) from two input trees T1 and T2 by
V = {(u, v) | u ∈ V(T1), u ≠ r (T1), v ∈ V(T2), v ≠ r(T2)},
E = {{(u1,v1), (u2,v2 )} | u1 ≠ u2, v1 ≠ v2, u1∈ anc(u2) iff v1 ∈ anc(v2), u2 ∈ anc(u1) iff v2 ∈ anc(v1)},
where the first two conditions and the last two conditions in the definition of E correspond to conditions (i) and (ii) for the edit distance mapping, respectively. We can see that there is a one-to-one correspondence between the set of cliques and the set of edit distance mappings, where we let r(T1) correspond to r(T2) (because the root cannot be deleted or inserted). Here, we assign weight w(x) to each vertex x = (u, v) ∈ V by w(x) = f(u, v). Then, we can see from Eq. (1) that the tree edit distance can be obtained by finding a maximum vertex weighted clique.
It is to be noted that if we consider the case of γ(a, ∈) = γ(∈, a) = 1, γ(a, a) = 0 for all a ∈ Σ, and γ(a, b) = 2 for all a ≠ b, we have f(v, v) = 2 and f(u, v) = 0 for ℓ(u) ≠ ℓ(v), and thus we can use a non-weighted version of maximum clique algorithms (see Figure 2). In such a case, the resulting mapping gives a largest common subtree (a tree with the largest number of nodes which is a subtree of both T1 and T2) [17].
Maximum clique algorithms
In this study, we use algorithms for both the maximum clique problem and the maximum vertex weighted clique problem. For both problems, Tomita and his collaborators have been developing several algorithms. Recent studies on comparison with other existing algorithms suggest that their algorithms are fastest in most cases[22]. Based on several preliminary experiments, we chose MCQ and MWCQ as algorithms for the maximum clique problem and the maximum vertex weighted clique problem, respectively, where MWCQ is basically an extended version of MCQ. Details of MCQ and MWCQ are given in [21] and [20], respectively.
Though there are some theoretical studies on related algorithms [23], the worst case time complexities of MCQ and MWCQ are left open. Therefore, we cannot discuss the time complexity of our proposed method, whereas it is straight-forward to see that the graph obtained by reduction from input trees has O(|V(T1)| × |V(T2)|) nodes and O(|V(T1)|2 × |V(T2)|2) edges.