Tree edit distance for leaflabelled trees on free leafset and its comparison with frequent subsplit dissimilarity and popular distance measures
 Jakub Koperwas^{1}Email author and
 Krzysztof Walczak^{1}
https://doi.org/10.1186/1471210512204
© Koperwas and Walczak; licensee BioMed Central Ltd. 2011
Received: 1 March 2011
Accepted: 25 May 2011
Published: 25 May 2011
Abstract
Background
This paper is devoted to distance measures for leaflabelled trees on free leafset. A leaflabelled tree is a data structure which is a special type of a tree where only leaves (terminal) nodes are labelled. This data structure is used in bioinformatics for modelling of evolution history of genes and species and also in linguistics for modelling of languages evolution history. Many domain specific problems occur and need to be solved with help of tree postprocessing techniques such as distance measures.
Results
Here we introduce the tree edit distance designed for leaf labelled trees on free leafset, which occurs to be a metric. It is presented together with tree edit consensus tree notion. We provide statistical evaluation of provided measure with respect to RF, MAST and frequent subsplit based dissimilarity measures as the reference measures.
Conclusions
The tree edit distance was proven to be a metric and has the advantage of using different costs for contraction and pruning, therefore their properties can be tuned depending on the needs of the user. Two of the presented methods carry the most interesting properties. E(3,1) is very discriminative (having a wide range of values) and has a very regular distance distribution which is similar to a normal distribution in its shape and is good both for similar and nonsimilar trees. NFC(2,1) on the other hand is proportional or nearly proportional to the number of mutation operations used, irrespective of their type.
Background
This paper is devoted to distance measures for leaflabelled trees on free leafset. A leaflabelled tree is a data structure which is a special type of a tree where only leaves (terminal) nodes are labelled. This data structure is used in bioinformatics for modelling of evolution history of genes and species and also in linguistics for modelling of languages evolution history. Many domain specific problems occur and need to be solved with help of tree postprocessing techniques such as distance measures, consensus trees, clustering. Distance measures play the most important role as they are very often the start point for more complicated techniques. One of such problem is a problem of competing evolutionary hypothesis. In the process of phylogenetic tree reconstruction, different candidate trees may be obtained, the researches have to determine the true tree of life.
Many existing techniques are designed for trees built of the same leafset which is very limiting. Here we focus on techniques that do not require trees to contain the same set of leaves. Previously we introduced the simple zrestriction approach [1] and more sophisticated frequent subsplit approach [2, 3]. Here we introduce the tree edit distance designed for leaf labelled trees on free leafset, which occurs to be a metric. It is presented together with tree edit consensus tree notion and some new results for frequent subsplit based dissimilarity measures approach. For the purpose of experimental testing we follow and extend methodology presented in [4]. We use the popular RobinsonFoulds [5] and MAST [6] based distances as the reference measures. The experiments yield very promising results.
Methods
Basic Notions
Here we provide the basic notions and the description of some basic operation on leaf labelled trees which were chosen as the basic operation for new tree edit distance measure. Some derived notions are also presented here.
Leaflabelled tree is a tree with labels assigned to its leaves. Unrooted leaflabelled trees are very often represented as a set of splits [7].
Definition [Split] The Split (or Bipartition) AB (of a tree T with leafset L), corresponding to an edge e is a pair of leafsets A and B, which originated from splitting tree T into two disconnected trees, whilst removing an edge e from a tree T, A ∪ B = L. If A = 1 or B = 1, the split is trivial.
In this paper, we will refer to the leafset of a given split s as L(s). The set of splits corresponding to each edge builds a unique representation of a given tree. We will refer to the set of splits for a given tree as S(T). We will use s ∈ S(T), or s ∈ T to denote that split s occurs in tree T.
Definition [Contraction]
The Contraction of a tree T is obtained by removing a chosen internal edge from tree T and identifying adjacent nodes of the contracted edge.
T 1 : abcde, bacde, cabde, dabce, eabcd, abcde, abecd
T 2 : abcde, bacde, cabde, dabce, eabcd, abcde
T1 is called a refinement of T2, however T2 is also a subtree of T1 (in more general terms), therefore we will say that T2 is csubtree of T1.
Definition [ccubtree] A Csubtree of tree T is a subtree where only a contraction operation has been used to construct the csubtree from its supertree T.
Definition [Pruning] Pruning is the operation of removing a chosen leaf from a tree, and afterwards removing the nodes of degree two (which is called forced contraction). The pruning operation can be illustrated on a set of splits as the process of removing leaves from splits, and then removing duplicate splits and notvalid splits, which corresponds to forced contraction.
T 1 : abcde, bacde, cabde, dabce, eabcd, abcde, abecd
T 2 : abce, bace, cabe, abce, eabc, abce, abec
T 3 : abce, bace, cabe, eabc, abce
T3 is called an induced subtree of T1, however here we will call it a psubtree.
Definition [psubtree] A Psubtree PS of a tree T is a subtree where only a pruning operation is allowed to construct the subtree PS from its supertree T.
Definition [restricted tree, zrestricted tree, induced subtree] A zrestricted tree T^{ z } (alternatively denoted as Tz in the literature and also called an induced subtree) of a tree T on leafset z, is a psubtree of T where all leaves not in z were pruned. In this paper we use both T^{ z } and Tz notations, the second one is more popular and clearer, however it sometimes conflicts with the split notation.
In [2] we have introduced the subsplit term which is used for the distance and consensus methods discussed later in this paper.
Common information extraction techniques
Splits from T1: abcdef, bacdef, cabdef, dabcef, eabcdf, fabcde, abcdef, abcdef, abcdef
Splits from T2: abcdef, bacdef, cabdef, dabcef, eabcdf, fabcde, abcdef, abcdef, abcefd
The common splits of these trees, which build the strict consensus tree, are as follows: abcdef, bacdef, cabdef, dabcef, eabcdf, fabcde, abcdef, abcdef.
Because the concept of a consensus tree is very strict, for many trees, a consensus tree can easily become a star (a tree built of only trivial splits). In order to deal with this problem, many variations of consensus trees have been proposed, among others, a majority rule consensus tree.
Definition [Majority rule consensus tree] The majority rule consensus tree is built from splits that occur in the majority of trees.
Definition [Maximum Agreement Subtree (MAST)] [6] For a given pro le of leaflabelled trees T_{1},.... T_{ n }, the Agreement Subtree is a tree for which T_{ A } = T_{1}x··· = T_{ n }x for given x, where x ⊆ L(T). The Maximum Agreement Subtree is an agreement subtree with a maximum number of leaves [6].
Several versions of the MAST problem exist like RMAST, which considers only rooted trees, or UMAST for general unrooted trees.
A MAST problem without any restrictions is generally NPhard [7]. However when the degree of one input trees is limited, then the algorithm is polynomial [7]. Also, when the number of trees is limited to two, then the algorithm is also polynomial.
Distance measures
For example, in Figure 4:
d_{ R  F } (T_{1}, T_{2}) = 2
The tree edit distance [9], [10] between T1 and T2 is defined as the minimal cost of editing operations needed to insert a node, delete a node and relabel a node that transforms T1 to T2. It is based on the concept of edit distance for strings. Tree edit distance was defined for nodelabelled and edgelabelled trees. The distance has nice features, it is intuitive, it does not require that compared trees have the same set of leaves. However for trees with leaves which are only labelled, it cannot be used directly. Some artificial internal node labelling is required to use it for such trees, which makes it less intuitive. This distance has not been popular for leaflabelled/phylogenetic trees. However, in our opinion the idea of edit distance, can be applied to leaflabelled trees, provided that the editing operations that are selected are natural for them. Such an approach can lead to better distance measures for leaflabelled trees than existing measures. Such an approach will be presented later in this paper.
The MAST distance between trees T1 and T2 is the number of leaves that need to be removed to obtain the Maximum Agreement Subtree.
For the trees from Figure 5, d_{ mast } = 2.
Representative Splitset and derived similarity measure
Here, we recall the basis of our representative splitset approach, which is the foundation for a new consensus technique and new similarity measure, applicable to trees where the leafset may vary without discarding any information. For the detailed information see [2].
Notion of Representative Splitset
Definition [Frequent subsplit] Frequent subsplit s with support minsup in a profile of trees is a split that is a subsplit of at least one split in at least minsup of trees. The minsup parameter is called the minimal support. It may be an absolute value which denotes the minimum number of trees in which the split is supposed to be found (as a subsplit). It can also be given as a relative value, where it is a minimal percentage of the trees in which the split is supposed to be found.
T 1 : cdabefghi, bcdaefghi, abcdefghi, hiabcdefg, ghiabcdef, fghiabcde plus trivial splits
T 2 : bcadefghj, abcdefghj, abcdefghj, hjabcdefg, ghjabcdef, fghjabcde plus trivial splits
According to our approach, we count the number of trees in which the split occurs (as a subsplit of any split), rather than counting the number of splits, of which it is a subsplit. For example, in Figure 6: abcdefgh has the support 2/2 (100%), because it occurs in both trees: in the first one as a subsplit of abcdefghi, and in the second one as a subsplit of abcdefghj. The argument for counting trees rather than splits is that there may be some subsplits that occur frequently as subsplits of many splits, but only in one tree. Such trees are considered uninteresting.
Definition [Representative splitset] Representative splitset  a set that contains maximal frequent subsplits s, i.e. such that there is no other frequent subsplit s_{ x } that is also a supersplit of s.
Definition [Majorityrule representative splitset MRFS] The Majorityrule representative splitset is a representative splitset with minsup = 50%.
Frequent Splitset Interpretation
It is clear that, from the splits of FS, we cannot directly construct one tree because the splits in general have different leafsets.
The full reasoning about frequent interpretation was provided in [2]. Here we just recall the conclusions which were derived from the split compatibility Definition and use the fact that from a compatible set of splits a tree can be built:
Conclusion 1: For each distinct leafset z from frequent splitset (FS) with a support greater than 50%, a tree can be built. The tree is built on zrestricted versions of those splits from FS having a leafset as a superset of z. Therefore the frequent splitset (minsup > 50%) can be represented as a set of trees. In particular, it affects the strict and majorityrule frequent splitset.
Conclusion 2: Each split from the frequent splitset discussed above will occur in at least one tree, in a restricted form.
Conclusion 3: Conclusions 1 and 2 are also true for a tree based on the intersection of all the distinct leafsets from the frequent splitset.
Conclusion 4: The set of trees resulting from the frequent splitset will also contain a consensus tree, provided that the input dataset of trees was built on the same leafset.
Strictfrequentset: abcdefgh, ghabcdef, fghabcde, bcaefgh, habcdefg abcdefgh, bacdefgh, cabdefgh, dabcefgh, eabcdfgh, fabcdegh, gabcedfh
FSbased Dissimilarity Measure
SFS(T_{1}, T_{2}) = trivial(5) + abcd, SFS(T_{1}, T_{3}) = trivial, SFS(T_{2}, T_{3}) = trivial
dRF (T_{1}, T_{2}) = 4, d(T_{1}, T_{2}) = 1  6/9 = 4/9
dRF (T_{2}, T_{3}) = 4, d(T_{2}, T_{3}) = 1 5/9 = 5/9
It is clear that the RF distance states that T_{1} and T_{2} are both as dissimilar as T_{2} and T_{3} whilst our measure arrives at a different result, which is an intuitive result since both T_{1} and T_{2} share a common nontrivial subsplit abcd. For trees on a different leafset, the RF distance does not work at all whilst our measure does.
The main drawback of this measure is that it is not a metric, however it achieves very good statistical characteristics and clustering results as described in the Results section. In this paper the method was compared to RF, MAST and edit distance in the series of experiments.
Tree Edit Distance and Tree Edit Consensus for LeafLabelled Trees
Tree Edit Distance for LeafLabelled Trees
Definition [Edit script] An Edit script S(T 1, T 2) for leaflabelled trees T 1 and T 2 is the pair of subscripts S(T 1) and S(T 2) which are sequences of editing operations including contraction and pruning, which can be applied to the selected input trees T1 and T2 to unify them. S(T1, T2) = S(T 1) ∪ S(T 2). The subscripts S(T 1), (S(T 2)) are unidirected which means that by using S(T 1), we can modify T 1 to obtain the tree that is a unification of T 1 and T 2, but not necessarily in the opposite direction.
Definition [Edit script Cost] The cost of an edit script Cost(S) is the sum of the defined costs of editing operations: contraction and pruning where Cost(e) = cost_{ c } if e is a contraction and Cost(e) = cost_{ p } if e is a pruning operation. Forced contractions may be counted or not, depending on application.
Definition [Tree edit distance for leaflabelled trees]
Having defined positive value costs for contraction and pruning operation, the tree edit distance for leaflabelled trees T1 and T2 is the minimal cost edit script d(T 1, T 2) = minCost(S), where forced contractions are counted as normal contractions. Note that in order to keep the resulting tree as leaflabelled tree only contractions that correspond to nontrivial split are performed, unless it is a forced contraction  than trivial split may also be contracted to remove split with duplicate representation. In this paper, we focus on the edit distance which counts forced contractions. Thanks to this property, we can prove that it is a true metric.
However, there is also an interesting variant of the edit distance where a forcedcontraction is ignored. The metric property of such a variant is yet to be verified, the measure will also be considered in experiments due to its interesting features.
Definition [No Forced Contraction Disimilarity Measure for leaflabelled trees] Having defined positive value costs for contraction and pruning operation, the No Forced Contraction Disimilarity Measure for leaflabelled trees T1 and T2 is the minimal cost edit script d(T 1, T 2) = minCost(S), where forced contractions are ignored.
Tree Edit Distance versus RF Distance and MAST
As mentioned earlier, comparing distance measures is not a trivial task. Here, we provide a subjective opinion about why this measure is better than others, however an objective statistical comparison will be provided in the Results section.
The RF and MAST distances have some drawbacks which have emerged from the fact that the RF distance may use only contraction operations and MAST uses only pruning operations and forced contractions. There are of course some cases when all three distances perform equally well, as in the example of Figure 1, where for Trees T1 and T2, the RF Distance = 1, MAST = 1, Edit Distance = 1.
RF Distance Drawbacks
 1)
The first drawback of the RF distance is that it is totally useless for leaflabelled trees on a free leafset. For example, Figure 2 shows two trees on a different leafset (T1 and T3). The RF distance is undefined here. It can be seen that the removal of one leaf and the internal edge is sufficient to make the two trees identical. In this case, both the MAST and Edit Distance can be used as MAST can be extended to support a free leafset, and the Edit Distance is naturally suitable for a free leafset. MAST and Edit distances will provide different values: MAST distance = 1, Edit Distance = 2. These differences will be discussed in the following section.
 2)
The second drawback of the RF distance is that even if the trees are on the same leafset, one noisy leaf may cause the trees to be considered totally different (all splits must be removed). Removal of one leaf may significantly reduce the distance between trees. Such a situation is illustrated in Figure 12. Trees T1 and T2 look totally different, in terms of the RF distance, because of leaf d, thus all nontrivial splits must be removed (all the information!) in order to make them identical. RF Distance = 10, however, removing only leaf d would result in trees differing by only 2 splits! Therefore, the MAST distance equals 2 and the Edit Distance equals 6.
MAST Distance Drawbacks
 1)
The first drawback of the MAST distance occurs when the trees are similar except for one internal edge as in Figure 13. In this case, the MAST distance would equal 3 as it requires the removal of at least 3 leaves in order to make the trees identical. However, both the RF and Edit distances may just remove one edge, thus the RF Distance would equal 1 and the Edit Distance would equal 1.
 2)
The second drawback is that MAST counts only the leaves that are removed from both input trees. If it is allowed to also count leaves that are present in only one tree in order to support a different leafset, then the distance will ignore some subtle changes. For example, in Figure 14, the distance between T1 and T2 is identical as in the example with T1 and MAST(T1, T2), which is obviously incorrect. The solution to this problem would be to count the leaf twice if it is removed from both trees and count it once if is removed from one tree. This solution would not however fix another problem: MAST completely ignores forced contractions. Therefore, some subtle differences may again be missed. For example the MAST distance between T1 in Figure 1 and T2 in Figure 2 is the same as between T2 from Figure 1 and T2 from Figure 2, which is again incorrect. In this case, the MAST distance would be equal to 1, but the Edit Distance equals 2 and 1 respectively.
Edit Distance Advantage
In the previous sections, we showed the drawbacks of RF and MAST distances and showed that the Edit Distance is better because:

it can be used for trees on a free leafset

it can distinguish differences where the MAST distance cannot as it can use both contraction and pruning.
Values of c, p and edit distances for various examples.
Fig (trees)  cdist  pdist  Editdist 

1(T1, T2)  1  2  1 
2(T1, T3)    2  2 
11(T1, T3)    5  3 
14(T1, T2)  4  4  4 
13(T1, T2)  1  4  1 
15(T1, T2)  8  7  4 
Cost Manipulation
The difference between the Edit Distance and other distances is visible especially when the cost of operations is not the same. Although in some cases both operations can be equally good, one may prefer for example contraction over pruning in some cases. The motivation can be for example the need to have as many leaves as possible in the tree edit consensus. Therefore, our distance uses the costs of editing operations. For example, consider the trees T1 and T2 in Figure 12, and assume that the pruning cost equals 2 and the contraction cost equals 1. For these trees, if only the pruning operation is used, then the pdistance equals 12 (removal of d and h from both trees and forced contractions), if only a contraction is used then the RFdistance equals 10 (removal of all nontrivial splits in both trees), however if the Edit distance is used then the distance is equal to 8 (removal of d from both trees, then removal of two differencing splits). The edit script is also illustrated, with its semiproducts (T1' and T2'). If we assume the following costs: 3 for prunning and 1 for contraction, then the Edit Distance would consist of contraction operations only, and the distance would be equal to 10.
Tree Edit Distance Metric Proof
In order to show that our measure is a true metric, the following conditions shall be proved:

d(T_{1}, T_{2}) = 0 ⇔ T_{1} = T_{2}

d(T_{1}, T_{2}) = d(T_{2}, T_{1})

d(T_{1}, T_{2}) + d(T_{2}, T_{3}) ≥ d(T_{1},T_{3})
The first two conditions are met by definition: The minimal edit script that unifies T1 and T1 contains no operations, therefore the distance is equal to 0. On the other hand, if two different trees T1 and T2 may be unified only by applying some editing operations, and because cost must be positivevalued, then the distance for different trees cannot have the value 0.
As the Definition states that the distance is the minimal cost of unifying two trees, by applying the editing operations either to T1 or T2, it is therefore symmetric by Definition.
The third condition is slightly more complicated and requires more explanation:
Lemma : Having the edit scripts corresponding to distances d(T_{1}, T_{2}) and d(T_{2}, T_{3}), we can unify trees T1 and T3 using the same operations as on both scripts (or a subset of them).
Proof:
Lets denote:

TPCX the tree edit consensus (unification) of trees T1 and T2.

TPCY the tree edit consensus (unification) of trees T2 and T3.

Sx1(T1) = TPCX  the edit subscript that transforms T1 to TPCX

Sx2(T2) = TPCX  the edit subscript that transforms T2 to TPCX

Sy2(T2) = TPCY  the edit subscript that transforms T2 to TPCY

Sy3(T3) = TPCY  the edit subscript that transforms T3 to TPCY
Because S 1(S 2(T)) = S 2(S 1(T)), which will be shown in the next section, we have:
Sx 2(Sy 2(T 2)) = Sy 2(Sx 2(T 2)) and Sx 2(TPCY) = Sy 2(TPCX), because Sy 2(T 2) = TPCY and Sx 2(T 1) = TPCX
Therefore, there exists some tree TPCZ, such that Sx 2(TPCY) = TPCZ and Sy 2(TPCX) = TPCZ (dotted line on Figure 16), which can be obtained from T1 and T3 with at most the same number of operations as unifying T1 with T2 and T2 with T3.
Theorem: The tree edit distance for leaflabelled trees meets the third metric condition.
Proof: Due to Lemma, presented earlier, there exists an edit script S(T 1, T 3) that can unify trees T 1 and T 3 at the same cost (or less) than the sum of: d(T_{1}, T_{2}) + d(T_{2}, T_{3})
Cost(S(T 1,T 3)) <= d(T_{1}, T_{2}) + d(T_{2}, T_{3})
and because
d(T_{1}, T_{3}) <= Cost(S(T 1, T 3))
therefore:
d(T_{1}, T_{3}) <= d(T_{1}, T_{2}) + d(T_{2}, T_{3})
Edit Subscript Order
In this section we show that for a given edit subscript (i.e. a set of operations on one tree), changing the order of operations in it will not change the resulting tree. Therefore, it will also not increase the costs. In order to show this, we need to show that if edit script consists of operations: p_{1},..., p_{ n }and c_{1},..., c_{ m } then the changing order of operations does not change the result.
Let us assume that a tree is represented with two sets E(e_{1}... e_{ n })  a set of internal edges referring by nontrivial splits and L(l_{1} ... l_{ n })  a set of leaves. Let us consider edit script ES that transforms T_{1} represented with E_{1}, L_{1} to tree T_{2} represented with E_{2}, L_{2} and E_{2} ⊆ E_{1} and L_{2} ⊆ L_{1}. Assume for a moment that we will not handle forced contractions. Under such assumption the edit operations: contraction and pruning operate on a different set of items  edges(referred with splits) and leaves respectively. Therefore position of contraction operation, with relation to pruning operation in edit script (and vice versa) does not affect the result. Additionally, from the set theory  the order of items removal from a set is irrelevant  therefore position of contraction operation with relation to other contraction operation in edit script is irrelevant. The same holds for pruning. This leads directly to conclusion that the order of operations in edit script does not affect the final result.
One may notice that pruning also removes some edges, but only trivial ones, which are not considered in edit distance and may be removed at any time. One may also notice that pruning changes the bipartition representation of all nontrivial splits. It is also not a problem, as the total number of edges is not affected. Although we use split representation very often, here the number of edges is important (not the form of their split representations).
As it was presented earlier in this paper, pruning may occasionally introduce forced contraction (see Figure 2). It does not, however, break the assumption that operations work on different sets and are independent of each other.
Let us represent pruning p as the pair of operations p' that removes leaf only as assumed earlier and fc that performs forced contraction. So each time pruning p occurs in edit script it is replaced with either p' if there is no forced contraction to perform or p', fc. Operation fc may be treated as regular contraction operation, with the only difference that it is inserted by pruning. It can also be easily shown that if we change the order from p', fc to fc, p' the result remains the same as the pruning operation does not trigger forced contraction if the appropriate edge was removed earlier in edit script.
The last thing that we mention is the edge matching. Forced contraction removes the edge, which is a duplicate of another edge with respect to their split representation. For example e_{1} = abcde and e_{2} = abecd will be consider duplicates if leaf e is pruned. It may be therefore problematic which edge e_{1} or e_{2} is in fact removed from set of edges. Therefore if the forced contraction is to be done then we shall treat it as the unification of duplicate edges. That means operations c(e_{1}) and c(e_{2}) can be used exchangeably within the edit script, as the edges are unified sooner or later.
Algorithm for Counting Edit Distance of Leaflabelled Trees
The algorithm can also be presented with pseudocode as follows:
function edit_distance(T1,T2, cost_c, cost_p) {
D1 = L(T1)  L(T2) ;
D2 = L(T2)  L(T1) ;
T1'=prune(T1,D1) ;
T2'=prune(T2,D2) ;
//after above operations L(T1') = = L(T2')
cost = (D1+D2)*cost_p+ cost_c * (how_many_fc(T1,T1',D1)+how_many_fc(T2,T2',D2))
+ edit_distance_the_same_leafset(T1',T2',cost_c,cost_p);
return cost;
}
function edit_distance_the_same_leafset(T1,T2, cost_c, cost_p){
minCost = RFdistance(T1,T2) * cost_c;
L = L(T1); // L(T1) = = L(T2)
For each leaf in L
T1'=prune(T1,leaf);
T2'=prune(T2,leaf);
cost_all_fc = cost_c * (how_many_fc(T1,T1',1)+ how_many_fc(T2,T2',1));
costT = edit_distance_the_same_leafset(T1',T2', cost_c, cost_p)
+ 2*cost_p + cost_all_fc;
if (costT < minCost) minCost=costT;
end for
return minCost;
}
function how_many_fc(T1,T1',k) { return S(T1)  S(T1') +k)
// k parameter is used to prevent counting trivial split contration directly
// associated with leaf removal
function prune (T1,L)  prunes all leaves from set L from tree T1 and
performs forced contractions.
This algorithm is now exponential with respect to the number of leaves. It is possible that this can also be improved so that it has the same complexity as MAST for two trees (which is polynomial), but further investigations are required. For the purpose of this paper, we used a dynamic programming algorithm, where partial results are stored in memory and reused if necessary. It turns out that the algorithm was required to only count a small part of all possible combinations which also gives grounds for optimism that a better algorithm will be found.
Let us look on a few steps of naive algorithm for trees T1 and T2 from Figure 12
T1:ghabcdef, fghabcde, efghabcd, aefghbcd, abefghcd
T2:f gabcdeh, dfgabceh, dfghabce, defghabc, adefghbc
Trees are built on the same leafset so we may directly calculate d_{ s }equation. d_{ RF }(T 1, T 2) = 10. Let us remove some leaves (we will not show all of them due to clarity of the presentation): Let us remove a, we obtain:
T1':ghbcdef, fghbcde, efghbcd, (efghbcd), befghcd
T2':fgbcdeh, dfgbceh, dfghbce, defghbc, (defghbc)
plus the trivial splits
Curly braces denote splits that will be forcecontracted.
d_{ RF }(T 1', T 2') = 8, cost of pruning: 2, and cost of forced contractions: 2 Total cost in this path: 12, so the result is worse than d_{ RF }(T 1, T 2), and the prunning additional leaves will also not improve the result.
So let us remove d instead of a
T1':ghabcef, f ghabce, efghabc, aefghbc, (abefghc)
T2':(fgabceh), fgabceh, fghabce, efghabc, aefghbc
d_{ RF }(T 1', T 2') = 2, cost of pruning: 2, and cost of forced contractions: 2 Total cost in this path: 6, so the result is better than d_{ RF }(T 1, T 2) We may continue with pruning of another leaf to see if we can improve the result more, so we remove g
T1":(habcef), fhabce, efhabc, aefhbc
T2":(fabceh), fhabce, efhabc, aefhbc
d_{ RF }(T 1", T 2") = 0, cost of pruning: 2, and cost of forced contractions: 2, cost calculated from previous
step: 4, so the total cost is equal to 8 thus we received a worse result. We will not continue with pruning of other leaves as it will not lead to better result.
Therefore the best cost is 6, and the edit script contains two subscripts:
From T1: p(d), fc(cabefgh), c(ghabcef)
From T2: p(d), fc(fgabceh), c(fgabceh)
Tree Edit Consensus Tree
Similarly, we may define a new consensus method on the basis of editing operations called the Tree Edit Consensus Tree. The Tree Edit Consensus Tree is the maximal (with respect to leaves and edges) common subtree of the input trees, obtained by contraction and pruning operations and is defined as follows:
Definition [Tree edit consensus tree (PCConsensus tree) ] Having defined the positive value costs of contraction and pruning operations, the tree edit consensus for leaflabelled trees T 1 ... Tn is the minimal cost edit script that unifies these trees. For trees T1 and T2 in Figure 12 the tree edit consensus tree is tree TC = T 1" = T 2" Similar to MAST, the tree edit consensus tree is not unique. The experimental assessment of this method was done in PhD thesis [11] and will not be recalled here.
Tree Edit Consensus Algorithm
where , SCT stands for strict consensus tree, TEC is the Tree Edit Consensus k is the number of forced contractions that needed to be performed and cost_{ fc } is equal to cost_{ c }
The tree edit consensus tree may be obtained by recording prunning operations used along the optimum path. Recorded prunnings must be applied to input trees, and afterwards all unmaching edges must be contracted (strict consensus tree).
Quality of similarity measures
The quality of similarity measures is not obvious to estimate. The best possible method would be a method based on external criteria i.e. based on expert knowledge. In biological applications, it could be a comparison of the consensus tree with the true phylogenetic tree. The true tree however is something that is not known. We agree with the opinion presented in [4] that similarity measures are especially difficult to score as they are very subjective about what is similar and what is not. Such a subjective approach to score the distance requires that someone arbitrarily selects the best distance matrix. This matrix is to be compared with the distance matrix achieved with a given similarity measure. A more objective method would be to prove that a given measure is best in some particular applications or may be used to solve some particular problems, like for example the identification of the ancestral paralog position in the paralog families mentioned in [12].
The methods that can be applied to measure the quality of distance measures and consensus techniques can be roughly divided into:

qualitative methods which try to de ne properties that the given consensus method or similarity measure must meet as in [7]

quantitative methods which try to measure the quality of consensus or similarity methods such as [13]. They are often based on some assumptions due to the lack of verified domain knowledge

statistical methods which display the statistical properties of the given method to help an expert score the method instead of scoring it automatically, because the quality of a metric may depend on the application.
In an axiomatic approach, the most common requirement for the similarity measure is that it meets metric properties, or at least pseudometric ones.
The quantitative approach is not very suitable for distance measures due to lack of objective criteria. Even if we are supplied with biological data which contain groups of trees and may count for example the proportion of innergroup distances to betweengroup distances, such an approach is not very trustworthy. This is because we see the effect of the distance measure on selected sets, which may be different for different parts of the treespace. We also ignore some potential properties of distance, for example that the distance metric may be better for some topologies of trees but worse for others and this observation could give hints on where to use it and where not. Simply put, the quality of a metric may depend on the application.
Except for proving the metric properties of some distances, we choose the statistical approach as described in [4] and perform two kinds of experiment:

Analysis of distance probability distribution

Analysis of distance dynamics with respect to number of changes in trees.
In the first approach, we count the distance for a large number of randomly generated unrooted trees according to different distributions and examine the distribution of probability. The details of random generation of rooted trees can be found in [14])  this method can be adaptable for unrooted trees.
P(d(T 1,T 2) = k). We try to determine:

whether the distance distribution has any regularities, follows any known distribution, which proves that the distance does not work in a random fashion

whether the distance is well enough discriminative (has a large number of values), whether the discrimination property is equally strong for similar and different trees.
The other approach is to mutate the random tree with different mutation operations and see how the distance changes.
More details about the experiments will be provided in the Results section.
Results and Discussion
In this section, an experimental evaluation of the proposed methods is presented. For the purpose of experiments, we use the randomly generated trees with different distributions and we evaluate the statistical properties of the similarity measures as described previously in this paper. From our propositions, we decided to evaluate the Tree Edit Distance, No Forced Contraction Similarity Measure called NFC here and the FSbased Similarity measure.
For comparison with existing distance measures we have chosen the RF and MAST distances. RF is one of most popular and computionally efficient distances, MAST and RF are in a way foundations of the Edit measures presented in this paper. We decided not to normalise the values of distances because sometimes normalisation is not obvious (as in the case of the Edit distance). Normalisation is not necessary in the first experiment as we study the distribution rather than absolute values. In the second experiment, the lack of normalisation does not prevent us observing dynamics, it only forbids the spotting of the crossing points of distances. The approach of normalising with the maximum observed value, as used in literature, in our opinion distorts the results, because if the real maximum value is not achieved then the graph is distorted. The only modifications are made with the FS dissimilarity measure, i.e. values are scaled and biased in order to be compared with other distances on the same chart.
In this experiment, trees with 8 leaves are presented, however tests were also performed with trees with up to 17 leaves for unconstrained trees and 12 leaves for binary trees, with similar results being obtained.
Distribution of Distance Probability
For this test, 1000 pairs of trees with 8 leaves were generated and the distribution of probability P(d(T 1, T 2) = k) was examined under different tree generation models, as mentioned previously in this paper. As the Edit Distance and its version with no penalty for forced contraction are parametrizable, various pruning and contraction operations costs were used. In the following experiments, the edit distance with cost of contraction equal to x, and cost of pruning equal to y is denoted by E(x, y), the version with no cost for forced contraction is denoted by NFC(x, y).
Unrooted binary leaflabelled trees on the same leafset
Conclusion The first conclusion is that by modifying the costs of the Edit distance, we can achieve a measure with very wellbehaving properties: very discriminative and suitable both for similar and dissimilar trees. Moreover, the similarity of the Edit, RF and MAST distributions shows that the distance is not accidental.
Unrooted unconstrained leaflabelled trees on the same leafset
Unrooted leaflabelled trees on a free leafset
Conclusion : To summarise the key points of this experiment:

The RF distance is not very discriminative for binary trees, it is also weak for distant trees. It is not suitable for trees with different leafsets.

The MAST distance is good for the same and different leafset, and is good both for distant and similar trees, however it is only weakly discriminative.

The Edit distance, with the variant where cost of contraction = 1 and pruning = 3, looks very promising as it has a wide range of values and is equally good for distant and similar trees.

The FS dissimilarity measure is similar to the Edit distance, but it does not have a very regular distribution.

NFC here behaves like E(1,1) i.e. it is equivalent to RF for the same leafset and equivalent to MAST for different leafsets, which is good. However it is still only very weakly discriminative.
Dynamics of Distances
For this test, one tree is randomly generated and then the second tree is obtained with k mutation operations. Here, we observe the dynamics of distance changes with respect to number and type of mutations. Due to the nature of most of the examined distances i.e. Edit Distance, No Forced Contraction Similarity Measure, MAST and RF, we use the following types of mutation:

Contraction  we randomly remove a selected split

Pruning  we randomly remove a selected leaf

Nearest NonBrother Interchange (NNBI).
Conclusions
In this paper we have proposed new technique for measuring distance between leaf labelled trees on free leafset, and provided its evaluations with respect to frequent subsplit based method and other measures. The tree edit distance was proven to be a metric and has the advantage of using different costs for contraction and pruning, therefore their properties can be tuned depending on the needs of the user. It is difficult to pick the best distance measure as they all have different interesting properties and may be used in different applications. Two of the presented methods carry the most interesting properties. E(3,1) is very discriminative (having a wide range of values) and has a very regular distance distribution which is similar to a normal distribution in its shape and is good both for similar and nonsimilar trees. NFC(2,1) on the other hand is proportional or nearly proportional to the number of mutation operations used, irrespective of their type. All of these distances have a great advantage in that they can take different costs of contraction and pruning, therefore their properties can be tuned depending on the needs of the user. Future works will be dedicated to discovering more efficient algorithm for tree edit distance and deep experimental evaluation of tree edit consensus method for leaflabelled trees on the same leafset.
Declarations
Authors’ Affiliations
References
 Koperwas J, Walczak K: Clustering of leaf labeledtrees on free leafset. In RSEISP. Springer, Heidelberg; 2007:736–745. LNAI 4585 LNAI 4585Google Scholar
 Koperwas J, Walczak K: Frequent Subsplit Representation of LeafLabelled Trees. In EvoBIO. Springer, Heidelberg; 2008:95–105. LNCS 4973 LNCS 4973Google Scholar
 Koperwas J, Walczak K: Phylogenetic Trees Dissimilarity Measure Based on Strict Frequent Splits Set and Its Application for Clustering. In RSKT. Springer, Heidelberg; 2008:604–611. LNAI 5009 LNAI 5009Google Scholar
 Steel M, Penny D: Distributions of tree comparison metrics  some new results. Syst Biol 1993, 42: 126.Google Scholar
 Robinson DR, Foulds LR: Comparison of phylogenetic trees. Mathematical Biosci 1981, 53: 131–147. 10.1016/00255564(81)900432View ArticleGoogle Scholar
 Finden CR, Gordon AD: Obtaining common pruned trees. J Classification 1985, 2: 255–276. 10.1007/BF01908078View ArticleGoogle Scholar
 Bryant D: Building trees, hunting for trees, and comparing trees Theory And Methods In Phylogenetic Analysis. In Ph.D Thesis. University of Canterbury; 1997.Google Scholar
 McMorris FR, Meronk DB, Neumann DA: A view of some consensus methods for trees. In Numerical Taxonomy. Edited by: Felsenstein J. SpringerVerlag; 1983:122–125.View ArticleGoogle Scholar
 Bille P: Tree Edit Distance, Alignment Distance and Inclusion. Technical report TR2003–23, IT University Technical Report Series 2003.Google Scholar
 Klein P: Computing the EditDistance between Unrooted Ordered Trees. Proc. 6th Annual European Symp. Algorithms, 24–26 Aug. 1998 1998, 91–102.Google Scholar
 Koperwas J: Clustering Techniques of LeafLabelled Trees and Their Applications. In Ph.D Thesis. Warsaw University of Technology, Warsaw; 2009.Google Scholar
 Bolikowski L, Gambin A: New Metrics for Phylogenies. Fundamenta Informaticae 2007, 78: 1–18.Google Scholar
 BergerWolf TY, Williams TL, Moret BE, Warnow TJ: An experimental evaluation of phylogenetic consensus methods. Technical Report TRCS2003–19, Department of Computer Science, University of New Mexico 2003.Google Scholar
 Oden NL, Shao KT: An algorithm to equiprobably generate all directed trees with labeled terminal nodes and unlabeled interior nodes. Bull Mathematical Biology 1984, 46(3):379–387.Google Scholar
 Robinson DF: Comparison of labeled trees with valency three. J Combinatorial Theory, Series B 1971, 11: 105–119. 10.1016/00958956(71)900207View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Comments
View archived comments (1)