This paper is devoted to distance measures for leaflabelled trees on free leafset. A leaflabelled tree is a data structure which is a special type of a tree where only leaves (terminal) nodes are labelled. This data structure is used in bioinformatics for modelling of evolution history of genes and species and also in linguistics for modelling of languages evolution history. Many domain specific problems occur and need to be solved with help of tree postprocessing techniques such as distance measures.
Results
Here we introduce the tree edit distance designed for leaf labelled trees on free leafset, which occurs to be a metric. It is presented together with tree edit consensus tree notion. We provide statistical evaluation of provided measure with respect to RF, MAST and frequent subsplit based dissimilarity measures as the reference measures.
Conclusions
The tree edit distance was proven to be a metric and has the advantage of using different costs for contraction and pruning, therefore their properties can be tuned depending on the needs of the user. Two of the presented methods carry the most interesting properties. E(3,1) is very discriminative (having a wide range of values) and has a very regular distance distribution which is similar to a normal distribution in its shape and is good both for similar and nonsimilar trees. NFC(2,1) on the other hand is proportional or nearly proportional to the number of mutation operations used, irrespective of their type.
Background
This paper is devoted to distance measures for leaflabelled trees on free leafset. A leaflabelled tree is a data structure which is a special type of a tree where only leaves (terminal) nodes are labelled. This data structure is used in bioinformatics for modelling of evolution history of genes and species and also in linguistics for modelling of languages evolution history. Many domain specific problems occur and need to be solved with help of tree postprocessing techniques such as distance measures, consensus trees, clustering. Distance measures play the most important role as they are very often the start point for more complicated techniques. One of such problem is a problem of competing evolutionary hypothesis. In the process of phylogenetic tree reconstruction, different candidate trees may be obtained, the researches have to determine the true tree of life.
Many existing techniques are designed for trees built of the same leafset which is very limiting. Here we focus on techniques that do not require trees to contain the same set of leaves. Previously we introduced the simple zrestriction approach [1] and more sophisticated frequent subsplit approach [2, 3]. Here we introduce the tree edit distance designed for leaf labelled trees on free leafset, which occurs to be a metric. It is presented together with tree edit consensus tree notion and some new results for frequent subsplit based dissimilarity measures approach. For the purpose of experimental testing we follow and extend methodology presented in [4]. We use the popular RobinsonFoulds [5] and MAST [6] based distances as the reference measures. The experiments yield very promising results.
Methods
Basic Notions
Here we provide the basic notions and the description of some basic operation on leaf labelled trees which were chosen as the basic operation for new tree edit distance measure. Some derived notions are also presented here.
Leaflabelled tree is a tree with labels assigned to its leaves. Unrooted leaflabelled trees are very often represented as a set of splits [7].
Definition [Split] The Split (or Bipartition) AB (of a tree T with leafset L), corresponding to an edge e is a pair of leafsets A and B, which originated from splitting tree T into two disconnected trees, whilst removing an edge e from a tree T, A ∪ B = L. If A = 1 or B = 1, the split is trivial.
In this paper, we will refer to the leafset of a given split s as L(s). The set of splits corresponding to each edge builds a unique representation of a given tree. We will refer to the set of splits for a given tree as S(T). We will use s ∈ S(T), or s ∈ T to denote that split s occurs in tree T.
Definition [Contraction]
The Contraction of a tree T is obtained by removing a chosen internal edge from tree T and identifying adjacent nodes of the contracted edge.
Because split corresponds to edge (provided that no internal edges of degree two occur), so a contraction may be realised by removing a split from a splitset that represents the given tree. Figure 1 illustrates a contraction operation, and the splitsets are as follows:
T 1 : abcde, bacde, cabde, dabce, eabcd, abcde, abecd
T 2 : abcde, bacde, cabde, dabce, eabcd, abcde
T1 is called a refinement of T2, however T2 is also a subtree of T1 (in more general terms), therefore we will say that T2 is csubtree of T1.
Definition [ccubtree] A Csubtree of tree T is a subtree where only a contraction operation has been used to construct the csubtree from its supertree T.
Definition [Pruning] Pruning is the operation of removing a chosen leaf from a tree, and afterwards removing the nodes of degree two (which is called forced contraction). The pruning operation can be illustrated on a set of splits as the process of removing leaves from splits, and then removing duplicate splits and notvalid splits, which corresponds to forced contraction.
Figure 2 illustrates a pruning operation, where T2 is the process of removing leaf d (a node of degree two emerges after this) and finally T3 is a tree where a forced contraction has also been applied.
T3 is called an induced subtree of T1, however here we will call it a psubtree.
Definition [psubtree] A Psubtree PS of a tree T is a subtree where only a pruning operation is allowed to construct the subtree PS from its supertree T.
Definition [restricted tree, zrestricted tree, induced subtree] A zrestricted tree T^{
z
} (alternatively denoted as Tz in the literature and also called an induced subtree) of a tree T on leafset z, is a psubtree of T where all leaves not in z were pruned. In this paper we use both T^{
z
} and Tz notations, the second one is more popular and clearer, however it sometimes conflicts with the split notation.
Definition [Restricted Split Equality(zequality)] [1] Splits s_{1} and s_{2} are restrictedly equal on leafset z if their zrestricted versions on leafset z are equal.
(1)
Figure 3 illustrates a more complicated tree together with its psubtree(restricted subtree on z = acde) and csubtree.
In [2] we have introduced the subsplit term which is used for the distance and consensus methods discussed later in this paper.
Definition [Subsplit and supersplit] [2] Split s_{1} is a subsplit of s_{2} and s_{2} is a supersplit of s_{1} iff s_{1} is restrictedly equal to s_{2} on the leafset of s_{1}, and the leafset of s_{1} is a subset of the leafset of s_{2}.
(2)
This can be presented alternatively as:
(3)
Common information extraction techniques
Definition [The strict consensus tree] [8] The strict consensus tree is defined in terms of splits. The strict consensus tree is a tree constructed of all splits common to all trees in a given pro le of trees. Figure 4 presents two trees together with their strict consensus tree.
The common splits of these trees, which build the strict consensus tree, are as follows: abcdef, bacdef, cabdef, dabcef, eabcdf, fabcde, abcdef, abcdef.
Because the concept of a consensus tree is very strict, for many trees, a consensus tree can easily become a star (a tree built of only trivial splits). In order to deal with this problem, many variations of consensus trees have been proposed, among others, a majority rule consensus tree.
Definition [Majority rule consensus tree] The majority rule consensus tree is built from splits that occur in the majority of trees.
Definition [Maximum Agreement Subtree (MAST)] [6] For a given pro le of leaflabelled trees T_{1},.... T_{
n
}, the Agreement Subtree is a tree for which T_{
A
} = T_{1}x··· = T_{
n
}x for given x, where x ⊆ L(T). The Maximum Agreement Subtree is an agreement subtree with a maximum number of leaves [6].
An example of a MAST can be seen in Figure 5. In Figure 5, T1 and T2 are the input trees, the leaf d is removed from both trees resulting in T1' and T2' respectively. Finally the leaf h needs to be removed to achieve an identical tree TM, which is a maximum agreement subtree of T1 and T2.
Several versions of the MAST problem exist like RMAST, which considers only rooted trees, or UMAST for general unrooted trees.
A MAST problem without any restrictions is generally NPhard [7]. However when the degree of one input trees is limited, then the algorithm is polynomial [7]. Also, when the number of trees is limited to two, then the algorithm is also polynomial.
Distance measures
Robinson  Fould distance [5] originated from phylogenetic analysis. It is defined as the difference between the number of all splits and the number of splits shared by compared trees. As it was proposed for phylogenetic trees, it is defined for leaflabelled trees with the same leafset. The RF distance between two trees T_{1} and T_{2} with a set of splits S_{1} and S_{2}, respectively, is as follows:
The tree edit distance [9], [10] between T1 and T2 is defined as the minimal cost of editing operations needed to insert a node, delete a node and relabel a node that transforms T1 to T2. It is based on the concept of edit distance for strings. Tree edit distance was defined for nodelabelled and edgelabelled trees. The distance has nice features, it is intuitive, it does not require that compared trees have the same set of leaves. However for trees with leaves which are only labelled, it cannot be used directly. Some artificial internal node labelling is required to use it for such trees, which makes it less intuitive. This distance has not been popular for leaflabelled/phylogenetic trees. However, in our opinion the idea of edit distance, can be applied to leaflabelled trees, provided that the editing operations that are selected are natural for them. Such an approach can lead to better distance measures for leaflabelled trees than existing measures. Such an approach will be presented later in this paper.
The MAST distance between trees T1 and T2 is the number of leaves that need to be removed to obtain the Maximum Agreement Subtree.
Representative Splitset and derived similarity measure
Here, we recall the basis of our representative splitset approach, which is the foundation for a new consensus technique and new similarity measure, applicable to trees where the leafset may vary without discarding any information. For the detailed information see [2].
Notion of Representative Splitset
Definition [Frequent subsplit] Frequent subsplit s with support minsup in a profile of trees is a split that is a subsplit of at least one split in at least minsup of trees. The minsup parameter is called the minimal support. It may be an absolute value which denotes the minimum number of trees in which the split is supposed to be found (as a subsplit). It can also be given as a relative value, where it is a minimal percentage of the trees in which the split is supposed to be found.
Consider the trees shown in Figure 6, which are represented as follows:
According to our approach, we count the number of trees in which the split occurs (as a subsplit of any split), rather than counting the number of splits, of which it is a subsplit. For example, in Figure 6: abcdefgh has the support 2/2 (100%), because it occurs in both trees: in the first one as a subsplit of abcdefghi, and in the second one as a subsplit of abcdefghj. The argument for counting trees rather than splits is that there may be some subsplits that occur frequently as subsplits of many splits, but only in one tree. Such trees are considered uninteresting.
Definition [Representative splitset] Representative splitset  a set that contains maximal frequent subsplits s, i.e. such that there is no other frequent subsplit s_{
x
} that is also a supersplit of s.
Definition [strict representative splitset SFS] The strict representative splitset SFS is a representative splitset with minsup = 100%. More formally, SFS can be represented as follows:
(5)
where
(6)
Definition [Majorityrule representative splitset MRFS] The Majorityrule representative splitset is a representative splitset with minsup = 50%.
Frequent Splitset Interpretation
It is clear that, from the splits of FS, we cannot directly construct one tree because the splits in general have different leafsets.
The full reasoning about frequent interpretation was provided in [2]. Here we just recall the conclusions which were derived from the split compatibility Definition and use the fact that from a compatible set of splits a tree can be built:
Conclusion 1: For each distinct leafset z from frequent splitset (FS) with a support greater than 50%, a tree can be built. The tree is built on zrestricted versions of those splits from FS having a leafset as a superset of z. Therefore the frequent splitset (minsup > 50%) can be represented as a set of trees. In particular, it affects the strict and majorityrule frequent splitset.
Conclusion 2: Each split from the frequent splitset discussed above will occur in at least one tree, in a restricted form.
Conclusion 3: Conclusions 1 and 2 are also true for a tree based on the intersection of all the distinct leafsets from the frequent splitset.
Conclusion 4: The set of trees resulting from the frequent splitset will also contain a consensus tree, provided that the input dataset of trees was built on the same leafset.
For example, as the strictfrequent splitset of trees from Figure 6 contains splits built on two distinct leafsets: abcdefg and abcefg the intersection of these leafsets is equal to the second leafset. Therefore, this strictfrequent splitset will be illustrated by two trees as shown in Figure 7.
For a more difficult example, let us look at trees T_{1} and T_{2} from Figure 8: Here, we have three distinct leafsets: {abcde f gh} {abce f gh} {abcde f g} and the intersection: {abce f g}. Therefore as a visualisation we present four trees on these leafsets, as shown in Figure 9.
FSbased Dissimilarity Measure
Basing on frequent subsplit notion we defined a dissimilarity measure between two trees (or splitsets) [3]. It is not only applicable to trees with different leafsets but also gives more intuitive results for trees with the same leafset:
(7)
where SFS is a strictfrequent splitset and is the modified sum of both splitsets, which means that, if for splits s_{1} ∈ S_{1}, s_{2} ∈ S_{2} , s_{1} is a supersplit of s_{2}, only the supersplit (s_{1}) is included in the result. Formally, it can be represented as follows:
(8)
Such a measure determines the dissimilarity on the basis of how many subsplits they share in common. Let us compare this measure to the most popular: RF distance. Consider the example from Figure 10:
It is clear that the RF distance states that T_{1} and T_{2} are both as dissimilar as T_{2} and T_{3} whilst our measure arrives at a different result, which is an intuitive result since both T_{1} and T_{2} share a common nontrivial subsplit abcd. For trees on a different leafset, the RF distance does not work at all whilst our measure does.
The main drawback of this measure is that it is not a metric, however it achieves very good statistical characteristics and clustering results as described in the Results section. In this paper the method was compared to RF, MAST and edit distance in the series of experiments.
Tree Edit Distance and Tree Edit Consensus for LeafLabelled Trees
Tree Edit Distance for LeafLabelled Trees
In the following sections we define a new distance and consensus notion based on editing operations on leaflabelled trees. We choose contraction and pruning as editing operations for leaflabelled trees. If tree T3 is a subtree of T1, where both pruning and contraction operations are allowed, then we call it a pcsubtree or edit subtree. An example of transforming tree T1 into T3 using editing operations is shown in Fig. 11.
Definition [Edit script] An Edit script S(T1, T2) for leaflabelled trees T1 and T2 is the pair of subscripts S(T1) and S(T2) which are sequences of editing operations including contraction and pruning, which can be applied to the selected input trees T1 and T2 to unify them. S(T1, T2) = S(T1) ∪ S(T2). The subscripts S(T1), (S(T2)) are unidirected which means that by using S(T1), we can modify T1 to obtain the tree that is a unification of T1 and T2, but not necessarily in the opposite direction.
Definition [Edit script Cost] The cost of an edit script Cost(S) is the sum of the defined costs of editing operations: contraction and pruning where Cost(e) = cost_{
c
} if e is a contraction and Cost(e) = cost_{
p
} if e is a pruning operation. Forced contractions may be counted or not, depending on application.
Definition [Tree edit distance for leaflabelled trees]
Having defined positive value costs for contraction and pruning operation, the tree edit distance for leaflabelled trees T1 and T2 is the minimal cost edit script d(T1, T2) = minCost(S), where forced contractions are counted as normal contractions. Note that in order to keep the resulting tree as leaflabelled tree only contractions that correspond to nontrivial split are performed, unless it is a forced contraction  than trivial split may also be contracted to remove split with duplicate representation. In this paper, we focus on the edit distance which counts forced contractions. Thanks to this property, we can prove that it is a true metric.
However, there is also an interesting variant of the edit distance where a forcedcontraction is ignored. The metric property of such a variant is yet to be verified, the measure will also be considered in experiments due to its interesting features.
Definition [No Forced Contraction Disimilarity Measure for leaflabelled trees] Having defined positive value costs for contraction and pruning operation, the No Forced Contraction Disimilarity Measure for leaflabelled trees T1 and T2 is the minimal cost edit script d(T1, T 2) = minCost(S), where forced contractions are ignored.
Tree Edit Distance versus RF Distance and MAST
As mentioned earlier, comparing distance measures is not a trivial task. Here, we provide a subjective opinion about why this measure is better than others, however an objective statistical comparison will be provided in the Results section.
The RF and MAST distances have some drawbacks which have emerged from the fact that the RF distance may use only contraction operations and MAST uses only pruning operations and forced contractions. There are of course some cases when all three distances perform equally well, as in the example of Figure 1, where for Trees T1 and T2, the RF Distance = 1, MAST = 1, Edit Distance = 1.
RF Distance Drawbacks
1) The first drawback of the RF distance is that it is totally useless for leaflabelled trees on a free leafset. For example, Figure 2 shows two trees on a different leafset (T1 and T3). The RF distance is undefined here. It can be seen that the removal of one leaf and the internal edge is sufficient to make the two trees identical. In this case, both the MAST and Edit Distance can be used as MAST can be extended to support a free leafset, and the Edit Distance is naturally suitable for a free leafset. MAST and Edit distances will provide different values: MAST distance = 1, Edit Distance = 2. These differences will be discussed in the following section.
2) The second drawback of the RF distance is that even if the trees are on the same leafset, one noisy leaf may cause the trees to be considered totally different (all splits must be removed). Removal of one leaf may significantly reduce the distance between trees. Such a situation is illustrated in Figure 12. Trees T1 and T2 look totally different, in terms of the RF distance, because of leaf d, thus all nontrivial splits must be removed (all the information!) in order to make them identical. RF Distance = 10, however, removing only leaf d would result in trees differing by only 2 splits! Therefore, the MAST distance equals 2 and the Edit Distance equals 6.
MAST Distance Drawbacks
1) The first drawback of the MAST distance occurs when the trees are similar except for one internal edge as in Figure 13. In this case, the MAST distance would equal 3 as it requires the removal of at least 3 leaves in order to make the trees identical. However, both the RF and Edit distances may just remove one edge, thus the RF Distance would equal 1 and the Edit Distance would equal 1.
2) The second drawback is that MAST counts only the leaves that are removed from both input trees. If it is allowed to also count leaves that are present in only one tree in order to support a different leafset, then the distance will ignore some subtle changes. For example, in Figure 14, the distance between T1 and T2 is identical as in the example with T1 and MAST(T1, T2), which is obviously incorrect. The solution to this problem would be to count the leaf twice if it is removed from both trees and count it once if is removed from one tree. This solution would not however fix another problem: MAST completely ignores forced contractions. Therefore, some subtle differences may again be missed. For example the MAST distance between T1 in Figure 1 and T2 in Figure 2 is the same as between T2 from Figure 1 and T2 from Figure 2, which is again incorrect. In this case, the MAST distance would be equal to 1, but the Edit Distance equals 2 and 1 respectively.
Edit Distance Advantage
In the previous sections, we showed the drawbacks of RF and MAST distances and showed that the Edit Distance is better because:
it can be used for trees on a free leafset
it can distinguish differences where the MAST distance cannot as it can use both contraction and pruning.
Let us compare the values of these distances. The Edit Distance is easily compared to the RF distance, provided the same cost of contraction is used to count both distances. The only difference between them is that the RF distance cannot use pruning. However, it is impossible to compare the values directly to MAST as this distance is not well defined if the leaf is removed from one tree only, and the cost of forced contractions is ignored. Therefore, in order to compare the values, we will use the RF distance (denoted here as the cdistance) and instead of MAST, we will count the cost of each pruning operation and forced contraction (denoted here as the pdistance). Some distance values are presented in Table 1.
Table 1
Values of c, p and edit distances for various examples.
Fig (trees)
cdist
pdist
Editdist
1(T1, T2)
1
2
1
2(T1, T3)

2
2
11(T1, T3)

5
3
14(T1, T2)
4
4
4
13(T1, T2)
1
4
1
15(T1, T2)
8
7
4
These results show that, in some situations, pruning operations are better at unifying trees, sometimes contractions are, and sometimes neither performs well. However, there are cases when using both of them is better. To sum up below, there are some cases when one editing operation is better than another: Pruning is better: when the trees are not on the same leafset, then pruning is necessary (figures 2, 11, 14) the trees may be on the same leafset but they contain some noisy leaf (leaf d in trees T1 and T2 from Figure 12). Contraction is better: when two significantly large subtrees are connected directly on one tree, but connected with an additional edge on the second tree (Figure 13). Both operations seem to be equally good when the trees are on the same leafset, there is no noisy data, and the degrees of the nodes are relatively small. Because there are some cases when one operation is better than the other, a distance based on both operations shall be better than one based on only one operation. A distance constructed in this way will choose the most appropriate editing operations. For example, consider Figure 15, the cdistance equals 8 (all nontrivial splits), pdistance (d and b in both trees plus 3 forced contractions) equals 7, pcdistance (i.e. edit distance with unitary costs) equals 4 (d in both trees plus 2 splits)
Cost Manipulation
The difference between the Edit Distance and other distances is visible especially when the cost of operations is not the same. Although in some cases both operations can be equally good, one may prefer for example contraction over pruning in some cases. The motivation can be for example the need to have as many leaves as possible in the tree edit consensus. Therefore, our distance uses the costs of editing operations. For example, consider the trees T1 and T2 in Figure 12, and assume that the pruning cost equals 2 and the contraction cost equals 1. For these trees, if only the pruning operation is used, then the pdistance equals 12 (removal of d and h from both trees and forced contractions), if only a contraction is used then the RFdistance equals 10 (removal of all nontrivial splits in both trees), however if the Edit distance is used then the distance is equal to 8 (removal of d from both trees, then removal of two differencing splits). The edit script is also illustrated, with its semiproducts (T1' and T2'). If we assume the following costs: 3 for prunning and 1 for contraction, then the Edit Distance would consist of contraction operations only, and the distance would be equal to 10.
Tree Edit Distance Metric Proof
In order to show that our measure is a true metric, the following conditions shall be proved:
The first two conditions are met by definition: The minimal edit script that unifies T1 and T1 contains no operations, therefore the distance is equal to 0. On the other hand, if two different trees T1 and T2 may be unified only by applying some editing operations, and because cost must be positivevalued, then the distance for different trees cannot have the value 0.
As the Definition states that the distance is the minimal cost of unifying two trees, by applying the editing operations either to T1 or T2, it is therefore symmetric by Definition.
The third condition is slightly more complicated and requires more explanation:
Lemma : Having the edit scripts corresponding to distances d(T_{1}, T_{2}) and d(T_{2}, T_{3}), we can unify trees T1 and T3 using the same operations as on both scripts (or a subset of them).
Proof:
Lets denote:
TPCX the tree edit consensus (unification) of trees T1 and T2.
TPCY the tree edit consensus (unification) of trees T2 and T3.
Sx1(T1) = TPCX  the edit subscript that transforms T1 to TPCX
Sx2(T2) = TPCX  the edit subscript that transforms T2 to TPCX
Sy2(T2) = TPCY  the edit subscript that transforms T2 to TPCY
Sy3(T3) = TPCY  the edit subscript that transforms T3 to TPCY
The mentioned artefacts are presented in Figure 16.
Because S1(S2(T)) = S2(S1(T)), which will be shown in the next section, we have:
Sx2(Sy2(T2)) = Sy2(Sx2(T2)) and Sx2(TPCY) = Sy2(TPCX), because Sy2(T2) = TPCY and Sx2(T1) = TPCX
Therefore, there exists some tree TPCZ, such that Sx2(TPCY) = TPCZ and Sy2(TPCX) = TPCZ (dotted line on Figure 16), which can be obtained from T1 and T3 with at most the same number of operations as unifying T1 with T2 and T2 with T3.
Theorem: The tree edit distance for leaflabelled trees meets the third metric condition.
Proof: Due to Lemma, presented earlier, there exists an edit script S(T1, T3) that can unify trees T1 and T3 at the same cost (or less) than the sum of: d(T_{1}, T_{2}) + d(T_{2}, T_{3})
In this section we show that for a given edit subscript (i.e. a set of operations on one tree), changing the order of operations in it will not change the resulting tree. Therefore, it will also not increase the costs. In order to show this, we need to show that if edit script consists of operations: p_{1},..., p_{
n
}and c_{1},..., c_{
m
} then the changing order of operations does not change the result.
Let us assume that a tree is represented with two sets E(e_{1}... e_{
n
})  a set of internal edges referring by nontrivial splits and L(l_{1} ... l_{
n
})  a set of leaves. Let us consider edit script ES that transforms T_{1} represented with E_{1}, L_{1} to tree T_{2} represented with E_{2}, L_{2} and E_{2} ⊆ E_{1} and L_{2} ⊆ L_{1}. Assume for a moment that we will not handle forced contractions. Under such assumption the edit operations: contraction and pruning operate on a different set of items  edges(referred with splits) and leaves respectively. Therefore position of contraction operation, with relation to pruning operation in edit script (and vice versa) does not affect the result. Additionally, from the set theory  the order of items removal from a set is irrelevant  therefore position of contraction operation with relation to other contraction operation in edit script is irrelevant. The same holds for pruning. This leads directly to conclusion that the order of operations in edit script does not affect the final result.
One may notice that pruning also removes some edges, but only trivial ones, which are not considered in edit distance and may be removed at any time. One may also notice that pruning changes the bipartition representation of all nontrivial splits. It is also not a problem, as the total number of edges is not affected. Although we use split representation very often, here the number of edges is important (not the form of their split representations).
As it was presented earlier in this paper, pruning may occasionally introduce forced contraction (see Figure 2). It does not, however, break the assumption that operations work on different sets and are independent of each other.
Let us represent pruning p as the pair of operations p' that removes leaf only as assumed earlier and fc that performs forced contraction. So each time pruning p occurs in edit script it is replaced with either p' if there is no forced contraction to perform or p', fc. Operation fc may be treated as regular contraction operation, with the only difference that it is inserted by pruning. It can also be easily shown that if we change the order from p', fc to fc, p' the result remains the same as the pruning operation does not trigger forced contraction if the appropriate edge was removed earlier in edit script.
The last thing that we mention is the edge matching. Forced contraction removes the edge, which is a duplicate of another edge with respect to their split representation. For example e_{1} = abcde and e_{2} = abecd will be consider duplicates if leaf e is pruned. It may be therefore problematic which edge e_{1} or e_{2} is in fact removed from set of edges. Therefore if the forced contraction is to be done then we shall treat it as the unification of duplicate edges. That means operations c(e_{1}) and c(e_{2}) can be used exchangeably within the edit script, as the edges are unified sooner or later.
Algorithm for Counting Edit Distance of Leaflabelled Trees
The naive algorithm for this problem can be illustrated as follows:
(9)
(10)
where d_{
s
} is the distance for trees on the same leafset, T_{2}  s denotes tree T2 with removed split s, Δ stands for symmetric difference and k indicates the number of forced contractions that needed to be performed. To keep metric property cost_{
fc
} is equal to cost_{
c
}. In the equation the following part: cost_{
p
} * L(T_{1})ΔL(T_{2})+k * cost_{
fc
} is in fact the cost of unification of leafset for both trees. The algorithm is therefore exponential with respect to the number of leaves and the number of splits. We modify algorithm on the basis of two observations. The first one is that the order of editing operations is irrelevant therefore the algorithm can try prunning operations before it tries contractions. The second one is that the Edit Distance will never remove the splits that occur in both trees (except forced contractions), which can be easily proved, only differing splits (i.e RF distance) are considered. After modifications algorithm presents as follows (L(T1 = = L(T2)).
(11)
The algorithm can also be presented with pseudocode as follows:
function how_many_fc(T1,T1',k) { return S(T1)  S(T1') +k)
// k parameter is used to prevent counting trivial split contration directly
// associated with leaf removal
function prune (T1,L)  prunes all leaves from set L from tree T1 and
performs forced contractions.
This algorithm is now exponential with respect to the number of leaves. It is possible that this can also be improved so that it has the same complexity as MAST for two trees (which is polynomial), but further investigations are required. For the purpose of this paper, we used a dynamic programming algorithm, where partial results are stored in memory and reused if necessary. It turns out that the algorithm was required to only count a small part of all possible combinations which also gives grounds for optimism that a better algorithm will be found.
Let us look on a few steps of naive algorithm for trees T1 and T2 from Figure 12
Trees are built on the same leafset so we may directly calculate d_{
s
}equation. d_{
RF
}(T1, T2) = 10. Let us remove some leaves (we will not show all of them due to clarity of the presentation): Let us remove a, we obtain:
Curly braces denote splits that will be forcecontracted.
d_{
RF
}(T1', T2') = 8, cost of pruning: 2, and cost of forced contractions: 2 Total cost in this path: 12, so the result is worse than d_{
RF
}(T1, T2), and the prunning additional leaves will also not improve the result.
So let us remove d instead of a
T1':ghabcef, f ghabce, efghabc, aefghbc, (abefghc)
d_{
RF
}(T1', T2') = 2, cost of pruning: 2, and cost of forced contractions: 2 Total cost in this path: 6, so the result is better than d_{
RF
}(T1, T2) We may continue with pruning of another leaf to see if we can improve the result more, so we remove g
T1":(habcef), fhabce, efhabc, aefhbc
T2":(fabceh), fhabce, efhabc, aefhbc
d_{
RF
}(T1", T2") = 0, cost of pruning: 2, and cost of forced contractions: 2, cost calculated from previous
step: 4, so the total cost is equal to 8 thus we received a worse result. We will not continue with pruning of other leaves as it will not lead to better result.
Therefore the best cost is 6, and the edit script contains two subscripts:
From T1: p(d), fc(cabefgh), c(ghabcef)
From T2: p(d), fc(fgabceh), c(fgabceh)
Tree Edit Consensus Tree
Similarly, we may define a new consensus method on the basis of editing operations called the Tree Edit Consensus Tree. The Tree Edit Consensus Tree is the maximal (with respect to leaves and edges) common subtree of the input trees, obtained by contraction and pruning operations and is defined as follows:
Definition [Tree edit consensus tree (PCConsensus tree) ] Having defined the positive value costs of contraction and pruning operations, the tree edit consensus for leaflabelled trees T1 ... Tn is the minimal cost edit script that unifies these trees. For trees T1 and T2 in Figure 12 the tree edit consensus tree is tree TC = T1" = T2" Similar to MAST, the tree edit consensus tree is not unique. The experimental assessment of this method was done in PhD thesis [11] and will not be recalled here.
Tree Edit Consensus Algorithm
Similar to the edit distance, based on the fact that, if a prunning operation is used it must be used on all input trees and the fact that contraction is performed only for splits that do not occur in all input trees. The naive, dynamic programming algorithm which counts the score of the tree edit consenus may be defined as follows:
(12)
(13)
where , SCT stands for strict consensus tree, TEC is the Tree Edit Consensus k is the number of forced contractions that needed to be performed and cost_{
fc
} is equal to cost_{
c
}
The tree edit consensus tree may be obtained by recording prunning operations used along the optimum path. Recorded prunnings must be applied to input trees, and afterwards all unmaching edges must be contracted (strict consensus tree).
Quality of similarity measures
The quality of similarity measures is not obvious to estimate. The best possible method would be a method based on external criteria i.e. based on expert knowledge. In biological applications, it could be a comparison of the consensus tree with the true phylogenetic tree. The true tree however is something that is not known. We agree with the opinion presented in [4] that similarity measures are especially difficult to score as they are very subjective about what is similar and what is not. Such a subjective approach to score the distance requires that someone arbitrarily selects the best distance matrix. This matrix is to be compared with the distance matrix achieved with a given similarity measure. A more objective method would be to prove that a given measure is best in some particular applications or may be used to solve some particular problems, like for example the identification of the ancestral paralog position in the paralog families mentioned in [12].
The methods that can be applied to measure the quality of distance measures and consensus techniques can be roughly divided into:
qualitative methods which try to de ne properties that the given consensus method or similarity measure must meet as in [7]
quantitative methods which try to measure the quality of consensus or similarity methods such as [13]. They are often based on some assumptions due to the lack of verified domain knowledge
statistical methods which display the statistical properties of the given method to help an expert score the method instead of scoring it automatically, because the quality of a metric may depend on the application.
In an axiomatic approach, the most common requirement for the similarity measure is that it meets metric properties, or at least pseudometric ones.
The quantitative approach is not very suitable for distance measures due to lack of objective criteria. Even if we are supplied with biological data which contain groups of trees and may count for example the proportion of innergroup distances to betweengroup distances, such an approach is not very trustworthy. This is because we see the effect of the distance measure on selected sets, which may be different for different parts of the treespace. We also ignore some potential properties of distance, for example that the distance metric may be better for some topologies of trees but worse for others and this observation could give hints on where to use it and where not. Simply put, the quality of a metric may depend on the application.
Except for proving the metric properties of some distances, we choose the statistical approach as described in [4] and perform two kinds of experiment:
Analysis of distance probability distribution
Analysis of distance dynamics with respect to number of changes in trees.
In the first approach, we count the distance for a large number of randomly generated unrooted trees according to different distributions and examine the distribution of probability. The details of random generation of rooted trees can be found in [14])  this method can be adaptable for unrooted trees.
P(d(T1,T2) = k). We try to determine:
whether the distance distribution has any regularities, follows any known distribution, which proves that the distance does not work in a random fashion
whether the distance is well enough discriminative (has a large number of values), whether the discrimination property is equally strong for similar and different trees.
The other approach is to mutate the random tree with different mutation operations and see how the distance changes.
More details about the experiments will be provided in the Results section.
Results and Discussion
In this section, an experimental evaluation of the proposed methods is presented. For the purpose of experiments, we use the randomly generated trees with different distributions and we evaluate the statistical properties of the similarity measures as described previously in this paper. From our propositions, we decided to evaluate the Tree Edit Distance, No Forced Contraction Similarity Measure called NFC here and the FSbased Similarity measure.
For comparison with existing distance measures we have chosen the RF and MAST distances. RF is one of most popular and computionally efficient distances, MAST and RF are in a way foundations of the Edit measures presented in this paper. We decided not to normalise the values of distances because sometimes normalisation is not obvious (as in the case of the Edit distance). Normalisation is not necessary in the first experiment as we study the distribution rather than absolute values. In the second experiment, the lack of normalisation does not prevent us observing dynamics, it only forbids the spotting of the crossing points of distances. The approach of normalising with the maximum observed value, as used in literature, in our opinion distorts the results, because if the real maximum value is not achieved then the graph is distorted. The only modifications are made with the FS dissimilarity measure, i.e. values are scaled and biased in order to be compared with other distances on the same chart.
In this experiment, trees with 8 leaves are presented, however tests were also performed with trees with up to 17 leaves for unconstrained trees and 12 leaves for binary trees, with similar results being obtained.
Distribution of Distance Probability
For this test, 1000 pairs of trees with 8 leaves were generated and the distribution of probability P(d(T1, T2) = k) was examined under different tree generation models, as mentioned previously in this paper. As the Edit Distance and its version with no penalty for forced contraction are parametrizable, various pruning and contraction operations costs were used. In the following experiments, the edit distance with cost of contraction equal to x, and cost of pruning equal to y is denoted by E(x, y), the version with no cost for forced contraction is denoted by NFC(x, y).
Unrooted binary leaflabelled trees on the same leafset
First consider Figure 17 which shows the RF, MAST and Edit distances with the cost of pruning and contraction equal to 1. It appears that all of these distances on this dataset took only 4 unique values each. Among them only 3 frequent enough to be visible on the figure. This leads to the conclusion that they are not very discriminative, as the total number of unrooted binary trees with 8 leaves is ≈ 10^{4}. The Edit distance and RF distance behave identically here, the number of occurrences of particular distance value increases asymptotically with the value, which means that these distances are good only for similar trees. On the other hand, MAST is also not very discriminative but it is more reminescent of the normal distribution.
Figure 18 shows that the E(1,1), E(1,2), NFC(1,1), NFC(1,2), NFC(2,1) distributions are similar or identical. Due to the fact that E(1,1) is identical to RF for these data, they won't be discussed more here.
The distances E(2,1) and E(3,1) are significantly different, especially E(3,1) which is compared to MAST and RF in Figure 19. The E(3,1) distribution is similar in shape to MAST so it can be used both for similar and distant trees, however it has a wider range of values (11 unique values for E(3,1) as compared to 4 for MAST.)
Conclusion The first conclusion is that by modifying the costs of the Edit distance, we can achieve a measure with very wellbehaving properties: very discriminative and suitable both for similar and dissimilar trees. Moreover, the similarity of the Edit, RF and MAST distributions shows that the distance is not accidental.
The FS similarity measure is the hardest to interpret (see Figure 20), it also has a wide range of values (19 unique values) so it is discriminative, however the shape of the distribution is very irregular. However, if we merge the low peaks with neighbouring high peaks, we achieve something similar to the RF distance, i.e increasing with increasing distance value. So the conclusion is that it is discriminative but works better for similar trees than for distant trees. It is worth remembering that this measure is not metric for sure, therefore this may affect its properties.
Unrooted unconstrained leaflabelled trees on the same leafset
This distribution leads to similar observations and conclusions. The E(1,1), E(1,2), NFC(1,1), NFC(1,2), NFC(2,1) distributions are similar or identical (the figure has been omitted). E(1,1) and RF are again similar, however the distribution does not rise asymptotically with increasing distance value Figure 21. Both E(3,1) (Figure 22) and FS (Figure 23) look better than MAST and FS as they take more values E(3,1)  24, FS  27 versus RF  8 and MAST  4, which make them more discriminative.
Unrooted leaflabelled trees on a free leafset
In this experiment trees with at most 8 leaves were generated. Both binary and unconstrained versions will be discussed together as the differences are only with the RF distance. Characteristics of RF distribution in this experiment does not recall typical RF distribution. The main reason is that it is unsuitable for comparing trees with different leafsets as it will always return the maximum value, which will also be dependent on the number of leaves of the trees. Therefore the distribution reflects the conditional probability of selecting two trees with the same leafset(left part of graph) and trees with different leafsets (right part of graph) of Figure 24 (binary) and Figure 25 (unconstrained). As for the other distances, both E(3,1) and FS behave similarly, having a wide range of values, while E(3,1) is more regular (see Figure 26). Both E(3,1) (Figure 22) and FS (Figure 23) look better than MAST and FS as they take more values E(3,1)  24, FS  27 versus RF  8 and MAST  4, which make them more discriminative.
Conclusion : To summarise the key points of this experiment:
The RF distance is not very discriminative for binary trees, it is also weak for distant trees. It is not suitable for trees with different leafsets.
The MAST distance is good for the same and different leafset, and is good both for distant and similar trees, however it is only weakly discriminative.
The Edit distance, with the variant where cost of contraction = 1 and pruning = 3, looks very promising as it has a wide range of values and is equally good for distant and similar trees.
The FS dissimilarity measure is similar to the Edit distance, but it does not have a very regular distribution.
NFC here behaves like E(1,1) i.e. it is equivalent to RF for the same leafset and equivalent to MAST for different leafsets, which is good. However it is still only very weakly discriminative.
Dynamics of Distances
For this test, one tree is randomly generated and then the second tree is obtained with k mutation operations. Here, we observe the dynamics of distance changes with respect to number and type of mutations. Due to the nature of most of the examined distances i.e. Edit Distance, No Forced Contraction Similarity Measure, MAST and RF, we use the following types of mutation:
Contraction  we randomly remove a selected split
Pruning  we randomly remove a selected leaf
Nearest NonBrother Interchange (NNBI).
Nearest NonBrother Interchange (NNBI) is a modification of the NNI operation [15]. We choose the nearest leaves that are not brothers and interchange the leaves as shown on Figure 27. The motivation for such a type of mutation is that we wanted to achieve such a modification of a tree that both pruning and contraction can be used to level the changes made by the operation. Direct use of both C and P in the mutation process leads to a situation where the number of leaves changes and therefore the RF distance is hard to be compared to. The NNBI operation can be levelled with either one contraction or one pruning operation if the cousins are 3 edges away from each other. The resulting input trees are of the same leafset so the RF distance can also be taken in the experiments which is exactly what we wanted to achieve. For the trees shown in Figure 27, it is possible to either prune leaf f (or d) from both trees or contract splits efabcd and edabcf to make the trees identical.
To analyse the results, let us see the distances counted with respect to the contraction operation (Figure 28). All distances have similar linear dynamics and might have been simply scaled to behave identically on these data. It can be seen that all distances that have a cost of contraction equal to 1, are identical. NFC(2,1) was not identical but very similar, so it is illustrated with the same line. Those distances with a cost of c = 1 and NFC(2,1) scale the most naturally as the distance is simply equal to the number of mutations, the distance is directly proportional to the number of mutations with k = 1. Increasing the cost of contraction makes the edit distances increase quickly.
For a pruning mutation, the situation looks very similar (see Figure 29) i.e. all distances would have identical values if scaled, however a few things should be pointed out. The RF distance here gets smaller with increasing number of p operations. This is because it does not work for trees with different leafsets, in such a case it returns the maximum possible value, which is lower for a smaller leafset, and this is exactly what is illustrated in the Figure. Another point is that while the RF was the distance best scaled for contractions, MAST is the distance best scaled for pruning. This is natural because RF uses contraction while MAST uses pruning. So if we consider not just pruning nor just contraction, but wish to use both, then the distances do not have the same dynamics. What is seen here is that NFC(2,1) scales best for both contraction and pruning. This can be visualised better when we see the reaction of distances on contraction and pruning on the same chart. In Figure 30, we can see the reaction of RF, MAST and NFC(2,1) with respect to contraction, pruning, and Nearest NonBrother Interchange (i.e an operation that is neither contraction nor pruning but the distance can be realised by both of these operations). We can see that only the NFC dynamics are similar irrespective of the type of mutation used.
Conclusions
In this paper we have proposed new technique for measuring distance between leaf labelled trees on free leafset, and provided its evaluations with respect to frequent subsplit based method and other measures. The tree edit distance was proven to be a metric and has the advantage of using different costs for contraction and pruning, therefore their properties can be tuned depending on the needs of the user. It is difficult to pick the best distance measure as they all have different interesting properties and may be used in different applications. Two of the presented methods carry the most interesting properties. E(3,1) is very discriminative (having a wide range of values) and has a very regular distance distribution which is similar to a normal distribution in its shape and is good both for similar and nonsimilar trees. NFC(2,1) on the other hand is proportional or nearly proportional to the number of mutation operations used, irrespective of their type. All of these distances have a great advantage in that they can take different costs of contraction and pruning, therefore their properties can be tuned depending on the needs of the user. Future works will be dedicated to discovering more efficient algorithm for tree edit distance and deep experimental evaluation of tree edit consensus method for leaflabelled trees on the same leafset.
Declarations
Authors’ Affiliations
(1)
Institute of Computer Science, Warsaw University of Technology
References
Koperwas J, Walczak K: Clustering of leaf labeledtrees on free leafset. In RSEISP. Springer, Heidelberg; 2007:736–745. LNAI 4585
Koperwas J, Walczak K: Frequent Subsplit Representation of LeafLabelled Trees. In EvoBIO. Springer, Heidelberg; 2008:95–105. LNCS 4973
Koperwas J, Walczak K: Phylogenetic Trees Dissimilarity Measure Based on Strict Frequent Splits Set and Its Application for Clustering. In RSKT. Springer, Heidelberg; 2008:604–611. LNAI 5009
Steel M, Penny D: Distributions of tree comparison metrics  some new results.Syst Biol 1993, 42:126.
Finden CR, Gordon AD: Obtaining common pruned trees.J Classification 1985, 2:255–276.View Article
Bryant D: Building trees, hunting for trees, and comparing trees Theory And Methods In Phylogenetic Analysis. In Ph.D Thesis. University of Canterbury; 1997.
McMorris FR, Meronk DB, Neumann DA: A view of some consensus methods for trees. In Numerical Taxonomy. Edited by: Felsenstein J. SpringerVerlag; 1983:122–125.
Bille P: Tree Edit Distance, Alignment Distance and Inclusion.Technical report TR2003–23, IT University Technical Report Series 2003.
Klein P: Computing the EditDistance between Unrooted Ordered Trees.Proc. 6th Annual European Symp. Algorithms, 24–26 Aug. 1998 1998, 91–102.
Koperwas J: Clustering Techniques of LeafLabelled Trees and Their Applications. In Ph.D Thesis. Warsaw University of Technology, Warsaw; 2009.
Bolikowski L, Gambin A: New Metrics for Phylogenies.Fundamenta Informaticae 2007, 78:1–18.
BergerWolf TY, Williams TL, Moret BE, Warnow TJ: An experimental evaluation of phylogenetic consensus methods.Technical Report TRCS2003–19, Department of Computer Science, University of New Mexico 2003.
Oden NL, Shao KT: An algorithm to equiprobably generate all directed trees with labeled terminal nodes and unlabeled interior nodes.Bull Mathematical Biology 1984,46(3):379–387.
Robinson DF: Comparison of labeled trees with valency three.J Combinatorial Theory, Series B 1971, 11:105–119.View Article
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.