A scalable method for identifying frequent subtrees in sets of large phylogenetic trees
 Avinash Ramu^{1},
 Tamer Kahveci^{2}Email author and
 J Gordon Burleigh^{3}
DOI: 10.1186/1471210513256
© Ramu et al.; licensee BioMed Central Ltd. 2012
Received: 5 February 2012
Accepted: 5 September 2012
Published: 3 October 2012
Abstract
Background
We consider the problem of finding the maximum frequent agreement subtrees (MFASTs) in a collection of phylogenetic trees. Existing methods for this problem often do not scale beyond datasets with around 100 taxa. Our goal is to address this problem for datasets with over a thousand taxa and hundreds of trees.
Results
We develop a heuristic solution that aims to find MFASTs in sets of many, large phylogenetic trees. Our method works in multiple phases. In the first phase, it identifies small candidate subtrees from the set of input trees which serve as the seeds of larger subtrees. In the second phase, it combines these small seeds to build larger candidate MFASTs. In the final phase, it performs a postprocessing step that ensures that we find a frequent agreement subtree that is not contained in a larger frequent agreement subtree. We demonstrate that this heuristic can easily handle data sets with 1000 taxa, greatly extending the estimation of MFASTs beyond current methods.
Conclusions
Although this heuristic does not guarantee to find all MFASTs or the largest MFAST, it found the MFAST in all of our synthetic datasets where we could verify the correctness of the result. It also performed well on large empirical data sets. Its performance is robust to the number and size of the input trees. Overall, this method provides a simple and fast way to identify strongly supported subtrees within large phylogenetic hypotheses.
Keywords
Phylogenetic trees Frequent subtreeBackground
Alternate approaches have been proposed to reveal highly supported subtrees. The maximum agreement subtree (MAST) problem seeks the largest subtree that is present in all members of a given collection of trees [8]. For example, in Figure 1 the MAST includes taxa A, B, C, and D. Finding the MAST is an NPhard problem [9], although efficient algorithms exist to compute the MAST in some cases (e.g., [9–17]). In practice, since any difference in any single tree will reduce the size of the MAST, the MAST is often quite small, limiting it usefulness.
A less restrictive problem is to find frequent agreement subtrees (FAST), or subtrees that are found in many, but not necessarily all, of the input trees (see [18]). In this problem, a subtree is declared as frequent if it is in at least as many trees as a user supplied frequency threshold. Several algorithmic approaches have been suggested to identify FASTs, and specifically the maximum FASTs (MFASTs), or FASTs that contain the largest number of taxa. A variant of this problem seeks the maximal FASTs, i.e., FASTS that are not contained in any other FASTs. Notice that an MFAST is a maximal FAST, however, the inverse is not necessarily true. Zhang and Wang defined algorithms, implemented in Phylominer, to identify FASTs from a collection of phylogenetic trees [19, 20]. These algorithms are guaranteed to find all FASTs but they may be prohibitively slow for data sets larger than 20 taxa. Cranston and Rannala implemented MetropolisHastings and Threshold Accepting searches to identify large FASTs from a Bayesian posterior distribution of phylogenetic trees [21]. This approach can handle thousands of input trees but it may not be feasible if the trees have more than 100 taxa [21].
Another approach to reveal highly supported subtrees from a collection of trees is to identify and remove rogue taxa, or taxa whose position in the input trees is least consistent. Recently, several methods have been developed that can identify and remove rogue taxa from collections of trees with thousands of taxa [22–24]. However, unlike MAST or FAST approaches, they do not provide guarantees about the support for the remaining taxa.
In this paper, we describe a heuristic approach for identifying MFASTs in collections of trees. Unlike previous methods, our method easily scales to datasets with over a thousand taxa and hundreds of trees. Towards this goal, we develop a heuristic solution that works in multiple phases. In the first phase, it identifies small candidate subtrees from the set of input trees which serve as the seeds of larger subtrees. In the second phase, it combines these seeds to build larger candidate MFASTs. In the final phase, it performs a post processing step. This step ensures that the size (i.e., number of taxa) of the FAST found can not be increased further by adding a new taxon without reducing its frequency below a user supplied frequency threshold. We demonstrate that this heuristic can easily handle data sets with 1000 taxa. We test the effectiveness of these approaches on simulated data sets and then demonstrate its performance on large, empirical data sets. Although our heuristic does not guarantee to find all MFASTs or the largest MFAST in theory, it found the true MFAST in all of our synthetic datasets where we could verify the correctness of the result. It also performed well on the empirical data sets. Its performance is robust with respect to the number of input trees and the size of the input trees.
Methods
In this section we describe our method that aims to find Maximum Frequent Agreement SubTrees (MFASTs) in a given set of m phylogenetic trees $\mathcal{T}$= {T_{1}, T_{2}, …, T_{ m }}. Our method follows from the observation that an MFAST is present in a large number of trees in $\mathcal{T}$. The method builds MFASTs bottom up from small subtrees of taxa in the trees in $\mathcal{T}$. Briefly, it works in three phases.

Phase 1. Seed generation (Section “Phase one: Seed generation”).In the first phase, we identify small subtrees from the input trees that have a potential to be a part of an MFAST. We call each such subtree a seed.

Phase 2. Seed combination (Section “Phase two: Seed combination”).In the second phase, we construct an initial FAST by combining the seeds found in the first phase.

Phase 3. Post processing (Section “Phase three: Postprocessing”).In the third phase, we grow the FAST further to obtain the maximal FAST that contains it by individually considering the taxa which are not already in the FAST. We report the resulting maximal FAST as a possible MFAST.
First, we present the the basic definitions needed for this paper in Section “Preliminaries and notation”. We then discuss each of the three phases above in detail.
Preliminaries and notation
In this section, we present the key definitions and notations needed to understand the rest of the paper. We describe our method using rooted and bifurcating phylogenetic trees. However, our method and definitions can easily be applied to unrooted or multifurcating trees with minor or no modifications. Also, we assume that all the taxa are placed at the leaf level nodes of the phylogenetic tree, and all the internal nodes are inferred ancestors. Figure 2(a) shows a sample phylogenetic tree built on five taxa. We define the size of a tree as the number of taxa in that tree. We start by defining key terms.
Definition 1 (Clade)
Let T be a phylogenetic tree. Given an internal node of T, we define the set of all nodes and edges of T contained under that node as the clade rooted at that node.
Each internal node of a phylogenetic tree corresponds to a clade of that tree. Figure 2(b) depicts the clade of the tree in Figure 2(a) rooted at x_{1}.
Definition 2 (Contraction)
Let T be a phylogenetic tree with n taxa. The contraction operation transforms T into a tree with n−1 taxa by removing a given taxon in T along with the edge that connects that taxon to T.
The contraction operation can extract the clades of a tree by removing all the taxa that are not a part of that clade. It can also extract parts of the tree that are not necessarily clades. We use the term subtree to denote a tree that is obtained by applying contractions to arbitrary set of taxa in a given tree. Formal definition is as follows.
Definition 3 (Subtree)
Let T and T’ be two phylogenetic trees. We say that T’ is a subtree of T if T can be transformed into T’ by applying a series of contractions on T.
If a tree T’ is a subtree of another tree T, we say that T’ is present in T. Notice that a clade is always a subtree, but the inverse is not true all the time. Figures 2(b) and 2(c) illustrate two subtrees of the tree in Figure 2(a). Let us denote the number of combinations of k taxa from a set of n taxa with $\left(\genfrac{}{}{0ex}{}{n}{k}\right)$. In general, if a tree has n taxa, then that tree contains $\left(\genfrac{}{}{0ex}{}{n}{k}\right)$ subtrees with k taxa. As a consequence, that tree contains 2^{ n } − 1 subtrees of any size including itself.
Definition 4 (Frequency)
Definition 5 (FAST)
Let $\mathcal{T}$ = {T_{1}, T_{2}, … , T_{ m }} be a set of m phylogenetic trees and T be a phylogenetic tree. Let γ be a number in [0, 1] interval that denotes frequency cutoff. We say that T is a Frequent Agreement SubTree (FAST) of $\mathcal{T}$ if its frequency in $\mathcal{T}$ is at least γ (i.e., $\mathit{\text{freq}}(T,\mathcal{T})\ge \gamma $).
We say that a FAST is maximal if there is no other FAST that contains all the taxa in that FAST. Clearly, larger FASTs indicate biologically more relevant consensus patterns. The following definition summarizes this.
Definition 6 (MFAST)
Let $\mathcal{T}$ = {T_{1}, T_{2}, …, T_{ m }} be a set of m phylogenetic trees. Let γ be a number in [0, 1] interval that denotes frequency cutoff. A FAST T of $\mathcal{T}$ is a Maximum Frequent Agreement SubTree (MFAST) of $\mathcal{T}$ if there is no other FAST T’ of $\mathcal{T}$ that has a larger size than T.
Commonly used variables and functions in this paper
$\mathcal{T}$  A set of phylogenetic trees 
T _{ i }  i th tree 
m  Number of trees in $\mathcal{T}$ 
n  Number of taxa in each input tree 
a _{ i }  i th taxa 
$\mathit{\text{freq}}(T,\mathcal{T})$  Frequency of the subtree T in $\mathcal{T}$ 
γ  Frequency cutoff 
S _{ i }  i th seed (each seed is a subtree of a tree in $\mathcal{T}$) 
k  Size of a seed 
c  Number of contractions used to create a seed 
Phase one: Seed generation
 1.
Seed size ( k ) is the number of taxa in the seed.
 2.
Number of contractions (c) is the number of taxa we prune from a clade taken from an input tree in order to extract the seed.
 3.
Frequency ( f ) is the fraction of input trees in which the seed is present.
The last feature denotes the number of trees in $\mathcal{T}$ in which the seed is present. For example in Figure 4, there are nine seeds S_{1}, S_{2}, …, S_{9} extracted from the three input trees using only one contraction. Among these, the frequency of S_{1} is 1 as it is present in all the trees. Frequency of S_{2} is about 0.67 for it is present in only two out of three trees (T_{1} and T_{2}). The frequency of the rest of the seeds is only about 0.33. Recall that, by definition, an MFAST is present in at least a fraction γ of the trees in $\mathcal{T}$. Therefore, we consider only the seeds whose frequency values are equal to or greater than this number ( i.e. , f ≥ γ).
Given the values of k, c and γ, we extract all the seeds which possess the desired feature values from the set of input trees as follows. In the newick string representation of a tree, a pair of matching parentheses corresponds to an internal node in the tree. The number of taxa in the clade rooted at this internal node is given by the number of labels between the two matching parentheses. Following from this observation, we scan the newick string of each tree one by one. For each such tree, we identify the clades which have k + c taxa. Notice that, if a tree contains n taxa, then it contains at most $\frac{n}{k+c}$ clades of size k + c as no two such clades can contain common taxa. We then extract all combinations of k taxa from each of these clades by contracting the remaining c taxa. The number of ways this can be done is $\left(\genfrac{}{}{0ex}{}{k+c}{c}\right)$. Notice that all the small trees extracted this way possess the first two characteristics explained above. At this point, we however do not know their frequencies. Therefore, we call them potential seeds. It is worth mentioning that the same seed might be extracted from different trees. As we extract a new potential seed, before storing it in the list of potential seeds, we check if it is already present there. We include it in the potential seed list only if it does not exist there yet. Otherwise, we ignore it. This way, we maintain only one copy of each seed.
Once we build our potential seed list for all the trees in $\mathcal{T}$, we go over them one by one and count their frequency in $\mathcal{T}$ as the fraction of trees that contain them. We filter all the potential seeds whose frequencies are less than the frequency cutoff. We keep the remaining ones as the list of seeds along with the frequency of each seed.
In Figure 4, consider the tree T_{1} that has four taxa. For k = 3 and c = 1, there is only one clade of size k + c = 4 which is the tree T_{1} itself. We extract four potential seeds, each having three leaves from this tree. The potential seeds in this figure are given by S_{1}, S_{2}, S_{5} and S_{7} which we extract by contracting a_{4}, a_{3}, a_{2} and a_{1} respectively from T_{1}.
Phase two: Seed combination
At the end of the first phase, we obtain a set of frequent seeds from the input trees. Notice that each seed is a FAST as each seed is present in sufficient number of trees specified by γ. These seeds are the basic building blocks of our method. In the second phase of our method, we combine subsets of these seeds to construct larger FASTs.
We first define what it means to combine two seeds. In order to combine two seeds, it is a necessary condition that both seeds are present in at least one common tree T in $\mathcal{T}$. We call such a tree T as the reference tree. We combine two seeds with the guidance of a reference tree. Let S_{1} and S_{2} be two seeds and let T be their reference tree. Let L_{1}, L_{2} and L be the set of taxa in S_{1}, S_{2} and T respectively. Combining S_{1} and S_{2} results in the tree that is equivalent to the one obtained by contracting the taxa in L − (L_{1} ∪ L_{2}) from T. For simplicity, we will denote the combine operation using T as the reference network with the ⊕_{ T } symbol. For instance we denote combining S_{1} and S_{2} with T being the reference tree as S_{1} ⊕ _{ T }S_{2}. To simplify our notation, whenever the identity of the reference tree is irrelevant, we will use the symbol ⊕ instead of ⊕_{ T }.
So far, we have explained how to combine two seeds S_{1} and S_{2} using a reference tree. It is possible that many trees in $\mathcal{T}$ have both seeds present in them. Thus, one question is which of these trees should we use as the reference tree to combine the two seeds? The brief answer is that all such trees need to be considered. However, we make several observations that helps us avoid combining S_{1} and S_{2} using each such reference tree one by one exhaustively without ignoring any of such trees. We explain them next.
Consider two trees T_{1} and T_{2} from $\mathcal{T}$ where both seeds are present in. There are two cases for T_{1} and T_{2}.

Case 1: S_{1} ⊕T_{1}S_{2} = S_{1} ⊕T_{2}S_{2}. In this case, it does not matter whether we use T_{1} or T_{2} as the reference tree. They will both lead to the same combined subtree. Thus, we use only one.

Case 2: S_{1} ⊕T_{1}S_{2} ≠ S_{1} ⊕T_{2}S_{2}. In this case, the trees T_{1} and T_{2} lead to alternative combination topologies. So, we consider both of them separately.
We utilize the observations above as follows. We start by picking one reference tree arbitrarily. Once we create a combined subtree using that tree, we check whether that subtree is present in the remaining trees in $\mathcal{T}$. We mark those trees that contain it as considered for reference tree and never use them as reference for the same seed pair again. This is because those trees fall into the first case described above. This way, we also store the frequency of the combined subtree in $\mathcal{T}$. If the number of unmarked trees is too small (i.e., less than γ × m) then it means that even if all the remaining trees agree on the same combined topology for the two seeds under consideration, they are not sufficient to make it a FAST. Thus, we do not use any of the remaining trees as reference for those two seeds. Otherwise, we pick another unmarked tree arbitrarily and repeat the same process until we run out of reference trees.
The next question we need to answer is which seed pairs should we combine? To answer this question we first make the following proposition.
Proposition 1
Proof
□
 1.
Inorder Combination (Section “Inorder combination”).
 2.
Minimum Overlap Combination (Section “Minimum overlap combination”).
Both approaches accept the list of seeds computed in the first phase as input and produce a larger FAST that is a combination of multiple seeds. Both of them also assume that the list of input seeds are already sorted in decreasing order of their frequencies. We discuss these approaches next.
Inorder combination
The inorder combination approach follows from Proposition 1. It assumes that the seeds with higher frequencies have greater potential to be a part of an MFAST. It exploits this assumption as follows, first it picks a seed as the starting point to create a FAST. It then grows this seed by combining it with other seeds starting from the most frequent one as long as the frequency of the resulting tree remains at least as large as the given cutoff γ. It repeats this process by trying each seed as the starting point, Algorithm Algorithm 1 In order combination presents this approach.
Algorithm 1 In order combination
FAST ← ∅
for all seeds S_{ i } do
FAS T^{ ′ } ← S_{ i }
Mark S_{ i } as considered
repeat
S_{ j } ← seed with highest frequency among unconsidered seeds Mark S_{ j } as considered CUTOFF ← γ t _FAST^{ ′ } ← FAS T^{ ′ }
repeat
Pick the next unconsidered tree $T\in \mathcal{T}$ as reference
Mark all the trees as that contain FAS T^{ ′ } ⊕_{ T }S_{ j } as considered
if freq($\mathit{\text{FAS}}{T}^{\prime}{\oplus}_{T}{S}_{j},\mathcal{T})\ge $ CUTOFF then
t _FAS T^{ ′ } ← FAS T^{ ′ } ⊕_{ T }S_{ j }
CUTOFF ← freq($\mathit{\text{FAS}}{T}^{\prime}{\oplus}_{T}{S}_{j},\mathcal{T})$
end if
until Less than γ × m unmarked reference trees are left in $\mathcal{T}$
FAS T^{ ′ }←t _FAS T^{ ′ }
Unmark all trees in $\mathcal{T}$
until all seeds are considered
if size of FAS T^{ ′ } ≥ size of FAST then
FAST ← FAS T^{ ′ }
end if
Unmark all seeds
end for
In Algorithm Algorithm 1 In order combination we first initialize the FAST as empty. We then consider each seed one by one. We initialize a temporary subtree denoted by FAST’ with the seed S _{ i } under consideration and mark S_{ i } as considered. We combine the FAST’ with a seed S_{ j } which has the highest frequency amongst the seeds that have not been added. If multiple seeds have the highest frequency, we randomly pick one of them and mark that seed S _{ j } as added to the FAST’. There can be alternative ways to combine FAST’ with S_{ j } leading to different topologies. We use the trees in $\mathcal{T}$ that contain both FAST’ and S_{ j } as guides to try only the topologies that exist in $\mathcal{T}$. We stop constructing alternative topologies as soon as we ensure that there are not sufficient number of trees to yield frequency of γ. We set FAST’ to the combined seed if the combined seed has large enough frequency. We then consider the seed with the next highest frequency for addition and repeat this step till all S _{ j } have been considered. If the resulting temporary FAST is larger than FAST we replace the smaller FAST with the larger one. In the next iteration, we initialize the FAST with the next S _{ i } . Using this approach we can initialize the FAST with all S _{ i } , alternatively if the user wishes to limit the amount of time spent using a maximum time cutoff we stop the outermost loop (i.e., alternative initializations of FAST’) as soon as the allowed running time budget is reached.
Notice that in Algorithm 3 each seed S_{ i } can lead to a different FAST. We record only the FAST that has the largest size. However, it is trivial to maintain the top k FASTs with the largest size instead if the user is looking for k alternative maximal FASTs.
Minimum overlap combination
The purpose of combining seeds is to construct a FAST that is large in size. Our inorder combination approach (Section “Inorder combination”) aimed to maximize the frequency of the combined seeds. In this section, we develop our second approach, named Minimum Overlap Combination. This approach picks seeds so that their combination produced as large subtree as possible. We elaborate on this approach next.
When we combine two seeds, the size of the resulting tree becomes at least as big as the size of each of these seeds. Formally let S_{1} and S_{2} be two seeds (i.e., trees). Let L_{1} and L_{2} be the set of taxa combined in S_{1} and S_{2}. We denote the size of a set, say L_{1}, with L_{1}. The size of the tree resulting from combination of S_{1} and S_{2} is L_{1} + L_{2} − L_{1} ∩ L_{2}. For a given fixed seed size, the first two terms of this formulation remains unchanged regardless of the seed. The last term determines the growth in the size of the FAST. Thus, in order to grow the FAST rapidly, it is desirable to combine two frequent subtrees with a small number of common taxa.
Our second approach follows from the observation above. We introduce a criteria called the overlap between two subtrees as the number of taxa common between them. Our minimum overlap combination approach works the same as Algorithm Algorithm 1 In order combination with a minor difference in selecting the seed S_{ j } that will be combined with the current temporary FAST (i.e., FAST’). Rather than choosing the seed with the largest frequency, this approach chooses the one that has the least overlap with FAST’ among all the unconsidered and frequent seeds. If multiple seeds have the same smallest overlap, it considers the frequency as the tie breaker and chooses the one with the largest frequency among those.
Phase three: Postprocessing
So far we described how to obtain seeds (Section “Phase one: Seed generation”) and how to combine them to construct FAST (Section “Phase two: Seed combination”). The two approaches we developed for combining seeds aim to maximize the size of FAST. However, they do not ensure the maximality of the resulting FAST. There are two main reasons that prevent our seed combining algorithms from constructing maximal FAST. First, some of the taxa of a maximal FAST may not appear in any seed (i.e. false negatives). As a result no combination of seeds will lead to that maximal FAST. Second, even if all the taxa of a maximal FAST are parts of at least one seed, our algorithms will reject combining that seed with the FAST of the seeds if those seeds contain other taxa that are not part of the maximal FAST (i.e. false positives).
In the postprocessing phase, we tackle abovementioned problem. Algorithm 3 describes the post processing phase in detail. We do this by considering all taxa which are not already present in the FAST one by one. We iteratively grow the current FAST by including one more taxon at a time if the frequency of the resulting FAST remains at least as large as the frequency cutoff γ. We repeat these iterations until no new taxon can be included in the FAST. Thus the resulting FAST is guaranteed to be maximal.
Algorithm 2 Post processing
INPUT = FAST from the seed combination phase
INPUT = $\mathcal{T}$
OUTPUT = Maximal FAST
RESULT ← FAST
for all a_{ i } not in FAST do
CUTOFF ← γ
t_RESULT ← RESULT
repeat
Pick the next unconsidered tree $T\in \mathcal{T}$ as reference
RESULT’ ← RESULT ⊕_{ T }a_{ i }
Mark all the trees that contain RESULT’ as considered
if frequency of RESULT’ ≥ CUTOFF then t_RESULT ← RESULT’ CUTOFF ← frequency of RESULT’
end if
until Less than γ × m unmarked reference trees are left in $\mathcal{T}$
RESULT ← t_RESULT
Unmark all trees in $\mathcal{T}$
end for
return RESULT
We expect the post processing step to identify quickly the taxa that have a potential to be in an MFAST that might have not been considered during the seed generation and seed combination phases. At the end of the post processing step we obtain an MFAST.
Complexity analysis of our method
In this section we discuss the complexity of our method in terms of the three phases involved in it. Let $\mathcal{T}$ be a set of m phylogenetic trees having n leaves each. The complexity of the different phases of our method are as follows.
Phase one
Finding the seeds involves enumerating all the subtrees and checking their frequencies. Given seed size k and number of contractions c, each tree will contain at most $\frac{n}{k+c}$ clades each leading to $\left(\genfrac{}{}{0ex}{}{k+c}{c}\right)$ alternative subtrees. Thus, in total there can be up to $\frac{\mathrm{mn}}{k+c}\left(\genfrac{}{}{0ex}{}{k+c}{c}\right)$ seeds (possibly many of them identical) from all the trees in $\mathcal{T}$. Typically, the values of k and c are fixed and small (in our experiments we have k ∈ {3, 4, 5} and c ∈ {0, 1, 2, 3, 4, 5}) leading to O(mn) seeds.
The complexity of finding whether a seed is present in a single tree is O(n log n). Given that there are m trees in $\mathcal{T}$, the cost of computing the frequency of a single seed is O(mn log n). Thus, the time complexity for finding the frequency of all the seeds is this expression multiplied by the number of seeds, which is O(m^{2}n^{2} log n).
Phase two
Consider a set of p frequent seeds that will be considered for combining in this phase. Recall that we have two approaches to combine them. Below, we focus on each.
INORDER COMBINATION We try to combine each seed with every other seed leading to O(p^{2}) iterations. The complexity of checking the frequency of each combined subtree is O(mn logn). Also, there can be up to O(m) different reference trees for guiding the combine operation. Multiplying these terms, we obtain the complexity of phase using this approach as O(p^{2}m^{2}n log n).
MINIMUM OVERLAP COMBINATION The complexity of combining the frequent seeds using the minimum overlap combination approach is very similar to the inorder approach except for an additional term. The additional complexity is because we maintain the overlap between the subtrees. This leads to the complexity O(p^{2}n^{2} + p^{2}m^{2}n log n).
Phase three
Here, we consider the FAST obtained from each of the p frequent seeds in phase two. For each FAST, we sequentially go over each taxa one by one leading to O(n) iterations. There can be up to O( γ × m) references to add a taxon. So the cost of extending all p FASTs is O(γ × mnp).
respectively.
Results and discussion
This section evaluates the performance of our MFAST algorithm experimentally.
Implementation details
We implemented our MFAST algorithm using C and Perl. More specifically, we implemented the first two phases (seed generation and seed combination) in C and the third phase (post processing) in Perl. We utilize the functions provided in the newick Utilities [25] package by modifying the source code provided in that package. We use k ∈ {3, 4, 5} and c ∈ {0, 1, 2, 3, 4, 5} in all of our experiments unless otherwise stated. In our experiments, we observed that the minimum overlap combination produced larger MFASTs than the inorder combination approach. Therefore, we limit our experimental results to the minimum overlap combination approach.
Methods compared against
We have compared our method against Phylominer [20] and the MAST command implemented in PAUP* [26]. Among these, Phylominer also seeks MFASTs in a collection of trees. However, the time complexity of this method is exponential in the size of the input trees, and hence it becomes intractable for large trees. In our experiments, we observed that it does not scale beyond 50 taxa. PAUP* is primarily a program for phylogenetic inference, although it also can compute MASTs. MASTs have a strict 100% agreement criterion unlike the arbitrary frequency cutoff values γ in our method.
Evaluation Criteria
We evaluate our algorithm based on the size of the MFAST found. Larger MFASTs are preferable. When possible, we report the size of the optimal solution as well.
Test Environment
We ran our experiments on Linux servers equipped with dual AMD Opteron dual core processors running at 2.2 GHz and 3 GB of main memory to test the performance of our method.
Datasets
We test the performance and verify the results of our method on synthetic datasets and real datasets.

Synthetic dataset We built synthetic datasets in which we embedded an MFAST as described below. We characterize each synthetic dataset using five parameters. The first two parameters denote the size and number of trees in $\mathcal{T}$. MFAST frequency specifies the fraction of trees in $\mathcal{T}$ which contain an MFAST. MFAST size is the number of taxa in the embedded MFAST. The noise percentage is the percentage of taxa that is not a part of the embedded MFAST but is placed on the branches within the clade that contains the MFAST. We place all the other taxa on the branches outside this clade.Given an instantiation of these parameters, we first created a tree that has n’ taxa. This tree serves as the MFAST. We then created m × f trees that contain this MFAST. We build each of these trees by inserting n − n^{ ′ } taxa randomly in the MFAST. With probability ∊ we insert each taxa within the clade that contains MFAST. With probability 1 − ∊ we insert it outside that clade. We then created m − (m × f) trees that do not contain the current MFAST. We simply do this by inserting all the taxa one by one at a random location.
 1.
Tree size (n).
 2.
Number of trees (m).
 3.
MFAST frequency (f).
 4.
MFAST size (n’).
 5.
Noise percentage (∊).

Real datasets. We use two empirical datasets to evaluate the performance of our heuristic. The data sets contain 200 bootstrap trees generated from phylogenetic analysis of the Gymnosperm [27] and Saxifragales (Burleigh, unpublished) plant clades. To make the bootstrap trees, we assembled supermatrices, matrices of concatenated gene alignments with partial taxon overlap, from gene sequence data available in GenBank. We performed a maximum likelihood bootstrap analysis on each supermatrix using RAxML v. 7.0.4 [28]. The Gymnosperm trees each contain 959 taxa, and the Saxifragales trees each contain 950 taxa.
Effects of number of input trees
In our first experiment, we analyze how the number of input trees in $\mathcal{T}$ affects the performance of our algorithm. For this purpose, we created 30 synthetic datasets. The size of the embedded MFAST in all the datasets was 15. Among these 30 datasets, 10 contained 50 trees, 10 contained 100 trees and 10 contained 200 trees. We set the noise percentage to 20% in all the datasets. The frequency of the embedded MFAST was 0.8. We set the number of taxa in all the trees in these datasets to 100.
Evaluation of the effect of the number of trees in $\mathcal{T}$
Number of trees  MFAST size  

Before post processing  After post processing  
50  14.5  16.0 
100  15.3  15.8 
200  14.4  15.4 
Effects of tree size
Our second experiment considers the impact of the number of taxa in the input trees contained in $\mathcal{T}$ on the success of our method. To carry out this test, we built datasets with varying tree sizes (i.e., n). Particularly, we used n = 100, 250, 500 and 1000. For each value of n, we repeated the experiment 10 times by creating 10 datasets with the same properties. In all datasets, we set the number of trees to m = 100, the noise percentage at ∊ = 20%, the size of the embedded MFAST at 15% of n, and the MFAST frequency at 0.8.
Evaluation of the effect of the size of the trees in $\mathcal{T}$
Number of  MFAST size  

taxa  Embedded  Reported  
Before post  After post  
processing  processing  
100  15  15.3  15.8 
250  38  32.3  38.8 
500  75  43.7  76.0 
1000  150  69.8  151.0 
The results also suggest that our method identifies a significant percentage of the taxa in the embedded MFAST after the second phase (i.e., before postprocessing) when the tree size is small. As the tree size grows, it starts missing some taxa at this phase. It however recovers the missing taxa during the postprocessing phase even for the largest tree size. This indicates that at the end of phase two our method could identify a backbone of the actual MFAST. The unidentified taxa at this phase are scattered throughout the clades in the input trees. Thus, there is no clade of size k + c that contains them with c contractions for small k or c. As evident from Table 3, this however does not prevent our method from recovering them. This is because the backbone reported at the end of phase two is large enough, and thus specific enough, to recover the missing taxa one by one in the last phase. This is a significant observation as it demonstrates that our method works well even with small values of k and c.
Effects of noise percentage
Recall that the noise percentage ∊denotes the percentage of taxa that is added inside the clade that contains the MFAST. As ∊increases, the pairs of taxa in the MFAST get farther away from each other in the tree that contains it. As a result, fewer taxa from MFAST will be contained in small clades of size k + c. This raises the question whether our method works well as ∊increase and thus the MFAST taxa gets scattered around in the trees that contain it.
Evaluation of the effect of the noise in the trees in $\mathcal{T}$
Noise (%)  MFAST size  

Before post processing  After post processing  
20  15.3  15.8 
40  13.6  15.0 
60  12.7  15.0 
The results suggest that our method can identify the embedded MFAST successfully even when the noise percentage is very high. We observe that the size of the MFAST found by our method before post processing decreases slowly with increasing amount of noise. This is not surprising as the taxa contained in the embedded MFAST gets more spread out (and thus farther away from each other) in the trees in $\mathcal{T}$ with increasing noise. As a result, if there are taxa that are not part of any seed with the provided values of k and c, they will never be included in the computed MFAST at the end of phase two. We however observe that (i) only a small number of such taxa exists. For instance, even for the largest noise percentage (∊ = 60%), only 2.3 taxa (i.e., 15  12.7) are missing on the average. (ii) The missing taxa are recovered during phase three. This is because the computed MFAST at the end of phase two is very large, and thus it is specific to the embedded MFAST.
Impact of seed creation
So far, in our experiments we consistently observed two major points for all the parameter settings (see Sections “Effects of number of input trees” to “Effects of noise percentage”): (i) Our method always finds a large subtree of the embedded MFAST after phase two. (ii) Our method always recovers the entire embedded MFAST after phase three. The second observation can be explained from the first one that the outcome of phase two is large enough to build the entire MFAST precisely. The first observation however indicates that the set of seeds generated in phase one contain a significant percentage of the taxa in the embedded MFAST. In this section, we take a closer look into this phenomenon and explain why this is the case even for small values of seed size k and contraction amount c, and large noise percentage ∊. To do that, we will compute the probability that a subset of the taxa of the embedded MFAST appears in at least one seed generated in phase one. In our computation, we will assume that the taxa can appear at any location of a given tree with the same probability. We discuss the implication of this assumption later in this section.
A lower bound to NS(h, k) is h − k + 1 which can be obtained by picking a contiguous block of k taxa from the canonical newick representation of the MFAST by considering all possible h − k + 1 starting point locations. Notice that the larger the value of P(n, k,c, m, h), the higher the chances that our algorithm will construct some part of the MFAST. Similarly, the larger the value of NS(h, k), the higher the chances that our algorithm will construct some part of the MFAST.
Overall, we conclude from this experiment that even small values of k and c suffices to capture a part of the MFAST in phase two. Therefore, although our algorithm’s complexity increases exponentially with k and c, we do not need to use large values for k and c. This enables our algorithm to scale to very large datasets with thousands of taxa and trees. These results explain the theory behind the practical results we observed in Sections “Effects of number of input trees” to “Effects of noise percentage”.
Evaluation of state of the art methods
So far, we have shown that our method could successfully find the MFASTs contained in sets of trees $\mathcal{T}$ for up to 1000 taxa and 200 trees (i.e., n = 1000 and m = 200). An obvious question is how well do existing methods perform on the same datasets. Here, we answer this question for two existing programs, namely PAUP* (version 4.0b10) and Phylominer.
When we fix the number of trees and the number of taxa to 100, PAUP* was able to find the MAST for for all datasets. As we grow the number of taxa to 250 or larger while keeping the number of trees as 100, PAUP* runs our of memory and fails to return any results. After reducing the number of trees to 50, PAUP* still runs out of memory and cannot report any results for more than 100 taxa.
The scalability problem of Phylominer is even more severe. Phylominer is able to compute the MFASTs on datasets with up to 20 taxa. However, as we increase the number of taxa further, its performance deteriorates quickly. When we set the number of taxa to 100, even with as few as 100 trees, Phylominer takes more than a week to report a result. Moreover, in our experiments, the maximum size of the subtrees it found on average contained fewer than 7 taxa, even though the size of the true MAST was 10.
Another interesting question about existing methods would be whether the majority consensus rule can be used to find MFASTs. To evaluate this, we used the same three synthetic datasets used in Section “Effects of noise percentage”. Recall each of these three datasets contains an MFAST of size 15 which is embedded in 80% of the trees. The datasets are created with 20%, 40% and 60% noise indicating different levels of difficulty in recovering the embedded MFAST. We computed 70% majority consensus tree. Notice that if majority consensus rule can identify an MFAST, that would correspond to a bifurcating subtree topology in the consensus tree. In other words a subtree is bifurcating in this experiment only if 70% or more of the input trees agree on the topology of that subtree. The resulting tree, however, was multifurcating for all the three datasets. This means that majority consensus rule could not recover even a smaller portion of the embedded tree while our method was able to locate the entire MFAST successfully (see Table 4).
These results demonstrate that both PAUP* and Phylominer are not well suited to finding agreement subtrees in larger datasets, our method scales better in terms of both the number of taxa and the number of trees. When PAUP* runs to completion, we observed that it reports the true results. Recall from previous experiments that our method always found the true results on the same datasets as well as larger datasets. This suggests that our method has the potential to have an impact in large scale phylogenetic analysis when existing methods fail.
Empirical dataset experiments
To examine the performance of the MFAST method on real data, we performed experiments using 200 maximum likelihood bootstrap trees from a phylogenetic analysis of gymnosperms (959 taxa) and Saxifragales (950 taxa). Specifically, we evaluated how the performance of the MFAST algorithm was affected by the number of input trees and the size of the input trees.
Effects of number of input trees
The size of the MFAST found by our method on the Gymnosperms and Saxifragales datasets before and after post processing (phase three)
Number of  MFAST size  

trees  Gymnosperms  Saxifragales  
Before  After  Only  Before  After  Only  
50  78.5  129.8  99.5  64.7  122.0  84.1 
100  68.4  119.2  83.1  55.4  112.8  74.7 
200  76.0  118.0  84.0  40.0  105.0  75.0 
The large gap between the MFAST sizes before and after the post processing suggests that phase three is the main reason behind the success of our method, and thus, the costly seed combination phase (i.e., phase two) may be unnecessary. To answer whether this conjecture is correct, we ran a variant of our method by disabling the second phase; we only ran the post processing phase starting from each seed as the initial MFAST one by one. We reported the largest MFAST found that way as the output of this variant in Table 5. The results demonstrate that although phase three can grow a large FAST, phase two is essential to find the largest frequent agreement subtree. In other words, post processing finds the true MFAST only if a large portion of it is already found (which is the role served by phase two). In conclusion, phase three of our method cannot replace phase two, yet both phases are essential for the success of our method.
Effects of size of input tree
The size of the MFAST found by our method on the Gymnosperms and Saxifragales datasets before and after post processing (phase three) for different number of taxa
Number of  MFAST size  

leaves  Gymnosperms  Saxifragales  
Before  After  Only  Before  After  Only  
100  41.2  56.1  43.5  43.5  50.7  38.5 
250  67.2  88.5  63.0  62.3  76.2  54.6 
500  91.6  123.0  74.9  52.0  86.7  62.9 
All  76.0  118.0  84.0  40.0  105.0  75.0 
Similar to the experiments in Section “Effects of number of input trees”, we investigated the gap between the MFAST sizes before and after the post processing step. We ran a variant of our method by disabling the second phase; we only ran the post processing phase starting from each seed as the initial MFAST one by one. We reported the largest MFAST found that way as the output of this variant in Table 6. The results are in parallel with those in Table 5. Phase three can grow a large frequent agreement subtree, but not quite as big as that when both phase two and three are executed.
Effects of sample size
In our final experiment, we evaluated the effect of the maximum time cutoff , we described in Section “Inorder combination” on the accuracy of our method. Recall that, this cutoff limits the number of initial seeds tried in our algorithm by randomly sampling a small percentage of the seeds. It only uses the sampled seeds as possible initial seeds. However, it uses the entire set of seeds while growing the MFAST determined by the initial seed. As each initial seed roughly takes the same amount of time to grow into an MFAST, using x% of the seeds as the sample set reduces the total running time our method to roughly x% of that of our original implementation.
The size of the MFAST found by our method on the Gymnosperms and Saxifragales datasets for different random subsamples of the total number of taxa
Sampling  MFAST size  

percentage  Gymnosperms  Saxifragales 
2  85.9  74.3 
5  87.5  75.6 
10  88.4  75.2 
25  87.5  75.5 
50  88.5  76.2 
100  88.5  76.2 
Conclusion
In this paper, we present a heuristic for finding the maximum agreement subtrees. The heuristic uses a multistep approach which first identifies small candidate subtrees (called seeds), from the set of input trees, combines the seeds to build larger candidate MFASTs, and then performs a postprocessing step to increase the size of the candidate MFASTs. We demonstrate that this heuristic can easily handle data sets with 1000 taxa, greatly extending the estimation of MFASTs beyond current methods. Although this heuristic is not guaranteed to find all MFASTs, it performs well using both simulated and empirical data sets. Its performance is relatively robust to the number of input trees and the size of the input trees, although with the larger data sets, the post processing step becomes more important. Overall this method provides a simple and fast way to identify strongly supported subtrees within large phylogenetic hypotheses.
Although the method we developed is described and implemented for the rooted and bifurcating trees, it can be trivially extended to multifurcating as well as unrooted trees. The central technical difference in the case of unrooted trees would be the definition of clade (see Definition 1) as the definition requires a root. A clade in an unrooted tree encompasses two sets of nodes; (i) a given set of taxa X, (ii) the set of all internal nodes that are on a path between two taxa in X on the phylogenetic tree. We expect that this will increase the number of seeds substantially and thus make the problem more computationally intensive. The amount of increase will depend on the tree topology. The theoretical worst case happens when all the taxa are connected to a single internal node (i.e., star topology). In that case any subset of taxa can lead to a potential seed as long as the subset size is equal to the seed size allowed. One possible way to overcome this problem would be to exploit randomization or graph coloring strategies and avoid enumerating majority of the possible seeds.
Abbreviations
 MAST:

Maximum agreement subtree
 FAST:

Frequent agreement subtree
 MFAST:

Maximum frequent agreement subtree.
Declarations
Acknowledgments
This work was supported partially by the National Science Foundation (grants CCF0829867 and IIS 0845439).
Authors’ Affiliations
References
 Goloboff PA, Catalano SA, Mirande JM, Szumik CA, Arias JS, Källersjö M, Farris JS: Phylogenetic analysis of 73 060 taxa corroborates major eukaryotic groups. Cladistics 2009, 25(3):211–230. 10.1111/j.10960031.2009.00255.xView Article
 Price MN, Dehal PS, Arkin AP: Fasttree 2  approximately maximumlikelihood trees for large alignments. PLoS ONE 2010, 5(3):e9490. 10.1371/journal.pone.0009490PubMed CentralView ArticlePubMed
 Smith SA, Beaulieu JM, Stamatakis A, Donoghue MJ: Understanding angiosperm diversification using small and large phylogenetic trees. American Journal of Botany 2011, 98: 404–414. 10.3732/ajb.1000481View ArticlePubMed
 Felsenstein J: Confidence Limits on Phylogenies: An Approach Using the Bootstrap. Evol 1985, 39(4):783–791. 10.2307/2408678View Article
 Farris JS, Albert VA, Källersjö M, Lipscomb D, Kluge: Parsimony jackknifing outperforms neighborjoining. Cladistics 1996, 12(2):99–124. 10.1111/j.10960031.1996.tb00196.xView Article
 Huelsenbeck JP, Rannala B, Masly JP: Accommodating phylogenetic uncertainty in evolutionary studies. 2000 Science, 30 June 2000, 288(5475):2349–2350. 10.1126/science.288.5475.2349
 Bryant D: A classification of consensus methods for phylogenetics. Bioconsensus (Piscataway, NJ, 2000/2001), volume 61 of DIMACS Ser. Discrete Math. Theoret. Comput. Sci. Amer. Math. Soc., 2003, 163–183.
 Finden CR, Gordon AD: Obtaining common pruned trees. J Classification 1985, 2: 255–276. 10.1007/BF01908078View Article
 Amir A, Keselman D: Maximum agreement subtree in a set of evolutionary trees: Metrics and efficient algorithms. SIAM J Comput 1997, 26(6):1656–1669. 10.1137/S0097539794269461View Article
 Kubicka E, Kubicki G, McMorris FR: An algorithm to find agreement subtrees. J Classification 1995, 12(1):91–99. 10.1007/BF01202269View Article
 Farach M, Przytycka TM, Thorup M: On the agreement of many trees. Inf Process Lett 1995, 55(6):297–301. 10.1016/00200190(95)00110XView Article
 Bryant D: Building trees, hunting for trees and comparing trees. PhD thesis. Dept. Mathematics: University of Canterbur; 1997.
 Cole R, FarachColton M, Hariharan R, Przytycka TM, Thorup M: An o(n log n) algorithm for the maximum agreement subtree problem for binary trees. SIAM J Comput 2000, 30(5):1385–1404. 10.1137/S0097539796313477View Article
 Lee CM, Hung LJ, Chang MS, Shen CB, Tang CY: An improved algorithm for the maximum agreement subtree problem. Inf Process Lett 2005, 94(5):211–216. 10.1016/j.ipl.2005.02.005View Article
 Berry V, Nicolas F: Improved parameterized complexity of the maximum agreement subtree and maximum compatible tree problems. IEEE/ACM Trans Comput Biology Bioinform 2006, 3(3):289–302. 10.1109/TCBB.2006.39View Article
 Guillemot S, Nicolas F: Solving the maximum agreement subtree and the maximum compatible tree problems on many bounded degree trees. In Proceedings of the 17th Annual conference on Combinatorial Pattern Matching (CPM’06). Edited by: Lewenstein M, Valiente G(Eds. ).. Berlin, Heidelberg: SpringerVerlag; 165–176. 10.1007/11780441_16
 Guillemot S, Nicolas F, Berry V, Paul C: On the approximability of the maximum agreement subtree and maximum compatible tree problems. Discrete Applied Mathematics 2009, 157(7):1555–1570. 10.1016/j.dam.2008.06.007View Article
 Chi Y, Xia Y, Yang Y, Muntz: Correction to ”mining closed and maximal frequent subtrees from databases of labeled rooted trees”. IEEE Trans Knowl Data Eng 2005, 17(12):1737.View Article
 Zhang S, Wang JTL: Mining frequent agreement subtrees in phylogenetic databases. In Proceedings of the 6th SIAM International Conference on Data Mining (SDM 2006). Edited by: Ghosh J, Lambert D, Skillicorn DB, Srivastava J. Maryland: Bethesda; April 2006:222–233.View Article
 Zhang S, Wang JTL: Discovering frequent agreement subtrees from phylogenetic data. IEEE Trans Knowl Data Eng 2008, 20(1):68–82.View Article
 Cranston KA, Rannala B: Summarizing a Posterior Distribution of Trees Using Agreement Subtrees. Syst B 2007, 56(4):578–590. 10.1080/10635150701485091View Article
 Pattengale ND, Swenson KM, Moret BME: Uncovering hidden phylogenetic consensus. In Proc. 6th Int’l Symp. Bioinformatics Research & Appls. ISBRA’10, in Lecture Notes in Computer Science Edited by: Borodovsky M, Gogarten JP, Przytycka TM, Rajasekaran S. pp 128–139. Springer, 2010 Springer, 2010
 Pattengale ND, Aberer AJ, Swenson KM, Stamatakis A, Moret BME: Uncovering hidden phylogenetic consensus in large data sets. IEEE/ACM Trans Comput Biology Bioinform 2011, 8(4):902–911.View Article
 Aberer AJ, Stamatakis A: A simple and accurate method for rogue taxon identification. In Proceedings of the IEEE International Conference on, Bioinformatics and Biomedicine (BIBM ’11). IEEE Computer Society, Washington, DC, USA; 118–122. 10.1109/BIBM.2011.70
 Junier T, Zdobnov EM: The newick utilities: highthroughput phylogenetic tree processing in the unix shell. Bioinformatics 2010, 26(13):1669–1670. 10.1093/bioinformatics/btq243PubMed CentralView ArticlePubMed
 Swofford DL: PAUP∗ Phylogenetic analysis using parsimony (and other methods). Version 4.0 beta 10.,2002. Sunderland, Massachusetts: Sinauer Assoc;
 Burleigh JG, Barbazuk WB, Davis JM, Morse AM, Soltis PS: Exploring diversification and genome size evolution in extant gymnosperms through phylogenetic synthesis. J Botany 2012, 2012: 6. Article ID 292857. 10.1155/2012/292857View Article
 Stamatakis A: Raxmlvihpc: maximum likelihoodbased phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 2006, 22(21):2688–2690. 10.1093/bioinformatics/btl446View ArticlePubMed
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.