ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees

Background Evolutionary histories can be discordant across the genome, and such discordances need to be considered in reconstructing the species phylogeny. ASTRAL is one of the leading methods for inferring species trees from gene trees while accounting for gene tree discordance. ASTRAL uses dynamic programming to search for the tree that shares the maximum number of quartet topologies with input gene trees, restricting itself to a predefined set of bipartitions. Results We introduce ASTRAL-III, which substantially improves the running time of ASTRAL-II and guarantees polynomial running time as a function of both the number of species (n) and the number of genes (k). ASTRAL-III limits the bipartition constraint set (X) to grow at most linearly with n and k. Moreover, it handles polytomies more efficiently than ASTRAL-II, exploits similarities between gene trees better, and uses several techniques to avoid searching parts of the search space that are mathematically guaranteed not to include the optimal tree. The asymptotic running time of ASTRAL-III in the presence of polytomies is \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$O\left ((nk)^{1.726} D \right)$\end{document}O(nk)1.726D where D=O(nk) is the sum of degrees of all unique nodes in input trees. The running time improvements enable us to test whether contracting low support branches in gene trees improves the accuracy by reducing noise. In extensive simulations, we show that removing branches with very low support (e.g., below 10%) improves accuracy while overly aggressive filtering is harmful. We observe on a biological avian phylogenomic dataset of 14K genes that contracting low support branches greatly improve results. Conclusions ASTRAL-III is a faster version of the ASTRAL method for phylogenetic reconstruction and can scale up to 10,000 species. With ASTRAL-III, low support branches can be removed, resulting in improved accuracy. Electronic supplementary material The online version of this article (10.1186/s12859-018-2129-y) contains supplementary material, which is available to authorized users.

allowed to coalesce inside each branch; lineages that fail to coalesce at the root of each branch are moved to the parent branch.
Many "summary" methods have been developed to infer a species tree from a collection of input trees. Examples include MP-EST [17], NJst [18], ASTRID [19], DISTIQUE [20], ASTRAL [21,22] and STAR [23], which only use gene tree topologies, and GLASS [24] and STEAC [23], which also use branch lengths. These methods are all proved statistically consistent under the MSCM, given error-free input gene trees; when input trees are inferred from sequence data, statistical consistency is not guaranteed [25]. Most methods take rooted gene trees as input, but some methods (e.g., ASTRAL, NJst/ASTRID and DISTIQUE) use unrooted input trees. ASTRAL-II [22] is currently one of the commonly used summary methods.
In this paper, we introduce an improved version of ASTRAL called ASTRAL-III. As we will show, compared to ASTRAL-II, the new version has better running time without sacrificing accuracy. The improvements in the running time are both theoretical (reducing the asymptotic running time so that it is guaranteed to grow polynomially with the dataset size) and empirical.

Notations and definitions
Let the set of n species be called L and let G be the set of k input gene trees on L. Let [ d] represent the set {1, 2 . . . , d}. We use Q(t) to denote the set of quartet trees induced by a tree t. Any subset of L is called a cluster. We define a partition as a set of clusters that are pairwise mutually exclusive; note that we abuse the term "partition" here because the union of all clusters in a partition need not give the complete set. Each node in an unrooted tree defines a partition. A bipartition (tripartition) is a partition with cardinality two (three); a partition with cardinality at least four corresponds to a multifurcation (also referred to as a polytomy). Let X (the constraint bipartition set) be a set of clusters such that for each A ∈ X, we also have L − A ∈ X. We use Y to represent the set of all tripartitions that can be build from X: We use E to denote the set of all unique partitions and their frequency in G. Thus,  (1) where N(g) is the set of all partitions representing all internal nodes in the tree g. We also define D as the sum of the cardinalities of unique partitions in gene trees: ( 2 )

ASTRAL (old versions)
The problem addressed by ASTRAL is to find the tree that shares the maximum number of induced quartet topologies with the collection of input gene trees: Problem statement: Given a set G of input gene trees, find the species tree t that maximizes g∈G |Q(g) ∩ Q(t)|.
Lafond and Scornavacca recently proved this problem is NP-hard [26].

ASTRAL-I and ASTRAL-II algorithms
ASTRAL solves a constrained version of the problem where a set of clusters X restricts bipartitions that the output species tree may include (recall ∀A ∈ X : L − A ∈ X). Note that setting X to the powerset solves the unconstrained problem. Based on the fact that an unrooted quartet species tree always matches the most likely unrooted quartet gene tree [27], ASTRAL is proved statistically consistent [21]. ASTRAL uses dynamic programming to solve the problem using the recursive relation: where the function w(T) scores each tripartition T = (A|B|C) against each node in each input gene tree. Let partition M = (M 1 |M 2 |...|M d ) represent an internal node of degree d in a gene tree. The overall contribution of T to the score of any species tree that includes T is: where, defining a i = |A ∩ M i |, b i = |B ∩ M i |, and c i = |C ∩ M i |, we have:

QI(T, M) = i∈[d] j∈[d]−{i} k∈[d]−{i,j}
As previously proved [21], QI(T, M) computes twice the number of quartet trees that are going to be shared between any two trees if one includes only T and the other includes only M. ASTRAL-II requires d 3 time for computing QI(.), making its overall running time O n 3 k|Y | with polytomies of unbounded degrees or O(nk|Y |) in the absence of polytomies.
Noting trivially that |Y | < |X| 2 , the previously published running time analysis of ASTRAL-II was O nk|X| 2 for binary gene trees and O n 3 k|X| 2 for trees with polytomies. A recent result by Kane and Tao [28] (motivated by the analysis of ASTRAL) proved that |Y | ≤ |X| 3/log 3 (27/4) . This result immediately gives us a better upper bound on the running time. 1.726 and O n 3 k|X| 1.726 , respectively, with and without polytomies in gene trees.

Corollary 1 ASTRAL-II runs in O nk|X|
In ASTRAL-I, X is the set of all bipartitions observed in input gene trees. While sufficient for statistical consistency and often for accuracy, under some conditions, this set X is too restrictive. To address this shortcoming, ASTRAL-II [22] uses several heuristics (see Additional file 1: Appendix A) and further expands the set X. Even though ATRAL-II tries to limit |X|, it does not provide any guarantees as to how it grows with n and k. In the worst case, |X| can grow exponentially, and thus, ASTRAL-II does not guarantee polynomial running time. The relatively high accuracy of ASTRAL-II has been shown in several simulations [20,22,29,30] and it has been adopted by the community as one of the main methods used in phylogenomics. ASTRAL has the ability to compute branch lengths in coalescent units [2] and a measure of branch support called local posterior probability [31].

Limitations of ASTRAL-II
Several shortcomings of ASTRAL-II in terms of running time are addressed here (ASTRAL-III); our improvements, in turn, enable new types of analyses.
While ASTRAL-II can analyze datasets with a thousand species and gene trees in reasonable time, it does not easily scale to many tens of thousands of input trees. Datasets with more than ten thousand loci are already available (e.g., [5]) and as more genomes are sequenced, more are destined to become available in the near future. Moreover, being able to handle large k (i.e., numbers of input trees) enables using multiple trees per locus (e.g., a Bayesian sample) as input to ASTRAL. The limited scalability of ASTRAL with k has two reasons. First, the set X is not bounded in ASTRAL-II and can grow to become the power set. Thus, in ASTRAL-II, |X| can theoretically grow exponentially with n. We fix this in ASTRAL-III by modifying heuristics that form the set X so that they all guarantee that |X| = O(nk). The second cause of the slowdown is that computing each w(T) for a tripartition T requires (nk). This computation does not exploit similarities between gene trees, a shortcoming that we fix in ASTRAL-III.
Beyond large k, ASTRAL-II, which scales as O n 3 k|X| 1.726 in the presence of polytomies, can quickly become prohibitively slow for input trees with large polytomies. ASTRAL-III uses a mathematical trick to enable scoring of gene tree polytomies in time similar to binary nodes. The ability to handle large polytomies in input gene trees is important for two reasons. Some of the conditions that are conducive to ILS, namely shallow trees, are also likely to produce identical gene sequence data for several species. The gene tree should leave the relationship between identical sequences unresolved (FastTree [32] automatically does it and RAxML, which outputs an arbitrary resolution, warns the user about the input). Moreover, all summary methods, including ASTRAL, are sensitive to gene tree estimation error [22,[33][34][35][36][37]. One way of dealing with gene tree error, previously studied in the context of minimizing deep coalescence [38], is to contract low support branches in gene trees and use these unresolved trees as input to the summary method. While earlier studies found no evidence that this approach helps ASTRAL-II when the support is judged by SH-like Fast-Tree support [22], no study has tested this approach with bootstrap support values. We will for the first time evaluate the effectiveness of contracting branches with low bootstrap support and show that conservative filtering of very low support branches does, in fact, help the accuracy.

ASTRAL-III
ASTRAL-III has six new features:  [39], is used to compute an upper-bound on the best possible resolution of a clade; we need not expand a clade recursively when its upper-bound is below the best available score. The worst case asymptotic running time does not change due to this feature. 6. A two-stage heuristic mechanism is designed to further tighten the upper bounds used in pruning unnecessary parts of the search space. The worst case asymptotic running time is not impacted.
A beta version of ASTRAL-III was recently described [40] and that version included features 3-5 but not the others. We next describe each improvement.

New search space: |X| = O(nk)
ASTRAL-II uses several heuristic methods to build X (see the original paper [22] for details). The main method involves computing several extended majority consensus trees from gene trees and then resolving polytomies in these consensus trees using three techniques (mentioned below). These steps are repeated for 10 rounds or more until very few (less than a constant threshold) of the bipartitions observed are new to X. Because the number of rounds is not constant or a function of n and k, we cannot bound how X grows with n and k for ASTRAL-II. In ASTRAL-III, we limit the number of rounds by a constant value (default set to 110). This enables us to provide guarantees of a polynomial growth of |X| with n and k.
To get to X = O(nk), we need further changes. As mentioned, three techniques are used to resolve each polytomy of degree d in extended majority consensus trees. The first technique uses a precomputed distance matrix to build a UPGMA tree starting from sides of the polytomy and adds the new bipartitions from the UPGMA tree to X. This can only add O(d) = O(n) resolutions. The second technique computes a greedy consensus of gene trees subsampled to randomly selected taxa (one from each side of the polytomy) and adds bipartitions from the greedy consensus to X. This also can only add O(d) new bipartitions. The third resolution samples a taxon from each side of the polytomy, computes d caterpillar trees, each constructed based on decreasing similarity to each sampled taxon, and adds the bipartitions from all these caterpillar trees to X. This quadratic resolution step can add O d 2 = O n 2 bipartitions to X. To have |X| = O(n), we need to change this step. Let d 1 . . . d r be the list of all polytomy degrees in an extended majority consensus tree in the ascending ordered. We find the smallest threshold q such that

New search space: handling gene tree polytomies
We also change the way ASTRAL builds X in the presence of gene tree polytomies. Our goal is to increase |X| compared to ASTRAL-II for multifurcating gene trees. However, |X| is enlarged at most by a constant factor and we retain |X| = O(nk).
If a gene tree includes polytomies, ASTRAL-II adds bipartitions implied by resolutions of that polytomy to the set X using a guide tree g. To build g, a greedy consensus of all gene trees is computed and is further refined to become binary by applying UPGMA to each polytomy of the greedy tree using a precomputed similarity matrix (see the original paper [22] for details). To resolve a gene tree polytomy of degree d, ASTRAL-II first randomly samples d taxa, each from one side of the polytomy. Let S be the sampled taxa. All bipartitions from the tree g restricted to the set S of leaves are added to X. While in ASTRAL-II this process is done only once, in ASTRAL-III, we repeat the process three times with different random samples S. This increases |X| but at most by a constant factor. The enlarged X can lead to improved accuracy when input trees include many polytomies.
The second change in ASTRAL-III is that we now use a UPGMA tree inferred based on the similarity matrix as the guide tree. We observed that the UPGMA tree summarizes the input gene trees more accurately than the greedy tree (see Additional file 1: Table S1). Finally, in ASTRAL-III, we improve the definition of the similarity matrix in the presence of gene tree polytomies. Unlike in ASTRAL-II, we ensure that unresolved quartet trees induced by gene trees do not increase the similarity between pairs of taxa included in those quartets. Note that the similarity matrix, which is based on quartets, should not be confused with the quartet score optimized by ASTRAL.

Efficient handling of Polytomies
Recall that ASTRAL-II uses Eq. 4 to score a tripartition against a polytomy of size d in (d 3 ) time. Our next Lemma shows that this can be improved.

QI(T, M) can be computed in time (d).
Proof In (d) time, we can compute: where a i = |A ∩ M i | and b i = |B ∩ M i |; we can also compute S b , S c , S a,c and S b,c (similarly defined). Equation 4 computes twice the number of quartet tree topologies shared between an unrooted tree with internal node T and another tree with one internal node M [22]. Equation 4 can be rewritten as: (the derivation is given in the Additional file 1: Appendix B). Computing Eq. 6 instead of Eq. 4 clearly reduces the running time to (d) instead of d 3 .
ASTRAL needs to score each of the |Y | tripartitions considered in the dynamic programming against each internal node of each input gene tree. The sum of degrees of k trees on n leaves is O(nk) (since that sum can never exceed the number of bipartitions in gene trees) and thus:

Gene trees as a Polytree
ASTRAL-II scores each dynamic programming tripartition against each individual node of each gene tree. However, nodes that are repeated in several gene trees need not be recomputed. Recalling the definitions of E and D (Eqs. 1 and 2),

Lemma 2 The score of a tripartition T = (A|B|C) against all gene trees (i.e., the w(T) score) can be computed in (D).
Proof In ASTRAL-III, we keep track of nodes that appear in multiple trees. This enables us to reduce the total calculation by using multiplicities: We achieve this in two steps. In the first step, for each distinct gene tree cluster W, we compute the cardinality of the intersection of W and sets A, B, and C once using a depth-first search with memoization. Let children(W ) denote the set of children of W in an arbitrarily chosen tree g ∈ G containing W. Then, we have the following recursive relation:

Trimming of the dynamic programming
We now introduce an upper-bound (proved in Additional file 1: Appendix B): . When computing V (A) by maximizing the score over all resolutions of A, imagine that we first encounter A and then A . We avoid A ). This approach clearly makes the order of processing of the resolutions important. To heuristically improve the efficiency of this approach, we order all A ). Note that computing U(A) does not require recursive computations down the dynamic programming DAG. Thus, the use of this upper-bound results in the trimming of the search space. However, as far as we can tell, this trimming does not improve the theoretical running time.

Two-staged α-trimming
In order to further trim the search space, another upperbound of V (A) is calculated. For a given α ≥ 1 and any ordering of the set A : We can compute V α (A) using an algorithm equivalent to our dynamic programming for computing V (A), except that, as resolutions of a clade A are being tested, a new one is accepted only if it improves upon the previous best resolution by a factor of α (thus, α = 1 simply reproduces our existing dynamic programming). When computing , and thus we no longer need to recursively compute The hope is that the U α function will give us tighter upper bounds compared to the U function previously defined. Whether this happens or not depends on the choice of α, the order of visiting clusters, and the particularities of a dataset. While any choice of α ≥ 1 would guarantee the correct solution to the dynamic programming, we have empirically selected a heuristic to choose α. We set α = U(L) and g(A) = 0 for |A| = 1. Just as before, we order the clusters in the decreasing order of U(A, A i ).

Experimental setup
We study three research questions:

RQ1:
Can contracting low support branches improve the accuracy of ASTRAL? RQ2: How do the running time and search space compare between ASTRAL-II and ASTRAL-III? RQ3: How accurate is ASTRAL-III, which guarantees polynomial size search space, compared to ATRAL-II?

Datasets
Avian biological dataset: Neoavian relationships show extremely high levels of gene tree discord, perhaps because their ancestors experienced a rapid radiation [5]. A dataset of 48 genomes representing all avian orders has been used to partially resolve this rapid radiation [5]. A set of 14,446 loci (including exons, introns, and UCEs) was used to produce two reference species trees using concatenation and using a coalescent-based method [5,33]. We use the set of all unbinned gene trees and compare ASTRAL-III with and without contraction against both reference trees. SimPhy-homogeneous (S100): We simulated a new 101-taxon dataset using SimPhy [42] with 50 replicates, each with a different species tree. The species trees are simulated under the birth-only process with birth rate 10 −7 , fixed haploid N e of 400K, and the number of generations sampled from a log-normal distribution with mean 2.5M. For each replicate, 1000 true gene trees are simulated under the MSCM (exact commands shown in Additional file 1: Appendix C and parameters given in Additional file 1: Table S2). The average normalized RF distance between true species trees and true gene trees was in most replicates in the [0.3, 0.6] range, with an average of 0.46 (Fig. 1). We use Indelible [43] to simulate the nucleotide sequences along the gene trees using the GTR evolutionary model [44] with 4 different fixed sequence lengths: 1600, 800, 400, and 200bp. We then use Fast-Tree2 [32] to estimate both ML and 100 bootstrapped gene trees under the GTR+ (requiring more than two million runs in total). Gene tree estimation error, measured by the FN rate between the true gene trees and the estimated gene trees, depended on the sequence length as shown in Fig. 1 (0.55, 0.42, 0.31, and 0.23 on average for b a Fig. 1 Properties of the S100 dataset. a The density plot of the amount of true gene discordance measured by the FN rate between the true species tree and the true gene trees. b The density plot of gene tree estimation error measured by FN rate between true gene trees and estimated gene trees for different sets of sequence lengths 200bp, 400bp, 800bp, and 1600bp, respectively). We sample 1000, 500, 200, or 50 genes to generate datasets with varying numbers of gene trees.

SimPhy-ASTRAL2 (S200):
This dataset (201 taxa) is from the ASTRAL-II paper [22]. We use its most challenging model conditions with max tree height set to 500K generations and two rates of speciation: 10 −6 and 10 −7 (respectively, recent and deep speciation events). Compared to S100, this dataset has a much higher level of ILS. This was the only case in the ASTRAL-II paper where enlarging X substantially impacted accuracy [22]. We use S200 to test if our changes to X have compromised the accuracy. Like S100, gene alignments have varying lengths and mutation rates, leading to a wide range of gene tree error [22]. We analyze the data using 1000, 200, or 50 genes, and each model condition has 50 replicates; following the original paper, three replicates with low signal are removed.

Methods and Evaluation
We compare ASTRAL-III (version 5.5.4) to ASTRAL-II (version 4.11.1) in terms of running time and accuracy. To address RQ1, we draw bootstrap support values on the ML gene trees and then contract branches with bootstrap support up to a threshold (0, 3, 5, 7, 10, 20, 33, 50, and 75%,) using the newick utility package [45]. Together with the original gene trees, we have 10 different versions of ASTRAL-III.
To measure the accuracy of estimated species trees, we use False Negative (FN) rate. Note that in all our species tree comparisons, FN rate is equivalent to normalized Robinson-Foulds (RF) [46] metric because the ASTRAL species trees are fully resolved. All running times are measured on a cluster with servers with Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz; each run was assigned to a single process, sharing cache and memory with other jobs.

RQ1: Impact of contracting low support branches on accuracy
We investigate RQ1 on the two simulated datasets where bootstrapping was feasible (avian and S100) and on the real avian dataset. On S200, due to its size, bootstrapping was not feasible and thus we cannot test RQ1.

S100
On this dataset, contracting very low support branches in most cases improves the accuracy (Fig. 2 and Additional file 1: Table S3). However, the excessive removal of branches with high, moderate, or occasionally low support degrades the accuracy. Nevertheless, filtering at 10% is always beneficial on average (Additional file 1: Table S3). The threshold where contracting starts to become detrimental depends on the condition, especially the number of gene trees and the alignment length, perhaps representing a signal to noise ratio trade-off.
As the number of genes increases, the optimal threshold for contracting also tends to increase. Combining all model conditions, the error continues to drop until a 20% contracting threshold with 1000 genes, whereas no substantial improvement is observed after contracting branches with 5% support for 50 genes (Fig. 2). Nevertheless, removing branches with 10% or 20% does not increase the error with 50 genes. Perhaps, with few gene trees, removing branches of low support leaves us with very little information left; thus, regardless of whether we contract or not, we don't get much signal around the most difficult branches. In contrast, when many gene trees are given, perhaps even after removing many branches, still enough gene trees with a resolution around difficult species tree branches are left.
The alignment length and gene tree error also impact the effect of contraction. For short alignments (200bp) and 1000 genes, contracting branches with up to 10% support reduces the species tree error by 21% (from 8.8% with no contraction to 6.9%). As alignment length grows, benefits of gene tree contraction diminish, so that with 1600bp genes, the reduction in error is merely from 4.1 to 3.7%. This pattern is perhaps expected because, with longer alignments, branch support is expected to increase. Thus, with longer gene alignments and consequently better gene trees with higher support, there is less room for improvement by reducing the noise. Consistent with this explanation, grouping replicates based on average gene tree error gives similar results as grouping by alignment length (see Additional file 1: Figure S1).

avian-like simulations
On the avian simulated dataset, contracting low support branches helps accuracy marginally, but the extent of impact depends on the model condition (Fig. 3). With moderate ILS (2X), we see no improvements as a result of contracting low support branches, perhaps because the average error is below 5% even with no contraction, leaving little room for improvements. Increasing ILS, we start to see improvements using contracted gene trees. Removing branches of up to 5% support reduces the error from 13 to 11% with 0.5X, and from 8 to 7% for the 1X condition.
When ILS is fixed to 1X and sequence length is varied (Fig. 3), contracting is helpful mostly with short sequences (e.g., 250 bp). With longer sequences, where gene tree estimation error is low, little or no improvement in accuracy is obtained. The best accuracy is typically observed by contracting at 0-5%. The gains in accuracy comparing no contraction to contraction at 0, 3, 5% thresholds are statistically significant (p = 0.017, 0.028, and 0.013) according to one-tailed paired t-tests.

Avian biological dataset
The original analyses on this dataset [5,33] report that MP-EST [17] run on 14,446 gene trees produces a tree that conflicts with strong evidence from the literature and other analyses on the same dataset. The statistical binning method was developed to address this shortcoming by combining loci together to reduce gene tree error [33,34]. MP-EST run on binned gene trees (i.e., binned MP-EST) produced results [5,33] that were largely congruent with the concatenation using ExaML [47] and differed in only five branches with low support (Fig. 4a, b); both trees were used as the reference [5]. Here, we test if simply contracting low support gene tree branches and using ASTRAL-III produces trees congruent with the reference trees.
Similar to MP-EST, when ATRAL-III is run on 14,446 gene trees with no contraction, the results differ in nine and 11 branches, respectively, with respect to the reference binned MP-EST and concatenation trees (Fig. 4c). Moreover, this tree contradicts some strong results from the avian analyses (e.g., not recovering the Columbea group [5]). ASTRAL-III with no contraction finishes in 32 hours, but with contraction, depending on the threshold, it takes 3 to 84 h (> 50 h for 0 -20% thresholds and < 26 hours for 33 -75%). Contracting 0% branches has minimal impact on the discordance (eight discordant branches with binned MP-EST instead of nine). However, contracting low support branches with 3-33% thresholds dramatically reduces the discordance with the reference tree (2, 2, 4, 2, 3, and 3 discordant branches, respectively, for 3, 5, 7, 10, 20, and 33%). Three thresholds (3, 5, and 10%) produce an identical tree (Fig. 4d). The remaining differences are among the branches that are deemed unresolved by Jarvis et al. and change among the reference trees as well [5]. Contracting at 50 and 75% thresholds, however, increases discordance to five and six branches, respectively.
Thus, consistent with simulations, contracting very low support branches seems to produce the best results, when judged by similarity with the reference trees. To summarize, ASTRAL-III obtained on unbinned but collapsed gene trees agreed with all major relations in Jarvis et al., including the novel Columbea group, whereas the unresolved tree missed important clades (Fig. 4).

RQ2: Running time improvements Varying the number of genes (k)
We compare ASTRAL-III to ASTRAL-II on the avian simulated dataset, changing the number of genes from 2 8 to 2 14 and forcing X to be the same for both versions to enable comparing impacts of improved weight calculation (Fig. 5). We allow each replicate run to take up to two days. ASTRAL-III improves the running time over ASTRAL-II and the extent of the improvement depends on k (see Additional file 1: Figure S2). With 1000 genes or more, there is at least a 2.1X improvement. With 2 13 genes, the largest value where both versions could run, ASTRAL-III finishes on average 3.2 times faster than ASTRAL-II (234 versus 758 minutes). ASTRAL-II is not able to finish on the dataset with k = 2 14 , while ASTRAL-III finishes on all conditions. Moreover, fitting a line to the average running time in the log-log scale graph reveals that on this dataset, the running time of ASTRAL-III on average grows as O(k 2.08 ), which is better than that of ASTRAL-II at O(k 2.28 ), and both are better than the theoretical worst case, which is O(k 2.726 ). These results are consistent with the fact that ASTRAL-III considers similarities between gene tree nodes.

Running time for large polytomies
ASTRAL-III has a clear advantage compared to ASTRAL-II with respect to the running time when gene trees include polytomies (Fig. 6a and Additional file 1: Figure S3). Since ASTRAL-II and ASTRAL-III can have a different set X, we show the running time per each weight calculation (i.e., Eq. 3). As we contract low support branches and hence increase the prevalence of polytomies, the weight calculation time quickly grows for ASTRAL-II, whereas, in ASTRAL-III, the weight calculation time remains flat, or even decreases. These results are consistent with a change of asymptotic running time to score a polytomy of size

The search space
Comparing the size of the search space (|X|) between ATRAL-II and ASTRAL-III shows that as intended, the search space is decreased in size for cases with no polytomy but can increase in the presence of polytomies (Fig. 6b). With no contraction, on average, |X| is always c d a b Fig. 4 Avian dataset with 14,446 genes. Shown are reference trees from the original paper [5] using the coalescent-based binning (a) and concatenation (b), and two new trees using ASTRAL-III with no contraction (c) and with contraction with 3, 5, and 10% thresholds (d). Support values (bootstrap for a, b and local posterior probability for c, d) shown for all branches except those with full support; in (d), support is shown for 3, 5, and 10%, respectively. Branches conflicting with the reference coalescent-based tree are shown as dotted red lines  Figure S2 for normal scale). A line is fit to the data points in the log/log space and line slopes are shown. ASTRAL-II did not finish on 2 14 genes in 48 hours smaller for ASTRAL-III than ASTRAL-II. With few errorprone gene trees (50 gene trees from 200bp alignments), the search space has reduced dramatically but with many genes or high-quality gene trees, the reductions are minimal. Moreover, the search space for gene trees estimated from short alignments (e.g., 200 bp) is several times larger than those based on longer alignments (e.g., 1600 bp) for both methods. These are results of the first feature of ASTRAL-III that forces the search space to grow at O(nk). Contracting low support branches initially increases the search space. This is because ASTRAL-III unlike ASTRAL-II adds multiple resolutions per polytomy to X. Further contraction results in reductions in |X|, presumably because many polytomies exist and they are resolved similarly inside ASTRAL-III.

RQ3: ASTRAL-II versus ASTRAL-III accuracy
Despite limiting |X| to grow at most linearly with n and k, the accuracy of ASTRAL-III remains unchanged compared to ASTRAL-II (Table 1 and Additional file 1: Figures S4-S7). Importantly, even for the very challenging S200 dataset, the accuracy is not reduced substantially even though |X| is reduced by up to 47%. Over all datasets, differences in error are less than 0.002, except for three datasets where the error of ASTRAL-III was less than ASTRAL-II by 0.003, 0.005, and 0.006 and two cases where the error increased by 0.004. Over all datasets, the differences between ASTRAL-II and ASTRAL-III were not statistically significant according to a paired t-test (p-value = 0.496). Since ASTRAL-III has a reduced search space, its quartet scores are typically slightly lower than ASTRAL-II, but these reductions are never more than 0.06%. As expected, the largest drops in the quartet score happen for the challenging S200 dataset with only 50 gene trees. The search space reduces in almost all cases and the reductions can be as much as 72%. Thus, the improved running time of ASTRAL-III does not come at the price of reduced accuracy.

Discussion
Below we further comment on ASTRAL-III in terms of accuracy and running time. We finish by comparing ASTRAL-III and ASTRAL-III-beta.

Accuracy
Although tree accuracy can improve with contracted gene trees, the gap between performance on true gene trees and estimated gene trees remains wide (Additional file 1: Table S3). On the S100 dataset, respectively for 50, 200, 500, and 1000 genes, the best average error with 1600bp gene trees among all contraction levels were 9.8%, 5.9%, 4.3%, and 3.7% compared to 7.0%, 3.7%, 2.4%, and 1.5% a b Fig. 6 Weight calculation and |X| on S100. Average and standard error of (a) the time it takes to score a single tripartition using Eq. 3 and (b) search space size |X| are shown for both ASTRAL-II and ASTRAL-III on the S100 dataset. Running time is in log scale. We vary numbers of gene trees (boxes) and sequence length (200 and 1600). See Additional file 1: Figure S3 for similar patterns for with 400 and 800bp alignments with true gene trees. Thus, while contracting low support branches helps in addressing gene tree error, improved methods of gene tree estimation remain crucial. Our results also indicate that in the presence of noisy gene trees, increased numbers of genes are needed to achieve high accuracy. For example, on the S100 dataset, with 1000 gene trees of only 200bp and contracting with a 10% threshold, the species tree error was 6.9%, which slightly outperformed the accuracy with only 50 true gene trees. This observation encourages the use of a large number of gene trees; incidentally, a main feature of ASTRAL-III is improved running time with many genes.
The best choice of the threshold of contraction was somewhat sensitive to the dataset. Testing up to 1000 gene trees, we observed that more gene trees clearly increased the optimal threshold, but did not test beyond 1000 genes. One can predict that perhaps the trend may continue but also that the optimal threshold will not indefinitely increase. Similarly, we saw that the amount of gene tree error due to lack of signal impacts the optimal threshold. One may expect that other sources of error, including incorrect orthology, incorrect alignment, and model misspecifications may also impact the optimal threshold. Regardless of the choice of the optimal threshold, it seems that the largest benefits are associated with removing the least supported branches. Overall, a threshold of 10% seemed to provide a good default value.
In most datasets, a substantial accuracy improvement was observed when 0% BS branches were removed. Branches of 0% support are presumably resolved arbitrarily. The use of conserved genes or closely related taxa can increase instances where two or more taxa have identical sequences in some genes. Some tree estimation methods generate binary trees even under such conditions. Removing branches that are arbitrarily resolved make sense and, as our results indicate, improves accuracy.
The main competitor of ASTRAL is NJst [18] and its fast implementation, ASTRID [19], but these tools are not able to handle polytomies in input gene trees. ASTRAL-III makes it efficient to use unresolved gene trees. Moreover, beyond contracting low support branches, other strategies could be used to reduce impacts of gene tree uncertainty. Previous studies indicate that simply using the set of all bootstrap gene tree replicates as input to ASTRAL increases error [21], perhaps due to the increased noise [31,35]. However, using a sample from the Bayesian distribution for each gene tree may improve the accuracy of ASTRAL.
Finally, theoretical implications of removing low support branches are less clear than its empirical impact. In principle, branches that have low support are not necessarily expected to be randomly selected among gene trees. Thus, while our empirical results support the use of (conservative) filtering, the resulting procedure may lose statistical guarantees of consistency. Future work should study conditions where ASTRAL remains statistically consistent with contracted gene trees.

Running time Large n
To assess limits of ASTRAL-III in terms of scalability, we tested it on 20 replicates of a dataset with 5,000 species and 1000 true gene trees (simulation procedure described in Additional file 1: Appendix C and parameters given in Additional file 1: Table S4). ASTRAL-III took between 2 and 62 h to run on this dataset (9.4 hours on average). We also attempted to test ASTRAL-III on four replicates of a dataset with 10,000 species and 1000 true gene trees, allowing a week of running time. Of the four replicates, two were able to finish within the allotted time. Thus, depending on the nature of the data, ASTRAL-III may be able to scale to datasets with up to 10,000 species given sufficient running time.

Average running time, |X|, and |Y|
The ASTRAL-III running time analysis is based on several worst-case assumptions, and real data may grow less rapidly with both n and k. Overall, although the exact a b c Fig. 7 Empirical search space. a |X| is shown for ASTRAL-II and ASTRAL-III for avian-like simulateds dataset with varying numbers of genes. b |X| is shown for ASTRAL-III for several datasets with varying n. c The density plots of log X |Y| across all ASTRAL-III runs reported in this paper. Size of the dynamic programming space Y is never above |X| 1.312 here  value depends on the dataset and especially the amount of discordance, the running time of ASTRAL seems to grow roughly quadratically with both n and k (i.e., proportionally to n 2 k 2 ); see Additional file 1: Figures S2 and S8.
ASTRAL-III bounds |X| to grow at most linearly with n and k. Empirically, we observe that |X| grows sublinearly with k closetoO k 3 4 on the avian simulated dataset (Fig. 7a). Note that the avian dataset has one of the highest levels of ILS; the dependence on k is expected to be lower for datasets with lower gene tree discordance. Testing the growth with n is more difficult because as n changes, other factors such as the amount of discordance also change. Nevertheless, across all the datasets that we had available, we tested the change in running time for fixed k as n changes and observed a linear growth (Fig. 7b), matching the worst-case scenario.
Finally, establishing empirical running time growth requires establishing the rate of the growth of |Y | with respect to |X|. The |Y | ≤ |X| 1.726 upper-bound is for specialized formations of the set X [28]. Empirically, as |X| increases, the size of |Y | in ASTRAL-III does not increase as fast as the worst-case scenario implies. Across all of our ASTRAL-III runs in this paper, |Y | ranged in 90% of our runs between |X| 1.07 and |X| 1.20 , and the overall average was |X| 1.11 (Fig. 7c).

Comparisons to ASTRAL-III-beta
The beta version of ASTRAL-III [40] included features 3-5 but not changes to X (features 1 and 2) or the twostaged α-trimming technique (feature 6). For completeness, we compared ASTRAL-III-beta and ASTRAL-III in terms of accuracy, quartet score, and the running time ( Table 2). Accuracy and quartet scores are very similar, perhaps with a small improvement since the beta version. The search space is reduced since the beta version (due to features 1 and 2), and the running times are substantially decreased (at least by half in most cases). The reductions in the running time are due to α-trimming, reduced |X|, in addition to further improvements in details of our implementation of the polytree data-structure.
To further demonstrate the impact of the α-trimming feature, we randomly chose 18 species from the avian dataset with 1500bp and 1X ILS. On this limited dataset, Negative numbers indicate ASTRAL-III-beta has a larger value (i.e., has higher error, larger search space, better quartet scores, and is slower) we ran ASTRAL-III in its exact mode (i.e., setting X to the power set) with 100 gene trees. Without any trimming of the dynamic programming (i.e., without features 5 and 6), the running time was 40 minutes. Emulating ASTRAL-III-beta, we disabled α-trimming but kept the trimming (feature 5) and the running time reduced to 33 min. Adding the α-trimming feature dramatically reduced the running time to 13 min. Thus, when X includes many bipartitions that have very little promise in improving the quartet score (as in the exact mode of ASTRAL), the α-trimming approach is very effective in reducing the running time.