Gene tree discordance and the multispecies coalescent
We begin by briefly reviewing the multispecies coalescent model. Under the model, the genealogical history of orthologous lineages from k species is modeled backward in time conditional on a fixed rooted species tree with topology and branch lengths specified. Looking back in time, lineages from a pair of species cannot share common ancestry more recently than the time at which the species share common ancestry (Fig. 1). As a result, conditional on the species tree, not all topologies are equally likely for the gene tree; moreover, a random sample of gene trees that have evolved on the species tree contains information about the species tree topology and branch lengths [24]. In a general treatment of the model, the number of lineages per species is arbitrary, but here we restrict attention to one lineage per species.
Studies of the properties of inference methods applied to sets of gene trees produced under the model can make use of analytical formulas for the probability distribution of gene tree topologies conditional on a species tree [22, 25]. Such formulas employ the species tree topology and branch lengths as parameters, producing a discrete distribution that contains a probability for each possible gene tree topology. This distribution is complex, potentially with significant weight on gene tree topologies that disagree with the species tree, and its properties can differ substantially for species trees with different topologies and different numbers of species [25–28]. In general, under the model, the extent of the disagreement of gene tree topologies with species tree topologies increases as branch lengths in species trees decrease [9, 25], particularly when multiple short branches occur in succession [29].
A key quantity in evaluating gene tree probabilities is a function g
i,j
(T) that computes the probability that exactly i−j coalescent events happen in time T, beginning from i lineages at time 0 [30]:
$$ g_{i,j}(T) = \sum_{k=j}^{i}\frac{e^{-k(k-1)T/2} (2k-1)(-1)^{k-j}j_{(k-1)} \, i_{[k]}}{j! \, (k-j)! \, i_{(k)}}, $$
(1)
where a
(k)=a(a+1)…(a+k−1), a
[k]=a(a−1)…(a−k+1), and a
(0)=a
[0]=1. T is measured in coalescent time units, representing a number of generations normalized by the number of gene copies of a locus present in a population (2N for diploids, where N is the effective population size measured as a number of individuals).
Bipartitions
A tree with k leaf nodes has 2k−3 bipartitions: k−3 nontrivial bipartitions in which each of the subsets has at least two leaves, and k trivial bipartitions produced from cuts that separate one leaf from the other k−1 leaves. The k trivial bipartitions appear in every tree topology with a fixed leaf label set; henceforth we assume that bipartitions are nontrivial unless otherwise noted. The number of leaves in the larger of the two leaf subsets of a (nontrivial) bipartition is at most k−2. The bipartition separating, for example, taxa A and B from taxa C and D, is annotated AB |CD (Fig. 1).
Consider a species tree and a gene tree—both on the same taxon set—in which one gene tree lineage is sampled per species. We say that a nontrivial bipartition ϕ of the species tree is observed in the gene tree if for some internal node of the gene tree, a cut on that branch produces the bipartition ϕ of the leaf nodes. For a set \(\mathcal {G}\) of gene trees, if each of the k−3 nontrivial bipartitions of a species tree S is observed for at least one gene tree in the set, we say that \(\mathcal {G}\) is a bipartition cover of S.
For gene trees and species trees sharing the same set of k taxa, our goal is to study the probability that a random gene tree set \(\mathcal {G}\) containing n gene trees sampled under the multispecies coalescent model is a bipartition cover of a species tree S. We then use this calculation to set an upper bound on the number of loci n required so that with a specified minimum probability, a random n-locus gene tree set is a bipartition cover of S.
Exact computation for four-taxon species trees
We first calculate for four-taxon species trees the exact probability that a gene tree set is a bipartition cover of a species tree. A four-taxon species tree S has only one nontrivial bipartition (Fig. 1), which appears in five of the 15 rooted gene tree topologies. Consider a species tree whose nontrivial bipartition is AB |CD. This bipartition appears in the gene trees with topologies ((AB),(CD)), (((AB),C),D), (((AB),D),C), (((CD),A),B), and (((CD),B,A)).
We compute the probability that a set \(\mathcal {G}\) of gene trees is a bipartition cover for a four-taxon species tree S with bipartition AB |CD. Because the species tree has only one nontrivial bipartition, all that is required is for one of the gene trees in \(\mathcal {G}\) to have one of the five topologies with the bipartition AB |CD. For four-taxon species trees, it is straightforward to calculate the probabilities under the multispecies coalescent model of each of the 15 gene tree topologies [27]. The probability that a gene tree possesses the species tree bipartition and hence is a bipartition cover is the sum of the probabilities of the five gene tree topologies with bipartition AB |CD.
We must consider two cases, in which S represents the symmetric (Fig. 2
a) or asymmetric species tree topology (Fig. 2
b). Employing tabulations of gene tree probabilities for four-taxon species trees ([27], Tables 4 and 5), we examine both species tree topologies, denoting the probability that a gene tree has bipartition AB |CD in the symmetric case by \({P^{s}_{1}}\) and in the asymmetric case by \({P^{a}_{1}}\). The subscript 1 indicates that this quantity is for a single gene tree; we will generalize to sets of n gene trees in the next step. Labeling the species tree branch lengths in coalescent time units by T
1 and T
2 as in Fig. 2, in the symmetric case,
$$\begin{aligned} {P_{1}^{s}} = \ & \left[ g_{2,1}(T_{1})g_{2,1}(T_{2}) + \frac{1}{3}g_{2,1}(T_{1})g_{2,2}(T_{2}) \right. \\ \ & + \left. \frac{1}{3}g_{2,2}(T_{1})g_{2,1}(T_{2}) + \frac{1}{9}g_{2,2}(T_{1})g_{2,2}(T_{2}) \right] \\ \ & + \left[\frac{1}{3}g_{2,1}(T_{1})g_{2,2}(T_{2}) + \frac{1}{18}g_{2,2}(T_{1})g_{2,2}(T_{2})\right] \\ \ & + \left[\frac{1}{3}g_{2,1}(T_{1})g_{2,2}(T_{2}) + \frac{1}{18}g_{2,2}(T_{1})g_{2,2}(T_{2})\right] \\ \ & + \left[\frac{1}{3}g_{2,2}(T_{1})g_{2,1}(T_{2}) + \frac{1}{18}g_{2,2}(T_{1})g_{2,2}(T_{2})\right] \\ \ & + \left[\frac{1}{3}g_{2,2}(T_{1})g_{2,1}(T_{2}) + \frac{1}{18}g_{2,2}(T_{1})g_{2,2}(T_{2})\right]. \\ \end{aligned} $$
For the asymmetric case,
$$\begin{aligned} {P_{1}^{a}} = \ & \left[ \frac{1}{3}g_{2,1}(T_{1})g_{2,2}(T_{2}) + \frac{1}{9}g_{2,2}(T_{1})g_{3,2}(T_{2}) \right. \\ \ & + \left. \frac{1}{9}g_{2,2}(T_{1})g_{3,3}(T_{2}) \right] \\ \ & + \left[ g_{2,1}(T_{1})g_{2,1}(T_{2}) + \frac{1}{3}g_{2,1}(T_{1})g_{2,2}(T_{2}) \right. \\ \ & + \frac{1}{3}g_{2,2}(T_{1})g_{3,1}(T_{2}) + \frac{1}{9}g_{2,2}(T_{1})g_{3,2}(T_{2}) \\ \ & \left. + \frac{1}{18}g_{2,2}(T_{1})g_{3,3}(T_{2})\right] \\ \ & + \left[ \frac{1}{3}g_{2,1}(T_{1})g_{2,2}(T_{2}) + \frac{1}{9}g_{2,2}(T_{1})g_{3,2}(T_{2})\right.\\ \ & \left. +\frac{1}{18}g_{2,2}(T_{1})g_{3,3}(T_{2})\right] \\ \ & + \left[ \frac{1}{18}g_{2,2}(T_{1})g_{3,3}(T_{2}) \right] + \left[ \frac{1}{18}g_{2,2}(T_{1})g_{3,3}(T_{2}) \right]. \\ \end{aligned} $$
The five terms demarcated by brackets in \({P_{1}^{s}}\) and \({P_{1}^{a}}\) give the probabilities of the five gene tree topologies with bipartition AB |CD: ((AB),(CD)), (((AB),C),D), (((AB),D),C), (((CD),A),B), and (((CD),B,A)), respectively.
Simplifying these equations using Eq. 1, we find that
$$\begin{array}{@{}rcl@{}} {P_{1}^{s}} & = & 1-\frac{2}{3}e^{-(T_{1}+T_{2})} \end{array} $$
(2)
$$\begin{array}{@{}rcl@{}} {P_{1}^{a}} & = & 1-\frac{2}{3}e^{-T_{1}}. \end{array} $$
(3)
Note that these two equations are similar in that in each case, the quantity in the exponent, T
1+T
2 or T
1, corresponds to the length of the only internal branch of the unrooted species tree (Fig. 2).
Equations 2 and 3 give the probabilities that a single gene tree is a bipartition cover of the species tree, in the symmetric and asymmetric cases, respectively. Recall that our goal is to calculate the probability that a set \(\mathcal {G}\) of n gene trees is a bipartition cover, or that the species tree bipartition is observed in at least one of n sampled gene trees. This quantity—\({P_{n}^{s}}\) in the symmetric case and \({P_{n}^{a}}\) in the asymmetric case—is 1 minus the probability that the bipartition is observed in none of the n trees. Because each gene tree is independent conditional on the species tree, we have
$$\begin{array}{@{}rcl@{}} {P_{n}^{s}} & = & 1 - (1-{P^{s}_{1}})^{n} \end{array} $$
(4)
$$\begin{array}{@{}rcl@{}} {P_{n}^{a}} & = & 1 - (1-{P^{a}_{1}})^{n}. \end{array} $$
(5)
In Fig. 3, we plot \({P_{n}^{a}}\) as a function of the number of loci n for several fixed values of T
1; the behavior of \({P_{n}^{s}}\) is analogous, except with T
1 replaced by T
1+T
2. For each value of T
1, \({P_{n}^{a}}\) increases with n, approaching 1 as n→∞. For larger T
1, the initial probability that a single gene tree has bipartition AB |CD is greater, so that the number of gene trees required before \({P_{n}^{a}}\) achieves a specified value is smaller. As T
1→0, gene trees approach a scenario in which the gene lineages from species A, B, and C persist into the common ancestor of the three species. Each possible sequence of coalescences among these three lineages is equally likely, and the probability that a random gene tree contains the nontrivial bipartition AB |CD is \({P_{1}^{a}} = \frac {1}{3}\). P
n
then approaches \(1-(\frac {2}{3})^{n}\).
A general upper bound for k-taxon species trees
For k>4, the number of nontrivial bipartitions in a k-taxon species tree exceeds 1, and the event that a random gene tree possesses a nontrivial species tree bipartition ϕ
1 is not independent of the event of its possessing a second such bipartition ϕ
2. To perform a comparably simple calculation in the general k-taxon case to that achieved in the four-taxon case, we focus on deriving a lower bound on the probability that a random n-locus gene tree set \(\mathcal {G}\) is a bipartition cover of a k-taxon species tree S.
Let S be a rooted k-taxon species tree with fixed topology and branch lengths. Denote the k−3 nontrivial bipartitions of S by ϕ
1,ϕ
2,…,ϕ
k−3. Denote the k−2 internal branches of S by e
1,e
2,…,e
k−2, with associated lengths T
1,T
2,…,T
k−2. If one side of the root of S has only a single leaf, then the internal branch immediately descended from the other side is associated with a trivial bipartition. We indicate this internal branch by e
c
, with c∈{1,2,…,k−2}, and we denote its associated branch length T
c
. If both sides of the root of S each have at least two descendant leaves, then each of the k−2 internal branches is associated with a nontrivial bipartition, and the two branches immediately descended from the root share the same bipartition. We indicate by {1,2,…,k−2}∖c the set of indices for internal branches that produce nontrivial bipartitions, understanding that if the two sides of the root each have at least two descendant leaves so that e
c
does not exist, this index set reduces to {1,2,…,k−2}.
Let E
i,n
be the event that bipartition ϕ
i
is observed at least once in a set \(\mathcal {G}\) of n random gene trees, and let \(Q_{i,n} = \mathbb {P}[E_{i,n}]\) be the associated probability that at least one of n random gene trees possesses ϕ
i
. Then E
n
=E
1,n
∩E
2,n
∩⋯∩E
k−3,n
is the event that a gene tree set \(\mathcal {G}\) with n gene trees is a bipartition cover of S. Denote by \(Q_{n} = \mathbb {P}[E_{n}]\) the probability that a random gene tree set is a bipartition cover: that among n gene trees, all bipartitions of S appear at least once.
The Q
i,n
have a complex dependence, so that if a gene tree possesses one of the bipartitions ϕ
i
, its conditional probability of possessing another bipartition ϕ
j
might substantially increase in relation to the unconditional probability. Our strategy for bounding the desired probability Q
n
from below amounts to supposing that each bipartition ϕ
i
is as improbable as the least-probable bipartition and bounding the probability of the least-probable bipartition from below (Lemma 1). We then disregard the dependence among the Q
i,n
to bound from below the joint probability that all of the E
i,n
are observed in a gene tree set (Theorem 2).
Let T
min= mini∈{1,2,…,k−2}T
i
denote the length of the shortest internal branch in the species tree S. We obtain a lower bound on Q
i,n
, which we then use to bound Q
n
. Our lower bound for Q
n
is a function of only k, T
min, and n, and it can be inverted to produce an upper bound on the smallest n that achieves a desired minimal value for Q
n
.
Lemma 1
mini∈{1,2,…,k−3}Q
i,n
≥1−[1−g
k−2,1(T
min)]n.
Proof
Consider Q
i,n
for some i. Q
i,n
is the probability that bipartition ϕ
i
is observed in at least one of n random gene trees that are conditionally independent given the species tree. It therefore equals 1 minus the probability that ϕ
i
fails to be observed in all n gene trees: Q
i,n
=1−(1−Q
i,1)n. Because for fixed n≥1, the function 1−(1−x)n increases monotonically in x on [0,1],
$$\begin{array}{@{}rcl@{}} \min_{i \in \{1, 2, \dots, k-3 \}} Q_{i,n} & = & \min_{i \in \{1, 2, \dots, k-3 \}} [ 1-(1-Q_{i,1})^{n} ] \\ & = & 1- \left(1- \min_{i \in \{1, 2, \dots, k-3 \}} Q_{i,1} \right)^{n}. \end{array} $$
(6)
To produce a lower bound on mini∈{1,2,…,k−3}Q
i,n
, it remains to bound mini∈{1,2,…,k−3}Q
i,1 from below. A sufficient condition for bipartition ϕ
i
to be observed in a gene tree is for all the lineages descended from the internal branch \(e_{\phi _{i}}\) associated with ϕ
i
in the species tree to coalesce to a single lineage on that branch. In case ϕ
i
is associated with two internal branches—the two immediately descended from the root on opposite sides—it is sufficient for the lineages on one side to coalesce to a single lineage on the internal branch associated with that side. Supposing that k
i
is the number of taxa descended in S from branch e
i
and T
i
is the branch length for e
i
, the probability Q
i,1 that ϕ
i
is observed in a single gene tree is therefore bounded below by \(\phantom {\dot {i}\!}g_{k_{i,1}}(T_{i})\), and:
$$ \begin{aligned} 1~- \ & \left(1- \min_{i \in \{1, 2, \dots, k-3 \}} Q_{i,1} \right)^{n} \\ \ & \geq 1- \left[ 1-\min_{i \in \{1, 2, \dots, k-2 \} \setminus c} g_{k_{i},1}(T_{i}) \right]^{n}. \end{aligned} $$
(7)
In this step, although the species tree has k−3 nontrivial bipartitions, it has k−2 internal branches, one of which possibly produces a trivial bipartition. If cuts on two of the k−2 internal branches, say j
1 with \(k_{j_{1}}\) descendant leaf nodes and j
2 with \(\phantom {\dot {i}\!}k_{j_{2}}\) descendant leaf nodes, produce the same (nontrivial) bipartition ϕ
i
, then \(\phantom {\dot {i}\!}Q_{i,1} \geq g_{k_{j_{1}},1}(T_{j_{1}})\) and \(Q_{i,1} \geq g_{k_{j_{2}},1}(T_{j_{2}})\phantom {\dot {i}\!}\).
The quantity \(g_{k_{i},1}(T_{i})\phantom {\dot {i}\!}\)—the probability that k
i
lineages coalesce to 1 lineage during time T
i
—decreases monotonically with increasing k
i
, and increases monotonically with increasing T
i
. Because a species tree internal branch associated with a nontrivial bipartition has at most k−2 descendant leaves, and because the shortest internal branch length is T
min,
$$ g_{k_{i},1}(T_{i}) \geq g_{k-2,1}(T_{i}) \geq g_{k-2,1}(T_{\min}). $$
(8)
This condition applies to each of the k−2 internal branches—including both immediately descended from the root in the case that the root does not have a pendant edge as one of its descendants. We take the minimum over internal branches that produce nontrivial bipartitions to obtain
$$ \min_{i \in \{1, 2, \dots, k-2 \} \setminus c} g_{k_{i},1}(T_{i}) \geq g_{k-2,1}(T_{\min}). $$
(9)
We can connect inequalities 6, 7, and 9 to conclude
$$ \min_{i \in \{1, 2, \dots, k-3 \}} Q_{i,n} \geq 1- [ 1 - g_{k-2,1}(T_{\min})]^{n}. $$
(10)
We thus have the desired result. □
The approach of this proof amounts to replacing the species tree S with \(S_{T_{\min }}\), a tree with the same topology as S but with all internal branch lengths set to T
min, the minimum branch length in S. Next, it is noted that each bipartition is at least as probable as the least probable bipartition. The probability of the least probable bipartition is then bounded from below by computing a lower bound on one specific way of observing an arbitrary bipartition: the probability of a bipartition is at least as great as the probability that all of the lineages for leaves that descend from its associated internal edge coalesce on that edge.
Now that we have a lower bound for the probability of an arbitrary bipartition, it remains to simultaneously consider all k−3 bipartitions.
Theorem 2
Q
n
≥1−(k−3)[1−g
k−2,1(T
min)]n.
Proof
As the probability of an intersection, Q
n
can be written \(Q_{n} = \mathbb {P}[E_{n}] = \mathbb {P}[\bigcap _{i=1}^{k-3} E_{i,n}]\). The minimal probability of the intersection of a set of possibly dependent events can be bounded by Bonferroni’s inequality [31]. It follows that
$$ Q_{n} \geq 1 - \sum_{i=1}^{k-3} \mathbb{P}[\overline{E}_{i,n}], $$
(11)
where \(\overline {E}_{i,n}\) is the complement of event E
i,n
.
We then have
$$ \begin{aligned} Q_{n} & \geq 1 - \sum_{i=1}^{k-3} (1 - \mathbb{P}[E_{i,n}]) \\ & = \left(\sum_{i=1}^{k-3} Q_{i,n} \right)- \left(k-4\right) \\ & \geq \left(k-3\right)\left(\min_{i \in \{1, 2, \dots, k-3 \} }Q_{i,n}\right) - (k-4). \\ \end{aligned} $$
(12)
We invoke Lemma 1 to obtain mini∈{1,2,…,k−3}Q
i,n
≥1−[1−g
k−2,1(T
min)]n, from which
$$ Q_{n} \geq 1 - (k-3)[1 - g_{k-2,1}(T_{\min})]^{n}. $$
(13)
This completes the proof. □
Note that given the species tree S, for small values of n, it is possible for (k−3)[1−g
k−2,1(T
min)]n≥1, so that the theorem produces a negative value for the lower bound on Q
n
. Because Q
n
is a probability, in these cases, we have the trivial result Q
n
≥0. As n increases, however, eventually (k−3)[1−g
k−2,1(T
min)]n<1, so that in the theorem, Q
n
is bounded from below by a positive quantity.
By solving for n, for a specified probability q, Eq. 13 can be used to calculate an upper bound on the minimal value of n for which Q
n
≥q. Setting Q
n
=q for 0<q<1,
$$ n = \frac{\log [(1-q)/(k-3)]}{ \log [1 - g_{k-2,1}(T_{\min})]}. $$
(14)
Equation 14 gives an upper bound on the number of sampled gene trees required for a random gene tree set to be a bipartition cover with probability at least q. It applies irrespective of the species tree topology and branch lengths.
Influences on the upper bound
For fixed values of q, we numerically computed the number of gene trees n required for achieving Q
n
≥q in Eq. 14. In Fig. 4, we plot log10(n) as a function of the number of taxa k for a range of minimum branch lengths and q=1−10−2 and q=1−10−5.
When T
min=1 or T
min=0.5, so that the shortest internal branch length in the species tree has a value of 1 or 0.5 coalescent time units, n grows slowly as a function of k and remains less than 104 for species trees containing up to 30 species. By contrast, when T
min=0.2 or T
min=0.1, species trees with up to k=8 taxa have n<104, but the number of gene trees n grows rapidly and exceeds 104 for larger k. The patterns are fairly insensitive to the value of q, as q contributes to Eq. 14 only via the logarithmic term log(1−q).
Accuracy of the upper bound
We next compared our upper bound on the number of loci required to produce a bipartition cover with probability q (Eq. 14) to values of this number of loci obtained in stochastic simulations under the multispecies coalescent. The simulations allow us to quantify the extent to which our upper bound overestimates the true number of required gene trees.
Simulations were conducted using COAL [25] to compute the exact multinomial distribution of gene tree topologies for “caterpillar” species trees in which all branch lengths were set to T
min. The caterpillar case represents a difficult scenario for species tree inference, as the extent of gene tree discordance can be greater with caterpillar species trees than other species tree topologies [28, 29, 32, 33]. For fixed values of n
s
, the number of simulated gene trees in gene tree sets, we resampled 104 independent gene tree sets from this exact multinomial distribution, identifying for each gene tree set all gene tree clades that appeared in at least one of the random gene trees. This clade identification step was conducted using Biopython [34].
Next, we recorded the empirical proportion of simulations in which the n
s
gene trees produced a bipartition cover of the species tree. Treating this empirical probability of a bipartition cover as an estimate of \(Q_{n_{s}}\), we then computed the number of loci n in Eq. 14 using the estimated \(\hat {Q}_{n_{s}}\) for q, denoting this number of loci n
b
. The ratio \(\frac {n_{b}}{n_{s}}\) represents the factor by which our upper bound on the minimum number of loci required for producing a bipartition cover exceeded the actual number of loci required in simulated gene tree sets. A value of \(\frac {n_{b}}{n_{s}}=1\) indicates that our upper bound is accurate; values larger than 1 indicate that our upper bound overestimates the number of required gene trees by a factor of \(\frac {n_{b}}{n_{s}}\).
Figure 5 presents \(\frac {n_{b}}{n_{s}}\) as a function of q. In each panel, representing different values of T
min, \(\frac {n_{b}}{n_{s}}\) is relatively close to 1 for k=4 taxa, indicating a reasonably accurate upper bound. As k increases, \(\frac {n_{b}}{n_{s}}\) progressively increases as well. For small k, with relatively few internal branches, fewer ways exist for coalescent events to occur other than on the internal branch of minimum length, so that our consideration of only those coalescences in obtaining the bound disregards fewer alternative ways of producing bipartitions. It hence produces a more accurate n
b
.
Comparing the three panels of Fig. 5, we see that \(\frac {n_{b}}{n_{s}}\) is smaller and the bound n
b
is therefore tighter when T
min is large than when T
min is small. For small T
min, it is unlikely that all lineages below a species tree branch of length T
min will coalesce on the branch, so that our consideration of only the case in which such coalescences occur in producing Eq. 14 is less accurate. For each T
min value, the level of overestimation does not strongly depend on the value of q, especially for q near 1.