Properties of minimal descriptor ARGs
Given the topology of an ARG G with the embedded (marginal) trees and annotations representing genetic events, we have seen that this defines an unambiguous genetic flow giving rise to the samples S(G). This annotation also implicity defines the segments associated with each node. The M embedded trees in G correspond to M segments on the chromosome of the K samples which is encoded by the leaf nodes of G. We assume that these M segments of interest are consecutive on the chromosomes of the samples. Thus these trees can be numbered by consecutive integers from 1 to M, the values indicating the order on the chromosomes. Thus the multiple edges of G (defined in the last section as colored edges) implies an annotation of a node v as well. We define this as sg(v) which is formalized below.
Definition 6 ( sg(v) overlap) Given node v in an ARG G, sg(v) is the set of the embedded trees that v is incident on. Two vertices v1and v2in G are said to overlap if sg(v1) ∩ sg(v2) ≠ ø.
In [14], this is defined as gm(v), however in this paper we use sg(.) to avoid confusion with the SNPs as genetic events. For example, consider the marked node v of Fig 2(a). Let the red, green and blue trees be numbered 1, 2 and 3 respecting the order in which they appear on the chromosomal segment as defined in the samples in Fig 5. Then sg(v) = {1, 2}. Also, let the leaf nodes marked 1, 2 and 3 of the same tree in Fig 6(a) be u1,u
2
,u
3
respectively. Then
sg( u1) = sg(u
2
) = sg(u
3
) = {1, 2, 3}.
Fig 6(b) displays these segments at each node. To simplify the exposition, assume that the M segments are numbered consecutively from 1 to M.
Definition 7 (gapped/ non-gapped node) A node v in ARG G is called a gapped node if the elements of sg(v) are not consecutive. A non-gapped node is a node that is not a gapped node.
Note that when the elements of sg(v) are not consecutive, or v is gapped, it implies that while tracking the relevant chromosomal segments in the ARG, v appears to be carrying segments that matter (i.e., are ancestral to the corresponding segments of some of the samples) and are interspersed with segments that do not matter (i.e., are not ancestral to the corresponding segments in any of the samples).
On boundedness of ARGs
A generic ARG is not necessarily bounded, i.e, it can have an infinite number of vertices. See Fig 7(a) for an example of such an ARG.
Theorem 3
An unbounded ARG G always has a bounded minimal descriptor.
Proof: We prove this by constructing a bounded minimal descriptor from the unbounded G. Since G is unbounded, G has no GMRCA. Obtain G′ from G by removing all coalescent vertices that are not t-coalescent. Then by definition G′ is structure-preserving and S(G) = S(G′).
By Thm 1 (2), G′ has M embedded trees. For each embedded tree T
i
, let L
i
be the set of LCAs in G. Note that the LCA of a tree is completely defined topologically. Let
. Then we prove the following:
The only infinite chains in G′ originate in a vertex in L and they are disjoint.
First we need to show that any infinite chain in G′ must originate in a vertex in L. Assume the contrary that there is a chain originating in v ∉ L. Let u be an extant sample reachable from v and let i ∈ sg(u), then v must be an LCA in tree i and then v ∈ L by construction, which is a contradiction. Next we need to show that no two chains overlap. Assume the contrary that a pair of infinite chains, one originating in v
1
∈ L and the other in v
2
∈ L, cross paths, say at u. Since it is a chain, u is a coalescent node and not the GMRCA. If sg( v1) ∩ sg(v
2
) = ø, then it contradicts the fact all (non-GMRCA) coalescent nodes in G′ must be t-coalescent. But if i ∈ sg( v1) ∩ sg(v
2
), then it contradicts the fact that v
1
and v
2
are LCAs of the embedded tree i. Hence no two chains can cross.
(B) and (C) prove the statement. Then since the preconditions hold, we can apply the 1-vertex compactification to G′ by adding the new vertex v′ and obtain G″. To show that G″ is a minimal descriptor of G using Definition 5, we next assert the following three statements:
(i) S(G″) = S(G), (ii) G″ preserves the structure of G and (iii) v′ is the GMRCA of G″.
From (A) and the definition of the 1-vertex compactification process, statements (i) and (ii) hold. Also by the construction, v′ is the LCA of all the vertices in L. Hence by Thm 1 (3), v′ is the GMRCA of G″. This concludes that G″ is a minimal descriptor of G with the GMRCA v′, hence bounded.
Corollary 1
The degree of the GMRCA in a minimal descriptor is ≤ MK.
This follows from the cardinality of C in the proof. Since each segment can have no more than K LCAs, |L| ≤ KM.
Fig 7 illustrates the construction procedure used in the proof of the theorem on a simple example with K = M = 2. The corollary states that even if the underlying ARG G is unbounded, there exists a minimal descriptor with a GMRCA with not just finite but a priori bounded degree.
Back to properties of ARGs
The previous section gave a global characteristic of a minimal descriptor ARG in terms of resolvable nodes. In this section we explore properties of nodes of the minimal descriptor ARG that can be determined by studying a very local neighborhood of the node.
Theorem 4 1. If a coalescent node v, except the GMRCA, in a minimal descriptor ARG has descendants u1, u2, .., u
l
, for some l > 1, then for any u
i
, there exists u
j
, i ≠ j, such that
sg(u
i
) ∩ sg(u
j
) ≠ ø.
In other words, for each descendant u
i
of v there exists another descendant u
j
of v overlapping with u
i
,1 ≤ i ≠ j ≤ l.
2. The number of vertices in a minimal descriptor ARG is finite. Moreover, if n
c
the number of coalescent events, n
e
is the number of genetic exchange events, and n
v
the number of vertices in a nontrivial minimal descriptor ARG, excluding the leaf nodes, then the following holds:
-
1
≤ n
c
≤ M(K – 1) + 1, (1)
-
0
≤ n
e
≤ K(M – 1) + M(K – 1), (2)
n
v
= O(MK). (3)
Proof: (1) Since, by Definition 5, every coalescent node in the minimal descriptor is t-coalescent, this must hold. (2) By Thm 3, every minimal descriptor ARG has a GMRCA. Also the GMRCA is a coalescent node. In G there are kM segments that eventually coalesce into M segments. The smallest number of coalescences occurs when all vertices coalesce into a single vertex, the GMRCA, giving a lower bound of 1. Again, by (1), since every coalescent vertex must be the coalescence in at least one embedded tree, the number of coalescent vertices, excluding the GMRCA, in a minimal descriptor ARG is no more than (K – 1)M. By Thm 1 (3), including the GMRCA introduces only one more node and this proves n
c
≤ M(K – 1) + 1 of Eqn 1.
By the definition of the parameters of the ARG, G must have at least one genetic exchange vertex encoding (M – 1) genetic exchange events. However, the above count excludes the leaf nodes and it is possible to encode these (M – 1) events in a single genetic exchange leaf node. Hence a lower bound of 0 for n
e
. When a vertex in the ARG is displayed as a linear ordering of the non-mixing segments with distinct colors for each segment (as in Fig 6(b)), then the potential junctions for the exchange events are at the junction of the colored segments. Recall that by the definition of the ARG, each unit has at most M non-mixing segments, hence can have no more than M – 1 genetic exchange events. Thus there are K(M – 1) such junctions potentially each representing a recombination (or exchange) event. We adopt the following convention: each non-mixing segment in a vertex v contributes to a junction to its left. Thus, by this convention, the left-most segment has no associated junction. See vertex v in Fig 8 as an illustration. Each distinct non-mixing segment is shown by a distinct color; gap is shown as a white segment and junctions are marked by arrows of the same color as that of the segment associated with it. Thus the green, red, blue, magenta colored segments show the associated junctions, and the leftmost green segment has none in Fig 8(b). We call the non leftmost as interior segments. Also, note that a gap does not contribute to a potential junction by our convention. For a recombination event to occur multiple times at the same junction involving the same or similar set of samples, it is clear that a coalescence must occur. Then the following can be verified.
Each coalescent event of an embedded (marginal) tree can create at most one additional junction in the coalescing node.
Consider Fig 8. In (a), v is not t-coalescent. In (b) four coalescences in four embedded trees produces three junctions in the coalescing node. Since the total number of coalescent events in the embedded trees is no more than M(K – 1), the additional junctions is bounded by the same number. Hence n
e
≤ K(M – 1) + M(K – 1) of Eqn 2. See Figs 9-10 for illustration of other scenarios where there is no increase in the number of junctions in the coalescing vertices. The coalescent events are absorbed in the worst case scenario in the count of genetic exchange events. Next, Eqn 3 follows from Eqns 1 and 2.
Corollary 2
The minimal descriptor satisfies the bounded-degree property (of Thm 1 (1)).
By Corollary 1, there is an a priori bound on the GMRCA. By the theorem there is such a bound on all the other vertices. Hence the result.
Binary ARGs
The sampling algorithms incorporating the coalescent process produces ARGs that bound the indegree and outdegree of a node to two [3]. We call such ARGs as binary ARGs and they also give stronger characteristics that can be further exploited by the sampling algorithms. We find that, to prove these characteristics, it is not necessary to restrict the incoming edges of a node to two.
Definition 8 (binary ARG)
A vertex in a binary G has no more than two outgoing edges, except the GMRCA.
We next identify some properties on these binary ARGs that can be again used in the sampling algorithms, if required. Consider the scenario where the genetic exchange event is restricted to recombinations. In other words if a node v has two incoming edges them the genetic material of v is split such that the left part of the segment is derived from one of the parental nodes and the right part is derived from the other parent. This is in contrast to an arbitrary segment being derived from one and the remaining from the other parental node. Next, we prove a rather unexpected property of a node of a binary minimal descriptor ARG under recombinations: the genetic material carried by a node has no gaps. This result is somewhat counter-intuitive (since it appears unduly restrictive and is counter to earlier beliefs) and is proved in (2) below.
Theorem 5
1. If n
c
is the number of coalescent events, n
e
the number of genetic exchange events, and n
v
the number of vertices in a nontrivial binary minimal descriptor, excluding the leaf nodes, then the following holds:
(K – 1) ≤ n
c
≤ M(K – 1) + 1, (4)
-
0
≤ n
e
≤ K(M – 1), (5)
n
v
= O(MK). (6)
2. Let (a) the genetic exchange events be restricted to recombinations and (b) all the leaf nodes be non-gapped. Then no node in a binary minimum descriptor is gapped.
Proof (1) The proof of Eqn 4 is along the lines of that of Eqn 1. We show n
e
≤ K(M - 1) of Eqn 5 as follows. By the definition of the ARG, each unit has at most M non-mixing segments, hence can have no more than M – 1 genetic exchange events. Thus there are K(M – 1) such junctions. For a recombination event to occur multiple times at the same junction involving the same or similar set of samples, it is clear that a coalescence must occur. Following the notation used in the proof of Thm 4, We next prove the following statement:
If there is some overlap in the coalescing vertices u
1
,u
2
, then there is no increase in the number of junctions from the sum total in u
1
and u
2
to that of the coalesced vertex v.
Every non-mixing segment of v corresponds to the same non-mixing segment in at least one of the descendants u
1
,u
2
. There can be an increase in the junctions in v if and only if a leftmost non-mixing segment in say u1, is not the leftmost in v. It cannot be a leftmost in u
2
as well, since it is not a leftmost in v. Thus it is an interior in u
2
. Thus this segment has a corresponding junction in u
2
that contributes to the junction in v without introducing an increase in the count of junctions. Since a coalescence does not increase the count of junctions, n
e
≤ K(M – 1) holds. Next, Eqn 6 follows from Eqns 4 and 5. This completes the proof of Statement (1).
(2) Assume the contrary, i.e., there exist gapped nodes in G. Amongst all such nodes, consider a least node v, i.e., v is such that there is no other gapped node in all the paths from v to the reachable nongapped leaf nodes. Let v have only one child u. Since the only genetic exchange event is recombination, u must be gapped for v to be gapped. This leads to a contradiction that v is the least such node. Then v must have two children. Consider the two following cases. Case i: Let the two descendant nodes of v be coalescent nodes. By the assumption each of them is non-gapped and by Thm 4 (1) the two must overlap. Hence v must be non-gapped as well. Case ii: Let at least one of the descendant nodes of v, say u, be a genetic-exchange node. Again, since the only genetic exchange event is recombination, the segment transmitted by u to v is nongapped. Hence by Thm 4 (1) v is nongapped. Thus for v to be gapped at least one of its two descendants must be gapped, leading to a contradiction.
Comparison with the standard coalescent
We introduce a definition of equivalence of ARG instances here.
Definition 9 (equivalent ARG instances) Let G and G′ be two ARG instances with G = ∪
T∈Τ
T and G′ = ∪
T∈Τ′
T where each T is a tree. G and G′ are said to be equivalent if and only if the following conditions hold. (1) S(G) = S(G′), i.e., both define the same set of samples and (2) there exists a bijection f : Τ→Τ′ such that for every T∈Τ, f(T) is isomorphic to T via an edge-length as well as leaf-label preserving isomorphism.
A graph-theoretic isomorphism definition, for the equivalence of two ARG instances, is unduly rigid and this weaker, but more relevant, definition of equivalence is derived from the typical handling of ARGs in literature [13, 17–19]. Next, we prove the following.
Theorem 6
Given an instance of ARG G, its minimal descriptor G
md
is equivalent to G. (In this sense, the minimal descriptor of a binary ARG is the standard coalescent model.)
The equivalence of G and G
md
follows from Defns 5 and 9. (Further, since the binary ARG is an alternative modeling of the standard coalescent, the minimal descriptor of the binary ARG is equivalent to the standard coalescent model.)
Furthermore, in our experiments that involved comparison with other models utilizing the recombination rate parameter, r, we enforce this parameter on the minimal descriptor to simplify the task of comparison. We observe empirically that properties such as embedded (marginal) tree branch length distribution, LD decay of samples etc match very well with that of MS[15] and GENOME[16] although the first method uses the coalescent time to the next event and the second carries out the simulation at every generation. However, that is not the case with the approximate models [17–19].
Estimating redundancy in an ARG
Recall that in G there is no vertex that has one descendant and one ascendant (Property 1 (1)). We call G reduced if it has no vertex that has only one descendant. Given G, if G′ is obtained from G by removing all and only those vertices that have single descendants then G′ is called as reduced G. Since a coalescent node cannot have a single descendant the following is easily verified: If G′ is a reduction of G then G′ is structure-preserving and S(G) = S(G′). A reduced G is a canonical form and then it is meaningful to compare the number of vertices between canonical forms.
Observation 4
If G
md
is a minimal descriptor of G, then the number of vertices in reduced G
md
, which is no more than in reduced G.
We have shown that an unbounded G (with infinite number of vertices) always has a minimal descriptor (with number of vertices = O(MK)). Hence we focus on a subspace here- that of binary minimal descriptors. To estimate their number, we use Thm 5 (2) that states no node in a binary minimal descriptor is gapped. Characterizing the type of nodes as gapped and non-gapped, we simply compare the cardinality of the respective ‘universal sets’ of nodes of binary ARGs and binary minimal descriptors, for a rough estimate. The following is the ratio of non-gapped to gapped configurations, where M is the usual parameter:
.
Sampling algorithm
An ARG construction has the following two independent tasks: (1) Constructing the topology of the structure including the lengths (or time estimates) of the edges. (2) Annotating the edges of the structure with genetic events, the number of the events (say, mutations) is a function of the length of the edges. The topology is a critical part of the ARG and the graph-theoretic treatment of the problem isolates the topology which has lead to various novel insights. Due to the fundamental characteristics in capturing the essence of the evolving population and its versatility, the standard coalescent approach [12, 15] can be used to estimate the lengths of the edges in the topology. Thus even with the graph-theoretic treatment of the problem, we appeal to essentials of coalescent theory for the sampling of the minimal descriptors. All the methods discussed in the background section are based on the standard coalescent model, which is analogous to the binary ARGs in this paper. Hence we focus only on this subspace of generic ARGs in this section.
Based on the models presented on this paper, there are at least two possible approaches to sampling the space: (1) One is sampling the space of standard ARGs. (2) Since a minimal descriptor is also an ARG the other approach is to directly sample the subspace of standard ARGs, corresponding to the binary minimal descriptors. In the first approach, a standard ARG is sampled and its minimal descriptor is extracted either as a post-processing step or simultaneously during the construction process. This approach has the advantage that the sampling distribution is exactly the same as that of the standard ARG, which is well studied in literature. The second approach is to directly sample a subspace of ARGs. A theoretical (time-expensive) sampling algorithm with the ‘true’ probabilities for a generic ARG is presented in [14]. Though extensively in use, the sampling distribution of the standard ARGs (binary) of the practical algorithms is not fully understood (see, for instance, discussions in [18]). Thus an understanding of the sampling distribution of a subspace of ARGs may be equally elusive.
There are two local properties of the vertices of a minimal descriptor: (1) every coalescent vertex is also a coalescent in one of the M embedded trees (Thm 2), and, (2) a coalescing vertex must overlap with another (Thm 4). Since these are local properties, it is possible to exploit one or both of these (related) properties in the sampling algorithms in both the above approaches. In our experiments we have used the overlap property of Thm 4 in the second approach of the last paragraph to sample the space of minimal descriptor of binary ARGs. To recapitulate, the input parameters to our sampling algorithm are: (a) N(≥ 1): population size is 2N at each generation. (b) K(> 1): the number of samples. (c) M(≥ 1) & (s0,s1,s2,…,s
M
): M is the number of nonmixing segments. The lengths of the segments are (s0,s1,s2,…,s
M
) that model varying recombination rates along the chromosome.