Representing and decomposing genomic structural variants as balanced integer flows on sequence graphs
 Daniel R. Zerbino^{1, 2}Email author,
 Tracy Ballinger^{1, 2},
 Benedict Paten^{2},
 Glenn Hickey^{2} and
 David Haussler^{2}
https://doi.org/10.1186/s1285901612584
© The Author(s) 2016
Received: 28 January 2016
Accepted: 13 September 2016
Published: 29 September 2016
Abstract
Background
The study of genomic variation has provided key insights into the functional role of mutations. Predominantly, studies have focused on single nucleotide variants (SNV), which are relatively easy to detect and can be described with rich mathematical models. However, it has been observed that genomes are highly plastic, and that whole regions can be moved, removed or duplicated in bulk. These structural variants (SV) have been shown to have significant impact on phenotype, but their study has been held back by the combinatorial complexity of the underlying models.
Results
We describe here a general model of structural variation that encompasses both balanced rearrangements and arbitrary copynumber variants (CNV).
Conclusions
In this model, we show that the space of possible evolutionary histories that explain the structural differences between any two genomes can be sampled ergodically.
Keywords
Background
Genomic studies, especially in the field of human health, generally do not focus on the majority of bases which are common to all individuals, but instead on the minute differences which are shown to be associated to a variable phenotype [1–4]. These variants are caused by diverse processes, which modify the genome in different ways. One common task is to find the evolutionary history which most parsimoniously explains the differences between an ancestral genome and a derived genome.
The best known variants are short single nucleotide variants (SNV) or multiple nucleotide variants (MNVs), which affect at most a handful of consecutive bases. These few bases are substituted, inserted or deleted, without affecting the neighbouring bases or the overall structure of the genome. Especially when only substitutions are taken into consideration, this process can be fully understood using mathematical tools [5]: not only is it trivial to describe a parsimonious history that explains the appearance of these variants with a minimum number of mutational events, but posterior likelihoods can be computed across the space of all possible histories.
However, rearrangement events sometimes change the overall structure of the genome without changing a base. Rearrangements can be decomposed into sequences of basic operations, known as double cut and joins (DCJ) [6]. In a DCJ operation the DNA polymer is cleaved at two loci then ligated again, so as to produce a new sequence from the same bases. A DCJ operation can further be decomposed into single cuts or joins (SCJ) [7]. The DCJ operation creates balanced rearrangements, i.e. without loss or gain of material. However, coupled with the loss or insertion of detached fragments, these DCJ operations can explain all structural variants, including copynumber variants (CNV) [8]. These SVs are known to have significant impact on phenotype and health [9, 10], but the combinatorial complexity of rearrangement models has restricted their study.
In the absence of CNVs, it is possible to compute a parsimonious history in polynomial time [6, 7, 11, 12], but computing its posterior likelihood against all possible histories is computationally prohibitive, as it supposes a random Markov walk across the symmetric group of order 2n, where n is the number of bases in the genome [13].
However, in the presence of CNVs, even computing a parsimonious history is difficult, and several teams have approached this problem with slightly different assumptions. Promising results were constructed around a model with a single whole genome duplication and local rearrangements, known as the Genome Halving Problem [14–18]. Some studies did not allow for duplicated regions [19, 20]. Others allowed for duplications, considered as independent events on atomic regions, and focusing on the problem of matching segmental copies [8, 21–24]. Bader generalized this model further [25, 26], allowing for larger duplications of contiguous regions along the original genome.
Other studies extended the SCJ model with an approximate algorithm based on a restricted model of cuts, joins, duplications and deletions [27]. Zeira and Shamir demonstrated the NPhardness of computing an optimal history with fewest duplications [28]. They nonetheless presented a linear time solution to an SCJ version of the Genome Halving Problem.
We described in [29] the history graph. This data structure represents genomes and their evolutionary relationships, allowing for substitutions as well as rearrangements and copynumber changes to occur, much like the trajectory graph defined by Shao [24]. This data structure explicitly represents genomic structure, and assumes that the extant and imputed ancestral genomes are phased, and therefore directly representable as sets of nucleotide sequences. In this model, we were able to compute, between bounds, the number of DCJ events separating two genomes with uneven content, allowing for wholechromosome duplications and deletions, which are free of cost. This model is extremely general, as it allows segmental duplication and deletion event to occur at any point in the evolutionary history and over any contiguous region at that time point. It also allows duplicate copies to be created then lost during the course of evolution.
However, in practice, it is much easier to estimate copynumber in a sample, using for example shotgun sequencing fragment density with regard to a reference, than it is to construct the underlying genomes. We therefore ask whether it is possible to evaluate the number of rearrangements between a sample and a supposed ancestral genome using copynumbers and breakpoints alone. By evaluating the number of rearrangements, we are indirectly proposing to simultaneously infer the rearrangement events, and thus assemble the sample’s genome, using evolutionary parsimony.
We describe here a general model of structural variation which encompasses both balanced rearrangements and arbitrary copynumbers variants (CNV). In this model, we show that the difference between any two genomes can be decomposed into a sequence of smaller optimization problems. Mathematically speaking, we represent genomes as points in a \(\mathbb {Z}\)module of elements that we call flows, and histories as differences between flows. The overall problem of finding a parsimonious history is then replaced by that of finding an optimal decomposition of the problem, which we address with an ergodic sampling strategy.
Results
Directed history graphs
Given a bidirected history graph H we construct its directed history graph H ^{′} as follows: each segment vertex is replaced by a segment edge from a tail to a head vertex, which are distinct. The orientation can be chosen arbitrarily, so long as the DNA label reflects the sequence that will be read when traversing from tail to head. The bidirected adjacencies incident on the head side of the segment are connected to the head vertex of the segment, likewise for the tail of the segment. If a branch connects two segments in H, a branch similarly connects their head vertices in H ^{′}, and another branch their tail vertices.
A directed history graph is trivially equivalent to its bidirected counterpart, so we will use the two concepts interchangeably. Because a vertex in a directed history graph can be incident with at most one adjacency edge and at most one segment edge, each connected component of segment and adjacency edges in H ^{′} is a simple path of alternating segment and adjacency edges, which we call a thread.
Sequence graphs and copynumber
Henceforth we assume all threads in a layered history are circular, as in Fig. 1, thus removing corner cases. The sequence graph G obtained from a layered history is constructed by contracting all the branches of H (Fig. 1, right side). Duplicate edges between two vertices of G are merged together. This defines a projection from each edge of H onto the edge of G it is merged into. For a given thread t in H, its copynumber weighting on G is the function which assigns to each edge e of G the number of times t traverses a thread edge that projects onto e.
From here onwards, G is assumed to be a sequence graph, and E the set of its edges. Using shotgun sequencing read density and breakpoint analysis, an approximate sequence graph for a threadstructured genome is obtained far more easily than the genome itself, motivating the analysis of such structures.
Flow functions
We examine the properties of copynumber weightings on sequence graphs. The edge space \(\mathbb {R}^{E}\) denotes the set of realvalued weightings on the set of edges E. \(\mathbb {R}^{E}\) is a vector space isomorphic to \( \mathbb {R}^{E}\).
A flow on G is a weighting in \(\mathbb {R}^{E}\) such that at each vertex the total weight of segment edge incidences equals the total weight of adjacency edge incidences. This equality on all of the vertices is called the balance condition, which characterizes flows. It is essential to distinguish between an edge and an edge incidence. In particular, if an adjacency edge connects a vertex v to itself, its weight is contributed twice to v. Let \(\mathcal {F}(G)\) denote the set of all flows on G. Because the balance condition is conserved by addition and multiplication by a real scalar, \(\mathcal {F}(G)\) is a subspace of \(\mathbb {R}^{E}\).
Let \(\mathcal {F}_{\mathbb {Z}}(G)\) denote the set of integervalued flows on G and \(\mathcal {F}_{\mathbb {Z}}^{+}(G)\) the cone of nonnegative integer flows. \(\mathcal {F}_{\mathbb {Z}}(G)\) is a \(\mathbb {Z}\)modular lattice of the vector space \(\mathcal {F}(G)\).
Copynumber and flow functions
Lemma 1
The copynumber of a circular thread is necessarily a flow in \(\mathcal {F}_{\mathbb {Z}}^{+}(G)\).
Proof
A circular thread describes a cycle that alternates between adjacency and segment edges. □
A valid flow sequence is a sequence of nonnegative integer flows (f _{1},…,f _{ k }) in \(\mathcal {F}_{\mathbb {Z}}^{+}(G)^{k}\) such that for any segment edge e, if there exists i such that f _{ i }(e)=0 then f _{ j }(e)=0 for all i≤j≤k. In addition, for every segment edge e, f _{1}(e)=1. This ensures that if the number of copies of a segment falls to zero, then it can not be recreated in any subsequent stage of the history, and that at the start of the flow sequence there is exactly one copy of each segment.
A cycle traversal is a closed walk through the graph, possibly with edge reuse, with an explicit starting vertex. A cycle is an equivalence class of cycle traversals defined by a common circular permutation of edges and its reverse.
Lemma 2
Any layered history has a valid flow sequence, and any valid flow sequence has a realization as a layered history.
Proof
The first part follows easily from Lemma 1. To prove the second part, we first decompose each flow in the sequence into a set of threads. To decompose a flow of \(\mathcal {F}_{\mathbb {Z}}^{+}(G)\) into a sum of circular thread flows, we color the edges of G that have nonzero weight such that segments edge are green, and adjacencies orange. Edges with weight 0 are removed. Each edge with weight of absolute value n>1 is replaced by n identical edges of weight 1, creating a multigraph G ^{′}. This defines a trivial mapping from the edges of G ^{′} to those of G. By construction, G ^{′} is a balanced biedgecolored graph (as defined in [30]), therefore it is possible to find an edgecovering composed of coloralternating cycles. This can be done in polynomial time by a procedure akin to the typical greedy construction on balanced noncolored graphs [30, 31]. By the mapping defined above, we defined a set C of cycles in G that alternate between segment and adjacency edges. By construction, each edge of G has as many cycle edges mapping to it as its flow weight.
We now demonstrate by induction that we can construct a history graph H from this set of cycle decompositions. If the flow sequence has length 1, we simply construct threads that correspond to the cycles in C. Because each segment has a flow of exactly 1, there is a bijection between the newly created segments and the segments of G. Now let us assume the property demonstrated for all flow sequences of length k or less. Given a flow sequence (f _{1},…,f _{ k+1}), we first construct a realisation H ^{′} of (f _{1},…,f _{ k }). We then choose for every segment edge e in G for which f _{ k }(e) is strictly positive a segment edge in the leaf layer of H ^{′} that maps to it. We create under this chosen segment copy f _{ k+1}(e) novel descendant segment edges, and assign them an arbitrary order. We decompose f _{ k+1} into a set of cycles in G as described above. For each cycle (picked in any order), we pick a random traversal and create a thread, greedily using up available segments. Once connected by adjacencies, these segments form a thread. By construction, each edge e is visited as many times in the set of cycles as newly created segments map to it, so this process is guaranteed to use up all the newly created segments. By construction, the total flow of these threads is equal to f _{ k+1}. The history graph thus extended is a realisation of (f _{1},…,f _{ k+1}). □
In [29], we defined the minimum rearrangement cost r(G) of a layered history graph G as the minimum number of DCJ operations it would require to account for all the changes that occur in that layered history graph, assuming that whole chromosome duplications and deletions are free of cost. We demonstrated that this cost is NPhard to compute, but provided upper and lower bounds for it that can be computed in polynomial time.
The lower bound is closely related to earlier results [6, 25].
By extension, we define the minimum rearrangement cost of a valid flow sequence as the minimum rearrangement cost across all of its realizations, and seek to demonstrate the existence of tractable bounds on this value.

V is the set of vertices of G.

p is the number of adjacency edges e where f _{ i }(e)>0.

C is the set of active components, i.e. connected components of vertices and adjacency edges e where f _{ i }(e)+f _{ i+1}(e)>0
Theorem 1
Proof
If k>2, to prove the theorem for the flow sequence (f _{1},…,f _{ k }) on (g _{1},…,g _{ k−1}), we remove g _{1} from H, and obtain a reduced history graph H ^{′} composed of bilayered sequence graphs (g _{2},…,g _{ k−1}), and its sequence graph G ^{′}. The sequence graph G ^{′} is slightly different from G, in that each segment edge of G ^{′} has a copynumber of 1 in f _{2}, whereas each segment edge of G has a copynumber of 1 in f _{1}. However, there exists a mapping from the vertices of the second layer of H to the first, implicitly defining a mapping Φ from the vertices of G ^{′} to the vertices of G such that for any edge e in G and any i≥2, \(f_{i}(e) = \sum _{e' : \Phi (e')=e}f_{i}'(e')\). On G ^{′} we compute the flow sequence of H ^{′}, s ^{′}=(f2′,…,f k′). These flows are all in \(\mathcal {F}^{+}_{\mathbb {Z}}(G')\)
By the inductive hypothesis, the rearrangement cost of g _{ i+1} is greater than \(\mathcal {C}^{l}_{f'_{i},f'_{i+1}}\), as computed on G ^{′}. We now compare \(\mathcal {C}^{l}_{f'_{i},f'_{i+1}}\) to \(\mathcal {C}^{l}_{f_{i},f_{i+1}}\), which is computed on G. For simplicity, we create a sequence of graphs (G _{1},G _{2},…,G _{ q }) of an arbitrary length q, describing an evolution from G _{1}=G ^{′} to G _{ q }=G, and decomposing Φ into a sequence of projections. At each step of the sequence, exactly two nodes are merged. Note that elements of this sequence are not necessarily valid graphs. Edge labels (segment or adjacency) are preserved, and two edges between the same vertices and having the same label are merged together. Projecting f _{ i } and f _{ i+1} across this sequence of graphs, we compute \(\mathcal {C}^{l}\) at each step. At each step, the number of vertices V decreases by 1. If the vertices belong to two different active components, the number of active components decreases by 1 and p remains unchanged. Otherwise, the number of active components remains unchanged and p decreases by at most 1, as two edges could be merged into one. The sequence of values of \(\mathcal {C}^{l}\) therefore decreases, hence \(\mathcal {C}^{l}_{f_{i},f_{i+1}} \le \mathcal {C}^{l}_{f'_{i},f'_{i+1}} \le r(g_{i+1})\). □
Theorem 2
Proof
We will construct g _{ i } directly. For every segment edge s of G, we create f _{ i }(s) top layer segments, images of s. In the following, each vertex in g _{ i } is assigned an adjacency to a partner vertex in G, on the assumption that our construction algorithm determines independently which specific image in g _{ i } it is connected to.
We start by creating the segment edges of g _{ i }. On both ends of a segment s, we compute \(\mathcal {D} + \mathcal {N}\) and choose the smaller value m _{ s }. We select m _{ s } images of s to be marked for duplication. The duplicated images of s are assigned 2 descendants each. If f _{ i }(s)=m _{ s }, an additional f _{ i+1}(s)−2m _{ s } descendant segments are assigned at random to the duplicated segments, otherwise the f _{ i }(s)−m _{ s } nonduplicated segments copies are assigned f _{ i+1}(s)−2m _{ s } descendant segments as evenly as possible, such that the difference in number of descendants between any two of those segments is never greater than 1. At the end of this process, segment copies with 0 descendants are marked for deletion. Segments which are marked neither for deletion nor duplication are undetermined.
Let d be the weighting on G equal to m i n(0,f _{ i }−f _{ i+1}), i.e. the number of images of an edge of G which are lost between the two layers of g _{ i }. By extension, for any vertex v, d _{ v } is the sum of d across all adjacencies incident on v.
Hence all top layer segments marked for deletion can be connected to adjacencies marked for deletion. For every adjacency edge e with a duplication count D(e)>0, we create D(e)duplicated images of this edge, connected in priority to duplicated copies of s, then to undetermined segments. We already connected all segments marked for deletion, hence duplicated adjacencies are connected exclusively to duplicated or undetermined segments. For every adjacency edge e with no flow change in G, we create f _{ i }(e) images of this edge connected to the remaining unattached segments.
We finally connect the bottom segments. If two duplicated segments are connected by a duplicated edge, then two of their descendant segments are connected by adjacencies, thus creating two trivial lifting adjacencies in g _{ i }. Otherwise, for each segment which is incident on an adjacency not marked for deletion, it necessarily has at least one descendant segment, which is connected by an adjacency to a descendant of its partner, thus creating one trivial lifting adjacency in g _{ i }. All remaining bottom segments are connected at random, conditional on respecting the adjacency counts specified by f _{ i+1}.
We now evaluate the upper bound rearrangement complexity of the corresponding DNA history graph as quoted above from [29]. By construction, deleted adjacencies and adjacencies with no flow change do not create nontrivial lifted edges in g _{ i }. Therefore only adjacencies e in G such that f _{ i+1}(e)>f _{ i }(e) give rise to nontrivial lifted edges in g _{ i }, unless they are connected to two duplicated segments. The number of duplicated adjacencies incident on a segment s which are not connected to a duplicated segment is bounded by its imbalance, hence there are at most I _{ i } adjacencies of this type across the graph. In addition, there are S _{ i } bottom layer adjacencies which were added at random. Hence the total number of nontrivial lifted edges is bounded by I _{ i }+S _{ i }. The construction algorithm guarantees that each perfect component in G gives rise to a simple module in G, hence, by the upper bound from [29], the cost is bounded by S _{ i }+I _{ i }−l _{ i }. □
Primary extensions
We say that a valid flow transition (F _{ A },F _{ B }) is a lookup if \(C^{l}_{F_{A},F_{B}} = C^{u}_{F_{A},F_{B}}\). In this case it is easy to compute the rearrangement cost of the transition. If a transition is not a lookup, then one way to assess its cost is to sample flow sequences with the same initial and final flows that are separated by smaller, more incremental changes.
A valid subsequence s _{2} of a given valid flow sequence s _{1} with the same initial and final elements is called a reduction of s _{1}, and conversely s _{1} is an extension of s _{2}.
We call the multiset of nonzero flows δ f _{1},…,δ f _{ k } a decomposition of Δ F.
If, for each i, either the flow transition (F _{ i },F _{ i+1}) is a lookup or at least simple enough so that we can compute the transition cost in reasonable time, then by sampling decompositions of the overall flow difference Δ F, we can arrange these into extensions of this flow difference, and evaluate these extensions efficiently, keeping track of the cheapest one we find. This provides a sampling strategy for producing plausible explanations for complicated observed flow transitions. To make this work in practice, we introduce the notion of a primary (or nearprimary) flow, and demonstrate how to decompose a flow difference f into a sum of primary or nearprimary flows. The primary or nearprimary flow differences we get are usually a lookup, and if not, the time required to compute their cost is reasonable.
We first introduce primary flows, which are a generating set of \(\mathcal {F}_{\mathbb {Z}}(G)\), then a superset of these, the nearprimary flows, which can be sampled ergodically with a simple algorithm.
We measure the elements of \(\mathcal {F}_{\mathbb {Z}}(G)\) with the L _{1} norm, defined by \(\f\_{1}=\sum _{e \in E}f(e)\). A nonzero flow f is primary if there do not exist two nonzero integer flows f _{1} and f _{2} such that f=f _{1}+f _{2} and ∥f∥_{1}=∥f _{1}∥_{1}+∥f _{2}∥_{1}.
Theorem 3
Primary flows are a generating set of \(\mathcal {F}_{\mathbb {Z}}(G)\).
Proof
We will decompose a flow f by induction. For k≥1, we will define a set S _{ k } of k nonzero flows \(({f^{k}_{1}}, \ldots, {f^{k}_{k}})\) (allowing for duplicate flows), such that \(f = \sum _{f' \in S_{k}}f'\) and \(\f\_{1} = \sum _{f' \in S_{k}} \f'\_{1}\). At the start, S _{1}=(f). If there exists i≤k such that \({f^{k}_{i}}\) is not primary, then there exist two nonzero integer flows f _{ b } and f _{ c } such that \({f^{k}_{i}}=f_{b}+f_{c}\) and \(\{f^{k}_{i}}\_{1}=\f_{b}\_{1}+\f_{c}\_{1}\). We define \(S_{k+1}=({f^{k}_{1}},\ldots,f^{k}_{i1},f_{b},f_{c},f^{k}_{i+1},\ldots,{f^{k}_{k}})\). It is straightforward to verify that \(f = \sum _{f' \in S_{k+1}}f'\) and \(\f\_{1} = \sum _{f' \in S_{k+1}} \f'\_{1}\). We proceed until no nonprimary flows remain. In that case, then f was successfully decomposed as a sum of primary flows. Since the L _{1} norm of a nonzero flow is necessarily a nonzero integer, the total number k of flows we create in this way is bounded by ∥f∥_{1}. The nonzero flow f was therefore decomposed as a finite sum of primary flows. □
A valid primary flow sequence is a valid flow sequence s=(f _{1},…,f _{ k }) such that for each flow transition, (f _{ i },f _{ i+1}), its associated flow change (f _{ i+1}−f _{ i }) is a primary flow.
Corollary 1
Any valid sequence of flows in \(\mathcal {F}_{\mathbb {Z}}(G)\) can be extended into a valid primary flow sequence.
Proof
We decompose the flow changes of a valid sequence into primary flow changes using the approach described in Theorem 3. At each step, a flow change f is replaced by two flows f _{1} and f _{2} such that a) f=f _{1}+f _{2} and b) ∥f∥_{1}=∥f _{1}∥_{1}+∥f _{2}∥_{1}. From equality a), it follows that at every edge e we have 0<f(e)≤f _{1}(e)+f _{2}(e). Given equality b) we have at every edge e f(e)=f _{1}(e)+f _{2}(e), hence f _{1} and f _{2} have the same sign as f. Therefore if f is the flow change between two nonnegative genome flows F _{ A } and F _{ B }, F _{ A }+f _{1}=F _{ B }−f _{2} is also nonnegative. The extended flow sequence (…,F _{ A },F _{ A }+f _{1},F _{ B },…), with flow changes (…,f _{1},f _{2},…), is also a valid flow sequence. □
By the above result, any flow change f can always be decomposed, not necessarily uniquely, into primary flow changes (f _{ i })_{ i=1..k } such that \(f = \sum _{i=1}^{k} f_{i}\) where \(\f\_{1} = \sum _{i=1}^{k} \f_{i}\_{1}\). This type of decomposition is fast to compute, and useful to create instances of primary flow sequences.
However, these decompositions are intrinsically limited in scope. In particular, the second condition implies that when the flow change f is, say, positive on a given edge e, then ∀i<k,f _{ i }(e)≥0. This forbids edge reuse, i.e. cases where adjacencies are temporarily created, then deleted, or segments temporarily duplicated, then deleted. Edge reuse is common in actual rearrangement histories, including minimum cost histories, so we do not always want to impose this second condition.
Given a flow f, a primary extension of f is a set {c _{1} f _{1},…,c _{ n } f _{ n }}, such that \(\{c_{1}, \ldots, c_{n}\} \in {\mathbb {N}^{*}}^{n}\) and f _{1},…,f _{ n } are primary flows, such that \(f = \sum _{i=1}^{n} c_{i} f_{i}\) and the component flows f _{1},…,f _{ n } are linearly independent, i.e. no component can be written as a linear combination of the others. Note that since the dimension of the subspace of all flows is at most ∥E∥, where E is the set of edges in the graph, no primary extension can have more than ∥E∥ components. Further, since the components f _{ i } are required to be linearly independent, once these are specified for a given f, the coefficients c _{ i } are uniquely determined.
This definition eliminates trivially nonparsimonious cases where a simpler decomposition could be obtained with a strict subset of the component flows. This amounts to forbidding a some cases of homeoplasy, i.e. we allow a simple rearrangement event to happen multiple times, in which case the component is assigned a weight greater than 1, for example in the case of tandem duplications, but we don’t allow distinct sets of components to have identical or inverse flow changes, creating a new configuration and then undoing it again for no net effect.
Characterising the space of primary flows
We now demonstrate the ties between primary flows and the space of even length cycles.
If a weighting is such that the sum of incident weights at every vertex is 0, it is said to satisfy the conjugate balance condition. A weighting f is a flow if and only if (iff) \(\hat f\) satisfies the conjugate balance condition.
For each edge e, let δ _{ e } be the weighting that assigns 1 to e, and 0 to all other edges. The set {δ _{ e }}_{ e∈E } forms a trivial spanning set of \(\mathbb {Z}^{E}\).
Given the cycle traversal t=(v,{e _{ i }}_{ i=1..2n })∈V×E ^{2n } of an even length cycle c (possibly with edge reuse, possibly with 2 or more consecutive adjacency or segment edges), its associated alternating weighting is \(w(t)=\sum _{i=1}^{2n}(1)^{i} \delta _{e_{i}}\). Because the weights of the edges in the cycle alternate between 1 and 1 along an even length cycle, an alternating weighting satisfies the conjugate balance condition, see Fig. 4b).
The alternating flow f of an evenlength cycle traversal t is the conjugate of its alternating weighting, i.e. \(f=\hat {w}(t)=\sum _{i=1}^{2n}(1)^{i}\hat \delta _{e_{i}}\). Conversely, t is a traversal of f. Because an alternating flow is conjugate to a weighting that satisfies the conjugate balance condition, it satisfies the balance condition defined above, and hence is a flow. See Fig. 4c.
For the next few definitions, we assume that t=(v,{e _{ i }}_{ i=1..2n })∈V×E ^{2n }) is an even length cycle traversal on G.
Lemma 3
A cycle traversal t is tight iff \(\ \hat {w}(t) \_{1} = t\), the length of t.
Proof
□
Lemma 4
A cycle traversal t that is both nested and synchronized is necessarily tight.
Proof
Assuming by contradiction that t is not tight, there exists two indices i<j such that e _{ i }=e _{ j } and j−i is odd. They cannot be traversed in opposite directions, else e _{ i } and e _{ j−1} would coincide at their endpoints, even though (j−1)−i is even, which contradicts synchronization. They cannot either be traversed in the same direction, else e _{ i } and e _{ j } on one hand and e _{ i−1} and e _{ j−1} on the other would coincide, which contradicts the nesting constraint. □
A simple cycle is a cycle without vertex reuse. A cactus graph is a graph in which any two simple cycles intersect in at most one vertex, which may be called an intersection point [32]. A connected graph is 2edge connected if for any two distinct vertices A and B, two edge removals are sufficient to disconnect the vertices. It is wellknown that a connected graph is a cactus graph iff it is 2edge connected. A cutpoint in a graph is a vertex that if removed splits the connected component in which it resides into two or more connected components called subcomponents.
Lemma 5
Given a simple cycle C, one of its traversals t=(v,{e _{ i }}_{ i=1..l }) and a set of properly nested pairs of indices in [1,l]^{2}, merging the vertices at the end of the edges indicated by the index pairs creates a cactus graph.
Proof
We will demonstrate that the graph is 2edge connected. Let A and B be vertices in the graph. Both A and B can be traced to the merging of two sets of vertices in C. Let I(A) be the sets of indices of original vertices merged into A, likewise for B. Both sets cannot overlap, else they would be identical and A and B would not be distinct. Because of the nesting constraint, no vertex in [m i n(I(B)),m a x(I(B))] can be merged to a vertex outside this interval, therefore cutting the edges at indices m i n(I(B)) and m a x(I(B))+1 breaks the graph into two components. Because all the indices of I(B) belong to the interval, whereas all the indices of I(A) are outside of this interval, A is disconnected from B. Thus, the graph is 2edge connected, in other words it is a cactus graph. □
Theorem 4
The set of primary flows is identical to the set of alternating flows of nested and synchronized cycle traversals.
Proof
Let f be a primary flow, and t=(v,{e _{ i }}_{ i=1..l }) be a cycle traversal in G that alternates between edges where f is strictly positive and edges where f is strictly negative. We will demonstrate by contradiction that t is nested and synchronized.
We will first assume that t is not synchronized, i.e. there exist distinct indices i and j such that i and j coincide, yet (i−j) is even. Without loss of generality, we assume that i<j. Let t _{1}={e _{ i+1},…,e _{ j }} and t _{2}={e _{1},…,e _{ i },e _{ j+1},…,e _{ l }}. As i−j is even, both are even length, primary cycles. It follows that \(\hat w(t_{1}) + \hat w(t_{2}) = \hat {w}(t) = f\), and obviously t _{1}+t _{2}=l. Since t alternates between edges where f is positive and negative, e _{ i }=e _{ j } implies i−j is even, hence t is tight. It is obvious that this property is preserved in t _{1} and t _{2}. Hence, by Lemma 3, \( \ \hat w(t_{1}) \_{1} + \ \hat w(t_{2}) \_{1} = t_{1} + t_{2} =l \). Therefore \(\ f \_{1} = \ \hat w(t_{1}) \_{1} + \ \hat w(t_{2}) \_{1}\), hence f could not be minimal. Figure 5 gives an example.
Let t=(v,{e _{ i }}_{ i=1..l }) be a nested and synchronized cycle traversal, we will demonstrate that its alternating flow f is primary. By Lemma 5, because t is nested, the subgraph of G it traverses is a cactus graph. Because it is impossible to find three indices i,j and k such that (i−j), (j−k) and (k−i) are all odd, t can visit any vertex at most twice. Because all the edges in this cactus graph can be traversed in a single cycle traversal, it can be edgepartitioned into a set of simples cycles or chains such that at each vertex at most two components intersect. Because t is synchronized, if f=f _{1}+f _{2} is a decomposition of f and t _{1} is a traversal of f _{1}, t _{1} must necessarily switch between edges in different simple cycles whenever it reaches a cutpoint vertex, else the resulting cycle would not be evenlength. Thus t _{1} traverses all of the edges at least once, and is equal to t. Therefore f is primary (see Fig. 7 for an illustration). □
Nearprimary extensions
Instead, we focus on a superset of primary flows. A flow derived from a tight and nested traversal is called a nearprimary flow. Following Lemma 4, primary flows are a subset of nearprimary flows. To illustrate the difference between the two sets, on Fig. 5, the left traversal is nearprimary, the right one is primary.
A nearprimary extension of a flow f is a set (c _{1} f _{1},…,c _{ n } f _{ n }) for some integer constants (c _{1},…,c _{ n }) and nearprimary flows (f _{1},…,f _{ n }) such that \(f = \sum _{i=1}^{n} c_{i} f_{i}\) and the component flows f _{1},…,f _{ n } are linearly independent.
Lemma 6
The number of possible nearprimary extensions of a flow sequence is finite.
Proof
Because a synchronized traversal visits each edge at most twice, there are only a finite number of synchronized traversals, and hence a finite number of nearprimary flows. Since the flows in a decomposition (c _{1} f _{1},…,c _{ n } f _{ n }) of any flow f are linearly independent, n is at most E, and since we must have f=c _{1} f _{1}+…+c _{ n } f _{ n }, the coefficients c _{1},…,c _{ n } are uniquely determined by the choice of f _{1},…,f _{ n }. Hence the number of decompositions is finite. □
This implies that any scoring scheme can be used to construct a valid probability mass function on the set of nearprimary extensions of a flow sequence, which constitute a natural set of flow histories.
Collapsing equivalent solutions
Let K be the set of edges with zero overall flow change, and Ω the set of vertices which are only incident to K. To prevent a combinatorial explosion because of equivalent cycles going through Ω, Ω is collapsed into a universal connector vertex, ω, connected to all other vertices in the graph. In other words, if an alternating flow traversal has edges incident with vertices of Ω, it is represented as going to ω, selflooping on that vertex, then continuing out. This ensures that otherwise equivalent histories are not distinguished because of irrelevant labeling differences between interchangeable vertices. Any alternating flow traversal which selfloops from ω to itself more than once can be automatically simplified into a simple flow traversal with lesser complexity, avoiding extrapolating useless operations on the edges of K.
Given a valid flow decomposition, if an edge of K is duplicated then deleted, leaving no final CNV change, then the flows of the two events are summed up into a single flow. This constrains partially the space of possible histories, as it precludes a third event from being timed between the two combined events, but it reduces the search space. To mitigate this, this constraint is only applied to the end results of the sampling, as during the sampling stage we allow for events to have nonzero flow over edges of K.
After these two transformations, a valid flow decomposition is said to be collapsed.
Ergodic sampling of nearprimary extensions
where E _{1},E _{2} and E _{3} are adjacency edges such that E _{1} connects v ^{′} to ω, E _{2} connects ω to ω and E _{3} connects ω to v.
where E _{1} is a (possibly new) adjacency edge that connects the destination vertex of e _{ l }, v ^{′}, to ω, and E _{2} is a (possibly new) adjacency edge that connects ω to the starting vertex of e _{1}, v.
Theorem 5
The process of applying random merges and splits provides an ergodic exploration of all possible nearprimary collapsed extensions of a given flow transition.
Proof
We are given a nearprimary collapsed extension H of a flow transition \((f_{1},f_{2}) \in \mathcal {F}(G)^{2}\), and wish to reach another nearprimary collapsed extension H° that also spans Δ f=f _{2}−f _{1}. Because they are both collapsed, H° and H cover exactly the same set of vertices, namely E∖Ω.
We choose at random a nearprimary component flow δ f° of H°, and one of its tight traversals t°. Since H and H° cover the same set of edges, it is possible to greedily define a sequence of flows of H that cover each of the edges of t°. The flows can be sequentially merged, since each shares an overlap of at least one vertex with a previously defined flow. Following the greedy merge, they may be greedily split to recover δ f°. It is thus possible to reconstruct any nearprimary component flow of H° with splits and merges between the vectors in H. □
Fixing invalid flow sequences
A flow sequence is a sequence of positive flows, from which a sequence of flow changes can be derived. However, a nearprimary extension of a positive flow can not necessarily be ordered to define a valid flow sequence because cumulatively adding up the flow changes can produce negative values.
Imposing that each flow of a flow sequence is a positive flow would require much computation and possibly break the ergodicity of the method described below. For this reason we preferred to impose a softer constraint, namely increasing the cost of a history showing inconsistencies. For every segment edge which goes from nonnegative to negative or from 0 to positive flow, a penalty charge of 2 events is added to the history. This cost corresponds to the cost of temporarily duplicating a region then deleting it.
Simulations
Measuring accuracies of solutions
To test our model, we created 90 artificial rearrangement histories. Each simulated history begins with a single chromosome composed of 100, 150, or 200 atomic blocks of sequence in order to have a range not only in the number of rearrangement events per simulation, but in the density or breakpoint reuse of genomic rearrangements as well. The history is built by repeatedly randomly deleting, duplicating or inverting segments of contiguous blocks. The number of structural rearrangement operations applied was randomized between 1 and 33, again, in order to generate histories with a range of complexity and density. For example, a simulation with 100 sequence blocks and 10 rearrangement events will have a higher density of events than a simulation with 200 sequence blocks and 10 rearrangement events; although the sequence of deletions, duplications, or inversions in both histories may be identical, in the smaller genome, they are more likely to overlap or occur in the same genomic regions.
Given the final copy number profile and novel adjacencies of the rearranged genome, we sampled possible flow histories using importance sampling based on the above ergodic exploration of the space with ten independent runs of 2,000 iterations each. Each nearprimary flow, representing a set of rearrangement events, received a likelihood score equal to the fraction of sampled histories that contain it. So as not to penalise valid flow transitions that are extensions of multiple flow transitions from the simulated history, we counted as true positive any sampled flow transition which could be decomposed as a finite sum of original simulated flow transitions.
Discussion
As discussed in the introduction, there have been many attempts to define parsimomious histories given a set of rearranged genomes with deletions and duplications. Because of the inherent complexity of the problem, different models made different simplifying assumptions. The model presented here is particularly flexible, in that it allows arbitrarily complex rearrangement events containing duplications and deletions of arbitrary contiguous regions, not just predefined atomic components or predefined rearrangement transformations. There exists a “free lunch” problem whereby if arbitrary de novo insertions are allowed, then any differences between the ancestral and derived genome can be explained by a parsimonious but absurd evolutionary history consisting of an entire genome deletion followed by the insertion of a completely new genome [8]. To preclude such trivially unrealistic solutions, our model does not allow insertions of entirely new sequence, only duplication of existing DNA. In addition, this is the first such model which offers an explicit ergodic sampling procedure. In counterpart, we currently bracket the cost of each flow transition between an upper and lower bound, with the expectation that the two bounds meet for most encountered nearprimary flow transitions. Given their constrained size, it should be possible to exhaustively test the cost of nearprimary flow transitions when these two bounds diverge. In addition, the collapsing of histories does preclude some event orderings.
Theorem 2 only describes the cost of each epoch independently, and does not allow us to extrapolate the cost of entire histories. Because the bilayered components of a layered history graph share intermediary layers, defining the structure of one layer constrains the possible layers immediately before and after it. It is therefore sometimes impossible to construct a realisation of a flow sequence such that every bilayered component has a cost below its corresponding upper bound. Determining the cost of entire histories would presumably require reconstructing these histories in their entirety, as we describe in [29]. However, this explicit approach requires the ancestral and derived genomes to be fully determined, which is often not the case when working with sequencing data. The loss of precision therefore appears to be necessary to handle the incompleteness of genomic data.
The application of the theory described here would therefore be most relevant to unassembled short read data, for example metagenomic samples or tumors. Inferring the evolutionary history of the observed copynumber changes and alignment breakpoints would help both understand the evolutionary processes at play and determine the most likely sequences of the sampled genomes. However, such datasets pose further issues because they generally contain a diversity of different genomes, and the data is itself quite noisy. Given that most of the properties demonstrated here hold for the edge space \(\mathbb {R}^{E}\), we believe that it will be possible to extend the entire model to handle arbitrary mixtures of genomes, where copynumber is weighted by prevalence. Noisy artifacts would thus be represented as low frequency elements that can be filtered out at the end of the analysis. As genomic sequencing progresses and long range data becomes more affordable, it might be possible to obtain nearfinished assemblies straight from the data, in which case the model described in [29] would be more relevant.
From a mathematical standpoint, the set of flows of G is very rich. It bears similarity to the cycle space described in [33]. The balance condition guarantees that a positive flow can be decomposed as a set of weighted threads, and that in a flow transition, every created adjacency is compensated by a either a corresponding adjacency deletion or a segment duplication, much like double entry bookkeeping. The nearprimary flows further provide us with a finite subset of flows, such that all possible realisations of the data can be sampled ergodically. However, unlike models for single base substitutions and balanced rearrangements, this model of evolutionary cost is not a distance function, because it is asymmetric, i.e. the distance between two genomes depends on which genome is ancestral. For example, duplicating a segment then translocating one of its copies generally requires two DCJ operations, yet only one operation (a segmental deletion) is needed to return to the original genome.
Conclusion
We presented here a model to efficiently sample the space of possible rearrangement histories in the presence of duplications and deletions. Confronted with the NPhard problem of inferring parsimonious histories described in [29] from everyday genomic data, we opted for a simplification of the problem which allows us to adopt an efficient ergodic sampling strategy. Through simulation, we verified that this approach produces exploitable results. In practice, it is capable of accurately discriminating true rearrangement events and sampling their possible orderings. There remain a few open questions, in particular whether it is possible to efficiently compute the rearrangement cost of any primary flow sequence.
Declarations
Acknowledgments
Not applicable.
Funding
We would like to thank the Howard Hughes Medical Institute, Dr. and Mrs. Gordon Ringold, NIH grant 2U41 HG00237113 and NHGRI/NIH grant 5U01HG004695, and the European Molecular Biology Laboratory (DZ and TB) for providing funding.
Availability of data and materials
All the code used in the simulations and sampling is available at https://github.com/dzerbino/cnavg .
Authors’ contributions
The theory was developed jointly by DZ, BP, GH and DH. The simulations were coded and run by DZ and TB. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 The International HapMap3 Consortium. Integrating common and rare genetic variation in diverse human populations. Nature. 2010; 467:52–8.View ArticleGoogle Scholar
 The 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012; 491:56–65.View ArticlePubMed CentralGoogle Scholar
 Cancer Genome Atlas Research Network. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature. 2008; 455(7216):1061–8. doi:http://dx.doi.org/10.1038/nature07385.
 The International Cancer Genome Consortium. International network of cancer genome projects. Nature. 2010; 464:993–8.View ArticlePubMed CentralGoogle Scholar
 Jukes T, Cantor C. Evolution of Protein Molecules. Mammalian Protein Metabolism. Vol. 3. New York: Academic Press; 1969.Google Scholar
 Yancopoulos S, Attie O, Friedberg R. Efficient sorting of genomic permutations by translocation, inversion and block interchange. Bioinformatics. 2005; 21(16):3340–346. doi:http://dx.doi.org/10.1093/bioinformatics/bti535.
 Feijão P, Meidanis J. Scj: A breakpointlike distance that simplifies real several rearrangement problems. IEEEACM Trans Comp Biol Bioinf. 2011; 8(5):1318–29.View ArticleGoogle Scholar
 Yancopoulos S, Friedberg R. DCJ path formulation for genome transformations which include insertions, deletions, and duplications. J Comput Biol. 2009; 16(10):1311–38. doi:http://dx.doi.org/10.1089/cmb.2009.0092.
 Zhang F, Gu W, Hurles M, Lupski J. Copy number variation in human health, disease, and evolution. Annu Rev Hum Genet. 2009; 10:451–81.View ArticleGoogle Scholar
 Shlien A, Malkin D. Copy number variations and cancer. Genome Med. 2009; 1(6):62. doi:http://dx.doi.org/10.1186/gm62.
 Hannenhalli S, Pevzner PA. Transforming cabbage into turnip: polynomial algorithm for sorting signed permutations by reversals. J ACM. 1999; 46(1):1–27.View ArticleGoogle Scholar
 Bergeron A, Mixtacki J, Stoye J. A unifying view of genome rearrangements. Algorithm Bioinforma. 2006; 4175:163–73.View ArticleGoogle Scholar
 Durrett R. Genome Rearrangement: Recent Progress and Open Problems In: Nielsen R, editor. Statistical Methods in Molecular Evolution. New York: SpringerVerlag: 2005.Google Scholar
 ElMabrouk N, Nadeau JH, Sankoff D. Genome halving. Lect Notes Comput Sci. 1998; 1448:235–50.View ArticleGoogle Scholar
 ElMabrouk N, Sankoff D. On the reconstruction of ancient doubled circular genomes using minimum reversal. Genome Inf. 1999; 10:83–93.Google Scholar
 ElMabrouk N, Bryant B, Sankoff D. Reconstructing the predoubling genome. In: Proc. Third Ann. Int’l Conf. Computational Molecular Biology (RECOMB). Berlin: SpringerVerlag: 1999. p. 153–163.Google Scholar
 ElMabrouk N, Sankoff D. The reconstruction of doubled genomes. SIAM J Comput. 2003; 32:754–92.View ArticleGoogle Scholar
 Alekseyev MA, Pevzner PA. Colored de bruijn graphs and the genome halving problem. IEEEACM Trans Comp Biol Bioinf. 2007; 4(1):98–107.View ArticleGoogle Scholar
 ElMabrouk N. Sorting signed permutations by reversals and insertions/deletions of contiguous segments. J Discrete Algorithm. 2001; 1:105–22.Google Scholar
 Braga MDV, Willing E, Stoye J. Double cut and join with insertions and deletions. J Comput Biol. 2011; 18(9):1167–84. doi:http://dx.doi.org/10.1089/cmb.2011.0118.
 Chen X, Zheng J, Fu Z, Nan P, Zhong Y, Lonardi S, Jiang T. Assignment of orthologous genes via genome rearrangement. IEEEACM Trans Comp Biol Bioinf. 2005; 2(4):302–15.View ArticleGoogle Scholar
 Shao M, Lin Y. Approximating the edit distance for genomes with duplicate genes under dcj, insertion and deletion. BMC Bioinforma. 2012; 13(Suppl 19):513.View ArticleGoogle Scholar
 Shao M, Moret BME. Comparing genomes with rearrangements and segmental duplications. Bioinformatics. 2015; 31:329–8.View ArticleGoogle Scholar
 Shao M, Lin Y, Moret BME. Sorting genomes with rearrangements and segmental duplications through trajectory graphs. BMC Bioinforma. 2013; 14(Suppl 15):9.View ArticleGoogle Scholar
 Bader M. Sorting by reversals, block interchanges, tandem duplications, and deletions. BMC Bioinforma. 2009; 10(Suppl 1):9. doi:http://dx.doi.org/10.1186/1471210510S1S9.
 Bader M. Genome rearrangements with duplications. BMC Bioinforma. 2010; 11(Suppl 1):27.View ArticleGoogle Scholar
 OzeryFlato M, Shamir R. Sorting cancer karyotypes by elementary operations. J Comput Biol. 2009; 16(10):1445–60.View ArticlePubMedGoogle Scholar
 Zeira R, Shamir R. Sorting by cuts, joins and whole chromosome duplications. Combinatorial Pattern Matching. Lecture Notes in Computer Science. vol. 9133. Heidelberg: Springer: 2015. p. 396–409.Google Scholar
 Paten B, Zerbino D, Hickey G, Haussler D. A unifying parsimony model of genome evolution. BMC Bioinforma. 2014; 15:206.View ArticleGoogle Scholar
 Pevzner PA. DNA phyical mapping and alternating eulerian cycles in colored graphs. Algorithmica. 1995; 13:77–105.View ArticleGoogle Scholar
 Kotzig A. Moves without forbidden transitions in a graph. Matematický časopis. 1968; 18(1):76–80.Google Scholar
 Harary F, Uhlenbeck GE. On the number of husimi trees: I. Proc Natl Acad Sci. 1952; 39:315–22.View ArticleGoogle Scholar
 MacLane S. A combinatorial condition for planar graphs. Fundam Math. 1937; 28:22–32.Google Scholar