 Methodology article
 Open access
 Published:
A unifying model of genome evolution under parsimony
BMC Bioinformatics volume 15, Article number: 206 (2014)
Abstract
Background
Parsimony and maximum likelihood methods of phylogenetic tree estimation and parsimony methods for genome rearrangements are central to the study of genome evolution yet to date they have largely been pursued in isolation.
Results
We present a data structure called a history graph that offers a practical basis for the analysis of genome evolution. It conceptually simplifies the study of parsimonious evolutionary histories by representing both substitutions and double cut and join (DCJ) rearrangements in the presence of duplications. The problem of constructing parsimonious history graphs thus subsumes related maximum parsimony problems in the fields of phylogenetic reconstruction and genome rearrangement. We show that tractable functions can be used to define upper and lower bounds on the minimum number of substitutions and DCJ rearrangements needed to explain any history graph. These bounds become tight for a special type of unambiguous history graph called an ancestral variation graph (AVG), which constrains in its combinatorial structure the number of operations required. We finally demonstrate that for a given history graph G, a finite set of AVGs describe all parsimonious interpretations of G, and this set can be explored with a few sampling moves.
Conclusion
This theoretical study describes a model in which the inference of genome rearrangements and phylogeny can be unified under parsimony.
Background
In genome evolution there are two interacting relationships between nucleotides of DNA resulting from two key features: DNA nucleotides descend from common ancestral nucleotides, and they are covalently linked to other nucleotides. In this paper we explore the combination of these two relationships in a simple graph model, allowing for change by the process of replication, where a complete sequence of DNA is copied, by substitution, in which the chemical characteristics of a nucleotide are changed, and by the coordinated breaking and rematching of covalent adjacencies between nucleotides in rearrangement operations. These processes have quite different dynamics: DNA molecules replicate essentially continuously, much more rarely substitutions occur and more rarely still rearrangement operations take place. For this reason, and because of inherent complexity issues, a wealth of models, data structures and algorithms have studied these processes either in isolation or in a more limited combination.
Such evolutionary methods generally start with a set of observed sequences in an alignment, an alignment being a partitioning of elements in the sequences into equivalence classes, each of which represents elements that are homologous, i.e. that share a recognisably recent common ancestor. Though alignments represent an uncertain inference, and though there optimisation for standard models is intractable for multiple sequences ([1]), we make the common assumption that the alignment is given, as efficient heuristics exist to compute reasonable genome alignments ([2–4]).
If the sequences in an alignment only differ from one another by substitutions and rearrangements that delete subsequences, or insert novel subsequences (collectively indels), then the alignment data structure is naturally a 2D matrix. In such a matrix, by convention, the rows represent the sequences and the columns represent the equivalence classes of elements. The sequences are interspersed with “gap” symbols to indicate where elements are missing from a column due to indels. From such a matrix alignment, phylogenetic methods infer a history of replication ([5]). Such a history is representable as a phylogenetic tree, whose internal nodes represent the most recent common ancestors (MRCA) of subsets of the input sequences. To create a history including the MRCA sequences, additional rows can be added to the matrix ([6–8]). Both the problem of imputing maximum parsimony phylogenetic trees from matrix alignments and calculating maximum parsimony MRCA sequences given a phylogenetic tree and a matrix alignment are NPhard ([9, 10]).
In addition to substitutions and short indels, homologous recombination operations are a common modifier of individual genomes within a population. The alignment of long DNA sequences related by these operations is also representable as a matrix. However, the history of replication of such an alignment is no longer generally representable as a single phylogenetic tree, as each column in the matrix may have its own distinct tree. To represent the MRCAs of such an alignment requires a more complex data structure, termed an ancestral recombination graph (ARG) ([11, 12]). It is NPhard under the infinite sites model (no repeated or overlapping changes) to determine the minimum number of homologous recombinations needed to explain the evolutionary history of a given set of sequences, and probably NPhard under more general models ([13]).
Larger DNA sequences, or complete genomes, are often permuted by more complex rearrangements, such that the matrix alignment representation is insufficient. Instead, the alignment naturally forms a graph called a breakpoint graph ([14, 15]). Assuming rearrangements are balanced (neither involving the gain or loss of material), inferring parsimonious rearrangement histories between two genomes has polynomial or better time complexity, whether based upon inversions ([16]), translocations ([17]) or doublecutandjoin (DCJ) operations ([18]). However, for three or more genomes with balanced rearrangements ([19]) or when rearrangements are unbalanced (involving the gain or loss of material) leading to duplications (additional copies of subsequences resulting from rearrangement), these exact parsimony methods are intractable. Exact solutions in the most general case are therefore only feasible for relatively small problems ([20]) before heuristics become necessary ([21, 22]).
Despite the hardness of the general case, there has been substantial work on computing maximum parsimony results, allowing for a wider repertoire of rearrangements. ElMabrouk studied inversions and indels, though gave no exact algorithm for the general case ([23]). Recently Yancopoulous ([24]) then Braga ([25]) considered the distance between pairs of genomes differing by DCJ operations and indels, the latter providing the first lineartime algorithm for balanced rearrangements and indels, and the former proposing a datastructure to model duplications. Many methods have been proposed that deal with the combination of rearrangements and duplications, for good recent reviews see ([26, 27]), however until recently there were no algorithms to our knowledge that explicitly unified both duplications and genome rearrangements as forms of general unbalanced rearrangement. First [28] provided a model allowing for a subset of duplications and deletions as well as balanced DCJ operations, giving a lower bound approximation, while [29] studied a model allowing atomic (single gene) duplications, insertions and deletions, but arrived at no closedform formula for the total number of rearrangements.
The graph model introduced in this paper is capable of representing a general evolutionary history for any combination of replication, substitution and rearrangement operations, including duplications and homologous recombinations. It therefore generalises phylogenetic trees, graphs representing histories with indels, ancestral recombination graphs and breakpoint graphs, building upon the methods described above. We start by introducing this graph and then develop a maximum parsimony problem that, somewhat imperfectly, generalises maximum parsimony variants of all the problems mentioned, facilitating the study of all these subproblems in one unified domain. We adopt the common assumption that all substitutions and rearrangements occur independently of one another, and account for tradeoffs between them by independent rearrangement and substitution costs, which are themselves essentially sums over the numbers of inferred events. Importantly, replications that are combined with unbalanced rearrangements are costed by the underlying rearrangement cost. We finally provide a bounded sampling approach to cope with the NPhardness of the general maximum parsimony problem.
Results
Sequence graphs and threads
Sequence graphs are used extensively in comparative genomics, in rearrangement theory typically under the name (multi or master) breakpoint graph ([14, 15, 22]) and in alignment under the name Abruijn ([30]) or adjacency graph ([31]). We use the following bidirected form, which is similar to that used by [32] for sequence assembly.
A (bidirected) sequence graph G=(V_{ G },E_{ G }) is a graph in which a set V_{ G } of vertices are connected by a set E_{ G } of bidirected edges ([33]), termed adjacencies. A vertex represents a subsequence of DNA termed a segment. A vertex x is oriented, having a tail side and a head side, respectively denoted x_{ h e a d } and x_{ t a i l }. These categories {h e a d,t a i l} are called orientations. An adjacency, which represents the covalent bond between adjacent nucleotides of DNA, is a pair set of sides. We refer to the two sides contained in an adjacency as its endpoints. Adjacencies are bidirected, in that each endpoint is not just a vertex, but a vertex with an independent orientation (either head or tail). For convenience, we say a side is attached if it is contained in an adjacency, else it is unattached. By extension, we say a vertex is attached if either of its sides are attached, else it is unattached.
Associated with a sequence graph is a labeling, i.e. a function l:V_{ G }→Σ^{∗}∪{∅} where Σ={A/T,C/G,G/C,T/A} is the alphabet of bases, which are oriented, paired nucleotides of DNA, and Σ^{∗} is the set of all possible labels consisting of finite sequences of bases in Σ. Bases and labels are directed. For ρ/τ∈Σ, ρ is the forward complement and τ is the reverse complement. If a vertex is traversed from its tail to its head side, its label is read as the sequence of its forward complements. Conversely, if traversed from head to tail, the label is read as the reverse sequence of the reverse complements. A vertex x∈V_{ G } for which l(x)=∅ is unlabeled. A label represents a multibase allele. A path through the sequence therefore represents a single DNA sequence (and its reverse complement) whose bases are encoded by the labels of the vertices, where unlabeled vertices represent missing information.
A thread is a connected component in a sequence graph in which each side is connected to at most one adjacency. A thread graph is a sequence graph in which every connected component is a thread. In this paper we limit ourselves to investigating thread graphs. A thread may be a simple cycle, representing a circular DNA molecule, or have two unattached sides, in which case it represents a linear DNA molecule or fragment of a larger DNA molecule. An example thread graph is shown in Figure 1. A thread graph is phased, in that each thread is assigned a maximal DNA sequence (and its reverse complement), and any path though that thread corresponds to a subsequence of these maximal sequences. In contrast, a sequence graph that is not a thread graph may be unphased, in that there exist many possible maximal sequences for each of its connected components.
History graphs
Nucleotides of DNA derive from one another by a process of replication. This replication process is represented in history graphs, which add ancestry relationships to thread graphs.
A history graph G=(V_{ G },E_{ G },B_{ G }) is a thread graph with an additional set B_{ G } of directed edges between vertices, termed branches. Each vertex is incident with at most one incoming branch. The event graph D(G) is the directed graph formed by the contraction^{a} of adjacencies in E_{ G }. For G to be a history graph D(G) must be a directed acyclic graph (DAG), a property we term acyclicity. Example history graphs are shown in Figure 2(A,B), along with an event graph in Figure 2(C) for the history graph shown in Figure 2(B).
To avoid confusion we define terminology to discuss branch relationships. Each weakly connected component of branches forms a branchtree. Two vertices are homologous if they are in the same branchtree. A vertex y is a descendant of a vertex x, and conversely y is an ancestor of x, if y is reachable by a directed path of branches from x. If two homologous vertices do not have an ancestor/descendant relationship then they are indirectly related. For a branch e=(x,y), x is the parent of e and y, and y is the child of e and a child of x. Similarly, e is the parent branch of y and a child branch of x. A vertex is a leaf if it has no incident outgoing branches, a root if it has no incident incoming branches, else it is internal. We reuse the terminology of parent, child, homologous, ancestor, descendant and indirectly related with sides. Two sides have a given relationship if their vertices have the relationship and they have the same orientation. Similarly, a side is a leaf (resp. root) if its vertex is a leaf (resp. root).
Simple histories
We formally define a class of history graphs, called simple histories, for which parsimonious sequences of substitutions and rearrangements can be trivially derived.
A bilayered history graph is a history graph whose threads can be partitioned into root and leaf layers, such that every branch connects a vertex in the root layer with a vertex in the leaf layer. A rearrangement epoch is a bilayered history graph in which every branch tree is a root with 1 child, every vertex is labeled, and any set of homologous sides are either all attached or all unattached. For n≥2, an nway replication epoch is a bilayered history graph in which every branch tree is a root with n children, every vertex is labeled, any set of homologous sides are either all attached or all unattached, if two root sides x_{ α } and y_{ β } are attached by an adjacency then each child of x_{ α } is attached to a child of y_{ β }, and a root vertex has at most one child with a label different from its own. An epoch is either a rearrangement epoch or an nway replication epoch for some n≥2. A layered history graph is a history graph that can be edge partitioned into a finite sequence of bilayered history graphs, such that the leaf layer of a contained bilayered history graph is the root layer of the following bilayered history graph. A simple history is a layered history graph whose bilayered subgraphs are all epochs. An example simple history with epoch subgraphs is shown in Figure 2(D).
A substitution occurs on a branch if the labels of its endpoints are not identical. Note that a substitution can occur either in a rearrangement or a replication epoch. The substitution cost of a simple history H is the total number of substitutions, denoted s(H). The example simple history in Figure 2(D) has substitution cost 4. Note the requirement that all homologous sides in a simple history be either all attached or all unattached does not forbid rearrangements involving the observed ends of chromosomes (linear threads), because it is always possible to add material to a simple history at zero cost that attaches such unattached sides and allows them to participate in rearrangements.
The substitution cost defined deals, abstractly, with changes of alleles in which any change between alleles is scored equally. However for the case Σ^{∗}=Σ, i.e. single base labels, the substitution cost is the minimum number of single base changes. Furthermore, any history graph in which all homologous labels have the same length can easily be converted to a semantically equivalent history graph for which Σ^{∗}=Σ. More complex substitution costs to deal with the case where the alphabet represents the alleles of genes, as is commonly dealt with in rearrangement theory, are straightforward but not pursued here for simplicity.
A rearrangement cycle in a rearrangement epoch is a circular path consisting of one or more repetitions of the basic pattern consisting of an adjacency edge in the root layer, a forward branch to the leaf layer, an adjacency edge in the leaf layer and a reverse branch to the root layer. Its size is the number of repetitions in it of this basic pattern minus 1. A linear path that follows this same basic pattern but does not complete every pattern and return to the original vertex is a degenerate rearrangement cycle. Its size is the size of the smallest rearrangement cycle that can be obtained from it by adding edges. The rearrangement cost of a simple history H is the total size of all rearrangement cycles in it, denoted r(H). This cost is known to be the number of doublecutandjoin (DCJ) operations needed to achieve all the rearrangements.
Lemma 1.
The rearrangement cost of an epoch is the minimum number of doublecutandjoin (DCJ) operations required to convert the root layer’s adjacencies into the leaf layer’s adjacencies.
Proof.
Similar to that given in [18].
The example simple history in Figure 2(D) has rearrangement cost 3.
Because different studies lay different emphases on substitution or rearrangement (e.g. because of the available data) and because the events do not have the same probability in practice, we allow for a degree of freedom in the definition of the overall cost function. A (simple history) cost function for a simple history is any monotone function on the substitution and rearrangement costs in which both substitutions and rearrangements have nonzero cost.
Reduction
Not all history graphs are as detailed as simple histories. We define below a partial order relationship that describes how one graph can be a generalization of another graph, so for example, a less detailed history graph can be used to subsume multiple simple histories.
A branch whose child is unlabeled and unattached is referred to as having a freechild. A branch whose parent is unlabeled, unattached and a root with a single child is referred to as having a freeparent. A vertex is isolated if it has no incident adjacencies or branches.
A reduction operationis an operation upon a history graph that either:

Deletes an adjacency, an isolated vertex or the label of a vertex.

Contracts a branch with a freechild or freeparent.
See Figure 3(AE) for examples. The inverse of a reduction operation is an extension operation.
Lemma 2.
The result of a reduction operation is itself a history graph.
A history graph G is a reduction of another history graph G^{′} if G is isomorphic to a graph that can be obtained from G^{′} by a sequence of reduction operations, termed a reduction sequence.
Lemma 3.
The reduction relation is a partial order.
We write G\u22de{G}^{\prime} to indicate that G is a reduction of G^{′} and G≺G^{′} to indicate that G is a reduction of G^{′} not equal to G^{′}. Like reduction and extension operations, if G is a reduction of G^{′}, G^{′} is an extension of G. An examination of the reduction relation is in the Discussion section.
History graph cost
Using the parsimony principle, we now extend parsimony cost functions, previously defined on simple histories, to all history graphs.
A simple history H that is an extension of a history graph G is called a realisation of G. The set \mathcal{\mathscr{H}}(G) is the realisations of G.
For a given cost function c the cost of a history graph G is^{b}
Lemma 4.
The problem of finding the cost of a history graph is NPhard.
Proof.
There are parsimony problems on either substitutions or rearrangements alone that are NPhard and can be formulated as special cases of the problem of finding the minimum cost realisation of a history graph ([9, 34]).
The lifted graph
Although determining the cost of a history graph is NPhard, we will show that the cost can be bounded such that the bounds become tight for a broad, characteristic subset of history graphs. To do this we introduce the concept of lifted labels and adjacencies, which are used to project information about labels and adjacencies from descendant to ancestral vertices and are useful in reasoning about the cost of a history graph.
The freeroots of a history graph G are a set of additional vertices such that a single, unique freeroot is assigned to each root vertex in G (see the top of Figure 4(A)). For a vertex x, its lifting ancestor A(x) is the most recent labeled ancestor of x, else if no such vertex exists, the freeroot of the branchtree containing x. For a side x_{ α } its lifting ancestor (overloading notation) A(x_{ α }) is its most recent attached ancestor, else if no attached ancestor exists, its ancestral side in the free root of the branch tree containing it.
For a labeled vertex y, a lifted label is a label identical to l(y) on its lifting ancestor. For a vertex the lifted labels is therefore a multiset, because the same lifted label may be lifted to a lifting ancestor from multiple distinct descendants and each is considered an element of the multiset.
For an adjacency {x_{ α },y_{ β }}, a lifted adjacency is a bidirected edge {A(x_{ α }),A(y_{ β })}. In analogy with the lifted labels for a vertex, the lifted adjacencies for a side is the multiset of lifted adjacencies incident with the side.
A history graph G with freeroots, lifted labels and lifted adjacencies is a lifted graph L(G). Figure 4(A) shows an example lifted graph that outlines these concepts.
Some lifted elements do not imply change between descendant and ancestral states, while others do. To formalise such a notion we define trivial and nontrivial labels and and adjacencies. A lifted label ρ of a labeled vertex x is trivial if l(x)=ρ. A lifted label ρ on an unlabeled vertex x (necessarily a free root) is trivial if it is the only lifted label on x. Otherwise a lifted label is nontrivial.
A junction side is a most recent common ancestor (MRCA) of two attached, indirectly related sides. For a history graph G, a lifted adjacency e={A(x_{ α }),A(y_{ β })} is trivial if there exists no unattached junction side on the path of branches from (but excluding) A(x_{ α }) to (but excluding) x_{ α }, or on the path of branches from (but excluding) A(y_{ β }) to (but excluding) y_{ β } and either there is a (regular) adjacency between A(x_{ α }) and A(y_{ β }) in G or A(x_{ α }) and A(y_{ β }) are free roots, else e is nontrivial. See Figure 4(A) for examples of trivial and nontrivial labels and adjacencies.
Ancestral variation graphs
We can now define a broad class of history graphs for which cost can be computed in polynomial time. To do this we will define ambiguity, information that is needed to allow the tractable assessment of cost. There are two types of ambiguity. The substitution ambiguity of a history graph G, denoted u_{ s }(G), is the total number of nontrivial lifted labels in excess of one per vertex. Substitution ambiguity reflects uncertainty about MRCA bases. The substitution ambiguity of the history graph in Figure 2(B) is 1, as there exists one vertex with two nontrivial lifted labels.
The rearrangement ambiguity of a history graph G, denoted u_{ r }(G), is the total number of nontrivial lifted adjacency incidences in excess of one per side.Rearrangement ambiguity reflects uncertainty about MRCA adjacencies. The rearrangement ambiguity of the history graph in Figure 2(B) is 5, because two sides have three incident nontrivial lifted edges and one side has two incident nontrivial lifted edges.
The ambiguity of a history graph G is u(G)=u_{ s }(G)+u_{ r }(G). An ancestral variation graph (AVG) H is a history graph such that u(H)=0, i.e. an unambiguous history graph.
Lemma 5.
Simple histories are AVGs.
While simple histories are AVGs, so are many other history graphs that are far less detailed. For example, the AVG in Figure 5 is not a simple history.
Bounds on cost
We provide trivially computable lower and upper bound cost functions for history graphs that are tight for AVGs.
The lower bound substitution cost (LBSC) of a history graph G, denoted s_{ l }(G), is the total number of distinct (not counting duplicates in the multiset) nontrivial lifted labels at all vertices minus the number of unlabeled vertices with nontrivial lifted labels (necessarily free roots). The LBSC of the history graph in Figure 2(B) is 4.
The upper bound substitution cost (UBSC) of a history graph G, denoted s_{ u }(G), is the total number of nontrivial lifted labels at all vertices minus the maximum number of identical lifted labels at each unlabeled vertex with nontrivial lifted labels (again, necessarily free roots). The UBSC of the history graph in Figure 2(B) is 5. For the AVG in Figure 5, LBSC = UBSC = 4.
The module graph of a history graph G is a multigraph in which the vertices are the sides of vertices in L(G) that have incident real or lifted adjacencies and the edges are the real and lifted adjacencies in L(G) incident with these sides. Each connected component in a module graph is called a module. The set of modules in the module graph for G is denoted M(G). Figure 4(B) shows the modules for Figure 4(A).
The lower bound rearrangement cost (LBRC) for a history graph G is:
For a history graph that is a simple history this definition is equivalent to the earlier definition of rearrangement cost for simple histories.
The upper bound rearrangement cost (UBRC) of a history graph G, denoted r_{ u }(G), is the total number of nontrivial lifted adjacencies in L(G) minus the number of modules in M(G) in which every side has exactly one incident nontrivial lifted edge. The LBRC of the history graph in Figure 2(B) is 3 and its UBRC is 6. For the AVG in Figure 5 LBRC = UBRC = 3.
Theorem 1.
For any history graph G and any cost function c, c(s_{ l }(G),r_{ l }(G))≤C(G,c)≤c(s_{ u }(G),r_{ u }(G)) with equality if G is an AVG.
The proof is in given the Methods section.
Theorem 1 demonstrates that LBSC and LBRC are lower bounds on cost, UBSC and UCRC are upper bounds on cost, and that all these bounds become tight at the point of zero ambiguity. This implies that to assess cost of an arbitrary history graph G we need only search for extensions of G to the point that they have zero ambiguity and not the complete set of simple history realisations of G. For an AVG H, as the lower and upper bounds on cost are equivalent, we write r(H)=r_{ l }(H)=r_{ u }(H) and s(H)=s_{ l }(H)=s_{ u }(H).
Goptimal AVGs
We now explore the process of sampling AVG extensions of an initial starting graph. Though it is possible to start from any history graph, in practice we are likely to start from a history graph G based on sequence alignments, such as that shown in Figure 2(A). If G is already an AVG, by Theorem 1, it is trivial to assess its cost. If not we sample AVG extensions of G in order to assess cost and explore the set of most parsimonious realisations of G. With the aim of restricting this search, ultimately to a finite space, we first define the set of Goptimal AVGs.
An AVG extension H of a history graph G is Gparsimonious w.r.t. a cost function c if C(G,c)=c(s(H),r(H)). The set of Gparsimonious AVGs is necessarily infinite: it is always possible to add arbitrary vertices without affecting substitution or rearrangement costs. To avoid the redundant sampling of AVG extensions of G and their own extensions we define the notion of minimality.
An AVG extension H of G is Gminimal if there is no other AVG H^{′} such that G≺H^{′}≺H. The set of Gminimal AVGs contains those AVGs that can not be reduced without either ceasing to be AVGs or extensions of G. This set is also infinite for some DNA history graphs (Lemma 9 below).
An AVG is Goptimal w.r.t. a cost function c if it is both Gparsimonious w.r.t. to c and Gminimal. We establish below that the set of Goptimal AVGs is finite for any history graph G. By definition, any Gparsimonious AVG is either Gminimal or has a Gminimal reduction therefore we can implicitly represent and explore the set of parsimonious realisations of G by sampling just the Goptimal AVGs.
Gbounded history graphs
Unfortunately, because the history graph cost problem is NPhard, it is unlikely that there exists an efficient way to sample only Goptimal. Instead, we now define a finite bounding set that contains Goptimal and can be efficiently searched. Conveniently this bounding set is the same for all cost functions.
A label of a vertex x is a junction (overloading the term junction, but using it analogously) if x has more than one lifted label, else it is a bridge if x has one lifted label, its lifted label is nontrivial, the most recent labeled ancestor of x is labeled the same as x and this ancestor has at least one nontrivial lifted label (see Figure 6(A,B)).
A side x_{ α } is a bridge side if it is not a junction, is incident with one nontrivial lifted adjacency and an adjacency e that defines a trivial lifted adjacency e^{′} whose A(x_{ α }) endpoint is a junction side incident with a nontrivial lifted adjacency, and such that if e is deleted at least one endpoint of e^{′} in the original graph remains a junction side in the resulting graph (see Figure 6(C,D)). An adjacency is a junction (again, overloading the term junction) if either of its endpoints are junctions, else it is a bridge (overloading bridge) if either of its endpoints are bridge sides.
An element is nonminimal if it is a branch with a freechild or freeparent, an isolated vertex, or label or adjacency that is not a junction or bridge.
For G\u22de{G}^{\prime}, an element in G^{′} is Greducible if there exists a reduction operation in a reduction sequence from G^{′} to G that either deletes the element if it is an adjacency, label or vertex or contracts it if it is a branch. We are interested in the set of Greducible elements of an extension of G, as they are the elements which may be added and removed during an iterative sampling procedure.
For G\u22de{G}^{\prime}, the Gunbridged graph of G^{′} is the reduction resulting from the deletion of all Greducible bridge adjacencies in G^{′}. A side x_{ α } that has no attached descendants is a hanging side. A pair of adjacencies e and e^{′}, each with a hanging side, and such that e has an endpoint whose most recent attached ancestor is incident with e^{′}, form a pair of pingpong adjacencies. We call e the ping adjacency and e^{′} the pong adjacency (Figure 6(E)).
A history graph G^{′} is Gbounded if it is an extension of G that does not contain a Greducible nonminimal element and its Gunbridged graph does not contain a Greducible ping adjacency.
Theorem 2.
The set of Gbounded AVGs contains the Goptimal AVGs for every cost function.
The proof is given in the Methods section.
Importantly, the following theorem demonstrates that there is a constant k such that any Gbounded history graph is at most k times the cardinality of G.
Theorem 3.
A Gbounded history graph contains less than or equal to max(0,10n−8)Greducible adjacencies and max(0,2m−2,20n−16,20n+2m−18) additional vertices, where n is the number of adjacencies in G and m is the number of labeled vertices in G. This bound is tight for all values of n and m.
The proof is given in the Methods section.
The set of Gbounded history graphs and, by inclusion, the set of Goptimal AVGs are therefore finite.
The Gbounded poset
Finally we demonstrate how to navigate between Gbounded history graphs using a characteristic set of operations that define a hierarchy between these graphs.
For a vertex x in a Gbounded history graph the composite minimisation of xis as follows:

If x is unattached and unlabeled and has a Greducible parent branch, the contraction of the parent branch, renaming the resulting merged vertex x.

If x is then an unattached, unlabeled root and has a single Greducible child branch, the contraction of the child branch, renaming the resulting merged vertex x.

The deletion of x if subsequently isolated, unlabeled and Greducible.
A Gbounded reduction operation on a Gbounded history graph is one of the following operations, provided it results in a Gbounded history graph.

a label detachment: the deletion of a Greducible label on a vertex x, followed by the composite minimisation of x (Figure 7(AC)).

an adjacency detachment: the deletion of a Greducible adjacency {x_{ a },y_{ ß }} followed by the composite minimisation of x and y (Figure 7(DF)). The inverse of an adjacency detachment is an adjacency attachment.

a lateraladjacency detachment: the adjacency detachment of a pair of Greducible junction adjacencies {x_{ a },y_{ ß }} and {A(x_{ a }),A(y_{ ß })}, and a subsequent adjacency attachment that creates an adjacency that includes x_{ a } or y_{ ß } as an endpoint (Figure 7(DE)).
Note that the first two Gbounded reduction operations are combinations of reduction operations, while the lateraladjacency detachment, which proves necessary to avoid creating intermediate graphs with Greducible pingpong edges, involves both reduction and extension operations, but always reduces the total number of adjacencies. As with reduction operations, the inverse of a Gbounded reduction operation is a Gbounded extension operation. A Gbounded history graph G^{′} is a Gbounded reduction (resp. extension) of another Gbounded history graph G^{′′} if G^{′} is isomorphic to a graph that can be obtained from G^{′′} by a sequence of Gbounded reduction (resp. extension) operations.
Lemma 6.
The Gbounded reduction relation is a partial order.
The Gbounded poset is the set of Gbounded history graphs with the Gbounded reduction relation. We write ≺_{ G } to denote the Gbounded reduction relation and ≺·_{ G } to denote its covering relation (i.e. A≺·_{ G }B iff A≺_{ G }B and there exists no C such that A≺_{ G }C≺_{ G }B).
Theorem 4.
The Gbounded poset is finite, has a single least element G, and its maximal elements are all AVGs. Also, G^{′}≺·_{ G }G^{′′} iff there exists a single Gbounded reduction operation that transforms G^{′′} into G^{′}.
The proof is given in the Methods section.
As the Gbounded poset is finite, it can be represented by a Hasse diagram whose nodes are the Gbounded history graphs and whose edges, which are the covering relation, represent equivalence classes of Gbounded operations. Figure 8 shows a simple Gbounded poset Hasse diagram.
A basic implementation
The previous four theorems establish the mechanics of everything we need to sample the finite set of Goptimal AVGs, and thus, amongst other things, determine the cost of a history graph. Although it will require further work to establish practical and efficient sampling algorithms, we have implemented a simple graph library in Python that for an input history graph G iteratively generates Gbounded AVGs (https://github.com/dzerbino/pyAVG) through sequences of Gbounded extension operations.
To test the library we used simulations. For each simulation we generated a simple history H by forward simulation, starting from a genome with 5 vertices in a single thread and simulating 4 epochs in which either whole chromosome replication or rearrangements occurred and substitutions were made at a constant rate at each branch. The labels in the simulation correspond to single DNA bases. To ensure complexity, we selected histories with substitutions, rearrangements and at least two epochs of replication. We created a reduction G of H by removing from H all labels of internal vertices and adjacencies incident on internal vertices and finally contracting the parent branch of all internal vertices. As a result, the reduced history contained only the leaf threads and branch trees that, containing no internal vertices, simply indicate the homologies between the vertices. To simulate incomplete genome assemblies, we randomly removed, on average, 10% of the adjacencies, labels and vertices from these leaf threads. To test our library we enumerated sequences of Gbounded history graphs starting at G, at each step picking at random a possible Gbounded extension operation until we reached an AVG. We sampled 20,000 starts for each of 20 randomly sampled pairs of history and starting graph. To make the search strategy efficient, we restarted the search if we reached an extension with a higher total sum of lower bound substitution and rearrangement costs than s_{ u }(G)+r_{ u }(G), initially, and then subsequently the sum of the substitution and rearrangement costs of the best AVG found up to that point. Tables 1 and 2 show the results of these 20 sampling runs. Figure 9 shows one example of H, G and a sampled AVG.
For these simulations the minimum rearrangement cost of any sampled AVG is often close or equal to r_{ l }(G), while the maximum rearrangement cost of any sampled AVG is generally slightly greater than r_{ u }(G). Notably, we found that AVG extensions sometimes had lower cost than the original simple history, this occurring because of the information loss that resulted from reducing H to G.
Repeating these experiments with histories that started with 10 root vertices in the simple history, but which were otherwise simulated identically, demonstrates that the naive random search procedure implemented here fails to find reasonable histories within a set of only 20,000 random samples (data not shown), so, as might be expected, more intelligent sampling strategies will be needed to find parsimonious interpretations of even moderately complex datasets. However, with more efficient sampling algorithms, a history graph sampling algorithm could be applied to find solutions to various established parsimony problems, such as the DCJ median problem, or be used for less explored problems, such as the inference of gene trees incorporating synteny information.
Discussion
We have introduced a general model for genome evolution under parsimony, but the reduction relation and the definition of the Gbounded set may appear arbitrary. We highlight below the reasons for our choice of reduction relation, how reduction relates to other orderings over graphs, and how we can easily approximate a set of Greducible elements, something critical to the sampling of Gbounded extensions of a given graph. We then briefly discuss the possibilities of yet more compact graphical representations.In the reduction relation, we allow the deletion of vertices, vertex labels and adjacencies, but forbid branch deletion. Otherwise, extensions would allow the invention of homology between vertices (see Figure 10(A)). Conversely, branches can be contracted but not adjacencies, otherwise extensions could create interstitial vertices without any rearrangement (see Figure 10(B)).We disallow the nontrivial contraction of the incoming branch of attached or labeled vertices, with the one exception for branches with freeparents, because it would allow a reduction to merge previously separate threads (see Figure 10(C)), and because vertices could be reduced to become ancestors of originally indirectly related vertices (see Figure 10(D)). We allow the one exception for the contraction of the incoming branch of attached or labeled vertices when the branch has a freeparent because disallowing it would forbid reductions that removed information from root vertices (see Figure 10(E)) and allowing it does not permit the issues highlighted in Figures 10(CD).It is informative to consider the relationship between reduction operations and the reduction relation. When a graph contains multiple copies of isomorphic structures, distinct reduction operations can result in isomorphic reductions (see Figure 10(FI)), therefore each possible reduction in the covering set (transitive reduction) of the reduction relation represents an equivalence class of reduction operations.
A valid permutation of a reduction sequence is a permutation in which all operations remain reduction operations when performed in sequence. Clearly not all permutations of a reduction sequence have this property, however the following lemma illustrates the relationship between valid permutations.
Lemma 7.
All valid permutations of a reduction sequence create isomorphic reductions.
Reduction is somewhat analogous to a restricted form of the graph minor. Importantly, the graph minor is a wellquasiordering (WQO) ([35]), i.e. in any infinite set of graphs there exists a pair such that one is the minor of the other.
Lemma 8.
Reduction is not a WQO.
Proof.
Consider the infinite set of cyclic threads, they are not reductions of one another.
An ordering is a WQO if every set has a finite subset of minimal elements. In contrast, it can be shown that for the reduction relation, even the set of AVG extensions of a single base history G can have an infinite set of minimal elements.
Lemma 9.
There exists a history graph G with an infinite number of Gminimal extensions.
The proof is given in the Methods section.
One barrier to exploring the Gbounded poset is deciding for a pair of history graphs G and G^{′} such that G\u22de{G}^{\prime} if an element is Greducible. This problem is of unknown complexity, and may well be NPhard. To avoid the potential complexity of this problem we can define an alternative notion of reducibility. A fix for (G,G^{′}), where G\u22de{G}^{\prime}, is a history subgraph of ({V}_{{G}^{\prime}},{E}_{{G}^{\prime}},{B}_{{G}^{\prime}}^{+}) isomorphic to G, where {B}_{{G}^{\prime}}^{+} is the transitive closure of {B}_{{G}^{\prime}}. Starting from an input history graph G and a fix isomorphic to it, we can easily update the fix as we create extensions of G. For an extension of G, elements in the fix become the equivalent of Girreducible, while elements not in the fix become the equivalent of Greducible. From a starting graph we can therefore explore a completely analogous version of Gbounded, replacing the question of Greducibilty with membership of the fix.
Following from Lemma 7, there is a bijection between the set of fixes for G\u22de{G}^{\prime} and the set of equivalence classes of reduction sequences that are all valid permutations of each other. This is the limitation of considering membership of a fix instead of assessing if an element is Greducible, it limits us to considering only a single equivalence class of reduction sequences in exploring the analogous poset to Gbounded.
It is in general possible to reduce the size of the set Gbounded while still maintaining the properties that it can be efficiently sampled and contains Goptimal. However, this is likely to be at the expense of making the definition of Gbounded more complex. One approach is to add further “forbidden configurations” to the definition of Gbounded, like the Greducible ping adjacencies that are forbidden in the current definition of Gbounded. Forbidding these was essential to making Gbounded finite, but we might consider also forbidding other configurations just to make Gbounded smaller.
It is possible to consider a graph representation of histories that use fewer vertex nodes if we are willing to allow for the possibility that a subrange of the sequence of a vertex be ancestral to a subrange of the sequence of another vertex. This is a common approach in ancestral recombination graphs ([11]). Such a representation entails the additional complexity of needing to specify the sequence subranges for every branch, but may in some applications be a worthwhile trade off for reducing the number of vertices in the graph. The theory of such graphs is mathematically equivalent to the theory of the history graphs presented here, but the implementation would differ.
Conclusion
We have introduced a graph model in which a set of chromosomes evolves via the processes of whole chromosome replication, gain and loss, substitution and DCJ rearrangements. We have demonstrated upper and lower bounds on maximum parsimony cost that are trivial to compute despite the intractability of the underlying problem. Though these cost bounding functions are relatively crude and can almost certainly be tightened for many cases, they become tight for AVGs. This implies that we only need to reach AVG extensions to assess cost when sampling extensions.
To our knowledge, this is the first fully general model of chromosome evolution by substitution, replication, and rearrangement. However, it has its limitations. For example, it treats common rearrangements, such as recombinations and indels as any other rearrangement, and only takes into account maximum parsimony evolutionary histories. We anticipate future extensions that incorporate more nuanced cost functions, as well as probabilistic models over all possible histories.
The constructive definition of the Gbounded poset, coupled with the upper and lower bound functions, suggests simple branch and bound based sampling algorithms for exploring lowcost genome histories. To facilitate the practical exploration of the space of optimal and near optimal genome histories, we expect that more advanced sampling strategies across the Gbounded poset could be devised.
Methods
Proof of Theorem 1
We first define some convenient notations to describe lifted labels and edges. For a vertex x let {L}_{x}^{\prime}=({L}_{x},{N}_{x}) be its multiset of lifted labels, where L(x) is the set of distinct lifted labels for x, and for each lifted label ρ, N_{ x }(ρ) is the number of times ρ appears as a lifted label for x, i.e. L_{ x }={l(y):A(y)=x}⊆Σ^{∗} and {N}_{x}:{L}_{x}\to {\mathbb{Z}}_{+} such that N_{ x }(ρ)={y:A(y)=x,l(y)=ρ}.
For a side x_{ α }, and overloading notation, let {L}_{{x}_{\alpha}}^{\prime}=({L}_{{x}_{\alpha}},{N}_{{x}_{\alpha}}) be its multiset of lifted edges, where L(x_{ α }) is the set of distinct lifted adjacencies incident with x_{ α }, and for each lifted adjacency {x_{ α },w_{ γ }}, {N}_{{x}_{\alpha}}(\{{x}_{\alpha},{w}_{\gamma}\}) is the number of sides whose lifting ancestor is x_{ α }, and which are connected by an adjacency to a side whose lifting ancestor is w_{ γ }, i.e. {L}_{{x}_{\alpha}}=\left\{\right\{{x}_{\alpha}=A({y}_{\alpha}),A({z}_{\beta})\}:\{{y}_{\alpha},{z}_{\beta}\}\in {E}_{G}\} and {N}_{{x}_{\alpha}}={L}_{{x}_{\alpha}}\to {\mathbb{Z}}_{+} such that {N}_{{x}_{\alpha}}(\{{x}_{\alpha},{w}_{\gamma}\})=\{{y}_{\alpha}:\{{x}_{\alpha}=A({y}_{\alpha}),{w}_{\gamma}\}\in {L}_{{x}_{\alpha}}\}.
Note that for a side x_{ α }, {N}_{{x}_{\alpha}}(\{{x}_{\alpha},{w}_{\gamma}\}) gives the multiplicity of lifted adjacency incidences with x_{ α }, not the multiplicity of {x_{ α },w_{ γ }}. In particular, if two sides x_{ α } and {x}_{\alpha}^{\prime} are attached and share the same lifting ancestor A(x_{ α }), then {N}_{A({x}_{\alpha})}(\{A({x}_{\alpha}),A({x}_{\alpha})\}) is incremented by 2. On the contrary, if x_{ α } is connected to w_{ γ } and A(x_{ α }) is distinct from A(w_{ γ }), then both {N}_{A({x}_{\alpha})}(\{A({x}_{\alpha}),A({w}_{\gamma})\}) and {N}_{A({w}_{\gamma )}}(\{A({x}_{\alpha}),A({w}_{\gamma})\}) are incremented by 1.
For a vertex (resp. side) x the multiset of nontrivial lifted labels (adjacencies) is {\stackrel{~}{L}}_{x}^{\prime}=({\stackrel{~}{L}}_{x},{\xd1}_{x})\subseteq {L}_{x}^{\prime}.
The equivalence of LBSC to UBSC and LBRC to UBRC for AVGs
Lemma 10.
For any AVG H, s_{ l }(H)=s_{ u }(H).
Proof.
For a vertex x without substitution ambiguity there is at most one nontrivial lifted label, that, if it exists, has a multiplicity of one, therefore {\stackrel{~}{L}}_{x}={\stackrel{~}{L}}_{x}^{\prime}=0 or 1. Let δ_{a,b} be the Kronecker delta, i.e. δ_{a,b}=1 if a=b, else 0. It is easily verified for every possible case:
summing over modules, therefore:
A module is simple if each side has at most one incidence with a nontrivial lifted adjacency.
Lemma 11.
All modules in an AVG are simple.
Proof.
Follows from definition of rearrangement ambiguity.
Lemma 12.
For an AVG H, r_{ l }(H)=r_{ u }(H).
Proof.
Let M be a simple module and let {k}_{M}=\sum _{{x}_{\alpha}\in {V}_{M}}{\delta}_{1,{\stackrel{~}{L}}_{{x}_{\alpha}}^{\prime}}, i.e. the number of sides in V_{ M } with a single incidence with a nontrivial lift.
As the module is simple it is a path or a cycle, and hence V_{ M }−k_{ M }=0,1 or 2, from which it is easily verified that:
Summing over modules in H, which are all simple, therefore:
As all modules of H are simple, k_{ M } is always even and {k}_{M}=\sum _{{x}_{\alpha}\in {V}_{M}}{\stackrel{~}{L}}_{{x}_{\alpha}}^{\prime}, therefore:
therefore, for an AVG H
A bounded transformation of a history graph into an AVG
In this section we will prove that any history graph G has an AVG extension H such that s_{ u }(G)≥s_{ u }(H) and r_{ u }(G)≥r_{ u }(H). To do this we define sequences of extension operations that when applied iteratively and exhaustively construct such an extension.
A vertex or side x is ambiguous if {\stackrel{~}{L}}_{x}^{\prime}>1. For an ambiguous freeroot x^{′} and unlabeled root vertex x such that A(x)=x^{′}, a root labeling extension is a labeling of x with a member of the set \underset{\rho \in {\stackrel{~}{L}}_{{x}^{\prime}}}{arg\phantom{\rule{1pt}{0ex}}\text{max}}{N}_{{x}^{\prime}}(\rho ) (See Figure 11(A)).
Lemma 13.
For any history graph G containing an ambiguous freeroot there exists a root labeling extension G^{′} of G such that s_{ u }(G)=s_{ u }(G^{′}), r_{ u }(G)=r_{ u }(G^{′}) and u(G)>u(G^{′}).
For a branch (x,x^{′}) an interpolation is the extension resulting from the creation of a new vertex x^{′′} and branches (x,x^{′′}) and (x^{′′},x^{′}) and the deletion of (x,x^{′}). Let x be a labeled and ambiguous vertex and x^{′} be a labeled vertex such that A(x^{′})=x and l(x)≠l(x^{′}). A substitution ambiguity reducing extension is the interpolation of a vertex x^{′′} along the parent branch of x^{′} labeled with l(x) (See Figure 11(B)).
Lemma 14.
For any history graph G containing no ambiguous freeroots and such that u_{ s }(G)>0, there exists a substitution ambiguity reducing extension G^{′} of G such that s_{ u }(G)=s_{ u }(G^{′}), r_{ u }(G)=r_{ u }(G^{′}) and u(G)>u(G^{′}).
The following is used for eliminating rearrangement ambiguity. For an unattached junction side x_{ α } a junction side attachment extension is the extension resulting from the following: If x_{ α } has no attached ancestor, the creation of a new vertex and adjacency connecting a side of the new vertex to x_{ α } (see Figure 11(C) for an example), else {A(x_{ α }),y_{ β }}∈E_{ G } and the extension is the creation of a new vertex y^{′}, branch (y,y^{′}) and adjacency \{{x}_{\alpha},{y}_{\beta}^{\prime}\} (See Figure 11(D)).
Lemma 15.
For any history graph G containing an unattached junction side, there exists a junction side attachment extension G^{′} of G such that s_{ u }(G)=s_{ u }(G^{′}), r_{ u }(G)≥r_{ u }(G^{′}), u(G)≥u(G^{′}) and G^{′} contains one less unattached junction side than G.
Let {x_{ α },y_{ β }} and {A(x_{ α }),z_{ γ }} be a pair of adjacencies and A(x_{ α }) be ambiguous. A rearrangement ambiguity reducing extension is the interpolation along the parent branch of x a vertex x^{′}, the creation of a new vertex z^{′}, new branch (z,z^{′}) and new adjacency \{{x}_{\alpha}^{\prime},{z}_{\gamma}^{\prime}\} (See Figure 11(E)).
Lemma 16.
For any history graph G containing no unattached junction sides and such that u_{ r }(G)>0, there exists a rearrangement ambiguity reducing extension G^{′} of G such that s_{ u }(G)=s_{ u }(G^{′}), r_{ u }(G)≥r_{ u }(G^{′}) and u(G)>u(G^{′}).
We can now prove the desired lemma.
Lemma 17.
Any history graph G has an AVG extension H such that s_{ u }(G)≥s_{ u }(H) and r_{ u }(G)≥r_{ u }(H).
Proof. Using the previous 4 lemmas it is easily verified the result of the following algorithm is an AVG extension H for a history graph G such that s_{ u }(G)≥s_{ u }(H) and r_{ u }(G)≥r_{ u }(H).
A bounded transformation of an AVG into a realisation
In this section we will prove that any AVG H has a realisation H such that s_{ l }(H)=s(H) and r_{ l }(H)=r(H).
A vertex connected by an adjacency to another vertex with more child branches has missing children. A root vertex that is connected to a nonroot vertex has a missing parent. Missing parents and missing children are collectively missing branches. An unattached side with homologous attached sides has a missing adjacency.We will define a series of extension types that when combined iteratively create an extension in which all vertices are labeled and no elements have missing adjacencies or branches. For each extension type defined below Figure 12 shows an example.
For an attached root vertex x, the creation of a new vertex x^{′} and branch (x^{′},x) is a case 1 extension. The case 1 extension is used iteratively to initially ensure all roots are unattached.
For an attached leaf vertex, the creation of a new vertex x^{′} and branch (x,x^{′}) is a case 2 extension. The case 2 extension is used iteratively to initially ensure all leaves are unattached.
For a side x_{ α } if A(x_{ α }) is in a module M, x_{ α } is in the face of M. Let M be a simple module containing an odd number of sides and let x_{ α } be an unattached root side in the face of M. The following is a case 3 extension: the creation of a pair of vertices y and y^{′}, an adjacency connecting a side of y to x_{ α } and the branch (y,y^{′}). The case 3 extension is used iteratively to ensure all modules contain an even number of sides.
Similarly to vertices and sides, a thread X is ancestral to a thread Y in a history graph G, and reversely Y is a descendant of X, if there exists a directed path in D(G) from the vertex representing X to the vertex representing Y, otherwise two threads are unrelated if they do not have an ancestor/descendant relationship. For a vertex x, T(x) is the thread it is part of. For a pair of unattached root sides x_{ α } and y_{ β } in the face of a simple module such that T(x)=T(y) or T(x) and T(y) are unrelated, the creation of a new adjacency {x_{ α },y_{ β }} is a case 4 extension. The case 4 extension is used iteratively to ensure all modules contain attached root sides.
Let x_{ α } be a side in the face of a simple module M such that x_{ α } is internal, unattached and has an attached parent. Let (y,y^{′}) be a branch such that {y}_{\beta}^{\prime} is a side in the face of M, T(y) is not descendant of T(x), if T(y)=T(x) then y is unattached, T(y^{′}) is descendant or unrelated to T(x), and the sides A(x_{ α }) and A({y}_{\beta}^{\prime}) in M are connected by a path containing an odd number of adjacencies/lifted adjacencies. If y_{ β } is unattached and T(y) is unrelated or equal to T(x) then the creation of the adjacency {x_{ α },y_{ β }} is the case 5 extension, else the interpolation of a vertex y^{′′} on the branch (y,y^{′}) and creation of the adjacency \{{x}_{\alpha},{y}_{\beta}^{\mathrm{\prime \prime}}\} is the case 5 extension. The case 5 extension is used iteratively to ensure all internal vertices are attached.
For an adjacency {x_{ α },y_{ β }} such that y has fewer children than x, the creation of a new vertex y^{′} and branch (y,y^{′}) is a case 6 extension. The case 6 extension is used iteratively to ensure there are no vertices with missing children.
Let x_{ α } and y_{ β } be a pair of unattached leaf sides in the face of a simple module M such that T(x) and T(y) are unrelated or equal, A(x_{ α }) and A(y_{ β }) are attached and are either connected by an adjacency or both not incident with a nontrivial lifted adjacency. The creation of a new adjacency {x_{ α },y_{ β }} is a case 7 extension. The case 7 extension is used iteratively to ensure there are no leaf vertices with missing adjacencies.
For a branchtree containing no labeled vertices, the labeling of any single vertex in the branchtree with a member of Σ^{∗} is a case 8 extension. For a branch (x,y), such that y is labeled and x is unlabeled the labeling of x with the label of y is a case 9 extension. For a branch (x,y), such that x is labeled and y is unlabeled the labeling of y with the label of x is a case 10 extension. The case 8, 9 and 10 extensions are used iteratively to ensure there are no unlabeled vertices.
Lemma 18.
For an AVG H, if H^{′} is obtained from H by any of the 10 extensions cases above then s_{ l }(H)=s_{ l }(H^{′}) and r_{ l }(H)=r_{ l }(H^{′}).
Lemma 19.
For an AVG H, each of the ten types of extensions above can only be applied consecutively a finite number of times until there are no more opportunities in the graph to apply an extension of that type.
Lemma 20.
Any AVG H has an AVG extension H^{′} with no missing labels, adjacencies or branches and such that s_{ l }(H)=s_{ l }(H^{′}) and r_{ l }(H)=r_{ l }(H^{′}).
Proof.
We will demonstrate that the following algorithm converts an AVG into an AVG with no missing adjacencies or branches or unlabeled vertices.
It follows from Lemma 19 that the algorithm always terminates and from Lemma 18 that H^{′} is an AVG such that s_{ l }(H)=s_{ l }(H^{′}) and r_{ l }(H)=r_{ l }(H^{′}).
It remains to prove that H^{′} has no missing branches or adjacencies or unlabeled vertices. Call the AVG extension resulting at the end of the ith loop of line 5 of the algorithm the case i complete extension. The following series of compounding statements are straightforward to verify.

The case 3 complete extension contains no modules with an odd number of sides.

The case 2 extensions ensure that all root vertices are unattached, and every case 3 extension attaches a root vertex in a module with an odd number of sides to a newly created root vertex, so ensuring the module contains an even number of sides, so for every module with an odd number of sides there exists a case 3 extension.

The case 4 complete extension additionally contains no root sides with missing adjacencies or root vertices with missing parents.

The case 3 extensions ensure that there always 0 or 2 unattached root sides in a module, so any unattached root side in a module always has a potential unattached partner root side within the module. The requirement that sides connected in a case 4 extension be in the same or unrelated threads prior to connection does not prevent any root side within the face of a module from becoming attached, because the case 1 extensions ensure that all root vertices are unrelated, the case 2 extensions do not effect root vertices and the case 3 and 4 vertices only result in root vertices being connected to one another.

The case 5 complete extension additionally contains no internal vertices with missing adjacencies.

The case 4 extensions ensure that all root sides within modules are attached. The case 2 extensions ensure that all attached sides have children and the case 3, 4 and 5 extensions ensure this remains true. Given this, and that every module has an even number of sides within it (as a case 3 complete extension), it is straightforward to verify that there is always a case 5 extension in a sequence of such extensions for any internal side within the face of a module.

The case 6 complete extension additionally contains no vertices with missing child branches.

The case 7 complete extension additionally contains no leaf sides with missing adjacencies, and therefore has no missing branches or adjacencies.

Analogously with the case 4 extensions, the requirement that sides connected in a case 7 extension be in the same or unrelated threads does not prevent any leaf side within the face of a module from becoming attached by a case 7 extension, this is because the case 2 extensions ensure all leaf vertices are unrelated, the case 3, 4, 5 and 6 extensions do not connect leaf vertices, and the case 7 extensions only connect leaf sides to one another.

The case 8 complete extension additionally contains no branchtrees without any labeled vertices.

The case 9 complete extension additionally contains no unlabeled ancestral vertices that have labeled descendants.

The case 10 complete extension additionally contains no unlabeled vertices, and therefore has no missing adjacencies, branches or labels.
We can now prove the desired lemma.
Lemma 21.
Any AVG H has a realisation H such that s_{ l }(H)=s(H) and r_{ l }(H)=r(H).
Proof.
Lemma 20 demonstrates there exists an AVG extension H^{′} of H with no missing labels, adjacencies or branches such that s_{ l }(H)=s_{ l }(H^{′}) and r_{ l }(H)=r(H^{′}). H^{′}is converted to a simple history with the same cost as follows.

On every branch of H^{′} interpolate a vertex.

Label each interpolated vertex identically to its parent.

Connect the sides of the interpolated vertices to one another such that for any adjacency {x_{ α },y_{ β }} connecting interpolated vertices, \{A({x}_{\alpha}),A({y}_{\beta})\}\in {E}_{{H}^{\prime}}.
It is easily verified that the result is an AVG that can be edge partitioned into rearrangement and replication epochs and hence is a simple history.
LBSC and LBRC are lower bounds
Lemma 22.
LBSC is a lower bound on substitution cost.
Proof.
From Lemmas 17 and 21 it follows that every history graph has a realisation. It is sufficient therefore to further prove that for any simple history H, s(H)=s_{ l }(H) and that a history graph G has no extension G^{′} such that s_{ l }(G)>s_{ l }(G^{′}). The former is easily verified and we now prove the latter.
Let (G=G_{ n })≺G_{n−1}≺…G_{2}≺(G_{1}=G^{′}) be a sequence of n history graphs for a reduction sequence of n−1 reduction operations. For some integer i∈ [ 1,n) if the ith reduction operation is a vertex deletion, adjacency deletion or branch contraction, as these each have no impact on the calculation of LBSC, s_{ l }(G_{i+1})=s_{ l }(G_{ i }). Else the ith reduction operation is a label deletion. Let x be the vertex whose label is being deleted. As the number of nontrivial lifted labels for A(x) after the deletion of x is less than or equal to the sum of nontrivial lifted labels for x and A(x), it follows that s_{ l }(G_{i+1})≤s_{ l }(G_{ i }). Therefore by induction s_{ l }(G)≤s_{ l }(G^{′}).
Lemma 23.
LBRC is a lower bound on rearrangement cost.
Proof.
Analogously to the proof of Lemma 22, from Lemmas 17 and 21 it follows that every history graph has a realisation. It is sufficient therefore to further prove that for any simple history H, r(H)=r_{ l }(H) and that a history graph G has no extension G^{′} such that r_{ l }(G)>r_{ l }(G^{′}). The former is easily verified and we now prove the latter.
Let (G=G_{ n })≺G_{ n }−1≺…G_{2}≺(G_{1}=G^{′}) be a sequence of n history graphs for a reduction sequence of n−1 reduction operations. For some integer i∈[1,n) if the ith reduction operation is a label deletion, vertex deletion or contraction of a branch with a freeparent, as each removes an element that has no effect on the calculation of the LBRC, r_{ l }(G_{i+1})=r_{ l }(G_{ i }).
Else if the ith reduction operation is a contraction of a branch with a freechild, as the child is unattached the only possible effect on the LBRC calculation is the conversion of nontrivial lifted adjacencies into trivial lifted adjacencies, therefore r_{ l }(G_{i+1})≤r_{ l }(G_{ i }) (see Figure 13(A)).
Let q(M) and p(M) be the number of unattached and attached sides in a module M, as q(M)+p(M)=V_{ M }:
As each side may be incident with at most one adjacency p(M) is even and p(M)/2 is the number of adjacencies in M, therefore:
Hence r_{ l }(G)=E_{ G }+Q(G)−M(G), where Q(G)=\sum _{M\in M(G)}\lceil q(M)/2\rceil. Suppose r_{ l }(G_{i+1})>r_{ l }(G_{ i }). If the ith reduction operation is an adjacency deletion, {E}_{{G}_{i+1}}+1={E}_{{G}_{i}}, therefore Q(G_{i+1})−M(G_{i+1})≥Q(G_{ i })−M(G_{ i })+2.
The removal of an adjacency can reduce the number of modules by at most two, therefore M(G_{ i })−M(G_{i+1})≤2. The number of modules decreases by the maximum of two only when the adjacency to be deleted connects two sides that each have no incident lifted adjacencies (see Figure 13(B)). However, in this case Q(G_{ i })=Q(G_{i+1})+1, as the number of unattached sides in a module decreases by 2, therefore if M(G_{ i })−M(G_{i+1})=2 then r_{ l }(G_{i+1})≤r_{ l }(G_{ i }).
An unattached side in a module is the side of a freeroot, and such a freeroot side has incident lifted adjacencies. The side of a freeroot with no incident lifted adjacencies can not become part of a module by the removal of any adjacency from the associated history graph, as by definition the homologous sides in its associated branchtree are all unattached. The removal of an adjacency can therefore only decrease or leave the same the total number of unattached sides in modules. The only way for Q(G_{i+1})−Q(G_{ i }) to be positive is therefore by the redistribution of unattached sides between modules to exploit the ceiling function. As in the removal of a single adjacency at most two unattached sides can be redistributed from a single module (see Figure 13(C)), therefore Q(G_{i+1})−Q(G_{ i })≤1. But if Q(G_{i+1})−Q(G_{ i })=1 then it is easily verified M(G_{ i })−M(G_{i+1})≤0. This is all the cases, therefore r_{ l }(G_{i+1})≤r_{ l }(G_{ i }), by induction therefore r_{ l }(G)≤r_{ l }(G^{′}).
Theorem 1.
For any history graph G and any cost function c, c(s_{ l }(G),r_{ l }(G))≤C(G,c)≤c(s_{ u }(G),r_{ u }(G)) with equality if G is an AVG.
Proof.
Follows from Lemmas 10, 12, 17, 21, 22 and 23.
Proof of Theorem 2
We first classify nonminimal adjacencies and labels.
A nonminimal label of a vertex x is (see Figure 14(A)):

A leaf if {L}_{x}^{\text{'}}=\left\{\right\},

else, as it is not a junction, {L}_{x}^{\text{'}}=1 and:

the label is redundant if {\stackrel{~}{L}}_{x}^{\text{'}}=\left\{\right\},

else complicating if l(A(x))?l(x),

else l(A(x))=l(x) and, as it is not a bridge, then {\stackrel{~}{L}}_{A(x)}=\left\{\right\} and it is an unnecessary bridge.

A nonminimal adjacency {x_{ α },y_{ β }} is (see Figure 14(B)):

a leaf if {L}_{{x}_{a}}^{\text{'}}?{L}_{{y}_{\xdf}}^{\text{'}}=\left\{\right\},

else, as it is not a junction, neither x_{ a } or y_{ ß } are junction sides and it is complex if {L}_{{x}_{a}}^{\text{'}}>1 or {L}_{{y}_{\xdf}}^{\text{'}}>1,

else {L}_{{x}_{a}}^{\text{'}}=1, {L}_{{y}_{\xdf}}^{\text{'}}=1 and:

the adjacency is redundant if {L}_{{x}_{a}}?{L}_{{y}_{\xdf}}=\left\{\right\{{x}_{a},{y}_{\xdf}\left\}\right\},

else complicating if {A(x_{ a }),A(y_{ ß })} is a nontrivial lifted adjacency,

else {A(x_{ a }),A(y_{ ß })} is a trivial lifted adjacency and, as it is not a bridge either:

({L}_{A({x}_{a})}^{\text{'}}?{L}_{{x}_{a}}^{\text{'}})\backslash \left\{\right\{{x}_{a},{y}_{\xdf}\left\}\right\}=1 and ({L}_{A({y}_{\xdf})}^{\text{'}}?{L}_{{y}_{\xdf}}^{\text{'}})\backslash \left\{\right\{{x}_{a},{y}_{\xdf}\left\}\right\}=1 and it is a removable bridge.
{\stackrel{~}{L}}_{A({x}_{a})}^{\text{'}}?{\stackrel{~}{L}}_{A({y}_{\xdf})}^{\text{'}}=\left\{\right\} 
and it is an unnecessary bridge,


Lemma 24.
A Gminimal AVG contains no Greducible nonminimal elements.
Proof.
We prove the contrapositive. It is easily verified that the deletion of any single nonminimal vertex or contraction of a nonminimal branch from an AVG results in a reduction that is also an AVG. It is also easily verified that the deletion of each possible type of nonminimal label/adjacency from an AVG results in a reduction that is also an AVG, with the exceptions of a complex nonminimal adjacency, which can not be present within an AVG (because such an edge implies ambiguity), and a removable bridge adjacency. After deletion of a removable bridge adjacency {x_{ α },y_{ β }} the adjacency {A(x_{ α }),A(y_{ β })} ceases to be a junction adjacency, and may either become a bridge, in which case the resulting graph is an AVG, or it may become a nonminimal adjacency. If it becomes a nonminimal adjacency, then, by the prior argument, if it is not a removable bridge adjacency then its deletion results in an AVG, else if it is a removable bridge then after the deletion of {A(x_{ α }),A(y_{ β })}, the process of considering if {A(A(x_{ α })),A(A(y_{ β }))} is nonminimal and deleting if necessary is repeated iteratively until the resulting graph is an AVG.
Lemma 25.
The only Greducible adjacencies in the Gunbridged graph of an extension of G containing no nonminimal elements are junction adjacencies.
Proof.
By definition, the only Greducible adjacencies in an extension of G with no Greducible nonminimal elements are junction adjacencies and bridges. Each deletion of a Greducible bridge adjacency does not create any Greducible nonminimal adjacencies, as a junction adjacency connecting sides that are the lifting ancestors of the sides connected by a bridge adjacency remains a junction adjacency after the deletion of the bridge, and the lifted adjacencies incident with the sides connected by the bridge, which are nontrivial, lift to this junction instead and therefore remain nontrivial.
Lemma 26.
The Gunbridged graph of a Goptimal AVG for any cost function contains no Greducible ping adjacencies.
Proof.
Let H be a Goptimal AVG whose Gunbridged graph H^{′} contains one or more Greducible ping adjacencies. Example subgraphs containing pingpong adjacencies are shown in Figure 15(AC). Let e={x_{ α },y_{ β }} be such a Greducible ping adjacency and y_{ β } a hanging endpoint in H^{′}. From Lemma 25, the adjacency e must be a junction. Delete e from H^{′} giving H^{′′}, note G\u22de{H}^{\mathrm{\prime \prime}}. If x_{ α } has no most recent attached ancestor create a new vertex and connect it with an adjacency to x_{ α } as shown in Figure 15(D), else do the same but connect the new vertex by a branch that makes it the child of the vertex connected by an adjacency to the most recent attached ancestor of x_{ α }, as shown in Figure 15(E). Note that it does not matter in this second case if the most recent attached ancestor of x_{ α } is a pong adjacency, as demonstrated in Figures 15(C) and (F). It is easily verified that each modification defines the Gunbridged graph of a valid AVG extension H^{′′′} of G that has one fewer Greducible ping adjacencies in its Gunbridged graph, one less rearrangement and the same number of substitutions in its most parsimonious realisation as in the most parsimonious realisation of H. This contradiction to the assumption that H was Goptimal establishes the result.
Theorem 2.
The Gbounded AVGs contain the Goptimal AVGs for every cost function.
Proof.
Follows from Lemmas 24 and 26.
Proof of Theorem 3
In the following let n be the number of adjacencies in a history graph G.
Lemma 27.
If n=0 any Gbounded extension of G contains 0 adjacencies.
Proof.
Follows from Lemma 24.
As the n=0 case is trivial now assume that n≥1. For an adjacency {x_{ α },y_{ β }} its received incidence is {L}_{{x}_{\alpha}}^{\prime}+{L}_{{y}_{\beta}}^{\prime} and its projected incidence is equal to the number of members of {A(x_{ α }),A(y_{ β })} that are attached, either 0, 1 or 2. For an adjacency, the difference between projected incidence and received incidence is the incidence transmission. A positive incidence transmission occurs when the projected incidence is greater than the received incidence number, conversely a negative incidence transmission occurs when the projected incidence is less than the received incidence. The incidence sum of a history graph is the sum of the received incidences of its adjacencies, or, equivalently, the sum of the projected incidences of its adjacencies.
Lemma 28.
The maximum possible incidence sum of G is 2n−2.
Proof.
The 2n term is because each adjacency has a projected incidence of at most 2, the −2 term is because at least one adjacency has a projected incidence of 0.
It is trivial to show this bound can be achieved for all values of n.
Lemma 29.
The Gunbridged graph G^{′′} for a Gbounded history graph G^{′} has no Greducible adjacencies with a positive incidence transmission.
Proof.
By Lemma 25, the only Greducible adjacencies in G^{′′} are junction adjacencies. Junction adjacencies have an incidence transmission of 0 or less.
Lemma 30.
The Gunbridged graph G^{′′} for a Gbounded history graph G^{′} contains less than or equal to 2n−1 adjacencies that either have a negative incidence transmission, or which are Girreducible and have an incidence transmission of 0.
Proof.
Let k_{i,j} be the number of adjacencies in G^{′′} that have a projected incidence of i and a received incidence of j. As the sum of projected incidences equals the sum of received incidences therefore:
Separating the contributions of adjacencies with a negative incidence transmission:
The first term of the lefthand side of the equation is the total incidence transmission of all adjacencies in G^{′′} with a positive incidence transmission. Using Lemma 29, these adjacencies must all be Girreducible. Let k be the number of G^{′′}irreducible adjacencies that have an incidence transmission of 0, as:
therefore b y substitution:
The righthand side of the inequality is the number of adjacencies with a negative incidence transmission plus two times the number of Girreducible adjacencies with an incidence transmission of 0.
As {\sum}_{i=0}^{2}{\sum}_{j=i+1}^{\infty}(ji1){k}_{i,j} can not be negative, it remains only to prove that this term must be positive. Assume that there are 2n or more adjacencies that either have a negative incidence transmission, or which are Girreducible and have an incidence transmission of 0 (i.e. a contradiction of the lemma). As n>0, there must be at least one adjacency in G^{′′} with a projected incidence of 0 and a received incidence of greater than 0 in some ancestral thread (i.e. {\sum}_{j=1}^{\infty}{k}_{0,j}>0). Either such an edge has a received incidence of 2 or greater, in which case the considered term must be positive, or a larger graph exists (see Figure 16) that is Hbounded and Hunbridged with respect to a graph H, which has the same number of adjacencies as G and an extra edge with a projected incidence of 0 and a received incidence of 2 or greater, which implies that {\sum}_{i=0}^{2}{\sum}_{j=i+1}^{\infty}{k}_{i,j}+2k<2n. In either case we derive a contradiction to the assumption of the number of adjacencies, therefore:
Lemma 31.
The Gunbridged graph G^{′′} of a Gbounded history graph G^{′} contains less than or equal to 3n−3Greducible adjacencies with an incidence transmission of 0.
Proof.
Let X be the set of Greducible adjacencies with an incidence transmission of 0 in G^{′′}. The sum of received incidences equals the sum of projected incidences for members of X, therefore the sum of received incidences of other adjacencies in G^{′′} (≤2n−1Greducible adjacencies with positive or negative incidence transmission by Lemmas 29 and 30 and ≤n Girreducible adjacencies) is equal to the sum of their projected incidences, which by Lemma 28 is at most 2(n+2n−1)−2=6n−4. By Lemma 25, any adjacency e={x_{ α },y_{ β }} in X must be a junction adjacency, and, as it has 0 incidence transmission, must have a hanging endpoint and projected incidence of 2 (see Figure 17(A)). Let e^{′} and e^{′′} be the adjacencies incident with A(x_{ α }) and A(y_{ β }), respectively (see Figure 17(A)). As there exist no Greducible ping adjacencies, e^{′} and e^{′′} are either Girreducible or Greducible junction adjacencies with a negative incidence transmission. As e projects at least one incidence to each such adjacency, X has a cardinality at most (6n−4)/2=3n−2. It remains to prove that it must be at least one less than this bound.
Now let e be a Greducible junction adjacency in G^{′′} that is contained in a thread that is ancestral or unrelated to all threads that contain a Greducible adjacency or label. If G^{′′} contains more adjacencies than G then such an adjacency must clearly exist in G^{′′}.
If e makes projected incidences to Girreducible adjacencies then it makes projected incidences to adjacencies not in X. If e does not make projected incidences then it has negative incidence transmission, and either e is a hanging adjacency, in which case it must receive projected incidences from adjacencies that are not in X (else there exists a Greducible ping adjacency), or e is not a hanging adjacency and a larger graph exists (see Figure 17(B)) that is Hbounded with respect to a graph H with the same number of adjacencies as G, in which case, using Lemma 30, there must be less than 2n−1Greducible negative transmission incidence adjacencies in G^{′′}. Therefore either there exist projected incidences made between adjacencies not in X or there are fewer than 2n−1Greducible negative transmission incidence adjacencies in G^{′′}, either way, there are fewer than 6n−4 projections made from adjacencies in X to adjacencies not in X, and as there are no projections made between adjacencies in X, and all adjacencies in X have a projected incidence of 2, therefore X has cardinality less than 3n−2.
Lemma 32.
A Gbounded history graph G^{′} contains less than or equal to 5n−4 junction adjacencies.
Proof.
From Lemmas 29, 30 and 31 it follows that the unbridged graph of G^{′} contains less than 5n−4 junction adjacencies. Extending the argument of Lemma 25, it is easily verified that G^{′} contains the same number of junction adjacencies as its unbridged graph.
Lemma 33.
A Gbounded history graph G^{′} contains less than or equal to 10n−8Greducible adjacencies and 20n−16 additional attached vertices. These bounds are tight for all n≥1.
Proof.
Let i and j be the numbers of Greducible junction and bridge adjacencies in G^{′}, respectively. As bridges and junctions are the only Greducible adjacencies in G^{′}, i+j is equal to the total number of Greducible adjacencies in G^{′}. Assume that i+j>10n−8. From Lemma 32 it follows that i≤5n−4, therefore j>5n−4. As j>5n−4, it follows from Lemma 32 there exists in G^{′} a pair of Greducible bridge adjacencies {x_{ α },y_{ β }}, {w_{ α },z_{ β }} such that {A(x_{ α }),A(y_{ β })}={A(w_{ α }),A(z_{ β })} (see Figure 18(A)). However, in this case there exists an extension H of G that contains the same number of adjacencies as G^{′} but one additional Greducible junction adjacency (see Figure 18(B)), therefore in H the number of Greducible junction adjacencies is greater than 5n−4, a contradiction of Lemma 32, therefore i+j≤10n−8. From this bound, trivially, the bound of the number of additional attached vertices follows. Figure 19 shows both bounds are tight for all n.
Let m be the number of labeled vertices in the history graph G. As with the n=0 case, the m=0 case is similarly trivial, but in terms of the number of Greducible labels.
Lemma 34.
If m=0 any Gbounded extension of G contains 0 labels.
Proof.
Follows from Lemma 24.
Now assume that m≥1 and that n≥0.
Lemma 35.
A Gbounded history graph G^{′} contains less than or equal to 2m−2Greducible vertex labels. This bound is tight for all m≥1.
Proof.
Let i and j be the number of junction and bridge labels, respectively, in G^{′}. By Lemma 24, the total number of Greducible labels in G^{′} is less than or equal to i+j. The number of bridges j is less than or equal to the total number of child branches of vertices that are label junctions, which, as the connected components of branches are trees, is equal or fewer than two times the number of leaf labels minus 2, and therefore equal or fewer than 2m−2. Furthermore, by definition, the lifting ancestor of a vertex with a bridge label has a nontrivial lifted label, which implies such a vertex’s label is not a bridge, therefore j≤2m−2−i, therefore j+i≤2m−2. Figure 20 shows this bound is tight for all m.
We are now in a position to prove the desired theorem for any value of n and m.
Theorem 3.
A Gbounded history graph contains less than or equal to max(0,10n−8)Greducible adjacencies and max(0,2m−2,20n−16,20n+2m−18) additional vertices. This bound is tight for all values of n≥0 and m≥0.
Proof.
Lemmas and 27 and 33 prove the bound on the number of Greducible adjacencies, it remains to prove the bound on the number of additional vertices.
Let X, Y and Z be the total numbers, respectively, of additional attached, labeled and both unattached and unlabeled vertices in G^{′}.
From Lemmas 27 and 33 it follows that X≤ max(0,20n−16). From Lemmas 34 and 35 it follows that Y≤ max(0,2m−2). Combining these results X+Y≤ max(0,2m−2,20n−16,20n+2m−18).
Assume X+Y+Z> max(0,2m−2,20n−16,20n+2m−18). As X+Y≤ max(0,2m−2,20n−16,20n+2m−18), Z≥1. As G^{′} contains no nonminimal branches, Z is a count of additional root vertices that are unlabeled, unattached and have two or more children, all of which are either labeled, attached or both. Using this information, it is straightforward to demonstrate that there exists a modified pair of history graphs (H,H^{′}) such that H has the same size and cardinality as G, and H^{′} is a Hbounded extension of H that has more labeled or attached vertices than G^{′}. The existence of (H,H^{′}) contradicts either or both Lemmas 35 or Lemma 33.
Figure 21 shows this bound is tight for all n and m.
Proof of Theorem 4
A adjacency {x_{ α },y_{ β }} is old if both A(x_{ α }) and A(y_{ β }) are each independently either the side of a freeroot or incident with a Girreducible adjacency.
Lemma 36.
For any Gbounded history graph G^{′} not isomorphic to G there exists a label detachment or adjacency detachment that results in a Gbounded history graph.
Proof.
As G is not isomorphic to G^{′}, G^{′} contains one or more Greducible elements. If there exists a Greducible label that is a bridge then its label detachment results in a Gbounded reduction (see Figure 22(AB)). Else if there exists a Greducible label it is a junction label and its label detachment results in a Gbounded reduction (see Figure 22(BC)). Else if there exists a Greducible adjacency that is a bridge then its adjacency detachment results in a Gbounded reduction (see Figure 22(CD)). Else there exists a Greducible adjacency that is an old junction adjacency and whose adjacency detachment results in a Gbounded reduction (see Figure 22(DEF)).
The previous lemma implies that for any Gbounded history graph there exists a sequence of label and adjacency detachments that results in G. We now seek the inverse, to demonstrate the existence of a sequence of moves to create a Gbounded AVG from any Gbounded history graph.
The inverse of a label/adjacency/lateraladjacency detachment is, respectively, a label/adjacency/lateraladjacency attachment.
The graph in Figure 23(A) has no valid label/adjacency attachment operation that results in a Gbounded history graph, yet it is not an AVG, because it has an unattached junction side, {x}_{\mathit{\text{tail}}}^{\prime} and any adjacency attachment of {x}_{\mathit{\text{tail}}}^{\prime} results in the creation of a Greducible ping adjacency. This motivates the need for the lateraladjacency detachment/attachment operation, that we use to avoid the creation of Greducible ping adjacencies. Notably, while both label detachment and adjacency detachment operations define reductions, the result of a lateraladjacency detachment, though an extension of G, is not necessarily a reduction of the starting graph, though it always has one fewer adjacency.
Lemma 37.
A Gbounded history graph G^{′} such that u(G)>0 has a label/adjacency/lateraladjacency attachment that results in a Gbounded history graph.
Proof.
If G^{′} has a freeroot x such that {L}_{x}^{\prime}>1, then the labeling of the root of the branchtree whose freeroot is x is a label attachment that results in a Gbounded extension that contains an additional junction label (see Figure 7(AB) in the main text). Else, if G^{′} has substitution ambiguity then there exists a labeled vertex with two or more nontrivial lifted labels for which there exists a label attachment that results in a Gbounded extension, which contains an additional bridge label (see Figure 7(BC) in the main text). Else G^{′} has rearrangement ambiguity. If G^{′} has one or more unattached junction sides, let x_{ α } be such a side. If the most ancestral attached descendants of x_{ α } are not incident with hanging adjacencies then the creation of an isolated vertex y and adjacency {x_{ α },y_{ α }} is an adjacency attachment that results in a Gbounded extension (see Figure 7(CD) in the main text). Else there exists a lateraladjacency attachment that results in a Gbounded history graph in which x_{ α } is incident with an adjacency with no hanging endpoints (see Figure 7(DE), the operation is also an adjacency attachment in this example). Else G^{′} does not have an unattached junction side, and there exists an attached junction side with two or more incident nontrivial lifted adjacencies for which there exists an adjacency attachment that results in a Gbounded extension that contains an additional bridge adjacency (see Figure 7(EF) in the main text).
Given Theorem 3, the previous lemma implies that for any Gbounded history graph there exists a sequence of label/adjacency/lateraladjacency attachment operations that result in a Gbounded AVG.
Theorem 4.
The Gbounded poset is finite, has a single least element G, its set of maximal elements are AVGs, and if and only if there exists a Gbounded reduction operation to transform G^{′′} into G^{′} then G^{′}≺·_{ G }G^{′′}.
Proof.
That Gbounded is finite follows from Theorem 3. Lemma 36 implies it has a single least element. As a corollary of Theorem 3 and Lemma 37 it follows that the set of maximal elements of the Gbounded poset are AVGs.
It remains to prove G^{′}≺·_{ G }G^{′′} if and only if there exists a Gbounded reduction operation to transform G^{′′} into G^{′}. The only if follows by definition. If G^{′}≺_{ G }G^{′′} but not G^{′}≺·_{ G }G^{′′} then there exists a G^{′′′} such that G^{′}≺_{ G }G^{′′′}≺_{ G }G^{′′}. If G^{′′} is transformed to G^{′} by a single Gbounded reduction operation, to complete the proof it is sufficient to show that no such G^{′′′} can exist, this is easily verified.
Proof of Lemma 9
Lemma 9.
There exists a history graph G with an infinite number of Gminimal extensions.
Proof.
We will demonstrate there exists an infinite set of Gminimal AVG extensions of the history graph G shown in Figure 24(A). The extensions are composed of the repeating subgraph shown in bold in Figure 24(B) and the terminal elements shown in bold in Figure 24(C) that attach the most ancestral copies of w and y.
Consider the AVG extension H_{0} with zero copies of the repeating subunit and the terminal elements to attach w^{0} and y^{0}, as in Figure 24(C). As a, b, c and d are labeled but no other vertices are labeled, the adjacencies {a_{ head },b_{ head }} and {c_{ head },d_{ h e a d }} are Girreducible, because removal of either in any reduction would create a graph that can not then be an extension of G. Given this observation, by definition w^{0} and y^{0} or any vertices produced by contracting incident branches of w^{0} and y^{0} must be junctions in any Gminimal reduction, and therefore be attached, but by definition of the reduction relation, \{{w}_{\mathit{\text{head}}}^{0},{w}_{\mathit{\text{head}}}^{{}^{\prime}0}\} and \{{y}_{\mathit{\text{head}}}^{0},{y}_{\mathit{\text{head}}}^{{}^{\prime}0}\} can not be removed and yet w^{0} and y^{0} be attached in any Gminimal reduction. This therefore implies that the bridge adjacencies \{{x}_{\mathit{\text{head}}}^{0},{x}_{\mathit{\text{head}}}^{{}^{\prime}0}\} and \{{z}_{\mathit{\text{head}}}^{0},{z}_{\mathit{\text{head}}}^{{}^{\prime}0}\} are also not removed in a Gminimal reduction, but this is all the adjacencies in H_{0}, as all the vertices in H_{0} are attached, therefore H_{0} is Gminimal.
Let H_{ i } be an AVG with i such layers, where i>0 (Figure 24(D) shows an example for i=2). To prove that H_{ i } is a Gminimal AVG extension we proceed by induction. H_{0} is the base case. Assume the adjacencies incident with w^{i−1} and y^{i−1} are not removed in any Gminimal reduction. Using similar logic to the base case the adjacencies incident w^{i}, x^{i}, y^{i} and z^{i} are similarly not removed in a Gminimal reduction, again as this is all the added adjacencies and all vertices are attached, using the induction therefore H_{ i } is Gminimal.
Endnotes
^{a} The contraction of an edge e is the removal of e from the graph and merger of the vertices x and y incident with e to create new vertex z, such that edges incident with z were incident either with x or y or both, in the latter case becoming a loop edge on z.
^{b} Note: while \mathcal{\mathscr{H}}(G) is infinite we show in the sequel that the infimum of this set of costs is always achieved by a history, hence the infimum is the minimum.
References
Elias I: Settling the intractability of multiple alignment. J Comput Biol. 2006, 13 (7): 13231339.
Miller W, Rosenbloom K, Hardison RC, Hou M, Taylor J, Raney B, Burhans R, King DC, Baertsch R, Blankenberg D, Kosakovsky Pond SL, Nekrutenko A, Giardine B, Harris RS, Tyekucheva S, Diekhans M, Pringle TH, Murphy WJ, Lesk S, Weinstock GM, LindbladToh K, Gibbs RA, Lander ES, Siepel A, Haussler D, Kent WJ: 28way vertebrate alignment and conservation track in the UCSC Genome browser. Genes Dev. 2007, 17 (12): 17971808.
Darling AE, Mau B, Perna NT: Progressivemauve: multiple genome alignment with gene gain, loss and rearrangement. PloS one. 2010, 5 (6): e11147
Paten B, Earl D, Nguyen N, Diekhans M, Zerbino D, Haussler D: Cactus: algorithms for genome multiple sequence alignment. Genome Res. 2011, 21 (9): 15121528.
Felsenstein J: Inferring Phylogenies. 2004, Sinauer Associates: Sunderland
Blanchette M, Green ED, Miller W, Haussler D: Reconstructing large regions of an ancestral mammalian genome in silico. Genome Res. 2004, 14 (12): 24122423.
Kim J, Sinha S: Indelign: a probabilistic framework for annotation of insertions and deletions in a multiple alignment. Bioinformatics (Oxford, England). 2007, 23 (3): 289297.
Paten B, Herrero J, Fitzgerald S, Beal K, Flicek P, Holmes I, Birney E: Genomewide nucleotidelevel mammalian ancestor reconstruction. Genome Res. 2008, 18 (11): 18291843.
Day W: Computational complexity of inferring phylogenies from dissimilarity matrices. Bull Math Biol. 1987, 49 (4): 461467.
Chindelevitch L, Li Z, Blais E, Blanchette M: On the inference of parsimonious indel evolutionary scenarios. J Bioinform Comput Biol. 2006, 4 (3): 721744.
Song YS, Hein J: Constructing minimal ancestral recombination graphs. J Comput Biol. 2005, 12 (2): 147169.
Westesson O, Holmes I: Accurate detection of recombinant breakpoints in wholegenome alignments. PLoS Comput Biol. 2009, 5 (3): e1000318
Wang LL, Zhang KK, Zhang LL: Perfect phylogenetic networks with recombination. J Comput Biol. 2001, 8 (1): 6978.
Bergeron A, Mixtacki J, Stoye J: A unifying view of genome rearrangements. Lecture Notes in Bioinformatics. 4175: 163173.
Alekseyev M, Pevzner P: Multibreak rearrangements and chromosomal evolution. Theor Comput Sci. 2008, 395 (2–3): 193202.
Hannenhalli S, Pevzner PA: Transforming cabbage into turnip: polynomial algorithm for sorting signed permutations by reversals. J ACM. 1999, 46 (1): 127.
Bergeron A, Mixtacki J, Stoye J: On sorting by translocations. J Comput Biol. 2006, 13 (2): 567578.
Yancopoulos S, Attie O, Friedberg R: Efficient sorting of genomic permutations by translocation, inversion and block interchange. Bioinformatics. 2005, 21 (16): 33403346.
Caprara A: Formulations and hardness of multiple sorting by reversals. Proc. 3rd Conf. Computational Molecular Biology RECOMB99. 1999, 1: 8493.
Xu AW: A fast and exact algorithm for the median of three problem: a graph decomposition approach. J Comput Biol. 2009, 16 (10): 13691381.
Bourque G, Pevzner PA: Genomescale evolution: reconstructing gene orders in the ancestral species. Genome Res. 2002, 12 (1): 2636.
Ma J, Ratan A, Raney BJ, Suh BB, Miller W, Haussler D: The infinite sites model of genome evolution. Proc Natl Acad Sci USA. 2008, 105 (38): 1425414261.
ElMabrouk N: Sorting signed permutations by reversals and insertions/deletions of contiguous segments. J Discrete Algorithm. 2000, 1 (1): 105121.
Yancopoulos S, Friedberg R: DCJ path formulation for genome transformations which include insertions, deletions, and duplications. J Comput Biol. 2009, 16 (10): 13111338.
Braga MD, Willing E, Stoye J: Double cut and join with insertions and deletions. J Comput Biol. 2011, 18 (9): 11671184.
ElMabrouk N, Sankoff D: Analysis of gene order evolution beyond singlecopy genes. Methods Mol Biol. 2012, 855: 397429.
Chauve C, ElMabrouk N, Gueguen L, Semeria M, Tannier E: Models and Algorithms for Genome Evolution. 2013, London: SpringerVerlag
Bader M: Genome rearrangements with duplications. BMC Bioinformatics. 2010, 11 (Suppl 1): S27S27.
Shao M, Lin Y: Approximating the edit distance for genomes with duplicate genes under DCJ, insertion and deletion. BMC Bioinformatics. 2012, 13 (Suppl 19): S13
Raphael B, Zhi D, Tang H, Pevzner P: A novel method for multiple alignment of sequences with repeated and shuffled elements. Genome Res. 2004, 14 (11): 23362346.
Paten B, Diekhans M, Earl D, John JS, Ma J, Suh B, Haussler D: Cactus graphs for genome comparisons. J Comput Biol. 2011, 18 (3): 469481.
Medvedev P, Brudno M: Maximum likelihood genome assembly. J Comput Biol. 2009, 16 (8): 11011116.
Edmonds J, Johnson EL: Matching: A WellSolved Class of Integer Linear Programs. 1970, 2730.
Tannier E, Zheng C, Sankoff D: Multichromosomal median and halving problems under different genomic distances. BMC Bioinformatics. 2009, 10: 120
Bienstock D, Langston MA: Algorithmic implications of the graph minor theorem. Handbooks in Operations Research and Management Science. 1994, 7: 481502.
Acknowledgements
We would like to thank Dent Earl for his help with figures and the Howard Hughes Medical Institute, Dr. and Mrs. Gordon Ringold, NIH grant 2U41 HG00237113 and NHGRI/NIH grant 5U01HG004695 for providing funding.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
BP, DZ, GH and DH developed the theory. BP and DZ implemented the theory and performed the experiments. BP wrote the paper, which was edited by DZ, GH and DH. All authors read and approved the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Paten, B., Zerbino, D.R., Hickey, G. et al. A unifying model of genome evolution under parsimony. BMC Bioinformatics 15, 206 (2014). https://doi.org/10.1186/1471210515206
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/1471210515206