 Proceedings
 Open access
 Published:
Sorting genomes with rearrangements and segmental duplications through trajectory graphs
BMC Bioinformatics volumeÂ 14, ArticleÂ number:Â S9 (2013)
Abstract
We study the problem of sorting genomes under an evolutionary model that includes genomic rearrangements and segmental duplications. We propose an iterative algorithm to improve any initial evolutionary trajectory between two genomes in terms of parsimony. Our algorithm is based on a new graphical model, the trajectory graph, which models not only the final states of two genomes but also an existing evolutionary trajectory between them. We show that redundant rearrangements in the trajectory correspond to certain cycles in the trajectory graph, and prove that our algorithm converges to an optimal trajectory for any initial trajectory involving only rearrangements.
Introduction
Genomescale evolutionary events can be divided into two categories: genomic rearrangements and contentmodifying operations. Genomic rearrangements shuffle gene orders and change gene orientations; they include inversions, transpositions, block exchanges, circularizations, and linearizations, all of which act on a single chromosome, and translocations, fusions, and fissions, which act on two chromosomes. All of these operations can be modelled in terms of the doublecutandjoin (DCJ) operation [1, 2], which has formed the basis for much algorithmic research on rearrangements over the last few years [3â€“6]. Contentmodifying operations, which affect both the number of gene copies and the gene orders, include insertions, deletions, and duplications.
A basic problem in phylogenetic inference is to compute an edit sequence between two genomes, i.e., a most parsimonious series of evolutionary operations that can transform one genome into the other. Many algorithms have been proposed for various edit problems under different evolutionary models and different assumptions about the genomes. Most of these algorithms use the same underlying data structure, the breakpoint graph, introduced by Bafna and Pevzner to study the edit problem under unsigned inversions [7, 8]. In 1995, Hannenhalli and Pevzner gave the first polynomialtime algorithm to compute the edit distance (the length of the edit sequence) under signed inversions for unichromosomal genomes [9], which was later improved to linear time [10]. For multichromosomal genomes, the edit distance under the HannenhalliPevzner model (signed inversions and translocations) has been studied through a series of papers [9, 11â€“13], culminating in a fairly complex lineartime algorithm [3]. Under DCJ operations, Bergeron et al. [1] gave a simple lineartime algorithm to compute the edit distance, based on a slightly different representation of the rearrangements, using an adjacency graph. All of these algorithms consider only rearrangements and thus also assume equal gene content and no duplicate genes; consequently, their corresponding adjacency graphs have a very simple structurea set of independent cycles and paths, making it possible to design efficient algorithms. Adjacency graphs have also been extended to study rearrangements with insertions and deletions [6, 14, 15], wholegenome duplications [16, 17] as well as incorporating sequence information [18].
Segmental duplications have long been recognized as major driving forces of evolution [19, 20]. In human genomes segmental duplications are hotspots for nonallelic homologous recombination leading to genomic disorders, copynumber polymorphisms, and gene and transcript innovations [21]. Kahn and Raphael gave an efficient dynamic programming algorithm to compute the duplication distance, in which rearrangement operations are not allowed [22], work later extended by introducing likelihood techniques and used to analyze the evolutionary relationships between duplication blocks [23]. Combining general segmental duplication with rearrangements, however, has remained an open problem, in spite of considerable work on combining rearrangements with contentmodifying operations.
ElMabrouk [24] extended the HP approach to include insertions and deletions, by providing an exact algorithm to compute edit distances for inversions and losses and also a heuristic to approximate edit distances for inversions, losses, and nonduplicating insertions; this approach was extended and refined by Marron et al. [25]. Yancopoulos et al. [6] proposed a model for calculating the edit distance under DCJ, singlegene insertions, singlegene deletions, and duplications. Lin et al. [26] proposed a new evolutionary model that integrates gene duplications and losses with genome rearrangement. Shao et al. [27] gave an approximation algorithm to compute the edit distance for two genomes with duplicated genes under a model that includes DCJ, insertions and deletions. A maximum likelihood approach has also been proposed to infer the evolutionary tree on wholegenome data under a model that considers both rearrangements and genecontent modifying events [28]. Most of this work, however, focused on distance computations based on the final states of the genomes and so cannot support identification of the actual sequence and location of any segmental duplications.
In this paper, we propose an iterative algorithm to refine any initial evolutionary trajectory between two genomes with rearrangements and segmental duplications. We introduce a new graphical data structure, the trajectory graph, to model any given evolutionary trajectory between two genomes. We begin by defining and illustrating the trajectory graph, then show correspondences between redundant rearrangements in the initial trajectory and certain cycles in the corresponding trajectory graph, and provide an effective algorithm to remove these redundant rearrangements by iteratively resolving the active (will be defined in the next section) cycles in the trajectory graph. We also prove that this algorithm converges to an optimal trajectory from any initial trajectory when the model is restricted to rearrangements.
The trajectory graph
We use the notation introduced by Bergeron et al. [1]. Each genome is represented as a set of chromosomes, while each chromosome is a linear or circular list of genes. Each gene is represented by a nonzero integer, where the sign of the integer codes the orientation of that gene along the chromosome. The two ends of a gene g are called extremities, the head denoted g_{ h } and the tail g_{ t }. A gene g always points from g_{ t } to g_{ h }, i.e., we have g = g_{ t } â†’ g_{ h } and âˆ’g = g_{ h } â†’ g_{ t }. Two consecutive genes a and b can be connected by one adjacency, denoted as (a, b). Each adjacency can also be represented by the set of the two adjacent extremities. For example, adjacency (1, 2) can be written as {1_{ h }, 2_{ t }}, and adjacency (1, âˆ’2) can be written as {1_{ h }, 2_{ h }}. Note that (a, b) and (b, a) are two different adjacencies, while (a, b) and (âˆ’b, âˆ’a) are the same. Each linear chromosome ends with two singleextremity sets, to which we add a special null extremity, specified by 0, to form two normal adjacencies [6, 27]. Thus, a linear chromosome with n genes has n + 1 adjacencies, while a circular chromosome with n genes has n adjacencies. The adjacency set of a genome is defined as the set of all adjacencies in the genome. If all genes in the genome are distinct, the genome is uniquely represented by its adjacency set.
In our evolutionary model, we study two kinds of operations: DCJ and segmental duplication. A DCJ operation makes two cuts in the genome, producing four cut ends, and then rejoins them to produce two possible outcomes (a third "repairs" the cuts and thus makes no change). We can view a DCJ operation as a function of adjacencies: the input is two adjacencies (a, b) and (c, d) and the output is two new adjacencies (a, d) and (c, b), or (a, c) and (b, d). A segmental duplication operation duplicates a segment (a substring of the genome) and either creates a new circular chromosome out of the copy or inserts the copy in the genome at some location outside the original segment. A segment consisting of n genes g_{1}, g_{2}, ..., g_{ n } can be represented by its n  1 adjacencies, (g_{1}, g_{2}), (g_{2}, g_{3}), ..., (g_{ n1 }, g_{ n }). To keep notation shorter, we will simply write (g_{1}, g_{2}, ..., g_{ n }) to represent these n  1 adjacencies and, when appropriate, we simply use g' to represent the copy of the original gene g after a duplication.
We now define the two types of segmental duplications in term of adjacencies. The first type of the segmental duplication [19, 20] inserts a copy of a segment (g_{1}, g_{2}, ..., g_{ n }) to a position specified by the adjacency (a, b); thus it takes (g_{1}, g_{2}, ..., g_{ n }) and (a, b) as input, and outputs \left(a,\phantom{\rule{0.3em}{0ex}}{g}_{1}^{\mathrm{\xe2\u20ac\xb2}}\right),\phantom{\rule{0.3em}{0ex}}\left({g}_{1}^{\mathrm{\xe2\u20ac\xb2}},\phantom{\rule{0.3em}{0ex}}{g}_{2}^{\mathrm{\xe2\u20ac\xb2}},\xe2\u2039\xaf\phantom{\rule{0.3em}{0ex}},\phantom{\rule{0.3em}{0ex}}{g}_{n}^{\mathrm{\xe2\u20ac\xb2}}\right),\phantom{\rule{0.3em}{0ex}}\left({g}_{n}^{\mathrm{\xe2\u20ac\xb2}},\phantom{\rule{0.3em}{0ex}}b\right) and (g_{1}, g_{2}, ..., g_{ n }) or \left(a,\phantom{\rule{0.3em}{0ex}}{g}_{n}^{\mathrm{\xe2\u20ac\xb2}}\right),\phantom{\rule{0.3em}{0ex}}\left({g}_{n}^{\mathrm{\xe2\u20ac\xb2}},\phantom{\rule{0.3em}{0ex}}{g}_{n1}^{\mathrm{\xe2\u20ac\xb2}},\xe2\u2039\xaf\phantom{\rule{0.3em}{0ex}},\phantom{\rule{0.3em}{0ex}}{g}_{n1}^{\mathrm{\xe2\u20ac\xb2}}\right),\phantom{\rule{0.3em}{0ex}}\left({g}_{1}^{\mathrm{\xe2\u20ac\xb2}},\phantom{\rule{0.3em}{0ex}}b\right) and (g_{ 1 }, g_{2}, ..., g_{ n }). Note that \left({g}_{1}^{\mathrm{\xe2\u20ac\xb2}},\phantom{\rule{0.3em}{0ex}}{g}_{2}^{\mathrm{\xe2\u20ac\xb2}},\xe2\u2039\xaf\phantom{\rule{0.3em}{0ex}},\phantom{\rule{0.3em}{0ex}}{g}_{n}^{\mathrm{\xe2\u20ac\xb2}}\right) and \left({g}_{n}^{\mathrm{\xe2\u20ac\xb2}},\phantom{\rule{0.3em}{0ex}}{g}_{n1}^{\mathrm{\xe2\u20ac\xb2}},\xe2\u2039\xaf\phantom{\rule{0.3em}{0ex}},\phantom{\rule{0.3em}{0ex}}{g}_{1}^{\mathrm{\xe2\u20ac\xb2}}\right) represent the same adjacency set. The second type of the segmental duplication [29] creates a new circular chromosome with the copy of the segment (g_{1}, g_{2}, ..., g_{ n }), thus it takes only (g_{1}, g_{2}, ..., g_{ n }) as input and outputs \left({g}_{n}^{\mathrm{\xe2\u20ac\xb2}},\phantom{\rule{0.3em}{0ex}}{g}_{1}^{\mathrm{\xe2\u20ac\xb2}}\right),\phantom{\rule{0.3em}{0ex}}\left({g}_{1}^{\mathrm{\xe2\u20ac\xb2}},\phantom{\rule{0.3em}{0ex}}{g}_{2}^{\mathrm{\xe2\u20ac\xb2}},\xe2\u2039\xaf\phantom{\rule{0.3em}{0ex}},\phantom{\rule{0.3em}{0ex}}{g}_{n}^{\mathrm{\xe2\u20ac\xb2}}\right) and (g_{1}, g_{2}, ..., g_{ n }).
Each operation can be represented as a directed subgraph, composed of one operation node representing the operation itself, a number of input adjacency nodes (one for each of its input adjacencies), a number of output adjacency nodes (one for each of its output adjacencies), one directed edge from each input adjacency node to the operation node, and one directed edge from the operation node to each output adjacency node. Figure 1 illustrates these graph components for DCJ and the two types of duplication.
We say that an edge is active if the adjacency associated with the node to which it is attached has been changed by the operation, inactive otherwise. In the graph for a DCJ operation, all four edges are active. In the graph for the first type of duplication, only the edge from (a, b) to the operation node and the two edges from the operation node to the output adjacencies \left(a,\phantom{\rule{0.3em}{0ex}}{g}_{1}^{\mathrm{\xe2\u20ac\xb2}}\right) and \left({g}_{n}^{\mathrm{\xe2\u20ac\xb2}},\phantom{\rule{0.3em}{0ex}}b\right), or \left(a,\phantom{\rule{0.3em}{0ex}}{g}_{n}^{\mathrm{\xe2\u20ac\xb2}}\right) and \left({g}_{1}^{\mathrm{\xe2\u20ac\xb2}}\phantom{\rule{0.3em}{0ex}},\phantom{\rule{0.3em}{0ex}}b\right) are active, while the other edges are inactive. In the graph for the second type of duplication, only the edge from the operation node to \left({g}_{n}^{\mathrm{\xe2\u20ac\xb2}},\phantom{\rule{0.3em}{0ex}}{g}_{1}^{\mathrm{\xe2\u20ac\xb2}}\right) is active; all other edges are inactive. We say that a cycle in the trajectory graph is an active cycle if all of its edges are active and define the size of a cycle as the number of the operation nodes it contains.
Given two adjacency sets X and Y, a sorting path P = {p_{1}, p_{2}, ..., p_{ n }} from X to Y is a series of operations that transform X into Y. The trajectory graph G(P) with respect to P naturally delineates the input and output of each operation and the dependency relationships between them. Let S_{ k } be the adjacency set after sequentially performing operations p_{1}, p_{2}, ..., p_{ k } starting from X. The trace from X to Y with respect to P is (X = S_{0}, S_{1}, S_{2}, ..., S_{ n } = Y). To construct G(P), the initial step is to draw one adjacency node for each adjacency in X. Then we sequentially handle each operation in P, ensuring that, before tackling operation p_{ k }, all adjacency nodes of outdegree 0 in the current graph are exactly S_{ k1 }. To add operation p_{ k }, we connect the component graph for operation p_{ k } to the current graph by replacing all the input adjacency nodes with their counterparts in the current graph. After all operations are added, the set of all adjacency nodes of outdegree 0 is then exactly Y. Figure 2 illustrates the construction.
By construction, the trajectory graph has the following two properties. First, the adjacency nodes of indegree 0 form the adjacency set of X and the adjacency nodes of outdegree 0 form the adjacency set of Y, while all other adjacency nodes have indegree 1 and outdegree 1. Second, the trajectory graph is a directed acyclic graph and any topological sorting of all the operation nodes is a valid sorting path from X to Y.
An iterative algorithm to improve any trajectory
Given any two genomes and initial evolutionary trajectory P between them, we build the trajectory graph G(P) and give the following sufficient condition in G(P) to identify and resolve redundant rearrangements in P.
Theorem 1. Let P be a sorting path from X to Y. If G(P) contains active cycles, then we can find another sorting path P' from X to Y with fewer DCJ operations and an equal number of duplications.
Proof. Let C be an active cycle in G(P). Since G(P) is directed, we can represent C as two nodedisjoint directed paths starting from the same top node and ending at the same bottom node. Clearly, neither of these two nodes can be an adjacency node: the top node must have outdegree 2 and the bottom node must have indegree 2. Moreover, the bottom node must be a DCJ node, since there is at most one active edge pointing to each duplication node. We now show how to exchange the bottom operation node with one of the two parent operation nodes, thereby moving one operation node out of C, until there are only two operation nodes left in C, which we can always replace with at most one operation.
We choose one of the two parent operation nodes to guarantee that the exchange will not create any directed cycles. If both parent nodes are independent (there is no directed path from one to the other), then we arbitrarily choose one; otherwise, we always choose the one that is on the directed path from the other to the bottom node.
Figure 3 shows how to exchange the bottom DCJ node with a parent DCJ node. We view the two DCJ nodes as a single supernode, with three input adjacencies and three output adjacencies. We replace the current two DCJ nodes with two new ones, keeping the inputs and outputs of the supernode unchanged. The new top DCJ node takes the two input adjacencies of the supernode that are linked in C as its inputs and outputs two adjacencies, one of which is among the outputs of the supernode. The new bottom DCJ node takes the other output adjacency of the new top DCJ node and the remaining input adjacency of the supernode as inputs and outputs the other two output adjacencies of the supernode. After the exchange, the new bottom DCJ node is out of the new active cycle, while the new top DCJ node becomes the bottom node of the new active cycle.
Figure 4 shows how to exchange the bottom DCJ node with its parent duplication node. Again, we consider these two operation nodes as a single supernode. The new DCJ node takes the two adjacency nodes linked in C as inputs and outputs two adjacencies, one of which is among the outputs of the supernode while the other is the insert position of the new bottom duplication node. After the exchange, the new bottom duplication node will be out of the new active cycle and the new parent DCJ node will be the bottom node of the new active cycle.
Through these exchanges, C will be reduced to just two operation nodes, the top one and the bottom one. Now we show that we can always replace these two operations with at most one operation.
Consider first the case in which the two operation nodes are both DCJ operations. If the input and output of the supernode are different, then we can use one new DCJ node to connect them, as shown in Figure 5(a, b); otherwise, we do not need any operation node, as shown in Figure 5(c, d).
Next consider the case in which the top node is a duplication and the bottom node is a DCJ. Assume the top duplication inserts a copy of the segment to position (a, b). If (a, b) is not one of the output adjacencies of the bottom DCJ node, then these two operations can be replaced by one duplication which inserts the inverted segment to the same position, as shown in Figure 6(a, b); otherwise, these two operations can be replaced by one duplication which creates a circular chromosome from the copy of the segment, as shown in Figure 6(c, d).
Note that the trajectory graph remains a directed acyclic graph. Thus we can retrieve one sorting path from any topological sorting of the operation nodes in the final trajectory graph; since we also showed that the number of DCJ operation nodes has reduced by at least one and the number of duplication nodes is unchanged, the theorem is proved. â–¡
Given the trajectory graph G(P) constructed from two input genomes and an initial trajectory P, our algorithm iteratively applies Theorem 1 to reduce the number of operations until there is no active cycle in G(P). This algorithm will always terminate since resolving each cycle will reduce the number of operation nodes in G(P) and such number is always nonnegative.
The trajectory graph restricted to rearrangements
We study the trajectory graphs with only DCJ operations. We show that the above iterative algorithm always converges to an optimal trajectory for any initial trajectory.
We first investigate the structure of the trajectory graphs under a rearrangementonly model and illustrate the close relationship with adjacency graphs. Recall that given two adjacency sets X and Y, the adjacency graph is defined as a bipartite multigraph A = {X, Y, E}, in which u âˆˆ X and v âˆˆ Y are linked by one edge if u and v share one extremity and by two edges if they share two extremities. If X and Y have the same extremity set and each extremity appears only once, the adjacency graph consists of nodedisjoint cycles and the minimum number of DCJ operations needed to transform X into Y is X  c, where c is the number of cycles in A [1].
Let a trajectory P consist of only DCJ operations. For a connected component C of the trajectory graph G(P), we use I(C) to denote the set of adjacency nodes of indegree 0 in C, O(C) to denote the set of adjacency nodes of outdegree 0 in C, A(C) to denote the set of adjacency nodes with indegree 1 and outdegree 1, and D(C) to denote the set of DCJ nodes in C. We say that a connected component is trivial if it is a single adjacency node and nontrivial otherwise (for examples see Figure 2).
Lemma 1. Let C be a connected component in the trajectory graph G(P) where P consists of only DCJ operations. Then we have C is a tree if and only if D(C) = I(C)  1.
Proof. If C is trivial, it is a tree, and we have D(C) = 0 and I(C) = 1, hence the lemma holds. Assume then that C is nontrivial. The number of edges equals the sum of indegrees, which is O(C)+ A(C)+ 2 Â· D(C); the number of nodes is I(C) + O(C) + A(C) + D(C), since I(C) and O(C) are disjoint when C is nontrivial. C is connected, so it is a tree exactly when it has one more vertices than it has edges, hence we can write O(C)+A(C)+2Â·D(C) = I(C)+O(C)+A(C)+D(C)1, which yields D(C) = I(C)1, as desired. â–¡
The following lemma shows that there is onetoone correspondence between trees in the trajectory graph and cycles in the adjacency graph.
Lemma 2. Let C be a tree in the trajectory graph G(P) where P consists of only DCJ operations. The corresponding adjacency graph A = {I(C), O(C), E} consists of exactly one cycle.
Proof. If C is trivial, then A is a cycle of length 2, hence the lemma holds. For a nontrivial C, we proceed by contradiction. Since C is nontrivial, we have that I(C) and O(C) are disjoint. Besides, I(C) and O(C) have the same extremity set. Thus, A = {I(C), O(C), E} consists of nodedisjoint cycles. Now suppose that A consists of two or more cycles. We partition A into two parts by taking one arbitrarily chosen cycle as the first part and the remaining cycle(s) as the second part. We then label all extremities in the first part as e_{1} and all extremities in the second part as e_{2}. We divide the adjacency nodes in C into three categories: if its two extremities are both labeled e_{1}, then label the node a_{1}; if its two extremities are both labeled e_{2}, then label the node a_{2}; otherwise, label the node a_{3}. Now we can classify any DCJ operation into one of the following 7 types:

1.
{a_{1}, a_{1}} â†’ {a_{1}, a_{1}}, or

2.
{a_{2}, a_{2}} â†’ {a_{2}, a_{2}}, or

3.
{a_{1}, a_{2}} â†’ {a_{3}, a_{3}}, or

4.
{a_{1}, a_{3}} â†’ {a_{1}, a_{3}}, or

5.
{a_{2}, a_{3}} â†’ {a_{2}, a_{3}}, or

6.
{a_{3}, a_{3}} â†’ {a_{3}, a_{3}}, or

7.
{a_{3}, a_{3}} â†’ {a_{1}, a_{2}}.
Note that the labels on the adjacency nodes in I(C) âˆª O(C) must be either a_{1} or a_{2}, since these adjacencies are the nodes of the adjacency graph, whose two extremities have the same type. Note also that there must exist at least one adjacency node in C that is labeled as a_{3}: otherwise the DCJ operation can be only of type (1) or (2) and C can be divided into two disconnected subgraphs, one defined by adjacency nodes labeled a_{1} and DCJ nodes of type (1), which contradicts the fact that C is connected. Now, remove all adjacency nodes labeled a_{1} or a_{2} and their adjacent edges: the remaining nodes must have even total degreesee Figure 7. The reason is that all adjacency nodes in I(C) âˆª O(C) are removed, while adjacency nodes in A(C) either are removed or are labeled as a_{3} and have degree 2. Moreover, all operation nodes must have even total degree, as easily verified by checking the 7 types. Hence there must be one cycle in C, a contradiction since C is a tree. â–¡
Theorem 2. A trajectory P consisting of only DCJ operations is optimal if and only if the corresponding trajectory graph G(P) consists of trees.
Proof. If G(P) contains at least one cycle C, then this cycle C must be an active cycle since all edges in G(P) are active for the DCJonly model. Thus, according to Theorem 1, we have that P is not optimal.
We now prove that, if G(P) consists of only trees, then P is optimal. Now assume that G(P) consists of m trees, T_{1}, T_{2}, ..., T_{ M }. Applying Lemma 1 to each tree, we find that the total number of DCJ nodes in G(P) is
On the other hand, according to Lemma 2, there are exactly m cycles in the adjacency graph A=\left\{{\xe2\u02c6\xaa}_{k=1}^{m}I\left({T}_{k}\right),{\xe2\u02c6\xaa}_{k=1}^{m}O\left({T}_{k}\right),E\right\}, which implies that the minimum number of DCJ operations needed is {\xe2\u02c6\u2018}_{k=1}^{m}\leftI\left({T}_{k}\right)\right\phantom{\rule{2.77695pt}{0ex}}m. Thus P is optimal, as desired. â–¡
Corollary 1. The iterative algorithm converges to an optimal trajectory from any initial trajectory.
Proof. The iterative algorithm terminates when there is no active cycles in the trajectory graph. According to Theorem 2, any trajectory retrieved from the trajectory graph is optimal. â–¡
Discussion and conclusion
Theorem 1 gives us a means to reduce the cost of a given sorting path. Unfortunately, the converse of the theorem does not hold: it is not hard to see how to take advantage of duplication nodes to produce a counterexample. Thus repeated applications of the constructive proof of Theorem 1 do not ensure optimality. However, the iterative improvement procedure can form the basis for strong heuristics or good approximation algorithms.
The trajectory graph naturally combines rearrangements and segmental duplications (or, in general, contentmodifying operations) in a single model. Such a basis is crucial to the development of strong characterizations and good algorithms. We also took a step in that direction by showing that the trajectory graph is a proper augmentation of the adjacency graph, in terms of that both graphs can be used to obtain an optimal sorting scenario when only DCJ operations are considered, and by describing an efficient iterative algorithm that can be used on a more general model. Our current work focuses on using this improvement method within a large optimization framework (e.g. incorporating methods to find good initial trajectories) to derive fast and accurate approximations.
Different trajectories may correspond to the same trajectory graph, if they can be retrieved by different ways of topological sorting in the same graph. Thus the trajectory graph is useful in representing equivalent trajectories as well as characterizing the space of all optimal trajectories under both rearrangements and contentmodifying operations (like the cases under inversions [30, 31] and DCJ operations [32] on adjacency graphs), and thus forming a basis for future statistical analysis.
Under the DCJ model, if the two cuts are in the same linear chromosome, one of the two nontrivial outcomes is to circularize a segment of DNA as a circular chromosome (also called circular intermediate), which has recently been inferred in the evolution of cow genomes [33]. Recent evidence also showed that segmental duplications may also be mediated by circular intermediates in fish genomes [29]. The proof of Theorem 1 makes natural use of the connection between rearrangements and segmental duplications through circular intermediates, and thus may be useful to identify possible circular intermediates in the evolutionary trajectory.
References
Bergeron A, Mixtacki J, Stoye J: A unifying view of genome rearrangements. Proc 6th Workshop Algs in Bioinf (WABI'06), Volume 4175 of Lecture Notes in Comp Sci. 2006, Springer Verlag, Berlin, 163173.
Yancopoulos S, Attie O, Friedberg R: Efficient sorting of genomic permutations by translocation, inversion and block interchange. Bioinformatics. 2005, 21 (16): 33403346. 10.1093/bioinformatics/bti535.
Bergeron A, Mixtacki J, Stoye J: A new lineartime algorithm to compute the genomic distance via the double cut and join distance. Theor Comput Sci. 2009, 410 (51): 53005316. 10.1016/j.tcs.2009.09.008.
Chen X: On sorting permutations by doublecutandjoins. Proc 16th Conf Computing and Combinatorics (COCOON'10), Volume 6196 of Lecture Notes in Comp Sci. 2010, Springer Verlag, Berlin, 439448.
Chen X, Sun R, Yu J: Approximating the doublecutandjoin distance between unsigned genomes. BMC Bioinformatics. 2011, 12 (Suppl 9): S1710.1186/1471210512S9S17.
Yancopoulos S, Friedberg R: Sorting genomes with insertions, deletions and duplications by DCJ. Proc 6th RECOMB Work on Comp Genomics (RECOMBCG'08). Lecture Notes in Computer Science 5267, Volume 5267 of Lecture Notes in Comp Sci. 2008, Springer Verlag, Berlin, 170183.
Bafna V, Pevzner P: Genome rearrangements and sorting by reversals. Proc 34th Ann IEEE Symp Foundations of Comput Sci (FOCS'93). 1993, 148157.
Bafna V, Pevzner P: Genome rearrangements and sorting by reversals. SIAM J on Computing. 1996, 25: 272289. 10.1137/S0097539793250627.
Hannenhalli S, Pevzner P: Transforming cabbage into turnip (polynomial algorithm for sorting signed permutations by reversals). Proc 27th Ann ACM Symp Theory of Comput (STOC'95). 1995, ACM Press, New York, 178189.
Bader D, Moret B, Yan M: A fast lineartime algorithm for inversion distance with an experimental comparison. J Comput Biol. 2001, 8 (5): 483491. 10.1089/106652701753216503.
Jean G, Nikolski M: Genome rearrangements: a correct algorithm for optimal capping. Inf Proc Letters. 2007, 104: 1420. 10.1016/j.ipl.2007.04.011.
OzeryFlato M, Shamir R: Two notes on genome rearrangement. J Bioinf Comp Bio. 2003, 1: 7194. 10.1142/S0219720003000198.
Tesler G: Efficient algorithms for multichromosomal genome rearrangements. J Comput Syst Sci. 2002, 65 (3): 587609. 10.1016/S00220000(02)000119.
Braga M, Willing E, Stoye J: Genomic distance with DCJ and indels. Algorithms in Bioinformatics. 2010, 90101.
ElMabrouk N: Genome rearrangement by reversals and insertions/deletions of contiguous segments. Proc 11th Ann Symp Combin Pattern Matching (CPM'00), Volume 1848 of Lecture Notes in Comp Sci. 2000, Springer Verlag, Berlin, 222234.
Alekseyev M, Pevzner P: Whole genome duplications and contracted breakpoint graphs. SIAM J on Computing. 2007, 36 (6): 17481763. 10.1137/05064727X.
ElMabrouk N, Sankoff D: The Reconstruction of Doubled Genomes. SIAM J Computing. 2003, 32 (3): 754792. 10.1137/S0097539700377177.
Paten B, Diekhans M, Earl D, John J, Ma J, Suh B, Haussler D: Cactus graphs for genome comparisons. J Comput Biol. 2011, 18 (3): 469481. 10.1089/cmb.2010.0252.
Bailey J, Eichler E: Primate segmental duplications: crucibles of evolution, diversity and disease. Nature Reviews Genetics. 2006, 7 (7): 552564. 10.1038/nrg1895.
Lynch M: The Origins of Genome Architecture. 2007, Sinauer
Jiang Z, Tang H, Ventura M, Cardone M, MarquesBonet T, She X, Pevzner P, Eichler E: Ancestral reconstruction of segmental duplications reveals punctuated cores of human genome evolution. Nature Genetics. 2007, 39 (11): 13611368. 10.1038/ng.2007.9.
Kahn C, Raphael B: Analysis of segmental duplications via duplication distance. Bioinformatics. 2008, 24 (16): i133i138. 10.1093/bioinformatics/btn292.
Kahn C, Hristov B, Raphael B: Parsimony and likelihood reconstruction of human segmental duplications. Bioinformatics. 2010, 26 (18): i446i452. 10.1093/bioinformatics/btq368.
ElMabrouk N: Sorting signed permutations by reversals and insertions/deletions of contiguous segments. Journal of Discrete Algorithms. 2001, 1: 105122.
Marron M, Swenson K, Moret B: Genomic distances under deletions and insertions. Proc 9th Conf Computing and Combinatorics (COCOON'03). 2003, Springer, 537547.
Lin Y, Moret B: A new genomic evolutionary model for rearrangements, duplications, and losses that applies across eukaryotes and prokaryotes. J Comput Biol. 2011, 18 (9): 10551064. 10.1089/cmb.2011.0098.
Shao M, Lin Y: Approximating the edit distance for genomes with duplicate genes under DCJ, insertion and deletion. BMC Bioinformatics. 2012, 13 (Suppl 19): S1310.1186/1471210513S19S13.
Lin Y, Hu F, Tang J, Moret B: Maximum likelihood phylogenetic reconstruction from highresolution wholegenome data and a tree of 68 eukaryotes. Proc 18th Pacific Symp on Biocomputing (PSB'13). 2013, 285296.
Fujimura K, Conte M, Kocher T: Circular DNA Intermediate in the Duplication of Nile Tilapia vasa Genes. PLoS ONE. 2011, 6 (12): e2947710.1371/journal.pone.0029477.
Bergeron A, Chauve C, Hartman T, StOnge K: On the properties of sequences of reversals that sort a signed permutation. Proceedings of JOBIM, Volume 2. 2002, Citeseer, 99108.
Braga M, Sagot MF, Scornavacca C, Tannier E: Exploring the solution space of sorting by reversals, with experiments and an application to evolution. ACM/IEEE Trans on Comput Bio & Bioinf. 2008, 5 (3): 348356.
Braga M, Stoye J: The solution space of sorting by DCJ. J Comput Biol. 2010, 17 (9): 11451165. 10.1089/cmb.2010.0109.
Durkin K: Serial translocation by means of circular intermediates underlies colour sidedness in cattle. Nature. 2012, 482 (7383): 8184. 10.1038/nature10757.
Acknowledgements
This work is supported by a grant from EPFL and a fellowship of the Swiss National Science Foundation to YL (grant no. 146708). The authors would like to thank the reviewers and editors for their helpful comments.
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 15, 2013: Proceedings from the Eleventh Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S15.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Competing interests
The authors declare that they have no competing interests.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Shao, M., Lin, Y. & Moret, B. Sorting genomes with rearrangements and segmental duplications through trajectory graphs. BMC Bioinformatics 14 (Suppl 15), S9 (2013). https://doi.org/10.1186/1471210514S15S9
Published:
DOI: https://doi.org/10.1186/1471210514S15S9