Sorting by reversals, block interchanges, tandem duplications, and deletions

Background Finding sequences of evolutionary operations that transform one genome into another is a classic problem in comparative genomics. While most of the genome rearrangement algorithms assume that there is exactly one copy of each gene in both genomes, this does not reflect the biological reality very well – most of the studied genomes contain duplicated gene content, which has to be removed before applying those algorithms. However, dealing with unequal gene content is a very challenging task, and only few algorithms allow operations like duplications and deletions. Almost all of these algorithms restrict these operations to have a fixed size. Results In this paper, we present a heuristic algorithm to sort an ancestral genome (with unique gene content) into a genome of a descendant (with arbitrary gene content) by reversals, block interchanges, tandem duplications, and deletions, where tandem duplications and deletions are of arbitrary size. Conclusion Experimental results show that our algorithm finds sorting sequences that are close to an optimal sorting sequence when the ancestor and the descendant are closely related. The quality of the results decreases when the genomes get more diverged or the genome size increases. Nevertheless, the calculated distances give a good approximation of the true evolutionary distances.


Background
During evolution, genomes are subject to genome rearrangements, which are large scale mutations that can alter the ordering and orientation (strandedness) of the genes on the chromosomes or even change the genome content by inserting, deleting, or duplicating genes. Because these events are rare compared to point mutations, they can give us valuable information about ancient events in the evolutionary history of organisms. For this reason, one is interested in the most "plausible" genome rearrangement scenario between two genomes. More precisely, given two genomes, one wants to find an optimal (shortest) sequence of rearrangements that transforms this genome into the other. In the classical approach, each gene has exactly one copy in each genome, and only operations that do not change the genome content are considered. These "classical operations" are nowadays a well-studied subject, where the most important operations are reversals (also called inversions), where a section of the genome is excised, reversed in orientation, and reinserted, and transpositions, where a section of the genome is excised and reinserted at a new position in the genome. While the problem of Sorting by reversals can be solved in polynomial time [1][2][3], and the reversal distance can be determined in linear time [4], the problem gets more complicated if one also considers transpositions, and there are only approximation algorithms known [5][6][7]. To simplify the existing algorithms, Yancopoulos et al. invented the double cut and join operator, which can simulate reversals and block interchanges (a more generalized form of a transposition), resulting in a simple and efficient algorithm [8].
However, restricting the genes to be unique in each genome does not reflect the biological reality very well, as in most genomes that have been studied, there are some genes that are present in two or more copies. This holds especially for the genomes of plants, and one of the most prominent genomes is the one of the flowering plant Arabidopsis thaliana, where large segments of the genome have been duplicated (see e.g. [9]). There are various evolutionary events that can change the content of the genome, like duplications of single genes, horizontal gene transfer, or tandem duplications. For a nice overview in the context of comparative genomics, see [10]. From an algorithmic point of view, the existence of duplicated genes complicates many existing algorithms, for example the problem of sorting arbitrary strings by reversals [11] and the problem of sorting by reversals and duplications [12] have been proven to be NP-hard. So far, most of the existing algorithms restrict duplications to have a fixed length [13], or simulate duplications by arbitrary insertions [14][15][16]. Even with these restrictions, it is hard to solve most of the problems exactly, and heuristics have to be used.

Related work
While genome rearrangement problems without duplications is a well studied subject, considering genomes with duplicated genes is a rather new field of research. One of the first works on this topic was done by Sankoff [17], where the following problem was examined. Given two genomes with duplicated genes, identify in both genomes the "true exemplars" of each gene and remove all other genes, such that the rearrangement distance between these modified genomes is minimized. This approach minimizes the number of classical rearrangement operations, but not the one of duplications and deletions. In the work of El-Mabrouk [18], for a given genome with duplicated gene content, one searches for a hypothetical ancestor with unique gene content such that the reversal and duplication distance towards this ancestor is minimized. Bertrand et al. [13] developed an algorithm for the following problem. Given two genomes with duplicated gene content, find a hypothetical ancestor such that the sum of the reversal and duplication distance of both genomes to this ancestor is minimized. However, in this work, duplications are restricted to have the length of one marker, i.e. a duplication can only duplicate segments that are identical in the initial genomes. Therefore, this approach is disadvantageous if large segmental duplications happened during evolution. Fu et al. extended this approach to the greedy algorithm MSOAR for assigning orthologous genes, which works well in practice [12,19]. Other approaches [14][15][16] simulate duplications by arbitrary insertions. Recently, Yancopoulos and Friedberg provided a mathematical model of a genome rearrangement distance for genomes with unequal gene content [20], combining the DCJ operator [8] with arbitrary but lengthweighted insertions and deletions. Another field of research is the "Genome halving problem", where a rearrangement scenario consists of a whole genome duplication followed by a series of classical rearrangement operations. It has been studied first for reversals and translocations [21,22] and recently has been extended to the double cut and join operator [23,24].
To the best of our knowledge, the only approach that creates a rearrangement scenario between two genomes, consisting of duplications of arbitrary length and classical genome rearrangements, is the one of Ozery-Flato and Shamir [25]. They use a greedy algorithm that starts with one genome and in each step applies the simplest and most evident operation that brings this genome closer to the target genome. If there is no evident operation, the algorithm aborts. Although this approach fails on complicated rearrangement scenarios, they were able to find rearrangement scenarios for more than 98% of the karyotypes in the "Mitelman database of chromosome aberrations in cancer" [26].

Our contribution
In this paper, we will focus on the following problem. Given an ancestral genome with unique gene content and the genome of a descendant with arbitrary gene content, find the shortest sequence of reversals, block interchanges, tandem duplications, and deletions that transforms the ancient genome into the one of the descendant. In contrast to most of the previous works, tandem duplications and deletions can be of arbitrary length. We developed a lower bound for the distance, and a heuristic greedy algorithm based on this lower bound. The approach can be extended to also include general duplications and insertions of single elements, as described in Section "Discussion". Experimental results on simulated data show that our algorithm works well in practice. problem is defined as follows. Given two genomes ' and  and a set of possible operations, where each operation is assigned a weight, find a sequence of minimum weight that transforms ' into . This minimum weight will be denoted by d(', ). In our algorithm, we will restrict the set of operations to reversals, deletions, tandem duplications (all of weight 1), and block interchanges (of weight 2), as defined in the next subsection. For simplification, we will also assume that ' = id, i.e. we search for a sequence of operations that transforms the identity genome into , and we write d() as short hand for d(id, ).

Operations
In our algorithm, we will restrict the set of operations to the following four types of operations. A reversal rev(i, j) (with 0 <i <j <n + 1) is an operation that inverts the the order of the elements of the segment  i ...  j-1 . Additionally, the orientation of every element in the segment is flipped. A block interchange bi(i, j, k, l) (with 0 <i  j  k  l <n + 1) is an operation that changes the positions of the segments  i ...  j-1 and  k ...  l-1 in . A tandem duplication td(i, j) (with 0 <i <j <n + 1) is an operation that adds a copy of the segment i ...  j-1 before the element  j . A deletion del(i, j) (with 0 <i <j <n + 1) cuts the segment  i ...  j-1 out of .
We will use the double cut and join operator (short DCJ) to simulate reversals and transpositions. A double cut and join DCJ(i, j, x) (with 0 <i <j <n + 1 and x  {+, -}) cuts the genome  before the elements  i and  j (i.e. we get the segments  0 ...  i-1 ,  i ...  j-1 , and  j ...  n+1 ), and rejoins the cut ends in two new pairs. If x = +, we rejoin such that the elements  i-1 and  j-1 as well as the elements  i and  j-1 become adjacent. This is equivalent to a the reversal of the . If x = -, we rejoin such that the elements  i-1 and  j as well as the elements  i and  j-1 become adjacent. This cuts the genome into the linear genome  0 ..
... This circular genome can be absorbed by applying another DCJ with one cutting point in the linear genome and the other cutting point in the circular genome. Depending on how we rejoin, those two DCJs are equivalent to either two consecutive reversals or to one block interchange. Thus, we can reduce the set of operations to DCJs, tandem duplications, and deletions, provided that we demand that circular genomes must be absorbed in the next step.

The breakpoint graph
Our main tool for visualization is the breakpoint graph. This graph has been introduced by Bafna and Pevzner to solve rearrangement problems on genomes without duplicated genes [27]. We extend this graph such that it can also be used for genomes with duplicated genes. The breakpoint graph of a genome  can be constructed as follows. First, we write the set of vertices {+0, -1, +1, -2, +2, ..., -n, +n, -(n + 1)} from left to right on a straight line. Second, we add a reality edge (+i, Note that two different genomes can have the same breakpoint graph, like e.g. and . However, this will not cause problems in our algorithm, because we use the identity genome as ancestral genome, which has a unique breakpoint graph.

A lower bound
Instead of searching for a sequence of operations op 1 , ..., op k that sorts id into , one can also search for the inverse sequence that sorts  into id. This is more convenient, as it is easier to track the changes in the breakpoint graph caused by the performed operations (remember that the reality edges are defined as the adjacencies in the identity genome). Thus, we only apply inverse operations, i.e. we sort a genome  into id by DCJs, inverse tandem duplications, and inverse deletions. Note that the inverse of a DCJ is still a DCJ, while an inverse deletion is an insertion. To keep our original problem in mind, we will use the term "inverse deletion" and not "insertion". Lemma 1. The breakpoint graph of the identity genome has n + 1 components and no loops. No breakpoint graph has more components. If a breakpoint graph has n + 1 components, it is the breakpoint graph of the identity genome if and only if it has no loops.
Proof. The first statement is easy to verify. As each vertex is connected with another vertex by a reality edge, the maximum possible number of components in the breakpoint graph is n + 1. If a genome is not the id, it must contain a breakpoint. The desire edge corresponding to this breakpoint either is a loop, or it connects two vertices that are not connected by a reality edge. In the latter case, the breakpoint graph contains a component with at least 4 vertices and therefore cannot have n + 1 components. ᮀ We will now examine how an operation can change the number of components and loops.

DCJ
A DCJ cuts the genome at two positions, and rejoins the cut ends. This has the following effect on the breakpoint graph. Two desire edges (u, v) and (w, x) are removed, and w.l.o.g. the desire edges (u, w) and (v, x) are added to the breakpoint graph. This can increase C by at most 1. If one of the removed edges is a loop, all three vertices are in the same component after the operation, i.e. C will not be increased by this operation. As a DCJ removes only two edges, S can be decreased by at most 2.

Inverse tandem duplication
An inverse tandem duplication deletes the following desire edges. (a) Edges that are inside the duplicated segment. All these edges have a multiplicity  2, thus deleting these edges neither changes C nor S. (b) The edge between the last node of the segment and the first node of the copy. This can increase C by 1, or decrease S by 1 (but not both).

Inverse deletion
An inverse deletion splits the genome at one position and adds arbitrary elements. In the breakpoint graph, one desire edge is removed and several desire edges are added. Therefore, a deletion can increase C by at most 1 or decrease S by at most 1. As C can only be decreased if the removed edge is a 1-bridge, an inverse deletion cannot increase C and decrease S.

Theorem 1. A lower bound lb() of the distance d() is
where S i is the number of vertices with a loop in component C i .

Proof.
Operations that increase C() by 1 or decrease S() by 1 decrease the lower bound at most by 1. For an oper- The breakpoint graph Figure 1 The ation that remove two loops, there are two cases. (a) It acts on two loops of the same component C i . This decreases S i by 2 and the lower bound is decreased by 1. (b) It acts on two loops in two components C i and C j . This can decrease two of the summands by 1, but the components C i and C j are merged and C is decreased by 1, thus the lower bound is decreased by at most 1. ᮀ

The algorithm
The algorithm uses a greedy strategy to sort the genome. In each step, it searches for operations that decrease the lower bound, i.e. we search for operations that increase C or decrease S, and check their effect on the lower bound. If there is no such operation, we will use additional heuristics to search for small sequences of operations that bring us closer to our goal. The main idea behind these heuristics is to reduce the number of missing elements and duplicates and to create adjacencies.

Operations that decrease the lower bound
As a DCJ removes two desire edges and rejoins the endpoints with two new desire edges, it can only increase C if the removed desire edges are a 2-bridge, or two 1-bridges in the same component. If the DCJ rejoins the endpoints such that we get a linear and a circular genome, we need a lookahead to search for another DCJ that absorbs this DCJ. Those two DCJs are directly merged into two reversals or one block interchange with a weight of 2. Inverse tandem duplications can only remove one desire edge with a multiplicity of 1 (the one between the duplicated segments), thus an inverse tandem duplication can increase C only if this edge is a 1-bridge. Additionally, one has to check whether the segments on both sides of the cutting point are identical. Inverse deletions just remove one desire edge, thus also an inverse deletion can increase C only if the removed edge is a 1-bridge. Additionally, one has to check whether there is a segment that can be inserted such that no desire edge in the inserted segment merges two components. Although there is such a segment in most cases, practical tests have shown that it is better to only insert segments that have no breakpoints, i.e. we perform only an inverse deletion if the breakpoint is of the form or with x <y. In summary, the main task in finding operations that increase C is to find 1bridges and 2-bridges in the breakpoint graph, which can be done very efficiently by following the algorithm devised in [28].
Finding operations that decrease S is rather straightforward, as we just have to scan the breakpoint graph for loops with a multiplicity of 1 and find the corresponding position in the genome. An operation that decreases S can be an inverse tandem duplication or an inverse deletion that removes this loop, or a DCJ that removes two loops with a multiplicity of 1, or a DCJ on a loop and another desire edge of the same component.

Heuristics for the remaining cases
If there is no operation that decreases the lower bound, one heuristic would be to decrease the number of duplicated elements without increasing the lower bound. If there are two consecutive copies of the same segment, we can remove one of them by an inverse tandem duplication. As an inverse tandem duplication only removes desire edges, it can never increase the lower bound. This is different in the general case of an inverse duplication, where the duplicated segments are separated by a nonempty segment in the genome. In this case, the removal of one of these segments (which can be simulated by a block interchange and an inverse tandem duplication) creates a new desire edge between the last element before the removed segment and the first element after the removed segment. If the corresponding vertices in the breakpoint graph are in different components, or if they are identical and the new desire edge would increase <Fences>QS i / 2<Fences>N of this component, the operation would increase the lower bound, i.e. we cannot easily provide a sequence of operations that removes one of the duplicated segments and does not increase the lower bound. However, the situation is different if we have at least three copies of the segment.

Lemma 2.
If there are three identical copies of a segment that are maximal (i.e. they cannot be extended in any direction such that still all three copies are identical), then there exists a sequence of operations that removes two of these copies and does not increase the lower bound.
Proof. Let a be the vertex corresponding to the leftmost element of the segment, and let b be the vertex corresponding to the rightmost vertex of the segment. There are reality edges (v 1 , a) and (b, w 1 ), (v 2 , a) and (b, w 2 ), and (v 3 , a) and (b, w 3 ) (from the elements enclosing the first, second, and third copy of the segment). Because the segment is maximal, we can assume w.l.o.g. that w 1  w 2 . As v 1 , v 2 , and v 3 are all adjacent to a, they must be in the same component, as well as w 1 , w 2 , and w 3 . By deleting the first two segments, we remove the desire edges (v 1 , a), (b, w 1 ), (v 2 , a), and (b, w 2 ), and get the new desire edges (v 1 , w 1 ) and (v 2 , w 2 ). If this merges two components, the new desire edges are a 2-bridge, and we can apply a DCJ that replaces them by the desire edges (v 1 , v 2 ) and (w 1 , w 2 ). If v 1 = v 2 this can create a new loop. This loop can be removed by another DCJ between the edges (v 1 , v 2 ) and (v 3 , a) (note that v 3  v 1 because the segments are maximal, and v 1  a because otherwise the loop was already there before the operation). In fact, the operations of the sequence can be  xy  yx arranged such that all DCJs are reversals, so we do not have to find appropriate follow-ups. An illustration of the sequence is depicted in Fig. 2. ᮀ We will now examine what we can do with elements with a multiplicity of at most 2. A first strategy would be to create adjacencies wherever this is possible without creating loops (note that creating adjacencies cannot decrease C). As a precondition, there must be a reality edge (a, b) and the desire edges (a, c) and (b, d) with c  d.
If there are no further adjacencies to create, and all elements have a multiplicity of at most 2, all the possible cases for a reality edge and its adjacent desire edges are depicted in Fig. 3.

Lemma 3. If all elements in  have a multiplicity  2, and
there is no DCJ that creates an adjacency without creating a loop, then there is a reality edge with adjacent desire edges corresponding to Case A, B, or C in Fig. 3. For these cases, there is an operation or a sequence of operations that removes this configuration.
Proof. If a reality edge and its adjacent desire edges correspond to Case D or E, then the reality edge starting at vertex a must correspond to Case C (this follows from the preconditions). Now, let us assume that all reality edges correspond to one of the Cases F and G. The elements adjacent to a reality edge of these cases occur either twice in the genome, or they do not occur at all. As we work with augmented genomes, there must be at least two elements that occur exactly once in the genome and have a breakpoint, otherwise the genome is the id. This is a contradiction to our assumption, therefore there must be at least one reality edge corresponding to Cases A, B, or C. We will now provide sequences for these cases.

Case A
The genome is of the form , the element x -1 is missing. Let y be the largest element <x that is not missing. We apply an inverse deletion of the elements to between and , i.e.  becomes . The desire edge (+a, -x) is removed, the inserted desire edges are the edge (+a, -(y + 1)) and some adjacencies. The reality edge (+(x -1), -x) is split from the component, the edge (+a, -(y + 1)) may merge two components, so the overall number of components cannot be decreased. As the element y + 1 was not present The remaining cases Figure 3 The remaining cases. The different configurations in which each vertex is adjacent to at most two desire edges, and no adjacencies can be created without creating a loop. In all cases, the picture shows a reality edge (horizontal line) and its adjacent desire edges (all other lines). Sequences for elements with multiplicity = 3. An example for a sequence if an element × has a multiplicity  3. First, we remove two of the copies of x by two block interchanges and inverse tandem duplications. If this merges two components, the component can be cut by an additional reversal. For other orientations of the element x, similar sequences can be applied.
If merge of components, apply reversal

Remove − → x by 2 block interchanges + inverse tandem duplications
in the original genome, the edge (+a, -(y + 1)) cannot be a loop.

Case B
x is in a duplicated segment, w.l.o.g. the segment is leftmaximal. We extend it to the right until it is also rightmaximal. Nevertheless, we will denote the leftmost vertex of the duplicated segment by -x and the rightmost vertex by +x, i.e.
(the copies of x can also have negative orientation). As the segment is rightmaximal, -b  -c or the segments have different orientation and touch each other, i.e. . In the first case, we remove the copy of x that is not adjacent to x -1, i.e. we remove the desire edges (+a, -x) and (+x, -c), and create the new desire edge (+a, -c). If +a = -c, the loop can be removed by a DCJ on this edge and the edge (+x, -b). In the second case, we remove the copy of x that is not adjacent to x -1, i.e. we remove the loop and the desire edge (+a, -x), and we create the desire edge (+x, +a). In both cases, the desire edge (+(x -1), -x) is split from the component, and adding one new desire edge can merge only two components, so the overall number of components does not decrease. Additionally, also S cannot increase.

Case C
The genome is of the form . We remove the second copy of x. This removes the desire edges (+b, -x) and (+x, -d) and adds the desire edge (+b,d). If this has merged two components, then (+a, -x) and (+b, -d) are 1-bridges with disjoint endpoints (remember that there is no desire edge from vertex +(x -1)), so a DCJ on these two edges splits the component again. If +b = -d, we have a loop, so we will not apply this sequence. Instead, we use the symmetrical case in which we remove the first copy of x. If both +b = -d and +a = -c, we can remove the loop (+b, -d) by applying a DCJ on it and the desire edge (+a, -x). Note that there is the possibility that the first DCJ creates a circular genome that cannot be absorbed in the next step. In this case, we can apply the sequence for Case A twice, i.e. we add the same elements before both copies of x. ᮀ

Completeness of the algorithm
Whenever the algorithm cannot apply an operation that decreases the lower bound, it searches for sequences that remove duplicated segments, for operations that create adjacencies, and for sequences according to the cases A to C in the previous subsection. Then, one of these sequences is selected and applied to the genome. The pseudocode of the algorithm can be seen in Fig. 4.
To prove the completeness of the algorithm, we need the following lemma. Proof. Let  be a genome such that () is maximal. If m() > 0 we could transform  by adding all missing elements without increasing r() or decreasing a(). This would be a contradiction to the fact that () is maximal, therefore m() = 0. Now, let us assume that r() > 0, and let x be the smallest element that is duplicated. As x -1 has a multiplicity of 1, there is at least one copy of x that has a breakpoint. Removing this copy decreases r() by 1 and  Creating The algorithm Figure 4 The algorithm. The algorithm in pseudocode.
while π = id do Find all operations that decrease lb(π) if operation found then apply an operation that maximizes τ (π) else find inverse tandem duplications find sequences for segments with multiplicity ≥ 3 find operations that create adjacencies find sequences for cases A, B, C apply a sequence that maximizes τ (π) end if end while (page number not for citation purposes) a() by at most 1, while m() remains unchanged. This would increase (), and lead to a contradiction. Therefore,  is the genome without duplicated or missing elements with the maximum number of adjacencies, i.e.  =

id. ᮀ
We are now ready to prove the following theorem. Theorem 2. The algorithm terminates after a finite number of steps. When the algorithm terminates, the genome  is transformed into id.
Proof. As none of the operations and sequences of operations discussed above increases the lower bound, and the lower bound is minimized for id, only a finite number of operations that decrease the lower bound can be applied. As we have shown in the last subsections, the algorithm always finds a sequence of operations as long as   id. Table 1 shows the changes of () when applying these sequences. As all sequences increase (), only a finite number of those sequences can be applied between two operations that decrease the lower bound. Therefore, the algorithm must terminate, and  is transformed into id. ᮀ

Testing
We used simulated data to assess the performance of our algorithm. We generated test cases by creating the identity genome id of size n and applying random sequences of n Performance Figure 5 Performance. Performance of our algorithm on simulated data. In each diagram, the x-axis corresponds to the sequence weight used to obtain the test case, while the y-axis corresponds to the weight of the reconstructed sequence. Each value is the average of 10 created test cases.

Discussion
In the following, we will discuss how the set of operations could be extended.

Duplications
While our algorithm only considers tandem duplications, one might also be interested in including arbitrary duplications. This would be rather easy if tandem duplications have weight 1, and all other duplications have weight 2, since the general case of a duplication can change the lower bound by at most 2. However, weighting all duplications equally is a more complicated subject. If all duplications have weight 2, this could be disadvantageous in detecting tandem duplications, as an inverse tandem duplication can decrease the lower bound by at most 1.
On the other hand, if all duplications have weight 1, duplications would be favored over DCJs. This could lead to sequences of many small duplications, instead of first merging the segments and then just applying one big duplication.

Insertions
Insertions of single elements could be easily included in our algorithm, because the inverse of this operation decreases the lower bound by at most 1. Insertions of arbitrary length are more complicated. On the other hand, allowing insertions of arbitrary length are neither biologically meaningful nor do they make sense in combination with arbitrary deletions, because one could solve every sorting problem by just one deletion and one insertion step. Thus, further research in including insertions should also include a reasonable length depending weighting of the insertions.

Conclusion
We presented an algorithm that works well for smaller genomes and distances. Although our results are promising, this algorithm should be seen as a first step in handling duplications of arbitrary length. Further research could improve the algorithm itself by finding closer lower bounds and better heuristics, or extend the algorithm such that it considers more different operations (as described in Section "Discussion") and can also handle multichromosomal genomes.
Frequencies of the operations Figure 6 Frequencies of the operations. The relative frequencies of the different operations.