Sorting by weighted inversions considering length and symmetry

Large-scale mutational events that occur when stretches of DNA sequence move throughout genomes are called genome rearrangements. In bacteria, inversions are one of the most frequently observed rearrangements. In some bacterial families, inversions are biased in favor of symmetry as shown by recent research. In addition, several results suggest that short segment inversions are more frequent in the evolution of microbial genomes. Despite the fact that symmetry and length of the reversed segments seem very important, they have not been considered together in any problem in the genome rearrangement field. Here, we define the problem of sorting genomes (or permutations) using inversions whose costs are assigned based on their lengths and asymmetries. We consider two formulations of the same problem depending on whether we know the orientation of the genes. Several procedures are presented and we assess these procedure performances on a large set of more than 4.4 × 109 permutations. The ideas presented in this paper provide insights to solve the problem and set the stage for a proper theoretical analysis.


Background
Among various large-scale rearrangement events that have been proposed to date, inversions were established as the main explanation for the genomic divergence in many organisms [1][2][3]. An inversion occurs when a chromosome breaks at two locations, and the DNA between those locations is reversed.
In some families of bacteria, an 'X'-pattern is observed when two circular chromosomes are aligned [2,4]. Inversions symmetric to the origin of replication (meaning that the breakpoints are equally distant from the origin of replication) have been proposed as the primary mechanism that explains the pattern [4]. The justification relies on the fact that one single highly asymmetric inversion affecting a large area of the genome could destroy the 'X'-pattern, although short inversions may still preserve it.
Darling, Miklós and Ragan [1] studied eight Yersinia genomes and added evidence that symmetric inversions are "over-represented" with respect to other types of inversions. They also found that inversions are shorter than expected under a neutral model. In many cases, short inversions affect only a single gene, as observed by Lefebvre et al. [5] and Sankoff et al. [6], which contrasts with the null hypothesis that the two endpoints of an inversion occur by random and independently.
Despite the importance of symmetry and length of the reversed segment, both have been somewhat overlooked in the genome rearrangement field. Indeed, the most important result regarding inversions is a polynomial time algorithm presented by Hannenhalli and Pevzner [3] that considers an unit cost for each inversion no matter its length or symmetry. When gene orientation is not taken into account, finding the minimum number of inversions that transform one genome into the other is a NP-Hard problem [7].
Some results have considered at least one of the concepts. There is a research line that considers the total sum of the inversion lengths as the objective function of a minimization problem. Several results have been presented both when gene orientation is considered [8,9] and when it is not [10][11][12].
Regarding symmetry, the first results were presented by Ohlebusch et al [13]. Their algorithm uses symmetric inversions in a restricted setting to compute an ancestral genome and, therefore, is not a generic algorithm to compute the rearrangement distance using only symmetric inversions. In 2012, Dias et al. presented an algorithm that considers only symmetric and almostsymmetric inversions [14]. They later included unitary inversions to the problem and provided a randomized heuristic to compute scenarios between two genomes that uses solely these operations [15].
Here we propose a new genome rearrangement problem that combines the concepts of symmetry and length of the reversed segments. Whereas previous works restricted the set of allowed operations by considering only inversions that satisfy constrains like symmetry or almost-symmetry [14,15], here we allow all possible inversions.
The problem we are proposing aims at finding lowcost scenarios between genomes when gene orientation is taken into account and when it is not, which is useful, among others, for building phylogenetic trees, annotating genomes or correcting already existing annotations. The results obtained are the first steps in exploring this interesting new problem.

Definitions
Formally, a chromosome is represented as a n-tuple whose elements represent genes. If we assume no gene duplication, then this n-tuple is a permutation π = (π 1 π 2 ≠ π n ), 1 ≤ |π i | ≤ n and |π i | ↔ |π j | ↔ i ≠ j. Because we focus on bacterial chromosomes, we assume permutations to be circular, and π 1 is the first gene after the origin of replication.
We consider two cases depending on whether we know the orientation of the genes. If we know the orientation, then π is a signed permutation such that each element π i ∈ {−n, −(n − 1), ..., −1, +1, +2, ..., +n}. If we do not know the orientation of the genes, then π is an unsigned permutation such that π i ∈ {1, 2, ..., n}.
Given two permutations a and s, we are interested in finding rearrangement scenarios that link a to s. Therefore, our scenarios are sequences of operations r 1 , r 2 , ..., r t such that a·r 1 ·r 2 ·... r t = s.
Let ι = (1 2 ... n) be the identity permutation, sorting a permutation π = (π 1 π 2 ... π n ) is the process of transforming π into ι. Note that s·s −1 = s −1 ·s = ι n . Thus, the scenario that transforms a into s can be used to transform a permutation π into a permutation ι if we take π = s −1 ·a. Therefore, we hereafter consider sorting permutations by inversions.
Previous researchers have worked on the inversion distance d(π) of an arbitrary permutation π, which is the minimum number of inversions that transform π into ι. Here we consider that each inversion r(i, j) has a cost which is based on the length and the symmetry of endpoints i and j.
Our cost function is: cost(r(i, j)) = |slice(ι, i) − slice(ι, j)| + 1. Its behavior is explained by looking at the two cases that arise. Figure 2 illustrates both cases.
In this case, the cost function can be simplified to cost (r(i, j)) = abs(i−j)+1, which means that it is proportional to the number of elements in the reversed segment.
This cost is what one would expect from a lengthweighted inversion distance in such a way that larger inversions cost more than short inversions.
Case 2: i > n 2 and j < n 2 , or j > n 2 and In this case, the cost function is penalizing the asymmetry instead of the number of elements in the reversed segment. In effect, if the inversion r(i, j) is perfectly symmetric (meaning that i and j are equally distant from the origin of replication, so slice(ι, i) = slice(ι, j)), then the cost is given by cost(r(i, j)) = 1.
Therefore, our problem is to find a sequence of operations r 1 , r 2 , ..., r t such that π·r 1 ·r 2 ... r t = ι and t k=1 cost(ρ k ) is minimum.

Methods
This section presents several greedy algorithms that take advantage from the characteristics of the cost function. The first greedy approach was named LR and constructs a solution by placing one element each time in the final position. After we place an element, we guarantee that we will not move it again.
Three other greedy approaches have been established, named NB, SMP and NB+SMP. These approaches rely on greedy functions to estimate how good an inversion might be. For each greedy function f, the benefit of an inversion r is the difference in the value returned by f before and after r is applied divided by the cost of r. For instance, let π be an arbitrary permutation, the ben- Note that we expect f to assign smaller values to permutations closer to the identity.
Using the greedy function f, we construct a sequence of inversions that sort π by iteratively adding an inversion with the best benefit among all possible inversions.
The greedy function f guarantees a (possibly not optimum) solution if we can always find an inversion r such that f (π·r) < f (π) for any π ≠ ι. However, our greedy functions presented in the following sections do not always guarantee that. Therefore, we study each case and we developed ways to circumvent this issue.

Left or right heuristic
We use the term LR to refer to this approach as an acronym for Left or Right. We first divide the elements in the permutation in two groups. The first group refers to the elements that are in slices classified as sorted and the second group comprises those elements that are in unsorted slices. A slice s is in the sorted group if p(π, s) = s, p(π, n − s + 1) = n − s + 1, and the slices {1, 2, ..., s − 1} are also in the sorted group. Otherwise, s is in the unsorted group.
First, the Left or Right heuristic selects the least slice in the unsorted group. Then, we determine the element  that should be moved first to that slice: the left or the right. The left (right) side is composed of the elements which are in positions that have indices bigger (lower) than the middle position.
To make this choice, we compute the total cost to put a given element in its final place. For signed permutations, we also consider the cost to place the element with a positive sign. After computing a cost for placing the left and the right elements, we choose the side that has the minimum cost. In case of tie, we move the right element.
If the slice has only one element that does not belong to it, we find the element that should be in that slice and perform an inversion to place it in the final position. After placing the element, we might have to change its sign on the signed version of our problem.

Slice-misplaced pairs heuristic
We use the term SMP to refer to this heuristic as an acronym for Slice-Misplaced Pairs.
Let SMP (π) be the number of slice-misplaced pairs in π and Δ SMP (π, r) = SMP (π·r)−SMP (π) be the variation in the number of slice-misplaced pairs caused by an inversion r, then the benefit of an inversion is given by For some permutations π ≠ ι, there is no inversion r such that SMP (π·r) < SMP (π). The following lemmas give us a better understanding from the properties of these permutations. We start by stating in Lemma 1 when we can be sure that at least one slice-misplaced pair can be removed.
Lemma 1 Let π i , π j be two elements in π such that |i − j| = 1 and π i~πj . Then there is at least one inversion r such that Δ SMP (π, r) < 0.
Proof Let us assume, without loss of generality, that slice (π, π i ) < slice(π, π j ). Therefore, slice(ι, will remove the slice-misplaced pair {π i , π j } when creating the permutation s = π·r. Let π k and π l be the elements in the same slice as π i and π j in π, respectively, the following statements suffice to conclude that Δ SMP (π, r) < 0. Note that it may occur that π j is the only element in the slice, therefore it suffices to prove the first statement.
Observe that we do not need to consider cases where π i~πl or π j~πk in π, because in the worst scenario these slice-misplaced pairs will not be removed.
Lemma 2 Let π i be an element in π such that , then there is at least one inversion r such that Δ SMP (π, r) < 0. □ Proof If π i is the only element having slice(ι, π i ) = n 2 or if it occurs that slice(π , π i ) ≥ slice(π , π j ) for π i , π j such that slice(ι, π i ) = slice(ι, π j ) = n 2 , then π i will form a slice-misplaced pair with all possible element π k such that slice(π, π k ) > slice(π, π i ). Therefore, it is straightforward from Lemma 1 that at least one inversion r such that Δ SMP (π, r) < 0 exists as long as slice(π , π i ) = n 2 .
If π i = n 2 and |i -j| ≠ 1, it is straightforward from Lemma 1 that we can move π j toward the slice. When |i − j| = 1 we apply the inversion r(i, k) if i < k or r(k, i) if k < i, where k = n − j + 1. □ Lemma 3 Let π = (π 1 π 2 ... π n ) be a permutation such that Lemmas 1 and 2 find no inversion that decreases the number of slice-misplaced pairs, then π has the form: For n even Proof Lemma 2 implies that if n is odd, then π n + 1 2 is in the highest slice in ι. The same occurs with the elements π n 2 and π n 2 +1 that should be in the highest slice in ι if n is even. We know that π i ≁ π i+1 , otherwise one could find r such that Δ SMP (π, r) < 0 using the Lemma 1. That leads to the elements being ordered according to theirs slices in ι as shown. □ Lemma 4 Δ SMP (π, r) = 0 for any perfectly symmetric inversion r(i, j) such that j = n − i + 1.
Proof Inversions r(i, j) such that j = n − i + 1 have no effect in the slice of any element in π. That said, no slice-misplaced pair will be created or removed whatsoever.
Lemma 5 Let π be a permutation such that SMP (π) = 0, then slice(π, π i ) = slice(ι, π i ) for any 1 ≤ i ≤ n. In this case, perfectly symmetric inversions can sort π if it is unsigned and perfectly symmetric plus unitary inversions should be enough to sort π if it is signed.
Proof Let π i be an element in π such that slice(π, π i ) ≠ slice(ι, π i ). There must exist an element π j such that slice(π, π j ) = slice(ι, π i ) ≠ slice(ι, π j ). Two cases are possible: (i) π i~πj , therefore SMP (π) > 0 or (ii) π i ≁ π j , thus there must exist an element π k such that slice(π, π k ) = slice(ι, π j ) ≠ slice(ι, π k ). In this second case, we can restart the entire process by calling π j and π k as π i and π j , respectively. Since we have a finite number of elements in the permutation, we know that the first case will be reached eventually.
If slice(π, π i ) = slice(ι, π i ) for any 1 ≤ i ≤ n, we can reach the identity permutation by first performing perfectly symmetric inversions r(i, j), slice(ι, i) = slice(ι, j), if |π i | = j and |π j | = i. It requires at most n 2 inversions and each one will cost one unit. It should be enough for unsigned permutations, but for signed permutations we still need to perform unitary inversions r(i, i) if s(π i ) = −1. Lemma 6 Let π = (π 1 π 2 ... π n ) be a permutation such that Lemmas 1 and 2 find no inversion that decreases the number of slice-misplaced pairs and SMP (π) > 0, then we can apply two inversions r 1 and r 2 such that SMP (π) > SMP (π·r 1 ·r2).
• a = slice(ι, π i ) = slice(ι, s i ) • a = slice(ι, π i+1 ) = slice(ι, s j ) We know that b ≥ c because π has the form described by Lemma 3, and we know that a ≠ b and a ≠ c because at most two elements can have a as slice in the identity permutation. Five different situations are possible: 1 a > b > c: we have s i~si+1 in s because a > b. Therefore, the inversion r 2 (i, i + 1) has Δ SMP (π·r 1 , r 2 ) < 0 according to Lemma 1.
Lemmas 1 and 2 show cases such that at least one inversion will have positive benefit. No positive benefit inversion is guaranteed on permutations in the form described by Lemma 3, but in those cases we can use Lemmas 5 and 6 to assure that our greedy approach will eventually reach the identity permutation.

Number of breakpoints heuristic
We use the term NB to refer to this heuristic as an acronym for Number of Break-points. Consider the extended permutation that can be obtained from π by inserting two new elements: π 0 = 0 and π n+1 = n + 1. The extended permutation is still denoted as π. Below, we present two definitions of breakpoint depending on whether we are dealing with signed or unsigned permutations.
Definition 2 A pair of elements π i , π i+1 in a signed permutation π, with 0 ≤ i ≤ n, is a breakpoint if π i+1π i ≠ 1.
We use NB(π) to represent the number of breakpoints in a permutation and Δ NB (π, r) = NB(π·r) -NB(π) to represent the variation in the number of break-points caused by an inversion r.
The identity permutation ι is the only permutation with no breakpoints. Therefore, an inversion that decreases the number of breakpoints indirectly leads to the identity permutation. Since an inversion can affect only two breakpoints, we know that Δ NB (π, r) ∈ {−2, −1, 0, 1, 2}.
The benefit of an arbitrary inversion r can be computed as ben NB (π , ρ) = − NB(π ,ρ) cost(ρ) . We compute the benefit of all possible inversion and we choose one that maximizes the benefit. Since breakpoint is a very common concept in the genome rearrangement field, previous publications have clearly stated which kind of permutations will not allow any inversion to decrease the number of breakpoints.
Strips can be either increasing (π i < π i+1 < ... < π j ) or decreasing (π i > π i+1 > ... > π j ). Strips with only one element are considered decreasing strips. Kececioglu and Sankoff [16] proved that every unsigned permutation with a decreasing strip has at least one inversion that removes at least one breakpoint. Therefore, those inversions will have positive (non-zero) benefit. The same idea holds for signed permutations.
When we find a permutation π with no decreasing strips, we have no positive benefit. Therefore, we can use one of the following strategies to create decreasing strips as a contingency plan.
1. Use the Left or Right Heuristic known as LR. 2. The Left or Right Heuristic could break one strip because only a single element will be moved. Therefore, another approach is to compute the cost of placing the strip that contains the left element in the left side and the strip that contains the right element in the right side and hence use the less costly option.
3. Revert the entire unsorted group with one inversion. 4. Find the increasing strips and compute the cost of all possible inversions that reverse one or more of them. Note that we do not consider inversions that split any strip. After that, use the less costly inversion. We named this approach as the Best Strip strategy.
The Best Strip strategy leads to the best results in our experiments. Therefore, we use it in our comparative analysis.
For unsigned permutations, we use the inversion r that maximizes the benefit if, and only if, ben NB+SMP (π , ρ) > 0. Otherwise, we use the Best_Strip strategy in order to guarantee that the input permutation will be sorted.
For signed permutations, we compute the unsigned permutation π′ by removing the signs from π and we apply in π the inversion r′ that would be used in π′ according to the method previously described for unsigned permutations. In the end, if π ≠ ι, we simply change the signs of negative elements with unitary inversions.

Results and discussion
We implemented the algorithms using C++ Programming Language and experiments were executed at the cluster provided by the IN2P3 Computing Center (http://cc.in2p3.fr/).
We performed two batches of experiments. We first show experiments using small permutations and then we use considerably longer sequences up to size 100.

Small permutations
We generated a dataset with all possible unsigned permutations up to size 12, which accounts for 12 n=2 n! = 522, 956, 312 instances. We did the same for all possible signed permutations up to size 10, which accounts for 10 n=2 2 n n! = 3, 912, 703, 160 instances. Therefore, we have used a large dataset having more than 4.4 × 10 9 permutations.
For every permutation in the dataset, we were able to compute a minimum cost solution for comparison purposes. The minimum cost solution was calculated using a graph structure G n , for n ∈ {1, 2, ..., 12}. We define G n as follows. A permutation π is a vertex in G n if, and only if, π has n elements. Let π and s be two vertices in G n , we build an edge from π to s if, and only if, there is an inversion r that transforms π into s. The weight assigned to this edge is cost(r). Finally, we calculate the shortest path from ι to each vertex in G n using a variant of Dijkstra's algorithm for the single-source shortestpaths problem. This variant gives us the minimum cost to sort permutations in G n , as well as an optimum scenario of inversions.
Let heu_cost be the cost for sorting a permutation using some of our heuristics and opt_cost be the optimum cost, we can compute the approximation ratio as heu cost opt cost . Figures 3 and 4 summarizes our results for unsigned and singed permutations, respectively. The graphs in Figures 3(a) and 4(a) show how often each heuristic returns the optimum cost. The graphs in Figures 3(b) and 4(b) show the average ratio considering all permutations of a given size, while the graphs in Figures 3(c) and 4(c) exhibit the maximum ratios for the same group of permutations. These graphs are maximum or average values and they may not answer the question: for a single instance π, is there any algorithm that is likely to provide the best answer? The graphs in Figures 3(d) and  We observe that NB+SMP leads to the best results and it is consistently better than using just the number of breakpoints (NB) or just the variation in the number of slice-misplaced pairs (SMP) in every aspects we plot in Figures 3 and 4.
Individually, NB and SMP may be worse than NB+SMP, but NB returns results that are much closer to the optimum solution than SMP. That is true both on signed and unsigned permutations as we can see in Figures 3(b) and 4(b). The other graphs also corroborate this fact, which supports our decision of favoring the number of breakpoints when we compute Δ NB+SMP .
The simplistic approach LR leads to inferior results as reasonable. However, when we consider only the maximum ratio aspect, we observe that LR outdoes SMP. Indeed, the SMP approach is not consistent with respect to this aspect, which indicates that particular permutations are hard to sort using only the slice-misplaced pairs.
A final test checks if any profit is gained from running all possible heuristic. We added a new curve labeled as All, which selects for each instance the less costly result between those produced by our heuristics. As we can see, running all possible heuristics and keeping the best result accomplishes very good results. Indeed, it is consistently better than using solely the NB+SMP heuristic.

Large permutations
We ran our algorithm on a set of arbitrarily large permutations. This set is composed of 190,000 random signed permutations and 190,000 random unsigned permutations. In both cases, the permutation size ranges from 10 to 1000 in intervals of 5, with 10,000 permutations of each size. Here, we do not have an exact solutions for these permutations. Therefore, we use the average cost instead of the approximation ratio to base our analysis.
The analysis on random permutations reinforces the notion that NB+SMP leads to the best results. We observe in Figures 5 and 6 that the difference between NB and NB+SMP is very small on average. However, NB+SMP returns the less costly scenario in more cases.
Using random permutations allows us to draw information about the running time of each heuristic. In Table 1 we observe that LR is the fastest heuristic and the sorting scenario can be obtained almost instantly (less than 1 milliseconds), which is reasonable since this heuristic is very simplistic. The heuristic SMP and NB +SMP are the ones that take more time to finish, which can be explained, in part, by the fact that both need to compute slice-misplaced pairs. Since SMP returns scenarios with more inversions than NB+SMP, the former requires about twice the time used by the latter.

Conclusions
We have defined a new genome rearrangement problem based on the concepts of symmetry and length of the reversed segments in order to assign a cost for each inversion. The problem we are proposing aims at finding low-cost scenarios between genomes. We considered the cases when gene orientations is taken into account and when it is not. We have provided the first steps in exploring this problem.
We presented several heuristics and we assessed their performances on a large set of more than 4.4 × 10 9 permutations. The ideas we used to develop these heuristics together with the experimental results set the stage for a proper theoretical analysis.
As in other problems in the genome rearrangement field, we would like to know the complexity of determining the distance between any two genomes using the operations we defined. That seems to be a difficult problem that we intend to keep studying. We plan to design approximation algorithms and more effective heuristics.