 Research
 Open access
 Published:
Sorting by weighted inversions considering length and symmetry
BMC Bioinformatics volume 16, Article number: S3 (2015)
Abstract
Largescale mutational events that occur when stretches of DNA sequence move throughout genomes are called genome rearrangements. In bacteria, inversions are one of the most frequently observed rearrangements. In some bacterial families, inversions are biased in favor of symmetry as shown by recent research. In addition, several results suggest that short segment inversions are more frequent in the evolution of microbial genomes. Despite the fact that symmetry and length of the reversed segments seem very important, they have not been considered together in any problem in the genome rearrangement field. Here, we define the problem of sorting genomes (or permutations) using inversions whose costs are assigned based on their lengths and asymmetries. We consider two formulations of the same problem depending on whether we know the orientation of the genes. Several procedures are presented and we assess these procedure performances on a large set of more than 4.4 × 10^{9} permutations. The ideas presented in this paper provide insights to solve the problem and set the stage for a proper theoretical analysis.
Background
Among various largescale rearrangement events that have been proposed to date, inversions were established as the main explanation for the genomic divergence in many organisms [1–3]. An inversion occurs when a chromosome breaks at two locations, and the DNA between those locations is reversed.
In some families of bacteria, an 'X'pattern is observed when two circular chromosomes are aligned [2, 4]. Inversions symmetric to the origin of replication (meaning that the breakpoints are equally distant from the origin of replication) have been proposed as the primary mechanism that explains the pattern [4]. The justification relies on the fact that one single highly asymmetric inversion affecting a large area of the genome could destroy the 'X'pattern, although short inversions may still preserve it.
Darling, Miklós and Ragan [1] studied eight Yersinia genomes and added evidence that symmetric inversions are "overrepresented" with respect to other types of inversions. They also found that inversions are shorter than expected under a neutral model. In many cases, short inversions affect only a single gene, as observed by Lefebvre et al. [5] and Sankoff et al. [6], which contrasts with the null hypothesis that the two endpoints of an inversion occur by random and independently.
Despite the importance of symmetry and length of the reversed segment, both have been somewhat overlooked in the genome rearrangement field. Indeed, the most important result regarding inversions is a polynomial time algorithm presented by Hannenhalli and Pevzner [3] that considers an unit cost for each inversion no matter its length or symmetry. When gene orientation is not taken into account, finding the minimum number of inversions that transform one genome into the other is a NPHard problem [7].
Some results have considered at least one of the concepts. There is a research line that considers the total sum of the inversion lengths as the objective function of a minimization problem. Several results have been presented both when gene orientation is considered [8, 9] and when it is not [10–12].
Regarding symmetry, the first results were presented by Ohlebusch et al [13]. Their algorithm uses symmetric inversions in a restricted setting to compute an ancestral genome and, therefore, is not a generic algorithm to compute the rearrangement distance using only symmetric inversions. In 2012, Dias et al. presented an algorithm that considers only symmetric and almostsymmetric inversions [14]. They later included unitary inversions to the problem and provided a randomized heuristic to compute scenarios between two genomes that uses solely these operations [15].
Here we propose a new genome rearrangement problem that combines the concepts of symmetry and length of the reversed segments. Whereas previous works restricted the set of allowed operations by considering only inversions that satisfy constrains like symmetry or almostsymmetry [14, 15], here we allow all possible inversions.
The problem we are proposing aims at finding lowcost scenarios between genomes when gene orientation is taken into account and when it is not, which is useful, among others, for building phylogenetic trees, annotating genomes or correcting already existing annotations. The results obtained are the first steps in exploring this interesting new problem.
Definitions
Formally, a chromosome is represented as a ntuple whose elements represent genes. If we assume no gene duplication, then this ntuple is a permutation π = (π_{1} π_{2} ≠ π_{ n }), 1 ≤ π_{ i } ≤ n and π_{ i } ↔ π_{ j } ↔ i ≠ j. Because we focus on bacterial chromosomes, we assume permutations to be circular, and π_{1} is the first gene after the origin of replication.
We consider two cases depending on whether we know the orientation of the genes. If we know the orientation, then π is a signed permutation such that each element π_{ i } ∈ {−n, −(n − 1), ..., −1, +1, +2, ..., +n}. If we do not know the orientation of the genes, then π is an unsigned permutation such that π_{ i } ∈ {1, 2, ..., n}.
We treat permutations as functions such that π(i) = π_{ i } and π(−i) = −π(i). The inverse of a permutation π is denoted by π^{−1}, for which {\pi}_{\pi i}^{1}=i for all 1 ≤ i ≤ n. The composition between two permutations π and σ is similar to function composition in such way that π·σ = (π_{σ(1) }π_{σ(2) }⋯ π_{σ(n)}).
Let π = (π_{1} π_{2} ... π_{ n }) be a signed permutation, an inversion ρ(i, j), 1 ≤ i ≤ j ≤ n is an operation such that π·ρ(i, j) = (π_{1} ⋯ π_{i−1 }− π_{ j }− π_{j−1 }⋯ − π_{i+1 }⋯ − π_{ i }π_{j+1 }⋯ π_{ n }). Let π = (π_{1} π_{2} ⋯ π_{ n }) be an unsigned permutation, an inversion ρ(i, j), 1 ≤ i < j ≤ n is an operation such that π·ρ(i, j) = (π_{1} ⋯ π_{i−1 }π_{ j } π_{j−1 }⋯ π_{i+1 }π_{ i } π_{j+1 }⋯ π_{ n }).
Given two permutations α and σ, we are interested in finding rearrangement scenarios that link α to σ. Therefore, our scenarios are sequences of operations ρ_{1}, ρ_{2}, ..., ρ_{ t } such that α·ρ_{1}·ρ_{2}·... ρ_{ t } = σ.
Let ι = (1 2 ... n) be the identity permutation, sorting a permutation π = (π_{1} π_{2} ... π_{ n }) is the process of transforming π into ι. Note that σ·σ^{−1} = σ^{−1}·σ = ι_{ n }. Thus, the scenario that transforms α into σ can be used to transform a permutation π into a permutation ι if we take π = σ^{−1}·α. Therefore, we hereafter consider sorting permutations by inversions.
Previous researchers have worked on the inversion distance d(π) of an arbitrary permutation π, which is the minimum number of inversions that transform π into ι. Here we consider that each inversion ρ(i, j) has a cost which is based on the length and the symmetry of endpoints i and j.
The following functions help us to define our cost function and can be applied to identify any element i in the permutation π. Position: p(π, i) = k ⇔ π_{ k } = i, p(π, i) ∈ {1, 2, ..., n}. Sign: s(π, i) = k ⇔ π_{ p }(π, i) = k × i, s(π, i) ∈ {−1, +1}. Slice: slice(π, i) = min{p(π, i), n − p(π, i) + 1}, slice(π, i) ∈ {1, 2, ..., \u2308\frac{n}{2}\u2309}.
Figure 1(b) shows the values of these functions for the signed permutation π = (−5 + 3 + 4 − 2 + 1). Note that the values for the functions sliceand position would not change if instead we had the unsigned permutation π = (5 3 4 2 1).
Our cost function is: cost(ρ(i, j)) = slice(ι, i) − slice(ι, j) + 1. Its behavior is explained by looking at the two cases that arise. Figure 2 illustrates both cases.
Case 1: i,j\le \u2308\frac{n}{2}\u2309 or i,j\ge \u2308\frac{n}{2}\u2309.
In this case, the cost function can be simplified to cost(ρ(i, j)) = abs(i−j)+1, which means that it is proportional to the number of elements in the reversed segment. This cost is what one would expect from a lengthweighted inversion distance in such a way that larger inversions cost more than short inversions.
Case 2: i>\u2308\frac{n}{2}\u2309 and j<\u2308\frac{n}{2}\u2309, or j>\u2308\frac{n}{2}\u2309 and i<\u2308\frac{n}{2}\u2309.
In this case, the cost function is penalizing the asymmetry instead of the number of elements in the reversed segment. In effect, if the inversion ρ(i, j) is perfectly symmetric (meaning that i and j are equally distant from the origin of replication, so slice(ι, i) = slice(ι, j)), then the cost is given by cost(ρ(i, j)) = 1.
Therefore, our problem is to find a sequence of operations ρ_{1}, ρ_{2}, ..., ρ_{t} such that π·ρ_{1}·ρ_{2} ... ρ_{ t } = ι and {\sum}_{k=1}^{t}cost\left({\rho}_{k}\right) is minimum.
Methods
This section presents several greedy algorithms that take advantage from the characteristics of the cost function. The first greedy approach was named LR and constructs a solution by placing one element each time in the final position. After we place an element, we guarantee that we will not move it again.
Three other greedy approaches have been established, named NB, SMP and NB+SMP. These approaches rely on greedy functions to estimate how good an inversion might be. For each greedy function f, the benefit of an inversion ρ is the difference in the value returned by f before and after ρ is applied divided by the cost of ρ. For instance, let π be an arbitrary permutation, the benefit of ρ is computed as be{n}_{f}\left(\pi ,\rho \right)=\frac{f\left(\pi \right)f\left(\pi \cdot \rho \right)}{cost\left(\rho \right)}. Note that we expect f to assign smaller values to permutations closer to the identity.
Using the greedy function f, we construct a sequence of inversions that sort π by iteratively adding an inversion with the best benefit among all possible inversions. The greedy function f guarantees a (possibly not optimum) solution if we can always find an inversion ρ such that f (π·ρ) < f (π) for any π ≠ ι. However, our greedy functions presented in the following sections do not always guarantee that. Therefore, we study each case and we developed ways to circumvent this issue.
Left or right heuristic
We use the term LR to refer to this approach as an acronym for Left or Right. We first divide the elements in the permutation in two groups. The first group refers to the elements that are in slices classified as sorted and the second group comprises those elements that are in unsorted slices. A slice s is in the sorted group if p(π, s) = s, p(π, n − s + 1) = n − s + 1, and the slices {1, 2, ..., s − 1} are also in the sorted group. Otherwise, s is in the unsorted group.
First, the Left or Right heuristic selects the least slice in the unsorted group. Then, we determine the element that should be moved first to that slice: the left or the right. The left (right) side is composed of the elements which are in positions that have indices bigger (lower) than the middle position.
To make this choice, we compute the total cost to put a given element in its final place. For signed permutations, we also consider the cost to place the element with a positive sign. After computing a cost for placing the left and the right elements, we choose the side that has the minimum cost. In case of tie, we move the right element.
If the slice has only one element that does not belong to it, we find the element that should be in that slice and perform an inversion to place it in the final position. After placing the element, we might have to change its sign on the signed version of our problem.
Slicemisplaced pairs heuristic
We use the term SMP to refer to this heuristic as an acronym for SliceMisplaced Pairs.
Definition 1 We say that a pair {π_{ i }, π_{ j }}, 1 ≤ i < j ≤ n, is slicemisplaced in π if slice(π, π_{ i }) > slice(π, π_{ j }) and slice(ι, π_{ i }) < slice(ι, π_{ j }). We use π_{ i } ~ π_{ j } to represent that π_{ i } and π_{ j } are slicemisplaced.
Let SMP (π) be the number of slicemisplaced pairs in π and Δ_{ SMP } (π, ρ) = SMP (π·ρ)−SMP (π) be the variation in the number of slicemisplaced pairs caused by an inversion ρ, then the benefit of an inversion is given by be{n}_{SMP}=\frac{{\Delta}_{SMP}\left(\pi ,\rho \right)}{cost\left(\rho \right)}.
For some permutations π ≠ ι, there is no inversion ρ such that SMP (π·ρ) < SMP (π). The following lemmas give us a better understanding from the properties of these permutations. We start by stating in Lemma 1 when we can be sure that at least one slicemisplaced pair can be removed.
Lemma 1 Let π_{ i }, π_{ j } be two elements in π such that i − j = 1 and π_{ i } ~ π_{ j }. Then there is at least one inversion ρ such that Δ_{ SMP } (π, ρ) < 0.
Proof Let us assume, without loss of generality, that slice(π, π_{ i }) < slice(π, π_{ j }). Therefore, slice(ι, π_{ i }) > slice(ι, π_{ j }) since π_{ i } ~ π_{ j }. The inversion ρ = ρ(i, j) if i < j or ρ = ρ(j, i) if i > j will remove the slicemisplaced pair {π_{ i }, π_{ j }} when creating the permutation σ = π·ρ. Let π_{ k } and π_{l} be the elements in the same slice as π_{ i } and π_{ j } in π, respectively, the following statements suffice to conclude that Δ_{ SMP } (π, ρ) < 0. Note that it may occur that π_{ j } is the only element in the slice, therefore it suffices to prove the first statement.

If π_{ j }≁ π_{ k }in π, then π_{ i }≁ π_{ k }in σ. We know that slice(ι, π_{ j }) > slice(ι, π_{ k }) since in π we have π_{ j }≁ π_{ k }and slice(π, π_{ j }) > slice(π, π_{ k }). Therefore, π_{ i }≁ π_{ k }in σ because slice(ι, π_{ k }) < slice(ι, π_{ j }) < slice(ι, π_{ i }) and slice(σ, π_{ k }) < slice(σ, π_{ i }).

If π_{ i }≁ π_{ l }in π, then π_{ j }≁ π_{ l }in σ. We know that slice(ι, π_{ i }) < slice(ι, π_{ l }) since in π we have π_{ i }≁ π_{ l }and slice(π, π_{ i }) < slice(π, π_{ l }). Therefore, π_{ j }≁ π_{ l }in σ because slice(ι, π_{ l }) > slice(ι, π_{ i }) > slice(ι, π_{ j }) and slice(σ, π_{ l }) > slice(σ, π_{ j }).
Observe that we do not need to consider cases where π_{ i } ~ π_{ l } or π_{ j } ~ π_{ k } in π, because in the worst scenario these slicemisplaced pairs will not be removed.
Lemma 2 Let π_{ i } be an element in π such that slice\left(\iota ,\phantom{\rule{2.36043pt}{0ex}}{\pi}_{i}\right)=\u2308\frac{n}{2}\u2309. If slice\left(\pi ,\phantom{\rule{2.36043pt}{0ex}}{\pi}_{i}\right)\ne \u2308\frac{n}{2}\u2309, then there is at least one inversion ρ such that Δ_{ SMP } (π, ρ) < 0. □
Proof If π_{ i } is the only element having slice\left(\iota ,\phantom{\rule{2.36043pt}{0ex}}{\pi}_{i}\right)=\u2308\frac{n}{2}\u2309 or if it occurs that slice\left(\pi ,\phantom{\rule{2.36043pt}{0ex}}{\pi}_{i}\right)\ge slice\left(\pi ,\phantom{\rule{2.36043pt}{0ex}}{\pi}_{j}\right) for π_{ i }, π_{ j } such that slice\left(\iota ,\phantom{\rule{2.36043pt}{0ex}}{\pi}_{i}\right)=slice\left(\iota ,\phantom{\rule{2.36043pt}{0ex}}{\pi}_{j}\right)=\u2308\frac{n}{2}\u2309, then π_{ i } will form a slicemisplaced pair with all possible element π_{ k } such that slice(π, π_{ k }) > slice(π, π_{ i }). Therefore, it is straightforward from Lemma 1 that at least one inversion ρ such that Δ_{ SMP } (π, ρ) < 0 exists as long as slice\left(\pi ,\phantom{\rule{2.36043pt}{0ex}}{\pi}_{i}\right)\ne \u2308\frac{n}{2}\u2309.
If {\pi}_{i}=\u2308\frac{n}{2}\u2309 and i  j ≠ 1, it is straightforward from Lemma 1 that we can move π_{ j } toward the slice. When i − j = 1 we apply the inversion ρ(i, k) if i < k or ρ(k, i) if k < i, where k = n − j + 1. □
Lemma 3 Let π = (π_{1} π_{2} ... π_{ n }) be a permutation such that Lemmas 1 and 2 find no inversion that decreases the number of slicemisplaced pairs, then π has the form:
Proof Lemma 2 implies that if n is odd, then {\pi}_{\frac{n+1}{2}} is in the highest slice in ι. The same occurs with the elements {\pi}_{\frac{n}{2}} and {\pi}_{\frac{n}{2}+1} that should be in the highest slice in ι if n is even. We know that π_{ i } ≁ π_{i+1}, otherwise one could find ρ such that Δ_{ SMP } (π, ρ) < 0 using the Lemma 1. That leads to the elements being ordered according to theirs slices in ι as shown. □
Lemma 4 Δ_{ SMP } (π, ρ) = 0 for any perfectly symmetric inversion ρ(i, j) such that j = n − i + 1.
Proof Inversions ρ(i, j) such that j = n − i + 1 have no effect in the slice of any element in π. That said, no slicemisplaced pair will be created or removed whatsoever.
Lemma 5 Let π be a permutation such that SMP (π) = 0, then slice(π, π_{ i }) = slice(ι, π_{ i }) for any 1 ≤ i ≤ n. In this case, perfectly symmetric inversions can sort π if it is unsigned and perfectly symmetric plus unitary inversions should be enough to sort π if it is signed.
Proof Let π_{ i } be an element in π such that slice(π, π_{ i }) ≠ slice(ι, π_{ i }). There must exist an element π_{ j } such that slice(π, π_{ j }) = slice(ι, π_{ i }) ≠ slice(ι, π_{ j }). Two cases are possible: (i) π_{ i } ~ π_{ j }, therefore SMP (π) > 0 or (ii) π_{ i } ≁ π_{ j }, thus there must exist an element π_{ k } such that slice(π, π_{ k }) = slice(ι, π_{ j }) ≠ slice(ι, π_{ k }). In this second case, we can restart the entire process by calling π_{ j } and π_{ k } as π_{ i } and π_{ j }, respectively. Since we have a finite number of elements in the permutation, we know that the first case will be reached eventually.
If slice(π, π_{ i }) = slice(ι, π_{ i }) for any 1 ≤ i ≤ n, we can reach the identity permutation by first performing perfectly symmetric inversions ρ(i, j), slice(ι, i) = slice(ι, j), if π_{ i } = j and π_{ j } = i. It requires at most \u2308\frac{n}{2}\u2309 inversions and each one will cost one unit. It should be enough for unsigned permutations, but for signed permutations we still need to perform unitary inversions ρ(i, i) if s(π_{ i }) = −1.
Lemma 6 Let π = (π_{1} π_{2} ... π_{ n }) be a permutation such that Lemmas 1 and 2 find no inversion that decreases the number of slicemisplaced pairs and SMP (π) > 0, then we can apply two inversions ρ_{1} and ρ_{2} such that SMP (π) > SMP (π·ρ_{1}·ρ2).
Proof Assuming no inversion as described by Lemmas 1 and 2 is possible, we know that π has the form of Lemma 3. Moreover, there is at least one pair of elements π_{ i }, π_{i+1 }such that slice(ι, π_{ i }) = slice(ι, π_{i+1}) and slice(ι, π_{ i }) ≠ \u2308\frac{n}{2}\u2309. We hereafter consider, without loss of generality, that (i) slice(π, π_{ i }) = i and slice(π, π_{i+1}) = i+1; (ii) slice(π, π_{i+1}) ≥ slice(π, π_{ x }) and slice(π, π_{i+1}) ≥ slice(π, π_{x+1}) for every pair π_{ x }, π_{x+1 }such that slice(ι, π_{ x }) = slice(ι, π_{x+1}); (iii) π_{ j } and π_{j+1 }are elements that share the same slice with π_{i+1 }and π_{ i }, respectively.
The first inversion we apply is ρ_{1}(i + 1, j). This inversion is perfectly symmetric and hence Δ_{ SMP } (π, ρ_{1}) = 0 by Lemma 4. We assert that the inversion ρ_{2} applied after ρ_{1} will have Δ_{ SMP } (π·ρ_{1}, ρ_{2}) < 0 in order to prove the lemma. That said, we start by compiling attributes of σ = π·ρ_{1} and we will use these attributes to find ρ_{2}.

a = slice(ι, π_{ i }) = slice(ι, σ_{ i })

a = slice(ι, π_{i+1 }) = slice(ι, σ_{ j })

b = slice(ι, π_{ j }) = slice(ι, σ_{i+1})

c = slice(ι, π_{j+1}) = slice(ι, σ_{j+1})
We know that b ≥ c because π has the form described by Lemma 3, and we know that a ≠ b and a ≠ c because at most two elements can have a as slice in the identity permutation. Five different situations are possible:
1 a > b > c: we have σ_{ i } ~ σ_{i+1 }in σ because a > b. Therefore, the inversion ρ_{2}(i, i + 1) has Δ_{ SMP } (π·ρ_{1}, ρ_{2}) < 0 according to Lemma 1.
2 b > c > a: we have σ_{ j } ~ σ_{ j }+1 in σ because c > a. Therefore, the inversion ρ_{2} (i, i + 1) has ΔSMP (π·ρ_{1}, ρ_{2}) < 0 according to Lemma 1.
3 b > a > c: we show that this case is not possible. We have assumed that slice(π, π_{i+1 }) ≥ slice(π, π_{ x }) and slice(π, π_{i+1 }) ≥ slice(π, π_{x+1}) for every pair π_{ x }, π_{x+1 }such that slice(ι, π_{ x }) = slice(ι, π_{x+1}). Therefore, the element π_{j−1 }cannot have slice(ι, π_{j−1}) = b, which forces us to conclude that there is one element π_{ z } such that slice(ι, π_{ z }) = b and z = i + 2. However, since for each pair of elements π_{ x }, π_{ y } such that slice(π, π_{ x }) = slice(π, π_{ y }) > b cannot happen π_{ x } and π_{ y } on the same side, we must have more elements in one side of the permutation than in the other, which is impossible.
4 b = c > a: we have σ_{ j } ~ σ_{j+1 }in σ because c > a. Therefore, the inversion ρ_{2}(i, i + 1) has Δ_{ SMP } (π·ρ_{1}, ρ_{2}) < 0 according to Lemma 1.
5 a > b = c: we have σ_{ i } ~ σ_{i+1 }in σ because a > b. Therefore, the inversion ρ_{2}(i, i + 1) has Δ_{ SMP } (π·ρ_{1}, ρ_{2}) < 0 according to Lemma 1.
Lemmas 1 and 2 show cases such that at least one inversion will have positive benefit. No positive benefit inversion is guaranteed on permutations in the form described by Lemma 3, but in those cases we can use Lemmas 5 and 6 to assure that our greedy approach will eventually reach the identity permutation.
Number of breakpoints heuristic
We use the term NB to refer to this heuristic as an acronym for Number of Breakpoints. Consider the extended permutation that can be obtained from π by inserting two new elements: π_{0} = 0 and π_{n+1 }= n + 1. The extended permutation is still denoted as π. Below, we present two definitions of breakpoint depending on whether we are dealing with signed or unsigned permutations.
Definition 2 A pair of elements π_{ i }, π_{i+1 }in a signed permutation π, with 0 ≤ i ≤ n, is a breakpoint if π_{i+1 } π_{ i } ≠ 1.
Definition 3 A pair of elements π_{ i }, π_{i+1 }in an unsigned permutation π, with 0 ≤ i ≤ n, is a breakpoint if π_{i+1 }− π_{ i } ≠ 1.
We use NB(π) to represent the number of breakpoints in a permutation and Δ_{ NB } (π, ρ) = NB(π·ρ)  NB(π) to represent the variation in the number of breakpoints caused by an inversion ρ.
The identity permutation ι is the only permutation with no breakpoints. Therefore, an inversion that decreases the number of breakpoints indirectly leads to the identity permutation. Since an inversion can affect only two breakpoints, we know that Δ_{ NB } (π, ρ) ∈ {−2, −1, 0, 1, 2}.
The benefit of an arbitrary inversion ρ can be computed as be{n}_{NB}\left(\pi ,\phantom{\rule{2.36043pt}{0ex}}\rho \right)=\frac{{\Delta}_{NB\left(\pi ,\rho \right)}}{cost\left(\rho \right)}. We compute the benefit of all possible inversion and we choose one that maximizes the benefit.
Since breakpoint is a very common concept in the genome rearrangement field, previous publications have clearly stated which kind of permutations will not allow any inversion to decrease the number of breakpoints.
Definition 4 Let π be an unsigned permutation, a strip of π is an interval [π_{ i }, ... π_{ j }] with no breakpoint such that (π_{i−1}, π_{ i }) and (π_{ j } , π_{j+1}) are breakpoints.
Strips can be either increasing (π_{ i } < π_{i+1 }< ... < π_{ j }) or decreasing (π_{ i } > π_{i+1 }> ... > π_{ j }). Strips with only one element are considered decreasing strips. Kececioglu and Sankoff [16] proved that every unsigned permutation with a decreasing strip has at least one inversion that removes at least one breakpoint. Therefore, those inversions will have positive (nonzero) benefit. The same idea holds for signed permutations.
When we find a permutation π with no decreasing strips, we have no positive benefit. Therefore, we can use one of the following strategies to create decreasing strips as a contingency plan.

1.
Use the Left or Right Heuristic known as LR.

2.
The Left or Right Heuristic could break one strip because only a single element will be moved. Therefore, another approach is to compute the cost of placing the strip that contains the left element in the left side and the strip that contains the right element in the right side and hence use the less costly option.

3.
Revert the entire unsorted group with one inversion.

4.
Find the increasing strips and compute the cost of all possible inversions that reverse one or more of them. Note that we do not consider inversions that split any strip. After that, use the less costly inversion. We named this approach as the Best Strip strategy.
The Best Strip strategy leads to the best results in our experiments. Therefore, we use it in our comparative analysis.
Number of breakpoints plus slicemisplaced pairs heuristic
We use the term NB+SMP to refer to this approach since it uses concepts retrieved from NB and SMP. We decided to favor breakpoints reduction in our greedy function: {\Delta}_{NB+SMP}\left(\pi ,\phantom{\rule{2.36043pt}{0ex}}\rho \right)={\Delta}_{NB}\left(\pi ,\phantom{\rule{2.36043pt}{0ex}}\rho \right)+\frac{{\Delta}_{SMP}\left(\pi ,\phantom{\rule{2.36043pt}{0ex}}\rho \right)}{{n}^{2}}. We use the benefit computed as be{n}_{NB+SMP}\left(\pi ,\phantom{\rule{2.36043pt}{0ex}}\rho \right)=\frac{{\Delta}_{NB+SMP}\left(\pi ,\phantom{\rule{2.36043pt}{0ex}}\rho \right)}{cost\left(\rho \right)}.
For unsigned permutations, we use the inversion ρ that maximizes the benefit if, and only if, be{n}_{NB+SMP}\left(\pi ,\phantom{\rule{2.36043pt}{0ex}}\rho \right)>0. Otherwise, we use the Best_Strip strategy in order to guarantee that the input permutation will be sorted.
For signed permutations, we compute the unsigned permutation π′ by removing the signs from π and we apply in π the inversion ρ′ that would be used in π′ according to the method previously described for unsigned permutations. In the end, if π ≠ ι, we simply change the signs of negative elements with unitary inversions.
Results and discussion
We implemented the algorithms using C++ Programming Language and experiments were executed at the cluster provided by the IN2P3 Computing Center (http://cc.in2p3.fr/).
Our source code is freely available at https://github.com/chrbaudet/SWILS.
We performed two batches of experiments. We first show experiments using small permutations and then we use considerably longer sequences up to size 100.
Small permutations
We generated a dataset with all possible unsigned permutations up to size 12, which accounts for {\sum}_{n=2}^{12}n!=522,956,312 instances. We did the same for all possible signed permutations up to size 10, which accounts for {\sum}_{n=2}^{10}{2}^{n}n!=3,912,703,160 instances. Therefore, we have used a large dataset having more than 4.4 × 10^{9} permutations.
For every permutation in the dataset, we were able to compute a minimum cost solution for comparison purposes. The minimum cost solution was calculated using a graph structure G_{ n }, for n ∈ {1, 2, ..., 12}. We define G_{ n } as follows. A permutation π is a vertex in G_{ n } if, and only if, π has n elements. Let π and σ be two vertices in G_{ n }, we build an edge from π to σ if, and only if, there is an inversion ρ that transforms π into σ. The weight assigned to this edge is cost(ρ). Finally, we calculate the shortest path from ι to each vertex in G_{ n } using a variant of Dijkstra's algorithm for the singlesource shortestpaths problem. This variant gives us the minimum cost to sort permutations in G_{ n }, as well as an optimum scenario of inversions.
Let heu_cost be the cost for sorting a permutation using some of our heuristics and opt_cost be the optimum cost, we can compute the approximation ratio as \frac{heu\text{\_}cost}{opt\text{\_}cost}.
Figures 3 and 4 summarizes our results for unsigned and singed permutations, respectively. The graphs in Figures 3(a) and 4(a) show how often each heuristic returns the optimum cost. The graphs in Figures 3(b) and 4(b) show the average ratio considering all permutations of a given size, while the graphs in Figures 3(c) and 4(c) exhibit the maximum ratios for the same group of permutations. These graphs are maximum or average values and they may not answer the question: for a single instance π, is there any algorithm that is likely to provide the best answer? The graphs in Figures 3(d) and 4(d) discuss this question by assessing the number of times each algorithm provides the least costly sequence. The values we present do not add up to 100% because of ties.
We observe that NB+SMP leads to the best results and it is consistently better than using just the number of breakpoints (NB) or just the variation in the number of slicemisplaced pairs (SMP) in every aspects we plot in Figures 3 and 4.
Individually, NB and SMP may be worse than NB+SMP, but NB returns results that are much closer to the optimum solution than SMP. That is true both on signed and unsigned permutations as we can see in Figures 3(b) and 4(b). The other graphs also corroborate this fact, which supports our decision of favoring the number of breakpoints when we compute Δ_{ NB+SMP }.
The simplistic approach LR leads to inferior results as reasonable. However, when we consider only the maximum ratio aspect, we observe that LR outdoes SMP. Indeed, the SMP approach is not consistent with respect to this aspect, which indicates that particular permutations are hard to sort using only the slicemisplaced pairs.
A final test checks if any profit is gained from running all possible heuristic. We added a new curve labeled as All, which selects for each instance the less costly result between those produced by our heuristics. As we can see, running all possible heuristics and keeping the best result accomplishes very good results. Indeed, it is consistently better than using solely the NB+SMP heuristic.
Large permutations
We ran our algorithm on a set of arbitrarily large permutations. This set is composed of 190,000 random signed permutations and 190,000 random unsigned permutations. In both cases, the permutation size ranges from 10 to 1000 in intervals of 5, with 10,000 permutations of each size. Here, we do not have an exact solutions for these permutations. Therefore, we use the average cost instead of the approximation ratio to base our analysis.
The analysis on random permutations reinforces the notion that NB+SMP leads to the best results. We observe in Figures 5 and 6 that the difference between NB and NB+SMP is very small on average. However, NB+SMP returns the less costly scenario in more cases.
Using random permutations allows us to draw information about the running time of each heuristic. In Table 1 we observe that LR is the fastest heuristic and the sorting scenario can be obtained almost instantly (less than 1 milliseconds), which is reasonable since this heuristic is very simplistic. The heuristic SMP and NB+SMP are the ones that take more time to finish, which can be explained, in part, by the fact that both need to compute slicemisplaced pairs. Since SMP returns scenarios with more inversions than NB+SMP, the former requires about twice the time used by the latter.
Conclusions
We have defined a new genome rearrangement problem based on the concepts of symmetry and length of the reversed segments in order to assign a cost for each inversion. The problem we are proposing aims at finding lowcost scenarios between genomes. We considered the cases when gene orientations is taken into account and when it is not. We have provided the first steps in exploring this problem.
We presented several heuristics and we assessed their performances on a large set of more than 4.4 × 10^{9} permutations. The ideas we used to develop these heuristics together with the experimental results set the stage for a proper theoretical analysis.
As in other problems in the genome rearrangement field, we would like to know the complexity of determining the distance between any two genomes using the operations we defined. That seems to be a difficult problem that we intend to keep studying. We plan to design approximation algorithms and more effective heuristics.
References
Darling AE, Miklós I, Ragan MA: Dynamics of genome rearrangement in bacterial populations. PLoS Genetics. 2008, 4 (7): 1000128
Dias U, Dias Z, Setubal JC: A simulation tool for the study of symmetric inversions in bacterial genomes. Comparative Genomics. Lecture Notes in Computer Science Springer. 2011, 6398: 240251.
Hannenhalli S, Pevzner PA: Transforming cabbage into turnip: polynomial algorithm for sorting signed permutations by reversals. Journal of the ACM. 1999, 46 (1): 127.
Eisen JA, Heidelberg JF, White O, Salzberg SL: Evidence for symmetric chromosomal inversions around the replication origin in bacteria. Genome Biology. 2000, 1 (6): 0011100119.
Lefebvre JF, ElMabrouk N, Tillier E, Sankoff D: Detection and validation of single gene inversions. Bioinformatics. 2003, 19 (Suppl 1): 190196.
Sankoff D, Lefebvre JF, Tillier E, ElMabrouk N: The distribution of inversion lengths in bacteria. Comparative Genomics. Lecture Notes in Computer Science, Springer. 2005, 3388: 97108.
Caprara A: Sorting permutations by reversals and Eulerian cycle decompositions. SIAM Journal on Discrete Mathematics. 1999, 12 (1): 91110.
Swidan F, Bender M, Ge D, He S, Hu H, Pinter R: Sorting by lengthweighted reversals: Dealing with signs and circularity. Combinatorial Pattern Matching. Lecture Notes in Computer Science, Springer. 2004, 3109: 3246.
Arruda TS, Dias U, Dias Z: Heuristics for the sorting by lengthweighted inversions problem on signed permutations. Algorithms for Computational Biology. Lecture Notes in Computer Science. 2014, 8542: 5970.
Pinter RY, Skiena S: Genomic sorting with lengthweighted reversals. Genome Informatics. 2002, 13: 2002
Bender MA, Ge D, He S, Hu H, Pinter RY, Skiena S, Swidan F: Improved bounds on sorting by lengthweighted reversals. Journal of Computer and System Sciences. 2008, 74 (5): 744774.
Arruda TS, Dias U, Dias Z: Heuristics for the sorting by lengthweighted inversion problem. Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics. 2013, 498507.
Ohlebusch E, Abouelhoda MI, Hockel K, Stallkamp J: The median problem for the reversal distance in circular bacterial genomes. Proceedings of Combinatorial Pattern Matching. 2005, 116127.
Dias Z, Dias U, Setubal JC, Heath LS: Sorting genomes using almostsymmetric inversions. Proceedings of the 27th Symposium On Applied Computing (SAC'2012), Riva del Garda, Italy. 2012, 17.
Dias U, Baudet C, Dias Z: Greedy randomized search procedure to sort genomes using symmetric, almostsymmetric and unitary inversions. Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics. BCB'13. 2013, ACM, New York, NY, USA, 181190.
Kececioglu JD, Ravi R: Of Mice and Men: Algorithms for Evolutionary Distances Between Genomes with Translocation. Proceedings of the 6th Annual Symposium on Discrete Algorithms. 1995, 604613.
Acknowledgements
This work was supported by a Postdoctoral Fellowship from FAPESP to UD (number 2012/015843), by project fundings from CNPq (numbers 477692/20125 and 483370/20134), FAPESP (number 2014/194018) and CAPES/COFECUB (number 831/15) to ZD, and by French Project ANR MIRI BLAN081335497 and the ERC Advanced Grant SISYPHE to CB.
The authors also thank the Center for Computational Engineering and Sciences at Unicamp for financial support through the FAPESP/CEPID Grant 2013/082937.
Experiments were executed at the cluster provided by the IN2P3 Computing Center (http://cc.in2p3.fr/).
This article has been published as part of BMC Bioinformatics Volume 16 Supplement 19, 2015: Brazilian Symposium on Bioinformatics 2014. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S19
Declarations
This work was supported by CNPq grants (numbers 477692/20125 and 483370/20134), FAPESP grants (numbers 2012/015843, 2013/082937 and 2014/194018) and CAPES/COFECUB (number 831/15), by French Project ANR MIRI BLAN081335497 and the ERC Advanced Grant SISPHE (grant number [247073]10). The publication charges of this article were funded by FAPESP/CEPID grant 2013/082937.
This article has been published as part of BMC Bioinformatics Volume 16 Supplement 19, 2015: Brazilian Symposium on Bioinformatics 2014. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S19
Author information
Authors and Affiliations
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
ZD came up with the idea of using the cost function that uses both symmetry and length of the affected segment. CB, ZD and UD designed the algorithms and the batches of experiments. UD and CB programmed an early prototype in python. After a series of iterative improvements, CB developed the final C++ code. UD drafted the manuscript. All authors read and approved the final manuscript.
Christian Baudet, Ulisses Dias and Zanoni Dias contributed equally to this work.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.
The Creative Commons Public Domain Dedication waiver (https://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Baudet, C., Dias, U. & Dias, Z. Sorting by weighted inversions considering length and symmetry. BMC Bioinformatics 16 (Suppl 19), S3 (2015). https://doi.org/10.1186/1471210516S19S3
Published:
DOI: https://doi.org/10.1186/1471210516S19S3