Preliminaries
A chromosome
is a string over the alphabet Σ = {1, ..., n}, where each element may have a positive or negative orientation (indicated by
or
). We get the inverse of an element
(indicated by
) by inverting its orientation. The reflection of a chromosome
is the chromosome
. It is considered to be equivalent to πi. A genome π= {πi, ..., πm} is a multiset of chromosomes. The multiplicity mult(π, x) of an element x is the number of its occurrences (with arbitrary orientation) in π. A segment is a consecutive sequence of elements of a chromosome. Each element x can also be represented by the ordered set of its extremities x
t
(the tail ) and x
h
(the head ), where the ordering of the extremities is x
t
x
h
if x has positive orientation, and x
h
x
t
otherwise. For example, the chromosome
can also be written as (1
t
1
h
2
t
2
h
). The two extremities belonging to the same element are called co-elements. We say the tail/head x
t/h
of an element x is a telomere if there is a chromosome in π beginning or ending with xt /h. The value t(π, x
t
) determines how often x
t
is a telomere in π, the value t(π, x
h
) is defined analogously. Two consecutive extremities in a chromosome πiwhich are no co-elements form an inner adjacency w.r.t. another genome ρ if they are also consecutive in a chromosome of ρ, otherwise they form an inner breakpoint. An extremity which is a telomere in π forms a telomere adjacency w.r.t. another genome ρ if it is also a telomere in ρ, otherwise it forms a telomere breakpoint. For example, if ρ = {(1
t
1
h
2
t
2
h
3
t
3
h
)} and π = {(1
t
1
h
2
t
2
h
), (1
t
1
h
3
t
3
h
)}, then 1
t
and 3
h
form telomere adjacencies, while 2
h
forms a telomere breakpoint. Applying an operation op to a genome π yields the genome op(π). A genome rearrangement problem is defined as follows. Given two genomes ρ and π and a set of possible operations, where each operation is assigned a weight, find a sequence of minimum weight (i.e. the sum of the weights of the operations is minimized) that transforms ρ into π. This minimum weight will be denoted by d(ρ, π). In our algorithm, we will consider all operations listed in Subsection "Operations". For simplification, we will assume that two chromosomes in ρ are either disjoint (i.e. they have no element in common) or identical. Furthermore, each element in Σ must appear in at least one chromosome of ρ. Note that these restrictions only hold for the genome ρ, not for π.
Operations
The following set of operations will be considered by our algorithm. A reversal inverts the order of the elements of a segment. Additionally, the orientation of each element in the segment is inverted. A transposition cuts a segment out of a chromosome, and reinserts it at another position in the same chromosome. If we additionally apply a reversal on this segment, we speak of an inverted transposition. A fusion concatenates two chromosomes. Both chromosomes πiand πjcan be inverted before the operation, i.e. we can replace πior πjby its reflection. A fission splits a chromosome into two chromosomes. A translocation splits two chromosomes πiand πjinto πi, t, πi, hand πj, t, πj, h, and then concatenates πi, twith πj, hand πj, twith πi, h. Again, both chromosomes can be inverted before the operation. A tandem duplication inserts an identical copy of a segment immediately after this segment in a chromosome. A transposition duplication inserts an identical copy of a segment at an arbitrary position in the genome. A deletion cuts a segment out of a chromosome. A chromosome duplication creates an identical copy of a chromosome. A chromosome deletion deletes a chromosome.
All operations have weight 1, except for (inverted) transpositions and transposition duplications, which have weight 2. These weights are chosen rather by mathematical than biological reasons, but still result in biologically realistic scenarios (see also Section "Discussion"). To simplify the analysis of the effects of the operations in Section "A lower bound", we will there use the double cut and join operator (DCJ), which has been introduced in [7]. A DCJ cuts a genome at two positions, and rejoins the cut ends in two new pairs. It is a well-known fact that reversals, translocations, fusions, and fission can be described by one DCJ operation, while transpositions can be described by two DCJ operations. Duplications and deletions cannot be described by DCJ operations and therefore must be examined separately.
The breakpoint graph
Our main tool for developing the algorithm is the breakpoint graph, which is an edge-colored multigraph that visualizes the current adjacencies and breakpoints. It has been introduced by Bafna and Pevzner to solve rearrangement problems on genomes without duplicated genes [24]. We extend this graph such that it can also be used for genomes with duplicated genes. The breakpoint graph of two genomes ρ and π can be constructed as follows. First, we write the set of vertices {1
t
, 1
h
, 2
t
, 2
h
, ..., n
t
, n
h
} from left to right on a straight line. Second, for each pair (x
t/h
, y
t/h
) of extremities that are no co-elements but consecutive in a chromosome of ρ, we connect the corresponding vertices by a gray edge. Third, we analogously add a black edge for each pair of extremities that are consecutive in π. However, if one of the endpoints is not an endpoint of a gray edge (i.e. it corresponds to a telomere in ρ), we do not add the black edge (this is required to obtain a good lower bound in Section "A lower bound"). For an example, see Fig. 1. In contrast to the original breakpoint graph, each vertex can be the endpoint of several gray and black edges. The multiplicity of an edge (v, v') is the number of black edges between v and v'. A black edge (v, v) is called a loop. Let L(ρ, π) denote the number of vertices with a loop. Two vertices v, v' are in the same component of the graph if and only if there is a path (consisting of gray and black edges) from v to v'. Let C(ρ, π) denote the number of components in the breakpoint graph of π. A black edge is called a 1-bridge if the removal of this edge increases C(ρ, π). A pair of black edges is called a 2-bridge if none of the edges is a 1-bridge and the removal of both edges increases C(ρ, π).
A lower bound
Instead of searching for a sequence of operations op1, ..., op
k
that sorts ρ into π, one can also search for the inverse sequence
that sorts π into ρ (we will call such a sequence a sorting sequence). This is more convenient, because then the breakpoint graph has some nice properties due to the limitations in ρ. Note that simply restricting π instead of ρ would not work, because the operations are directed from the restricted ancestral genome to the unrestricted descendant, i.e. we would nevertheless have to invert the operations. In the following, we will examine what effects the inverse operations have on the breakpoint graph. In the unichromosomal case, we were mainly interested in the effects on components and loops (see [23]). As the breakpoint graph does not contain edges for telomeres, we additionally have to consider the amount of incorrect telomeres (denoted by T(ρ, π)), which is defined as follows.
In other words, for each telomere x
t
(or x
h
) in ρ, we count how often we must create this telomere in π such that t(ρ, x
t
) ≤ t(π, x
t
) (or t(ρ, x
h
) ≤ t(π, x
h
)). Additionally, we add the amount of telomeres x
t
and x
h
in π that have to be removed, i.e. they are not telomeres in ρ.
Lemma 1. If π = ρ, then C(ρ, π) is maximized, and L(ρ, π) and T(ρ, π) are minimized.
Proof. If π = ρ, the set of gray edges and the set of black edges in the breakpoint graph are equal. Thus, removing black edges does not increase the number of components, and adding black edges can only decrease the number of components. Therefore, changing π cannot increase C(ρ, π). L(ρ, π) and T(, π) are both positive numbers, and if π = ρ, they are equal to 0 and therefore minimized. □
In [23], we have shown that all operations can either change the number of components by 1, or change the number of loops by 2. These observations still hold for our slightly modified breakpoint graph. We will now examine how an operation can change T(ρ, π).
Inverse reversal, translocation, transposition
These operations can be simulated by one or two inverse DCJs (which is equivalent to a normal DCJ), thus it is sufficient to examine the effects of a DCJ (note that transpositions, which require two DCJs, have weight 2). A DCJ can only change T(ρ, π) if one of its edges is the end of a chromosome. Then, a telomere xt /his removed and a new telomere yt /his created. This decreases T (ρ, π) by 1 or 2 if xt /his not a telomere in π and yt /his a telomere in π, otherwise the operation does not decrease T(ρ, π). However, in the first case, the DCJ did not cut any black edge, as we neither draw black edges for telomeres in ρ nor for telomeres in π.
Inverse fusion (fission)
Splitting a chromosome will only decrease T(ρ, π) if both new telomeres are also telomeres in ρ. In this case, no black edge in the breakpoint graph is removed, i.e. C(ρ, π) and L(ρ, π) remain unchanged. T(ρ, π) is decreased by at most 2.
Inverse fission (fusion)
Concatenating two chromosomes can decrease T(ρ, π) by at most 2. This operation never removes a black edge, thus C(ρ, π) cannot be increased and L(ρ, π) cannot be decreased.
Inverse tandem duplication
This operation does never change the set of telomeres in π, and therefore cannot change T(ρ, π).
Inverse transposition duplication
This operation can decrease T(ρ, π) only if the duplicated segment is at a chromosome end, and the new chromosome end (after deleting the segment) is a telomere in ρ. In this case, no black edge with multiplicity 1 is removed, therefore C(ρ, π) and L(ρ, π) remain unchanged. The decrement of T(ρ, π) is ≤ 2.
Inverse deletion (insertion)
This operation can only change T(ρ, π) if we insert a segment at a chromosome end. In this case, no black edge is removed, i.e. C(ρ, π) cannot be increased and L(ρ, π) cannot be decreased. T(ρ, π) is decreased by at most 2.
Inverse chromosome duplication
This operation can decrease T(ρ, π) by at most 2 (if the telomeres of the removed chromosome are not telomeres in ρ). Only black edges with multiplicity ≥ 2 are removed, thus C(ρ, π) and L(ρ, π) remain unchanged.
Inverse chromosome deletion (chromosome insertion)
This operation can decrease T(ρ, π) by at most 2 (if the telomeres of the new chromosome are also telomeres in ρ). In the breakpoint graph, no black edges are removed, i.e. C(ρ, π) cannot be increased and L(ρ, π) cannot be decreased.
Theorem 1. A lower bound lb(ρ, π) of the distance d(ρ, π) is
where c(ρ) = n + (number of different chromosomes in ρ), and L
i
(ρ, π) is the number of vertices with a loop in component C
i
in the breakpoint graph of ρ and π.
Proof. An operation that decreases T(ρ, π) will neither increase C(ρ, π) nor decrease L(ρ, π), therefore we can separate every sequence into operations that decrease T(ρ, π) and operations that decrease
. Each operation decreases T(ρ, π) by at most 2, so we need at least
operations of the first kind. Furthermore, if ρ = π, then C(ρ, π) = c(ρ) and therefore CL(ρ, π) = 0. As each operation decreases CL(ρ, π) by at most one (the proof in [23] still holds), we need at least CL(ρ, π) operations of the second kind. Therefore, any sorting sequence must have at least lb(ρ, π) operations. □
Corollary 1. lb(ρ, ρ) = 0.
Unfortunately, there are genomes π ≠ ρ with lb(ρ, π) = 0, i.e. it is not sufficient to sort until the lower bound reaches 0. We therefore have to introduce another distance measure τ(ρ, π). We will use the following definitions.
For example, if ρ =
and π =
, then lb(ρ, π) = 0 and τ(ρ, π) = 2.
Lemma 2. If ρ = π, then τ(ρ, π) = 0. Otherwise, τ(ρ, π) > 0.
Proof. It is clear that τ(ρ, π) = 0 if ρ = π. In order to minimize τ(ρ, π), it is necessary to minimize m(ρ, π) and to maximize ia(ρ, π) and ta(ρ, π). ia(ρ, ρ) and ta(ρ, ρ) are independent of π and therefore fixed. Each inner adjacency is weighted by 2. We can interpret this as if each of the adjacent extremities is weighted by 1. Therefore, we can say that each element in π can account at most 2 to ia(ρ, π) + ta(ρ, π), and this value is reached if there is an adjacency on both sides of the element. Thus, the contribution to τ(ρ, π) of all occurrences of an element x in π is minimized if mult(ρ, x) = mult(π, x) and no extremity of x is part of any breakpoint. Every additional occurrence of x may increase ia(ρ, π) + ta(ρ, π) by 2, but also increases m(ρ, π) by 4 and therefore increases τ(ρ, π) by at least 2. This means that τ(ρ, π) is minimized if each element has the same multiplicity in ρ and π, and the breakpoint graph contains no breakpoints. This is the case if and only if ρ and π are identical. □
The algorithm
The algorithm uses a greedy strategy to sort π into ρ by inverse operations. For better readability, we will simply write "operation" instead of "inverse operation" in this section. First, we define Δlb(op) = lb(ρ, π) - lb(ρ, op(π)) and Δτ(op) = τ(ρ, π) - τ(ρ, op(π)). The score σ of an operation op is defined as the tuple σ(op) = (Δlb(op), Δτ(op)). The comparison operator between two scores is defined by σ(op1) >σ(op2) if Δlb(op1) > Δlb(op2) ∨ Δ(lb(op1) = Δlb(op2) ∧ Δτ (op1) > Δτ (op2)). In each step, we search for operations that decrease the lower bound, and apply the one with the greatest score. If no such operation exists, we use additional heuristics to find operations that do not change the lower bound but have a positive score (i.e. σ (op) >(0, 0)). There is still the possibility that we even do not find such an operation. In this case, we use a fallback algorithm that is guaranteed to terminate.
Operations that decrease the lower bound
Finding operations that increase C(ρ, π) can be done by finding 1-bridges and 2-bridges in the breakpoint graph and verifying additional preconditions, as shown in [23]. The only difference is that now a DCJ can cut only one black edge. This is the case when the other cutting point contains a telomere in ρ or π. Thus, we also have to consider DCJs that act on a 1-bridge and a telomere. Such a DCJ can be interpreted as inverse reversal, translocation, fission, or transposition. In the last case, we have to find a third cutting point in the same chromosome such that the resulting inverse transposition still increases C(ρ, π). Also finding operations that decrease L(ρ, π) is straightforward and can be done as in [23]. The remaining task is to find operations that decrease T(ρ, π). For this, we create a list of telomeres in π that are not telomeres in ρ, and another list of inner breakpoints in π where at least one of the adjacent elements is a telomere in ρ. Operations that decrease T(ρ, π) must act on one or two points of these lists, depending on the operation type. Creating the lists can be done by a linear scan over π, therefore all operations that decrease T(ρ, π) can be found in quadratic time. The only exceptions are inverse deletions and inverse chromosome deletions, which may add segments of arbitrary content. Practical tests have shown that it is better to only allow the insertion of segments without any breakpoints. This does not only lead to better sorting sequences, but also keeps the time complexity of finding the operations in O(n2).
Additional operations
If there is no operation that decreases lb(ρ, π), we may still find operations that do not change the lower bound but decrease τ(ρ, π). Searching for all these operations would exceed our computing capacity, so we just search for the following subset of these operations that can be found easily.
-
There are inverse tandem duplications and transposition duplications that do not change σ(ρ, π), but decrement τ(ρ, π). We therefore search for identical consecutive segments that are maximal, i.e. they cannot be extended in any direction, and check the effect on σ(ρ, π) and τ(ρ, π) if we remove one of them. This operation corresponds either to an inverse tandem duplication, or to an inverse transposition duplication.
-
Depending on the telomeres of a chromosome, the lower bound can remain unchanged during an inverse chromosome duplication, but τ(ρ, π) can decrease. We therefore search for identical chromosomes and check the score of removing one of them.
-
Inserting a segment of consecutive elements x with mult(ρ, x) >mult(π, x) decreases τ(ρ, π). We search for such segments of maximal length and try to insert them by an inverse deletion. Note that this is not always possible as this operation can increase the lower bound by merging two components.
-
Creating inner or telomere adjacencies never increases the lower bound, but decreases τ(ρ, π). We therefore search for DCJs and inverse fissions that create new adjacencies without splitting old ones.
The fallback algorithm
It is possible that there is neither an operation that decreases lb(ρ, π), nor an operation that decreases τ(ρ, π), so the main algorithm gets stuck. However, this case cannot occur if all elements have the same multiplicity in ρ and in π.
Lemma 3. If ρ ≠ π and mult(ρ, x) = mult(π, x) holds for all elements x, then there is an operation with positive score.
Proof. When the preconditions are fulfilled, there must be at least one breakpoint in π. We have to distinguish three cases. (1) This is a telomere breakpoint. W.l.o.g. a chromosome in π ends with x
h
, but x
h
is not a telomere in ρ. Then, mult(ρ, x) = mult(ρ, x + 1) (as they are in the same chromosome), and therefore there must be another breakpoint including (x + 1)
t
. An operation that creates an adjacency between x
h
and (x + 1)
t
will not decrease the lower bound, but decrease τ (ρ, π) by at least 2. (2) The breakpoint is an inner breakpoint between two extremities that are telomeres in ρ. In this case, the score of cutting the chromosome at this breakpoint is (1, 2), because both extremities become telomeres (i.e. T(ρ, π) increases by 2), and we create two telomere adjacencies. (3) The breakpoint is an inner breakpoint, and at least one of the adjacent extremities is not a telomere in ρ. W.l.o.g., the breakpoint is of the form (x
h
, y
h
), and x
h
is not a telomere in ρ. Then, mult(ρ, x) = mult(ρ, x + 1), thus there must be another breakpoint including (x + 1)
t
. An operation that creates an adjacency between x
h
and (x + 1)
t
will not increase the lower bound, but decrease τ(ρ, π) by at least 2. □
The fallback algorithm will first ensure that the precondition of the lemma holds. For each chromosome ρiin ρ, we determine the element x with the most occurrences in π. We then create maximal segments of consecutive elements
... such that each element z in the segment belongs to ρiand mult(π, z) <mult(π, x), and add this segment by an inverse deletion to π. Note that this can be done without creating new breakpoints. This step is repeated until all elements belonging to the same component in ρ have the same multiplicity in π. We then transform ρ into ρ' by applying chromosome duplications and chromosome deletions on ρ such that for each element x, mult(ρ', x) = mult(π, x). Now, we apply our normal algorithm to sort π into ρ'. To ensure that the precondition of Lemma 3 is always fulfilled, we forbid operations that create or delete elements, i.e. any kind of duplication or deletion. Due to Lemma 3, the algorithm will find a sorting sequence that transforms π into ρ'. As last step, we have to undo the operations that transformed ρ into ρ'.