Volume 12 Supplement 9

Proceedings of the Ninth Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics

Open Access

Genomic distance under gene substitutions

  • Marília D V Braga1Email author,
  • Raphael Machado1,
  • Leonardo C Ribeiro1 and
  • Jens Stoye2
BMC Bioinformatics201112(Suppl 9):S8

DOI: 10.1186/1471-2105-12-S9-S8

Published: 5 October 2011

Abstract

Background

The distance between two genomes is often computed by comparing only the common markers between them. Some approaches are also able to deal with non-common markers, allowing the insertion or the deletion of such markers. In these models, a deletion and a subsequent insertion that occur at the same position of the genome count for two sorting steps.

Results

Here we propose a new model that sorts non-common markers with substitutions, which are more powerful operations that comprehend insertions and deletions. A deletion and an insertion that occur at the same position of the genome can be modeled as a substitution, counting for a single sorting step.

Conclusions

Comparing genomes with unequal content, but without duplicated markers, we give a linear time algorithm to compute the genomic distance considering substitutions and double-cut-and-join (DCJ) operations. This model provides a parsimonious genomic distance to handle genomes free of duplicated markers, that is in practice a lower bound to the real genomic distances. The method could also be used to refine orthology assignments, since in some cases a substitution could actually correspond to an unannotated orthology.

Background

The genomic distance is often computed taking into consideration only the common markers, that occur in both genomes [13]. Approaches to deal with unique markers (that occur in only one genome) also exist, but usually allowing only insertions or deletions of these markers. Insertions and deletions can be shortly called indels. In [4], the operations allowed are inversions and indels, while the models given in [5] and [6] consider indels and the double cut and join (DCJ) operation [7], that is able to represent most large scale mutation events in genomes, such as inversions, translocations, fusions and fissions. The mentioned approaches assign the same weight to all rearrangement operations, including indels, regardless of the size of the affected regions and the particular types of the operations. A drawback in these models is that, if a deletion and a subsequent insertion occur at the same position of the genome, the cost is the same as a deletion and an insertion in different positions.

In the present work we propose a more parsimonious model in which, instead of deleting or inserting, we allow the substitution of unique markers between two genomes, as illustrated in Figure 1. We do not suggest that a substitution occurs in a precise moment in evolution, but instead it represents a region that underwent continuous mutations (duplications, losses and gene mutations), so that a group of genes is transformed into a different group of genes (either of which may also be empty, allowing a substitution to represent an insertion or a deletion). Other studies also represent continuous mutations as a rearrangement event [8, 9]. By minimizing substitutions we are able to establish a relation between indels that could have occurred in the same position of the compared genomes, identifying genomic regions that could be subject to these continuous mutations. Observe that we suggest that such regions have a common evolutionary origin. We develop a method to count the minimum number of substitutions that could have occurred, by assigning the same weight to substitutions and to the other operations, similarly to the approaches that handle indels.
https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_Fig1_HTML.jpg
Figure 1

(i) An optimal sorting scenario with DCJ operations and indels. (ii) An optimal sorting scenario with DCJ operations and indels in which the last two operations occur in the same position of the genome, between markers a and b. (iii) A more parsimonious alternative to the deletion of consecutive markers s and u and the insertion of consecutive markers x and y would be the substitution of s and u by x and y.

We analyze genomes with unequal content, but without duplicated markers and extend the results given in [6] to develop a linear time algorithm that exactly computes the genomic distance with substitutions and DCJ operations. The objective of this model is to provide a parsimonious genomic distance to handle genomes free of duplicated markers, that in practice is a lower bound to the real genomic distances. In the present work, we do not study algorithms to generate parsimonious sorting scenarios. Nevertheless, in the analysis of the evolution of human chromosomes X and Y, we manually obtain a parsimonious evolutionary scenario under our model, that is coherent with the results given in [10].

In the remainder of this section we introduce some concepts given in [1] and [6] and define the operation that substitutes markers in a genome - these are the basis of the method that we will present here.

Preliminaries

In the present study duplicated markers are not allowed. Given two genomes A and B, possibly with unequal content, we denote by https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq1_HTML.gif the “reduced” genome [4], that is the set of markers that occur once in A and once in B. Moreover, the set https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq2_HTML.gif contains the markers that occur only in A and the set https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq3_HTML.gif contains the markers that occur only in B. The markers in sets https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq4_HTML.gif and https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq5_HTML.gif are also called unique markers. Observe that the sets https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq6_HTML.gif , https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq7_HTML.gif and https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq8_HTML.gif are disjoint.

A genome is possibly composed of linear and circular chromosomes. Each marker g in a genome is a DNA fragment and is represented by the symbol g, if it is read in direct orientation, or by the symbol https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq9_HTML.gif , if it is read in reverse orientation. An example of a pair of genomes is given in Figure 2.
https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_Fig2_HTML.jpg
Figure 2

For genomes A, composed of three linear chromosomes, and B, composed of one single chromosome, we have https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq10_HTML.gif , https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq11_HTML.gif and https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq12_HTML.gif .

In the following we adopt definitions which we have given in [6] (some of them are generalizations of concepts introduced by Bergeron et al. [1]).

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq13_HTML.gif -adjacencies

Each one of the two ends of a linear chromosome is called a telomere and is represented by the symbol . For each marker https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq14_HTML.gif , denote its two extremities by g t (tail) and g h (head). A https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq15_HTML.gif -adjacency in genome A (respectively in genome B) is in general a linear string v = γ1ℓγ2, such that γ1 and γ2 are telomeres or extremities of markers of https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq16_HTML.gif and , the string composed of the markers that are between γ1 and γ2 in A (respectively in B), contains no marker that also belongs to https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq17_HTML.gif . The string is said to be the label of v, and the extremities γ1 and γ2 are said to be https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq18_HTML.gif -adjacent. If is a non-empty string, v is said to be labeled, otherwise v is said to be clean.

A https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq19_HTML.gif -adjacency γ1ℓγ2 can also be represented by https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq20_HTML.gif . Furthermore, ◦◦ represents a linear chromosome composed only of markers that are not in https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq21_HTML.gif . In the same way, a https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq22_HTML.gif -adjacency given by a label corresponds to a whole circular chromosome composed only of markers that are not in https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq23_HTML.gif . This is the only case of a https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq24_HTML.gif -adjacency in which we have a circular instead of a linear string.

Two genomes A and B can then be represented by the sets https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq25_HTML.gif and https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq26_HTML.gif , containing their https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq27_HTML.gif -adjacencies. For the two genomes in Figure 2, we have https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq28_HTML.gif , https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq29_HTML.gif and https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq30_HTML.gif .

The DCJ operation

A cut performed on a genome A separates two adjacent markers of A. A cut affects a https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq31_HTML.gif -adjacency v of https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq32_HTML.gif as follows: if v is linear, the cut is done between two symbols of v, creating two open ends in two separate linear strings; if v is circular, the cut creates two open ends in one linear string. A double-cut and join or DCJ applied on a genome A is the operation that generally performs two cuts in https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq33_HTML.gif , creating four open ends, and joins these open ends in a different way. A DCJ operation can correspond to several rearrangement events, such as an inversion, a translocation, a fusion, or a fission [7].

We represent by ({γ11|ℓ4γ4 , γ33 | ℓ2γ2 } → {γ11| ℓ2 γ2, γ33|ℓ4 γ4 }) a DCJ applied on γ114γ4 and γ332γ2 , that creates γ112γ2 and γ334γ4. Observe that one or more extremities among γ1, γ2, γ3 and γ4 can be equal to (a telomere), as well as one or more labels among ℓ1, ℓ2, ℓ3 and ℓ4 can be equal to ε (the empty string). Particular cases include circular adjacencies and are described in [6].

Adjacency graph and the DCJ distance

The adjacency graph AG(A, B) [1] is the bipartite graph that has a vertex for each https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq34_HTML.gif -adjacency in https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq35_HTML.gif and a vertex for each https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq36_HTML.gif -adjacency in https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq37_HTML.gif . Then, for each https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq38_HTML.gif , we have one edge connecting the vertex in https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq39_HTML.gif and the vertex in https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq40_HTML.gif that contain g h and one edge connecting the vertex in https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq41_HTML.gif and the vertex in https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq42_HTML.gif that contain g t .

The connected components of the graph AG(A, B) are cycles and paths that alternate vertices in https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq43_HTML.gif and https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq44_HTML.gif . A path that has one endpoint in https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq45_HTML.gif and the other in https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq46_HTML.gif is called an AB-path. In the same way, both endpoints of an AA-path are in https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq47_HTML.gif , as well as both endpoints of a BB-path are in https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq48_HTML.gif . Furthermore, AG(A, B) can have two extra types of components: each https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq49_HTML.gif -adjacency that corresponds to a linear (respect. circular) chromosome is a linear (respect. circular) singleton. Linear singletons are particular cases of AA-paths and BB-paths. An example of an adjacency graph is given in Figure 3.
https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_Fig3_HTML.jpg
Figure 3

For genomes A and B, the adjacency graph contains one cycle, two AA-paths (one is a linear singleton) and two AB-paths.

The number of AB-paths in AG(A, B) is always even and a DCJ operation can be of three types [1, 6]: optimal when it either increases the number of cycles by one, or the number of AB-paths by two; neutral when it does not affect the number of cycles and AB-paths; or counter-optimal when it either decreases the number of cycles by one, or the number of AB-paths by two.

Singletons, AB-paths composed of one single edge, and cycles composed of two edges are said to be DCJ-sorted. Longer paths and cycles are said to be DCJ-unsorted. The procedure of using DCJ operations to turn AG(A, B) into DCJ-sorted components is called DCJ-sorting of A into B. The DCJ distance of A and B, denoted by d DCJ (A, B), corresponds to the minimum number of steps required to do a DCJ-sorting of A into B and can be easily obtained:

Theorem 1 ( [1]) Given two genomes A and B without duplicated markers, we have https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq50_HTML.gif , where https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq51_HTML.gif is the set of common markers between A and B, and c and b are the number of cycles and of AB-paths in AG(A, B).

Runs of unique markers

Given a component C of AG (A, B), we can obtain a string (C) by the concatenation of the labels of the https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq52_HTML.gif -adjacencies of C in the order in which they appear. Cycles, AA-paths and BB-paths can be read in any direction, but AB-paths should always be read from A to B. If C is a cycle and has labels in both genomes A and B, we should start to read in a labeled https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq53_HTML.gif -adjacency v of A, such that the first labeled vertex before v is a https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq54_HTML.gif -adjacency in B; otherwise C has labels in at most one genome and we can start anywhere. Each maximal substring of (C) composed only of markers in https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq55_HTML.gif (respectively in https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq56_HTML.gif is called an https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq57_HTML.gif -run (respectively a https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq58_HTML.gif -run). Each https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq59_HTML.gif -run or https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq60_HTML.gif -run can be simply called run[6]. A component composed only of clean https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq61_HTML.gif -adjacencies has no run and is said to be clean, otherwise the component is labeled. We denote by Λ(C) the number of runs in a component C. A path can have any number of runs, while a cycle has zero, one, or an even number of runs. Figure 4 shows a BB-path with 4 runs.
https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_Fig4_HTML.jpg
Figure 4

A BB-path with 4 runs. Only the labels of the https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq62_HTML.gif -adjacencies are represented.

Substitutions

The unique markers in https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq63_HTML.gif and https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq64_HTML.gif are represented in AG (A, B) as labels and singletons and, in order to sort A into B, they also have to be considered. Here we propose a model in which only the following operation can be applied to unique markers. A substitution is an operation that affects the label of one single https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq65_HTML.gif -adjacency, by substituting contiguous markers in this label.

Consider the labels 1 and 2, where |1| = m and |2| = n. The substitution of 1 by 2 in a https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq66_HTML.gif -adjacency is represented by (γ13|1|4γ2γ13|2|4γ2) (for better reading in our notation we omit the curly set brackets for singleton sets). One or both extremities among γ1 and γ2 can be equal to (a telomere), as well as one or both labels among 3 and 4 can be equal to ε (the empty string). The substitution of 1 by 2 in a circular singleton is represented by (|1|3| → |2|3|). Observe that at most one chromosome can be entirely substituted at once (but we do not allow the substitution of a linear by a circular chromosome and vice-versa). Moreover, if m = 0, we have an insertion of n contiguous markers. On the other hand, if n = 0, we have a deletion of m contiguous markers. Thus, insertions and deletions, also called indels, are special cases of substitutions.

The DCJ-substitution distance of A and B, denoted by https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq67_HTML.gif , is the minimum number of DCJs and substitutions required to transform A into B. Since substitutions include indels, https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq68_HTML.gif is upper bounded by the DCJ-indel distance, the minimum number of DCJ and indel operations required to transform A into B, that can be computed in linear time [6]. In the present work we give an approach to exactly compute https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq69_HTML.gif also in linear time.

Results and discussion

The main result of the present study is an exact formula to compute the DCJ-substitution distance in linear time. We achieve this formula by developing the substitution-potential of two genomes, a property that allows us to obtain a good upper bound to the genomic distance with DCJ operations and substitutions. Then we show how some special DCJ operations reduce the overall number of substitutions and obtain the exact formula. Although the objective of this model is to provide a parsimonious genomic distance, that in practice is a lower bound to real distances, we run some experiments on data from human X and Y chromosomes and obtained a parsimonious sorting scenario that is coherent with the results available in the literature. We also observe that the DCJ-substitution method could be used to refine orthology assignments.

The substitution-potential

Observe that a https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq70_HTML.gif -adjacency with a non-empty label can be cut in at least two different positions, either before or after . Since the position of the cut does not change the effect of the DCJ on dDCJ(A, B), we can choose to cut at positions that allow the concatenation of the labels of the original https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq71_HTML.gif -adjacencies. As a consequence, a set of labels of one genome can be accumulated with DCJ operations. In particular, when we apply optimal DCJs on only one component of the adjacency graph, we can accumulate an entire run in a single https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq72_HTML.gif -adjacency:

Proposition 1 ( [6]) A run can be entirely accumulated in the label of one single https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq73_HTML.gif -adjacency with optimal DCJ operations.

Given a DCJ operation ρ, let Λ0 and Λ1 be, respectively, the number of runs in AG (A, B) before and after ρ. We define ∆Λ(ρ) = Λ 1 Λ 0.

Proposition 2 ( [6]) Given any DCJ operation ρ, we have ∆Λ(ρ) ≥ – 2.

In order to obtain the exact formula for the DCJ-substitution distance, we will first analyze the components of the adjacency graph separately. Given two genomes A and B and a component C AG (A, B), we denote by d DCJ (C) the minimum number of DCJ operations required to do a separate DCJ-sorting in C, applying DCJs on vertices of C (or vertices that result from DCJs applied on vertices that were in C). It is possible to do a separate DCJ-sorting using only optimal DCJs in any component of AG (A, B), thus, in other words, d DCJ (A, B) = C AG ( A , B )d DCJ (C) [2]. In [6] we have already defined the indel-potential of a component, denoted by λ(C), that is the minimum number of runs that we can obtain by DCJ-sorting C with optimal DCJ operations only, and can be computed with the formula given in the next proposition.

Proposition 3 ( [6]) Given a component C in AG(A, B), we have https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq74_HTML.gif , if Λ(C) ≥ 1. Otherwise λ(C) = 0.

Similarly, here we denote by σ(C) the substitution-potential of a component C, that is the minimum number of substitutions that we can obtain by DCJ-sorting C with optimal DCJ operations only. In order to find a formula to compute σ(C), we first obtain a stronger version of Proposition 1 where not only the labels of a run are accumulated into a single https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq75_HTML.gif -adjacency, but pairs of consecutive runs are accumulated into adjacent https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq76_HTML.gif -adjacencies (that are https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq77_HTML.gif -adjacencies connected by a single edge in the adjacency graph).

Proposition 4 ( [6]) If γ1γ2 is a clean https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq78_HTML.gif -adjacency in a DCJ-unsorted component C of AG(A, B), such that neither γ1 nor γ2 are telomeres, then it is always possible to extract a clean cycle from C with an optimal DCJ operation.

Proposition 5 Two consecutive runs in a component C can be entirely accumulated into the labels of two adjacent https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq79_HTML.gif -adjacencies of C with optimal DCJs.

Proof: By Proposition 1 we assume that two consecutive runs of C are accumulated into https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq80_HTML.gif -adjacencies v A and v B . If v A and v B are not adjacent, there are only clean https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq81_HTML.gif -adjacencies between v A and v B in C. By Proposition 4, we can apply optimal DCJs to extract clean cycles until v A and v B are adjacent.

Pairs of consecutive runs that are accumulated into adjacent https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq82_HTML.gif -adjacencies can be extracted into a labeled DCJ-sorted component, that can be sorted with one substitution. Observe that minimizing the number of pairs of consecutive runs is equivalent to minimizing the total number of runs. Hence, we can determine the substitution-potential from the indel-potential.

Proposition 6 Given a component C in AG (A, B), we have https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq83_HTML.gif , if Λ(C) ≥ 1. Otherwise σ(C) = 0.

Proof: By Proposition 5 we can assume that the runs of C are accumulated into pairs of adjacent https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq84_HTML.gif -adjacencies. By Proposition 3, we can obtain https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq85_HTML.gif runs doing a separate DCJ-sorting in C with optimal DCJs. Moreover, these optimal DCJs can be done in such a way that pairs of runs that were accumulated into adjacent https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq86_HTML.gif -adjacencies remain in these adjacent https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq87_HTML.gif -adjacencies. Since each one of these pairs can be sorted with one substitution, the substitution-potential of C is equal to the number of pairs of labeled adjacent https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq88_HTML.gif -adjacencies, which is:
https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_Equa_HTML.gif

The formulas to compute λ(C) and σ(C), given in Propositions 3 and 6 above, are indeed very similar. Consequently, many of the results obtained in [6] can be adapted to the new substitution-potential. Let σ0 and σ1 be, respectively, the sums of the number σ for the components of the adjacency graph before and after a DCJ operation ρ. We then define ∆σ(ρ) = σ1– σ0. Furthermore, let dcj (ρ) be respectively 0, +1 and +2 depending whether ρ is optimal, neutral or counter-optimal. We also define ∆d(ρ) = dcj (ρ) + ∆σ(ρ).

Proposition 7 Given a DCJ operation ρ acting on a single component, we have ∆d(ρ) ≥ + 2 if ρ is counter-optimal, or ∆d(ρ) ≥ 0 if ρ is neutral.

We denote by https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq89_HTML.gif the minimum number of DCJs and substitutions required to sort separately a component C of AG (A, B). The definition of σ and Proposition 7 guarantee that https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq90_HTML.gif .

Observe that, if C is a singleton in the adjacency graph, https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq91_HTML.gif , corresponding to the insertion or the deletion of the whole chromosome. We do not allow the substitution of a linear by a circular singleton and vice-versa. However, each pair composed by a singleton in genome A and a singleton in genome B (such that both are linear or both are circular) can be sorted with one single substitution, which saves one sorting step per pair. Let P L and P C be, respectively, the maximum number of disjoint pairs of linear and circular singletons in the adjacency graph. Together with the DCJ-substitution distance per component, these numbers give a good upper bound for https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq92_HTML.gif :

Lemma 1 Given two genomes A and B without duplicated markers, we have:
https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_Equb_HTML.gif

The formula given by Lemma 1 above corresponds to the exact distance for a particular set of genomes. Given a https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq93_HTML.gif -adjacency γℓ of a genome A such that γ≠, then γ is said to be a tail of a linear chromosome in A. Two genomes are co-tailed if their sets of tails are equal (this includes two genomes composed only of circular chromosomes).

Theorem 2 Given two co-tailed genomes A and B without duplicated markers, we have:
https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_Equc_HTML.gif

However, for non co-tailed genomes the use of DCJs applied to two components of the adjacency graph can lead to a shorter sequence of operations sorting one genome into another, as we will see in the next section.

The DCJ-substitution distance

Recall that ∆σ(ρ) = σ1– σ0, where σ0 and σ1 are the sums of the number σ for the components of the adjacency graph before and after ρ. A DCJ operation ρ that acts on two components of the adjacency graph is called recombination.

Proposition 8 Given any recombination ρ, we have ∆σ(ρ) ≥ –2.

Proof: Only the recombinations that decrease or do not change the number of runs (∆Λ ≤ 0) have to be analyzed (we can not have ∆σ ≤ –1 if the number of runs increases). Consider the recombination of two paths with i and j runs, that result in two new paths with i′ and j′ runs. The best we can have is when i and j are multiples of 4, i′ and j′ are multiples of 4 minus 1 and ∆Λ = –2, that gives: https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq94_HTML.gif . The analysis of recombinations involving cycles is analogous.

All recombinations involving at least one cycle are counter-optimal and any counter-optimal recombination has ∆d ≥ 0, thus only path recombinations can have ∆d ≤ –1. The following definitions are similar to those given in [6], except that here we have a larger number of labeled path types.

Consider an integer i ≥ 0. For a second integer k {1, 3}, let https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq95_HTML.gif (respectively https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq96_HTML.gif ) be a sequence with odd 4i + k runs, starting and ending with an https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq97_HTML.gif -run (respectively https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq98_HTML.gif -run). Similarly for k {2, 4}, let https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq99_HTML.gif (respectively https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq100_HTML.gif ), be a sequence with even 4i + k runs, starting with an https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq101_HTML.gif -run (respectively https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq102_HTML.gif -run) and ending with a https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq103_HTML.gif -run (respectively https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq104_HTML.gif -run). An empty sequence (with no run) is represented by ε. Then each one of the notations AA ε , https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq105_HTML.gif , https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq106_HTML.gif , https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq107_HTML.gif , https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq108_HTML.gif , https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq109_HTML.gif , https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq110_HTML.gif , BB ε , https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq111_HTML.gif , https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq112_HTML.gif , https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq113_HTML.gif , https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq114_HTML.gif , https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq115_HTML.gif , https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq116_HTML.gif , AB ε , https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq117_HTML.gif , https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq118_HTML.gif , https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq119_HTML.gif , https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq120_HTML.gif , https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq121_HTML.gif , https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq122_HTML.gif , https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq123_HTML.gif and https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq124_HTML.gif represents a particular type of path (AA, BB or AB) with a particular structure of runs (ε, https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq125_HTML.gif , https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq126_HTML.gif , https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq127_HTML.gif , https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq128_HTML.gif , https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq129_HTML.gif , https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq130_HTML.gif , https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq131_HTML.gif , or https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq132_HTML.gif ).

The components on which the cuts are applied are called sources and the components obtained after the joinings are called resultants of the recombination. The complete set of recombinations with ∆d ≤ –1 is given in Table 1. In Table 2 we also list recombinations with ∆d = 0 that create at least one source of recombinations of Table 1. We denote by • an AB-path that can not be a source in Tables 1 and 2, such as AB ε , https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq133_HTML.gif , https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq134_HTML.gif , https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq135_HTML.gif , https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq136_HTML.gif , https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq137_HTML.gif and https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq138_HTML.gif .
Table 1

Path recombinations that have ∆d ≤ –1 and allow the best reuse of the resultants.

sources

resultants

σ

dcj

d

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq139_HTML.gif

• + •

–2

0

–2

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq140_HTML.gif

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq141_HTML.gif

–2

+1

–1

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq142_HTML.gif

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq143_HTML.gif

–2

+1

–1

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq144_HTML.gif

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq145_HTML.gif

–2

+1

–1

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq146_HTML.gif

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq147_HTML.gif

–2

+1

–1

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq148_HTML.gif

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq149_HTML.gif

–2

+1

–1

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq150_HTML.gif

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq151_HTML.gif

–2

+1

–1

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq152_HTML.gif

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq153_HTML.gif

–1

0

–1

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq154_HTML.gif

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq155_HTML.gif

–1

0

–1

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq156_HTML.gif

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq157_HTML.gif

–1

0

–1

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq158_HTML.gif

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq159_HTML.gif

–1

0

–1

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq160_HTML.gif

• + •

–1

0

–1

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq161_HTML.gif

• + •

–1

0

–1

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq162_HTML.gif

• + •

–1

0

–1

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq163_HTML.gif

• + •

–1

0

–1

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq164_HTML.gif

• + •

–1

0

–1

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq165_HTML.gif

• + •

–1

0

–1

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq166_HTML.gif

• + •

–1

0

–1

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq167_HTML.gif

• + •

–1

0

–1

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq168_HTML.gif

• + •

–1

0

–1

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq169_HTML.gif

• + •

–1

0

–1

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq170_HTML.gif

• + •

–1

0

–1

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq171_HTML.gif

• + •

–1

0

–1

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq172_HTML.gif

• + •

–1

0

–1

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq173_HTML.gif

• + •

–1

0

–1

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq174_HTML.gif

• + •

–1

0

–1

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq175_HTML.gif

• + •

–1

0

–1

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq176_HTML.gif

• + •

–1

0

–1

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq177_HTML.gif

• + •

–2

+1

–1

Table 2

Recombinations that have ∆d = 0 and create resultants that can be used in recombinations with ∆d ≤ –1 (listed in Table 1).

sources

resultants

σ

dcj

d

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq178_HTML.gif

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq179_HTML.gif

–2

+2

0

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq180_HTML.gif

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq181_HTML.gif

–1

+1

0

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq182_HTML.gif

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq183_HTML.gif

–1

+1

0

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq184_HTML.gif

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq185_HTML.gif

–1

+1

0

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq186_HTML.gif

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq187_HTML.gif

–1

+1

0

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq188_HTML.gif

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq189_HTML.gif

–2

+2

0

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq190_HTML.gif

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq191_HTML.gif

–1

+1

0

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq192_HTML.gif

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq193_HTML.gif

–1

+1

0

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq194_HTML.gif

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq195_HTML.gif

–1

+1

0

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq196_HTML.gif

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq197_HTML.gif

–1

+1

0

Proposition 9 The recombinations with ∆d = 0 involving cycles or circular singletons cannot create new components that can be used as sources of recombinations listed in Tables 1 and 2.

The two sources of a recombination can also be called partners. Looking at Table 1 we observe that some types of paths have more partners than other types of paths. For example, all partners of https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq198_HTML.gif and https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq199_HTML.gif paths are also partners of https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq200_HTML.gif and https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq201_HTML.gif paths. Furthermore, some resultants of recombinations in Tables 1 and 2 can be used in other recombinations. These observations allow the identification of groups of recombinations, as listed in Table 3.
Table 3

All recombination groups obtained from Tables 1 and 2 (the recombinations from Table 2 appear only in groups in Y and Z). The column scr indicates the contribution of each path in the distance decrease.

 

sources

resultants

d

scr

U

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq202_HTML.gif

2•

–2

–1

V

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq203_HTML.gif

4•

–3

–3/4

 

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq204_HTML.gif

4•

–3

–3/4

W

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq205_HTML.gif

3•

–2

–2/3

 

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq206_HTML.gif

3•

–2

–2/3

 

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq207_HTML.gif

3•

–2

–2/3

 

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq208_HTML.gif

3•

–2

–2/3

 

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq209_HTML.gif

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq210_HTML.gif

–2

–2/3

 

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq211_HTML.gif

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq212_HTML.gif

–2

–2/3

 

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq213_HTML.gif

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq214_HTML.gif

–2

–2/3

 

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq215_HTML.gif

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq216_HTML.gif

–2

–2/3

X

Recombinations from Table 1 with ∆d = –1

 

–1

–1/2

Y

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq217_HTML.gif

4•

–2

–1/2

 

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq218_HTML.gif

4•

–2

–1/2

Z

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq219_HTML.gif

3•

–1

–1/3

 

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq220_HTML.gif

3•

–1

–1/3

 

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq221_HTML.gif

3•

–1

–1/3

 

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq222_HTML.gif

3•

–1

–1/3

 

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq223_HTML.gif

3•

–1

–1/3

 

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq224_HTML.gif

3•

–1

–1/3

 

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq225_HTML.gif

3•

–1

–1/3

 

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq226_HTML.gif

3•

–1

–1/3

 

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq227_HTML.gif

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq228_HTML.gif

–1

–1/3

 

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq229_HTML.gif

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq230_HTML.gif

–1

–1/3

 

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq231_HTML.gif

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq232_HTML.gif

–1

–1/3

 

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq233_HTML.gif

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq234_HTML.gif

–1

–1/3

 

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq235_HTML.gif

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq236_HTML.gif

–1

–1/3

 

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq237_HTML.gif

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq238_HTML.gif

–1

–1/3

The deductions shown in Table 3 can be computed with an approach that greedily maximizes the number of recombinations in U, V, W, X, Y and Z in this order. The U part contains only one operation and the two groups in V are mutually exclusive after applying U. The part W is then the application of all possible remaining groups of two operations with ∆d = –2. Similarly, the part X is only the application of all possible remaining operations with ∆d = –1. After X, the two groups in Y are mutually exclusive and then the same happens to the groups in Z. Although some groups in W, X and Z have some reusable resultants, those are actually never reused (if operations that are lower in the table use as sources resultants from higher operations, the sources of all referred operations would be previously consumed in operations that occupy even higher positions in the table). Due to this fact, the number of operations in U, V , W, X, Y and Z depends only on the initial number of each type of component.

With the results presented in this section we have an exact formula to compute the DCJ-substitution distance:

Theorem 3 Given two genomes A and B without duplicated markers, we have:
https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_Equd_HTML.gif

where P L and P C are the numbers of disjoint pairs of linear and circular singletons and U, V, W, X, Y and Z are computed as described above.

The formula given in Theorem 3 is analogous to the one which we have obtained in [6] to compute the DCJ-indel distance. Both formulas depend on factors that can be computed in linear time [6].

Triangular inequality

Note that, since only unique markers can be substituted in this model, we avoid the “free lunch problem”, mentioned in [5], that is the possibility of transforming any genome A into any genome B by simply substituting the whole content of A by the whole content of B. However, the triangular inequality can be disrupted in the DCJ-substitution distance. In other words, given any three genomes A, B and C without duplicated markers, there is no guarantee that the triangular inequality https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_IEq239_HTML.gif holds. In a companion paper [11] we provide an efficient way to establish the triangular inequality a posteriori in both the DCJ-indel [6] and the DCJ-substitution distances.

Experiments

The objective of this model is to provide a parsimonious genomic distance, that in practice is a lower bound to real distances. Nevertheless, we could run some experiments on data from human X and Y chromosomes and obtained a parsimonious sorting scenario that is coherent with the results available in the literature. During evolution, a portion of the human Y chromosome has become increasingly subjected to local mutations, while the X chromosome remained relatively conserved, as we will see in the following. Human X and Y chromosomes are very different and, while X is 155 Mbp long, the Y chromosome is 58 Mbp long. However, they still share pseudo-autosomal regions at both extremities and are believed to have evolved from an identical autosomal pair [12] (the autosomes are all non-sex chromosomes). Current theories suggest that the pseudo-autosomal region, which originally covered the whole chromosomes, was successively pruned by a few big inversions on the Y chromosome [13] (we call these inversions pruning). After each pruning inversion, several mutations seem to have occurred on the affected part of the Y chromosome, while X remained “closer” to the common ancestor.

A parsimonious scenario of 8 inversions on the markers common to chromosomes X and Y has been published in [[10], Fig. 7], and is given as an argument to support the existence and bounds of the three most recent pruning inversions, but unique markers were simply ignored. We used our method to compute the DCJ-substitution distance using the same dataset, but reincorporating the unique markers, and obtained a DCJ-substitution distance of 14. Then we manually reconstructed the evolutionary scenario of human chromosomes X and Y and obtained a parsimonious scenario with 8 inversions and 6 substitutions (including 2 insertions and 1 deletion) that is coherent with the pruning inversions given in [10] (see Figure 5). Although a DCJ is a very comprehensive operation and can represent many rearrangement events, in the analysis of unichromosomal genomes DCJs often represent only inversions, and this also happens in this dataset.
https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-12-S9-S8/MediaObjects/12859_2011_Article_4815_Fig5_HTML.jpg
Figure 5

A parsimonious scenario of 8 inversions and 6 substitutions (including 2 insertions and 1 deletion) sorting human X into Y chromosome, using the dataset given in [10]. The symbol ‘P’ represents the current pseudo-autosomal region in the beginning of X and Y. Each number represents a common marker, each symbol x i represents a unique marker in X and each symbol y i represents a unique marker in Y (the unique markers were also obtained from the data in [10]). The three pruning inversions suggested in [[10], Fig. 7] are underlined. The boundary of the pseudo-autosomal region, indicated with vertical dots, is shifted to the left after each pruning inversion.

Discussion

Our method was designed to find gene mutations, but it could also help to improve orthology assignments, that are the computational prediction of orthologous pairs of genes from different species. No orthology predictor is able to find all assignments correctly. In particular, when comparing two different species, some pairs of orthologous genes that are below the predictor threshold remain unassigned. Since our substitutions establish a relation between different genes in the two compared genomes, they correspond to candidates to be assigned as orthologous genes.

Conclusions and future work

In this work we presented a new model to compare two genomes with unequal content, but without duplicated markers, using substitutions and DCJ operations, and developed a linear time algorithm to exactly compute the DCJ-substitution distance.

Although the objective of this model is to provide a parsimonious genomic distance, that in practice is a lower bound to real distances, based on our method we have manually reconstructed a parsimonious evolutionary scenario of human chromosomes X and Y. We considered biological constraints that are specific to this case and obtained a scenario that is coherent with the results given in the literature.

By reconstructing a parsimonious scenario that minimizes substitutions, we may identify genomic regions that were subject to continuous mutations during evolution and could have a common evolutionary origin. Currently our method is only able to compute the genomic distance, but in a future work we intend to study the space of all parsimonious sorting scenarios and develop methods to systematically identify such regions.

The DCJ-substitution model could also be used to refine orthology assignments, since in some cases a substitution could actually correspond to an unannotated orthology. We also plan on exploring the use of our method in refining orthology in a future work.

Declarations

Acknowledgements

This research was partially supported by the Brazilian research agencies CNPq (grant PROMETRO 563087/2010-2) and FAPERJ (grant INST E-26/111.837/2010).

This article has been published as part of BMC Bioinformatics Volume 12 Supplement 9, 2011: Proceedings of the Ninth Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/12?issue=S9.

Authors’ Affiliations

(1)
Instituto Nacional de Metrologia, Qualidade e Tecnologia, Duque de Caxias
(2)
AG Genominformatik, Technische Fakultät, Universität Bielefeld

References

  1. Bergeron A, Mixtacki J, Stoye J: A unifying view of genome rearrangements. Proc. of WABI 2006, LNBI 2006, 4175: 163–173.Google Scholar
  2. Braga MDV, Stoye J: The solution space of sorting by DCJ. Journal of Computational Biology 2010, 17(9):1145–1165. 10.1089/cmb.2010.0109PubMedView ArticleGoogle Scholar
  3. Hannenhalli S, Pevzner P: Transforming men into mice (polynomial algorithm for genomic distance problem). Proc. of FOCS 1995, 581–592.Google Scholar
  4. El-Mabrouk N: Sorting Signed Permutations by Reversals and Insertions/Deletions of Contiguous Segments. Journal of Discrete Algorithms 2001, 1: 105–122.Google Scholar
  5. Yancopoulos S, Friedberg R: DCJ path formulation for genome transformations which include insertions, deletions, and duplications. Journal of Computational Biology 2009, 16(10):1311–1338. 10.1089/cmb.2009.0092PubMedView ArticleGoogle Scholar
  6. Braga MDV, Willing E, Stoye J: Double Cut and Join with Insertions and Deletions. Journal of Computational Biology 2011, 18: 1167–1184. DOI: 10.1089/cmb.2011.0118 DOI: 10.1089/cmb.2011.0118 10.1089/cmb.2011.0118PubMedView ArticleGoogle Scholar
  7. Yancopoulos S, Attie O, Friedberg R: Efficient sorting of genomic permutations by translocation, inversion and block interchange. Bioinformatics 2005, 21: 3340–3346. 10.1093/bioinformatics/bti535PubMedView ArticleGoogle Scholar
  8. Boore JL: The duplication/random loss model for gene rearrangement exemplified by mitochondrial genomes of deuterostome animals. In Comparative Genomics Edited by: Sankoff D, Nadeau JH. 2000, 133–148.View ArticleGoogle Scholar
  9. Moritz C, Dowling TE, Brown WM: Evolution of animal mitochondrial DNA: relevance for population biology and systematics. Annu. Rev. Ecol. Syst 1987, 18: 269–292. 10.1146/annurev.es.18.110187.001413View ArticleGoogle Scholar
  10. Ross MT, et al.: The DNA sequence of the human X chromosome. Nature 2005, 434: 325–337. 10.1038/nature03440PubMedPubMed CentralView ArticleGoogle Scholar
  11. Braga MDV, Machado R, Ribeiro LC, Stoye J: On the weight of indels in genomic distances. BMC Bioinformatics 2011, 12(Suppl 9):S13. doi:10.1186/1471–2105–12-S9-S13 doi:10.1186/1471-2105-12-S9-S13 10.1186/1471-2105-12-S9-S13PubMedPubMed CentralView ArticleGoogle Scholar
  12. Ohno S: Sex chromosomes and sex-linked genes. Springer-Verlag, Berlin; 1967.View ArticleGoogle Scholar
  13. Lahn BT, Page DC: Four evolutionary strata on the human X chromosome. Science 1999, 286: 964–967. 10.1126/science.286.5441.964PubMedView ArticleGoogle Scholar

Copyright

© Braga et al; licensee BioMed Central Ltd. 2011

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.