Multichromosomal median and halving problems under different genomic distances

Tannier, Eric; Zheng, Chunfang; Sankoff, David

doi:10.1186/1471-2105-10-120

Methodology article
Open access
Published: 22 April 2009

Multichromosomal median and halving problems under different genomic distances

Eric Tannier^1,2,
Chunfang Zheng³ &
David Sankoff³

BMC Bioinformatics volume 10, Article number: 120 (2009) Cite this article

8312 Accesses
112 Citations
Metrics details

Abstract

Background

Genome median and genome halving are combinatorial optimization problems that aim at reconstructing ancestral genomes as well as the evolutionary events leading from the ancestor to extant species. Exploring complexity issues is a first step towards devising efficient algorithms. The complexity of the median problem for unichromosomal genomes (permutations) has been settled for both the breakpoint distance and the reversal distance. Although the multichromosomal case has often been assumed to be a simple generalization of the unichromosomal case, it is also a relaxation so that complexity in this context does not follow from existing results, and is open for all distances.

Results

We settle here the complexity of several genome median and halving problems, including a surprising polynomial result for the breakpoint median and guided halving problems in genomes with circular and linear chromosomes, showing that the multichromosomal problem is actually easier than the unichromosomal problem. Still other variants of these problems are NP-complete, including the DCJ double distance problem, previously mentioned as an open question. We list the remaining open problems.

Conclusion

This theoretical study clears up a wide swathe of the algorithmical study of genome rearrangements with multiple multichromosomal genomes.

Background

The gene order or syntenic arrangement of ancestral genomes may be reconstructed based on comparative evidence from present-day genomes – the phylogenetic approach – or on internal evidence in the case of genomes descended from an ancestral polyploidisation event, or from a combination of the two. The computational problem at the heart of phylogenetic analysis is the median problem, while internal reconstruction inspires the halving problem, and the combined approach gives rise to guided halving. How these problems are formulated depends (1) on the karyotypic framework: the number of chromosomes in a genome and whether they are constrained to be linear, or if circular chromosomes are also permitted, and (2) on the objective function used to evaluate possible solutions. This function is based on some notion of genomic distance, either the number of adjacent elements on a chromosome in one genome that are disrupted in another – the breakpoint distance – or the number of evolutionary operations necessary to transform one genome to another.

While the karyotypes allowed in an ancestor vary only according to the dimensions of single versus multiple chromosome, and linear versus circular versus mixed, the genomic distances of interest have proliferated according to the kinds of evolutionary operations considered, from the classic, relatively constrained, reversals/translocations distance to the more inclusive Double Cut-and-Join (DCJ) measure, and many others [1].

The computational complexity of some of these problems has been settled for some specific distances and karyotypic contexts, and it is sometimes taken for granted that these results carry over to other combinations of context and distance. This is not necessarily the case. In this paper, we survey the known results and unsolved cases for three distance measures in three kinds of karyotype. We include several results presented here for the first time, as well as discussions on the definitions of the distances. The results contain both new polynomial-time algorithms and NP-hardness proofs. This paper is the full version of an extended abstract that has appeared in [2], which announced the results without giving all the proofs. In particular, a full discussion on the breakpoint distance definition, as well as the proofs of Theorem 2, Theorem 4, and Theorem 6 are added here, which makes this version a complete and definitive one.

Genomes, breakpoints and rearrangements

Multichromosomal genomes

We follow the general formulation of a genome in [3]. A gene A is an oriented sequence of DNA, identified by its tail A^tand its head A^h. Tails and heads are the extremities of the genes. An adjacency is an unordered pair of gene extremities. A genome Π is a set of adjacencies on a set of genes. Each adjacency in a genome means that two gene extremities are consecutive on the DNA molecule. In a genome, each gene extremity is adjacent to zero or one other extremity. An extremity x that is not adjacent to any other extremity is called a telomere, and can be written as an adjacency x∘ with a null symbol ∘. The adjacency x∘ is called a telomeric adjacency. For a genome Π on a set of genes , consider the graph G_Π whose vertices are all the extremities of the genes, and the edges include all the non telomeric adjacencies in Π as well as an edge joining the head and the tail of each gene. This graph is a set of disjoint paths and cycles. Every connected component is called a chromosome of Π. A chromosome is linear if it is a path, and circular if it is a cycle. A genome with only linear, or only circular, chromosomes is called a linear or circular genome, respectively. An example of a graph G_Π is given in Figure 1.

A Genome can also be represented as a set of strings, by writing the genes for each chromosome in the order in which they appear in the paths and cycles of the graph G_Π, with a bar over the gene if the head of the gene appears before the tail (we say it has negative sign), and none if the tail appears before the head (it has positive sign). For each linear chromosome, there are two possible equivalent strings, according to the arbitrary chosen starting point. One is obtained from the other by reversing the order and switching the signs of all the genes. For circular chromosomes, there are also two possible circular string representations, according to the direction in which the cycle is traversed. For example, chromosome C₁ of the genome Π of Figure 1 may be written (12 14 1 8) or ( 7 4 ).

A genome with only one chromosome is called unichromosomal. These correspond to signed permutations: the two string representations are (linear or circular) signed permutations.

Genomes with duplicates

A duplicated gene A is a couple of homologous oriented sequences of DNA, identified by two tails A 1^tand A 2^t, and two heads A 1^hand A 2^h. An all-duplicates genome Δ is a set of adjacencies on a set of duplicated genes.

For a genome Π on a gene set , a doubled genome Π ⊕ Π is an all-duplicates genome on the set of duplicated genes from such that if A^xB^y(x, y ∈ {t, h}) is an (possibly telomeric) adjacency of Π (A^xor B^ymay be ∘), either A 1^xB 1^yand A 2^xB 2^y, or A 2^xB 1^yand A 1^xB 2^y, are adjacencies of Π ⊕ Π.

Note the difference between a general all-duplicates genome and the special case of a doubled genome: the former has two copies of each gene, while in the latter these copies are organised in such a way that there are two identical copies of each chromosome when we ignore the 1's and 2's in the A 1^x's and A 2^x's: it has two linear copies of each linear chromosome, and for each circular chromosome, either two circular copies or one circular chromosome containing the two successive copiesNote also that for a genome Π, there is an exponential number of possible doubled genomes Π ⊕ Π (exactly two to the power of the number of non-telomeric adjacencies in Π). These definitions correspond to duplicated and perfectly duplicated genomes found in [4], and slightly differs from the perfectly duplicated genome definition found in [5], as discussed in [4]. An example of an all-duplicates genome and a doubled genome is shown in Figure 2. Doubled genomes are the immediate result of an evolutionary event called Whole Genome Duplication (WGD), which is known to have occurred in many evolutionary lineages, from protists [6] to yeasts, to plants, to fish, to amphibians and even to mammals [7]. All-duplicates genomes derive from doubled genomes through a series of rearrangement events. Typically, all-duplicates genomes pertain to extant species, while doubled genomes are ancestral configurations inferred to exist immediately after the WGD, and that are to be reconstructed.

In discussing all-duplicates genomes, we will sometimes contrast them with ordinary genomes which have a single copy of each gene.

The breakpoint distance

The breakpoint distance has been well-studied for permutations, i.e., unichromosomal genomes [8, 9], but only a few published discussions have focused on how it should be defined for multichromosomal genomes (see [10] for one suggestion). The distance should depend not only on common adjacencies, or rather their absence, but also on common telomeres (or lack thereof) in two genomes. Here we propose a definition that we wish valid for all types of karyotypes, based on a most general approach integrating all possible informations from the two genomes. For two genomes Π and Γ on a set of n genes, suppose Π has N_Π chromosomes, and Γ has N_Γ chromosomes. Let a(Π, Γ) be the number of common adjacencies, e(Π, Γ) be the number of common telomeres of Π and Γ. Then insofar as it should depend additively on these components, we may suppose the breakpoint distance has form

where β, θ and γ are positive parameters, while ψ may have either sign. Taking Π = Γ and imposing d_BP(Π, Π) = 0 yields the relations β = 1 and 1 - 2θ + 2γ = 0, so θ = γ + 1/2, and the distance formula reduces to:

It is most plausible to count a total of 1 breakpoint for a fusion or fussion of linear chromosomes, which implies γ = ψ = 0, so the most natural choice of breakpoint distance between Π and Γ is

It might be argued that a fussion or fusion should count for as many as 2 breakpoints, or anything between 1 and 2, so that alternate values of γ and ψ might be entertained, provided γ ∈ [0, ], and ψ ∈ [0,1 - γ]. This may have an influence on how to calculate the number of breakages within a scenario, as discussed in [11]. For example, the parameters chosen in [10] are γ = and ψ = , giving rise to the disadvantage of there possibly being more breakpoints between two genomes than adjacencies in either one. For example, in comparing Π = (1 2 3 4 5) and Γ in which five linear chromosomes each contain one gene i ∈ {1,...,5}, the definition in [10] would count 9 breakpoints, which seems counterintuitive, while our definition counts 4, which seems more reasonable. Whether all the results presented in this paper also hold for the definition in [10] is open.

The definition of the breakpoint distance is easily transposable to the comparison of two all-duplicates genomes. For one all-duplicates genome Δ and one ordinary genome Π, the breakpoint distance between Π and Δ is the minimum breakpoint distance between Δ and a doubled genome Π ⊕ Π, that is,

The Double Cut-and-Join distance

Given a genome Π, a double-cut-and-join (DCJ) is an operation ρ acting on two adjacencies pq and rs (possibly some of p, q, r, s are ∘ symbols, so that telomeric adjacencies are considered; one adjacency can even be ∘∘). The DCJ operation replaces pq and rs either by pr and qs, or ps and qr. An example of DCJ operation on the genome Π of Figure 1 is drawn in Figure 3.

A DCJ can reverse an interval of a genome, may cause the fussion of one chromosome into two, or the fusion of two chromosomes into a one, or a reciprocal translocation: the exchange of two telomere-containing segments between two chromosomes. Two consecutive DCJ operations, excising and circularising a chromosomal segment followed by a re-linearisation of the circular intermediate and reintegration on the same chromosome, using two new cut-points, results in a block interchange: two segments of the genome appear to simply exchange their positions. In the case these two segments are consecutive, the two DCJs result in a transposition, the apparent movement of a segment from one place on a chromosome to another. The DCJ operation is thus a very general framework, introduced by Yancopoulos et al. [12], as well as by Lin et al. in a special case [13], and since been adopted by Bergeron et al. [3, 14] and many others, sometimes under other names such as spring [15] or "2-break rearrangement" [16].

If Π and Γ are two genomes on a set of n genes, the minimum number of DCJ operations needed to transform Π into Γ is called the DCJ distance and noted d_DCJ(Π, Γ).

This DCJ distance is easily defined also for two all-duplicates genomes. For one all-duplicates genome Δ and one ordinary genome Π, the DCJ distance between Π and Δ is d_DCJ(Π, Δ) = min_Π⊕Π d_DCJ(Π ⊕ Π, Δ).

The reversal/translocation distance

The reversal/translocation distance was introduced by Hannenhalli and Pevzner [17], and is equivalent to the DCJ distance constrained to linear genomes.

If Π is a linear genome, a linear DCJ operation is a DCJ operation on Π that results in a linear genome. This allows reversals, chromosome fusions, fussions, and reciprocal translocations. DCJs that create circular intermediates, temporary circular chromosomes, and thereby mimic block interchanges and transpositions, are not allowed. Chromosome fusions and fussions are particular cases of translocations in this framework, justifying the appellation RT-distance. If Π and Γ are linear genomes, the RT distance between Π and Γ is the minimum number of linear DCJ operations that transform Π into Γ, and is noted d_RT(Π, Γ).

Computational problems

The classical literature on genome rearrangements aims at reconstructing the evolutionary events and ancestral configurations that explain the differences between the organization of extant genomes. The focus has been on the genomic distance, median and halving problems. More recently the doubled distance and guided halving problems have also emerged as important. In each of the ensuing sections of this paper, these five problems are examined for a specific combination of distance d (breakpoint, DCJ or RT) and kind of multichromosomal karyotype (linear, circular, mixed).

1.
Distance. Given two genomes Π, Γ, compute d(Π, Γ). Once the distance is calculated, an additional problem in the cases of DCJ and RT is to reconstruct the rearrangement scenario of length d(Π, Γ), i.e. the putative events that differentiate the genomes.
2.
Double distance. Given an all-duplicates genome Δ and an ordinary genome Π, compute d(Δ, Π). This computation evaluates the evolutionary distance posterior to a WGD of the given genome Π, leading to an all-duplicates genome Δ, and locates the genes of the all-duplicates genome on chromosomes in one of the two ancestral copies of the ordinary genome. Because the assignment of labels "1" or "2" to the two identical (for our purposes) copies of a duplicated gene in Δ is arbitrary, the double distance problem is equivalent to finding such an assignment that minimises the distance between Δ and a genome Π ⊕ Π considered as ordinary genomes, where all the genes on any one chromosome in Π ⊕ Π are uniformly labeled "1" or "2" [16, 18]. The double distance function is not symmetric because Δ is an all-duplicates genome and Π is an ordinary one, thus capturing the presumed asymmetric temporal and evolutionary relationship between the ancestor Π and the present-day genome Δ.
3.
Median. Given three genomes Π₁, Π₂, Π₃, find a genome M which minimises d(Π₁, M) + d(Π₂, M) + d(Π₃, M). The median problem estimates the common ancestor of two genomes, given a third one as an outgroup. This is meaningful even in the "unrooted" case, where it is not specified which of the three genomes is the outgroup, because of the symmetry of the sum to be minimised.
4.
Halving. Given an all-duplicates genome Δ, find an ordinary genome Π which minimises d(Δ, Π), the double distance mentioned above. The goal of a halving analysis is to reconstruct the ancestor of an all-duplicates genome at the time of a WGD event.
5.
Guided halving. Given an all-duplicates genome Δ and an ordinary genome Π, find an ordinary genome M which minimises d(Δ, M) + d(M, Π). The guided halving problem is similar to the genome halving problem for Δ, but it takes into account the ordinary genome Π of an organism presumed to share a common ancestor with M, the reconstructed undoubled ancestor of Δ. A variant of the guided halving problem introduced in [19] is to find an ordinary genome M that is a solution to genome halving, that is, minimises d(Δ, M), and which in addition minimises d(M, Π). This helps choosing, among the numerous solutions to the genome halving problem, the one that is closest to the outgroup. We do not study this variant here, and it is open for all genomic distances.

We will survey these five computational problems for the three distances that we have introduced, in the cases of multichromosomal genomes containing all linear chromosomes, all circular chromosomes, or permitting both. The latter are refered as mixed genomes.

While many problems are open for multichromosomal genomes, there is a huge amount of research on these problems for unichromosomal genomes, whether circular or linear (the two cases are often equivalent up to some transformations [1]). They are not systematically particular cases of the multichromosomal problems, as the constraint of keeping only one chromosome along a rearrangement scenario can result in more difficult problems. More precisely, unichromosomal DCJ problems reduce to RT multichromosomal ones. Indeed, the RT operations always transform a unichromosomal genome into a unichromosomal one. As this paper contains very few results on the RT distance, practically the unichromosomal cases are often independent and not generalized here. Results on unichromosomal genomes are summarised in Table 1, together with the results for the multichromosomal case we review or present here. A complete survey on these problems can be found in [1].

Results

Breakpoint distance, circular and mixed genomes

In this section, d = d_BP, and genomes are considered in their most general definition, that is, multichromosomal with both circular and linear chromosomes allowed. All the results also stand for circular genomes, but not always for linear genomes, which will be considered in a following section. As the nuclear genome of a eukaryotic species, a mixed karyotype is rarely observed, so probably unstable. Nevertheless this case is of great theoretical interest, as it is the only combination of distance and karyotype where all five problems mentioned in the previous section prove to be polynomially solvable, including the median problem which is hard for almost every other variant. Furthermore, the solutions in this context may suggest approaches for other variants of the problems, as well as providing a rapid bound for other distances, through the Watterson et al. bound [8].

Distance and double distance

The distance computation follows directly from the definition, and is easily achievable in linear time. The double distance computation is also easy: let Π be a genome and Δ be an all-duplicates genome. Let a(Π, Δ) be the sum, for every adjacency xy in Π, of the number of adjacencies among x 1y 1, x 1y 2, x 2y 1, x 2y 2 in Δ. Let e(Π, Δ) be the sum, for every telomere x in Π, of the number of telomeres among x₁ and x₂ in Δ.

Then we obtain

Indeed, it is a lower bound on the distance, because a(Π, Δ) and e(Π, Δ) are upper bounds on the number of common adjacencies and common telomeres, respectively, between Δ and any Π ⊕ Π. This lower bound is attained by constructing Π ⊕ Π in the following way: let xy be a possibly telomeric adjacency in Π (either x or y may be ∘ symbols); if x 1y 1 or x 2y 2 is an adjacency in Δ, choose x 1y 1 and x 2y 2 as adjacencies in Π ⊕ Π; If x 1y 2 or x 2y 1 is an adjacency in Δ, choose x 1y 2 and x 2y 1 as adjacencies in Π ⊕ Π; the two cases are either mutually exclusive if xy is not telomeric, or identical if xy is telomeric, so the assignment is made without ambiguity. For all adjacencies that have not been assigned, assign them arbitrarily.

Median

The following result contrasts with the NP-completeness proofs of almost all median problems in the literature [20–22] (see [23, 24] for tractability results on some variants). The problem is NP-complete for unichromosomal genomes, that is, when the median genome M is required to be unichromosomal, whether the genomes are linear or circular [20, 21], but the multichromosomal case happens to be easier.

Theorem 1. There is a polynomial time algorithm for the breakpoint median problem for multichromosomal genomes.

Proof. Let Π₁, Π₂, Π₃ be three genomes on a gene set of size n. For any genome M on , let s(M) = d(Π₁, M) + d(Π₂, M) + d(Π₃, M) be the median score of M.

Draw a graph G on the vertex set containing (1) all extremities of genes in , and (2) one supplementary vertex t_xfor every gene extremity x. For any pair of gene extremities x, y, draw an edge xy weighted by the number of genomes, among Π₁, Π₂, Π₃, for which xy is an adjacency. Then there is an edge between each pair of gene extremities, weighted by 0, 1, 2, or 3. Now for any vertex x, draw an edge xt_xweighted by half the number of genomes, among Π₁, Π₂, Π₃, having x as a telomere. Each edge xt_xis then weighted by 0, , 1, or . Finally, put an edge of weight 0 between t_xand t_yfor all pairs of gene extremities x, y. Let M be a perfect matching in G. Clearly, the edges joining gene extremities in M define the adjacencies of a genome, which we also call M. The relation between the weight of the perfect matching M and the median score of the genome M is easy to state:

Claim 1. The weight w(M) of the perfect matching M in G is 3n - s(M).

Indeed, for any genome Π_i, , where a_i= a(Π_i, M) is the number of common adjacencies between M and Π_i, and e_i= e(Π_i, M) is the number of common telomeres between M and Π_i. If M and Π_ihave a common adjacency or a common telomere, this accounts for 1 or , respectively, in the weight of the perfect matching M. So the weight of the matching M is , which yields d(Π₁, M) + d(Π₂, M) + d(Π₃, M) = 3n - w(M).

Conversely, any genome M can be extended to a perfect matching M in G such that s(M) = 3n - w(M): construct the matching M by including the edges xy and t_xt_yfor each adjacency xy and an edge xt_xfor each telomere x.

Claim 1 implies that a maximum weight perfect matching M is a minimum score median genome. As the maximum weight perfect matching problem is polynomial [25], so is the breakpoint median problem. □

If the three genomes in the instance are circular, then it is possible to constrain the result to also be circular by restricting the graph G to the extremities of the genes. Then, in the same way, a perfect matching gives a circular solution to the median problem. This is not the case for linear genomes, since there is no way to guaranty that no chromosome in an instance is circular.

Note that a generalisation of this algorithm remains valid if the median of more than three genomes is to be computed. The phylogeny problems, both "big" and "small" versions, which also generalise the median problem for three genomes, remain open. The big problem is the search for a Steiner tree in the space of genomes, minimising the sum of the distances on its branches, while in the small problem, presumably easier, the graph-theoretical structure of the tree, namely its vertex set and edge or branch set, are given, and only the genomes corresponding to the extra vertices (not corresponding to the given genomes) need to be reconstructed.

Halving

To our knowledge, the genome halving with breakpoint distance has not yet been studied. In this framework, it has an easy solution, using a combination of elements from the maximum weight perfect matching technique in the solution of the median problem presented above, and the double distance computation. Let Δ be an all-duplicates genome on a gene set , and G be the graph on the vertex set containing (1) all the extremities of the genes in , and (2) one supplementary vertex t_xfor every gene extremity x. For any pair of gene extremities x, y, draw an edge in G weighted by zero, one or two according to the number of adjacencies in Δ among x 1y 1, x 1y 2, x 2y 1, and x 2y 2. Now for any vertex x, draw an edge xt_xweighted by half the number of telomeres among x 1 and x 2 in Δ. Finally, put an edge of weight 0 between t_xt_yfor all pairs of gene extremities x, y.

For a genome M on , define a perfect matching, also called M, by including edges xy and t_xt_yfor each adjacency xy, and an edge xt_xfor each telomere x. Let w(M) be the weight of the matching M.

Claim 2. For a genome M on , the perfect matching M thus constructed satisfies w(M) = 2n - d(Δ, M).

Indeed, the score of the perfect matching M is , that is, 2n - d(Δ,M), according to the double distance formula (see above in this section).

Conversely, it is easy to see that any perfect matching on G defines a genome M such that w(M) = 2n - d(Δ, M). This implies that the maximum weight perfect matching solves the genome halving problem in the breakpoint distance context.

Again, it is possible to solve the problem on only circular genomes by restricting the graph G to the gene extremities, dropping the t_xsupplementary vertices.

Guided Halving

As is the case for the median problem, this context provides the only polynomial result for the guided genome halving problem up to our knowledge. The solution combines elements of the three previous results, on the double distance, median and halving problems.

Let Δ be an all-duplicates genome on a gene set , and Π be an ordinary genome on . Let G be the graph on the vertex set containing (1) all the extremities of the genes in , and (2) one supplementary vertex t_xfor every gene extremity x.

For any pair of gene extremities x, y, there is an edge in G weighted by the number of adjacencies among x 1y 1, x 1y 2, x 2y 1, x 2y 2 in Δ, and xy in Π. Now there is an edge xt_xfor any gene extremity x weighted by half the number of telomeres among x 1, x 2 in Δ and x in Π. So each edge between gene extremities has an integer weight in {0, 1, 2, 3}, and xt_xedges may have weight 0, , 1, or . Add 0-weight edges t_xt_yfor all pairs x, y of gene extremities.

For any genome M, let s(M) = d(Δ, M) + d(M, Π). It is possible to construct a perfect matching M in G from genome M by choosing edges xy and t_xt_yfor every adjacency xy in M. Its weight is denoted w(M).

Claim 3. For a genome M, the perfect matching thus constructed satisfies w(M) = 3n - s(M).

Indeed, the weight of the perfect matching M is . According to the double distance formula (see above in this section), this yields w(M) = 3n - s(M).

Conversely, if M is a perfect matching in G, its edges between gene extremities define the adjacencies of a genome M which satisfies s(M) = 3n - w(M). This implies that the maximum weight perfect matching solves the guided genome halving problem in the breakpoint distance context.

As is the case for the median problem, it is possible to generalise this statement for an arbitrary number of ordinary outgroup genomes. The phylogenetic problems are open.

Again, we can solve the problem on circular genomes by dropping the t_xsupplementary vertices in the graph G.

Breakpoint distance, linear case

In this section, d = d_BPand all genomes must be linear, as is most appropriate for modeling for the eukaryotic nuclear genome. In contrast to the model of the previous section, all the problems concerning at least three genomes are NP-complete.