The median problem, namely to construct the genome, the sum of whose distances from three given genomes is minimized, is of biological interest because it is at the heart of several approaches to phylogenetic inference based on gene order. It is also of computational interest since it represents one of the major axes of generalizations of simple pairwise gene order comparison, and most but not all versions are NP-hard [1].

One concern about the median problem, perhaps of more pertinence to applications than to theory, is that the solution is generally not unique, and that different solutions may be of considerable distance from each other (e.g., [2]). A second concern has to do with a tendency, if the three input gene orders are relatively highly rearranged with each other, for the median to fall on or near one of these input orders (e.g., [3]), rather than "in the middle", as might be more intuitively satisfying.

In this study, based on a series of simulations, we investigate the simplest median problems, that of unsigned genes under the breakpoint distance and that of signed genes under the breakpoint distance. We make use of a reduction of the problems into the Traveling Salesman Problem (TSP) [4], which we can now rapidly solve for genomes with thousands of genes [5]. We find that, indeed, as gene orders become more random with respect to each other, and as the number of genes increases, the median does indeed tend to approach an input genome, in terms of the distance normalized by the number of genes. Moreover, with the same input genomes, there are different solutions that approach each of the corners. We formalize these observations in terms of a conjecture.

We generalize this conjecture to the case of the median of four or more genomes. We also conjecture that the phenomenon of medians "seeking corners" carries over to other distances often applied to gene orders. Finally we discuss how it fits in with more general ideas of loss of evolutionary signal as gene orders become increasingly rearranged.

### The breakpoint median problem for circular chromosomes

For the unsigned case, we consider genomes modeled as (single) circular permutations on genes 1, …, *n*. Let *A* = *a*
_{1}, …,*a*
_{
n
} be such a permutation. The unordered pair (*a*
_{
i
}, *a*
_{
i
}+1) are called *adjacent*; they constitute an *adjacency* on *A*, for 1 ≤ *i* <*n*. In addition, circularity means that *a*
_{
n
} is adjacent to *a*
_{
1
} .

Consider two unsigned genomes *A* = *a*
_{1}, ..., *a*
_{
n
} and *B* = *b*
_{1}, ..., *b*
_{
n
} on the same set of *n* genes. If two genes *g* and *h* are adjacent in *A* but not in *B* (that is, *gh* or *hg* do not appear in *B*), then they determine a *breakpoint*. The *breakpoint distance d*(*A*, *B*) between *A* and *B* is defined as the number of breakpoints in *A* (or, equivalently, in *B*). This can be calculated as *d*(*A*, *B*) = *n* − adj(*A*, *B*), where adj(*A*, *B*) is the number of adjacencies in common between *A* and *B*.

For a signed genome, each gene is assigned a positive or negative orientation. If gene *h*, with a given orientation in *A*, follows gene *g*, also with a given orientation, which we write *gh*, then if either *gh* or −*h* − *g* is in *B*, this constitutes a common adjacency in the two genomes. Otherwise the two genes determine a breakpoint.

Given three genomes *A, B*, and *C* on the same set of *n* genes, the breakpoint median problem is the problem of finding a genome *M*, called the *median*, such that *d*(*M*, *A*) + *d*(*M*, *B*)+ *d*(*M*, *C*), the *median sum*, i.e., the sum of the breakpoint distances between *M* and the given genome is minimized. This definition holds for both unsigned and signed genomes.

More generally, for *k* ≥ 3, the *k*-median problem for breakpoints requires, for *k* given genomes *A*
_{1}, …, *A*
_{
k
} on the same set of *n* genes, finding a genome *M* such that the median sum
is minimized. Where the meaning is clear, we will use the term "median" to refer to 3-medians.

The unichromosomal breakpoint median problems are known to be NP-hard ([6] and [7]), as are most, but not all, versions of the median problem, with metrics different from the breakpoint distance and/or on spaces of genomes different from that of circular unichromosomal genomes [1].

Nevertheless, by reducing the *k*-breakpoint median problem to the TSP [4], we can solve instances containing many thousands of genes rapidly [5], making use of *Concorde*, a software package that combines many of the recent advances in the field to rapidly produce TSP solutions [8].

Given *k* ≥ 3 genomes *A*
_{1}, …, *A*
_{
k
} , to reduce the *k*-median problem for unsigned genomes to the TSP on *n* vertices, let *G* be a complete graph of the *n* vertices, where each vertex represents one gene. For each edge *xy* let *v*(*xy*) be equal to the number of times the genes corresponding to *x* and *y* are adjacent (do not form a breakpoint) in genomes *A*
_{1}, …, *A*
_{
k
}, so *v*(*xy*) can be any value among 0, …, *k*. Define the edge weight *w*(*xy*) = *k* − *v*(*xy*). Then a solution of the TSP on *G* with weights *w*(·), namely a minimum weight Hamilton cycle, defines a genome with a minimum sum of breakpoint distances to the *k* given genomes.

A similar strategy transforms the median problem for the signed genome problem to the TSP.

### The conjectures

We start with the unsigned case. For a given *n* ≥ 1, consider a number of random genomes drawn independently from the set of all circular permutations, each with probability 2/(*n* − 1)!.

Let

be the set of genomes containing

*n* genes. For

*A* ∈

, let the neighbourhood of

*A* be

in other words, the set of genomes that are close to *A* in the normalized sense.

We note that for all

*A, B* ∈

because there can be no more than *n* breakpoints between two genomes of length *n*.

We impose a uniform measure

*p*
_{
n
} on

, so that

*p*
_{
n
}(

*A*) = 1/(

*n* − 1)! for all

*A* ∈

. Then for random

*A*,

*B* ∈

, for large

*n* the number of adjacencies approaches a Poisson distribution with parameter λ = 2 [

9], so that

as *n* increases.

We propose the following:

**Conjecture 1 "Medians Seek the Corners"**
*For any ε* > 0, δ > 0,

*there is an n′*,

*such that if A*
_{1}, …,

*A*
_{
k
}
*are k genomes drawn at random from*
,

*where n* >

*n′*,

*and M is a k-median for these genomes, then*
*for i* = 1, …, *k*.

It is important to note that not only would a median tend to be close to one of the input genomes *A*
_{1}, …, *A*
_{
k
}, but other median solutions for the same input genomes would simultaneously be close to each of the other input genomes, in equal proportions.

**Corollary 1** For

*n* = 1, …,

*if A*
_{1}, …,

*A*
_{
k
}
*are k genomes drawn at random from*
,

*then the expected normalized median sum*
*as n* →∞.

**Corollary 2**
*As n* →∞,

*if A*
_{1}, …,

*A*
_{
k
}
*are k genomes drawn at random from*
*, and M*
_{1}
*and M*
_{2}
*are two medians of these k genomes, then*
We now turn to the case of signed genomes. Here, not only are there (*n* − 1)! gene orders, but there are 2^{
n
}ways of assigning orientations to the genes. Thus the set
of all genomes contains 2^{
n
}(*n* − 1)! elements. The definition of a neighborhood in Eq. (1) carries over with
replaced by
. For the uniform measure *q*
_{
n
} on
, the Poisson parameter for the number of common adjacencies in two genomes is
instead of 2 [9], but the limiting value of the normalized breakpoint distance is still 1, as in Eq. (3).

Then Conjecture 1 and Corollaries 1 and 2 are also proposed for the signed case, where
replaced by
and *p*
_{
n
} is replaced by *q*
_{
n
}.

The conjecture, and its corollaries, might seem counterintuitive, especially if the median is conceived of as being "in the middle" of the input genomes. For example we could imagine constructing a genome containing a proportion 1/*k* of its adjacencies in common with each of the random input genomes. Its normalized distance would then be approximately (*k* − 1)/*k* from each of them, for a combined median sum of *k* − 1, the same as in the Corollary 1. Moreover, this would accord well with the notion of the median as being in the middle. However, such medians would not satisfy Corollary 2.