Preliminaries
Genomes and rearrangement operations
We model the evolutionary rearrangement of a genome containing n distinct signed genes through the accumulated operation of number of processes familiar in classical genetics: inversion, reciprocal translocation, transposition, chromosome fusion and fission, operating on linear chromosomes. We will not delve into the details of the operations; formally they can all be subsumed under a single operation called double-cut-and-join (DCJ) which need not be described here. All that is needed for our purposes is a formula due to Yancopoulos et al.[10] that gives d(G1, G2), the minimum number of rearrangement operations needed to transform one genome into another in terms of properties of the “breakpoint graph” determined by G1 and G2, the initial and final genomes. To calculate D efficiently, we construct and analyze the breakpoint graph as follows.
For each genome, each gene g with a positive polarity is replaced by two vertices representing its two ends, i.e., by a “tail” vertex and a “head” vertex in the order g
t
, g
h
; for –g we would put g
h
, g
t
. Each pair of successive genes in the gene order defines an adjacency, namely the pair of vertices that are adjacent in the vertex order thus induced. For example, if i, j, –k are three neighbouring genes on a chromosome then the unordered pairs {i
h
, j
t
} and {j
h
, k
h
} are the two adjacencies they define.
If there are m genes on a chromosome, it has determined 2m vertices by this stage. The first and the last of these vertices are called telomeres. We convert all the telomeres in genome G1 and G2 into adjacencies with additional vertices all labelled T1 or T2, respectively. The breakpoint graph has a blue edge connecting the vertices in each adjacency in G1 and a red edge for each adjacency in G2. We make a cycle of any path ending in two T1 or two T2 vertices, connecting them by a red or blue edge, respectively, while for a path ending in a T1 and a T2, we collapse them to a single vertex denoted “T”.
Each vertex is now incident to exactly one blue and one red edge. This bicoloured graph decomposes uniquely into κ alternating cycles. If n′ is the number of blue edges, then [10]:
d(G1, G2) = n′ – κ. (1)
For a given unrooted binary tree T on N given genomes G1, G2, ⋯, G
N
(and thus with N – 2 unknown ancestral genomes M1, M2, ⋯, M
N
–2 and 2N – 3 branches) as depicted in Fig. 1, the small phylogeny problem is to infer the ancestral genomes so that the total edge length of T, namely ∑
XY∈E
(
T
)d(X, Y), is minimal.
The computational complexity of the median problem, which is just the small phylogeny problem with N = 1, is known to be NP-hard and hence so is that of the general small phylogeny problem. Our method will be shown to run in linear time, so obviously it is not guaranteed to find an exact solution. One of the goals of this paper is to determine for what range of instances PATHGROUPS leads to accurate solutions, and how the approach may be improved to extend this range.
Data structure and algorithm
In this section we first discuss PATHGROUPS in some detail as it applies to the median problem with three given genomes and one ancestor to be reconstructed. Then we describe how this works for the simultaneous reconstruction of all the ancestors in the small phylogeny problem.
Paths and fragments
We generalize our definition of a path to be any connected subgraph of a breakpoint graph, namely any connected part of a cycle. Initially, each blue edge in the given genomes is a path.
A fragment is any set of genes connected by red edges in a linear order. The set of fragments represents the current state of the reconstruction procedure. Initially the set of fragments contains all the genes, but no red edges, so each gene is a fragment by itself.
Pathgroups
The objective function for the small phylogeny problem consists of the sum of a number of genomic distances, one distance for every branch in the phylogeny. Each of these distances corresponds to a breakpoint graph. A given genome determines blue edges in one breakpoint graph, while the red edges correspond to the ancestral genome being constructed. For each such ancestor, the red edges are identical in all the breakpoint graphs corresponding to distances to that ancestor. A pathgroup is a set of three paths, all beginning with the same vertex, one path from each partial breakpoint graph currently being constructed. Initially, there is one pathgroup for each non-T vertex. (We do not construct pathgroups for each T vertex separately, to be explained later, though paths ending in T vertices are found in other pathgroups.)
Pathgroups overlap because most paths are in two pathgroups, one associated with its initial vertex and one with its final vertex. With respect to a given path xy, we say the pathgroup determined by vertex x is the partner of the pathgroup determined by y. For the kind of binary (or bifurcating) trees we use, each pathgroup may have up to three distinct partners.
Priorities
Our main algorithm aims to construct three breakpoint graphs with a maximum aggregate number of cycles. At each step it adds an identical red edge to each path in the pathgroup, altering all three breakpoint graphs. This removes two (partner) pathgroups, turning one or more of their paths into cycles and concatenating each of their remaining paths with a path in some other pathgroup. We do not add red edges incident to T vertices. It is always possible to create one cycle, at least, (if not all the paths in the pathgroup end in T) by adding a red edge between the two ends of any one of the paths. The strategy is to create as many cycles as possible. If alternate choices of steps create the same number of cycles, we choose one that sets up the best configuration for the next step. Thus the pathgroups are prioritized,
-
1.
first by the maximum number of cycles that can be created within the group, without giving rise to circular chromosomes.
-
2.
second, for those pathgroups allowing equal numbers of cycles, by considering the maximum number of cycles that could be created in the next iteration of step 1, in any one pathgroup affected by the current choice.
A pathgroup may receive no priority, if creating any cycle within the pathgroup necessarily creates a circular chromosome. Note that in adding a red edge xy, this causes not only the disappearance of two partnered pathgroups, but it also changes paths in other pathgroups, which we call secondary pathgroups. Furthermore, each secondary pathgroup may itself have partner pathgroups whose paths, though not affected by the addition of xy, may have changed priorities. We call these tertiary pathgroups.
The pathgroups and the priorities are illustrated in Figure 2.
The makeCycles algorithm
By maintaining a list of pathgroups for each priority level, and a list of fragment endpoint pairs (initial and final), together with appropriate pointers, the algorithm makeCycles requires O(n) running time.
Algorithm makeCycles
input: pathgroups each consisting of three blue one-edge paths
output: ancestral genome
while: there remain pathgroups with priorities
-
1.
add red edge to pathgroup of highest priority, creating at least one cycle, thus deleting this pathgroup and its partner.
-
2.
update the paths in the secondary pathgroups affected by the addition of the red edge, and update the red fragment extended by this edge or created by the joining together of two existing red fragments.
-
3.
update the priorities of the secondary pathgroups, the tertiary pathgroups, and the at most two pathgroups associated with the endpoints of the red fragment extended or created in step 2.
Not all the red fragments output by makeCycles are complete chromosomes of the ancestral genome; they may just be chromosome fragments. We know
Proposition [2]: Adding a red edge xy in a pathgroup creates at most four secondary pathgroups and at most eight tertiary pathgroups.
The proposition, together with the two facts:
-
1.
the total number of pathgroups decreases by two by each step
-
2.
the calculation or recalculation of the priority of each pathgroup requires constant time
ensure the O(n) running time of the algorithm. If we had allowed red edges to connect T vertices or allowed pathgroups determined by T vertices, the number of potential secondary and tertiary pathgroups affected by the addition of a red edge would have increased considerably, depending on the number of chromosomes in the genomes, but this would not have affected the O(n) property.
Seven priority levels are illustrated in Figure 2, based on the construction of
-
1.
three cycles,
-
2.
two cycles setting up a) three, b) two or c) one in the look-ahead, or
-
3.
one cycle setting up a) three, b) two or c) one in the look-ahead.
Note that when a red edge is defined, the pathgroup is emptied, either by the creation of cycles, or by the integration of x as a non-endpoint of some path.
An example of the solution to the median problem is given as Additional File 1.
Small phylogeny
To apply PATHGROUPS to the small phylogeny problem, we set up an entire set of pathgroups for every internal (ancestral) node. Initially, in the pathgroups for those ancestral nodes connected to two given genomes, one of the paths will be missing and replaced by a single vertex of the breakpoint graph, as illustrated in Figure 2. Those ancestral nodes connected to only one given genome will have two missing paths in each pathgroup, both replaced by the vertex. Finally those ancestral nodes connected only to other internal nodes will have all paths missing in each pathgroup, all replaced by the vertex.
As the algorithm executes, a pathgroup with the highest priority found among any of the internal nodes is chosen to be processed next. Pathgroups connected to two given nodes will tend to be processed first, building up all three paths and combining the pathgroups one by one. Each time a red edge is added to a path, this becomes a blue edge in the corresponding pathgroup for ancestral genome(s) connected to it.
Eventually even the nodes furthest from any given genomes will accumulate enough edges in their pathgroups so that cycles can be formed and so that fragments of the associated genomes begin their reconstruction.