Reconstruction of ancestral gene orders using intermediate genomes

Background The problem of reconstructing ancestral genomes in a given phylogenetic tree arises in many different comparative genomics fields. Here, we focus on reconstructing the gene order of ancestral genomes, a problem that has been largely studied in the past 20 years, especially with the increasing availability of whole genome DNA sequences. There are two main approaches to this problem: event-based methods, that try to find the ancestral genomes that minimize the number of rearrangement events in the tree; and homology-based, that look for conserved structures, such as adjacent genes in the extant genomes, to build the ancestral genomes. Results We propose algorithms that use the concept of intermediate genomes, arising in optimal pairwise rearrangement scenarios. We show that intermediate genomes have combinatorial properties that make them easy to reconstruct, and develop fast algorithms with better reconstructed ancestral genomes than current event-based methods. The proposed framework is also designed to accept extra information, such as results from homology-based approaches, giving rise to combined algorithms with better results than the original methods.


Background
Reconstructing ancestral genomes based on gene order data is an important problem that has been largely studied in the past years, especially with the increasing availability of whole genome DNA sequences. This problem, also called the small parsimony problem, receives as input a phylogenetic tree with extant genomes at its leaves, and asks for the reconstructed gene orders at the internal nodes of the tree, corresponding to ancestral genomes.
There are two main approaches for solving this problem. The first is a parsimonious approach, that will be called event-or distance-based, that assumes a model of rearrangement and reconstructs the ancestral genomes that minimize the total number of rearrangement events on the tree, or said differently, the total tree length, as measured by the rearrangement distance.
The first proposed method using the distance-based approach, BPAnalysis [1], used the breakpoint distance.
With the development of better rearrangement models, such as signed reversals and the Double-Cut-and-Join (DCJ) distance [2,3], new algorithms have been proposed. Notable examples of recent distance-based methods include MGRA [4], PATHGROUPS [5], and GASTS [6], based on minimizing the DCJ distance, and SCJ Small Phylogeny [7], based on the Single-Cut-or-Join (SCJ) model [8]. Since rearrangement distance problems are usually NP-hard for three or more genomes, these methods tend to be time consuming, especially when exact solutions are implemented.
On the other hand, homology methods do not use rearrangement models, but instead look for conserved structures between the input genomes, such as conserved adjacencies or gene clusters, and use this information to assemble Contiguous Ancestral Regions (CARs), as pioneered by Ma et al. in InferCARs [9], an approach explored by many recent algorithms, such as GapAdj [10], ANGES [11], PMAG [12], MLGO [13] and ProCARs [14].
In this paper, a new approach is proposed, inspired by combinatorial ideas from rearrangement models, though not directly motivated as a distance-based method. Specifically, we propose the use of intermediate genomes, arising from optimal rearrangement scenarios between two genomes. The motivation is the fact that, if we consider an internal node in a phylogenetic tree, corresponding to a speciation event, then by the parsimony principle, this internal node should be an intermediate genome in a optimal rearrangement scenario between the two genomes at its children nodes. Yet, most current parsimonious approaches focus only on minimizing events on the tree, without extra restrictions.
This paper is organized as follows. First, we characterize all possible intermediate genomes in a parsimonious path between two genomes, and show that using the restriction that internal nodes must be intermediate genomes improves the ancestral reconstruction results, even though the trees do not minimize the number of events. The proposed framework can easily accept additional information, such as results from homology-based methods, giving rise to combined algorithms that have better results than the original methods.

Preliminaries
We represent multichromosomal genomes using a similar notation as in previous works [3,15]. A gene is an oriented fragment of DNA represented by a signed integer, where the sign denotes the orientation. A chromosome is represented by a sequence of genes, flanked in the extremities by telomeres (○) if the chromosome is linear; otherwise, it is circular. Genomes are sets of chromosomes, where each gene occurs exactly once. A gene g has two extremities, the tail (g t ) and the head (g h ). An adjacency in a genome is an unordered pair of either consecutive gene extremities in a chromosome, or a gene extremity with a telomere (called telomeric adjacency ). A genome G A is represented either by the set of chromosomes, or also by the set of adjacencies A. For instance, the genome G A of Figure 1 has two linear chromosomes and five genes, with Given two genomes A and B with the same set of genes, the breakpoint graph [16] of A and B, denoted by BP (A, B), is a graph G where the vertex set V (G) is the set of gene extremities, and the edge set E(G) is the set of non-telomeric adjacencies of A and B, called A-edges and B-edges, respectively. Adjacencies from A an B are usually drawn with different colors, as seen in Figure 2. A connected component in BP(A, B) is always a color alternating component and will be called AB-component. The size of an AB-component is the number of vertices it has. An AB-component is either a cycle, an even path or an odd path, where the parity of a path is based on the number of vertices of the path (an isolated vertex is considered an odd path). Note that the parity is usually defined on edges, not vertices, but here the vertex parity is more natural.
The Double-Cut-and-Join (DCJ) [2,3] operation rearranges genes in a genome by cutting two adjacencies and rejoining the extremities in a different way. For instance, adjacencies pq and rs become pr and qs, or p○ and qr become pr and q○. It is also possible to cut an adjacency pq into p○ and q○, and the reverse operation.
The DCJ distance d(A, B) between genomes A and B is defined as the minimum number of DCJ operations needed to transform A into B. A sorting scenario S between two genomes A and B is an ordered list of The DCJ distance d(A, B) can be determined using the Breakpoint Graph BP(A, B) [3]: where n is the number of genes, c and o are the number of cycles and odd paths in BP(A, B), respectively. The problem of transforming A into B is then the problem of finding DCJ operations that increase the number of cycles and odd paths in BP(A, B).
Another way of determining the DCJ distance from BP (A, B) is considering the sorting cost of each AB-component independently [8,15]. An AB-component with n vertices has cost n/2 − 1 if it is a cycle, (n − 1)/2 if it is an odd path and n/2 if it is an even path. The sorting cost of a component C will be denoted as d(C). In the example of It is sometimes useful to include telomeric adjacencies in BP(A, B). With similar ideas as used by Braga and Stoye [15], telomere nodes are added in the Breakpoint Graph as follows: for each odd path, add one telomere and connect both telomeric adjacencies of A and B of the odd path to this new telomere node. For each even path, since both telomeric adjacencies are from the same genome, add two telomere nodes, connecting each adjacency with a different telomere, also adding and edge between both telomeres, representing what is usually called a null chromosome on the other genome. This extended version can be called Circular Breakpoint Graph, because it has only cycles, as we can see in Figure 3. The Circular Breakpoint Graph has the advantage that every AB-component is a cycle (called an ABcycle), and any DCJ operation is equivalent to removing two edges of the same color (same genome) and replacing with two new edges that reconnect the four vertices in a different way. Another property of the Circular BP (A, B) is that the sorting cost of each component preserved. An odd path with n vertices has cost (n − 1)/2 in the original BP(A, B). With the added telomere, it becomes a cycle with n+1 vertices, with cost (n + 1)/2 − 1 = (n − 1)/2. Similarly, an even path with n vertices has original cost n/2, and with the two added telomeres it becomes a cycle with n + 2 vertices, with cost (n + 2)/2 − 1 = n/2. Therefore, determining the DCJ distance using the circular BP(A, B) components is slightly simpler, as only cycles are present. Equation (1) can also be used for the circular BP(A, B), with n increased by 1 for each pair of telomeres added, since they are new "genes" in the graph.
During the rest of this paper, the term Breakpoint Graph or BP(A, B) always refers to the circular version, with telomeric adjacencies. Also, since edges correspond to adjacencies, both terms will be used inter-changeably when in context. The study of intermediate genomes is motivated by the fact that if genomes A and B share a common ancestor, by the parsimony principle, it is reasonable to expect that their most recent common ancestor is an intermediate genome of A and B. Therefore, when reconstructing a phylogenetic tree based on gene order and rearrangement events, one possibility is to enforce that any genome corresponding to an internal node on a binary tree must be an intermediate genome of its two children nodes, in a bottom-up approach. One difficulty is that the number of intermediate genomes is potentially huge. To the best of our knowledge there are no studies about the number of intermediate genomes, but the number of possible DCJ sorting scenarios has been determined [15,17], growing exponentially with the distance between the genomes, so one would expect that a similar behavior occurs with intermediate genomes.

Intermediate genomes
A-edges are drawn in green, and B-edges in blue. Nevertheless, intermediate genomes have nice combinatorial properties, and approaches exploiting this properties can be both simple and fast, at the same time producing better results than previous methods, as it will be shown in the Results section.

Properties of intermediate genomes
Braga and Stoye showed that it is possible to sort genomes with DCJ operations acting on each component of the Breakpoint Graph independently [15], and from the two possible ways of applying a DCJ in two edges of a component, one is optimal, reducing the distance between the genomes, and the other is not, as shown in Figure 4. A scenario that follows this strategy will be called independent component scenario. There are scenarios where DCJ operations act on two different components, but these are extremely rare and even non-existing in some cases [15], so they will not be considered here.
It is interesting to notice that, since the optimal DCJ breaks a cycle in two, the newly created edges do not intersect. Conversely, in the non-optimal operation, the edges do intersect. This leads to the intuition that the edges of any intermediate genome M of A and B, when drawn on BP(A, B), do not cross and do not touch two different components. Such a scenario is shown in Figure 5. This is indeed a general property of intermediate genomes, summarized in the following Theorem. Therefore, the presence of only non-crossing chords is a necessary and sufficient condition for an intermediate genome, in scenarios without component recombination. A similar statement was proven by Ouangraoua and Bergeron [17], where it was shown that optimal DCJ scenarios sorting an AB-cycle are maximal chains in the lattice of non-crossing partitions of the cycle.
The proof of Theorem 1 will be derived from a series of small lemmas. The first two prove the necessary condition of the theorem.  Since this is the only crossing, between each of the pair of consecutive vertices p, q, r, s in the AB-cycle there is an even number of vertices, otherwise the edge adjacent to the odd vertex would have to cross pr or sq. Then, w. l.g, p, q, s, r have to appear consecutively in a AM-cycle, while p, s, q, r appear consecutively in a BM-cycle, as shown in Figure 6 Lemma 3 Given an AB-cycle of size 2n with n noncrossing M-edges covering all vertices, the sum of AM-cycles and BM-cycles is n + 1.
Proof The proof is by induction on n, the number of M-edges. Let C denote the complete component with the AB-cycle and the M-edges, and let C A and C B denote the number of AM-cycles and BM-cycles in a component C, respectively. If n = 1, then C A = C B = 1, and the lemma holds, as seen in Figure 7(a).
For n > 1, since the edges of M are non-crossing chords covering the entire cycle, M has always an edge in common with one of the genomes. W.l.g, let's assume an A-edge e = pq is common between A and M , and let's call r and s the vertices adjacent to p and q, belonging to B-edges rp and qs, as shown in Figure 7      On-Line Encyclopedia of Integer Sequences [18]. Therefore, the number of possible intermediate genomes for two genomes A and B, denoted here as IG (A, B), is given by where K ∈ BP(A, B) represent all AB-cycles in BP (A, B), and |K| is the number of vertices in a cycle K.

Intermediate genomes for ancestral reconstruction
Given a tree T with n extant genomes at the leaves, the ancestral genome reconstruction problem asks to reconstruct the ancestral genomes corresponding to the internal nodes of the tree.
One proposed solution to this problem, in the context of a rearrangement distance model, is to find ancestral genomes such that the total length of the tree, defined as the sum of all rearrangement distances on the edges, is minimized. This is sometimes called the small phylogeny problem, since the tree T is given, in contrast to the large phylogeny problem, where no tree is given and the aim is to find, from all possible topologies, a tree with minimum length.
In particular, for three genomes A, B and C, this becomes the well studied genome median problem, where the aim is to find a genome M minimizing d(A, M) + d(B, M) + d(C, M). The median problem is used by many ancestral reconstruction methods as an intermediate step in an iterative approach. Although it is NP-hard for DCJ and many models, it is widely studied, with several exact and heuristic methods already proposed [19,20]. On the other hand, some authors have recently conjectured that it might not be the best option for ancestral inference, since in many instances the solution is one of the input genomes, giving no new information for ancestral reconstruction [21].
Using the concept of intermediate genomes, the following alternative problems are proposed: Problem 1 (Intermediate Genome Small Phylogeny) Given a binary tree T with n extant genomes at the leaves, find ancestral genomes corresponding to the internal nodes of the tree such that the tree length is minimized, with the restriction that each internal node is an intermediate genome of its children.  [22], where they try to find intermediate genomes that are simultaneously in scenarios from A to C and from B to C, and are as far as as possible to C, meaning that they are close to the input genomes A and B. In order to do that, it is necessary to find the common intermediate genomes in both scenarios, and this usually leads to an exponential number of scenarios to be considered. Therefore, exact solutions are only found for small genomes, and heuristics have to be used for larger genomes.
The approach proposed here has the advantage that we can detect intermediate genomes without the need to find rearrangement scenarios. The number of intermediate genomes is still exponential, but the use of additional information from the outgroup genome or external input can greatly reduce the search space, as we will see in the next section.
It is likely that the intermediate genome problems proposed here are NP-hard, as their original counterparts, but this remains an open question. Similarly as previous algorithms, one can develop an approach to solve Problem 1 by solving iteratively several instances of Problem 2. But, as a simpler first approach, the solution for the small phylogeny problem presented here is based on a bottom-up procedure, where at each step an internal node is reconstructed as an intermediate genome between its two children, who must be leaves in the tree. In order to do that, instead of minimizing the distance as in Problem 2, we will use the concept of adjacency guides.

Adjacency guides
We propose to solve the following problem: Problem 3 (Guided Intermediate Genome) Given a set of adjacencies G, called the adjacency guide, and two genomes A and B, find an intermediate genome M of A and B using the adjacencies G as a guide. This is intentionally a very open definition. What does it mean to use G as a guide? One could maximize the number of adjacencies of G that are present in M , for instance, giving rise to a maximum cardinality independent set problem in a circle graph, that can be solved polynomially. A similar idea was used by Swenson et al. [23] in the context of finding the maximum number of non-interfering reversals in the breakpoint graph.
But, more simply, just consider each adjacency a ∈ G sequentially, adding a if both vertices belong to the same AB-cycle in BP (A, B), and it does not cross with previously added adjacencies. Also, a has to respect the parity constraint, as seen in Lemma 2, where each edge in M has to split an AB-cycle in two cycles with even number of vertices.
This problem can be solved independently and recursively for each AB-cycle. For an added adjacency a, the AB-cycle is divided in two and the algorithm is then executed on both cycles. This is the Guided IG function from Algorithm 1, with a small example shown in Figure 8.
An interesting property of this algorithm is that it can introduce new adjacencies, other than the ones from the guide G, when this is the only option for an intermediate genome. Specifically, for any AB-cycle of size 2, either from the original BP(A, B) or coming from the recursion, the corresponding adjacency is chosen, since there is no other option, as M has to be an intermediate genome. For instance, in Figure 8, although only two adjacencies from the guide are used, it is possible to infer three more, because of the three cycles of size 2.
The next question is then, what to do with the larger AB-cycles? One can think of two ways to answer this question, based on the different approach between homology-based and distance-based algorithms.
As an homology-based algorithm, after the guide is processed, no new adjacency is added in M , since the aim is to reconstruct contiguous ancestral regions (CARs), without the need for a complete genome, and there is not enough information to choose new adjacencies. If we think as a distance algorithm, trying to minimize the tree length, it is not a good idea to have a fragmented genome, and completing the adjacencies in any non-crossing way builds an intermediate genome M , minimizing d(A, M) + d(B, M). Since at this point in the algorithm we have no extra information, one way would be to just choose noncrossing edges randomly. In practice the adjacencies were chosen all from the same colored edges, either all A-edges or all B-edges, in an attempt to maximize the common adjacencies with input genomes A and B. The number of common adjacencies does not correspond directly to the number of A-or B-edges in a component, since the recursion produces new A-and B-edges that might not be present in the original genomes. From the two possible ways to do that (all A-or all B-edges), we choose the one with more common adjacencies. For instance, on the AB-cycle of Figure 8, we can choose either the two green or two blue edges, but both blue ones are original from B, and only one green is from A, therefore we choose the blue adjacencies. If there is no difference, one option is picked randomly.
A pseudo-code of the main functions of the ancestral reconstruction algorithm is shown in Algorithm 1.

Finding adjacency guides
How do we find adjacency guides? The simplest answer is to search directly in the tree, traversing it from the internal node being reconstructed. For each leaf found, from closest to farthest, its adjacencies are added to the guide. It is a very naive algorithm but, somewhat surprisingly, it can give good results. This guide will be called leaf guide.
Better adjacency guides potentially improve the reconstruction, and there is already a good amount of well designed algorithms for finding those guides. Namely, the output of any homology algorithm, such as InferCARs [9], ANGES [11], SCJ [7] or Pro-CARs [14], can be directly used as a guide. It does not mean that all adjacencies from the homology algorithm are automatically added to the ancestral genome; each adjacency is still going to be considered and possibly rejected if it breaks any intermediate genome property. This type of guide is the ancestral guide.
Any combination of ancestral guides and leaf guide could potentially be used. In the Results section a more detailed description of the combinations tested in this work is given.

Results
The proposed algorithms were implemented in Python and tested on simulated datasets, where a tree with genomes at the leaves is given as input, and the algorithms try to reconstruct the ancestral genomes at internal nodes of the tree. The amount of correct adjacencies, missing adjacencies and wrong adjacencies for each ancestral genome is then measured.
To compare the proposed algorithms, as the homology methods representatives we ran SCJ [7] and Pro-CARs [14]. SCJ was chosen because since it is a very conservative method, it has a very low false positive rate [7], providing an almost error-free set of adjacencies that the intermediate genomes method can improve upon. ProCARs was chosen because it was the easiest of the homology methods to run, and according to its authors it has similar results when compared to ANGES, Infer-CARs and GapAdj [14]. For the distance methods, PATHGROUPS [5] and GASTS [6] were chosen, as these are the most recent methods for DCJ-based small phylogeny.
For the Intermediate Genomes algorithm, four variations of ancestral guides were tested: no ancestral guide, SCJ, ProCARs and perfect. The perfect guide is a "cheating algorithm" that uses as guide the correct ancestral genome, from the simulation. This is done to determine some form of "upper bound" on the proposed IG algorithms, as we can measure how many adjacencies are rejected by the algorithm, even though they are present in the correct ancestral genome. On all four types, after all adjacencies from the ancestral guide are applied, the leaf guide is used. These four variants will be called IG-Pure, IG-SCJ, IG-ProCARS and IG-perfect, depending on the ancestral guide.
For each of these four guide options, two variants were tested: the homology approach, were no extra adjacencies are added in the guided IG problem, for components larger than 2; and the distance approach, were adjacencies are added, and since there is some randomization in the choice, the algorithm runs 50 times, with the shortest DCJ distance tree chosen as the output. pq pop_first_element(G) 20: if p ∉ or q ∉ C then: continue 21: i C.idx(p) 22: j C.idx(q) 23: if (j − i) mod 2 = 0 then: continue 24: Algorithm 1: Ancestral Reconstruction with Intermediate Genomes. The main function is IG_SMALL_-PHYLOGENY, that receives a tree T and a genome list G , at the leaves of T , and returns a new list with added ancestral genomes. In a bottom-up approach, it chooses two leaves, to reconstruct the ancestral parent node. First, it builds an adjacency guide with ancestral and/or leaf guides, as described in the text. Then it calls GUIDED_IG, that will build a set of adjacencies of the ancestral genome by calling GUIDED_IG_C for each component of BP (A, B), which in turn calls itself recursively each time an adjacency is applied in a component. When there are no adjacency guides for a component, in the homology approach no new adjacency is returned; in the distance approach, two sets are built, with all adjacencies from the same color, and the one with more common adjacencies with the input genomes is returned. After an ancestral genome is reconstructed, its leaves are pruned so it becomes a leaf and the main loop continues until all the ancestral genomes have been reconstructed.

Simulated datasets
For each simulation, similarly as previous works [24], a birth-death model with a birth rate of 0.001 and a death rate of 0 generates an ultrametric tree with a chosen number of leaves. To modify the branch lengths, for each branch a real number d is uniformly chosen from the interval [−c, +c], and the branch length is multiplied by e d . For this experiments, c = 2. The branch lengths are then rescaled so the tree has a target diameter D. At the root of the tree, a genome is generated and evolved through the tree, applying l rearrangements in each branch, where l is the branch length. The rearrangement events are randomly chosen between reversals (90%), transpositions (5%) and translocations (5%). We generated 2 main datasets, one with 20 leaf genomes with 500 genes, and another with 12 leaf genomes with 5000 genes. Also, the diameter D varies from 1n to 4n, where n is the number of genes. For each D, 10 simulations were generated. The leaf genomes and the tree topology is then given as input for the ancestral reconstruction methods.

Discussion
The simulation results are summarized in Figure 9, for the homology algorithms, and in Figures 10, 11 and 12, for the distance algorithms. More detailed results are shown in Table 1 and running times for two different datasets are displayed in Table 2.
For each internal node of the tree, given the reconstructed genome R and the simulated ancestral S, three measurements were made: correct adjacencies (true positives in Table 1), adjacencies in R also present in S; missing adjacencies (false negatives in Table 1), present in S but not in R; and wrong adjacencies (false positives in Table 1), present in R but not in S. The percentage of all adjacencies in each category, for each different tree diameter D, is shown in the figures and table. For the distance methods, the total DCJ length of the tree is also measured, as well as the DCJ error, the distance between R and S, and the total sum of these values is shown on Figure 10.
The homology algorithms have a very low error rate, because they tend to choose adjacencies only when there is good information for doing it. In contrast, distance algorithms usually have a higher number correct adjacencies, at the cost of a much larger error rate, since they have to build more complete genomes that minimize distances, filling the missing adjacencies.
For the homology algorithms comparison, we can see that the IG-Pure algorithm already has good results in terms of correctly reconstructed adjacencies, almost the same as ProCARs. When combined with SCJ and Pro-CARs as ancestral guide, the IG algorithm improves on  Compared to the distance methods, the IG algorithms have better results in almost all measurements, especially as the tree diameter increases. While the DCJ tree length from PATHGROUPS and GASTS is always smaller on average, which is expected since they are distance minimizers, the IG algorithms are quite close, even though they are not designed to minimize the tree length, and in a few instances the IG-ProCARs algorithm even returned a smaller tree. More interestingly, the DCJ error between the reconstructed and real ancestral genomes is smaller in the IG methods, significantly more as the diameter increases, raising the question: is it the best choice to find a genome median, minimizing the length of the tree, when this median is actually more distant to the true ancestral genome?
On the other hand, if we focus on the IG-perfect results shown in Table 1 we can identify the proportion of adjacencies from the true ancestral genomes that were rejected because they are not part of the intermediate genome being reconstructed. The more distant the genomes are, the more adjacencies are rejected, with almost 30% of the adjacencies of the true ancestors being rejected when the diameter is 4n. This shows that even though restricting intermediate genomes on internal nodes seems to enhance results over the unrestricted versions, there is still much to be improved. An open question is whether a mixed approach can be developed, allowing some flexibility on departing from the intermediate genome restriction, potentially obtaining better results and, perhaps the more challenging aspect, how can we detect when to do it?

Conclusion
A new approach for ancestral reconstruction of gene orders is proposed. The key aspect is the restriction that each ancestral genome must be an intermediate genome of its children. This would seem like a basic assumption in parsimonious methods, but it was not used in previous methods. The results of this paper indicate that, even with naive approaches, simple algorithms based on intermediate genomes are fast and obtain good results. Furthermore, they are easily combined with homology-based algorithms, enhancing their results.
There are many directions for improving the current approach, as many simple algorithm design decisions were made in this first study. For instance, the order of the adjacencies in the guides is definitely important in the reconstructed ancestral genomes, as it potentially changes which adjacencies are accepted or rejected. A better logic in selecting which adjacencies are used and in which order they should be considered might increase the number of correctly reconstructed adjacencies. For the distance-based approach, when adjacencies have to be included without guide information, instead of the simple approach presented here, a better option would be to choose intermediate genomes that try to respect the distance given by the tree branch lengths, if available.
In a more high level view, it would be interesting to investigate the concept of intermediate genomes in more complex DCJ models, that include insertion and deletions events [25,26], to develop more complete ancestral reconstruction methods, that can accept genomes with different sets of genes, a limitation of any parsimonious method based on the basic DCJ distance.