Volume 13 Supplement 10

## Selected articles from the 7th International Symposium on Bioinformatics Research and Applications (ISBRA'11)

# Gene order in rosid phylogeny, inferred from pairwise syntenies among extant genomes

- Chunfang Zheng
^{1}and - David Sankoff
^{1}Email author

**13(Suppl 10)**:S9

**DOI: **10.1186/1471-2105-13-S10-S9

© Zheng and Sankoff et al; licensee BioMed Central Ltd. 2012

**Published: **25 June 2012

## Abstract

### Background

Ancestral gene order reconstruction for flowering plants has lagged behind developments in yeasts, insects and higher animals, because of the recency of widespread plant genome sequencing, sequencers' embargoes on public data use, paralogies due to whole genome duplication (WGD) and fractionation of undeleted duplicates, extensive paralogy from other sources, and the computational cost of existing methods.

### Results

We address these problems, using the gene order of four core eudicot genomes (cacao, castor bean, papaya and grapevine) that have escaped any recent WGD events, and two others (poplar and cucumber) that descend from independent WGD s, in inferring the ancestral gene order of the rosid clade and those of its main subgroups, the fabids and malvids. We improve and adapt techniques including the OMG method for extracting large, paralogy-free, multiple orthologies from conflated pairwise synteny data among the six genomes and the PATHGROUPS approach for ancestral gene order reconstruction in a given phylogeny, where some genomes may be descendants of WGD events. We use the gene order evidence to evaluate the hypothesis that the order Malpighiales belongs to the malvids rather than as traditionally assigned to the fabids.

### Conclusions

Gene orders of ancestral eudicot species, involving 10,000 or more genes can be reconstructed in an efficient, parsimonious and consistent way, despite paralogies due to WGD and other processes. Pairwise genomic syntenies provide appropriate input to a parameter-free procedure of multiple ortholog identification followed by gene-order reconstruction in solving instances of the "small phylogeny" problem.

## Background

- 1.
The relative recency of these data. Although almost twenty dicotyledon angiosperms have been sequenced and released, most of this has taken place in the last two years (at the time of writing) and the comparative genomics analysis has been reserved by the various sequencing consortia for their own first publication, often delayed for years following the initial data release.

- 2.
Algorithms maximizing a well-defined objective function for reconstructing ancestors through the median constructions and other methods are computationally costly, increasing both with

*n*, the number of genes orthologous across the genomes, and especially with $\frac{d}{n}$, where*d*is the number of rearrangements occurring along a branch of the tree. Indeed, even with moderate values of $\frac{d}{n}$, these methods may fail to execute at all in reasonable time. - 3.
Whole genome duplication (WGD), which is rife in the plant world, particularly among the angiosperms [13, 14], sets up a comparability barrier between those species descending from a WGD event and species in all other lineages originating before the event [3]. This is largely due to the process of duplicate gene reduction, eventually affecting most pairs of duplicate genes created by the WGD, which distributes the surviving members of duplicate pairs between two homeologous chromosomal segments in an unpredictable way,

*fractionation*[15–19], thus scrambling gene order and disrupting the phylogenetic signal. This difficulty is compounded by the residual duplicate gene pairs created by the WGD, complicating orthology identification essential for gene order comparison between species descended from the doubling event and those outside it. - 4.
Global reconstruction methods are initially designed to work under the assumption of identical gene complement across the genomes, but if we look at dicotyledons, for example, each time we increase the set of genomes being studied by one, the number of genes common to the whole set is reduced by approximately $\frac{1}{3}$. Even comparing six genomes, retaining only the genes common to all six, removes 85% of the genes from each genome, almost completely spoiling the study as far as local syntenies are concerned.

- 1.
The software has been developed and tested with all the released and annotated dicotyledon genome sequences, even though "ethical" claims by sequencing consortia leaders discourage the publication of the results on the majority of them at this time. In this enterprise, we benefit from the up-to-date and well organized CO GE platform [2, 20], with its database of thousands of genome sequences and its sophisticated, user-friendly SYN MAP facility for extraction of synteny blocks.

- 2.
PATHGROUPS aims to rapidly reconstruct ancestral genomes according to a minimum total rearrangement count (using the DCJ metric [21]) along all the branches of a phylogenetic tree. PATHGROUPS' speed is due to its heuristic approach (greedy search with look-ahead), which entails a small accuracy penalty as $\frac{d}{n}$ increases, but allows it to return a solution for values of $\frac{d}{n}$ where exact methods are no longer feasible. The implementation first produces a rapid initial solution of the "small phylogeny" problem (i.e., where the tree topology is given and the ancestral genomes are to be constructed), followed by an iterative improvement treating each ancestral node as a median problem (one unknown genome to be constructed on the basis of the three given adjacent genomes), using techniques to avoid convergence to local minima.

- 3.
The comparability barrier erected by a WGD event is not completely impenetrable, even though gene order fractionation is further confounded by genome rearrangement events. The WGD-origin duplicate pairs remaining in the modern genome will contain much information about gene order in the ancestral diploid, immediately before WGD. The gene order information is retrievable through the method of

*genome halving*[22], which is incorporated in a natural way into PATHGROUPS, where it is combined with information on single-copy genes. - 4.
One of the main technical contributions of this paper is the feature of PATHGROUPS that allows the genome complement of the input genomes to vary. Where the restriction to equal gene complement would lead to reconstructions involving only about 15% of the genes, the new feature allows close to 100% of the genes with orthologs in at least two genomes to appear in the reconstructions. The other key innovation we put to phylogenetic use for the first time here is our "orthologs for multiple genomes" (OMG) method for combining the genes in the synteny block sets output by SYN MAP for pairs of genomes, into orthology sets containing at most one gene from every genome in the phylogeny [23].

Both the PATHGROUPS and the OMG procedures are parameter-free. There are no thresholds or other arbitrary settings. We argue that the appropriate moment to tinker with such parameters is during the synteny block construction and not during the orthology set construction nor the ancestral genome reconstruction. A well-tuned synteny block method goes a long way to attenuate genome alignment problems due to paralogy. It is also the appropriate point to incorporate thresholds for declaring homology, since these depend on evolutionary divergence, which is specific to pairs of genomes. Finally, the natural criteria for constructing pairwise syntenies do not extend in obvious ways to three or more genomes.

## Methods

### Six eudicotyledon sequences

*Arabidopsis*, and some for which the gene annotations are not easily accessible leaves us the six depicted in Figure 1, namely cacao [24], castor bean [25], cucumber [26], grapevine [27, 28], papaya [29] and poplar [30]. Of the two main eudicot clades, asterids and rosids, only the latter is represented, including the order Vitales, considered the closest relative of the eurosids [13, 31]. Poplar and cucumber are the only two to have undergone ancestral WGD since the divergence of the grapevine.

### Formal methods

A genome is a set of chromosomes, each chromosome consisting of a number of genes linearly ordered. The genes are all distinct and each has positive or negative polarity, indicating on which of the two DNA strands the gene is located.

Genomes can be rearranged through the accumulated operation of number of processes: inversion, reciprocal translocation, transposition, chromosome fusion and fission. These can all be subsumed under a single operation called double-cut-and-join which we do not describe here. For our purposes all we need is a formula due to Yancopoulos *et al*. [21], stated below, that gives the genomic distance, or length of a branch in a phylogeny, in terms of the minimum number of rearrangement operations needed to transform one genome into another.

#### Rearrangement distance

*d*(

*G*

_{1},

*G*

_{2}) is a metric counting the number of rearrangement operations necessary to transform one multichromosomal gene order

*G*

_{1}into another

*G*

_{2}, where both contain the same

*n*genes. To calculate

*d*efficiently, we use the breakpoint graph of

*G*

_{1}and

*G*

_{2}, constructed as illustrated in Figure 2: For each genome, each gene

*g*with a positive polarity is replaced by two vertices representing its two ends, i.e., by a "tail" vertex and a "head" vertex in the order

*g*

_{ t },

*g*

_{ h }; for -

*g*we would put

*g*

_{ h },

*g*

_{ t }. Each pair of successive genes in the gene order defines an adjacency, namely the pair of vertices that are adjacent in the vertex order thus induced. For example, if

*i*,

*j*,

*-k*are three neighbouring genes on a chromosome then the unordered pairs {

*i*

_{ h },

*j*

_{ t }} and {

*j*

_{ h },

*k*

_{ h }} are the two adjacencies they define. There are two special vertices called telomeres for each linear chromosome, namely the first vertex from the first gene on the chromosome and the second vertex from the last gene on the chromosome.

If there are *m* genes on a chromosome, there are 2*m* vertices at this stage. As mentioned, the first and the last of these vertices are telomeres. We convert all the telomeres in genome *G*_{1} and *G*_{2} into adjacencies with additional vertices all labelled *T*_{1} or *T*_{2}, respectively. The breakpoint graph has a blue edge connecting the vertices in each adjacency in *G*_{1} and a red edge for each adjacency in *G*_{2}. We make a cycle of any path ending in two *T*_{1} or two *T*_{2} vertices, connecting them by a red or blue edge, respectively, while for a path ending in a *T*_{1} and a *T*_{2}, we collapse them to a single vertex denoted "*T*".

*κ*alternating cycles. If

*n*' is the number of blue edges, then [21]:

#### The median problem and small phylogeny problem

Let *G*_{1}, *G*_{2} and *G*_{3} be three genomes on the same set of *n* genes. *The rearrangement median problem is to find a genome M such that* *d*(*G*_{1}, *M*) + *d*(*G*_{2}, *M*) + *d*(*G*_{3}, *M*) *is minimal*.

*T*on

*N*given genomes

*G*

_{1},

*G*

_{2}, ⋯,

*G*

_{ N }(and thus with

*N*- 2 unknown ancestral genomes

*M*

_{1},

*M*

_{2}, ⋯,

*M*

_{N - 2}and 2

*N*- 3 branches) as depicted in Figure 3,

*the small phylogeny problem is to infer the ancestral genomes so that the total edge length of*

*T*,

*namely*

*is minimal*.

The computational complexity of the median problem, which is just the small phylogeny problem with *N* = 3, is known to be NP-hard and hence so is that of the general small phylogeny problem.

### The OMG problem

#### Pairwise orthologies

As justified in the Introduction, we construct sets of orthologous genes across the set of genomes by first identifying pairwise synteny blocks of genes. In our study, genomic data were obtained and homologies identified within synteny blocks, using the SYN MAP tool in CO GE[2, 20]. This was applied to the six dicot genomes in CO GE shown in Figure 1, i.e., to 15 pairs of genomes. We repeated all the analyses to be described here using the default parameters of SYN MAP, with minimum block size 1, 2, 3 and 5 genes.

#### Multi-genome orthology sets

The pairwise homologies SYN MAP provided for all 15 pairs of genomes determine the set of edges *E* of the *homology graph* *H = (V*, *E)*, where *V* is the set of genes in any of the genomes participating in at least one homology relation.

The understanding of orthologous genes in two genomes as originating in a single gene in the most recent common ancestor of the two species, leads logically to transitivity as a necessary consequence. If gene *x* in genome *X* is orthologous both to gene *y* in genome *Y* and gene *z* in genome *Z*, then *y* and *z* must also be orthologous, even if SYN MAP does not detect any homology between *y* and *z*. The operational criteria for identifying homologs in SYN MAP, combining sequence similarity and syntenic context correspondences, may sometimes indicate that *x* is homologous to *y* and *z*, but not necessarily that *y* and *z* are homologous. This may be due to threshold criteria, differing rates or durations of evolution, or simply statistical fluctuation. Nevertheless, it seems logical to extend all homology relations by transitivity, so that in this example we will consider *y* to be homologous to *z*.

Ideally, then, all the genes in a connected component of *H* should be orthologous; insofar as SYN MAP resolves all relations of paralogy, we should expect *at most* one gene from each genome in such an orthology set, or two for genomes that descend from a WGD event.

In practice, gene *x* in genome *X* may be identified as homologous to both *y*_{1} and *y*_{2} in genome *Y*. Or *x* in *X* is homologous both to gene *y*_{1} in genome *Y* and gene *z* in genome *Z*, while *z* is also homologous to *y*_{2}. By transitivity, we again obtain that *x* is homologous to both *y*_{1} and *y*_{2} in the same genome. While one gene being homologous to several paralogs in another genome is commonplace and meaningful, this should be relatively rare in the output from SYN MAP, where syntenic correspondence is a criterion for resolving paralogy. Aside from tandem duplicates, which do not interfere with gene order calculations, and duplicates stemming from WGD events (which are handled separately by our methods [10]), we consider duplicate homologs in the same genome, inferred directly by SYN MAP or indirectly by being members of the same connected component, as evidence of error or noise.

Suppose *G* = (*V*_{
G
}, *E*_{
G
}) is a connected component of *H* with duplicate homologs in the same genome (or more than two in the case of a WGD descendant). We delete a subset of edges *E'* ⊂ *E*_{
G
}, so that the remaining graph *Q* decomposes into smaller connected components, *Q* = *Q*_{1} ∪ ⋯ ∪*Q*_{
t
}, where each *Q*_{
i
} is free from (non-WGD) paralogy. To decide which edges to delete, we define an objective function to be the total number of edges in the transitive closure of *Q*, i.e., in all the cliques generated by the components *Q*_{
i
}. In other words, we seek to maximize ${\sum}_{1}^{t}\left(\begin{array}{c}\hfill \left|{E}_{i}\right|\hfill \\ \hfill 2\hfill \end{array}\right)$, where *Q*_{
i
} = (*V*_{
i
}, *E*_{
i
}). We are not aware of any algorithm for this problem, aside from the heuristic we have recently developed [23], presented here in simplified form, but conjecture it to be NP-hard.

Let $\stackrel{\u0304}{P}$ be the transitive closure of any graph *P*. To obtain $\stackrel{\u0304}{P}$ we can raise its adjacency matrix *M*_{
P
} (including 1's on the diagonal) to successively higher powers ${M}_{P}^{r}$ until a maximal set of non-zero elements is attained. These non-zero elements correspond to the edges of the connectivity graph $\stackrel{\u0304}{P}$, which is the union of a set of disjoint cliques. In practice, the construction of $\stackrel{\u0304}{P}$ can be made more efficient using Warshall's algorithm [32].

The edges of $\u1e20$, where *G* is a component of the homology graph *H*, define a single clique, since *G* is connected. These edges represent both given and indirectly inferred orthologies as discussed above, but there may be paralogies. To remedy this by deleting edges from *G* to produce an optimal union of paralogy-free components *Q* = *Q*_{1}∪ ⋯ ∪*Q*_{
t
}, we first examine the star subgraph *s*(*v*) of $\u1e20$ containing *ν*(*v*) vertices, namely *v*, its *ν*(*v*) *-* 1 neighbours, and the *ν*(*v*) - 1 edges connecting the former to the latter.

Let *c*(*v*) ≥ 1 be the number of distinct genomes represented among the vertices in *s*(*v*). Let $F(E)={\displaystyle {\sum}_{{}_{v\in V}}c(v)}$.

### Without WGD descendants

- 1.
set

*E*' = ∅. - 2.
**while**there are still some*v*∈*V*where*ν*(*v*) >*c*(*v*),- (a)
find the edge

*e*∈*E*\*E*' that maximizes - (b)
**if**there are several such*e*, find the one that minimizes - (c)
*E*←*E*' ∪ {*e*}

$F\left(E\backslash {E}^{\u2033}\right)=\sum _{v\in V}c\left(v\right),\phantom{\rule{0.3em}{0ex}}\mathsf{\text{where}}\phantom{\rule{0.3em}{0ex}}{E}^{\u2033}={E}^{\prime}\cup \left\{e\right\}$${F}^{+}\left(E\backslash {E}^{\u2033}\right)=\sum _{\left(V,E\backslash {E}^{\u2033}\right)}\nu \left(v\right)-c\left(v\right).$ - (a)
- 3.
relabel as

*Q*_{1}, ⋯,*Q*_{ t }the disjoint components created by deleting edges. These contain the vertices of the required components of*Q*.

Implicit in each greedy step is an attempt to create large orthology sets. If the deleted edges create two partitioned components, i.e., each with no internal paralogy, then the increment in *F* will be proportional to the sum of the squares of the number of vertices in each one. This favours a decomposition into one large and one small component rather than two equal sized components.

WGD **descendants allowed**. To handle paralogs of WGD origin, the definition of *c*(*v*) must be amended to take account an allowance of 2 vertices from a single genome in *s*(*v*) if these are from the appropriate genomes. And the condition in Step 2 must require that at most two vertices be contained in *s*(*v*) from any one genome, and only if these involve WGD descendants.

Note that it is neither practical nor necessary to deal with *H* in its entirety, with its hundred thousand or so edges. It suffices to delete edges, if necessary, from each connected component *G* independently. Typically, this will contain only a few genes and very rarely more than 100. The output is a decomposition of *G* into two or more smaller sets with no undesired paralogy. These are the orthology sets we input into the gene order reconstruction step.

For the small number (typically from 1 to 5) of very large components *G* we encounter, called "tangles" in [23], we break them into more tractable size sets by extracting genes with large numbers of homologs, together with their immediate homologs, and treat them independently. This is done recursively on the remaining part of *G* until a small enough set of homologies is obtained that can be handled by the procedure detailed above.

Though these procedures require only a few minutes of computation, there are a number of devices we employ to slash this time without materially affecting the end results of our analysis. One is simply to remove at the outset all components *G* containing only two genes from two genomes separated by three or more ancestral nodes in the given phylogenetic tree. The algorithms later in our pipeline would not infer an ortholog of such genes in the ancestral genomes, so there is no point in including them in the analysis. This step allows great computational saving when the minimum size of syntenic blocks in SYN MAP is set to 1.

### PATHGROUPS

Once we have our solution to the OMG problem on the set of pairwise syntenies, we can proceed to reconstruct the ancestral genomes. First, we briefly review the PATHGROUPS approach (previously detailed in [10, 11]) as it applies to the median problem with three given genomes and one ancestor to be reconstructed, *all having the same gene complement*. The same principles apply to the simultaneous reconstruction of all the ancestors in the small phylogeny problem, and to the incorporation of genomes having previously undergone WGD.

We redefine a path to be any connected subgraph of a breakpoint graph, namely any connected part of a cycle. Initially, each blue edge in the given genomes is a path. A *fragment* is any set of genes connected by red edges in a linear order. The set of fragments represents the current state of the reconstruction procedure. Initially the set of fragments contains all the genes, but no red edges, so each gene is a fragment by itself.

The objective function for the small phylogeny problem consists of the sum of a number of genomic distances, one distance for every branch in the phylogeny. Each of these distances corresponds to a breakpoint graph. A given genome determines blue edges in one breakpoint graph, while the red edges correspond to the ancestral genome being constructed. For each such ancestor, *the red edges are identical in all the breakpoint graphs corresponding to distances to that ancestor*.

A pathgroup is a set of three paths, all beginning with the same vertex, one path from each partial breakpoint graph currently being constructed. Initially, there is one pathgroup for each vertex.

- 1.
by the maximum number of cycles that can be created within the group, without giving rise to circular chromosomes, and

- 2.
for those pathgroups allowing equal numbers of cycles, by considering the maximum number of cycles that could be created in the next iteration of step 1, in any one pathgroup affected by the current choice.

By maintaining a list of pathgroups for each priority level, and a list of fragment endpoint pairs (initial and final), together with appropriate pointers, the algorithm requires *O*(*n*) running time.

In the current implementation of PATHGROUPS[11], much greater accuracy, with little additional computational cost, is achieved by designing a refined set of 163 priorities, based on a two-step look-ahead greedy algorithm.

For completeness, we remark that some genomes are incompletely assembled and only available in the form of fragmented chromosomes. These are treated as full chromosomes by our procedures; for this and other reasons the reconstructed ancestors may also be output as chromosomal fragments. To correct the distance between two such fragmented genomes, we note that part of the DCJ distance allows for a number of chromosomal fusions or fissions to equalize the numbers of chromosomes in the two genomes. This number is a methodological artifact and should be removed from the DCJ score to estimate the true evolutionary distance. Details of this correction have been published elsewhere [33].

#### Inferring the gene content of ancestral genomes

The assumption of equal gene content simplifies the mathematics of PATHGROUPS and allows for rapid computation. Unfortunately it also drastically reduces the number of genes available for ancestral reconstruction, so that the method loses its utility when more than a few genomes are involved.

In this section, we address the problem of assigning gene content to the ancestral genomes, a question that was avoided previously when all genomes had the same content. Then in the next section we show how to adapt PATHGROUPS to the unequal gene content median problem.

There are two natural ways to assign genes parsimoniously to ancestral genomes. One is to treat a different presence or absence status at the two ends of a branch of a phylogenetic tree as an evolutionary event, and to minimize, by dynamic programming, the number of events for each gene. However, if we have a rooted tree, it is may be more appropriate to allow any number of loss events for a gene but only one gain (innovation) event, since convergent evolution of a gene is unlikely. With real data sets, however, this rule (Dollo's principle) may be too restrictive. In our implementation, we compromise, in allowing multiple gains but when there are equally costly choices during execution of the assignment algorithm, to choose the one that attributes the gain as early in the tree as possible.

Using dynamic programming on unrooted trees, our assignment of genes to ancestors simply assures that if a gene is in at least two of the three adjacent nodes of an ancestral genome, it will be in that ancestor. If it is in less than two of the adjacent nodes, it will be absent from the ancestor.

#### Median and small phylogeny problems with unequal genomes

To generalize our construction of the three breakpoint graphs for the median problem to the case of three unequal genomes, we set up the pathgroups much as before, and we use a slightly modified priority structure. Each pathgroup, however, may have three paths, as before, or only two paths, if the initial vertex of the paths comes from a gene absent from one of the leaves. Moreover, when one or two cycles are completed by drawing a red edge, this edge must be left out of the third breakpoint graph if the corresponding gene is missing from the third genome.

The consequence of this strategy is that some of the paths in the breakpoint graph will never be completed into cycles, impeding the evaluation of the objective function (1). We could continue to search for cycles under a weakened definition, but this would be computationally costly to do in an exhaustive way, spoiling the linear run time property of the algorithm.

The small phylogeny problem can be formulated and solved using the same principles as the median problem, as with the case of equal genomes. The solution, however, only serves as an initialization. As in [11], the solution can be improved by applying the median algorithm to each ancestral node in turn, based on the three neighbour nodes, and iterating until convergence of the total tree length (2). At each step, the new median is accepted if the sum of the three branch lengths is no greater than the existing one. This strategy of allowing the median to change as long as it does not increase total tree length is effective in exploring local solution space and avoiding local minima.

## Results

### Coping with fractionation

*T*of a WGD into a phylogeny by creating an immediate median-like ancestor node

*A*in the tree where two of the paths (say

*G*

_{1}and

*G*

_{2}in Figure 4 connect to

*T*and the third (

*G*

_{3}) to an ancestral node

*R*in the phylogeny. Like all ancestral nodes,

*R*is connected to two other nodes in the tree, leaves or ancestral.

There are some technical differences connected with avoiding the creation of circular chromosomes in PATHGROUPS for WGD. Our current implementation can only handle the case where *T* contains exactly two copies of every gene in *R*. Thus we consider only the duplicate genes in *T* in constructing *A* during the small phylogeny analysis. After this is constructed single-copy genes are added to *A* in a way that does not change the DCJ distance (1). This simply involves inserting in *A* each run of single-copy genes next to one of its adjacent (in *T*) double-copy genes *g*, and inserting the same run of single-copy genes next to the duplicate of *g* as well. Sometimes both copies of *g* have adjacent single-copy runs in *T*, due to the process of fractionation. In this case the two single-copy runs must be merged (or *consolidated* [15]). Using present methods, evidence from *R* does not contribute to how this merger proceeds, so that the gene order in this consolidated run may have a large random component. This is particularly true of longer runs, with more than two or three genes.

Adding some randomness to a gene order will tend to create roughly one new rearrangement per added breakpoint [34] and fractionation tends to involve deletions of two or three consecutive WGD paralogs [18], though many of the deletions will be adjacent, creating longer runs of single-copy genes.

This suggests that the distance between *A* and *R* may be exaggerated by a the addition of anywhere from $\frac{1}{4}$*s* to $\frac{1}{2}$*s* on the average for each single-copy run of length *s* for *s* larger than some cutoff value. Therefore, as a crude correction, we deduct from the distance $\frac{1}{3}$*s* for *s* > 3.

*s*for the two genomes, reflecting the relative recency of the poplar WGD. The distance correction turns out to be 2524 for cucumber but only 536 for poplar. The distances portrayed in the next section incorporate these corrections.

### The Malpighiales

It has been suggested recently that the order Malpighiales should be assigned to the malvids rather than the fabids [35]. In our results, the tree supporting this suggestion is indeed more parsimonious than the more traditional one. However, based on the limited number of genomes at our disposal, this is not conclusive.

### Properties of the solution as a function of synteny block size

Of the 18,000 orthology sets, the number of genes considered on each branch ranged from 12,000 to 15,000. When the minimum block size is 5, the typical branch length over the 11 branches of the tree (including one branch from each WGD descendant to its perfectly doubled ancestor plus one from that ancestor to a speciation node) is about 1600, so that $\frac{d}{n}$ is around 0.12, a low value for which simulations have shown PATHGROUPS to be rather accurate, at least in the equal genomes context [11].

Figure 9 shows the convergence behaviour as the set of medians algorithms is repeated at each ancestral node. Each iteration required about 8 minutes on a MacBook.

### Block validation

To what extent do the synteny blocks output by SYN MAP for a pair of genomes appear in the reconstructed ancestors on the path between these two genomes in the phylogeny? Answering this in a positive way could validate the notion of syntenic conservation implicit in the block construction. If, however, the ancestors did not reflect the pairwise block construction due to conflicting homology structure among other descendants of the same ancestors, we would be forced to discount the pairwise syntenies as artifactual.

Since our reconstructed ancestral genomes are not in the curated CO GE database (and are lacking the DNA sequence version required of items in the database), we cannot use SYN MAP to construct synteny blocks between modern and ancestor genomes. We can only see if the genes in the original pairwise syntenies tend to be colinear as well in the ancestor.

On the path connecting grapevine to cacao in the phylogeny in Figure 1, there are two ancestors, the malvid ancestor and the rosid ancestor. There are 308 syntenic blocks containing at least 5 genes in the output of SYN MAP. A total of 11,229 genes are involved, of which 10,872 and 10,848 (97%) are inferred to be in the malvid and rosid ancestor respectively.

Integrity of cacao-grapevine syntenic blocks

malvid ancestor | rosid ancestor | |||
---|---|---|---|---|

synteny breaks | number | intra-block movement = 1.0) | number | intra-block movement = 1.0) |

0 | 140 (45%) | 126 (90%) | 153 (50%) | 146 (95%) |

1 | 66 (21%) | 62 (94%) | 64(21%) | 58 (91%) |

2 | 42 (14%) | 39 (93%) | 47(15%) | 37 (79%) |

> 2 | 60 (19%) | 58 (97%) | 44(14%) | 38 (86%) |

## Discussion

We have developed a methodology for reconstructing ancestral gene orders in a phylogenetic tree, minimizing the number of genome rearrangements they imply over the entire tree. The input is the set of synteny blocks produced by SYN MAP for all pairs of genomes. The two steps in this method, OMG and PATHGROUPS, are parameter-free; we argue that the proper moment for entering thresholds and other parameters, as well as resolving paralogy, is in the pairwise synteny construction. Our method rapidly and accurately handles large data sets (tens of thousands of genes per genome, and potentially dozens of genomes), although we have been constrained, for non-technical reasons (i.e., embargoes), to present the case of 6 genomes only. There is no requirement of equal gene complement.

For larger numbers of genomes, the quadratic increase in the number of pairs of genomes would become problematic, but this could be handled by extracting information from SYN MAP only from genomes pairs that are relatively close phylogenetically.

Future work will concentrate first on ways to complete cycles in the breakpoint graph which are currently left as paths, without substantially increasing computational complexity. This will increase the accuracy (optimality) of the results. Second, the incorporation of WGD descendants in the phylogeny will be upgraded to reflect the new unequal gene content techniques, in order to reduce the crude correction terms now associated with single-copy regions. Third, to increase the biological utility of the results, a post-processing component will be added to differentiate regions of confidence in the reconstructed genomes from regions of ambiguity.

## Availability

The PATHGROUPS software, together with sample data, may be downloaded from http://137.122.149.195/IsbraSoftware/smallPhylogenyInDel.html

The OMG software, together with sample data, may be downloaded from http://137.122.149.195/IsbraSoftware/OMGMec.html

The data used here, as well as other genomic data, and the SynMap software for producing pairwise homology sets are available at http://genomevolution.org/CoGe/OrganismView.pl and http://genomevolution.org/CoGe/SynMap.pl, respectively.

Because of the variety of formats in which genome data are released, and incorporated into CO GE, the conversion of several SynMap pairwise homology outputs into a master homology graph, conserving positional (on chromosome, fragment, contig, scaffold, pseudomolecule, etc.) information, at this time still requires short programs or scripts specific to the genomes under study.

## Declarations

### Acknowledgements

This article has been published as part of *BMC Bioinformatics* Volume 13 Supplement 10, 2012: "Selected articles from the 7th International Symposium on Bioinformatics Research and Applications (ISBRA'11)". The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S10.

Thanks to Victor A. Albert for advice, Eric Lyons for much help and Nadia El-Mabrouk for encouragement in this work. Research supported by a postdoctoral fellowship to CZ from the NSERC, and a Discovery grant to DS from the same agency. DS holds the Canada Research Chair in Mathematical Genomics.

## Authors’ Affiliations

## References

- Moore G, Devos KM, Wang Z, Gale MD: Cereal genome evolution. Grasses, line up and form a circle. Current Biology. 1995, 5: 737-739. 10.1016/S0960-9822(95)00148-5.View ArticlePubMedGoogle Scholar
- Lyons E, Pedersen B, Kane J, Alam M, Ming R, Tang H, Wang X, Bowers J, Paterson A, Lisch D, Freeling M: Finding and comparing syntenic regions among Arabidopsis and the outgroups papaya, poplar, and grape: CoGe with rosids. Plant Physiology. 2008, 148: 1772-1781. 10.1104/pp.108.124867.PubMed CentralView ArticlePubMedGoogle Scholar
- Tang H, Wang X, Bowers J, Ming R, Alam M, Paterson A: Unraveling ancient hexaploidy through multiply-aligned angiosperm gene maps. Genome Research. 2008, 18: 1944-10.1101/gr.080978.108.PubMed CentralView ArticlePubMedGoogle Scholar
- Murphy WJ, Larkin DM, Everts-van der Wind A, Bourque G, Tesler G, Auvil L, Beever JE, Chowdhary BP, Galibert F, Gatzke L, Hitte C, Meyers SN, Milan D, Ostrander EA, Pape G, Parker HG, Raudsepp T, Rogatcheva MB, Schook LB, Skow LC, Welge M, Womack JE, O'brien SJ, Pevzner PA, Lewin HA: Dynamics of mammalian chromosome evolution inferred from multispecies comparative maps. Science. 2005, 309: 613-617. 10.1126/science.1111387.View ArticlePubMedGoogle Scholar
- Ma J, Zhang L, Suh BB, Raney BJ, Burhans RC, Kent WJ, Blanchette M, Haussler D, Miller W: Reconstructing contiguous regions of an ancestral genome. Genome Research. 2006, 16: 1557-1565. 10.1101/gr.5383506.PubMed CentralView ArticlePubMedGoogle Scholar
- Adam Z, Sankoff D: The ABCs of MGR with DCJ. Evolutionary Bioinformatics. 2008, 4: 69-74.Google Scholar
- Ouangraoua A, Boyer F, McPherson A, Tannier E, Chauve C: Prediction of contiguous regions in the amniote ancestral genome. Bioinformatics Research and Applications, 5th International Symposium (ISBRA), Lecture Notes in Computer Science. Edited by: Mandoiu II, Narasimhan G, Zhang Y. 2009, Springer, 5542: 173-185.Google Scholar
- Gordon JL, Byrne KP, Wolfe KH: Additions, losses, and rearrangements on the evolutionary route from a reconstructed ancestor to the modern
*Saccharomyces cerevisiae*genome. PLoS Genetics. 2009, 5: e1000485-10.1371/journal.pgen.1000485.PubMed CentralView ArticlePubMedGoogle Scholar - Tannier E: Yeast ancestral genome reconstructions: The possibilities of computational methods. Comparative Genomics, 7th International Workshop (RECOMB-CG), Lecture Notes in Computer Science. Edited by: Ciccarelli F, Miklós I. 2009, Springer, 5817: 1-12. 10.1007/978-3-642-04744-2_1.Google Scholar
- Zheng C: PATHGROUPS, a dynamic data structure for genome reconstruction problems. Bioinformatics. 2010, 26: 1587-1594. 10.1093/bioinformatics/btq255.View ArticlePubMedGoogle Scholar
- Zheng C, Sankoff D: On the PATHGROUPS approach to rapid small phylogeny. BMC Bioinformatics. 2011, 12 (Suppl 1): S4-10.1186/1471-2105-12-S1-S4.PubMed CentralView ArticlePubMedGoogle Scholar
- Bertrand D, Gagnon Y, Blanchette M, El-Mabrouk N: Reconstruction of ancestral genome subject to whole genome duplication, speciation, rearrangement and loss. Algorithms in Bioinformatics, 10th International Workshop, (WABI), Lecture Notes in Computer Science. Edited by: Moulton V, Singh M. 2010, Springer, 6293: 78-89. 10.1007/978-3-642-15294-8_7.Google Scholar
- Soltis DE, Albert VA, Leebens-Mack J, Bell CD, Paterson AH, Zheng C, Sankoff D, Depamphilis CW, Wall PK, Soltis PS: Polyploidy and angiosperm diversification. American Journal of Botany. 2009, 96: 336-348. 10.3732/ajb.0800079.View ArticlePubMedGoogle Scholar
- Burleigh JG, Bansal MS, Wehe A, Eulenstein O: Locating large-scale gene duplication events through reconciled trees: implications for identifying ancient polyploidy events in plants. Journal of Computational Biology. 2009, 16: 1071-1083. 10.1089/cmb.2009.0139.View ArticlePubMedGoogle Scholar
- Langham RJ, Walsh J, Dunn M, Ko C, Goff SA, Freeling M: Genomic duplication, fractionation and the origin of regulatory novelty. Genetics. 2004, 166: 935-945. 10.1534/genetics.166.2.935.PubMed CentralView ArticlePubMedGoogle Scholar
- Thomas B, Pedersen B, Freeling M: Following tetraploidy in an Arabidopsis ancestor, genes were removed preferentially from one homeolog leaving clusters enriched in dose-sensitive genes. Genome Research. 2006, 16: 934-946. 10.1101/gr.4708406.PubMed CentralView ArticlePubMedGoogle Scholar
- Sankoff D, Zheng C, Zhu Q: The collapse of gene complement following whole genome duplication. BMC Genomics. 2010, 11: 313-10.1186/1471-2164-11-313.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang B, Zheng C, Sankoff D: Fractionation statistics. BMC Bioinformatics. 2011, 12 (Suppl 9): S5-10.1186/1471-2105-12-S9-S5.View ArticleGoogle Scholar
- Sankoff D, Zheng C, Wang B: A model for biased fractionation after whole genome duplication. BMC Genomics. 2012,Google Scholar
- Lyons E, Freeling M: How to usefully compare homologous plant genes and chromosomes as DNA sequences. Plant Journal. 2008, 53: 661-673. 10.1111/j.1365-313X.2007.03326.x.View ArticlePubMedGoogle Scholar
- Yancopoulos S, Attie O, Friedberg R: Efficient sorting of genomic permutations by translocation, inversion and block interchange. Bioinformatics. 2005, 21: 3340-3346. 10.1093/bioinformatics/bti535.View ArticlePubMedGoogle Scholar
- El-Mabrouk N, Sankoff D: The reconstruction of doubled genomes. SIAM Journal on Computing. 2003, 32: 754-792. 10.1137/S0097539700377177.View ArticleGoogle Scholar
- Zheng C, Swenson KM, Lyons E, Sankoff D: OMG! Orthologs in multiple genomes - competing graph-theoretical formulations. Algorithms in Bioinformatics - 11th International Workshop (WABI), Lecture Notes in Computer Science. Edited by: Przytycka TM, Sagot MF. 2011, Springer, 6833: 364-375. 10.1007/978-3-642-23038-7_30.Google Scholar
- Argout X, Salse J, Aury JM, Guiltinan MJ, Droc G, Gouzy J, Allegre M, Chaparro C, Legavre T, Maximova SN, Abrouk M, Murat F, Fouet O, Poulain J, Ruiz M, Roguet Y, Rodier-Goud M, Barbosa-Neto JF, Sabot F, Kudrna D, Ammiraju JS, Schuster SC, Carlson JE, Sallet E, Schiex T, Dievart A, Kramer M, Gelley L, Shi Z, Berard A, Viot C, Boccara M, Risterucci AM, Guignon V, Sabau X, Axtell MJ, Ma Z, Zhang Y, Brown S, Bourge M, Golser W, Song X, Clement D, Rivallan R, Tahi M, Akaza JM, Pitollat B, Gramacho K, D'Hont A, Brunel D, Infante D, Kebe I, Costet P, Wing R, McCombie WR, Guiderdoni E, Quetier F, Panaud O, Wincker P, Bocs S, Lanaud C: The genome of
*Theobroma cacao*. Nature Genetics. 2011, 43: 101-108. 10.1038/ng.736.View ArticlePubMedGoogle Scholar - Chan AP, Crabtree J, Zhao Q, Lorenzi H, Orvis J, Puiu D, Melake-Berhan A, Jones KM, Redman J, Chen G, Cahoon EB, Gedil M, Stanke M, Haas BJ, Wortman JR, Fraser-Liggett CM, Ravel J, Rabinowicz PD: Draft genome sequence of the oilseed species
*Ricinus communis*. Nature Biotechnology. 2010, 28: 951-956. 10.1038/nbt.1674.PubMed CentralView ArticlePubMedGoogle Scholar - Huang S, Li R, Zhang Z, Li L, Gu X, Fan W, Lucas W, Wang X, Xie B, Ni P, Ren Y, Zhu H, Li J, Lin K, Jin W, Fei Z, Li G, Staub J, Kilian A, van der Vossen EA, Wu Y, Guo J, He J, Jia Z, Ren Y, Tian G, Lu Y, Ruan J, Qian W, Wang M, Huang Q, Li B, Xuan Z, Cao J, Wu Z, Zhang J, Cai Q, Bai Y, Zhao B, Han Y, Li Y, Li X, Wang S, Shi Q, Liu S, Cho WK, Kim JY, Xu Y, Heller-Uszynska K, Miao H, Cheng Z, Zhang S, Wu J, Yang Y, Kang H, Li M, Liang H, Ren X, Shi Z, Wen M, Jian M, Yang H, Zhang G, Yang Z, Chen R, Liu S, Li J, Ma L, Liu H, Zhou Y, Zhao J, Fang X, Li G, Fang L, Li Y, Liu D, Zheng H, Zhang Y, Qin N, Li Z, Yang G, Yang S, Bolund L, Kristiansen K, Zheng H, Li S, Zhang X, Yang H, Wang J, Sun R, Zhang B, Jiang S, Wang J, Du Y, Li S: The genome of the cucumber,
*Cucumis sativus L*. Nature Genetics. 2009, 41: 1275-1281. 10.1038/ng.475.View ArticlePubMedGoogle Scholar - Jaillon O, Aury JM, Noel B, Policriti A, Clepet C, Casagrande A, Choisne N, Aubourg S, Vitulo N, Jubin C, Vezzi A, Legeai F, Hugueney P, Dasilva C, Horner D, Mica E, Jublot D, Poulain J, Bruyere C, Billault A, Segurens B, Gouyvenoux M, Ugarte E, Cattonaro F, Anthouard V, Vico V, Del Fabbro C, Alaux M, Di Gaspero G, Dumas V, Felice N, Paillard S, Juman I, Moroldo M, Scalabrin S, Canaguier A, Le Clainche I, Malacrida G, Durand E, Pesole G, Laucou V, Chatelet P, Merdinoglu D, Delledonne M, Pezzotti M, Lecharny A, Scarpelli C, Artiguenave F, Pe ME, Valle G, Morgante M, Caboche M, Adam-Blondon AF, Weissenbach J, Quetier F, Wincker P: The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature. 2007, 449: 463-467. 10.1038/nature06148.View ArticlePubMedGoogle Scholar
- Velasco R, Zharkikh A, Troggio M, Cartwright DA, Cestaro A, Pruss D, Pindo M, Fitzgerald LM, Vezzulli S, Reid J, Malacarne G, Iliev D, Coppola G, Wardell B, Micheletti D, Macalma T, Facci M, Mitchell JT, Perazzolli M, Eldredge G, Gatto P, Oyzerski R, Moretto M, Gutin N, Stefanini M, Chen Y, Segala C, Davenport C, Dematte L, Mraz A, Battilana J, Stormo K, Costa F, Tao Q, Si-Ammour A, Harkins T, Lackey A, Perbost C, Taillon B, Stella A, Solovyev V, Fawcett JA, Sterck L, Vandepoele K, Grando SM, Toppo S, Moser C, Lanchbury J, Bogden R, Skolnick M, Sgaramella V, Bhatnagar SK, Fontana P, Gutin A, Van de Peer Y, Salamini F, Viola R: A high quality draft consensus sequence of the genome of a heterozygous grapevine variety. PLoS ONE. 2007, 2: e1326-10.1371/journal.pone.0001326.PubMed CentralView ArticlePubMedGoogle Scholar
- Ming R, Hou S, Feng Y, Yu Q, Dionne-Laporte A, Saw JH, Senin P, Wang W, Ly BV, Lewis KL, Salzberg SL, Feng L, Jones MR, Skelton RL, Murray JE, Chen C, Qian W, Shen J, Du P, Eustice M, Tong E, Tang H, Lyons E, Paull RE, Michael TP, Wall K, Rice DW, Albert H, Wang ML, Zhu YJ, Schatz M, Nagarajan N, Acob RA, Guan P, Blas A, Wai CM, Ackerman CM, Ren Y, Liu C, Wang J, Wang J, Na JK, Shakirov EV, Haas B, Thimmapuram J, Nelson D, Wang X, Bowers JE, Gschwend AR, Delcher AL, Singh R, Suzuki JY, Tripathi S, Neupane K, Wei H, Irikura B, Paidi M, Jiang N, Zhang W, Presting G, Windsor A, Navajas-Perez R, Torres MJ, Feltus FA, Porter B, Li Y, Burroughs AM, Luo MC, Liu L, Christopher DA, Mount SM, Moore PH, Sugimura T, Jiang J, Schuler MA, Friedman V, Mitchell-Olds T, Shippen DE, dePamphilis CW, Palmer JD, Freeling M, Paterson AH, Gonsalves D, Wang L, Alam M: The draft genome of the transgenic tropical fruit tree papaya (
*Carica papaya Linnaeus)*. Nature. 2008, 452: 991-996. 10.1038/nature06856.PubMed CentralView ArticlePubMedGoogle Scholar - Tuskan GA, Difazio S, Jansson S, Bohlmann J, Grigoriev I, Hellsten U, Putnam N, Ralph S, Rombauts S, Salamov A, Schein J, Sterck L, Aerts A, Bhalerao RR, Bhalerao RP, Blaudez D, Boerjan W, Brun A, Brunner A, Busov V, Campbell M, Carlson J, Chalot M, Chapman J, Chen GL, Cooper D, Coutinho PM, Couturier J, Covert S, Cronk Q, Cunningham R, Davis J, Degroeve S, Dejardin A, Depamphilis C, Detter J, Dirks B, Dubchak I, Duplessis S, Ehlting J, Ellis B, Gendler K, Goodstein D, Gribskov M, Grimwood J, Groover A, Gunter L, Hamberger B, Heinze B, Helariutta Y, Henrissat B, Holligan D, Holt R, Huang W, Islam-Faridi N, Jones S, Jones-Rhoades M, Jorgensen R, Joshi C, Kangasjarvi J, Karlsson J, Kelleher C, Kirkpatrick R, Kirst M, Kohler A, Kalluri U, Larimer F, Leebens-Mack J, Leple JC, Locascio P, Lou Y, Lucas S, Martin F, Montanini B, Napoli C, Nelson DR, Nelson C, Nieminen K, Nilsson O, Pereda V, Peter G, Philippe R, Pilate G, Poliakov A, Razumovskaya J, Richardson P, Rinaldi C, Ritland K, Rouze P, Ryaboy D, Schmutz J, Schrader J, Segerman B, Shin H, Siddiqui A, Sterky F, Terry A, Tsai CJ, Uberbacher E, Unneberg P, Vahala J, Wall K, Wessler S, Yang G, Yin T, Douglas C, Marra M, Sandberg G, Van de Peer Y, Rokhsar D: The genome of black cottonwood,
*Populus trichocarpa (Torr. & Gray)*. Science. 2006, 313: 1596-1604. 10.1126/science.1128691.View ArticlePubMedGoogle Scholar - Forest F, Chase MW: Eudicots. The Timetree of Life. Edited by: Hedges SB, Kumar S. 2009, Oxford University Press, 169-176.Google Scholar
- Warshall S: A theorem on Boolean matrices. Journal of the ACM. 1962, 9: 11-12. 10.1145/321105.321107.View ArticleGoogle Scholar
- Muñnoz A, Sankoff D: Rearrangement phylogeny of genomes in contig form. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2010, 7: 579-587.View ArticleGoogle Scholar
- Sankoff D, Haque L: The distribution of genomic distance between random genomes. Journal of Computational Biology. 2006, 13: 1005-1012. 10.1089/cmb.2006.13.1005.View ArticlePubMedGoogle Scholar
- Shulaev V, Sargent DJ, Crowhurst RN, Mockler TC, Folkerts O, Delcher AL, Jaiswal P, Mockaitis K, Liston A, Mane SP, Burns P, Davis TM, Slovin JP, Bassil N, Hellens RP, Evans C, Harkins T, Kodira C, Desany B, Crasta OR, Jensen RV, Allan AC, Michael TP, Setubal JC, Celton JM, Rees DJ, Williams KP, Holt SH, Ruiz Rojas JJ, Chatterjee M, Liu B, Silva H, Meisel L, Adato A, Filichkin SA, Troggio M, Viola R, Ashman TL, Wang H, Dharmawardhana P, Elser J, Raja R, Priest HD, Bryant DW, Fox SE, Givan SA, Wilhelm LJ, Naithani S, Christoffels A, Salama DY, Carter J, Lopez Girona E, Zdepski A, Wang W, Kerstetter RA, Schwab W, Korban SS, Davik J, Monfort A, Denoyes-Rothan B, Arus P, Mittler R, Flinn B, Aharoni A, Bennetzen JL, Salzberg SL, Dickerman AW, Velasco R, Borodovsky M, Veilleux RE, Folta KM: The genome of woodland strawberry (
*Fragaria vesca*). Nature Genetics. 2011, 43: 109-116. 10.1038/ng.740.PubMed CentralView ArticlePubMedGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.