Reconstructing ancestral gene orders with duplications guided by synteny level genome reconstruction

Background Reconstructing ancestral gene orders in the presence of duplications is important for a better understanding of genome evolution. Current methods for ancestral reconstruction are limited by either computational constraints or the availability of reliable gene trees, and often ignore duplications altogether. Recently, methods that consider duplications in ancestral reconstructions have been developed, but the quality of reconstruction, counted as the number of contiguous ancestral regions found, decreases rapidly with the number of duplicated genes, complicating the application of such approaches to mammalian genomes. However, such high fragmentation is not encountered when reconstructing mammalian genomes at the synteny-block level, although the relative positions of genes in such reconstruction cannot be recovered. Results We propose a new heuristic method, MultiRes, to reconstruct ancestral gene orders with duplications guided by homologous synteny blocks for a set of related descendant genomes. The method uses a synteny-level reconstruction to break the gene-order problem into several subproblems, which are then combined in order to disambiguate duplicated genes. We applied this method to both simulated and real data. Our results showed that MultiRes outperforms other methods in terms of gene content, gene adjacency, and common interval recovery. Conclusions This work demonstrates that the inclusion of synteny-level information can help us obtain better gene-level reconstructions. Our algorithm provides a basic toolbox for reconstructing ancestral gene orders with duplications. The source code of MultiRes is available on https://github.com/ma-compbio/MultiRes. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1262-8) contains supplementary material, which is available to authorized users.


S.1 Introduction
This document presents the preliminary steps for processing the input to Mul-tiRes , and explains the main optimization routine used [1]. It also contains figures showing the variation of the performance and runtime of MultiRes with changes in the parameter.

S.1 Reconstructing the ancestor using synteny blocks
An initial preprocessing step in the method is to use the phylogenetic tree and the synteny blocks to find a representation of the ancestor by ordering the synteny blocks. We use ANGES [2] as the preferred software to do this, but ideally, any other ancestral reconstruction software, such as inferCARs [3], MGRA [4] or ProCARs [5]. In each case, the output is a set of contiguous ancestral regions (CARs), each of which corresponds to a reconstruction of the ancestral synteny order. For the runs presented in this paper, we use ANGES [2], which works on the concept of consecutive-ones matrices. In contrast, MGRA, for example, tries to minimize rearrangement distances between the extant genomes under a certain model.
Our choice of method may affect the final reconstruction, since it uses the synteny-level reconstruction to guide the gene-level scaffolding step. However, in this paper, we consider experiments on the mammalian X-chromosome only. We used both ANGES and MGRA2 [6] to reconstruct the ancestral X-chromosome using blocks of resolution 100K. The main difference is that MGRA2 filters out markers which are not present in all reference extant species, while ANGES does not need to do any such filtering. As a result, we used the reconstructions provided by ANGES for our experiments.
It is important to note that almost none of these methods is suitable for reconstructing synteny/gene orders in the presence of duplications. In the data sets we work on, the mammalian X-chromosomes, almost 20% of the gene families have duplicated occurrences in multiple extant species, with copy numbers as high as 10.

S.2 Finding conserved ancestral adjacencies
In order to find conserved ancestral adjacencies and weight them, we use the following rule: if an adjacency between two markers, or between two block extremities, is present in two extant species, and the unique path between them on the phylogenetic tree contains the ancestor of interest, then we infer the adjacency as conserved in the ancestor. This is a very conservative criterion, and we expect to recover many false positives. However, since we are optimizing over the set of adjacencies later, this is not a particularly troubling issue.
A conserved adjacency between two markers is assigned a weight based on its conservation pattern in the extant species. In order to compute this weight, we use the weighting function available in ANGES [2], which itself is based on the method developed in [3].

S.3 Inferring ancestral copy numbers
Ancestral copy numbers are computed using Wagner parsimony [7] with equal costs for gain and loss of gene copies. More specifically, we consider the number of genes in a single gene family for each extant species, and use a dynamic programming approach to compute the minimum cost ancestral scenario at the root. Then, by backtracking, we find the best cost scenario at the ancestor of interest. The markers inherit the copy numbers of the respective gene families.
It is important to note that we merely find an upper limit to the copy number in the ancestor. For example, if a gene family, and by extension, a marker, is said to have copy number k in the ancestor, we may use fewer than k copies of the marker in the reconstruction. However, it has not been possible to implement a lower limit to the copy number in the method yet. This is left open as a theoretical challenge.

S.4 The optimization scheme
A key part of our method is the use of an algorithm mentioned in [1]. Since this allows the incorporation of duplicated genes, we have chosen it as the preferred optimization algorithm in the new method. In this algorithm, given a graph G = (V, E), where V is a set of genes or gene extremities, and E is the set of adjacencies between them, and µ : V → N, where µ provides an upper bound on the number of copies of each gene, the output is a subset E ⊆ E such that 1. for each gene (extremity) v ∈ V , there exist at most 2µ (v)(for extremities, µ (v)) edges in E with v as an end,  showing how copy numbers might be inferred. This is a highly simplified parsimony scheme, in which the number of copies of a gene is provided at the leaf nodes. Losses are depicted by red arrows, and gains by green arrows. We compute a set of annotations for the internal nodes such that the number of gains and losses is minimized. In this example, there are two such scenarios, which are show side by side.
2. there exists a set of linear or circular walks on G which involves only the edges in E , and no v is visited more than 2µ (v)(for extremities, µ (v)) times, and 3. there is no subset E ⊆ E with these properties which has larger size (or weight, if the edges are weighted).
The algorithm used for this is relatively simple to describe. Let G = (V, E) be the original adjacency graph, with gene families (not extremities) for vertices, and let µ : V → N be the copy number function. We first construct a new graph G = (V , E ) as follows. On this graph, we run a maximum-matching routine, which outputs a set of edges in the matching. Finally, we keep those adjacencies e = {u, v} ∈ E for which we find 2 edges {u i , e u } , {v i , e v } ∈ E in the matching. This is guaranteed to be a maximum matching, and since we have at most 2µ (v) copies of every gene family v, we are also guaranteed that each gene family has at most 2µ (v) neighbours.

S.1 Simulation methodology
We first created a unichromosomal ancestral genome at the root of the tree at both the synteny block and gene resolutions, using the length and copy number distributions of the blocks, genes and gaps in the human X-chromosome. We used an arbitrary cut-off of a maximum copy number of 15 to generate the distributions. Approximately 85% of the gene families found in humans having copy number 1, about 9% of them having copy number 2, and the rest distributed between copy number 3 to 15. About 93% of the duplicated genes have copy number between 2 and 5, so a cut-off above 5 does not affect the distribution significantly. Using these distributions, we evolved the genome along the branches of the phylogenetic tree as follows.

For each inversion, we simulated k inversions within the inverted region,
where k takes on values from {1, 3} for two different simulation sets. If k = 1, we label the instance as a low rearrangement instance, and otherwise as a high rearrangement instance.
For each rearrangement rate, we created 50 data sets. The parameters for the low rearrangement/high rearrangement simulation sets were chosen so that the number of pairwise breakpoints in the phylogenetic tree between two ingroup species, where breakpoints are defined as adjacencies that are present in exactly one of the two species, is roughly within one standard deviation from the number of breakpoints between the species observed in the real data. The rates of duplications and loss were chosen so that the distribution of copy numbers in the extant human genome created in the simulations is similar to the real data. The average length of the ancestral genome after filtering for copy numbers and by conservation, in terms of the number of adjacencies between genes, was 735 and 756 for the low rearrangement and high rearrangement simulation sets respectively. In total, this leads to 8 2 possible parameter combinations, since the segment length has to be at least as large as the window length. For both rearrangement sets, we do not notice a significant variation in any of the rates. For larger parameter values, the computation over all simulations took too long, and were excluded from the analysis.