Probabilistic modeling of the evolution of gene synteny within reconciled phylogenies

Background Most models of genome evolution concern either genetic sequences, gene content or gene order. They sometimes integrate two of the three levels, but rarely the three of them. Probabilistic models of gene order evolution usually have to assume constant gene content or adopt a presence/absence coding of gene neighborhoods which is blind to complex events modifying gene content. Results We propose a probabilistic evolutionary model for gene neighborhoods, allowing genes to be inserted, duplicated or lost. It uses reconciled phylogenies, which integrate sequence and gene content evolution. We are then able to optimize parameters such as phylogeny branch lengths, or probabilistic laws depicting the diversity of susceptibility of syntenic regions to rearrangements. We reconstruct a structure for ancestral genomes by optimizing a likelihood, keeping track of all evolutionary events at the level of gene content and gene synteny. Ancestral syntenies are associated with a probability of presence. We implemented the model with the restriction that at most one gene duplication separates two gene speciations in reconciled gene trees. We reconstruct ancestral syntenies on a set of 12 drosophila genomes, and compare the evolutionary rates along the branches and along the sites. We compare with a parsimony method and find a significant number of results not supported by the posterior probability. The model is implemented in the Bio++ library. It thus benefits from and enriches the classical models and methods for molecular evolution.


Background
Genomes evolve through processes that modify their content and organization at different scales, ranging from substitutions, insertions or deletions of single nucleotides to large scale chromosomal rearrangements. Extant genomes are the result of a combination of many such processes, which makes it difficult to reconstruct the big picture of genome evolution. Instead, most models and methods focus on one scale and use only one kind of data, such as gene orders or sequence alignments.
Models based on sequence alignments were first developed in the 1960's and underwent steady development until reaching a high level of complexity [1]. In a recent development, they have been extended to include gene content, modeling duplications, losses and transfers of genes with reconciliation methods [2,3]. Reconciled gene trees account for evolutionary events at both the sequence level and the gene family level. They thus yield a better representation of genome evolution and pave the way for approaches integrating other levels of information [4,5].
In parallel extant gene orders have long been used to infer evolutionary relationships between organisms and to reconstruct ancestral genomes [6][7][8]. Although the early stages of their development were computational challenges, methods based on gene orders gradually overcame theoretical and computational constraints so that they can now handle unequal gene content, multichromosomal genomes, whole genome duplications and dozens of genomes with large amounts of genes [9][10][11], and can be inserted into probabilistic frameworks [12][13][14][15][16][17].
All ingredients are present to integrate gene order and sequence evolution models, yet this leap has not been taken, mostly because of computational issues. Reconstructing gene order histories is often hard [18]. A computational solution to reconstruct gene orders and scale up with the size of datasets is to see a genome as a set of independently evolving adjacencies, i.e. the links between consecutive genes [19]. One can reconstruct ancestral gene orders following three main steps: • Group potentially homologous adjacencies (they connect homologous pairs of genes) • For each group, reconstruct the common history of adjacencies, by recovering ancestral ones • Assemble the ancestral adjacencies in each ancestral species to obtain ancestral chromosomes The assumption that adjacencies evolve independently allows quick computations at the second step: the size of the data can be an order of magnitude larger than without the assumption. But an optimization assembly step is required because of possible conflicts between adjacencies wrongly assumed independent [20].
Another difficulty is the integration of gene content dynamics. Often probabilistic solutions are limited to invariable gene content [12][13][14]. A solution is to encode altogether the presence and absence of genes and adjacencies as binary characters and use a binary sequence evolution model [15,16], but it lacks an evolutionary model of gene content and order dynamics. Gene profiles [21] or reconciled gene trees [22,10] are more promising for integration with sequence evolution models. They were mainly used with parsimony methods to reconstruct ancestral adjacencies, which makes it difficult to combine with a model at a different scale.
We propose a probabilistic model of adjacency evolution accounting for gene duplications and losses, using extant gene orders and reconciled gene trees. We base on the parsimony algorithm of DeCo [10] that we adapt to Felsenstein's maximum likelihood algorithm [1] with a birth/death process that models the evolution of adjacencies. We compute the most likely adjacencies in ancestral genomes and the quantity of gains and losses of adjacencies in all the branches of a species trees, thus providing an insight into the dynamics of rearrangements in these lineages. The model is implemented in Bio++ [23], the present form allowing at most one duplication node between two speciation nodes in gene trees. We compute the likelihood of gene orders in a set of 12 drosophila whose genomes are annotated in the Ensembl Metazoa [24] database. We optimize branch lengths in a species phylogeny and construct ancestral genomes. We compare the results with a parsimony approach, showing that while most adjacencies inferred by parsimony have a good probability, a non negligible proportion (> 11%) are not supported (posterior probability < 0.5).

Species tree
A rooted species tree is a binary tree that describes the evolutionary relationships between organisms. The leaves of the tree are available species, internal nodes are ancestral species. The species tree has branch lengths indicating the quantity of expected evolution. Branch lengths can be also estimated as an output.

Adjacencies
An ordered set of genes is represented by a set of adjacencies, which are pairs of consecutive genes. For example, a genome A containing the sequence of genes a 1 − a 2 − a 3 − a 4 contains adjacencies a 1 a 2 , a 2 a 3 , and a 3 a 4 . Adjacencies are not oriented, meaning that a 1 a 2 is equivalent to a 2 a 1 .

Gene trees
Genes are grouped into homologous families across genomes. The evolutionary history of each family is represented by a rooted gene tree. Gene trees are reconciled with the species tree (see precomputation below).

Principle
The principle is illustrated on Figure 1. It consists in reconstructing hypothetical ancestral adjacencies, modeling the evolution of adjacencies, computing a maximum likelihood of the model given the data, and computing the a posteriori probability of presence for each ancestral adjacency.
In this section, we give an overview of the main steps in our method. All these steps are detailed in the following sections, except the precomputation, for which we refer to [10].
• Precomputation: gene trees are reconciled with the species tree in order to minimize the number of duplications and losses (using DeCo [10]). It consists in annotating each internal node by an ancestral gene, together with the species it belongs to, and the evolutionary event (speciation, duplication, loss) taking place at the bifurcation. This determines a set of ancestral genes for all ancestral species. Gene losses are also annotated in the trees.
• Classify extant adjacencies so that every class can be handled independently. Inside each class, two gene families with two trees are involved and all adjacencies have an extremity in each family.
• For each selected pair of gene families, construct a tree, called the tree of possible adjacencies. Its nodes are all the couples of nodes from each gene tree, which are in the same extant or ancestral species (the speciation nodes), plus some duplication nodes; the leaves are labeled with the pattern of presence/ absence of the possible adjacencies in the data.
• Compute, between successive nodes of this tree, the probability of presence or absence of the adjacency using the model of evolution described below.
• Compute the likelihood of the adjacency given the observed adjacencies.
• Compute a posteriori probabilities of presence of ancestral adjacencies.
The likelihood computation for one adjacency tree allows to obtain a likelihood for the whole dataset by multiplying all likelihoods, considered as independent, and to optimize parameters. These can concern branch lengths on the species tree, or a law of differential fragility for different genome sites, modeling different susceptibility to rearrangements among chromosomal regions [25,26].

Adjacency classes
We first reduce the problem to two gene trees, without loss of generality, by classifying adjacencies. Reconciled gene trees define ancestral genes of ancestral species. A necessary condition for an adjacency i 1 i 2 to be an ancestor of a 1 a 2 is that i 1 is an ancestor of a 1 and i 2 an ancestor of a 2 . By the same idea a necessary condition for adjacencies a 1 a 2 and b 1 b 2 to be homologous is that there is a common ancestor i 1 of a 1 and b 1 , and a common ancestor i 2 of a 2 and b 2 , such that i 1 and i 2 are in the same species. This condition for homology is an equivalence relation on all extant adjacencies, which can be clustered and treated by equivalence classes of homology. To a class we can associate i 1 and i 2 the most ancient distinct common ancestors of all adjacency extremities in the class. So every adjacency in the class has an extremity which is a descendant of i 1 and an extremity which is a descendant of i 2 . Without loss of generality we can work with the two sub-trees rooted at i 1 and i 2 .

Trees of possible adjacencies
We now suppose that we have G 1 and G 2 two reconciled gene trees with some leaves of G 1 involved in adjacencies with some leaves of G 2 . Each node n in G 1 and G 2 is annotated with an event (speciation, duplication, loss) and a species S(n). Take each pair of nodes i 1 i 2 , where i 1 and i 2 are speciation nodes associated with the same ancestral species s, i 1 ∈ G 1 and i 2 ∈ G 2 . Since S(i 1 ) = S(i 2 ) and adjacencies exist between leaves of G 1 and leaves of G 2 , i 1 i 2 is called a possible adjacency.
All possible adjacencies define nodes of the tree of possible adjacencies, in which duplication nodes can be added, as explained below.
If i 1 i 2 is a possible adjacency such that S(i 1 ) = S(i 2 ) = s, let s 1 and s 2 be the two children of s in the species tree. There is a descent path in the tree of possible adjacencies from i 1 i 2 to all possible adjacencies j 1 j 2 in s 1 such that i 1 is an ancestor of j 1 and i 2 is an ancestor of j 2 , and a similar independent path from s to s 2 . If there is no duplication node between i 1 and j 1 and i 2 and j 2 , then this path is a single edge. If there is at least one duplication node between i 1 and j 1 or i 2 and j 2 , then the path from i 1 i 2 to j 1 j 2 has two edges, one between i 1 i 2 and d, a new duplication node, and one from d to j 1 j 2 . The node i 1 i 2 always has only two descendants, but the node d can have an arbitrary number, according to the number of duplications in the gene lineages.
Loss of one or both genes involved in the adjacency in a branch leading to a species s′ leads to the loss of the adjacency in s′. In this case, a loss leaf of the tree of possible adjacencies is constructed. An example of construction of a tree of possible adjacencies for two reconciled gene trees is drawn in Figure 2. Once each pair of nodes i 1 i 2 has been considered, the resulting tree is the tree of possible adjacencies for G 1 and G 2 on which we can apply a model of evolution.

Model of evolution
We consider possible adjacencies as evolutionary objects in a binary alphabet. An adjacency can either be present Figure 1 Principle of the method. Given two gene trees (dark blue tree and light blue tree) reconciled within a species tree (black tree), and sharing adjacencies in some extant species (species A and C), we reconstruct hypothetical ancestral adjacencies (in species D and E) using a model of evolution and maximum likelihood algorithm. Our method allows for losses (cross in light blue tree between species B and species D), and duplications (empty square in dark blue tree) of genes.
(state 1) or absent (state 0) in a genome. The transition rate matrix for the birth/death process which describes the evolution of a binary object is: Where is the rate of 0 1 (gain of an adjacency) over the rate of 1 0 (loss of an adjacency). Probabilities of transition between two states separated by a amount t of time can be computed using a classical binary substitution model: Where λ = (κ + 1) 2 2κ .
In the case when there is no duplication in the two gene trees, likelihoods can be computed directly from the tree of possible adjacencies (which itself has no duplication nodes) with Felsenstein's algorithm [1].
An adjacency can be lost because of a rearrangement (1 0), or because at least one of the two adjacent genes is lost. In the first case, the state of the leaf in the tree of possible adjacency is simply 0. In the second case, we assign an undetermined state ? to the loss leaf in the tree of possible adjacencies to differentiate it from a loss due to a rearrangement. We do not compute probabilities of transition for branches leading to these nodes.
In the case when there are duplication nodes, we write the probabilities according to a model of evolution of adjacencies in presence of duplications: when one gene belonging to an adjacency is duplicated, the adjacency is transmitted to one of the two copies of the gene. This is always verified, whether the duplication is tandem or remote. For example, consider a gene i 2 involved in an Figure 2 The tree of possible adjacencies. A tree of possible adjacencies is constructed from two reconciled gene trees (top trees). The nodes of the tree are annotated as speciation nodes (grey nodes), duplication nodes (white squares), or losses (crossed grey nodes). A binary state is attributed to each leaf according to the presence/absence pattern of the adjacency in extant species (light blue square). The true evolutionary history of the adjacency is represented on the blue tree. An adjacency exists between genes a 1 and a 2 , c 1 and c 2 , d 1 and d 2 . It is absent between genes a 1 and a 2 and genes c 1 and c 2 . If one extremity of the adjacency is lost (species B), the adjacency node is given an undefined state "?". adjacency i 1 i 2 in species I with a gene i 1 . In species A (descendant of species I ), i 1 has one descendant a 1 , whereas i 2 is duplicated, giving two copies a 2 and a 2 . If the duplication is in tandem it leads to the gene order a 1 a 2 a 2 , and the only adjacency conserved with a 1 is a 1 a 2 . Otherwise it leads to the gene order a 1 a 2 . . . a 2 and again only a 1 a 2 is conserved. Note nevertheless that the adjacency a 1 a 2 can appear later in the phylogeny following a rearrangement.
Between two speciation events, we have no date for duplication events. We argue that fixing a date, for example with gene branch lengths, would be a mistake as the position of a duplication between two speciations influences the transition probabilities. Besides, the probabilistic approach means that we can account for all possible dates. Hence we compute an average transition probability for the duplicated branch over all the moments on the branch of the species where this duplication could have occurred. To do this, we integrate the transition probabilities P(t) uniformly over the length of this branch. Depending on the date of the duplications, the probabilities of the several resulting adjacencies are more or less linked. Hence, the integrated transition probability is no longer from one adjacency to another adjacency, but from one adjacency to the set of all the possible adjacencies that result from the duplication. In the previous example (one duplication), the transition probability is from i 1 i 2 to ((a 1 a 2 , a 1 a 2 ) . We can fully model such a process as several processes in parallel. If Q is the generator of the binary model, Q ⊗ Q ⊗ ... ⊗ Q is the generator of the whole process, where ⊗ is the Kronecker product. Here, from a single Q generator at the beginning of the branch, along the branch each event of duplication gives rise to a larger Kronecker product. From a computational point of view, the whole parallel process is considered all along the branch, but just a subset of the transition probabilities is used.
We restrict here the description of the model to the case when there is at most one duplication node between two speciation nodes in the gene trees, which means that in the tree of possible adjacencies, duplication nodes have at most four descendants (because a gene duplication would have occurred in each gene tree). However, in case of several duplications, the same principle holds, with much more complicated formula.

One duplication
If there is one duplication in one gene tree (from a to a 1 and a 2 ) and no duplication in the other, then in the non duplicated branch probabilities are settled with the matrix P. The duplicated branch has a length drawn from the uniform distribution on the non duplication branch length, because it starts from the duplication. So the average transition matrix on the duplicated branch is: As in the duplicated branch there is no adjacency (state 0) at the moment of the duplication, we are only interested by the (0, z) components of N 1 (t), z ∈ [0, 1]. Calculating the integral yields: Let x be the state of adjacency i 1 i 2 , y the state of a 1 a 2 and z the state of a 1 a 2 , (x, y, z) ∈ [0, 1] 3 . Assuming that a 1 a 2 is on the duplicated branch, the overall transition probabilities from x to y and z are given by P x,y (t) × N 1 0,z (t) . The two choices for the duplicated branch are considered during the computation of the likelihood.

Two duplications
If both i 1 and i 2 are duplicated, we assume that both duplications are independent. Note that with this assumption, we do not model the case of joint duplications, where a fragment of chromosome is duplicated (i.e. several consecutive genes are duplicated following a single duplication event). Without loss of generality, we assume in the computation that one duplication occurs after the other. The average transition matrix integrated uniformly along both branches is: Since, as before, only one gene pair inherits the adjacency, we are only interested by the (., 0, 0, 0) (., ., ., .) components of P(t) ⊗ N 11 (t) (Figure 3).

Likelihood computation
Likelihood is computed in the rooted tree of possible adjacencies in a bottom-up way. From here, we describe adjacency nodes with single letters for better clarity. We denote as D i the data that is below node i.
Take a speciation node i in the tree, which descendants are nodes j and k (with branch lengths respectively t 1 and t 2 ). Let x, y, z ∈ [0, 1] be the respective states of i, j, k. We compute the partial conditional likelihoods of D i in the classical way:  Now, let i be a duplication node with two children j and k. Since it concerns only one branch in the species tree, there is a unique branch length t involved. We defined the model of evolution such that the contribution of one child is included using the basic transition matrix P(t) and the contribution of the other child (the child on the duplicated branch) is included using the transition matrix N 1 (t). The partial likelihoods of i can then be computed by allowing the equal possibility that either j or k is on the duplicated branch: If we generalize this problem, computing the partial likelihoods of a duplication node i means exploring the combinatorics of possible states for i's children and the combinatorics of attributing the duplicated branch(es) to the children. Take a duplication node i with n speciation nodes as descendants in the same species. Each node is in a binary state, which means that there are 2 n combinations of states for i's children. We could explore all these combinations to compute i's likelihood but binary characters quickly lead to redundancies in the computation. We can avoid some of these redundancies and reduce the space of exploration by defining patterns. A pattern is an unordered set of 0s and 1s. There are n+1 possible patterns representing the states of i's children. For each pattern p, we can compute the pattern's pseudo likelihood by exploring all its possible orders (i. e. all the possible ways of ordering the 1s and 0s in the pattern): where Y is one possible order of p. If i has n children, Y is a vector of n binary characters representing the states of the n children. Y c is thus the c th element of Y and D c the data below the c th child of i.
We define the weight ω(p) of the pattern p as the number of possible orders for p : ω(p) = n N , with N the number of 1s in p. We give the generalized formula for computing the partial likelihood of i when i has four children (n = 4, which means that the only concerned integrated transition matrix is N 11 (t)): This formula is valid for any number of children for the duplication node i, provided N 11 is replaced by an appropriate matrix.

Ancestral adjacencies reconstruction
Ancestral states, that is, posterior probabilities of presence of adjacencies in the tree of possible adjacencies, are reconstructed by a top-down (from the root to the leaves) algorithm following the the bottom-up likelihood computation algorithm. In the top-down likelihood computation algorithm, we compute the conditional likelihoods of each node i according to the conditional likelihood of the data below it (D i ), and to the conditional likelihood of the data that is on the other part of its father f, D f , and to the conditional likelihood that is below the brothers of i (say one brother i′).
If father f of node i is a speciation node: where y is the state of i, x is the state of f, z the state of i′ and t′ the length of the branch from f to i′.
If father f of node i is a duplication node with one duplication (i.e. two sons i and i′), the likelihood of node i is the average of both scenarios: And the equivalent to the case of two duplications in the bottom-up algorithm is achieved by computing i's Figure 3 Model of evolution with duplications. A duplication in the same species in each of the two gene trees leads to a duplication node with four children (white square) in the tree of possible adjacencies between the two gene trees. Immediately after a duplication event, the adjacency is broken for the duplicated branch (branch leading to the rightmost red node). The second duplication leads to the simultaneous apparition of two other branches (leading to left and middle red nodes). The adjacency is also broken at the beginning of these two branches. The probabilities of transition between the duplication node and its four children are then given by the (., 0, 0, 0) (., ., ., .) components of P(t 1 ) ⊗ N 11 (t 1 ). In the likelihood computation, all positions for the blue and red nodes are considered. partial likelihoods when i's father is a duplication node with four children i, i′, i″, i‴, and the likelihood is an average of four scenarios: wzu (Pxy(t).N 11 0,0,0→w,z,u (t) + Pxw(t)N 11 0,0,0→y,z,u (t) +Pxz(t)N 11 0,0,0→w,y,u (t) + Pxu(t)N 11 0,0,0→w,z,y (t)) L(Di |w).L(Di |z).L(Di |u) (13) From these conditional likelihoods, a posteriori probabilities of presence of adjacencies can be computed. The result is, for each ancestral species, a set of adjacencies associated with probabilities of presence. Transforming it into a bona fide gene order necessitates finding a subset of probable adjacencies in which one ancestral gene can be adjacent to only two others. Efficient methods exist [20] to do so, but they ignore the main source of possible conflict between adjacencies when they are seen as independently evolving characters: errors in gene trees [27]. So in general we prefer presenting a set of adjacencies associated with probabilities, and leave open the way of choosing among them and/or correcting the input data to avoid conflict.

Implementation and availability
We implemented the model of evolution and the likelihood calculation algorithm in the Bio++ library (http:// biopp.univ-montp2.fr/). The algorithm that builds the trees of possible adjacencies was implemented in a separate program which also uses Bio++. Reconciliation was performed with DeCo [10]. All the analytical formulas in our model were computed using Maxima (see Additional file 1). These programs are available upon request to the authors.

Dataset
We selected 12 drosophila species from the Ensembl Metazoa [24] database. We used the species tree from [28], the gene trees and the chromosomal locations from Ensembl Metazoa. We pruned the gene trees to keep only the drosophilae clade, and reconciled them with the species tree using [10]. We reduced the dataset to the 9223 gene trees with at most one duplication between two speciation nodes in reconciled gene trees. We built a set of extant adjacencies by connecting consecutive genes in the reduced dataset, provided they were on the same chromosome or scaffold. We built 13059 trees of possible adjacencies from this set of reconciled gene trees and extant adjacencies. By maximum likelihood, we optimized the branch lengths of the species tree using our model of evolution, from the 3608 trees of possible adjacencies without any duplication (Figure 4). Optimizing branch lengths over many trees remains computationally intensive, especially for trees with several duplications (then the combinatorics increases). The choice of the sample to optimize from was thus a trade-off between accuracy and computational cost. While we optimized branch lengths, we also optimized the model's parameters in a non-stationary way.
Note that the drosophila genomes are not all perfectly assembled and some are fragmented in several hundred contigs. So all the signal does not have to be interpreted as rearrangements, but some of it is due to the absence of adjacencies in extant genomes.

Ancestral adjacencies
We computed posterior probabilities of presence and absence for all possible ancestral adjacencies, given the optimized branch lengths. We report in Table 1 the number of genes and adjacencies in extant and ancestral species. Note that the difference between the number of genes and adjacencies in extant species gives the number of chromosomes or scaffolds. This goes from the well assembled melanogaster genomes in 8 scaffolds to simulans with 445 scaffolds, with all intermediaries. Despite the fact that assembly is incomplete, we have enough adjacencies in the dataset to make a signal for the reconstruction of ancestral adjacencies. And indeed, 54222 adjacencies with posterior probability > 0.9 are proposed. The signal is weaker for ancient species, as in ANC10, with only 2360 adjacencies for 8026 genes, depicting a very fragmented ancestral genome.
The "degree" column in Table 1 shows that in general less than 4% of the genes harbor a conflicting signal with more than 2 attached adjacencies having posterior probability > 0.9. While this remains a high rate of error, it means that most of the supported signal constitutes linear ancestral contigs or chromosomes. The conflict is variable according to the lineages. A surprisingly high amount of conflict arises for the ancestor of yacuba and erecta, predicted as recent. Perhaps this reflects an ambiguity in the species tree which precisely at this place is debated [29]. It seems that rearrangements support an alternative topology.

Comparison with parsimony
We compare the results with those obtained by [10] (DeCo software) on the same data ( Figure 5). DeCo reconstructs ancestral adjacencies according to a parsimony principle, whereas we reconstruct all possible ancestral adjacencies along with a posterior probability of presence for each one. Most of the adjacencies reconstructed by Column "genes" is the number of genes in the dataset. Column "coverage" is the proportion of the genes in the dataset to the total number of genes annotated in Ensembl. Column "adjacencies" is the number of adjacencies in an extant genome or adjacencies in ancestral genomes with a posterior probability > 0.9.
Column "genes with more than 2 adjacencies" is the number of genes involved in more than 2 adjacencies of posterior probability > 0.9. This value is reported only for ancestral genomes, as in extant all genes have 0, 1 or 2 neighbors by definition.
DeCo are given a high probability of presence according to our model (70% have a support > 0.9). Interestingly, a few of them are given low probabilities of presence (11% have probabilities of presence < 0.5), suggesting that our model could bring a finer understanding of the evolution of these adjacencies. Figure 5 shows the distribution of posterior probabilities, as computed by our model, of all the possible adjacencies (in grey), and of all the adjacencies inferred by parsimony (in red). We always reconstruct more ancestral adjacencies than DeCo because DeCo reconstructs ancestral adjacencies up to the last common ancestor of an adjacency class, whereas we reconstruct possible ancestral adjacencies up to the most ancient ancestor of an adjacency class. This explains why many possible ancestral adjacencies have low or no support in the presence/absence pattern at the leaves.

Discussion
Probabilistic models of evolution have at least four advantages over parsimony approaches: they provide more accurate results in presence of many mutations; they provide a natural support scheme of the results in the form of a probability of ancestral states; the likelihood is computed by an integration over all scenarios rather than choosing only one, even if optimal; and several models at different scales of the genome can be integrated.
But most probabilistic models of gene order evolution are computationally intractable on large datasets, working with too large state spaces. Coding gene order by binary characters is a solution, like for many characters characterized by their presence or absence. Then it is possible, like in [30], to use a standard model of binary sequence evolution to achieve a probabilistic reconstruction of phylogenies and ancestral gene orders based on the presence/absence of adjacencies in extant species. This way can handle unequal gene content but does not model the processes of joint evolution of gene content and order, and has to simplify the data to make it fit into standard models. As a result a part of the understanding of genome evolution remains out of reach. This is why we put some efforts in a model of gene neighborhood evolution handling complex histories of genes depicted by their reconciled phylogenies.
We gain several advantages. For example the model allows to follow a pattern of descent of adjacencies. Links between genes evolve, just as genes evolve too. This can be used to detect the positional orthology (orthology of a gene as a locus, in addition to a sequence) when a gene is duplicated in an asymmetric way [31] -not in tandem, so that from the loci point of view, only one duplicate is a descendant of the unique copy before duplication. Here we allow any kind of duplication, symmetric or not, but in any case an adjacency is transmitted to one copy. In the case of a tandem duplication, this does not yield an asymmetry for the genes, because a gene has two adjacencies, and the two can transmit a descendant to a different copy in the case of a tandem duplication. But in the case of an asymmetric duplication, the two adjacencies are transmitted to the same copy of a gene and a positional homolog is detected.
We also keep track of the evolutionary events that can be responsible for the gain and loss of an adjacency. For example an adjacency can be lost because one of the genes is lost, or because of a rearrangement. It is two different reasons for an adjacency to be absent, and we are able with a model to differentiate both cases.
We found that a significant number of adjacencies inferred by parsimony on a drosophila dataset are not supported by a probabilistic model. It corroborates the usual findings in evolutionary models each time reasonably distant species are compared, whether it is sequence evolution [1], gene content evolution [32], or gene order evolution [12].
There are still several limitations to this work. For the moment the computation time is one of them, the efficiency of optimization algorithms coupled with our model allowed us to work only on a small fixed phylogeny. Theoretically we could even infer phylogenies, coupling a model of sequence evolution and such a model of genome organization evolution, but it will necessitate algorithmic progresses. Another limit is that our current implementation only handles independent duplication events, although we are also developing a model for joint duplications. Finally, the possible presence of many duplications yields intricate integrals difficult to solve analytically, if we want to stick with exact solutions integrating over their position in a branch. Numerical approximations or simplifying hypotheses have to be incorporated. For the moment families with many duplications are filtered out.

Conclusions
The present model is a proof of concept that it is possible to handle whole genomes of dozens of species, including genes with complex histories, into a probabilistic model for gene organization.
We open a path that has many possible continuations: • Handle joint duplications of two consecutive genes as a single duplication event.
• Handle more than one gene duplication between two gene speciations.
• Jointly infer probabilistic presence and absence of genes and gene neighborhoods, using conditional probabilities mixing two models. • Integrate the model into an integrative probabilistic model of genome evolution, handling both sequence evolution and gene content evolution, like Phyldog [27].
• With this integration the model can be used to infer species phylogenies, or at least in the current state of computational complexity, to test among a small number of species phylogenies. For example we will test two different alternative drosophila species tree topologies according to the likelihood of our model, and according to the coherence of ancestral genomes (the linear organization of genes along chromosomes). • Use this model to detect highly variable sites by correlating variable rates of adjacency evolution (in a similar framework as for sequence evolution [34]) and intergene sizes, and bring a stone to the study of fragile and solid regions [26].