Evolution of genes neighborhood within reconciled phylogenies: an ensemble approach

Chauve, Cedric; Ponty, Yann; Zanetti, João Paulo Pereira

doi:10.1186/1471-2105-16-S19-S6

Volume 16 Supplement 19

Brazilian Symposium on Bioinformatics 2014

Research
Open access
Published: 16 December 2015

Evolution of genes neighborhood within reconciled phylogenies: an ensemble approach

Cedric Chauve¹,
Yann Ponty^1,2,3 &
João Paulo Pereira Zanetti^1,4,5

BMC Bioinformatics volume 16, Article number: S6 (2015) Cite this article

1915 Accesses
5 Citations
1 Altmetric
Metrics details

Abstract

Context

The reconstruction of evolutionary scenarios for whole genomes in terms of genome rearrangements is a fundamental problem in evolutionary and comparative genomics. The DeCo algorithm, recently introduced by Bérard et al., computes parsimonious evolutionary scenarios for gene adjacencies, from pairs of reconciled gene trees. However, as for many combinatorial optimization algorithms, there can exist many co-optimal, or slightly sub-optimal, evolutionary scenarios that deserve to be considered.

Contribution

We extend the DeCo algorithm to sample evolutionary scenarios from the whole solution space under the Boltzmann distribution, and also to compute Boltzmann probabilities for specific ancestral adjacencies.

Results

We apply our algorithms to a dataset of mammalian gene trees and adjacencies, and observe a significant reduction of the number of syntenic conflicts observed in the resulting ancestral gene adjacencies.

Background

The reconstruction of the evolutionary history of genomic characters along a given species tree is a long-standing problem in computational biology. This problem has been well studied for several types of genomic characters, for which efficient algorithms exist to compute parsimonious evolutionary scenarios; classical examples include genes and genomes sequences [1], gene content [2], and gene family evolution [3, 4]. Recently, Bérard et al. [5] extended the corpus of such results to syntenic characters. They introduced the notion of adjacency forest, that models the evolution of gene adjacencies within a phylogeny, motivated by the reconstruction of the architecture of ancestral genomes, and described an efficient dynamic programming (DP) algorithm, called DeCo, to compute parsimonious adjacency evolutionary histories. So far, DeCo is the only existing tractable model that considers the evolution of gene adjacencies within a general phylogenetic framework: other tractable models of genome rearrangements accounting for a given species phylogeny are either limited to single-copy genes and ignore gene-specific events [6], assume restrictions on the gene duplication events, such as considering only whole-genome duplication (see [7] and references there), or require a dated species phylogeny [8].

From a methodological point of view, most existing algorithms to reconstruct evolutionary scenarios along a species tree in a parsimony framework rely on dynamic-programming along this tree, whose introduction can be traced back to Sankoff in the 1970s (see [9] for a recent retrospective on this topic). Recently, several works considered more general approaches for such parsimony problems that either explore a wider range of values for combinatorial parameters of parsimonious models [10] or consider several alternate histories for a given instance, chosen for example from the set of all possible co-optimal scenarios or from the whole solution space, including suboptimal solutions (see [11–13] for examples of this approach for the gene tree/species tree reconciliation problem).

The present work follows the later approach and extends the DeCo DP scheme toward an exploration of the whole solution space of adjacency histories, under the Boltzmann probability distribution, that assigns a probability to each solution defined in terms of its parsimony score. This principle of exploring the solution space of a combinatorial optimization problem under the Boltzmann probability distribution is sometimes known as the "Boltzmann ensemble approach". It was initially introduced in the context of RNA folding, where the probability of any given conformation at the thermodynamic equilibrium follows a Boltzmann distribution, i.e. a conformation s is observed for a given RNA w with probability $e^{- E_{w, s} / k T} / Z_{w}$ , where E_w,s is the free-energy of conformation s over w, k is the Boltzmann constant, T is the temperature, and $Z_{w}$ is the partition function of w. This latter quantity can be seen as a renormalization factor, and is key in the study of RNA thermodynamics, but its computation involves summing over an exponential number of conformations compatible with the RNA sequence. A major paradigm shift occurred in RNA research when McCaskill [14] showed in 1990 how an efficient algorithm for the partition function could be adapted from a DP scheme for energy minimization through a simple change of algebra. This seminal work also introduced a variant of the inside-outside algorithm [15] for computing base-pairing probabilities.

While this Boltzmann ensemble approach has been used for a long time in RNA structure analysis, to the best of our knowledge it is not the case in comparative genomics, where exact probabilistic models have been favoured recently [16, 17]. However, probabilistic models still pose computational challenges for large datasets, and so far a probabilistic model does not exist for gene adjacencies, which motivates our work. In the specific case of the DeCo model, the ability to explore alternative co-optimal or slightly sub-optimal solutions is crucial. Indeed, as DeCo models gene adjacencies, each ancestral gene can only be adjacent to at most two other genes, which is not considered in DeCo. However, the initial experiments using DeCo on mammalian gene trees resulted in hundreds of ancestral genes were involved in more than two ancestral gene adjacencies [5]. This raises the question of filtering inferred ancestral adjacencies to reduce the level of syntenic conflict, which can be done on the basis of their Boltzmann probabilities. We reason that some of the erroneously-predicted adjacencies may result from combinatorial optimization artifacts and that features of a gene adjacency parsimonious evolutionary scenario that are not robust to considering alternative equivalent, or slightly worse, solutions should be considered as dubious.

Methods

Models

A phylogeny is a rooted tree which represents the evolutionary relationships of a set of elements represented by its nodes: internal nodes are ancestors, leaves are extant elements, and edges represent direct descents between parents and children. We consider here three kinds of phylogenies (illustrated in Figure 1): species trees, reconciled gene trees and adjacencies trees/forests. Trees we consider are always rooted. For a tree T and a node x of T , we denote by T (x) the subtree rooted at x. If x is an internal node, we assume it has either one child, denoted by a_x, or two children, denoted by a_x and b_x.

Species trees

A species tree S is a binary tree that describes the evolution of a set of related species, from a common ancestor (the root of the tree), through the mechanism of speciation. For our purpose, species are identified with genomes, and genes are arranged linearly or circularly along chromosomes.

Reconciled gene trees

A reconciled gene tree is also a binary tree that describes the evolution of a set of genes, called a gene family, through the evolutionary mechanisms of speciation, gene duplication and gene loss, within the given species tree S. Therefore, each node of a gene tree G represents a gene loss, an extant gene or an ancestral gene. Ancestral genes are represented by the internal nodes of G, while gene losses and extant genes are represented by the leaves of G.

We denote by s(g) ∈ S the species of a gene g ∈ G, and by e(g) the evolutionary event that leads to the creation of the two children a_g and b_g. If g is an internal node of G, then e(g) is a speciation (denoted by Spec) if the species pair {s(a_g), s(b_g)} equals the species pair {a_s(g), b_s(g)}, or a gene duplication (GDup) if s(a_g) = s(b_g) = s(g). Finally, if g is a leaf, then e(g) indicates either a gene loss (GLoss) or an extant gene (Extant), in which case e(g) is not an evolutionary event.

Adjacency trees and forests

A gene adjacency is a pair of genes that appears consecutively along a chromosome. An adjacency tree represents the evolution of an ancestral adjacency through the evolutionary events of speciation, gene duplication, gene loss (these events, as described above, occur at the gene level and are modelled in the reconciled gene trees), and adjacency duplication (ADup), adjacency loss (ALoss) and adjacency break (ABreak), that are adjacency-specific events.

The duplication of an adjacency {g₁, g₂} follows from the simultaneous duplication of both its genes g₁ and g₂ (with s(g₁) = s(g₂) and e(g₁) = e(g₂) = GDup), resulting in the creation of two distinct adjacencies each belonging to {a_g1, b_g1} × {a_g2, b_g2}.
An adjacency may disappear due to several events, such as the loss of exactly one (gene loss) or both (adjacency loss) of its genes, or a genome rearrangement that breaks the contiguity between the two genes (adjacency break).

Finally, to model the complement of an adjacency break, i.e. the creation of adjacencies through a genome rearrangement, adjacency gain (AGain) events are also considered, and result in the creation of a new adjacency tree. It follows that the evolution of the adjacency between two genes can be described by a forest of adjacency trees, called an adjacency forest. In this forest, each node v belongs to a species denoted by s(v), and is associated to an evolutionary event e(v) ∈ {Spec, GDup, ADup} if g is an internal node, or {Extant, GLoss, ALoss, ABreak} if v is a leaf. Finally, adjacency gain events are associated to the roots of the trees of the adjacency forest. So in the same way that a gene tree G evolves within the species S, an adjacency forest F describing the evolution of the adjacency between two gene families G₁ and G₂ evolves within S, G₁ and G₂. We refer the reader to Figure 1 for an illustration.

Parsimony scores and the Boltzmann distribution

When considered in a parsimonious framework, the score of an adjacency forest F is the number of adjacency gains and breaks; other events are not considered as they are the by-products of evolutionary events already accounted for in the score of the reconciled gene trees G₁ and G₂. We denote by s_a(F) the parsimony score of an adjacency forest F. Let $F (G_{1}, G_{2})$ be the set of all adjacency forests for G₁ and G₂, including both optimal and sub-optimal ones, where we assume that at least one extant adjacency is composed of extant genes from G₁ and G₂.

We define the Boltzmann factor of an adjacency forest F as

B (F) = e^{- \frac{s_{a} (F)}{k T}} .

(1)

The partition function associated to two trees G₁ and G₂ is obtained as

Z (G_{1}, G_{2}) = \sum_{F \in F (G_{1}, G_{2})} e^{- \frac{s_{a} (F)}{k T}}

(2)

where kT is an arbitrary constant. The partition function implicitly defines a Boltzmann probability distribution over $F (G_{1}, G_{2})$ , where the probability of an adjacency forest F is defined by:

P (F) = \frac{e^{\frac{- s_{a} (F)}{k T}}}{Z (G_{1}, G_{2})} .

(3)

By exponentially favouring adjacency forests with lower parsimony scores, the Boltzmann distribution provides an alternative way to probe the search space, which is heavily influenced by the choice of kT. Indeed, decreasing kT values will skew the Boltzmann distribution towards more parsimonious adjacency forests. Its limiting distributions are uniform over the whole search space (kT → +∞) or over the set of co-optimal forests (kT → 0) (see Figure 2 for an illustration).

A Boltzmann probability distribution on the set of all adjacency forests for a given instance also implies a well defined notion of probability for features of adjacency forests. For example, one can associate a probability to a specific potential ancestral adjacency (i.e. adjacency between two genes from a given ancestral species) as the ratio of the sum of the probabilities of the adjacency forests that contain this adjacency with the partition function.

Algorithms

DeCo, the algorithm described in [5] to compute a parsimonious adjacency forest, is a DP scheme constrained by S, G₁ and G₂. We first present this algorithm, then describe how to extend it into an Boltzmann ensemble algorithm.

The DeCo DP scheme

Let G₁ and G₂ be two reconciled gene trees and g₁ and g₂ be two nodes, respectively of G₁ and G₂, such that s(g₁) = s(g₂). The DeCo algorithm computes, for every such pair of nodes g₁ and g₂, two quantities denoted by c₁(g₁, g₂) and c₀(g₁, g₂), that correspond respectively to the most parsimonious score of a parsimonious adjacency forest for the pairs of subtrees G(g₁) and G(g₂), under the hypothesis of a presence (c₁) or absence (c₀) of an ancestral adjacency between g₁ and g₂. As usual in DP along a species tree, the score of a parsimonious adjacency forest for G₁ and G₂ is given by min(c₁(r₁, r₂), c₀(r₁, r₂)) where r₁ is the root of G₁ and r₂ the root of G₂.

So, c₁(g₁, g₂) and c₀(g₁, g₂) can be computed as the minimum of a sum of the scores of adjacency gains or breaks and, more importantly, of terms of the form c₁(x, y) and c₀(x, y) with (x, y) ∈ {g₁, a_g1, b_g1} × {g₂, a_g2, b_g2} − (g₁, g₂), using the two combinatorial operator min and +.

(Un)-ambiguity of the DeCo DP scheme

As defined in [18], the ambiguity of a DP algorithm can be defined as follows: a DP explores a combinatorial solution space (here for DeCo, the space of all possible adjacency forests, including possible suboptimal solutions), that can be explicitly generated by replacing in the equations min by $⋓$ (the set-theoretic union operator) and + by the Cartesian product × between combinatorial sets. A DP algorithm is then unambiguous if the unions are disjoint, i.e. the sets provided as its arguments do not overlap.

We claim that the DeCo dynamic programming scheme is unambiguous. Indeed, computing c₁(g₁, g₂) and c₀(g₁, g₂) branches on disjoint subcases that each involve a different set of terms c₁(x, y) and c₀(x, y). The only case that deserves a closer attention is the case where e(g₁) = e(g₂) = GDup, as a simultaneous duplication can be obtained by two successive duplications. But in this case, the number of AGain events is different (see Figure 3), which ensures the pairwise difference of solutions.

Stochastic backtrack algorithm through algebraic substitutions

As mentioned in [18], any unambiguous dynamic programming scheme can be adapted through algebraic changes to exhaustively generate the set of all adjacency forests, and also compute the corresponding partition function. To that purpose one simply needs to replace the arithmetic operators (min, +) with (Σ, ×), and to exponentiate any atomic cost $C \in ℝ$ into a (partial) Boltzmann factor e^-C/kT(see Figure 3).

This precomputation allows us to sample adjacency forests under the Boltzmann distribution, by changing the deterministic backtrack used for maximum parsimony into a stochastic operation. Indeed, assume that the partition function version of the DeCo equation computes c₁(g₁, g₂) (resp. c₀(g₁, g₂)) as $\sum_{i \in [1, k_{1}]} t_{i}$ , where the t_i denote the contribution to the partition function of one of the local alternatives within the DP scheme. The latter are typically computed recursively as combinations of atomic adjacency gain/break costs, and recursive terms of the form c₁(x, y) and c₀(x, y) with (x, y) ∈ {g₁, a_g1, b_g1} × {g₂, a_g2, b_g2} − {(g₁, g₂)}.

Then a (possibly non-parsimonious) random solution can be generated recursively for c₁(g₁, g₂) (resp. c₀(g₁, g₂)), by branching on some t_i with probability t_i/c₁(g₁, g₂) (resp. t_i/c₀(g₁, g₂)), and proceed recursively on each occurrence of a recursive term within the alternative t_i. The correctness of the algorithm, i.e. the fact that the random process generates each adjacency forests with Boltzmann probability, follows immediately from general considerations on unambiguous DP schemes [18].

The stochastic nature of the backtrack does not affect its worst-case complexity. This Boltzmann sampling algorithm, for an instance composed of two gene trees G₁ and G₂ of respective sizes (number of leaves) n₁ and n₂, has time complexity of $O (n_{1} \times n_{2})$ for each backtrack.

Rescaling to avoid numerical overflows

The partition function values $Z (g_{1}, g_{2})$ , handled during the computation, typically grow exponentially in the total number of nodes in G₁ and G₂, and may end up overflowing the floating point data type used within the DP tables. Following practice in RNA folding prediction [19], we address this issue by iteratively applying an homogeneous rescaling of these values during the computation, to keep the values found in the DP table asymptotically close to 1, while still allowing for analysis of the Boltzmann distribution.

To that purpose, one introduces a rescaling factor α which is applied, as a multiplicative term, to some of the DP rules. A rescaling is homogeneous for a pair of (sub)trees (G₁(g₁), G₂(g₂)) (abridged into (g₁, g₂) from now) when the number of occurrences of α, encountered during the generation of a given solution F, only depends on (g₁, g₂) and not on specific features of F. Let us denote by κ_{g1, g2}the number of occurrences of α for (g₁, g₂), then the rescaled contribution of a given solution F is now $e^{- \frac{s_{a} (F)}{k T}} \times α^{k_{g 1, g 2}}$ , while the rescaled partition function, computed by the modified DP scheme, is given by

Z_{α} (g_{1}, g_{2}) = \sum_{F \in F (g_{1}, g_{2})} e^{- \frac{s_{a} (F)}{k T}} \times α^{k_{g 1, g 2}} .

(4)

A direct execution of the stochastic backtrack algorithm then returns each forest F with probability

\frac{e^{- \frac{s_{a} (F)}{k T}} \times α^{k_{g 1, g 2}}}{Z_{α} (g_{1}, g_{2})} = \frac{e^{- \frac{s_{a} (F)}{k T}}}{Z (g_{1}, g_{2})} = P (F)

(5)

In other words, the introduction of the rescaling does not induce any bias in the stochastic sampling, i.e. the sampling still follows a Boltzmann distribution.

On the other hand, α can be used to constrain the values $Z_{α} (g_{1}, g_{2})$ to avoid numerical overflows. For instance, setting $α^{*} = Z {(G_{1}, G_{2})}^{1 / k_{g 1, g 2}}$ yields $Z_{α^{*}} (G_{1}, G_{2}) = 1$ . Furthermore, if the rescaling terms are regularly distributed during the execution of the DP scheme, then the intermediate values c_0|1(g₁, g₂) also typically remain close to 1, thereby avoiding numerical over/underflows. In practice, $Z (g_{1}, g_{2})$ is the end product of the computation, and thus cannot be used to determine a suitable value for α. However, any value that avoids numerical over/underflow can be used, so DeClone accepts as input a prescribed value for α. Note also that α can also be typically inferred from a partial computation, based on the first occurrence of an under/overflow in the DP matrices. To apply these concepts in the context of the DeCo DP scheme, we are left to find an homogeneous rescaling.

Fortunately, we observe that the number of recursive calls $c_{0 | 1} (g_{1}^{'}, g_{2}^{'})$ , where $e (g_{1}^{'})$ ≠ GDup and $e (g_{2}^{'})$ ≠ GDup, is provably constant within the solutions generated from any call c_0|1(g₁, g₂). For the sake of simplicity, we assume here that calls of the form c_0|1(g₁, g₂), where e(g₁) = GLoss (resp. e(g₂) = GLoss), are expanded into calls c_0|1(g₁, a_g2) and c_0|1(g₁, b_g2) (resp. c_0|1(a_g1, g₂) and c_0|1(b_g1, g₂)), unless g₂ (resp. g₁) is also a leaf. From this observation that can be tediously verified by induction, we adapt the DP scheme as illustrated by Figure 3.

Inside-Outside algorithm

While the sampling algorithm described above provides a flexible, easy to implement, approach to analyze the Boltzmann distribution, it only allows for the computation of estimates for properties of interest (for example the occurrence of a specific ancestral adjacency in evolutionary scenarios), whose accuracy may critically depend on the number of samples, the - a priori unknown - variance of the underlying distribution, or other factors. However, whenever the property of interest, in conjunction with the DP scheme, fulfills certain technical conditions [18], it is possible to compute its expectation exactly in polynomial time, by transforming the DP scheme using a variant of the inside-outside algorithm.

More precisely, our objective is to compute the probabilities associated with each of the $O (n_{1} \times n_{2})$ left-hand-side (LHS) to right-hand-side (RHS) transitions in the DP recurrence. Let us denote by l → r an LHS/RHS transition, such that

l \in {0, 1} \times G_{1} \times G_{2} and r \in ℝ^{+} \times {({0, 1} \times G_{1} \times G_{2})}^{*},

(6)

and by F_l→r the set of forests whose production borrows the l → r transition. The Boltzmann probability of (l → r) is then defined as

P (l \to r) : = \sum_{F \in F_{l \to r}} P (F) \equiv \frac{\sum_{F \in F_{l \to r}} e^{- \frac{s_{a} (F)}{k T}}}{Z (G_{1}, G_{2})} .

(7)

Since $Z (g_{1}, g_{2})$ is known, it is sufficient to compute the numerator of the above fraction, i.e. the total Boltzmann factor of the forests $F_{l \to r}$ that feature (l → r). On the other hand, the number of forests in $F_{l \to r}$ typically grows exponentially on n₁ + n₂, so one must find an efficient strategy for computing this summation.

The principle of the inside-outside algorithm [15] is to decompose each of the executions, associated with a forest in $F_{l \to r}$ , into: a) an inside part, generated from the recursive calls in the RHS r; and b) an outside part, which denotes the context in which the LHS l appears, i.e. an execution of the DP scheme which features a recursive call to l, and is truncated at that point. Let us remark that the inside and outside parts are independent, i. e. any inside part can be combined with any outside part to form a valid execution of the DP scheme, and the score of the associated forest is simply obtained by summing the scores of its two parts. Thus, the total Boltzmann factor of the forests $F_{l \to r}, l : = (x_{l}, g_{l}, g_{l}^{'})$ , can be decomposed as

\sum_{F \in ℱ_{l \to r}} e^{- \frac{s_{a} (F)}{k T}} \equiv \underset{inside contribution}{\underset{⏟}{e^{- \frac{C_{r} F)}{k T}} \times \prod_{(x, g, g') \in r} c_{x} (g, g') \times}} \underset{Outside contribution}{\underset{⏟}{d_{x_{l}} (g_{l}, g_{l}^{'}),}}

where C_r denotes the constant score increment in the RHS, and $d_{x_{l}} (g_{l}, g_{l}^{'})$ is the outside partition function, i.e. the total Boltzmann factor of all outside parts that are truncated at l. This term can be computed in $O (n_{1} \times n_{2})$ by inverting the DP scheme of Figure 3 in a purely generic, yet quite technical, fashion [18]. To limit the risk of mistakes in the derivation/implementation of DP equations for d_0|1(g₁, g₂), we implemented an ad hoc parser, based on the inversion principle described by Ponty and Saule [18].

Once the probabilities P (l → r) are known, it is possible to determine the probability of an (ancestral) adjacency (g₁, g₂) by simply summing over the probabilities of transitions that infer such an adjacency, i. e. that feature a recursive call of the form c1(g₁, g₂) within their RHS. Iterating this over all (g₁, g₂) pairs, one obtains an adjacency matrix, as shown in Figure 2.

Results and discussion

Data

We re-analyzed a dataset described in [5] composed of 5, 039 reconciled gene trees and 50, 389 extant gene adjacencies, forming 6, 074 DeCo instances, with genes taken from 36 mammalian genomes from the Ensembl database in 2012. In [5], these data were analyzed using the DeCo algorithm that computed a single parsimonious adjacency forest per instance. All together, these adjacency forests defined 112, 188 (resp. 96, 482) ancestral and extant genes (resp. adjacencies), where, by "ancestral adjacency", we mean adjacency that involves two genes g₁ and g₂ whose descendants in their respective gene trees satisfy that they do not belong to the same species s(g₁) (equal to s(g₂)), i.e. g₁ and g₂ are pre-speciation genes, that were not duplicated within their species (this choice is motivated by the fact that the reconstruction of ancestral genomes considers pre-speciation genomes. More important, we can observe 5, 817 ancestral genes participating to three or more ancestral adjacencies, which represent a significant level of syntenic conflict (close to 5%), as a gene can only be adjacent to at most two neighboring genes along a chromosome.

DeCo scores, solution space

Unlike reconciled gene trees, whose mutation cost can be high, most adjacency forests have a relatively low cost, with only 32 instances leading to forest of score 5 or above, while the average number of parsimonious syntenic events (adjacency gain and break) is 1.25. This illustrates the fact that syntenic events, that are due to genome rearrangements, are rare evolutionary events, which suggests that parsimony is a relevant criterion for such characters, and that robustness of syntenic characters with respect to the whole solution space should be assessed in terms of optimal or slightly suboptimal evolutionary scenarios.

Boltzmann sampling and exact Boltzmann probabilities

For each instance, we sampled 1, 000 adjacency forests under the Boltzmann distribution, for three values of kT, 0.001, 0.1, 0.5, and recorded the frequency of all observed ancestral adjacencies. Then for the same values of kT, we computed the exact Boltzmann probability of all potential ancestral adjacencies using the inside-outside algorithm. The result observed were very similar whether sampling or exact probabilities were considered. However, the time required to compute exact Boltzmann probabilities is polynomial, so the exact Boltzmann approach based on the inside-outside algorithm should naturally be favoured in applications. In consequence, we discuss only the case of exact Boltzmann probabilities below.

The main difference between the three values of kT is that, with kT = 0.5, non-optimal adjacency forests have a higher Boltzmann probability in the Boltzmann distribution, while kT = 0.1 skews the distribution toward optimal adjacency forests and slightly suboptimal ones, and kT = 0.01 ensures that the probability of sub-optimal adjacency forests is extremely low and almost does not contribute to the partition function. We then looked at the numbers of ancestral adjacencies, genes and syntenic conflicts from ancestral adjacencies in terms of Boltzmann probability. Table 1 below summarizes the obtained results.

Table 1 Characteristics of ancestral genes and adjacencies from observed ancestral adjacencies filtered by Boltzmann probability (leftmost column), with different kT values.

Full size table

The difference observed between the results with different values of kT supports that parsimony is an appropriate criterion for looking at gene adjacency evolution. Indeed, in the results obtained with kT = 0.5, that gives a higher probability to non-optimal adjacency forests, it appears that the number of conserved ancestral adjacencies drops sharply after probability 0.6, showing that very few ancestral adjacencies appear with high probability. However, with kT = 0.1 and kT = 0.01, by taking a high probability threshold (starting at a threshold of 0.6), we reduce significantly the number of syntenic conflicts while maintaining a relatively similar number of ancestral genes than the experiments described in [5]; this observation illustrates the potential of the ensemble approach compared to the classical dynamic approach that relies on a single arbitrary optimal solution. Next, the experiment with kT = 0.01 that considers only co-optimal scenarios (the probability of non-optimal scenarios falls under the numerical precision) shows that, despite conserving only ancestral adjacencies with maximal support in terms of Boltzmann probability, a significant number of syntenic conflicts remains. We conjecture that this is due to errors in the considered reconciled gene trees, and it would be interesting to see if the information about highly supported conflicting adjacencies can be used to correct reconciled gene tree.

Conclusions

The main contribution of our work is an extension of the DeCo dynamic programming scheme to consider adjacency forests in a probabilistic framework, under the Boltzmann distribution. The application of our algorithms on a mammalian genes dataset, together with a simple threshold-based, approach to filter ancestral adjacencies, proved to be effective to reduce significantly the number of syntenic conflicts, illustrating the interest of the ensemble approach. This preliminary work raises several questions and can be extended along several lines. Among them, we can cite two of immediate interest. First, given the Boltzmann probabilities of the adjacency gains and breaks associated to ancestral adjacencies, we could use them to compute a Maximum Expected Accuracy adjacency forest, which is a parsimonious adjacency forest in a scoring model where each event is weighted by Boltzmann probability (see [20] for an example of this approach for RNA secondary structures). This would provide a unique evolutionary scenario per instance. Next, we considered here an evolutionary model based on speciation, duplication and loss. A natural extension would be to include the event of lateral gene transfer in the model. Efficient reconciliation algorithms exist for several variants of this model [3, 4], together with an extension of DeCo, called DeCoLT [21]. DeCoLT is also based on dynamic programming, and it is likely that the techniques we developed in the present work also apply to this algorithm.

References

Liberles DA: Ancestral Sequence Reconstruction. 2007, Oxford University Press, Oxford, UK
Chapter Google Scholar
Csürös M: Ancestral reconstruction by asymmetric Wagner parsimony over continuous characters and squared parsimony over distributions. RECOMB-CG. Lecture Notes in Computer Science. 2008, Springer, Berlin, Germany, 5267: 72-86. doi:10.1007/978-3-540-87989-3_6
Google Scholar
Bansal MS, Alm EJ, Kellis M: Efficient algorithms for the reconciliation problem with gene duplication, horizontal transfer and loss. Bioinformatics. 2012, 28 (12): 283-291. doi:10.1093/bioinformatics/bts225
Article CAS Google Scholar
Doyon J-P, Scornavacca C, Gorbunov KY, Szöllosi GJ, Ranwez V, Berry V: An efficient algorithm for gene/species trees parsimonious reconciliation with losses, duplications and transfers. RECOMB-CG. Lecture Notes in Computer Science. 2010, Springer, Berlin, Germany, 6398: 93-108. doi:10.1007/978-3-642-16181-0_9
Google Scholar
Bérard S, Gallien C, Boussau B, Szöllosi GJ, Daubin V, Tannier E: Evolution of gene neighborhoods within reconciled phylogenies. Bioinformatics. 2012, 28 (18): 382-388. doi:10.1093/bioinformatics/bts374
Article CAS Google Scholar
Biller P, Feijão P, Meidanis J: Rearrangement-based phylogeny using the single-cut-or-join operation. IEEE/ACM Trans. Comput. Biology Bioinform. 2013, 10 (1): 122-134. doi:10.1109/TCBB.2012.168
Article Google Scholar
Gagnon Y, Blanchette M, El-Mabrouk N: A flexible ancestral genome reconstruction method based on gapped adjacencies. BMC Bioinformatics. 2012, 13 (S-19): 4-doi:10.1186/1471-2105-13-S19-S4
Google Scholar
Ma J, Ratan A, Raney BJ, Suh BB, Zhang L, Miller W, Haussler D: DUPCAR: reconstructing contiguous ancestral regions with duplications. Journal of Computational Biology. 2008, 15 (8): 1007-1027. doi:10.1089/cmb.2008.0069
Article PubMed CAS PubMed Central Google Scholar
Csürös M: How to infer ancestral genome features by parsimony: Dynamic programming over an evolutionary tree. Models and Algorithms for Genome Evolution. 2013, Springer, Berlin, Germany, 29-45. doi:10.1007/978-1-4471-5298-9_3
Chapter Google Scholar
Libeskind-Hadas R, Wu Y-C, Bansal MS, Kellis M: Pareto-optimal phylogenetic tree reconciliation. Bioinformatics. 2014, 30 (12): 87-95. doi:10.1093/bioinformatics/btu289
Article CAS Google Scholar
Bansal MS, Alm EJ, Kellis M: Reconciliation revisited: Handling multiple optima when reconciling with duplication, transfer, and loss. Journal of Computational Biology. 2013, 20 (10): 738-754. doi:10.1089/cmb.2013.0073
Article PubMed CAS PubMed Central Google Scholar
Scornavacca C, Paprotny W, Berry V, Ranwez V: Representing a set of reconciliations in a compact way. J. Bioinformatics and Computational Biology. 2013, 11 (2): doi:10.1142/S0219720012500254
Google Scholar
Doyon J-P, Hamel S, Chauve C: An efficient method for exploring the space of gene tree/species tree reconciliations in a probabilistic framework. IEEE/ACM Trans. Comput. Biology Bioinform. 2012, 9 (1): 26-39. doi:10.1109/TCBB.2011.64
Article Google Scholar
McCaskill JS: The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers. 1990, 29 (6-7): 1105-1119. doi:10.1002/bip.360290621
Article PubMed CAS Google Scholar
Baker JK: Trainable grammars for speech recognition. The Journal of the Acoustical Society of America. 1979, 65 (S1): 132-132. doi:10.1121/1.2017061
Article Google Scholar
Arvestad L, Lagergren J, Sennblad B: The gene evolution model and computing its associated probabilities. J. ACM. 2009, 56 (2): doi:10.1145/1502793.1502796
Google Scholar
Mahmudi O, Sjöstrand J, Sennblad B, Lagergren J: Genome-wide probabilistic reconciliation analysis across vertebrates. BMC Bioinformatics. 2013, 14 (S-15): 10-doi:10.1186/1471-2105-14-S15-S10
Article Google Scholar
Ponty Y, Saule C: A combinatorial framework for designing (pseudoknotted) RNA algorithms. Algorithms in Bioinformatics (Proceedings of WABI'11). Lecture Notes in Computer Science. Edited by: Przytycka, T., Sagot, M.-F. 2011, Springer, Berlin Heidelberg, Germany, 6833: 250-269. doi:10.1007/978-3-642-23038-7_22
Google Scholar
Hofacker IL, Fontana W, Stadler PF, Bonhoeffer LS, Tacker M, Schuster P: Fast folding and comparison of RNA secondary structures. Monatshefte für Chemie/Chemical Monthly. 1994, 125 (2): 167-188. doi:10.1007/BF00818163
Article CAS Google Scholar
Clote P, Lou F, Lorenz WA: Maximum expected accuracy structural neighbors of an RNA secondary structure. BMC Bioinformatics. 2012, 13 (S-5): 6-doi:10.1186/1471-2105-13-S5-S6
Article CAS Google Scholar
Patterson M, Szöllosi GJ, Daubin V, Tannier E: Lateral gene transfer, rearrangement, reconciliation. BMC Bioinformatics. 2013, 14 (S-15): 4-doi:10.1186/1471-2105-14-S15-S4
Article Google Scholar

Download references

Acknowledgements

J.P.P.Z. visit to Simon Fraser University was funded by the São Paulo Research Foundation (FAPESP).

Declarations

The publication charges this article were funded by the Simon Fraser University (SFU) Open Access fund.

This article has been published as part of BMC Bioinformatics Volume 16 Supplement 19, 2015: Brazilian Symposium on Bioinformatics 2014. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S19

Author information

Authors and Affiliations

Department of Mathematics, Simon Fraser University, 8888 University Drive, V5A 1S6, Burnaby, Canada
Cedric Chauve, Yann Ponty & João Paulo Pereira Zanetti
Pacific Institute for the Mathematical Sciences - CNRS UMI 3069, Simon Fraser University, 8888 University Drive, V5A 1S6, Burnaby, Canada
Yann Ponty
Laboratoire d'Informatique de l'École Polytechnique - CNRS UMR 7161, École Polytechnique, École polytechnique, 91128, Palaiseau Cedex, France
Yann Ponty
Institute of Computing, University of Campinas, Av. Albert Einstein, 1251, 13083-852, Campinas, Brazil
João Paulo Pereira Zanetti
São Paulo Research Foundation, FAPESP, R. Pio XI, 1500, 05468-140, São Paulo, Brazil
João Paulo Pereira Zanetti

Authors

Cedric Chauve
View author publications
You can also search for this author in PubMed Google Scholar
Yann Ponty
View author publications
You can also search for this author in PubMed Google Scholar
João Paulo Pereira Zanetti
View author publications
You can also search for this author in PubMed Google Scholar

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

All authors participated in all aspects of the study.

Cedric Chauve, Yann Ponty and João Paulo Pereira Zanetti contributed equally to this work.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Chauve, C., Ponty, Y. & Zanetti, J.P.P. Evolution of genes neighborhood within reconciled phylogenies: an ensemble approach. BMC Bioinformatics 16 (Suppl 19), S6 (2015). https://doi.org/10.1186/1471-2105-16-S19-S6

Download citation

Published: 16 December 2015
DOI: https://doi.org/10.1186/1471-2105-16-S19-S6

Brazilian Symposium on Bioinformatics 2014

Evolution of genes neighborhood within reconciled phylogenies: an ensemble approach