Distinguishing successive ancient polyploidy levels based on genome-internal syntenic alignment

Background A basic tool for studying the polyploidization history of a genome, especially in plants, is the distribution of duplicate gene similarities in syntenically aligned regions of a genome. This distribution can usually be decomposed into two or more components identifiable by peaks, or local maxima, each representing a different polyploidization event. The distributions may be generated by means of a discrete time branching process, followed by a sequence divergence model. The branching process, as well as the inference of fractionation rates based on it, requires knowledge of the ploidy level of each event, which cannot be directly inferred from the pair similarity distribution. Results For a sequence of two events of unknown ploidy, either tetraploid, giving rise to whole genome doubling (WGD), or hexaploid, giving rise to whole genome tripling (WGT), we base our analysis on triples of similar genes. We calculate the probability of the four triplet types with origins in one or the other event, or both, and impose a mutational model so that the distribution resembles the original data. Using a ML transition point in the similarities between the two events as a discriminator for the hypothesized origin of each similarity, we calculate the predicted number of triplets of each type for each model combining WGT and/or WGD. This yields a predicted profile of triplet types for each model. We compare the observed and predicted triplet profiles for each model to confirm the polyploidization history of durian, poplar and cabbage. Conclusions We have developed a way of inferring the ploidy of up to three successive WGD and/or WGT events by estimating the time of origin of each of the similarities in triples of genes. This may be generalized to a larger number of events and to higher ploidies.

r-nomial distributions, the output of one being the input of the next, where the parameters, namely the r-nomial probabilities, express resistance to gene loss through fractionation [5][6][7][8][9].
These parameters, which are key to understanding the cycle of polyploidization and fractionation, can be estimated using the information inferred from the distribution of gene pair similarities.
One aspect of the branching process that cannot be inferred from the study of pairs only is the ploidy r of the various events. Thus, durian is known to have undergone two whole genome triplings [9,10], the γ tripling almost 120 million years ago [11], ancestral to most flowering plants, and a more recent tripling (10-20 Mya) not shared by even closely related species, like cacao. By just looking at the distribution of similarities in Fig. 1 engendered by these events, however, there is no direct way of knowing whether one or both of the two component distributions represent WGD, WGT, or other polyploid events. To resolve this problem, the main point of this presentation, we propose to add triples of similar (> 50 %) genes to the study of gene pairs. Our technique, based on the branching process, responds to our concern in previous ad hoc treatments [7,12] of how to use triples rigorously from a statistical point of view.
In the next section, we summarize the general branching process approach to analyzing the distribution of gene pair similarities. We then focus on four competing twoevent models involving WGD and/or WGT. We define four types of gene triplet according to whether the gene pairs within them were created by the first event, the second event, or both. Within each model, we calculate the expected number of triplets of each type. Thus creates an "underlying" profile of triplet distribution to compare to the "observed" profile of triplets in the data. Because of the way the two components of the pair similarity distribution overlap, however, the origin of each triplet in the data is not always obvious. Thus we create a "predicted" profile of triplet distribution by grafting a paralog divergence model onto the branching process, making use of a maximum likelihood dividing point between the two components. We apply this analysis to the genomes of durian, poplar Populus trichocarpa [13] and cabbage Brassica oleracea [14], each of which has a different sequence of polyploidization events. These histories are captured correctly for the first two, but the results for B. oleracea prompt an extension to three-event models, which we carry out, and suggest further work to higher numbers of events.

The branching process and two-event models
Denote by m i the total number of individuals (genes)at time t i , i = 1, . . . , n. Set m 1 = 1. At time t i , i = 1, . . . n − 1, each of the m i genes is replaced by r i ≥ 2 progeny, but only j ≥ 1 of them survive until time t i+1 , with probability u (i) j . Of the total of m i genes at time t i , let a (i) j be the number for which j progeny survive until time t i+1 , so that The probability distribution of the evolutionary histories represented by the given r = {r i } n−1 i=1 and the variable a = a (2) The expected number of genes at time t n is This is illustrated by the sample trajectory in Fig. 2, in which a WGT at time t 1 , with all three progeny surviving -the 3-nomial sample has value 3 -is followed by another independent WGT at time t 2 where the three lineages show one, two or all three offspring surviving, i.e., the independent 3-nomials samples have values 1,2 and 3, respectively. We will study the case of two successive polyploidy events, with r 1 and r 2 taking on values 2 or 3, i.e., WGD or WGT, in all four combinations, i.e., in the set M of "models", denoted (3,3), (3,2), (2,3) and (2,2). Because of the limited information that can be inferred about each component of the distribution of similarities, we can only infer the probabilities of samples of value 1, 2, or 3, so that we are limited to 2-nomials and, with some assumptions, 3-nomials, by far the biologically most important cases.
To infer parameters like fractionation rate in the polyploidization history of a genome, based on the distribution of gene pair similarities, we need to know the ploidies r i of the various events. This motivates us to extend our study from gene pairs only to also include gene triplets.

Triplet probabilities in four models
With no loss of generality, we study triplets of similarities among three genes, rather than the triplets of genes themselves. A triplet is a (multi-)set {t i , t j , t k }, where each of i, j and k may be 1 or 2. Let T = {{t 1 , t 1 , t 1 }, {t 1 , t 1 , t 2 }, {t 1 , t 2 , t 2 }, {t 2 , t 2 , t 2 }}. We classify each kind of triplet we according to whether each of the three paralogies among the three pairs of gene originates in the first event or the second event. Thus in the branching process illustration in Fig. 2 the blue dots represent genes that form a {t 2 , t 2 , t 2 } triplet and the red dots form a {t 1 , t 1 , t 2 } triplet. The single red dot combines with the three pairs of blue genes to form three additional {t 1 , t 1 , t 2 } triplets. And there are a further nine {t 1 , t 1 , t 2 } triplets in the sample. We can calculate the expected number of triplets of each type by enumerating the triplets of each type in each possible trajectory of the process, and multiplying by the probability of this trajectory from Expression (2). The enumeration within a trajectory is easily done by considering every triple of genes at time t n and identifying the last common ancestor of each pair. The probabilities of the trajectories have previously been calculated [5][6][7]. We then sum these results over all trajectories. These are listed in Table 1.
This table provides  It can be seen in Table 1 that the profiles of triplet types produced by the different models are very different. If we could observe the triplet profile produced by the branching process underlying a given set of data, we could easily identify which model was responsible. However, we only see the data after mutational processes have applied. When a mutational divergence model is applied to the similarities, a single trajectory of the branching process, producing one ideal type of triplet, can produce many very different data triplets.
We can try to categorize the set of triplets in a set of data by how closely they resemble one of the four basic types. If the two components of the similarity distribution were completely separate, this would also be an easy matter. But the usual large overlap between the components means that we cannot automatically ascribe any data triplet to any particular underlying triplet.

A statistical approach
As a solution to this problem, we first try to find the best transition, or cutoff point H somewhere between the peaks of the two components. For this we compute the product of the probability density at each similarity value less than H, according to the component with probability that two progeny survive after the first polyploidization event. u = probability that three survive. Similarly v and v are the probabilities that two or three progeny survive, respectively, after the second event mean at t 1 , and the density at each similarity value greater than H, according to the component with mean at t 2 , and maximize with respect to H. I.e., We then categorize the triplets in the data according to the transition value H. If a similarity x is less than H we classify it as being produced at time t 1 , and we write x ∈ I. If it is greater than H, we classify it as being created at time t 2 and we write x ∈ J. This creates eight "octants", defined by the 8 = 2 3 combination of the three triplet similarities, which in turn are collapsed into the four types of triplet in T tabulated in Table 1, {t 1 , t 1 , t 1 }, {t 1 , t 1 , t 2 }(representing three octants),{t 1 , t 2 , t 2 } (representing three octants) and {t 2 , t 2 , t 2 }. The number of triplets of the four types we call the observed profile.
Although we can compare the observed profile with the underlying profile, this comparison is not too meaningful since it neglects the fact that many of the data triplets classified as one type may be generated by a different underlying type, not as an error, but simply as a result of the normal process of duplicate gene sequence divergence clearly operative in the more or less dispersed and overlapping components of the distribution of gene pair similarities.
We can, however, take this process into account in producing a predicted profile for each model. We first calculate the variance-covariance matrix of the t 1 similarities in triplets containing at least of them and t 2 similarities in triplets containing at least two of these. We fixed covar(t 1 , t 2 ) = 0, in accordance with the Markov nature of the branching process.
For each model M ∈ M, we construct the predicted profile of triplet types by integrating over the trivariate normal with means drawn from the EMMIX analysis or identified by eye with the distribution component peaks, and covariance estimated as above, restricted to the domains defined by the transition point. Thus our prediction of {t 1 , t 1 , t 1 } triplets would involve a restriction to the   For comparative purposes we normalize the underlying and predicted profiles so that the total number of triples is the same as the observed profiles.

Results
We compare the three profiles for three well-studied flowering plant genomes that are known to have undergone multiple polyploidizations in the last 120 million years, to see if our method predicts the right combination of WGT and WGD.

Durian
Starting with the durian genome, the (3,3) model, known to represent true evolutionary history, is the only one with a credible prediction profile in Table 2, the only one that has reasonable values for all four triplet types. The three others all fail to predict one or both of the {t 1 , t 1 , t 1 } and {t 2 , t 2 , t 2 } triplets. This indicates the potential of our statistical method, since the original durian sequence article [3] did not recognize the second event as a tripling.

Poplar
The predicted profile of the (3,2) model in Table 3 summarizes the true history of the Populus trichocarpa genome (COGE ID 25127), whose gene pair similarity distribution is displayed in Fig. 3. Along with γ , this shares the ancient "salicoid" WGD with other members of the Salicaceae family [15]. (3,2) is the only model that correctly identifies both the γ event as a WGT, and the more recent event as a WGD.

Cabbage
The recent ancestor of Brassica oleracea genome (COGE ID 26018), underwent a WGT that gave rise not only to the crucifers and mustard genera, but also radishes and other related genera. Early than that a WGD called the α doubling is apparent in the whole range of family Brassicacea genera, including Arabidopsis. A still earlier WGD, the β doubling, occurred in the order Brassicales lineage that includes the Brassicaceae. Thus the cabbage genome counts γ , β, α and a Brassica WGT in its evolutionary history [16,17]. In Fig. 4, we see that at least the two recent components are clearly distinguishable, so we first carried out an analysis excluding gene pairs of similarity less than 76%. This analysis in Table 4 does not give satisfactory results. Indeed, the (3,3) profile matches the observed profile much better than the (2,3) profile does. This can be partially attributed to substantial number of similarities generated by the β and even γ doublings greater than 76%.
We can partially correct this by adding a third event to our branching process. This leads to eight models instead of four, and ten kinds of triplet, summarized in Table 5.
We fix the mean of the first component at 71% to account for the γ event, that for the second component, representing the α event, at 79.5% and we find two ML discrimination points, as in Fig. 5.
The results of this are shown in Table 6. Here the (3,2,3) model is just as close to the observed profile than the competing (3,3,3) model, with the notable exception of {t 2 , t 2 , t 2 } triplets. The absence of a distinction between the α and β events means that the similarities they generate are all conflated to yield an excess of t 2 , and consequently an excess of t 2 triples, so that a WGT is inferred rather than a WGD.
The obvious remedy for this would be to construct fourevent models (sixteen of them), with profiles consisting of 20 different triplets. We leave this for further work. In general the number of models is exponential: 2 m for m events, while the number of triples follows the polynomial (i.e., cubic) tetrahedral sequence (A00292 in [18]) 1 6 m(m + 1)(m + 2), so that eventually there would not be enough data to discriminate among the models. Choosing among models with different numbers of events would  require some standard for model selection such as the Akaike or Bayesian information criteria.

Conclusions
We model the process of fractionation to account for the distribution of gene pair similarities after a number of whole genome doublings, triplings, etc., each followed by a period of duplicate gene loss. The model is a discretetime branching process, with synchronous birth number r i ≥ 2 across the i − th generation population and deaths determined by a r i -nomial law conditioned on at least one survivor.
The observations of gene pair similarities consist of a mixture of normals, each component generated by one event, with the event time estimated by the sequence divergence from the event to the present. Despite the overlapping distributions, we can estimate the mean (via a local mode), standard deviation and proportion of the sample.
Statistics on gene pairs alone do not allow us to infer r i , so we introduce the study of gene triplets. We find formulae for the expected number of each kind of triplet, categorized as to which events produced the similarities among the three pairs of genes.
We develop a way of grafting a gene divergence model on this underlying profile of triplets to produce a predicted profile of the number of triplets of each kind. This can then be compared with the observed number of triplets.

Further work
Distinguishing among the four models combining tetraploidization or hexaploidization in two successive events is the simplest example of a more general problem. The theoretical way is clear to extending these ideas to include, for example, octoploidization through the extraction of quadruplets instead of triplets. In addition, there is a straight forward extension to the case of three or more successive polyploidization events, which we have undertaken in the study of Brassica oleracea. Here, three events, three normal components and two transition points are estimated from the distribution of similarities. The combinatorial probabilities have been worked out for this case and many others, and the methodology is available to complete this.
It is true that in some cases, such as that we presented in [9] , concerning Durio zibethinus, the ploidy is evident from the clear presence in the SYNMAP self-comparison dotplots of sets of r regions covering a large proportion of the genome, each set represented by exactly r 2 synteny blocks showing synteny among all r regions. (r = 3 in the case of Durio.) Clear cases like this are rare, however, especially for genomes where the last polyploidization is more remote in time.
In previous work [7], we used additional information, beyond that contained in the similarity distribution, to confirm the recent hexaploidization of Brassica rapa against the alternative of tetraploidization. This kind of data, however, namely speculation about the number of single-copy genes in the current genome, was extremely subjective in that report, and is unreliable even when assessed by the best available methods on well-assembled and annotated genomes.
A distribution of gene pair similarities is generated in the comparison of two related genomes as well as in the self-comparison of a single genome. The number of  -u = probability that two progeny survive after the first polyploidization event. u = probability that three survive. Similarly v and v are the probabilities that two or three progeny survive, respectively, after the second event. w and w are the probabilities that two or three progeny survive after the third event  orthologous gene pairs available when comparing two related genomes is generally much greater than the number of paralogous pairs identified in the self-comparison of two genomes, simply because the loss by fractionation of one copy of a duplicated gene does not eliminate all related orthology pairs: the other remaining copy and its orthology pairs remain intact. This suggests an avenue to improved accuracy of polyploidy levels inference. The larger number of data, however, may not always compensate for the fact that the speciation component of the similarity distribution is always the most recent one [9], so that the more remote (earlier) components associated with polyploidy become statistically less clear and informative.