 Research
 Open Access
 Published:
Distinguishing successive ancient polyploidy levels based on genomeinternal syntenic alignment
BMC Bioinformatics volume 20, Article number: 635 (2019)
Abstract
Background
A basic tool for studying the polyploidization history of a genome, especially in plants, is the distribution of duplicate gene similarities in syntenically aligned regions of a genome. This distribution can usually be decomposed into two or more components identifiable by peaks, or local maxima, each representing a different polyploidization event. The distributions may be generated by means of a discrete time branching process, followed by a sequence divergence model. The branching process, as well as the inference of fractionation rates based on it, requires knowledge of the ploidy level of each event, which cannot be directly inferred from the pair similarity distribution.
Results
For a sequence of two events of unknown ploidy, either tetraploid, giving rise to whole genome doubling (WGD), or hexaploid, giving rise to whole genome tripling (WGT), we base our analysis on triples of similar genes. We calculate the probability of the four triplet types with origins in one or the other event, or both, and impose a mutational model so that the distribution resembles the original data. Using a ML transition point in the similarities between the two events as a discriminator for the hypothesized origin of each similarity, we calculate the predicted number of triplets of each type for each model combining WGT and/or WGD. This yields a predicted profile of triplet types for each model. We compare the observed and predicted triplet profiles for each model to confirm the polyploidization history of durian, poplar and cabbage.
Conclusions
We have developed a way of inferring the ploidy of up to three successive WGD and/or WGT events by estimating the time of origin of each of the similarities in triples of genes. This may be generalized to a larger number of events and to higher ploidies.
Background
Given the pervasiveness of whole genome doubling (WGD) and tripling (WGT) in the ancestral lineages of plant species, a widespread feature of plant genome publications is the display of the distribution of duplicate gene identities (or similarities, distances, K_{s},...). This is illustrated in Fig. 1, which represents the distribution of similarities between syntenically aligned duplicate genes [1, 2] in the durian (Durio zibethinus) genome [3]. The two visually distinguishable but overlapping parts of the distribution are the legacy of two successive polyploidization events. The means (t_{1} and t_{2}), variances and proportion of the total sample of each component of the distribution can be estimated by mixtures of models techniques such as EMMIX [4].
These distributions can be explained and generated by a discretetime branching process model of polyploidization and fractionation (not timehomogeneous), mathematically represented by the product of successive rnomial distributions, the output of one being the input of the next, where the parameters, namely the rnomial probabilities, express resistance to gene loss through fractionation [5–9].
These parameters, which are key to understanding the cycle of polyploidization and fractionation, can be estimated using the information inferred from the distribution of gene pair similarities.
One aspect of the branching process that cannot be inferred from the study of pairs only is the ploidy r of the various events. Thus, durian is known to have undergone two whole genome triplings [9, 10], the γ tripling almost 120 million years ago [11], ancestral to most flowering plants, and a more recent tripling (1020 Mya) not shared by even closely related species, like cacao. By just looking at the distribution of similarities in Fig. 1 engendered by these events, however, there is no direct way of knowing whether one or both of the two component distributions represent WGD, WGT, or other polyploid events. To resolve this problem, the main point of this presentation, we propose to add triples of similar (> 50 %) genes to the study of gene pairs. Our technique, based on the branching process, responds to our concern in previous ad hoc treatments [7, 12] of how to use triples rigorously from a statistical point of view.
In the next section, we summarize the general branching process approach to analyzing the distribution of gene pair similarities. We then focus on four competing twoevent models involving WGD and/or WGT. We define four types of gene triplet according to whether the gene pairs within them were created by the first event, the second event, or both. Within each model, we calculate the expected number of triplets of each type. Thus creates an “underlying” profile of triplet distribution to compare to the “observed” profile of triplets in the data. Because of the way the two components of the pair similarity distribution overlap, however, the origin of each triplet in the data is not always obvious. Thus we create a “predicted” profile of triplet distribution by grafting a paralog divergence model onto the branching process, making use of a maximum likelihood dividing point between the two components. We apply this analysis to the genomes of durian, poplar Populus trichocarpa [13] and cabbage Brassica oleracea [14], each of which has a different sequence of polyploidization events. These histories are captured correctly for the first two, but the results for B. oleracea prompt an extension to threeevent models, which we carry out, and suggest further work to higher numbers of events.
The branching process and twoevent models
Denote by m_{i} the total number of individuals (genes)at time t_{i},i=1,…,n. Set m_{1}=1. At time t_{i}, i=1,…n−1, each of the m_{i} genes is replaced by r_{i}≥2 progeny, but only j≥1 of them survive until time t_{i+1}, with probability \(u_{j}^{(i)}\).
Of the total of m_{i} genes at time t_{i}, let \(a_{j}^{(i)}\) be the number for which j progeny survive until time t_{i+1}, so that
The probability distribution of the evolutionary histories represented by the given \(\mathbf {r}=\{r_{i}\}_{i=1}^{n1}\) and the variable \(\mathbf {a}=\left \{a_{j}^{(i)}\right \}_{j=1\dots r_{i}}^{i=1\dots n1}\) is
The expected number of genes at time t_{n} is
This is illustrated by the sample trajectory in Fig. 2, in which a WGT at time t_{1}, with all three progeny surviving – the 3nomial sample has value 3 – is followed by another independent WGT at time t_{2} where the three lineages show one, two or all three offspring surviving, i.e., the independent 3nomials samples have values 1,2 and 3, respectively. We will study the case of two successive polyploidy events, with r_{1} and r_{2} taking on values 2 or 3, i.e., WGD or WGT, in all four combinations, i.e., in the set \(\mathscr {M}\) of “models”, denoted (3,3),(3,2),(2,3) and (2,2). Because of the limited information that can be inferred about each component of the distribution of similarities, we can only infer the probabilities of samples of value 1, 2, or 3, so that we are limited to 2nomials and, with some assumptions, 3nomials, by far the biologically most important cases.
To infer parameters like fractionation rate in the polyploidization history of a genome, based on the distribution of gene pair similarities, we need to know the ploidies r_{i} of the various events. This motivates us to extend our study from gene pairs only to also include gene triplets.
Methods
Triplet probabilities in four models
With no loss of generality, we study triplets of similarities among three genes, rather than the triplets of genes themselves. A triplet is a (multi)set {t_{i},t_{j},t_{k}}, where each of i,j and k may be 1 or 2. Let \(\mathscr {T}=\{\{t_{1},t_{1},t_{1}\}, \{t_{1},t_{1},t_{2}\}, \{t_{1},t_{2},t_{2}\},\{t_{2},t_{2},t_{2}\}\}\). We classify each kind of triplet we according to whether each of the three paralogies among the three pairs of gene originates in the first event or the second event. Thus in the branching process illustration in Fig. 2 the blue dots represent genes that form a {t_{2},t_{2},t_{2}} triplet and the red dots form a {t_{1},t_{1},t_{2}} triplet. The single red dot combines with the three pairs of blue genes to form three additional {t_{1},t_{1},t_{2}} triplets. And there are a further nine {t_{1},t_{1},t_{2}} triplets in the sample. We can calculate the expected number of triplets of each type by enumerating the triplets of each type in each possible trajectory of the process, and multiplying by the probability of this trajectory from Expression (2). The enumeration within a trajectory is easily done by considering every triple of genes at time t_{n} and identifying the last common ancestor of each pair. The probabilities of the trajectories have previously been calculated [5–7]. We then sum these results over all trajectories. These are listed in Table 1.
This table provides the expected number of triplets W_{M}(Δ) of each type \(\Delta \in \mathscr {T}\) produced by the branching process for each model \(M \in \mathscr {M}\). For a given genome, the four numerical values of W_{M}(·) constitute the underlying profile of the model M. The underlying profiles for each model based on maximum likelihood values of u and v are given in the top half of Tables 2, 3 and 4 below. Because of the limited number of parameters that can be inferred from the distribution of similarities, we assume u^{′}=u^{2} and v^{′}=v^{2}.
It can be seen in Table 1 that the profiles of triplet types produced by the different models are very different. If we could observe the triplet profile produced by the branching process underlying a given set of data, we could easily identify which model was responsible. However, we only see the data after mutational processes have applied. When a mutational divergence model is applied to the similarities, a single trajectory of the branching process, producing one ideal type of triplet, can produce many very different data triplets.
We can try to categorize the set of triplets in a set of data by how closely they resemble one of the four basic types. If the two components of the similarity distribution were completely separate, this would also be an easy matter. But the usual large overlap between the components means that we cannot automatically ascribe any data triplet to any particular underlying triplet.
A statistical approach
As a solution to this problem, we first try to find the best transition, or cutoff point H somewhere between the peaks of the two components. For this we compute the product of the probability density at each similarity value less than H, according to the component with mean at t_{1}, and the density at each similarity value greater than H, according to the component with mean at t_{2}, and maximize with respect to H. I.e.,
We then categorize the triplets in the data according to the transition value H. If a similarity x is less than H we classify it as being produced at time t_{1}, and we write x∈I. If it is greater than H, we classify it as being created at time t_{2} and we write x∈J. This creates eight “octants”, defined by the 8=2^{3} combination of the three triplet similarities, which in turn are collapsed into the four types of triplet in \(\mathscr {T}\) tabulated in Table 1, {t_{1},t_{1},t_{1}},{t_{1},t_{1},t_{2}}(representing three octants), {t_{1},t_{2},t_{2}} (representing three octants) and {t_{2},t_{2},t_{2}}. The number of triplets of the four types we call the observed profile.
Although we can compare the observed profile with the underlying profile, this comparison is not too meaningful since it neglects the fact that many of the data triplets classified as one type may be generated by a different underlying type, not as an error, but simply as a result of the normal process of duplicate gene sequence divergence clearly operative in the more or less dispersed and overlapping components of the distribution of gene pair similarities.
We can, however, take this process into account in producing a predicted profile for each model. We first calculate the variancecovariance matrix Σ of the t_{1} similarities in triplets containing at least of them and t_{2} similarities in triplets containing at least two of these. We fixed covar(t_{1},t_{2})=0, in accordance with the Markov nature of the branching process.
For each model \(M\in \mathscr {M}\), we construct the predicted profile of triplet types by integrating over the trivariate normal with means drawn from the EMMIX analysis or identified by eye with the distribution component peaks, and covariance estimated as above, restricted to the domains defined by the transition point. Thus our prediction of {t_{1},t_{1},t_{1}} triplets would involve a restriction to the domain (I,I,I) where all three coordinates are less than or equal to H. Our prediction of apparently {t_{1},t_{1},t_{2}} triplets would be confined to the three octants where two coordinates are less than or equal to H and one is greater. The integrals are weighted by W(Δ), the expected number of triplets. For example, the predicted number of {t_{1},t_{1},t_{2}} triplets would be;
where μ is the vector of component means. To summarize, we have defined three types of triplet profile:
the observed profile, based on triples of genes all having high similarity scores with each other, which can be compiled from the list of gene pairs produced by the SYNMAP function of COGE [1, 2],
the underlying profile for each model \(M\in \mathscr {M}\), produced by the combinatorial probabilities of the branching process for each model \(M\in \mathscr {M}\),
the predicted profile for each model \(M\in \mathscr {M}\), produced by grafting a gene pair divergence model on the underlying profile.
For comparative purposes we normalize the underlying and predicted profiles so that the total number of triples is the same as the observed profiles.
Results
We compare the three profiles for three wellstudied flowering plant genomes that are known to have undergone multiple polyploidizations in the last 120 million years, to see if our method predicts the right combination of WGT and WGD.
Durian
Starting with the durian genome, the (3,3) model, known to represent true evolutionary history, is the only one with a credible prediction profile in Table 2, the only one that has reasonable values for all four triplet types. The three others all fail to predict one or both of the {t_{1},t_{1},t_{1}} and {t_{2},t_{2},t_{2}} triplets. This indicates the potential of our statistical method, since the original durian sequence article [3] did not recognize the second event as a tripling.
Poplar
The predicted profile of the (3,2) model in Table 3 summarizes the true history of the Populus trichocarpa genome (COGE ID 25127), whose gene pair similarity distribution is displayed in Fig. 3. Along with γ, this shares the ancient “salicoid” WGD with other members of the Salicaceae family [15]. (3,2) is the only model that correctly identifies both the γ event as a WGT, and the more recent event as a WGD.
Cabbage
The recent ancestor of Brassica oleracea genome (COGE ID 26018), underwent a WGT that gave rise not only to the crucifers and mustard genera, but also radishes and other related genera. Early than that a WGD called the α doubling is apparent in the whole range of family Brassicacea genera, including Arabidopsis. A still earlier WGD, the β doubling, occurred in the order Brassicales lineage that includes the Brassicaceae. Thus the cabbage genome counts γ,β,α and a Brassica WGT in its evolutionary history [16, 17]. In Fig. 4, we see that at least the two recent components are clearly distinguishable, so we first carried out an analysis excluding gene pairs of similarity less than 76%.
This analysis in Table 4 does not give satisfactory results. Indeed, the (3,3) profile matches the observed profile much better than the (2,3) profile does. This can be partially attributed to substantial number of similarities generated by the β and even γ doublings greater than 76%.
We can partially correct this by adding a third event to our branching process. This leads to eight models instead of four, and ten kinds of triplet, summarized in Table 5.
We fix the mean of the first component at 71% to account for the γ event, that for the second component, representing the α event, at 79.5% and we find two ML discrimination points, as in Fig. 5.
The results of this are shown in Table 6. Here the (3,2,3) model is just as close to the observed profile than the competing (3,3,3) model, with the notable exception of {t_{2},t_{2},t_{2}} triplets. The absence of a distinction between the α and β events means that the similarities they generate are all conflated to yield an excess of t_{2}, and consequently an excess of t_{2} triples, so that a WGT is inferred rather than a WGD.
The obvious remedy for this would be to construct fourevent models (sixteen of them), with profiles consisting of 20 different triplets. We leave this for further work. In general the number of models is exponential: 2^{m} for m events, while the number of triples follows the polynomial (i.e., cubic) tetrahedral sequence (A00292 in [18]) \(\frac {1}{6}m(m+1)(m+2)\), so that eventually there would not be enough data to discriminate among the models. Choosing among models with different numbers of events would require some standard for model selection such as the Akaike or Bayesian information criteria.
Conclusions
We model the process of fractionation to account for the distribution of gene pair similarities after a number of whole genome doublings, triplings, etc., each followed by a period of duplicate gene loss. The model is a discretetime branching process, with synchronous birth number r_{i}≥2 across the i−th generation population and deaths determined by a r_{i}nomial law conditioned on at least one survivor.
The observations of gene pair similarities consist of a mixture of normals, each component generated by one event, with the event time estimated by the sequence divergence from the event to the present. Despite the overlapping distributions, we can estimate the mean (via a local mode), standard deviation and proportion of the sample.
Statistics on gene pairs alone do not allow us to infer r_{i}, so we introduce the study of gene triplets. We find formulae for the expected number of each kind of triplet, categorized as to which events produced the similarities among the three pairs of genes.
We develop a way of grafting a gene divergence model on this underlying profile of triplets to produce a predicted profile of the number of triplets of each kind. This can then be compared with the observed number of triplets.
Further work
Distinguishing among the four models combining tetraploidization or hexaploidization in two successive events is the simplest example of a more general problem. The theoretical way is clear to extending these ideas to include, for example, octoploidization through the extraction of quadruplets instead of triplets. In addition, there is a straight forward extension to the case of three or more successive polyploidization events, which we have undertaken in the study of Brassica oleracea. Here, three events, three normal components and two transition points are estimated from the distribution of similarities. The combinatorial probabilities have been worked out for this case and many others, and the methodology is available to complete this.
It is true that in some cases, such as that we presented in [9], concerning Durio zibethinus, the ploidy is evident from the clear presence in the SYNMAP selfcomparison dotplots of sets of r regions covering a large proportion of the genome, each set represented by exactly \(\binom {r}{2}\) synteny blocks showing synteny among all r regions. (r=3 in the case of Durio.) Clear cases like this are rare, however, especially for genomes where the last polyploidization is more remote in time.
In previous work [7], we used additional information, beyond that contained in the similarity distribution, to confirm the recent hexaploidization of Brassica rapa against the alternative of tetraploidization. This kind of data, however, namely speculation about the number of singlecopy genes in the current genome, was extremely subjective in that report, and is unreliable even when assessed by the best available methods on wellassembled and annotated genomes.
A distribution of gene pair similarities is generated in the comparison of two related genomes as well as in the selfcomparison of a single genome. The number of orthologous gene pairs available when comparing two related genomes is generally much greater than the number of paralogous pairs identified in the selfcomparison of two genomes, simply because the loss by fractionation of one copy of a duplicated gene does not eliminate all related orthology pairs: the other remaining copy and its orthology pairs remain intact. This suggests an avenue to improved accuracy of polyploidy levels inference. The larger number of data, however, may not always compensate for the fact that the speciation component of the similarity distribution is always the most recent one [9], so that the more remote (earlier) components associated with polyploidy become statistically less clear and informative.
Availability of data and materials
The datasets analysed during the current study are available in the COGE repository [1, 2], https://genomevolution.org/coge/
Abbreviations
 ML:

Maximum likelihood
 WGD:

Whole genome doubling
 WGT:

Whole genome tripling
References
 1
Lyons E, Freeling M. How to usefully compare homologous plant genes and chromosomes as DNA sequences. Plant J. 2008; 53:661–73. https://doi.org/10.1111/j.1365313X.2007.03326.x.
 2
Lyons E, Pedersen B, Kane J, Alam M, Ming R, Tang H, et al.Finding and comparing syntenic regions among Arabidopsis and the outgroups papaya, poplar and grape: COGE with rosids. Plant Physiol. 2008; 148:1772–81.
 3
Teh BT, Lim K, Yong CH, Ng CCY, Rao SR, et al.The draft genome of tropical fruit durian Durio zibethinus. Nat Genet. 2017; 49:1633–41.
 4
McLachlan GJ, Peel D, Basford KE, Adams P. The EMMIX software for the fitting of mixtures of normal and tcomponents. J Stat Softw. 1999; 4:1–14.
 5
Zhang Y, Zheng C, Sankoff D. Evolutionary model for the statistical divergence of paralogous and orthologous gene pairs generated by whole genome duplication and speciation. IEEE/ACM Trans Comput Biol Bioinforma. 2017; 15(5):1545–5963.
 6
Sankoff D, Zheng C, Zhang Y, Meidanis J, Lyons E, Tang H. Models for similarity distributions of syntenic homologs and applications to phylogenomics. IEEE/ACM Trans Comput Biol Bioinforma. 2018. https://doi.org/10.1109/TCBB.2018.2849377.
 7
Zhang Y, Zheng C, Sankoff D. Pinning down ploidy in paleopolyploid plants. BMC Genomics. 2018; 19(Suppl 5):287.
 8
Zhang Y, Zheng C, Sankoff D. A branching process for homology distributionbased inference of polyploidy, speciation and loss. Algoritm Mol Biol. 2019; 14:18.
 9
Zhang Y, Zheng C, Islam S, Kim YM, Sankoff D. Branching out to speciation in a model of fractionation: the Malvaceae. IEEE/ACM Trans Comput Biol Bioinforma. 2019. in press.
 10
Wang J, Yuan J, Yu J, Meng F, Sun P, Li Y, et al.Recursive paleohexaploidization shapes the durian genome. Plant Physiol. 2019. https://doi.org/10.1104/pp.18.00921.
 11
Jaillon O, Aury JM, Noel B, Policriti A, Clepet C, et al.The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature. 2007; 449:463–7.
 12
RendónAnaya M, IbarraLaclette E, Méndez Bravo A, Lan T, Zheng C, Carretero Paulet L, et al.The avocado genome informs deep angiosperm phylogeny, highlights introgressive hybridization, and reveals pathogeninfluenced gene space adaptation. PNAS. 2019. in press.
 13
Tuskan GA, DiFazio S, Jansson S, Bohlmann J, Grigoriev I, et al.The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science. 2006; 313:1596–604.
 14
Liu S, Liu Y, Yang X, Tong C, Edwards D, Parkin IAP, et al.The Brassica oleracea, genome reveals the asymmetrical evolution of polyploid genomes. Nat Commun. 2014; 5:3930.
 15
Dai X, Hu Q, Cai Q, Feng K, Ye N, Tuskan GA, et al.The willow genome and divergent evolution from poplar after the common genome duplication. Cell Res. 2014; 24:1274–7.
 16
Barke MS, Vogel H, Schranz ME. Paleopolyploidy in the Brassicales: analyses of the Cleome transcriptome elucidate the history of genome duplications in Arabidopsis and other Brassicales. Genome Biol Evol. 2009; 1:391–9. https://doi.org/10.1093/gbe/evp040.
 17
Edger PP, Hall JC, Harkess A, Tang M, Coombs J, Mohammadin S, et al.Brassicales phylogeny inferred from 72 plastid genes: A reanalysis of the phylogenetic localization of two paleopolyploid events and origin of novel chemical defenses. Am J Bot. 2018; 105:463–9.
 18
Sloane NJA. The Online Encyclopedia of Integer Sequences. Published electronically at https//oeis.org. Accessed 18 June 2019.
Acknowledgments
The authors thank Victor A. Albert for encouraging us to explore this research topic.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 20 Supplement 20, 2019: Proceedings of the 17th Annual Research in Computational Molecular Biology (RECOMB) Comparative Genomics Satellite Workshop: Bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume20supplement20.
Funding
Publication costs are funded by the Natural Sciences and Engineering Research Council of Canada Discovery Grant RGPIN201605585. The research described in this article as well as publication costs were also partly funded by this grant. DS holds the Canada Research Chair in Mathematical Genomics.
Author information
Affiliations
Contributions
DS and YZ. designed the study. YZ carried out most of the research. CZ did part of the analysis. DS and YZ wrote the paper. All authors read and approved the paper.
Corresponding author
Correspondence to David Sankoff.
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Zhang, Y., Zheng, C. & Sankoff, D. Distinguishing successive ancient polyploidy levels based on genomeinternal syntenic alignment. BMC Bioinformatics 20, 635 (2019). https://doi.org/10.1186/s128590193202x
Published:
Keywords
 Polyploidy
 Whole genome doubling
 Whole genome tripling
 Gene triples
 Plant genomes
 Branching process