Structural vs. functional mechanisms of duplicate gene loss following whole genome doubling
 David Sankoff^{1}Email author,
 Chunfang Zheng^{1},
 Baoyong Wang^{1} and
 Carlos Fernando Buen Abad Najar^{2}
https://doi.org/10.1186/1471210516S17S9
© Sankoff et al. 2015
Published: 7 December 2015
Abstract
Background
The loss of duplicate genes  fractionation  after whole genome doubling (WGD) is the subject to a debate as to whether it proceeds gene by gene or through deletion of multigene chromosomal segments.
Results
WGD produces two copies of every chromosome, namely two identical copies of a sequence of genes. We assume deletion events excise a geometrically distributed number of consecutive genes with mean µ ≥ 1, and these events can combine to produce singlecopy runs of length l. If µ = 1, the process is genebygene. If µ > 1, the process at least occasionally excises more than one gene at a time. In the latter case if deletions overlap, the later one simply extends the existing run of singlecopy genes. We explore aspects of the predicted distribution of the lengths of singlecopy regions analytically, but resort to simulations to show how observing run lengths l allows us to discriminate between the two hypotheses.
Conclusions
Deletion run length distributions can discriminate between genebygene fractionation and deletion of segments of geometrically distributed length, even if µ is only slightly larger than 1, as long as the genome is large enough and fractionation has not proceeded too far towards completion.
Keywords
Background
The process of whole genome doubling (WGD) gives rise to two copies of each chromosome in a genome, containing the same genes in the same order. Through an attrition mechanism known as fractionation, one of each pair of duplicate genes is lost over evolutionary time. The rare deletion of both copies of a gene can be excluded from our considerations of the interleaving patterns of deletions from duplicated regions first discovered by Wolfe and Shields [1]. The retention of one copy of each pair is what differentiates the WGD/fractionation model from approaches to gene duplication, insertion and deletion in the study of comparative genomics, pioneered by ElMabrouk [2].
An important biological controversy arises from the question of whether duplicated genes are deleted through random excision  elimination of excess DNA  namely the deletion of chromosomal segments containing one or more genes [3], which we term the "structural" mechanism, or through geneby gene events such as epigenetic silencing and pseudogenization [4], which are "functional" mechanisms. This question is important to evolutionary theory because it speaks directly to the role of WGD, and gene duplication in general, in disrupting gene order, in creating functional innovation, and in the radiation of new species. It is a question of whether selection operates on the level of simply permitting nonlethal deletions or whether more subtle effects are in play, such as dosage balance of interacting genes.
This debate may be formulated in terms of deletion events removing a number X of contiguous genes, where X is drawn from a geometric distribution γ with mean µ. Here the oneatatime deletion model is represented by µ = 1, while the random number of deletions at a time holds if µ > 1. In the latter case, the possibility of two overlapping events is handled by a biologically realistic additive runlength assumption.
In this paper, we investigate the discrimination problem of choosing between the two models based on deletion runlength statistics (resulting from overlapping deletion events). This involves comparing an observed genome containing singlecopy genes, originally members of duplicate pairs, to the predictions of the models for µ = 1 and for µ > 1. This requires knowledge of the runlength distribution, given a total number of deleted genes and remaining duplicate pairs. While this is easily calculated for the case µ = 1, the the distribution for the opposing scenario µ > 1 is not known.
In the first part of this paper, we analyze aspects of the deletion runlength distribution ψ when µ > 1 for the deletionlength distribution γ, including some new and surprising analytical results, the clearest of which pertain to a continuous analog of the problem. We then show why it is difficult to describe ψ in closed form or other easily computable format. In the second part, we simulate the distribution and carry out a study of the discrimination problem for various values of µ, genome size N and θ, the proportion of undeleted genes at time t. We conclude with a discussion of the remaining mathematical problems to be solved before the method can be applied to data from real WGD descendants.
Results
The models
For modeling purposes, we consider a doubled genome made up, at the outset, of a pair of identical linear chromosomes each containing genes g_{1}, . . . , g_{ N }, where N is large enough so that we can neglect end effects  particular behaviors near g_{ 1 } and g_{ N } . At each time t = 1, 2, . . ., one such doubled gene g_{i} is chosen at random, and a value X is chosen from a geometric distribution γ with mean µ. If X = a, then g_{i}, g_{i+1}, . . . , g_{i+a−1}are deleted from one of the genomes  they become singlecopy genes  unless some of these are already singlecopy. In the latter case, we skip existing singlecopy genes and proceed to convert the next doublecopy genes we encounter until a total of a doublecopy genes have been converted to singlecopy. Note that this overlapping of deletion events never occurs if µ = 1 since, in this case, by definition, exactly one doublecopy gene is selected and deleted in each step. For simplicity, we assume all deletions take place from one and the same genome. In a more complete model, deletion events would occur on one or the other chromosome, with probabilities φ and 1 − φ [5].
The "skipping" procedure, introduced in [6], is a natural way to model the deletion process, since deletion of part of a chromosome and the subsequent rejoining of the chromosome directly before and directly after the deleted fragment means that this fragment is no longer "visible" to the deletion process. As observers, however, we have a record of the deleted genes, as one copy of each gene must be retained in the genome.
Overlapping deletion events and skipping result in the creation of runs of singlecopy genes whose length is the sum of a number of geometric variables. The sum of r identical geometric variables produces a negative binomial distribution with parameter r, but the skipping process does not involve the sum of identical random variables, since a deletion with a large value of a is more likely to overlap an existing single copy region than a deletion with small a. Thus, at any point of time t > 0, the distribution ψ_{t} of singlecopy run lengths will tend to contain a higher frequency of runs of length 1, and of very long runs, than would be generated by the negative binomial. On the other hand, the distribution of run lengths of the remaining doublecopy genes is geometrically distributed with a probability distribution ρ_{t}, where the mean ν_{ t } decreases with t [5, 6].
Analysis of overlap probabilities
An attempt to determine ψ_{ t }analytically starts with the calculation of how many deletion events have overlapped to form a run of singlecopy genes at time t. In [6], we derived a formula to predict whether a deletion event would create a new run of singlecopy genes, probability p_{0}; overlap exactly one existing run, thus extending it without changing the total number of runs, probability p_{1}; overlap two runs, producing one larger combined run in place of the two preexisting ones, probability p_{2}; and so on. Other probabilities deal with the events that a run "touches" a preexisting run without overlapping it. These probabilities all depend solely on γ and ρ_{ t }. For example, we examine the case of p_{ 0 }. The other probabilities are all formulated in analogous ways.
where j indexes the starting position of the deletion within a run of length l, and a is the number of genes deleted in the event.
Typically, µ is somewhere between 1 and 2, [3, 4], and ν_{ t } of the order of 10^{3} or 10^{4}. Thus p_{0} is initially only slightly less than 1 but declines rapidly as ν_{ t } decreases exponentially.
with mean µ. If X = a, then the segment [g, g +a] is deleted from one of the genomes  [g, g + a] becomes a singlecopy region  unless part of it is already singlecopy. In the latter case, we skip existing singlecopy regions and proceed to convert the next doublecopy region we encounter until a total measure a of doublecopy regions have been converted to singlecopy.
In analogy with the discrete model, the combined length of the remaining doublecopy segments is exponentially distributed according to probability distribution σ_{ t }, with a mean ν_{ t } that decreases with t.
which is reminiscent of the relation (4) in the discrete case with large ν_{ t }.
Thus we have the surprisingly uncomplicated result that the number q of preexisting runs of singlecopy regions overlapped by a new deletion event is geometrically distributed on q = 0, 1, . . . with parameter µ/(µ + ν_{ t }).
On the runlength distribution
How to account for the distorting effect of skipping on the runlength distribution will require additional insight and research. In the interim, we may use simulations to study the discrimination problem.
Simulations
We first simulated the fractionation process for all combinations of the following parameter values:

gene number N = 100 to 900, in steps of 100.

µ = 1.0 to 2.4, in steps of 0.1.

Proportion of the genes deleted, 1 − θ = 0.1 to 0.9, in steps of 0.1.
For each combination of the parameters µ, N and θ, we calculated the distribution of run lengths l for singlecopy regions, and similarly for doublecopy regions. The simulation was repeated 1000 times and the frequencies of length (l = 1,2,3,...) of runs of deleted genes were averaged over the 1000 trials to get a reasonably accurate estimate of the cumulative F_{ µ,N,1−θ }. Similarly we estimated the cumulative G_{ µ,N,1−θ }for runs of remaining doublecopy genes.
Once the cumulative distributions were established, we then carried out the actual discrimination study. For each value of µ and N, we sampled 1000 new individual trajectories of the deletion process until 1 − θ = 90 % of the genes were deleted. For each value of 1 − θ = 0.1, 0.2, . . . , 0.9, we created "bins" corresponding to the fifteen values of µ for which we had constructed cumulatives. Then for each sample S_{ i }, at each 1 − θ = 0.1, . . . , 0.9 we counted the frequency of runs of deleted genes of length = 1, 2, . . . and constructed a cumulative distribution. We calculated the KolmogorovSmirnov statistic ${D}_{\mathit{\mu},N,1\theta}^{\left(Si\right)}$between the sample cumulative and the previously established distribution F_{ µ,N,1−θ }for each fifteen values of µ and assigned the sample to the bin corresponding to the minimal value of this statistic, which was our estimate $\widehat{\mu}$ for that sample.
Conclusions
In this work we have made some progress in deriving the runlength distribution ψ_{ t } for singlecopy regions, although this problem is still not completely resolved. From an analytical point of view, it is unexpected and interesting that in the continuous version of the problem, the number of preexisting runs overlapped by a deletion event follows a geometric distribution.
The simulation study showed the much greater difficulty in distinguishing between the structural and functional models when the mean µ of the deletion size distribution is 1.1 rather than 1.9, when N is 100 rather than 900, and when the proportion of genes deleted is bigger than 50% rather than less than 40%. The latter effect is also apparent in empirical studies [7].
Our simulation results are based on a "binning" strategy for determining $\widehat{\mu}$ for the purposes of discrimination, rather than an asymmetrical testing approach comparing the hypotheses µ = 1 and µ > 1. This is justified by the lack of any biological significance, and high rates of error, in comparing $\mu =\text{1}+\in $ and µ = 1 for very small $\phantom{\rule{0.5em}{0ex}}\in $, as well as the global picture it offers of the degradation of discriminatory power as a function of µ, N and θ.
This work has for the first time enabled the systematic discrimination between the two models of duplicate deletion following WGD. Future research will continue on the analytical determination of ψ_{ t }as well as extension to the "twosided" deletion models proposed in [5]. Eventually, we will have to allow processes of genome rearrangement to disrupt runs of singlecopy genes or doublecopy genes, as in [7]. It is these kinds of model that will eventually be useful for analyzing data from real genomes.
Declarations
Acknowledgements
Research supported in part by grants to DS from the Natural Sciences and Engineering Research Council of Canada (NSERC). CFBAN was a Mitacs Globalink intern. DS holds the Canada Research Chair in Mathematical Genomics.
Declarations
Publication of this article was funded by NSERC Discovery grant DG 88672008 RGPIN.
This article has been published as part of BMC Bioinformatics Volume 16 Supplement 17, 2015: Selected articles from the Fourth IEEE International Conference on Computational Advances in Bio and medical Sciences (ICCABS 2014): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S17.
Authors’ Affiliations
References
 Wolfe KH, Shields DC: Molecular evidence for an ancient duplication of the entire yeast genome. Nature. 1997, 387: 70813.PubMedView ArticleGoogle Scholar
 ElMabrouk N: Genome rearrangement by reversals and insertions/deletions of contiguous segments. Combinatorial Pattern Matching, 11th Annual Symposium Lecture Notes in Computer Science. Edited by: Giancarlo, R., Sankoff, D. 2000, Springer, 1848: 222234.View ArticleGoogle Scholar
 van Hoek MJ, Hogeweg P: The role of mutational dynamics in genome shrinkage. Molecular Biology and Evolution. 2007, 24: 24852494.PubMedView ArticleGoogle Scholar
 Byrnes JK, Morris GP, Li WH: Reorganization of adjacent gene relationships in yeast genomes by wholegenome duplication and gene deletion. Molecular Biology and Evolution. 2006, 23: 11361143.PubMedView ArticleGoogle Scholar
 Sankoff D, Zheng C, Wang B: A model for biased fractionation after whole genome duplication. BMC Genomics. 2012, 13: S1, S8View ArticleGoogle Scholar
 Wang B, Zheng C, Sankoff D: Fractionation statistics. BMC Bioinformatics. 2011, 12: S9, S5View ArticleGoogle Scholar
 Sankoff D, Zheng C, Zhu Q: The collapse of gene complement following whole genome duplication. BMC Genomics. 2010, 11: 313PubMedPubMed CentralView ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.