Volume 16 Supplement 17
Structural vs. functional mechanisms of duplicate gene loss following whole genome doubling
© Sankoff et al. 2015
Published: 7 December 2015
The loss of duplicate genes - fractionation - after whole genome doubling (WGD) is the subject to a debate as to whether it proceeds gene by gene or through deletion of multi-gene chromosomal segments.
WGD produces two copies of every chromosome, namely two identical copies of a sequence of genes. We assume deletion events excise a geometrically distributed number of consecutive genes with mean µ ≥ 1, and these events can combine to produce single-copy runs of length l. If µ = 1, the process is gene-by-gene. If µ > 1, the process at least occasionally excises more than one gene at a time. In the latter case if deletions overlap, the later one simply extends the existing run of single-copy genes. We explore aspects of the predicted distribution of the lengths of single-copy regions analytically, but resort to simulations to show how observing run lengths l allows us to discriminate between the two hypotheses.
Deletion run length distributions can discriminate between gene-by-gene fractionation and deletion of segments of geometrically distributed length, even if µ is only slightly larger than 1, as long as the genome is large enough and fractionation has not proceeded too far towards completion.
Keywordswhole genome doubling fractionation run length statistics gene loss gene duplication
The process of whole genome doubling (WGD) gives rise to two copies of each chromosome in a genome, containing the same genes in the same order. Through an attrition mechanism known as fractionation, one of each pair of duplicate genes is lost over evolutionary time. The rare deletion of both copies of a gene can be excluded from our considerations of the interleaving patterns of deletions from duplicated regions first discovered by Wolfe and Shields . The retention of one copy of each pair is what differentiates the WGD/fractionation model from approaches to gene duplication, insertion and deletion in the study of comparative genomics, pioneered by El-Mabrouk .
An important biological controversy arises from the question of whether duplicated genes are deleted through random excision - elimination of excess DNA - namely the deletion of chromosomal segments containing one or more genes , which we term the "structural" mechanism, or through gene-by gene events such as epigenetic silencing and pseudogenization , which are "functional" mechanisms. This question is important to evolutionary theory because it speaks directly to the role of WGD, and gene duplication in general, in disrupting gene order, in creating functional innovation, and in the radiation of new species. It is a question of whether selection operates on the level of simply permitting non-lethal deletions or whether more subtle effects are in play, such as dosage balance of interacting genes.
This debate may be formulated in terms of deletion events removing a number X of contiguous genes, where X is drawn from a geometric distribution γ with mean µ. Here the one-at-a-time deletion model is represented by µ = 1, while the random number of deletions at a time holds if µ > 1. In the latter case, the possibility of two overlapping events is handled by a biologically realistic additive run-length assumption.
In this paper, we investigate the discrimination problem of choosing between the two models based on deletion run-length statistics (resulting from overlapping deletion events). This involves comparing an observed genome containing single-copy genes, originally members of duplicate pairs, to the predictions of the models for µ = 1 and for µ > 1. This requires knowledge of the run-length distribution, given a total number of deleted genes and remaining duplicate pairs. While this is easily calculated for the case µ = 1, the the distribution for the opposing scenario µ > 1 is not known.
In the first part of this paper, we analyze aspects of the deletion run-length distribution ψ when µ > 1 for the deletion-length distribution γ, including some new and surprising analytical results, the clearest of which pertain to a continuous analog of the problem. We then show why it is difficult to describe ψ in closed form or other easily computable format. In the second part, we simulate the distribution and carry out a study of the discrimination problem for various values of µ, genome size N and θ, the proportion of undeleted genes at time t. We conclude with a discussion of the remaining mathematical problems to be solved before the method can be applied to data from real WGD descendants.
For modeling purposes, we consider a doubled genome made up, at the outset, of a pair of identical linear chromosomes each containing genes g1, . . . , g N , where N is large enough so that we can neglect end effects - particular behaviors near g 1 and g N . At each time t = 1, 2, . . ., one such doubled gene gi is chosen at random, and a value X is chosen from a geometric distribution γ with mean µ. If X = a, then gi, gi+1, . . . , gi+a−1are deleted from one of the genomes - they become single-copy genes - unless some of these are already single-copy. In the latter case, we skip existing single-copy genes and proceed to convert the next double-copy genes we encounter until a total of a double-copy genes have been converted to single-copy. Note that this overlapping of deletion events never occurs if µ = 1 since, in this case, by definition, exactly one double-copy gene is selected and deleted in each step. For simplicity, we assume all deletions take place from one and the same genome. In a more complete model, deletion events would occur on one or the other chromosome, with probabilities φ and 1 − φ .
The "skipping" procedure, introduced in , is a natural way to model the deletion process, since deletion of part of a chromosome and the subsequent rejoining of the chromosome directly before and directly after the deleted fragment means that this fragment is no longer "visible" to the deletion process. As observers, however, we have a record of the deleted genes, as one copy of each gene must be retained in the genome.
Overlapping deletion events and skipping result in the creation of runs of single-copy genes whose length is the sum of a number of geometric variables. The sum of r identical geometric variables produces a negative binomial distribution with parameter r, but the skipping process does not involve the sum of identical random variables, since a deletion with a large value of a is more likely to overlap an existing single copy region than a deletion with small a. Thus, at any point of time t > 0, the distribution ψt of single-copy run lengths will tend to contain a higher frequency of runs of length 1, and of very long runs, than would be generated by the negative binomial. On the other hand, the distribution of run lengths of the remaining double-copy genes is geometrically distributed with a probability distribution ρt, where the mean ν t decreases with t [5, 6].
Analysis of overlap probabilities
An attempt to determine ψ t analytically starts with the calculation of how many deletion events have overlapped to form a run of single-copy genes at time t. In , we derived a formula to predict whether a deletion event would create a new run of single-copy genes, probability p0; overlap exactly one existing run, thus extending it without changing the total number of runs, probability p1; overlap two runs, producing one larger combined run in place of the two pre-existing ones, probability p2; and so on. Other probabilities deal with the events that a run "touches" a pre-existing run without overlapping it. These probabilities all depend solely on γ and ρ t . For example, we examine the case of p 0 . The other probabilities are all formulated in analogous ways.
where j indexes the starting position of the deletion within a run of length l, and a is the number of genes deleted in the event.
with mean µ. If X = a, then the segment [g, g +a] is deleted from one of the genomes - [g, g + a] becomes a single-copy region - unless part of it is already single-copy. In the latter case, we skip existing single-copy regions and proceed to convert the next double-copy region we encounter until a total measure a of double-copy regions have been converted to single-copy.
In analogy with the discrete model, the combined length of the remaining double-copy segments is exponentially distributed according to probability distribution σ t , with a mean ν t that decreases with t.
which is reminiscent of the relation (4) in the discrete case with large ν t .
Thus we have the surprisingly uncomplicated result that the number q of pre-existing runs of single-copy regions overlapped by a new deletion event is geometrically distributed on q = 0, 1, . . . with parameter µ/(µ + ν t ).
On the run-length distribution
How to account for the distorting effect of skipping on the run-length distribution will require additional insight and research. In the interim, we may use simulations to study the discrimination problem.
We first simulated the fractionation process for all combinations of the following parameter values:
gene number N = 100 to 900, in steps of 100.
µ = 1.0 to 2.4, in steps of 0.1.
Proportion of the genes deleted, 1 − θ = 0.1 to 0.9, in steps of 0.1.
For each combination of the parameters µ, N and θ, we calculated the distribution of run lengths l for single-copy regions, and similarly for double-copy regions. The simulation was repeated 1000 times and the frequencies of length (l = 1,2,3,...) of runs of deleted genes were averaged over the 1000 trials to get a reasonably accurate estimate of the cumulative F µ,N,1−θ . Similarly we estimated the cumulative G µ,N,1−θ for runs of remaining double-copy genes.
Once the cumulative distributions were established, we then carried out the actual discrimination study. For each value of µ and N, we sampled 1000 new individual trajectories of the deletion process until 1 − θ = 90 % of the genes were deleted. For each value of 1 − θ = 0.1, 0.2, . . . , 0.9, we created "bins" corresponding to the fifteen values of µ for which we had constructed cumulatives. Then for each sample S i , at each 1 − θ = 0.1, . . . , 0.9 we counted the frequency of runs of deleted genes of length = 1, 2, . . . and constructed a cumulative distribution. We calculated the Kolmogorov-Smirnov statistic between the sample cumulative and the previously established distribution F µ,N,1−θ for each fifteen values of µ and assigned the sample to the bin corresponding to the minimal value of this statistic, which was our estimate for that sample.
In this work we have made some progress in deriving the run-length distribution ψ t for single-copy regions, although this problem is still not completely resolved. From an analytical point of view, it is unexpected and interesting that in the continuous version of the problem, the number of pre-existing runs overlapped by a deletion event follows a geometric distribution.
The simulation study showed the much greater difficulty in distinguishing between the structural and functional models when the mean µ of the deletion size distribution is 1.1 rather than 1.9, when N is 100 rather than 900, and when the proportion of genes deleted is bigger than 50% rather than less than 40%. The latter effect is also apparent in empirical studies .
Our simulation results are based on a "binning" strategy for determining for the purposes of discrimination, rather than an asymmetrical testing approach comparing the hypotheses µ = 1 and µ > 1. This is justified by the lack of any biological significance, and high rates of error, in comparing and µ = 1 for very small , as well as the global picture it offers of the degradation of discriminatory power as a function of µ, N and θ.
This work has for the first time enabled the systematic discrimination between the two models of duplicate deletion following WGD. Future research will continue on the analytical determination of ψ t as well as extension to the "two-sided" deletion models proposed in . Eventually, we will have to allow processes of genome rearrangement to disrupt runs of single-copy genes or double-copy genes, as in . It is these kinds of model that will eventually be useful for analyzing data from real genomes.
Research supported in part by grants to DS from the Natural Sciences and Engineering Research Council of Canada (NSERC). CFBAN was a Mitacs Globalink intern. DS holds the Canada Research Chair in Mathematical Genomics.
Publication of this article was funded by NSERC Discovery grant DG 8867-2008 RGPIN.
This article has been published as part of BMC Bioinformatics Volume 16 Supplement 17, 2015: Selected articles from the Fourth IEEE International Conference on Computational Advances in Bio and medical Sciences (ICCABS 2014): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S17.
- Wolfe KH, Shields DC: Molecular evidence for an ancient duplication of the entire yeast genome. Nature. 1997, 387: 708-13.PubMedView ArticleGoogle Scholar
- El-Mabrouk N: Genome rearrangement by reversals and insertions/deletions of contiguous segments. Combinatorial Pattern Matching, 11th Annual Symposium Lecture Notes in Computer Science. Edited by: Giancarlo, R., Sankoff, D. 2000, Springer, 1848: 222-234.View ArticleGoogle Scholar
- van Hoek MJ, Hogeweg P: The role of mutational dynamics in genome shrinkage. Molecular Biology and Evolution. 2007, 24: 2485-2494.PubMedView ArticleGoogle Scholar
- Byrnes JK, Morris GP, Li WH: Reorganization of adjacent gene relationships in yeast genomes by whole-genome duplication and gene deletion. Molecular Biology and Evolution. 2006, 23: 1136-1143.PubMedView ArticleGoogle Scholar
- Sankoff D, Zheng C, Wang B: A model for biased fractionation after whole genome duplication. BMC Genomics. 2012, 13: S1, S8-View ArticleGoogle Scholar
- Wang B, Zheng C, Sankoff D: Fractionation statistics. BMC Bioinformatics. 2011, 12: S9, S5-View ArticleGoogle Scholar
- Sankoff D, Zheng C, Zhu Q: The collapse of gene complement following whole genome duplication. BMC Genomics. 2010, 11: 313-PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.