Structural vs. functional mechanisms of duplicate gene loss following whole genome doubling

Background The loss of duplicate genes - fractionation - after whole genome doubling (WGD) is the subject to a debate as to whether it proceeds gene by gene or through deletion of multi-gene chromosomal segments. Results WGD produces two copies of every chromosome, namely two identical copies of a sequence of genes. We assume deletion events excise a geometrically distributed number of consecutive genes with mean µ ≥ 1, and these events can combine to produce single-copy runs of length l. If µ = 1, the process is gene-by-gene. If µ > 1, the process at least occasionally excises more than one gene at a time. In the latter case if deletions overlap, the later one simply extends the existing run of single-copy genes. We explore aspects of the predicted distribution of the lengths of single-copy regions analytically, but resort to simulations to show how observing run lengths l allows us to discriminate between the two hypotheses. Conclusions Deletion run length distributions can discriminate between gene-by-gene fractionation and deletion of segments of geometrically distributed length, even if µ is only slightly larger than 1, as long as the genome is large enough and fractionation has not proceeded too far towards completion.


Background
The process of whole genome doubling (WGD) gives rise to two copies of each chromosome in a genome, containing the same genes in the same order. Through an attrition mechanism known as fractionation, one of each pair of duplicate genes is lost over evolutionary time. The rare deletion of both copies of a gene can be excluded from our considerations of the interleaving patterns of deletions from duplicated regions first discovered by Wolfe and Shields [1]. The retention of one copy of each pair is what differentiates the WGD/fractionation model from approaches to gene duplication, insertion and deletion in the study of comparative genomics, pioneered by El-Mabrouk [2].
An important biological controversy arises from the question of whether duplicated genes are deleted through random excision -elimination of excess DNAnamely the deletion of chromosomal segments containing one or more genes [3], which we term the "structural" mechanism, or through gene-by gene events such as epigenetic silencing and pseudogenization [4], which are "functional" mechanisms. This question is important to evolutionary theory because it speaks directly to the role of WGD, and gene duplication in general, in disrupting gene order, in creating functional innovation, and in the radiation of new species. It is a question of whether selection operates on the level of simply permitting non-lethal deletions or whether more subtle effects are in play, such as dosage balance of interacting genes.
This debate may be formulated in terms of deletion events removing a number X of contiguous genes, where X is drawn from a geometric distribution γ with mean µ.
Here the one-at-a-time deletion model is represented by µ = 1, while the random number of deletions at a time holds if µ > 1. In the latter case, the possibility of two overlapping events is handled by a biologically realistic additive run-length assumption.
In this paper, we investigate the discrimination problem of choosing between the two models based on deletion run-length statistics (resulting from overlapping deletion events). This involves comparing an observed genome containing single-copy genes, originally members of duplicate pairs, to the predictions of the models for µ = 1 and for µ > 1. This requires knowledge of the run-length distribution, given a total number of deleted genes and remaining duplicate pairs. While this is easily calculated for the case µ = 1, the the distribution for the opposing scenario µ > 1 is not known.
In the first part of this paper, we analyze aspects of the deletion run-length distribution ψ when µ > 1 for the deletion-length distribution γ, including some new and surprising analytical results, the clearest of which pertain to a continuous analog of the problem. We then show why it is difficult to describe ψ in closed form or other easily computable format. In the second part, we simulate the distribution and carry out a study of the discrimination problem for various values of µ, genome size N and θ, the proportion of undeleted genes at time t. We conclude with a discussion of the remaining mathematical problems to be solved before the method can be applied to data from real WGD descendants.

The models
For modeling purposes, we consider a doubled genome made up, at the outset, of a pair of identical linear chromosomes each containing genes g 1 , . . . , g N , where N is large enough so that we can neglect end effects -particular behaviors near g 1 and g N . At each time t = 1, 2, . . ., one such doubled gene g i is chosen at random, and a value X is chosen from a geometric distribution g with mean µ. If X = a, then g i , g i+1 , . . . , g i+a−1 are deleted from one of the genomes -they become single-copy genes -unless some of these are already single-copy. In the latter case, we skip existing single-copy genes and proceed to convert the next double-copy genes we encounter until a total of a double-copy genes have been converted to single-copy. Note that this overlapping of deletion events never occurs if µ = 1 since, in this case, by definition, exactly one double-copy gene is selected and deleted in each step. For simplicity, we assume all deletions take place from one and the same genome. In a more complete model, deletion events would occur on one or the other chromosome, with probabilities and 1 − [5].
The "skipping" procedure, introduced in [6], is a natural way to model the deletion process, since deletion of part of a chromosome and the subsequent rejoining of the chromosome directly before and directly after the deleted fragment means that this fragment is no longer "visible" to the deletion process. As observers, however, we have a record of the deleted genes, as one copy of each gene must be retained in the genome.
Overlapping deletion events and skipping result in the creation of runs of single-copy genes whose length is the sum of a number of geometric variables. The sum of r identical geometric variables produces a negative binomial distribution with parameter r, but the skipping process does not involve the sum of identical random variables, since a deletion with a large value of a is more likely to overlap an existing single copy region than a deletion with small a. Thus, at any point of time t > 0, the distribution ψ t of single-copy run lengths will tend to contain a higher frequency of runs of length 1, and of very long runs, than would be generated by the negative binomial. On the other hand, the distribution of run lengths of the remaining double-copy genes is geometrically distributed with a probability distribution ρ t , where the mean ν t decreases with t [5,6].

Analysis of overlap probabilities
An attempt to determine ψ t analytically starts with the calculation of how many deletion events have overlapped to form a run of single-copy genes at time t. In [6], we derived a formula to predict whether a deletion event would create a new run of single-copy genes, probability p 0 ; overlap exactly one existing run, thus extending it without changing the total number of runs, probability p 1 ; overlap two runs, producing one larger combined run in place of the two pre-existing ones, probability p 2 ; and so on. Other probabilities deal with the events that a run "touches" a pre-existing run without overlapping it. These probabilities all depend solely on g and r t . For example, we examine the case of p 0 . The other probabilities are all formulated in analogous ways.
The proportion of genes in single-copy runs of length lρ t (l) . The probability p 0 that a deletion event falls within a run of double-copy genes without deleting the terms at either end is where j indexes the starting position of the deletion within a run of length l, and a is the number of genes deleted in the event.
This formula requires quadratic computing time, but the p i for higher i, require polynomial time of degree i + 2.
Here we exemplify with p 0 to show that these probabilities can in fact be reduced to closed form, so that computing time is a negligible constant. The formula in (1), when expanded, consists of a number of partial sums of the geometric distributions g and r t and means of these distributions, all of which are readily reduced to closed form, plus sums of terms of the form , which themselves can be considered in terms of a geometric distribution with mean ζ, where Then (1) reduces to: For large ν t , i.e., during the early stage of the process, Typically, µ is somewhere between 1 and 2, [3,4], and ν t of the order of 10 3 or 10 4 . Thus p 0 is initially only slightly less than 1 but declines rapidly as ν t decreases exponentially.
We proceed in an analogous way to derive closed forms for p 1 , p 2 , . . ., but it is perhaps more instructive here to present the continuous version of the deletion process. Here the two identical chromosomes at time t = 0 are linear segments, long enough in comparison with the other parameters of the model so that end effects can be ignored. At each time t = 1, 2, . . . , a random point g is chosen on the chromosome, and a value X is chosen from an exponential distribution with mean µ. If X = a, then the segment [g, g +a] is deleted from one of the genomes -[g, g + a] becomes a single-copy region -unless part of it is already singlecopy. In the latter case, we skip existing single-copy regions and proceed to convert the next double-copy region we encounter until a total measure a of doublecopy regions have been converted to single-copy.
In analogy with the discrete model, the combined length of the remaining double-copy segments is exponentially distributed according to probability distribution s t , with a mean ν t that decreases with t.
The proportion of undeleted regions accounted for by Then the probability p 0 that a deletion event falls completely within an undeleted segment is Carrying out the integrations, we find which is reminiscent of the relation (4) in the discrete case with large ν t .
The probability p 1 that a deletion event overlaps exactly one existing run of deletions is: It can be proved by induction that the probability a deletion event overlaps exactly q existing runs of deletions is: Thus we have the surprisingly uncomplicated result that the number q of pre-existing runs of single-copy regions overlapped by a new deletion event is geometrically distributed on q = 0, 1, . . . with parameter µ/(µ + ν t ).

On the run-length distribution
Although having a closed form for p q constitutes progress towards the computation of the run-length distribution ψ t , or eventually towards some analytical results on it, how to find this distribution remains a difficult question. As mentioned in the previous section, long deletion events will be involved in more skipping than small ones. This is illustrated in Figure 1, where runs built from a small number of events tend to be composed of shorter deletions, especially when µ is large. Had we just added independent samples from a geometric distribution, the curves in the figure would have been horizontal lines.
How to account for the distorting effect of skipping on the run-length distribution will require additional insight and research. In the interim, we may use simulations to study the discrimination problem.

Simulations
We first simulated the fractionation process for all combinations of the following parameter values: • gene number N = 100 to 900, in steps of 100. For each combination of the parameters µ, N and θ, we calculated the distribution of run lengths l for single-copy regions, and similarly for double-copy regions. The simulation was repeated 1000 times and the frequencies of length (l = 1,2,3,...) of runs of deleted genes were averaged over the 1000 trials to get a reasonably accurate estimate of the cumulative F µ,N,1−θ . Similarly we estimated the cumulative G µ,N,1−θ for runs of remaining double-copy genes.
Once the cumulative distributions were established, we then carried out the actual discrimination study. For each value of µ and N, we sampled 1000 new individual trajectories of the deletion process until 1 − θ = 90 % of the genes were deleted. For each value of 1 − θ = 0.1, 0.2, . . . , 0.9, we created "bins" corresponding to the fifteen values of µ for which we had constructed cumulatives. Then for each sample S i , at each 1 − θ = 0.1, . . . , 0.9 we counted the frequency of runs of deleted genes of length = 1, 2, . . . and constructed a cumulative distribution. We calculated the Kolmogorov-Smirnov statistic D (Si) μ,N,1−θ between the sample cumulative and the previously established distribution F µ,N,1−θ for each fifteen values of µ and assigned the sample to the bin corresponding to the minimal value of this statistic, which was our estimateμ for that sample. Figure 2 shows the distributions ofμ for the 1000 samples With all four values of N in Figure 2, the most accurate inference is made for µ = 1, the gene-by-gene model. This brings us back to the original problem of discriminating between the gene-by-gene "functional model" (µ = 1) and the random excision "structural" model (µ > 1). Figure 3 shows the frequency with which we estimateμ = 1, for various values of µ and N = 200 or 900, as a function of 1 − θ, the proportion of genes deleted. The upper curves in the figure show that we can correctly identify the µ = 1 model around 70-85% of the time; more for N = 900 and less for N = 200, as long as 1 − θ < 50%. In other words, the type I error of a test of H 0 : µ = 1 against H 1 : µ > 1 with these parameters and procedures, is about 15-30%. The lower curves show that incorrectly inferringμ = 1 occurs around 20% of the time when µ = 1.2, but very rarely for µ = 1.9 or even µ = 1.5, until 1 − θ begins to exceed 50%. In other words, if now H m : µ = m, for some constant m > 1, is the null hypothesis and H 0 is the alternative, then the type I error is very small unless m is very close to 1 (e.g., m = 1.2) or 1 − θ is large (e.g., >50% if m = 1.5).
Up to now, we have examined only runs of singlecopy genes. What of the runs of remaining double-copy genes? Figure 4 compares some of the results from the same simulations as Figure 3, but using the cumulative G µ,N,1−θ for runs of double-copy genes as well as F µ,N,1−θ for runs of single-copy genes. The main observation is that the double-copy approach systematically infers µ = 1 with higher frequency for small values of 1 − θ, whether or not this inference corresponds to the generating µ. It systematically infers µ = 1 with lower frequency for large values of 1 − θ, again whether or not this is correct. These simulations establish ranges of Figure 1 Non-independence of overlapping events. Simulation of the number of overlapping deletion events making up a singlecopy region, when 70% of the genes are single copy. With a large number of events in a run, the individual events tend to have greater lengths. From [6].
values of N, µ and 1 − θ for which we can and cannot discriminate between the two models.

Conclusions
In this work we have made some progress in deriving the run-length distribution ψ t for single-copy regions, although this problem is still not completely resolved. From an analytical point of view, it is unexpected and interesting that in the continuous version of the problem, the number of pre-existing runs overlapped by a deletion event follows a geometric distribution.
The simulation study showed the much greater difficulty in distinguishing between the structural and functional models when the mean µ of the deletion size distribution is 1.1 rather than 1.9, when N is 100 rather than 900, and when the proportion of genes deleted is bigger than 50% rather than less than 40%. The latter effect is also apparent in empirical studies [7].
Our simulation results are based on a "binning" strategy for determiningμ for the purposes of discrimination, rather than an asymmetrical testing approach comparing the hypotheses µ = 1 and µ > 1. This is justified by the lack of any biological significance, and high rates of error, in comparing μ = 1+ ∈ and µ = 1 for very small ∈ , as well as the global picture it offers of the degradation of discriminatory power as a function of µ, N and θ.
This work has for the first time enabled the systematic discrimination between the two models of duplicate deletion following WGD. Future research will continue on the analytical determination of ψ t as well as extension to the "two-sided" deletion models proposed in [5]. Eventually, we will have to allow processes of genome rearrangement to disrupt runs of single-copy genes or double-copy genes, as in [7]. It is these kinds of model that will eventually be useful for analyzing data from real genomes.