Skip to main content

Structural vs. functional mechanisms of duplicate gene loss following whole genome doubling



The loss of duplicate genes - fractionation - after whole genome doubling (WGD) is the subject to a debate as to whether it proceeds gene by gene or through deletion of multi-gene chromosomal segments.


WGD produces two copies of every chromosome, namely two identical copies of a sequence of genes. We assume deletion events excise a geometrically distributed number of consecutive genes with mean µ ≥ 1, and these events can combine to produce single-copy runs of length l. If µ = 1, the process is gene-by-gene. If µ > 1, the process at least occasionally excises more than one gene at a time. In the latter case if deletions overlap, the later one simply extends the existing run of single-copy genes. We explore aspects of the predicted distribution of the lengths of single-copy regions analytically, but resort to simulations to show how observing run lengths l allows us to discriminate between the two hypotheses.


Deletion run length distributions can discriminate between gene-by-gene fractionation and deletion of segments of geometrically distributed length, even if µ is only slightly larger than 1, as long as the genome is large enough and fractionation has not proceeded too far towards completion.


The process of whole genome doubling (WGD) gives rise to two copies of each chromosome in a genome, containing the same genes in the same order. Through an attrition mechanism known as fractionation, one of each pair of duplicate genes is lost over evolutionary time. The rare deletion of both copies of a gene can be excluded from our considerations of the interleaving patterns of deletions from duplicated regions first discovered by Wolfe and Shields [1]. The retention of one copy of each pair is what differentiates the WGD/fractionation model from approaches to gene duplication, insertion and deletion in the study of comparative genomics, pioneered by El-Mabrouk [2].

An important biological controversy arises from the question of whether duplicated genes are deleted through random excision - elimination of excess DNA - namely the deletion of chromosomal segments containing one or more genes [3], which we term the "structural" mechanism, or through gene-by gene events such as epigenetic silencing and pseudogenization [4], which are "functional" mechanisms. This question is important to evolutionary theory because it speaks directly to the role of WGD, and gene duplication in general, in disrupting gene order, in creating functional innovation, and in the radiation of new species. It is a question of whether selection operates on the level of simply permitting non-lethal deletions or whether more subtle effects are in play, such as dosage balance of interacting genes.

This debate may be formulated in terms of deletion events removing a number X of contiguous genes, where X is drawn from a geometric distribution γ with mean µ. Here the one-at-a-time deletion model is represented by µ = 1, while the random number of deletions at a time holds if µ > 1. In the latter case, the possibility of two overlapping events is handled by a biologically realistic additive run-length assumption.

In this paper, we investigate the discrimination problem of choosing between the two models based on deletion run-length statistics (resulting from overlapping deletion events). This involves comparing an observed genome containing single-copy genes, originally members of duplicate pairs, to the predictions of the models for µ = 1 and for µ > 1. This requires knowledge of the run-length distribution, given a total number of deleted genes and remaining duplicate pairs. While this is easily calculated for the case µ = 1, the the distribution for the opposing scenario µ > 1 is not known.

In the first part of this paper, we analyze aspects of the deletion run-length distribution ψ when µ > 1 for the deletion-length distribution γ, including some new and surprising analytical results, the clearest of which pertain to a continuous analog of the problem. We then show why it is difficult to describe ψ in closed form or other easily computable format. In the second part, we simulate the distribution and carry out a study of the discrimination problem for various values of µ, genome size N and θ, the proportion of undeleted genes at time t. We conclude with a discussion of the remaining mathematical problems to be solved before the method can be applied to data from real WGD descendants.


The models

For modeling purposes, we consider a doubled genome made up, at the outset, of a pair of identical linear chromosomes each containing genes g1, . . . , g N , where N is large enough so that we can neglect end effects - particular behaviors near g 1 and g N . At each time t = 1, 2, . . ., one such doubled gene gi is chosen at random, and a value X is chosen from a geometric distribution γ with mean µ. If X = a, then gi, gi+1, . . . , gi+a−1are deleted from one of the genomes - they become single-copy genes - unless some of these are already single-copy. In the latter case, we skip existing single-copy genes and proceed to convert the next double-copy genes we encounter until a total of a double-copy genes have been converted to single-copy. Note that this overlapping of deletion events never occurs if µ = 1 since, in this case, by definition, exactly one double-copy gene is selected and deleted in each step. For simplicity, we assume all deletions take place from one and the same genome. In a more complete model, deletion events would occur on one or the other chromosome, with probabilities φ and 1 − φ [5].

The "skipping" procedure, introduced in [6], is a natural way to model the deletion process, since deletion of part of a chromosome and the subsequent rejoining of the chromosome directly before and directly after the deleted fragment means that this fragment is no longer "visible" to the deletion process. As observers, however, we have a record of the deleted genes, as one copy of each gene must be retained in the genome.

Overlapping deletion events and skipping result in the creation of runs of single-copy genes whose length is the sum of a number of geometric variables. The sum of r identical geometric variables produces a negative binomial distribution with parameter r, but the skipping process does not involve the sum of identical random variables, since a deletion with a large value of a is more likely to overlap an existing single copy region than a deletion with small a. Thus, at any point of time t > 0, the distribution ψt of single-copy run lengths will tend to contain a higher frequency of runs of length 1, and of very long runs, than would be generated by the negative binomial. On the other hand, the distribution of run lengths of the remaining double-copy genes is geometrically distributed with a probability distribution ρt, where the mean ν t decreases with t [5, 6].

Analysis of overlap probabilities

An attempt to determine ψ t analytically starts with the calculation of how many deletion events have overlapped to form a run of single-copy genes at time t. In [6], we derived a formula to predict whether a deletion event would create a new run of single-copy genes, probability p0; overlap exactly one existing run, thus extending it without changing the total number of runs, probability p1; overlap two runs, producing one larger combined run in place of the two pre-existing ones, probability p2; and so on. Other probabilities deal with the events that a run "touches" a pre-existing run without overlapping it. These probabilities all depend solely on γ and ρ t . For example, we examine the case of p 0 . The other probabilities are all formulated in analogous ways.

The proportion of genes in single-copy runs of length l is l ρ t ( l ) / ν t , where ν t = l > 0 l ρ t ( l ) . The probability p0 that a deletion event falls within a run of double-copy genes without deleting the terms at either end is

p 0 = l > 2 l ρ t ( l ) v t j = 2 l - 1 1 l a = 1 l - j γ ( a ) = 1 v t l > 2 ρ t ( l ) j = 2 l - 1 a = 1 l - j γ ( a ) = 1 v t l > 2 ρ t ( l ) a = 1 l - 2 ( l - a - 1 ) γ ( a )

where j indexes the starting position of the deletion within a run of length l, and a is the number of genes deleted in the event.

This formula requires quadratic computing time, but the p i for higher i, require polynomial time of degree i + 2. Here we exemplify with p0 to show that these probabilities can in fact be reduced to closed form, so that computing time is a negligible constant. The formula in (1), when expanded, consists of a number of partial sums of the geometric distributions γ and ρ t and means of these distributions, all of which are readily reduced to closed form, plus sums of terms of the form - 1 μ - 1 v t l and l - 1 μ - 1 v t l , which themselves can be considered in terms of a geometric distribution with mean ζ, where

1 - 1 ζ = ( 1 - 1 μ ) ( 1 - 1 v t )

Then (1) reduces to:

p 0 = ν t - 1 2 ( μ + ν t - 1 ) ν t

For large ν t , i.e., during the early stage of the process,

p 0 v t μ + v t 1 - 1 v t

Typically, µ is somewhere between 1 and 2, [3, 4], and ν t of the order of 103 or 104. Thus p0 is initially only slightly less than 1 but declines rapidly as ν t decreases exponentially.

We proceed in an analogous way to derive closed forms for p 1 , p 2 , . . ., but it is perhaps more instructive here to present the continuous version of the deletion process. Here the two identical chromosomes at time t = 0 are linear segments, long enough in comparison with the other parameters of the model so that end effects can be ignored. At each time t = 1, 2, . . . , a random point g is chosen on the chromosome, and a value X is chosen from an exponential distribution

f ( a ) = 1 μ e - a μ , a 0

with mean µ. If X = a, then the segment [g, g +a] is deleted from one of the genomes - [g, g + a] becomes a single-copy region - unless part of it is already single-copy. In the latter case, we skip existing single-copy regions and proceed to convert the next double-copy region we encounter until a total measure a of double-copy regions have been converted to single-copy.

In analogy with the discrete model, the combined length of the remaining double-copy segments is exponentially distributed according to probability distribution σ t , with a mean ν t that decreases with t.

The proportion of undeleted regions accounted for by segments of length ldl is l σ l v t d l , where ν t = 0 l σ ( l ) d l . Then the probability p0 that a deletion event falls completely within an undeleted segment is

p 0 = l = 0 l σ t ( l ) v t x = 0 l 1 l y = 0 l - x f ( y ) d y d x d l

Carrying out the integrations, we find

p 0 = v t μ + v t

which is reminiscent of the relation (4) in the discrete case with large ν t .

The probability p1 that a deletion event overlaps exactly one existing run of deletions is:

p 1 = 1 v t l = 0 z = 0 σ t ( l ) σ t ( z ) x = 0 l y = l - x l - x + z f ( y ) d y d x d z d l
= v t μ + v t μ μ + v t

It can be proved by induction that the probability a deletion event overlaps exactly q existing runs of deletions is:

p q = v t μ + v t μ μ + v t q

Thus we have the surprisingly uncomplicated result that the number q of pre-existing runs of single-copy regions overlapped by a new deletion event is geometrically distributed on q = 0, 1, . . . with parameter µ/(µ + ν t ).

On the run-length distribution

Although having a closed form for p q constitutes progress towards the computation of the run-length distribution ψ t , or eventually towards some analytical results on it, how to find this distribution remains a difficult question. As mentioned in the previous section, long deletion events will be involved in more skipping than small ones. This is illustrated in Figure 1, where runs built from a small number of events tend to be composed of shorter deletions, especially when µ is large. Had we just added independent samples from a geometric distribution, the curves in the figure would have been horizontal lines.

Figure 1
figure 1

Non-independence of overlapping events. Simulation of the number of overlapping deletion events making up a single-copy region, when 70% of the genes are single copy. With a large number of events in a run, the individual events tend to have greater lengths. From [6].

How to account for the distorting effect of skipping on the run-length distribution will require additional insight and research. In the interim, we may use simulations to study the discrimination problem.


We first simulated the fractionation process for all combinations of the following parameter values:

  • gene number N = 100 to 900, in steps of 100.

  • µ = 1.0 to 2.4, in steps of 0.1.

  • Proportion of the genes deleted, 1 − θ = 0.1 to 0.9, in steps of 0.1.

For each combination of the parameters µ, N and θ, we calculated the distribution of run lengths l for single-copy regions, and similarly for double-copy regions. The simulation was repeated 1000 times and the frequencies of length (l = 1,2,3,...) of runs of deleted genes were averaged over the 1000 trials to get a reasonably accurate estimate of the cumulative F µ,N,1−θ . Similarly we estimated the cumulative G µ,N,1−θ for runs of remaining double-copy genes.

Once the cumulative distributions were established, we then carried out the actual discrimination study. For each value of µ and N, we sampled 1000 new individual trajectories of the deletion process until 1 − θ = 90 % of the genes were deleted. For each value of 1 − θ = 0.1, 0.2, . . . , 0.9, we created "bins" corresponding to the fifteen values of µ for which we had constructed cumulatives. Then for each sample S i , at each 1 − θ = 0.1, . . . , 0.9 we counted the frequency of runs of deleted genes of length = 1, 2, . . . and constructed a cumulative distribution. We calculated the Kolmogorov-Smirnov statistic D μ , N , 1 θ ( S i ) between the sample cumulative and the previously established distribution F µ,N,1−θ for each fifteen values of µ and assigned the sample to the bin corresponding to the minimal value of this statistic, which was our estimate μ ^ for that sample.

Figure 2 shows the distributions of μ ^ for the 1000 samples S1, . . . , S1000. The four panels are the results of N = 900, 300, 200 and 100. A separate distribution is drawn for each of the trial values of µ used to generate the samples. For N = 900 (top left), there is a clear pattern of the mode of the distribution to occur at the same value of µ that generated the data, though the distributions become more spread out for higher values of µ. The same pattern may be seen for N = 300 (top right), though considerably degraded. This loss of accuracy of μ ^ continues through N = 200 (bottom left) and N = 100 (bottom right), where the modes for μ ^ when µ = 1.1 are in the µ = 1.0 bin.

Figure 2
figure 2

Discrimination between models based on run lengths of deleted genes. Frequency of μ ^ , the value for which D μ , N , 1 - θ ( S i ) between the sample cumulative and the distribution F µ,N,1−θ is minimal. All data involve a proportion of 1 − θ = 0.20 deleted genes. Top left: N = 900. Top right: N = 300. Bottom left: N = 200. Bottom right: N = 100.

With all four values of N in Figure 2, the most accurate inference is made for µ = 1, the gene-by-gene model. This brings us back to the original problem of discriminating between the gene-by-gene "functional model" (µ = 1) and the random excision "structural" model (µ > 1). Figure 3 shows the frequency with which we estimate μ ^ = 1 , for various values of µ and N = 200 or 900, as a function of 1 − θ, the proportion of genes deleted. The upper curves in the figure show that we can correctly identify the µ = 1 model around 70-85% of the time; more for N = 900 and less for N = 200, as long as 1 − θ < 50%. In other words, the type I error of a test of H0 : µ = 1 against H1 : µ > 1 with these parameters and procedures, is about 15-30%. The lower curves show that incorrectly inferring μ ^ = 1 occurs around 20% of the time when µ = 1.2, but very rarely for µ = 1.9 or even µ = 1.5, until 1 − θ begins to exceed 50%. In other words, if now H m : µ = m, for some constant m > 1, is the null hypothesis and H0 is the alternative, then the type I error is very small unless m is very close to 1 (e.g., m = 1.2) or 1 − θ is large (e.g., >50% if m = 1.5).

Figure 3
figure 3

Assessment of tests. Frequency of μ ^ = 1 as a function of 1 − θ, for µ = 1 (functional hypothesis) and various µ > 1 (structural hypothesis), for N = 900 and 200.

Up to now, we have examined only runs of single-copy genes. What of the runs of remaining double-copy genes? Figure 4 compares some of the results from the same simulations as Figure 3, but using the cumulative G µ,N,1−θ for runs of double-copy genes as well as F µ,N,1−θ for runs of single-copy genes. The main observation is that the double-copy approach systematically infers µ = 1 with higher frequency for small values of 1 − θ, whether or not this inference corresponds to the generating µ. It systematically infers µ = 1 with lower frequency for large values of 1 − θ, again whether or not this is correct. These simulations establish ranges of values of N, µ and 1 − θ for which we can and cannot discriminate between the two models.

Figure 4
figure 4

Comparison of double-copy and single-copy analyses. Frequency of μ ^ = 1 as a function of 1 − θ, for µ = 1 and 1.2 and N = 900 and 200. Results based on runs of single-copy (deleted) genes contrasted with results from double-copy (undeleted) genes.


In this work we have made some progress in deriving the run-length distribution ψ t for single-copy regions, although this problem is still not completely resolved. From an analytical point of view, it is unexpected and interesting that in the continuous version of the problem, the number of pre-existing runs overlapped by a deletion event follows a geometric distribution.

The simulation study showed the much greater difficulty in distinguishing between the structural and functional models when the mean µ of the deletion size distribution is 1.1 rather than 1.9, when N is 100 rather than 900, and when the proportion of genes deleted is bigger than 50% rather than less than 40%. The latter effect is also apparent in empirical studies [7].

Our simulation results are based on a "binning" strategy for determining μ ^ for the purposes of discrimination, rather than an asymmetrical testing approach comparing the hypotheses µ = 1 and µ > 1. This is justified by the lack of any biological significance, and high rates of error, in comparing μ= 1 + and µ = 1 for very small , as well as the global picture it offers of the degradation of discriminatory power as a function of µ, N and θ.

This work has for the first time enabled the systematic discrimination between the two models of duplicate deletion following WGD. Future research will continue on the analytical determination of ψ t as well as extension to the "two-sided" deletion models proposed in [5]. Eventually, we will have to allow processes of genome rearrangement to disrupt runs of single-copy genes or double-copy genes, as in [7]. It is these kinds of model that will eventually be useful for analyzing data from real genomes.


  1. Wolfe KH, Shields DC: Molecular evidence for an ancient duplication of the entire yeast genome. Nature. 1997, 387: 708-13.

    Article  PubMed  CAS  Google Scholar 

  2. El-Mabrouk N: Genome rearrangement by reversals and insertions/deletions of contiguous segments. Combinatorial Pattern Matching, 11th Annual Symposium Lecture Notes in Computer Science. Edited by: Giancarlo, R., Sankoff, D. 2000, Springer, 1848: 222-234.

    Chapter  Google Scholar 

  3. van Hoek MJ, Hogeweg P: The role of mutational dynamics in genome shrinkage. Molecular Biology and Evolution. 2007, 24: 2485-2494.

    Article  PubMed  CAS  Google Scholar 

  4. Byrnes JK, Morris GP, Li WH: Reorganization of adjacent gene relationships in yeast genomes by whole-genome duplication and gene deletion. Molecular Biology and Evolution. 2006, 23: 1136-1143.

    Article  PubMed  CAS  Google Scholar 

  5. Sankoff D, Zheng C, Wang B: A model for biased fractionation after whole genome duplication. BMC Genomics. 2012, 13: S1, S8-

    Article  Google Scholar 

  6. Wang B, Zheng C, Sankoff D: Fractionation statistics. BMC Bioinformatics. 2011, 12: S9, S5-

    Article  Google Scholar 

  7. Sankoff D, Zheng C, Zhu Q: The collapse of gene complement following whole genome duplication. BMC Genomics. 2010, 11: 313-

    Article  PubMed  PubMed Central  Google Scholar 

Download references


Research supported in part by grants to DS from the Natural Sciences and Engineering Research Council of Canada (NSERC). CFBAN was a Mitacs Globalink intern. DS holds the Canada Research Chair in Mathematical Genomics.


Publication of this article was funded by NSERC Discovery grant DG 8867-2008 RGPIN.

This article has been published as part of BMC Bioinformatics Volume 16 Supplement 17, 2015: Selected articles from the Fourth IEEE International Conference on Computational Advances in Bio and medical Sciences (ICCABS 2014): Bioinformatics. The full contents of the supplement are available online at

Author information

Authors and Affiliations


Corresponding author

Correspondence to David Sankoff.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

All authors participated in the research, wrote the paper, read and approved the manuscript.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sankoff, D., Zheng, C., Wang, B. et al. Structural vs. functional mechanisms of duplicate gene loss following whole genome doubling. BMC Bioinformatics 16 (Suppl 17), S9 (2015).

Download citation

  • Published:

  • DOI: