Fractionation statistics
- Baoyong Wang^{1},
- Chunfang Zheng^{1} and
- David Sankoff^{1}Email author
https://doi.org/10.1186/1471-2105-12-S9-S5
© Wang et al; licensee BioMed Central Ltd. 2011
Published: 5 October 2011
Abstract
Background
Paralog reduction, the loss of duplicate genes after whole genome duplication (WGD) is a pervasive process. Whether this loss proceeds gene by gene or through deletion of multi-gene DNA segments is controversial, as is the question of fractionation bias, namely whether one homeologous chromosome is more vulnerable to gene deletion than the other.
Results
As a null hypothesis, we first assume deletion events, on one homeolog only, excise a geometrically distributed number of genes with unknown mean µ, and these events combine to produce deleted runs of length l, distributed approximately as a negative binomial with unknown parameter r, itself a random variable with distribution π(·). A more realistic model requires deletion events on both homeologs distributed as a truncated geometric. We simulate the distribution of run lengths l in both models, as well as the underlying π(r), as a function of µ, and show how sampling l allows us to estimate µ. We apply this to data on a total of 15 genomes descended from 6 distinct WGD events and show how to correct the bias towards shorter runs caused by genome rearrangements. Because of the difficulty in deriving π(·) analytically, we develop a deterministic recurrence to calculate each π(r) as a function of µ and the proportion of unreduced paralog pairs.
Conclusions
The parameter µ can be estimated based on run lengths of single-copy regions. Estimates of µ in real data do not exclude the possibility that duplicate gene deletion is largely gene by gene, although it may sometimes involve longer segments.
Background
Whole genome doubling (WGD) triggers the wholesale shedding of duplicate genes through processes such as epigenetic silencing, pseudogenization, and deletion of chromosomal segments containing one or more genes [1–5]. The extent to which this paralog reduction is a gene-by-gene inactivation process [6] targeting redundant copies at random points throughout the genome, or a consequence of largely random excision, elimination of excess DNA [2], is controversial and likely varies from one phylogenetic domain to another. The distinction between these two processes is not sharp: The inactivation effect may be produced not only by pseudogenization and various suppression and silencing mechanisms but also by the actual excision of a small but critical region of a gene or promoter. Conversely, the apparent excision of two or more adjacent genes may rather be due to any of a variety of genetic, epigenetic or functional interactions, rather than the deletion of a DNA fragment. Nevertheless, the determination of whether paralog reduction is a gene-by-gene process or the deletion of longer stretches DNA is key to understanding the dynamics of genome evolution, not only following WGD, but as part of the continual innovative expansion and simplifying shrinkage of genomes over time.
The other face of paralog reduction is the process of fractionation. When a duplicate gene is lost, it may be lost from one copy (homeolog) of a chromosome or the other. When compared to the pre-WGD genome, or to a closely related but unduplicated genome, this creates an interleaving pattern, such that it is only by consolidating[4] the two homeologous single-copy regions that the full original gene complement becomes apparent. That the consolidated region is directly comparable to homologous regions in related genomes is due to the fact that single-copy genes are rarely deleted - of the two duplicates created by WGD, it is unlikely that both are deleted, for obvious functional reasons.
Fractionation is an important evolutionary process whenever WGD occurs, and is of particular interest for comparative genomics, since it results in a genome that is highly scrambled with respect to its pre-WGD ancestor. The study of fractionation also brings up the question of bias: Are paralogs always or generally lost from the same “side”, or are they lost randomly from one homeologous chromosome or the other [3–5, 7]?
In this paper, we analyze paralog reduction and fractionation in terms of two models, one easier to analyze but the other more realistic. First we model paralog reduction on only one of the two homeologous chromosomes as a series of excisions of geometrically distributed lengths and show how to use the observed run lengths of single-copy genes on the other chromosome to estimate the parameter of the deletion length distribution.
In the second model we allow excisions on both homeologous chromosomes, and model deletion lengths in terms of truncated geometric distributions to account for the above-mentioned prohibition against deleting single-copy genes.
This work is essentially the creation of a simple, one-parameter “null” model of paralog reduction, where deletion is by random events involving geometrically distributed (with mean µ) numbers of genes on one homeologous chromosome or randomly on both of them. This sets up the possibility of statistical tests of real WGD descendants, to see if the geometric hypothesis is acceptable and to see if fractionation is unbiased or not. We will not explicitly investigate the alternative hypotheses of gene-by-gene excision or biased fractionation; our task here, aside from estimating the parameters of our model, is simply to set up the null statistical model with a view to eventually developing useful statistical tests of hypothesis for this problem.
In a previous study of post-WGD evolution [3], we took chromosomal rearrangement events into account. In the present paper, we do not incorporate rearrangement into our model, but we do reanalyze some of the data, to explore the effects of genome rearrangement processes in confounding the evidence of fractionation and to suggest a way of redressing the loss of information.
The lengths of runs of undeleted genes may be considered independent samples from a geometric distribution, and the lengths of runs of deleted genes are also independent, but we show that the deletion events making up a run of deleted genes are not independent. As a consequence, the distribution of deleted run lengths seems beyond the scope of straightforward mathematical derivation. The major analytical and computational result of this paper is the construction, implementation and evaluation of a deterministic recurrence to calculate the distribution of the number of deletion events per run as a function of µ and the proportion θ of unreduced paralog pairs.
The models
The structure of the data
The data on paralog reduction are of the form (G, H), where G and H are binary sequences indexed by ℤ, satisfying the condition that g(i) + h(i) > 0. This condition models the prohibition against deleting both copies of a duplicated gene. We may also assume that whatever process generated the 0s and 1s is homogeneous on ℤ.
The sequence G + H consists of alternating runs of 1s and 2s. We denote by p(l), l ≥ 1 the probability distribution of length of runs of 1s. For any finite interval of ℤ we denote by f(l), l ≥ 1 the empirical frequency distribution of length of runs of 1s.
The use of ℤ instead of a finite interval is consistent with our goal of getting to the mathematical essence of the process, without any complicating parameters such as interval length. In practice, we will use long intervals of 100,000 or 300,000 so that any edge effects will be negligible. See [3] and the section below on 15 WGD-descendant genomes for ad hoc ways of handling biological scale intervals.
One-sided deletion
and we convert g(i) = 0, g(i + 1) = 0, … , g(i + a – 1) = 0, unless one or more of these is already 0, in which case we skip over it and continue to convert the next available 1s into 0s, until a total of a 1s have been converted. This is a natural way to model the excision process, since deletion of duplicates and the subsequent rejoining of the DNA directly before and directly after the excised fragment means that this fragment is no longer “visible” to the deletion process. Observationally, however, we know deletion has occurred because we have access to the sequence H, which retains copies of the deleted terms.
if these geometric variables were independent. As we shall see later, however, the hypothesis of independence does not hold.
One-sided model
event | i | a | -7 | -6 | -5 | -4 | -3 | -2 | -1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | r |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
start | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |||
1 | -1 | 3 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
2 | -4 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1,1 |
3 | 4 | 4 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1,1,1 |
4 | -5 | 4 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 3,1 |
The one-sided model is an extreme version of biased fractionation and is not meant to model any real situation. It is, however, relatively tractable and hence provides a mathematical framework for understanding more realistic cases.
Two-sided deletion
In a more realistic model, deletions can occur both in sequence G and sequence H as in Table 2. Thus before choosing a position i, we chose either G or H with probability ø and 1 – ø, respectively. The default we shall study here, , represents unbiased fractionation. Then position i, where g(i) + h(i) = 2 and geometric variable a are chosen as before.
Two-sided model
event | i | a | -7 | -6 | -5 | -4 | -3 | -2 | -1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | r |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
start | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | |||
1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ||||
1 | -1 | 3 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ||||
2 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1,1 | ||
-4 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ||
3 | 4 | 4 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1,1,1 |
1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ||||
4 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 3,1 | ||
-5 | 4 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
In this model, the deletion length is no longer geometric but a mixture of geometric and truncated geometric variables, and run length is no longer negative binomially distributed, even approximately.
Results
Simulations to determine π
Finally, the remaining graph in Fig. 1 shows that any edge effects in our simulation are negligible. Whether we work with G and H on an interval of ℤ of length 100,000 or, as in another simulation, length 300,000, gives virtually the same results.
Non-independence of deletion events in a run
Application to 15 descendants of WGD events
To explore the relevance of our models for real genomes, we emphasize that we observe only the proportion θ of unreduced duplicates and the distribution of run lengths of single-copy genes on both homeologous chromosomes. (We can also observe the distribution of the run size of surviving paralog pairs, although models have not been developed for this.) We cannot observe t or λ. We cannot sample from the geometric distribution of deletion sizes, only their accumulation into runs, so that we cannot directly estimate its mean µ, the parameter of biological interest.
In [3], we studied 15 descendants of 6 ancient WGD events. In real genome sequences such as these, many or most runs of deleted paralogs will be impossible to identify; one or both of the homeologous regions will have been disrupted by inversions, translocations or other rearrangement events that juxtapose the surviving genes in the run with genes originally remote on the chromosome or from elsewhere in the genome.
Among all the runs of single-copy genes in a WGD descendant genome, it is only the AU that can be used as evidence for the paralog reduction process, because it is only from these that we can reconstruct common conserved homeologous regions on two chromosomes (or remote regions on one chromosome).
Data on 15 genomes
t | n | m | 1 – θ | ||
---|---|---|---|---|---|
S. cerevisiae | 150 | 5616 | 4498 | 0.89 | 6.0958 |
C. glabrata | 150 | 5180 | 4382 | 0.92 | 5.3839 |
V. polyspora | 150 | 5112 | 4164 | 0.9 | 4.922 |
S. bayanus | 150 | 5857 | 4773 | 0.9 | 5.8297 |
N. castelli | 150 | 5213 | 4053 | 0.88 | 5.0717 |
Paramecium | 20 | 38626 | 14576 | 0.55 | 2.0299 |
populus | 70 | 20082 | 7228 | 0.53 | 1.6402 |
Arabidopsis | 50 | 25655 | 13267 | 0.68 | 3.6086 |
fugu | 350 | 14251 | 12653 | 0.941 | 3.806 |
medaka | 350 | 14564 | 13352 | 0.957 | 5.0629 |
stickleback | 350 | 16726 | 14876 | 0.941 | 4.3792 |
tetraodon | 350 | 17120 | 16088 | 0.969 | 6.876 |
chicken | 450 | 10077 | 8495 | 0.915 | 3.6122 |
opossum | 450 | 13339 | 11589 | 0.93 | 5.5507 |
human | 450 | 13828 | 12144 | 0.935 | 3.818 |
Consider an AU of length u. There are u+1 possible breakpoints in an AU of length u, including the two at either end of the run of single-copy genes involving the flanking duplicate genes, that could destroy the AU, according to definition.
This correction procedure is relatively unstable, since it is very sensitive to the arbitrary parameter α. All the more so with very low values of θ, as on the right of Fig. 6, where the model begins to percolate, i.e., where the runs merge together at a rapidly increasing rate. Nevertheless we see no evidence in the figure that µ is much greater than 1, leaving a gene-by-gene model very much a viable candidate alongside the geometric excision model.
A model for π(r) in the one-sided model
We are interested in inferring µ from the observed distribution of run lengths and the proportion of undeleted terms θ. At the outset θ = 1. As t → ∞, θ → 0. We are not, however, interested in t, since it is not observable and any time-based inference we can make about µ will depend only on run lengths and θ in any case. On the other hand, r, the number of deletion events per run is an interesting variable since we can assume run length is rµ on average, and we can model the evolution of r directly in the one-sided model. We consider the distribution π as a function of θ.
- 1.
new runs (r = 1) falling completely within an existing run of undeleted terms, not touching the preceding or following run of deleted terms
- 2.
runs that touch, overlap or entirely engulf exactly one previous run of deleted terms with r ≥ 1, thus lengthening that run to r + 1 events,
- 3.
runs that touch, overlap or engulf, by the skipping process, two previous runs of r_{1} and r_{2} events respectively, creating a new run of r_{1} + r_{2} + 1 events, and diminishing the total number of runs by 1, and
- 4.
runs that touch, overlap or engulf, by the skipping process, k > 2 previous runs of of r_{1}, … , r_{ k } events respectively, creating a new run of r_{1} + … + r_{ k } + 1 events, and diminishing the total number of runs by k – 1. Case 3 above may be considered a special case of this for k = 2 and Case 2 for k = 1.
The distribution ρ(l) of lengths of the undeleted runs is geometric, since each deletion event creates a randomly placed demarcation between two runs in the sequence consisting of all the remaining terms. The number of terms between two successive such demarcations corresponds to the difference between successive order statistics, and is hence geometrically distributed.
where j indexes the starting position of the deletion within the run, and a is the number of terms deleted in the event.
The event A adds one new run with r = 1. The events B and C lengthen an existing run from r events to r + 1. The events D and E join two existing runs of of r and s events to create a single run of length r + s + 1. In our initial model, we neglect the merger of three or more runs of deletions. There is no conceptual difficulty in including three or more mergers, but the proliferation of embedded summations leads to computational problems. Thus we should expect the model to be adequate until θ gets very small, when mergers of several runs at a time become common.
The last lines of each of (7),(9) and (10) include the collection of terms, significantly cutting down on computing time when these formulae are implemented.
The new proportion θ′ of undeleted terms is .
We implement equations (7) to (17) as a recurrence with a step size parameter Λ to control the number of events using the same p_{ A }, p_{ B }, p_{ C }, p_{ D } and p_{ E } and δ(·) between successive normalizations, using Λδ(·) instead of δ(·) in (15)-(17). The choice of Λ determines the trade-off between computing speed and accuracy.
Other possible sources of error might be due to the cutoffs in x used for calculations involving γ(x) and ρ(x). However, extensive testing of various cutoff values has indicated such errors to be negligible in our implementation.
Conclusions
We have developed a model for the fractionation process based on deletion events excising a geometrically-distributed number of contiguous paralogs from one of a pair of homeologous chromosomes. This is extended to the mathematically less tractable case where both homeologs are susceptible to deletion events. The existence of data prompting this model is due to a functional biological constraint against deleting both copies of a duplicate pair of genes.
The mathematical framework we propose should eventually serve for testing the geometric excision hypothesis against alternatives such as gene-by-gene inactivations or imbalanced fractionation, although we have not developed these here.
Simulations of these models indicate the feasibility of estimating the mean µ of the deletion event process from observations of the length of runs of single-copy genes and the overall proportion of single-copy genes. Application to real data from an earlier survey of 15 genomes descended from 6 WGD events, however, is hampered by the accumulation of rearrangement events that have obscured most of the runs of single-copy genes. We have proposed a way of correcting for the missing runs, but this remains a rather approximate procedure.
The main outstanding question remains the exact derivation of π, the distribution of the number of deletion events contributing to a run of single-copy genes. The simulations are convenient in practice, since they depend on only one parameter µ as they evolve over time, but they give little mathematical insight. Our most important advance is a deterministic recurrence for the π(r) as the proportion θ of undeleted genes decreases, albeit for the one-sided model only. This takes into account the appearance of new runs over time, the lengthening of existing runs, as well as the merger of two existing runs with the new deletions to form a single, longer one. This calculation fits the process as simulated rather well and seems promising for further development.
Declarations
Acknowledgements
Research funded in part by a Discovery grant from the Natural Sciences and Engineering Research Council of Canada.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 9, 2011: Proceedings of the Ninth Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/12?issue=S9.
Authors’ Affiliations
References
- Byrne K, Wolfe K: The Yeast Gene Order Browser: combining curated homology and syntenic context reveals gene fate in polyploid species. Genome Research 2005, 15: 1456–1461. 10.1101/gr.3672305PubMedPubMed CentralView ArticleGoogle Scholar
- van Hoek M, Hogeweg P: The role of mutational dynamics in genome shrinkage. Molecular Biology and Evolution 2007, 24: 2485–2494. 10.1093/molbev/msm183PubMedView ArticleGoogle Scholar
- Sankoff D, Zheng C, Zhu Q: The collapse of gene complement following whole genome duplication. BMC Genomics 2010, 11: 313. 10.1186/1471-2164-11-313PubMedPubMed CentralView ArticleGoogle Scholar
- Langham R, Walsh J, Dunn M, Ko C, Goff S, Freeling M: Genomic duplication, fractionation and the origin of regulatory novelty. Genetics 2004, 166: 935–945. 10.1534/genetics.166.2.935PubMedPubMed CentralView ArticleGoogle Scholar
- Thomas B, Pedersen B, Freeling M: Following tetraploidy in an Arabidopsis ancestor, genes were removed preferentially from one homeolog leaving clusters enriched in dose-sensitive genes. Genome Research 2006, 16: 934–946. 10.1101/gr.4708406PubMedPubMed CentralView ArticleGoogle Scholar
- Byrnes JK, Morris GP, Li WH: Reorganization of Adjacent Gene Relationships in Yeast Genomes by Whole-Genome Duplication and Gene Deletion. Molecular Biology and Evolution 2006, 23(6):1136–1143. 10.1093/molbev/msj121PubMedView ArticleGoogle Scholar
- Edger P, Pires J: Gene and genome duplications: the impact of dosage-sensitivity on the fate of nuclear genes. Chromosome Research 2009, 17: 699–717. 10.1007/s10577-009-9055-9PubMedView ArticleGoogle Scholar
- El-Mabrouk N, Sankoff D: The reconstruction of doubled genomes. SIAM Journal on Computing 2003, 32: 754–792. 10.1137/S0097539700377177View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.