A continuous analog of run length distributions reflecting accumulated fractionation events

Background We propose a new, continuous model of the fractionation process (duplicate gene deletion after polyploidization) on the real line. The aim is to infer how much DNA is deleted at a time, based on segment lengths for alternating deleted (invisible) and undeleted (visible) regions. Results After deriving a number of analytical results for “one-sided” fractionation, we undertake a series of simulations that help us identify the distribution of segment lengths as a gamma with shape and rate parameters evolving over time. This leads to an inference procedure based on observed length distributions for visible and invisible segments. Conclusions We suggest extensions of this mathematical and simulation work to biologically realistic discrete models, including two-sided fractionation.


Background
In the course of evolution, new genomes occasionally arise by duplication or triplication of an existing genome, so that there are two or three identical copies of each maternal and each paternal chromosome. After a (usually) transient period of polyploidy marked by unusual patterns of meiosis where more than just one maternal and paternal chromosome are aligned and recombine, processes of sequence divergence and chromosome rearrangement lead to more familiar diploid patterns. At the same time a process of fractionation eliminates some or most of the duplicate genes, some from each chromosomal copy, but in the simplest model, never all members of a duplicate pair or triple -for reasons of viability. Fractionation processes have been surveyed across evolutionarily diverse types of eukaryote organisms [1].
Since one copy of a duplicate pair of genes must be retained, we can identify not only the chromosomal regions that have been retained -by simple observation of the genome -but also each region that is now invisible -by reference to the duplicate chromosome that has necessarily retained a copy of this region. Thus, the data on which inferences about the deletion process can be made consist of alternating segments of deleted and undeleted genome of varying lengths.
Among the important questions about the nature of the deletion process, we can ask whether deletion proceeds one gene at a time or by larger chromosomal fragments. In this paper, we model the process as the deletion of segments from the real line, with a biologically realistic treatment afforded to overlapping deletions. Previous work focused on the difficult question of how many overlapping deletion events are responsible for each contiguous deleted region [2][3][4], but was not able to account analytically for the dynamics of the process.
In the present paper we attack and solve the inference problem of the size, form and spacing of deletion events, allowing for a number of sweeps over the genome as a way of accounting for overlapping deletions. We carry this out in a continuous analog of the original discrete geneorder context, and address the "one-sided" version of the problem, where all deletions occur on one of the duplicate chromosomes.
There has been a certain amount of work on the quantification of the fractionation process, starting in 2006 with [5], which claimed deletions involved one gene at a time, and [6], which treated the number of genes deleted in a single event as a random variable with mean greater than 1. Other work of this kind includes [1] and [7]. However, the modelling of fractionation where the whole genome evolves as a stochastic process began with [2]. The previously unstudied phenomenon taken into account in that work was the overlap of deletion events, something that assumes much importance soon after the fractionation process commences. Overlap must be handled differently if all deletions occur from one copy of the genome or in either copy. To isolate the most important aspect of overlap, [2] gave analytical results for the case where deletions all occurred on one copy ("one-sided" model). Then [3] extended this to the more realistic case where deletion could occur at different rates, or the same rate, from either copy of the genome ("two-sided" model). This analysis was more difficult and could not be taken as far as with the one-sided model.
For the one-sided model, a closed form solution of how many deletion events contribute to a deleted region after a single event (i.e., at a single step in the fractionation process) was obtained in [4].

The proposed model
We model the fractionation process in terms of a number of successive sweeps of a point process with parameter ν on the positive reals, i.e., ν ∈ R + , representing one copy of the genome. At the origin, we say that all points of this genome are "visible". A deletion event, rendering a segment of exponentially (mean μ) distributed length "invisible", occurs at each point determined by the point process. The second copy of the genome remains undisturbed throughout and retains a 1-to-1, length preserving, correspondence with the fractionating copy, without regard to any disruption caused by invisibility. In applications, the acceptance of the one-gene-at-a time theory of deletion depends on whether μ is below or above a certain absolute value, but the present work is part of the mathematical preliminaries to the practical questions. The eventual goal of this work is to determine the relative size of the "spacing" parameter ν and the deletion length parameter μ. The model innovation here is to introduce the parameter ν in the place of a rate parameter in previous work, which was awkward to work with.
During the first sweep, illustrated at the top of Fig. 1 at time (or step) t = 1, the first deletion point x 1 is determined by sampling from the exponential distribution with mean ν. Then a deletion length a 1 is chosen from another exponential distribution with mean μ. Normally, ν μ, but this is not necessary to the analysis. The segment [ x 1 , x 1 + a 1 ) is "deleted", or is designated as invisible. The next deletion point x 2 is chosen by sampling x 2 from the first exponential distribution (mean ν), so that x 2 = x 2 + x 1 + a 1 . Then the length a 2 of the second deleted segment is determined by sampling from γ again. The process continues in this way to find x 3 , a 3 , . . . Concatenating only those segments that are still visible, we see that x 1 , x 2 , . . . are points determined by a point process with parameter ν. Associated with each of these points x is an "event counter" C(x). Initially, each C(x) = 1. We define a function π t (i), i = 1, . . . measuring the proportion of event counters registering i events at time t ≥ 1. Thus π 1 (1) = 1 and π 1 (j) = 0, for all j > 1.
At times t = 1, 2, . . ., the second, third, . . . sweeps begin, all independent of the first sweep and each other, and each applied to the concatenated visible segments only. We sample x (t) 1 and a (t) 1 in the same way as x 1 and a 1 according to ρ and γ , respectively, to determine a deletion interval [ We find the next deletion point by sampling We continue the t sweep, adding visible deletion points and making others invisible. Some deletion points from the earlier sweep will remain unchanged, i.e. are still visible. The x i and the additional deletion points remaining from the earlier sweep define a process with mean λ t , a parameter that decreases with t, as the undeleted segments are interrupted by more and more deletions. This parameter is important as it is directly inferable from the observed genome at time t.
More important, it is clear, that at each sweep, more and more of the genome becomes invisible. Since each concatenation of visible segments still extends to the positive reals, we cannot observe directly how much the genome has been reduced in absolute terms. But thanks to the length-preserving isomorphism between the second copy of the genome and the fractionating one, for any large finite interval we can observe the proportion of the genome that is left by time t and we can predict that it is approximately (1 − μ ν+μ ) t . We will calculate λ, the number of deletion points in [ x i , x i+1 ), the distribution p(j), j = 1, ... of the number j of pre-existing deletion points in intervals deleted during each sweep, and discuss how to calculate π 1 (j), j ≥ 1, the proportion of event counters with C = j.

The length of undeleted segments λ
After the first sweep, x i is the only deletion point in [ x (1) i , a (1) i ) and the only deletion point in the visible [ x (1) i , x (1) i+1 ), so that λ 1 = ν. During the second sweep, the number of these first-sweep deletion points that the visible [ x (2) i , x (2) i+1 ) contains is Poisson distributed with mean ν ν+μ , while the remaining first-sweep deletion points that the invisible [ x (2) i , a (2) i ) contains are Poisson distributed with mean μ ν+μ . (These are approximations, since the true means are In addition the visible segment contains one new deletion  point, created during the second sweep itself. We can then predict λ 2 to be roughlŷ ( Suppose λ t−1 is the parameter of the point process that generates the deletion points visible after sweep t − 1. Then, in the sweep at time t, the number of deletion points that the invisible [ x (2) i , a (2) i ) will contain is Poisson distributed with mean μ λ t−1 . The number of deletion points in the visible [ x i , x i+1 ), not including x i , is Poisson distributed with mean ν λ t−1 . In addition, the visible segment contains one new deletion point, created during the t-th sweep itself. λ t can thus be predicted to be approximatelŷ ( 4 ) The treatment of overlapping deletions The discussions in this section and the next do not depend on t, so let be the exponential distribution with mean λ. From [4], the probability p 0 that a deletion event contains no extant deletion points is Carrying out the integrations, we find The probability p 1 that a deletion event overlaps exactly one existing run of deletions is: It can be proved by induction that the probability a deletion event overlaps exactly q existing runs of deletions is: (10) Thus we have the surprisingly uncomplicated result that the number q of pre-existing runs of single-copy regions overlapped by a new deletion event is geometrically distributed on q = 0, 1, . . . with parameter μ/(μ + λ). The event count C(x) at a visible deletion point x tells us how many deletion events have occurred to make up the invisible segment adjacent to x. In contrast to the undeleted segments, where we know that no events occurred, observing that a segment has been deleted does not tell us C(x). Some work has focused on the distribution π(i) of the probabilities that a deletion point x has C(x) = i, and we are able to calculate how π changes with each sweep. Then we can update π t by a linear  Fig. 6 Relation between slope of ln 1/β as a function of t, and μ ν combination of the distribution of changes due to the deletion and the existing π t−1 . Let (i) represent the change in π i at any sweep t. This can be calculated from Eq. (10) and the net effect that a deletion overlapping q existing runs has on the various π. Without giving details here, Unfortunately, even knowing the dynamics of C does not help us with the inference problem, since the number of events associated with an invisible segment, is not directly associated with the total length of the segment. It is known that the overlapping gamma variables making up each segment are related in a complex way, and cannot simply be treated as the sum of gammas drawn a single population.
This leads us to the approach in the next two sections, where simulations strongly suggest the functional form of the distribution of invisible segment lengths, including shape and rate parameters that can be observed, leading to inference of the simulation parameters based on the observations.

Simulation
Our simulation experiments were based on initial visible segments of length 10,000, which is very long in comparison to the deletion lengths with μ ≤ 10. In other words, we do not risk artificial effects, like a disappearing genome, after a few sweeps, t ≤ 10. Moreover, after each sweep, if the total undeleted length = L, we add, to the end of the remaining visible portion, segments where the lengths of the visible portions total 10, 000 − L, copied from a replicate trial. The program, written in Java, was repeated 5 times for each configuration of the parameters μ, ν and t. Each set of 5 trials averaged a total of less than 3 min on a Lenovo Y50 laptop.
After each sweep, we calculated the distribution of segment lengths for both the invisible and visible parts of the model genome.

Parameter estimation
The results of the simulations strongly suggest that the lengths of the invisible segments are gamma distributed, as illustrated in the Cullen-Frey graphs at the top of Fig. 2. As the parameters ν, μ and t change, the moments of the simulated distributions also change, but remain those of a gamma distribution. Similarly, the distribution of the lengths of the visible segments is always exponential, as at the bottom of Fig. 2, with rate As a first step towards the ability to infer μ and ν from the length distributions of invisible and visible segments, we would like to predict α and β, the shape and rate parameters of the gamma distribution, from t, μ and ν. Table 1 suggests, for a fixed value of t and a fixed value μ ν , that shape is constant as μ changes, and that the rate is inversely proportion to μ.
Similar results hold for each combination of t and μ ν , with different shape constants and rate proportions. Figure 3 shows how the shape constant varies with t for four values of μ ν . The four coefficients of the linear relationships inferred from Fig. 3 are plotted in Fig. 4. Fitting this curve with a quadratic yields As for the rate parameter of the gamma, Fig. 5 shows that it is the logarithm of the rate that behaves linearly over time for a fixed value of μ ν . The four coefficients of the linear relationships inferred from Fig. 5 are plotted in Fig. 6. Fitting this curve with a quadratic yields The observable quantities in our model are the distribution of visible segment lengths, predicted to be exponential with mean λ, and the shape and rate parameters α and β of the predicted gamma distribution of invisible segment lengths. These three observable quantities are related to the unknown model parameters μ, ν, and t through Eqs. (17), (18) and (19). With the given value of these parameters, we can estimate the values of μ, ν, and t.
Lacking a closed form solution for μ, ν, and t in terms of λ, α and β, we use the following procedure. Since t must be an integer, we can find values of ν t and μ t for each t = 1, 2, . . . with Eqs. (17) and (18). Then we can solve Eq. (19) to find β t .  We then compare all the β t , for t = 1, 2, . . . with the β observed in the simulation, and set As an example, in one set of simulations where μ = 1, ν = 3 and t = 5, the experimental value of parameters are λ −1 = 1.665595, shape = 0.6711252 and β = 0.3504422. When t ≤ 2, there is no solution for μ. For t > 2, Table 2 shows the results of this procedure, where 100δ is 100 × the normalized difference between β and β t in Eq. (20).
The minimum value of 100δ occurs when t = 5, expressing the fact that the inferred values of μ and ν, together with t = 5, are the parameter values most consistent with the observed values of α, β and λ. Other typical examples spanning a range of parameter values are given in Tables 3, 4 and 5. It can be seen, at least in these diverse examples, that the inference procedure generally identifies the correct value of t, and good estimates of μ and ν.

Discussion
The introduction of sweeps consisting of alternating jumps and deletions, with time-invariant parameters ν and μ, provide us with an improved possibility of solving the fractionation model completely. We do announce such a solution, though it has much room for improvement. Though the exponential distribution of visible segment lengths should be easy to establish analytically, it is also possible that the gamma distribution of invisible segment lengths could be proved, including the α and β parameters as a function of the number of sweeps. Depending on the functional form of such a solution, the inference of t, μ and ν might be amenable through closed form formulae rather than the quadratic modeling. Nevertheless, we have succeeded for the first time in inferring the parameters of a fractionation model, albeit a "one-sided" model and a continuous analog of more realistic discrete fractionation models.

Conclusions
Aside from theoretical improvements, the first priority for this work should be the return to a discrete gene-order model of fractionation with the insights gained in the current report. This should be extended to, or at least tested on simulations of, two-sided fractionation models with subgenome dominance (higher deletion rates on one copy of the genome than the other).