Volume 14 Supplement 15

## Proceedings of the Eleventh Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics

# The dynamics of alternative pathways to compensatory substitution

- Chris A Nasrallah
^{1}Email author

**14(Suppl 15)**:S2

https://doi.org/10.1186/1471-2105-14-S15-S2

© Nasrallah; licensee BioMed Central Ltd. 2013

**Published: **15 October 2013

## Abstract

The role of epistatic interactions among loci is a central question in evolutionary biology and is increasingly relevant in the genomic age. While the population genetics of compensatory substitution have received considerable attention, most studies have focused on the case when natural selection is very strong against deleterious intermediates. In the biologically-plausible scenario of weak to moderate selection there exist two alternate pathways for compensatory substitution. In one pathway, a deleterious mutation becomes fixed prior to occurrence of the compensatory mutation. In the other, the two loci are simultaneously polymorphic. The rates of compensatory substitution along these two pathways and their relative probabilities are functions of the population size, selection strength, mutation rate, and recombination rate. In this paper these rates and path probabilities are derived analytically and verified using population genetic simulations. The expected time durations of these two paths are similar when selection is moderate, but not when selection is weak. The effect of recombination on the dynamics of the substitution process are explored using simulation. Using the derived rates, a phylogenetic substitution model of the compensatory evolution process is presented that could be used for inference of population genetic parameters from interspecific data.

## Background

Complex interactions among loci play an important role in many evolutionary processes. They have arisen in the context of adaptation as Sewall Wright's shifting balance theory [1, 2] and in speciation as Dobzhansky-Muller incompatibilities [3, 4]. Such interactions have been shown to play a role in the intramolecular evolution within proteins [5, 6] and RNAs [7, 8], as well as the intermolecular co-evolution between different proteins [9] and the regulation of gene expression [10]. An understanding of the the dynamics of such dependent evolutionary processes is therefore applicable for a wide range of fundamental questions in evolutionary biology.

Often these problems are cast in the framework of a fitness landscape, with peaks corresponding to genotypes of high fitness and valleys to genotypes of low fitness [11, 12]. When the landscape is very flat and all genotypes of similar fitness, different loci may appear to evolve independently. But when the landscape is more rugged and certain combinations of alleles are deleterious, the method and rate at which an entire population can cross from one peak to another become the central issues.

If a haploid genome possesses a combination of alleles at different loci that are associated with high fitness, a mutation at any locus is likely to result in a suboptimal combination of alleles and consequently a haplotype of low fitness. However, a second mutation at another locus may restore the combination of alleles to one of high fitness. An entire population moving from one fitness peak to another in this manner is called compensatory substitution.

Theoretical population geneticists have used simple but powerful models, typically consisting of two loci with two alleles at each locus, to discern the rate at which such shifts can occur [13–21]. The expected time for these shifts is a function of the population size, the mutation rate, the strength of selection against intermediates, and the probability of recombination between loci [13]. With only four allelic combinations, the frequency of these haplotypes in a population becomes a difficult three-dimensional diffusion problem [17, 18]. The approach has uniformly been to simplify the dimensionality of the problem in order to obtain analytical results, with simulations frequently used to verify results and explore problems for which analytic solutions could not be found [13–21].

Kimura demonstrated that population genetic theory could allow for the occurrence of compensatory substitutions, using the simplest possible model of compensatory evolution with two linked loci and two alleles at each locus [13]. Subsequent efforts have extended Kimura's results to models of greater complexity by allowing for: different fitnesses for the two "peak" haplotypes [14, 15, 22], different fitnesses for the two deleterious intermediate haplotypes [14, 15, 17, 19], different mutation rates [14, 17, 19], reversible mutations [18, 19, 21], diploid populations [20, 21], and expanding to a two locus four allele model [18]. These mutation/selection models are depicted in additional file 1.

While most of this theoretical work assumed that selection (and often mutation) was very strong (but see [19]), there are many cases in which selection against intermediates is not expected to be strong. A high frequency of deleterious intermediate states was observed among homologous stem pairs of RNA between species [23], taken as evidence that selection on RNA intermediates could not be as strong as previously supposed. The role effective population size has played in the evolution of tRNAs was examined [24] and it was concluded that natural selection on paired sites has been weak to moderate. It has also been observed in RNA intramolecular interactions that certain non-canonical intermediate configurations (GU or UG) are more stable than others, and that compensatory substitutions tend to favor paths through less-deleterious intermediates [12]. These studies underscore the need to understand the dynamics of the compensatory substitution process when selection is moderate to weak.

The dynamics of the compensatory substitution process for both Type 1 and Type 2 events is the focus of this work. The rates of these alternate events are derived, representing the first derivation of the direct double-substitution rate under weak selection and reversible mutation. The interactions among selection strength, population size, mutation, recombination, and their effect on the probability of the alternative compensatory substitution pathways are explored using simulation. This work also shows that the expected time to compensatory substitution may depend upon the fixation pathway. These results yield insights into how fitness valleys can be crossed when selection is more moderate.

## Methods

### Theory

#### Substitution Model

*N*diploid (or 2

*N*haploid) individuals and a two-locus, two-allele mutation model as shown in Figure 3A. The first locus consists of alleles

*A*and

*a*, the second of alleles

*B*and

*b*. Let

*µ*be the bidirectional mutation rate between alleles

*A*and

*a*, and let this same rate correspond to mutations between

*B*and

*b*. Under a haploid (genic) selection mechanism, the fitnesses of the four possible haplotypes

*AB, aB, Ab*, and

*ab*are 1, 1 -

*s*, 1 -

*s*, and 1 respectively. This situation corresponds to the case of symmetric neutral compensatory evolution [13] but with reversible mutation [18]. For this derivation assume that the loci are tightly linked, such that recombination between loci does not occur.

Define a four-state continuous-time Markov chain of the substitution process, as shown in Figure 3B, in which each state represents the novel fixation of the population for each of these four haplotypes. The notion of novel fixation events is an important one: a population fixed for *AB*, becoming polymorphic then losing the polymorphism to become again fixed for *AB* does not represent a novel fixation event and no change of state will have been observed. In this model the population may move between two fixed states if (1) the haplotypes differ at only one of the two loci, or (2) both haplotypes are of high fitness.

Direct movement between states *Ab* and *aB* is disallowed, as such a move is highly unlikely unless intermediates are only very slightly deleterious. This is because if both of these states are of low fitness, the process must actually avoid a fitness peak while making two substitutions. If only single mutation events are allowed, then one of the fitter haplotypes must be created in the process, and is likely to fix via selection. If double mutations are allowed, these will occur at rate *µ*^{2} and must then fix by drift, and it is far more likely that a single mutant of higher fitness will arise before this happens.

#### Rates into deleterious states

*AB*haplotype. Following the formulation of [25], the rate at which new

*aB*haplotypes arise that are destined to go to fixation in this population is the product of the rate at which new

*aB*mutants arise in the population and the probability that such a new mutation becomes fixed in the population. The rate at which

*aB*haplotypes arise in a population of 2

*N AB*haplotypes, or equivalently the probability that a new

*aB*mutant will arise in such a population in a single generation, is 2

*Nµ*, where

*µ*is the per-locus mutation rate. Recalling that the

*aB*haplotype has fitness 1 -

*s*, the probability of fixation of such a new mutation is given by $\frac{1-{e}^{2s}}{1-{e}^{4Ns}}$[26]. Consequently the rate from

*AB*to

*aB*is

If the assumption is made that mutation rates are uniform across loci and between all alleles, then Equation 1 will hold for leaving any state with fitness 1 {*AB, ab*} and entering any state with fitness 1 - *s* {*aB, Ab*}.

The probability of fixation of a deleterious haplotype does not depend upon whether or not another fit haplotype arises in the process. Consider a population of *AB* individuals in which an *aB* haplotype has arisen. Before it becomes fixed, suppose a mutation occurs creating a new *ab* haplotype from one of the copies of *aB*. Because the *AB* and *ab* haplotypes are of equal fitness, the emergence of the new haplotype affects the probability of fixation of the intermediate exactly as would a reverse mutation. Alternatively, the introduction of a new *Ab* haplotype via mutation from *AB* will serve to increase slightly the probability of fixation of the *aB* haplotype by lowering the overall population fitness, but this effect is expected to be slight unless the mutation rate is very high.

#### Rates out of deleterious states

*aB*haplotype. If the population fitness is renormalized to be 1, then a new

*ab*haplotype arising in such a population will have fitness $1+t=\frac{1}{1-s}$, and will have selection coefficient $t=\frac{s}{1-s}$ where

*t*> 0 implies selection in favor of this haplotype. In this case the probability of fixation of such a new haplotype given that it has arisen is $\frac{1-{e}^{-2t}}{1-{e}^{-4Nt}}$, and such haplotypes arise by mutation at rate 2

*Nµ*per generation, yielding an overall rate of

As before, under symmetric mutation this rate *r*_{2} will hold for leaving states with fitness 1 - *s* {*aB, Ab*} and entering any state with fitness 1 {*AB, ab*}.

#### Rates of double substitution

For Equations 1 and 2 the assumption was made that new haplotypes arise in populations that are fixed for a particular state. It is possible however that a new haplotype could arise while the population is still polymorphic, and the identities and frequencies of the existing haplotypes could vary greatly. This will be considered when deriving the rate of entering the *ab* state.

*x*be the combined frequency of the two deleterious haplotypes in the population (each with fitness 1 -

*s*), and

*y*= 1 -

*x*be the combined frequency of the remaining haplotypes (each with fitness 1). In this case, the problem reduces to a one-dimensional diffusion process on

*x*. Sewall Wright [1] derived the equilibrium distribution of a mutant allele class resulting from the actions of reversible mutation and selection in a finite population as

where *θ* = 4*Nµ* and *C* is a normalizing constant such that ${\int}_{0}^{1}\varphi \left(x\right)dx=1$. Equation 3 differs slightly from Wright's original because there are two loci and mutations may occur at either locus. This provides the distribution of the frequency of deleterious intermediate haplotypes {*Ab, aB*}.

Consider a population lacking the *ab* haplotype. If the mutation rate is sufficiently low that the possibility of a double mutant *ab* arising from an *AB* haplotype in a single generation can be ignored, then *ab* haplotypes can only arise from *Ab* or *aB* haplotypes via single mutation. Therefore, conditional on the deleterious frequency *x*, the probability that a new *ab* haplotype enters the population this generation is 2*Nµx*.

*y*group can be approximated by

*y*group after a new

*ab*mutant has entered the population [26]. The probability of fixation of the new

*ab*haplotype is

*AB*and

*ab*have equal fitness. Taken together, and integrating over all possible values of the deleterious intermediate frequency, the average probability

*α*in each generation that a new

*ab*haplotype will arise that is destined to go to fixation is

Equations 4 - 6, from [19] but included here that this manuscript may be self-contained, give the rate *α* at which new successful *ab* haplotypes enter the population. By marginalizing over the stationary frequencies of deleterious intermediates at equilibrium, this key insight [19] implicitly considers both pathways by which the new *ab* haplotype can fix (with or without a deleterious intermediate ever fixing in the population). Previous work on rates of compensatory substitution only considered the case in which the deleterious intermediate could never fix because selection was too strong [13, 14, 18].

*into*state

*ab*

*π*

_{ k }is the stationary probability of being in state

*k*and

*r*

_{2}is given by Equation 2. So the total rate into state

*ab*is the sum of rates into

*ab*from each other state, weighted by the stationary probability of being in that particular state. This assumes that the detailed-balance conditions are met, meaning that the process is at stationarity. This is not such a strong assumption, as the process reaches something very close to stationarity quickly [19]. Solving for

*r*

_{3}yields

*α*also describes the total rate into state

*AB*, and therefore

The expected stationary frequency of the fixed states can be derived from Equation 3, which provides the distribution of the frequency of the pooled deleterious intermediate haplotypes (and conversely the pooled high-fitness haplotypes as well). Alternatively these frequencies can be estimated from simulations. Because the model is symmetric, *π*_{
AB
} *= π*_{
ab
} and *π*_{
Ab
} *= π*_{
aB
}.

Equations 8 and 9 represent the rates of direct double-substitution, without first fixing the deleterious intermediate. This is the first derivation of a direct double-substitution rate when selection is weak.

#### A discrete-time Markov model

Embedded within the continuous-time Markov model of Figure 3B describing the rates of moving between states is a discrete-time Markov chain representing the transition probabilities between states (Figure 3C). Consider, as before, only novel fixation events such that each state is the population becoming newly fixed for that state. The amount of time required for each fixation event to occur is not considered here, merely the state changes themselves. This model will assist in understanding some of the properties of the compensatory substitution process.

For simplicity, the deleterious haplotypes *aB* and *Ab*, both of relative fitness 1-*s*, are considered together as a single state representing the population being fixed for one of the deleterious haplotypes. Because both the *AB* and *ab* haplotypes have fitness 1 and because mutation is symmetric, the probabilities of moving from state *aB/Ab* to either *AB* or *ab* are both $\frac{1}{2}$. Since each simulation replicate ends when the population becomes fixed for the *ab* haplotype, *ab* is considered an absorbing state; a model of the true process differs only in that *ab* is not absorbing and would be completely symmetric.

In this Markov model there is some probability *β* of moving from state *AB* directly to state *ab* and some probability 1 - *β* of moving to one of the deleterious intermediate states. When *β* = 0 the compensatory substitution can only occur via the fixation of the intermediate, and when *β* = 1 the intermediate can never become fixed.

### Simulations

The forward evolution of a population of 2*N* haploid individuals is simulated using Wright-Fisher sampling. Each simulation replicate begins with the population fixed for the *AB* haplotype and ends when the *ab* haplotype becomes fixed in the population. The mutation-selection model used for simulation is the twolocus, two-allele model of [18], shown in Figure 3A. This is the simplest possible model of compensatory evolution that allows reversible mutation.

In each generation the number of mutations occurring in the population in a generation is drawn as a Poisson(4*Nµ*) random variable, where *µ* is the per generation per individual per locus mutation rate. If more than one mutation is to be introduced, the haplotypes to be mutated are selected uniformly without replacement, meaning that two mutations cannot occur to the same copy. Mutations to a haplotype occur at the first or second locus with equal probability. The number of recombination events in each generation is drawn as a Poisson(2*N ρ*) random variable, where *ρ* is the recombination rate per generation per individual. For each recombination event, two haplotypes are chosen without replacement with probability proportional to their frequency, the two selected haplotypes recombined, and the two recombinant haplotypes returned to the population. Following mutation and recombination, one generation begets the next by resampling with replacement with probabilities proportional to the haplotype fitnesses (see Supporting Information for full details of population resampling).

After each round of mutation-recombination-resampling the composition of the population is evaluated to determine the allele frequencies. Whenever a haplotype becomes fixed in the population that is different from the last fixed haplotype, the generation and haplotype are recorded. If the last fixed haplotype before the simulation is completed was *AB*, it is a Type 2 event (Figure 2B). If the last fixed haplotype before completion was a deleterious intermediate (*aB* or *Ab*), it is a Type 1 event (Figure 2A).

Type 2 events encompass two scenarios: either the deleterious intermediate haplotype is lost from the population before the *AB* haplotype, leaving only the two high-fitness haplotypes, or the deleterious intermediate persists until after the *AB* haplotype is lost. While the latter case may result in one of the two new alleles becoming fixed before the other, as long as the intermediate haplotype does not fix it remains a Type 2 event (Figure 2C).

Simulation Parameters

Parameter | Symbol | Value |
---|---|---|

Mutation Rate |
| (0.001, 0.01, 0.1, 1.0) |

Recombination Rate | 2 | (0, 5) |

Selection Coefficient |
| (0, 0.1, 0.2, ... , 3.0) |

Population Size | 2 | 200 |

## Results and Discussion

### Direct double substitution probability

The direct double substitution probability *β* (Eq. 10) can be easily estimated from simulations. When fixed for the *AB* haplotype, the population can subsequently become fixed for either the *ab* haplotype or for one of the deleterious intermediate haplotypes. *β* can be estimated as the proportion of times in which the population becomes fixed for the *ab* haplotype after having been fixed for the *AB* haplotype.

*β*under different simulation conditions are plotted as the points in Figure 4. For small mutation rates and low selection coefficients there is very little chance of observing a direct double substitution, while for high mutation rates and large selection coefficients there is little chance of observing anything but direct double substitutions.

*β*can be derived from the rates of the continuous-time Markov model of Figure 3B. Because the rates out of state

*AB*are given by Equations 1 and 8,

Analytical expectations for *β* from Equation 10 are shown in Figure 4 for comparison to the simulation results. The analytical result is a very good approximation to the observed direct double substitution probability across the range of selection coefficients, especially when the mutation rate is relatively low. The poor approximation of the analytical prediction when the mutation rate is very high is to be expected since the derivation of *α* assumes relatively low mutation rates.

### Alternate compensatory substitution pathway probabilities

The probability that compensatory substitutions occur without the deleterious intermediate first becoming fixed (Type 2) is a function of selection, mutation, and recombination, all scaled by population size. Importantly, it is not the same as the direct double-substitution probability *β*. This is because the process may visit other states before proceeding from *AB* to *ab* directly. Consequently, *β* serves as a lower bound on the probability that a compensatory substitution event is of Type 2.

*β*as well. Given that the population is in state

*AB*, the probability that it moves directly to

*ab*without returning to

*AB*is

*β*. The probability that it moves from

*AB*to

*ab*via the deleterious intermediate, without returning to

*AB*is (1 -

*β*)

*/*2. Thus the relative probability that it moves from

*AB*to

*ab*directly before moving to

*ab*via the deleterious intermediate is

The population could of course have returned to the initial state *AB* from the deleterious intermediate as well. But because the process is Markovian, the probability of observing a Type 2 event given that the current state is *AB* is still given by Equation 11.

*β*, the proportion of Type 2 compensatory substitution events increases with the strength of selection (Figure 5). In addition, Type 2 compensatory substitution events are very rare when the mutation rate is relatively low, such that the vast majority of compensatory substitution events involve the fixing of the deleterious intermediate haplotype.

Recombination is often thought of as having the positive effect of putting together beneficial mutations [27, 28]. Even negatively-epistatic alleles that are beneficial when paired will be aided by recombination [15]. But when the two mutants are individually deleterious and jointly neutral, recombination increases the time required for compensatory evolution [13]. Empirically, fewer compensatory substitutions are observed as physical distance (and hence recombination) between interacting loci increases [29, 30].

In this context, recombination reduces the proportion of Type 2 substitutions observed (Figure 5). Since the *a* and *b* alleles spend most of their time at low frequency due to selection against them, recombination serves to break apart *ab* haplotypes much more than it creates them. Consider a population with a single copy of each of the *a* and *b* alleles. A single *ab* haplotype will be broken up by recombination if it is one of the two selected, but an *ab* haplotype will only be created by recombination if *both* of the haplotypes bearing the deleterious alleles are selected. This tendency to separate rare alleles increases the chance that at least one of the alleles will be lost, decreasing the probability of Type 2 events. This effect of recombination is obviated when the mutation rate is very high, where Type 2 events dominate over a wide range of selective strength.

Taken together these paint a picture of the dynamics of the compensatory neutral substitution process. If mutation rates are relatively low, then times to compensatory substitution will be very large, particularly when selection is strong [17]. In this case the process will likely proceed through the deleterious intermediate fixed state, unless selection is strong (Figure 5). Recombination, if strong, will increase both the time to fixation [19] and the probability of passing through the deleterious intermediate state (Figure 5). Conversely, high mutation and strong selection will lead to faster Type 2 substitution events [13], and high mutation rate alone provides immunity from the effect of recombination (Figure 5). This suggests potentially very different compensatory substitution dynamics in different species.

### Reversions to the ancestral state

It was noted earlier that the population, after fixing a deleterious intermediate state, could revert to the original fixed state *AB* rather than proceeding to the alternative state *ab*. In fact, it can cycle back a number of times before eventually becoming fixed for *ab*, either via the intermediate or directly. Each time in state *AB* the probability of "failure" (leaving then returning to *AB*) is (1 - *β*)*/* 2, and the probability of "success" (reaching *ab* by any means) is (1 + *β*)*/* 2. The number of returns to *AB* before ultimately reaching *ab* is therefore a geometrically distributed random variable on the space {0,1,2,...} with mean (1 - *β*)*/*(1 + *β*). Because only the final path from *AB* to *ab* determines if an event is of Type 1 or Type 2, the expected number of reversions is the same for both pathways.

*AB*before fixing in state

*ab*via any pathway. The number of returns to the initial state is lower when the mutation is high. This is because high mutation increases the rate of appearance (and subsequent fixation) of

*ab*haplotypes (Figure 5); if intermediates do not become fixed, there can be no return to the initial condition. Recombination had little effect on the number of reversions, slightly increasing the number of reversions when selection is moderate to large and mutation is low (Additional file 2). Increasing selection against intermediates reduces the number of reversions by removing deleterious alleles from the population; by preventing deleterious mutants from fixing in the population it also prevents reversions (as well as Type 1 fixation events).

### Relative time of the two pathways

An important question is whether the expected amount of time to compensatory substitution differs for Type 1 and Type 2 events. It is not obvious that these two events should take the same amount of time, or if differences between them should remain constant across a wide range of selection, mutation, and recombination. Because the expected number of reversions to the ancestral state is the same regardless of the pathway by which fixation is ultimately achieved, the power to detect differences between Type 1 and Type 2 events is maximized by comparing the time required for the final path from *AB* to *ab*.

The number of generations from the last time the population enters state *AB* until the time it enters state *ab* was calculated for each simulation replicate. If there is genuinely no difference in the expected time for Type 1 and Type 2 fixation events, then the difference in the two sample means is expected to be zero. Under certain assumptions about the samples (normally distributed and equal variances), the difference in means, normalized by the pooled variance, will be a standard normal random variable. Large deviations in this test statistic imply significant differences in the expected times of the two pathways.

Recombination had little effect on either the differences observed under weak selection or the lack of differences observed at stronger selection (Figure 7). Because recombination most strongly affects the probability of observing Type 2 events when the mutation rate is low and selection is strong (Figure 5), if recombination is to have an effect on the difference in path times it should be most evident under these conditions. However, it is under precisely these conditions that the difference between path times is minimized (Figure 7).

### Extensions

#### Model asymmetry

The total rate of compensatory substitution *α* in the more general case in which mutation for the two loci and selection coefficients against the two deleterious haplotypes could differ was previously derived [19]. This is attractive because it would allow for more biologically realistic scenarios, such as the semi-stable U-G intermediates in RNA stem pairing [23]. However, in deriving the more general case they made the assumption that *Ns* was very large in order to rule out the possibility of the coexistence of both deleterious haplotypes. This assumption might be problematic for many systems in which either population sizes or selection coefficients are not expected to be very large. Nevertheless, it should be straightforward to incorporate into the work presented here.

The case in which the fitness of *ab* is far greater than that of any other state, such as when normal cells mutate to cancerous ones, was considered by [22]. Under these conditions they derived a rate of double substitution, giving it the colorful term "stochastic tunneling." However, unlike the approach shown here, these authors assumed that the probability of fixation of a new double mutant was independent of the population composition at the time it arises. This assumption simplifies matters greatly, but such an approximation may be limited to the case in which the *ab* state is of exceptionally high fitness.

#### Expanding the state space

The two-locus, two-allele model considered here will be insufficient for some important biological cases, such as the evolution of RNA stem positions. In this case, either the state space of the model can be expanded to a two-locus, four-allele model, or the state space of RNA can be compressed somehow to fit the current model. The latter approach is simpler, but risks losing valuable information about the evolutionary process by grouping states together. The former approach for the case of strong selection and large populations was explored by [18].

#### A Phylogenetic substitution model

**Q**= {

*q*

_{ ij }} of the continuous-time Markov chain can be defined as

where $\mathbb{X}=\left\{aB,\phantom{\rule{0.5em}{0ex}}Ab\right\},\mathbb{Y}=\left\{AB,\phantom{\rule{0.5em}{0ex}}ab\right\}$, and *u* is a rate scaling factor to ensure that branch lengths are in terms of expected number of substitutions per site.

Phylogenetic models of the compensatory substitution process have been developed previously [31, 32], but these models have double and single substitution rates that are uncoupled from each other and that are purely descriptive in nature. Equation 13 on the other hand represents a fully-defined phylogenetic substitution model for the analysis of compensatory substitution specified entirely in terms of the underlying population genetic parameters. This is appealing because it moves towards models that are *interpretive*, in which parameters have meaning with respect to the process of evolution, rather than models that are simply *descriptive* of patterns observed.

The model described by Equation 13 presents a potentially powerful way to investigate the evolution of compensatory evolution using interspecific data, allowing the estimation of population parameters of interest. Application of the basic model described here to interspecific data would be making key assumptions about the evolutionary process, namely that each branch of the phylogenetic tree would share a common population size, selection coefficient, and mutation rate. Relaxing this constraint to allow different branches to have their own parameters has been done in different contexts using the Dirichlet process [33, 34] and could readily be applied to this model as well.

#### Origination vs. fixation process

The total compensatory substitution rate *α* is actually the rate at which successful compensatory haplotypes *arise* in the population, the inverse of which being the time until such haplotypes arise [19]. This is slightly different from what was calculated in simulations, which was the time until the actual fixation event itself. The difference in these times is very small: A compensatory (second) mutation that is destined to go to fixation, from the time it arises in the population, takes on the order of 4*N* generations until it becomes fixed [19, 35], which is trivial compared to the time typically spent waiting for the eventually-successful mutation to arise. Consequently the simulations should be expected to yield ever-so-slightly longer fixation times than predicted by *α*.

These two processes are distinguished as origination and fixation processes, respectively [36]. Origination processes can often have more desirable statistical properties than fixation processes. For example, in the case of neutral linkage the origination process remains Poisson while the fixation process is something more complex and overdispersed [37]. While substitution models are rarely explicit about the process being modeled, the theoretical results and phylogenetic substitution model presented here pertain to the origination process rather than the fixation process.

## Conclusions

There are many cases in which loci interact epistatically and natural selection is weak to moderate, and this work derived rates for the compensatory neutral substitution process along the different possible paths to fixation under these conditions. These were shown to have good predictive power regarding the probability of observing compensatory substitution via alternate paths under a range of conditions. Because there is little difference in the time to fixation between the paths when selection is not negligible, the relative probability of these paths may be important for distinguishing the evolutionary history of compensatory substitutions. To this end, a phylogenetic substitution model parameterized in terms of these population genetic parameters was presented that could be used for inference of population histories using interspecific data.

## Declarations

### Acknowledgements

The author would like to thank Jeff Thorne, Josh Schraiber, Alex Griffing and two anonymous reviewers for helpful comments on this manuscript. This work was supported by National Institutes of Health grants GM-069801 (to John Huelsenbeck) and GM-070806 (to Jeff Thorne), and National Science Foundation postdoctoral fellowship DBI-1202884 (to CN). Publication charges for this article were paid for by National Science Foundation postdoctoral fellowship DBI-1202884 (to CN).

This article has been published as part of *BMC Bioinformatics* Volume 14 Supplement 15, 2013: Proceedings from the Eleventh Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S15.

## Authors’ Affiliations

## References

- Wright S: Evolution in Mendelian populations. Genetics. 1931, 16: 97-159.PubMed CentralPubMedGoogle Scholar
- Wright S: The roles of mutation, inbreeding, crossbreeding and selection in evolution. Proceedings of the Sixth International Congress on Genetics. 1932, 1: 356-366.Google Scholar
- Dobzhansky T: Studies on hybrid sterility. II. Localization of sterility factors in
*Drosophila*pseudoobscura hybrids. Genetics. 1936, 21: 113-135.PubMed CentralPubMedGoogle Scholar - Muller HJ: Reversibility in evolution considered from the standpoint of genetics. Bio Rev Camb Philos Soc. 1939, 14: 261-280. 10.1111/j.1469-185X.1939.tb00934.x.View ArticleGoogle Scholar
- Robinson DM, Jones DT, Kishino H, Goldman N, Thorne JL: Protein evolution with dependence among codons due to tertiary structure. Molecular Biology and Evolution. 2003, 20: 1692-1704. 10.1093/molbev/msg184.View ArticlePubMedGoogle Scholar
- Rodrigue N, Lartillot N, Bryant D, Philippe H: Site interdependence attributed to tertiary structure in amino acid sequence evolution. Gene. 2005, 347: 207-217. 10.1016/j.gene.2004.12.011.View ArticlePubMedGoogle Scholar
- Chen Y, Carlini DB, Baines JF, Parsch J, Braverman JM, Tanda S, Stephan W: RNA secondary structure and compensatory evolution. Genes Genet Syst. 1999, 74: 271-286. 10.1266/ggs.74.271.View ArticlePubMedGoogle Scholar
- Yu J, Thorne JL: Dependence among sites in RNA evolution. Molecular Biology and Evolution. 2006, 23: 1525-1537. 10.1093/molbev/msl015.View ArticlePubMedGoogle Scholar
- Clark NL, Gasper J, Sekino M, Springer SA, Aquadro CF, Swanson WJ: Coevolution of interacting fertilization proteins. PLoS Genetics. 2009, 5: e1000570-10.1371/journal.pgen.1000570.PubMed CentralView ArticlePubMedGoogle Scholar
- Goncalves A, Leigh-Brown S, Thybert D, Stefflova K, Turro E, Flicek P, Brazma A, Odom DT, Marioni JC: Extensive compensatory cis-trans regulation in the evolution of mouse gene expression. Genome Research. 2012, 22: 2376-2384. 10.1101/gr.142281.112.PubMed CentralView ArticlePubMedGoogle Scholar
- Huynen M, Hogeweg P: Pattern generation in molecular evolution: exploitation of the variation in RNA landscapes. Journal of Molecular Evolution. 1994, 39: 71-79.View ArticlePubMedGoogle Scholar
- Meer MV, Kondrashov AS, Artzy-Randrup Y, Kondrashov FA: Compensatory evolution in mitochondrial tRNAs navigates valleys of low fitness. Nature. 2010, 464: 279-282. 10.1038/nature08691.View ArticlePubMedGoogle Scholar
- Kimura M: The role of compensatory neutral mutations in molecular evolution. Journal of Genetics. 1985, 64: 7-19. 10.1007/BF02923549.View ArticleGoogle Scholar
- Iizuka M, Takefu M: Average time until fixation of mutants with compensatory fitness interaction. Genes Genet Syst. 1996, 71: 167-173. 10.1266/ggs.71.167.View ArticleGoogle Scholar
- Michalakis Y, Slatkin M: Interaction of selection and recombination in the fixation of negativeepistatic genes. Genet Res. 1996, 67: 257-269. 10.1017/S0016672300033747.View ArticlePubMedGoogle Scholar
- Phillips PC: Waiting for a compensatory mutation: phase zero of the shifting-balance process. Genetical Research. 1996, 67: 271-283. 10.1017/S0016672300033759.View ArticlePubMedGoogle Scholar
- Stephan W: The rate of compensatory evolution. Genetics. 1996, 144: 419-426.PubMed CentralPubMedGoogle Scholar
- Higgs PG: Compensatory neutral mutations and the evolution of RNA. Genetica. 1998, 91-101. 102/103Google Scholar
- Innan H, Stephan W: Selection intensity against deleterious mutations in RNA secondary structures and rate of compensatory nucleotide substitutions. Genetics. 2001, 159: 389-399.PubMed CentralPubMedGoogle Scholar
- Ichinose M, Iizuka M, Kado T, Takefu M: Compensatory evolution in diploid populations. Theoretical Population Biology. 2008, 74: 199-207. 10.1016/j.tpb.2008.07.002.View ArticlePubMedGoogle Scholar
- Ichinose M, Iizuka M, Kusumi J, Takefu M: Models of compensatory molecular evolution: effects of back mutation. Theoretical Population Biology. 2013, 323: 1-10.View ArticleGoogle Scholar
- Iwasa Y, Michor F, Nowak MA: Stochastic tunnels in evolutionary dynamics. Genetics. 2004, 166: 1571-1579. 10.1534/genetics.166.3.1571.PubMed CentralView ArticlePubMedGoogle Scholar
- Rousset F, Pelandakis M, Solignac M: Evolution of compensatory substitutions through G-U intermediate state in
*Drosophila*rRNA. PNAS. 1991, 88: 10032-10036. 10.1073/pnas.88.22.10032.PubMed CentralView ArticlePubMedGoogle Scholar - Piskol R, Stephan W: The role of the effective population size in compensatory evolution. Genome Biology and Evolution. 2011, 3: 528-538. 10.1093/gbe/evr057.PubMed CentralView ArticlePubMedGoogle Scholar
- Halpern AL, Bruno WJ: Evolutionary distances for protein-coding sequences: Modeling site-specific residue frequencies. Molecular Biology and Evolution. 1998, 15: 910-917. 10.1093/oxfordjournals.molbev.a025995.View ArticlePubMedGoogle Scholar
- Kimura M: On the probability of fixation of mutant genes in a population. Genetics. 1962, 47: 713-719.PubMed CentralPubMedGoogle Scholar
- Fisher RA: The Genetical Theory of Natural Selection. 1930, Oxford: Oxford University PressView ArticleGoogle Scholar
- Muller HJ: Some genetic aspects of sex. The American Naturalist. 1932, 66: 118-138. 10.1086/280418.View ArticleGoogle Scholar
- Stephan W, Kirby DA: RNA folding in
*Drosophila*shows a distance effect for compensatory fitness interactions. Genetics. 1993, 135: 97-103.PubMed CentralPubMedGoogle Scholar - Piskol R, Stephan W: Analyzing the evolution of RNA secondary structures in vertebrate introns using Kimura's model of compensatory fitness interactions. Molecular Biology and Evolution. 2008, 25: 2483-2492. 10.1093/molbev/msn195.View ArticlePubMedGoogle Scholar
- Tillier ERM: Maximum likelihood with multiparameter models of substitution. Journal of Molecular Evolution. 1994, 39: 409-417. 10.1007/BF00160273.View ArticleGoogle Scholar
- Tillier ERM, Collins RA: Neighbor joining and maximum likelihood with RNA sequences: addressing the interdependence of sites. Molecular Biology and Evolution. 1995, 12: 7-15. 10.1093/oxfordjournals.molbev.a040195.View ArticleGoogle Scholar
- Huelsenbeck JP, Jain S, Frost SWD, Pond SLK: A Dirichlet process model for detecting positive selection in protein-coding DNA sequences. Proceedings of the National Academy of Science, USA. 2006, 103: 6263-6268. 10.1073/pnas.0508279103.View ArticleGoogle Scholar
- Heath TA, Holder MT, Huelsenbeck JP: A Dirichlet process prior for estimating lineage-specific substitution rates. Molecular Biology and Evolution. 2012, 29: 939-955. 10.1093/molbev/msr255.PubMed CentralView ArticlePubMedGoogle Scholar
- Kimura M: The length of time required for a selectively neutral mutant to reach fixation through random frequency drift in a finite population. Genet Res. 1970, 15: 131-133. 10.1017/S0016672300001439.View ArticlePubMedGoogle Scholar
- Gillespie JH: The causes of molecular evolution. 1991, New York: Oxford University PressGoogle Scholar
- Watterson GA: Substitution times for mutant nucleotides. Journal of Applied Probability. 1982, 19: 59-70.View ArticleGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.