- Methodology article
- Open Access

# Accurate and fast methods to estimate the population mutation rate from error prone sequences

- Bjarne Knudsen
^{1}and - Michael M Miyamoto
^{2}Email author

**10**:247

https://doi.org/10.1186/1471-2105-10-247

© Knudsen and Miyamoto; licensee BioMed Central Ltd. 2009

**Received:**25 February 2009**Accepted:**11 August 2009**Published:**11 August 2009

## Abstract

### Background

The population mutation rate (θ) remains one of the most fundamental parameters in genetics, ecology, and evolutionary biology. However, its accurate estimation can be seriously compromised when working with error prone data such as expressed sequence tags, low coverage draft sequences, and other such unfinished products. This study is premised on the simple idea that a random sequence error due to a chance accident during data collection or recording will be distributed within a population dataset as a singleton (i.e., as a polymorphic site where one sampled sequence exhibits a unique base relative to the common nucleotide of the others). Thus, one can avoid these random errors by ignoring the singletons within a dataset.

### Results

This strategy is implemented under an infinite sites model that focuses on only the internal branches of the sample genealogy where a shared polymorphism can arise (i.e., a variable site where each alternative base is represented by at least two sequences). This approach is first used to derive independently the same new Watterson and Tajima estimators of θ, as recently reported by Achaz [1] for error prone sequences. It is then used to modify the recent, full, maximum-likelihood model of Knudsen and Miyamoto [2], which incorporates various factors for experimental error and design with those for coalescence and mutation. These new methods are all accurate and fast according to evolutionary simulations and analyses of a real complex population dataset for the California seahare.

### Conclusion

In light of these results, we recommend the use of these three new methods for the determination of θ from error prone sequences. In particular, we advocate the new maximum likelihood model as a starting point for the further development of more complex coalescent/mutation models that also account for experimental error and design.

## Keywords

- Ancestral Sequence
- Internal Branch
- Evolutionary Simulation
- Basal Branch
- External Branch

## Background

The population mutation rate (θ) remains one of the most fundamental parameters in genetics, ecology, and evolutionary biology [3–5]. This interest in θ derives from the fact that this parameter measures the effective size (*N*_{
e
}) and whole-locus mutation rate (μ) of a population, which are of great importance in understanding its demography and history. Specifically, θ is a compound parameter that is calculated as the product of 2*pN*_{
e
}μ (with *p* = 1 or 2 for haploids and diploids, respectively). Correspondingly, a number of alternative methods are available to estimate θ from a population sample of allelic sequences [6–8]. These alternative methods range from relatively simple summary statistics (moment methods) to full coalescent/mutation models. Indeed, the estimation of θ remains central to even the most complex coalescent/mutation models that are otherwise concerned with the determination of other population genetic parameters (e.g., for growth, migration, and recombination).

A population sample of sequences is obtained from interbreeding or potentially interbreeding individuals and is therefore usually associated with a small number of mutations [9–12]. Thus, when estimating θ from a population sample, sequence errors can pose a real problem, since their numbers can begin to approach or even surpass those for the mutations [13–19]. This problem becomes particularly acute when working with error prone data such as expressed sequence tags (EST), low coverage draft sequences, and other such unfinished products [20, 21]. For example, an error rate of one mistake per every 500 nucleotides (e.g., as for an EST dataset obtained from single sequencing passes) will make a significant contribution to the observed variation among sequences that differ because of mutations by <1 to 2%. If uncorrected, such errors can lead to an inflated estimate of θ and even erroneous conclusions about the biology of their population [1, 2, 18, 22].

This study relies on the simple premise that the random mistakes of error prone sequences can be avoided by ignoring the singletons within their population sample. This strategy is first used to obtain independently the same new Watterson [25] and Tajima [26] estimators of θ, which were recently reported by Achaz [1] for error prone sequences. This approach is then implemented in the recent maximum likelihood (ML) model of Knudsen and Miyamoto [2], which incorporates various factors of experimental error and design with those for coalescence and mutation. By relying on only shared polymorphisms, these three new approaches allow for more accurate estimates of θ. However, this greater accuracy comes with a cost as singletons due to actual mutations are ignored along with those due to random errors. To assess this tradeoff, the three new methods are tested against each other and their original predecessors that count singletons with evolutionary simulations and/or analyses of a real population dataset for the California seahare (*Aplysia californica*). These tests document that these new approaches offer reliable and fast alternatives for the determination of θ from error prone sequences.

## Results and discussion

### Three new methods for estimating θ from error prone sequences

#### Infinite sites model

As in their original versions, the new Watterson, Tajima, and Knudsen/Miyamoto methods rely on the infinite sites model to accommodate a neutral mutation process [27, 28]. The infinite sites model assumes that only a single mutation can occur at any homologous position of the population sample. Thus, each variable site will be represented by only two bases that subdivide the sampled sequences into two non-overlapping subsets consisting of those with the first nucleotide versus the remainder with the second base. Correspondingly, each mutation will map to a specific branch within the sample genealogy, which thereby partitions the sequences into their two non-overlapping subsets. For example, the shared polymorphism at position 3 in Figure 1 is attributable to a unique mutation along the internal branch that partitions sequences I and II from III, IV, and V.

As random errors are treated as rare chance events, they are also modeled in this study along with the mutations by the infinite sites process. Thus, only a single random error or mutation is allowed at any site of the sampled sequences. In turn, each random error is limited to a single sequence (and therefore to a particular singleton) in contrast to a mutation that can also result in a shared polymorphism (Figure 1). The reason is that random errors arise during the experimental determination and recording of individual sequences, whereas mutations occur at specific points within the sample genealogy. Thus, a mutation along an internal branch of the genealogy will result in a new base that will be shared by two or more of its descendant sequences.

#### New Watterson estimator (θ'_{
W
})

*T*

_{ i }as the length of time (as scaled by

*N*

_{ e }generations) during which there are exactly

*i*ancestors for

*n*sampled sequences. Standard coalescent theory tells us that:

*n*sequences (as measured in units of scaled coalescent time) can now be calculated as:

*n*

_{ s }be the observed number of segregating (polymorphic) sites in the dataset. Under the infinite sites model,

*n*

_{ s }also counts the number of mutations, since each observed variable site is attributable to a single mutation. The expected number of mutations per locus per unit of branch length is θ/2. Thus, an estimate of θ can be obtained as:

[25].

A new Watterson estimator that avoids singletons (and thereby random sequence errors, θ'_{
W
}) can now be derived from Equation (4) by adjusting both its numerator and denominator. The numerator is easily adjusted by counting only the shared polymorphic sites in the original dataset (*n'*_{
s
}). Although more complicated, the denominator can also be readily adjusted by including in its calculation only the lengths of the internal branches where shared polymorphisms can arise (see below).

*i*ancestors for the

*n*sampled sequences. Looking forward in time, the probability that a particular branch of the genealogy is not chosen for the next split (leading to

*i*+ 1 lineages) is [(

*i*- 1)/

*i*]. Thus, the probability that this branch remains unbroken to the present is:

Our Equation (6) is equivalent to equation (10) of Fu and Li [31], apart from our use of different symbols and terms and of time as scaled by *N*_{
e
}generations (rather than generations alone). Thus, as previously noted by them, the total length of the external branches where a singleton can occur is independent of the original number of sampled sequences. Fu and Li [31] also obtained the variance for the total length of the external branches as their equation (14).

*n*- 1) sampled sequences, a mutation in the internal branch leading to the common ancestor of the (

*n*- 1) group will result in a singleton within the dataset (i.e., a variable site where the single nonmember sequence exhibits the unique base) (Figure 2). Thus, the length of this (

*n*- 1) basal branch (as weighted by its probability of occurrence) must also be accounted for in the adjustment of the denominator for θ'

_{ W }. The weighted length of this (

*n*- 1) basal branch can be calculated as follows. First, the chance of this branch is determined as the probability of an (

*n*- 1) asymmetrical topology or equivalently as the probability that either one of the two basal branches for the genealogy remains unbroken to the present. According to Equation (5), the latter probability is 2 [1/(

*n*- 1)]. Second, the unweighted length of this branch is then calculated as the length of the time interval

*T*

_{2}, which is 1 [Equation (2)]. Thus, the weighted length of the (

*n*- 1) basal branch is:

In words, θ'_{
W
}ignores the singletons (and thereby random errors) in the original dataset, while restricting the total branch length calculation to only those internal internodes where a shared polymorphism can arise.

Apart from our use of different symbols and terms for calculating the denominator, our Equation (8) is equivalent to equation (13) of Achaz [1]. Achaz [1] also derived his equation B22 for the calculation of the associated variance for *n'*_{
s
}(i.e., his Var [*S*_{-η1}]).

#### New Tajima estimator (θ'_{
T
})

_{ ij }denote the number of observed pairwise differences between sampled sequences

*i*and

*j*(for

*i*≠

*j*). Standard coalescent theory tells us that the expected waiting time for these two sequences to coalesce is ~

*N*

_{ e }generations. Thus, the expected value of each π

_{ ij }is θ, whereas that for their sum is:

[26].

The Tajima and Watterson estimators have the same expected value of θ, even though the former is based on the total differences between each sampled sequence pair whereas the latter is obtained from the total number of segregating sites within the sample. Their different summaries of the observed variation are the basis of Tajima's *D* that tests for departures from the standard neutral model [32] (see also [31, 33, 34]).

_{ W }. First, let π'

_{ ij }count the number of observed pairwise differences between sequences

*i*and

*j*after the removal of all singletons from their dataset. Then, the expected number of singletons can be calculated from Equations (6) and (7) as the total length of the external branches and (

*n*- 1) basal branch (as weighted by its probability of occurrence) multiplied by θ/2 {i.e., [1 + (1/

*n*- 1)] θ}. The removal of each singleton from the original dataset reduces the expected sum in Equation (9) by (

*n*- 1). Combining these results, we obtain:

Our Equation (12) for θ'_{
T
}is equivalent to equation (15) of Achaz [1], except for our use of different symbols and total π'_{
ij
}(rather than average π'_{
ij
}) for the population sample. Achaz [1] also derived his equation B23 for the calculation of the associated variance for average π'_{
ij
}[i.e., his Var[π_{-η1}]).

#### New Knudsen/Miyamoto model (θ'_{
KM
})

Knudsen and Miyamoto ([2], hereafter referred to as "KM") developed a full coalescent/mutation model that accounts for three specific factors of experimental error and design: (a) For random sequence errors, (b) For unobserved polymorphisms due to missing data, and (c) For the uncertain assignment of the multiple sequencing reads for a diploid or polyploid individual to its two or more homologues. Their KM model uses recursion to calculate an exact probability of the population sample under the standard Fisher [35] and Wright [36] model for reproduction (hereafter, referred to as "FW") and the infinite sites process for both mutations and random sequence errors [37, 38]. Their model relies on ML to estimate θ and the expected number of errors per full length sequence (ε).

*s*

_{ i }~

*s*

_{ j }).

Definitions of the parameters and factors used in Equation (13) of the KM model

Parameter or factor | Description |
---|---|

| The number of segregating sites within the current set of sampled and/or ancestral sequences |

σ | Counts the number of "singletons" for sampled or ancestral sequence |

| Probability of α |

| Probability of β |

| Probability of |

| Sampled and/or ancestral sequences |

| Signifies that the available regions of sampled and/or ancestral sequences |

| | Measures the relative degree to which sampled or ancestral sequence |

| | Summarizes the current number of sampled and/or ancestral sequences during a particular coalescent interval in the genealogy |

_{ i }> 0). Here, the definition of a "singleton" is expanded to include the derived mutations of the common ancestors for the different groups of related sampled sequences as well as those for the singletons (in the strict sense) of the original dataset (Table 1). This expanded use of the term acknowledges that a shared polymorphism under the infinite sites model is due to a unique mutation within the common ancestor of those sampled sequences sharing the derived base (Figure 3). Thus, even though they result in shared polymorphisms among the sampled sequences, these derived mutations are ultimately counted as "singletons" as one works backwards in time. This expanded definition of a "singleton" allows for the economical use of σ

_{ i }alone to track the mutations of both the original shared polymorphisms and singletons.

*s*

_{ i }| factor then quantifies the total amount of sequence available for this sampled allele. For example, the (0.2:1.0) score for the leftmost (first) sampled sequence in Figure 4a indicates that it is lacking the initial 20% of the full multiple alignment. Thus, |

*s*

_{ i }| = 0.8 for this partial, leftmost, sampled sequence. In turn, as one works backwards in time, the closed intervals for the available regions of the sequences for common ancestors are calculated as the union of the known lengths for their two immediate descendants. Thus, the closed interval of the common ancestor for the two leftmost sampled sequences in Figure 4a is (0.2:1.0) ∪ (0.0:1.0) = (0.0:1.0), with |

*s*

_{ i }| = 1.0.

The purpose of Σ_{
i
}|*s*_{
i
}| in Equation (13) is to track the total available length of all sampled and/or ancestral sequences for the detection of mutations as one works backwards in time. The corresponding use of |*s*_{
i
}| in the bottom line of Equation (13) allows for the adjustment of σ_{
i
}due to unobserved polymorphisms resulting from missing data.

The KM model uses only Equation (13) when working with error free sequences. In turn, this model also relies on equation (4) of Knudsen and Miyamoto [2] when dealing with error prone sequences. This additional equation of the KM model assumes that the random errors are uniformly distributed along the sampled sequences and that their total number is Poisson distributed with an intensity of λ = ε Σ_{
i
}|*s*_{
i
}|. The inclusion of Σ_{
i
}|*s*_{
i
}| in the calculation of λ allows for the adjustment of this error rate to account for incomplete sampled sequences.

In contrast to its predecessor, the KM' model uses only Equation (13) when working with either error prone or error free sequences. As for the new Watterson and Tajima estimators, the KM' model operates by counting only those internal branches of the genealogy where a mutation will result in a shared polymorphism (Figure 4b). The sampled sequences are rescored as unknowns with empty intervals (∅), as is the sequence of the common ancestor for the (*n* - 1) basal group of each asymmetrical genealogy (Figure 2). Correspondingly, |*s*_{
i
}| is then reset to 0.0 for these sequences. The KM' model ignores the external and basal branches of the genealogy, where a mutation will result in a singleton, by multiplying σ_{
i
}for each sampled and/or (*n* - 1) ancestral sequence by |*s*_{
i
}| = 0.0 in Equation (13). In this way, the KM' model discounts the singletons in favor of the shared polymorphisms within the dataset without the use of equation (4) and its associated parameters (e.g., λ and ε) of the original KM model.

For sampled sequences without missing data, the previous explanation is complete as to how the KM' model corrects for the branches of the genealogy where a mutation will result in a singleton. However, for incomplete sampled sequences, the "above and below" test of the KM' model also becomes necessary to correct for the regions of each common ancestral sequence where a mutation in its internal branch will lead to a singleton, rather than to a shared polymorphism, because of this missing information (Figure 4b). In this test, "below" refers to those sampled sequences that belong to the monophyletic group of the common ancestor in question. "Above" then corresponds to the remaining, more distantly related, sampled sequences. In Figure 4b, the two leftmost sampled sequences are direct descendants of the first, leftmost, common ancestor, whereas the other three are not. Thus, these two sampled sequences lie "below," whereas the other three occur "above" the internal branch for this leftmost common ancestor.

The above and below test checks whether comparable information for a region is known for at least two, descendant, sampled sequences below and at least two, more distantly related, sampled sequences above an internal branch in the genealogy. If not, then a mutation within this region of the ancestral sequence that corresponds to this internal branch will result in a false singleton within the original dataset. For example, a mutation within the first 20% of the leftmost common ancestor in Figure 4 will result in a false singleton of the second sampled sequence rather than in a true shared polymorphism of the first and second. The problem is that the leftmost sampled sequence is missing comparable information for the detection of this mutation as a shared change in the leftmost common ancestor. The above and below test corrects for such regions of the common ancestral sequences during the rescoring of their closed intervals and |*s*_{
i
}|. As the first 20% of the leftmost common ancestor fails the below half of this test, the KM' model disregards this region during the recalculation of its closed interval as (0:2:1.0) and |*s*_{
i
}| = 0.8 (Figure 4b).

### Evolutionary simulations and A. californica dataset

#### Evolutionary simulations

To test the three new procedures, evolutionary simulations were conducted according to standard methods [4]. Two hundred datasets apiece were simulated for eight or 16 sequences of length 500 from a single population under the baseline conditions of the standard FW and infinite sites models with θ = 1, 2, 4, or 8. In addition to these baseline conditions, sequence errors were introduced as four or eight randomly placed changes among the eight and 16 sequences, respectively, for an expected ε of 0.5. Estimates of θ were then obtained for the 200 datasets of each tested combination with the FW model and the original and new Watterson estimators, Tajima estimators, and KM and KM' models.

*n*

_{ s }) in Equation (4) of θ

_{ W }, but a smaller one to the sum of the pairwise differences in Equation (10) of θ

_{ T }. Specifically, each singleton is limited in Equation (10) to the (

*n*- 1) pairwise comparisons of the unique sequence to the (

*n*- 1) other sequences. Thus, the Tajima estimator is less vulnerable than the Watterson estimator to random errors even though its vulnerability is still significant.

Results of the evolutionary simulations

Evolutionary simulations | Coalescent/mutation models | Watterson and Tajima estimators | |||||||
---|---|---|---|---|---|---|---|---|---|

θ |
| ε | θ | θ | θ' | θ | θ' | θ | θ' |

A1: 1 | 8 | 0.0 | 0.96 ± 0.10 |
| 1.10 ± 0.15 | 0.98 ± 0.10 | 1.12 ± 0.16 | 1.02 ± 0.12 | 1.10 ± 0.15 |

A2: 1 | 8 | 0.5 |
| 0.90 ± 0.14 | 0.96 ± 0.15 |
| 0.94 ± 0.15 |
| 0.95 ± 0.16 |

A3: 1 | 16 | 0.0 | 1.02 ± 0.10 |
| 1.01 ± 0.12 | 1.00 ± 0.09 | 0.99 ± 0.11 | 0.97 ± 0.10 | 0.96 ± 0.12 |

A4: 1 | 16 | 0.5 |
| 0.99 ± 0.12 | 1.02 ± 0.13 |
| 1.03 ± 0.13 |
| 1.01 ± 0.13 |

B1: 2 | 8 | 0.0 | 1.92 ± 0.17 |
| 1.80 ± 0.25 | 1.92 ± 0.18 | 1.81 ± 0.25 | 1.89 ± 0.20 | 1.82 ± 0.26 |

B2: 2 | 8 | 0.5 |
| 1.80 ± 0.20 | 1.92 ± 0.26 |
| 1.91 ± 0.26 |
| 1.92 ± 0.25 |

B3: 2 | 16 | 0.0 | 2.06 ± 0.16 |
| 2.02 ± 0.19 | 2.04 ± 0.16 | 2.04 ± 0.20 | 2.09 ± 0.19 | 2.10 ± 0.22 |

B4: 2 | 16 | 0.5 |
| 1.88 ± 0.17 | 1.92 ± 0.19 |
| 1.92 ± 0.19 |
| 1.96 ± 0.22 |

C1: 4 | 8 | 0.0 | - | - | 4.05 ± 0.44 | 4.27 ± 0.35 | 4.27 ± 0.53 | 4.31 ± 0.39 | 4.32 ± 0.54 |

C2: 4 | 8 | 0.5 | - | - | 3.89 ± 0.46 |
| 4.09 ± 0.53 |
| 4.13 ± 0.54 |

C3: 4 | 16 | 0.0 | - | - | 4.00 ± 0.31 | 4.12 ± 0.28 | 4.21 ± 0.38 | 4.24 ± 0.36 | 4.28 ± 0.41 |

C4: 4 | 16 | 0.5 | - | - | 3.82 ± 0.31 |
| 3.87 ± 0.35 |
| 3.86 ± 0.38 |

D1: 8 | 8 | 0.0 | - | - | 7.88 ± 0.71 | 7.75 ± 0.49 | 7.87 ± 0.78 | 7.78 ± 0.55 | 7.85 ± 0.77 |

D2: 8 | 8 | 0.5 | - | - | 7.90 ± 0.85 |
| 8.38 ± 1.03 |
| 8.39 ± 1.03 |

D3: 8 | 16 | 0.0 | - | - | 7.82 ± 0.54 | 8.08 ± 0.49 | 8.20 ± 0.72 | 8.26 ± 0.69 | 8.32 ± 0.82 |

D4: 8 | 16 | 0.5 | - | - | 8.00 ± 0.58 |
| 8.31 ± 0.71 |
| 8.60 ± 0.81 |

In contrast, the KM model underestimates θ when the sequences are errorless (Table 2). Furthermore, this tendency to underestimate θ is also evident (although not significant) when the sequences contain errors (simulations A2, A4, B2, and B4; see also [2] for additional cases). In these situations, mutations are sometimes counted as random errors, thereby deflating their estimates of θ. In cases where few to no random errors are expected (i.e., finished sequences), one can first perform a likelihood ratio test to determine if ε = 0 [39]. If this null hypothesis cannot be rejected, then the KM analysis should be restricted to only Equation (13). However, if ε > 0 according to this likelihood ratio test, then equation (4) of Knudsen and Miyamoto [2] is also needed and the user of the KM model must remain aware that her/his estimate of θ most likely includes some slight downward bias.

Unlike their original versions, the new Watterson estimator, Tajima estimator, and KM' model all consistently recover the true θ in the simulations both with and without errors (Table 2). Furthermore, these three methods also avoid the tendency of the KM model to underestimate θ due to unnecessary or "greedy" parameters for random errors. In particular, the reliance of the KM' model on one less equation [(4)] and fewer parameters (e.g., λ and ε) makes it much simpler and less prone to over-parameterizations than its predecessor.

The mean estimates of θ for the new Watterson estimator, Tajima estimator, and KM' model are also generally associated with greater standard deviations than those for their original versions (Table 2). The proximal reason for these increased standard deviations is that the elimination of singletons results in the loss of some actual mutations along with the random errors. However, this cost of the three new methods appears to be relatively small, given that their standard deviations are never twice as great as those for their original versions (with these discrepancies usually being much smaller). Indeed, the standard deviations for the three new methods are less than or equal to those of their counterparts in four cases (simulations A2, A4, and B4 for θ'_{
W
}and A4 for θ'_{
T
}in Table 2). Perhaps more surprising is that the standard deviations for the new Watterson and Tajima estimators are comparable to those for the KM' model, particularly when θ = 1 or 2. These similarities in their standard deviations are surprising given that the KM' model is a full ML model.

As summary statistics, the original and new Watterson and Tajima estimators are all computationally fast, requiring much less than 1 CPU second on a 2.4 GHz Pentium 4 CPU to analyze each of the simulated datasets. More importantly, the KM' model is much faster than the FW and KM models as documented by its completed analyses of the more complex datasets (simulations C1 to C4 and D1 to D4 in Table 2). In contrast, the FW and KM models fail to complete their analyses of these more complex datasets due to time and memory constraints. The KM' model is much faster than the FW and KM models, because it relies on only shared polymorphic sites. Less variable positions results in faster coalescences and fewer choices as one works back through the coalescent/mutation recursion of Equation (13).

#### Aplysia dataset

The *A. californica* dataset was the original motivating force behind the development of the KM model [2]. Thus, this real dataset was also analyzed with the KM' model to test further its performance against that of its predecessor. This dataset consists of 18 sequencing reads for six diploid individuals from a laboratory population of the California seahare at the Laboratory for Marine Bioscience, University of Florida (LL Moroz and AB Kohn, unpublished data). Three cloned inserts were sequenced from each individual as a pair of single sequencing passes starting from both ends of an internal segment of 1731 base pairs for the protein-coding region of the nuclear *FMRF* gene. These pairs of passes overlap in the middle for nine sequences, but at most by only 58 bases. Thus, these 18 essentially single-pass sequences contain many random errors and some missing data and their assignments to the two homologues of each diploid remain uncertain (even though their individual sources are known). Correspondingly, the KM analysis of this dataset with 44 singletons and 10 shared polymorphisms required the full use of its factors for experimental error and design (i.e., for random errors, missing data, and uncertain homologue assignments).

A full ML analysis of this complex dataset by the KM model proved too time and memory consuming and a two step procedure was therefore adopted instead whereby the number of errors was first ML estimated followed by the determination of θ for this ML value [2]. This heuristic approach resulted in an estimate of θ_{
KM
}= 6.32 for this sample, which still took more than 12 hours to complete on the same CPU as for the simulations (see above). In contrast, a full ML analysis of this dataset by the KM' model using the same factors for experimental design is much faster as it took less than 1 hour on this CPU to obtain a similar estimate of θ'_{
KM
}= 7.17. Correspondingly, nucleotide diversity (π) for this sample is calculated as (7.17/1731 positions) = 0.0041 mutations per site. To summarize, the KM and KM' models both support similar estimates of θ for this real dataset, but the latter (as in the evolutionary simulations in Table 2) is much faster as confirmed by its completed, full, ML analysis.

## Conclusion

Expressed sequence tags, low coverage draft sequences, and other such unfinished products are known to contain random errors due to chance accidents during their data collection and recording [1, 11, 21, 23]. However, such sequences often constitute the only available allelic information for a population and methods to deal with their random mistakes are therefore needed to obtain accurate estimates of θ from these error prone data. This study is based on the simple premise that random sequence errors are distributed as singletons. Thus, one can avoid the random mistakes of error prone sequences by ignoring their singletons in favor of their shared polymorphisms (Figures 2 and 4).

This strategy is implemented in the new Watterson estimator, Tajima estimator, and KM' model. These new methods are all accurate and fast according to their evolutionary simulations and analysis of the real complex dataset for *A. californica* (Table 2). These methods come with the cost of increased standard deviations, but this price appears small or even negligible compared to their advantages of significantly improved accuracy and/or computational speed. Obviously, additional evolutionary simulations and applications to real datasets are now needed to evaluate more fully under what conditions the removal of singletons is warranted in light of this tradeoff. Nevertheless, the current successes with the new Watterson estimator, Tajima estimator, and KM' model support our recommendation that these three methods be given serious consideration when estimating θ from error prone sequences.

Unlike the Watterson and Tajima estimators that represent summary statistics, the KM and KM' models both constitute full ML models that offer the framework for the further incorporation of other experimental and population genetic factors [2]. In particular, such further developments are encouraged for the KM' model in light of its current successes in the evolutionary simulations and *A. californica* analysis (Table 2). For example, biased errors within a sample due to the systematic misreading of specific bases during DNA sequencing, the postmortem biochemical degradation of ancient DNA, and/or other such sources of error can be accommodated by a finite sites process that allows for repeated mistakes as well as mutations at the same sequence positions [14–16, 40]. Likewise, additional factors to account for other population genetic processes such as recombination (which is most likely the most important parameter overlooked in this study of the nuclear *FMRF* gene for *A. californica*) can be accommodated by the use of ancestral recombination graphs [41]. Inevitably, these more complex versions of the KM' model will require the use of sampling based procedures for their implementation (e.g., Markov chain Monte Carlo approximations), since the current use of direct ML evaluation will remain practical for only the smaller datasets and simpler models [14, 42–44].

## Declarations

### Acknowledgements

We thank Leonid L Moroz and Andrea B Kohn for the use of their unpublished *A. californica* dataset; William Farmerie and Michele R Tennant for their suggestions about this research; the Carlsberg Foundation for its fellowship to BK; and the University of Florida for its support. A copy of our computer program to estimate θ'_{
W
}, θ'_{
T
}, and θ'_{
KM
}is available upon request from BK.

## Authors’ Affiliations

## References

- Achaz G: Testing for neutrality in samples with sequencing errors.
*Genetics*2008, 179: 1409–1424. 10.1534/genetics.107.082198PubMed CentralView ArticlePubMedGoogle Scholar - Knudsen B, Miyamoto MM: Incorporating experimental design and error into coalescent/mutation models of population history.
*Genetics*2007, 176: 2335–2342. 10.1534/genetics.106.063560PubMed CentralView ArticlePubMedGoogle Scholar - Fu YX: Statistical properties of segregating sites.
*Theoretical Population Biology*1995, 48: 172–197. 10.1006/tpbi.1995.1025View ArticlePubMedGoogle Scholar - Hein J, Shierup M, Wiuf C:
*Sequence Variation, Genealogies and Evolution*. New York: Oxford University Press; 2005.Google Scholar - Wakeley J:
*Coalescent Theory: An Introduction*. Greenwood Village: Roberts & Company Publishers; 2008.Google Scholar - Hudson RR: Gene genealogies and the coalescent process. In
*Oxford Surveys in Evolutionary Biology*.*Volume 7*. Edited by: Antonovics J, Futuyma D. Oxford: Oxford University Press; 1990:1–44.Google Scholar - Fu YX: Estimating effective population size or mutation rate using the frequencies of mutations of various classes in a sample of DNA sequences.
*Genetics*1994, 138: 1375–1386.PubMed CentralPubMedGoogle Scholar - Felsenstein J:
*Inferring Phylogenies*. Sunderland: Sinauer Associates; 2004.Google Scholar - Li WH, Sadler LA: Low nucleotide diversity in man.
*Genetics*1991, 129: 513–523.PubMed CentralPubMedGoogle Scholar - Clark AG, Whittam TS: Sequencing errors and molecular evolutionary analysis.
*Molecular Biology and Evolution*1992, 9: 744–752.PubMedGoogle Scholar - Tiffin P, Gaut BS: Molecular evolution of the wound-induced serine protease inhibitor
*wip1*in*Zea*and related genera.*Molecular Biology and Evolution*2001, 18: 2092–2101.View ArticlePubMedGoogle Scholar - Nabholz B, Jean-Francois M, Bazin E, Galtier N, Glemin S: Determination of mitochondrial genetic diversity in mammals.
*Genetics*2008, 178: 351–361. 10.1534/genetics.107.073346PubMed CentralView ArticlePubMedGoogle Scholar - Ho SY, Phillips MJ, Cooper A, Drummond AJ: Time dependency of molecular rate estimates and systematic overestimation of recent divergence times.
*Molecular Biology and Evolution*2005, 22: 1561–1568. 10.1093/molbev/msi145View ArticlePubMedGoogle Scholar - Ho SY, Heupink TH, Rambaut A, Shapiro B: Bayesian estimation of sequence damage in ancient DNA.
*Molecular Biology and Evolution*2007, 24: 1416–1422. 10.1093/molbev/msm062View ArticlePubMedGoogle Scholar - Johnson PL, Slatkin M: Inference of population genetic parameters in metagenomics: A clean look at messy data.
*Genome Research*2006, 16: 1320–1327. 10.1101/gr.5431206PubMed CentralView ArticlePubMedGoogle Scholar - Johnson PL, Slatkin M: Accounting for bias from sequencing error in population genetic estimates.
*Molecular Biology and Evolution*2008, 25: 199–206. 10.1093/molbev/msm239View ArticlePubMedGoogle Scholar - Axelsson E, Willerslev E, Gilbert MTP, Nielsen R: The effect of ancient DNA damage on inferences of demographic histories.
*Molecular Biology and Evolution*2008, 25: 2181–2187. 10.1093/molbev/msn163View ArticlePubMedGoogle Scholar - Burgess R, Yang Z: Estimation of hominoid ancestral population sizes under Bayesian coalescent models incorporating mutation rate variation and sequencing errors.
*Molecular Biology and Evolution*2008, 25: 1979–1994. 10.1093/molbev/msn148View ArticlePubMedGoogle Scholar - Lynch M: Estimation of nucleotide diversity, disequilibrium coefficients, and mutation rates from high-coverage genome-sequencing projects.
*Molecular Biology and Evolution*2008, 25: 2409–2419. 10.1093/molbev/msn185PubMed CentralView ArticlePubMedGoogle Scholar - Lottaz C, Iseli C, Jongeneel CV, Bucher P: Modeling sequencing errors by combining hidden Markov models.
*Bioinformatics*2003, 19: i103-i112. 10.1093/bioinformatics/btg1067View ArticleGoogle Scholar - Long AD, Beldade P, Macdonald SJ: Estimation of population heterozygosity and library construction-induced mutation rate from expressed sequence tag collections.
*Genetics*2007, 176: 711–714. 10.1534/genetics.106.063610PubMed CentralView ArticlePubMedGoogle Scholar - Slatkin M, Pollack JL: Subdivision in an ancestral species creates asymmetry in gene trees.
*Molecular Biology and Evolution*2008, 25: 2241–2246. 10.1093/molbev/msn172PubMed CentralView ArticlePubMedGoogle Scholar - Wesche PL, Gaffney DJ, Keightley PD: DNA sequence error rates in Genbank records estimated using the mouse genome as a reference.
*DNA Sequence*2004, 15: 362–364.View ArticlePubMedGoogle Scholar - Shendure JA, Porreca GJ, Church GM: Overview of DNA sequencing strategies.
*Current Protocols in Molecular Biology*2008, 81: 7.1.1–7.1.11.Google Scholar - Watterson GA: On the number of segregating sites in genetical models without recombination.
*Theoretical Population Biology*1975, 7: 256–276. 10.1016/0040-5809(75)90020-9View ArticlePubMedGoogle Scholar - Tajima F: Evolutionary relationship of DNA sequences in finite populations.
*Genetics*1983, 105: 437–460.PubMed CentralPubMedGoogle Scholar - Kimura M: The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations.
*Genetics*1969, 61: 893–903.PubMed CentralPubMedGoogle Scholar - Kimura M:
*The Neutral Theory of Molecular Evolution*. Cambridge: Cambridge University Press; 1983.View ArticleGoogle Scholar - Kingman JFC: The coalescent.
*Stochastic Processes and their Applications*1982, 13: 235–248. 10.1016/0304-4149(82)90011-4View ArticleGoogle Scholar - Donnelly P, Tavaré S: Coalescents and genealogical structure under neutrality.
*Annual Review of Genetics*1995, 29: 401–421. 10.1146/annurev.ge.29.120195.002153View ArticlePubMedGoogle Scholar - Fu YX, Li WH: Statistical tests of neutrality of mutations.
*Genetics*1993, 133: 693–709.PubMed CentralPubMedGoogle Scholar - Tajima F: Statistical method for testing the neutral mutation hypothesis by DNA polymorphism.
*Genetics*1989, 123: 585–595.PubMed CentralPubMedGoogle Scholar - Simonsen KL, Churchill GA, Aquadro CF: Properties of statistical tests of neutrality for DNA polymorphism data.
*Genetics*1995, 141: 413–429.PubMed CentralPubMedGoogle Scholar - Fu YX: Statistical tests of neutrality of mutations against population growth, hitchhiking and background selection.
*Genetics*1997, 147: 915–925.PubMed CentralPubMedGoogle Scholar - Fisher RA:
*The Genetical Theory of Natural Selection*. Oxford: Clarendon Press; 1930.View ArticleGoogle Scholar - Wright S: Evolution in Mendelian populations.
*Genetics*1931, 16: 97–159.PubMed CentralPubMedGoogle Scholar - Griffiths RC: Genealogical-tree probabilities in the infinitely-many-sites model.
*Journal of Mathematical Biology*1989, 27: 667–680.View ArticlePubMedGoogle Scholar - Griffiths RC, Tavaré S: Unrooted genealogical tree probabilities in the infinitely-many-sites model.
*Mathematical Biosciences*1995, 127: 77–98. 10.1016/0025-5564(94)00044-ZView ArticlePubMedGoogle Scholar - Huelsenbeck JP, Rannala B: Phylogenetic methods come of age: Testing hypotheses in an evolutionary context.
*Science*1997, 276: 227–232. 10.1126/science.276.5310.227View ArticlePubMedGoogle Scholar - Huse SM, Huber JA, Morrison HG, Sogin ML, Welch DM: Accuracy and quality of massively parallel DNA pyrosequencing.
*Genome Biology*2007, 8: R143.1-R143.9. 10.1186/gb-2007-8-7-r143View ArticleGoogle Scholar - Griffiths RC: Ancestral inference from gene trees. In
*Genes, Fossils, and Behaviour: An Integrated Approach to Human Evolution*. Edited by: Donnelly P, Foley R. Amsterdam: IOS Press; 2001:137–172.Google Scholar - Kuhner MK, Felsenstein J: Sampling among haplotype resolutions in a coalescent-based genealogy sampler.
*Genetic Epidemiology*2000, 19: S15-S21. Publisher Full Text 10.1002/1098-2272(2000)19:1+%3C;::AID-GEPI3%3E;3.0.CO;2-VView ArticlePubMedGoogle Scholar - Nielsen R: Estimation of population parameters and recombination rates from single nucleotide polymorphisms.
*Genetics*2000, 154: 931–942.PubMed CentralPubMedGoogle Scholar - Beerli P: Comparison of Bayesian and maximum-likelihood inference of population genetic parameters.
*Bioinformatics*2006, 22: 341–345. 10.1093/bioinformatics/bti803View ArticlePubMedGoogle Scholar - Edwards AW: Estimation of the branch points of a branching diffusion process.
*Journal of the Royal Statistical Society*1970, 32: 155–174.Google Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.