Accurate reconstruction of viral quasispecies spectra through improved estimation of strain richness

Background Estimating the number of different species (richness) in a mixed microbial population has been a main focus in metagenomic research. Existing methods of species richness estimation ride on the assumption that the reads in each assembled contig correspond to only one of the microbial genomes in the population. This assumption and the underlying probabilistic formulations of existing methods are not useful for quasispecies populations where the strains are highly genetically related. The lack of knowledge on the number of different strains in a quasispecies population is observed to hinder the precision of existing Viral Quasispecies Spectrum Reconstruction (QSR) methods due to the uncontrolled reconstruction of a large number of in silico false positives. In this work, we formulated a novel probabilistic method for strain richness estimation specifically targeting viral quasispecies. By using this approach we improved our recently proposed spectrum reconstruction pipeline ViQuaS to achieve higher levels of precision in reconstructed quasispecies spectra without compromising the recall rates. We also discuss how one other existing popular QSR method named ShoRAH can be improved using this new approach. Results On benchmark data sets, our estimation method provided accurate richness estimates (< 0.2 median estimation error) and improved the precision of ViQuaS by 2%-13% and F-score by 1%-9% without compromising the recall rates. We also demonstrate that our estimation method can be used to improve the precision and F-score of ShoRAH by 0%-7% and 0%-5% respectively. Conclusions The proposed probabilistic estimation method can be used to estimate the richness of viral populations with a quasispecies behavior and to improve the accuracy of the quasispecies spectra reconstructed by the existing methods ViQuaS and ShoRAH in the presence of a moderate level of technical sequencing errors. Availability http://sourceforge.net/projects/viquas/

A major observation on the QSR methods ViQuaS and ShoRAH was that the precision (fraction of reconstructed strains that are true: equation 8) of reconstructed spectra was less than the recall rate (fraction of true strains that are reconstructed: equation 7) owing to the reconstruction of in silico false positives. However, we observed in [8] that ViQuaS has the best recall rates among the four methods ViQuaS, ShoRAH, PredictHaplo and QuRe. Also, ShoRAH performs at comparable levels with PredictHaplo in terms of recall. Furthermore, both QuRe and Predic-tHaplo demonstrate higher precision values than the corresponding recall values. Therefore, we realize that the F-score (the geometric mean of recall and precision: equation 9) values of the spectra reconstructed by ViQuaS and ShoRAH can be improved by controlling the generation of false positives without compromising the recall rates, but such an improvement cannot be achieved in QuRe and PredictHaplo as the spectra generated by them usually contain a lower number of strains than the actual number of strains in the population. Therefore, other algorithmic changes will be needed to improve the F-score values of spectra generated by QuRe and PredictHaplo without compromising the recall rates. In this paper, we present a novel probabilistic method to estimate the number of strains in a viral quasispecies population and a strategy to improve the precision and F-score of ViQuaS analysis pipeline without compromising the recall rates, by reducing the number of in silico false positives using above estimates as input information. We also show that the same strategy can be used to improve the performance of ShoRAH.
The number of different microbial types in a mixed population is termed as richness. Thus far we find two popular methods of estimating the richness of a mixed microbial population named PHACCS [9] and CatchAll [10]. The input to both methods takes the same form. In fact, the input is the contig spectrum of the metagenome derived from the target mixed microbial population. Both methods rely on probabilistic parameter estimation strategies. A major assumption regarding the input is that the reads in each contig correspond to only one of the microbial genomes in the population [9], [10]. In other words, it is assumed that the different microbes in the population do not comprise of significant common genomic regions. This assumption is acceptable for populations such as soil, lake water, sea water bacterial populations and bacteriophages, but it is unacceptable for quasispecies populations where the viral strains are highly genetically related. Hence, the requirement arises to formulate the richness estimation problem for viral quasispecies in an alternative framework.
The main contributions of this work are two fold. 1) We formulate a novel probabilistic method to estimate the strain richness of a viral quasispecies population and 2) we propose a reconfiguration for the recently published pipeline ViQuaS [8], that significantly improves the precision of reconstructed quasispecies spectra without compromising the recall rate. Furthermore, we discuss how the existing quasispecies spectrum reconstruction method ShoRAH [1] can benefit from the proposed estimation strategy.

Strain richness estimation
In this work we formulate the strain richness estimation problem as a parameter estimation task given a single observation from a discrete probability distribution.
Consider the instance where the biological sample for next-generation sequencing is collected, at which the quasispecies spectrum can be safely assumed as static. Let us assume the following notations to formulate the proposed estimation problem.
• L = Length of the genomic segment of the known reference (or the wild type) genome of the quasispecies population we are interested in reconstructing. • s = Number of different strains in the quasispecies population.
• r i = Number of mutations in the i th strain with respect to the reference genome where i ∈ {1, 2, 3, ..., s}.
• n = Number of total possible mutations that can occur in a single strain. It should be noted that two distinct strains can have one or more common mutations, but not all mutations can be common. For the formulation we assume that each strain contains a constant number of mutations with respect to the reference genome. We use the value r as the number of mutations in each strain and define the probability mass function (p.m.f.) of U under the parameters s, n and r (Pr(U = u; s, n, r)) as in equation 1.
where n, r, u, s ∈ Z + and x ∈ N 0 and equation 2 defines P s (x).
In real world populations, the number of mutations in each strain may not be a fixed value (r), but we show in our results that the effect of this variability on the estimation strategy is minimal under practical settings.
Proof Case 1 : s = 1 s = 1 corresponds to a population where there is only one strain. If the number of mutations in this strain is r, all these mutations are unique in the population. Hence, when s = 1 the random variable U can only take the value U = r. Therefore, Pr(U = u; s, n, r) = 1 when s = 1 and u = r Pr(U = u; s, n, r) = 0 when s = 1 and u ≠ r Case 2 : s ∈ 1, n r and U ≤ r s > 1 corresponds to a population where there are more than one strain. If the number of mutations per strain is r, the first strain contributes r number of unique mutations towards the value of U . The second strain cannot have the same r mutations that are observed in the first strain. Therefore, the second strain contains at least one mutation which is different from the already observed r mutations. Consequently, when s >1, U takes a value greater than r. Also the total number of strains that are possible to be present is n r .

Therefore,
Pr(U = u; s, n, r) = 0 when s ∈ 1, n r and u ≤ r Case 3 : s ∈ 1, n r and U > min(n, rs) Under any circumstance, the number of unique mutations cannot be greater than the number of total possible mutations that can occur in a single strain (n). Consequently, the number of unique mutations in the entire population cannot grow beyond n.
Furthermore, when s ∈ 1, n r , the first strain contributes r number of unique mutations towards the value of U and the maximum number any subsequent strain can contribute is also r. Hence, the value of U cannot be higher than rs. As a result, when s ∈ 1, n r , the maximum value attainable by U is min(n, rs). Therefore, Pr(U = u; s, n, r) = 0 when s ∈ 1, n r and u > min(n, rs) Case 4 : s ∈ 1, n r and U ∈ (r, min(n, rs)] Based on Case 2 and Case 3 we understand that, when s ∈ 1, n r , U can get an integer value within the range (r, min(n, rs)].
Consider the situation where we have observed U = ux with (s -number of strains. The probability of occurrence of this situation is Pr(U = ux; s -1, n, r). In order to observe U = u with s number of strains, the remaining s th strain should contain exactly x number of new unique mutations. x can be any natural number in the range [0, r]. Hence, the probability of observing U = u with s number of strains is, where P s (x) is the probability of observing x (x ∈ [0, r]) number of new unique mutations in the s th strain.
Consider the case where x ∈ [1, r]. For the s th strain to contain exactly x number of new unique mutations out of a total of r, only (r -x) number of mutations in the s th strain can be a subset of the already observed (u -x) number of unique mutations. The remaining x number of mutations should arise from the unobserved n -(u -x) possible mutations. In addition, the total number of ways that these r mutations can occur without being equivalent to one of the (s -1) number of already observed ways is n r − (s − 1).
Hence, when x ∈ [1, r], Consider the case where x = 0. In this case all r mutations of the s th strain must come from the already observed u unique mutations in the (s -1) number of strains. However, the s th strain cannot be equivalent to any of the already observed (s -1) strains. The number of ways this scenario can occur is u r − (s − 1) . Similar to the case where x ∈ [1, r], the total number of ways that these r mutations can occur without being equivalent to one of the (s -1) number of already observed ways is n r − (s − 1) Hence, when x = 0, We observe that for all U = u where u ∈ [r, n), there exist a global optimum value in the likelihood function of s given U (i.e. L(s|U = u)) such that s is finite. Hence, given an observed value for U we can readily use the maximum likelihood strategy to estimate the value of s. A closed form equation for the maximum likelihood estimator of s (s mle ) cannot be derived. Nevertheless, s mle can be obtained numerically in a time efficient manner using dynamic programming.
The expected value of U given the value of s (i.e. E(U|s)) is a strictly increasing discrete valued function of s for given values of n and r. Therefore, as a second strategy we can estimate s using the method of moments.
In real world quasispecies populations, we can assume that r, s <<<n. Since it is impractical to know the value of n, we set n = L. This approximation is observed to have minimal effect on the parameter estimators when r, s <<<L (i.e. when r, s <<<n the exact value of n has minimal effect on the estimated value of s).

Calculating r given a NGS metagenome
For a given metagenome the chance of each strain being completely sequenced, decreases with its relative frequency. [11] extends the Lander-Waterman model of sequencing [12] to derive the probability that all the strains are covered by the given number of sequenced reads assuming the reads are uniformly distributed along the genomic region of interest. The theoretically re-constructible minimum relative frequency (f min ), derived according to [11] and [13], defines a relative frequency value for the strains in the population above which the probability of a strain being completely covered is at least p min . We set p min = 0.99 in our study.
Given a NGS metagenome, we use ViQuaS algorithm presented in [8] to reconstruct the unsupervised quasispecies spectrum of the viral population. We demonstrated in [8] that ViQuaS reconstructs a highly reliable spectrum for strains having a relative frequency greater than the theoretically re-constructible minimum relative frequency (f min ). Hence we obtain an approximate value for r by calculating the median number of mutations present in the reconstructed strains having a relative frequency greater than f min .
Alternatively, if we can safely assume that the sequenced reads provide a uniform coverage across the genomic region of interest, following formulae can be used to calculate the parameter r.
Calculating U given a NGS metagenome Depending on the information we are interested in, we use different strategies to calculate U . Let us denote s t as the total number of strains in the population (i.e. richness) and s f as the number of strains having a relative frequency greater than f min . Each read in the metagenome carries zero or more mutations belonging to a single strain of the original population. However, it is unlikely that all the mutations carried by the strains having a relative frequency less than f min are captured during sequencing. Hence, defining U as the number of unique mutations captured in all reads of the metagenome leads us to an estimate of s (ŝ t ) less than s t (i.e. a lower bound for s t ).
As a second strategy, we define U as the number of unique mutations captured in the local haplotypes of the ViQuaS analysis pipeline having a local haplotype frequency greater than f min . Due to the presence of common genomic regions, reads originating from two or more low frequency strains can form a single local haplotype having a local haplotype frequency greater than f min . Therefore, the second strategy gives us a U value corresponding to a number of strains slightly higher than s f . Hence, the estimated value (ŝ f ) is an upper bound for s f .

Reconfiguration of ViQuaS
The ViQuaS analysis pipeline presented in [8] performs an unsupervised reconstruction of strains given a NGS read set. We propose to reconfigure the original ViQuaS pipeline as follows. First, the unsupervised algorithm provides information needed to calculate the parameter r and the observation U as outlined in the previous subsections, facilitating the calculation ofŝ f . Thisŝ f value is fed back to the global spectrum reconstruction algorithm of ViQuaS to decide whether it should continue with the unsupervised result or terminate after having reconstructedŝ f number of strains. Assume the unsupervised ViQuaS pipeline has reconstructed s f,u number of strains with a frequency greater than f min . Then, if, s f ,u >ŝ f ⇒ Terminate afterŝ f strains else, continue with the unsupervised result

Data sets
For the ease of comprehension and comparison of results, we used the simulated data sets (SS1, SS2, SS3, SS4, SS5, SS6, SS7 and SS8 ) described in detail in [8] to benchmark the ViQuaS and ShoRAH using the proposed reconfigurations presented in this paper. (We have provided the description in [8] as Additional File 1 for the ease of comprehension.) We also used the V11909 real Roche 454 HIV-1 data set to demonstrate the applicability of the estimation strategy on real data. Details of V11909 are provided in [8].

Validation of strain richness estimation theory
We used 4800 simulated quasispecies populations with known input parameters to validate the strain richness estimation theory for practical settings where the number of mutations per strain and the number of strains are much less than the target genome length (r, s <<<L). The

Estimating s t and s f in quasispecies populations using NGS data
We used the simulated data set SS1 to evaluate the performance of the richness estimation method on NGS data derived from viral quasispecies populations. For each sample, the value of the random variable U was calculated at two stages as described under the Calculating U given a NGS metagenome subsection corresponding to st and s f . Figures 3 and 4 illustrate We observed that the estimation errors of both parameters decrease with increasing Diversity (Diversity ∝ r) and increase with increasing s t . Furthermore, we observe that the estimation errors of s t are predominantly negative (706 out of 800 instances in SS1 ) and that of s f are predominantly positive (787 out of 800 instances in SS1 ) and the magnitudes of median error are close to zero. The major causes of positive estimation error of st are: (i) significant difference between the calculated r and the actual r used in simulating data and (ii) the existence of samples where s t s f . The major cause of negative estimation error of s t is the significant difference between the calculated r and the actual value. This confirms our claim in Calculating U given a NGS metagenome subsection that the corresponding estimates (ŝ t andŝ f ) give a lower bound for s t and an upper bound for s f .

Enhanced QSR performance in ViQuaS and ShoRAH
In our previous work [8] we observed that the precision of reconstructed quasispecies spectra are significantly lower than the recall rate under all simulation settings due to the reconstruction of a significantly higher number of false positive strains. Using the estimated valueŝ f as an upper bound for s f we controlled the growth of false positive strains and obtained significant gain in precision (2%-13%) as shown in Figure 5. Consequently we also observed significant gain in F-score values (1%-9%) for reconstructed spectra using ViQuaS ( Figure 6). (Refer [8] for the comparison of performance between the methods ViQuaS, ShoRAH, QuRe and PredictHaplo under the same simulation settings.) Furthermore, we applied the same reconfiguration we proposed on Vi-QuaS using the estimated upper bounds for s f (ŝ f ) on an existing QSR methods named ShoRAH to evaluate the performance enhancement attainable via our new estimation strategy. Table 1 summarizes the gain in precision and F-score attained by the two methods, ViQuaS and ShoRAH. Similar to ViQuaS, we observe that ShoRAH also attains considerable gain in precision and F-score when the added knowledge ofŝ f values is used. Most importantly, no compromise in recall rates were observed associated with the gain in precision of both ViQuaS and ShoRAH. (Original recall values are presented in [8]. Performance measures of QuRe and Predic-tHaplo are not included in this table as they cannot be improved using the proposed reconfiguration strategy.) The three performance measurement terms recall, precision and F-score were calculated according to the following equations. Further details regarding the calculations are found in [8].
Precision = True Positive Strains with a relative frequency > fmin Total number of reconstructed strains with a relative frequency > fmin Application on real data Analyzing V11909 [14] using the estimation method we found that it contains 16 mutations per strain on average (r = 16) within the 1044 bp long region of interest. The Diversity, number of strains, exact nucleotide sequences and the relative frequencies of the strains in the population are unknown. The estimated technical error rate of this sample is 0.11% [13], mean read length is 90bp and the total number of reads (n total ) is 5177. The theoretically re-constructible minimum relative frequency (f min ) for this set of reads is 2.5%. The number of unique mutations found in the total set of reads was 752 and the number of unique mutations found in the local haplotypes having a relative frequency greater than f min was 90. Accordingly, our method estimated thatŝ t = 82 andŝ f = 6 for V11909 HIV-1 data set using the p.d.f. Pr(U; s, n = 1044, r = 16).

Discussion and Conclusions
To the best of our knowledge there exists no dedicated method in literature to estimate the number of strains in a viral quasispecies population. Our previous studies highlighted that the unsupervised quasispecies spectrum reconstruction methods such as ShoRAH [1] and QuRe [2] reconstruct respectively higher and lower number of in silico false positive strains. These methods were not aimed at providing accurate estimates for the number of strains in a population. Estimation of species richness in mixed microbial populations is a problem that is closely related to the problem discussed here. However, methods addressing mixed microbial population richness estimation are not applicable to the problem at hand due to the characteristic differences between the subjective populations. We presented in this paper a novel probabilistic method to estimate the number of strains in a quasispecies population based on the distribution of mutations among different strains. The derived p.m.f. Pr(U = u; s, n, r) shows a significant relationship to the well known hypergeometric distribution. We were unable to derive closed form expressions for the maximum likelihood and Method of Moments estimators for the parameter s. They were calculated using dynamic programming as both estimators are deterministic. We chose to use the Method of Moments estimator as it provided marginally better estimates than the maximum likelihood estimator on benchmark data sets.
We demonstrated that the variability in the number of mutations per strains in a population has minimal effect on the estimation method under practical parameter values of n, r and s. However, the estimates are considerably sensitive to the parameter r. This implies that identifying the location of the distribution (i.e. median, mean or mode) of the number of mutations per strain is a critical step in the presented method. Accordingly, we identified that significant differences between the calculated and correct r values as the main cause of the outliers (high error values) of the error plots in Figures 3  and 4.
Using the estimated upper bound of s f (ŝ f ), we reconfigured the Vi-QuaS analysis pipeline to control the growth of in silico false positives. The summarized results presented in Figures 5, 6 andTable 1 show that the reconfigured ViQuaS pipeline improves the precision and F-score of reconstructed spectra compared to the previously proposed ViQuaS pipeline [8]. The highlight of the proposed reconfiguration is that we use the knowledge from both the unsupervised algorithm and the probabilistic estimation method to make a well informed decision to limit the number of false positives. This strategy allows the reconfigured ViQuaS pipeline to compensate for errors introduced by: (i) the uncontrolled growth of strains when using the unsupervised algorithm alone and (ii) the sensitivity of the estimation method to calculation errors of the parameter r.
We also demonstrated that, similar to ViQuaS, the performance of the existing method ShoRAH can be substantially improved using the added knowledge of the estimatedŝ f values. It will be an interesting study to see whether QuRe and PredictHaplo can be improved   by giving the estimatedŝ f values as input to the global spectrum reconstruction stage of the algorithms. This will require the ability to edit the source code of the respective methods. Also, the reconfiguration proposed in this paper can be used to reduce the computational cost to a certain extent by appropriately reducing the number of strains being reconstructed in the global strain reconstruction stage of a quasispecies spectrum reconstruction pipeline. Although the method can handle an error rate of 0.1% (theoretical maximum substitutional error probability of quality trimmed reads with a PHRED threshold of 30), we identify the absence of a proper discounting method to overcome the effect of technical sequencing errors (in cases of very high error rates) when calculating different parameters of the estimation method as the main drawback of our proposal. We plan to study further on such a discounting strategy that can overcome both substitutional and homopolymeric errors.
We also plan to study further on the p.m.f. with the view of improving the efficiency of the calculation methods and providing sound mathematical proof for the observed properties of the parameter estimators. We hope this new probability distribution will be useful in many theoretical and applied statistical problems in the future.
In addition to modeling the distribution of mutations as proposed in this paper, a quasispecies population may also be modeled using master equations [15][16][17] to estimate the strain richness. However, solving highly complex master equations has been a main bottleneck for decades [18] when using such techniques.

Additional material
Additional File 1: Description of simulated data sets. Detailed description of the simulated data sets SS1-SS8.

Competing interests
The authors declare that they have no competing interests.
Authors' contributions DJ formulated the problem and the statistical solution, carried out the simulations, analyzed real data and wrote the manuscript. IS and SKH contributed in formulating the statistical solution, designing the validation and simulation studies and preparing the manuscript. BCC and SLT contributed in designing the simulation study, analyzing real data and preparing the manuscript.