Estimating time since infection in early homogeneous HIV-1 samples using a poisson model

Background The occurrence of a genetic bottleneck in HIV sexual or mother-to-infant transmission has been well documented. This results in a majority of new infections being homogeneous, i.e., initiated by a single genetic strain. Early after infection, prior to the onset of the host immune response, the viral population grows exponentially. In this simple setting, an approach for estimating evolutionary and demographic parameters based on comparison of diversity measures is a feasible alternative to the existing Bayesian methods (e.g., BEAST), which are instead based on the simulation of genealogies. Results We have devised a web tool that analyzes genetic diversity in acutely infected HIV-1 patients by comparing it to a model of neutral growth. More specifically, we consider a homogeneous infection (i.e., initiated by a unique genetic strain) prior to the onset of host-induced selection, where we can assume a random accumulation of mutations. Previously, we have shown that such a model successfully describes about 80% of sexual HIV-1 transmissions provided the samples are drawn early enough in the infection. Violation of the model is an indicator of either heterogeneous infections or the initiation of selection. Conclusions When the underlying assumptions of our model (homogeneous infection prior to selection and fast exponential growth) are met, we are under a very particular scenario for which we can use a forward approach (instead of backwards in time as provided by coalescent methods). This allows for more computationally efficient methods to derive the time since the most recent common ancestor. Furthermore, the tool performs statistical tests on the Hamming distance frequency distribution, and outputs summary statistics (mean of the best fitting Poisson distribution, goodness of fit p-value, etc). The tool runs within minutes and can readily accommodate the tens of thousands of sequences generated through new ultradeep pyrosequencing technologies. The tool is available on the LANL website.


Background
The occurrence of a genetic bottleneck in HIV sexual or mother-to-infant transmissions has been well documented [1,2]. This results in about 80% of new infections being homogeneous, i.e., initiated by a single genetic strain [3]. Due to the availability of early samples taken at multiple time points from acute HIV subjects, we now know that the viral population grows exponentially during the early phases of infection [4], prior to the onset of the host immune response or significant target cell depletion. Given this simple setting, our goal is to clearly distinguish infections that were initiated by a single strain (homogenous infection) from those where multiple strains entered the host. Furthermore, for those cases where we are able to determine that the infection was indeed homogenous, we seek to estimate the time since the most recent common ancestor (MRCA) given a sample of genetic sequences. In contrast to coalescent and Bayesian inference methods (e.g., BEAST [5]), which are based on a simulation of genealogies, we do not simulate the genealogical history of the observed sequences, but rather follow the diversity structure of the entire viral population. While in most settings a forward simulation is not attainable (hence the use of coalescence), our approach becomes not only feasible but extremely simple to realize when modeling a homogeneous infection within a small number of generations from the transmission bottleneck. The increased efficiency of the algorithm (which runs within minutes instead of hours) facilitates the analysis of samples from a large number of subjects [3] and enables the application to massive new data sets that are currently being gathered using pyrosequencing [6]. We have successfully tested our tool in both cases, i.e., hundreds of patients and tens of thousands of sequences. These methods are robust against violations of assumptions that do not strongly affect overall diversity. Quantities like the time since the MRCA, when computed by our tool, yield virtually identical results [3,7] to those computed by the coalescent or Bayesian methods, as shown in Figure 1. The tool is available on the LANL website [8]. An extensive explanation of how the tool works and how to format the input data is provided in the Poisson Fitter Explanation link available on the tool website [8].

Implementation
Our basic framework is that of an exponentially growing population following a narrow bottleneck, with lineageindependent mutation rates at all sites and no differential selection among the attested forms. When the resulting diversity is small, almost every change is at a distinct locus, and the pairwise differences between genetic strains, i.e., the Hamming distances (HD), follow a Poisson distribution [9]. There is little evidence of recombination in samples from this early stage when care is taken to guard against in vitro recombination [10]. Furthermore, when sites evolve independently starting from a homogeneous infection, homologous recombination does not change the HD distribution nor the overall amount of diversity. It does, however, affect the distribution when these assumptions are violated (unpublished data). The Poisson Fitter tool [8] uses this  [3] and how they correlate to the times estimated using the Beast tool [5]. The data is highly correlated: r = 0.973 and the best fitting slope is 1.036. fact to fit a Poisson distribution to the pairwise HD frequency distribution obtained from the input data. Because a good fit is obtained when the infection is homogeneous and free of differential selection pressure [7], the tool can be used to detect large violations of one or more assumptions of the null model.

Input data
The Poisson Fitter tool [8] was originally developed to analyze genetic sequences from early HIV-1 infected patients (Fiebig Stage I and II [3]). However, it can be used on any set of highly conserved sequences to test whether the observed diversity could be due to random accumulation of mutations or whether one needs to invoke alternate explanations like early onset of selection or multiple original infecting strains. The tool accepts multiple alignments, obtained through either single genome amplification [10] or deep sequencing [6], but they must be uploaded as one file. All alignments need to be in fasta format and always start with a consensus or reference sequence, which is assumed to be the ancestral strain. More details on how to format the input data is provided on the tool web page [8].

Controlling for APOBEC enrichment
APOBEC is a host enzyme. During replication it causes mutations that can introduce stop codons and inactivate the virus. These substitutions are recognizable by a G A mutational pattern embedded in the APOBEC3G/F motif [11,12], which is: GRD, where R is the IUPAC symbol for G or A, and D for G, A or T. Their occurrence is sporadic, often leading to overtly hypermutated sequences [11,12], and thus violating our assumption of a constant mutation rate (see Results and Discussion for an example). Our tool allows for two types of checks: the first method searches for hypermutated sequences, i.e., particular sequences that present a statistically significant excess of G A mutations at APOBEC3G/F motifs compared to elsewhere in the sample. The test is performed using the Hypermut tool [13] and the user can select a P-value threshold (default P < 0.1). With the second method, an artificial sequence is created, which summarizes all mutations found across the sample, and is then tested for APOBEC enrichment via the Hypermut tool [13]. When the latter is significantly enriched, it indicates enrichment across the sample for mutations embedded in the APOBEC motif, in which case we call the sample overall enriched [14]. The user can choose whether and how to correct for APOBEC enrichment. Two correction methods are provided, which correspond to the two tests discussed above. The first option is to simply remove all sequences that yield a P-value below the threshold. Because this method reduces the number of sequences, we have devised a second option which looks at all G's in the consensus sequence and removes the positions that fall within an APOBEC GRD motif from the entire alignment. This leaves the number of sequences unchanged and provides an unbiased correction. With either option, a new corrected alignment will be created, and the tool will proceed to analyze both the uncorrected and the corrected alignments.

Star phylogeny
When the samples are relatively small, under our assumed model of neutral evolution and rapid exponential growth, all sequences are likely to coalesce to the same founder or MRCA [15]. When this happens, we say that the sample follows a star phylogeny. This has been shown in HIV-1 early homogenous infections [3,16,17]. Under a star-phylogeny scenario, the pairwise HD distribution coincides with the self-convolution of the HD 0 frequency distribution, where HD 0 is the Hamming distance from the consensus sequence. When the sequences coalesce at the founder we can use the following mathematical formulation to compute the HD frequencies [7]: where δ x, y is the Kroenecker delta, and X i is the number of sequences that are i bases from the consensus, and Y j are the number of pairs of sequences that are different at j bases. By comparing these theoretical frequencies Y j to the observed ones (i.e., the number of pairs that are j bases from one another), one can verify whether the star-phylogeny assumption is valid for the tested alignment.

Fitting the Poisson distribution
A Poisson distribution is fitted to the observed pairwise HD distribution using a maximum likelihood method (see [7] for details). A c 2 goodness of fit (GOF) is performed to test whether the HD distribution significantly diverges from a Poisson (small P-values indicate a bad fit). The test takes into account the non-independence of pairwise HD distances by comparing the observed frequencies to the expected ones if the sample were to follow a star phylogeny. Prior to the onset of positive selection, the population is assumed to undergo a rapid expansion during which the basic reproductive number R 0 > 1. Therefore, when the sample yields a GOF P-value above 0.05 (indicating a non-significant divergence from a Poisson distribution), we can estimate the time since the MRCA using the parameters characterizing intrahost HIV evolution. Following Stafford et al. [18], we set R 0 = 6 for acute HIV-1 infection samples. We assume a constant mutation rate across lineages, which we fix at an average value of = 2.16 × 10 -5 per site and per replication cycle. This value has been adjusted from what was originally derived for HIV-1 by Mansky et al. [19] by considering only base substitutions and ignoring insertions and deletions. However, it has not been corrected for possible extra purifying selection in vivo. Using these parameters, we use the mean of the fitted Poisson distribution to infer the time since the MRCA [7].

Results and Discussion
All the parameters explained in the previous section are computed and included in the output table called "Log Likelihood -Estimated Parameters." This comprises, for each sample: the number of sequences in the sample, the mean and maximum pairwise HD, the mean of the best fitting Poisson distribution, the corresponding time since the MRCA, and the goodness of fit P-value It is important to notice that when the sample meets our model's assumptions, the mean of the best fitting Poisson distribution is in fact the mean pairwise HD of the sample. A second table, called "Convolution Estimates," provides the observed HD frequencies and the estimated ones calculated using equation (1). A more detailed explanation of the parameters is provided in the Explanation file on the tool web page. Figure 2 shows the graphics obtained by analyzing a fragment of the NEF HIV-1 gene (169 base pairs) from patient CH40 [3,16]. The data have been published [16] and submitted to the NCBI Sequence Read Archive http://www.ncbi.nlm.nih.gov/Traces/sra under accession number SRA020793. This sample was obtained through deep sequencing [6,20] and yielded a little over 4,000 sequences, though our tool can easily handle ten times as many sequences: because it works only with counts of pairwise distances, it can handle samples of almost any reasonable size, though very large jobs will slow the server. The left panel in Figure 2 shows the pairwise HD frequency counts (black), the best fitting Poisson distribution (blue), and the expected counts if it were a starphylogeny (red) on a logarithmic scale. The fact that the red line and the black line are indistinguishable confirms that the sample follows a star-like phylogeny. Because the Poisson fit is very sensitive to deviations in the upper tail of the distribution, the tool outputs graphics in the logarithmic scale whenever the sample size is above 100; this helps visualize possible deviations at the higher distances. Though the values are discrete, lines are used for better visualization. In the right panel a histogram of the frequency counts is shown together with, in red, the best fitting Poisson distribution. In this case the sample yielded a good fit (P = 0.981) and a time to MRCA of 34 days 95% CI = (31, 38).
As a second example, Figure 3 shows a sample drawn from single genome amplification sequencing [3,10]. All fifty sequences used for this example are available through Accession Numbers EU575084-133. In this case the original alignment does not yield a good fit to the Poisson distribution (top left panel, red line), but the tool detected APOBEC3G/F mediated hypermutation. By selecting the option to correct for APOBEC signatures by both methods, two more alignments are produced: one where two significantly (P < 0.1) hypermutated sequences are removed, and one where instead the alignment position where APO-BEC induced mutation could potentially affect results are removed. It is noteworthy that the first type of correction still does not yield a good Poisson fit (GOF Figure 2 Example of output graphics for a 454 sample that conformed to the model. Pairwise HD frequency plots on a logarithmic scale (black, left panel), together with the best fitting Poisson (blue) and the theoretical counts expected if the sample were to follow a star-like phylogeny. The right panel shows the pairwise HD histogram and the best fitting Poisson distribution (red). In the legend we report the GOF P-value (P = 0.981), the estimated days since the infections (d = 34(31, 38)), and the number of sequences in the sample (N = 4046). P < 0.0001). This is because the sample is overall enriched for G A. One can check this by looking at the Hypermut Results Table and noticing that the sequence called compressedMutations yields a P = 6 × 10 -6 for APOBEC enrichment. Therefore, only when all positions with a G embedded in a APOBEC3G/F motif are removed, does one finally achieves a good Poisson fit (P = 0.865) and a biologically sensible (given the clinical data available for this subject) time estimate of the time since the MRCA of 12 days, 95% CI = (8,16). This example illustrates how APOBEC enrichment can cause the Poisson fit to fail and hence how it is necessary to isolate the APOBEC induced mutations in order to make sensible estimates on the timing of the infection.
Unlike the example in Figure 2, where a logarithmic scale is used for better visualization, when the sample size is under 100, the star-phylogeny is represented in the manner shown in the panels on the right: the observed pairwise HD frequency counts are shown by the blue histograms, whereas the ones computed theoretically are shown in red. For both APOBEC-corrected and non-corrected samples, the red lines follow the histograms faithfully, which deems both samples as star-like phylogenies.
Both of the examples above obviously meet our model's assumptions of exponential growth with no selection and negligible recombination rate. When one or more assumption is not met the goodness of fit P-value lowers considerably and therefore the time since the MRCA is inaccurate. There are several factors that can cause this Figure 3 Example of output graphics for an SGA sample that was enriched for APOBEC mediated substitutions. HD frequency plots with best fitting Poisson (red line), on the left (panels A and C), and with theoretical star-phylogeny frequencies (red line), on the right (panels B and D). The top panels represent an alignment not corrected for enrichment for APOBEC motifs, whereas the bottom panels represent the same alignment after the G positions that in the consensus are in the APOBEC context have been removed from the alignment. Prior to the correction, the Poisson does not fit the HD frequency distribution (GOF P < 0.0001), whereas after the correction the Poisson yields a good fit (GOF P = 0.865).
to happen: for instance, the infection may be nonhomogeneous, the sample may not be "early" enough, or one may have sampled an unlikely early random mutation that distorts the Poisson distribution. When analyzing HIV-1 data, we recommend using samples taken within the first 2-5 weeks of infection, or characterized as Fiebig stage I or II [3]. At later Fiebig stages selection and recombination are clearly observed, and the diversity is controlled by these later selective bottlenecks. The probability of an early stochastic mutation violating the model assumption is calculated in [14], and is typically small.
Finally, we notice that our tool can be applied to subsets of sequences sampled at later time points when there is evidence of a narrow bottleneck. For example, in Fischer et al. [16] we were able to isolate the escape lineages after the immune response had begun and applied the tool to estimate the timing of each lineage. The tool has been used primarily on large HIV-1 data sets [3,16], though it can be used on any population that grows in a similar fashion, as appears to be the case of HCV for instance [21].

Conclusions
Our tool enables quantitative characterization of acute infection samples and can be usefully applied in large scale vaccine and prophylaxis studies where estimates of the time since the MRCA and/or the timing of the onset of host selection can be extremely informative. The tool can rapidly detect whether the mutational distribution in a set of HIV sequences is consistent with a star phylogeny and/or a Poisson model, indicative of a population evolving from a single ancestor, with lineage independent mutations under no differential selection of surviving forms. If the model is violated, the tool automatically evaluates whether this is a consequence of APOBEC mediated substitutions. When the model is satisfied, it can be be used to estimate times to the most recent common ancestor of the lineage, rapidly providing timing estimates that are in good accord with coalescent methods. The speed and simplicity of the algorithm enables it to be applied to massive data sets obtained through ultra-deep sequencing methods [6,20]. We have used it on tens of thousands of sequences [16] to estimate the time since the MRCA and also, at later time points, to estimate the timing of escape lineages, which arose after the onset of selection. The tool can be used on any other population which presents similar evolutionary patterns [21].