Estimating time since infection in early homogeneous HIV-1 samples using a poisson model
© Giorgi et al; licensee BioMed Central Ltd. 2010
Received: 19 May 2010
Accepted: 25 October 2010
Published: 25 October 2010
The occurrence of a genetic bottleneck in HIV sexual or mother-to-infant transmission has been well documented. This results in a majority of new infections being homogeneous, i.e., initiated by a single genetic strain. Early after infection, prior to the onset of the host immune response, the viral population grows exponentially. In this simple setting, an approach for estimating evolutionary and demographic parameters based on comparison of diversity measures is a feasible alternative to the existing Bayesian methods (e.g., BEAST), which are instead based on the simulation of genealogies.
We have devised a web tool that analyzes genetic diversity in acutely infected HIV-1 patients by comparing it to a model of neutral growth. More specifically, we consider a homogeneous infection (i.e., initiated by a unique genetic strain) prior to the onset of host-induced selection, where we can assume a random accumulation of mutations. Previously, we have shown that such a model successfully describes about 80% of sexual HIV-1 transmissions provided the samples are drawn early enough in the infection. Violation of the model is an indicator of either heterogeneous infections or the initiation of selection.
When the underlying assumptions of our model (homogeneous infection prior to selection and fast exponential growth) are met, we are under a very particular scenario for which we can use a forward approach (instead of backwards in time as provided by coalescent methods). This allows for more computationally efficient methods to derive the time since the most recent common ancestor. Furthermore, the tool performs statistical tests on the Hamming distance frequency distribution, and outputs summary statistics (mean of the best fitting Poisson distribution, goodness of fit p-value, etc). The tool runs within minutes and can readily accommodate the tens of thousands of sequences generated through new ultradeep pyrosequencing technologies. The tool is available on the LANL website.
Our basic framework is that of an exponentially growing population following a narrow bottleneck, with lineage-independent mutation rates at all sites and no differential selection among the attested forms. When the resulting diversity is small, almost every change is at a distinct locus, and the pairwise differences between genetic strains, i.e., the Hamming distances (HD), follow a Poisson distribution . There is little evidence of recombination in samples from this early stage when care is taken to guard against in vitro recombination . Furthermore, when sites evolve independently starting from a homogeneous infection, homologous recombination does not change the HD distribution nor the overall amount of diversity. It does, however, affect the distribution when these assumptions are violated (unpublished data). The Poisson Fitter tool  uses this fact to fit a Poisson distribution to the pairwise HD frequency distribution obtained from the input data. Because a good fit is obtained when the infection is homogeneous and free of differential selection pressure , the tool can be used to detect large violations of one or more assumptions of the null model.
The Poisson Fitter tool  was originally developed to analyze genetic sequences from early HIV-1 infected patients (Fiebig Stage I and II ). However, it can be used on any set of highly conserved sequences to test whether the observed diversity could be due to random accumulation of mutations or whether one needs to invoke alternate explanations like early onset of selection or multiple original infecting strains. The tool accepts multiple alignments, obtained through either single genome amplification  or deep sequencing , but they must be uploaded as one file. All alignments need to be in fasta format and always start with a consensus or reference sequence, which is assumed to be the ancestral strain. More details on how to format the input data is provided on the tool web page .
Controlling for APOBEC enrichment
APOBEC is a host enzyme. During replication it causes mutations that can introduce stop codons and inactivate the virus. These substitutions are recognizable by a G → A mutational pattern embedded in the APOBEC3G/F motif [11, 12], which is: GRD, where R is the IUPAC symbol for G or A, and D for G, A or T. Their occurrence is sporadic, often leading to overtly hypermutated sequences [11, 12], and thus violating our assumption of a constant mutation rate (see Results and Discussion for an example). Our tool allows for two types of checks: the first method searches for hypermutated sequences, i.e., particular sequences that present a statistically significant excess of G → A mutations at APOBEC3G/F motifs compared to elsewhere in the sample. The test is performed using the Hypermut tool  and the user can select a P-value threshold (default P < 0.1). With the second method, an artificial sequence is created, which summarizes all mutations found across the sample, and is then tested for APOBEC enrichment via the Hypermut tool . When the latter is significantly enriched, it indicates enrichment across the sample for mutations embedded in the APOBEC motif, in which case we call the sample overall enriched . The user can choose whether and how to correct for APOBEC enrichment. Two correction methods are provided, which correspond to the two tests discussed above. The first option is to simply remove all sequences that yield a P-value below the threshold. Because this method reduces the number of sequences, we have devised a second option which looks at all G's in the consensus sequence and removes the positions that fall within an APOBEC GRD motif from the entire alignment. This leaves the number of sequences unchanged and provides an unbiased correction. With either option, a new corrected alignment will be created, and the tool will proceed to analyze both the uncorrected and the corrected alignments.
where δ x, y is the Kroenecker delta, and X i is the number of sequences that are i bases from the consensus, and Y j are the number of pairs of sequences that are different at j bases. By comparing these theoretical frequencies Y j to the observed ones (i.e., the number of pairs that are j bases from one another), one can verify whether the star-phylogeny assumption is valid for the tested alignment.
Fitting the Poisson distribution
A Poisson distribution is fitted to the observed pairwise HD distribution using a maximum likelihood method (see  for details). A χ2 goodness of fit (GOF) is performed to test whether the HD distribution significantly diverges from a Poisson (small P-values indicate a bad fit). The test takes into account the non-independence of pairwise HD distances by comparing the observed frequencies to the expected ones if the sample were to follow a star phylogeny. Prior to the onset of positive selection, the population is assumed to undergo a rapid expansion during which the basic reproductive number R0 > 1. Therefore, when the sample yields a GOF P-value above 0.05 (indicating a non-significant divergence from a Poisson distribution), we can estimate the time since the MRCA using the parameters characterizing intrahost HIV evolution. Following Stafford et al. , we set R0 = 6 for acute HIV-1 infection samples. We assume a constant mutation rate across lineages, which we fix at an average value of ϵ = 2.16 × 10-5 per site and per replication cycle. This value has been adjusted from what was originally derived for HIV-1 by Mansky et al.  by considering only base substitutions and ignoring insertions and deletions. However, it has not been corrected for possible extra purifying selection in vivo. Using these parameters, we use the mean of the fitted Poisson distribution to infer the time since the MRCA .
Results and Discussion
All the parameters explained in the previous section are computed and included in the output table called "Log Likelihood - Estimated Parameters." This comprises, for each sample: the number of sequences in the sample, the mean and maximum pairwise HD, the mean of the best fitting Poisson distribution, the corresponding time since the MRCA, and the goodness of fit P-value It is important to notice that when the sample meets our model's assumptions, the mean of the best fitting Poisson distribution is in fact the mean pairwise HD of the sample. A second table, called "Convolution Estimates," provides the observed HD frequencies and the estimated ones calculated using equation (1). A more detailed explanation of the parameters is provided in the Explanation file on the tool web page.
Unlike the example in Figure 2, where a logarithmic scale is used for better visualization, when the sample size is under 100, the star-phylogeny is represented in the manner shown in the panels on the right: the observed pairwise HD frequency counts are shown by the blue histograms, whereas the ones computed theoretically are shown in red. For both APOBEC-corrected and non-corrected samples, the red lines follow the histograms faithfully, which deems both samples as star-like phylogenies.
Both of the examples above obviously meet our model's assumptions of exponential growth with no selection and negligible recombination rate. When one or more assumption is not met the goodness of fit P-value lowers considerably and therefore the time since the MRCA is inaccurate. There are several factors that can cause this to happen: for instance, the infection may be non-homogeneous, the sample may not be "early" enough, or one may have sampled an unlikely early random mutation that distorts the Poisson distribution. When analyzing HIV-1 data, we recommend using samples taken within the first 2-5 weeks of infection, or characterized as Fiebig stage I or II . At later Fiebig stages selection and recombination are clearly observed, and the diversity is controlled by these later selective bottlenecks. The probability of an early stochastic mutation violating the model assumption is calculated in , and is typically small.
Finally, we notice that our tool can be applied to subsets of sequences sampled at later time points when there is evidence of a narrow bottleneck. For example, in Fischer et al.  we were able to isolate the escape lineages after the immune response had begun and applied the tool to estimate the timing of each lineage. The tool has been used primarily on large HIV-1 data sets [3, 16], though it can be used on any population that grows in a similar fashion, as appears to be the case of HCV for instance .
Our tool enables quantitative characterization of acute infection samples and can be usefully applied in large scale vaccine and prophylaxis studies where estimates of the time since the MRCA and/or the timing of the onset of host selection can be extremely informative. The tool can rapidly detect whether the mutational distribution in a set of HIV sequences is consistent with a star phylogeny and/or a Poisson model, indicative of a population evolving from a single ancestor, with lineage independent mutations under no differential selection of surviving forms. If the model is violated, the tool automatically evaluates whether this is a consequence of APOBEC mediated substitutions. When the model is satisfied, it can be be used to estimate times to the most recent common ancestor of the lineage, rapidly providing timing estimates that are in good accord with coalescent methods. The speed and simplicity of the algorithm enables it to be applied to massive data sets obtained through ultra-deep sequencing methods [6, 20]. We have used it on tens of thousands of sequences  to estimate the time since the MRCA and also, at later time points, to estimate the timing of escape lineages, which arose after the onset of selection. The tool can be used on any other population which presents similar evolutionary patterns .
This work was supported by the Los Alamos National Laboratory Directed Research and Development program, the Center for HIV/AIDS Vaccine Immunology, NIH grants U19-AI067854-05, AI28433-19, RR06555-18, and by NIAID via NIH-DOE interagency agreement (Y1-AI-1500-01). We wish to thank Brian Gaschen for technical support.
- Wolinsky SM, Wike CM, Korber BT, Hutto C, Parks WP, et al.: Selective transmission of human immunodeficiency virus type-1 variants from mothers to infants. Science 1992, 255: 1134–1137. 10.1126/science.1546316View ArticlePubMedGoogle Scholar
- Delwart EL, Magierowska M, Royz M, Foley B, Peddada L: Homogeneous quasispecies in 16 out of 17 individuals during very early HIV-1 primary infection. AIDS 2001, 15: 1–7. 10.1097/00002030-200101050-00003View ArticleGoogle Scholar
- Keele BF, Giorgi EE, Salazar-Gonzalez JF, Decker JM, Pham KT, et al.: Identification and characterization of transmitted and early founder virus envelopes in primary HIV-1 infection. Proc Natl Acad Sci USA 2008, 105: 7552–7557. 10.1073/pnas.0802203105View ArticlePubMedPubMed CentralGoogle Scholar
- Ribeiro RM, Qin L, Chavez LL, Li D, Self SG, Perelson AS: Estimation of the initial viral growth rate and basic reproductive number during acute HIV-1 infection. J of Virology 2010, (12):6096–102. 10.1128/JVI.00127-10Google Scholar
- Drummond AJ, Rambaut A: BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evol Biol 2007., 7(214):Google Scholar
- Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, et al.: Genome sequencing in microfabricated high-density picolitre reactors. Nature 2005, (437):376–380.Google Scholar
- Lee HY, Giorgi EE, et al.: Modeling sequence evolution in acute HIV-1 infection. J Theor Biol 2009, 2(261):341–360. 10.1016/j.jtbi.2009.07.038View ArticleGoogle Scholar
- Poisson fitter[http://www.hiv.lanl.gov/content/sequence/POISSON_FITTER/poisson_fitter.html]
- Slatkin M, Hudson RR: Pairwise comparisons of mitochondrial DNA sequences in stable and exponentially growing populations. Genetics 1991, (129):555–562.Google Scholar
- Hahn S, Zhong XY, Troeger C, Burgemeister R, Gloning K, Holzgreve W: Current applications of single-cell PCR. Cell Mol Life Sci 2000, (57):96–105. 10.1007/s000180050501Google Scholar
- Bourara K, Liegler TJ, et al.: Target cell APOBEC3G can induce limited G-to-A mutation in HIV-1. PLoS Path 2007, (3):1477–1485.Google Scholar
- Simon V, Zennou V, et al.: Natural variation in Vif: differential impact on APOBEC3G/3F and a potential role in HIV-1 diversification. PLoS Path 2005, 1(1):e6. 10.1371/journal.ppat.0010006View ArticleGoogle Scholar
- Wood N, Bhattacharya T, Keele BF, Giorgi EE, Liu M, Gaschen B, Daniels M, Ferrari G, Haynes BF, McMichael A, Shaw GM, Hahn BH, Korber B, Seoighe C: HIV evolution in early infection: selection pressures, patterns of insertion and deletion, and the impact of APOBEC. PLoS Path 2009, 5(5):414.View ArticleGoogle Scholar
- Wakeley J: Coalescent Theory, an Introduction. Roberts and Co; 2008.Google Scholar
- Fischer W, Bhattacharya T, Keele BF, Giorgi EE, Hraber PT, Perelson AS, Shaw GM, Korber BT, et al.: Rapid mutational escape from cytotoxic T-cell responses in acute HIV-1 infection--an ultra-deep view. PLoS ONE, in press.
- Salazar-Gonzalez JF, Salazar MG, Keele BF, Learn GH, Giorgi EE, et al.: Genetic identity, biological phenotype, and evolutionary pathways of transmitted/founder viruses in acute and early HIV-1 infection. Journal of Experimental Medicine 2009, 206(6):1273–1289. 10.1084/jem.20090378View ArticlePubMedPubMed CentralGoogle Scholar
- Stafford MA, Corey L, Cao Y, Daar ES, Ho DD, et al.: Modeling plasma virus concentration during primary HIV infection. J Theor Biol 2000, (203):285–301. 10.1006/jtbi.2000.1076Google Scholar
- Mansky LM, Temin HM: Lower in vivo mutation rate of human immunodeficiency virus type 1 than that predicted from the fidelity of purified reverse transcriptase. J Virol 1995, (69):5087–5094.Google Scholar
- Tsibris AM, Korber B, Arnaout R, Russ C, Lo CC, Leitner T, Gaschen B, Theiler J, Paredes R, Su Z, Hughes MD, Gulick RM, Greaves W, Coakley E, Flexner C, Nusbaum C, Kuritzkes DR: Quantitative deep sequencing reveals dynamic HIV-1 escape and large population shifts during CCR5 antagonist therapy in vivo. PLoS ONE 2009, 5(4):5683. 10.1371/journal.pone.0005683View ArticleGoogle Scholar
- Wang GP, Sherrill-Mix SA, Chang KM, Quince C, Bushman FD: Hepatitis C virus transmission bottlenecks analyzed by deep sequencing. J Virol 2010, 84(12):6218–28. 10.1128/JVI.02271-09View ArticlePubMedPubMed CentralGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.