Error correction and statistical analyses for intrahost comparisons of feline immunodeficiency virus diversity from highthroughput sequencing data
 Yang Liu^{1, 4},
 Francesca Chiaromonte^{1, 4},
 Howard Ross^{5},
 Raunaq Malhotra^{6},
 Daniel Elleder^{2, 4, 7} and
 Mary Poss^{2, 3, 4}Email author
Received: 6 March 2014
Accepted: 29 April 2015
Published: 30 June 2015
Abstract
Background
Infection with feline immunodeficiency virus (FIV) causes an immunosuppressive disease whose consequences are less severe if cats are coinfected with an attenuated FIV strain (PLV). We use virus diversity measurements, which reflect replication ability and the virus response to various conditions, to test whether diversity of virulent FIV in lymphoid tissues is altered in the presence of PLV. Our data consisted of the 3′ half of the FIV genome from three tissues of animals infected with FIV alone, or with FIV and PLV, sequenced by 454 technology.
Results
Since rare variants dominate virus populations, we had to carefully distinguish sequence variation from errors due to experimental protocols and sequencing. We considered an exponentialnormal convolution model used for background correction of microarray data, and modified it to formulate an error correction approach for minor allele frequencies derived from highthroughput sequencing. Similar to accounting for overdispersion in counts, this accounts for errorinflated variability in frequencies – and quite effectively reproduces empirically observed distributions. After obtaining errorcorrected minor allele frequencies, we applied ANalysis Of VAriance (ANOVA) based on a linear mixed model and found that conserved sites and transition frequencies in FIV genes differ among tissues of dual and single infected cats. Furthermore, analysis of minor allele frequencies at individual FIV genome sites revealed 242 sites significantly affected by infection status (dual vs. single) or infection status by tissue interaction. All together, our results demonstrated a decrease in FIV diversity in bone marrow in the presence of PLV. Importantly, these effects were weakened or undetectable when error correction was performed with other approaches (thresholding of minor allele frequencies; probabilistic clustering of reads). We also queried the data for cytidine deaminase activity on the viral genome, which causes an asymmetric increase in G to A substitutions, but found no evidence for this host defense strategy.
Conclusions
Our error correction approach for minor allele frequencies (more sensitive and computationally efficient than other algorithms) and our statistical treatment of variation (ANOVA) were critical for effective use of highthroughput sequencing data in understanding viral diversity. We found that coinfection with PLV shifts FIV diversity from bone marrow to lymph node and spleen.
Keywords
Background
The dynamics of lentiviral infection within a host have been intensively studied because they reveal important temporal and spatial features of virushost interaction [1–3]. These interesting dynamics arise largely due to the unique lentiviral life history strategy that leaves a DNA copy (a provirus) of the viral RNA in the genome of an infected cell. The mutational spectrum attributed to a lentivirus population arises from errors introduced during the conversion of the RNA genome to DNA. Viral population structure is evident across tissues of a lentivirusinfected host in part due to the error rate in reverse transcription of the viral genome and in part due to the targeted movement of infected cells to specific tissues [1, 4]. The primary cells for lentiviral infection include monocytes and T cells, which migrate between sites of inflammation in tissue and lymphoid organs via the blood. Differential movement and subsequent activation of infected cells determine the diversity of viruses among host tissues. If an infected cell is activated at a site of inflammation, new progeny lentiviruses can be produced and infect naïve cells recruited to the site. Newly infected cells in a tissue will contain a provirus differing at several sites from the parental virus and indicate that virus replication has occurred. In contrast, expansion of an infected cell without production of virus and cell reinfection will increase the provirus census size without virus replication or increase in virus diversity. Thus virus diversity in a tissue will change depending on the number of infected cells migrating into or out of a tissue and virus replication in the tissue even if there is no change in census size.
Changes in lentivirus population dynamics under different conditions inform mechanisms that contribute to infection outcome. For example, cats experience insidious immune system dysfunction when infected with the lentivirus feline immunodeficiency virus (FIVfca; FIV hereafter) [5, 6], which presents a similar clinical profile as human immunodeficiency virus infected humans [6–8]. Cats infected with FIV derived from cougars (FIVpco; strain PLV) do not develop disease [9] but are protected against the loss of CD4 T cells, which is an indicator of FIVinduced immune dysfunction [10]. Data indicate that innate immunity may play a key role in PLV protection [9], and that humoral immune parameters do not significantly differ between cats infected with both FIV and PLV and those infected with FIV alone [11, 12]. We also observed that the dynamics of FIVinfected cells in the blood are significantly different in dual and single infection [13]. Cats infected with both FIV and PLV have a lower effective population size of FIV and the FIV population undergoes a bottleneck at 3–5 weeks post infection that further reduces the effective population size. In the present study, we investigate mechanisms contributing to the differential outcomes of FIV infection in the presence and absence of PLV by testing the hypothesis that FIV diversity in immune tissues is altered in the presence of PLV.
Next generation sequencing (NGS) approaches have recently been applied to study diversity in viral populations. These studies have identified low frequency substitutions (e.g. those that may be associated with drug resistance [14–16]) and changes in viral population diversity within hosts during infection [17–20] utilizing innovative methods to reconstruct individual viral haplotypes from NGS sequences [21, 22].
Nevertheless, evaluating diversity in a viral population from NGS data remains challenging because of the short size of the reads and the presence of errors whose rates are higher than in Sanger sequencing [23] and vary with the specific sequencing platform. The development of algorithms to differentiate between errors and actual genetic variants and/or to perform error correction in NGS data is an active area of research. These algorithms commonly utilize error thresholds [24] or Poisson/Binomial error distributions [18, 19, 25, 26] which can be sitespecific. Thresholds or error distribution parameters are fixed in a variety of ways; e.g. using values from existing literature [24], values derived from quality scores [19, 26], or values estimated by computing errors from sequences of cloned samples obtained under conditions matching those of the samples under consideration [18, 25].
Importantly, these algorithms are used to detect “true” single nucleotide variants in the presence of error, rather than to reduce the contribution of errors to the observed signals of interest. In our analyses, these signals are minor allele frequencies; that is, the frequencies with which nucleotides different from the reference occur at any given position in the viral genome – so the problem can be rephrased by saying that, in addition to be able to separate “truly” non0 frequencies from those that are non0 only due to error, one would want to clean the former from error that they, too, contain. Unless observed signals are properly corrected, they will reflect a combination of errors and “true” biological signals – with an inflated variability that can influence subsequent statistical analyses. When the signals of interest are read counts (e.g. from RNASeq or ChIPSeq), this inflation results in socalled overdispersion [27–29]; since a Poisson distribution, where the variance is bound to equal the mean, is inadequate to model these counts, researchers often switch to a Negative Binomial distribution, where the variance can be larger than the mean [29]. Of course the issue of how to model simultaneously “true” biological signals and errors superimposed to them concerns also other quantities derived from highthroughput sequencing data; e.g., the minor allele frequencies we are interested in. Our error correction approach is based exactly on developing such a model.
We note here the existence of another broadly used class of algorithms that specifically target error correction working at the level of the reads. These utilize probabilistic clustering of reads within overlapping sequence windows; reads are aligned to the reference and error correction is accomplished by converting each aligned read to the consensus or the cluster centroid in the window [17, 21, 30], or removing rare reads [31]. While these algorithms have been shown to reduce perbase error, they are computationally expensive and work most effectively when sequencing error rates are substantially lower than substitution rates underlying the data [32].
In this study we use data generated by Roche 454 sequencing of partial viral genomes to investigate whether the presence of PLV changes the FIV population diversity in several lymphoid tissues of the cat host. Virus populations evolving over short infection times likely comprise a large number of very lowfrequency minor alleles – with the ranges of “true” signals and errors substantially overlapping. As thresholding does not really accomplish error correction, and algorithms such as [17, 21, 30, 31] may not be effective in these settings, it was paramount for us to develop an alternative error correction approach.
To do so, we borrowed an idea used in background correction of microarray data. For such data, one can simultaneously model “true” signals and errors with an exponentialnormal convolution, and perform error correction (i.e. reduce the error portion of the observed signals) using conditional expectations [33, 34]. Minor allele frequencies are continuous quantities, but their empirically observed distributions (especially when very low frequencies are abundant) are not necessarily well represented by an exponentialnormal convolution. We therefore adapted the original idea modifying the model as to match empirically observed distributions.
After correcting minor allele frequencies with our approach, we analyzed them with an ANOVA framework and found significant evidence for tissue differences in FIV population diversity in the presence and absence of PLV. Importantly, when the data was processed by thresholding minor allele frequencies or running ShoRAH (an error correction algorithms based on probabilistic clustering of reads) [21, 30] this evidence was reduced or lost.
Results
Error correction
In this study we use NGS data to characterize FIV genetic diversity in tissues of infected cats in the presence or absence of PLV. Genetic diversity reflects the selective pressures, migration, and growth experienced by a virus population. Virus populations replicating under weak selection pressure will have an abundance of rare variants, which can carry key information about evolutionary processes. In this situation, removing minor allele frequencies below a threshold to eliminate errors may in fact remove relevant signal from the data  and also more sophisticated error correction algorithms (e.g., based on probabilistic clustering of reads) may not be effective. Thus, before analyzing the data with an ANOVA framework, we developed and implemented our own error correction approach.
All sequencing reads from each library were aligned to a reference genome representing the cloned virus used in the infection experiments. Deletions and insertions in sequence tags were not considered, so for each nucleotide position along the genome there were 4 possible alleles (A, C, G, T) – one representing the reference, and 3 representing minor alleles. We computed frequencies of each minor allele at each position from each of 12 libraries (two libraries were available per ANOVA “treatment”; i.e. tissue and infection status combination).
To perform error correction, we started by considering an exponentialnormal convolution model, as in the Robust MultiArray Average (RMA) software [35–37]. RMA was developed for microarray data, which consists of continuous values of fluorescent intensities, and cannot be applied to discrete read counts data – e.g. from RNAseq. However, with appropriate modifications, the RMA approach is suitable for our study because minor allele frequencies, albeit derived from read counts produced by DNASeq, are continuous quantities. Compared to thresholding, which only removes very low observed frequencies, a model that convolutes “true” signal and error allows us to account for the increased variability due to the latter, and to correct observed minor allele frequencies of any size by calculating conditional expectations – as is done in RMA (see Methods). Moreover, we can operate directly on minor allele frequencies instead of going back to reads and attempt to modify or remove erroneous ones – as in error correction algorithms based on probabilistic clustering. As we show below, utilizing a convolution model for the frequencies is computationally much faster, and in fact more effective for our type of data.
Parameter Estimates for the ExponentialNormal Convolution Model
Library pairs  \( {\widehat{\alpha}}^a \)  \( {\widehat{\mu}}^b \)  \( {\widehat{\sigma}}^c \)  Total Reads  Average Coverage 

1 & 2  637  3.2×10^{4}  5.1×10^{5}  63439  1810 
3 & 4  750  2.4×10^{4}  4.7×10^{5}  90839  2775 
5 & 6  810  1.4×10^{4}  1.7×10^{5}  133801  4236 
7 & 8  813  1.8×10^{4}  2.1×10^{5}  132589  4078 
9 & 10  865  1.4×10^{4}  3.6×10^{5}  177873  5530 
11 & 12  888  2.1×10^{4}  6.6×10^{5}  118396  3916 
All  777  1.6×10^{4}  5.1×10^{5}  716937  3724 
To better capture the peak at very low minor allele frequencies, we introduced a point mass at 0 with proportion p as part of the modeling of the “true” minor allele frequencies. Our modified model is thus represented by a mixture of a point mass at 0 (weight p) and an exponential (weight (1p)) providing the signal, which is then convoluted with a normal noise. Note that there are now four parameters to estimate in the modified model: the proportion p, the exponential rate α, and the normal mean μ and standard deviation σ. We did not attempt their joint estimation on all library pairs; instead, we proceeded as follows. We considered the estimates of α, μ and σ obtained with the original RMA on the library pair 1 and 2; these were \( \widehat{\alpha} \) = 637, \( \widehat{\mu} \) = 3.2 (10^{−4}) and \( \widehat{\sigma} \) = 5.1 (10^{−5}). Because this pair had a good fit to the exponentialnormal convolution model, and because all libraries share the same preparation protocols and sequencing platform, we fixed those values for all other library pairs, and then estimated p separately for each pair using a grid search between 0 and 1 to find a satisfactory match between empirical and simulated distributions (see inserts in Fig. 2; the same was done for the pool of all libraries – insert in Fig. 1). The very good match between histograms and distributions simulated from the modified model with parameter values selected as described above (solid curves in Figs. 1 and 2) suggests that our approach works well: an exponentialnormal convolution model with parameters α, μ and σ estimated from the library pair 1 and 2, when appropriately “spiked” at 0 with a proportion p specific to each library pair, does provide a good reconstruction of the empirical distributions for all library pairs.
Minor Alleles Before and After Removal
Library  No. minor alleles  Minimum frequency  No. minor alleles 

(before correction)  (before correction)  (after correction)  
1  6253  2.4×10^{4}  6253 
2  6526  2.6×10^{4}  6526 
3  7911  1.8×10^{4}  7666 
4  7083  1.9×10^{4}  6919 
5  7314  1.7×10^{4}  7092 
6  8588  1.1×10^{4}  6887 
7  7839  1.7×10^{4}  7280 
8  8344  1.4×10^{4}  7390 
9  7581  1.7×10^{4}  7469 
10  10965  0.7×10^{4}  8915 
11  7711  1.5×10^{4}  7224 
12  8744  1.1×10^{4}  7964 
In the second error correction stage, we adjusted all remaining minor allele frequencies using the exponentialnormal convolution model with parameters \( \widehat{\alpha} \) = 637, \( \widehat{\mu} \) = 3.2×10^{−4} and \( \widehat{\sigma} \) = 5.1×10^{−5}. Specifically, we replaced each observed frequency with the conditional expectation of the signal given the frequency from the convolution model (see Methods). In summary, our error correction procedure first removes minor alleles frequencies likely to be due to error alone, and then adjusts the remaining minor allele frequencies as to reduce the error component they carry. The resulting errorcorrected minor allele frequencies were used in the subsequent ANOVA analyses.
FIV genetic diversity across tissues and single or dual infection status
Our previous results demonstrated that FIV effective population size in peripheral blood cells is lower in the presence of PLV [13]. Because T cells in the blood are in transit between tissues and lymphoid organs, we hypothesized that the decrease in FIV effective population size in the blood was due to an affect of PLV on FIV tissue replication and migration. Since each cat in the study provided multiple observations (from the three sampled tissues) we adopted a splitplot ANOVA scheme, which comprised the fixed effects of tissue and infection status (single or dual), their interaction, and a random effect for cats nested within infection status (see Methods). Here we concentrate on infectionrelated effects, i.e. infection status fixed effects and tissue by infection status interaction effects, as a direct test of the hypothesis. The results for tissue fixed effects are provided in Figure S1 (see Additional file 1). Among several genetic diversity measures computed from the error corrected minor allele frequencies, which could be used as response variables in our analysis, we considered total number of conserved positions, frequency of transitions and frequency of transversions for the complete 3′ genome, and separately for each of the four constituent genes. We also considered the substitution frequencies at each of the 4603 nucleotide positions in the 3′ portion of the FIV genome under evaluation. Each response was appropriately transformed to be amenable for ANOVA analysis (see Methods).
Substitutions Matrix Of Minor Alleles With Frequencies Significantly Affected By InfectionRelated Effects
To  A  C  G  T 

From  
A    37  23  27 
C  14    14  10 
G  11  24    20 
T  26  7  29   
Performance of our error correction approach
To evaluate our error correction approach we considered its sensitivity in terms of number of retained minor alleles and, relatedly, the statistical power it affords in detecting effects on FIV diversity through ANOVA. We processed the data with simple thresholding and repeated our ANOVA analyses. If we remove minor allele frequencies using a threshold of ≤ 1 %, which is commonly done [24], we lose most of the minor alleles in our data and practically all significant effects in ANOVA . If we use a threshold of ≤ 0.05 %, the reported 454platform substitution error rate [38], we retain more alleles and some significant effects, but still lose power to detect interactions at gene and genome scales. If we use a threshold of 0.023 % (i.e. the cutoff used in the first stage of our correction procedure, but without implementing the second stage where minor allele frequencies above threshold are corrected) we retain significant effects at gene and genome scales but with elevated pvalues (weakened significance) compared to the full implementation of our approach (full comparisons in terms of number of detected minor alleles and ANOVA pvalues are provided in Tables S1 and S2, see Additional file 3). Moreover, without correcting frequencies above threshold, we would likely increase false positives in the individual site analysis by unduly increasing frequency differences at sites where some minor alleles fall below threshold in some libraries and above in some others. These arguments suggest that even selecting a threshold based on model considerations and the data at hand is not enough; an effective error correction approach must also account for the error component carried by minor allele frequencies that survive the cutoff.
We also compared our approach to ShoRAH, which performs error correction using probabilistic clustering of reads [21, 30]. Similar to 1 % thresholding, ShoRAH retained very few minor alleles (Table S1, see Additional file 3) and led to lower power in detecting interaction and tissue effects at gene and genome scales – with only one tissue effect identified as significant (Table S2, see Additional file 3). Importantly, we also compared computational burden (running time of the correction steps; Table S3, see Additional file 3). Overall, our approach took less than a second per library, while ShoRAH took hours. This very large difference is due to the fact that ShoRAH performs expensive operations on a huge number of reads, while our approach perform inexpensive operations directly on the minor allele frequencies. Thus, for studies focusing on minor allele frequencies our approach, in addition to being more sensitive and affording higher power in subsequent statistical analyses, is computationally much more convenient.
Enzymatic deamination and asymmetric substitution analyses
As a second approach, we individually evaluated a cytidine deaminase recognition triplet GGA, which is present at 146 trinucleotide sites of the 3′ portion of the FIV genome. Using a Wilcoxon signed rank test we found that substitutions of the first G in the GGA triplet to A occur at significantly higher frequencies in the absence of PLV (pvalue < 0.001), while substitutions to A at the second G in the GGA triplet do not show a significant difference between FIV sequences from single and dual infection.
Discussion
The most pronounced effect of a concurrent PLV infection on FIV population dynamics in peripheral blood cell is a decrease in FIV effective population size and a transient population bottleneck within a month of FIV infection [13]. We hypothesized that changes observed previously in blood could reflect differences in tissue replication of FIV and/or migration patterns of FIVinfected cells when PLV is present. Our results provide evidence that the presence of PLV affects FIV population dynamics in tissues, with an apparent shift of replication or immigration of FIVinfected cells from bone marrow to both spleen and mesenteric lymph node.
High throughput sequencing approaches provide an opportunity to evaluate the full mutational spectrum in the viral population but analysis of short read data is challenging. Analysis of our experimental data was additionally difficult because cats were inoculated with a cloned virus. FIV evolutionary rates are estimated at 3 x 10^{−3} substitutions per site per year [43]. Because few of these substitutions become fixed in the population, we expected an abundance of low frequency variants in a population of replicating viruses. Thus, in order to be able to fully exploit the rich information in NGS data to address our hypothesis, we needed to properly account for error in our libraries, which would obscure the signal from low frequency substitutions.
Both our error correction approach and subsequent statistical treatment of the data advance analysis of viral diversity based on high throughput sequencing data. Instead of removing minor allele frequencies below a threshold, which we demonstrate can eliminate or weaken signal, we considered a convolution model that combines an exponential signal with normal error, and allows us to derive errorcorrected variant frequencies as estimated conditional expectations. This idea was first implemented in the RMA software for application to microarray data [33, 34]. However, in order to effectively apply it to minor allele frequencies from our sequencing data, which has an abundance of very small frequencies, we had to modify the model incorporating an “extra amount” of 0signal. We showed that this modification works quite effectively, in the sense that the modified model reproduces the observed frequency distributions, and that it is more critical the higher the coverage of a library – since higher coverage tends to increase the number of very small frequencies. We note that parameter estimation for the modified model is implemented in an “ad hoc” manner; a more rigorous estimation procedure is left for future development.
Analytically, using a model that convolutes “true” signal and error accounts for errorinduced inflation of variability in the observed signal (see Methods). This is logically similar to accounting for overdispersion (e.g. due to sequencing errors, sample preparation protocols, etc.) in stateoftheart approaches for analyzing read counts from NGS [27–29]. In effect, an appropriate convolution model allows one to exploit information in very small frequencies when correcting larger ones.
Compared to simple thresholding of minor allele frequencies, our approach is more sensitive and affords increased statistical power in subsequent statistical analyses. Our approach also appears to guarantee better sensitivity and statistical power than ShoRAH – likely because our data is characterized by an abundance of very small minor allele frequencies, which approaches based on probabilistic clustering of reads tend to overcorrect. In addition, our approach is orders of magnitude faster – because correcting at the level of the aligned reads can be very timeconsuming, especially for libraries with high coverage and deep sequencing. In contrast, our approach corrects directly the minor allele frequencies requiring only a few, computationally fast estimation and correction steps.
In summary, our correction approach does manage to effectively exploit information in very abundant small minor allele frequencies – which is not exploited by, and in fact hinders, other recently proposed algorithms [17, 21, 30].
This makes it more sensitive and affords increased power to ascertain biological effects in subsequent statistical analyses. Our approach is also computationally much leaner, providing huge running time gains.
High rates of asymmetric substitution can be due to a host defense mechanism that entails enzymatic deamination of cytidine residues in the viral genome, resulting in an excess of G to A mutation [44]. Viral replication is disrupted through the production of premature termination codons in viral proteins or the creation of proteins with subfunctional folding properties. FIV can replicate in cells expressing cytidine deaminase because it encodes an accessory protein, vif, which protects the viral genome from cytidine deamination, in part by increasing enzyme degradation [45, 46]. The vif from PLV, which is derived from a cougar, does not protect the PLV genome from domestic cat cytidine deaminase. Because there is evidence for cytidine deamination of the PLV genome [39], we reasoned that elevated levels of cytidine deaminase might reduce FIV replication in some tissues in a coinfection with PLV. We queried our comprehensive viral sequence data set for evidence of asymmetry in general, and of an increase in the minor allele frequency of G to A substitutions at an enzyme trinucleotide target site, GGA. Our data do not support an increase in G to A substitution frequency in the FIV genome in the presence of PLV. In fact, a significant increase in G to A substitutions at the first G in the trinucleotide target GGA was detected in the FIV genome in single infections. While these data do support that processes acting on the FIV genome differ in the presence and absence of PLV, the lack of evidence for G to A substitution bias in the FIV genome in dual infection suggests that restriction by cytidine deaminase activity is not the primary mechanism for observed changes in tissue dynamics.
High throughput sequence analysis of retrovirus genomes affords a unique perspective on viral dynamics. Because retroviruses integrate their genome into the genome of the infected cell and the target cells for retroviral infection are migratory, an increase in the overall virus population diversity of a sampled tissue can be caused either by active virus replication in the tissue, with ensuing integration into susceptible cells, or by recruitment of migratory cells, which were infected at a different time or location. Our ANOVA results support a significant tissue by treatment interaction. The FIV population diversity is highest in bone marrow in the absence of PLV as evidenced by more genomic sites affected by substitutions in the dUTPase/Integrase portion of pol and env and by an increased frequency of transitions in orfA. In addition the substitution (minor allele) frequencies were higher in FIV derived from bone marrow of infected cats. In contrast, in the presence of PLV, FIV population diversity is decreased in bone marrow and elevated in spleen and lymph node, suggesting a shift in target tissue for virus replication. Collectively, these results support our hypothesis that a primary mechanism by which PLV attenuates virulent FIV infection is altering the withinhost dynamics of infected cells and/or virus replication. By reducing immigration of infected cells and/or FIV replication in bone marrow, PLV could confer protection to hematopoietic cells essential to maintain immune system health. Our analyses, which rely on a thorough procedure to account for errors in NGS, provide an important advance in using high throughput approaches to interrogate tissue specific virus replication in different treatment regimes.
Conclusions
This article demonstrates a systematic framework to access the full frequency spectrum of genomic diversity in viral populations based on highthroughput sequencing of viral genomes. We address the problem of separating signals from the errors intrinsic in NGS technologies, which is critical to understand the underlying biological phenomena. The issue is particularly important when it is necessary to use information from low frequency variants. We propose an error correction approach that is easy to implement, computationally fast, and provides good performance in distinguishing rare variant signals within data characterized by an abundance of very small minor allele frequencies. As a consequence, the approach also guarantees good statistical power when using ANOVA based on linear mixed models on the errorcorrected data. Doing so, we find evidence that FIV population dynamics change among tissues in the presence of PLV.
Methods
Viral sequence data
Library Information
Library  Cat number  Infection  Tissue 

1  97/99  Single  Spleen 
2  02/06  Single  Spleen 
3  89/93  Dual  Spleen 
4  03/05  Dual  Spleen 
5  97/99  Single  Lymph node 
6  02/06  Single  Lymph node 
7  89/93  Dual  Lymph node 
8  03/05  Dual  Lymph node 
9  97/99  Single  Bone marrow 
10  02/06  Single  Bone marrow 
11  89/93  Dual  Bone marrow 
12  03/05  Dual  Bone marrow 
Sequence preprocessing
The average length of raw reads was 299 bp. We applied a strict read quality filter using a threshold of 0.02 in CLC Genomics Workbench (version 5.1) [47], which resulted in highquality reads with average length of 263 bp. Using the same software, these were mapped to the complete FIVC36 genome (9466 bp), which is the sequence of the cloned virus used to infect cats in this study. Mapping parameters were implemented as follows: Insertion Cost = 3; Deletion Cost = 1; Mismatch Cost = 2; Length Fraction = 0.9; Similarity = 0.9; Global Alignment; Ignoring Nonspecific Match. After alignment, all insertions in reads leading to a gap in the reference sequence were removed to maintain the reference at 9466 bp, and the corresponding 4603 bp sequence from the 3′ half of FIV genome was utilized for subsequent analyses.
Error correction using a convolution model
The dataset corresponding to each of the 12 libraries consists of rows, representing the read coverage, and columns, representing sequence variants in the reads, for every position in the 3′ portion of the FIVC36 genome. The frequency of each variant (minor allele) at every position in each library is regarded as a combination of a “true” biological signal and error; our aim is to correct for the latter.
The Robust Multichip Average (RMA) software in Bioconductor [48] proposes a background correction procedure for genomewide microarray data based on an exponentialnormal convolution model. For the purpose of parameter estimation, RMA treats the frequencies at or below the overall frequency mode as the “left half” of a normal error distribution with mean μ and standard deviation σ. μ is thus estimated by the overall mode itself, and σ by “doubling” the spread on its left (see [34]). We actually truncated the normal at 0 to better reflect the absence of negative frequencies when estimating the standard deviation. Frequencies above the overall mode are treated as reflecting, by and large, an exponential “true” signal distribution with rate α – shifted to the right by μ (Figure S3, see Additional file 5). Following the RMA75 implementation [34], we took the 75^{th} percentile of all frequencies minus the estimated μ as the 75 % percentile to anchor an exponential distribution, and estimated α through its cumulative distribution function. Simulation studies show that this approach guarantees a conservative and robust estimation of the signal rate [34].
For the purpose of error correction, we first remove from each library all minor allele frequencies below the 0.1 % percentile of the exponentialnormal convolution model estimated on the library pair 1 and 2 (this corresponds to a threshold of 0.023 %). The removed frequencies are “reassigned” to the reference nucleotide at their sequence positions. Second, we adjust the remaining minor allele frequencies using the exponentialnormal convolution model estimated on the library pair 1 and 2. Each observed frequency is replaced with the conditional expectation of the signal given the observed frequency itself, as illustrated in the scheme below. All calculations are implemented in the statistical computing environment R, version 2.15 [49].
Scheme for error correction

Set up the exponentialnormal convolution model X = S + E, where X is the (observable) variant frequency, S the “true” signal ~ Exponential (α), and E the error ~ Normal (μ, σ^{2}) independent of S.

The expected value of the signal given the frequency is:

To compute the errorcorrected frequency, fix a quantile q of the estimated ExponentialNormal convolution distribution (we used q = 0.1 %), and set:

For all cases with x _{ corrected } = 0, attribute the frequency x back to the reference allele.
Accounting for error variability in a convolution model
The “true” exponential signal has variance Var(S) = α^{−2}. Convolution with the independent normal error inflates the variance of the observed signal to Var(X) = Var(S + E) = Var(S) + Var(E) = α^{−2} + σ^{2}. The model thus accounts for error variability, and so does the correction based on it – where the observed signal is replaced with the expected value of the “true” signal given the observed signal itself.
Error correction using ShoRAH
ShoRAH is a software to perform error correction, reconstruct viral haplotypes, and estimate their relative frequencies in a population. It takes as input a reference genome and a set of reads aligned to the reference. Error correction is performed with probabilistic clustering of reads in a moving window, using a Dirichlet Process Mixture Model (DPM) [21, 30]. Utilizing the same reads alignments from which we derived the minor allele frequencies then corrected with our approach, we ran the shorah.py script with default parameters (window size = 201 bp; α parameter for the DPM = 0.1) on each of our 12 libraries. We took error corrected reads from the intermediate output files marked by the suffix “_cor.fas”, realigned them to the reference using the CLC Genomics Workbench (see Sequence Preprocessing above), recomputed minor allele frequencies, and repeated the ANOVA analyses (see below) across the 12 libraries. We then compared results, in terms of number of minor alleles surviving correction in each library (Table S1, see Additional file 3), in terms of ANOVA pvalues (Table S2, see Additional file 3), and in terms of running time of the correction steps (Table S3, see Additional file 3) to those obtained with our error correction approach.
Analysis of variance
As responses (y) we take measurements reflecting virus replication in each environment (e.g. number of conserved positions, transition frequency, transversion frequency, individual substitution frequency). We consider these at the genome level, at the gene level, and at individual sites along the genome. We also transform them by natural logarithm (for counts) and logit (for frequencies) to satisfy the basic assumption underlying ANOVA (zero counts or frequencies are shifted right by a very small amount prior to transformation).
When running multiple tests (e.g. on ANOVA effects for single genomic positions) we employ the Benjamini Hochberg method for False Discovery Rate (FDR) control on the expected proportion of incorrectly rejected null hypotheses. All calculations are implemented in the statistical computing environments SAS, version 9.2 [50] and R, version 2.15 [49].
Asymmetric substitutions analysis
This analysis identifies the cumulative effect of mutation, drift and selection that has occurred on FIV as it evolves from the sequence used to initiate infection. Rates of substitution were estimated from a nucleotide association matrix in which the columns represent the nucleotides occurring in the infecting (reference) strain and the rows represent the nucleotides occurring among the aligned reads. The cells contained the frequency of the associations computed along a sliding window of the alignment and centered on each site. The variablewidth window was symmetrical about the site with a width the minimum sufficient to include all four nucleotide bases in the reference sequence. The cell frequencies in each column (reference base) were scaled by dividing by the number of times that base occurred in the window in the reference sequence.
The nucleotide association table for each window was tested for asymmetry using an Rlanguage script provided by Ababneh et al. [42]. This script computed the overall asymmetry [40] and partitioned it into a component due to the marginal distributions, corresponding to the test of [41], and a component due to internal asymmetry. Sites where the associated Pvalue for overall asymmetry approached zero were noted. These sites all have associated high values of the asymmetry index AI (Figure S4, see Additional file 6). While the values of AI are not independent due to the sliding window used to compute them, they offer a separate measure of asymmetry.
Wilcoxon signed rank test
Wilcoxon signed rank test, a nonparametric statistical hypothesis test, was used to compare the G to A substitution frequencies of 146 occurrences of the GGA trinucleotide between single and dual infections. The substitution frequencies were averaged over libraries corresponding to the same infection status. We performed comparisons for the first and second G separately. Wilcoxon signed rank test is used to test the null hypothesis that the median difference of G to A substitution frequencies between single and dual infections is zero.
Availability of supporting data
The data sets supporting the results of this article are included within the article (and its additional file). The original sequence data sets are available in Dryad [submission in progress].
Declarations
Acknowledgements
We thank Dr. Abinash Padhi and Brian Huylebroeck for technical assistance, Dr. James Rosenberger for discussion of statistical design, Dr. Paul Medvedev for discussion of error correction methods, Dr. Sue VandeWoude and Julie Terwee for providing samples, and the Penn State Genomics Core Facility, University Park, PA for sequencing services. We also thank the reviewers and editors who provided very useful comments on a prior version of this manuscript. This research was supported by NIH HL092791.
Authors’ Affiliations
References
 Poss M, Rodrigo AG, Gosink JJ, Learn GH, de Vange PD, Martin Jr HL, et al. Evolution of envelope sequences from the genital tract and peripheral blood of women infected with clade A human immunodeficiency virus type 1. J Virol. 1998;72:8240–51.PubMedPubMed CentralGoogle Scholar
 Nickle DC, Jensen MA, Shriner D, Brodie SJ, Frenkel LM, Mittler JE, et al. Evolutionary indicators of human immunodeficiency virus type 1 reservoirs and compartments. J Virol. 2003;77:5540–6.View ArticlePubMedPubMed CentralGoogle Scholar
 Salemi M, Burkhardt BR, Gray RR, Ghaffari G, Sleasman JW, Goodenow MM. Phylodynamics of HIV1 in lymphoid and nonlymphoid tissues reveals a central role for the thymus in emergence of CXCR4using quasispecies. PLoS One. 2007;2:e950.View ArticlePubMedPubMed CentralGoogle Scholar
 Blackard JT. HIV compartmentalization: a review on a clinically important phenomenon. Curr HIV Res. 2012;10:133–42.View ArticlePubMedGoogle Scholar
 Burkhard MJ, Dean GA. Transmission and immunopathogenesis of FIV in cats as a model for HIV. Curr HIV Res. 2003;1:15–29.View ArticlePubMedGoogle Scholar
 VandeWoude S, Apetrei C. Going wild: lessons from naturally occurring Tlymphotropic lentiviruses. Clin Microbiol Rev. 2006;19:728–62.View ArticlePubMedPubMed CentralGoogle Scholar
 Elder JH, Lin YC, Fink E, Grant CK. Feline immunodeficiency virus (FIV) as a model for study of lentivirus infections: parallels with HIV. Curr HIV Res. 2010;8:73–80.View ArticlePubMedPubMed CentralGoogle Scholar
 Bendinelli M, Pistello M, Lombardi S, Poli A, Garzelli C, Matteucci D, et al. Feline immunodeficiency virus: an interesting model for AIDS studies and an important cat pathogen. Clin Microbiol Rev. 1995;8:87–112.PubMedPubMed CentralGoogle Scholar
 Terwee JA, Yactor JK, Sondgeroth KS, Vandewoude S. Puma lentivirus is controlled in domestic cats after mucosal exposure in the absence of conventional indicators of immunity. J Virol. 2005;79:2797–806.View ArticlePubMedPubMed CentralGoogle Scholar
 VandeWoude S, Hageman CA, O’Brien SJ, Hoover EA. Nonpathogenic lion and puma lentiviruses impart resistance to superinfection by virulent feline immunodeficiency virus. J Acquir Immune Defic Syndr. 2002;29:1–10.View ArticlePubMedGoogle Scholar
 Terwee JA, Carlson JK, Sprague WS, Sondgeroth KS, Shropshire SB, Troyer JL, et al. Prevention of immunodeficiency virus induced CD4+ Tcell depletion by prior infection with a nonpathogenic virus. Virology. 2008;377:63–70.View ArticlePubMedPubMed CentralGoogle Scholar
 Zheng X, Carver S, Troyer RM, Terwee JA, VandeWoude S. Prior virus exposure alters the longterm landscape of viral replication during feline lentiviral infection. Viruses. 2011;3:1891–908.View ArticlePubMedPubMed CentralGoogle Scholar
 Padhi A, Ross H, Terwee J, Vandewoude S, Poss M. Profound differences in virus population genetics correspond to protection from CD4 decline resulting from feline lentivirus coinfection. Viruses. 2010;2:2663–80.View ArticlePubMedPubMed CentralGoogle Scholar
 Hoffmann C, Minkah N, Leipzig J, Wang G, Arens MQ, Tebas P, et al. DNA bar coding and pyrosequencing to identify rare HIV drug resistance mutations. Nucleic Acids Res. 2007;35:e91.View ArticlePubMedPubMed CentralGoogle Scholar
 Barzon L, Lavezzo E, Militello V, Toppo S, Palu G. Applications of nextgeneration sequencing technologies to diagnostic virology. Int J Mol Sci. 2011;12:7861–84.View ArticlePubMedPubMed CentralGoogle Scholar
 Radford AD, Chapman D, Dixon L, Chantrey J, Darby AC, Hall N. Application of nextgeneration sequencing technologies in virology. J Gen Virol. 2012;93:1853–68.View ArticlePubMedPubMed CentralGoogle Scholar
 Eriksson N, Pachter L, Mitsuya Y, Rhee SY, Wang C, Gharizadeh B, et al. Viral population estimation using pyrosequencing. PLoS Comput Biol. 2008;4:e1000074.View ArticlePubMedPubMed CentralGoogle Scholar
 Willerth SM, Pedro HA, Pachter L, Humeau LM, Arkin AP, Schaffer DV. Development of a low bias method for characterizing viral populations using next generation sequencing technology. PLoS One. 2010;5:e13564.View ArticlePubMedPubMed CentralGoogle Scholar
 Wright CF, Morelli MJ, Thebaud G, Knowles NJ, Herzyk P, Paton DJ, et al. Beyond the consensus: dissecting withinhost viral population diversity of footandmouth disease virus by using nextgeneration genome sequencing. J Virol. 2011;85:2266–75.View ArticlePubMedGoogle Scholar
 Henn MR, Boutwell CL, Charlebois P, Lennon NJ, Power KA, Macalalad AR, et al. Whole genome deep sequencing of HIV1 reveals the impact of early minor variants upon immune recognition during acute infection. PLoS Pathog. 2012;8:e1002529.View ArticlePubMedPubMed CentralGoogle Scholar
 Zagordi O, Bhattacharya A, Eriksson N, Beerenwinkel N. ShoRAH: estimating the genetic diversity of a mixed sample from nextgeneration sequencing data. BMC bioinformatics. 2011;12:119.View ArticlePubMedPubMed CentralGoogle Scholar
 Prabhakara S, Malhotra R, Acharya R, Poss M. MutantBin: Unsupervised Haplotype Estimation of Viral Population Diversity Without Reference Genome. J Comput Biol. 2013;20:453–63.View ArticlePubMedGoogle Scholar
 Shendure J, Ji H. Nextgeneration DNA sequencing. Nat Biotechnol. 2008;26:1135–45.View ArticlePubMedGoogle Scholar
 Romano CM, Lauck M, Salvador FS, Lima CR, VillasBoas LS, Araujo ES, et al. Inter and intrahost viral diversity in a large seasonal DENV2 outbreak. PLoS One. 2013;8:e70318.View ArticlePubMedPubMed CentralGoogle Scholar
 Wang C, Mitsuya Y, Gharizadeh B, Ronaghi M, Shafer RW. Characterization of mutation spectra with ultradeep pyrosequencing: application to HIV1 drug resistance. Genome Res. 2007;17:1195–201.View ArticlePubMedPubMed CentralGoogle Scholar
 Morelli MJ, Wright CF, Knowles NJ, Juleff N, Paton DJ, King DP, et al. Evolution of footandmouth disease virus intrasample sequence diversity during serial transmission in bovine hosts. Vet Res. 2013;44:12.View ArticlePubMedPubMed CentralGoogle Scholar
 Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11:R106.View ArticlePubMedPubMed CentralGoogle Scholar
 Hashimoto TB, Edwards MD, Gifford DK. Universal count correction for highthroughput sequencing. PLoS Comput Biol. 2014;10:e1003494.View ArticlePubMedPubMed CentralGoogle Scholar
 Robinson MD, McCarthy DJ. Smyth GK: edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26:139–40.View ArticlePubMedGoogle Scholar
 Zagordi O, Geyrhofer L, Roth V, Beerenwinkel N. Deep sequencing of a genetically heterogeneous sample: local haplotype reconstruction and read error correction. J Comput Biol. 2010;17:417–28.View ArticlePubMedGoogle Scholar
 Skums P, Dimitrova Z, Campo DS, Vaughan G, Rossi L, Forbi JC, et al. Efficient error correction for nextgeneration sequencing of viral amplicons. BMC bioinformatics. 2012;13 Suppl(10):S6.View ArticleGoogle Scholar
 Beerenwinkel N, Gunthard HF, Roth V, Metzner KJ. Challenges and opportunities in estimating viral genetic diversity from nextgeneration sequencing data. Front Microbiol. 2012;3:329.View ArticlePubMedPubMed CentralGoogle Scholar
 Bolstad BM. Low Level Analysis of Highdensity Oligonucleotide Array Data: Background, Normalization and Summarization. Dissertation. University of California, Berkeley, Department of Statistics. 2004.Google Scholar
 McGee M, Chen Z. Parameter estimation for the exponentialnormal convolution model for background correction of affymetrix GeneChip data. Statistical applications in genetics and molecular biology. 2006; 5:Article24.Google Scholar
 Irizarry RA, Hobbs B, Collin F, BeazerBarclay YD, Antonellis KJ, Scherf U, et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003;4:249–64.View ArticlePubMedGoogle Scholar
 Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP. Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res. 2003;31:e15.View ArticlePubMedPubMed CentralGoogle Scholar
 Bolstad BM, Irizarry RA, Astrand M, Speed TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003;19:185–93.View ArticlePubMedGoogle Scholar
 Niklas N, Proll J, Danzer M, Stabentheiner S, Hofer K, Gabriel C. Routine performance and errors of 454 HLA exon sequencing in diagnostics. BMC bioinformatics. 2013;14:176.View ArticlePubMedPubMed CentralGoogle Scholar
 Poss M, Ross HA, Painter SL, Holley DC, Terwee JA, Vandewoude S, et al. Feline lentivirus evolution in crossspecies infection reveals extensive GtoA mutation and selection on key residues in the viral polymerase. J Virol. 2006;80:2728–37.View ArticlePubMedPubMed CentralGoogle Scholar
 Bowker AH. A test for symmetry in contingency tables. J Am Stat Assoc. 1948;43:572–4.View ArticlePubMedGoogle Scholar
 Stuart A. A test for homogeneity of the marginal distributions in a twoway classification. Biometrika. 1955;42:412–6.View ArticleGoogle Scholar
 Ababneh F, Jermiin LS, Ma C, Robinson J. Matchedpairs tests of homogeneity with applications to homologous nucleotide sequences. Bioinformatics. 2006;22:1225–31.View ArticlePubMedGoogle Scholar
 Hayward JJ, Rodrigo AG. Molecular epidemiology of feline immunodeficiency virus in the domestic cat (Felis catus). Vet Immunol Immunopathol. 2010;134:68–74.View ArticlePubMedGoogle Scholar
 Chiu YL, Greene WC. The APOBEC3 cytidine deaminases: an innate defensive network opposing exogenous retroviruses and endogenous retroelements. Annu Rev Immunol. 2008;26:317–53.View ArticlePubMedGoogle Scholar
 Marin M, Rose KM, Kozak SL, Kabat D. HIV1 Vif protein binds the editing enzyme APOBEC3G and induces its degradation. Nat Med. 2003;9:1398–403.View ArticlePubMedGoogle Scholar
 Conticello SG, Harris RS, Neuberger MS. The Vif protein of HIV triggers degradation of the human antiretroviral DNA deaminase APOBEC3G. Curr Biol. 2003;13:2009–13.View ArticlePubMedGoogle Scholar
 CLC Genomics Workbench [http://www.clcbio.com/products/clcgenomicsworkbench/]
 Bioconductor: Open Source Software for Bioinformatics [http://www.bioconductor.org/]
 The R Project for Statistical Computing [http://www.rproject.org/]
 SAS [http://www.sas.com/]
Copyright
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.