Viral quasispecies inference from 454 pyrosequencing
- Wan-Ting Poh†1,
- Eryu Xia†7,
- Kwanrutai Chin-inmanu2, 3,
- Lai-Ping Wong1,
- Anthony Youzhi Cheng1,
- Prida Malasit2, 4, 5,
- Prapat Suriyaphol2, 3,
- Yik-Ying Teo1, 6, 7, 8, 9, 10Email author and
- Rick Twee-Hee Ong1, 10
© Poh et al.; licensee BioMed Central Ltd. 2013
Received: 19 May 2013
Accepted: 15 November 2013
Published: 5 December 2013
Many potentially life-threatening infectious viruses are highly mutable in nature. Characterizing the fittest variants within a quasispecies from infected patients is expected to allow unprecedented opportunities to investigate the relationship between quasispecies diversity and disease epidemiology. The advent of next-generation sequencing technologies has allowed the study of virus diversity with high-throughput sequencing, although these methods come with higher rates of errors which can artificially increase diversity.
Here we introduce a novel computational approach that incorporates base quality scores from next-generation sequencers for reconstructing viral genome sequences that simultaneously infers the number of variants within a quasispecies that are present. Comparisons on simulated and clinical data on dengue virus suggest that the novel approach provides a more accurate inference of the underlying number of variants within the quasispecies, which is vital for clinical efforts in mapping the within-host viral diversity. Sequence alignments generated by our approach are also found to exhibit lower rates of error.
The ability to infer the viral quasispecies colony that is present within a human host provides the potential for a more accurate classification of the viral phenotype. Understanding the genomics of viruses will be relevant not just to studying how to control or even eradicate these viral infectious diseases, but also in learning about the innate protection in the human host against the viruses.
KeywordsVirus quasispecies Sequence alignment 454 pyrosequencing
Virus populations exist as pools of non-identical but related members called quasispecies . Quasispecies are associated with the error-prone replications, high mutation rates and short generation times of the evolutionary dynamics of viruses, generating the genetic diversity that allows the species to persist in their hosts . Due to the genetically labile nature of such viruses, which include most RNA virus such as dengue virus (DENV) and HIV, they often develop resistance to vaccines and antiviral drugs very quickly. The fitness of a virus, defined as the ability for a given virus variant to tolerate environmental changes and to reproduce successfully, is often reflected in the frequency . It is thus of interest to try to study and characterise the fittest variants of the quasispecies, which may lead to the development of more effective therapeutic treatments.
When compared to high-throughput next-generation sequencing (NGS) techniques, traditional Sanger capillary sequencing tends to be more time consuming and relatively more expensive per sequenced base. NGS techniques have been widely applied in de-novo sequencing, re-sequencing, metagenomics and intra-host characterization of infections pathogens. These techniques produce more sequencing fragments as compared to traditional Sanger sequencing, thus allowing more details with a much higher coverage. However, confounding results may be derived due to the PCR step, shorter read lengths and higher sequencing error rates of these sequencing fragments.
Here, we are interested in using NGS to re-sequence the virus genome to characterize the fittest variants within a quasispecies from infected patients. While the cost of NGS is falling rapidly, the shorter reads and higher sequencing error rates require more computationally intensive analyses in assembling the sequence reads and in distinguishing if each polymorphic site is a genuine biological variant or a sequencing error.
There are currently several methods for quasispecies assembly, of which the majority of these methods have been summarized in the extensive review by Beerenwinkel and Zagordi, which looked at the problems and challenges faced in deciphering viral populations through the use of “ultra-deep sequencing” and also compared several of the existing methods and their limitations . The quasispecies assembly problem can be divided into four components: (a) pre-processing of low quality reads and mapping of the filtered reads to the respective reference genome; (b) distinguishing whether each polymorphic site is the result of sequencing errors or due to a genuine mutation; (c) reconstructing full haplotypes from filtered and corrected reads; and (d) calculating the frequencies and confidence scores for the constructed haplotypes. The existing methods typically provide solutions that address a combination of at most three components in the problem of quasispecies reconstruction, frequently missing out on the pre-processing step [5-11].
More recently, Prosperi and colleagues introduced a set of formulae for the combinatorial analysis of a quasispecies . They also introduced a reconstruction algorithm based on combinations of multinomial distributions using amplicons. This was subsequently implemented into QuRe to analyze long reads through sliding windows with a Poisson error correction method . Also, QColors was published for non-contiguous reads  and QuasiRecomb, which takes into consideration recombination events that may occur in DNA viruses and RNA viruses such as HIV .
Of the approaches for quasispecies reconstruction, five methods have accompanying software that have been made publicly available:
In this paper, we aim to: (1) introduce a novel method QuasQ for reconstructing the genome sequences of the quasispecies that appropriately incorporates the base quality scores of each sequenced fragment; (2) using the base quality scores as well as the frequencies of each sequenced fragment, to derive the likelihood scores which will be used to effectively reduce the number of false positive haplotypes in the quasispecies inference. To compare the performance of QuasQ with the updated versions of ShoRAH, ViSpA, QuRe and QuasiRecomb, we performed a series of simulations generated from reference DENV serotype 1 sequence data where we vary: (i) the frequency distributions of the simulated variants; (ii) the total number of simulated variants; and (iii) the overall coverage of the data. In addition, we attempt to reconstruct the quasispecies of subtype B type-1 human immunodeficiency virus (HIV-1) pol clones from a real experiment obtained from 454 GS FLX Titanium platform . Through these, we showed that our approach, QuasQ, shows reasonable recall rate in inferring the number of true unique variants with few false positives and frequencies that lie close to that of the clinical data as compared to ShoRAH, ViSpA, QuRe and QuasiRecomb. Finally, we apply this algorithm to clinical datasets of isolated DENV consisting of all four serotypes, sequenced on the 454 Genome Sequencer FLX System machine. QuasQ and the simulated datasets are available for download at http://www.statgen.nus.edu.sg/~software/quasq.html.
Summary of the average (and standard deviation) of the recall rates, i.e. the number of true polymorphic sites detected out of all true polymorphic sites, and precision, i.e. the proportion of true polymorphic sites detected with respect to the total number of polymorphic sites reported by QuasQ, in each of the eight scenarios
Summary of the eight settings used to simulate the data for comparing the performance of the different sequence alignment approaches
However, in our analysis of the simulated data under scenario A with QuasiRecomb (assuming both options of with and without recombination), around 10,000 haplotypes were constructed during each round of simulations with a precision of between 0.0002-0.0004 and these accounted for only 10%-40% of the simulated variants (see Additional file 1: Table S1). As a result, we subsequently excluded it from the comparison against the other 4 methods.
Summary of the mean and standard deviation of the total number of constructed haplotypes by QuasQ, ShoRAH, QuRe, ViSpA and ViSpA (with corrected reads) over 50 runs across each of the eight scenarios as described in Table 2
Mean number of constructed sequences (SD)
Summary of the mean F-measure for QuasQ, ShoRAH, QuRe and ViSpA (with ShoRAH corrected reads) over 50 runs across each of the eight scenarios as described in Table 2
The constructed haplotype sequences are compared against each of the 10 or 15 simulated sequences, and we identified the constructed sequence that has the highest match to each of the simulated sequences and calculate the extent of base identity between the constructed sequence and the best-matched simulated sequence. For those simulated sequences, which are not matched by any of the constructed sequences, there will not be any base identity score. We observe that in general, the sequences constructed by QuasQ have a higher match to the simulated sequences than those by the other three methods (Additional file 1:Figures S2-S54).
Subtype B HIV-1 quasispecies results
Summary of the recall rate, precision and F-measure for QuasQ, ShoRAH, QuRe, ViSpA, ViSpA (using ShoRAH corrected reads), QuasiRecomb (no recombination) and QuasiRecomb (with recombination) in reconstruction of the subtype B HIV-1 pol gene quasispecies
QuasiRecomb (No recomb)
QuasiRecomb (With recomb)
Clinical DENV data from Thailand
We apply QuasQ, ShoRAH, ViSpA, QuRe and QuasiRecomb to reconstruct the quasispecies sequences of the five dengue virus strains from the four serotypes, although we emphasize the results do not necessarily have any bearing on the relative performance of the approaches given the absence of any knowledge on the underlying quasispecies diversity.
Summary of the mean and standard deviation of the read lengths from 454 pyrosequencing of serotypes 1, 3 and 4 of the dengue virus obtained from lab cultures in Thailand
QuasiRecomb (No recomb)
QuasiRecomb (With recomb)
The haplotypes constructed by each of the four softwares are aligned back to the reference of the respective serotypes using BLASTn (E value <0.001). Additional file 1: Figure S56-S58 shows the top scoring alignment where the points represent the positions where the haplotypes differ from the reference sequence. Across all three serotypes, QuRe constructed haplotypes that aligned to <20% of the reference genome. Majority of the haplotypes constructed by ViSpA, before using ShoRAH corrected reads, aligned to ≤80% of the reference genome with ≤80% identity. And it is observed that ViSpA constructs haplotypes with significantly higher hamming distance to the reference genome as compared to ShoRAH and QuRe and this number is drastically decreased when ViSpA is used with ShoRAH corrected reads; the constructed haplotypes by ViSpA with ShoRAH corrected reads have significantly lower hamming distance as compared to the haplotypes constructed by ShoRAH and QuasQ.
Summary of the alignment result of the translated haplotypes of serotypes 1, 3 and 4 of the dengue virus obtained from lab cultures in Thailand constructed by QuasQ, ShoRAH, QuRe, ViSpA, ViSpA (with ShoRAH corrected reads), QuasiRecomb (no recombination) and QuasiRecomb (with recombination) with the translated reference genome
QuasiRecomb (No recomb)
QuasiRecomb (With recomb)
We have introduced a strategy QuasQ that utilizes the base quality scores of each sequence read to infer the authenticity of a mutation in reconstructing the genome sequences of viral quasispecies that are present within a single human host. Owing to the fact that each host carries multiple variants from a single serotype, inferring the number of quasispecies that are actually present as well as reconstructing the genome sequences of these different quasispecies is not trivial. While there are available softwares for performing this sequence alignment, these typically do not incorporate the quality of the sequence base calling, which can be valuable in deciding whether an observed polymorphic site contains a genuine mutation or is attributed to a base calling error. In an ideal setting, the presence of a single discordant site between two genome sequences should indicate that these genome sequences belong to two closely-related yet distinct variants, however in practice these two sequences are likely to be identical and the single discordant site is very likely a base calling error. With the novel collapsing method introduced in this paper, we have showed effectiveness in the reduction of such haplotypes caused by sequencing errors. Our simulations and comparisons on the real HIV-1 data have indicated that QuasQ outperformed the existing methods with publicly available software, by: (1) identifying higher number of true variants, while (2) minimizing the number of false variants, (3) where the reconstructed haplotype sequences exhibit higher degree of genetic similarities with the true sequences and (4) reconstructs haplotypes with frequencies resembling that of the original data. These findings strongly indicate that a sequence alignment strategy for viruses that utilizes the base quality scores ends up inferring a lower number of variants in a quasispecies, as this will down-weigh or even remove the contribution of polymorphic sites that are mainly attributed to base calling errors.
The ability to correctly infer the individual sequences of the quasispecies opens up vast opportunities in investigating the epidemiology of viruses, such as dengue. Consider a setting where dengue patients are admitted to a clinic and blood samples are obtained daily, allowing a longitudinal tracking of the evolution of dengue virus diversity. Association analysis can subsequently be performed to correlate clinical dengue symptoms with quasispecies diversity, allowing the functional impact and virulence of either individual or clusters of mutations to be assessed. In such a longitudinal setting, the ability to track dominant variant of a quasispecies will also allow researchers to identify specific strains that are more adaptive and resistant to innate host immunity, which allows the assessment of the mutations that correlate with this resistance. Understanding the genomic architecture of antibody resistance in the dengue virus is expected to be vital in designing effective vaccines. In addition, identifying the responsible mutations for dengue severity may provide the potential for predicting the outcome of dengue infection, whether it is the milder dengue fever or the more severe hemorrhagic form. The latter application has important public health implication, as dengue is more prevalent in developing countries and prolonged monitoring of dengue patients in a hospital setting for the purpose of minimizing hemorrhagic fever can place a considerable burden on the healthcare system.
By introducing the ability to infer the viral quasispecies colony that is present within a human host, QuasQ provides the potential for a more accurate classification of the viral phenotype. Perhaps the future of viral genomics will focus on comparing healthy controls who are carriers of virulent strains of the quasispecies against subjects exhibiting severe clinical symptoms of the infectious disease but are in fact affected by less virulent strains of the quasispecies. It is clear that understanding the genomics of viruses will be relevant not just to studying how to control or even eradicate these viral infectious diseases, but also in learning about the innate protection in the human host against the viruses.
Our algorithm QuasQ assumes the availability of high-throughput DNA data sequenced using 454 pyrosequencing. This technology produces reads with average read lengths of 330 bases, and yields per-base quality scores calibrated on the Phred scale for all valid calls . This base quality score maps directly to the probability that a particular base is incorrectly called due to a sequencing error. The distribution of base quality scores from a clinical application of 454 sequencing of DENV (the details of the sequencing can be found in subsection “Dengue virus template preparation and sequencing”) is shown in Additional file 1: Figure S1. When the sequencing is unable to generate a valid call of one of the four bases on a particular read, a null call or “N” is assigned. By incorporating this information on base quality and read quality, the QuasQ algorithm thus contains four components: (1) pre-processing and quality checking of the raw sequence reads, and subsequently mapping the high quality reads to the reference genome; (2) local correction of sequencing errors; (3) global reconstruction and collapsing of haplotypes; and finally (4) inferring the number of variants and their respective likelihoods and frequencies.
A small fraction of the reads tend to possess more sequencing errors than others in 454 sequencing. These low quality reads thus contribute disproportionately to the total error rates. It has been reported previously that sequence reads (i) containing one or more N calls; or (ii) with lengths that are either extremely short or long are often indicative of low-quality sequencing . We thus remove any reads that contain at least one N call, or are of extreme lengths (defined as sequence reads with lengths in the 1 percentile of either extremes of the distribution, see Additional file 1: Figure S2 for the distribution of average base quality scores against read lengths).
Mapping to the reference genome
The reads that remain from the pre-processing are mapped against the reference genome using Bowtie 2.0.0  (see Additional file 1). The sequencing process can occasionally introduce in vivo artifacts and produce high-quality sequence reads that are not reflective of the target genome. If undetected, these artifacts will erroneously increase the diversity of the haplotypes constructed. We thus implement an additional step of mapping each read against the reference genome while permitting some discordance to reflect genuine rapid evolutionary changes of variants as compared to the reference genome. Our pipeline retains only those reads that were uniquely mapped and where at least 80% of each read maps to the reference genome with at least 80% similarity.
A known source of problem with 454 pyrosequencing is the tendency to incorrectly ascertain the sequence content in the presence of a homopolymer (a stretch of identical bases), resulting in either an undercount or overcount of the identical bases . To accommodate the possibility of such homopolymer-induced errors, our algorithm allows indels to be inserted or deleted with respect to the reference genome.
Local error correction
A mutation in the haplotypes may either be genuine and explains the diversity of the variants within a quasispecies, or it can be an error introduced during sequencing. To minimize the occurrence of the latter, we implemented a sliding window clustering algorithm to correct for sequencing errors using the co-variation of the alleles. At every window, all combinations of alleles that have frequencies ≤ 0.5% of the coverage at that window are identified and clustered with those of the shortest hamming distance. The intuition behind this is that combinations of alleles that occur singly are more likely to be caused by sequencing errors, which are rare and occur randomly, as compared to true mutations, which should occur in proportions relative to their respective frequencies in the population.
Quasispecies sequence reconstruction
The reads after post-processed for local error correction are piled-up by positions, and all polymorphic sites (PS) where two or more alleles are present will be identified (Figure 7a). The reference genome is then ‘reduced’ to consist of only the polymorphic sites identified earlier (Figure 7b). Similarly, all mapped reads are ‘reduced’ to consist only of the bases called in the same polymorphic sites. The ‘reduced’ mapped reads are then grouped into disjoint sets according to their starting polymorphic sites (Figure 7c). The longest representative reads based on sequence base identities would therefore be identified for each set of reads, such that shorter reads that are complete subsequence of the longer reads will be filtered out. A read-graph method is then applied to connect the disjoint sets of representative merged reads in ascending order of the positions of polymorphic sites, where each node consist of the read sequence of DNA bases, with directed edges to connect two nodes if the first node is a complex prefix of the second node (Figure 7d). For any two reads that overlap, it is important to ensure that the overlapping region spans genuine polymorphic sites that are not the result of in vitro artifacts. To evaluate this, the method looks for at least one sequence read that not only spans the overlapping portion but also extends to include the polymorphic site immediately adjacent to the right and, if possible, left flank of the overlapping segment, depending on the coverage (Figure 7e).
We calculate the likelihood of each constructed haplotype by factoring into account the chance that a polymorphic site is attributed to a sequencing error. The intuition behind this inference is that a sequencing error is more likely to have happened when: (1) only a small proportion of the reads spanning a polymorphic site carry the minor allele(s); (2) the average base quality score for the minor allele is low. It is however important to acknowledge the possibility that situation (1) can happen if it is a recent mutation and is thus present only in a small proportion of the quasispecies.
For each constructed haplotype sequence, the reads spanning each polymorphic site can be divided into two categories: (a) reads with the exact same base composition as the constructed haplotype sequence; (b) reads without the exact same base composition as the constructed haplotype sequence. We are thus interested to calculate the likelihood for the observed data spanning each constructed haplotype sequence given that this haplotype sequence is real. This means that at each polymorphic site, either all the reads are correct (thus allowing the relative frequencies of the different alleles to be estimated from the proportion of the reads carrying each allele) or that those reads carrying a different allele at this site are erroneous.
P(Observed base | Sequence is real) =
P(Observed base | Sequence is real, all reads without errors) ×
P(All reads without errors) +
P(Observed base | Sequence is real, category (a) contains errors) ×
P(category (a) contains errors) +
P(Observed base | Sequence is real, category (b) contains errors) ×
Inferred haplotype construction
Using the aligned sequences for the inferred haplotype of a quasispecies, we then proceed to construct a neighbor-joining tree similar to the method by Saitou and Nei , where the distance between any two sequence alignments (inferred haplotypes) is simply quantified as the proportion of sites that differ. As every branch on the tree corresponds to an inferred haplotype sequence with a known likelihood score, two neighboring branches are collapsed into a single branch based on two parameters. The first parameter is the patristic distance between the two branches where both branches will be collapsed if the distance is less than a user-specified threshold. The second parameter is the difference in posterior probabilities of the two branches where both branches will be collapsed if the difference between them is larger than a user-specified threshold. The posterior probabilities for the two branches or sequence haplotypes k, l are calculated as described in the earlier paragraph. The intuition behind this branch collapsing is to remove highly similar haplotype sequences and retain only the sequence that is most probable.
The frequencies of the constructed haplotypes are subsequently inferred using freqEst, an Expectation-Maximization (EM) algorithm, which allows the most prevalent haplotype to be identified. The probability distribution of each of the constructed haplotypes (p 1 ,…,p H ), where H is the set of all the constructed haplotypes, is estimated by maximizing the log-likelihood function of the probability of observing a read drawn with uniform probability, across all the reads that are consistent with the constructed haplotypes.
Number of real variants within quasispecies.
Overall coverage of the reads.
Deep sequencing of genomes promises the information for detection of low frequency mutations but brings with it higher chances of sequencing errors in the increasing number of reads. The ability to identify low-frequency variants at low coverage and also to be able to differentiate sequencing errors from true polymorphisms in data at high coverage is what we want to test for. Hence, we simulate the datasets at ~700x and ~1500x coverage with average of ~30,000 and ~60,000 reads respectively.
Our simulations assume a total genome size of 10,700 basepairs, and the ancestral sequence is taken to be DENV 1 strain hawaii. We adopt a sequential approach in generating the quasispecies genome sequences, where the ith variant is effectively a copy of one of the previous (i - 1) variants except at the mutation sites. The resultant genome sequences thus constitute the true data that we will attempt to reconstruct with QuasQ, ShoRAH, ViSpA, QuRe and QuasiRecomb. Aside from varying the total number of sequences generated, we vary, too, the frequencies of each of these sequences and finally used ART 454 to generate reads of different coverage. There are eight simulation scenarios we consider after accounting for the different combinations of the three factors (see Table 2), and 50 datasets are simulated within each scenario.
Subtype B HIV-1 quasispecies
We attempt to reconstruct the viral quasispecies from a real experiment in which the sequence of the variants of the quasispecies sequenced was known beforehand, and thus can be used to access the effectiveness of each software. The dataset from the experiment by Zagordi et al. , in which 10 different clinical isolates of subtype B HIV-1 quasispecies were pooled in a mixed sample in different proportions (at a maximum contribution of 30% from each isolate, at an average diversity of about 7%, and with an estimated rate of heterogeneity of 0.35). The first 1,245 bases of the pol gene were sequenced without PCR amplification using 454 FLX Titanium resulting in 16,540 reads.
Dengue virus template preparation and sequencing
In the application of QuasQ to actual dengue viruses, we isolated five DENV prototype strains commonly used in the laboratory consisting of: (i) serotype 1, strain Hawaii (acc: EU848545); (ii) serotype 2, strain NGC (acc: M29095); (iii) serotype 2, strain 16681 (acc: U87411); (iv) serotype 3, strain H87 (acc: M93130); and (v) serotype 4, strain H241 (acc: AY947539). These prototypes are cultured in C6/36 cell line obtained from the Dengue Hemorrhagic Fever Research Unit, Siriraj Hospital in Bangkok, Thailand. Dengue RNA was extracted from the viral-culture supernatant with the QIAamp Viral RNA Kit (QIAGEN) and converted to cDNA by SuperScript III First-Strand Synthesis System (Invitrogen). The AccuPrime™ Taq DNA Polymerase High Fidelity (Invitrogen) was used to amplify the cDNA template. The PCR product was purified using the QIAquick Gel Extraction Kit (QIAGEN) and the DNA concentration was measured using NanoDrop. Pooled dsDNA template of all four serotypes of the DENV was finally sequenced by the Genome Sequencer FLX System (Roche Company) according to standard protocol which starts from random fragmentation by Nebulizer.
The program settings for ShoRAH version 0.6 that were used in our analyses considered a window size of 330 basepairs, based on our average read lengths of 350 bp, and a recommended alpha of 0.1 for the Dirichlet process mixture. We also provided ShoRAH with a file that contains the appropriate serotype reference sequence. An example command line that was used is: python shorah.py -b INPUT.bam -r INPUT_REF.fas -w 330 -a 0.1 -k > global.log
The program settings for ViSpA version 01 that were used in our analyses allow for (i) 1 (for corrected reads) or 5 (for uncorrected reads) mismatches when deciding whether a read is a sub-read of the super-read; and (ii) 105 (for simulations with mutation rate of 0.02) or 262 (for simulations with mutation rate of 0.05) mutations, in distinguishing between variants. We also provided ViSpA with a file that contains the appropriate serotype reference sequence. An example command line that was used is: sh main.bash INPUT.fas INPUT_REF.fas 12 (i) (ii).
We find that majority of the haplotypes constructed by ViSpA aligns to <80% of the length of any simulated variants, as described in the “Simulation Results” section later. Hence we excluded ViSpA in the later comparions. However, since ViSpA was packaged with the aligner, SEGEMEHL, as opposed to Bowtie that was ran for the other three softwards, to make it a fair comparison, Bowtie-aligned data was used during ViSpA’s analysis as well. Additional rounds of analysis were ran on ViSpA using ShoRAH corrected reads to test the viability of ViSpA’s alignment algorithm, despite the absence of the pre-processing step - ViSpA (Corrected).
The program settings for QuRe version 0.9997 that were used in our analyses considered the defaults of 0.0044 homopolymeric error rate, 0.0007 non-homopolymeric error rate and 2,500 iterations. A relatively small number of iterations was used because at a higher number, QuRe very often run out of memory. We also provided QuRe with a file that contains the appropriate serotype reference sequence. An example command line that was used is: java -Xmx4G QuRe INPUT.fas INPUT_REF.fas 0.0044 0.0007 2500
The program settings for QuasiRecomb version 1.1 that were used in our analyses considered both with and without recombination, without gaps and applied the conservative method. The example command lines there were used are:
Without Recombination: java -jar ~/src/QuasiRecomb.jar -i INPUT.bam -noGaps -quality -unpaired -conservative -noRecomb
With Recombination: java -jar ~/src/QuasiRecomb.jar -i INPUT.bam -noGaps -quality -unpaired -conservative
In assessing these five softwares, QuasQ, ShoRAH, ViSpA, QuRe and QuasiRecomb, we quantify and compare their performances in terms of (i) how many of the simulated variants have been correctly inferred by the software (recall rate); (ii) out of the total number of constructed haplotypes, how many unique simulated sequences are reported by the software (precision); and (iii) the degree of genetic similarity between the reconstructed haplotype sequences and the simulated variants.
WTP, LPW, AYC, YYT and RTO acknowledge the support from the National Research Foundation (NRF-RF-2010-05). EX is supported by a scholarship from NUS Graduate School for Integrative Sciences and Engineering. PM, KC, and PS are supported by the Office of the Higher Education Commission and Mahidol University under the National Research Universities Initiative.
- Schneider WL, Roossinck MJ: Genetic diversity in RNA virus quasispecies is controlled by host-virus interactions. J Virol. 2001, 75: 6566-6571. 10.1128/JVI.75.14.6566-6571.2001.PubMed CentralView ArticlePubMedGoogle Scholar
- Lauring AS, Andino R: Quasispecies theory and the behavior of RNA viruses. PLoS Pathog. 2010, 6: e1001005-10.1371/journal.ppat.1001005.PubMed CentralView ArticlePubMedGoogle Scholar
- Domingo E, Sheldon J, Perales C: Viral quasispecies evolution. Microbiol Mol Biol Rev. 2012, 76: 159-216. 10.1128/MMBR.05023-11.PubMed CentralView ArticlePubMedGoogle Scholar
- Beerenwinkel N, Zagordi O: Ultra-deep sequencing for the analysis of viral populations. Curr Opin Virol. 2011, 1: 413-418. 10.1016/j.coviro.2011.07.008.View ArticlePubMedGoogle Scholar
- Astrovskaya I, Tork B, Mangul S, Westbrooks K, Mandoiu I, et al: Inferring viral quasispecies spectra from 454 pyrosequencing reads. BMC Bioinforma. 2011, 12 (6): S1-View ArticleGoogle Scholar
- Westbrooks K, Astrovskaya I, Campo D, Khudyakov Y, Berman P, et al: HCV quasispecies assembly using network flows. Proceedings of the 4th international conference on Bioinformatics research and applications. 2008, Atlanta, GA, USA: Springer-Verlag, 159-170.Google Scholar
- Jojic V, Hertz T, Jojic N: Population sequencing using short reads: HIV as a case study. Pac Symp Biocomput. 2008, 13: 114-125.Google Scholar
- Eriksson N, Pachter L, Mitsuya Y, Rhee SY, Wang C, et al: Viral population estimation using pyrosequencing. PLoS Comput Biol. 2008, 4: e1000074-10.1371/journal.pcbi.1000074.PubMed CentralView ArticlePubMedGoogle Scholar
- Zagordi O, Geyrhofer L, Roth V, Beerenwinkel N: Deep Sequencing of a Genetically Heterogeneous Sample: Local Haplotype Reconstruction and Read Error Correction. J Comput Biol. 2010, 17: 417-428. 10.1089/cmb.2009.0164.View ArticlePubMedGoogle Scholar
- Zagordi O, Bhattacharya A, Eriksson N, Beerenwinkel N: ShoRAH: estimating the genetic diversity of a mixed sample from next-generation sequencing data. BMC Bioinforma. 2011, 12: 119-10.1186/1471-2105-12-119.View ArticleGoogle Scholar
- Prabhakaran S, Rey M, Zagordi O, Beerenwinkel N, Roth V: HIV-haplotype inference using a constraint-based dirichlet process mixture model. Extended abstract at the Twenty-Third Annual Conference on Neural Information Processing Systems (NIPS), Machine Learning in Computational Biology (MLCB) workshop. 2010, BC, Canada: WhistlerGoogle Scholar
- Prosperi MC, Prosperi L, Bruselles A, Abbate I, Rozera G, et al: Combinatorial analysis and algorithms for quasispecies reconstruction using next-generation sequencing. BMC Bioinforma. 2011, 12: 5-10.1186/1471-2105-12-5.View ArticleGoogle Scholar
- Prosperi MC, Salemi M: QuRe: software for viral quasispecies reconstruction from next-generation sequencing data. Bioinformatics. 2012, 28: 132-133. 10.1093/bioinformatics/btr627.PubMed CentralView ArticlePubMedGoogle Scholar
- Huang A, Kantor R, DeLong A, Schreier L, Istrail S: QColors: an algorithm for conservative viral quasispecies reconstruction from short and non-contiguous next generation sequencing reads. Silico Biol. 2011-2012, 11 (5-6): 193-201.Google Scholar
- Zagordi O, Töpfer A, Prabhakaran S, Roth V, Halperin E, et al: Probabilistic Inference of Viral Quasispecies Subject to Recombination. Research in Computational Molecular Biology. Edited by: Chor B. 2012, Berlin Heidelberg: Springer, 342-354.View ArticleGoogle Scholar
- Zagordi O, Klein R, Däumer M, Beerenwinkel N: Error correction of next-generation sequencing data and reliable estimation of HIV quasispecies. Nucl Acids Res. 2010, 38 (21): 7400-7409. 10.1093/nar/gkq655.PubMed CentralView ArticlePubMedGoogle Scholar
- Gilles A, Meglecz E, Pech N, Ferreira S, Malausa T, et al: Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencing. BMC Genomics. 2011, 12: 245-10.1186/1471-2164-12-245.PubMed CentralView ArticlePubMedGoogle Scholar
- Huse SM, Huber JA, Morrison HG, Sogin ML, Welch DM: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biol. 2007, 8: R143-10.1186/gb-2007-8-7-r143.PubMed CentralView ArticlePubMedGoogle Scholar
- Langmead B, Salzberg SL: Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012, 9: 357-359. 10.1038/nmeth.1923.PubMed CentralView ArticlePubMedGoogle Scholar
- Saitou N, Nei M: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987, 4: 406-425.PubMedGoogle Scholar
- Huang W, Li L, Myers JR, Marth GT: ART: a next-generation sequencing read simulator. Bioinformatics. 2012, 28: 593-594. 10.1093/bioinformatics/btr708.PubMed CentralView ArticlePubMedGoogle Scholar
- Behura S, Severson D: Nucleotide substitutions in dengue virus serotypes from Asian and American countries: insights into intracodon recombination and purifying selection. BMC Microbiol. 2013, 13: 37-10.1186/1471-2180-13-37.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.