Analysis of high-depth sequence data for studying viral diversity: a comparison of next generation sequencing platforms using Segminator II
© Archer et al; licensee BioMed Central Ltd. 2012
Received: 28 October 2011
Accepted: 23 March 2012
Published: 23 March 2012
Next generation sequencing provides detailed insight into the variation present within viral populations, introducing the possibility of treatment strategies that are both reactive and predictive. Current software tools, however, need to be scaled up to accommodate for high-depth viral data sets, which are often temporally or spatially linked. In addition, due to the development of novel sequencing platforms and chemistries, each with implicit strengths and weaknesses, it will be helpful for researchers to be able to routinely compare and combine data sets from different platforms/chemistries. In particular, error associated with a specific sequencing process must be quantified so that true biological variation may be identified.
Segminator II was developed to allow for the efficient comparison of data sets derived from different sources. We demonstrate its usage by comparing large data sets from 12 influenza H1N1 samples sequenced on both the 454 Life Sciences and Illumina platforms, permitting quantification of platform error. For mismatches median error rates at 0.10 and 0.12%, respectively, suggested that both platforms performed similarly. For insertions and deletions median error rates within the 454 data (at 0.3 and 0.2%, respectively) were significantly higher than those within the Illumina data (0.004 and 0.006%, respectively). In agreement with previous observations these higher rates were strongly associated with homopolymeric stretches on the 454 platform. Outside of such regions both platforms had similar indel error profiles. Additionally, we apply our software to the identification of low frequency variants.
We have demonstrated, using Segminator II, that it is possible to distinguish platform specific error from biological variation using data derived from two different platforms. We have used this approach to quantify the amount of error present within the 454 and Illumina platforms in relation to genomic location as well as location on the read. Given that next generation data is increasingly important in the analysis of drug-resistance and vaccine trials, this software will be useful to the pathogen research community. A zip file containing the source code and jar file is freely available for download from http://www.bioinf.manchester.ac.uk/segminator/.
Sequencing platforms such as the 454 Life Sciences GS-FLX  and Illumina  are providing a previously unprecedented insight into the extent of pathogen variation [3–7]. This is due to the depth of coverage that can be obtained across individual genes or genomes , as well as the ability to analyze large numbers of samples simultaneously [9, 10]. Within chronic viral infections, such as HIV-1 and HCV, the study of variation is important as it has been directly associated with both disease progression and the outcome of treatment [11–14]. These sequencing platforms have the potential to accurately quantify the variation within viral populations [15, 16], including those sampled through time [3, 17], and from differing compartments within the host , thus, providing a more complete picture of viral evolution. This has applications for the development of treatment strategies that are both reactive and predictive. Reactive in that the information derived from individual hosts can be incorporated into the optimization of current drug treatment regimes [14, 19–21]. Predictive in that the likelihood of the emergence of resistant variants can be calculated and incorporated into strategies such as the algorithmic design of therapeutic vaccines [22, 23].
However, prior to the practical and routine application of current sequencing technologies to the characterization of patterns of variation, a number of non-trivial challenges need to be overcome . Primarily this involves the detection of error introduced during sequencing [25, 26], and the separation of this from real genetic variation. This issue is particularly acute for RNA viruses within an individual host for which the degree of genetic variation may be on the same order as the error rate. The extent and nature of sequencing error varies between platforms . For example, the 454 technology utilizes a sequencing by-synthesis method during which the incorporation of cyclically delivered nucleotides into the growing DNA strand, via pyrophosphate liberation, is measured . In regions where the complement strand contains a homopolymeric stretch, the strength of the signal is proportional to the number of bases incorporated. Ambiguities in signal intensity are most frequently manifested as under- and over-calls on the length of these stretches . An overall error rate of about 1% has been observed [7, 25] which has been partitioned into insertion (0.7%), deletion (0.2%) and mismatch error (0.1%). Unlike the 454 platform, sequences generated on the Illumina platform are extended one base at a time , thus, under- and over-calls are uncommon. Mismatch error, however, has been observed [26, 27], and, because of the high coverage that is achievable, the quantification of the rate at which this error occurs is particularly relevant when attempting to identify low frequency genetic variants.
The development of an accurate, dynamic picture of viral evolution within the host has also been hampered by the logistics of data management. The many mapping, assembly and analysis programs available [30, 31] primarily have the goal of obtaining an accurate estimate of a genomic sequence or detecting variation at the allelic level. Other commercially available packages such as Geneius  and CLC Genomics Workbench http://www.clcbio.com/, that do offer tools to analyze sequence variability, have limited capacity to account for temporally sampled viral data. At the time of writing no freely available, generally applicable software exists that addresses the task of detecting and characterizing the high levels of genetic variation encountered within rapidly evolving viral populations using data that has either been temporally or compartmentally sampled. Prior to our current work we presented a framework, for the mapping of the short sequence segments (reads) generated on the 454 platform, which was applied to the detection of clusters of low frequency drug resistant forms . Here we extend this framework so that it (i) is applicable to viral data generated on the Illumina platform (and other sequencing platforms), (ii) outputs a range of metrics for characterizing variability including base, indel and codon frequencies, coverage and quality scores and (iii) allows for multiple data sets to be managed within a project permitting comparative analysis. To demonstrate the usefulness of our software we assess platform-induced variation present within reads derived from 12 Influenza A H1N1/09 infected individuals. For each individual, because the same sample was used for read amplification on both the 454 and Illumina platforms, any variation associated solely with a single platform can be considered as potential sequencing error. H1N1 genomes are particularly useful for characterizing insertion and deletion error that has been introduced by the sequencing platform because indels are rare in this data .
Case study: data sets
Sample ID's and read numbers before and after mapping
454 Life Sciences
No. of reads
No. of reads mapped (%)
No. of reads
No. of reads mapped (%)
Case study: template creation, mapping and alignment
To construct a reference sequence, for each genomic segment of the H1N1 genome, a sample of data-all available full length US sequences-were downloaded from GenBank http://www.ncbi.nlm.nih.gov/genbank/, aligned using Muscle  and a consensus sequence generated. The number of sequences included for each gene was: PB2: 1401, PB1: 1397, PA: 1387, HA: 1578, NP: 1461, NA: 1586, MP: 1490 and NS: 1409. Consensus sequences from each were concatenated together to produce a genome length template of length 13,284 nucleotides. At each concatenation point, a string of Ns was incorporated in order to separate the segments for visualization purposes; these were ignored in the read mapping. For each of the 12 samples, reads from both platforms were separately mapped to the template using our framework (Figure 2), following which a data set specific consensus sequence was generated by maintaining the most frequent residue present at each site. Reads were then remapped and aligned to the consensus sequence, which resulted in the final assemblies used in downstream analysis. The parameters for the k-mer matching step were k-mer length, k-mer density (minimum number of k-mer hits before a location is considered) and k-mer skip (rather than search every k-mer within the read every ith k-mer is used, where i is the skip parameter); the values set for these were 8, 2 and 2, respectively. Note, for conserved data the speed of the mapping can be increased using the skip parameter as searching every k-mer to find the approximate location of a read leads to redundancy. For the pairwise alignment step the parameters were: match, gap open and gap extension scores as well as the value for transversions and transitions with the values: 1, -2, -0.5, 2 and 1, respectively.
Case study: results
Here we have developed a framework for the comparison of next generation data derived from multiple sources. We have applied the framework to the comparison of platform dependent error rates present within data generated on both the 454 Life Sciences and Illumina platforms. Other applications include the comparison of temporally sampled data or data derived from different tissues within the host. Given that novel next generation sequencing platforms and chemistries are being developed, ease of comparison of data from different sources is timely. Our software permits the efficient analysis of multiple such data sets within one integrated framework. We demonstrate its usage by comparing extremely large data sets of influenza H1N1 samples sequenced on both the 454 Life Sciences and Illumina platforms. Previously we have demonstrated how phylogenetic trees can be inferred from combined time points of viral data, permitting the tracking of the emergence of distinct viral lineages associated with HIV-1 drug resistance forms [3, 35].
To determine the importance of characterizing the extent of platform induced variation in this study we compared the entropy present within the read data derived from the 12 H1N1/09 samples to that of a sample of data from GenBank (Figure 4C). We observed that the median entropies obtained from reads pooled according to platform were higher than those obtained from the GenBank data, thus, highlighting the importance of quantifying how much of this variation was caused by platform error, prior to the detection of genuine variants. In order to quantify this we considered any differences in variation between the data from each of the platforms as being platform dependent (Figure 5). Following the calculation of the per site ratios between variant and consensus frequencies, it was observed that for insertions and deletions the ratios within the 454 data were significantly higher than their Illumina counterparts (Figure 6A). The higher ratios were strongly associated with hps+ regions (Figure 6B), a characteristic that has been previously observed, for example, within a study designed to characterize platform error using the reverse transcriptase gene of HIV-1 . This result highlights the suitability of the Illumina technology for viral data sets, which is often prone to high levels of biologically relevant indels.
While mapping of low frequency variation at sites on the template to locations within the underlying reads it was observed that within the 454 data, with the exception of the read starts, the frequencies follow a random expectation (Figure 7A, C and 7E). This is unsurprising given the observation that the majority of platform-induced variation is dependent on the location of sites in relation to hps+ regions (Figure 6B), as defined by the genome. Since these regions are located randomly in relation to the read, platform dependent variation would be expected to appear randomly with respect to read length. The major implication of this is that that error occurring at particular positions on a genome may be replicated across multiple independent samples, as has been previously observed . It also suggests, that within the 454 data the genomic positioning of hps+ sites can be used to accurately predict where this error is likely to occur, and that steps can be taken to reduce its effects on data analysis, such as the removal of insertions within reads that fall with hps+ regions and that are associated with low quality scores. Conversely within the data obtained from the Illumina platform the distributions of indel error in relation to read location does not fall consistently within the random expectation (Figure 7B, D and 7F). For this data hps+ regions were not observed to influence location. Combined, this suggests that read location plays an influential role in the introduction of such error, although the levels present, especially given the far higher coverage, are extremely low.
For mismatches, ratios between consensus and variant frequencies were observed to be relatively similar on both platforms, although as a result of higher coverage many more sites within the Illumina data harbored low-level change (Figure 8A and 8B). To demonstrate the applicability of our software to data sets derived from different sources the topology generated by the number of variable sites cross-validated using both platforms, in relation to the quality of the nucleotides present was plotted (Figure 9). The topology follows the general trend of high non-consensus and quality score values lead to few polymorphic sites, low non-consensus and quality score thresholds lead to many polymorphic sites. Of particular interest, however, is that between the 55th and 56th quality percentiles there is a consistent decrease in the number of variable sites identified, suggesting a possible cutoff value for this parameter. Interestingly, quality alone does not appear to be sufficient for accurately identifying the presence of low-level biological variation as, without cross-validation using the Illumina platform, the number of sites identified within each sample is consistently higher (Figure 10). In the latter the median number of variable sites across all 12 samples is 1947, which is almost three times higher than when cross-platform validation is performed. Reflecting this uncertainty is the development of many probabilistic methods that attempt to improve the reliability of identifying low-level variation at sites within data obtained from a single platform [36–39]. Our frameworks ability to store temporally sampled data provides an opportunity to derive a set of priors characterizing the expected variation within that individual. Combined with the error rates described here and in conjunction with read quality scores this forms the bases for our first future update which involves filtering platform dependent variation from temporally sampled read data using a Bayesian approach.
We have provided software, Segminator II, which can be used for the processing of temporally, spatially or otherwise linked viral data obtained from next generation sequencing platforms. In a demonstration of the usability of our software we have also quantified the amount of platform dependent error that is present within data generated on both the 454 and Illumina platforms and, thus, highlighted the need for care when calling low frequency variants using a single platform. Given that next generation data is increasingly important in the analysis of drug-resistance and vaccine trials, this software will be useful to the retroviral research community.
Availability and requirements
Project name: Segminator II
Project home page: http://www.bioinf.manchester.ac.uk/segminator/
Operating system: e.g. Platform independent
Programming language: Java
Other requirements: Java 1.6 or higher
License: GNU Lesser GPL
Any restrictions to use by non-academics: license needed
We wish to thank BBSRC for funding, project grant: BB/H012419/1. We also wish to thank Felix Feyertag, Bram Vrancken and Kenneth Henry for useful discussion.
- Droege M, Hill B: The Genome Sequencer FLX System-longer reads, more applications, straight forward bioinformatics and more complete data sets. J Biotechnol 2008, 136: 3–10. 10.1016/j.jbiotec.2008.03.021View ArticlePubMedGoogle Scholar
- Bennett S: Solexa Ltd. Pharmacogenomics 2004, 5: 433–438. 10.1517/146224220.127.116.113View ArticlePubMedGoogle Scholar
- Archer J, Rambaut A, Taillon BE, Harrigan PR, Lewis M, Robertson DL: The evolutionary analysis of emerging low frequency HIV-1 CXCR4 using variants through time-an ultra-deep approach. PLoS Comput Biol 2010, 6: e1001022. 10.1371/journal.pcbi.1001022PubMed CentralView ArticlePubMedGoogle Scholar
- Holt KE, Parkhill J, Mazzoni CJ, Roumagnac P, Weill FX, Goodhead I, Rance R, Baker S, Maskell DJ, Wain J, et al.: High-throughput sequencing provides insights into genome variation and evolution in Salmonella Typhi. Nat Genet 2008, 40: 987–993. 10.1038/ng.195PubMed CentralView ArticlePubMedGoogle Scholar
- Kuroda M, Katano H, Nakajima N, Tobiume M, Ainai A, Sekizuka T, Hasegawa H, Tashiro M, Sasaki Y, Arakawa Y, et al.: Characterization of quasispecies of pandemic 2009 influenza A virus (A/H1N1/2009) by de novo sequencing using a next-generation DNA sequencer. PLoS One 2010, 5: e10256. 10.1371/journal.pone.0010256PubMed CentralView ArticlePubMedGoogle Scholar
- Poon AF, Swenson LC, Dong WW, Deng W, Kosakovsky Pond SL, Brumme ZL, Mullins JI, Richman DD, Harrigan PR, Frost SD: Phylogenetic analysis of population-based and deep sequencing data to identify coevolving sites in the nef gene of HIV-1. Mol Biol Evol 2009, 27: 819–832.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang C, Mitsuya Y, Gharizadeh B, Ronaghi M, Shafer RW: Characterization of mutation spectra with ultra-deep pyrosequencing: application to HIV-1 drug resistance. Genome Res 2007, 17: 1195–1201. 10.1101/gr.6468307PubMed CentralView ArticlePubMedGoogle Scholar
- ten Bosch JR, Grody WW: Keeping up with the next generation: massively parallel sequencing in clinical diagnostics. J Mol Diagn 2008, 10: 484–492. 10.2353/jmoldx.2008.080027PubMed CentralView ArticlePubMedGoogle Scholar
- Binladen J, Gilbert MT, Bollback JP, Panitz F, Bendixen C, Nielsen R, Willerslev E: The use of coded PCR primers enables high-throughput sequencing of multiple homolog amplification products by 454 parallel sequencing. PLoS One 2007, 2: e197. 10.1371/journal.pone.0000197PubMed CentralView ArticlePubMedGoogle Scholar
- Meyer M, Stenzel U, Hofreiter M: Parallel tagged sequencing on the 454 platform. Nat Protoc 2008, 3: 267–278. 10.1038/nprot.2007.520View ArticlePubMedGoogle Scholar
- Lal RB, Chakrabarti S, Yang C: Impact of genetic diversity of HIV-1 on diagnosis, antiretroviral therapy & vaccine development. Indian J Med Res 2005, 121: 287–314.PubMedGoogle Scholar
- Luciani F, Alizon S: The evolutionary dynamics of a rapidly mutating virus within and between hosts: the case of hepatitis C virus. PLoS Comput Biol 2009, 5: e1000565. 10.1371/journal.pcbi.1000565PubMed CentralView ArticlePubMedGoogle Scholar
- Tazi L, Imamichi H, Hirschfeld S, Metcalf JA, Orsega S, Perez-Losada M, Posada D, Lane HC, Crandall KA: HIV-1 infected monozygotic twins: a tale of two outcomes. BMC Evol Biol 2011, 11: 62. 10.1186/1471-2148-11-62PubMed CentralView ArticlePubMedGoogle Scholar
- Westby M, Lewis M, Whitcomb J, Youle M, Pozniak AL, James IT, Jenkins TM, Perros M, van der Ryst E: Emergence of CXCR4-using human immunodeficiency virus type 1 (HIV-1) variants in a minority of HIV-1-infected patients following treatment with the CCR5 antagonist maraviroc is from a pretreatment CXCR4-using virus reservoir. J Virol 2006, 80: 4909–4920. 10.1128/JVI.80.10.4909-4920.2006PubMed CentralView ArticlePubMedGoogle Scholar
- Bushman FD, Hoffmann C, Ronen K, Malani N, Minkah N, Rose HM, Tebas P, Wang GP: Massively parallel pyrosequencing in HIV research. AIDS 2008, 22: 1411–1415. 10.1097/QAD.0b013e3282fc972ePubMed CentralView ArticlePubMedGoogle Scholar
- Shendure J, Hanlee J: Next-generation DNA sequencing. Nat Biotech 2008, 26: 1135–1145. 10.1038/nbt1486View ArticleGoogle Scholar
- Tsibris AM, Korber B, Arnaout R, Russ C, Lo CC, Leitner T, Gaschen B, Theiler J, Paredes R, Su Z, et al.: Quantitative deep sequencing reveals dynamic HIV-1 escape and large population shifts during CCR5 antagonist therapy in vivo. PLoS One 2009, 4: e5683. 10.1371/journal.pone.0005683PubMed CentralView ArticlePubMedGoogle Scholar
- Rozera G, Abbate I, Bruselles A, Vlassi C, D'Offizi G, Narciso P, Chillemi G, Prosperi M, Ippolito G, Capobianchi MR: Massively parallel pyrosequencing highlights minority variants in the HIV-1 env quasispecies deriving from lymphomonocyte sub-populations. Retrovirology 2009, 6: 15. 10.1186/1742-4690-6-15PubMed CentralView ArticlePubMedGoogle Scholar
- Hinkley T, Martins J, Chappey C, Haddad M, Stawiski E, Whitcomb JM, Petropoulos CJ, Bonhoeffer S: A systems analysis of mutational effects in HIV-1 protease and reverse transcriptase. Nat Genet 2011, 43: 487–489. 10.1038/ng.795View ArticlePubMedGoogle Scholar
- Korber B, Gaschen B, Yusim K, Thakallapally R, Kesmir C, Detours V: Evolutionary and immunological implications of contemporary HIV-1 variation. Br Med Bull 2001, 58: 19–42. 10.1093/bmb/58.1.19View ArticlePubMedGoogle Scholar
- Shafer RW, Schapiro JM: HIV-1 drug resistance mutations: an updated framework for the second decade of HAART. AIDS Rev 2008, 10: 67–84.PubMed CentralPubMedGoogle Scholar
- Barouch DH, O'Brien KL, Simmons NL, King SL, Abbink P, Maxfield LF, Sun YH, La Porte A, Riggs AM, Lynch DM, et al.: Mosaic HIV-1 vaccines expand the breadth and depth of cellular immune responses in rhesus monkeys. Nat Med 2010, 16: 319–323. 10.1038/nm.2089PubMed CentralView ArticlePubMedGoogle Scholar
- Fischer W, Perkins S, Theiler J, Bhattacharya T, Yusim K, Funkhouser R, Kuiken C, Haynes B, Letvin NL, Walker BD, et al.: Polyvalent vaccines for optimal coverage of potential T-cell epitopes in global HIV-1 variants. Nat Med 2007, 13: 100–106. 10.1038/nm1461View ArticlePubMedGoogle Scholar
- Pop M, Salzberg SL: Bioinformatics challenges of new sequencing technology. Trends Genet 2008, 24: 142–149. 10.1016/j.tig.2007.12.006PubMed CentralView ArticlePubMedGoogle Scholar
- Gilles A, Meglecz E, Pech N, Ferreira S, Malausa T, Martin JF: Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencing. BMC Genomics 2011, 12: 245. 10.1186/1471-2164-12-245PubMed CentralView ArticlePubMedGoogle Scholar
- Nakamura K, Oshima T, Morimoto T, Ikeda S, Yoshikawa H, Shiwa Y, Ishikawa S, Linak MC, Hirai A, Takahashi H, et al.: Sequence-specific error profile of Illumina sequencers. Nucleic Acids Res 2011, 39: e90. 10.1093/nar/gkr344PubMed CentralView ArticlePubMedGoogle Scholar
- Harismendy O, Ng PC, Strausberg RL, Wang X, Stockwell TB, Beeson KY, Schork NJ, Murray SS, Topol EJ, Levy S, et al.: Evaluation of next generation sequencing platforms for population targeted sequencing studies. Genome Biol 2009, 10: R32. 10.1186/gb-2009-10-3-r32PubMed CentralView ArticlePubMedGoogle Scholar
- Mardis ER: Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet 2008, 9: 387–402. 10.1146/annurev.genom.9.081307.164359View ArticlePubMedGoogle Scholar
- Brockman W, Alvarez P, Young S, Garber M, Giannoukos G, Lee WL, Russ C, Lander ES, Nusbaum C, Jaffe DB: Quality scores and SNP detection in sequencing-by-synthesis systems. Genome Res 2008, 18: 763–770. 10.1101/gr.070227.107PubMed CentralView ArticlePubMedGoogle Scholar
- Bao S, Jiang R, Kwan W, Wang B, Ma X, Song YQ: Evaluation of next-generation sequencing software in mapping and assembly. J Hum Genet 2011, 56: 406–414. 10.1038/jhg.2011.43View ArticlePubMedGoogle Scholar
- Oshlack A, Robinson MD, Young MD: From RNA-seq reads to differential expression results. Genome Biol 2010, 11: 220. 10.1186/gb-2010-11-12-220PubMed CentralView ArticlePubMedGoogle Scholar
- Geneious v5.4[http://www.geneious.com/]
- Zhou B, Donnelly ME, Scholes DT, St George K, Hatta M, Kawaoka Y, Wentworth DE: Single-reaction genomic amplification accelerates sequencing and vaccine production for classical and Swine origin human influenza a viruses. J Virol 2009, 83: 10309–10313. 10.1128/JVI.01109-09PubMed CentralView ArticlePubMedGoogle Scholar
- Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 2004, 32: 1792–1797. 10.1093/nar/gkh340PubMed CentralView ArticlePubMedGoogle Scholar
- Archer J, Braverman MS, Taillon BE, Desany B, James I, Harrigan PR, Lewis M, Robertson DL: Detection of low-frequency pretherapy chemokine (CXC motif) receptor 4 (CXCR4)-using HIV-1 with ultra-deep pyrosequencing. AIDS 2009, 23: 1209–1218. 10.1097/QAD.0b013e32832b4399PubMed CentralView ArticlePubMedGoogle Scholar
- Li H: Improving SNP discovery by base alignment quality. Bioinformatics 2011, 27: 1157–1158. 10.1093/bioinformatics/btr076PubMed CentralView ArticlePubMedGoogle Scholar
- Quinlan AR, Stewart DA, Stromberg MP, Marth GT: Pyrobayes: an improved base caller for SNP discovery in pyrosequences. Nat Methods 2008, 5: 179–181. 10.1038/nmeth.1172View ArticlePubMedGoogle Scholar
- Salmela L, Schroder J: Correcting errors in short reads by multiple alignments. Bioinformatics 2011, 27: 1455–61. 10.1093/bioinformatics/btr170View ArticlePubMedGoogle Scholar
- Zagordi O, Klein R, Daumer M, Beerenwinkel N: Error correction of next-generation sequencing data and reliable estimation of HIV quasispecies. Nucleic Acids Res 2010, 38: 7400–7409. 10.1093/nar/gkq655PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.