Routine performance and errors of 454 HLA exon sequencing in diagnostics
© Niklas et al.; licensee BioMed Central Ltd. 2013
Received: 23 November 2012
Accepted: 30 May 2013
Published: 3 June 2013
Next-generation sequencing (NGS) has changed genomics significantly. More and more applications strive for sequencing with different platforms. Now, in 2012, after a decade of development and evolution, NGS has been accepted for a variety of research fields. Determination of sequencing errors is essential in order to follow next-generation sequencing beyond research use only. This study describes the overall 454 system performance of using multiple GS Junior runs with an in-house established and validated diagnostic assay for human leukocyte antigen (HLA) exon sequencing. Based on this data, we extracted, evaluated and characterized errors and variants of 60 HLA loci per run with respect to their adjacencies.
We determined an overall error rate of 0.18% in a total of 118,484,408 bases. 31.3% of all reads analyzed (n=349,503) contain one or more errors. The largest group are deletions that account for 50% of the errors. Incorrect bases are not distributed equally along sequences and tend to be more frequent at sequence ends. Certain sequence positions in the middle or at the beginning of the read accumulate errors. Typically, the corresponding quality score at the actual error position is lower than the adjacent scores.
Here we present the first error assessment in a human next-generation sequencing diagnostics assay in an amplicon sequencing approach. Improvements of sequence quality and error rate that have been made over the years are evident and it is shown that both have now reached a level where diagnostic applications become feasible. Our presented data are better than previously published error rates and we can confirm and quantify the often described relation of homopolymers and errors. Nevertheless, a certain depth of coverage is needed, in particular with challenging areas of the sequencing target. Furthermore, the usage of error correcting tools is not essential but might contribute towards the capacity and efficiency of a sequencing run.
KeywordsNext-generation sequencing Human leukocyte antigen typing Error characteristics Quality control
Next-generation sequencing systems have boosted genetics in the last few years. The reduction of costs, wet-lab workflow complexity and the gain of read length has led to an enormous increase in sequencing projects and sequencing data . Roche/454 Life Sciences is one of the major players in the NGS field as their technology of pyrosequencing allows for the longest possible reads of all 2nd generation sequencing techniques with further technological improvements proposed, moreover, two different sized platforms allow for scalability . This technology is based on DNA templates immobilized on beads which are loaded onto a PicoTiterPlate (PTP). Subsequently, nucleotides flow over this plate in periodic cycles and get incorporated if complementary to the template strand. An enzyme cascade is activated, leading to the release of photons. These photons are detected by an ultra-sensitive CCD camera. Lengths of homopolymers (stretch of the same nucleotides) are determined by the amount of emitted light , especially long homopolymers are a huge challenge of the 454 technology itself, bioinformatics and analysis respectively interpretation [4, 5].
It is a logical consequence to follow NGS from the basic research applications to routine diagnostic assays [6-8]. Using NGS for human leukocyte antigen (HLA) typing is one of the most evolving fields of application and pushing forward for routine diagnostics [9-13]. Our lab is certified by the European Federation for Immunogenetics for HLA typing and has years of experience in HLA typing and next-generation sequencing [14, 15]. For transplantation of haematopoietic stem cells DNA based, high-resolution typing of HLA is an absolute necessity in order to gain a best possible histocompatibility to reduce the risk of a severe graft-versus-host-disease . Most recently, we have demonstrated NGS HLA typing as feasible for routine diagnostics .
For diagnostic applications it is essential to know possible errors in workflow and data analysis. There are already implemented mechanisms controlling and dealing with errors in a quality management controlled laboratory. Every next-generation sequencing platform and technique has its own application dependent error profile. Several groups have estimated errors for special fields of genomics, including bacterial, viral and antibody sequencing [4, 18, 19].
Here we present a detailed error assessment for sequences of NGS HLA typing on a 454 platform. We analyzed multiple runs and point out the level of safety for diagnostics NGS applications on the basis of error occurrences and if any of them are recurring and linked to sequence motifs.
Performance, accuracy and errors
Taking all six runs together, 373,792 reads passed built in quality filtering , with a total of 146,860,970 bases sequenced and average read length of 393 base pairs.
Overall run performances
Qual 98% 400 bp
Median Read length
Avg Read length
93.5% of the generated raw reads could be aligned to HLA reference sequences and were used for further analysis. After trimming primers and reducing reads to exon information, 118,484,408 bases (81% of the original output) were taken into account when calling variants and determining errors. 563 variants in the exon regions were defined as true variants, known by Sanger sequence based typing (SBT) and additional pseudogen analysis. Besides, 13,505 variants were detected and categorized as errors.
109,473 reads had at least one error, therefore 31.3% of all reads contain errors in their coding region and on average one read had 2.08 errors. After applying the error correction tool Acacia, errors still remained in 25.1% of all reads .
The number of reads containing one error was multiplied with the corresponding length of the error resulting in 212,415 bases being erroneous. The total error rate of 0.18% was defined by the percentage of wrong bases in the number of total exon bases, where insertions account for 0.09%, deletions for 0.04% and substitutions for 0.05%. Insertions had an average length of 1.12 bases, deletions 1.07 bases and substitutions one base; summarized, errors had a length of 1.06 bases.
38.15% of these errors were detected in all six runs, meaning 0.07% reproducible errors (0.03% insertions, 0.03% deletions and 0.008% substitutions) associated with 81,026 bases.
Several publications analyze accuracy and errors in 454 sequencing data. Huse et al.  analyzed bacterial 16S rDNA with the older GS20 platform and affirmed their basic foundings for Standard chemistry , Prabakaran et al.  characterized errors in a small portion of 3,467 antibody sequences and Gilles et al.  used control DNA fragments of the 454 workflow for error assessment. As stated previously , error characteristics is sequence motive dependant, hence every application needs its own error profile.
Run performance of the GS Junior platform is stated to be approximately 136,760 reads per run for shotgun sequencing . 70,000 reads are expected from amplicon experiments , most of our runs in this study do not reach this number of sequences, resulting in average 62,299 reads, however, being sufficient for HLA genotyping of 10 samples (six loci per sample).
Per base error rates
The used enzyme for amplification has an error rate of 8.3×10-6 . Accordingly, approximately 25,052 erroneous bases in our experiment are due to PCR artifacts. These bases contribute 11.8% to our total error rate. Our error rate of 0.18% differs significantly from already published error rates: 0.49% Standard chemistry , 0.4% and 1.07% for Titanium chemistry [24, 27]. The high error rate of 1.07% can be explained through the use of the 454 control fragments for error analysis. Considering (long) homopolymers being the weak point of 454 systems, they are overrepresented in the control fragments in contrast to natural DNA sequences. In Lind et al. an error rate of 1.1% for a shotgun HLA sequencing approach is given, sequenced with Standard chemistry . Since GS20 many improvements in protocol, reagents and software have been made to the 454 technology. Additionally, reads tend to become error prone towards their end , the (intron) trimmed analysis furthermore reduces possible errors due to errors being rather located at the reads’ ends. Due to this analysis strategy, 19% of the produced output is not analyzed.
Insertions (50%) are the most frequent errors followed by substitutions (28%) and deletions (22%), the substitution rate is even lower than for Illumina’s MiSeq system stated in Loman et al. . Both publications mention insertions as the most frequent errors. In contrast to previously published error data substitutions account for the second frequent errors, including PCR or application specific errors. Gilles et al. reported a seven times lower substitution rate than deletions originating from the overrepresented homopolymers.
68.7% of all reads were free from errors, consistent with Huse et al. . Hence, without denoising  or smoothing  a loss of one third of data must be taken into account. With error correction additional 6.2% of reads (of total reads generated) could be recovered, resulting in a quarter of sequences still exhibiting errors. We use a conservative approach without additional modifications of the data to prevent introduction of false positive mutations. The majority of reads containing errors (77.2%) has less than three wrong bases. The reduced error rate in our setting is the reason for the satisfying average error per read rate of 2.08 errors and the average length of 1.06 bases per error.
For 1,743 variants (13%) there was evidence (in at least one of the six runs) supporting the mutation in both sequencing directions, in accordance with Challis et al. .
Read position and motifs
The occurrence of erroneous bases was highly connected to read respectively reference position, 38.15% of them occurred at the same positions when resequencing. There is strong evidence that errors are also highly linked to special sequence positions and DNA patterns. As a result the individual error rates of the six runs only slightly differ from each other respectively the given average values. Vandenbroucke et al. indicated that every amplicon has its own error profile .
Based on our examination we can state that more errors are located in the second half of the read than in the other half, indicated by a median error position of 236 with an average read length of 393.
Quality values calculated from the averaged error rates were compared to the average quality values estimated by the GS Junior at the same positions (Figure 5A). Below values of 30, the empirical rate is higher than the estimated value; above 30 the GS Junior overestimates its own performance (Q30 = accuracy of 99.9%).
The distribution of quality scores along the read distance (Figure 5B) of all runs exhibits a very equal pattern, showing that some regions have valleys (lower quality scores) while others have peaks (high quality scores). The overall pattern with a considerable decrease at around 300 bp is typical for all GS Junior runs; positions and power of peaks are library specific and highly reproducible. The quality scores of surrounding error positions correspond to the overall run performance that was slightly better in run 4 and 5 and below expectations for run 6 due to variations of the complex workflow and chemistry.
Comparing the quality values of the actual error position to their neighborhood (see Figure 1) reveals that the erroneous base is represented by a quality valley. Figure 1 reveals that quality values of areas of errors are below other positions, the actual error position is even lower.
Homopolymers form a major challenge in base calling algorithms in the 454 sequencing systems, thus, errors turning up are highly connected to homopolymer regions [4, 24]. On a first glance 50.4% of errors outside homopolymeric regions may seem contrary. Considering the distribution of homopolymers with given lengths in the reference sequences for HLA, it is significant (p<0.01) that homopolymers are more attractive to form errors than single bases (proportions are plotted in Figure 2). The length of homopolymers correlates with a decrease of accuracy drops in general, with the exception of 2-mers having the best quality scores at error positions, displayed in Figure 3.
In this study we present a detailed error characterization of 454 sequencing using data from a diagnostic assay. In our amplicon sequencing approach exactly 0.18% of total bases used for HLA typing are erroneous. This error rate supports and allows the benefit of typing HLA with 454 next generation sequencing. Although amplicon sequencing is considered as more sophisticated than shotgun from a bioinformatics perspective , the presented data are even better than previously published shotgun approaches .
Several software products are able to correct errors, however most of them are specialized on a specific application and sequence context. Moreover, if error models are already known, many tools are able to simulate sequencing data with a reference sequence but without taking neighboring sequence motifs into account [31-33].
Additionally, knowing error rates allows for the reduction of sequence depth needed for a certain accuracy , furthermore allowing diagnostics to be more cost-effective. The given data outperforms previous publications using test fragments, non human samples or outdated software or reagents.
Clinical setting and experimental design
Data processing was carried out on the GS Junior attendant PC with default settings for Amplicon sequencing without any modifications to processing pipeline or filtering. HLA genotypes are routinely typed with ATF software (Conexio Genomics, Perth, Australia). For assessment of variations and errors the GS Amplicon Variant Analyzer (AVA) (Roche 454 Life Sciences, Branford, USA) was used for alignment and output of sequences.
Variant and error detection
Genotypes of tested samples were determined beforehand by Sanger SBT. Therefore expected variants could be defined with an allele database (IMGT/HLA 3.7.0 2012-07) . To overcome missing intron information in the allele database only exon sequence was considered. In principle, AVA software does not output all detected variants by default. Therefore variants were generated by a Perl script (Roche 454 Life Sciences, Branford, CT, USA) going through all multiple alignments in AVA and reporting discrepancies from the reference sequences. Sequences A*01:01:01:01, B*07:02:01, C*01:02:01, DQB1*02:01:01, DRB1*01:01:01 and DPB1*01:01:01 were used as references.
Detected variants were compared to known variants. For locus A, exon 2 the pseudogen HLA-Y is amplified by approximately 25%, for locus DRB1 the loci DRB3, DRB4 and DRB5 are amplified also. These known side-products were not considered as errors. Alignments were examined for pseudogene evaluation.
As an error correction tool Acacia  was used with default parameters, the improved sequences were investigated with respect to the previous error results.
A series of Perl 5.10.0 scripts (The Perl Foundation, Walnut, CA, USA) was used for variant data extraction, mapping of quality values to variant positions and assessment of read qualities and homopolymer runs. R 2.14.2 (2012-02-29)  was used for graphics generation and statistical tests. For averaging quality scores they were translated to error rates, then averaged and transferred back to average quality scores.
Availability of supporting data
Sequence information is available at NCBI’s SRA database, accession number SRP020222.
Human leukocyte antigens
Polymerase chain reaction
Amplicon Variant Analyzer
European Federation for Immunogenetics.
The authors thank Sabine Singh for critical review of the manuscript.
- Kodama Y, Shumway M, Leinonen R: The Sequence Read Archive: explosive growth of sequencing data. Nucleic Acids Res. 2012, 40: D54-D56. 10.1093/nar/gkr854.PubMed CentralView ArticlePubMedGoogle Scholar
- Loman NJ, Misra RV, Dallman TJ, Constantinidou C, Gharbia SE, Wain J, et al: Performance comparison of benchtop high-throughput sequencing platforms. Nat Biotechnol. 2012, 30: 434-439. 10.1038/nbt.2198.View ArticlePubMedGoogle Scholar
- Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, et al: Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005, 437: 376-380.PubMed CentralPubMedGoogle Scholar
- Huse SM, Huber JA, Morrison HG, Sogin ML, Welch DM: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biol. 2007, 8: R143-10.1186/gb-2007-8-7-r143.PubMed CentralView ArticlePubMedGoogle Scholar
- De Schrijver JM, De LK, Lefever S, Sabbe N, Pattyn F, Van NF, et al: Analysing 454 amplicon resequencing experiments using the modular and database oriented Variant Identification Pipeline. BMC Bioinforma. 2010, 11: 269-10.1186/1471-2105-11-269.View ArticleGoogle Scholar
- Voelkerding KV, Dames SA, Durtschi JD: Next-generation sequencing: from basic research to diagnostics. Clin Chem. 2009, 55: 641-658. 10.1373/clinchem.2008.112789.View ArticlePubMedGoogle Scholar
- Voelkerding KV, Dames S, Durtschi JD: Next generation sequencing for clinical diagnostics-principles and application to targeted resequencing for hypertrophic cardiomyopathy: a paper from the 2009 William Beaumont Hospital Symposium on Molecular Pathology. J Mol Diagn. 2009, 2010 (12): 539-551.Google Scholar
- Gabriel C, Stabentheiner S, Danzer M, Proll J: What Next? The Next Transit from Biology to Diagnostics: Next Generation Sequencing for Immunogenetics. Transfus Med Hemother. 2011, 38: 308-317. 10.1159/000332433.PubMed CentralView ArticlePubMedGoogle Scholar
- Lank SM, Golbach BA, Creager HM, Wiseman RW, Keskin DB, Reinherz EL, et al: Ultra-high resolution HLA genotyping and allele discovery by highly multiplexed cDNA amplicon pyrosequencing. BMC Genomics. 2012, 13: 378-10.1186/1471-2164-13-378.PubMed CentralView ArticlePubMedGoogle Scholar
- Bentley G, Higuchi R, Hoglund B, Goodridge D, Sayer D, Trachtenberg EA, et al: High-resolution, high-throughput HLA genotyping by next-generation sequencing. Tissue Antigens. 2009, 74: 393-403. 10.1111/j.1399-0039.2009.01345.x.PubMed CentralView ArticlePubMedGoogle Scholar
- Holcomb CL, Hoglund B, Anderson MW, Blake LA, Bohme I, Egholm M, et al: A multi-site study using high-resolution HLA genotyping by next generation sequencing. Tissue Antigens. 2011, 77: 206-217. 10.1111/j.1399-0039.2010.01606.x.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang C, Krishnakumar S, Wilhelmy J, Babrzadeh F, Stepanyan L, Su LF, et al: High-throughput, high-fidelity HLA genotyping with deep sequencing. Proc Natl Acad Sci U S A. 2012, 109: 8676-8681. 10.1073/pnas.1206614109.PubMed CentralView ArticlePubMedGoogle Scholar
- Shiina T, Suzuki S, Ozaki Y, Taira H, Kikkawa E, Shigenari A, et al: Super high resolution for single molecule-sequence-based typing of classical HLA loci at the 8-digit level using next generation sequencers. Tissue Antigens. 2012Google Scholar
- Gabriel C, Danzer M, Hackl C, Kopal G, Hufnagl P, Hofer K, et al: Rapid high-throughput human leukocyte antigen typing by massively parallel pyrosequencing for high-resolution allele identification. Hum Immunol. 2009, 70: 960-964. 10.1016/j.humimm.2009.08.009.View ArticlePubMedGoogle Scholar
- Proll J, Danzer M, Stabentheiner S, Niklas N, Hackl C, Hofer K, et al: Sequence capture and next generation resequencing of the MHC region highlights potential transplantation determinants in HLA identical haematopoietic stem cell transplantation. DNA Res. 2011, 18: 201-210. 10.1093/dnares/dsr008.PubMed CentralView ArticlePubMedGoogle Scholar
- Spellman SR, Eapen M, Logan BR, Mueller C, Rubinstein P, Setterholm MI, et al: A perspective on the selection of unrelated donors and cord blood units for transplantation. Blood. 2012, 120: 259-265. 10.1182/blood-2012-03-379032.PubMed CentralView ArticlePubMedGoogle Scholar
- Danzer M, Niklas N, Stabentheiner S, Hofer K, Pröll J, Stückler C, et al: Rapid, scalable and highly automated HLA genotyping using next-generation sequencing: A transition from research to diagnostics. BMC Genomics. 2013, 14: 221-10.1186/1471-2164-14-221.PubMed CentralView ArticlePubMedGoogle Scholar
- Skums P, Dimitrova Z, Campo DS, Vaughan G, Rossi L, Forbi JC, et al: Efficient error correction for next-generation sequencing of viral amplicons. BMC Bioinforma. 2012, 13 (10): S6-View ArticleGoogle Scholar
- Prabakaran P, Streaker E, Chen W, Dimitrov DS: 454 antibody sequencing - error characterization and correction. BMC Res Notes. 2011, 4: 404-10.1186/1756-0500-4-404.PubMed CentralView ArticlePubMedGoogle Scholar
- 454 Life Science Corp: 454 Sequencing System Software Manual v2.7. 454 Manual. 2012Google Scholar
- Bragg L, Stone G, Imelfort M, Hugenholtz P, Tyson GW: Fast, accurate error-correction of amplicon pyrosequences using Acacia. Nat Methods. 2012, 9: 425-426. 10.1038/nmeth.1990.View ArticlePubMedGoogle Scholar
- Brockman W, Alvarez P, Young S, Garber M, Giannoukos G, Lee WL, et al: Quality scores and SNP detection in sequencing-by-synthesis systems. Genome Res. 2008, 18: 763-770. 10.1101/gr.070227.107.PubMed CentralView ArticlePubMedGoogle Scholar
- Huse SM, Welch DM, Morrison HG, Sogin ML: Ironing out the wrinkles in the rare biosphere through improved OTU clustering. Environ Microbiol. 2010, 12: 1889-1898. 10.1111/j.1462-2920.2010.02193.x.PubMed CentralView ArticlePubMedGoogle Scholar
- Gilles A, Meglecz E, Pech N, Ferreira S, Malausa T, Martin JF: Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencing. BMC Genomics. 2011, 12: 245-10.1186/1471-2164-12-245.PubMed CentralView ArticlePubMedGoogle Scholar
- 454 Life Science Corp: 454 Sequencing System Guidelines for Amplicon Experimental Design. 454 Guidelines. 2011Google Scholar
- Frey B, Suppmann B: Demonstration of the Expand™ PCR System’s Greater Fidelity and Higher Yields with a lacI-based PCR Fidelity Assay.Biochemica. 1995, 1: 8-9.Google Scholar
- Quince C, Lanzen A, Davenport RJ, Turnbaugh PJ: Removing noise from pyrosequenced amplicons. BMC Bioinforma. 2011, 12: 38-10.1186/1471-2105-12-38.View ArticleGoogle Scholar
- Lind C, Ferriola D, Mackiewicz K, Heron S, Rogers M, Slavich L, et al: Next-generation sequencing: the solution for high-resolution, unambiguous human leukocyte antigen typing. Hum Immunol. 2010, 71: 1033-1042. 10.1016/j.humimm.2010.06.016.View ArticlePubMedGoogle Scholar
- Challis D, Yu J, Evani US, Jackson AR, Paithankar S, Coarfa C, et al: An integrative variant analysis suite for whole exome next-generation sequencing data. BMC Bioinforma. 2012, 13: 8-10.1186/1471-2105-13-8.View ArticleGoogle Scholar
- Vandenbroucke I, Van MH, Verhasselt P, Thys K, Mostmans W, Dumont S, et al: Minor variant detection in amplicons using 454 massive parallel pyrosequencing: experiences and considerations for successful applications. Biotechniques. 2011, 51: 167-177.View ArticlePubMedGoogle Scholar
- McElroy KE, Luciani F, Thomas T: GemSIM: general, error-model based simulator of next-generation sequencing data. BMC Genomics. 2012, 13: 74-10.1186/1471-2164-13-74.PubMed CentralView ArticlePubMedGoogle Scholar
- Lysholm F, Andersson B, Persson B: An efficient simulator of 454 data using configurable statistical models. BMC Res Notes. 2011, 4: 449-10.1186/1756-0500-4-449.PubMed CentralView ArticlePubMedGoogle Scholar
- Balzer S, Malde K, Lanzen A, Sharma A, Jonassen I: Characteristics of 454 pyrosequencing data-enabling realistic simulation with flowsim. Bioinformatics. 2010, 26: i420-i425. 10.1093/bioinformatics/btq365.PubMed CentralView ArticlePubMedGoogle Scholar
- Churchill GA, Waterman MS: The accuracy of DNA sequences: estimating sequence quality. Genomics. 1992, 14: 89-98. 10.1016/S0888-7543(05)80288-5.View ArticlePubMedGoogle Scholar
- Robinson J, Mistry K, McWilliam H, Lopez R, Parham P, Marsh SG: The IMGT/HLA database. Nucleic Acids Res. 2011, 39: D1171-D1176. 10.1093/nar/gkq998.PubMed CentralView ArticlePubMedGoogle Scholar
- R Development Core Team: R: A Language and Environment for Statistical Computing. 2012, Vienna, Austria: R Foundation for Statistical ComputingGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.