Improved base-calling and quality scores for 454 sequencing based on a Hurdle Poisson model
© De Beuf et al.; licensee BioMed Central Ltd. 2012
Received: 6 April 2012
Accepted: 1 November 2012
Published: 15 November 2012
454 pyrosequencing is a commonly used massively parallel DNA sequencing technology with a wide variety of application fields such as epigenetics, metagenomics and transcriptomics. A well-known problem of this platform is its sensitivity to base-calling insertion and deletion errors, particularly in the presence of long homopolymers. In addition, the base-call quality scores are not informative with respect to whether an insertion or a deletion error is more likely. Surprisingly, not much effort has been devoted to the development of improved base-calling methods and more intuitive quality scores for this platform.
We present HPCall, a 454 base-calling method based on a weighted Hurdle Poisson model. HPCall uses a probabilistic framework to call the homopolymer lengths in the sequence by modeling well-known 454 noise predictors. Base-calling quality is assessed based on estimated probabilities for each homopolymer length, which are easily transformed to useful quality scores.
Using a reference data set of the Escherichia coli K-12 strain, we show that HPCall produces superior quality scores that are very informative towards possible insertion and deletion errors, while maintaining a base-calling accuracy that is better than the current one. Given the generality of the framework, HPCall has the potential to also adapt to other homopolymer-sensitive sequencing technologies.
A first step in the analysis of next-generation sequencing (NGS) data is the transformation of the measured intensity signals to a sequence of nucleotides. This process, referred to as base-calling, is an important task, as systematic base-calling errors may mislead downstream analysis , e.g. in genome assembly and sequence mapping. More accurate base-calling and more reliable base-calling quality scores result in a better distinction between sequencing errors and true polymorphisms between the base-called reads and a reference sequence. This is an essential merit in the detection of single nucleotide polymorphisms (SNPs) or sequence variants [2–4]. A myriad of applications such as the characterization of HIV mutation spectra , the detection of somatic mutations in cancer , and the identification of operational taxonomic units in metagenomics  have the potential to benefit from improved base-calling and more informative quality scores.
The 454 Life Sciences system is based on the sequencing-by-synthesis principle. In each flow of the sequencing process, light produced by a pyrosequencing reaction is emitted if one or more identical nucleotides are incorporated into the DNA template. The addition of each of the 4 possible nucleotide solutions A, C, G or T occurs in a fixed and known order. Hence, 454 base-calling is a matter of discerning the number of incorporated nucleotides or homopolymer length (HPL) of a known nucleotide type from the measured intensity signal in each flow . Consequently, the principal sources of 454 sequencing errors are insertion and deletion errors (indels). These are more frequent in sequences containing long homopolymers [9, 10], because the increase of intensity signal when more nucleotides are incoporated, attenuates at higher HPLs. This makes it harder to discriminate between subsequent homopolymer lengths, resulting in an inflation of undercalls or overcalls as the HPL increases (Additional file 1: Figure S1).
For these reasons we have developed HPCall, a general probabilistic framework that seamlessly integrates the base-calling with more informative quality score assignment. HPCall is based on the classification of the calls in groups representing the possible HPLs. To this end, a statistical model for count data is used that predicts the HPL in each sequencing cycle as a function of a number of explanatory variables. A particular property of the 454 pyrosequencing process is the high abundance of background intensities (HPL 0). Whenever a nucleotide flow does not match the nucleotide at the interrogated position of the DNA template, no nucleotide is incorporated and no sequencing reaction takes place. The resulting light intensity mainly reflects background optical noise. Consequently, there are more zero counts than expected for a Poisson distribution. Hurdle Poisson regression models  are one way to deal with these excess zeros.
For each possible HPL, the model produces an estimated probability that this HPL is truly present in the DNA sequence at the interrogated position. The called HPL is then the HPL with the largest estimated probability. The calculation of these probabilities allows the simultaneous construction of quality scores. These scores directly reflect the base-calling’s uncertainties and provide information about potential undercall or overcall errors. In the model we combine information of flowgrams and earlier-stage raw intensities. By including the raw intensities we employ the additional information otherwise removed by the preprocessing (Figure 1), both for the base-calling and for the calculation of the quality scores. However, they are not strictly necessary for the method to provide valid results.
To assess the base-calling accuracies, DNA of the reference K-12 strain MG1655 of the bacterium Escherichia coli was sequenced at the NXTGNT sequencing center, using shotgun sequencing with Titanium reagents. Results are presented for a random subset of 15000 out of the 635979 produced reads. The reads in the standard 454 pipeline were produced using software version 2.3.
Before running the Hurdle Poisson base-calling model a preliminary data preparation step is performed in HPCall. In this step several raw data files are merged to create a data set that can be used for calibration of the model, if needed, and for subsequent base-calling. After the base-calling three output files are created: (a) the base-called reads, (b) the associated Phred-like quality scores, and (c) a file with the base-calling probabilities by HPL. A visualization of the base-calling pipeline, including a more detailed description, can be found in the Additional file (Additional file 2: Figure S2).
Weighted Hurdle Poisson model
By considering a positive θ the underdispersion in the count data can be modeled properly.
with the f j and g j being smooth functions of the corresponding predictor variables xj,bcand yj,bc, respectively. Details about the smooth functions are provided in the Additional file 3. Note that the predictor variables in (5) and (6) can be specified separately. In this paper we propose that the following information is used in either or both of the 2 submodels:
flowgram values in the current flow;
log2 raw intensities in the current flow with or without a read-specific normalization;
the cumulative sum of flowgram values and log2 raw intensities up to the current flow; this allows for modeling a cycle-specific effect and recognizes the abundance of homopolymers in the preceding flows;
flowgram values of 1, 4 and/or 8 flows before and/or after the current flow; this corrects for homopolymers in preceding and subsequent flows.
Parameter estimation or model training
The model parameters are estimated by maximum likelihood. This is done by using an iteratively reweighted least squares procedure  for both submodels. Efficient fitting of the model is conducted by the -package .
The estimation of the model parameters is based on the use of a representative training data set generated from a reference DNA sequencing experiment. The corresponding reference HPLs are the class indicators in the classification model. For the E. coli reference run we randomly selected 1000 reads to fit the base-calling model. The remaining 14000 reads in the subset were used for assessing the performance of the base-calling method.
Base-calling and quality score construction
In flow bc, HPCall calls the HPL n bc for which is maximal. These probabilities are obtained by plugging the estimated parameters from submodels (5) and (6) into model (1). They are also very useful quality scores because they provide a direct probabilistic interpretation to the base-calling uncertainties and give insight into potential undercall or overcall errors. Moreover, they can also be used for the construction of Phred scores in a similar fashion as the traditional 454 quality scores: it is a quality score that reflects the probability that the called base is not an overcall. In particular, the Phred-like quality score of the k-th called base in a homopolymer stretch (k > 0) is thus given by: . Since we can obtain the probabilities for all possible HPLs, we can also calculate an alternative quality score that reflects the probability that the called base is not an undercall. This is given by . Using Q Sk,overcall and Q Sk,undercall a new quality score is calculated: Q Sk,HPCall = Idir × min(Q Sk,undercall,Q Sk,overcall) with Idir = − 1 if Q Sk,undercall < Q Sk,overcall and Idir = 1 if Q Sk,undercall > Q Sk,overcall. The sign of Q Sk,HPCall thus indicates whether an undercall or an overcall is more likely.
The performance of HPCall is compared with that of the native 454 base-caller and of Pyrobayes  based on the E. coli reference run. The Phred-like quality scores produced by the different base-callers are compared to ‘observed’ quality scores. The latter are computed by grouping all the bases with an equal quality score together, and computing for each group the proportion of overcalls. An observed quality score is calculated as Q Sobserved = −10 log10(observed overcall error rate). High HPCall quality scores are trimmed to 40, just like it is done by the native 454 base-caller. Further, the proportions of high-quality bases for the different base-callers are compared. Next, we illustrate the added value of the HPCall base-calling probabilities and the new quality scores Q SHPCall.
The raw base-calling accuracy is assessed to give insight into the base- and read-level error rates. For HPCall the reproducibility of this accuracy is evaluated based on 10 random training data sets. Subsequently, an indel and SNP-analysis is conducted using variant detection software. We use both ssaha2  and subread  (http://sourceforge.net/projects/subread/) to map the base-called reads to the reference sequence and ssahaSNP  to compute the number of sequence variants, both SNPs and indels. False positive calls are determined by comparing the base-calls to the E. coli K-12 strain reference genome. Finally, the computational performance for the different methods is compared.
Quality scores and base-calling probabilities
Base-calling probabilities example 1: undercall
reference sequence: AAAAA
native 454: AAAA
Q S 454
Q S HPCall
Base-calling probabilities example 2: overcall
reference sequence: AA
native 454: AAA
Q S 454
Q S HPCall
Base-calling probabilities example 3: 0-1 undercall
reference sequence: T
native 454: -
Q S 454
Q S HPCall
Read-wise assessment and sequence variant analysis
The reads produced by HPCall are mapped to the reference sequence using ssaha2 and subread. In the mapping of the HPCall reads the traditional quality scores produced by HPCall without sign information are used. The read-wise error rate is compared to the native 454 base-caller (Additional file 10: Table S2). For this data set mapping percentages of 99.47% (ssaha2) and 99.43% (subread) are obtained. 66% (ssaha2) to 69% (subread) of the reads produced by HPCall map perfectly to the reference genome of the E. coli K-12 strain, whereas this is only the case for 56% (ssaha2) to 60% (subread) of the reads produced by the native 454 base-caller. This evidently leads to a higher percentage of 454 reads with at least one mismatch to the reference genome as compared to reads generated by HPCall. The mapping location of the reads produced by HPCall and those produced by the native 454 base-caller differed by more than 10 bp for only 2 reads. The good performance of HPCall was confirmed by a read-wise assessment on data of a 454 amplicon resequencing experiment of the human TP53 gene. More details can be found in the Additional file 11. Sequence variants of the mapped reads are detected by the ssahaSNP program (Additional file 10: Table S3). A reduction of the number of sequence variants with 40% is obtained when using HPCall as compared to the native 454 base-caller. The decrease is observed both for indels and for SNPs.
The performance of HPCall was tested using a representative 454 dataset containing 198,347 sequences. Both the native 454 basecaller and Pyrobayes processed the dataset in approximately 4h, while the base-calling by HPCall was conducted in approximately 6.5h. Both Pyrobayes and HPCall had comparable memory footprints of less than 1GB. HPCall used 3.5GB hard disk space to store the preprocessed data before actual basecalling took place. The computational performance was measured on a 2×6 Core Intel Xeon X7460, 2.66 GHz Processors GNU/Linux server system with 128 GB RAM.
The HPCall pipeline contains three modules. The first is a preprocessing module that stores all required data in a SQL database. The second module performs the actual base-calling by means of the R package . All base-calls, HPCall probabilities and quality scores are postprocessed in the final module to produce the final output files. The HPCall software and manual are available at https://sourceforge.net/projects/hpcall/.
One of the main contributions of HPCall is that the base-calling and quality score assignment are seamlessly integrated and occur simultaneously, instead of in two separate steps. For a given cycle and nucleotide, the probability of being the correct HPL is estimated for each possible HPL based on different noise predictors, and the call corresponds to the HPL with the maximum probability. In this way the extent of the maximal probability provides direct information about the base-calling uncertainty and can thus be used as a measure for the base-calling quality. Moreover, in the case of a miscall the second largest probability indicates whether an undercall or an overcall is more likely. This information is important for the downstream analysis of sequencing data, but it is completely lacking from traditional Phred-like quality scores produced by current 454 base-callers. The distributions of maximum base-calling probabilities in case of a miscall are more evenly distributed between 0.5 and 1 than in the case of a correct call where it is very often nearly 1. This suggests that relatively small maximum probabilities are often associated with miscalls and therefore should raise caution.
Because they are commonly used in the analysis of NGS experiments, HPCall also calculates Phred-like quality scores, based on the base-calling probabilities. These can be used in the same way as 454 quality scores. They are related to the probability of not having an overcall. These ‘overcall’ quality scores appear to compete well with the 454 quality scores, while the Pyrobayes quality scores perform clearly worse. At the same time, however, HPCall produces considerably more high-quality scores. Since we have all possible base-calling probabilities at our disposal, we can also calculate alternative quality scores based on the probability of not having an undercall. Subsequently, a summarizing Phred-like quality score is constructed by determining which of these two quality scores has the smallest value at the base-called HPL and this information is coded by the sign of the quality score (minus for undercall, plus for overcall). This new quality score now also contains information about the direction of a possible miscall. Quality-aware sequence aligners may use these scores to provide more reliable mapping results. We further illustrate the use of the HPCall base-calling probabilities and the Phred-like HPCall quality scores for assessing indels in sequence variant detection. In each sequencing flow, the native 454 base-caller produces quality scores for each called base, i.e. for a homopolymer of length 3, also 3 quality scores are provided. These quality scores are not informative to discriminate between potential undercalls or overcalls. Furthermore, in the situation that 0 bases are called instead of 1, no quality scores are provided by the other base-callers. Hence, no information is given about the probability that indeed 0 bases should have been called. In contrast, HPCall clearly indicates which type of miscall - undercall or overcall - is possibly to be expected in these examples, by means of the second highest base-calling probability and the sign of the HPCall quality score.
Besides the added value of the base-calling probabilities and quality scores, the prediction accuracy of HPCall surpasses that of the native 454 base-caller and of Pyrobayes. Based on our E. coli data we detect a 35% reduction of base-calling errors as compared to the current 454 base-caller. This reduction is quite stable throughout the whole HPL range. This number is based on a model that uses not only information from the preprocessed flowgram values, but also from the earlier-stage raw intensities to call the HPL in each flow of the sequencing process. If only information of flowgram values is used, the reduction of base-calling errors is still there, but it is smaller. Hence, although preprocessing raw intensities to flowgram values in a separate step prior to base-calling has the merit of reducing the spatial, read-specific and background optical noise in the data to a large extent, it also seems to remove crucial information for the base-calling task itself. The lower number of base-calling errors is also reflected in the lower number of detected indels and SNPs after mapping the base-called reads to the E. coli reference sequence. The beneficial performance of HPCall was confirmed on a 454 amplicon resequencing experiment of the human TP53 gene. When HPCall is run using the model trained on the E. coli data set the base-calling accuracy slightly decreases (see Additional file 12). For optimal results, it is therefore recommended to retrain the model for different experiments. For calibration of the base-caller the associated HPLs of a reference sequence are used to fit the model. A possible way to implement this is by adding plasmids to the sequencing experiment. The 454 sequencer uses control reads containing varying HPLs for recalibrating its native base-caller. Hence, these control reads would be very valuable for this purpose. Up to now, however, the 454 software does not allow to extract the flowgram values associated with these reads. The larger accuracy and creation of the more informative quality scores by HPCall comes at the cost of an additional computing time that is in the same order as the time for native 454 base-calling. Based on the E. coli data, the accuracy performance of HPCall is stable across different training data sets used to fit the model.
While HPCall was primarily developed for base-calling of 454 data, it has the potential to be used for other homopolymer-sensitive sequencing platforms as well, e.g. the PGM sequencer of Ion Torrent. Within the broad framework of the Hurdle Poisson model the algorithms to train the model remain unchanged. This also means that similar informative quality scores can be produced for other platforms. Only the explanatory variables used to predict the HPL will be specific for each platform, e.g. the nucleotide flow order of the Ion PGM sequencer is different from the 454 sequencer. Although it was not the main focus of this research, a first test with PGM 314 E. coli reference data already shows promising results (details in Additional file 12).
In this paper, we have proposed an alternative method for the base-calling of 454 pyrosequencing data, referred to as HPCall. Based on the obtained results, we strongly believe that using HPCall for base-calling and taking advantage of the base-calling probabilities in downstream tasks like mapping, genome assembly and sequence variant detection will lead to more accurate and powerful applications. Although HPCall is developed based on sequencing data of the 454 sequencing system, the underlying probabilistic framework is quite general. Therefore, we expect it to be rather straightforward to adapt HPCall for use in novel emerging sequencing platforms based on flow cycles, for which base-calling of long homopolymers is critical, e.g. Ion Torrent PGM.
Herwig Van Marck and Bie Verbist are kindly acknowledged for the insightful discussions. Part of this research was supported by IAP research network grant no. P6/03 of the Belgian government (Belgian Science Policy) and Ghent University (Multidisciplinary Research Partnership “Bioinformatics: from nucleotides to networks”).
- Corrada Bravo H, Irizarry RA: Model-based quality assessment and base-calling for second-generation sequencing data. Biometrics. 2010, 66: 665-674. 10.1111/j.1541-0420.2009.01353.x.View Article
- Pop M: Genome assembly reborn: recent computational challenges. Briefings Bioinf. 2009, 10 (4): 354-366. 10.1093/bib/bbp026.View Article
- Li H, Ruan J, Durbin R: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 2008, 18 (11): 1851-1858. 10.1101/gr.078212.108.PubMed CentralView ArticlePubMed
- Quinlan AR, Stewart DA, Stromberg MP, Marth GT: Pyrobayes: an improved base caller for SNP discovery in pyrosequences. Nat Methods. 2008, 5 (2): 179-181. 10.1038/nmeth.1172.View ArticlePubMed
- Hoffmann C, Minkah N, Leipzig J, Wang G, Arens MQ, Tebas P, Bushman FD: DNA bar coding and pyrosequencing to identify rare HIV drug resistance mutations. Nucleic Acids Res. 2007, 35 (13): e91-10.1093/nar/gkm435.PubMed CentralView ArticlePubMed
- Kan Z, Jaiswal BS, Stinson J, Janakiraman V, Bhatt D, Stern HM, Yue P, Haverty PM, Bourgon R, Zheng J, Moorhead M, Chaudhuri S, Tomsho LP, Peters BA, Pujara K, Cordes S, Davis DP, Carlton VEH, Yuan W, Li L, Wang W, Eigenbrot C, Kaminker JS, Eberhard DA, Waring P, Schuster SC, Modrusan Z, Zhang Z, Stokoe D, de Sauvage FJ, Faham M, Seshagiri S: Diverse somatic mutation patterns and pathway alterations in human cancers. Nature. 2010, 466 (7308): 869-873. 10.1038/nature09208.View ArticlePubMed
- Petrosino JF, Highlander S, Luna RA, Gibbs RA, Versalovic J: Metagenomic pyrosequencing and microbial identification. Clin Chem. 2009, 55 (5): 856-866. 10.1373/clinchem.2008.107565.PubMed CentralView ArticlePubMed
- Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, Dewell SB, Du L, Fierro JM, Gomes XV, Godwin BC, He W, Helgesen S, Ho CH, Irzyk GP, Jando SC, Alenquer MLI, Jarvie TP, Jirage KB, Kim JB, Knight JR, Lanza JR, Leamon JH, Lefkowitz SM, Lei M, Li J, et al: Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005, 437 (7057): 376-380.PubMed CentralPubMed
- Huse S, Huber J, Morrison H, Sogin M, Welch D: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biol. 2007, 8 (7): R143-10.1186/gb-2007-8-7-r143.PubMed CentralView ArticlePubMed
- Shendure J, Ji H: Next-generation DNA sequencing. Nat Biotech. 2008, 26 (10): 1135-1145. 10.1038/nbt1486.View Article
- Brockman W, Alvarez P, Young S, Garber M, Giannoukos G, Lee WL, Russ C, Lander ES, Nusbaum C, Jaffe DB: Quality scores and SNP detection in sequencing-by-synthesis systems. Genome Res. 2008, 18 (5): 763-770. 10.1101/gr.070227.107.PubMed CentralView ArticlePubMed
- Ewing B, Green P: Base-calling of automated sequencer traces using Phred. II. error probabilities. Genome Res. 1998, 8 (3): 186-194.View ArticlePubMed
- Frith MC, Wan R, Horton P: Incorporating sequence quality data into alignment improves DNA read mapping. Nucleic Acids Res. 2010, 38 (7): e100-10.1093/nar/gkq010.PubMed CentralView ArticlePubMed
- Hamada M, Wijaya E, Frith MC, Asai K: Probabilistic alignments with quality scores: an application to short-read mapping toward accurate SNP/indel detection. Bioinformatics. 2011, 27 (22): 3085-3092. 10.1093/bioinformatics/btr537.View ArticlePubMed
- Albers CA, Lunter G, MacArthur DG, McVean G, Ouwehand WH, Durbin R: Dindel: Accurate indel calls from short-read data. Genome Res. 2011, 21 (6): 961-973. 10.1101/gr.112326.110.PubMed CentralView ArticlePubMed
- Mullahy J: Specification and testing of some modified count data models. J Econometrics. 1986, 33 (3): 341-365. 10.1016/0304-4076(86)90002-3.View Article
- Ridout MS, Besbeas P: An empirical model for underdispersed count data. Stat Modell. 2004, 4: 77-89. 10.1191/1471082X04st064oa.View Article
- Hastie TJ, Tibshirani RJ: Generalized additive models. 1990, London: Chapman & Hall
- McCullagh P, Nelder JA: Generalized Linear Models, Second Edition (Monographs on Statistics and Applied Probability). 1989, London: Chapman & Hall/CRC
- Yee T: The VGAM package. R News. 2008, 8: 28-39.
- Ning Z, Cox AJ, Mullikin JC: SSAHA: a fast search method for large DNA databases. Genome Res. 2001, 11 (10): 1725-1729. 10.1101/gr.194201.PubMed CentralView ArticlePubMed
- Liao Y, Shi W: Subread: a superfast read aligner with high sensitivity and accuracy (In preparation).
- Ning Z, Caccamo M, Mullikin JC: ssahaSNP - a polymorphism detection tool on a whole genome scale. 2005 IEEE Comput Syst Bioinf Conference - Workshops. 2005, 0: 251-252.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.