Base calling for high-throughput short-read sequencing: dynamic programming solutions
© Das and Vikalo; licensee BioMed Central Ltd. 2013
Received: 20 September 2012
Accepted: 20 March 2013
Published: 15 April 2013
Next-generation DNA sequencing platforms are capable of generating millions of reads in a matter of days at rapidly reducing costs. Despite its proliferation and technological improvements, the performance of next-generation sequencing remains adversely affected by the imperfections in the underlying biochemical and signal acquisition procedures. To this end, various techniques, including statistical methods, are used to improve read lengths and accuracy of these systems. Development of high performing base calling algorithms that are computationally efficient and scalable is an ongoing challenge.
We develop model-based statistical methods for fast and accurate base calling in Illumina’s next-generation sequencing platforms. In particular, we propose a computationally tractable parametric model which enables dynamic programming formulation of the base calling problem. Forward-backward and soft-output Viterbi algorithms are developed, and their performance and complexity are investigated and compared with the existing state-of-the-art base calling methods for this platform. A C code implementation of our algorithm named Softy can be downloaded from https://sourceforge.net/projects/dynamicprog.
We demonstrate high accuracy and speed of the proposed methods on reads obtained using Illumina’s Genome Analyzer II and HiSeq2000. In addition to performing reliable and fast base calling, the developed algorithms enable incorporation of prior knowledge which can be utilized for parameter estimation and is potentially beneficial in various downstream applications.
Technological advancements in sequencing technologies have enabled fast sequencing of millions of reads at a rapidly reducing cost. Sequencing-by-synthesis, including Illumina’s platforms based on reversible terminator chemistry and 454’s pyrosequencing platforms, is taking us closer to affordable routine sequencing tasks. However, inherent uncertainties in the ensemble-based sequencing-by-synthesis, along with data acquisition noise, present a major bottleneck in the quest for reads that are as long and reliable as those provided by the conventional Sanger sequencing method.
Sequencing-by-synthesis on Illumina’s platforms typically involves the following steps. First, multiple copies of the genome being sequenced are broken into short fragments in a random fashion, followed by ligation of sequencing adapters to the fragments. In the next phase, the DNA sample is introduced into one of the 8 lanes (or 16 flow cells for HiSeq) split into multiple tiles each containing a lawn of primers covalently bonded to the surface to generate clonal clusters of the captured forward and reverse DNA strands. In particular, captured strands hybridize to neighboring primers to form so-called U-shaped bridges, followed by the process of bridge amplification which is repeated ≈ 35 times to generate clusters containing ≈ 2000 molecules. In the final stage, “sequencing”, fluorescently labeled reversible terminators (different color for each base) are introduced and incorporated into the complementary strands of the DNA templates. Imaging of the fluorescently labeled clusters is followed by cleavage and unblocking of the incorporated nucleotides, and the labeled reversible terminators are added anew to proceed with synthesis of the complementary strands.
Images acquired at the end of each sequencing cycle are processed, and the resulting signals are analyzed to determine the type of nucleotides incorporated into the complementary strands. The problem of inferring the order of nucleotides in a template from the noisy signals is referred to as base calling. For base calling, Illumina uses its in-house software Bustard. Fundamentally, Bustard attempts to reverse the effects of various uncertainties on the signal and then makes base calls. While it is computationally very fast, the error rates resulting from Bustard’s calls may be significantly improved by more sophisticated algorithms .
Several base calling strategies for the Illumina platform have been proposed in recent years such as [2-4]. In , a parametric model of the sequencing-by-synthesis process was proposed, and a Monte Carlo method for base calling was presented. While having better performance than Bustard, this scheme is computationally intensive and impractical for processing tens of millions of reads generated by today’s sequencers. Kao and Song 2010 , the follow-up paper, suggested a modification leading to a computationally feasible base calling albeit with slight degradation in performance compared to . In , an algorithm achieving significant improvement in speed at the cost of a minor deterioration of base calling error rate as compared to  was presented.
Following a simplification of the parametric model of Illumina’s sequencing-by-synthesis platform proposed in , in this paper we present a formulation of the base calling problem amenable to being solved by dynamic programming methods. In particular, we derive the forward-backward and soft-output Viterbi algorithm (SOVA) for solving the base calling problem. The performance of the proposed algorithms is demonstrated on experimental reads acquired from Illumina’s Genome Analyzer II and HiSeq2000 and compared with several recent base calling techniques.
In this section, we present the mathematical model that leads to the dynamic programming formulation of the base calling problem, and present algorithms for base calling and parameter estimation.
Comprehensive model of sequencing-by-synthesis in Illumina’s platforms
A detailed mathematical model of Illumina’s platforms which describes various sources of uncertainty in the sequencing process and data acquisition step was introduced in . For the sake of self-contained presentation, we briefly review this model here while omitting details for brevity.
where d is a constant droop factor over all cycles and all reads and σ is the standard deviation of λ i .
Illumina’s base calling approach
Prior to base calling, Bustard (Illumina’s base calling software) estimates cross-talk using signals generated by synthesizing the first 2 bases of all reads, evaluating entries of K as the median of the estimates obtained using individual read signals. Bustard then infers X by inverting K and multiplying it with Y. Next, it calculates a tile-wide average scalar and renormalizes the signal by multiplying X i by . This procedure corrects for the signal droop. Matrix E is estimated from the first 12 bases, inverted and multiplied by the normalized X i values. This compensates for phasing/prephasing. Finally, for each cycle, base calling is performed by selecting the base inferred as having the highest corrected signal.
Parameter estimation and basecalling approach of BayesCall and naiveBayesCall
BayesCall relies on the comprehensive model reviewed in this section to perform base calling and significantly reduce error rates compared to Bustard. However, it suffers from two major computational bottlenecks. First, the lack of a closed form expression for the solution to the E-step of the EM algorithm used for parameter estimation necessitates a computationally intensive numerical optimization. Hence, the parameter estimation stage is time consuming, requiring ≈25 minutes per iteration on an 8 core machine. Consequently, BayesCall performs a single parameter estimation step that uses reads from all the tiles in a lane to generate a single set of parameters for the entire lane. Detailed analysis of BayesCall and naiveBayesCall error rates indicates that using a single set of parameters for an entire lane results in serious performance degradation for tiles where the parameters significantly differ from the ones used by the base calling algorithms (data not shown). Moreover, base calling in BayesCall is performed by relying on simulated annealing. Being a computationally intensive algorithm, the times for base calling via simulated annealing are prohibitively high. In order to overcome this issue, in the follow-up paper , the authors propose a simplified heuristic which reduces base calling times to ≈6 hours per lane with a small reduction in performance compared to BayesCall. However, since the parameter estimation step used by naiveBayesCall is the same as the one used by BayesCall, tiles with parameters which significantly differ from the single parameter set computed for the entire lane have very small performance improvements over Bustard.
Our model refinements for fast tractable base calling
While providing detailed description of various sources of uncertainty, mathematical model of the sequencing process overviewed in the previous section leads to computationally demanding base calling algorithms. To simplify the model and enable practically feasible base calling, we approximate λ i by its mean. Such an approximation is justified by the analysis of experimental data which shows that the coefficient of variation (ratio of the standard deviation to the mean) of λ i in (3) is small (typically below 0.1 for early cycles and below 0.06 in the latter ones) . On the other hand, it is desirable that the model allows variations in the droop factor from one cycle to another. Therefore, we describe the decay as , where λ denotes a read-dependent transduction coefficient mapping synthesis events to the generated signal intensity, and denotes cycle-dependent droop factors.
where βi,i+1 is a cycle-dependent parameter which allows us to essentially approximate the nonlinear dependence of E on p c f and hence facilitate efficient base calling.
where X i =((1−β i )S i +β i Si+1), and βi,i+1 is relabeled as β i for the simplicity of notation. Let us collect all the parameters into a vector Θ. Given Θ, λ and Y1,Y2,…,Y N , the goal of base calling is to determine S1,S2,…,S L . In our approach, we first obtain estimates of the parameters Θ using an unsupervised learning scheme. Then, posterior probabilities of S i are determined using either forward-backward or soft-output Viterbi algorithm. Details of the proposed scheme follow.
where the scalar coefficient λ and the template sequence matrix S are latent variables, is the set of parameters which need to be determined, and denotes the log-posterior function. In the absence of any prior information, this is also the log-likelihood function. The expectation in (7) needs to be evaluated with respect to λ and S given the current estimates of Θ.
The results from  indicate that base calling may be significantly improved by allowing the parameters to be cycle dependent. We observe the same and thus divide a sequencing run into windows of length W=6, and estimate model parameters window-by-window. Parameters for window l are initialized using the values of the parameters estimated in the previous window, l−1. To prevent over-fitting, (7) is optimized over two windows, l and l+1, and the resulting Θ is used as the set of parameters for window l. A window length W=6 was found to be short enough to capture time variations in the parameters and still maintain run times of the parameter estimation and base calling low.
Initialization for the first window
The EM algorithm requires initialization of the parameters in the first time window. We can reliably call the first two bases in a template by simply identifying the channels having the largest signal in the first two test cycles from which an initial estimate of K is obtained. As done by Bustard, multiple estimates of the columns of K can be computed from the first two signals of each read in a tile, and then the median of all these estimates can be used to provide an initial estimate . Subsequently, we find the mean of each column and add this to the diagonal entries. We then use the inverse of the resulting matrix to call bases again and iteratively refine the estimates of the entries of the matrix. The number of iterations is set to 5.
Given , an empirical estimate of σ is obtained by computing the difference between the intensity vector and the column of the cross talk matrix that corresponds to the called base. The covariance matrix is computed from this for each read. Finally, the estimate is formed as the median of . Parameters α i and β i are negligible in the early cycles and are initialized as zeros. To estimate droop coefficients , we start by calculating for each read and summing up the resulting vector elements to obtain the total signal acquired in the i t h cycle. The droop for the i t h cycle is then calculated as for each individual read. Finally, the median value of all is chosen as the initial value of . Details of this step are omitted and the interested reader is referred to .
E-step for the first window
where wj,k denote normalized weights of N I S =500 samples generated for each read in the training set from the Gaussian distribution (such a choice of sampling distribution for is suggested by the analysis of experimental data). The mean of the sampling distribution for each read in the training set, , is obtained by maximizing the log-likelihood function (9) given the current estimates of the parameters Θ and base calls .
M-step for first window
The objective function in (9) is separately differentiable and convex over each of the parameters in Θ except β. To optimize it, we rely on a cyclic co-ordinate descent scheme which rotates among the components of Θ. To find β, we employ a grid search. The co-ordinate descent is terminated when the ratio of the change in the value of the objective function to the value of the objective function in a previous iteration is less than ε =0.003. We use a similar stopping criterion for termination of the expectation-maximization algorithm.
E-step for subsequent windows
where , 1≤j≤16. Clearly, we need to find . For this, we turn to dynamic programming ideas - in particular, the forward-backward and soft-output Viterbi algorithms.
for all 1≤k≤16, 1≤i≤M. In order to ensure that the finite size of the trellis does not adversely effect reliability of the computed probabilities, we add an extra 5 cycles in the calculations (i.e., we use Y1,…,YM+5).
Soft output Viterbi algorithm
The forward-backward algorithm computes exact posteriori probabilities of the bases in a sequence. On the other hand, one can rely on various heuristics to obtain reasonably good approximations of posteriori probabilities while suffering only minor degradation in accuracy. Such heuristics include the soft-output Viterbi algorithm (SOVA), a modification of the Viterbi algorithm implemented on the same trellis we described in previous sections.
where the recursion is initialized by setting v0(0)=1, v k (0)=0 for all k>0. This recursion is at the core of the Viterbi algorithm, which then proceeds by backtracking through the optimal trellis path to determine the most likely sequence of states. The Viterbi algorithm, however, provides only the most likely sequence of states and does not find posteriori probabilities of the symbols. To this end, a soft-output variant of the Viterbi algorithm was proposed in . SOVA traces back optimal path through the trellis and for each symbol (i.e., base) explores alternative paths that could have changed the decision of the Viterbi algorithm for that symbol. Cost metrics of the alternative paths are then used to approximate posteriori probabilities for the base under consideration. To allow computationally efficient procedure, we limit the length of deviation of the alternative paths from the optimal one to 3 edges. Note that it is necessary to normalize the posterior probabilities obtained in the described fashion. As we will demonstrate in the subsequent sections, the forward-backward algorithm achieves better base calling error rates than SOVA, but it does so at the cost of having reduced speed.
M-step for subsequent windows
The optimization follows the same procedure as described for the first window.
Updating - After each step of the EM algorithm used for estimating parameters in a given window, we make calls for (using outputs of either forward-backward or SOVA). The calls and the most recent parameters are then employed to update by maximizing the log-likelihood function (9). The updated value of is used by the EM algorithm in the next window.
and so on. Note that these probabilities are also the ‘quality score’ assigned to the given basecall (more on quality scores in the next section). Clearly, we need to find posteriori probabilities . For this, we again turn to the soft-output Viterbi and forward-backward algorithms that we described in the previous section.
and choosing the positive solution as the value of .
Performance of various base calling algorithms can be compared by evaluating error rates that they achieve when applied to determining the order of nucleotides in a known sequence. In practical applications, where the sequence being analyzed is not known, we need to assess the confidence of a base calling procedure. To this end, quality scores provide information as to how reliable the corresponding base calls are. The quality scores that we assign to base calls are the posterior probabilities of the bases computed by the forward-backward/SOVA schemes. In particular, we use the posteriori probabilities of the bases computed according to (15) as the quality scores. In order to assess the ‘goodness’ of quality scores, we consider their discrimination ability [10, 11]. The discrimination ability for a given error rate is obtained by sorting all bases according to their quality scores in descending order and finding the number of bases called before the error rate exceeds the predefined threshold.
Comparison of error rates and speed for GAII
Comparison of error rates for HiSeq
Error rate (Pair 1)
Error rate (Pair 2)
For each read, the most computationally expensive Bustard’s step is its correction of phasing effects. For both forward-backward algorithm and SOVA, we need to evaluate 16 objective functions for the states at each stage of the trellis. In order to avoid finite window effects, for each window of length 6 additional 5 cycles are included in the computations. Therefore, for a 76 cycle read, we need to evaluate 131×16 state values. Additional overhead due to combining these values requires mostly additions (when the algorithms are implemented in the log domain). naiveBayesCall on the other hand, performs matrix inversion of the same complexity as those performed by Bustard, followed by evaluation of 4×76×21 terms. The factor 21 arises due to the fact that naiveBayesCall needs to solve a quartic equation using a golden section search that requires 21 evaluations per base. Thus, forward backward and SOVA are ≈3 times faster than naiveBayesCall.
Implementation and running times
We implemented our codes on an Intel i7 machine @3.07GHz using only a single core. With our codes written in C, it takes approximately 240 seconds to read in an intensity file, perform the parameter estimation step on 250 reads, call bases for the whole tile and write it in fastq format for the forward backward scheme and 180 seconds for SOVA. Processing an entire lane requires about 400 minutes and 300 minutes for FB and SOVA, respectively. naiveBayesCall, on the other hand, requires 19 hours just for its parameter estimation step while its basecalling takes 6 hours. Thus, our FB and SOVA implementations are 4 and 5 times faster than naiveBayesCall. Note that the run times of naiveBayesCall are reported for an implementation on a processor with 8 cores; it is expected that a parallel implementation of our algorithm would reduce the total running time by roughly 8 times. In addition, our proposed schemes would be able to almost instantaneously provide very high quality base calls to the end user since they are online (as opposed to batch) in nature. A comparison of the running times for processing an entire lane between the forward-backward algorithm and SOVA and the other basecallers is shown in Table 1.
Improving error rates using supervised parameter estimation
Although the described parameter estimation procedure assumes no supervision, the proposed forward-backward and SOVA schemes allow incorporation of non-uniform priors that may improve accuracy of the inferred parameters and hence the overall base calling performance. Illumina platforms typically have a dedicated control lane comprising reads from a known reference. In such a case, it is possible to obtain priors by aligning the reads onto the reference and using them to improve the accuracy of the estimated parameters.
Comparison of error rates for supervised and unsupervised schemes
We presented a formulation of the base calling problem on Illumina platforms that is amenable to being solved by dynamic programming methods, and proposed forward-backward and soft-output Viterbi algorithms for solving it. Base calling error rate performance of the proposed algorithms was demonstrated on experimental data to be superior to Illumina’s Bustard and several other publicly available base callers. The developed base callers are tested on data obtained by Genome Analyzer II and HiSeq2000 but the model, concepts, and algorithms should apply to other Illumina’s platforms as well. The developed schemes are online (as opposed to batch), scalable, and much faster than competing model-based base callers. In addition, they are capable of accounting for soft inputs (priors) and generating soft outputs (posteriors) - a feature we exploited to devise a supervised scheme for learning parameters of the sequencing model, and may further be useful in applications where prior knowledge about reads is available.
We would like to thank the authors of  for providing us with their dataset. This work was funded in part by the National Institute of Health under grant 1R21HG006171-01.
- Ledergerber C, Dessimoz C: Base-calling for next-generation sequencing platforms. Brief Bioinformatics. 2011, 12 (5): 489-497. 10.1093/bib/bbq077.PubMed CentralView ArticlePubMedGoogle Scholar
- Rougemont J, Amzallag A, Iseli C, Farinelli L, Xenarios I, Naef F: Probabilistic base calling for Solexa sequencing data. BMC Bioinformatics. 2008, 9: 431-10.1186/1471-2105-9-431.PubMed CentralView ArticlePubMedGoogle Scholar
- Kircher M, Stenzel U, Kelso J: Improved base calling for the Illumina genome analyzer using machine learning strategies. Genome Biol. 2009, 10: R83-10.1186/gb-2009-10-8-r83.PubMed CentralView ArticlePubMedGoogle Scholar
- Whiteford N, Skelly T, Curtis C, Ritchie ME, Lohr A, Zaranek AW, Abnizova I, Brown C: Swift: primary data analysis for the Illumina Solexa sequencing platform. Bioinformatics. 2009, 25: 2194-2199. 10.1093/bioinformatics/btp383.PubMed CentralView ArticlePubMedGoogle Scholar
- Kao WC, Stevens K, Song YS: Bayescall: A model-based base-calling algorithm for high-throughput short-read sequencing. Genome Res. 2009, 19: 1884-1895. 10.1101/gr.095299.109.PubMed CentralView ArticlePubMedGoogle Scholar
- Kao WC, Song YS: naiveBayesCall: An efficient model-based base-calling algorithm for high-throughput sequencing. Lect Notes Comput Sci. 2010, 6044: 233-247. 10.1007/978-3-642-12683-3_15.View ArticleGoogle Scholar
- Das S, Vikalo H: OnlineCall: fast online parameter estimation and base calling for Illumina’s next-generation sequencing. Bioinformatics. 2012, 28: 1677-1683. 10.1093/bioinformatics/bts256.PubMed CentralView ArticlePubMedGoogle Scholar
- McLachlan GJ, Krishnan T: The EM Algorithm and Extensions. 1997, Hoboken: Wiley and SonsGoogle Scholar
- Hagenauer J, Hoeher P: A Viterbi algorithm with soft-decision outputs and its applications. Proc IEEE GLOBECOM, Dallas, TX. 1989, 47: 1-7.Google Scholar
- Ewing B, Green P: Base-calling of automated sequencer traces using Phred.II. Error probabilities. Genome Res. 1998, 8: 186-194.View ArticlePubMedGoogle Scholar
- Smith AD, Xuan Z, Zhang MQ: Using quality scores and longer reads improves accuracy of Solexa read mapping. Bioinformatics. 2008, 9: 128-135.PubMed CentralPubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.