BMBC: a Bayesian method of base calling for Solexa sequence data
 Yuan Ji^{1}Email author,
 Riten Mitra^{2}Email author,
 Fernando Quintana^{3},
 Alejandro Jara^{3},
 Peter Mueller^{4},
 Ping Liu^{5},
 Yue Lu^{6} and
 Shoudan Liang^{7}
https://doi.org/10.1186/1471210513S13S6
© Ji et al; licensee BioMed Central Ltd. 2012
Published: 24 August 2012
Abstract
Base calling is a critical step in the Solexa nextgeneration sequencing procedure. It compares the positionspecific intensity measurements that reflect the signal strength of four possible bases (A, C, G, T) at each genomic position, and outputs estimates of the true sequences for short reads of DNA or RNA. We present a Bayesian method of base calling, BMBC, for SolexaGA sequencing data. The Bayesian method builds on a hierarchical model that accounts for three sources of noise in the data, which are known to affect the accuracy of the base calls: fading, phasing, and crosstalk between channels. We show that the new method improves the precision of base calling compared with currently leading methods. Furthermore, the proposed method provides a probability score that measures the confidence of each base call. This probability score can be used to estimate the false discovery rate of the base calling or to rank the precision of the estimated DNA sequences, which in turn can be useful for downstream analysis such as sequence alignment.
Keywords
Introduction
Next generation sequencing (NGS) such as Solexa sequencing (http://www.illumina.com) is a powerful tool producing massive sequences of short reads. It is considered the “digital” version of the classic microarray technology because in principle it measures the exact number of gene copies rather than relative abundances. NGS can be used for studies of sequence variations in genomes ([1, 2]), proteinDNA interactions ([3, 4]), transcriptome analysis ([5–7]), and de novo genome assembly [8]. The full potential of the technology is still being explored as quantitative researchers try to find efficient ways to streamline the sample processing and model the processed data.
In summary, the final data are millions of quadruple vectors. Each vector contains four continuous scores that represent the fluorescent intensities of nucleotides A, C, G, and T. Using these data, our task is to estimate the sequence of each short read.
We acknowledge that the proposed method in this paper deals with the data from Solexa genome analyzer. New sequencing technologies have been developed by Solexa/Illumina, such as the HiSeq series. However, numerous data sets have already been generated using the genome analyzer, which need to be properly analyzed. We believe that our proposed basecalling approach will contribute to the analysis of the existing data and also future data from experiments that still use the genome analyzer for sequencing. To our knowledge, a few methods for base calling are available in the literature. Most researchers use the default procedure, Bustard, built into the commercial software of the Illumina Genome Analyzer. The procedure yields an estimated base for each cycle along with a quality score called fastq. The fastq score measures the most likely base intensity relative to the three other intensities on a logarithmic scale from –5 to 40. In practice, DNA tags with small fastq scores are discarded in Solexa base calling. A more recent statistical method of base calling is by [10], who considered a variety of issues in the sequencing data including the base calling. Other works include [11, 12]. A recent addition to this group of methods is Ibis (Improved Base Calling for Genome Sequence Analyzer) [13]. Ibis applies multiclass Support Vector Machines to raw cluster intensities. The model is trained from data obtained from a reference genome.
In this paper, we propose a modelbased Bayesian method of basecalling (BMBC) for Solexa sequencing data. The BMBC method presents a hierarchical model that applies a probabilisticbased inference for base calling. The estimation of model parameters is computed via Markov chain Monte Carlo (MCMC) simulations and the posterior samples are used to compute the probability that each base is A, C, G, or T. These posterior probabilities are used to estimate the true DNA sequences, to rank the base calls, and to compute the false discovery rates (FDR). The remainder of this paper is organized as follows: The Methodology section presents a probability model for base calling, and the posterior inference procedure. The section on Numerical examples presents the basecalling results for a Solexa sequencing data set using the BMBC method and three other methods as comparison. The Discussion sections ends the paper.
Methodology
Other important systematic biases also affect the accuracy of base calling. For a discussion, see [14, 15]. However, these biases can be removed or reduced using standard statistical techniques. We assume that these biases have been removed and now the goal is to model the intensity scores.
Hierarchical models
We first consider models for sequence data of a single colony, i.e., measurements corresponding to a short read, with say I = 36 bases. Let y = {y_{1}, …, y_{36}} represent the 36 quadruplets of nucleotide intensity measurements, where y_{ i } = (y_{ i }_{1}, y_{ i }_{2}, y_{ i }_{3}, y_{ i }_{4})′ is the 4 × 1 vector for cycle i, respectively representing the intensities of four nucleotides, A, C, G, and T at location i of the short reads. Therefore, strong signals are indicated by large positive values of y_{ ij }. Because for each cycle only one true nucleotide is present, ideally only one of the four y_{ ij }’s should be positive and the remaining three should be zero. In the presence of noise, this is not the case. First, due to channel crosstalk, y_{ i }_{1} and y_{ i }_{2} are positively correlated, as are y_{ i }_{3} and y_{ i }_{4}. Second, because of fading, the intensities decay over cycles; that is, for later cycles, the values of y_{ i }’s are smaller on average. Last, when phasing is present, the intensity scores at cycle i depend on the ones at cycle (i – 1).
where I_{ j }’s are four indicator functions Ind(·) that truncate the multivariate normal. Here, . These indicators reflect the prior belief that the true base should have the largest intensity. Models (2) and (2) attempt to account for three sources of noise in the data. Specifically, due to fading, the intensity signals weaken as the cycle indicator i gets larger. Therefore, we include the exponential factor exp(–β · i^{ λ }) to describe the decay of the mean signal. Note that we specify an exponent λ to allow for more flexibility. For the phasing, we add a term α · y_{ i }_{–1},_{ j } to the mean of the multivariate normal (thus autoregressive), i.e., the intensity of the current cycle i depends on the intensity of the previous cycle (i – 1) for i ≥ 2.
We use a log N(0,1) prior for µ_{11}. Here, g_{1} accounts for the cross talk from channel C to channel A. We assign a beta(1, 1) as its prior. For g_{ ε }_{1} and g_{ ε }_{2}, we use beta(2,10) to reflect our strong belief that the intensities at channels G and T are much smaller than the intensity at channel A. We have tried other beta priors beta(a, b) with a ≪ b and obtained similar results in base calling.
The model is completed by specifying the discrete uniform prior for k_{ i }, i.e., Pr(k_{ i } = j) = 1/4 for j = 1, 2, 3, 4, a beta(1, 1) prior for λ, α, and β, and an inverse Wishart(diag(1, 4), 6) prior for ∑_{ j }, where diag(1, 4) is the 4 × 4 identity matrix.
The models above are built for one colony of sequencing data. With multiple colonies, we use y_{ ic } = (y_{ic,1}, …, y_{ic,4}) to denote the quadruple intensities of cycle i in colony c, and k_{ ic } to represent the latent indicator of the true base of cycle i in colony c. The models for y_{ ic } are the same as in (2) and (2), with y_{ ic } and k_{ ic } replacing y_{ i } and k_{ i }. The priors for k_{ ic }, µ_{ j }, λ, α, β, and ∑_{ j } remain unchanged. Since y_{ ic }’s are conditionally independent, the joint likelihood for all the data is simply the product of the likelihood function for each y_{ ic }. For simplicity, the mathematical expression of the models is omitted.
Posterior inference
as the posterior probability that the i th cycle in colony c has a true base of A, C, G, or T, respectively. These samples can be used to perform base calling. Specifically, the Bayesian base call corresponds to the nucleotide with the largest posterior probability in its cycle. That is, we assign base A, C, G, or T to cycle i in colony c if s_{ ic } equals 1, 2, 3, or 4, where . In addition, one can assess the accuracy of the proposed method by computing an estimated Bayesian FDR ([16, 17]) using the ξ’s. We will demonstrate this feature with a concrete example in the next section.
Numerical examples
We compared the performance of the BMBC method with currently leading methods, including the Solexa Bustard, the Rolexa method [11], and the BI method [10].
Data
We obtained Solexa DNA sequencing data from the control lane for a bacteria phage. This is part of the standard Solexa protocol. To illustrate the performance of base calling methods, we randomly selected three subsets, with each containing 1,000 colonies of the sequence data.
The control lane sequences the genome of an enterobacteria phage, phiX174, which is composed of 5,386 bases of single stranded DNA sequences and has no polymorphism. DNA preparation follows Illumina Control DNA library protocol (Illumina Cat. No CT9011001). DNA are broken to a size of 200 nucleotides and are subject to 18 cycles of polymerase chain reaction (PCR) amplification before the generation of DNA colonies by single molecule PCR. The sequences of DNA colonies are probed by 36 cycles of sequencing by synthesis.
Each DNA read is compared to the entire phage genome of 5,386 positions to search for the best matches. This is done using the Solexa software PhageAlign. After a tag is aligned to the phage genome, the matched sequence on the phage genome is considered to be the true sequence and any mismatched nucleotide is considered a sequencing error. The assignment of the true sequence is correct because 1) the phage genome contains no polymorphism and 2) the small genome size makes a mistaken sequence match over 36 nucleotides highly unlikely. Note that this is not the case for the human genome, where polymorphism occurs ([18]). Here, we treat the bases obtained from the above procedure as the “true” ones and compare the performance of base calling methods based on the deviation from these bases.
Analysis with random subsets
We first applied all the methods to a small data set for illustration purpose. We then implemented the BMBC method on a data set from the control lane of the Solexa sequencing, consisting of about 5 million short reads. We compare the following four basecalling methods using the phage sequencing data.

Bustard from Solexa’s Genome Analyzer: this is the commercial software provided by Illumina. More detailed information about the Genome Analyzer can be found at http://www.illumina.com.

Rolexa: this is a method building upon modelbased clustering [11], which assumes that the quadruplets of intensities follow fourcomponent univariate Gaussian mixture models. Instead of performing a full Bayesian inference using the joint posterior distribution, the Rolexa method applies the EM algorithm to obtain point estimates of the parameters.

BI: this is the intensity model proposed in Bravo and Irrizary (2010). The authors carefully examined potential noises in the intensity data and proposed a linear mixture model with different means given the indicator of true bases. They applied the EM algorithm to obtain the posterior probabilities of the true base calls. See [10].

BMBC: our proposed method.
We applied all four methods to the three random subsets of phage sequencing data, each with 1,000 colonies.
For the BMBC method, we performed base calling using 100 colonies at a time. The Markov chains converged fast and mixed extremely well. We only needed to throw away 100 burnin samples with a total of 600 iterations for every 100 colonies.
Error rates for different methods under comparison
Data sets  Number of wrong calls (percentage)  

BMBC  Solexa  BI  Rolexa  
1  1,340 (3.7%)  1,455 (4.0%)  1,428 (4.0%)  1,601 (4.4%) 
2  1,354 (3.7%)  1,514 (4.2%)  1,426 (4.0%)  1,432 (4.0%) 
3  1,385 (3.8%)  1,438 (4.0%)  1,444 (4.0%)  1,345 (3.7%) 
In Table 1, we used ACGT as the base calls for the Rolexa method. In the original paper by Rougemont et al. (2008), the authors focused on using the International Union of Pure and Applied Chemistry (IUPAC) symbols (http://www.bioinformatics.org/sms/iupac.html) as base calls. These symbols include not only ACGT, but other ambiguous calls that represent more than one base within ACGT. The authors stated that the IUPAC symbols gave the Rolexa better performance. For a fair comparison, we used the ACGT symbols for the Rolexa.
For ease of exposition, we now focus on the results of an arbitrary subset, data set 1 in Table 1. We computed the difference in the number of correct calls per colony between the BMBC method and each of the other three methods.
Full data analysis
We implemented the BMBC method on a data set consisting of 5,120,000 colonies. The data are from a control lane in a standard Solexa run, in which the true sequences are known. We first splitted the data into 8 equal parts, each comprising of 640,000 colonies. We then applied the BMBC method to each of the eight subsets in parallel. The eight jobs were executed on an iMAC with 2.8 GHz Intel Core i7 and 16 GB of memory. It took about 4 hours to complete the computation. We have built an R package “BMBC”, available to be downloaded from http://odin.mdacc.tmc.edu/~yuanj/soft.html
Basecall Matching Rates
Predicted calls  

A  C  G  T  
A  97.22  2.00  .3  .3  
True calls  C  1.06  95.75  1.29  1.88 
G  .00  .00  92.89  6.33  
T  .00  .01  .00  98.52 
Discussion
An important feature of the BMBC method is that it yields marginal posterior probabilities of the four nucleotides for each base. This allows a full probabilitybased inference for base calling and subsequent analysis. For example, one can associate the posterior probability of the base call with the estimated base and use it as a quality control measure for downstream sequence alignment. Sequences mapped to a genome with overall high posterior probabilities are more reliable than those with lower probabilities.
We also compared our method with the Bayesian classifier BayesCall in [12]. The computation was slow compared to the other methods. The slow speed could be a potential shortcoming for its application to data from NGS platforms, typical consisting of about millions of clusters. Naive Bayes classifiers, on the other hand, suffer from the simplistic assumptions of independence which are grossly violated in datasets of these type. One important feature of BMBC is that it does not require any prior learning for its application to GAI data. However, unsupervised clustering is not always feasible for data from newer sequencing technologies. Ibis [13] specifically uses large training data sets to analyze GAII control lanes. In addition, certain platforms possess unique features and need algorithms specially tailored to their specific requirements. Ibis, for example, is designed to model the features of bidirectional phasing and T accumulation which are present in GAII. On the other hand, BMBC is more suited towards addressing the issues of phasing, fading and cross talk that arise in the context of modeling GAI data.
We acknowledge that there is a scope of improving the model by incorporating the error sources unique to the latest sequencing platforms.
Declarations
Acknowledgement
Yuan Ji’s and Peter Müller’s research is partly supported by NIH/NCI R01 CA132897. Shoudan Liang’s research is partly supported by NIH/NCI K25 CA123344. Fernando Quintana’s research is partly supported by grants FONDECYT 1100010.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 13, 2012: Selected articles from The 8th Annual Biotechnology and Bioinformatics Symposium (BIOT2011). The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/13/S13/S1
Authors’ Affiliations
References
 Korbel JO, Urban AE, Affourtit JP, Godwin B, Grubert F, Simons JF, Kim PM, Palejev D, Carriero NJ, Du L, Taillon BE, Chen Z, Tanzer A, Saunders AC, Chi J, Yang F, Carter NP, Hurles ME, Weissman SM, Harkins TT, Gerstein MB, Egholm M, Snyder M: Pairedend mapping reveals extensive structural variation in the human genome. Science 2007, 318(5849):420–426. [http://www.hubmed.org/fulltext.cgi?uids=17901297] 10.1126/science.1149504PubMed CentralView ArticlePubMedGoogle Scholar
 Hillier LW, Marth GT, Quinlan AR, Dooling D, Fewell G, Barnett D, Fox P, Glasscock JI, Hickenbotham M, Huang W, Magrini VJ, Richt RJ, Sander SN, Stewart DA, Stromberg M, Tsung EF, Wylie T, Schedl T, Wilson RK, Mardis ER: Wholegenome sequencing and variant discovery in C. elegans. Nat Methods 2008, 5(2):183–188. [http://www.hubmed.org/fulltext.cgi?uids=18204455] 10.1038/nmeth.1179View ArticlePubMedGoogle Scholar
 Mikkelsen TS, Ku M, Jaffe DB, Issac B, Lieberman E, Giannoukos G, Alvarez P, Brockman W, Kim TK, Koche RP, Lee W, Mendenhall E, O’Donovan A, Presser A, Russ C, Xie X, Meissner A, Wernig M, Jaenisch R, Nusbaum C, Lander ES, Bernstein BE: Genomewide maps of chromatin state in pluripotent and lineagecommitted cells. Nature 2007, 448(7153):553–560. [http://www.hubmed.org/fulltext.cgi?uids=17603471] 10.1038/nature06008PubMed CentralView ArticlePubMedGoogle Scholar
 Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, Wang Z, Wei G, Chepelev I, Zhao K: Highresolution profiling of histone methylations in the human genome. Cell 2007, 129(4):823–837. [http://www.hubmed.org/fulltext.cgi?uids=17512414] 10.1016/j.cell.2007.05.009View ArticlePubMedGoogle Scholar
 Hafner M, Landgraf P, Ludwig J, Rice A, Ojo T, Lin C, Holoch D, Lim C, Tuschl T: Identification of microRNAs and other small regulatory RNAs using cDNA library sequencing. Methods 2008, 44: 3–12. [http://www.hubmed.org/fulltext.cgi?uids=18158127] 10.1016/j.ymeth.2007.09.009PubMed CentralView ArticlePubMedGoogle Scholar
 Vera JC, Wheat CW, Fescemyer HW, Frilander MJ, Crawford DL, Hanski I, Marden JH: Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing. Mol Ecol 2008, 17(7):1636–1647. [http://www.hubmed.org/fulltext.cgi?uids=18266620] 10.1111/j.1365294X.2008.03666.xView ArticlePubMedGoogle Scholar
 Friedländer MR, Chen W, Adamidi C, Maaskola J, Einspanier R, Knespel S, Rajewsky N: Discovering microRNAs from deep sequencing data using miRDeep. Nat Biotechnol 2008, 26(4):407–415. [http://www.hubmed.org/fulltext.cgi?uids=18392026] 10.1038/nbt1394View ArticlePubMedGoogle Scholar
 Chaisson MJ, Pevzner PA: Short read fragment assembly of bacterial genomes. Genome Res 2008, 18(2):324–330. [http://www.hubmed.org/fulltext.cgi?uids=18083777] 10.1101/gr.7088808PubMed CentralView ArticlePubMedGoogle Scholar
 Erlich Y, Mitra PP, delaBastide M, McCombie WR, Hannon GJ: AltaCyclic: a selfoptimizing base caller for nextgeneration sequencing. Nat Methods 2008, 5(8):679–682. [http://www.hubmed.org/fulltext.cgi?uids=18604217] 10.1038/nmeth.1230PubMed CentralView ArticlePubMedGoogle Scholar
 Bravo H, Irizarry R: Modelbased quality assessment and basecalling for secondgeneration sequencing data. Biometrics 2010., 66: To appear To appearGoogle Scholar
 Rougemont J, Amzallag A, Iseli C, Farinelli L, Xenarios I, Naef F: Probabilistic base calling of Solexa sequencing data. BMC Bioinformatics 2008, 9: 431. [http://www.biomedcentral.com/1471–2105/9/431] 10.1186/147121059431PubMed CentralView ArticlePubMedGoogle Scholar
 Kao W, Stevens K, Song Y: BayesCall: A modelbased basecalling algorithm for highthroughput shortread sequencing. Genome Research 2009, 19: 1884–1895. 10.1101/gr.095299.109PubMed CentralView ArticlePubMedGoogle Scholar
 Kircher M, Stenzel U, Kelso J: Improved base calling for the Illumina Genome Analyzer using machine learning strategies. Genome Biology 2009, 10: R83. 10.1186/gb2009108r83PubMed CentralView ArticlePubMedGoogle Scholar
 Metzker ML, Raghavachari R, Burgess K, Gibbs RA: Elimination of residual natural nucleotides from 3’OmodifieddNTP syntheses by enzymatic mopup. Biotechniques 1998, 25(5):814–817. [http://www.hubmed.org/fulltext.cgi?uids=9821582]PubMedGoogle Scholar
 Metzker ML: Emerging technologies in DNA sequencing. Genome Res 2005, 15(12):1767–1776. [http://www.hubmed.org/fulltext.cgi?uids=16339375] 10.1101/gr.3770505View ArticlePubMedGoogle Scholar
 Newton M, Noueiry A, Sarkar D, Ahlquist P: Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics 2004, 5: 155–176. 10.1093/biostatistics/5.2.155View ArticlePubMedGoogle Scholar
 Ji Y, Yin G, Tsui K, Kolonin M, Sun J, Arap W, Pasqualini R, Do KA: Bayesian mixture models for complex highdimension count data in phage display experiments. Journal of the Royal Statistical Society, Series C (Applied Statistics) 2007, 56(2):139–152. 10.1111/j.14679876.2007.00570.xView ArticleGoogle Scholar
 Ji Y, Xu Y, Zhang Q, Tsui KW, Yuan Y, Liang S, Liang H: BMMap: Bayesian mapping of multireads for nextgeneration sequencing data. Tech. rep The University of Texas M. D. Anderson Cancer Center; 2010. [http://odin.mdacc.tmc.edu/~ylji]Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.