Probabilistic base calling of Solexa sequencing data
© Rougemont et al. 2008
Received: 04 June 2008
Accepted: 13 October 2008
Published: 13 October 2008
Skip to main content
© Rougemont et al. 2008
Received: 04 June 2008
Accepted: 13 October 2008
Published: 13 October 2008
Solexa/Illumina short-read ultra-high throughput DNA sequencing technology produces millions of short tags (up to 36 bases) by parallel sequencing-by-synthesis of DNA colonies. The processing and statistical analysis of such high-throughput data poses new challenges; currently a fair proportion of the tags are routinely discarded due to an inability to match them to a reference sequence, thereby reducing the effective throughput of the technology.
We propose a novel base calling algorithm using model-based clustering and probability theory to identify ambiguous bases and code them with IUPAC symbols. We also select optimal sub-tags using a score based on information content to remove uncertain bases towards the ends of the reads.
We show that the method improves genome coverage and number of usable tags as compared with Solexa's data processing pipeline by an average of 15%. An R package is provided which allows fast and accurate base calling of Solexa's fluorescence intensity files and the production of informative diagnostic plots.
Ultra-high-throughput sequencing is having a growing impact on biological research by providing a fast and high resolution access to genome-scale information. The versatile technique can be used for unbiased genotyping [1–3], transcriptome analysis [4–6], protein-DNA interactions[7, 8], de-novo sequencing[9, 10]. While the sample processing is relatively streamlined, innovations in data management and information processing are necessary to exploit the full potential of the technology. A standard Solexa/Illumina Genome Analyzer "classic" run produces 700 Gb of image files and 200 Gb of processed data files over 3.5 days totaling nearly 400,000 image files and 20,000 processed files. The latest GAII upgrade further increases this volume of data, mostly by acquiring larger images (although only 100 tiles) and with the ability to perform paired-end sequencing (72 bases per colony). The computing infrastructure required for managing daily sequencing runs is extremely costly to set up and maintain. Developing new algorithms to extract more information from available images and reduce the number of sequencing runs per project will therefore prove extremely valuable. Finally, well-designed quality metrics and diagnostic tools will allow a rapid assessment of the quality of the sequencing runs and decide the applicable data retention policy.
The Solexa/Illumina Genome Analyzer performs sequencing-by-synthesis of a random array of clonal DNA colonies attached to the surface of a flow cell. There are about 8 million such colonies on each of the 8 lanes of the cell. At each cycle of synthesis all four nucleotides, labelled with four different fluorescent dyes and blocked at the 3'-ends, are introduced in the flow cell. Up to 36 such cycles of synthesis are performed.
The data acquisition on the Genome Analyzer "classic" proceeds as follows: each lane of the cell is divided into roughly 300 tiles that are individually photographed through four different filters. The image analysis software localizes each colony on each picture and quantifies the corresponding four fluorescence intensities. The output consists of one file per tile with one row per colony made of four coordinates and up to 144 real numbers for 36 intensity quadruples. The base calling starts downstream of this quantification and reconstructs the DNA sequence that likely generated each colony. The Solexa data analysis pipeline outputs two important files for each tile in each lane: a sequence file with the sequence determined from each intensity row and a fast-q file with a quality score for each base called. This fast-q score measures the most likely base intensity relative to the three other intensities on a logarithmic scale from -5 to 40 (it is asymptotically equal to a Phred score). Here we propose an alternative probabilistic base calling method based on the fluorescence intensity quantifications that uses the extended IUPAC alphabet to code ambiguous bases. An information criterion is used to control the length of trustable reads. We show that this methodology increases the specific mapping of the tags onto reference genomes by about 15% (typically 10–25%) on raw sequences and an increase of up to 70% after quality filtering. The method is implemented in a freely distributed software called Rolexa.
Similar approaches have recently been published. Closest to ours in their use of Gaussian mixtures is the method introduced by Cokus et al. in their analysis of Arabidopsis methylation patterns. The Alta-Cyclic base caller  uses a support vector machine that needs to be trained on a known dataset. Our approach is computationally light and modular in that it offers a set of complementary functionalities that attempt to address the various biases observed in Solexa sequence [14–16] based on simple models of the biochemistry involved.
Several sources of noise perturb the acquisition step: signal over noise ratio in the images depends on the position of the colony within the imaging field (boundary effect), colonies can be hard to segment on the pictures, fluorophore emission spectra partially overlap as emissions "leak" into adjacent channels. Moreover synthesis efficiency is limited and therefore, within each colony, some DNA strands incorporate a non-complementary base or are de-synchronized because they failed to incorporate a nucleotide at a previous step. Both effects lead to the emission of a different fluorophore than the majority of the colony. These effects are possibly dependent on the base composition of the sequence and are obviously deteriorating with each additional chemistry cycle.
As observed above, there are three main sources of systematic bias at the level of intensity data. The first is the cross-talk between color channels: for example the A and C channels are not independent. Thus we transformed the raw intensities by a linear mapping to the basis with axes at angles ϕ and θ with respect to the original axes (cf. methods). We optimize the two angles so as to minimize the overall correlation between the transformed coordinates. We repeat this operation at each cycle of sequencing as well as with the other two, G and T channels (Fig. 1B).
The second important bias is the colony dephasing: the amount of fluorescence emitted in a particular channel at cycle n depends on the number of corresponding bases present in the sequence at positions 1, ..., n-1 because incorporation failures accumulated from previous cycles will be partly compensated at cycle n thereby increasing the signal in all channels. This cross-cycle dependence can be modelled by a binomial distribution with parameter q which is the probability of not elongating the complementary strand at each cycle of synthesis. We assume that this rate is equal for all nucleotides and all cycles. We determine the value of q by minimizing the average correlation between intensities at cycle n and n+1.
The last major source of systematic variation is due to an optical effect: on each tile, the colonies near the center of the image appear brighter than the ones near the edges (Additional file 2). We correct this by fitting a two-dimensional lowess  model to the intensities for each tile and subtracting the difference between the fit and the median intensity.
The three corrections are applied sequentially (cf. Methods) to the raw intensities before applying the model-based clustering algorithm described next.
A surprising property of Solexa sequences is the imbalance between complementary A and T base counts as well as between G and C. As shown in Figure 5B, there is progressive deterioration in the proportions as the sequencing progresses, which is likely related to the varying noise levels across fluorescent dyes for complementary base pairs as well as dye-specific chemical effects (see Fig. 1). In consequence an intensity close to the background is more likely to be called T than A, or C than G. Applying our corrections at the level of intensities stabilizes the proportions of bases, which is particularly pronounced for the T's. For reasons we do not currently understand the A/T ratio is not exactly one but stabilizes around 0.9 (Figure 5B).
To ascertain whether our increased coverage is not simply the consequence of the more degenerate alphabet, we verified that introducing ambiguities at random positions does not similarly improve the mapping. We thus selected the tags that did not match on the genome based on Solexa base calling, but did match after Rolexa introduced one to five ambiguous bases. Then we introduced ambiguities in these tags, with the same frequency as Rolexa, but at random positions. Figure 5D shows that only about 2% of those randomized mutations found a match on the genome, indicating that the entropy is a specific predictor of ambiguous positions.
While Solexa's quality score tends to decrease along the sequence, its distribution mostly spreads, rather than shifts, downwards (Fig. 3B). Computing a global length cutoff based on the average quality will therefore discard a lot of high-quality bases and not necessarily ensure a uniform quality. Thus we expect to increase the number of tags that can be mapped to a reference sequence by cutting them to a shorter length . However this procedure has a downside since it will reduce the coverage length per tag and increase the probability of finding multiple matches. Similarly, standard Solexa procedures suggest selecting tags with high average fast-q. Yet, a low average can be the result of just a few uncertain bases near the end of an otherwise useful tag.
Several points in the analysis of Solexa high throughput sequencing technology can likely benefit from further improvements. First the disequilibrium between complementary bases should be reduced. Although the phiX174 is a single-stranded DNA virus, the library was prepared from the double-stranded covalently closed circular form of the genome. As shown, the output of the sequencing shows an increasing deterioration of the equilibrium between complementary bases as the sequencing cycles proceed (Figure 5B). Our approach improves on this but does not solve the issue completely.
Similar approaches have recently been Dohm et al. have observed similar bias to the ones described here, but only proposed to correct them at the level of the sequence alignment, not at the level of the base calling. Cokus et al. use Solexa's pre-treated data (_sig2 files) and apply a very similar EM procedure to fit a Gaussian mixture model for probabilistic base calling. They do not use information based metrics to reduce the probabilities to IUPAC codes, but rather construct position-weight matrices with which they scan the reference genome, which is computationally expensive and not directly applicable for de-novo sequencing. Erlich et al. train a Support Vector Machine optimized on a reference sequence which is computationally highly expensive. Rolexa only needs a (nowadays common) multi-core computer and runs a complete analysis of one lane in 10 hours over 5 cores. Moreover it is based on modeling the bio-chemical properties of the system.
We have not considered here the potentially important benefits of fine-tuning the image analysis algorithms. Looking at images generated by the microscopic device shows that when the density of colonies is high in some region of the images, bleeding-over occurs and assigning the correct fluorescence intensity to each colony is clearly a delicate problem (see ).
Due to the large file size and format of the Solexa output data, concurrently (and randomly) accessing 20,000 text files puts a heavy strain on any standard file system, not to mention backup devices. Rolexa works with compressed inputs and outputs, which already reduces file size considerably. Still, a better suited file format could help both the storage and the processing, e.g. using suffix tables and trees[27, 28]. The latest GAII upgrade to the Solexa/Illumina sequencer generates even more data, through larger acquisition area, longer reads, and paired-end sequencing. Generating longer reads require efficient and reliable algorithms for base calling with reasonable levels of accuracy up to the end of the read. Furthermore, this increased throughput requires these algorithms to be fast and be based on direct and simple methods that are re-usable without tuning from one run to the next.
Solexa/Illumina high-throughput sequencing has already and will increasingly produce vast amounts of systems scale genomics and functional genomics data. As with other high-throughput techniques, improvements in signal processing and statistical assessment of the data will prove to be a key step in the maturation of the technology and the progress towards reliable applications and new discoveries.
The phiX174 Control Library used was prepared by Illumina (Cat. No CT-901-1001). Briefly, the double-stranded covalently closed circular form of the viral DNA was broken into 100–400 bp fragments by nebulization; the ends repaired with Klenow, T4 DNA polymerase and PNK; and a base A was added on the 3'ends. After ligation of the double-stranded genomic adapters the sample was gel-purified to isolate fragments with "inserts" of approximately 200 bp and amplified by 18 cycles of PCR (Illumina protocol "Preparing Samples for Sequencing Genomic DNA", Part # 11251892 Rev. A). The library is quality controlled by cloning an aliquot into a TOPO plasmid and capillary sequencing 5–10 clones.
DNA Colonies were prepared by using a "Standard Cluster Generation Kit" (Cat. No. FC-103-1001) and 35 cycles of isothermal amplification in the flow-cell on the "Illumina Cluster Station" using a pM dilution of the 10 nM library. After amplification, one of the strands is removed; the free 3'-ends are blocked by terminal transferase in presence of dideoxynucleotides; and the genomic sequencing primer hybridized. The flow-cell was transferred to the Genome Analyzer "classic" and sequencing was performed for 36 cycles using a "36 Cycle Sequencing Kit" (Cat. No FC-104-1003) with the version 2.0 of the scanning buffer.
The samples used for Additional file 3 came from the pooled DNA obtained by long-range PCR amplification of a 30 kb region of chromosome 19 from 3 different individuals plus a 50 kb region of chromosome 3 from a fourth individual. Sequencing was performed as described above for the phiX174.
All data analysis for this paper has been performed with the R statistical framework http://www.r-project.org/ and the Rolexa package. This package uses the mclust routines as well as the fork package to run efficiently on multi-core architectures. Matching of short tags onto the genome have been performed with the fetchGWI tool by first generating a comprehensive index of the phiX174 genome and matching each query with its index entry. We used align0  to search for best matches from tags to the genome and estimate error rates (see Fig. 5A). When counting errors, an alignment of IUPAC code with one of its compatible bases was counted as correct match.
Raw data analysis (image analysis, initial base calling and fast-q scores) used the Firecrest image analysis module and the Bustard base-caller from the Illumina software suite (SolexaPipeline-0.2.2.6). No filtering or analysis with Gerald was performed.
The parameters ϕ AC , θ AC , ϕ GT , θ GT are determined by minimizing the following function:F n (θ AC , ϕ AC , θ GT , ϕ GT ) = cor(M -1 I (A, n, •), M -1 I(C, n, •))2 + cor(M -1 I(G, n, •), M -1 I(T, n, •))2,
which is minimized to determine q.
Lastly, we correct systematic bias in function of the cluster coordinate as follows: we fit a 2-dimensional lowess  as a function of (x, y) coordinates and then subtract the difference between that fit and the median intensity across all four channels, for each tile and cycle.
We used the EEV model of the mclust algorithm to fit the Gaussian mixtures used to assign base probabilities in function of the four-dimensional intensity vector, similar as what was performed in . This model assumes Gaussian mixtures with four covariance matrices of the same shape and volume but with varying orientation. We initialize the classification by attributing each colony to the nucleotide with the highest (corrected) intensity. Given that initial classification, an M step of the mclust algorithm is performed which estimates the maximum likelihood parameters given the class attributions, where the parameters to estimate are the global scale and shape parameters as well as the centers and orientations of each class (using the covariance parameterization described in ). This is then followed by an E step of the EM algorithm to estimate the conditional probabilities of each data point belonging to each class given the parameters estimates obtained previously. Full convergence of the EM algorithm is offered as an option but occasionally runs into spurious optima due to the effect of outliers (similarly to what was observed in ). Further details of the implementation can be found in the package documentation (see Availability section).
The Rolexa algorithms require two types of cutoffs, which can both be easily user-defined in the Rolexa package. In the analyses presented, the limits between the different IUPAC bases in the probability simplex (Figure 2A) were set to HT(n) = log2(n+0.5) with n = 1,2,3 (Figure 2B). Secondly the length-dependent cutoffs IT(n) were used to filter out uncertain bases by selecting the longest sub-tag S with total entropy smaller than IT(n = length (S)). In Figure 6 we used the following 6 choices: constants IT c (n) = c with the constant c set to 2, 4, 6, or 8, and two cutoffs increasing with the tag length: IT Log (n) = log2 (4 + (n - 1)/5) and IT Exp (n) = 2(1+(n-1)/36). The latter two cutoffs interpolate between 2 and approximately 4 over the length of the sequence, but the first cutoff is concave (increases faster at the beginning) and the second is convex.
We have developed an R package called Rolexa which is freely available from http://bbcf.epfl.ch/Software. It is distributed under the GPL license and uses the mclust package which is part of the R distribution.
FN thanks the Swiss National Science Foundation grant no 3100A0-113617 for financial support. We are grateful to Carlo Rivolta for providing early access to his data. Part of the data analysis was performed on the Vital-IT high-performance computing facility of the Swiss Institute of Bioinformatics.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.