Statistical properties of the fluorescent emissions
Several sources of noise perturb the acquisition step: signal over noise ratio in the images depends on the position of the colony within the imaging field (boundary effect), colonies can be hard to segment on the pictures, fluorophore emission spectra partially overlap as emissions "leak" into adjacent channels. Moreover synthesis efficiency is limited and therefore, within each colony, some DNA strands incorporate a non-complementary base or are de-synchronized because they failed to incorporate a nucleotide at a previous step. Both effects lead to the emission of a different fluorophore than the majority of the colony. These effects are possibly dependent on the base composition of the sequence[17] and are obviously deteriorating with each additional chemistry cycle.
We use the sequencing of the phiX174 (see Material and Methods) to analyze the signal in the four color channels as the sequencing progresses. We first observe that the distribution of intensities in the individual channels shows a good separation between background noise and signal, although the shape of the histograms strongly depends on the dye used (Fig. 1A and Additional file 1). For example, G has a tighter dynamical range than T and the range generally decreases with the cycle number. The largest range spans 4–5 logs. As the sequencing progresses, dynamic range decreases, signal over noise ratios worsen and the separation between background noise and signal becomes increasingly blurred (Additional file 1). Next, we observe that the A and C channels, as well as the T and G channels, are highly correlated (Fig. 1A).
Reducing positional bias, dephasing and cross-talk
As observed above, there are three main sources of systematic bias at the level of intensity data. The first is the cross-talk between color channels: for example the A and C channels are not independent. Thus we transformed the raw intensities by a linear mapping to the basis with axes at angles ϕ and θ with respect to the original axes (cf. methods). We optimize the two angles so as to minimize the overall correlation between the transformed coordinates. We repeat this operation at each cycle of sequencing as well as with the other two, G and T channels (Fig. 1B).
The second important bias is the colony dephasing: the amount of fluorescence emitted in a particular channel at cycle n depends on the number of corresponding bases present in the sequence at positions 1, ..., n-1 because incorporation failures accumulated from previous cycles will be partly compensated at cycle n thereby increasing the signal in all channels. This cross-cycle dependence can be modelled by a binomial distribution with parameter q which is the probability of not elongating the complementary strand at each cycle of synthesis. We assume that this rate is equal for all nucleotides and all cycles. We determine the value of q by minimizing the average correlation between intensities at cycle n and n+1.
The last major source of systematic variation is due to an optical effect: on each tile, the colonies near the center of the image appear brighter than the ones near the edges (Additional file 2). We correct this by fitting a two-dimensional lowess [18] model to the intensities for each tile and subtracting the difference between the fit and the median intensity.
The three corrections are applied sequentially (cf. Methods) to the raw intensities before applying the model-based clustering algorithm described next.
Model-based clustering and information-theoretic base calling
We used a model-based clustering algorithm[12, 19–22] to classify the intensity quadruples into four groups. Clearly, four well-delineated clusters corresponding to the four bases emerge (Fig. 1A–B). Specifically, we model the intensities measured in each channel by a mixture of four 4-dimensional Gaussian random variables and we use the intensity quadruples from all colonies in one or few combined tiles to fit the model parameters. The fitted model provides four probability distributions on the space of intensity quadruples, namely the probability PA(k) = P(A|I1(k), ..., I4(k)) that the kth base to call is an A knowing the measured intensities in all four channels at cycle k, and similarly for PC, PG and PT. We can measure the level of uncertainty in our base calling by the entropy which measures the uncertainty (in bits) in the determination of the correct kth base[23]. Knowing h and the four probabilities we then use cutoffs in the probability simplex to decide which IUPAC code to call (Figure 2A, Methods). As the sequencing progresses, we also compute the cumulative entropy of each colony, , which estimates the log2 of the number of actual sequences compatible with the codes called up to position n. This total entropy is used to rank tags from least to most ambiguous. Figure 3A shows that this ambiguity score correlates with, but is markedly different from the Solexa fast-q quality score. The ambiguity metric is useful for genome assembly or polymorphism identification by allowing down-weighting the low quality tags when deriving statistics from multiple alignments of tags. As shown below, this metric can also be used to optimize tag lengths and increase the chance of identifying a match on the reference genome.
Genome coverage statistics
To assess the quality of our base calling and to compare it with the sequences obtained via Solexa's analysis pipeline, we compute the mapping efficiency #{reads mapping exactly to the genome}/#{total number of reads}. We used the fetchGWI tool [24] to search for unique exact matches of each sequenced tag encoded in the IUPAC code on the 5386 nt reference phiX174 genome sequence [RefSeq:NC_001422]. We thus discard every tag that matches at more than one position or does not match exactly anywhere on the reference sequence. One lane (330 tiles) of the Solexa flow cell produced 8 M tags, 3 M unique tags and 3.8 mappable tags, which amounts to a throughput of 137 million immediately usable bases per run. Sorting tags by decreasing quality we see (Figure 4) that low-entropy tags are easily identified by both the Solexa and Rolexa pipelines, but that the coverage achieved by Rolexa-called tags increases significantly among the low-quality sequences and results in an increased total coverage of up to 10–25% (average 15%). We also see that ranking by quality (or entropy, data not shown) is a judicious prioritization strategy since the coverage increase is sharp in the top part of the list and subsequently plateaus off.
To estimate error rates of sequencing, we used align0 [25] to search for an optimal match between each tag and the phiX genome, and then computed the number of mismatches between tag and reference. Figure 5A shows how the error rates increases as a function of the sequencing cycles for Solexa tags. Rolexa tags called with the most probable ACGT base showed a slower increase, and introducing IUPAC codes significantly decreased both the intercept and slope of the error rate as a function of the sequencing cycle.
Base distribution statistics
A surprising property of Solexa sequences is the imbalance between complementary A and T base counts as well as between G and C[14]. As shown in Figure 5B, there is progressive deterioration in the proportions as the sequencing progresses, which is likely related to the varying noise levels across fluorescent dyes for complementary base pairs as well as dye-specific chemical effects (see Fig. 1). In consequence an intensity close to the background is more likely to be called T than A, or C than G. Applying our corrections at the level of intensities stabilizes the proportions of bases, which is particularly pronounced for the T's. For reasons we do not currently understand the A/T ratio is not exactly one but stabilizes around 0.9 (Figure 5B).
To ascertain whether our increased coverage is not simply the consequence of the more degenerate alphabet, we verified that introducing ambiguities at random positions does not similarly improve the mapping. We thus selected the tags that did not match on the genome based on Solexa base calling, but did match after Rolexa introduced one to five ambiguous bases. Then we introduced ambiguities in these tags, with the same frequency as Rolexa, but at random positions. Figure 5D shows that only about 2% of those randomized mutations found a match on the genome, indicating that the entropy is a specific predictor of ambiguous positions.
Optimizing tag length
While Solexa's quality score tends to decrease along the sequence, its distribution mostly spreads, rather than shifts, downwards (Fig. 3B). Computing a global length cutoff based on the average quality will therefore discard a lot of high-quality bases and not necessarily ensure a uniform quality. Thus we expect to increase the number of tags that can be mapped to a reference sequence by cutting them to a shorter length [26]. However this procedure has a downside since it will reduce the coverage length per tag and increase the probability of finding multiple matches. Similarly, standard Solexa procedures suggest selecting tags with high average fast-q. Yet, a low average can be the result of just a few uncertain bases near the end of an otherwise useful tag.
We tested the different selections by applying the following quality filters. For the Solexa method we cut the tags at length 20, 25, 26, 28, 30, and then filtered all sequences with average fast-q score bellow 30, 25, or 20. In comparison, we used the following filtering procedure for Rolexa tags: we chose 3 different length-dependent entropy cutoffs IT(k) (see methods) and searched within each read for the longest k-mer with total entropy less than IT(k). We then extended this subsequence in both directions up to the next ambiguous base and eventually removed all tags shorter than 10 bases. The coverage statistics for the different filters are summarized in Figure 6. We performed a similar analysis of the 330 tiles of the sequencing of targeted human genomic regions and found an average of 50% increase in nucleotide coverage (Additional file 3). We see that the efficiency of Rolexa is superior in all datasets as measured by the ratio of actual coverage to expected coverage as well as by the ratio of tags having a unique match on the genome. The latter criterion is important since in many application of high-throughput sequencing (such as gene expression measures or ChIP-Seq), the extent of the coverage is less important than the number of hits on the genome. Similarly, in genotyping and targeted re-sequencing, where inexact matches are expected, the ability to reliably filter out low-quality tags before doing the matching to the reference sequence is of the highest importance, since actual polymorphisms must be distinguished from sequencing errors.