### Sample preparation and Genome Analyzer sequencing

The phiX174 Control Library used was prepared by Illumina (Cat. No CT-901-1001). Briefly, the double-stranded covalently closed circular form of the viral DNA was broken into 100–400 bp fragments by nebulization; the ends repaired with Klenow, T4 DNA polymerase and PNK; and a base *A* was added on the 3'ends. After ligation of the double-stranded genomic adapters the sample was gel-purified to isolate fragments with "inserts" of approximately 200 bp and amplified by 18 cycles of PCR (Illumina protocol "Preparing Samples for Sequencing Genomic DNA", Part # 11251892 Rev. A). The library is quality controlled by cloning an aliquot into a TOPO plasmid and capillary sequencing 5–10 clones.

DNA Colonies were prepared by using a "Standard Cluster Generation Kit" (Cat. No. FC-103-1001) and 35 cycles of isothermal amplification in the flow-cell on the "Illumina Cluster Station" using a pM dilution of the 10 nM library. After amplification, one of the strands is removed; the free 3'-ends are blocked by terminal transferase in presence of dideoxynucleotides; and the genomic sequencing primer hybridized. The flow-cell was transferred to the Genome Analyzer "classic" and sequencing was performed for 36 cycles using a "36 Cycle Sequencing Kit" (Cat. No FC-104-1003) with the version 2.0 of the scanning buffer.

### Sequencing of Human cells

The samples used for Additional file 3 came from the pooled DNA obtained by long-range PCR amplification[30] of a 30 kb region of chromosome 19 from 3 different individuals plus a 50 kb region of chromosome 3 from a fourth individual. Sequencing was performed as described above for the phiX174.

### Data analysis

All data analysis for this paper has been performed with the R statistical framework http://www.r-project.org/ and the Rolexa package. This package uses the *mclust* routines[20] as well as the *fork* package to run efficiently on multi-core architectures. Matching of short tags onto the genome have been performed with the *fetchGWI* tool[24] by first generating a comprehensive index of the phiX174 genome and matching each query with its index entry. We used *align0* [25] to search for best matches from tags to the genome and estimate error rates (see Fig. 5A). When counting errors, an alignment of IUPAC code with one of its compatible bases was counted as correct match.

Raw data analysis (image analysis, initial base calling and fast-q scores) used the *Firecrest* image analysis module and the *Bustard* base-caller from the Illumina software suite (SolexaPipeline-0.2.2.6). No filtering or analysis with *Gerald* was performed.

### Preliminary data transformation

We model the measured intensities I(*α*, *n*, *x*) (*α* = *A*, *C*, *G*, *T* is the dye channel, *n* = 1, ..., *36* is the cycle number and *x* denotes the colony coordinates) as the following combination of unbiased intensities *J*(*α*, *n*, *x*):

I(\alpha ,n,x)={\displaystyle \sum _{m=1,\mathrm{...},n}{\displaystyle \sum _{\beta =A,C,G,T}M(\alpha ,\beta )J(\beta ,m,x)R(m,n)}},

where the 4 × 4 matrix *M* is a mixture matrix which is block diagonal and depends on the 4 parameters *ϕ*_{
AC
}, *θ*_{
AC
}, *ϕ*_{
GT
}and *θ*_{
GT
}:

M\left(\left\{A,C\right\},\left\{A,C\right\}\right)=\left(\begin{array}{cc}cos{\theta}_{AC}& \mathrm{sin}{\theta}_{AC}\\ \mathrm{cos}{\phi}_{AC}& \mathrm{sin}{\phi}_{AC}\end{array}\right),

and similarly for the *G*, *T* block, and the dephasing matrix *R* is a function of the parameter *q* and has a binomial structure:

R(m,n)=\{\begin{array}{c}0\text{if}mn,\\ \left(\begin{array}{c}n\\ m\end{array}\right){q}^{n-m}{(1-q)}^{m}\text{if}m\le n.\end{array}

The parameters *ϕ*_{
AC
}, *θ*_{
AC
}, *ϕ*_{
GT
}, *θ*_{
GT
}are determined by minimizing the following function:

*F*_{
n
}(*θ*_{
AC
}, *ϕ*_{
AC
}, *θ*_{
GT
}, *ϕ*_{
GT
}) = cor(*M*^{-1}*I* (*A*, *n*, •), *M*^{-1} *I*(*C*, *n*, •))^{2} + cor(*M*^{-1} *I*(*G*, *n*, •), *M*^{-1}*I*(*T*, *n*, •))^{2},

which defines an intermediate intensity matrix *K* = *M*^{-1} *I*. This is then introduced into the function

G(q)={\displaystyle \sum _{\alpha ,n}cor{\left({R}^{-1}K(\alpha ,n,\u2022),{R}^{-1}K(\alpha ,n+1,\u2022)\right)}^{2}},

which is minimized to determine *q*.

Lastly, we correct systematic bias in function of the cluster coordinate as follows: we fit a 2-dimensional lowess [18] as a function of *(x*, *y)* coordinates and then subtract the difference between that fit and the median intensity across all four channels, for each tile and cycle.

### Model-based clustering and data fitting

We used the *EEV* model of the *mclust* algorithm[20] to fit the Gaussian mixtures used to assign base probabilities in function of the four-dimensional intensity vector, similar as what was performed in [12]. This model assumes Gaussian mixtures with four covariance matrices of the same shape and volume but with varying orientation. We initialize the classification by attributing each colony to the nucleotide with the highest (corrected) intensity. Given that initial classification, an M step of the *mclust* algorithm is performed which estimates the maximum likelihood parameters given the class attributions, where the parameters to estimate are the global scale and shape parameters as well as the centers and orientations of each class (using the covariance parameterization described in [20]). This is then followed by an E step of the EM algorithm to estimate the conditional probabilities of each data point belonging to each class given the parameters estimates obtained previously. Full convergence of the EM algorithm is offered as an option but occasionally runs into spurious optima due to the effect of outliers (similarly to what was observed in [12]). Further details of the implementation can be found in the package documentation (see Availability section).

### Cutoffs for base calling and tag length

The Rolexa algorithms require two types of cutoffs, which can both be easily user-defined in the Rolexa package. In the analyses presented, the limits between the different IUPAC bases in the probability simplex (Figure 2A) were set to *HT(n)* = log_{2}(*n*+0.5) with *n* = 1,2,3 (Figure 2B). Secondly the length-dependent cutoffs *IT(n)* were used to filter out uncertain bases by selecting the longest sub-tag *S* with total entropy smaller than *IT(n* = length *(S))*. In Figure 6 we used the following 6 choices: constants *IT*_{
c
}*(n)* = *c* with the constant *c* set to 2, 4, 6, or 8, and two cutoffs increasing with the tag length: *IT*_{Log} (*n*) = log_{2} (4 + (*n* - 1)/5) and *IT*_{Exp} (*n*) = 2^{(1+(n-1)/36)}. The latter two cutoffs interpolate between 2 and approximately 4 over the length of the sequence, but the first cutoff is concave (increases faster at the beginning) and the second is convex.

### Availability

We have developed an R package called Rolexa which is freely available from http://bbcf.epfl.ch/Software. It is distributed under the GPL license and uses the *mclust* package which is part of the R distribution.