TileQC: A system for tile-based quality control of Solexa data

Dolan, Peter C; Denver, Dee R

doi:10.1186/1471-2105-9-250

Software
Open access
Published: 28 May 2008

TileQC: A system for tile-based quality control of Solexa data

Peter C Dolan¹ &
Dee R Denver¹

BMC Bioinformatics volume 9, Article number: 250 (2008) Cite this article

9919 Accesses
23 Citations
3 Altmetric
Metrics details

Abstract

Background

Next-generation DNA sequencing technologies such as Illumina's Solexa platform and Roche's 454 approach provide new avenues for investigating genome-scale questions. However, they also present novel analytical challenges that must be met for their effective application to biological questions.

Results

Here we report the availability of tileQC, a tile-based quality control system for Solexa data written in the R language. TileQC provides a means of recognizing bias and error in Solexa output by graphically representing data generated by flow cell tiles. The data represented in the images is then made available in the R environment for further analysis and automation of error detection.

Conclusion

TileQC offers a highly adaptable and powerful tool for the quality control of Solexa-based DNA sequence data.

Background

New high-throughput sequencing technologies have arisen over the last decade that produce very large numbers of small sequencing reads (hundreds of thousands to millions), making possible the rapid and inexpensive sequencing and resequencing of genomes [1]. Despite the excitement generated by these new technologies, they also present substantial challenges that include sequence assembly of millions of short-read fragments (~30 bp for Illumina's Solexa sequencing approach) for de novo sequencing applications [2, 3], and the rapid and accurate mapping of short sequence reads to genomic locations for resequencing [4]. Regardless of the application, one major concern is the ability to effectively characterize the reliabilities of DNA sequence reads deriving from "next-generation" platforms that rely on novel sequencing chemistries such as Solexa's reversible dye-labeled terminator approach. Furthermore, these platforms have abandoned the electrophoresis-based approaches of traditional Sanger sequencing; instead, DNA sequence data is collected in real-time from novel sequencing substrates. Development of quality-control tools for these next-generation DNA sequencing technologies is critical for their effective and accurate application to biological questions.

Illumina's Solexa sequencing approach consists of a process whereby DNA samples are nebulized to small pieces (~150 bp), then ligated to adapters that bind to linker molecules on the surface of a flow cell where amplified DNA clusters are ultimately sequenced in real-time using Solexa's reversible dye terminator approach [5]. Each flow cell contains eight lanes onto which DNA molecules from distinct samples can be independently sequenced. Each lane is subdivided into hundreds of tiles (200 tiles in earlier systems, 300 in the most recent system) – four images are collected from each tile (one for each of the four base dyes) per sequencing cycle. These tile images constitute the raw data from which DNA sequence information is ultimately derived. Illumina provides a standard front-end analysis pipeline for Solexa data where image analysis is carried out by Firecrest and base calls are made by Bustard. In making a base call, Bustard assigns a quality score (Q-score) to each of the 4 potential nucleotides. These Solexa quality scores range from -40 to 40. They are not equal to Phred quality scores, but are asymptotically identical [10]. Assuming no ambiguity, the nucleotide with the highest Q-score is called. In an ideal call, there is one +40 and three -40 s. The aggregate quality score (Q_AG-score) for a base call is the maximum Q-score minus the sum of the remaining three Q-scores.

After Firecrest and Bustard, Eland provides alignments of individual Solexa sequence reads to a user-defined reference genome. Eland subdivides all sequence reads into eight categories: those where sequences align to unique genomic regions with 0, 1 or 2 mismatches (U0, U1 and U2, respectively), those where sequences align to repetitive regions with 0, 1 or 2 mismatches (R0, R1 and R2, respectively), those where there are three or more mismatches to the reference genome which is defined as the "no match" (NM) category, and those containing two or more bases that were unable to be called (QC).

Here we provide an openly available software program, tileQC, for quality control of Solexa output. TileQC relies on the R programming environment and a mySQL database server configured for use by the tileQC program. Minor changes in the initialization script allow almost any SQL server to be used. Initial configuration is minimal but flexible enough that a gamut of security options is possible.

TileQC features both qualitative and quantitative error detection. The qualitatively oriented functions display the locations of reads on a tile as dots in a square. The read's color and size are coded using Eland categorizations and/or the Q_AG-score data derived from Bustard. The Eland-coded images represent the data after all other processing has occurred and reveal irregularities arising during any stage of the processing pipeline. Q_AG-score coded images, on the other hand, are produced from the Bustard output and not only produce a greater range of values than the Eland categorizations, but also have greater resolution, allowing the Solexa output to be analyzed down to the level of individual read cycles. This increased flexibility may obscure errors that are obvious at the Eland level. However, once an error is detected, the Q_AG-score coding allows for a more accurate assessment of that detected error's underlying cause and/or location.

The guiding philosophy behind tileQC's qualitative error assessment features is that the researcher's visual pattern recognition is the best way to detect novel errors. Once a new type of error is identified the data extraction features of the program may then be used as a starting point for the programmatic detection and filtration of similar errors.

Implementation

The current version (tileQC 1.0, see Additional file 1) runs on Windows, Linux, and Macintosh operating systems, and requires the programming environment R (version 2.5 or higher) and a properly configured MySQL server (detailed directions for configuration are available at [8]). The package 'RMySQL' must also be installed within the R environment. (The package 'RMySQL' also requires the package 'DBI', however, installing 'RMySQL' will install the 'DBI' package automatically).

The R software is available for download [6] as is the 'RMySQL' package (see the FAQ at [6] for details on downloading and installing an R package). The database server MySQL is also available for free download [7]. The TileQC system was implemented using the R language: source code, installation instructions and tutorials are available at the tileQC website [8].

In order to convert text-files into database form (and/or import data directly from the text files) the utilities sed, tr, grep, and wc must also be installed. These programs are part of the standard installation on most flavors of Unix (including recent versions of the Macintosh OS). For the win32 platform all the necessary programs are included in GNU utilities for win32 available at [9].

Results and Discussion

Throughout this section all Solexa data used in examples was generated from several of our Caenorhabditis elegans genomic DNA runs (unpublished data) on an Illumina Genome Analyzer. All C. elegans data were subject to the standard Solexa data analysis pipeline prior to application of tileQC tools.

The first role of tileQC is to facilitate the conversion of text based Solexa pipeline output to a more flexible SQL database format (in our case the MySQL database server). If a compatible database does not already exist, tileQC will (upon request) create one. Creating a database requires that both the SQL server and the tileQC program be properly configured (see [8] for details). Once the Eland and Q_AG-score data is in database form the full power of both SQL and R may be brought to bear upon the analysis of that data. Encapsulating the database connection within an R object enables the mundane details to be managed invisibly and frees the researcher to focus on the analysis of the data rather than the mechanics of accessing and manipulating that data. Although supplementary to the package's primary purpose of tile-based quality control (QC), this feature is useful in its own right, and simplifies the mechanics of querying a database containing Solexa data. The standard SQL query language is enhanced by the inclusion of a simple form of expression substitution. Here, for example, we see the extraction of five reads covering the location 332,080 in Chromosome I of the C. elegans genome (note the use of #current.table# in the SQL command):

> celegans$runSQL("select seq, type, locus, muta, mutb from #current.table#

where locus >= 332048 and locus <= 332080 and segment = 'CHR_I' limit 5")

seq type locus muta mutb

1 AATTTTTTGAATTTGCTCGCCGCATTTCGACTTTCT U2 332053 23A 28T

2 TGAATTTGCTCGCCGAATTTCGACTTTCTTACAATT U2 332060 21T 30G

3 GAATTTGCTCGCCGAATTTCGACTTTCTGACAATTT U1 332061 20T

4 GAATTTGCTCGCCGAATTTCGCCTTTCTGACACTTG U2 332061 20T 22A

5 GAATTTGCTCGCCGAATTTCGACTTTCTTACAATTT U2 332061 20T 29G

The primary purpose of the package is, of course, tile-based quality control. Often there are patterns in the errors generated during the Solexa sequencing process that become visible when the physical locations of a tile's reads are plotted in colors and sizes that depend upon the category to which they have been assigned by Eland. For this purpose, the tileQC package contains functions that are optimized to create such qualitative displays. The visual representation appears on the left and a relative frequency histogram of the number of reads in each Eland category for that tile appears to the right. The researcher may select which categories of read are to be displayed, and even filter the unique reads based upon whether they match the forward strand, the reverse strand, or either. The homogeneity of the Solexa process ensures that, when the machine is functioning properly, the relative frequencies are similar from tile to tile and distributed uniformly across each tile. Major discrepancies in these conditions are immediately discerned by sight.

Many such discrepancies are small and their effects are limited to one, or (possibly) a few, tiles. Figure 1 contains examples of three such situations. Often these aberrations have obvious causes such as bubbles in the reagents. However some, such as the rectangle in Figure 1a, remain mysterious. Detection of aberrant tiles is particularly important for researchers doing single nucleotide polymorphism (SNP) detection as a single tile with an increased error rate may yield a variety of false positives.

Tile-plots using Q_AG-score data allow for a more in depth analysis of the data and a better identification of an error's source, but one must be cautious – many types of errors that are clear from the Eland perspective may be difficult to discern from the Q_AG-score perspective. Sometimes, this can be alleviated by knowing the proper way to transform the Q_AG-score data into intensity data. Any desired function may be applied to a read's set of Q_AG-scores and the output of that function is normalized and transformed into an intensity value for the dot corresponding to that read. An example of this process can be seen in Figure 2 where we see three views of the same tile. The first, Figure 2a, uses Eland encoding, whereas Figures 2b and 2c use two distinct Q_AG-score encodings. In Figure 2b we see the results of applying the function mean across the first 32-cycles of a read. The error detected through Eland categorization is invisible in this image, but once again is easily detectable in Figure 2c where the minimum of the 32 inter-read Q_AG-scores are being plotted. The source of the error becomes apparent in Figure 2d – a cycleplot displaying the mean Q_AG-score for the first 32 cycles of the aberrant tile. We see from this graph, that the problem arises during cycle 28, and Figure 3, showing the minimum Q_AG-scores for a variety of cycles, confirms this interpretation. From this example, one might be tempted to assume that any tile with a cycle whose mean Q_AG-score drops below a certain value is to be discarded, but it is possible for a non-aberrant tile to have mean Q_AG-scores this low (as can be seen in Figure 4).

The visual pattern recognition of a researcher may also discern more subtle and global biases that escape simple numeric detection as in Figures 1, 2, 3, 4 where we see problematic reads occurring more frequently on the boundary of tiles, or in Figure 5 where a consistent increase in error rate is associated with reads from the upper left corner of a tile. Once such errors are known to occur, more sophisticated statistical techniques may be used to detect and remove the resulting biases, or at the very least to filter out the offending data.

Despite the power of the human visual system, some patterns of error that occur over an entire lane of Solexa data may not be apparent from direct observation of the 200–300 QC plots associated with that lane. However, some of these global biases are revealed when summarizing statistics are extracted for each tile and plotted in a single graph. In Figure 6 the blue dots correspond to the total number of perfect matches in each tile. The tiles within each lane are arranged as two vertical strips. In Figure 6, the left strip contains tiles 1–100 (with 1 at the top). The right strip contains tiles 101–200, but this time 101 is at the bottom and 200 at the top. The droop in the graph is indicative of an error rate that increased as the machine worked its way from the top of the lane to bottom. Aberrant tiles may also become visible from this perspective – in this same figure a tile with a particularly low number of unique matches was identified (the dot marked by yellow), and the graph of that tile's Eland categorized reads was superimposed upon a blank section of the summarization graph to illustrate how a bubbled tile may be detected in a summarization graph. We are presently working to improve the summary statistics features of tileQC and anticipate more advanced summary reports in future versions of this software.

Conclusion

The tileQC system offers a versatile and powerful tool for the quality control of Solexa-based DNA sequence data. Future challenges include the development of an interface that unifies the task of summarization with that of quantitative testing. This short-term goal (partially completed) will lead to a plug-in style of summarization and analysis that will allow researchers to flexibly encapsulate any desired post-processing or data extraction within a shareable R object. Mid-range goals include an interactive graphical interface for more convenient data exploration as well as a freely available library of analytic modules.

Availability and requirements

The tileQC system is freely available from [8]. It requires R (version 2.5 or higher), the R package 'RMySQL' and MySQL (version 5.0 or higher). In order to convert Solexa output from text to database form it requires the Solexa pipeline (up to version 0.3) output files of the form '_prb.txt' and '_eland_result.txt' as well as the utilities wc, grep, tr, and sed.

Abbreviations

The abbreviations are Q_AG-score:: aggregate quality score
SNP:: single nucleotide polymorphism.

References

Mardis ER: Anticipating the 1,000 dollar genome. Genome Biol 2006, 7(7):112. 10.1186/gb-2006-7-7-112
Article PubMed Central PubMed Google Scholar
Warren RL, Sutton GG, Jones SJ, Holt RA: Assembling millions of short DNA sequences using SSAKE. Bioinformatics 2007, 23(4):500–501. 10.1093/bioinformatics/btl629
Article CAS PubMed Google Scholar
Jeck WR, Reinhardt JA, Baltrus DA, Hickenbotham MT, Magrini V, Mardis ER, Dangl JL, Jones CD: Extending assembly of short DNA sequences to handle error. Bioinformatics 2007, 23(21):2942–2944. 10.1093/bioinformatics/btm451
Article CAS PubMed Google Scholar
Bentley DR: Whole-genome re-sequencing. Curr Opin Genet Dev 2006, 16(6):545–552. 10.1016/j.gde.2006.10.009
Article CAS PubMed Google Scholar
SS DNA Sequencing[http://www.illumina.com/downloads/SS_DNAsequencing.pdf]
The R Project for Statistical Computing[http://www.r-project.org/]
MySQL[http://www.mysql.com/]
TileQC Homepage[http://science.oregonstate.edu/~dolanp/tileqc]
Sourceforge: UnxUtils[http://sourceforge.net/projects/unxutils]
Understanding Qualities[http://maq.sourceforge.net/qual.shtml]

Download references

Acknowledgements

We thank Larry J. Wilhelm and Dana K. Howe for help with developing tileQC. Thanks to Chris Sullivan and Mark Dasenko at the OSU Center for Genome Research and Biocomputing for assistance with Solexa data and computing support. Also thanks to Brian Knaus, Dr. Rongkun Shen, and Dr. Albyn Jones for valuable advice. We are grateful to the National Institutes of Health and OSU Computational and Genome Biology Initiative for funding support.

Author information

Authors and Affiliations

Department of Zoology and Center for Genome Research and Biocomputing, Oregon State University, Corvallis, Oregon, 97331, USA
Peter C Dolan & Dee R Denver

Authors

Peter C Dolan
View author publications
You can also search for this author in PubMed Google Scholar
Dee R Denver
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Peter C Dolan.

Additional information

Authors' contributions

PCD contributed to writing the manuscript and developed the software. DRD contributed to writing the manuscript. Both authors have read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 6

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Dolan, P.C., Denver, D.R. TileQC: A system for tile-based quality control of Solexa data. BMC Bioinformatics 9, 250 (2008). https://doi.org/10.1186/1471-2105-9-250

Download citation

Received: 31 January 2008
Accepted: 28 May 2008
Published: 28 May 2008
DOI: https://doi.org/10.1186/1471-2105-9-250

TileQC: A system for tile-based quality control of Solexa data