TileQC: A system for tile-based quality control of Solexa data
© Dolan and Denver; licensee BioMed Central Ltd. 2008
Received: 31 January 2008
Accepted: 28 May 2008
Published: 28 May 2008
Next-generation DNA sequencing technologies such as Illumina's Solexa platform and Roche's 454 approach provide new avenues for investigating genome-scale questions. However, they also present novel analytical challenges that must be met for their effective application to biological questions.
Here we report the availability of tileQC, a tile-based quality control system for Solexa data written in the R language. TileQC provides a means of recognizing bias and error in Solexa output by graphically representing data generated by flow cell tiles. The data represented in the images is then made available in the R environment for further analysis and automation of error detection.
TileQC offers a highly adaptable and powerful tool for the quality control of Solexa-based DNA sequence data.
New high-throughput sequencing technologies have arisen over the last decade that produce very large numbers of small sequencing reads (hundreds of thousands to millions), making possible the rapid and inexpensive sequencing and resequencing of genomes . Despite the excitement generated by these new technologies, they also present substantial challenges that include sequence assembly of millions of short-read fragments (~30 bp for Illumina's Solexa sequencing approach) for de novo sequencing applications [2, 3], and the rapid and accurate mapping of short sequence reads to genomic locations for resequencing . Regardless of the application, one major concern is the ability to effectively characterize the reliabilities of DNA sequence reads deriving from "next-generation" platforms that rely on novel sequencing chemistries such as Solexa's reversible dye-labeled terminator approach. Furthermore, these platforms have abandoned the electrophoresis-based approaches of traditional Sanger sequencing; instead, DNA sequence data is collected in real-time from novel sequencing substrates. Development of quality-control tools for these next-generation DNA sequencing technologies is critical for their effective and accurate application to biological questions.
Illumina's Solexa sequencing approach consists of a process whereby DNA samples are nebulized to small pieces (~150 bp), then ligated to adapters that bind to linker molecules on the surface of a flow cell where amplified DNA clusters are ultimately sequenced in real-time using Solexa's reversible dye terminator approach . Each flow cell contains eight lanes onto which DNA molecules from distinct samples can be independently sequenced. Each lane is subdivided into hundreds of tiles (200 tiles in earlier systems, 300 in the most recent system) – four images are collected from each tile (one for each of the four base dyes) per sequencing cycle. These tile images constitute the raw data from which DNA sequence information is ultimately derived. Illumina provides a standard front-end analysis pipeline for Solexa data where image analysis is carried out by Firecrest and base calls are made by Bustard. In making a base call, Bustard assigns a quality score (Q-score) to each of the 4 potential nucleotides. These Solexa quality scores range from -40 to 40. They are not equal to Phred quality scores, but are asymptotically identical . Assuming no ambiguity, the nucleotide with the highest Q-score is called. In an ideal call, there is one +40 and three -40 s. The aggregate quality score (QAG-score) for a base call is the maximum Q-score minus the sum of the remaining three Q-scores.
After Firecrest and Bustard, Eland provides alignments of individual Solexa sequence reads to a user-defined reference genome. Eland subdivides all sequence reads into eight categories: those where sequences align to unique genomic regions with 0, 1 or 2 mismatches (U0, U1 and U2, respectively), those where sequences align to repetitive regions with 0, 1 or 2 mismatches (R0, R1 and R2, respectively), those where there are three or more mismatches to the reference genome which is defined as the "no match" (NM) category, and those containing two or more bases that were unable to be called (QC).
Here we provide an openly available software program, tileQC, for quality control of Solexa output. TileQC relies on the R programming environment and a mySQL database server configured for use by the tileQC program. Minor changes in the initialization script allow almost any SQL server to be used. Initial configuration is minimal but flexible enough that a gamut of security options is possible.
TileQC features both qualitative and quantitative error detection. The qualitatively oriented functions display the locations of reads on a tile as dots in a square. The read's color and size are coded using Eland categorizations and/or the QAG-score data derived from Bustard. The Eland-coded images represent the data after all other processing has occurred and reveal irregularities arising during any stage of the processing pipeline. QAG-score coded images, on the other hand, are produced from the Bustard output and not only produce a greater range of values than the Eland categorizations, but also have greater resolution, allowing the Solexa output to be analyzed down to the level of individual read cycles. This increased flexibility may obscure errors that are obvious at the Eland level. However, once an error is detected, the QAG-score coding allows for a more accurate assessment of that detected error's underlying cause and/or location.
The guiding philosophy behind tileQC's qualitative error assessment features is that the researcher's visual pattern recognition is the best way to detect novel errors. Once a new type of error is identified the data extraction features of the program may then be used as a starting point for the programmatic detection and filtration of similar errors.
The current version (tileQC 1.0, see Additional file 1) runs on Windows, Linux, and Macintosh operating systems, and requires the programming environment R (version 2.5 or higher) and a properly configured MySQL server (detailed directions for configuration are available at ). The package 'RMySQL' must also be installed within the R environment. (The package 'RMySQL' also requires the package 'DBI', however, installing 'RMySQL' will install the 'DBI' package automatically).
The R software is available for download  as is the 'RMySQL' package (see the FAQ at  for details on downloading and installing an R package). The database server MySQL is also available for free download . The TileQC system was implemented using the R language: source code, installation instructions and tutorials are available at the tileQC website .
In order to convert text-files into database form (and/or import data directly from the text files) the utilities sed, tr, grep, and wc must also be installed. These programs are part of the standard installation on most flavors of Unix (including recent versions of the Macintosh OS). For the win32 platform all the necessary programs are included in GNU utilities for win32 available at .
Results and Discussion
Throughout this section all Solexa data used in examples was generated from several of our Caenorhabditis elegans genomic DNA runs (unpublished data) on an Illumina Genome Analyzer. All C. elegans data were subject to the standard Solexa data analysis pipeline prior to application of tileQC tools.
The first role of tileQC is to facilitate the conversion of text based Solexa pipeline output to a more flexible SQL database format (in our case the MySQL database server). If a compatible database does not already exist, tileQC will (upon request) create one. Creating a database requires that both the SQL server and the tileQC program be properly configured (see  for details). Once the Eland and QAG-score data is in database form the full power of both SQL and R may be brought to bear upon the analysis of that data. Encapsulating the database connection within an R object enables the mundane details to be managed invisibly and frees the researcher to focus on the analysis of the data rather than the mechanics of accessing and manipulating that data. Although supplementary to the package's primary purpose of tile-based quality control (QC), this feature is useful in its own right, and simplifies the mechanics of querying a database containing Solexa data. The standard SQL query language is enhanced by the inclusion of a simple form of expression substitution. Here, for example, we see the extraction of five reads covering the location 332,080 in Chromosome I of the C. elegans genome (note the use of #current.table# in the SQL command):
> celegans$runSQL("select seq, type, locus, muta, mutb from #current.table#
where locus >= 332048 and locus <= 332080 and segment = 'CHR_I' limit 5")
seq type locus muta mutb
1 AATTTTTTGAATTTGCTCGCCGCATTTCGACTTTCT U2 332053 23A 28T
2 TGAATTTGCTCGCCGAATTTCGACTTTCTTACAATT U2 332060 21T 30G
3 GAATTTGCTCGCCGAATTTCGACTTTCTGACAATTT U1 332061 20T
4 GAATTTGCTCGCCGAATTTCGCCTTTCTGACACTTG U2 332061 20T 22A
5 GAATTTGCTCGCCGAATTTCGACTTTCTTACAATTT U2 332061 20T 29G
The primary purpose of the package is, of course, tile-based quality control. Often there are patterns in the errors generated during the Solexa sequencing process that become visible when the physical locations of a tile's reads are plotted in colors and sizes that depend upon the category to which they have been assigned by Eland. For this purpose, the tileQC package contains functions that are optimized to create such qualitative displays. The visual representation appears on the left and a relative frequency histogram of the number of reads in each Eland category for that tile appears to the right. The researcher may select which categories of read are to be displayed, and even filter the unique reads based upon whether they match the forward strand, the reverse strand, or either. The homogeneity of the Solexa process ensures that, when the machine is functioning properly, the relative frequencies are similar from tile to tile and distributed uniformly across each tile. Major discrepancies in these conditions are immediately discerned by sight.
The tileQC system offers a versatile and powerful tool for the quality control of Solexa-based DNA sequence data. Future challenges include the development of an interface that unifies the task of summarization with that of quantitative testing. This short-term goal (partially completed) will lead to a plug-in style of summarization and analysis that will allow researchers to flexibly encapsulate any desired post-processing or data extraction within a shareable R object. Mid-range goals include an interactive graphical interface for more convenient data exploration as well as a freely available library of analytic modules.
Availability and requirements
The tileQC system is freely available from . It requires R (version 2.5 or higher), the R package 'RMySQL' and MySQL (version 5.0 or higher). In order to convert Solexa output from text to database form it requires the Solexa pipeline (up to version 0.3) output files of the form '_prb.txt' and '_eland_result.txt' as well as the utilities wc, grep, tr, and sed.
- The abbreviations are QAG-score:
aggregate quality score
single nucleotide polymorphism.
We thank Larry J. Wilhelm and Dana K. Howe for help with developing tileQC. Thanks to Chris Sullivan and Mark Dasenko at the OSU Center for Genome Research and Biocomputing for assistance with Solexa data and computing support. Also thanks to Brian Knaus, Dr. Rongkun Shen, and Dr. Albyn Jones for valuable advice. We are grateful to the National Institutes of Health and OSU Computational and Genome Biology Initiative for funding support.
- Mardis ER: Anticipating the 1,000 dollar genome. Genome Biol 2006, 7(7):112. 10.1186/gb-2006-7-7-112PubMed CentralView ArticlePubMedGoogle Scholar
- Warren RL, Sutton GG, Jones SJ, Holt RA: Assembling millions of short DNA sequences using SSAKE. Bioinformatics 2007, 23(4):500–501. 10.1093/bioinformatics/btl629View ArticlePubMedGoogle Scholar
- Jeck WR, Reinhardt JA, Baltrus DA, Hickenbotham MT, Magrini V, Mardis ER, Dangl JL, Jones CD: Extending assembly of short DNA sequences to handle error. Bioinformatics 2007, 23(21):2942–2944. 10.1093/bioinformatics/btm451View ArticlePubMedGoogle Scholar
- Bentley DR: Whole-genome re-sequencing. Curr Opin Genet Dev 2006, 16(6):545–552. 10.1016/j.gde.2006.10.009View ArticlePubMedGoogle Scholar
- SS DNA Sequencing[http://www.illumina.com/downloads/SS_DNAsequencing.pdf]
- The R Project for Statistical Computing[http://www.r-project.org/]
- TileQC Homepage[http://science.oregonstate.edu/~dolanp/tileqc]
- Sourceforge: UnxUtils[http://sourceforge.net/projects/unxutils]
- Understanding Qualities[http://maq.sourceforge.net/qual.shtml]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.