PANDAseq: paired-end assembler for illumina sequences
© Masella et al.; licensee BioMed Central Ltd. 2012
Received: 23 September 2011
Accepted: 14 February 2012
Published: 14 February 2012
Illumina paired-end reads are used to analyse microbial communities by targeting amplicons of the 16S rRNA gene. Publicly available tools are needed to assemble overlapping paired-end reads while correcting mismatches and uncalled bases; many errors could be corrected to obtain higher sequence yields using quality information.
PANDAseq assembles paired-end reads rapidly and with the correction of most errors. Uncertain error corrections come from reads with many low-quality bases identified by upstream processing. Benchmarks were done using real error masks on simulated data, a pure source template, and a pooled template of genomic DNA from known organisms. PANDAseq assembled reads more rapidly and with reduced error incorporation compared to alternative methods.
PANDAseq rapidly assembles sequences and scales to billions of paired-end reads. Assembly of control libraries showed a 4-50% increase in the number of assembled sequences over naïve assembly with negligible loss of "good" sequence.
Single-gene sequencing has become the benchmark for studying microbial taxonomic composition of environmental samples, by amplification of hypervariable regions of the 16S rRNA gene. Next-generation sequencing platforms, such as Illumina, are now adapted for the generation of multi-million-member sequence libraries for sample comparisons [1–4]. The PCR amplicons used for sequencing typically encompass one or more 16S rRNA gene hypervariable regions and amplicon lengths typically extend beyond the sequencing limit of the Illumina single-read method, which is typically less than 150 bases. Because the Illumina platform can generate amplicon sequences in a paired-end format, based on each template's position on the flow cell, paired reads can be directly matched and assembled. The prefiltering step of the genome assembly software PHRAP can be used to assemble reads . Although the Needleman-Wunsch algorithm  embedded in Merger (http://emboss.sourceforge.net/apps/release/6.2/emboss/apps/merger.html) has been used to assemble Illumina paired-end reads , PANDAseq makes use of Illumina-specific properties, including the low probability of gap-inclusion.
PANDAseq aligns each set of paired-end sequence reads in a three-step process. First, it determines the locations of the amplification primers, if they are specified and were sequenced. Then, it identifies the optimal overlap. Finally, it reconstructs the complete sequence, correcting any errors, and checks for various constraints, such as length and quality.
To score alignments, we calculate the probability that the true nucleotides, and , are the same, given the observed nucleotides, X and Y. We estimate this with the included quality information found in the Illumina reads. For each base, CASAVA provides an encoded quality score, which is the probability of the base being miscalled. This probability (ϵ) is approximated by where A0 is the ASCII quality value in the Illumina analysis pipeline versions before CASAVA 1.8 and A 1 is the ASCII value used in CASAVA 1.8. 
while assuming that , which is the highest value score assigned by Illumina  and, intuitively, assuming that is P.
The program then finds the best overlap greater than a specified threshold for the forward and reverse sequences, F and R, respectively. If no suitable overlap is found, then the read pair is discarded. This is done for the entire read, even if there are primers to be removed, as it is possible for the overlap to be sufficiently long to be in the primer region. A schematic is shown in Figure 1.
where and and the remainder is as above with e fixed at a value determined empirically to be the average error rate. This value of ϵ was calculated by counting the mismatch rate in known index tags in a defined community data set (described below). This parameter need not be retuned as it is only an estimate of the error. Because the index read is short and sequenced earlier in the process, it likely has fewer errors and, therefore, its error rate should underestimate the true error rate. Regardless, the error rate specified for this step should not negatively affect the ability of PANDAseq to identify the best overlap for the forward and reverse reads.
Once the overlap is selected, the output sequence is constructed and an overall quality score is calculated. During this process, the primer regions are disregarded if primers were specified. The unpaired regions are copied from the available strands and the quality score for these regions is the product of the probability of those bases being correct. For the overlapping region, the decision-making process is more complex. If the bases agree, the base is included and the quality of this base is assumed to be . If the bases disagree, the base with the higher quality score is chosen and the quality of this base is assumed to be . If either or both bases are uncalled, they are considered to always match, noting that unassigned bases are always associated with the lowest quality score by CASAVA .
In certain cases, the CASAVA pipeline masks the quality score at the end of the read, replacing all quality scores with the lowest quality score . In this case, special quality scoring is used by PANDAseq. If one base is masked, the probability of the other base is used if the bases match or uniformly random, , is used if they do not match. If both are in the masked region, the quality is assumed to be uniformly random, .
The constructed sequence can then be validated against user-specified criteria. The quality score assigned to the assembled sequence is the geometric mean of the quality scores calculated above, which compensates for the variable lengths of the final sequences. PANDAseq enables users to reject sequences based on low quality score, lengths that are too short or too long, or the presence of uncalled bases. A module system is also available within PANDAseq to allow more sophisticated validation of user sequences, such as verification of known secondary structure or conserved regions. Note that there is a detailed manual included with the software that describes example usage scenarios.
Results and discussion
To validate PANDAseq, we used three experimental tests: (1) a test using simulated data to verify algorithmic correctness, (2) a test using sequence data from a single-template PCR amplicon to verify the quality of assembled reads, and (3) a test with experimental data obtained from a defined mixture of genomic DNA fragments to compare PANDAseq assembly yields with naïve assembly.
Read error correction frequencies
Quality (Geometric Mean)
0.9 - 1.0
0.6 - 0.9
Error-free Input and Output
All Errors Retained
Input Errors Reduced
Number of sequences with correct overlap regions
Percentage of Output
Assembling 1350 602 reads took just 364 seconds on a Macintosh Pro with 2 quad-core Intel Xeon 2.93 GHz processors. Profiling indicated that cache faults are the limiting factor in performance (data not shown), and the current design minimises cache faults during analysis of each sequence pair.
There is a concern that, when making a choice between two disagreeing bases, the reconstructed sequence does not reflect the true sequence. For the control library, the disagreeing bases were dominated by mismatches in the quality-masked region of the reads, where both bases were of low quality and the decision would be arbitrary because there is no reasonable way to discern which base is better. In those cases, the entire reads are of low quality and likely to be discarded due to the quality threshold. However, mismatches generally occur between a base with a high quality score and a base with a low quality score, simplifying the choice of which base is correct. In control library data, only 20% of mismatched bases both had quality scores masked by CASAVA. Since the quality masked region must be quite long for this to occur, only few sequences suffer strongly from these mismatched quality-masked bases. This is due to the overlap region typically being longer than the quality-masked regions.
PANDAseq produces additional high-quality assemblies from Illumina paired-end reads than naïve assembly for minimal computational cost and provided more rapid and higher quality results compared to existing assemblers. Error correction, particularly of uncalled bases, increases the number of assembled sequences. Although it is possible for PANDAseq to produce incorrect assemblies, most assemblies are correct because incorrect assemblies have low quality scores, as these mismatches occur in quality-masked regions of both reads, and are discarded. This software provides a versatile and powerful way to assemble paired-end Illumina reads without otherwise discarding high-quality sequence data.
Availability and Requirements
Project name: PANDAseq
Project home page: https://github.com/neufeld/pandaseq
Operating system(s): POSIX-compliant (Windows, Linux, and MacOS)
Programming language: C
Other requirements: None
License: GNU GPL
Any restrictions to use by non-academics:
This research was supported by the Natural Sciences and Engineering Research Council of Canada, through Discovery Grants to JDN and to DGB and through a Strategic Projects Grant to JDN, and by Early Researcher Awards from the Government of Ontario to JDN and to DGB.
- Bartram AK, Lynch MDJ, Stearns JC, Moreno-Hagelsieb G, Neufeld JD: Generation of multimillion-sequence 16S rRNA gene libraries from complex microbial communities by assembling paired-end Illumina reads. Appl Environ Microbiol 2011, 77: 3846–3852. [http://aem.asm.org/cgi/content/abstract/77/11/3846] 10.1128/AEM.02772-10PubMed CentralView ArticlePubMedGoogle Scholar
- Gloor GB, Hummelen R, Macklaim JM, Dickson RJ, Fernandes AD, MacPhee R, Reid G: Microbiome Profiling by Illumina sequencing of combinatorial sequence-tagged PCR products. PLoS ONE 2010, 5: e15406. 10.1371/journal.pone.0015406PubMed CentralView ArticlePubMedGoogle Scholar
- Degnan PH, Ochman H: Illumina-based analysis of microbial community diversity. ISME J 2011. [http://www.nature.com/ismej/journal/v6/n1/full/ismej201174a.html]Google Scholar
- Caporaso JG, Lauber CL, Walters WA, Berg-Lyons D, Lozupone CA, Turnbaugh PJ, Fierer N, Knight R: Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proc Natl Acad Sci USA 2011, 108(Suppl 1):4516–4522. [http://genomebiology.com/2011/12/5/R50]PubMed CentralView ArticlePubMedGoogle Scholar
- Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970, 48: 443–453. 10.1016/0022-2836(70)90057-4View ArticlePubMedGoogle Scholar
- Zhou HW, Li DF, Tam NF, Jiang XT, Zhang H, Sheng HF, Qin J, Liu X, Zou F: BIPES, a cost-effective high-throughput method for assessing microbial diversity. ISME J 2011, 5: 741–749. [http://www.nature.com/ismej/journal/v5/n4/abs/ismej2010160a.html] 10.1038/ismej.2010.160PubMed CentralView ArticlePubMedGoogle Scholar
- Cock PJA, Fields CJ, Goto N, Heuer ML, Rice PM: The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res 2010, 38(6):1767–1771. [http://nar.oxfordjournals.org/content/38/6/1767.abstract] 10.1093/nar/gkp1137PubMed CentralView ArticlePubMedGoogle Scholar
- Illumina, Inc: CASAVA Software Version 1.7 User Guide. Illumina, Inc; 2010.Google Scholar
- Muyzer G, de Waal EC, Uitterlinden AG: Profiling of complex microbial populations by denaturing gradient gel electrophoresis analysis of polymerase chain reaction-amplified genes coding for 16S rRNA. Appl Environ Microbiol 1993, 59: 695–700. [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC202176/]PubMed CentralPubMedGoogle Scholar
- Rodrigue S, Materna AC, Timberlake SC, Blackburn MC, Malmstrom RR, Alm EJ, Chisholm SW: Unlocking short read sequencing for metagenomics. PLoS ONE 2010, 5: e11840. [http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0011840] 10.1371/journal.pone.0011840PubMed CentralView ArticlePubMedGoogle Scholar
- Li W, Godzik A: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006, 22: 1658–1659. [http://bioinformatics.oxfordjournals.org/content/22/13/1658.long] 10.1093/bioinformatics/btl158View ArticlePubMedGoogle Scholar
- Cole JR, Wang Q, Cardenas E, Fish J, Chai B, Farris RJ, Kulam-Syed-Mohideen AS, McGarrell DM, Marsh T, Garrity GM, Tiedje JM: The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Nucleic Acids Res 2009, 37(Database issue):D141–145.PubMed CentralView ArticlePubMedGoogle Scholar
- Cole JR, Chai B, Farris RJ, Wang Q, Kulam-Syed-Mohideen AS, McGarrell DM, Bandela AM, Cardenas E, Garrity GM, Tiedje JM: The ribosomal database project (RDP-II): introducing myRDP space and quality controlled public data. Nucleic Acids Res 2007, 35(Database issue):D169–172.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.