iAssembler: a package for de novo assembly of Roche-454/Sanger transcriptome sequences
© Zheng et al; licensee BioMed Central Ltd. 2011
Received: 27 May 2011
Accepted: 23 November 2011
Published: 23 November 2011
Expressed Sequence Tags (ESTs) have played significant roles in gene discovery and gene functional analysis, especially for non-model organisms. For organisms with no full genome sequences available, ESTs are normally assembled into longer consensus sequences for further downstream analysis. However current de novo EST assembly programs often generate large number of assembly errors that will negatively affect the downstream analysis. In order to generate more accurate consensus sequences from ESTs, tools are needed to reduce or eliminate errors from de novo assemblies.
We present iAssembler, a pipeline that can assemble large-scale ESTs into consensus sequences with significantly higher accuracy than current existing assemblers. iAssembler employs MIRA and CAP3 assemblers to generate initial assemblies, followed by identifying and correcting two common types of transcriptome assembly errors: 1) ESTs from different transcripts (mainly alternatively spliced transcripts or paralogs) are incorrectly assembled into same contigs; and 2) ESTs from same transcripts fail to be assembled together. iAssembler can be used to assemble ESTs generated using the traditional Sanger method and/or the Roche-454 massive parallel pyrosequencing technology.
We compared performances of iAssembler and several other de novo EST assembly programs using both Roche-454 and Sanger EST datasets. It demonstrated that iAssembler generated significantly more accurate consensus sequences than other assembly programs.
Expressed sequence tags (ESTs) are short sub-sequences of transcribed genes and have been extensively used for gene discovery  and digital expression analysis . Recent advances in next-generation sequencing (NGS) technologies allow sequencing of large-scale ESTs in an efficient and cost-effective way. One of these technologies, Roche-454 massive parallel pyrosequencing platform , has been widely used to sequence transcriptomes of various non-model organisms [4–9] due to its relatively long reads generated (currently ~400 bp) that greatly facilitates de novo assembly.
Several de novo assembly programs such as CAP3 , MIRA , TGLCL , Phrap , and Newbler (Roche) have been developed to assemble EST sequence reads into longer contigs. However, most of these programs are primarily developed for genome sequence assembly, even their transcriptome assembly modes have not been fully optimized and two types of assembly errors are frequently observed: 1) type I error-ESTs derived from alternatively spliced transcripts or paralogs are incorrectly assembled into one transcript; 2) type II error-ESTs derived from the same transcript fail to be assembled together. We have investigated these two types of errors in the Dana-Farber Cancer Institute (DFCI) Plant Gene Index , which was created by assembling Sanger ESTs into unigenes using TGICL , as well as several other EST databases. Surprisingly, we found that a large number of unigenes with significant overlap (e.g., > 500 bp) and high sequence identity (e.g., > 99%) were not assembled together, such as TC219875 and TC221582 in the DFCI Tomato Gene Index (Additional file 1), and ESTs with significant sequence differences were assembled together, e.g., AW218649 and TC237370 (< 92% identity; Additional file 1), and AW031810 and TC223103 (alternative splicing; Additional file 1) in the DFCI Tomato Gene Index. The assembly error rates are also high for Roche-454 ESTs as we have constantly observed that large portion of Roche-454 unigenes contain assembly errors after reanalyzing several published datasets. Recently Kumar and Blaxter  recommended an assembly strategy that involves combining differently optimal assemblies from multiple programs. This strategy can generate better assemblies by taking advantage of advantages of different assembly programs; however it still contains significant number of mis-assemblies. To date, no program is available that can efficiently identify and correct the two types of errors described above.
In this paper we describe iAssembler, a package that can efficiently assemble large-scale EST datasets and automatically identify and correct assembly errors. We demonstrate the utility and performance of this program by performing assemblies on different EST datasets with different sets of parameters.
iAssembler is implemented in Perl and can be executed under either 32-bit or 64-bit Linux systems with Bioperl  installed. Although MIRA, CAP3 and NCBI megablast  programs are required by iAssembler, they are already integrated into the iAssembler package for user's convenience. Thus iAssembler is easy to install and simple to use.
Architecture of iAssembler
Error corrections in iAssembler
The unique feature of iAssembler is its ability to detect and automatically correct all possible assembly errors. Following initial assemblies by MIRA and CAP3, all-versus-all pairwise sequence alignments of resulting unigenes are performed using the NCBI megablast program. Unigenes whose overlapped sequence length and identity, and overhang length meet user-defined cutoffs are identified as type II assembly errors, i.e., sequences from same transcripts fail to be assembled together. The megablast assembler then utilizes the pairwise sequence alignment information to join the unigenes into new contigs. Next, the type I error corrector module maps individual EST members to their corresponding contigs using megablast. ESTs that have sequence similarities to their corresponding contigs less than and/or overhang lengths larger than the corresponding user-defined cutoffs are identified as type I assembly errors, i.e., two different transcripts are incorrectly assembled together. These misassembled ESTs are then extracted by the type I error corrector and together with unigenes derived from the current round of assembly and error correction, are used as the input sequences in the next round of assembly and error correction (Figure 1).
The iterative assembly strategy employed by iAssembler can result in loss of accuracy in final unigene base calling since later assemblies are performed on unigenes generated from previous assemblies, instead of ESTs; thus during assemblies by CAP3 and megablast assemblers, the information of depth of coverage by individual EST members at each unigene position will be lost and thus not used in base calling of assembled sequences. This will cause significant number of wrongly called bases in unigenes. iAssembler provides a unigene base error correction module (Figure 1) which reassigns each individual base sequence of unigenes according to the SAM  output file (generated by iAssembler) which contains detailed alignment information of individual ESTs to their corresponding unigenes. The most frequent base covering a specific position will be assigned to that position of the unigene.
Following corrections of type I and II assembly and unigene base calling errors, iAssembler reevaluates the resulting unigenes and identifies and corrects new assembly and base calling errors. The error identification and correction steps will be iterated until no new errors can be identified or corrected.
iAssembler is designed to generate highly accurate assemblies of EST sequences by performing iterative assembly strategy and automated error detection and correction. The three assemblers in iAssembler, MIRA, CAP3 and megablast assemblers, are all base on the overlap-layout-consensus strategy thus iAssembler is applicable for ESTs with relative long sequences, such as those generated using Sanger and/or Roche-454 platforms.
Workflow of iAssembler
The workflow of iAssembler is shown in Figure 1. iAssembler takes Roche-454 and/or Sanger EST sequences in FASTA format as its input. Before being fed to iAssembler, the EST sequences need to be cleaned by removing low quality regions and known sequence contaminations (e.g., adapters, vectors, and rRNAs) to avoid misassemblies and misinterpretations. This can be achieved by using sequence cleaning programs such as SeqClean  or LUCY . It is worth noting that iAssembler itself does not contain functions to clean and trim raw EST sequences.
Cleaned EST sequences are first supplied to iAssembler with appropriate user-defined parameters. iAssembler first employs MIRA to assemble EST sequences, followed by assembling the resulting MIRA unigenes using CAP3. These two open source assemblers were chosen because we have observed that MIRA is efficient in handling large-scale and relatively short Roche-454 reads while CAP3 can complement MIRA by correcting certain type II assembly errors. Following initial assemblies by MIRA and CAP3, type II assembly errors (unigenes belonging to same transcripts) are then identified by performing all-versus-all pairwise sequence alignments of the resulting unigenes using the NCBI megablast program. iAssembler then utilizes the pair-wise alignment information to assemble these unigenes into new contigs using the megablast assembler module. Next, iAssembler identifies type I assembly errors by aligning individual EST members to their corresponding unigenes. The misassembled ESTs, whose alignments to their corresponding unigenes do not satisfy cutoffs of user-specified parameters such as minimum percent identity or maximum overhang, were then extracted and used in the next round of assembly and error correction. Finally, unigene base calling errors are corrected based on alignment information of individual ESTs to their corresponding unigenes contained in the SAM output file. iAssembler iterates through error identification and correction steps until no new errors can be identified or corrected.
The main output of iAssembler includes 1) the final assembled unigene sequence file in FASTA format, 2) a text file summarizing the statistics of alignments of ESTs against their corresponding unigenes, which provides necessary information to assess the quality of the assembly, and 3) a file containing detailed alignment information of individual EST sequences against their corresponding unigenes in SAM format. SAM format is a generic alignment format for storing read alignments against reference sequences  and has been adopted by most next-generation sequence alignment and assembly programs. SAM files can be processed and manipulated by SAMtools, for example, SAMtools can convert SAM files into BAM files, the binary form of SAM files, for significant fast accessing and hard disk saving, and can generate pileup output from SAM files for SNP detection . SAM files can also be viewed by several next-generation sequence assembly visualization programs including IGV  and Tablet .
Evaluation of iAssembler
Command and parameters used for evaluating EST assembly programs
Command and parameters
iAssembler.pl -i input_est -h 40 -e 30 -p 97 -d -o output ("-e 10" for Arabidopsis)
cap3 input_est -o 40 -y 30 -p 97 -f 6 -s 251 ("-y 10" for Arabidopsis)
tgicl input_est -l 40 -v 30 -p 97 ("-v 10" for Arabidopsis)
mira -project = project -fasta = input_est -job = denovo, est, normal,454 -notraceinfo -GE:not = 1 454_SETTINGS -LR:wqf = no -AS:epoq = no:mrl = 30 COMMON_SETTINGS -AS:nop = 4 -SK:not = 1:pr = 97 -CL:pec = no 454_SETTINGS -AL:mo = 40:mrs = 97
MIRA (tomato and Arabidopsis)
mira -project = project -fasta = input_est -job = denovo, est, normal, sanger -notraceinfo -GE:not = 1 SANGER_SETTINGS -LR:wqf = no -AS:epoq = no:mrl = 30 COMMON_SETTINGS -AS:nop = 4 -SK:not = 1:pr = 97 -CL:pec = no SANGER_SETTINGS -AL:mo = 40:mrs = 97
phrap input_est -ace
runAssembly -cdna -urt -notrim -ml 40 -mi 97 -o output input_est
Performances of assembly programs with tomato Sanger ESTs (minimum overlap: 40 bp, minimum overlap percent identity: 97%, maximum overhang: 30 bp)
Average unigene length (bp)
No. type I errors
identity < 97%
overhang > 30 bp
No. type II errors
Total assembly errors
Run Time (minute)
Performances of assembly programs with olive Roche-454 ESTs (minimum overlap: 40 bp, minimum overlap percent identity: 97%, maximum overhang: 30 bp)
Average unigene length (bp)
No. type I errors
identity < 97%
overhang > 30 bp
No. type II errors
Total assembly errors
Run Time (minute)
We then tested performances of these assemblers using another set of parameters: minimum overlap length of 50 bp, minimum overlap percent identity of 95%, and maximum overhang length of 20 bp. The results also indicated that iAssembler generated much higher quality of assemblies than other assembly programs we investigated (Additional file 2).
Performances of assembly programs with a curated Arabidopsis EST dataset (minimum overlap: 40 bp, minimum overlap percent identity: 97%, maximum overhang: 10 bp)
Average unigene length (bp)
No. unigenes perfectly aligned to Arabidopsis cDNAs*
No. unigenes not perfectly aligned to Arabidopsis cDNAs
No. unigene pairs perfectly aligned to same Arabidopsis cDNAs with > = 40 bp overlaps (type II error)
No. ESTs and corresponding unigenes not aligned to same Arabidopsis cDNAs (type I error)
In summary, our extensive evaluations of iAssembler and other EST assembly programs using different datasets and parameters support that iAssembler has significantly better performance, generating much less assembly errors in assembling Sanger and/or Roche-454 ESTs.
As shown in Table 2 and 3, the higher quality of assemblies achieved by iAssembler is a tradeoff of longer run time. The most time-consuming steps of iAssembler include the first initial assembly of EST sequences by MIRA and error detection by megablast. The run time of iAssembler can be significantly reduced by taking advantage of efficient usage of multi-threads by megablast and MIRA programs.
In this study, we describe a standalone package called iAssembler, which can perform de novo assembly of ESTs generated by traditional Sanger and/or next-generation Roche-454 massively parallel pyrosequencing technologies. Through the use of an iterative assembly strategy and automated error detection and correction, iAssembler can deliver much higher accuracy in EST assembly than other existing EST assembly programs we investigated. Although iAssembler can only be executed under a command line interface, it's very easy to install and simple to use.
Availability and requirement
Project name: iAssembler
Project home page: http://bioinfo.bti.cornell.edu/tool/iAssembler
Operating system(s): Linux
Programming language: Perl
Other requirements: Bioperl version 1.006 or higher
Third-party tools: BLAST, CAP3 and MIRA. These tools are already integrated into the iAssembler package.
Any restrictions to use by non-academics: none
We would like to thank Qi Sun and Thomas Brutnell for critical review of the manuscript. This work was supported by National Science Foundation grant (IOS-0923312, IOS-1110080) and the United States-Israel Binational Agricultural Research and Development Fund (IS-3877-06) to ZF.
- Rudd S: Expressed sequence tags: alternative or complement to whole genome sequences? Trends Plant Sci 2003, 8: 321–329.View ArticlePubMedGoogle Scholar
- Fei Z, Tang X, Alba RM, White JA, Ronning CM, Martin GB, Tanksley SD, Giovannoni JJ: Comprehensive EST analysis of tomato and comparative genomics of fruit ripening. Plant J 2004, 40: 47–59.View ArticlePubMedGoogle Scholar
- Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, et al.: Genome sequencing in microfabricated high-density picolitre reactors. Nature 2005, 437: 376–380.PubMed CentralPubMedGoogle Scholar
- Barakat A, DiLoreto DS, Zhang Y, Smith C, Baier K, Powell WA, Wheeler N, Sederoff R, Carlson JE: Comparison of the transcriptomes of American chestnut ( Castanea dentata ) and Chinese chestnut ( Castanea mollissima ) in response to the chestnut blight infection. BMC Plant Biol 2009, 9: 51.PubMed CentralView ArticlePubMedGoogle Scholar
- Hahn DA, Ragland GJ, Shoemaker DD, Denlinger DL: Gene discovery using massively parallel pyrosequencing to develop ESTs for the flesh fly Sarcophaga crassipalpis . BMC Genomics 2009, 10: 234.PubMed CentralView ArticlePubMedGoogle Scholar
- Meyer E, Aglyamova GV, Wang S, Buchanan-Carter J, Abrego D, Colbourne JK, Willis BL, Matz MV: Sequencing and de novo analysis of a coral larval transcriptome using 454 GSFlx. BMC Genomics 2009, 10: 219.PubMed CentralView ArticlePubMedGoogle Scholar
- Bellin D, Ferrarini A, Chimento A, Kaiser O, Levenkova N, Bouffard P, Delledonne M: Combining next-generation pyrosequencing with microarray for large scale expression analysis in non-model species. BMC Genomics 2009, 10: 555.PubMed CentralView ArticlePubMedGoogle Scholar
- Sun C, Li Y, Wu Q, Luo H, Sun Y, Song J, Lui EM, Chen S: De novo sequencing and analysis of the American ginseng root transcriptome using a GS FLX Titanium platform to discover putative genes involved in ginsenoside biosynthesis. BMC Genomics 2010, 11: 262.PubMed CentralView ArticlePubMedGoogle Scholar
- Guo S, Zheng Y, Joung JG, Liu S, Zhang Z, Crasta OR, Sobral BW, Xu Y, Huang S, Fei Z: Transcriptome sequencing and comparative analysis of cucumber flowers with different sex types. BMC Genomics 2010, 11: 384.PubMed CentralView ArticlePubMedGoogle Scholar
- Huang X, Madan A: CAP3: A DNA sequence assembly program. Genome Res 1999, 9: 868–877.PubMed CentralView ArticlePubMedGoogle Scholar
- Chevreux B, Pfisterer T, Drescher B, Driesel AJ, Muller WEG, Wetter T, Suhai S: Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs. Genome Res 2004, 14: 1147–1159.PubMed CentralView ArticlePubMedGoogle Scholar
- Pertea G, Huang X, Liang F, Antonescu V, Sultana R, Karamycheva S, Lee Y, White J, Cheung F, Parvizi B, et al.: TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets. Bioinformatics 2003, 19: 651–652.View ArticlePubMedGoogle Scholar
- Phrap assembly program[http://www.phrap.org/]
- Quackenbush J, Cho J, Lee D, Liang F, Holt I, Karamycheva S, Parvizi B, Pertea G, Sultana R, White J: The TIGR Gene Indices: analysis of gene transcript sequences in highly sampled eukaryotic species. Nucleic Acids Res 2001, 29: 159–164.PubMed CentralView ArticlePubMedGoogle Scholar
- Kumar S, Blaxter ML: Comparing de novo assemblers for 454 transcriptome data. BMC Genomics 2010, 11: 571.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang Z, Schwartz S, Wagner L, Miller W: A greedy algorithm for aligning DNA sequences. J Comput Biol 2000, 7: 203–214.View ArticlePubMedGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The Sequence Alignment/Map format and SAMtools. Bioinformatics 2009, 25: 2078–2079.PubMed CentralView ArticlePubMedGoogle Scholar
- seqclean program[http://seqclean.sourceforge.net/]
- Chou HH, Holmes MH: DNA sequence quality trimming and vector removal. Bioinformatics 2001, 17: 1093–1104.View ArticlePubMedGoogle Scholar
- Robinson JT, Thorvaldsdottir H, Winckler W, Guttman M, Lander ES, Getz G, Mesirov JP: Integrative genomics viewer. Nat Biotechnol 2011, 29: 24–26.PubMed CentralView ArticlePubMedGoogle Scholar
- Milne I, Bayer M, Cardle L, Shaw P, Stephen G, Wright F, Marshall D: Tablet--next generation sequence assembly visualization. Bioinformatics 2010, 26: 401–402.PubMed CentralView ArticlePubMedGoogle Scholar
- Alagna F, D'Agostino N, Torchia L, Servili M, Rao R, Pietrella M, Giuliano G, Chiusano ML, Baldoni L, Perrotta G: Comparative 454 pyrosequencing of transcripts from two olive genotypes during fruit development. BMC Genomics 2009, 10: 399.PubMed CentralView ArticlePubMedGoogle Scholar
- NCBI dbEST database[http://www.ncbi.nlm.nih.gov/dbEST/]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.