ESTclean: a cleaning tool for next-gen transcriptome shotgun sequencing
© Tae et al.; licensee BioMed Central Ltd. 2012
Received: 9 July 2012
Accepted: 22 September 2012
Published: 26 September 2012
With the advent of next-generation sequencing (NGS) technologies, full cDNA shotgun sequencing has become a major approach in the study of transcriptomes, and several different protocols in 454 sequencing have been invented. As each protocol uses its own short DNA tags or adapters attached to the ends of cDNA fragments for labeling or sequencing, different contaminants may lead to mis-assembly and inaccurate sequence products.
We have designed and implemented a new program for raw sequence cleaning in a graphical user interface and a batch script. The cleaning process consists of several modules including barcode trimming, sequencing adapter trimming, amplification primer trimming, poly-A tail trimming, vector screening and low quality region trimming. These modules can be combined based on various sequencing applications.
ESTclean is a software package not only for cleaning cDNA sequences, but also for helping to develop sequencing protocols by providing summary tables and figures for sequencing quality control in a graphical user interface. It outperforms in cleaning read sequences from complicated sequencing protocols which use barcodes and multiple amplification primers.
Full cDNA shotgun sequencing is a major approach to finding whole transcriptomes and measuring gene expression. With the advent of next-generation sequencing (NGS) technologies such as 454 (Roche) and Solexa (Illumina), NGS sequencing has become popular in the study of transcriptomes especially in non-model organisms because of its cost efficiency compared to Sanger. In addition, several protocols have been invented to apply NGS technologies and each protocol uses its own short DNA tags or adapters attached to the ends of DNA fragments for labeling or sequencing. Since NGS technologies eliminate bacterial cloning, library preparation is fast and cheap without vector contamination. However, a simple protocol for 454 transcriptome sequencing can make artifact sequences, e.g., concatenated amplification primers. This problem can be overcome by using several amplification steps each of which uses different primers.
In transcriptome sequencing projects, the quality of initial data greatly affects downstream analyses and removing contamination has become one of the most important steps. To remove contamination, several software tools are available, including VecScreen, Lucy, Cross_match, SeqClean, Figaro, and SeqTrim. Although these programs have been used in many sequencing projects, most of them are not appropriate to detect the diverse contamination produced by several NGS-based protocols, especially those using two or more PCR amplification primers. None of them support new sequencing features such as barcodes or MIDs (Multiplex Identifiers), which are used to pool different samples. Many biologists also have difficulty using the programs due to complicated parameters, environment-specific operations and command line execution.
In this paper, we present a new program named ESTclean to clean raw sequences with seven modules that perform end sequence trimming, barcode trimming, sequencing adapter trimming, amplification primer trimming, poly-A tail trimming, vector screening and low quality region trimming. These modules can be combined based on various sequencing applications, e.g., parallel tagged sequencing. ESTclean provides a GUI with a user-friendly environment to manage sequencing protocols and analysis pipelines. It also produces various summary tables and figures to aid quality control by showing trimming statistics for each module; identifying problematic reads with primer concatenation, wrongly oriented primers, and no barcodes; and assessing sequencing biases.
The most common sources of contamination in NGS-based ESTs are barcodes, sequencing adapters, and amplification primers. Barcodes or MIDs (Multiplex IDentifiers) are short DNA tags attached to the 5’ ends of reads in order to distinguish pooled samples. Sequencing adapters are attached to both ends of DNA fragments for cloning and sequencing. Although the 454 data processing software is supposed to trim sequencing adapters, 3’ sequencing adapters often remain depending on the software version and fragment size. Amplification primers are attached to both ends of cDNAs to prepare cDNA libraries before fragmentation. These primers are often concatenated to each other in badly designed sequencing protocols.
Although NGS-based cDNA sequencing does not use vectors for amplification, ESTclean has a module to screen known vectors using VecScreen. ESTclean also has a module to modify SFF files to set a clean region for each read if users have SFF tools. Discarded read sequences from any steps can be collected and saved as a FASTA file and analyzed using BLAST with a user-provided sequence database.
Results and discussion
However, over-trimming may be correct trimming without knowing reference sequences. What would happen if the bases next to a sequence read in a genomic location would be the same as the first bases of sequencing adapters, amplification primers, or poly A tails? For example, if a sequence read ACGTcaat comes from ACGTCGGA of a genome and the lower bases in the sequence read is a amplification primer, the caat should be cleaned by ESTclean. However GMAP can align the raw read until base c and perfect cleaning of caat is evaluated as over-trimming by 1 bp. We expanded this observation for all of over-trimmed reads but not trimmed due to low quality scores. Additional file4 shows the over-trimmed subsequences by ESTclean in the 5’ and 3’ ends. Most of those sequences are part of sequencing adapters and amplification primers, especially poly A tails. To confirm this, we extracted trimmed subsequences of length 6 bases including an over-trimmed region and investigated these 6-mers. Indeed, almost all are part of sequencing adapters and poly A tails: 18,759 (100%) and 68,999 (92%) of reads over-trimmed in the 5’ and 3’ ends, respectively (Additional file5).
Since incomplete cleaning of EST sequences leads to incorrect downstream analyses such as mis-assembly and inaccurate biological interpretation. It has become one of the important tasks in transcriptome sequencing. ESTclean has been developed to remove the different kinds of contaminants from raw sequences. It not only provides trimming and screening modules, but also useful and user-friendly features including project management and quality control of sequencing protocols and raw sequences. It can also generate a script to execute trimming modules in command line environment in order to support automated pipeline of sequence assembly processes. We compared the performance of ESTclean with SeqClean for a real sequencing run for Drosophila melanogaster. ESTclean outperformed SeqClean in terms of the numbers of under-trimmed reads and bases. Although ESTclean has more over-trimmed reads in this experiment, it resulted from correct trimming without knowing reference sequences.
Availability and requirements
Project Name: ESTclean
Project home page: http://sourceforge.net/projects/estclean/
Operating system(s): Platform independent
Programming language: Perl (v5.0 or later), Java (v1.5.0 or later)
Other requirements: BLAST (v2.2.9 or later) (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST)
License: GNU GPL
Any restrictions to use by non-academic users: license needed
We would like to give special thanks to H. Tang, J. K. Colbourne, J. Carter, Z. Lai, K. Mockaitis, and Z. Smith at the Center for Genomics and Bioinformatics, Indiana University for valuable comments. This work was supported in part by the National Institutes of Health [CA134304] and the National Research Foundation of Korea Grant funded by the Korean Government [NRF-2009-352-D00275].
- Schuster SC: Next-generation sequencing transforms today’s biology. Nat Meth 2008, 5: 16–18. 10.1038/nmeth1156View Article
- Meyer E, Aglyamova G, Wang S, Buchanan-Carter J, Abrego D, Colbourne J, Willis B, Matz M: Sequencing and de novo analysis of a coral larval transcriptome using 454 GSFlx. BMC Genomics 2009, 10: 219. 10.1186/1471-2164-10-219PubMed CentralView ArticlePubMed
- VecScreen http://www.ncbi.nlm.nih.gov/VecScreen/VecScreen.html
- Chou HH, Holmes MH: DNA sequence quality trimming and vector removal. Bioinformatics 2001, 17(12):1093–1104. 10.1093/bioinformatics/17.12.1093View ArticlePubMed
- Cross_match http://www.phrap.org/phredphrapconsed.html
- SeqClean https://sourceforge.net/projects/seqclean/
- White JR, Roberts M, Yorke JA, Pop M: Figaro: a novel statistical method for vector sequence removal. Bioinformatics 2008, 24(4):462–467. 10.1093/bioinformatics/btm632PubMed CentralView ArticlePubMed
- Falgueras J, Lara A, Fernandez-Pozo N, Canton F, Perez-Trabado G, Claros MG: SeqTrim: a high-throughput pipeline for pre-processing any type of sequence read. BMC Bioinformatics 2010, 11: 38. http://www.biomedcentral.com/1471–2105/11/38 10.1186/1471-2105-11-38PubMed CentralView ArticlePubMed
- Parallel Tagged Sequencing https://bioinf.eva.mpg.de/pts/
- Altschul S, Gish W, Miller W, Myers E, Lipman D: Basic local alignment search tool. J Mol Biol 1990, 215: 403–410.View ArticlePubMed
- Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino-acid sequence of two proteins. J Mol Biol 1970, 48: 443–453. 10.1016/0022-2836(70)90057-4View ArticlePubMed
- Wu TD, Watanabe CK: GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 2005, 21(9):1859–1875. 10.1093/bioinformatics/bti310View ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.