Next generation transcriptomes for next generation genomes using est2assembly
© Papanicolaou et al; licensee BioMed Central Ltd. 2009
Received: 3 August 2009
Accepted: 24 December 2009
Published: 24 December 2009
The decreasing costs of capillary-based Sanger sequencing and next generation technologies, such as 454 pyrosequencing, have prompted an explosion of transcriptome projects in non-model species, where even shallow sequencing of transcriptomes can now be used to examine a range of research questions. This rapid growth in data has outstripped the ability of researchers working on non-model species to analyze and mine transcriptome data efficiently.
Here we present a semi-automated platform 'est2assembly' that processes raw sequence data from Sanger or 454 sequencing into a hybrid de-novo assembly, annotates it and produces GMOD compatible output, including a SeqFeature database suitable for GBrowse. Users are able to parameterize assembler variables, judge assembly quality and determine the optimal assembly for their specific needs. We used est2assembly to process Drosophila and Bicyclus public Sanger EST data and then compared them to published 454 data as well as eight new insect transcriptome collections.
Analysis of such a wide variety of data allows us to understand how these new technologies can assist EST project design. We determine that assembler parameterization is as essential as standardized methods to judge the output of ESTs projects. Further, even shallow sequencing using 454 produces sufficient data to be of wide use to the community. est2assembly is an important tool to assist manual curation for gene models, an important resource in their own right but especially for species which are due to acquire a genome project using Next Generation Sequencing.
Much of the recent progress in our understanding of genomics has come from the study of model genetic organisms such as the fruit fly Drosophila melanogaster, the nematode worm Caenorhabditis elegans and the zebrafish Danio rerio. In these model species, a full genome sequence combined with a well annotated collection of gene models is currently available. To address fundamental questions in evolutionary biology, however, we need to expand the genomic and transcriptomic resources available for non-model species, notably insects . Non-model insects represent attractive models for the study of a range of important biological questions both of applied and fundamental importance, such as those relating to studying insecticide resistance , wing pattern development  or co-evolution . From a functional biologist's point of view, crucial experimental tractability can be gained via a combination of rich sequence data (including Expressed Sequence Tag - EST - collections from several tissues), gene models, functional annotation and in-depth knowledge of an organism's genetics, preferably coupled with the ability to manipulate them [5, 6].
Since the advent of EST sequencing, the number of organisms represented in dbEST (which houses all public Sanger-sequenced EST projects)  has exploded. Transcriptomics has grown to be at least as an important resource to non-model species communities as it has proven for the traditional models. It is now conceivable that many non-model species will have at least a workable outline of their genomes available. In turn, large collections of ESTs important in genome annotation will be generated as the community identifies -omics as a major resource for species with an evolutionary importance [8–10]. This scenario is already a reality as the related technologies become more cost-effective. Such non-model species transcriptome projects, being focused on particular applications and biological questions, are especially useful to the wider community even if not targeting annotation of any specific genome [11–13]. One of the most important benefits of shallow EST sequencing is the ability to acquire candidate sequence data for downstream applications such as phylogenetics, multi-locus population genetics and expression studies . It is hoped that with a wide application of Next Generation Sequencing (NGS) the bottleneck in obtaining sequence data will no longer exist . The reality is, however, that this vast amount of sequence data has outstripped the ability of most researchers working on non-model species, who often have limited bioinformatic support, to analyze and mine their new datasets. Further, there is currently no standardized platform for providing researchers with such analyses for transcriptomes. The Generic Model Organism Database (GMOD) group, derived mainly from the model species communities, is a collection of software, platforms and standardized approaches in dealing with -omic data. They are best known for the GBrowse project, the sequence viewer used by WormBase, FlyBase and others . One of the most important contributions of the GMOD group, however, is the development and dissemination of standards capable of generic use: extensive use of BioPerl , database connectivity frameworks and Chado, the generic database schema for almost any type of data produced by the community . Such standards allow tools to be capable of a unique level of interconnectivity and interoperability.
Here we describe a software suite, est2assembly, which aims to address the above bioinformatic deficit while being embedded in the GMOD framework. It accepts raw sequence data (from 454 or Sanger technologies) and produces annotated assemblies in a GMOD/Chado-compatible format with minimum user input. Further, using the common file format of GFF (which stands for General Feature Format), users can share their data with collaborators and visualize them with tools such as GBrowse. The platform is highly automated and standardized, and we show it allows for direct comparison between various datasets. The modular nature of est2assembly allows users to independently make use of different subroutines. Extensive log files guide the user through the assembly process and the output. We demonstrate this platform using a range of 454 data from a phylogenetically diverse sample of insects. We benchmark the platform and compare these non-model species collections with 745,124 public EST data from D. melanogaster collected via conventional capillary sequencing which is still considered the gold standard in insect EST data and Bicyclus anynana, the butterfly with the highest number of Sanger-sequenced ESTs in GenBank.
All software on which the platform depends is free and installation requirements are straightforward. In brief, the platform makes extensive use of BioPerl (1.6+) and other open-source Perl modules available from CPAN. The GPL-licensed MIRA assembler is required . The proprietary Newbler2 - including the associated SFF toolkit - (454 Life Sciences) is optional. For 454 next generation sequencing files we use sff_extract  as this is the only known 454 basecalling software which is not under a restrictive license. In this paper, we made use of the complete platform and utilized both Newbler2 and MIRA version 2.9.37. Further, some modules have dependencies such as installation of the Chado database schema , EMBOSS , NCBI-BLAST , SSAHA2 , RepeatMasker  (with an recommended registration to RepBase ), prot4EST , FASTY 3.4  and annot8r . Due to the modular nature of the platform, a researcher needs to install only the components which will be of use their application of the platform. We provide a comprehensive installation script to ease the procedure of installation of the above 3rd party software.
The est_process module is driven by preprocessest.pl which accomplishes the following steps: i) project creation, including calling the bases or reading the SSF files; ii) masking short sequences (e.g. adaptors) using SSAHA2; iii) BLAST2 to detect unwanted sequences (as defined by the user) which can cause problems in the assembly (e.g. mitochondrial, rDNA and contaminants); iv) removal or masking using these BLAST output files; v) Repeat masking performed using RepeatMasker and a user provided database (which can be extracted from RepBase and customized); vi) polyA/T screening performed using a tiered approach of a custom algorithm. Then a final step cleans the output files and prepares assembly specific input files including an XML NCBI TraceInfo file. A user may interrupt the program and resume from any of the above steps.
The conversion of raw trace files to input files for an assembly is performed independently for each technology. Users have the option to convert and quality trim Sanger-derived sequences using the gold standard of phred or in the case of data derived from ABI sequencers data, make use of any internal KB basecalling. In the latter case, we use a custom quality trim subroutine provided by Steffi Gebauer-Jung (MPI for Chemical Ecology) involving two sliding windows to avoid local optima: a larger one scans the trace for a sudden drop in quality values and a finer search pinpoints the exact location. For 454 sequencing, due to licensing prohibiting the use of Roche's proprietary flowgram extracting software by ordinary researchers, SFF files are extracted using sff_extract, an open-source alternative. In addition to detection of low quality regions, we identify adaptor sequence introduced either in the making of the cDNA library or subsequent pre-sequencing steps. We use a two tiered search using SSAHA2, which combines the SSAHA and the cross_match algorithms  and BLAST2 from NCBI. We found that the SSAHA2 search itself is best utilized by using three iterations: two searches for adaptor sequence (with one prior- and one post-polyA/T masking) and one restriction site search with a parameterization for extremely short target sequences (restriction sites tend to be ca 7-10 bp). The platform uses BLAST2 to screen for common contaminants found in molecular biology labs via a customizable FASTA database. In any transcriptome project, it is also undesirable to clutter the assembler with sequences that are overrepresented due to transposon activity, mitochondrial or rDNA origin. The platform allows users to screen them using RepeatMasker via a user-defined database. To assist Lepidopterists in particular, we have included a prediction of repetitive elements from Bombyx mori (C. Smith, University of San Francisco, pers. communication) which users can concatenate with the Insect repeat library from RepBase. The intensive steps of running BLAST and SSAHA2 can seamlessly utilize multiple threads to reduce run time.
Trimming of polyA/T tails is essential in EST projects and we use a routine to iteratively scan for such homo-oligomers. The routine can also be used independently of the platform and is highly customizable; users can specify which base (A, T, C or G) they wish to search for, seed length, min/max length of homo-oligomer, depth of search of each sequence and other options. In order to minimize false positives in A/T rich genomes or errors produced by the pyrosequencing methodology of 454, the platform utilizes 5 rounds with increasing minimum length and decreasing search seed length. Except for the second round pair, the scan is performed only at the ends of the sequence for a length of one-third of the total sequence length. The second round scans deeper to 350 bp from each end. An additional feature of the routine is the use of any suspected polyadenylation signal site (PAS) upstream of a hypothesized polyA/T site. Currently, we use a simple pattern search but implementing a model-based approach  could be of use. This assists in correcting the masking and avoiding over-trimming of the 3' UTR or for allowing a short polyA sequence which would normally be below acceptable length to be masked. In addition, it uses this information to detect false positives if there is a significant amount of sequence (>50 bp) between the end of the polyA and the start of any vector masked sequence (or the end of the sequence). If no PAS site is found upstream of the polyA tail (in a 50 bp window) and the suspected polyA is shorter than the specified cut-off, then the polyA is not masked but still tagged in the log-files. We found that this option enhances polyA masking in Sanger-derived sequences but by default is switched off for short reads. An output for each polyA/T found and which criteria were used is produced in a log file should users wish to exploit them in gene model construction. At this stage, an XML file (using the NCBI TraceArchive template) containing the low quality, adaptor sequence and polyA/T trim points is generated. This file, when used with the original untrimmed and unmasked file, can guide assemblers such as MIRA on how to perform clipping of undesired regions. This approach can be used to tackle any potential false positives that may arise in the preprocessing steps.
It is worth noting here that we find that near-perfect signatures of the 454 adaptor sequences can persist even within regions of high quality assembly, which could be the result of the chimeric ligation of molecules. In such a scenario, we recommend that if manual inspection is not feasible that the sequence region is masked and the assembler allowed to determine if the region is truly an adaptor sequence or part of the sequenced species genome: false positives will have multiple reads in the assembly exhibiting high identity down- and up-stream of the suspected site and therefore still assemble. To facilitate assemblers who cannot make such judgements, preprocessest.pl allows the flagging of sequences which have more adaptor sequences than the user defines. If the user wishes then only the longest stretch of high quality sequence between two adaptors is used.
The output files of est_process are provided to the second module, parameterize_assembly. It could be straightforward to plug-in various assemblers in future versions but we currently make use of Newbler as it is the standard for 454 data and MIRA due to its ability to analyze Sanger/Next Generation data concurrently and the provision of excellent support. In the current version, datasets from Sanger and/or 454 can be provided and users will concatenate any datasets originating from identical technologies. A configuration file is responsible for defining which parameters are passed on to the assembler MIRA. This allows for multiple runs of the same datasets in order to explore the parameter space.
Assembly quality is estimated using analyze_blast.pl via summary indexes based on the coverage of one or more reference organism databases used in a similarity search (e.g. NCBI-BLAST). Coverage is calculated on a base pair basis by counting unique hits to a particular base pair. Overlapping coverage (i.e. redundancy) is calculated by counting the total number of hits a base pair (or amino acid; the platform uses the term position) receives. We perform these calculations for both the assembly and the database. The ratio of redundancy over coverage is summed for the database and the assembly. If more than one reference database is used (which is recommended in order to discount any organism-specific effects) then the total sum is used. One has to be aware that the absolute numbers of coverage are volatile as they are dependent on the BLAST cut-offs used. When the same cutoffs are used, however, comparison of assemblies is a meaningful index to evaluate how parameterization influences the assembly. Another quality control index is the number of reads included in the assembly. This can act as a proxy for the downstream utility of an assembly, for example the number of SNPs which can be determined or the likelihood we can detect alternative splicing or frame-shift-causing sequencing errors. Finally, we also consider the proportion of the reference database covered (or the average if more than one is used) as a proxy to eventual gene finding.
In EST assemblies a large portion of the consensus can be non-coding sequence. Such sequences often diverge rapidly due to the lack of any selective constraints. The unfortunate result in any assembly process is that sequences of this type, which are from identical genomic regions, fail to assemble together. The script trim_assembly.pl is one of the two methods to remove such redundant contigs. It first scans for polyA/T tails which may have been built during the assembly. Then it defines a set of 'high-quality' contigs using user-specified cut-off for length and number of reads included. The other contigs are only included in the final set if they a) don't have a high sequence similarity to a high-quality contig, b) have a high sequence similarity to a reference proteome, c) specifically requested by the user by proving a list file. The output of the contigs which have been excluded is cataloged in a log file.
Protein identification, SNP discovery and data mining
Data mining of EST projects is driven by searching the dataset for the signature of a favorite protein or sequence. The platform makes use of BLAST similarity searches using the contig consensus. The analyse_assembly.pl and analyze_blast.pl scripts, which are employed during parameterization, can be used standalone to estimate the quality indexes for any BLAST report. They allow users to identify the exact coverage of a FASTA input file in relation to a reference database using BLASTx, BLASTn or tBLASTx. They provide the ability to run multiple blasts in a threaded fashion with one command. The script has the additional ability to output a FASTA file with the part of the input file which matches the reference and a second file with the part which did not. This approach is very useful in extracting the segment of the assembly which is known to be coding. For example, a tBLASTx approach is useful when multiple species have been sequenced but no reference proteome exists or is under-annotated (see Results section).
Deeper annotation can be performed using predicted proteins. Protein predictions are accomplished using prot4EST  (included in the distribution). The current version of prot4EST does not produce the Open Reading Frame (ORF) which can differ substantially from the contig (one of the utilities of prot4EST is that it corrects for frameshifts). We acquire the ORF using FASTY or failing that, EMBOSS's transambig and attempt to correct for any ambiguous codons to match the consensus. These are then annotated using similarity to known proteins which may have annotated ontology terms: Gene Ontology (GO) , Enzyme Commission (EC)  and Kyoto Encyclopaedia of Genes and Genomes (KEGG) . Annotations based on electronic similarity are assigned using annot8r. If computational resources allow, InterProScan annotations can be included to allow users to search for specific protein domains, motifs and sequence signals. The output of the above procedures is then converted to a common file format using ic_annot8r2gff.pl and can be databased.
Further, we provide the script analyse_assembly.pl to help with BLAST reports on a single computer but if EST projects are commonplace then access to a PC-farm or a high performance computing system is required. For this reason, we include a set of simple scripts to facilitate use and error-checking of LSF queue submissions allowing, therefore, for the automation of BLAST and InterProScan annotations.
Of particular interest to biologists is the identification of Single Nucleotide Polymorphism markers (SNPs). The assembler MIRA produces such a set and can be included in the assembly GFF file. We also extract a 'high-quality' SNP dataset by including those SNPs which have the minor allele frequency above a user-specified threshold and have at least 20 invariable bases up and downstream of the SNP position. This padding can be customized by the user but we default to 20 in order to design primers and create a unique identification sequence for submission to dbSNP . This SNP identification is accomplished via ic_create_snps.pl which estimates the position of each SNP in relation to the ORF and, if determined as coding, provide the codon position, the alleles and whether the nucleotide change causes a synonymous or non-synonymous amino acid change. Due to prot4EST's approach to determine the ORF, a simple translation of co-ordinates is not possible. We, therefore, perform a local alignment using FASTY . We also include a similar implementation for SEAN  which would be of use to users with small amounts of data.
Data for benchmarking the platform was provided by dbEST (for D. melanogaster and B. anynana) or by collaborators. In more detail, the GS20 sequencing of a Manduca sexta (tobacco hornworm moth) hemocyte cDNA library was provided by Haobo Jiang and is published by Zou et al. The Chrysomela tremulae (a beetle) midgut GSLFX and Manduca sexta midgut are published by Pauchet et al and  respectively; the GSFLX of cDNA from whole larvae of Euphydryas aurinia (marsh fritillary butterfly) was provided by Yannick Pauchet (University of Exeter); the GSFLX-Titanium sequencing of cDNA from wing discs of Papilio dardanus (African swallowtail butterfly) by Iva Fuková (University of Exeter); the Sanger capillary and GSFLX data of Heliconius melpomene were prepared from wing-discs of developmental stages from late larval through to mid-pupal stage by Ronald Lee and Chris D. Jiggins (University of Cambridge) and is published by Ferguson et al. The Sanger capillary and GSFLX data of Heliconius erato was also generated from wing-discs and was provided by W. Owen McMillan (North Carolina State University). The GS20 Melitaea cinxia dataset from whole larvae is published by Vera et al and was downloaded from NCBI's Short Read Archive after communication with Howard Fescemyer (Pennsylvania State University). The Bicyclus anynana data are based on the capillary-sequencing technology of a variety of tissues, were obtained from dbEST and is published by Beldade et al. The D. melanogaster data are also based on capillary-sequencing technology of a variety of tissues and were retrieved from dbEST. We did not include any singletons in the resulting assemblies. For the saturation curves, we used the H. melpomene 454 preprocessed dataset (without any Sanger sequences) and created pseudo-datasets by randomly splitting it in datasets containing 1/5, 2/5, 3/5 and 4/5 of the initial data. We repeated the procedure 5 times for each pseudo-dataset thus generating and annotating 20 pseudo-assemblies using the same procedures as for the main assembly.
Reference proteomes used in this study were from D. melanogaster, Anopheles gambiae, Apis mellifera, B. mori and Tribolium castaneum. For each species we used the RefSeq  and UniProt  curated proteins, concatenated with the predictions provided by each organism's Genome Database [44–50] and made non-redundant at the 100% level using cd-hit . In the cross-dataset comparison, we identify ORF coverage by calculating the proportion of the reference proteome aligned to the assembly (e-value <= 1e-5; bit-score >= 80 bits) as an indication to gene-finding. We also estimate the proportion of the assembly aligning to the proteome (e-value <= 1e-5; bit-score >= 80 bits) to determine the portion of the assembly likely to be coding and also include the improvement of including a tBLASTx search (e-value <= 1e-15; bit-score >= 80 bits).
Implementation overview for the biologist
The est2assembly platform allows for the processing of data either directly from DNA sequencers (pyrosequencing or Sanger based) or as FASTA files. The software also allows users to combine their own datasets with those available in public databases and a script is included to automatically download such sequences from the European Bioinformatics Institute (EBI). Moreover, any large sequenced genomic regions (such as from Bacterial Artificial Chromosomes; BACs) can contribute CoDing Sequences (CDSs) to the assembler. This form of dataset concatenation has the advantage that a pool of shorter NGS reads will assemble better if a longer sequence (such a full-length mRNA sequence) is also included in the same pool. Currently, two sequencing technologies can be processed: Sanger capillary-based data and 454 pyrosequencing data. The input data is fed into the preprocessest.pl which removes low quality sequences, any adaptors and polyA/Ts that may be present. This script also removes any contaminants and repeats which can cause serious misalignments. Two versions of the processed FASTA files are produced: trimmed and 'masked', the latter accompanied by a quality file and an NCBI Traceinfo XML file which defines trim points in relation to the original files. Researchers can choose to use the untrimmed but masked files or the original sequences (with the XML file which contains the exact regions of high quality sequence) or the trimmed files for assemblers such as Newbler. Further, at this point in the pipeline, graphs can also be generated (using two scripts provided in the distribution) to examine the effect of the pre-processing on the data.
Once a user is satisfied with the quality control of the reads, the assembly can begin. An optimal assembly needs to be chosen according to the needs of a project but often the assembler is used as a black box despite the fact that assemblers are mere computation machines and therefore the results may or may not be biologically meaningful. For this reason, the next step is submitting these files to a second script, parameterize_assembly, which launches the assemblers (currently Newbler and MIRA are supported) with varying parameters, compares the results to one or more reference proteomes and computes a number of indexes suitable for transcriptome projects. Which index is most useful (i.e. the optimality criteria) depends on the aim of the particular project. For example, in a gene-hunting project one may wish to optimize for the number of genes discovered and minimize redundant contigs, where as in SNP project, one may wish to maximize for the number of reads in the assembly and tolerate redundancy which can later be addressed manually for contigs of interest. For the reference proteomes, we suggest that more than one is provided in order to remove species-specific bias and increase the power of detecting coding sequences but, for computational reasons, too divergent species will not be useful for parameterization. In our work and this paper, we used species with a genome sequence which are in the same Phylum as our data datasets. As parameterization is focused on exploring how different parameters behave, the exact details of the reference proteome and BLAST cut-off values are not as important since the platform ensures they are used in a consistent fashion.
Simple BLAST-driven indexes are reported for each proteome: the number of queries which have similarity with a reference protein; the number of reference proteins which have similarity to a contig in the assembly. It then calculates the indexes for each base pair/amino acid rather in order to detect the level of partial ORF sequencing. Further, an annotation redundancy index is based on the number of one-to-many hits (summed in both directions) between the reference proteomes and the assembly. Finally, est2assembly reports the number of reads included in the assembly. Which criterion is chosen to identify the best assembly is left to the individual researchers. In this paper, since we are focused on maximizing the 'number of genes found' - like many non-model species projects - we used the reference proteins found as the main criterion.
Once the desired assembly is chosen, users may opt for the removal of a subset of redundant contigs using trim_assembly.pl and evaluate if there is a loss of 'number of genes found'. By redundancy we mean here the fraction of contigs that are likely to originate from the same locus - as defined by the degree of similarity. Often, an assembler fails to align them due to a high number of mismatches caused by a non-conserved region (such as in non-coding regions) accumulating mutations, an alternative exon or - in libraries constructed from outbred individuals- multiple SNPs. The aim of a reduction of redundancy is to reduce the strain of computing resources on the subsequent annotation steps. Researchers will, therefore, annotate a considerably smaller dataset. We support GFF conversion for a number of publicly available tools that we consider to be of most use in transcriptome project and we use routinely: the CAF format (the new standardized file format for assemblies); prot4EST (ORF prediction); BLAST (similarity annotation); annot8r (Gene Ontology, Enzyme Commission and KEGG pathway term assignment according to similarity to known proteins) and InterProScan (protein domain identification).
As our interest is primarily in ensuring that data produced by multiple researchers can be integrated, we decided to utilize the community tools of BioPerl and GMOD. The bioinformatics community has been converging on a set of standard formats. One such flat-file format for sequence annotation is the General Feature Format (GFF) specification which is currently being standardized as version 3. The GFF format is a tab and semicolon delimited file making it both machine- and spreadsheet- readable and has become the format of choice for the GMOD software group. From a database perspective, there are two additional important formats: BioPerl's Bio::DB::SeqFeature and Chado. The former is a highly denormalized database schema which allows for rapid queries of sequence data by sacrificing control of data integrity. Chado, on the other hand, is a normalized modular database schema created to serve as the main data warehouse of multiple types of data. It is logical therefore for researchers to utilize Chado as a data warehouse and Bio::DB::SeqFeature for driving user-visualization software such as GBrowse. Data can be loaded into a database via BioPerl and we provide a script to load multiple GFFs in the correct order and allow for later additions. The Chado schema requires a PostgreSQL database and we find that the SeqFeature database works well with PostgreSQL as well. Once the database is loaded, one can use to drive popular tools such as GBrowse, Apollo and Artemis (Figure 1B) in order to curate the project. Transcriptome project curation requires the ability to join contigs and we find that the user-friendliness of the proprietary program Geneious (Biomatters Ltd, Auckland New Zealand) is efficient for the purpose and a free version is available. A future version of this platform may make use of Geneious' interactivity interface (API). Due to the popularity of GBrowse as a sequence viewer, we provide configuration files that can be readily customized. When the complete analysis is loaded, researchers and their collaborators can view any annotated contig in three inter-linked web pages: assembled contig, predicted ORF and protein (Figure 1C).
As est2assembly is unique in the field as it is not one pipeline but a complete framework to analyze transcriptomes from raw data to a format wet-lab biologists can analyze. The preprocessing step has been built to take advantage of NCBI's TraceXML in order to annotate vector and clipping positions. This allows us to use the original (unmasked) data to produce the assembly using MIRA which can make intelligent decisions if a clipped region is a false positive. The assembler parameterization allows bioinformaticians to seamlessly explore the parameter space as Figures 2 and 3 mentioned above (Implementation overview for the Biologist). Had we used the default settings, we would have an assembly that had a lower number of identified reference proteins (data not shown), a lower number of reads used but similar degree of the redundancy index (Figure 2). Likewise, had we opted for the Newbler assembler without investigating the MIRA assembly, we would have both a lower number of proteins identified and shorter CDSs (Figure 3).
In addition, as we mention above and show below, current NGS assemblers produce non-redundant contig sets. This is a correct procedure in order to avoid assembling close paralogs (as joining contigs is easier that splitting them) but results in a downstream computational problem with annotating a redundant set of objects. Our trim_assembly approach is highly customizable using concepts intuitive to biologists and produces better clustering than the standard cd-hit-est we used to use: for one of the more redundant assemblies we started with 54,748 contigs and reduced it to 37,012 contigs with trim_assembly when compared to 51,012 when using cd-hit-est with both strand search enabled.
SNP marker identification in trimmed assemblies
Total High Quality SNPs
Predicted with SEAN
Predicted with MIRA
D. melanogaster (Sanger)
B. anynana (Sanger)
M. cinxia (GS20)
M. sexta (Sanger + GSFLX)
C. tremulae (GSFLX)
E. aurinia (GSFLX)
H. erato (Sanger + GSFLX)
H. melpomene (Sanger + GSFLX)
P. dardanus (GSFLX Titan.)
The number of contigs is, however, rarely an accurate prediction of the number of genes sequenced. Non-coding DNA (e.g. UTR - UnTranslated Region or intron read-throughs) sequenced from multiple haplotypes is not easy to assemble due to high levels of heterozygosity caused by the fact that the majority of the UTR is evolving without constraints. Upon investigation, this seems to be the major cause of contig inflation and the platform can use two methods to alleviate the issue by filtering in favor of regions likely to be coding. One method is used in Figure 5 but we may have the unfortunate effect of removing small contigs containing novel proteins which are small in size and low in expression. The other method - available only to users with a dataset from related species - is to use a tBLASTx approach (via analyse_assembly.pl) to complement the BLASTx approach of the reference proteome. Novel proteins, even if evolving rapidly, are expected to show significant similarity on the amino acid level between the two species. The platform allows one to extract the contig regions matching this coding fraction and thus have a dataset known to be coding. The two approaches are not mutually exclusive but can be complementary: the trim_assembly.pl is highly customizable regarding similarity and abundance levels whereas the tBLASTx approach will not tackle the issue of redundancy (i.e. two contigs originating from one locus).
One other point of note is relating to the procedure of choosing a cDNA generation protocol. All NGS datasets compared here use the SMART technology to produce cDNA apart from the GS20 dataset of M. sexta which was produced using GC-rich random primers. We do not know why the number of reads was much lower than the GS20 dataset from M. cinxia but a significantly higher proportion of the reads matches a predicted CDS which translates to a better return to projects aimed at gene-hunting. We cannot be sure why this occurs: it could be due to a high number of low quality reads in M. cinxia, it could be due to a slightly better protein identification based on the closely related B. mori but it is logical to expect that, in species known to have a GC enriched coding sequences, the primer protocol would enrich for regions with high GC content and therefore more likely target coding regions.
Finally, researchers often wish to know how deep one should sequence in order to sequence the complete transcriptome in their sample. This is important in planning non-model species transcriptomes projects. As the H. melpomene 454 data originate from 2.5 full-plates (i.e. 5 half-plates) and the cDNA is harvested from a single tissue (wing discs) we used 20 pseudo-assemblies to investigate the effect of deeper sequencing. As Figure 7 shows, most transcripts for that particular tissue were identified after 1.5 half plates were sequenced as the exponential curve is approaching a plateau (Figure 7A). With 1 half-plate, however, 74.5% of the plateau value for proteins identified was attained, showing that even shallow sequencing of a non-model species is highly worthwhile. For SNP detection, the function is not exponential (Figure 7B): with each subsequent run the number of high quality SNP markers increased linearly.
There are several advantages of est2assembly over other platforms for processing EST raw data (e.g. [54, 55]). First, preprocessing of raw sequence is essential and our platform offers a standard method for consistently accomplishing this for hundreds of thousands of sequences with straightforward user customization. Second, parameterizing an assembler is a tedious process and our platform is the only one which automates many of the routines. Third, annotation of an assembly with est2assembly can be readily standardized and automated for processing large numbers of datasets with minimum investment in time. Deciding on the optimal assembly is a subjective process and depends on the project but by providing the means to explore the parameter space allows for a standardization of an approach which is often ad-hoc. We calculate the BLAST-based index using two approaches and can determine the number of unique proteins found, actual proportion of amino acids found and obtain an estimate of the assembly proportion that is actually coding. In this case, as more datasets are published, we can benchmark laboratory protocols and sequencing technologies involved in acquiring full length genes. Importantly, the rich log output guides the wet-lab biologist who generated the data to perform in-depth investigations and hold a better understanding of their project. With est2assembly, we have not aimed to produce a 'black box' but a program which gives feedback to the user as to the quality and characteristics of the different assemblies achieved with their data. We showed that analysis of an assembly can give important insights to the technologies and protocols employed to acquire a transcriptome. Future work can focus on including more annotation modules and developing a Java/JDBC-driven Graphical User Interface (GUI) and relational database to allow molecular biologists with no computing knowledge to supervise the data analysis.
Shallow sequencing EST projects are becoming a gold-mine for biologists working on non-model species and are often used for both gene or SNP discovery but until now no software exists to link the SNP to both an assembly and the codon it may be part of. The est2assembly platform allows for the classification and identification of SNPs which may be under selection or point to a misalignment. Such data are important in manual curation of an assembly and lacking from any other software. We can also obtain coding synonymous SNPs for which a PCR primer is straightforward to design but are under low levels of selection. Non-coding markers are also useful to researchers who wish to investigate selection in non-coding DNA.
Special considerations, however, have to apply to projects working on non-model species, especially when datasets are restricted for financial or biological reasons. Often the design of the experiment is not conceived with full knowledge of a technology's capabilities and limitations. Here we show that different technologies and lab protocols differ in their ability to produce an assembly and project design plays an important role. At times, but not always, such project design bottlenecks can be overcome. For example, assemblers treat the common issue problem of inflated contig number caused by non-optimal alignments by assuming that they are based solely on sequencing errors or repeats. The result is that fewer genes are discovered in non-model species. The issue is confounded because researchers undertake a transcriptome project for different reasons. Even though in gene discovery project the norm is to sequence a cDNA library from specific tissues and with a limited number of haplotypes, this is rarely the case in most EST projects of non-model species. Researchers often utilize EST projects as both a gene-finding project and a SNP discovery protocol and therefore are tempted to include a high number of out-bred individuals. It should be noted that both Newbler and MIRA are not clustering algorithms but assemblers and therefore their main aim is not to identify alternative splicing events or cope with a high degree of heterozygosity in the sequenced sample. There are methods to alleviate the problem such as including a final clustering step (e.g. miraEST ; CLOBB ; or CAP3 . Our platform does not yet contain such a clustering step as the levels of heterozygosity can vary and a supervised algorithm we are developing as a future module may provide a more optimal solution. Such an algorithm would be tailored (i.e. trained) for each transcriptome project and make assignment of supercontigs (e.g. the merging of alternative splicing events and non-coding regions belonging to the same locus) more robust. Another issue which cannot be resolved using bioinformatics is the quality of material used for cDNA preparation. Even though we cannot be certain regarding the cause, the M. cinxtia and E. aurinia datasets argue against a whole animal approach in constructing the cDNA library. Such cases have been shown to be problematic in enzymatic reactions due to PCR-inhibiting pigments . The inclusion of micro organisms in the digestive tract or the outer body can also result in acquiring contaminating sequence from another species. The later cases can be investigated bioinformatically [60, 61] so as to prevent generating an erroneous transcriptome survey.
In conclusion, the modern transcriptome sequencing approaches are very powerful and cost effective but they still yield partial transcriptomes. In the future, however, our single most important limitation will not be raw transcriptomic or genomic data. We have shown that the ability to accurately annotate an assembly depends on using a correct reference proteome or utilize phylogenetic framework. Further, comparison to an appropriate reference proteome is invaluable in choosing among different assemblies, yet such proteomes are themselves incomplete. A concerted annotation effort based on transcriptome sequencing from a diverse phylogenetic collection is required, which will accelerate the filling of proteome space beyond the limited set of model organisms that currently occupy it. With the NGS capabilities, it is obvious that such an effort should make full use of transcriptomic data in order but we still lack the necessary infrastructure. The est2assembly software, however, has enabled the development of such infrastructure, an alpha stage preview of which is available at http://www.insectacentral.org.
Availability and requirements
Project name: est2assembly
Project home page: http://code.google.com/p/est2assembly/
Operating system: Linux
Programming language: Perl
Dependencies on proprietary software: None
Other requirements: NCBI-BLAST, EMBOSS toolkit, BioPerl dev-branch, PostgreSQL, MIRA (included), sff_extract (included)
Optional programs (included): annot8r, prot4EST
License: General Public License version 3
Bacterial Artificial Chromosome
European Bioinformatics Institute
Expressed Sequence Tag
General Feature Format
Generic Model Organism Database
General Public License
Graphical User Interface
Next Generation Sequencing (technology)
Open Reading Frame
Single Nucleotide Polymorphism
We would like to thank Karl Gordon (CSIRO) for helping with end-user testing, two anonymous referees for improving the manuscript and the following for making pre-publication data available: Chris Jiggins and his laboratory (Univ. of Cambridge), Owen McMillan and his laboratory (State Univ. of N. Carolina), Yannick Pauchet and Iva Fuková (Univ. of Exeter). Further, Bastien Chevreux provided development versions of MIRA and excellent support, Jose Blanca provided sff_extract, James Wasmuth provided support for prot4EST, Ralf Schmid for annot8r, Derek Huntley for SEAN and Steffi Gebauer-Jung for TrimbyWindow. David Clements and Scott Cain helped with Chado and GBrowse. We also thank the TU-Dresden Deimos PC-Farm for computational support. The authors report no conflicting interests. AP was supported by the Max Planck Gesellschaft and the European Union Research Network GAMEXP; DGH was supported by the Max Planck Gesellschaft; RHfC was supported by the European Union Research Network EMBEK1.
- Van Straalen NM, Roelofs D: An introduction to ecological genomics. Oxford: Oxford University Press; 2006.Google Scholar
- Heckel DG, Gahan LJ, Daly JC, Trowell S: A genomic approach to understanding Heliothis and Helicoverpa resistance to chemical and biological insecticides. Philos Trans R Soc Lond B Biol Sci 1998, 353: 1713–1722. 10.1098/rstb.1998.0323PubMed CentralView ArticleGoogle Scholar
- Brakefield PM, Gates J, Keys D, Kesbeke F, Wijngaarden PJ, Monteiro A, French V, Carroll SB: Development, plasticity and evolution of butterfly eyespot patterns. Nature 1996, 384: 236–242. 10.1038/384236a0View ArticlePubMedGoogle Scholar
- Rausher MD: Natural selection and the evolution of plant insect interactions. In Insect chemical ecology: an evolutionary approach. Edited by: Rausher MD, Isman MB. New York: Chapman & Hall; 1992:20–88.Google Scholar
- Ewing B, Green P: Analysis of expressed sequence tags indicates 35,000 human genes. Nat Genet 2000, 25: 232–234. 10.1038/76115View ArticlePubMedGoogle Scholar
- Rudd S: Expressed sequence tags: alternative or complement to whole genome sequences? Trends Plant Sci 2003, 8: 321–329. 10.1016/S1360-1385(03)00131-6View ArticlePubMedGoogle Scholar
- Boguski MS, Lowe TMJ, Tolstoshev CM: dbEST-- database for "expressed sequence tags". Nat Genet 1993, 4: 332–333. 10.1038/ng0893-332View ArticlePubMedGoogle Scholar
- Beldade P, McMillan WO, Papanicolaou A: Butterfly genomics eclosing. Heredity 2008, 100: 150–157. 10.1038/sj.hdy.6800934View ArticlePubMedGoogle Scholar
- Mita K, Morimyo M, Okano K, Koike Y, Nohata J, Kawasaki H, Kadono-Okuda K, Yamamoto K, Suzuki MG, Shimada T: The construction of an EST database for Bombyx mori and its application. Proc Natl Acad Sci 2003, 100: 14121–14126. 10.1073/pnas.2234984100PubMed CentralView ArticlePubMedGoogle Scholar
- Mita K, Kasahara M, Sasaki S, Nagayasu Y, Yamada T, Kanamori H, Namiki N, Kitagawa M, Yamashita H, Yasukochi Y: The genome sequence of silkworm, Bombyx mori. DNA Res 2004, 11: 27–35. 10.1093/dnares/11.1.27View ArticlePubMedGoogle Scholar
- Papanicolaou A, Gebauer-Jung S, Blaxter ML, McMillan WO, Jiggins CD: ButterflyBase: a platform for lepidopteran genomics. Nucleic Acids Res 2008, 36: D582–587. 10.1093/nar/gkm853PubMed CentralView ArticlePubMedGoogle Scholar
- Bouck A, Vision T: The molecular ecologist's guide to expressed sequence tags. Mol Ecol 2007, 16: 907–924. 10.1111/j.1365-294X.2006.03195.xView ArticlePubMedGoogle Scholar
- Thomson RC, Shedlock AM, Edwards SV, Shaffer HB: Developing markers for multilocus phylogenetics in non-model organisms: A test case with turtles. Mol Phylogenet Evol 2008, 49: 514–525. 10.1016/j.ympev.2008.08.006View ArticlePubMedGoogle Scholar
- Papanicolaou A, Joron M, McMillan WO, Blaxter ML, Jiggins CD: Genomic tools and cDNA derived markers for butterflies. Mol Ecol 2005, 14: 2883–2897. 10.1111/j.1365-294X.2005.02609.xView ArticlePubMedGoogle Scholar
- Schuster SC: Next-generation sequencing transforms today's biology. Nat Methods 2008, 5: 16–18. 10.1038/nmeth1156View ArticlePubMedGoogle Scholar
- Stein LD, Mungall C, Shu S, Caudy M, Mangone M, Day A, Nickerson E, Stajich JE, Harris TW, Arva A, Lewis S: The Generic Genome Browser: A Building Block for a Model Organism System Database. Genome Res 2002, 12: 1599–1610. 10.1101/gr.403602PubMed CentralView ArticlePubMedGoogle Scholar
- Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JG, Korf I, Lapp H: The Bioperl toolkit: Perl modules for the life sciences. Genome Res 2002, 12: 1611–1618. 10.1101/gr.361602PubMed CentralView ArticlePubMedGoogle Scholar
- Mungall CJ, Emmert DB: A Chado case study: an ontology-based modular schema for representing genome-associated biological information. Bioinformatics 2007, 23: i337–346. 10.1093/bioinformatics/btm189View ArticlePubMedGoogle Scholar
- Chevreux B, Wetter T, Suhai S: Genome sequence assembly using trace signals and additional sequence information. Proc German Conf Bioinformatics 1999, 99: 45–56.Google Scholar
- SFF extract[http://bioinf.comav.upv.es/sff_extract/]
- Rice P, Longden I, Bleasby A: EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet 2000, 16: 276–277. 10.1016/S0168-9525(00)02024-2View ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Ning Z, Cox AJ, Mullikin JC: SSAHA: A Fast Search Method for Large DNA Databases. Genome Res 2001, 11: 1725–1729. 10.1101/gr.194201PubMed CentralView ArticlePubMedGoogle Scholar
- Jurka J, Kapitonov VV, Pavlicek A, Klonowski P, Kohany O, Walichiewicz J: Repbase Update, a database of eukaryotic repetitive elements. Cytogenet Genome Res 2005, 110: 462–467. 10.1159/000084979View ArticlePubMedGoogle Scholar
- Wasmuth JD, Blaxter ML: Prot4EST: Translating Expressed Sequence Tags from neglected genomes. BMC Bioinformatics 2004, 5: 187. 10.1186/1471-2105-5-187PubMed CentralView ArticlePubMedGoogle Scholar
- Pearson WR, Wood T, Zhang Z, Miller W: Comparison of DNA sequences with protein sequences. Genomics 1997, 46: 24–36. 10.1006/geno.1997.4995View ArticlePubMedGoogle Scholar
- Schmid R, Blaxter ML: annot8r: GO, EC and KEGG annotation of EST datasets. BMC Bioinformatics 2008, 9: 180. 10.1186/1471-2105-9-180PubMed CentralView ArticlePubMedGoogle Scholar
- Phred, Phrap, and Consed[http://www.phrap.com]
- Ji G, Zheng J, Shen Y, Wu X, Jiang R, Lin Y, Loke J, Davis K, Reese G, Li Q: Predictive modeling of plant messenger RNA polyadenylation sites. BMC Bioinformatics 2007, 8: 43. 10.1186/1471-2105-8-43PubMed CentralView ArticlePubMedGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT: Gene Ontology: tool for the unification of biology. Nat Genet 2000, 25: 25–29. 10.1038/75556PubMed CentralView ArticlePubMedGoogle Scholar
- Bairoch A: The ENZYME database in 2000. Nucleic Acids Res 2000, 28: 304–305. 10.1093/nar/28.1.304PubMed CentralView ArticlePubMedGoogle Scholar
- Kanehisa M, Goto S: KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 2000, 28: 27–30. 10.1093/nar/28.1.27PubMed CentralView ArticlePubMedGoogle Scholar
- Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K: dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 2001, 29: 308–311. 10.1093/nar/29.1.308PubMed CentralView ArticlePubMedGoogle Scholar
- Huntley D, Baldo A, Johr S, Sergot M: SEAN: SNP prediction and display program utilizing EST sequence clusters. Bioinformatics 2006, 22: 495. 10.1093/bioinformatics/btk006View ArticlePubMedGoogle Scholar
- Zou Z, Najar F, Wang Y, Roe B, Jiang H: Pyrosequence analysis of expressed sequence tags for Manduca sexta hemolymph proteins involved in immune responses. Insect Biochem Mol Biol 2008, 38: 677–682. 10.1016/j.ibmb.2008.03.009PubMed CentralView ArticlePubMedGoogle Scholar
- Pauchet Y, Wilkinson P, van Munster M, Augustin S, Pauron D, Ffrench-Constant RH: Pyrosequencing of the midgut transcriptome of the poplar leaf beetle Chrysomela tremulae reveals new gene families in Coleoptera. Insect Biochem Mol Biol 2009, 39: 403–13. 10.1016/j.ibmb.2009.04.001View ArticlePubMedGoogle Scholar
- Pauchet Y, Wilkinson P, Vogel H, Nelson DR, Reynolds SE, Heckel DG, ffrench-Constant RH: Pyrosequencing Manduca sexta larval midgut transcriptome: messages for digestion, detoxification and defence. Insect Mol Biol, in press.
- Ferguson L, Lee SF, Chamberlain N, Nadea N, Joron M, Baxter S, Wilkinson P, Papanicolaou A, Kumar S, Thuan-Jin Clark R, Davidson C, Glithero R, Beasle H, Vogel H, Ffrench-Constant R H, Jiggins CD: Characterization of a hotspot for mimicry: assembly of a butterfly wing transcriptome to genomic sequence at the HmYb/Sb locus. Mol Ecol, in press.
- Vera JC, Wheat CW, Fescemyer HW, Frilander MJ, Crawford DL, Hanski I, Marden JH: Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing. Mol Ecol 2008, 17: 1636–47. 10.1111/j.1365-294X.2008.03666.xView ArticlePubMedGoogle Scholar
- Beldade P, Saenko SV, Pul N, Long AD: A Gene-Based Linkage Map for Bicyclus anynana Butterflies Allows for a Comprehensive Analysis of Synteny with the Lepidopteran Reference Genome. PLoS Genet 2009, 5: e1000366. 10.1371/journal.pgen.1000366PubMed CentralView ArticlePubMedGoogle Scholar
- Pruitt KD, Tatusova T, Maglott DR: NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 2007, 35: D61–65. 10.1093/nar/gkl842PubMed CentralView ArticlePubMedGoogle Scholar
- Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M: The universal protein resource (UniProt). Nucleic Acids Res 2005, 33: D154–159. 10.1093/nar/gki070PubMed CentralView ArticlePubMedGoogle Scholar
- Drysdale RA, Crosby MA: FlyBase: genes and gene models. Nucleic Acids Res 2005, 33: D390–395. 10.1093/nar/gki046PubMed CentralView ArticlePubMedGoogle Scholar
- Lawson D, Arensburger P, Atkinson P, Besansky NJ, Bruggner RV, Butler R, Campbell KS, Christophides GK, Christley S, Dialynas E: VectorBase: a home for invertebrate vectors of human pathogens. Nucleic Acids Res 2007, 35: D503–505. 10.1093/nar/gkl960PubMed CentralView ArticlePubMedGoogle Scholar
- Lawson D, Arensburger P, Atkinson P, Besansky NJ, Bruggner RV, Butler R, Campbell KS, Christophides GK, Christley S, Dialynas E: VectorBase: a data resource for invertebrate vector genomics. Nucleic Acids Res 2009, 37: D583–587. 10.1093/nar/gkn857PubMed CentralView ArticlePubMedGoogle Scholar
- Solignac M, Zhang L, Mougel F, Li B, Vautrin D, Monnerot M, Cornuet JM, Worley KC, Weinstock GM, Gibbs RA: The genome of Apis mellifera: dialog between linkage mapping and sequence assembly. Genome Biol 2007, 8: 403. 10.1186/gb-2007-8-3-403PubMed CentralView ArticlePubMedGoogle Scholar
- Wang J, Xia Q, He X, Dai M, Ruan J, Chen J, Yu G, Yuan H, Hu Y, Li R: SilkDB: a knowledgebase for silkworm biology and genomics. Nucleic Acids Res 2005, 33: D399. 10.1093/nar/gki116PubMed CentralView ArticlePubMedGoogle Scholar
- Wang L, Wang S, Li Y, Paradesi MSR, Brown SJ: BeetleBase: the model organism database for Tribolium castaneum . Nucleic Acids Res 2007, 35: D476–479. 10.1093/nar/gkl776PubMed CentralView ArticlePubMedGoogle Scholar
- Yamamoto K, Narukawa J, Kadono-Okuda K, Nohata J, Suetsugu Y, Sasanuma M, Sasanuma S, Mita K, Minami H, Shimomura M: Silkworm genome analysis: Construction of an integrated genome database, KAIKObase. Seikagaku 2006, A12627: 78.Google Scholar
- Li W, Godzik A: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006, 22: 1658–9. 10.1093/bioinformatics/btl158View ArticlePubMedGoogle Scholar
- Harismendy O, Frazer K: Method for improving sequence coverage uniformity of targeted genomic intervals amplified by LR-PCR using Illumina GA sequencing-by-synthesis technology. BioTechniques 2009, 46: 229. 10.2144/000113082View ArticlePubMedGoogle Scholar
- Goldsmith MR, Shimada T, Abe H: The genetics and genomics of the silkworm, Bombyx mori . Annu Rev Entomol 2005, 50: 71–100. 10.1146/annurev.ento.50.071803.130456View ArticlePubMedGoogle Scholar
- Parkinson J, Anthony A, Wasmuth J, Schmid R, Hedley A, Blaxter M: PartiGene - constructing partial genomes. Bioinformatics 2004, 20: 1398–1404. 10.1093/bioinformatics/bth101View ArticlePubMedGoogle Scholar
- Paquola AC, Nishyiama Jr MY, Reis EM, da Silva AM, Verjovski-Almeida S: ESTWeb: bioinformatics services for EST sequencing projects. Bioinformatics 2003, 19: 1587–1587. 10.1093/bioinformatics/btg196View ArticlePubMedGoogle Scholar
- Chevreux B, Pfisterer T, Drescher B, Driesel AJ, Müller WEG, Wetter T, Suhai S: Using the miraEST Assembler for Reliable and Automated mRNA Transcript Assembly and SNP Detection in Sequenced ESTs. Genome Res 2004, 14: 1147–1159. 10.1101/gr.1917404PubMed CentralView ArticlePubMedGoogle Scholar
- Parkinson J, Guiliano DB, Blaxter M: Making sense of EST sequences by CLOBBing them. BMC Bioinformatics 2002, 3: 31. 10.1186/1471-2105-3-31PubMed CentralView ArticlePubMedGoogle Scholar
- Huang X, Madan A: CAP3: A DNA sequence assembly program. Genome Res 1999, 9: 868–877. 10.1101/gr.9.9.868PubMed CentralView ArticlePubMedGoogle Scholar
- Bextine B, Tuan S, Shaikh H, Blua M, Miller TA: Evaluation of Methods for Extracting Xylella fastidiosa DNA from the Glassy-Winged Sharpshooter. J Econ Entomol 2004, 97: 757–763. 10.1603/0022-0493(2004)097[0757:EOMFEX]2.0.CO;2View ArticlePubMedGoogle Scholar
- Friedel CC, Jahn KHV, Sommer S, Rudd S, Mewes HW, Tetko IV: Support vector machines for separation of mixed plant-pathogen EST collections based on codon usage. Bioinformatics 2005, 21: 1383–1388. 10.1093/bioinformatics/bti200View ArticlePubMedGoogle Scholar
- Emmersen J, Rudd S, Mewes HW, Tetko IV: Separation of sequences from host-pathogen interface using triplet nucleotide frequencies. Fungal Genet Biol 2007, 44: 231–241. 10.1016/j.fgb.2006.11.010View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.