Optimization of de novo transcriptome assembly from high-throughput short read sequencing data improves functional annotation for non-model organisms
© Haznedaroglu et al.; licensee BioMed Central Ltd. 2012
Received: 25 February 2012
Accepted: 26 June 2012
Published: 18 July 2012
The k-mer hash length is a key factor affecting the output of de novo transcriptome assembly packages using de Bruijn graph algorithms. Assemblies constructed with varying single k-mer choices might result in the loss of unique contiguous sequences (contigs) and relevant biological information. A common solution to this problem is the clustering of single k-mer assemblies. Even though annotation is one of the primary goals of a transcriptome assembly, the success of assembly strategies does not consider the impact of k-mer selection on the annotation output. This study provides an in-depth k-mer selection analysis that is focused on the degree of functional annotation achieved for a non-model organism where no reference genome information is available. Individual k-mers and clustered assemblies (CA) were considered using three representative software packages. Pair-wise comparison analyses (between individual k-mers and CAs) were produced to reveal missing Kyoto Encyclopedia of Genes and Genomes (KEGG) ortholog identifiers (KOIs), and to determine a strategy that maximizes the recovery of biological information in a de novo transcriptome assembly.
Analyses of single k-mer assemblies resulted in the generation of various quantities of contigs and functional annotations within the selection window of k-mers (k-19 to k-63). For each k-mer in this window, generated assemblies contained certain unique contigs and KOIs that were not present in the other k-mer assemblies. Producing a non-redundant CA of k-mers 19 to 63 resulted in a more complete functional annotation than any single k-mer assembly. However, a fraction of unique annotations remained (~0.19 to 0.27% of total KOIs) in the assemblies of individual k-mers (k-19 to k-63) that were not present in the non-redundant CA. A workflow to recover these unique annotations is presented.
This study demonstrated that different k-mer choices result in various quantities of unique contigs per single k-mer assembly which affects biological information that is retrievable from the transcriptome. This undesirable effect can be minimized, but not eliminated, with clustering of multi-k assemblies with redundancy removal. The complete extraction of biological information in de novo transcriptomics studies requires both the production of a CA and efforts to identify unique contigs that are present in individual k-mer assemblies but not in the CA.
Transcriptomic studies using high-throughput sequencing data have enabled researchers to study the global and specific gene expression of many different organisms without the need for a fully sequenced and annotated genome [1, 2]. Recently, de Bruijn graph-based  software packages such as Oases , Trans-ABySS , SOAPdenovo , and Trinity  have been developed to facilitate the transcriptome assembly of massive amounts of short read sequences produced using next generation DNA sequencing technologies. The power and robustness of these packages for forming contiguous sequences (contigs) has been tested, and comparative evaluations on computational resources such as execution time and parallelization, storage, and memory usage have been documented [8–10]. The choice of k-mer length (the length parameter defining the sequence overlap between two reads forming a contig) significantly affects the final assembly product . Shorter k-mer values might be a better choice in low-coverage studies to prevent the formation of complex overlapping nodes; whereas a larger k-mer choice would be more practical for high-coverage sequencing projects  to improve assembly accuracy. As an alternative to a single best k-mer value selection, multi-k value based methods have been adopted to compile different k-mer assemblies in order to improve performance, sensitivity, and specificity of the overall de novo transcriptome assemblies [2, 12]. Multi-k value based transcriptome assemblies come along with additional complexities, requiring algorithms to efficiently cluster homologous sequences from each single-k assembly and to remove redundant contigs to generate the final non-redundant clustered assembly (CA). Several algorithms such as CD-HIT-EST , VMATCH , and TGI Clustering tools  have been developed to obtain an optimal assembly clustering.
To date, the optimization studies for both single k-mer and clustered multi k-mer assemblies have largely focused on the length and number of contigs produced as a metric to evaluate the quality of the assembly output. There is, however, a limited understanding of how functional annotation—a primary goal of de novo transcriptome analysis—is affected by k-mer selection and clustering of multi-k assemblies. In this study, we report the significance of k-mer selection in the de novo assembly and annotation of a non-model eukaryotic organism’s transcriptome with no reference genome information available. We document the variations in uniqueness and the degree of functional annotations obtained under single k-mer and multi-k clustering methods, and present an assembly strategy to optimize the functional annotation to generate the gene catalogue of a non-model eukaryotic organism. Analysis is performed on Illumina short read sequencing of mRNA transcripts from the microalgae Neochloris oleoabundans, a candidate species for the production of microalgae-based biofuels [16, 17].
Herein, we also demonstrate that the combination of individual k-mer assemblies improves, but does not complete the annotation of all available unique contigs produced in an assembly. A workflow and useful scripts are provided to allow retrieval of additional biological information from contigs that are present in individual k-mer assemblies, but not in the clustered k-mer assembly.
Results and discussion
Sequencing and de novo transcriptome assembly
Transcriptome sequencing and assembly summary
Raw sequencing reads
Reads requiring trimming
Minimum read length
Lower quartile read length
Median read length
Upper quartile read length
Maximum read length
Number of reads assembled
Number of reads mapped
Number of contigs (≥ 100 bp)
Number of contigs (≥ 5,000 bp)
Number of contigs (≥ 8,000 bp)
Average length of contigs
Longest contig length
Effects of k-mer selection on mapping, functional annotation, and coverage
This analysis has clearly shown that the number of missing KOIs was minimal for mid-range k-mers, i.e. from 21 to 41, but it was more prominent for short and long k-mer sizes, i.e. 19, 43, and above. The fact that the highest quantities of missing KOIs corresponded to the highest and lowest number of generated contigs, in k-19 and k-63 assemblies respectively, was not surprising as these two extreme assemblies likely contained more unannotated contigs compared to other single k-mer assemblies, where higher accuracy in biological annotation is achieved with optimal mid-range k-mer length and ultimately contig length.
Despite the fact that k-mer 19 has resulted in a lower quality assembly in terms of coverage and missing annotations as discussed above, utilization of the k-19 assembly might still have value in annotating the transcriptome. Overall, lower k-mer assemblies are more successful in capturing transcripts with lower abundances, but as k-mer length increases, transcripts with higher abundances are more likely to be detected. Therefore, all individual assemblies of k-19 to k-63 were utilized to generate a multi-k based CA .
Assembly clustering and optimization
Number of KOIs present in individual k -mer assemblies but missing from the combined assemblies generated with different programs
Nevertheless, Table2 and Figure5 collectively indicated that there were still missing KOIs in both CAs and single k-mer assemblies. Although the number of missing KOIs in the CA was low compared to the total number of KOIs identified, there was still some relevant biological information lost during this clustering step. This was attributed to the heuristic-based search used in both CD-HIT-EST and Oases to reduce computation time and memory usage, resulting in minor inconsistencies during the removal of redundant sequences.
To further characterize the degree of lost biological information, the missing KOIs identified during the pair-wise comparisons in Figure5 were subjected to full biological annotation using KEGG BRITE gene and protein families. The missing KOI lists and their full biological annotations are presented as Additional file 1 and Additional file 2 [annotated KOIs missing in single k-mer assemblies otherwise present in CAs, corresponding to an average 0.19% of total KOIs, identified with CD-HIT-EST 1.0 scenario (Additional file 1) and 0.27% in Oases multi-k (Additional file 2)], and Additional file 3 and Additional file 4 [annotated KOIs missing in CAs otherwise present in single k-mer assemblies, corresponding to an average 3.43% of total KOIs, identified with CD-HIT-EST 1.0 scenario (Additional file 3) and 3.48% in Oases multi-k (Additional file 4)]. The detailed discussion of lost biological information would be out of the scope of this paper, as its nature and value would differ for each researcher and assembly goal. Nevertheless, a general important interpretation is that there were relevant genes encoding enzymes and proteins (of particular interest for a lipid producing microalgae species with respect to this study) identified as missing in single k-mer assemblies but present in CAs (Additional file 1 and Additional file 2) and vice-versa (Additional file 3 and Additional file 4). This suggests that comprehensive annotation should include, in addition to the CA, an interrogation of unique genes in the assemblies of individual k-mers from 19 and 63.
Suggested workflow for optimizing de novo transcriptome annotation
Although the generated CA provided the best annotation results, comparison with single k-mer assemblies suggested that this approach still results in the loss of some biological information as discussed above. A workflow is presented in Additional file 5, along with several useful scripts, as a guide to improve the annotation of de novo assembled transcriptome. The workflow first includes quantifying the number of annotations that could possibly be generated in single k-mer assemblies via quick annotation services (such as KAAS) to determine the optimal k-mer value range targeted to capture the most comprehensive functional annotation. Next, a clustered assembly should be generated using these k-mer values to produce the full set of non-redundant contigs. Finally, pairwise comparisons are performed to identify the unique contigs that are not present in the multi-k clustered assembly (otherwise detected in single k-mer assemblies). These missing contigs should be incorporated into the final assembly product prior to annotation.
For the de novo transcriptome assembly of non-model organisms from short read sequencing data, de Bruijn graph based algorithms use k-mer hash lengths to accommodate transcripts with different sizes. Here, we provide an in-depth analysis of the effects of individual k-mer length and multiple k-mer assembly methods on transcriptome annotation. Results demonstrate that different k-mer choices result in different quantities of unique contigs per single k-mer assembly, which in turn impact the amount of biological information that is retrievable from the transcriptome. Although this undesirable effect could be minimized with clustering of multi-k assemblies, it is not completely eliminated due to limitations in the heuristic algorithms used in redundancy removal when clustered k-mer assemblies are used. We present useful scripts and a workflow to retrieve some of the missing biological information. With high-throughput DNA sequencing methods removing limitations in transcriptome coverage, assembly-based optimization is important for continually improving the completeness of transcriptomes, particularly in non-model organisms for which the reference genome is not available. Taken together, our results provide important guidance on selecting and combining k-mer lengths to improve the extraction of biological information from de novo transcriptome assemblies.
Algae growth, cDNA sequencing, read trimming, and assembly
Neochloris oleoabundans (a Chlorophyceae class green microalgae) was grown in batch cultures under nitrogen stressed and unstressed conditions [17, 18]. Total RNA was extracted after 11 days of growth using Rneasy Plant Mini Kit (Qiagen, Germantown, MD). Library preparation was conducted using mRNA-Seq Kit supplied by Illumina (Illumina, Inc., San Diego, CA). Briefly, the mRNA fraction was isolated from total RNA using two rounds of hybridization to Dynaloligo(dT) magnetic beads (Invitrogen, Carlsbad, CA). The mRNA was then fragmented in the presence of divalent cations at 94°C, and subsequently converted into double stranded cDNA following the first- and second-strand cDNA synthesis using random hexamer primers. After polishing the ends of the cDNA using T4 DNA polymerase and Klenow DNA polymerase for 30 min at 20°C, a single adenine base was added to the 3’ ends of cDNA molecules. Illumina mRNA-Seq Kit specific adaptors were then ligated to cDNA 3’ ends. Subsequently, the cDNA was PCR-amplified for 15 cycles and amplicons were purified using the Qiagen PCR purification kit (Qiagen, Germantown, MD). The size and concentration of the cDNA libraries were determined on Agilent 2100 bioanalyzer (Agilent Technologies, Santa Clara, CA). Each cDNA library was loaded onto a lane of the Illumina flow cell and sequenced at the Yale Center for Genome Analysis using a Genome Analyzer IIx and the 99 bp single-read recipe. An additional lane was also used to run sequencing controls. Raw sequencing reads (44,568,122 reads; 99 bp single-ended) of cDNA were analyzed with the FastQC quality control tool (v0.10.0) to evaluate the read sequence quality . Low quality reads with a Phred score value of 13 and less were removed using the SolexaQA software package (v1.1) . After trimming, the FastQC analysis was conducted again to ensure quality measures were met in the remaining reads.
Based on their common application in de novo transcriptomic studies using Illumina reads [21, 22], the Velvet (v1.2.03)  and Oases (v0.2.06)  packages were utilized to assemble the high quality reads. In order to investigate the impact of k-mer choice on the assembly dynamics, separate assemblies were performed for odd k-mer values ranging from 19 to 63 using the “oases_pipeline.py” script provided in the Oases package. The raw sequence data used in this study has been submitted to National Center for Biotechnology Information (NCBI) Short Read Archive (SRA), and are available for public access with accession numbers: SRR391512.1 and SRR391513.1.
KOI assignment and functional annotation
The resulting contigs from each individual k-mer assembly were submitted to the Kyoto Encyclopedia of Genes and Genomes (KEGG) Automatic Annotation Server (KAAS) (v1.6a)  for KOI assignment using the default settings with single-directional best hit (SBH) method and databases that included several eukaryotic organisms (including the green microalgae C. reinhardtii) . Functional annotation of the KOI assignments was derived from the KEGG BRITE genes and protein families database . Comparison of KOIs for each k-mer assembly to KOIs for all other individual k-mer assemblies was performed to determine the number of KOIs unique to each k-mer assembly. This analysis was generated using a custom built script in the R programming language. This script is provided in Additional file 6.
Mapping reads to the assembled transcriptome
To understand the relationship between coverage and generated contigs, trimmed reads were mapped against each assembly (k-19 to k-63) using Bowtie  and the contig coverage was estimated. Prior to mapping, any read shorter than the k-value of the assembly was removed from the set of trimmed reads. This was done to ensure that reads, which were not used in the assembly, were not mapped to the assembly. Bowtie produced all alignments  with 0, 1, and 2 mismatches allowed and utilized the following settings: -a -phred64 -quals -suppress 1,2,4,5,6,7,8 -q -best. Fold coverage was calculated based on the average number of reads mapped per contig in a given k-mer assembly. This calculation was performed using a custom designed Python script (provided in Additional file 6).
Assembly clustering and optimization
Clustered assemblies (CA) were generated from the single k-mer assemblies of 19 to 63. Clustering of contigs and redundancy removal were performed using the following three different programs: Oases (by using its incorporated multi-k option), CD-HIT-EST (v4.0-2010-04-20) , and VMATCH (v2.1.6) . CD-HIT-EST was run with the following parameters: -n 8 -T 4, and 3 different values for the '-c' parameter (0.9, 0.95, 1.0). VMATCH was run with the following parameters: -d -p -l 18 -dbcluster 100 0 -v -nonredundant for vmatch and -allout -pl -dna for mkvtree. Contigs from the combined assemblies were submitted to KAAS for KOI assignment as previously described.
Pair-wise comparison of single k-mer assemblies versus CA was performed to determine if unique KOIs existed in specific k-mer assemblies but not in the clustered assemblies, and vice versa. To make these comparisons, scripts written in the Python programming language were developed (See Additional file 6).
Berat Z. Haznedaroglu, Darryl Reeves and Hamid Rismani-Yazdi equal authorship.
This research was supported by the Connecticut Center for Advanced Technologies under a Fuel Diversification Grant and by the National Science Foundation, Grant #0854322 awarded to JP. BZH was supported by a joint postdoctoral fellowship from the Yale Climate and Energy Institute and Yale Institute for Biospheric Studies. DR was supported in part from the National Library of Medicine, Grant #T15 LM07056. We acknowledge use of computational facilities at the Yale University Biomedical High Performance Computing Center NIH Grant# RR19895.
- Iyer MK, Chinnaiyan AM: RNA-Seq unleashed. Nat Biotech. 2011, 29 (7): 599-600. 10.1038/nbt.1915.View Article
- Martin JA, Wang Z: Next-generation transcriptome assembly. Nat Rev Genet. 2011, 12 (10): 671-682. 10.1038/nrg3068.View ArticlePubMed
- De Bruijn NG: A combinatorical problem. Koninklijke Nederlandse Akademie v Wetenschappen. 1946, 46: 758-764.
- Schulz MH, Zerbino DR, Vingron M, Birney E: Oases: Robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics. 2012, 10.1093/bioinformatics/bts094.
- Robertson G, Schein J, Chiu R, Corbett R, Field M, Jackman SD, Mungall K, Lee S, Okada HM, Qian JQ, et al: De novo assembly and analysis of RNA-seq data. Nat Meth. 2010, 7 (11): 909-912. 10.1038/nmeth.1517.View Article
- Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K, et al: De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 2010, 20 (2): 265-272. 10.1101/gr.097261.109.PubMed CentralView ArticlePubMed
- Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, et al: Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotech. 2011, 29 (7): 644-652. 10.1038/nbt.1883.View Article
- Bao S, Jiang R, Kwan W, Wang B, Ma X, Song Y-Q: Evaluation of next-generation sequencing software in mapping and assembly. J Hum Genet. 2011, 56 (6): 406-414. 10.1038/jhg.2011.43.View ArticlePubMed
- Narzisi G, Mishra B: Comparing De novogenome assembly: the long and short of it. PLoS One. 2011, 6 (4): e19175-10.1371/journal.pone.0019175.PubMed CentralView ArticlePubMed
- Zhang W, Chen J, Yang Y, Tang Y, Shang J, Shen B: A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies. PLoS One. 2011, 6 (3): e17915-10.1371/journal.pone.0017915.PubMed CentralView ArticlePubMed
- Zerbino DR, Birney E: Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008, 18 (5): 821-829. 10.1101/gr.074492.107.PubMed CentralView ArticlePubMed
- Surget-Groba Y, Montoya-Burgos JI: Optimization of de novo transcriptome assembly from next-generation sequencing data. Genome Res. 2010, 20 (10): 1432-1440. 10.1101/gr.103846.109.PubMed CentralView ArticlePubMed
- Li W, Godzik A: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006, 22 (13): 1658-1659. 10.1093/bioinformatics/btl158.View ArticlePubMed
- Kurtz S, Vmatch: Large scale sequence analysis software. http://www.vmatch.de/,
- Pertea G, Huang X, Liang F, Antonescu V, Sultana R, Karamycheva S, Lee Y, White J, Cheung F, Parvizi B, et al: TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets. Bioinformatics. 2003, 19 (5): 651-652. 10.1093/bioinformatics/btg034.View ArticlePubMed
- Griffiths M, Harrison S: Lipid productivity as a key characteristic for choosing algal species for biodiesel production. J Appl Phycol. 2009, 21 (5): 493-507. 10.1007/s10811-008-9392-7.View Article
- Li Y, Horsman M, Wang B, Wu N, Lan C: Effects of nitrogen sources on cell growth and lipid accumulation of green alga Neochloris oleoabundans. Appl Microbiol Biotech. 2008, 81 (4): 629-636. 10.1007/s00253-008-1681-1.View Article
- Pruvost J, Van Vooren G, Cogne G, Legrand J: Investigation of biomass and lipids production with Neochloris oleoabundans in photobioreactor. Bioresource Technol. 2009, 100 (23): 5988-5995. 10.1016/j.biortech.2009.06.004.View Article
- Andrews S, FastQC: A quality control tool for high throughput sequence data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/,
- Cox M, Peterson D, Biggs P: SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data. BMC Bioinformatics. 2010, 11 (1): 485-10.1186/1471-2105-11-485.PubMed CentralView ArticlePubMed
- Garg R, Patel RK, Tyagi AK, Jain M: De novo assembly of chickpea transcriptome using short reads for gene discovery and marker identification. DNA Res. 2011, 18 (1): 53-63. 10.1093/dnares/dsq028.PubMed CentralView ArticlePubMed
- Feldmeyer B, Wheat C, Krezdorn N, Rotter B, Pfenninger M: Short read Illumina data for the de novo assembly of a non-model snail species transcriptome (Radix balthica, Basommatophora, Pulmonata), and a comparison of assembler performance. BMC Genomics. 2011, 12 (1): 317-10.1186/1471-2164-12-317.PubMed CentralView ArticlePubMed
- Moriya Y, Itoh M, Okuda S, Yoshizawa AC, Kanehisa M: KAAS: an automatic genome annotation and pathway reconstruction server. Nucl Acids Res. 2007, 35 (suppl 2): W182-W185.PubMed CentralView ArticlePubMed
- Aoki-Kinoshita KF, Kanehisa M: Gene annotation and pathway mapping in KEGG. Comparative Genomics. Volume 2. Edited by: Bergman NH. 2007, Totowa, New Jersey: Humana Press, 71-91. vol. 396
- Langmead B, Trapnell C, Pop M, Salzberg S: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009, 10 (3): R25-10.1186/gb-2009-10-3-r25.PubMed CentralView ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.