Novel software package for cross-platform transcriptome analysis (CPTRA)
© Zhou et al. 2009
Published: 8 October 2009
Skip to main content
© Zhou et al. 2009
Published: 8 October 2009
Next-generation sequencing techniques enable several novel transcriptome profiling approaches. Recent studies indicated that digital gene expression profiling based on short sequence tags has superior performance as compared to other transcriptome analysis platforms including microarrays. However, the transcriptomic analysis with tag-based methods often depends on available genome sequence. The use of tag-based methods in species without genome sequence should be complemented by other methods such as cDNA library sequencing. The combination of different next generation sequencing techniques like 454 pyrosequencing and Illumina Genome Analyzer (Solexa) will enable high-throughput and accurate global gene expression profiling in species with limited genome information. The combination of transcriptome data acquisition methods requires cross-platform transcriptome data analysis platforms, including a new software package for data processing.
Here we presented a software package, CPTRA: Cross-Platform TRanscriptome Analysis, to analyze transcriptome profiling data from separate methods. The software package is available at http://people.tamu.edu/~syuan/cptra/cptra.html. It was applied to the case study of non-target site glyphosate resistance in horseweed; and the data was mined to discover resistance target gene(s). For the software, the input data included a long-read sequence dataset with proper annotation, and a short-read sequence tag dataset for the quantification of transcripts. By combining the two datasets, the software carries out the unique sequence tag identification, tag counting for transcript quantification, and cross-platform sequence matching functions, whereby the short sequence tags can be annotated with a function, level of expression, and Gene Ontology (GO) classification. Multiple sequence search algorithms were implemented and compared. The analysis highlighted the importance of transport genes in glyphosate resistance and identified several candidate genes for down-stream analysis.
CPTRA is a powerful software package for next generation sequencing-based transcriptome profiling in species with limited genome information. According to our case study, the strategy can greatly broaden the application of the next generation sequencing for transcriptome analysis in species without reference genome sequence.
The recent development of next generation sequencing techniques has revolutionized biological and biomedical research and has provided many enabling platforms for systems biology [1, 2]. However, maximizing the potential for next generation sequencing heavily depends on available data analysis tools . Some features of next generation sequencing data are different from those of traditional Sanger sequencing. For example, the Illumina Genome Analyzer can generate up to 20 gigabases of short read sequences per run . These short read sequences can be 18 bases, 36 bases or 76 bases in read length. They can also be generated from either single end or paired end runs . The different sequence formats, diverse applications, and the large amount of data generated all require new strategies for sequence analysis . Various sequence analysis tools have been developed to address the needs for different applications of next generation sequencing including de novo sequencing, whole genome re-sequencing, metagenome sequencing, transcriptome profiling, microRNA profiling, CHIP-seq, and others [3, 5–14]. In this paper, we will focus on a software package providing the enabling tools for cross-platform transcriptome analysis.
Next generation sequencing techniques have enabled several novel approaches for transcriptome profiling [1, 4, 15]. Depending on the read length, different next generation sequencing techniques can be optimized for different types of transcriptome profiling [16, 17]. The 454 pyrosequencing platform provides a longer read lengths of 200 to 400 bases and relatively less sequencing yield at around 200 to 400 megabases per run . Considering the read length, 454 sequencing has some advantages for transcriptome analysis, since the longer reads allow for better assembly of the sequences, which is particularly important for species without reference genome information. As compared to the 454 sequencing, the shorter read length and higher sequencing throughput for SOLiD and Solexa have enabled better transcript quantification, where the deep sequence coverage allows better digital quantification of gene expression levels . Even though not considered as part of next generation sequencing techniques, MPSS (Massively Parallel Signature Sequencing) and iGentifier can also be employed for the semi-quantitative transcriptome profiling with data output similar to the digital gene expression (DGE) profiling [2, 3, 19]. The so-called digital gene expression profiling technique employs a similar strategy as serial analysis of gene expression (SAGE), in which sequence tags around a four-base restriction enzyme are sequenced and quantified across different samples [20–22]. The SOLiD and Solexa-based methods provide much deeper sequence coverage of the tags and thus provide more accurate quantification. In fact, a recent study has indicated that digital gene expression profiling is more accurate than any microarray platform .
Despite the significant advantages of short sequence tag-based gene expression profiling methods, all these platforms, including MPSS, DGE, iGentifier and SAGE are heavily dependent on the availability of reference genome sequence, which limited the application of these techniques to sequenced or well-characterized species only [4, 19, 23]. However, one of the advantages and tasks for next generation sequencing is to expand the usage of sequence-based transcriptome and genome analysis to a variety of species with limited or no genome information . Novel experimental approaches accompanied by useable software are needed for such analysis.
We hereby describe a new approach for cross platform transcriptome analysis and apply it to a case study. The case study analyzes the molecular mechanisms of herbicide resistance in horseweed, Conyza canadenisis, a major weed in US. No genome information is available for horseweed. The project serves as a perfect case study because it uses a strategy to combine different sequencing platforms including 454 sequencing, cDNA sequencing and iGentifier for a comprehensive transcriptome profiling of horseweed's response to glyphosate treatments . The goal of the study was to discover novel genes involved in herbicide detoxification in non-target site resistance, in which multiple pathways including P450, GST and ABC transporters could be involved [24, 25]. Limited genome information greatly hinders the application of sequence-based transcriptomic profiling in this and other weedy species .
Herein we present a software package CPTRA for analyzing the transcriptome profiling data from different sequencing platforms. We present software design, data input, and output. The software package is available free at the website: http://people.tamu.edu/~syuan/cptra/cptra.html. We also evaluated the performance of the package and compared the CPU time for different algorithms. In a follow up case study, the software package was employed to analyze our cDNA library-, 454 sequencing-, and iGentifier-data to dissect the mechanisms of non-target herbicide resistance in horseweed. The analysis revealed the effectiveness of the approach for cross platform transcriptome profiling and the potential for the software package to be broadly applied for transcriptome analysis in essentially any species.
Figure 1 outlines the schema of the cross-platform transcriptome experimental design and data analysis flow. As shown in the figure, two types of input data are analyzed together. Our previous analysis indicated that the direct annotation of sequences less than 40 bases is not feasible . We therefore developed the CPTRA package with Python for transcriptome analysis based on two or three types of sequencing data. The input of the package is results of sequence tags from different sequencing platforms including DGE and iGentifier, and annotated cDNA sequences of the same species. For the first step of the analysis, the sequence tags are grouped to form a set of unique tags with a count number for each tag. The tags are then aligned to the cDNA sequences under certain limits of allowed mismatch numbers. CPTRA uses the alignment results to compute normalized expression counts for each cDNA sequence.
The major speed-limiting step for the CPTRA package lies in the cross-platform sequence matching function. We compared different algorithms including NCBI-BLAST programs (megablast, blastn) and regular pattern search. The regular pattern search basically identifies a string in a sequence file using one of the direct pattern search algorithms implemented with Python. Megablast renders best performance, whilst the regular pattern search is significantly slower than Megablast. The use of regular pattern search is essentially impossible for the large scale of next generation sequencing data, but it allows ambiguity nucleotide code, which is abundant in iGentifier sequencing results. We estimated that processing a dataset with 1000 iGentifier tags and 2500 cDNA sequences will take about 2 hours by the regular sequence search. The NCBI-BLAST programs will take only seconds for such task, but it cannot consider ambiguity nucleotide in alignment. Megablast is thus preferred if the data quality is high and does not have ambiguity nucleotide code.
We employed the CPTRA package to analyze the latest sequence-based transcriptome analysis data for our horseweed project . The purpose of the study is to evaluate the effectiveness of the package and the impact of different sequence coverage on the analysis output. The detailed information about the study can be found in our previous work . For this study, there are three types of the input data for the analysis, including the Unigene sequence and iGentifier data as previously presented along with the recent sequenced ESTs with 454 pyrosequencing (Peng, unpublished data) . The iGentifier data is similar to the DGE data and provides short sequence tags for quantification. The 454 EST sequence read length averaged 140 bases and totaled up to 50 megabases for the study (Peng, unpublished data).
We have introduced CPTRA as a software package for the cross-platform transcriptome analysis and presented a performance evaluation for CPTRA. The cross-platform transcriptome analysis often involves a short read tag-based platform for transcript quantification and a longer read length platform for annotation. The combination of the two platforms allows us to exploit the advantages for both platforms to reach an accurate quantification and the functional and ontology annotation of the transcripts. Tag-based methods such as DGE, SAGE and iGentifier have been, and will continue to be, broadly applied in global gene expression profiling to provide digital quantification with high confidence . The application of tag-based methods is obviously limited in species without a reference genome or large scale EST data, because the 17 to 50 base sequence tags normally cannot be accurately annotated [16, 17]. This limitation requires new complimentary strategies. The recent development of 454 sequencing enables the high-throughput sequence of ESTs with read length up to 400 bases, which can be readily assembled and annotated [16, 17]. The combination of long and short read sequencing platforms will allow us to explore the gene expression in a board spectrum of species regardless the available genome information and to quantify the gene expression with the most accurate transcriptome analysis platforms like DGE or SAGE [21, 26].
The software package directly addresses the needs for cross-platform sequence-based transcriptome analysis and provides enablement for next-generation sequencing-based transcriptome analysis. Despite the diverse tools developed for the next generation sequencing analysis, few software packages directly handle cross-platform transcriptome analysis data [1–3, 22, 33]. The current version of the package allows us to take Solexa and SOLiD short sequence tags along with the iGentifier data as the input for sequence quantification, and to take the annotated 454 or other cDNA sequence data as the source of annotation, thus making the best utilization of short-and long-read sequence data. We will expand the application to MPSS and SAGE in the future. As described in the result part, the software thus provides a comprehensive solution for combined analysis of sequence tags and annotated EST or cDNAs. In order to implement a sequence matching function with a reasonable speed, we also compared different algorithms and determined that the MegaBlast serves as the best option for handling the large dataset generated by the next generation sequencing.
Global gene expression profiling is a crucial component of functional genomics and the transcriptome analysis tools have been under consistent development [2, 22]. Traditional transcriptome analysis platforms include microarray, SAGE, and real-time PCR [21, 26, 27, 34]. The development of next generation sequencing has enabled many novel transcriptome tools, among which sequence tag-based DGE has the promises to become the most accurate option for transcriptome analysis [20, 21]. The recent available RNAseq and other methods can also be very powerful in transriptome analysis in species with adequate sequence information [15, 17]. However, the performance of RNAseq as compared to the microarray technology has not been well studied as compared to DGE. More importantly, the application of the RNAseq in species with no genome sequence might be difficult because of the complicated assembly of short sequence tags . Even though several new assemblers for short read sequence have been developed, these new assemblers are mostly applied in microbe genome studies currently [1, 5, 13, 35]. The cross-platform global gene expression profiling thus represents a viable choice for transcriptome analysis in species without reference genome because it combines the high accuracy of the DGE and the sequence information from 454 or ESTs. CPTRA provides an enabling software for such analysis.
The case study for the glyphosate resistance data revealed several important considerations for applying CPTRA and cross platform strategies. First, the input sequence coverage is important. As shown in Figure 3, the addition of 454 sequencing data greatly increased the number of annotated tags. Essentially, increasing transcriptome coverage for the input long read sequences serves to provide more tags for annotation. Second, the quality for the sequence assembly is important. Figure 4 shows a significant number of multiple hit annotations. A detailed analysis of these multiple hit annotations indicated that many of these are 454 ESTs have the same functional annotation and these ESTs actually can overlap with one another but failed to meet the assembly criteria. The unassembled 454 singletons can lead to the annotation of the same sequence multiple times. Therefore, both sequence coverage and proper assembly are important for deriving the correct annotation of the sequence tags. Overall, our study showed that CPTRA and cross-platform sequence analysis are powerful solutions for transcriptome analysis in species with limited genome sequence information. The further deep sequencing of the samples with Solexa and more 454 sequencing will allow us to better understand the molecular and genomic mechanisms for the glyphosate resistance in horseweed. The strategy can be used to study a variety of biological questions in many species without reference genome.
All of the plant growth, RNA extraction, sequencing and other bench work were as previously described . The dataset was collected from the published data for further analysis with the CPTRA platform . The sequence assembly was performed using TGICL software http://compbio.dfci.harvard.edu/tgi/software/ with the default settings. The contig sequences were compared with UniProtKB TrEMBL database using blastx program (ncbi-blast package). The annotations including functional description and GO annotation were parsed from the top hit of each contig with an E-value cutoff of 1e-10.
The software package is implemented in Python. Internally, a major component of the CPTRA package is a class providing universal data handling functionalities, i.e., grouping tags and producing output. The functionalities specific to sequencing platforms were implemented by subclassing. Currently the functions implemented by sub-classing included parsing files with different formats and aligning short read data to cDNA sequences. CPTRA package calls Megablast to align the reads to the reference.
Three types of input data from the glyphosate resistance study have been used. First, a previously published Unigene set was used to perform the cross-platform transcriptome analysis with the iGentifier dataset . Second, the Unigene dataset was assembled together with an unpublished 454 dataset (Peng, unpublished data) and further annotated. The combined dataset was then analyzed together with iGentifer dataset for the cross-platform transcriptome analysis. The single hit tags were then clustered based on the iGentifier expression level. The cluster analysis was carried out with MEV4.0 (Multiple Experiment Viewer), which allowed the output of the cluster results and individual gene expression patterns .
We deeply appreciate Ryan D. Syrenne for the thorough editing of the text.
The research is supported by Texas Agrilife Research and Monsanto Inc.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 11, 2009: Proceedings of the Sixth Annual MCBIOS Conference. Transformational Bioinformatics: Delivering Value from Genomes. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S11.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.