NGS-Trex: Next Generation Sequencing Transcriptome profile explorer
© Mignone et al.; licensee BioMed Central Ltd. 2013
Published: 22 April 2013
Next-Generation Sequencing (NGS) technology has exceptionally increased the ability to sequence DNA in a massively parallel and cost-effective manner. Nevertheless, NGS data analysis requires bioinformatics skills and computational resources well beyond the possibilities of many "wet biology" laboratories. Moreover, most of projects only require few sequencing cycles and standard tools or workflows to carry out suitable analyses for the identification and annotation of genes, transcripts and splice variants found in the biological samples under investigation. These projects can take benefits from the availability of easy to use systems to automatically analyse sequences and to mine data without the preventive need of strong bioinformatics background and hardware infrastructure.
To address this issue we developed an automatic system targeted to the analysis of NGS data obtained from large-scale transcriptome studies. This system, we named NGS-Trex (NGS Transcriptome profile explorer) is available through a simple web interface http://www.ngs-trex.org and allows the user to upload raw sequences and easily obtain an accurate characterization of the transcriptome profile after the setting of few parameters required to tune the analysis procedure. The system is also able to assess differential expression at both gene and transcript level (i.e. splicing isoforms) by comparing the expression profile of different samples.
By using simple query forms the user can obtain list of genes, transcripts, splice sites ranked and filtered according to several criteria. Data can be viewed as tables, text files or through a simple genome browser which helps the visual inspection of the data.
NGS-Trex is a simple tool for RNA-Seq data analysis mainly targeted to "wet biology" researchers with limited bioinformatics skills. It offers simple data mining tools to explore transcriptome profiles of samples investigated taking advantage of NGS technologies.
Despite Next-Generation Sequencing (NGS) technologies are becoming increasingly accessible and cost effective and a large number of analysis tools are regularly made available to the research community, the throughput level reached and the complexity of the analysis still poses serious problems related to the management and the interpretation of data.
While the conceptual analysis pipeline to deal with NGS data is relatively easy to be drawn, the actual implementation can be challenging. Indeed, the simple mapping of sequences onto a reference genome requires adequate computational power and properly set-up systems. Things gets even more complicated when it comes to compare data with annotated features because of the need of up-to-date databases and automated analysis procedures.
Even if many tools are available to perform those tasks (for a recent review and comparison see ) - also allowing the usage of remote servers thus reducing the hardware and software requirements - they still require some skills to be installed or their usage needs suitable training.
The efforts needed to set up the infrastructure and to acquire the required skills are beyond the possibility of many laboratories and are often not justified for projects where only few deep sequencing cycles are needed to address specific biological questions.
To fill this gap we developed an automatic system targeted to the analysis of Next Generation Sequencing data obtained from large-scale transcriptome studies. This system, we named NGS-Trex (NGS Transcriptome profile explorer) is available through a simple web interface http://www.ngs-trex.org and besides requiring a simple user interaction it offers an accurate characterization of the transcriptome profile of the samples under investigation. The system is also able to assess differential expression of genes at both gene and transcript level (i.e. splicing isoforms) by comparing the expression profile of different datasets.
Among the different tools available the only one that can be directly compared to NGS-Trex is the recently published GeneProf system . GeneProf http://www.geneprof.org/GeneProf/ and NGS-Trex share the same philosophy being both web-based and intended to offer an "easy to use" tool for RNA-Seq data analysis. Also the well-established Galaxy tool  can be used to analyse RNA-Seq data. However Galaxy is a more general purpose framework and it requires a strong user interaction at each stage of the workflow resulting in a relatively complex tool that requires a steep learning curve.
To complete the overall procedure the user has to perform three simple steps: 1) create a "project" and upload the sequences; 2) tune the analysis parameters; 3) mine data for relevant information.
The basic analysis workflow implemented in our tool follows a common schema for RNA-Seq data. It is basically articulated into 4 steps: 1) sequence filtering to discard low quality reads, to remove cloning linkers and to identify, whenever possible, sequence strandness; 2) mapping of reads onto the reference genome; 3) comparison of mapped reads to annotated features; 4) identification of "interesting" features (such as highly expressed genes, un-annotated splicing events, significant differences in expression levels).
The annotation process is very critical and several factors may affect the quality of the annotation. The main problem is related to ambiguities in the correct assignment of a read to a gene . Two distinct classes of ambiguities can be identified: we define class I ambiguities those generated by overlapping genes and class II ambiguities those caused by reads which are not uniquely mapped on the genome (mainly due to paralogues genes, pseudogenes or gene clusters).
Several analysis workflows simply bypass class II ambiguities by discarding sequences not uniquely mapped. The main problem with this approach is the introduction of an experimental bias with the alteration of reads count of gene families. Our system attempts to solve ambiguities using information about relative orientation of genes, about confidence of the assignment of the read to ambiguous genes and relying on strand information obtained from spliced reads. Indeed donor and acceptor sites of spliced reads can be used to guess the correct strand of reads when they are not oriented.
If reads are oriented (or spliced) and overlapping genes are on opposite strands, ambiguities can be easily solved and the read is assigned to the gene on the same strand; conversely if reads are not oriented and/or genes are on the same strand the correct assignment of a read to the right gene is more challenging. However, a reasonable solution to this issue is the assignment of the read to the competing genes with different level of confidence (the order of assignment of read is "T" > "G" > "P"). Whenever an ambiguity cannot be solved the user can choose either to discard the read or to assign the same read to both genes.
Finding "interesting features"
In addition to read assignment to genes and transcripts NGS-Trex identifies un-annotated splice sites. Putative new splice sites are classified according to their donor/acceptor sites and to the number of supporting reads. However they are not assembled into transcript models.
Finally, whenever two or more datasets are available for the same project, the system automatically identifies genes and introns that show a differential expression pattern. As pointed out in  a standard procedure for the identification of differentially expressed (DE) genes is not yet available. Moreover current implementation of DE genes calculation in NGS-Trex treats each pair separately (biological replicates - if available - are not taken into account). In this context Fisher's exact test appears adequate to reliably explore putative DE genes  and was implemented in our tool.
Data mining tools
Upon completion of the analysis process statistical summaries of data are provided together with several data mining tools.
Statistics show summary information about parameters used for the analysis, lengths distribution charts for both submitted and processed reads and flowchart of reads classified during the mapping and annotation processes. Finally, an overview of specific features like the number of novel introns, differentially expressed genes or introns is shown.
"DE genes" and "DE introns" panels can be used to highlight genes and introns that are significantly differentially expressed, i.e. up- or down-regulated between a reference and the others datasets.
Finally, with "New introns" query form it is possible to obtain a list of un-annotated splicing events and with "Transcripts" query panel it is possible to investigate the relation between reads and known transcripts by identifying the relative position of reads respect to CDS and UTRs regions.
Results are shown as tables and can be downloaded as tab-delimited text file to be easily imported into a spreadsheet program. A basic genome browser interface is also available to allow visual inspection of data and to simplify the download of reads assigned to a gene or mapped onto a genomic region.
To benchmark our system we processed the sequences obtained by  with Titanium 454-Roche-platform. Sequences are available at the NCBI Sequence Read Archive (SRA, http://www.ncbi.nlm.nih.gov/Traces/sra) with accession SRA012436. The Authors carefully analysed data and performed experimental validation to confirm among other findings - splicing variants, enrichment in alternative splicing events and alterations in expression levels of several genes. Sequences were obtained from a pool of cDNA produced from two distinct cell lines HB4a and HB4aC5.2. A barcoding sequence was inserted to discriminate between the two datasets.
We firstly analysed the whole dataset mainly using our default values (we only tuned overlap and minimum similarity threshold for the mapping step to match the same values used in the paper: 70% and 96% respectively).
Label given in Carraro et al.
Carraro et al.
Corresponding label in NGS-Trex
Total reads (1, 2)
Processed reads (2)
Partly aligning to Human Genome
Mapped low similarity (2)
Completely aligning to Human Genome
Mapped reads (1, 2)
Mapped at one genome position
Mapped with low frequency (2)
Mapped against Known Gene DB
Mapped against RefSeq
Classified as "T" (2)
Identified transcripts (2)
Identified genes (1, 2)
To identify DE genes we processed again the sequences using the barcoding information to discriminate between the two datasets. While NGS-Trex does not explicitly deal with pooled samples it is easy to obtain this feature creating as many datasets as the barcodes used and defining the proper barcode sequence as the linker in the filtering step of the procedure.
Novel AS Events
Carraro et al.
Carraro et al. (*)
NGS-Trex (N > 1)
NGS-Trex (N > 2)
It is beyond the scope of this discussion to make a detailed comparison between the results obtained from the two methods. The differences can be explained by different parameters and software used. The goal of this comparison was simply to demonstrate that results obtained with our system with a very simple submission of sequences, without requiring a deep knowledge of underlying analysis procedures, are qualitatively comparable to those obtained from a manually curated analysis that involves the combination of custom procedures with the usage of external tools.
Comparison with other tools
Several tools are already available for the analysis of NGS data (for a detailed comparison table see supplementary note in ). Most of them require complex installation procedures and have specific hardware requirements or need a strong user interaction to perform the analysis. The only available tool comparable to NGS-Trex is - to our knowledge - GeneProf . Nonetheless, this tool seems more focused on the early stage of the analysis: it offers automatic tools to import sequences from SRA database and shows many plots about sequence quality, base composition and read alignment statistics. On the other hand we were not able to find information about splicing or about annotation at transcript level. Finally, we did not find any method to download specific sequences set (i.e. reads assigned to a specific gene). This is quite important from a "wet biologist" perspective as it allows to carefully inspect sequences to identify mutations or to design primers for experimental validation of observed data.
We do not exclude that it is possible to extract these information by using GeneProf as a more experienced user but to this respect NGS-Trex seems more user-friendly.
NGS-Trex is a user-friendly system to analyse RNA-Seq data. Indeed, to our knowledge any such web tool to visualize and download data mining results as well as specific subset of sequence reads (i.e. those assigned to a specific gene, supporting a splice site, etc.) is not currently available. These read subsets can be used for further analyses by using third part software. For example, the current NGS-Trex version does not allow the direct identification of SNPs, nonetheless, while is not possible to obtain a genome wide identification of SNPs, using the exploration and extraction tools currently provided it is very easy to download sequences assigned to a specific gene and - by using external resources - process only the relevant sequences to obtain the required result.
The reliability of results provided by NSG-Trex has been supported by the benchmark assessment taking into account that observed discrepancies may be explained as the result of different tools and parameters as well as by difference in the genome feature annotation considered. However, our benchmark assessment has shown that results obtained by our system are quite comparable to those obtained from a manually curated analysis that involves the combination of custom procedures with the usage of external tools.
The analysis workflow has been implemented as a combination of custom perl and php scripts. Data are stored as a combination of flat files and relational databases using Mysql RDBMS. As part of the analysis several tools have been included in the pipeline, namely 1) NGS QC Toolkit  for the processing of fastq sequences, 2) gmap (v. 2011-10-04) and 3) TopHat (v. 2.0.6) for alignment step. All programs are used with default values.
Comparison with annotation is performed using a custom database which is mainly populated with several sources obtained from NCBI ftp site. Genes mapping information is derived from NCBI while RefSeq and Genbank sequences (assigned to genes as part of Unigene clusters) are mapped onto the reference genome by using gmap software. Sequences are updated every 6 months.
In NGS-Trex are currently available Homo sapiens (Hg18, Hg19), Mus musculus (mm9) and few bacteria. More genomes will be added also upon request. As previously described a critical step in the analysis procedure is the read assignment in the case of ambiguities. The user can define whether to discard ambiguously assigned reads or to assign those reads to all competing genes. Moreover it is possible to trigger a procedure to try to solve ambiguities. This procedure adopts several strategies: the first criterion is the confidence level of read assignment, being the order from higher to lower "T", "G", "P" (see text and Figure 2 for details). Whenever a read is assigned to more than one gene with the same confidence the procedure can still solve ambiguities for spliced reads if 1) the read supports an annotated splicing site, 2) the read is not oriented but the correct strand can be guessed using donor acceptor sites (in this case the read is treated as oriented and it is assigned to the gene on the same strand). If - after this procedure - it is still not possible to assign a read to a single gene it can be either assigned to all competing genes or discarded.
Differential expression (both at gene and splice site level) is evaluated applying Fisher's exact test. Read counts from different samples are normalized by the total number of reads mapped onto the genome.
Concerning input data, NGS-Trex can handle long reads (454-Roche) and short reads (Illumina), with fixed or different lengths. Files in fasta or fastq formats are accepted. Currently paired-end reads are not supported (although they can be analysed as single end). Multiplexed run can be processed.
This article was supported by Ministero dell'Istruzione, dell'Università e della Ricerca: FIRB "Futuro in ricerca" code RBFR08PWXIF.
This work was supported by "Ministero dell'Istruzione, Università e Ricerca" (MIUR, Italy): "PRIN 2009- Large-scale identification and characterization of alternative splicing in the human transcriptome through computational and experimental approaches"; Consiglio Nazionale delle Ricerche: Flagship Project "Epigen", PNR-CNR Aging Program 2012-2014, Italian Research Foundation for amyotrophic lateral sclerosis (AriSLA) and by Italian Ministry for Foreign Affairs (Italy-Israel actions).
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 7, 2013: Italian Society of Bioinformatics (BITS): Annual Meeting 2012. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S7
- Halbritter F, Vaidya HJ, Tomlinson SR: GeneProf: analysis of high-throughput sequencing experiments. Nature Methods. 2011, 9: 7-8. 10.1038/nmeth.1809.View ArticlePubMedGoogle Scholar
- Goecks J, Nekrutenko A, Taylor J, The Galaxy Team: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 2010, 11 (8): R86-10.1186/gb-2010-11-8-r86.PubMed CentralView ArticlePubMedGoogle Scholar
- Wu TD, Watanabe CK: GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics. 2005, 21: 1859-1875. 10.1093/bioinformatics/bti310.View ArticlePubMedGoogle Scholar
- Trapnell C, Pachter L, Salzberg SL: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009, 25: 1105-1111. 10.1093/bioinformatics/btp120.PubMed CentralView ArticlePubMedGoogle Scholar
- Costa V, Angelini C, De Feis I, Ciccodicola A: Uncovering the Complexity of Transcriptomes with RNA-Seq. Journal of Biomedicine and Biotechnology. 2010, 2010: 1-20.View ArticleGoogle Scholar
- Kvam VM, Liu P, Si Y: A comparison of statistical methods for detecting differentially expressed genes from RNA-seq data. Am J Bot. 2012, 99: 248-256. 10.3732/ajb.1100340.View ArticlePubMedGoogle Scholar
- Auer PL, Doerge RW: Statistical design and analysis of RNA sequencing data. Genetics. 2010, 185: 405-416. 10.1534/genetics.110.114983.PubMed CentralView ArticlePubMedGoogle Scholar
- Carraro DM, Ferreira EN, de Campos Molina G, Puga RD, Abrantes EF, Trapé AP, Eckhardt BL, Ekhardt BL, Nunes DN, Brentani MM, Arap W, Pasqualini R, Brentani H, Dias-Neto E, Brentani RR: Poly (A)+ transcriptome assessment of ERBB2-induced alterations in breast cell lines. PLoS ONE. 2011, 6: e21022-10.1371/journal.pone.0021022.PubMed CentralView ArticlePubMedGoogle Scholar
- Patel RK, Jain M: NGS QC Toolkit: A Toolkit for Quality Control of Next Generation Sequencing Data. PLoS ONE. 2012, 7: e30619-10.1371/journal.pone.0030619.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.