MAP-RSeq: Mayo Analysis Pipeline for RNA sequencing
- Krishna R Kalari†1,
- Asha A Nair†1,
- Jaysheel D Bhavsar1,
- Daniel R O’Brien1,
- Jaime I Davila1,
- Matthew A Bockol1,
- Jinfu Nie1,
- Xiaojia Tang1,
- Saurabh Baheti1,
- Jay B Doughty1,
- Sumit Middha1,
- Hugues Sicotte1,
- Aubrey E Thompson2,
- Yan W Asmann3 and
- Jean-Pierre A Kocher1, 4Email author
© Kalari et al.; licensee BioMed Central Ltd. 2014
Received: 22 February 2014
Accepted: 23 June 2014
Published: 27 June 2014
Although the costs of next generation sequencing technology have decreased over the past years, there is still a lack of simple-to-use applications, for a comprehensive analysis of RNA sequencing data. There is no one-stop shop for transcriptomic genomics. We have developed MAP-RSeq, a comprehensive computational workflow that can be used for obtaining genomic features from transcriptomic sequencing data, for any genome.
For optimization of tools and parameters, MAP-RSeq was validated using both simulated and real datasets. MAP-RSeq workflow consists of six major modules such as alignment of reads, quality assessment of reads, gene expression assessment and exon read counting, identification of expressed single nucleotide variants (SNVs), detection of fusion transcripts, summarization of transcriptomics data and final report. This workflow is available for Human transcriptome analysis and can be easily adapted and used for other genomes. Several clinical and research projects at the Mayo Clinic have applied the MAP-RSeq workflow for RNA-Seq studies. The results from MAP-RSeq have thus far enabled clinicians and researchers to understand the transcriptomic landscape of diseases for better diagnosis and treatment of patients.
Our software provides gene counts, exon counts, fusion candidates, expressed single nucleotide variants, mapping statistics, visualizations, and a detailed research data report for RNA-Seq. The workflow can be executed on a standalone virtual machine or on a parallel Sun Grid Engine cluster. The software can be downloaded from http://bioinformaticstools.mayo.edu/research/maprseq/.
Next generation sequencing (NGS) technology breakthroughs have allowed us to define the transcriptomic landscape for cancers and other diseases . RNA-Sequencing (RNA-Seq) is information-rich; it enables researchers to investigate a variety of genomic features, such as gene expression, characterization of novel transcripts, alternative splice sites, single nucleotide variants (SNVs), fusion transcripts, long non-coding RNAs, small insertions, and small deletions. Multiple alignment software packages are available for read alignment, quality control methods, gene expression and transcript quantification methods for RNA-Seq [2–5]. However, the majority of the RNA-Seq bioinformatics methods are focused only on the analysis of a few genomic features for downstream analysis [6–9]. At present there is no comprehensive RNA-Seq workflow that can simply be installed and used for multiple genomic feature analysis. At the Mayo Clinic, we have developed MAP-RSeq - a comprehensive computational workflow, to align, assess and report multiple genomic features from paired-end RNA-Seq data efficiently with a quick turnaround time. We have tested a variety of tools and methods to accurately estimate genomic features from RNA-Seq data. Best performing publically available bioinformatics tools along with parameter optimization were included in our workflow. As needed we have integrated in-house methods or tools to fill in the gaps. We have thoroughly investigated and compared the available tools and have optimized parameters to make the workflow run seamlessly for both virtual machine and cluster environments. Our software has been tested with paired-end sequencing reads from all Illumina platforms. Thus far, we have processed 1,535 Mayo Clinic samples using the MAP-RSeq workflow. The MAP-RSeq research reports for RNA-Seq data have enabled Mayo Clinic researchers and clinicians to exchange datasets and findings. Standardizing the workflow has allowed us to build a system that enables us to investigate across multiple studies within the Mayo Clinic. MAP-RSeq is a production application that allows researchers with minimal expertise in LINUX or Windows to install, analyze and interpret RNA-Seq data.
MAP-RSeq uses a variety of freely available bioinformatics tools along with in-house developed methods using Perl, Python, R, and Java. MAP-RSeq is available in two versions. The first version is single threaded and runs on a virtual machine (VM). The VM version is straightforward to install. The second version is multi-threaded and is designed to run on a cluster environment.
Virtual machine version of MAP-RSeq is available for download at the following URL . This includes a sample dataset, references (limited to chromosome 22), and the complete MAP-RSeq workflow pre-installed. Virtual Box software (free for Windows, Mac, and Linux at ) needs to be installed in the host system. The system also needs to meet the following requirements: at least 4GB of physical memory, and at least 10GB of available disk. Although our sample data is only from Human Chromosome 22, this virtual machine can be extended to the entire human reference genome or to other species. However this requires allocating more memory (~16GB) than may be available on a typical desktop system and building the index references files for the species of interest.
MAP-RSeq installation and run time for QuickStart virtual machine
~ 20 minutes to download on consumer grade internet
Time to import into VM
~ 10 minutes
Run time with sample data (chr22 only)
~ 30 minutes
MAP-RSeq installation and run time in a Linux environment
~10 minutes to download on consumer grade internet
~6 hours (mostly downloading and indexing references)
Depends on the sample data used
Sun grid engine
MAP-RSeq requires four processing cores with a total of 16GB RAM to get optimal performance. It also requires 8GB of storage space for tools and reference file installation. For MAP-RSeq execution the following packages such as JAVA version 1.6.0_17 or higher, Perl version 5.10.0 or higher, Python version 2.7 or higher, Python-dev, Cython, Numpy and Scipy, gcc and g++ , Zlib, Zlib-devel, ncurses, ncurses-devel, R, libgd2-xpm, and mailx need to be preinstalled and referenced in the environment path. It does also require having additional storage space for analysing input data and writing output files. MAP-RSeq uses bioinformatics tools such as BEDTools , UCSC Blat , Bowtie , Circos , FastQC , GATK , HTSeq , Picard Tools , RSeqQC , Samtools , and TopHat . Our user manual and README files provide detailed information of the dependencies, bioinformatics tools and parameters for MAP-RSeq. The application requires configuration, such as run, tool and sample information files, as described in the user manual.
Wall clock times to run MAP-RSeq at different read counts
MAP-RSeq processing time
Results and discussion
Several research and clinical projects [24–26] at Mayo Clinic have applied MAP-RSeq workflow for obtaining gene expression, single nucleotide variants and fusion transcripts for a variety of cancer and disease related studies. Currently there are multiple ongoing projects or clinical trial studies for which we generate both RNA-Sequencing and exome sequencing datasets at the Mayo Clinic Sequencing Core. We have developed our RNA-Seq and DNA-Seq workflows such that sequencing data can be directly supplied to the pipelines with less manual intervention. Analysis of the next generation sequencing datasets along with phenotype data enable further understanding of the genomic landscape to better diagnose and treat patients.
Gene expression and exon expression read counts
A Gene expression count is defined as the sum of reads in exons for the gene whereas an exon expression count is defined as the sum of reads in a particular exon of a gene. Gene expression counts in MAP-RSeq pipeline can be obtained using HTSeq  software (default) or featureCounts  software. The gene annotation files were obtained from the Cufflinks website . Exon expression counts are obtained using the intersectBed function from the BEDTools Suite .
Alignment statistics from MAP-RSeq using simulated dataset from BEERS
Total number of single reads
Reads used for alignment
Total number of reads mapped
Reads mapped to transcriptome
Reads mapped to junctions
Reads contributing to gene abundance
Reads contributing to exon abundance
Number of SNVs identified
Each sample is associated to a phenotype, such as tumor, normal, treated, control, etc and that meta-data needs to be obtained to form groups for differential expression analysis. To remove any outlier samples, it is required to perform detailed quality control checks prior to gene expression analysis. There are a variety of software packages that are used for differential expression analysis using RNA-Seq gene expression data [4, 30–32]. Several studies have been published comparing the differential expression methods and concluded that there are substantial differences in terms of sensitivity and specificity among the methods [33–35]. We have chosen edgeR software  from R statistical package for gene expression analysis. In our source code for MAP-RSeq pipeline, we have Perl, R scripts and instructions that can be used post MAP-RSeq run for differential expression analysis.
Expressed SNVs (eSNVs) from RNA-Seq
Fusion transcript detection
Summarization of data and final report
The workflow generates two main reports for end users: 1) summary report for all samples in a run with links to detailed reports and six QC visualizations per sample 2) final data report folder consists of exon, gene, fusion and expressed SNV files with annotations for further statistical and bioinformatics analysis.
A screenshot of an example report from MAP-RSeq is shown in Figure 2. A complete form of the report is presented in the additional file provided (see Additional file 1). Detailed descriptions of the samples processed by MAP-RSeq along with the study design and experiment details are reported by the workflow. Results are summarized for each sample in the report. Detailed quality control information, links to gene expression counts, exon counts, variant files, fusion transcript information and various visualization plots are also reported.
MAP-RSeq is a comprehensive simple-to-use application. MAP-RSeq reports alignment statistics, in-depth quality control statistics, gene counts, exon counts, fusion transcripts, and SNVs per sample. The output from the workflow can be plugged into other software or packages for subsequent downstream bioinformatics analysis. Several research and clinical projects at the Mayo Clinic have used the gene expression, SNVs and fusion transcripts reports from the MAP-RSeq workflow for a wide range of cancers and other disease-related studies. In future, we plan to extend our workflow such that alternate splicing transcripts and non-coding RNAs can also be obtained.
Availability and requirements
Project name: MAP-RSeq
Project home page: http://bioinformaticstools.mayo.edu/research/maprseq/
Operating system(s): Linux or VM
Programming language: PERL, Python, JAVA, R and BASH
Other requirements: none
License: Open Source
Any restrictions to use by non-academics: none
This work is supported by the Mayo Clinic Center for Individualized Medicine (CIM). KRK is supported by CIM and Eveleigh family career Development award. We acknowledge Jason Reisz from Appistry, Jason Weirather, Bruce Eckloff and Chris Kolbert for their constructive suggestions and feedback during the implementation of this workflow.
These studies were supported in part by funds from the Center for Individualized Medicine, Eveleigh Family Foundation (KRK), and the Mayo Foundation. Additional support was obtained from Pharmacogenomics Research Network (KRK) and Breast cancer SPORE career development award (KRK). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
- Barrett CL, Schwab RB, Jung H, Crain B, Goff DJ, Jamieson CHM, Thistlethwaite PA, Harismendy O, Carson DA, Frazer KA: Transcriptome sequencing of tumor subpopulations reveals a spectrum of therapeutic options for squamous cell lung cancer. PLoS One. 2013, 8 (3): e58714-10.1371/journal.pone.0058714.View ArticlePubMed CentralPubMedGoogle Scholar
- Chen YH, Souaiaia T, Chen T: PerM: efficient mapping of short sequencing reads with periodic full sensitive spaced seeds. Bioinformatics. 2009, 25 (19): 2514-2521. 10.1093/bioinformatics/btp486.View ArticlePubMed CentralPubMedGoogle Scholar
- Head SR, Mondala T, Gelbart T, Ordoukhanian P, Chappel R, Hernandez G, Salomon DR: RNA purification and expression analysis using microarrays and RNA deep sequencing. Methods Mol Biol. 2013, 1034: 385-403. 10.1007/978-1-62703-493-7_25.View ArticlePubMedGoogle Scholar
- Robinson MD, McCarthy DJ, Smyth GK: edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010, 26 (1): 139-140. 10.1093/bioinformatics/btp616.View ArticlePubMed CentralPubMedGoogle Scholar
- Wang K, Singh D, Zeng Z, Coleman SJ, Huang Y, Savich GL, He X, Mieczkowski P, Grimm SA, Perou CM, MacLeod JN, Chiang DY, Prins JF, Liu J: MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Res. 2010, 38 (18): e178-10.1093/nar/gkq622.View ArticlePubMed CentralPubMedGoogle Scholar
- Goncalves A, Tikhonov A, Brazma A, Kapushesky M: A pipeline for RNA-seq data processing and quality assessment. Bioinformatics. 2011, 27 (6): 867-869. 10.1093/bioinformatics/btr012.View ArticlePubMed CentralPubMedGoogle Scholar
- Habegger L, Sboner A, Gianoulis TA, Rozowsky J, Agarwal A, Snyder M, Gerstein M: RSEQtools: a modular framework to analyze RNA-Seq data using compact, anonymized data summaries. Bioinformatics. 2011, 27 (2): 281-283. 10.1093/bioinformatics/btq643.View ArticlePubMed CentralPubMedGoogle Scholar
- Qi J, Zhao FQ, Buboltz A, Schuster SC: inGAP: an integrated next-generation genome analysis pipeline. Bioinformatics. 2010, 26 (1): 127-129. 10.1093/bioinformatics/btp615.View ArticlePubMed CentralPubMedGoogle Scholar
- Wang Y, Mehta G, Mayani R, Lu JX, Souaiaia T, Chen YH, Clark A, Yoon HJ, Wan L, Evgrafov OV, Knowles JA, Deelman E, Chen T: RseqFlow: workflows for RNA-Seq data analysis. Bioinformatics. 2011, 27 (18): 2598-2600.PubMed CentralPubMedGoogle Scholar
- MAP-RSeq website. [http://bioinformaticstools.mayo.edu/research/maprseq/],
- Virtual Box download webpage. [https://www.virtualbox.org/wiki/Downloads],
- CGHub webpage. [https://cghub.ucsc.edu/],
- Quinlan AR, Hall IM: BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010, 26 (6): 841-842. 10.1093/bioinformatics/btq033.View ArticlePubMed CentralPubMedGoogle Scholar
- Kent WJ: BLAT–the BLAST-like alignment tool. Genome Res. 2002, 12 (4): 656-664. 10.1101/gr.229202. Article published online before March 2002.View ArticlePubMed CentralPubMedGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009, 10 (3): R25-10.1186/gb-2009-10-3-r25.View ArticlePubMed CentralPubMedGoogle Scholar
- Krzywinski M, Schein J, Birol I, Connors J, Gascoyne R, Horsman D, Jones SJ, Marra MA: Circos: An information aesthetic for comparative genomics. Genome Res. 2009, 19 (9): 1639-1645. 10.1101/gr.092759.109.View ArticlePubMed CentralPubMedGoogle Scholar
- FastQC website. [http://www.bioinformatics.babraham.ac.uk/projects/fastqc/],
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA: The genome analysis toolkit: a map reduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010, 20 (9): 1297-1303. 10.1101/gr.107524.110.View ArticlePubMed CentralPubMedGoogle Scholar
- Anders S, Pyl PT, Huber W: HTSeq — A Python framework to work with high-throughput sequencing data. bioRxiv preprintbioRxiv preprint. 2014Google Scholar
- Picard Tools webpage. [http://picard.sourceforge.net],
- Wang LG, Wang SQ, Li W: RSeQC: quality control of RNA-seq experiments. Bioinformatics. 2012, 28 (16): 2184-2185. 10.1093/bioinformatics/bts356.View ArticlePubMedGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009, 25 (16): 2078-2079. 10.1093/bioinformatics/btp352.View ArticlePubMed CentralPubMedGoogle Scholar
- Trapnell C, Pachter L, Salzberg SL: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009, 25 (9): 1105-1111. 10.1093/bioinformatics/btp120.View ArticlePubMed CentralPubMedGoogle Scholar
- Egan JB, Barrett MT, Champion MD, Middha S, Lenkiewicz E, Evers L, Francis P, Schmidt J, Shi CX, Van Wier S, Badar S, Ahmann G, Kortuem KM, Boczek NJ, Fonseca R, Craig DW, Carpten JD, Borad MJ, Stewart AK: Whole genome analyses of a well-differentiated liposarcoma reveals novel SYT1 and DDR2 Rearrangements. PLoS One. 2014, 9 (2): e87113-10.1371/journal.pone.0087113.View ArticlePubMed CentralPubMedGoogle Scholar
- Norton N, Sun Z, Asmann YW, Serie DJ, Necela BM, Bhagwate A, Jen J, Eckloff BW, Kalari KR, Thompson KJ, Carr JM, Kachergus JM, Geiger XJ, Perez EA, Thompson EA: Gene expression, single nucleotide variant and fusion transcript discovery in archival material from breast tumors. PLoS One. 2013, 8 (11): e81925-10.1371/journal.pone.0081925.View ArticlePubMed CentralPubMedGoogle Scholar
- Sakuma T, Davila JI, Malcolm JA, Kocher JP, Tonne JM, Ikeda Y: Murine leukemia virus uses NXF1 for nuclear export of spliced and unspliced viral transcripts. J Virol. 2014Google Scholar
- Liao Y, Smyth GK, Shi W: featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014, 30 (7): 923-930. 10.1093/bioinformatics/btt656.View ArticlePubMedGoogle Scholar
- Cufflink index and annotation. [http://cufflinks.cbcb.umd.edu/igenomes.html],
- Grant GR, Farkas MH, Pizarro AD, Lahens NF, Schug J, Brunk BP, Stoeckert CJ, Hogenesch JB, Pierce EA: Comparative analysis of RNA-Seq alignment algorithms and the RNA-Seq unified mapper (RUM). Bioinformatics. 2011, 27 (18): 2518-2528.PubMed CentralPubMedGoogle Scholar
- Hardcastle TJ, Kelly KA: baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinform. 2010, 11: 422-10.1186/1471-2105-11-422.View ArticleGoogle Scholar
- Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biol. 2010, 11 (10): R106-10.1186/gb-2010-11-10-r106.View ArticlePubMed CentralPubMedGoogle Scholar
- Smyth GK: Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004, 3: Article 3-Google Scholar
- Soneson C, Delorenzi M: A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinform. 2013, 14: 91-10.1186/1471-2105-14-91.View ArticleGoogle Scholar
- Seyednasrollah F, Laiho A, Elo LL: Comparison of software packages for detecting differential expression in RNA-seq studies. Brief Bioinform. 2013Google Scholar
- Rapaport F, Khanin R, Liang Y, Pirun M, Krek A, Zumbo P, Mason CE, Socci ND, Betel D: Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data. Genome Biol. 2013, 14 (9): R95-10.1186/gb-2013-14-9-r95.View ArticlePubMed CentralPubMedGoogle Scholar
- Kim D, Salzberg SL: TopHat-Fusion: an algorithm for discovery of novel fusion transcripts. Genome Biol. 2011, 12 (8): 1-View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.