SPARTA: Simple Program for Automated reference-based bacterial RNA-seq Transcriptome Analysis
© Johnson et al. 2016
Received: 2 November 2015
Accepted: 29 January 2016
Published: 4 February 2016
Many tools exist in the analysis of bacterial RNA sequencing (RNA-seq) transcriptional profiling experiments to identify differentially expressed genes between experimental conditions. Generally, the workflow includes quality control of reads, mapping to a reference, counting transcript abundance, and statistical tests for differentially expressed genes. In spite of the numerous tools developed for each component of an RNA-seq analysis workflow, easy-to-use bacterially oriented workflow applications to combine multiple tools and automate the process are lacking. With many tools to choose from for each step, the task of identifying a specific tool, adapting the input/output options to the specific use-case, and integrating the tools into a coherent analysis pipeline is not a trivial endeavor, particularly for microbiologists with limited bioinformatics experience.
To make bacterial RNA-seq data analysis more accessible, we developed a Simple Program for Automated reference-based bacterial RNA-seq Transcriptome Analysis (SPARTA). SPARTA is a reference-based bacterial RNA-seq analysis workflow application for single-end Illumina reads. SPARTA is turnkey software that simplifies the process of analyzing RNA-seq data sets, making bacterial RNA-seq analysis a routine process that can be undertaken on a personal computer or in the classroom. The easy-to-install, complete workflow processes whole transcriptome shotgun sequencing data files by trimming reads and removing adapters, mapping reads to a reference, counting gene features, calculating differential gene expression, and, importantly, checking for potential batch effects within the data set. SPARTA outputs quality analysis reports, gene feature counts and differential gene expression tables and scatterplots.
SPARTA provides an easy-to-use bacterial RNA-seq transcriptional profiling workflow to identify differentially expressed genes between experimental conditions. This software will enable microbiologists with limited bioinformatics experience to analyze their data and integrate next generation sequencing (NGS) technologies into the classroom. The SPARTA software and tutorial are available at sparta.readthedocs.org.
KeywordsBioinformatics Data analysis Transcriptomics Microbiology Next-generation sequencing High-throughput sequencing
One of the most common applications of RNA sequencing (RNA-seq) is to identify differentially expressed genes under differing experimental conditions. Before biological insights can be gained, one must process and analyze the large datasets generated from each sequencing experiment. Each sample contains millions of reads that must be trimmed and assessed for read quality, mapped back to a reference genome (or assembled de novo in the absence of a reference), counted for transcript abundance, and tested for differential gene expression. Many computational analysis tools have been developed specifically to work with RNA-seq data; however, a single tool is often not suitable and requires several different applications assembled into a workflow. This task can be complicated as both the tool choice and input and output file formats for a given tool need to be considered and potentially modified to meet the requirements for the subsequent analysis step. Several RNA-seq analysis workflows exist, however, most are designed for eukaryotic organisms [1–11]. The goal of this work is to assemble several open-source computational tools to deliver a complete, accessible, and easy-to-use reference-based bacterial RNA-seq analysis workflow that is amenable to both the research laboratory and undergraduate classroom.
SPARTA requires Python 2, NumPy (a Python library for numerical analyses), Java and R. Once Python is installed, the user initializes SPARTA, which then checks for the necessary dependencies at runtime. If any of these dependencies are not met, SPARTA informs the user of the missing components. To reduce complex software installation, SPARTA is distributed with the required software and an online tutorial (http://sparta.readthedocs.org) guides the user through installation and data analysis procedures for each operating system platform. The workflow maintains analytic flexibility for specific use cases by allowing the user to tailor the options utilized for each analysis step, but can proceed without requiring option specification. Further, SPARTA will write the necessary R commands at runtime and will generate the appropriate contrasts to test all possible comparisons between user defined experimental conditions. The workflow is distributed with an example data set containing the first 100,000 reads from a previously published study . This data set is included to allow the user to become rapidly familiarized with the analysis procedure as well as ensure the appropriate dependencies are met.
Results and discussion
RNA-seq transcriptional profiling is becoming increasingly routine, and there is a demand for applications such as SPARTA that enable stand-alone workflows. Though several bacterial RNA-seq analysis workflows have been developed [4, 5, 11], SPARTA is currently the only workflow capable of addressing the possibility of batch effects within the data set as well as the other necessary analysis procedures to identify differentially expressed genes. Using a previously published data set , SPARTA was capable of analyzing 4 experimental conditions containing 8 samples with approximately 30 million reads per sample in 4 h on an off-the-shelf iMac computer (8 GB RAM, Intel i5 2.7GHz quad-core processor). SPARTA can also be implemented in high performance computing environments utilizing the non-interactive mode functionality.
As NGS technologies and applications continue to permeate life science research, undergraduate education must incorporate the use of contemporary sequencing techniques to address biological questions. However, despite the rapid increase in data intensive experimental biology, undergraduates receiving a life sciences degree are often not exposed to the tools and basic computational skills required to study NGS data sets. To address this shortcoming, we have developed an online tutorial to guide students through the RNA-seq analysis process (http://sparta-teaching.readthedocs.org). The SPARTA teaching tool was integrated into a senior level genomics course and successfully engaged students in the theory and application of RNA-seq data analysis.
Future directions and functionality
Additional features and functionality that will be incorporated into future releases of SPARTA are listed below and will be updated at http://sparta.readthedocs.org/en/latest/wishlist.html. Further, to become involved into the active development of SPARTA, the current state of the code base and feature development can be found on GitHub through http://sparta.readthedocs.org under the “Contribute” heading. Future releases of SPARTA will include but not limited to: 1) automated batch effect correction, 2) additional input file format support, 3) paired-end reads support, 4) read mapping output with normalized expression values, 5) operon analysis and definition of untranslated regions, 6) files for visualization in common genome browsers, and 7) reference-free transcriptome analysis.
SPARTA is a bacterial RNA-seq analysis tool capable of taking raw Illumina reads to differentially expressed genes in a turn-key, stand-alone workflow format that takes advantage of existing state of the art analysis tools and warns the user of potential batch effects. By reducing the required computational proficiency to perform transcriptional profiling experiments using RNA-seq, SPARTA can enable microbiologists to accelerate their research and provide instructors the ability to incorporate a hands-on approach to NGS technologies in the classroom. Further, SPARTA maintains analytic flexibility by allowing the user to tailor the analysis through option specification but is capable of proceeding with default values.
Availability and requirements
next generation sequencing
This project was supported by grants to RBA from the NIH (R21AI105867) and the Bill & Melinda Gates Foundation (OPP1119065).
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Golosova O, Henderson R, Vaskin Y, Gabrielian A, Grekhov G, Nagarajan V, et al. Unipro UGENE NGS pipelines and components for variant calling, RNA-seq and ChIP-seq data analyses. PeerJ. 2014;2, e644.View ArticlePubMedPubMed CentralGoogle Scholar
- Michalovova M, Kubat Z, Hobza R, Vyskot B, Kejnovsky E. Fully automated pipeline for detection of sex linked genes using RNA-Seq data. BMC Bioinformatics. 2015;16(1):78.View ArticlePubMedPubMed CentralGoogle Scholar
- D’Antonio M, D’Onorio De Meo P, Pallocca M, Picardi E, D’Erchia AM, Calogero RA, et al. RAP: RNA-Seq Analysis Pipeline, a new cloud-based NGS web application. BMC Genomics. 2015;16:S3.View ArticlePubMedPubMed CentralGoogle Scholar
- McClure R, Balasubramanian D, Sun Y, Bobrovskyy M, Sumby P, Genco CA, et al. Computational analysis of bacterial RNA-Seq data. Nucleic Acids Res. 2013;41(14), e140.View ArticlePubMedPubMed CentralGoogle Scholar
- Tjaden B. De novo assembly of bacterial transcriptomes from RNA-seq data. Genome Biol. 2015;16:1.View ArticlePubMedPubMed CentralGoogle Scholar
- Kalari KR, Nair AA, Bhavsar JD, O’Brien DR, Davila JI, Bockol MA, et al. MAP-RSeq: Mayo Analysis Pipeline for RNA sequencing. BMC Bioinformatics. 2014;15:224.View ArticlePubMedPubMed CentralGoogle Scholar
- Wang Y, Mehta G, Mayani R, Lu J, Souaiaia T, Chen Y, et al. RseqFlow: workflows for RNA-Seq data analysis. Bioinformatics. 2011;27(18):2598–600.PubMedPubMed CentralGoogle Scholar
- Goncalves A, Tikhonov A, Brazma A, Kapushesky M. A pipeline for RNA-seq data processing and quality assessment. Bioinformatics. 2011;27(6):867–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Habegger L, Sboner A, Gianoulis TA, Rozowsky J, Agarwal A, Snyder M, et al. RSEQtools: a modular framework to analyze RNA-Seq data using compact, anonymized data summaries. Bioinformatics. 2011;27(2):281–3.View ArticlePubMedGoogle Scholar
- Qi J, Zhao F, Buboltz A, Schuster SC. inGAP: an integrated next-generation genome analysis pipeline. Bioinformatics. 2010;26(1):127–9.View ArticlePubMedGoogle Scholar
- Forstner KU, Vogel J, Sharma CM. READemption-a tool for the computational analysis of deep-sequencing-based transcriptome data. Bioinformatics. 2014;30(23):3421–3.View ArticlePubMedGoogle Scholar
- Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30(15):2114–20.View ArticlePubMedPubMed CentralGoogle Scholar
- FastQC - http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
- Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10(3):R25.View ArticlePubMedPubMed CentralGoogle Scholar
- Anders S, Pyl PT, Huber W. HTSeq–a Python framework to work with high-throughput sequencing data. Bioinformatics. 2015;31(2):166–9.View ArticlePubMedGoogle Scholar
- Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–40.View ArticlePubMedGoogle Scholar
- Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010;11(3):R25.View ArticlePubMedPubMed CentralGoogle Scholar
- McCarthy DJ, Chen Y, Smyth GK. Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res. 2012;40(10):4288–97.View ArticlePubMedPubMed CentralGoogle Scholar
- Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet. 2010;11(10):733–9.View ArticlePubMedGoogle Scholar
- Baker JJ, Johnson BK, Abramovitch RB. Slow growth of Mycobacterium tuberculosis at acidic pH is regulated by phoPR and host-associated carbon sources. Mol Microbiol. 2014;94(1):56–9.View ArticlePubMedPubMed CentralGoogle Scholar