spliceR: an R package for classification of alternative splicing and prediction of coding potential from RNA-seq data
© Vitting-Seerup et al.; licensee BioMed Central Ltd. 2014
Received: 15 November 2013
Accepted: 17 March 2014
Published: 23 March 2014
RNA-seq data is currently underutilized, in part because it is difficult to predict the functional impact of alternate transcription events. Recent software improvements in full-length transcript deconvolution prompted us to develop spliceR, an R package for classification of alternative splicing and prediction of coding potential.
spliceR uses the full-length transcript output from RNA-seq assemblers to detect single or multiple exon skipping, alternative donor and acceptor sites, intron retention, alternative first or last exon usage, and mutually exclusive exon events. For each of these events spliceR also annotates the genomic coordinates of the differentially spliced elements, facilitating downstream sequence analysis. For each transcript isoform fraction values are calculated to identify transcript switching between conditions. Lastly, spliceR predicts the coding potential, as well as the potential nonsense mediated decay (NMD) sensitivity of each transcript.
spliceR is an easy-to-use tool that extends the usability of RNA-seq and assembly technologies by allowing greater depth of annotation of RNA-seq data. spliceR is implemented as an R package and is freely available from the Bioconductor repository (http://www.bioconductor.org/packages/2.13/bioc/html/spliceR.html).
Alternative splicing is one of the most important RNA modifications, leading to protein diversification and contributing to the complexity of higher organisms . Recent advances in RNA sequencing (RNA-seq), combined with modern RNA-seq assembly software such as Cufflinks , allows for high-resolution profiling of the RNA landscape. The technological and computational advances enables identification of a catalog of distinctly spliced transcripts originating from the same transcription unit/gene. These full-length transcripts are however underutilized, in part because it is difficult to predict the functional impact of alternate transcription events leading to the different transcripts. In addition to the potential for impact on protein domains, alternate splicing may also alter RNA processing, stability and localization. Nonsense mediated decay (NMD) is tightly linked to alternate splicing, and the mechanism by which even small changes in alternative splicing can result in degradation via NMD is well described . To predict splicing events that may lead to these functional changes, it is necessary to classify the type of event as well as annotate the genomic position of the regions that are differentially spliced. Such classifications also enable systematic follow-up analyses, such as sequence analysis of the differentially spliced regions to infer the underlying mechanisms. However, at present there are no available tools that adequately perform these analyses. Existing methods, including MISO , Astalavista  and DiffSplice , do not output the genomic coordinates of differentially spliced regions [4, 7–11], have insufficient classification of alternative splicing (i.e., only a subset of alternative splicing classes are supported) [4–11], or cannot assess novel features [4, 7, 8, 12]. Furthermore, none of the existing tools for analyzing alternative splicing support NMD predictions [4–12].
This prompted us to develop the R package spliceR. spliceR uses the full-length transcripts created by RNA-seq assemblers to detect single- and multiple exon skipping/inclusion (ESI, MESI), alternative donor and acceptor sites (A5, A3), intron retention (IR), alternative first or last exon usage (ATSS, ATTS), and mutually exclusive exon events (MEE). spliceR annotates the genomic coordinates of the differentially spliced elements, facilitating downstream sequence analysis. Finally, spliceR predicts the coding potential of transcripts, calculates untranslated region (UTR) and open reading frame (ORF) lengths and predicts whether transcripts are NMD-sensitive based on compatible annotated start codon positions and their downstream ORF.
Retrieval of data
spliceR is implemented as an R package and is freely available from the Bioconductor repository (http://www.bioconductor.org/packages/2.13/bioc/html/spliceR.html). It is based on standard Bioconductor  classes such as GRanges, allowing for full flexibility and modularity, and support for all species and versions supported in the Bioconductor annotation packages. An example dataset is included to allow easy exploration of the package.
Classification of alternative splicing
Isoform fraction values
For each transcript and condition, spliceR calculates an isoform fraction (IF) value, which is calculated as (transcript expression) / (gene expression) *100 to represent the contribution of a transcript to the expression of the parent gene. Furthermore delta-IF (dIF) values, which measure the absolute change in IF values between conditions, are also calculated. Since these dIF values measure changes in relative abundance of isoforms within a gene, they are well suited for identifying and analyzing changes in the isoform usage. Such analysis does however require that the expression values of transcripts and genes have been normalized to length, sequencing depths, and possibly other biases, a standard feature of the Cufflinks FPKM (fragments per kilobase of transcript per million fragments sequenced) metric. For data from other assemblers, the user may need to accommodate this requirement manually. These IF values are highly dependent on the underlying data quality and relative expression strengths, and should be viewed as a helpful ranking statistic for identifying changes in alternative splicing between conditions.
Analysis of coding potential
spliceR retrieves the genomic exon sequences of each transcript from one of the Bioconductor annotation files. ORF annotation is then retrieved from the UCSC Genome Browser repository either from RefSeq, UCSC or Ensembl, as specified by the user. Alternatively, a custom ORF-table can be supplied. For each transcript, the most upstream compatible start codon is identified, the downstream sequence is translated, and the relative position of the first in-frame stop codon and the distance to the final exon-exon junction is recorded and returned to the user. Transcripts are marked NMD-sensitive if the stop codon falls more than 50 nt upstream of the final exon-exon junction, indicating a pre-mature stop codon (PTC). This is based on the literature consensus , however this setting is user-configurable. The position of the start codon is also annotated, which, in combination with the found stop codon and the annotated transcript lengths, enables users to calculate 5′UTR lengths, ORF lengths and 3′UTR lengths. In future versions, we plan to include alternative methods of coding region prediction, such as the logistic regression model implemented in the program CPAT .
Visualization and data integration tools
The spliceR package generates a GTF file that can be uploaded to the UCSC genome browser, to help users visualize transcripts and to allow for integration of the RNA-seq analysis with external annotation sources. This spliceR GTF file has two main advantages over the corresponding GTF file generated by Cufflinks’ Cuffmerge tool: Firstly, spliceR’s optional filters uses stringent criteria, e.g. requiring that the transcript should be expressed, or that Cufflinks marks the transcript deconvolution as successful. In our experience, this removes up to 80% of transcripts originally predicted to belong to the same gene in the Cufflinks GTF file (data not shown), making the GTF file generated by spliceR more suitable for visual analysis.
Secondly, spliceR can color-code transcripts according to their expression level within the parent gene. This feature facilitates easy visualization of changes in gene and transcript expression both within and between conditions.
The tabulated output of spliceR facilitates a number of downstream analyses, including identification of transcripts that exhibit major changes between samples, the ability to filter the output for specific splicing classes, as well as sequence analysis in or around regions that are spliced in or out between samples. Examples include detection of enriched motifs, or identification of protein domains, miRNA response elements or localization signals that are spliced in or out. spliceR facilitates these types of analyses by outputting the genomic coordinates of each alternatively spliced element.
Results and discussion
To show the power and versatility of spliceR we reanalyzed the publically available RNA-seq dataset from Zhang et al. , which compared two experimental conditions: the human colorectal cell line HT116 with and without a siRNA directed towards Usp49, a histone H2B deubiquitinase. To compare our approach with the original analysis, we used the original Cufflinks data (GEO: GSE38100). Transcripts successfully deconvoluted by cufflinks were used as the input into spliceR. The resulting dataset contained 5,496 single-transcript genes, and 1,867 multi-transcript genes. Combined, the multi-transcripts genes were predicted to express 4,612 unique transcripts 2.47 transcripts per gene). The analysis and results presented here are based on only five lines of R code, (modified to hg18), shown in Figure 1B-D, which illustrates the power and ease-of-use of spliceR. For reference, the spliceR analysis took ~30 minutes on a typical laptop (MacBook Pro 2.5Ghz i5, 8 GB RAM).
Splicing pattern and transcript structure
Frequency of splice site consensus sequences
5′ end (GT)
3′ end (AG)
Comparison of the analyzed transcripts to RefSeq
Intron chain level
Alternative splicing and NMD
From the Usp49 KD dataset spliceR identified a total of 6,121 alternative splicing events (Additional file 1: Table S1), distributed across the different splicing classes shown in Figure 2. spliceR found 8,179 (80.9%) transcripts without a PTC (PTC-), 642 (6,4%) transcripts with a PTC (PTC+) while 1,287 (12.7%) transcripts did not have any annotated compatible start codons. Similar fractions of transcripts were predicted to be NMD sensitive when all transcripts from RefSeq (8.2%) and Gencode (9.7%) were analyzed with spliceR, indicating that a non-neglectable fraction of transcripts could be NMD sensitive. By using spliceR’s annotation of start and stop codons, the length of both 5′UTRs, 3′UTRs and ORFs were analyzed, but no changes between conditions were found (data not shown).
We next assessed transcripts whose relative abundance was altered by the Usp49KD, by filtering for genes that had both a large positive and large negative dIF value (corresponding to a binary transcript-switch). 183 high confidence transcript switches were found: in 18 instances (~9.8%), an NMD-negative transcript was down-regulated while a NMD-sensitive transcript was up-regulated. This illustrates that failing to assess the NMD sensitivity can lead to overestimation of the number of functionally relevant transcript switches.
Here, we have introduced the R package spliceR, which increases the usability and power of RNA-seq and assembly technologies by providing a full overview of alternative splicing events and protein coding potential of transcripts. spliceR is flexible and easily integrated in existing workflows, supports input and output of standard Bioconductor data types, and enables investigators to perform many different downstream analyses of both transcript abundance and differentially spliced elements. We demonstrate the power and versatility of spliceR by showing how new conclusions can be made from existing RNA-seq data.
Availability and requirements
SpliceR is implemented as an R package, is freely available from the Bioconductor repository and can be installed simply by copy/pasting two lines into an R console.
Project name: spliceR
Project home page: http://www.bioconductor.org/packages/2.13/bioc/html/spliceR.html
Operating system(s): Platform independent
Programming language: R and C
Other requirements: R v 3.0.2 or higher
Any restrictions to use by non-academics: No limitations
KVS, JW and AS were supported by grants from the Lundbeck Foundation, the Novo Nordisk Foundation, and the RiMod-FTD Joint EU program for Neurodegenerative research to AS. Work in the BTP lab was supported through a center grant from the Novo Nordisk Foundation (The Novo Nordisk Foundation Section for Stem Cell Biology in Human Disease). We thank Dr Christine Wells, Glasgow University, for comments on the manuscript.
- Barbosa-Morais NL, Irimia M, Pan Q, Xiong HY, Gueroussov S, Lee LJ, Slobodeniuc V, Kutter C, Watt S, Colak R, Kim T, Misquitta-Ali CM, Wilson MD, Kim PM, Odom DT, Frey BJ, Blencowe BJ: The evolutionary landscape of alternative splicing in vertebrate species. Science. 2012, 338: 1587-1593. 10.1126/science.1230612.View ArticlePubMedGoogle Scholar
- Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010, 28: 511-515. 10.1038/nbt.1621.View ArticlePubMed CentralPubMedGoogle Scholar
- Guthrie C, Steitz J, Lejeune F, Maquat LE: Mechanistic links between nonsense-mediated mRNA decay and pre-mRNA splicing in mammalian cells. Curr Opin Cell Biol. 2005, 17: 309-315. 10.1016/j.ceb.2005.03.002.View ArticleGoogle Scholar
- Katz Y, Wang ET, Airoldi EM, Burge CB: Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat Methods. 2010, 7: 1009-1015. 10.1038/nmeth.1528.View ArticlePubMed CentralPubMedGoogle Scholar
- Foissac S, Sammeth M: ASTALAVISTA: dynamic and flexible analysis of alternative splicing events in custom gene datasets. Nucleic Acids Res. 2007, 35: W297-W299. 10.1093/nar/gkm311.View ArticlePubMed CentralPubMedGoogle Scholar
- Hu Y, Huang Y, Du Y, Orellana CF, Singh D, Johnson AR, Monroy A, Kuan P-F, Hammond SM, Makowski L, Randell SH, Chiang DY, Hayes DN, Jones C, Liu Y, Prins JF, Liu J: DiffSplice: the genome-wide detection of differential splicing events with RNA-seq. Nucleic Acids Res. 2012, 41: 1-18.Google Scholar
- Rogers MF, Thomas J, Reddy AS, Ben-Hur A: SpliceGrapher: detecting patterns of alternative splicing from RNA-Seq data in the context of gene models and EST data. Genome Biol. 2012, 13: R4-10.1186/gb-2012-13-1-r4.View ArticlePubMed CentralPubMedGoogle Scholar
- Ryan MC, Cleland J, Kim R, Wong WC, Weinstein JN: SpliceSeq: a resource for analysis and visualization of RNA-Seq data on alternative splicing and its functional impacts. Bioinformatics. 2012, 28: 2385-2387. 10.1093/bioinformatics/bts452.View ArticlePubMed CentralPubMedGoogle Scholar
- Sacomoto GAT, Kielbassa J, Chikhi R, Uricaru R, Antoniou P, Sagot M-F, Peterlongo P, Lacroix V: KISSPLICE: de-novo calling alternative splicing events from RNA-seq data. BMC Bioinforma. 2012, 13 (6): S5-Google Scholar
- Shen S, Park JW, Huang J, Dittmar KA, Lu Z, Zhou Q, Carstens RP, Xing Y: MATS: a Bayesian framework for flexible detection of differential alternative splicing from RNA-Seq data. Nucleic Acids Res. 2012, 40: e61-10.1093/nar/gkr1291.View ArticlePubMed CentralPubMedGoogle Scholar
- Zhou A, Breese MR, Hao Y, Edenberg HJ, Li L, Skaar TC, Liu Y: Alt event finder: a tool for extracting alternative splicing events from RNA-seq data. BMC Genomics. 2012, 13 (8): S10-View ArticlePubMed CentralPubMedGoogle Scholar
- Liu Q, Chen C, Shen E, Zhao F, Sun Z, Wu J: Detection, annotation and visualization of alternative splicing from RNA-Seq data with SplicingViewer. Genomics. 2012, 99: 178-182. 10.1016/j.ygeno.2011.12.003.View ArticlePubMedGoogle Scholar
- Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JYH, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004, 5: R80-10.1186/gb-2004-5-10-r80.View ArticlePubMed CentralPubMedGoogle Scholar
- Robinson MD, McCarthy DJ, Smyth GK: edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010, 26: 139-140. 10.1093/bioinformatics/btp616.View ArticlePubMed CentralPubMedGoogle Scholar
- Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biol. 2010, 11: R106-10.1186/gb-2010-11-10-r106.View ArticlePubMed CentralPubMedGoogle Scholar
- Hardcastle TJ, Kelly KA: baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics. 2010, 11: 422-10.1186/1471-2105-11-422.View ArticlePubMed CentralPubMedGoogle Scholar
- Weischenfeldt J, Waage J, Tian G, Zhao J, Damgaard I, Jakobsen JS, Kristiansen K, Krogh A, Wang J, Porse BT: Mammalian tissues defective in nonsense-mediated mRNA decay display highly aberrant splicing patterns. Genome Biol. 2012, 13: R35-10.1186/gb-2012-13-5-r35.View ArticlePubMed CentralPubMedGoogle Scholar
- Wang L, Park HJ, Dasari S, Wang S, Kocher J-P, Li W: CPAT: coding-potential assessment tool using an alignment-free logistic regression model. Nucleic Acids Res. 2013, 41: e74-10.1093/nar/gkt006.View ArticlePubMed CentralPubMedGoogle Scholar
- Zhang Z, Jones A, Joo H-Y, Zhou D, Cao Y, Chen S, Erdjument-Bromage H, Renfrow M, He H, Tempst P, Townes TM, Giles KE, Ma L, Wang H: USP49 deubiquitinates histone H2B and regulates cotranscriptional pre-mRNA splicing. Genes Dev. 2013, 27: 1581-1595. 10.1101/gad.211037.112.View ArticlePubMed CentralPubMedGoogle Scholar
- Burset M, Guigo R: Evaluation of gene structure prediction programs. 1996, 367: 353-367.Google Scholar
- Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, Pang N, Forslund K, Ceric G, Clements J, Heger A, Holm L, Sonnhammer ELL, Eddy SR, Bateman A, Finn RD: The Pfam protein families database. Nucleic Acids Res. 2012, 40: D290-D301. 10.1093/nar/gkr1065.View ArticlePubMed CentralPubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.