Comparisons of computational methods for differential alternative splicing detection using RNA-seq in plant systems

Background Alternative Splicing (AS) as a post-transcription regulation mechanism is an important application of RNA-seq studies in eukaryotes. A number of software and computational methods have been developed for detecting AS. Most of the methods, however, are designed and tested on animal data, such as human and mouse. Plants genes differ from those of animals in many ways, e.g., the average intron size and preferred AS types. These differences may require different computational approaches and raise questions about their effectiveness on plant data. The goal of this paper is to benchmark existing computational differential splicing (or transcription) detection methods so that biologists can choose the most suitable tools to accomplish their goals. Results This study compares the eight popular public available software packages for differential splicing analysis using both simulated and real Arabidopsis thaliana RNA-seq data. All software are freely available. The study examines the effect of varying AS ratio, read depth, dispersion pattern, AS types, sample sizes and the influence of annotation. Using a real data, the study looks at the consistences between the packages and verifies a subset of the detected AS events using PCR studies. Conclusions No single method performs the best in all situations. The accuracy of annotation has a major impact on which method should be chosen for AS analysis. DEXSeq performs well in the simulated data when the AS signal is relative strong and annotation is accurate. Cufflinks achieve a better tradeoff between precision and recall and turns out to be the best one when incomplete annotation is provided. Some methods perform inconsistently for different AS types. Complex AS events that combine several simple AS events impose problems for most methods, especially for MATS. MATS stands out in the analysis of real RNA-seq data when all the AS events being evaluated are simple AS events. Electronic supplementary material The online version of this article (doi:10.1186/s12859-014-0364-4) contains supplementary material, which is available to authorized users.

1 Simulation Pipeline 1.1 Step 1: Simulating biological replicates In order to approximate the situation in real RNA-seq experiment, we required two groups of empirical RNA-seq samples representing control and treatment groups respectively. First, the pipeline selected a random subset of genes that had more than one transcript based on annotation and that were expressed (have non-zero read counts in every replicate) in both input groups as true AS genes. The total transcripts copy number on a simulated gene was proportional to the number of reads counted on the real gene. We also introduced biological variance to gene expression by using Negative Binomial(NB) distributions. NB distribution is widely used for modeling variance across biological replicates. For each gene g we calculated mean µ g and variance σ 2 g of gene-level read counts across replicates and then performed a Loess regression f on the set of points (µ g , σ 2 g ). Thus we can borrow information across genes and do not rely on having large enough number of replicates to estimate variance. In the simulation studies with the same dispersion pattern we forced the regression function f to be the same under two conditions. For the simulation studies using different dispersion patterns the regression function f was learned from each of the two input groups and thus it differed for the two simulated conditions. The advantage of using Loess function is that Loess fitting does not make the same assumption of global homoscedasticity as general linear regression. Finally, the transcript counts for gene g were generated by NB distribution parameterized by mean µ g and fitted variance f (µ g ).

Step 2: Simulating differential splicing
We defined a parameter, P ALT , to control the relative transcript abundances across conditions. P ALT stands for Percentage of ALTernative form, ranging from 0 to 1. The relative transcript abundances of a multi-isoform gene g which has i isoforms, denoted by e g = (e 1 g , ..., e i g ), were decided through the following formulas.
• if g is a AS gene, then we set e j g = P ALT, if j = i and e j g = 1−P ALT • if g is not a AS gene, then we draw the relative abundance from a standard uniform distribution e j g ∈ unif orm(0, 1) with a constraint i j=1 e j g = 1 In addition, we introduced another parameter Read Depth(RD) to allow user to control the mean per-based read depth which is defined as: L * N/T Where L is the read length; N is the number of reads mapped to transcriptome; T is the transcriptome size.
Therefore the final absolute transcript abundance in the custom transcriptome expression profile are the product of gene-level transcript counts from step 1, relative transcript abundances and read depth tuner which makes sure the desired read depth is generated. Finally, the program, Flux Simulator calls this profile to generate RNA-seq reads. Figure S1: A two-step simulation pipeline. SAM files from real data are used as input for this pipeline. In the first step biological replicates are simulated by using Negative Binomial (NB) models. The raw fragment counts mean µ g and variance σ 2 g are calculated from the input. A regression function f is fitted on the set of points (µ g , σ 2 g ). Then the fitted variances are used as parameters in the NB models to generate three replicates, e.g. a, b, c. In the second step. The updated gene-level fragment counts are separate onto transcript levels based on the relative abundances and desired read depth. Finally, Flux Simulator is used to generated simulated RNA-seq reads.

Sanity check of synthetic data
To simulate biological replicates, we used Arabidopsis heat shock dataset [1] which contains three replicates for each of the two time points. The first time point was immediate after heat stress. The second was 24 h after recovery from the heat stress. The mean fragment counts across replicates and mean-variance relationship used in the simulation were estimated from the heat shock data set. Figure S2 shows the mean and variance of fragment counts in the log scale for synthetic data in baseline simulation study RD100 H D and heat shock data. There was a good agreement which indicated that the negative binomial model used in the simulation captured the mean-variance relationship or dispersion well. We further compared the distribution of the mean fragment counts in log scale. The simulation again captured the distribution in real data well.  Figure S2: Comparison between real (left panels) and synthetic data (right panels). The 2 panels on top are scatter plots of mean-variance relationship across replicates. The blue lines are LOWESS regression lines. The orange lines are variance = mean lines. It is clear that the real data is overdispersed with respect to what we would expect from a Poisson distribution and that it was well captured by a negative binomial distribution using in the simulated data. The two panels at the bottom compare the fragment counts distribution.

Command lines and parameter choices 3.1 Cufflinks
Cufflinks was written in Python and C++. It can be downloaded from http://cufflinks. cbcb.umd.edu/. We used the version 2.1.1 in this study. A newer version 2.2.0 was release while we were writing the paper.

DEXSeq
DEXSeq is a R package available in Bioconductor. We used the latest version 1.8.0 in this study.

DSGseq
DSGseq consists of a set of R scripts but is not a standard R packages. It can be downloaded from http://bioinfo.au.tsinghua.edu.cn/software/DSGseq/. We used the latest version 0.

SeqGSEA
SeqGSEA is a R package available in Bioconductor. We used the version 1.2.1. A newer version 1.5.0 was release while we were writing the paper.

SplicingCompass
SplicingCompass is a R package. We used the latest version 1.0.1.

Computational time requirement
We ran the code shown in the previous section in Iowa State University super cluster called Lightning. The code was all executed in a single node and a single core with 16GB RAM. Although we used a cluster, this amount of computational power can be easily obtained in a standard PC. All the programs were finished within a few hours. The computation time required for SeqGSEA is largely affected by the permutation times. In this study, we set it to 1000. The total required CPU time for each method in the baseline simulation study RD100 H D is given in the Table S1. 6 Visualization of read alignments in heat shock data for experimentally validated AS genes We have examined a few Arabidopsis genes that are known to be differentially spliced in response to ambient temperature changes. The following figures are the visualization of reads alignment of these few known genes using Integrated Genome Browser [4]. Solid bars represent reads, and thin lines indicate gaps in the alignment.

LHY
LATE ELONGATED HYPOCOTYL (LHY), circadian clock genes, are known to be differential spliced in response to temperature changes [2]. 5 transcripts have been found (based on TAIR10) in gene AT1G01060 which belongs to LHY gene family. Transcript AT1G01060.4 differs from other transcripts by 3-nt difference in the 3' site. Figure S3: LHY. Visualization of read alignments in heat shock data. Reads from heat stress group are colored in red whereas reads from control groups are colored in blue. The black arrow indicates where the AS event happens in the gene model.

SR45
AT1G16610 encodes SR45 which is a member of SR protein family. A alternative 3'SS event differed by a 21-nt sequence has been found to occur as ambient temperature changes [7]. Figure S4: SR45. Visualization of read alignments in heat shock data. Reads from heat stress group are colored in red and reads from control groups are colored in blue. The black arrow indicates where the AS event happens in the gene model.

SR1/SR34
AT1G02840 encodes SR1/SR34 protein, a member of highly conserved family of spliceosome proteins. An alternative 3'SS event has been found as ambient temperature changes [6]. Figure S5: SR1/SR34. Visualization of read alignments in heat shock data. Reads from heat stress group are colored in red and reads from control groups are colored in blue. The black arrow indicates where the AS event happens in the gene model.

SR30
AT1G09140 encodes SR30, a member of highly conserved family of spliceosome proteins. An alternative 3'SS event has been found in response to heat stress [6].