 Proceedings
 Open access
 Published:
Challenges in estimating percent inclusion of alternatively spliced junctions from RNAseq data
BMC Bioinformatics volume 13, Article number: S11 (2012)
Abstract
Transcript quantification is a longstanding problem in genomics and estimating the relative abundance of alternativelyspliced isoforms from the same transcript is an important special case. Both problems have recently been illuminated by highthroughput RNA sequencing experiments which are quickly generating large amounts of data. However, much of the signal present in this data is corrupted or obscured by biases resulting in nonuniform and nonproportional representation of sequences from different transcripts. Many existing analyses attempt to deal with these and other biases with various taskspecific approaches, which makes direct comparison between them difficult. However, two popular tools for isoform quantification, MISO and Cufflinks, have adopted a general probabilistic framework to model and mitigate these biases in a more general fashion. These advances motivate the need to investigate the effects of RNAseq biases on the accuracy of different approaches for isoform quantification. We conduct the investigation by building models of increasing sophistication to account for noise introduced by the biases and compare their accuracy to the established approaches.
We focus on methods that estimate the expression of alternativelyspliced isoforms with the percentsplicedin (PSI) metric for each exon skipping event. To improve their estimates, many methods use evidence from RNAseq reads that align to exon bodies. However, the methods we propose focus on reads that span only exonexon junctions. As a result, our approaches are simpler and less sensitive to exon definitions than existing methods, which enables us to distinguish their strengths and weaknesses more easily. We present several probabilistic models of of positionspecific read counts with increasing complexity and compare them to each other and to the current stateoftheart methods in isoform quantification, MISO and Cufflinks. On a validation set with RTPCR measurements for 26 cassette events, some of our methods are more accurate and some are significantly more consistent than these two popular tools. This comparison demonstrates the challenges in estimating the percent inclusion of alternatively spliced junctions and illuminates the tradeoffs between different approaches.
Introduction
Determining the relative abundance of gene transcripts in a cell  whether in relation to each other or in relation to corresponding transcripts in other cells  is an important and longstanding problem in genomics. Since introduction of RNAseq, a highthroughput experimental method of measuring the RNA content of a sample by reversetranscribing it and sequencing the resultant cDNA, this problem has been illuminated by vast amounts of data and by many methods for elucidating transcript abundance [1]. Current collections of RNAseq data are rapidly growing in multiple dimensions such as species, tissues, and conditions [2].
This data deluge necessitates more sophisticated and accurate analysis methods, which in turn create an opportunity to gain deeper insights into the role and regulation of transcript abundance in important developmental and disease processes. Undoubtedly, one important research area that can benefit from these advances is the study of RNA splicing, an essential cellular process that effectively increases the phenotypic complexity of eukaryotic organisms without necessitating an increase in their genetic complexity. Accurate measurements of the expression levels for isoforms from a large number of genes are especially useful for research into the molecular mechanisms that regulate alternative splicing in different tissues. For example, the recent advances in the RNA splicing code that determines the relative abundance of alternatively spliced isoforms [3] was made possible by highthroughput microarray technology. In principle, RNAseq can lead to much richer datasets at a fraction of the cost. Thus RNAseq technology can lead to significant new breakthroughts, as the code quality achieved by [3] leaves a lot of room for improvement. The focus of this paper  estimation of the percent inclusion of alternativelyspliced exons from RNAseq data  is a step toward a more accurate interpretation of the natural splicing code. This problem is complicated by several sources of bias in short read counts including those due to the cDNA fragmentation and primer amplification steps of current RNAseq protocols [4, 5]. These biases lead to widely varying abundances for reads from different positions in the transcript. We investigate this positionspecific bias further and suggest methods to mitigate it.
Specifically, we restrict our interest only to exonskipping events [6, 7]. The numerical quantity which captures relevant information for these events is termed percentsplicedin (PSI). For each exonskipping event, PSI is defined as the expression of isogorms containing the alternatively spliced exon (i.e. those containing a given cassette exon and its flanking constitutive exons) as a fraction of the total expression for both alternatively and constitutively spliced isoforms (i.e. those containing the flanking exons only) which is reported in percent. Accurate estimation of PSI is not only desirable on its own, but it can also be used to improve the resolution of differential splicing and thus improve the predictive power of the splicing code [3].
There are several recent tools for estimating relative abundance of isoforms, which deal with positionspecific biases in different ways [5, 7–9]. MISO can directly estimate PSI specifically for exonskipping events [7], while most others estimate the expression of whole isoforms from which a PSI value may be derived. This makes MISO the natural point of reference for our comparisons, but we also include Cufflinks [5] in the comparisons because of its popularity and explicit modeling of fragmentation and amplification biases. However, for the task of estimating PSI, Cufflinks' focus on multiexon isoforms appears to be detrimental, as we show in the Results section.
Our pursuit of robust estimates for PSI necessitates an appropriate measure of the uncertainty for these estimates. This additional necessity is crucial for the task of deciphering the natural RNA splicing code. Linking noisy RNAseq read counts with the sequence determinants of RNA splicing is a hard task that requires good measurement of splicing levels even in case of transcripts with minimal coverage. For this task it is just as important to quantify the range of possible PSI values supported by the RNAseq data, given that the positionspecific bias can dramatically influence these estimates. We start by framing the classic IID sampling assumption as a Poisson model and modify it to mitigate the effect of positionspecific biases. This leads to three methods of increasing complexity. We evaluate our models in terms of their accuracy and consistency. We compare our methods' accuracy to each other and to existing approaches of estimating PSI with respect to a reference set of 26 RTPCR measurements from a human cell line. As we discussed above, we are interested in developing algorithms that provide robust estimates: A handful of highly biased positions in the transcript, from which a much larger number of reads is obtained simply due to fragmentation bias, should not unduly influence the estimate of PSI. Our results show a moderate increase in accuracy and a significant increase in consistency of our methods over the current state of the art methods for quantifying of alternative splicing events.
Methods
RNAseq data
RNAseq data was generated from a HeLa cell line by the Blencowe Lab at the University of Toronto [10]. The protocol consisted of polyAselected RNA extraction, random hexamer primed reverse transcription, cDNA fragmentation (with mean insert size of 220nt), and 50nt pairedend sequencing by Illumina GA. This dataset is publicly available on the NCBI Gene Expression Omnibus with accession number GSE26463. 305 million RNAseq reads were sequenced and mapped to the reference human genome (NCBI build37, UCSC hg19) using TopHat, which is capable of reporting splitread alignments across splice junctions [11]. TopHat produced errorfree alignments for 66 million reads (about 22% of the total). For each exonexon junction, the reads that overlapped it by at least 8nt were selected and their positions were noted. Positions that contained reads mapping elsewhere were excluded. The number of 3' fragment ends (i.e. reads starts) around the junction was tabulated into a profile of read hits for each junction. This profile of read start counts is also called a read cover, in contrast to the more popular read coverage.
Figure 1 illustrates the actual cover profile for a representative constitutive (i.e. exclusion) junction with a relatively high total number of reads. Positiondependent biases in the read cover lead to positions with zero reads, as well as positions with many mode reads than are expected based on other positions. These two situations are sometimes treated differently, but they are essentially due to the same cause: positiondependent effects. Note that these positiondependent effects are present in the majority of junctions regardless of their underlying expression. Another source of error is mismatched reads but, in this work, we deliberately used only errorfree alignments (as opposed to the common practice of allowing a small number of mismatches) in order to differentiate the positional biases from mismatch noise. When estimating PSI, the individual read covers for each pair of alternative junctions that flank an alternative exon can be tabulated into a joint inclusion junction cover using halfcounts at each position. This is common practice for analyses of alternative splicing as it is assumed that the increased sample size results in better estimates of expression. However, we note that averaging the read covers for the two alternative junctions is not appropriate when the constitutive annotation of the two flanking exons is in question, and this approach does not significantly reduce the harsh effects of positional biases.
The existing tools for isoform quantification, MISO and Cufflinks were provided with the entire alignment, not just the reads mapping to junctions. MISO (version 0.2) and Cufflinks (version 1.2) were run with default parameters except for the pairedend read insert size and the number of samples from the appropriate posterior, which were set to 220 and 10000, respectively.
Native model
The first model we study makes the simplifying assumption that reads are sampled independently and identically distributed (IID) from the expressed isoforms. We refer to it as the "Native" model, because its key component, the Poisson arrival process, is a natural model for IID read coverage. This "Native" model has worked sufficiently well in the past for analysis in many respectable DNA and RNA sequencing studies [2].
Many simple models of RNAseq data assume, either explicitly or implicitly, that reads are sampled uniformly along the length of a transcript [1, 12]. However, actual RNAseq data do not follow this assumption because of multiple sequence and positionspecific biases inherent in the cDNA library preparation and sequencing [4, 5, 7, 13]. Still, we might expect this assumption to hold for sufficiently short regions on a transcript, such as the neighborhood around an exonexon junction. In this case, the number of read starts x_{ p }mapping to each position p near the junction should follow a Poisson distribution whose mean is estimated by \stackrel{\u0303}{\alpha}=\frac{1}{P}{\sum}_{p}{x}_{p} where the region of interest spans positions {1, 2, ... P}. The mean and matching variance α will estimate both the overall expression for that junction and the model's uncertainty in that expression. Unfortunately, reads are not distributed uniformly, even along short regions with sufficient coverage. As shown on Figure 1, the read counts covering the region within 50nt of a representative constitutive junction are highly variable and nonuniform. The corresponding cover for the two alternative junctions (not shown) contains about twice as many read counts in total, but they are split over two neighborhoods of 50nt. In general, RNAseq data deviates from the Native Poisson model in two ways:

the high sparsity of the data (~ 80% of positions have no reads starting at them) causes\stackrel{\u0303}{\alpha}, the average cover for the region, to underestimate the expected abundance α.

the variance of the nonzero elements x_{ p }> 0 is three times larger than that dictated by the Native model.
Note that the Poisson model describes the likelihood P(x_{ p } α) of observing a particular read cover profile x_{ p }given the unknown expression α. However, we are interested in the posterior probability P(α  x_{ p }) of the hidden expression given the observed data. This posterior can be obtained from the likelihood of the observed data and the prior over the expression through the classic Bayes' Rule:
Once we have distributions over the expected expression for both the alternative (a.k.a. inclusion) and the constitutive (a.k.a exclusion) junctions, α^{i}and α^{e}respectively, we combine them to produce the posterior over the PSI estimate of this model P\left({\Psi}_{\mathsf{\text{Native}}}{x}_{p}^{i},{x}_{p}^{e}\right) given the observed read counts over the inclusion ({x}_{p}^{i}) and exclusion ({x}_{p}^{e}) junctions, respectively. There is no closedform expression for this distribution, but we can estimate it with the ratios of samples from the inclusion and exclusion posteriors:
Gaussian model
In order to alleviate the shortcomings of the Native model, we propose two simple modifications which result in a new Gaussian model that is more robust to the positionspecific biases present in RNAseq data. To deal with the sparse cover and its effect on the expected expression, α, we dismiss all unmappable positions, i.e. those positions which coincide with the start of reads that map elsewhere in the reference genome or transcriptome. This leaves only the set of position indexes Q which coincide with the hits of only uniquelymappable reads. Therefore, the normalized expression of a junction is \gamma =\frac{1}{\leftQ\right}{\sum}_{q\in Q}{x}_{q}+\frac{1}{p} where we have added the pseudocount \frac{1}{P} in order to avoid dividing by zero for junctions which have no uniquelymappable reads, e.g. those that come from homologous regions of the genome.
To deal with the high variance at positions with nonzero read count, we approximate the PSI ratio of normalized junction expressions with a Gaussian distribution. Unlike the Poisson distribution whose mean and variance are identical by definition, the link between the mean and variance of this Gaussian approximation can be relaxed in order to make the model more robust. The mean μ is estimated by the ratio of the normalized read counts for the inclusion and exclusion junctions (γ^{i}and γ^{e}, respectively). The standard deviation σ is proportional to the geometric mean of μ and its complement 1  μ. The variance σ^{2}is normalized by the total number of uniquely mappable reads in the alternative and constitutive junction Γ = γ^{i}Q^{i} + γ^{e}Q^{e}, where Q^{i} is the number of uniquelymappable positions for the inclusion junction, and Q^{i} is that for the exclusion junction. Finally, the variance is lowerbounded by an arbitrary threshold in order to avoid overfitting the noisy RNAseq data:
This approximation allows us to skip the Bayesian procedure and sampling approximation required by the Native model, since we can directly specify the posterior distribution of our estimate for PSI given the read counts around a junction: P\left({\Psi}_{\mathsf{\text{Gaussian}}}{x}_{{p}^{\prime}}\right)~\mathcal{N}\left(\stackrel{\u0303}{\mu},{\stackrel{\u0303}{\sigma}}^{2}\right).
Bootstrap technique
To robustly estimate PSI without explicitly modeling sequence and position dependent bias, we propose a method based on randomly resampling the observed data. This method computes the degree of uncertainty in PSI by estimating the consistency within the observed dataset. It belongs to a general class of statistical methods called bootstraping that have been successfully used to model complex and unknown distributions [14].
The bootstrap can be used to assess the uncertainty in the PSI estimates produced by any method that takes positiondependent read counts as input. Here, we use a Poisson model. We assume that there are P mappable junction positions for each exon skipping event. We observe {x}_{p}^{i} inclusion reads and {x}_{p}^{e} exclusion reads for each position p = {1, 2, ... P}. To estimate PSI from such a dataset, a simple approach assumes that for every position, {x}_{p}^{i} and {x}_{p}^{e} are generated by a Poisson distribution with realvalued underlying abundances β^{i}and β^{j}respectively. A Poisson distribution is used to model the process of how RNAseq reads in each position arise from the true abundance of isoforms in the biological sample. Because of the IID assumption, the maxmimum likelihood (ML) estimator of β is simply the sum of the observed reads. Instead of simply using the ML estimator, we take a Bayesian approach where we assume an improper prior for P(β) = 1 for the abundances of both inclusion and exclusion variants. The posterior of β is a Gamma distribution with a shape parameter equal to 1:
where Gamma(θ, k) denote the real valued Gamma distribution with scale parameter θ and shape parameter k. In this application, the shape parameter is one plus the sum of the reads across positions. The Gamma random variable in the above equation incorporates our belief of likely values of isoform abundances (β) given the observed reads, with the IID assumption for read generation across positions. However, the IID assumption described above is highly incorrect, because of positiondependent effects introduced by RNAseq technologies. We use the bootstrap to assess the uncertainty induced by these effects as follows. Instead of summing over the reads at all positions, we generate a sample of P positions with replacement from the observed data and then sum the reads at those positions to produce an estimate of β as described above.
The above procedure is repeated to generate a distribution of β estimates, which can be used to form a distribution of PSI. In our approach, one million β^{i}and β^{e}are generated with which one million samples of Ψ_{bootstrap} are produced.
Robust mixture model
We propose a robust mixture model of read counts that span alternativelyspliced junctions from exon skipping events. The mixture has three components:

1.
A zerocover component to explain the empty positions arising from sparse fragmentation bias.

2.
A noise component to capture the read stacks arising from the other type of positional bias.

3.
A Poisson component to capture the remaining signal in the read cover.
Formulating a mixture model allows us to explicitly capture each of the two types of bias alongside the underlying signal in RNAseq data.
For each cassette splicing event, our model links the hidden expression counts λ^{i}and λ^{e}, for the inclusion and exclusion junctions, to the unknown PSI and coverage values: {\Psi}_{\lambda}\in \mathbb{Q} and C ∈ ℤ, and to the observed read counts: {x}_{p}^{i}\in \mathbb{Z} and {x}_{p}^{i}\in \mathbb{Z} where p ∈ {1, 2, ..., P} are positions in the neighborhood of each junction. As before, Ψ_{λ}, C, and λ are linked by a deterministic relationship:
Figure 2 shows the plate diagram for the Robust Mixture model. Its priors and factors are described in the following sections. The the priors and factors combine via Bayes' Rule (already described in Equation (1)) to give the posterior distribution over the hidden variables and mixture weights of this model.
Priors

PSI: Ψ_{λ} ~ Uniform[0, 1]
even though the empirical distribution is closer to a convex Beta distribution with preference for extreme values of Ψ_{λ}, we use the least informative prior in order to gain the most information about this hidden variable of interest [7].

Cover: C ~ Gamma(θ, k)
with scale parameter θ = 77.77 and shape parameter k = 0.77 estimated from C's empirical distribution.

Expression: A complex prior on λ^{i}and λ^{e}is induced by the priors on Ψ_{λ} and C through the relation in equation (11). We impose no further restriction on the distribution of these hidden variables.

Mixture: The weights of the three mixture components represent the relative strengths of the signal and the two noise models. The observed sparsity of RNAseq data ( where 80% of junctionneighboring positions have no read alignments starting from them) is an upper bound on the true sparsity because we expect to see zerocover positions in junctions with very low expression. Therefore we chose 60% sparsity as a reasonable compromise. Likewise, the observed readstack outlier rates for the Illumina platform is a lower bound on the actual fraction of outlier reads (3% of all junctionadjacent positions have a read count that is 10× higher than the simple average).
{p}_{0}\left({m}_{p}\right)=\left\{\begin{array}{cc}\hfill 0.60\hfill & \hfill \mathsf{\text{Zero}}\phantom{\rule{0.3em}{0ex}}\mathsf{\text{Cover}}\left({m}_{p}=0\right)\hfill \\ \hfill 0.36\hfill & \hfill \mathsf{\text{Poisson}}\phantom{\rule{0.3em}{0ex}}\mathsf{\text{Model}}\phantom{\rule{0.3em}{0ex}}\left({m}_{p}=1\right)\hfill \\ \hfill 0.04\hfill & \hfill \mathsf{\text{Read}}\phantom{\rule{0.3em}{0ex}}\mathsf{\text{Stacks}}\left({m}_{p}=2\right)\hfill \end{array}\right.(12)
Factors

Deterministic: λ^{i}, λ^{e}~ δ(λ^{i}= Ψ_{ λ }* C)δ(λ^{e}= C  λ^{i})

Multinomial: m_{ p }~ Multinomial(c_{ z }, c_{ p }, c_{ s })
This factor allows our model to learn the actual mixture weights for each of the components from the observed data.

Mixture: We use a mixture factor in order to capture each of the two biases and the actual signal in separate components. The choice for each component is motivated by the form of the signal or noise it is designed to capture.
{x}_{p}{m}_{p,}\lambda ~\left\{\begin{array}{cc}\hfill \delta \left({x}_{p}=0\right)\hfill & \hfill \mathsf{\text{Sparsity}}\left({m}_{p}=0\right)\hfill \\ \hfill \mathsf{\text{Poisson}}\left(\lambda \right)\hfill & \hfill \mathsf{\text{Signal}}\left({m}_{p}=1\right)\hfill \\ \hfill \mathsf{\text{Uniform}}\left[1,L\right]\hfill & \hfill \mathsf{\text{Noise}}\left({m}_{p}=2\right)\hfill \end{array}\right.(13)
Practical considerations
Performing inference in the Native and Robust Mixture models described above is intractable due to the complex partition function that normalizes the posterior distribution P(Ψx_{ p }). To compute the posterior, we could use advanced approximate inference methods such as Expectation Maximization used by IsoEM [8], Markov Chain Monte Carlo used by MISO [7], and combinatorial optimization used by Cufflinks [5, 12]. However, we note that discretizing the values of their parameters allows us to approximate the partition function and directly calculate the posterior distribution over the discretized PSI values: Ψ_{ α }and Ψ_{λ} respectively. In contrast, the Gaussian and bootstrap models give a posterior over Ψ_{ γ }directly, either in a closed form expression or in the form of samples from a provably exact distribution. Figure 3 shows that the resulting posterior distributions for all PSI estimators are wellformed, especially for junctions with sufficiently high read cover, and gives support for the viability of our discretization scheme for junctions of medium or even low read cover. Finally, performing inference with discretized parameters takes considerably less time at a minimal loss of precision. This allows our methods to analyze an entire prealigned RNAseq dataset in the manner of a few minutes, while other methods take tens of hours or even days on the same task, while other methods take hours on the same task.
Results and discussion
Accurate estimation of PSI
In order to evaluate the accuracy of our models and compare it to that of the existing methods, we selected a validation set of 26 cassette exons with reference PSI values derived from RTPCR experiments in HeLa cells [10]. The 26 events include 11 highexpression events with between 10 and 20 read starts per position, 8 mediumexpression events with about 1 read start per position, and 7 lowexpression events with 10 or fewer reads total across all 50 positions (≤ 0.2 read starts per position). Figure 3 compares the posterior distributions over PSI inferred by six different methods: our four methods described in the Methods section, and two popular tools for isoform quantification, MISO and Cufflinks. All tools shared the same input, but were able to extract varying amount of information from it. The shared TopHat alignment file included the mapping of reads to a reference set constructed only from the constitutive and alternative exons of the 26 cassette events. Our tools were able to use only the reads mapping across junctions, while MISO and Cufflinks was free to use the entire set of alignments. Furthermore, our methods did not benefit from the pairedend dependencies between the reads, while both MISO and Cufflinks were able to do so. To be fair, we note that Cufflinks is designed for wholetranscript quantification. Thus, we did not expect it to be competitive with the other methods on a highly restricted reference set consisting of only three exons per alternative splicing event
While limited, this comparison clearly shows that no particular method outperforms the others on every event. However, it does suggest that our methods are more accurate, especially when they agree with each other. We investigate the consistency of our methods in a later part of the Results section. Unfortunately there is no canonical way to measure the error between a distribution estimate and a point target. However, we modify three existing distance metrics between distributions and propose a new metric which allow us to compute the overall performance of the six methods on all 26 events. Given a PDF distribution of PSI estimates P(x) and a target value ψ described by discretized Gaussian distribution Q_{ ψ }(x) centered at the point target, ψ. We used an arbitrary standard deviation σ = 0.05 which is comparable to the accuracy needed for downstream applications of PSI estimates. The new metric directly computes the distance between a distribution and its target.

Variation distance, which measures the total deviation between the two distributions
V\left(P,{Q}_{\psi}\right)=\sum _{0\le x\le 1}P\left(x\right){Q}_{\psi}\left(x\right)(14) 
Disagreement distance between CDFs, which measures the maximum deviation. In our case, the maximum is attained at the mode of either P or Q_{ ψ }
S\left(P,{Q}_{\psi}\right)=\underset{0\le y\le 1}{\text{max}}\sum _{0\le x\le y}P\left(x\right){Q}_{\psi}\left(x\right)(15) 
KL divergence, which measures the asymmetric disagreement between P or Q_{ ψ }with respect to the latter
{D}_{\mathsf{\text{KL}}}\left({Q}_{\psi}\parallel P\right)=\sum _{0\le x\le 1}{Q}_{\psi}\left(x\right)\text{log}\frac{P\left(x\right)}{{Q}_{\psi}\left(x\right)}(16) 
Novel confidenceweighted {L}_{\frac{1}{2}} error distance, is designed to penalize distributions that distribute weight away from the target ψ
{E}_{\frac{1}{2}}\left(P,\psi \right)=\sum _{0\le x\le 1}P\left(x\right){\u2225x\psi \u2225}_{\frac{1}{2}}(17)
Table 1 shows the overall performance of each PSI estimation method over the 26 target events according to each of these error metrics. While our most robust methods perform well on three of these metrics, it is not surprising that MISO outperforms every other method on the remaining Smetric because it always distributes its posterior mass wider than our methods. The disagreement distance, S(P, Q_{ ψ }) rewards this extensive hedging because it is very susceptible to sampling noise which is abundant on Figure 3. The remaining metrics are chosen to be more robust when faced with this sampling noise.
Consistent estimation of PSI
In order to further investigate the consistency of PSI estimation methods, we performed a random subsampling procedure. This procedure chooses a random half of the positions around a junction and uses the subset of reads that start at those positions to obtain an unbiased estimate of the noise associated with the positional bias. A dataset with reduced set of positions is equivalent to a dataset with reduced signaltonoise ratio. Comparing the PSI estimate of a method given each half of the positions can measure the consistency of that method. Figure 4 depicts the consistency of the most accurate methods from Table 1 with a nonstandard 2D color visualization. We call a this visualization a constellation plot because of its superficial resemblance to images of deepspace galaxies.
We expect more consistent methods to produce consistently more similar estimates of PSI. For each method, we calculate the KLdivergence between its PSI estimate on a particular event to the PSI estimate on all other events. We compare the mean of all crossevent divergence to the divergence between PSI estimates from complementary halves of the same event. The former divergence we call the interexon distance, and the latter we call the intraexon distance. Then, the ratio between the inter and intraexon distances is a measure of the method's consistency for that particular exon. More consistent methods will have a higher ratio over all events. Figure 5 compares the consistency ratios of our four methods and that of MISO using a larger dataset of over 1000 events (including the 26 validated by RTPCR).
Consistency of the PSI estimates is especially important to the downstream uses of our methods. If only a randomly selected subset of positions are taken into account, the PSI estimate (and its uncertainty) should be very similar to the estimate that would be computed based on the complementary set of transcript positions. Thus we defined a measure of consistency of the estimator as the ratio of the average distance of the PSI distributions obtained from two different genes and the average distance from PSI distributions obtained from different position subsets of the same transcript. High values of this ratio indicated that using a smaller subset of the positions will not affect the estimate of PSI drastically, but that this is not achieved in a trivial way by always estimating either a high or a very low level of exon inclusion.
Runtime and efficiency
While accuracy and consistency are the most important considerations for any approach of estimating PSI, runtime and efficiency are becoming increasingly relevant as the amount of RNAseq data grows rapidly. Table 2 compares the runtimes of all methods on both the small validation set of 26 events and the larger set of 1051 events. To estimate the distribution over PSI values for each event, we used 10,000 samples for all methods. Sampling from the Gaussian model was direct whereas other models sampled the expression for inclusion and exclusion isoforms separately. It is not surprising that the run time of our preprocessing grows linearly with the number of RNAseq reads, and we expect the same happens to the preprocessing subroutines of both MISO and Cufflinks. However, the estimation subroutines in the two established tools are disproportionately slower on the larger dataset than any of our simple methods, including the robust and very consistent bootstrap model.
Conclusion
This work addressed the problem of estimating relative abundances of alternativelyspliced cassette exons from the sparse and noisy evidence in RNAseq data. First, we investigated the raw data and reviewed known fragmentation biases resulting from current RNAseq protocols. Next, we identified positionspecific anomalies affected by these biases, and proposed a modular probabilistic framework that robustly estimates the PSI and total coverage of alternativelyspliced exon junctions. Using this foundation, we framed the classic IID read sampling assumption as a Poisson model and termed the two types of positionspecific deviations in the actual data as sparse cover and read stacks. Using the established framework, we proposed three novel probabilistic methods of increasing complexity, which mitigate the effects of these two biases. We compared our methods' accuracy to each other and to existing approaches of estimating PSI with respect to a reference set of 26 RTPCR measurements from a human cell line. Our results showed a moderate increase in accuracy and a significant increase in consistency of our methods over the current stateoftheart for quantification of alternative splicing events. While we presented and referenced several methods for quantifying alternative splicing, our goal was not to pick a single champion that is superior to all others, but to compare the strengths and weaknesses of the various approaches. We hope that these advances will enable more sensitive downstream analyses, such as better determinants of differential splicing which can eventually lead to an improved RNA splicing code.
References
Mortazavi A, Wold B: Mapping and quantifying mammalian transcriptomes by RNASeq. Nat Meth. 2008, [http://dx.doi.org/10.1038/nmeth.1226]
Wang Z, Gerstein M, Snyder M: RNASeq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009, [http://dx.doi.org/10.1038/nrg2484]
Barash Y, Calarco JA, Gao W, Pan Q, Wang X, Shai O, Blencowe BJ, Frey BJ: Deciphering the splicing code. Nature. 2010, [http://dx.doi.org/10.1038/nature09000]
Hansen KD, Brenner SE, Dudoit S: Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Research. 2010, 38 (12): e131e131. 10.1093/nar/gkq224. [http://nar.oxfordjournals.org/content/38/12/e131.abstract]]
Roberts A, Trapnell C, Donaghey J, Rinn J, Pachter L: Improving RNASeq expression estimates by correcting for fragment bias. Genome Biology. 2011, 12 (3): [http://genomebiology.com/2011/12/3/R22]
Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ: Deep surveying of alternative splicing complexity in the human transcriptome by highthroughput sequencing. Nature Genetics. 2008
Katz Y, Wang ET, Airoldi EM, Burge CB: Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat Meth. 2010, 7 (12): 10091015. 10.1038/nmeth.1528.
Nicolae M, Mangul S, Mandoiu I, Zelikovsky A: Estimation of alternative splicing isoform frequencies from RNASeq data. Algorithms for Molecular Biology. 2011, 6: [http://www.almob.org/content/6/1/9]
Turro E, Su SY, Goncalves A, Coin L, Richardson S, Lewin A: Haplotype and isoform specific expression estimation using multimapping RNAseq reads. Genome Biology. 2011, 12 (2): [http://genomebiology.com/2011/12/2/R13]
Saltzman AL, Pan Q, Blencowe BJ: Regulation of alternative splicing by the core spliceosomal machinery. Genes and Development. 2011, 25: 373384. 10.1101/gad.2004811.
Trapnell C, Pachter L, Salzberg SL: TopHat: discovering splice junctions with RNASeq. Bioinformatics. 2009, 25 (9): 11051111. 10.1093/bioinformatics/btp120. [http://bioinformatics.oxfordjournals.org/content/25/9/1105.abstract]
Trapnell C, Williams BA, Wold BJ, Pachter L: Transcript assembly and quantification by RNASeq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotech. 2010, [http://dx.doi.org/10.1038/nbt.1621]
Srivastava S, Chen L: A twoparameter generalized Poisson model to improve the analysis of RNAseq data. Nucleic Acids Research. 2010, [http://nar.oxfordjournals.org/cgi/content/abstract/gkq670v1]
Davison A, Hinkley D: Bootstrap methods and their application. 1997, Cambridge Univ Pr
Acknowledgements
We kindly thank Yoseph Barash for his early involvement and Benjamin Blencowe for discussions on RNAseq and RTPCR. We also recognize the valuable comments and detailed critiques by our anonymous reviewers.
Funding: BK is supported by the NSF graduate research fellowship. BJF acknowledges funding from CIHR and NSERC. BJF is a Fellow of the Canadian Institute for Advanced Research and an NSERC E.W.R. Steacie Fellow.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 6, 2012: Proceedings of the Second Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMBseq 2012).
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
BK identified the positional biases, developed the Robust Mixture method, performed the analyses, and drafted the manuscript. HYX developed the Bootstrap method and wrote its description. LJL preprocessed the RNAseq data, and participated in the analysis. NJ developed the consistency ratio measure and revised the manuscript. BJF guided the study and proposed the Bootstrap method.
Rights and permissions
Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Kakaradov, B., Xiong, H.Y., Lee, L.J. et al. Challenges in estimating percent inclusion of alternatively spliced junctions from RNAseq data. BMC Bioinformatics 13 (Suppl 6), S11 (2012). https://doi.org/10.1186/1471210513S6S11
Published:
DOI: https://doi.org/10.1186/1471210513S6S11