Challenges in estimating percent inclusion of alternatively spliced junctions from RNAseq data
 Boyko Kakaradov^{1},
 Hui Yuan Xiong^{1},
 Leo J Lee^{1},
 Nebojsa Jojic^{2} and
 Brendan J Frey^{1}Email author
https://doi.org/10.1186/1471210513S6S11
© Kakaradov et al.; licensee BioMed Central Ltd. 2012
Published: 19 April 2012
Abstract
Transcript quantification is a longstanding problem in genomics and estimating the relative abundance of alternativelyspliced isoforms from the same transcript is an important special case. Both problems have recently been illuminated by highthroughput RNA sequencing experiments which are quickly generating large amounts of data. However, much of the signal present in this data is corrupted or obscured by biases resulting in nonuniform and nonproportional representation of sequences from different transcripts. Many existing analyses attempt to deal with these and other biases with various taskspecific approaches, which makes direct comparison between them difficult. However, two popular tools for isoform quantification, MISO and Cufflinks, have adopted a general probabilistic framework to model and mitigate these biases in a more general fashion. These advances motivate the need to investigate the effects of RNAseq biases on the accuracy of different approaches for isoform quantification. We conduct the investigation by building models of increasing sophistication to account for noise introduced by the biases and compare their accuracy to the established approaches.
We focus on methods that estimate the expression of alternativelyspliced isoforms with the percentsplicedin (PSI) metric for each exon skipping event. To improve their estimates, many methods use evidence from RNAseq reads that align to exon bodies. However, the methods we propose focus on reads that span only exonexon junctions. As a result, our approaches are simpler and less sensitive to exon definitions than existing methods, which enables us to distinguish their strengths and weaknesses more easily. We present several probabilistic models of of positionspecific read counts with increasing complexity and compare them to each other and to the current stateoftheart methods in isoform quantification, MISO and Cufflinks. On a validation set with RTPCR measurements for 26 cassette events, some of our methods are more accurate and some are significantly more consistent than these two popular tools. This comparison demonstrates the challenges in estimating the percent inclusion of alternatively spliced junctions and illuminates the tradeoffs between different approaches.
Introduction
Determining the relative abundance of gene transcripts in a cell  whether in relation to each other or in relation to corresponding transcripts in other cells  is an important and longstanding problem in genomics. Since introduction of RNAseq, a highthroughput experimental method of measuring the RNA content of a sample by reversetranscribing it and sequencing the resultant cDNA, this problem has been illuminated by vast amounts of data and by many methods for elucidating transcript abundance [1]. Current collections of RNAseq data are rapidly growing in multiple dimensions such as species, tissues, and conditions [2].
This data deluge necessitates more sophisticated and accurate analysis methods, which in turn create an opportunity to gain deeper insights into the role and regulation of transcript abundance in important developmental and disease processes. Undoubtedly, one important research area that can benefit from these advances is the study of RNA splicing, an essential cellular process that effectively increases the phenotypic complexity of eukaryotic organisms without necessitating an increase in their genetic complexity. Accurate measurements of the expression levels for isoforms from a large number of genes are especially useful for research into the molecular mechanisms that regulate alternative splicing in different tissues. For example, the recent advances in the RNA splicing code that determines the relative abundance of alternatively spliced isoforms [3] was made possible by highthroughput microarray technology. In principle, RNAseq can lead to much richer datasets at a fraction of the cost. Thus RNAseq technology can lead to significant new breakthroughts, as the code quality achieved by [3] leaves a lot of room for improvement. The focus of this paper  estimation of the percent inclusion of alternativelyspliced exons from RNAseq data  is a step toward a more accurate interpretation of the natural splicing code. This problem is complicated by several sources of bias in short read counts including those due to the cDNA fragmentation and primer amplification steps of current RNAseq protocols [4, 5]. These biases lead to widely varying abundances for reads from different positions in the transcript. We investigate this positionspecific bias further and suggest methods to mitigate it.
Specifically, we restrict our interest only to exonskipping events [6, 7]. The numerical quantity which captures relevant information for these events is termed percentsplicedin (PSI). For each exonskipping event, PSI is defined as the expression of isogorms containing the alternatively spliced exon (i.e. those containing a given cassette exon and its flanking constitutive exons) as a fraction of the total expression for both alternatively and constitutively spliced isoforms (i.e. those containing the flanking exons only) which is reported in percent. Accurate estimation of PSI is not only desirable on its own, but it can also be used to improve the resolution of differential splicing and thus improve the predictive power of the splicing code [3].
There are several recent tools for estimating relative abundance of isoforms, which deal with positionspecific biases in different ways [5, 7–9]. MISO can directly estimate PSI specifically for exonskipping events [7], while most others estimate the expression of whole isoforms from which a PSI value may be derived. This makes MISO the natural point of reference for our comparisons, but we also include Cufflinks [5] in the comparisons because of its popularity and explicit modeling of fragmentation and amplification biases. However, for the task of estimating PSI, Cufflinks' focus on multiexon isoforms appears to be detrimental, as we show in the Results section.
Our pursuit of robust estimates for PSI necessitates an appropriate measure of the uncertainty for these estimates. This additional necessity is crucial for the task of deciphering the natural RNA splicing code. Linking noisy RNAseq read counts with the sequence determinants of RNA splicing is a hard task that requires good measurement of splicing levels even in case of transcripts with minimal coverage. For this task it is just as important to quantify the range of possible PSI values supported by the RNAseq data, given that the positionspecific bias can dramatically influence these estimates. We start by framing the classic IID sampling assumption as a Poisson model and modify it to mitigate the effect of positionspecific biases. This leads to three methods of increasing complexity. We evaluate our models in terms of their accuracy and consistency. We compare our methods' accuracy to each other and to existing approaches of estimating PSI with respect to a reference set of 26 RTPCR measurements from a human cell line. As we discussed above, we are interested in developing algorithms that provide robust estimates: A handful of highly biased positions in the transcript, from which a much larger number of reads is obtained simply due to fragmentation bias, should not unduly influence the estimate of PSI. Our results show a moderate increase in accuracy and a significant increase in consistency of our methods over the current state of the art methods for quantifying of alternative splicing events.
Methods
RNAseq data
RNAseq data was generated from a HeLa cell line by the Blencowe Lab at the University of Toronto [10]. The protocol consisted of polyAselected RNA extraction, random hexamer primed reverse transcription, cDNA fragmentation (with mean insert size of 220nt), and 50nt pairedend sequencing by Illumina GA. This dataset is publicly available on the NCBI Gene Expression Omnibus with accession number GSE26463. 305 million RNAseq reads were sequenced and mapped to the reference human genome (NCBI build37, UCSC hg19) using TopHat, which is capable of reporting splitread alignments across splice junctions [11]. TopHat produced errorfree alignments for 66 million reads (about 22% of the total). For each exonexon junction, the reads that overlapped it by at least 8nt were selected and their positions were noted. Positions that contained reads mapping elsewhere were excluded. The number of 3' fragment ends (i.e. reads starts) around the junction was tabulated into a profile of read hits for each junction. This profile of read start counts is also called a read cover, in contrast to the more popular read coverage.
The existing tools for isoform quantification, MISO and Cufflinks were provided with the entire alignment, not just the reads mapping to junctions. MISO (version 0.2) and Cufflinks (version 1.2) were run with default parameters except for the pairedend read insert size and the number of samples from the appropriate posterior, which were set to 220 and 10000, respectively.
Native model
The first model we study makes the simplifying assumption that reads are sampled independently and identically distributed (IID) from the expressed isoforms. We refer to it as the "Native" model, because its key component, the Poisson arrival process, is a natural model for IID read coverage. This "Native" model has worked sufficiently well in the past for analysis in many respectable DNA and RNA sequencing studies [2].
Many simple models of RNAseq data assume, either explicitly or implicitly, that reads are sampled uniformly along the length of a transcript [1, 12]. However, actual RNAseq data do not follow this assumption because of multiple sequence and positionspecific biases inherent in the cDNA library preparation and sequencing [4, 5, 7, 13]. Still, we might expect this assumption to hold for sufficiently short regions on a transcript, such as the neighborhood around an exonexon junction. In this case, the number of read starts x_{ p }mapping to each position p near the junction should follow a Poisson distribution whose mean is estimated by $\stackrel{\u0303}{\alpha}=\frac{1}{P}{\sum}_{p}{x}_{p}$ where the region of interest spans positions {1, 2, ... P}. The mean and matching variance α will estimate both the overall expression for that junction and the model's uncertainty in that expression. Unfortunately, reads are not distributed uniformly, even along short regions with sufficient coverage. As shown on Figure 1, the read counts covering the region within 50nt of a representative constitutive junction are highly variable and nonuniform. The corresponding cover for the two alternative junctions (not shown) contains about twice as many read counts in total, but they are split over two neighborhoods of 50nt. In general, RNAseq data deviates from the Native Poisson model in two ways:

the high sparsity of the data (~ 80% of positions have no reads starting at them) causes$\stackrel{\u0303}{\alpha}$, the average cover for the region, to underestimate the expected abundance α.

the variance of the nonzero elements x_{ p }> 0 is three times larger than that dictated by the Native model.
Gaussian model
In order to alleviate the shortcomings of the Native model, we propose two simple modifications which result in a new Gaussian model that is more robust to the positionspecific biases present in RNAseq data. To deal with the sparse cover and its effect on the expected expression, α, we dismiss all unmappable positions, i.e. those positions which coincide with the start of reads that map elsewhere in the reference genome or transcriptome. This leaves only the set of position indexes Q which coincide with the hits of only uniquelymappable reads. Therefore, the normalized expression of a junction is $\gamma =\frac{1}{\leftQ\right}{\sum}_{q\in Q}{x}_{q}+\frac{1}{p}$ where we have added the pseudocount $\frac{1}{P}$ in order to avoid dividing by zero for junctions which have no uniquelymappable reads, e.g. those that come from homologous regions of the genome.
This approximation allows us to skip the Bayesian procedure and sampling approximation required by the Native model, since we can directly specify the posterior distribution of our estimate for PSI given the read counts around a junction: $P\left({\Psi}_{\mathsf{\text{Gaussian}}}{x}_{{p}^{\prime}}\right)~\mathcal{N}\left(\stackrel{\u0303}{\mu},{\stackrel{\u0303}{\sigma}}^{2}\right)$.
Bootstrap technique
To robustly estimate PSI without explicitly modeling sequence and position dependent bias, we propose a method based on randomly resampling the observed data. This method computes the degree of uncertainty in PSI by estimating the consistency within the observed dataset. It belongs to a general class of statistical methods called bootstraping that have been successfully used to model complex and unknown distributions [14].
where Gamma(θ, k) denote the real valued Gamma distribution with scale parameter θ and shape parameter k. In this application, the shape parameter is one plus the sum of the reads across positions. The Gamma random variable in the above equation incorporates our belief of likely values of isoform abundances (β) given the observed reads, with the IID assumption for read generation across positions. However, the IID assumption described above is highly incorrect, because of positiondependent effects introduced by RNAseq technologies. We use the bootstrap to assess the uncertainty induced by these effects as follows. Instead of summing over the reads at all positions, we generate a sample of P positions with replacement from the observed data and then sum the reads at those positions to produce an estimate of β as described above.
The above procedure is repeated to generate a distribution of β estimates, which can be used to form a distribution of PSI. In our approach, one million β^{ i }and β^{ e }are generated with which one million samples of Ψ_{bootstrap} are produced.
Robust mixture model
 1.
A zerocover component to explain the empty positions arising from sparse fragmentation bias.
 2.
A noise component to capture the read stacks arising from the other type of positional bias.
 3.
A Poisson component to capture the remaining signal in the read cover.
Formulating a mixture model allows us to explicitly capture each of the two types of bias alongside the underlying signal in RNAseq data.
Priors

PSI: Ψ_{λ} ~ Uniform[0, 1]
even though the empirical distribution is closer to a convex Beta distribution with preference for extreme values of Ψ_{λ}, we use the least informative prior in order to gain the most information about this hidden variable of interest [7].

Cover: C ~ Gamma(θ, k)
with scale parameter θ = 77.77 and shape parameter k = 0.77 estimated from C's empirical distribution.

Expression: A complex prior on λ^{ i }and λ^{ e }is induced by the priors on Ψ_{λ} and C through the relation in equation (11). We impose no further restriction on the distribution of these hidden variables.

Mixture: The weights of the three mixture components represent the relative strengths of the signal and the two noise models. The observed sparsity of RNAseq data ( where 80% of junctionneighboring positions have no read alignments starting from them) is an upper bound on the true sparsity because we expect to see zerocover positions in junctions with very low expression. Therefore we chose 60% sparsity as a reasonable compromise. Likewise, the observed readstack outlier rates for the Illumina platform is a lower bound on the actual fraction of outlier reads (3% of all junctionadjacent positions have a read count that is 10× higher than the simple average).${p}_{0}\left({m}_{p}\right)=\left\{\begin{array}{cc}\hfill 0.60\hfill & \hfill \mathsf{\text{Zero}}\phantom{\rule{0.3em}{0ex}}\mathsf{\text{Cover}}\left({m}_{p}=0\right)\hfill \\ \hfill 0.36\hfill & \hfill \mathsf{\text{Poisson}}\phantom{\rule{0.3em}{0ex}}\mathsf{\text{Model}}\phantom{\rule{0.3em}{0ex}}\left({m}_{p}=1\right)\hfill \\ \hfill 0.04\hfill & \hfill \mathsf{\text{Read}}\phantom{\rule{0.3em}{0ex}}\mathsf{\text{Stacks}}\left({m}_{p}=2\right)\hfill \end{array}\right.$(12)
Factors

Deterministic: λ^{ i }, λ^{ e }~ δ(λ^{ i }= Ψ_{ λ }* C)δ(λ^{ e }= C  λ^{ i })

Multinomial: m_{ p }~ Multinomial(c_{ z }, c_{ p }, c_{ s })
This factor allows our model to learn the actual mixture weights for each of the components from the observed data.

Mixture: We use a mixture factor in order to capture each of the two biases and the actual signal in separate components. The choice for each component is motivated by the form of the signal or noise it is designed to capture.${x}_{p}{m}_{p,}\lambda ~\left\{\begin{array}{cc}\hfill \delta \left({x}_{p}=0\right)\hfill & \hfill \mathsf{\text{Sparsity}}\left({m}_{p}=0\right)\hfill \\ \hfill \mathsf{\text{Poisson}}\left(\lambda \right)\hfill & \hfill \mathsf{\text{Signal}}\left({m}_{p}=1\right)\hfill \\ \hfill \mathsf{\text{Uniform}}\left[1,L\right]\hfill & \hfill \mathsf{\text{Noise}}\left({m}_{p}=2\right)\hfill \end{array}\right.$(13)
Practical considerations
Results and discussion
Accurate estimation of PSI
In order to evaluate the accuracy of our models and compare it to that of the existing methods, we selected a validation set of 26 cassette exons with reference PSI values derived from RTPCR experiments in HeLa cells [10]. The 26 events include 11 highexpression events with between 10 and 20 read starts per position, 8 mediumexpression events with about 1 read start per position, and 7 lowexpression events with 10 or fewer reads total across all 50 positions (≤ 0.2 read starts per position). Figure 3 compares the posterior distributions over PSI inferred by six different methods: our four methods described in the Methods section, and two popular tools for isoform quantification, MISO and Cufflinks. All tools shared the same input, but were able to extract varying amount of information from it. The shared TopHat alignment file included the mapping of reads to a reference set constructed only from the constitutive and alternative exons of the 26 cassette events. Our tools were able to use only the reads mapping across junctions, while MISO and Cufflinks was free to use the entire set of alignments. Furthermore, our methods did not benefit from the pairedend dependencies between the reads, while both MISO and Cufflinks were able to do so. To be fair, we note that Cufflinks is designed for wholetranscript quantification. Thus, we did not expect it to be competitive with the other methods on a highly restricted reference set consisting of only three exons per alternative splicing event
While limited, this comparison clearly shows that no particular method outperforms the others on every event. However, it does suggest that our methods are more accurate, especially when they agree with each other. We investigate the consistency of our methods in a later part of the Results section. Unfortunately there is no canonical way to measure the error between a distribution estimate and a point target. However, we modify three existing distance metrics between distributions and propose a new metric which allow us to compute the overall performance of the six methods on all 26 events. Given a PDF distribution of PSI estimates P(x) and a target value ψ described by discretized Gaussian distribution Q_{ ψ }(x) centered at the point target, ψ. We used an arbitrary standard deviation σ = 0.05 which is comparable to the accuracy needed for downstream applications of PSI estimates. The new metric directly computes the distance between a distribution and its target.

Variation distance, which measures the total deviation between the two distributions$V\left(P,{Q}_{\psi}\right)=\sum _{0\le x\le 1}P\left(x\right){Q}_{\psi}\left(x\right)$(14)

Disagreement distance between CDFs, which measures the maximum deviation. In our case, the maximum is attained at the mode of either P or Q_{ ψ }$S\left(P,{Q}_{\psi}\right)=\underset{0\le y\le 1}{\text{max}}\sum _{0\le x\le y}P\left(x\right){Q}_{\psi}\left(x\right)$(15)

KL divergence, which measures the asymmetric disagreement between P or Q_{ ψ }with respect to the latter${D}_{\mathsf{\text{KL}}}\left({Q}_{\psi}\parallel P\right)=\sum _{0\le x\le 1}{Q}_{\psi}\left(x\right)\text{log}\frac{P\left(x\right)}{{Q}_{\psi}\left(x\right)}$(16)

Novel confidenceweighted ${L}_{\frac{1}{2}}$ error distance, is designed to penalize distributions that distribute weight away from the target ψ${E}_{\frac{1}{2}}\left(P,\psi \right)=\sum _{0\le x\le 1}P\left(x\right){\u2225x\psi \u2225}_{\frac{1}{2}}$(17)
Accuracy
Error  Native  Gaussian  Mixture  Bootstrap  MISO  Cufflinks 

V  28.5  24.1  27.2  24.2  30.9  43.7 
S  12.90  15.26  15.87  15.22  9.87  12.65 
D _{KL}  264  102  94.2  92.0  220  1115 
E _{1/2}  9.34  7.08  6.62  6.65  9.28  14.65 
Consistent estimation of PSI
Consistency of the PSI estimates is especially important to the downstream uses of our methods. If only a randomly selected subset of positions are taken into account, the PSI estimate (and its uncertainty) should be very similar to the estimate that would be computed based on the complementary set of transcript positions. Thus we defined a measure of consistency of the estimator as the ratio of the average distance of the PSI distributions obtained from two different genes and the average distance from PSI distributions obtained from different position subsets of the same transcript. High values of this ratio indicated that using a smaller subset of the positions will not affect the estimate of PSI drastically, but that this is not achieved in a trivial way by always estimating either a high or a very low level of exon inclusion.
Runtime and efficiency
Runtime
Datasets:  Validation  HighThroughput 

RNAseq reads  66 Million  145 Million 
AS events  26  1051 
Cufflinks  16 min  75 min 
MISO  77 min  458 min 
Preprocess  4 min  11 min 
Gaussian  +1 sec  +2 min 
Native  +2 sec  +5 min 
Mixture  +6 sec  +17 min 
Bootstrap  +12 sec  +29 min 
Conclusion
This work addressed the problem of estimating relative abundances of alternativelyspliced cassette exons from the sparse and noisy evidence in RNAseq data. First, we investigated the raw data and reviewed known fragmentation biases resulting from current RNAseq protocols. Next, we identified positionspecific anomalies affected by these biases, and proposed a modular probabilistic framework that robustly estimates the PSI and total coverage of alternativelyspliced exon junctions. Using this foundation, we framed the classic IID read sampling assumption as a Poisson model and termed the two types of positionspecific deviations in the actual data as sparse cover and read stacks. Using the established framework, we proposed three novel probabilistic methods of increasing complexity, which mitigate the effects of these two biases. We compared our methods' accuracy to each other and to existing approaches of estimating PSI with respect to a reference set of 26 RTPCR measurements from a human cell line. Our results showed a moderate increase in accuracy and a significant increase in consistency of our methods over the current stateoftheart for quantification of alternative splicing events. While we presented and referenced several methods for quantifying alternative splicing, our goal was not to pick a single champion that is superior to all others, but to compare the strengths and weaknesses of the various approaches. We hope that these advances will enable more sensitive downstream analyses, such as better determinants of differential splicing which can eventually lead to an improved RNA splicing code.
Declarations
Acknowledgements
We kindly thank Yoseph Barash for his early involvement and Benjamin Blencowe for discussions on RNAseq and RTPCR. We also recognize the valuable comments and detailed critiques by our anonymous reviewers.
Funding: BK is supported by the NSF graduate research fellowship. BJF acknowledges funding from CIHR and NSERC. BJF is a Fellow of the Canadian Institute for Advanced Research and an NSERC E.W.R. Steacie Fellow.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 6, 2012: Proceedings of the Second Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMBseq 2012).
Authors’ Affiliations
References
 Mortazavi A, Wold B: Mapping and quantifying mammalian transcriptomes by RNASeq. Nat Meth. 2008, [http://dx.doi.org/10.1038/nmeth.1226]Google Scholar
 Wang Z, Gerstein M, Snyder M: RNASeq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009, [http://dx.doi.org/10.1038/nrg2484]Google Scholar
 Barash Y, Calarco JA, Gao W, Pan Q, Wang X, Shai O, Blencowe BJ, Frey BJ: Deciphering the splicing code. Nature. 2010, [http://dx.doi.org/10.1038/nature09000]Google Scholar
 Hansen KD, Brenner SE, Dudoit S: Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Research. 2010, 38 (12): e131e131. 10.1093/nar/gkq224. [http://nar.oxfordjournals.org/content/38/12/e131.abstract]]PubMed CentralView ArticlePubMedGoogle Scholar
 Roberts A, Trapnell C, Donaghey J, Rinn J, Pachter L: Improving RNASeq expression estimates by correcting for fragment bias. Genome Biology. 2011, 12 (3): [http://genomebiology.com/2011/12/3/R22]Google Scholar
 Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ: Deep surveying of alternative splicing complexity in the human transcriptome by highthroughput sequencing. Nature Genetics. 2008Google Scholar
 Katz Y, Wang ET, Airoldi EM, Burge CB: Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat Meth. 2010, 7 (12): 10091015. 10.1038/nmeth.1528.View ArticleGoogle Scholar
 Nicolae M, Mangul S, Mandoiu I, Zelikovsky A: Estimation of alternative splicing isoform frequencies from RNASeq data. Algorithms for Molecular Biology. 2011, 6: [http://www.almob.org/content/6/1/9]Google Scholar
 Turro E, Su SY, Goncalves A, Coin L, Richardson S, Lewin A: Haplotype and isoform specific expression estimation using multimapping RNAseq reads. Genome Biology. 2011, 12 (2): [http://genomebiology.com/2011/12/2/R13]Google Scholar
 Saltzman AL, Pan Q, Blencowe BJ: Regulation of alternative splicing by the core spliceosomal machinery. Genes and Development. 2011, 25: 373384. 10.1101/gad.2004811.PubMed CentralView ArticlePubMedGoogle Scholar
 Trapnell C, Pachter L, Salzberg SL: TopHat: discovering splice junctions with RNASeq. Bioinformatics. 2009, 25 (9): 11051111. 10.1093/bioinformatics/btp120. [http://bioinformatics.oxfordjournals.org/content/25/9/1105.abstract]PubMed CentralView ArticlePubMedGoogle Scholar
 Trapnell C, Williams BA, Wold BJ, Pachter L: Transcript assembly and quantification by RNASeq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotech. 2010, [http://dx.doi.org/10.1038/nbt.1621]Google Scholar
 Srivastava S, Chen L: A twoparameter generalized Poisson model to improve the analysis of RNAseq data. Nucleic Acids Research. 2010, [http://nar.oxfordjournals.org/cgi/content/abstract/gkq670v1]Google Scholar
 Davison A, Hinkley D: Bootstrap methods and their application. 1997, Cambridge Univ PrView ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.