Blind spots of quantitative RNA-seq: the limits for assessing abundance, differential expression, and isoform switching
© Rehrauer et al.; licensee BioMed Central Ltd. 2013
Received: 31 July 2013
Accepted: 19 December 2013
Published: 24 December 2013
RNA-seq is now widely used to quantitatively assess gene expression, expression differences and isoform switching, and promises to deliver results for the entire transcriptome. However, whether the transcriptional state of a gene can be captured accurately depends critically on library preparation, read alignment, expression estimation and the tests for differential expression and isoform switching. There are comparisons available for the individual steps but there is not yet a systematic investigation which specific genes are impacted by biases throughout the entire analysis workflow. It is especially unclear whether for a given gene, with current methods and protocols, expression changes and isoform switches can be detected.
For the human genes, we report their detectability under various conditions using different approaches. Overall, we find that the input material has the biggest influence and may, depending on the protocol and RNA degradation, exhibit already strong length-dependent over- and underrepresentation of transcripts. The alignment step aligns for 50% of the isoforms up to 99% of the reads correctly; only in the presence of transcript modifications mainly short isoforms will have a low alignment rate. In our dataset, we found that, depending on the aligner and the input material used, the expression estimation of up to 93% of the genes being accurate within a factor of two; with the deviations being due to ambiguous alignments. Detection of differential expression using a negative-binomial count model works reliably for our simulated data but is dependent on the count accuracy. Interestingly, using the fold-change instead of the p-value as a score for differential expression yields the same performance in the situation of three replicates and the true change being two-fold. Isoform switching is harder to detect and for at least 109 genes the isoform differences evade detection independent of the method used.
RNA-seq is a reliable tool but the repetitive nature of the human genome makes the origin of the reads ambiguous and limits the detectability for certain genes. RNA-seq does not equally well represent isoforms independent of their size which may range from ~200nt to ~100′000nt. Researchers are advised to verify that their target genes do not have extreme properties with respect to repeated regions, GC content, and isoform length and complexity.
RNA-seq has by now in most places replaced microarrays for the analysis of the human transcriptome. Initially, it has been praised as a method that measures expression in an unbiased fashion, free of background-signal and does not require sophisticated preprocessing . Without the dependency on specifically designed microarray probes, there is the expectation that in a given RNA-seq experiment the complete transcriptional state of the biological input can be captured. With the only limitation given by the sequencing depth, where for a given sequencing depth, low abundance transcripts may not be represented in the final set of reads.
Workflow steps towards quantitative RNA-seq together with example applications
Unbiased random transcript fragments (hom. coverage)
Coverage bias (inhom. coverage)
Variable transcript start + poly-A (mod. TSS + polyA)
Global; Transcriptome + Genome [tophat]
Local; Transcriptome + Genome [STAR]
Global; Transcriptome only [RSEM]
Read count (include multi-reads) [GenomicRanges: countOverlaps]
Read count(ignore multi-reads [HTSeq]
Isoform abundance model (resolve multi-reads [RSEM]
Significance using a negative binomial count model [edgeR:exactTest]
Log-ratio effect size
Differential isoform fractions [cuffdiff]
Differential splicing modules [DiffSplice]
Differential exons [DEXSeq]
We investigate here to which extend these tools live up to the promise of providing accurate quantitative results. To this end, we report overall performances as well as per gene performances. We report the relative impact of library representation biases and the subsequent analysis steps on the final quantitative result, in terms of expression, expression differences and isoform switching. We tackle the question whether these results are comprehensive or whether they are limited in principle due to sequencing errors and sequence ambiguity which can not be overcome within the limits of the current technological constraints. Special attention goes here to the sequence ambiguity, which is due to sequence homology in the genomes of many species.
We perform our analysis on human RNA-seq data and in each step we apply tools that can be considered as representatives of a major analysis paradigm. The choice of the methods is subjective and does not imply superior performance compared to competing methods. For a detailed comparison of competing methods we refer to the comparison papers mentioned below.
For the alignment of RNA-seq reads there are now many different aligners available and Fonseca et al.  provide an overview and characterize the aligners according to their features. The two major paradigms for RNA-seq reads are
Alignment to transcriptome: As a representative of this approach we consider RSEM 
pliced alignment to genome making use of transcript annotation. Here we consider
Where the first is limited to known isoforms, and the latter has the capability to discover new isoforms.
We classify expression estimation approaches into
Overlap counting: Count the reads that overlap a given genomic feature and directly use this as a quantitative estimate of the expression level of the feature; examples are the htseq-count  and the countOverlaps method in the Bioconductor package GenomicRanges [9, 10]
Isoform abundance: Model the read generation from the isoforms and estimate the isoform abundance based on the observed reads. The RSEM expression estimation follows this approach.
The overlap counting approach has the advantage that it can generate a gene level expression estimate without the need for knowing which specific isoforms are expressed. On the downside they are bound to confuse isoform switching with differential expression . Further, in the absence of a read generating model (see e.g. , they cannot make use of additional information for resolving ambiguously aligned reads to the same degree as for example RSEM does. Ambiguity occurs if a read aligns to multiple genomic positions or if for a given genomic position multiple overlapping features are defined. The count methods deal with ambiguity by either assigning the read to all compatible features (multiple counts), discarding the read (no count), assigning a fractional count that is reciprocal to the multiplicity. If overlapping isoforms at a given gene locus are prevalent throughout the genome, as it is true for the human genome, the counting approaches can only deliver reasonable expression estimates at the gene locus level, not at the isoform level. The isoform abundance approaches explicitly deal with these multi-reads within a statistical model; a comprehensive overview of the models is given by Pachter et al..
For differential expression, the use of negative binomial models that reflect the counting nature of the expression estimates is widely in use and an extensive performance comparison is available . We apply as a representative the edgeR  package and additionally use the simple log-ratio as an effect-size estimator for the expression change.
For the investigation whether there is a change in the relative abundances of the different isoforms that may be generated from a given gene there is again a variety of methods. In our comparison we do not discriminate between the different molecular mechanisms like alternative splicing, alternative transcription start sites, etc. that may be the cause of an observed isoform switch. Existing approaches either look at entire isoforms, like cuffdiff2 , or at specific expression events, like DEXSeq (exon usage)  and DiffSplice (junction and exon usage) ; a very good overview is given in Alamancos et al.. Again the approaches range from making full use of the annotated isoforms to complete de novo detection of splicing events simply based on local read and junction coverage.
With the methods being available and in use, it is unclear how accurate an entire workflow actually is. While Fonseca et al. report the number of reads aligned for each aligner, they do not assess which genes are affected. Soneson et al. compare the performance of the hypothesis test for differential expression under the assumption that the counts may have noise due to biological and technical variation. But they do not consider any systematic bias that might affect the counts of specific genes caused by wrong read alignments. In contrast to such horizontal comparison papers, this paper reports the performance of entire data analysis workflows and provides how well individual genes and isoforms can be quantified and how well differences can be assessed.
Results and discussion
Our analysis discusses the options listed in Table 1. In order to measure the accuracy of RNA-seq, we use simulated data generated with the Flux Simulator  where we can control the 5′-to-3′ bias of the transcript coverage, variations in transcription starts and variable poly-A tails. We use all RefSeq isoforms present in the UCSC hg19 genome annotation and simulate the data such that each isoform has same baseline transcript abundance. This is different from the biological situation but ensures that we obtain reads from every annotated isoform. For every sample we generate 10 Mio single-end reads of length 101 bp using the default error profile that the Flux Simulator provides. In the following we discuss the individual steps.
The representation biases of the isoforms together with isoform characteristics and more statistics can be found in Additional file 1.
Median value of the per-isoform mapping rate
Mod. TSS + polyA
Fraction of genes where the expression count bias is less than a factor of two
Alignment and count method
Mod. TSS + polyA
We show the density distributions of the count biases in Figure 2c. The comparison with the library representation and the read alignment biases shows that the counting biases affect the fewest genes.
Additionally, we verify how well RSEM can reproduce the isoform abundances. We find that RSEM does not correctly produce zero counts in the case where a gene has multiple isoforms but only one isoform is expressed. This observation is also reported by Mezlini et al.. It is caused by the fact that RSEM tends to bias the expression estimates of the individual isoforms towards the mean value of the corresponding gene locus.
Area under the curve for the differential expression
Alignment and count method
Number of genes where differential expression was not detectable
Alignment, count, and test method
Number of genes not detectable
The comparison demonstrates the power and benefit of the counting model but also shows that if only few replicates--in our case three--are available, the effect-size is a valuable score for differential expression. Corresponding findings were also reported for microarray data . In a related work on comparing hypothesis test based differential expression of counting data, Robles et al.  have shown that with 6 replicates there is a remarkable improvement of the power. This power comes mostly from the improved precision on the variance estimate. For RSEM, the distributions of the read counts do not follow strictly a Poisson or negative-binomial model, since the assignment of multi-alignments leads to fractional read counts and the posterior estimation adds additional noise to the estimate. However, we find that the Poisson approximation is reasonable and leads to accurate results.
Number of genes where alternative was not detectable
Alignment and isoform switching method
Number of genes not detectable
We have investigated the entire data analysis workflow starting from the input material to the biologically interpretable quantitative results. We find that library content is crucial and biases in the input material do strongly impact subsequent analysis results. To achieve highly accurate mappings for each gene it is essential that the alignment also tolerates variable transcription start sites and poly-A tails. While RSEM models poly-A tails, the aligners tophat and STAR do not. However the latter do cope with variable transcription start sites since they consider transcriptome alignments as well as genome alignments. We further find that overall mapping rates are not sufficiently informative since an aligner may systematically fail to align reads for a given gene. With respect to expression estimation, it is essential to resolve the ambiguous reads. Only by considering the ambiguous alignments an accurate expression estimation is possible for repeated regions or overlapping gene symbols. Differential expression using the count model is powerful but crucially depends on the accuracy of the counts. With as few as 3 replicates the negative binomial test, in our dataset, does not outperform the average effect-size as an indicator of differential expression. Since the read counts that are required as input for the counting tests are not normalized for transcript length they can lead to misinterpretation of isoform switching as differential expression. Detection of isoform switching does not reach the same accuracy as the detection of differential expression. This is due to the high similarity of isoforms and, depending on the gene locus, the potentially small number of aligned reads that do discriminate between the isoforms.
While RNA-seq can give an accurate estimation of gene expression as well as expression differences for the majority of genes, it may miss genes with extreme properties with respect to sequence length, GC content, and homologous regions. The detection of isoform switching may require significantly higher coverage to get satisfactory results. We refer to the additional files for detailed information on how well the transcriptional state of each gene and isoform can be assessed quantitatively.
For future analysis, it will be relevant to consider also single nucleotide polymorphisms in an analysis. Also, it will be interesting to investigate to which extend the results are applicable to other species. The current study assumes that all potential isoforms are known, however in practice the annotated isoforms will not represent the complete transcriptome and additional isoforms that are expressed may have adverse effects and lead to misinterpretations.
We used simulated data sets with a baseline expression where each isoform has the same transcript abundance. We consider as transcripts the Refseq transcripts in the hg19 build available from the UCSC Genome Browser web-site. For the differential expression we divide the genes in 10 chunks und choose always 1 chunk for two-fold upregulation and 1 chunk for two-fold downregulation. In total we use 11 conditions with 3 replicates each and each replicate having 10 Mio reads. We create 3 isoform switching datasets from the initial 11 conditions by shifting the entire expression on a gene locus into the first, second and third annotated isoform, respectively. For the methods below, if not mentioned differently, default options were used.
Read simulation with flux
For simulation we use the Flux read simulator v1.2. We simulate coverage bias by setting the fragmentation substrate to DNA. We choose the default options for the transcript modification to obtain transcripts with variable read starts and poly-A tail. We set both options to NaN and the fragmentation substrate to RNA to produce transcripts that follow exactly the annotated transcripts. We choose as read length 101 bp and use the default error distribution provided by Flux.
For read alignment, we use tophat v2.0.6 and provided tophat with the RefSeq transcript coordinates as a GTF file so that tophat would also align the reads to the known transcripts. Additionally, we used STAR v2.3.0e with the options“--outSAMstrandField intronMotif --outFilterMatchNmin 20--outFilterMismatchNmax 5 --outFilterMismatchNoverLmax 0.05--outFilterMultimapNmax 30″ that add additional strand information for spliced alignments, limit the number of mismatches to 5 and that increase the number multimapping for reads to 30. We use RSEM v1.2.3 for read alignment to the transcriptome and use RSEM also to compute expression estimates.
We compute expression with HTSeq v0.5.3p9 using the script HTS.scripts.count, as well as by using the countOverlaps method in the Bioconductor package GenomicRanges v1.10.7 and the expression output generated by RSEM.
We compute differential expression with the exactTest method in the Bioconductor package edgeR v3.0.6, and the t-test in the package genefilter v1.40.0. Additionally we compute the log-ratio as an effect size. We did not apply any normalization with respect to sequencing depth, since all samples had the same sequencing depth.
For the detection of isoform switching we use the Bioconductor package DEXSeq v1.4.0 which also provides a python scripts to generate the read counts for the pseudo-exons that are needed as input for the hypothesis test. The other two methods are cufflinks v2.1.1 (with the option for multi-read correction) and DiffSplice v0.1.1. We used all genes that were commonly detectable by DEXSeq and cuffdiff. This excludes genes where the same gene symbol is associated with more than one genomic locus because DEXSeq can not handle such situations. These genes have been removed and the GTF File from UCSC has been preprocessed with DEXSeq’s python script dexseq_prepare_annotation.py. DiffSplice does not make use of the gene annotation at all but reports alternatively spliced modules (ASM) independent of the annotation. In order to get comparable results we matched the ASM to the overlapping gene symbols.
Acknowledgements go to Marco Schmidt and Tanguy Le Carrour for providing and maintaining the computing infrastructure that was used.
- Wang Z, Gerstein M, Snyder M: RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009, 10: 57-63. 10.1038/nrg2484.PubMed CentralView ArticlePubMedGoogle Scholar
- Garber M, Grabherr MG, Guttman M, Trapnell C: Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods. 2011, 8: 469-477. 10.1038/nmeth.1613.View ArticlePubMedGoogle Scholar
- Alamancos GP, Agirre E, Eyras E: Methods to study splicing from high-throughput RNA Sequencing data. arXiv. 2013, -1304.5952Google Scholar
- Fonseca NA, Rung J, Brazma A, Marioni JC: Tools for mapping high-throughput sequencing data. Bioinformatics. 2012, 28: 3169-3177. 10.1093/bioinformatics/bts605.View ArticlePubMedGoogle Scholar
- Li W, Jiang T: Transcriptome assembly and isoform expression level estimation from biased RNA-Seq reads. Bioinformatics. 2012, 28: 2914-2921. 10.1093/bioinformatics/bts559.PubMed CentralView ArticlePubMedGoogle Scholar
- Kim D, Pertea G, Trapnell C, Pimentel H: TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biology. 2013, 14: R36-10.1186/gb-2013-14-4-r36.PubMed CentralView ArticlePubMedGoogle Scholar
- Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR: STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013, 29: 15-21. 10.1093/bioinformatics/bts635.PubMed CentralView ArticlePubMedGoogle Scholar
- Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biol. 2010, 11: R106-R106. 10.1186/gb-2010-11-10-r106.PubMed CentralView ArticlePubMedGoogle Scholar
- Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JYH, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004, 5: R80-10.1186/gb-2004-5-10-r80.PubMed CentralView ArticlePubMedGoogle Scholar
- Aboyoun P, Pages H, Lawrence M: GenomicRanges: representation and manipulation of genomic intervals. R Package Version. 2012, 1: 12-14.Google Scholar
- Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L: Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat Biotechnol. 2012, 31: 46-53. 10.1038/nbt.2450.View ArticlePubMedGoogle Scholar
- Pachter L: Models for transcript quantification from RNA-Seq. arXiv preprint arXiv:1104.3889. 2011, -Google Scholar
- Soneson C, Delorenzi M: A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinforma. 2013, 14: 91-10.1186/1471-2105-14-91.View ArticleGoogle Scholar
- McCarthy DJ, Chen Y, Smyth GK: Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res. 2012, 40: 4288-4297. 10.1093/nar/gks042.PubMed CentralView ArticlePubMedGoogle Scholar
- Anders S, Reyes A, Huber W: Detecting differential usage of exons from RNA-seq data. Genome Res. 2012, 22: 2008-2017. 10.1101/gr.133744.111.PubMed CentralView ArticlePubMedGoogle Scholar
- Hu Y, Huang Y, Du Y, Orellana CF, Singh D, Johnson AR, Monroy A, Kuan PF, Hammond SM, Makowski L, Randell SH, Chiang DY, Hayes DN, Jones C, Liu Y, Prins JF, Liu J: DiffSplice: the genome-wide detection of differential splicing events with RNA-seq. Nucleic Acids Res. 2012, 41 (2): e39-PubMed CentralView ArticlePubMedGoogle Scholar
- Griebel T, Zacher B, Ribeca P, Raineri E, Lacroix V, Guigó R, Sammeth M: Modelling and simulating generic RNA-Seq experiments with the flux simulator. Nucleic Acids Res. 2012, 40 (20): 10073-10083. 10.1093/nar/gks666.PubMed CentralView ArticlePubMedGoogle Scholar
- Mortazavi A, Williams BA, Mccue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008, 5: 621-628. 10.1038/nmeth.1226.View ArticlePubMedGoogle Scholar
- Hansen KD, Irizarry RA, Wu Z: Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics. 2012, 13: 204-216. 10.1093/biostatistics/kxr054.PubMed CentralView ArticlePubMedGoogle Scholar
- Mezlini AM, Smith EJ, Fiume M, Buske O, Savich G, Shah S, Aparicion S, Chiang D, Goldenberg A, Brudno M: IReckon: simultaneous isoform discovery and abundance estimation from RNA-seq data. Genome Res. 2012, 23 (3): 519-529.View ArticlePubMedGoogle Scholar
- Shi L, Jones WD, Jensen RV, Harris SC, Perkins RG, Goodsaid FM, Guo L, Croner LJ, Boysen C, Fang H, Qian F, Amur S, Bao W, Barbacioru CC, Bertholet V, Cao XM, Chu T-M, Collins PJ, Fan X-H, Frueh FW, Fuscoe JC, Guo X, Han J, Herman D, Hong H, Kawasaki ES, Li Q-Z, Luo Y, Ma Y, Mei N, et al: The balance of reproducibility, sensitivity, and specificity of lists of differentially expressed genes in microarray studies. BMC Bioinforma. 2008, 9 (9): S10-View ArticleGoogle Scholar
- Robles JA, Qureshi SE, Stephen SJ, Wilson SR, Burden CJ, Taylor JM: Efficient experimental design and analysis strategies for the detection of differential expression using RNA-Sequencing. BMC Genomics. 2012, 13: 484-10.1186/1471-2164-13-484.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.