Reducing bias in RNA sequencing data: a novel approach to compute counts
© Finotello et al.; licensee BioMed Central Ltd. 2014
Published: 10 January 2014
Skip to main content
© Finotello et al.; licensee BioMed Central Ltd. 2014
Published: 10 January 2014
In the last decade, Next-Generation Sequencing technologies have been extensively applied to quantitative transcriptomics, making RNA sequencing a valuable alternative to microarrays for measuring and comparing gene transcription levels. Although several methods have been proposed to provide an unbiased estimate of transcript abundances through data normalization, all of them are based on an initial count of the total number of reads mapping on each transcript. This procedure, in principle robust to random noise, is actually error-prone if reads are not uniformly distributed along sequences, as happens indeed due to sequencing errors and ambiguity in read mapping.
Here we propose a new approach, called maxcounts, to quantify the expression assigned to an exon as the maximum of its per-base counts, and we assess its performance in comparison with the standard approach described above, which considers the total number of reads aligned to an exon. The two measures are compared using multiple data sets and considering several evaluation criteria: independence from gene-specific covariates, such as exon length and GC-content, accuracy and precision in the quantification of true concentrations and robustness of measurements to variations of alignments quality.
Both measures show high accuracy and low dependency on GC-content. However, maxcounts expression quantification is less biased towards long exons with respect to the standard approach. Moreover, it shows lower technical variability at low expressions and is more robust to variations in the quality of alignments.
In summary, we confirm that counts computed with the standard approach depend on the length of the feature they are summarized on, and are sensitive to the non-uniform distribution of reads along transcripts. On the opposite, maxcounts are robust to biases due to the non-uniformity distribution of reads and are characterized by a lower technical variability. Hence, we propose maxcounts as an alternative approach for quantitative RNA-sequencing applications.
In recent years, ultra-high-throughput sequencing technologies (also called Next-Generation Sequencing technologies, NGS) [1, 2] have been applied intensively in quantitative transcriptomics, making RNA sequencing (RNA-seq)  a valuable alternative to microarrays. While microarrays can only assay transcripts corresponding to probes, RNA-seq can, in principle, investigate at a finer level of detail all the transcripts present in a sample, characterizing their sequences and quantifying their abundances at the same time . The possibility of sequencing transcriptomes at single-base resolution has opened a wide frontier of applications in transcriptomics research: transcriptome profiling of non-model organisms [5, 6], novel transcripts discovery , quantification of allele-specific gene expression , investigation of RNA editing [9, 10] and "dual RNA-seq" of pathogen and host . In this work we focus on its application to quantitative transcriptomics, since RNA-seq is now widely used in place of microarrays for measuring and comparing gene transcription levels [4, 12]. The standard workflow of transcripts quantification with RNA-seq is the following: first, RNAs are extracted from the sample of interest and subjected to fragmentation; then, RNA fragments are reverse-transcribed into complementary DNAs (cDNAs); finally, cDNAs are ligated to adapters and subjected to ultra-high-throughput sequencing. The millions of short sequences generated, called reads, can be aligned to a reference genome or transcriptome to calculate counts (i.e. the number of reads aligned to each gene or transcript), which give a digital measure of transcript abundances in the original sample. However, this measure requires normalization to correct for systematic errors arising from several sources of bias. First of all, the largest fraction of the reads sequenced in a sample arises from a restricted subset of highly expressed genes [13, 14]; as a consequence, these genes account for most of the counts in a library, while the remaining genes are under-represented. Moreover, by definition, counts are intrinsically biased towards longer transcripts: longer transcripts are more likely to be sequenced than shorter ones, so counts depend not only on the true gene expression, but also on the length of transcribed isoforms [15–19]. In addition, recent works highlight other sequence-dependent sources of bias affecting NGS data [20–23]. In particular, many studies observe the presence of a GC-content effect: gene counts correlate with the fraction of "G" (guanine) and "C" (cytosine) bases in the nucleotidic sequence of a gene [23–25].
Although several methods have been proposed to normalize data, thus providing less biased estimates of transcript abundances, all of them are based on an initial count of the total number of reads mapping on each transcript [19, 24–26]. This procedure, in principle robust to random noise, might be error-prone if reads are not uniformly distributed along sequences, as happens indeed due to both sequencing errors and ambiguity in read mapping.
Non-uniformity of read coverage is mainly due to biases associated to the different steps of RNA-seq protocols. For instance, fragmentation methods based on restriction enzymes have recently been reported to be sequence-specific and far from being random . Reverse-transcription performed with poly-dT oligomers, which bind to poly-A tails, is strongly biased towards 3' end of transcripts [3, 4]. Conversely, reverse-transcription with random hexamers results in an under-representation of 3' ends [4, 27]. This bias is due to the reduced number of priming positions from which the reverse transcriptase enzyme can start cDNA synthesis. Furthermore, depending on their sequence, RNAs and cDNAs can form secondary structures that alternatively obstruct or facilitate the binding of reverse-transcription primers and sequencing adapters, resulting in different efficiency of the sequencing process . Since the first RNA-seq experiment , several changes in library preparations and sequencing protocols have been proposed pursuing the aim of having an unbiased representation of transcript abundances (e.g. postponing reverse transcription after fragmentation), but the non-uniformity of read coverage along transcripts remains an issue of state-of-the-art technologies .
In this study, we propose a novel method for computing counts, called maxcounts, with the aim of reducing systematic errors. Once reads have been aligned to a feature of interest (exon or single-isoform transcript), we exploit read coverage to obtain counts for every position in its sequence and we quantify its expression as the maximum of its "positional" counts. We assess maxcounts performance in comparison with the standard approach, which considers the total number reads mapped on an exon (called totcounts in the following). To do this we considered three human data sets [19, 30, 31], in which samples are taken from different tissues or cellular compartments, or from cells subjected to different growth conditions or treatments. All samples were sequenced with the Illumina technology (http://www.illumina.com), which is now the most commonly used NGS platform for RNA-seq . Data were sequenced with single- and paired-end protocols, and have different characteristics, which allow us to test our approach with respect to different features. In particular, in Jiang's experiment , endogenous RNAs were sequenced together with spike-in RNAs, which are single-isoform transcripts with known nucleotidic sequences and concentrations. We used these data as gold-standard to benchmark and compare totcounts and maxcounts estimates of RNA abundances.
The MAQC2 data set  consists of single-ended RNA-seq reads obtained from two different biological samples: (i) Ambion's Human Brain Reference RNA ("Brain"), a standard pooled from multiple donors and several brain regions; (ii) Stratagene's Universal Human Reference RNA ("UHR"), a mixture of total RNA extracted from ten different human cell lines (see "Additional file 1" for further details on data).
Griffith's data set  contains paired-end reads obtained sequencing two fluorouracil (5-FU)-resistant ("MIP5FU") and (5-FU)-sensitive ("MIP101") human colorectal cancer cell lines.
A subset of replicates from Jiang's data set  is also considered, in which paired-end RNA-seq libraries were sequenced after mixing endogenous RNA from a K-562 cell line, extracted from nucleus ("nucleus") or whole cell ("cell"), with RNA standards developed by the External RNA Control Consortium (ERCC). ERCC standards are in vitro synthesized RNAs whose nucleotidic sequences and concentrations are known. They can be used to assess whether the final quantification of an RNA-seq experiment correctly represents the composition of the original input.
In the first phase, reads were pre-processed to remove adapter sequences and read ends whose Phred quality was lower than 20, and to discard reads whose length after trimming was less than 33 bp. Output FASTQ files were re-formatted to recover the correspondence of paired-end reads, and to store in a separate file the singleton reads, whose mate was discarded during pre-processing.
In the second phase, paired-ends and singletons were mapped with TopHat  in a two-steps procedure. First, paired-end reads were mapped on the reference sequence to generate a BAM file of alignments and a file of junctions. Then, singletons were mapped with TopHat exploiting the information provided by junctions (see "Additional file 1" for further details on read mapping). Alignment files from paired-end and singleton reads were finally merged in a single BAM file using the merge utility of samtools .
In the last phase of post-processing, a filtered set of alignments was obtained after discarding multireads and reads whose similarity with the reference was lower than 97%. This analysis was performed using SAMsieve, a java in-house developed program (available upon request), which allows the user to filter alignments stored in SAM or BAM files based on several criteria (see "Additional file 1" for additional information about SAMsieve).
Totcounts were computed using bedtools . Exons (or spike-in transcripts of Jiang's data set) with average totcounts across replicates lower than 0.5 were discarded from our analysis. Before comparing or averaging replicates, differences in library sizes were corrected through Trimmed Mean of M-values (TMM) normalization .
where, Nij are counts for exon i in library j (not normalized via TMM), Li is the length of exon i and N.j = ∑i Nij is the sum of all counts in library j.
Within-lane full-quantile normalization of counts on exon length was performed using EDASeq . In order to correct for differences in library sizes, this normalization was used together with between-lane full-quantile normalization, also implemented in EDASeq.
In this work we consider exons instead of genes or transcripts as we intend to evaluate the different summarization methods described above without biases, possibly introduced by the choice of a transcription model (e.g. how overlapping genes or alternative spliced exons are considered).
Ideally, a measure of gene expression should: (i) be independent of gene-specific covariates such as transcript length and GC-content; (ii) be unbiased towards highly expressed genes; (iii) be an accurate estimate of the true gene expression levels; (iv) show low technical variance; (v) be robust to possible variations in the quality of alignments. In the following we assess the above properties for maxcounts in comparison with totcounts. Plots are shown for Jiang's data, since this data set allows also the assessment of accuracy in transcript quantification thanks to spike-in RNAs; results on MAQC2 and Griffith's are reported in Additional Files.
In all data sets, plots show an increasing trend of totcounts as exon length increases (see the cubic-spline fit represented by the orange line), revealing that longer exons tend to have higher counts than shorter ones. This bias is reduced, but not completely removed, in maxcounts. Plots for Jiang's data ("nucleus" libraries), depicted in Figure 2A, show no dependency of maxcounts on exon length. Conversely, for maxcounts in Griffith's and MAQC2 data sets a slight under-representation of exons shorter than 50 bp is still visible. We believe this behavior is explained by the difference in read length among the three data sets and the ability of TopHat to map them on splice junctions. Indeed, we observed that in MAQC2 and Griffith's data sets (36 bp reads) only 0.25-0.50% of aligned reads are mapped on splice junctions, as opposed to 2.5-11.5% of reads in Jiang's data set (75 bp reads). As a consequence, there is a decrease of counts over exons boundaries, which mainly affects short exons. In all the considered data sets, RPKM-normalized totcounts show a negative relationship with exon length due to an over-correction for length bias on short exons. On the opposite, full-quantile normalization completely removes exon length bias. Similarly, if applied to maxcounts, full-quantile normalization completely removes exon-length bias even on short exons (plots not shown).
We used the same approach to investigate GC-content effect, revealing a moderate bias due to GC-composition on almost all data sets (Figure 2B and Additional Files 2, 3, 4, 5, 6). As noted in previous studies, GC-content effect is not consistent across data sets [20, 24, 25, 36]. Interestingly, the correction for exon length bias via full-quantile normalization also corrects for GC-content bias all the considered data sets.
In the following assessments, we always show raw totcounts and their RPKM- and full-quantile-normalized versions. Given the low length bias characterizing maxcounts, we consider their raw, not-normalized version.
Summary of count distribution across exons
In particular, Table 1 reports the percentage of exons accounting for 50% and 90% of total counts or RPKMs in a sample, highlighting that, less than 40% of exons contain more than 90% of all totcounts in a library. RPKM-normalized totcounts are more evenly distributed across exons, but the least biased distribution is that of maxcounts, with a marked improvement on the more biased data sets (see, for example, how this bias is reduced on Griffith's data).
For all measures, plots show higher agreement with the gold-standard on Jiang's "nucleus" data, probably because of the higher number of replicates (six libraries) with respect to "cell" data (two libraries). All measures, with the exception of full-quantile-normalized totcounts, obtain high correlation with true concentrations, with RPKM-normalized totcounts and maxcounts having slightly better results than totcounts. Full-quantile normalization performed on totcounts, although eliminating length bias, possibly over-corrects data. Correlations with true concentrations of maxcounts, totcounts and RPKM-normalized totcounts, computed on all libraries of Jiang's data set, do not significantly differ (two-sided t-test, p-value > 0.05). On the contrary, full-quantile-normalized totcounts present the lowest correlation with spike-in RNAs concentrations (two-sided t-test, p-value < 1e-10). All methods do not depend on transcript abundances, except for full-quantile-normalized totcounts, which are less robust in estimating low-abundance transcripts (Additional File 8).
where X i are totcounts or maxcounts, averaged across libraries, for each transcript here considered. Ideally, Δ should be very small, to reflect the closeness of the true concentrations. Whereas totcounts produce a variation of 39%, maxcounts have a much smaller variation of 2%, overcoming read-coverage bias and providing very similar estimates for the transcripts here used as example. It is interesting to note that both transcripts show a reduced read coverage in correspondence to 3' end (Figure 5), a bias that is introduced during the reverse-transcription step performed with random hexamers (see "Background"). This bias is present in all transcripts of Jiang's data set (results not shown). Maxcounts approach is robust to 3' bias since it considers the bases with the highest read coverage along transcripts.
Maxcounts show the lowest variance at low and mean expressions, while totcounts present slightly lower variance at high expressions. In order to account for differences in the range of values, we also considered the coefficient of variation (CV, ratio between standard deviation and mean). Totcounts and maxcounts obtain comparable CV curves. Totcounts normalized with full-quantile are characterized by larger variance and CV with respect to both maxcounts and totcounts, while totcounts normalized with RPKM-normalized totcounts have the highest variance and CV.
where the expression at the denominator is used to avoid possible divisions by zero. Ideally, if a measure is stable to alignment filtering (that depends on the specific analysis pipeline defined by users), relative variation should be 0%. Here we consider raw maxcounts and totcounts, not subjected to any normalization, since we want to assess the direct impact that changes in alignment filtering have on count summarization.
Moreover, alignments filtering also impacts on the number of reads that can be used for quantification. Indeed, by applying this filter to any of the three data sets, about 30% of reads are discarded. Hence, the results reported above highlight that maxcounts are robust to variations in the number of considered reads.
Thanks to the advent and progress of NGS technologies, RNA-seq has rapidly become the method of choice for measuring and comparing gene transcription levels. In this methodology, the expression of a coding unit, such as a gene, transcript or exon, is estimated by considering the total number of reads that can be aligned on its sequence (totcounts). Despite being widely adopted, this digital measure of expression is not free from biases, and efforts are underway by the scientific community to develop novel methods for data normalization and bias correction. Here we propose an alternative approach for computing RNA-seq counts: maxcounts. We exploit read coverage along an exon to compute maxcounts as the maximum of its positional counts, i.e. the number of reads covering each base along its sequence.
We characterized and compared totcounts and maxcounts considering the desired features of a measure of expression, irrespectively of downstream applications: no dependence on covariates, such as exon length and GC-content, no over-representation of highly transcribed exons, accurate and precise estimation of true expression levels, low variance and high reproducibility.
Overall, totcounts always need normalization for exon-length since they present a strong bias. On the contrary, exon-length bias in maxcounts is strongly reduced, so they do not necessarily require normalization. If exon-length bias is corrected through within-lane full-quantile normalization, further correction for GC-content is not needed neither for totcounts nor for maxcounts. Moreover, with maxcounts the over-representation of highly expressed exons is reduced with respect to totcounts. When focusing on accuracy and precision of measurements, maxcounts together with RPKM-corrected totcounts most accurately reproduce real data, whereas maxcounts together with totcounts normalized with the full-quantile approach show the lowest variance. Finally, although the quality of alignments has a great impact on both methods, maxcounts approach outperforms totcounts in terms of robustness to variations in alignment filtering.
Consequently, we believe that maxcounts approach represents a valuable alternative to totcounts for measuring exon expression from RNA-seq data, since it has comparable or higher performance on all the evaluation criteria.
Although several improvements have been made to understand and correct for possible biases in the RNA-seq experimental protocol, read coverage along transcripts still shows sequence-specific variability and under-representation of specific regions. Maxcounts approach can overcome biases due to the non-uniformity of read coverage, selecting the best-represented transcript regions. Nevertheless, RNA-seq is a methodology still under active development, which will experience a fast improvement of experimental protocols and evolution of data characteristics. We made available the code for calculating maxcounts (see "Methods"), thus enabling its benchmarking on different data sets.
A possible limitation of the current implementation is represented by the use of exons, since the final user might be interested in a having gene or transcript counts. Future work will focus on the definition of transcription models that can be used to combine exon maxcounts into an accurate measure of gene or transcript expression. Finally, an important issue to be addressed in the near future is the impact of maxcounts on differential expression analysis. At the moment, a complete assessment is difficult because of the lack of good benchmarks: microarrays and quantitative PCR can be used to measure maxcounts precision, but might not capture the complete picture of gene expression since they present a lower sensitivity with respect to RNA-seq. For these reasons, we are currently generating an ad hoc data set to assess both differential expression at exon and transcript level and to focus on expression of alternative splicing variants.
Reads Per Kilobase of exon Model per million mapped reads
Trimmed Mean of M-values.
This research is supported by Fondazione CARIPARO ("RNA sequencing for quantitative transcriptomics" PhD Program), PRAT 2010 CPDA101217 ("Models of RNA sequencing data variability for quantitative transcriptomics") and AACSE Project. The authors would like to thank Dr. Malachi Griffith for providing data and support with data handling.
The publication costs for this article were funded by PRAT 2010 CPDA101217 ("Models of RNA sequencing data variability for quantitative transcriptomics").
This article has been published as part of BMC Bioinformatics Volume 15 Supplement 1, 2014: Integrated Bio-Search: Selected Works from the 12th International Workshop on Network Tools and Applications in Biology (NETTAB 2012). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/15/S1.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.