- Research article
- Open Access
An integrative method to normalize RNA-Seq data
- Cyril Filloux†1, 2,
- Meersseman Cédric†1, 2, 3, 4,
- Philippe Romain1, 2,
- Forestier Lionel1, 2,
- Klopp Christophe5,
- Rocha Dominique3, 4,
- Maftah Abderrahman1, 2 and
- Petit Daniel1, 2Email author
© Filloux et al.; licensee BioMed Central Ltd. 2014
- Received: 11 December 2013
- Accepted: 9 June 2014
- Published: 14 June 2014
Transcriptome sequencing is a powerful tool for measuring gene expression, but as well as some other technologies, various artifacts and biases affect the quantification. In order to correct some of them, several normalization approaches have emerged, differing both in the statistical strategy employed and in the type of corrected biases. However, there is no clear standard normalization method.
We present a novel methodology to normalize RNA-Seq data, taking into account transcript size, GC content, and sequencing depth, which are the major quantification-related biases. In this study, we found that transcripts shorter than 600 bp have an underestimated expression level, while longer transcripts are even more overestimated that they are long. Second, it was well known that the higher the GC content (>50%), the more the transcripts are underestimated. Third, we demonstrated that the sequencing depth impacts the size bias and proposed a correction allowing the comparison of expression levels among many samples. The efficiency of our approach was then tested by comparing the correlation between normalized RNA-Seq data and qRT-PCR expression measurements. All the steps are automated in a program written in Perl and available on request.
The methodology presented in this article identifies and corrects different biases that influence RNA-Seq quantification, and provides more accurate estimations of gene expression levels. This method can be applied to compare expression quantifications from many samples, but preferentially from the same tissue. In order to compare samples from different tissue, a calibration using several reference genes will be required.
- Gene expression
The study of transcriptome has pushed forward by the development of next-generation sequencing technologies. RNA-Seq offers the possibility to get information on sequence and quantification of all transcribed genes, but extremely lowly expressed ones . As shown by these authors, this method differs from the microarrays which have limitations due to (i) the difficulty to design specific probes, leading to artifacts caused by cross-hybridization and (ii) the impossibility to detect expression for non-annotated genes. Expression quantification performed using qRT-PCR is more precise than microarrays, but is also not able to measure unknown genes. Moreover, the cost of TLDA (TaqMan Low Density Array - Applied Biosystems) for example, renders it unsuitable for large gene sets.
The RNA-Seq protocol is a succession of technical steps followed by quantification. According to Illumina technology, (i) a cDNA library from a given tissue is randomly fragmented by sonication, (ii) specific adapters are ligated for the assignation of each fragment to the corresponding sample, (iii) PCR amplification are performed, and (iv) amplified mRNA fragments with sizes ranging from 250 to 450 bp are isolated before being sequenced. The quantification of the sequenced fragments (called reads) begins with the mapping of each read onto the assembled genome or transcriptome, in order to count the number of reads assigned to each known or unknown gene. When there are several transcripts or close paralogues for a gene, the attribution of a read to the right transcript is not always possible depending on the read position: 5’-end fragments are expected to be more specific than 3’-end ones. The second step of quantification consists in removing four biases affecting read counts: (i) the number of reads increases with the size of the transcript [2–6], (ii) with the amount of the cDNA library [7, 8], (iii) sequencing efficiency decreases when the GC-content is too low or too high [9–12], and (iv) due to a PCR amplification step during the library preparation, PCR duplicates occur when two copies of the same cDNA fragment produce different clusters on the flow cell [13–15].
Since RNA-Seq emergence, a number of normalization methods have been developed to address one or two of the different biases [1–12, 14]. Our aim was to develop an integrated method able to correct all these sources of bias. In order to avoid RNA-Seq quantification problems linked to specific isoforms, unlike most studies, we only retained genes with a single transcript to determine the various equations and to perform the comparison . As for size effects, most of them are based on mathematical distribution models to compare expression levels between samples, but do not consider separately the opposite biases relative to size: short transcripts (<600 bp) are underestimated while longer ones are overestimated. As for the bias linked to GC content, we performed simple regression methods based on polynomial model. It appeared that sequencing depth has an effect on the equations driving the size and GC content corrections. Hence, unlike other methods, a further run of our program was performed to correct globally the read counts by taking into account size, GC content and total read numbers. In order to assess the efficiency of our approach, we calculated the correlation between corrected RNA-Seq counts and qRT-PCR quantifications.
Longissimus thoracis muscle biopsies were taken between the 7th and 9th ribs of 125 limousine bulls slaughtered at the age of 15.8 months. The samples were immediately frozen in liquid nitrogen and stored at -80°C. After grinding tissues using a FastPrep FP120 Homogenizer device (Thermo Savant) and micro-tubes “Lysing Matrix D” (MP Biomedicals), RNA extractions were performed with the RNeasy Midi/Maxi kit (Qiagen). The procedure and solution quantity were optimized for extraction from skeletal muscles and treatment with proteinase K as recommended by the supplier. The quality control of RNA step was done using RNA 6000 Nano Chips analyzed with 2100 Bioanalyzer instrument (Agilent Technologies). The 22 best ranking RNA samples were retained.
To verify the absence of degradation during the storage period, the quality of these 22 cattle samples were then checked again before preparing cDNA libraries according to the Illumina protocol. Briefly, mRNAs were isolated from total RNA by their polyA tails and cDNA libraries were built using random-hexamers. These cDNAs were fragmented by sonication, and specific adapters were then ligated to each fragment for the traceability of the sample. Ten cycles of PCR amplification were performed. Amplified mRNA with a size between 250 and 450 bp were then isolated before being sequenced in paired-end reads with a length of 100 bases using Illumina HighSeq2000 device (hosted at the INRA Genomic Platform of Toulouse, France).
RNA-Seq read counting
The first step consists in de-multiplexing the reads by recognizing specific adapter sequences to assign each read to the corresponding sample (three samples were pooled per flow cell lane). From 100 to 240 million paired-end reads were obtained per flow cell lane, corresponding to 27 to 91 million reads for each cDNA library. These paired-end reads were then mapped back to the bovine reference transcriptome, using Bos taurus known transcripts recorded in the Ensembl database v.61 (Website: ftp://ftp.ensembl.org/pub/release-75/fasta/bos_taurus/cdna/). This set contains 27,663 transcript sequences assigned to 21,734 known genes and pseudogenes. Paired-end reads located exactly on the same transcript were selected and counted. A total of 21,455 transcripts (17,605 genes) were identified, with at least one paired-end read within the 22 analyzed samples.
Among the 22 cattle samples, five of them were chosen to perform qRT-PCRs on the basis of a large range of total read numbers. These samples showed around 10.106 (1475), 13.106 (1455), 20.106 (1479), 24.106 (1345), and 30.106 reads (1476), respectively. These experiments were conducted using custom-made TLDA (Taq-Man Low Density Array) cards and ABI PRISM 7900HT sequence detector system (Applied Biosystems). The dataset was built with genes involved in glycosylation metabolism, named glyco-genes in the following. They concern glycosyl-transferases, glycosidases, sulfo-transferases, sugar carriers, and lectines. Among the around 800 genes recorded in the bovine genome (unpublished data), 372 were selected according to two criteria: the greatest diversity of the glyco-gene groups and the availability of primers provided by Applied Biosystems (https://bioinfo.appliedbiosystems.com/genome-database/gene-expression.html). Twelve housekeeping genes (18S RNA, TFIID transcription factor, etc., see ) were added as controls to complete the 384-microwells of each microfluidic card. The quantification was done using the SDS 2.3 software (Applied Biosystems) according to the ΔCt method (see the User Bulletin #2 for ABI PRISM 7700 of October 2001). Briefly, ΔCt corresponds to the threshold cycle (Ct) for each gene minus that of the mean of the twelve endogenous internal controls.
RNA-Seq data from public datasets (drosophila and human)
To validate the first steps of our method, it was necessary to consider public data dealing with other organisms than Bos taurus.
As for Drosophila, in the public dataset SRA: SRP009459/GEO: GSE33905 deposited by B.R. Graveley and co-workers, we downloaded 16 read sequence sets obtained from head of male and female adults (GSM838758 to GSM838760, GSM838763 to GSM838766 to GSM838780, and GSM838799 to GSM838802). The sequencing depth varied from 2.7 to 8.4 million reads, with a mean value close to 5 million reads.
As for Human, we considered the dataset SRA: SRP032775/GEO: GSE52166 deposited by R. Sanka and co-workers. In order to have homogeneous data, the read sequence sets came from whole blood of 20 individuals in a pre-infection state relatively to Plasmodium falciparum (GSM1335718, GSM1335720, GSM1335722 to GSM1335756). The sequencing depth varied from 21.2 to 72.9 million reads, with a mean value around 40 million reads.
Using STAR aligner software v.2.3.1f , the read sequences were splice-aligned onto the Drosophila v.BDGP5.75 or the Human genome v.GRCh37.75, respectively. Transcripts were quantified with sigcufflinks (available upon request at http://www.sigenae.org), a modified version of the cufflinks code  providing raw read counts per transcript, by using the GTF reference files provided by Ensembl (version 75).
Simulation of RNA-Seq data
As we suspected that the RNA fragment sizes have an impact on the behavior of read counts as a function of transcript sizes, it was useful to conduct simulation using a specific program. We first downloaded the transcript genes of Bos taurus chromosome 20 from Ensembl (version 75 – genome assembly UMD3.1). All the sequences were concatenated to obtain a single sequence of 255,601 bp. This sequence was then split into 231 genes in the FASTA format, with increasing sizes from 50 bp to 1,200 bp according to an arithmetic progression with common difference of 5. This file was submitted to rlsim . Default parameters were chosen, except for sequenced fragment range, and total read number (1 million). Three runs were launched, the first with 250–450 (mean 350) bp, the second with 450–650 (mean 550) bp, and the last 650–850 (mean 750) bp. For each transcript, the program assigns an expression level from a mixture of gamma distribution with two components with mean 5,000 and 10,000. Then, the simulation provides for each read its sequence and the assigned gene. We then calculated the number of reads for each gene using the program Fishing-net, written in Perl, available upon request from CF and DP.
As qRT-PCR quantification were used to validate our RNA-Seq normalization method, it was necessary to verify that qRT-PCR data were not subject to transcript size and GC content biases. As for transcript size, we tested a relationship with the ΔCt obtained by qRT-PCR for cattle sample 1479 (n = 233). Through polynomial equations of first and third orders, the p-values were 0.84 and 0.87, respectively. As for GC content in the same sample, the corresponding polynomial equations gave p-values of 0.57 and 0.96, respectively. We verified that for the four other cattle samples, no significant relationships were observed neither for transcript size nor for GC content (Additional file 1).
To compare qRT-PCR results with the RNA-Seq approach, several steps of correction are needed. The calculations concerned 14,676 genes for which only one transcript were detected in Cattle. We propose an integrated method called SGTR (transcript Size, GC content and Total Read number) that takes into account the effects of transcript size, GC content, and total read number. First, it was necessary to apply a log 2 transformation to raw counts to avoid large dispersion for high values according to [2, 10, 21] and .
Correction of transcript length biases
where y i corresponds to the mean read number for the size class x i .
We observed that the slope a 1 for shorter transcripts was higher than the one a 2 for longer transcripts, and verified that this trend was also true for the 21 other cattle samples analyzed by RNA-Seq. In particular, the 600 bp border remained constant. In other species (e.g. Drosophila and Human), we retrieved this 600 bp border in all the samples tested (16 drosophila head samples and 20 human whole blood samples). One example of each of this species is presented in Figures 1B and 1C. To correct the bias linked to transcript sizes, it was necessary to introduce two different equations corresponding to each part of the graph. As size 600 is a critical value, we decided to adjust all the read numbers to this size. First consider the left part; for a transcript of size S, we added the value “ a 1 (600 - S ) ” to the observed read number. Likewise, for transcripts > 600 bp, we removed the value “ a 2 ( S - 600) ”. As a result, the read numbers of all the transcripts were adjusted to the size 600 (Figure 1D).
To understand the significance of this 600 bp border, we hypothesized that it could be due to the length of the sequenced fragments. This idea was tested using the simulation procedure implemented in rlsim software. Three different fragment lengths were considered: 250–450 (mean 350) bp, 450–650 (mean 550) bp, and 650–850 (mean 750) bp, with a fixed total read number of 1 million. The results are summarized in Figure 1E, with LOESS smoothing. It is difficult to give a precise position of the break point between the two regression lines, but it is clear that the greater the sequenced fragments, the more the break point is shifted toward the right. Moreover, the slopes for the regression lines situated before the break points seem to be similar.
To assess the efficiency of our method, we calculated the Pearson correlations between qRT-PCR and RNA-Seq counts corrected by FPKM (Fragments Per Kilobase per Million mapped reads)  or SGTR for the five bovine samples. We choose FPKM as it is a one of the most frequently method used for normalization. Briefly, it consists in dividing the fragment counts by transcript size and the total number of reads, and adjusted to 1 kb and 1 million reads.
Comparison between FPKM and SGTR methods according to transcript size
log 2 (FPKM)
SGTR - Size
SGTR - Size and GC content
Size < 1,000 bp
1,000 – 2,000 bp
2,000 – 3,000 bp
3,000 – 4,000 bp
Size > 4,000 bp
Size < 1,000 bp
1,000 – 2,000 bp
2,000 – 3,000 bp
3,000 – 4,000 bp
Size > 4,000 bp
Size < 1,000 bp
1,000 – 2,000 bp
2,000 – 3,000 bp
3,000 – 4,000 bp
Size > 4,000 bp
Size < 1,000 bp
1,000 – 2,000 bp
2,000 – 3,000 bp
3,000 – 4,000 bp
Size > 4,000 bp
Size < 1,000 bp
1,000 – 2,000 bp
2,000 – 3,000 bp
3,000 – 4,000 bp
Size > 4,000 bp
Removing of the GC-content effect
where y represents the size-corrected mean read number and x the GC content.
where y corresponds to the predicted read number and x represents the difference between a GC content and 50%.
Third, we adjusted the size-corrected values by removing “g.x 3 + h.x 2 + i.x ” of this last function to all the transcripts. The Figure 3C illustrates the efficiency of GC bias correction.
For the 20 human samples, we obtained the same profiles of size-corrected read number according to GC content as in Cattle (data not shown). In contrast, for the 16 drosophila samples, the polynomial curves were different (Figure 3D and 3E). Nevertheless, the correction of the GC content bias using the previous procedure yielded a smoothing curve absolutely flat (Figure 3F), attesting the efficiency of our method.
The final step consisted in testing the effect of the GC content bias removing on the correlation between RNA-Seq counting and qRT-PCR quantification, in the case of bovine data. Except for the sample 1475 (10.106 reads), this last bias correction improved the global correlation with qRT-PCR quantifications relatively to the simple size correction by SGTR (Table 1). By comparison with log 2 (FPKM) correction, the removing of size and GC content biases improved the global correlation with qRT-PCR results, except for the sample 1455 which presented a low sequencing depth (13.106 reads) and showed similar results as log 2 (FPKM) correction. The Figure 2C illustrates the correlation obtained between SGTR including size and GC content corrections and qRT-PCR quantifications for the sample 1479. We observed a better proportionality than the one provided by log 2 (FPKM) correction (Figure 2B).
Adjustment according the total read number
where y i corresponds to the regression parameter for a total read number x i .
where y i corresponds to the corrected read number for the transcript x i .
where y i represents the corrected read number for the transcript x i .
where y i corresponds to the full-corrected read number, and x to the difference between the GC content and 50%.
Correction of the impact of total read numbers
SGTR - Size
SGTR - Size and GC content
Our results showed that non-transformed counts values from RNA-Seq presented worse correlations with qRT-PCR quantification than the log 2 -transformed ones, as already stressed by [2, 10, 21], and . The prior transformation of read counts by log 2 function was motivated by the variability of data corresponding to highly expressed genes, often observed in large size transcripts. We hypothesized that this transformation could also attenuate the overestimations due to PCR duplicates. Indeed, the more expressed the transcripts, the higher the probability to generate duplicates (several clusters of reads share exactly the same start and end) [13, 15]. Otherwise, certain authors have proposed to apply log 2 -transformed values to the data extracted from qRT-PCR [25, 26]. Given our regression curves, it is clear that for our samples, this correction is inappropriate (unpublished data).
As for transcript size correction, two strategies have been adopted by different authors. In the first one, the transcripts are ranked in quantiles containing identical numbers [2, 6, 7]. The advantage is a balanced distribution facilitating further statistical analysis. However, it is difficult to assign a mean read number to scaled sizes. In the second one, size classes are built irrespective of the number of genes per class , leading to an increasing dispersion for the classes of higher sizes (mainly due to lower number of genes). Both approaches allowed avoiding certain limitations implemented in RPKM (Reads Per Kilobase of exon model per Million mapped reads)  or FPKM  methods, where the number of read is simply divided by transcript size. The main difference consists in taking into consideration paired-reads in the FPKM method while only simple reads in the RPKM one.
We choose the second strategy because of the excellent regression quality of mean read numbers by size classes. We interpret the border 600 bp observed whatever the species dataset (Figure 1A-1C) as a result of sonication and selection of cDNA fragments between 250 and 450 bp. Indeed, fragments > 600 bp are all the more so represented that they are long [1, 3, 4, 27]. Conversely, the fragments < 600 bp are under-represented as many small segments were not sequenced. Moreover, the simulation conducted with rlsim confirmed our view, and showed that the border increases with the size of the sequenced fragments (Figure 1E). Hence, this proves the effect of the cDNA fragments size selection on the break point between the two regression lines. As a result, independent corrections are needed for both transcript sizes. This last point provided slightly better correction than the log 2 (FPKM) for transcripts < 1,000 bp (see Table 1). According to [14, 28] and , RNA-Seq protocol including PCR in the first steps introduced biases linked to GC content, as cDNA fragments with high GC and AT content are under-sequenced. To correct this bias, [10, 14] and  proposed to build GC-classes. In our method, we took into account the general trend by calculating a three order polynomial equation, which was used to correct the decrease over 50% GC content. The efficiency of our correction was sample-dependent and more precisely linked to sequencing depth. Indeed, for a low number of reads, the GC bias correction did not improve the normalization, in contrast to samples with higher sequencing depth. SGTR including Size and GC content corrections provide thus globally better results than log 2 (FPKM) (Table 2), which is in agreement with the conclusions of  and . We expect that the GC content correction should be more accurate if it was applied on gene segments (~300 to 500 bp) and not on full length transcripts, as there are variations along the sequence in their GC content.
Lastly, since the sequencing depth introduces effects on transcript size bias, we adjusted the TRN to 20 million reads in reason of its medium value. Hence, we modified the parameters a 1 , a 2 , and b 2 , but this step requires numerous samples to obtain reliable values. Finally, these size and TRN adjusted values were then corrected for GC content bias.
Our integrated method corrects some biases linked to transcript size and GC content, but also sequencing depth. However, it is striking that for the lowest sequencing depths (sample 1475: 10 million reads; 1455: 13 million reads) our correction gave worse or equal correlations with qRT-PCR values than log2(FPKM). In contrast, for read counts over 20 million, our method significantly improves the read counting, for the whole dataset and for most gene size classes. The question is to interpret this observation and several considerations have to be taken into account. First, in our samples, when the total number of reads is low, it is particularly true for transcript with sizes shorter than 600 pb, the regression equation between transcript size and read counts is less accurate than the one for transcript sizes longer than 600 pb. Second, the more expressed the transcripts (total read numbers over 20 million), the higher is the probability to generate duplicates and other biases induced by RNAseq. Our method can be compared to GAM (Generalized Additive Model) of , where the data are corrected for length, GC content, and dinucleotide frequency biases. However, these authors have shown that the correction of dinucleotide frequency biases did not improve results. Unlike GAM method, our model is not additive as we showed that the regression coefficient linked to transcript length depend on the sequencing depth. That was not the case for polynomial equation coefficients used to correct the GC content bias. Improvements are still needed to better take into account the variation of GC content per read in a given transcript, as the GC content is not homogeneous along the sequence. Protocols excluding PCR in first step could avoid this issue, and problems linked to PCR duplicates [13, 15, 28]. On the other hand, it is highly desirable to provide a good estimation of the number of reads corresponding to each transcript isoform. To overcome this issue, we took into account genes presenting only one transcript. In contrast to Human , this choice does not result in a dramatic loss of information as more than 50% of bovine genes have a single transcript in the available annotation file. The accurate determination of transcript size suffers from biases linked to cDNA library preparation. Indeed, it seems that random-hexamers present some favored and disfavored sites, so that specific regions are selected more easily than others leading to biases for low expressed genes [1, 31, 32]. RNA fragmentation before its reverse-transcription in cDNA reduces this bias leading to more uniform gene coverage . Nevertheless, these technical effects associated to library preparation as well as some variations observed between flow cells have always a smaller influence that the biological effect [6, 9]. Otherwise, the fine determination of TSSs (Transcription Start Sites) deduced from alignment of the reads onto the genome (and not onto the known transcripts) could further improve the accuracy of transcript size.
We demonstrated that our method is robust and suitable to compare the read counts of genes for numerous samples of the same tissue. All the steps described are sequentially automated within SGTR program written in Perl, and available upon request from RP and DP. The extension of our method to the normalization of the read numbers between different tissues requires considering a set of reference genes as calibrators.
All animal experimentation complied with the French Veterinary Authorities’ rules. No ethics approval was required by a specific committee, as the selected animals were not animals bred for experimental reasons.
We are grateful to Diane Esquerré and Olivier Bouchez (INRA, Toulouse) for their help in RNA library preparation and sequencing, respectively. We also thank Dr Mekki Boussaha (INRA, Jouy-en Josas) for helpful discussion. The RNA-Seq work was funded by the INRA Animal Genetics Department (BovRNA-Seq project). The sampling of the Limousin Longissimus thoracis biopsies was part of the Qualvigène project, funded by the French National Research Agency (contracts ANR-05-GANI-005 and ANR-05-GANI-017-01) and APIS GENE (contract 01-2005-QualviGenA-02). The qRT-PCR work was funded by the Limousin Regional Council.
- Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008, 5: 621-628. 10.1038/nmeth.1226.View ArticlePubMedGoogle Scholar
- Cloonan N, Forrest ARR, Kolle G, Gardiner BBA, Faulkner GJ, Brown MK, Taylor DF, Steptoe AL, Wani S, Bethel G, Robertson AJ, Perkins AC, Bruce SJ, Lee CC, Ranade SS, Peckham HE, Manning JM, McKernan KJ, Grimmond SM: Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat Methods. 2008, 5: 613-619. 10.1038/nmeth.1223.View ArticlePubMedGoogle Scholar
- Sultan M, Schulz MH, Richard H, Magen A, Klingenhoff A, Scherf M, Seifert M, Borodina T, Soldatov A, Parkhomchuk D, Schmidt D, O’Keeffe S, Haas S, Vingron M, Lehrach H, Yaspo M-L: A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science. 2008, 321: 956-960. 10.1126/science.1160342.View ArticlePubMedGoogle Scholar
- Oshlack A, Wakefield MJ: Transcript length bias in RNA-seq data confounds systems biology. Biol Direct. 2009, 4: 14-10.1186/1745-6150-4-14.View ArticlePubMed CentralPubMedGoogle Scholar
- Robinson MD, Oshlack A: A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010, 11: R25-10.1186/gb-2010-11-3-r25.View ArticlePubMed CentralPubMedGoogle Scholar
- Bullard JH, Purdom E, Hansen KD, Dudoit S: Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics. 2010, 11: 94-10.1186/1471-2105-11-94.View ArticlePubMed CentralPubMedGoogle Scholar
- Robinson MD, Smyth GK: Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics. 2008, 9: 321-332.View ArticlePubMedGoogle Scholar
- Dillies M-A, Rau A, Aubert J, Hennequet-Antier C, Jeanmougin M, Servant N, Keime C, Marot G, Castel D, Estelle J, Guernec G, Jagla B, Jouneau L, Laloë D, Le Gall C, Schaëffer B, Le Crom S, Guedj M, Jaffrézic F, The French StatOmique Consortium: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief Bioinform. 2012, 14 (6): 671-683.View ArticlePubMedGoogle Scholar
- Srivastava S, Chen L: A two-parameter generalized Poisson model to improve the analysis of RNA-seq data. Nucleic Acids Res. 2010, 38: e170-10.1093/nar/gkq670.View ArticlePubMed CentralPubMedGoogle Scholar
- Risso D, Schwartz K, Sherlock G, Dudoit S: GC-content normalization for RNA-Seq data. BMC Bioinformatics. 2011, 12: 480-10.1186/1471-2105-12-480.View ArticlePubMed CentralPubMedGoogle Scholar
- Zheng W, Chung LM, Zhao H: Bias detection and correction in RNA-Sequencing data. BMC Bioinformatics. 2011, 12: 290-10.1186/1471-2105-12-290.View ArticlePubMed CentralPubMedGoogle Scholar
- Hansen KD, Irizarry RA, WU Z: Removing technical variability in RNA-seq data using conditional quantile normalization. Biostat Oxf Engl. 2012, 13: 204-216.View ArticleGoogle Scholar
- Mamanova L, Andrews RM, James KD, Sheridan EM, Ellis PD, Langford CF, Ost TWB, Collins JE, Turner DJ: FRT-seq: amplification-free, strand-specific, transcriptome sequencing. Nat Methods. 2010, 7: 130-132. 10.1038/nmeth.1417.View ArticlePubMed CentralPubMedGoogle Scholar
- Benjamini Y, Speed TP: Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 2012, 40: e72-10.1093/nar/gks001.View ArticlePubMed CentralPubMedGoogle Scholar
- Xu H, Luo X, Qian J, Pang X, Song J, Qian G, Chen J, Chen S: FastUniq: a fast de novo duplicates removal tool for paired short reads. PLoS One. 2012, 7 (12): e52249-10.1371/journal.pone.0052249.View ArticlePubMed CentralPubMedGoogle Scholar
- Li J, Jiang H, Wong WH: Modeling non-uniformity in short-read rates in RNA-Seq data. Genome Biol. 2010, 11: R50-10.1186/gb-2010-11-5-r50.View ArticlePubMed CentralPubMedGoogle Scholar
- Ermonval M, Petit D, Le Duc A, Kellermann O, Gallet P-F: Glycosylation-related genes are variably expressed depending on the differentiation state of a bioaminergic neuronal cell line: implication for the cellular prion protein. Glycoconj J. 2009, 26: 477-493. 10.1007/s10719-008-9198-5.View ArticlePubMedGoogle Scholar
- Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR: STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013, 29: 15-21. 10.1093/bioinformatics/bts635.View ArticlePubMed CentralPubMedGoogle Scholar
- Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL, Pachter L: Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc. 2012, 7: 562-578. 10.1038/nprot.2012.016.View ArticlePubMed CentralPubMedGoogle Scholar
- Sipos B, Slodkowicz G, Massingham T, Goldman N: Realistic simulations reveal extensive sample-specificity of RNA-seq biases. 2013, arXiv preprint arXiv:1308.3172Google Scholar
- Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y: RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008, 18: 1509-1517. 10.1101/gr.079558.108.View ArticlePubMed CentralPubMedGoogle Scholar
- Sun Z, Zhu Y: Systematic comparison of RNA-Seq normalization methods using measurement error models. Bioinforma Oxf Engl. 2012, 28: 2584-2591. 10.1093/bioinformatics/bts497.View ArticleGoogle Scholar
- Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L: Transcript assembly and abundance estimation from RNA-Seq reveals thousands of new transcripts and switching among isoforms. Nat Biotechnol. 2010, 28: 511-515. 10.1038/nbt.1621.View ArticlePubMed CentralPubMedGoogle Scholar
- Hammer Ø, Harper D, Ryan P: Past: paleontological statistics software package for education and data analysis. Palaeontol Electron. 2001, 4 (4): 9-178kb. http://www.palaeo-electronica.org/2001_1/past/issue1_01.htmGoogle Scholar
- Lee S, Seo CH, Lim B, Yang JO, Oh J, Kim M, Lee S, Lee B, Kang C, Lee S: Accurate quantification of transcriptome from RNA-Seq data by effective length normalization. Nucleic Acids Res. 2011, 39: e9-10.1093/nar/gkq1015.View ArticlePubMed CentralPubMedGoogle Scholar
- Jones DC, Ruzzo WL, Peng X, Katze MG: A new approach to bias correction in RNA-Seq. Bioinformatics. 2012, 28: 921-928. 10.1093/bioinformatics/bts055.View ArticlePubMed CentralPubMedGoogle Scholar
- Gao L, Fang Z, Zhang K, Zhi D, Cui X: Length bias correction for RNA-seq data in gene set analyses. Bioinformatics. 2011, 27: 662-669. 10.1093/bioinformatics/btr005.View ArticlePubMed CentralPubMedGoogle Scholar
- Kozarewa I, Ning Z, Quail MA, Sanders MJ, Berriman M, Turner DJ: Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of GC-biased genomes. Nat Methods. 2009, 6: 291-295. 10.1038/nmeth.1311.View ArticlePubMed CentralPubMedGoogle Scholar
- Aird D, Ross MG, Chen W-S, Danielsson M, Fennell T, Russ C, Jaffe DB, Nusbaum C, Gnirke A: Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol. 2011, 12: R18-10.1186/gb-2011-12-2-r18.View ArticlePubMed CentralPubMedGoogle Scholar
- Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras J-B, Stephens M, Gilad Y, Pritchard JK: Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature. 2010, 464: 768-772. 10.1038/nature08872.View ArticlePubMed CentralPubMedGoogle Scholar
- Hansen KD, Brenner SE, Dudoit S: Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 2010, 38: e131-10.1093/nar/gkq224.View ArticlePubMed CentralPubMedGoogle Scholar
- Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L: Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biol. 2011, 12: R22-10.1186/gb-2011-12-3-r22.View ArticlePubMed CentralPubMedGoogle Scholar
- Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M: The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008, 320: 1344-1349. 10.1126/science.1158441.View ArticlePubMed CentralPubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.