Accuracy of RNA-Seq and its dependence on sequencing depth
© Cai et al; licensee BioMed Central Ltd. 2012
Published: 24 August 2012
Skip to main content
© Cai et al; licensee BioMed Central Ltd. 2012
Published: 24 August 2012
The cost of DNA sequencing has undergone a dramatical reduction in the past decade. As a result, sequencing technologies have been increasingly applied to genomic research. RNA-Seq is becoming a common technique for surveying gene expression based on DNA sequencing. As it is not clear how increased sequencing capacity has affected measurement accuracy of mRNA, we sought to investigate that relationship.
We empirically evaluate the accuracy of repeated gene expression measurements using RNA-Seq. We identify library preparation steps prior to DNA sequencing as the main source of error in this process. Studying three datasets, we show that the accuracy indeed improves with the sequencing depth. However, the rate of improvement as a function of sequence reads is generally slower than predicted by the binomial distribution. We therefore used the beta-binomial distribution to model the overdispersion. The overdispersion parameters we introduced depend explicitly on the number of reads so that the resulting statistical uncertainty is consistent with the empirical data that measurement accuracy increases with the sequencing depth. The overdispersion parameters were determined by maximizing the likelihood. We shown that our modified beta-binomial model had lower false discovery rate than the binomial or the pure beta-binomial models.
We proposed a novel form of overdispersion guaranteeing that the accuracy improves with sequencing depth. We demonstrated that the new form provides a better fit to the data.
To measure gene expression by RNA-Seq, RNA molecules are converted to DNA, sequenced, mapped to a gene database, and counted [1–3]. RNA-Seq then provides a digital readout of the gene expression levels. As the cost of next-generation sequencing drops rapidly, RNA-Seq may replace microarray methods in genome-wide surveys of gene expression. Compared to microarray technology, RNA-Seq has several advantages, including the ability to simultaneously detect mutations, discovering alternative transcript [4–6] and alternative splicing [7–10].
It is common to study the changes in gene expression under a perturbation. The perturbation can be, for example, the deletion of a gene, which is important in characterizing the function of a new gene, or it can be the stimulation of cells by a ligand, which is important in deciphering a pathway. Many experimental techniques, such as RNA interference , have been developed in recent years to make it easier to delete genes in mammalian cells. For an embryonic lethal gene in the mouse model, the Cre-lox system can be used to perform conditional gene knockout in a tissue-specific manner . These gene deletion techniques facilitate the study of gene functions for a large fraction of mammalian genes that remain to be characterized. Furthermore, two-sample comparisons apply when studying pathways through receptor stimulation. These methods have become increasingly popular for examining signal transduction pathways holistically. In such studies, the emphasis is on the function of genes or pathways and not on the genetic background in which the study is carried out. Therefore, one repeats the experiments in the same cell line or in mice with identical genetic backgrounds, and expects to find no genetic variation. In this situation, the difference in gene expression can be due to different methods of handling the biological samples (library preparation), as well as statistical fluctuations from the finite number of tags mapped to each gene. The uncertainty in the outcome of RNA-seq in repeated experiments of identical genetic background is yet to be characterized.
Such uncertainty affects the ability to affirm which genes are differentially expressed between a sample and a control. We focus on estimating the change in gene expression because the absolute amounts of RNA, by themselves, as measured by the RPKM (reads per kilobase of read length per million mapped reads) of the sequenced tag values , are not useful in most cases for biological interpretation. We hypothesize that experimental uncertainty is due primarily to the library preparation steps before sequencing, that it is intrinsic to the experimental protocol, and can therefore be characterized from repeated experiments. The expression difference is estimated based on the computation of a p-value, which can be calculated from repeated experiments using a t-test. However, since in RNA-Seq, the expression ratio derived from low sequence reads should have a larger error, it would be valuable to statistically estimate the error due to low counts . Binomial and beta-binomial [14, 15] distributions can be used to characterize small tag count fluctuations.
A fundamental question in RNA-Seq analysis is how the accuracy of measured gene expression change by RNA-Seq depend on the sequencing depth . Here the sequence depth means the total number of sequenced reads, which can be increased by using more lanes. A binomial distribution is often used to compare two RNA-Seq experiments. In this model, uncertainty approaches zero as where N is the tag counts for the gene. Indeed, the sequencing of the same DNA in different sequencing lanes produces errors consistent with the binomial distribution [13, 17]. However, comparisons of different samples have shown a dispersion larger than that given by the binomial distribution [18, 19]. A beta-binomial distribution appropriately describes the overdispersion. This type of distribution has been used for the analysis of differential gene expression levels in SAGE libraries , and to model peptide count data with both within- and between-sample variation in label-free tandem mass spectrometry-based proteomics . For the dispersion, the error is a sum of two parts: the first part goes to zero following , and the second part is a constant that is independent of N. The constant is ideal for describing the genetic variant. However, where genetic variations are not expected, it is inconsistent with the intuition that the accuracy of the measurement should improve with increasing depth of sequencing. In this paper, we provide empirical evidence that the error goes to zero as the tag count N increases, but at a slower rate than . We aim to characterize these overdispersions gene by gene, using a pair of replicate experiments. We compared the results in a dataset in which multiple replicates were also available. We used a form of overdispersion based on a beta-binomial distribution, but one in which the overdispersion parameter depends explicitly on the number of tags. The form we used was suggested during a study of the standard deviation as a function of the tag count. We demonstrated that the modified beta-binomial distribution improve performance.
The use of a proportion is a convenient way to compare two samples. Let n i and m i be the number of tags mapped to gene i. The proportion is defined as . It is convenient to use proportion because differences in proportion give rise to p-values using established statistics such as binomial and beta-binomial distributions. A proportion is also a convenient component of a normalization procedure.
In order to detect differential expression in two samples, we must determine the ratio of the counts in the two samples that corresponds to the same expression. One method, adapted in calculating the RPKM, assumes that the total number of tags sequenced, and equivalently the total amount of RNA, is a constant. The problem with RPKM normalization is that the number is dominated by a few genes that receive the highest sequence reads. These genes may or may not remain constant under the two experimental conditions. One could also use housekeeping genes such as POLR2A (polymeras II) or GAPDH in a normalization procedure. The problem with relying on a housekeeping gene is that the normalization depends on the choice of genes. Since the number of housekeeping genes is small, this normalization procedure is subject to fluctuation due to relatively small tag counts on these genes. Bullard et al. have shown good results with an upper-quartile normalization method .
This peak of histogram normalization is expected to be the most reasonable procedure for the Chiang dataset , which consists of the wild type and knockout versions of the TDP-43 gene (see Data and Methods for details). For this dataset, we expect the perturbation to the global gene expressions to be smaller than when comparing two different types of cells. Indeed, our peak of histogram normalization procedure resulted in a median of base-2 logarithm of expression difference ratio between the wild type and knockout gene of 0.014, which is to be compared to 0.025 for the median under the RPKM normalization procedure. This showed that peak normalization was comparable to and perhaps slightly better than RPKM normalization.
Normalization is performed according to the assumption that most of the genes do not change expression in the two experimental conditions. Although this convenient assumption is probably true in most cases, it has no ironclad biological justification.
Embryonic Stem Cells
Knock-out of TDP-43
UHR library A
UHR library B
Two estimations of γ from three datasets
Pairs of Experiments used in calculation
Under this assumption, for 0 < γ < 1, the asymptotic form of the variance of the proportion at large tag count N i = n i + m i according to the beta-binomial distribution is . Therefore the variance of the proportion of the modified beta-binomial distribution does approach zero at large N, but at a slower rate than in the binomial distribution.
We have investigated the error of RNA-Seq gene expression from repeated measurements. We have shown that the sequence reads from the same biological sample sequenced in different lanes follows a binomial distribution and that the library preparation steps prior to sequencing introduced larger variations from repeated experiments of the same biological specimen. We showed that the accuracy from repeated measurement improved with the sequencing depth. However the improvement with the tag counts was generally slower than predicted by the binomial distribution. We used a beta-binomial distribution to fit the inter-library overdispersion and introduced a parameterization of the overdispersion parameter that is consistent with the intuition that measurement accuracy should increase with the sequencing depth. We optimized the overdispersion parameters using maximum-likelihood estimation. We demonstrated better performance in lower FDR using our modified beta-binomial model.
Using the proportion of counts to estimate the gene expression difference has advantages over the RPKM expression. It has been shown recently that, in contrary to a naive presumption, the number of tags mapped to different positions in the same gene are highly non-uniform . The Poisson rate at different positions can fluctuate by a few hundred fold. However, the pattern of the variable rates along the position of the gene is highly reproducible, even when comparing experiments performed on different tissues. And such rates depend on the nucleotide composition of the local sequences. The main contribution of the variable Poisson rate is that it can be attributed to the hexamer primer in converting RNA to DNA . These data suggest that uneven PCR amplification could be the cause of the overdispersion that was observed. In estimating gene expression differences using a proportion, the highly variable Poisson rates do not need to be estimated. Such rates only enter the process indirectly through the dispersion. Not having to estimate the highly variable Poisson rates is therefore advantageous.
When the value of γ in Eq.(1) is zero, our model reverses back to the beta-binomial distribution. Interestingly, the γ values estimated in the different datasets were not the same. This phenomenon is similar to the GC bias in sequencing data, which also depends on the experiments . Therefore γ may be influenced by the experimental protocol that is used. The Caltech and Chiang datasets had similar values of γ. In these two datasets, the D i values are similar for the same gene (data not shown). This is quite consistent with the previous finding  that the variability of the Poisson rates is similar in different experiments. It would be interesting to study how D i may depend on the DNA sequences of the gene. Another parametrization of overdispersion is by position within a gene using Eq.(1). These possibilities will be explored in the future.
The normalization procedure using the peak of the histogram of proportion assumes that most genes remain unchanged in the two conditions being compared. In this normalization procedure, we fitted the highest peak in the histogram of proportion to a beta function. The maximum of the beta function determines the normalization proportion p n .
In RPKM normalization, we first count the total number of tags mapped to any gene in the RNA-Seq experiment. The number of tags mapped to a particular gene is divided by the total number of tags sequenced (the unit is millions of tags), and then divided by the number of nucleotides in the gene (the unit is thousands).
The three datasets we used are listed in Table 1.
The Chiang dataset consisted of five independent libraries of the deleted TDP-43 gene in the mouse. The data were derived from three independent clones of TDP-43 knockout embryonic stem (ES) cells and two independent clones of control ES cells. Raw reads were mapped to the University of California Santa Cruz mm9 genome library by efficient large-scale alignment of nucleotide databases. One gene deletion is an ideal case for testing normalization procedures with the assumption that most genes do not change.
The Caltech dataset consisted of two cells lines: GM12878 (normal blood) and H1hESC (embryonic stem cells), each with four libraries made independently from the same biological sample. The process involved raw Illumina reads on 2x75 datasets (RawData files on the download page, fasta format), which were run through Bowtie, version 0.9.8.1, with up to 2 mismatches. The resulting mappings were stored (RawData2 files, Bowtie format) for up to ten matches per read to the genome, spiked controls and UCSC knownGene splice junctions.
The Bullard dataset consisted of human brain reference RNA and human universal reference RNA as two library preparations. We used Bowtie, version 0.12.7, to align the reads to the genome (H. sapiens, NCBI 37.1 assembly). The Bowtie command we used to implement this mapping strategy was ./bowtie -a -v 2 -t -m 1 --best -strata h_sapiens_37_asm.
According to the likelihood ratio test, follows a Χ 2 distribution, where p i is the proportion for gene i and p n is the normalized proportion corresponding to no change in gene expression. This is the most convenient way to compute the p-value.
To determine the false discovery rate (FDR), we assumed that any gene deemed to be significantly differentially expressed at a given p-value were false when comparing two replicates sequenced from the same biological sample. We computed the FDR by dividing the number of falsely discovered genes at a given p-value with the number of significantly differentially expressed genes, comparing the sample to the control at the same p-value.
To determine the receiver operating characteristic (ROC), we first established a gold standard. Approximately one thousand genes in the Bullard dataset were previously assayed by RT-PCR in four independent experiments . Differentially expressed genes were determined by t-test by Bullard et al. . We used their results to draw an ROC curve when comparing the binomial and beta-binomial distributions for the Bullard dataset. For the Caltech and Chiang datasets, we assumed that the t-test provided a gold standard. In order to reduce errors for small tag counts, we required a gene to have more than 20 mapped tags. For the Caltech data, the Benjamini & Hochberg adjustment was applied to the p-value calculated by the t-test, using a cutoff of 0.05 . We could not use the FDR p-value adjustment on the Bullard dataset, as much fewer genes had differential expression levels detected from the wild/knockout samples. Therefore, we applied a cutoff of 0.05 to the p-value from the t-test and required a fold change larger than two.
We related the fold change in the gene expression level FC i to the optimized ratio p i and obtained, by definition, . This ratio has to be calibrated against the normalization of the entire experiment. We defined p n as no change. Therefore , where p n is the normalized ratio as determined over the entire dataset. Infinite values of FC i can be avoided by adding a pseudo-count to n i and m i so that 0 < p i < 1.
This work was partially supported by the NIH/NCI grant 5K25CA123344. HL was supported by a training fellowship from the Keck Center for Quantitative Biomedical Sciences of the Gulf Coast Consortia, on the Computational Cancer Biology Training Program from the Cancer Prevention and Research Institute of Texas (CPRIT No. RP101489).
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 13, 2012: Selected articles from The 8th Annual Biotechnology and Bioinformatics Symposium (BIOT-2011). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/13/S13/S1