Empirical bayes analysis of sequencing-based transcriptional profiling without replicates
© Wu et al. 2010
Received: 4 June 2010
Accepted: 16 November 2010
Published: 16 November 2010
Skip to main content
© Wu et al. 2010
Received: 4 June 2010
Accepted: 16 November 2010
Published: 16 November 2010
Recent technological advancements have made high throughput sequencing an increasingly popular approach for transcriptome analysis. Advantages of sequencing-based transcriptional profiling over microarrays have been reported, including lower technical variability. However, advances in technology do not remove biological variation between replicates and this variation is often neglected in many analyses.
We propose an empirical Bayes method, titled Analysis of Sequence Counts (ASC), to detect differential expression based on sequencing technology. ASC borrows information across sequences to establish prior distribution of sample variation, so that biological variation can be accounted for even when replicates are not available. Compared to current approaches that simply tests for equality of proportions in two samples, ASC is less biased towards highly expressed sequences and can identify more genes with a greater log fold change at lower overall abundance.
ASC unifies the biological and statistical significance of differential expression by estimating the posterior mean of log fold change and estimating false discovery rates based on the posterior mean. The implementation in R is available at http://www.stat.brown.edu/Zwu/research.aspx.
Recent technological advancements have made high throughput sequencing an increasingly popular approach for transcriptome analysis. Unlike microarrays, enumeration of transcript abundance with sequencing technology is based on direct counts of transcripts rather than relying on hybridization to probes. This has reduced the noise caused by cross-hybridization and the bias caused by the variation in probe binding efficiency. Sequencing-based transcription profiling does have other challenges. For example, whole transcript analysis produces data with transcript length bias . Other biases, including GC content, have also been reported . Nonetheless, sequencing based expression analysis has been shown to be more robust and have higher resolution compared to microarray platforms . Some researchers have predicted that it will eventually replace microarrays as the major platform for monitoring gene expression . The importance of replicates is well recognized in microarray analysis  and it is now standard practice to include biological replicates under each experimental condition. However, as of now, sequencing-based gene expression studies often do not include replicates [6–8], posing the question of whether the biological variation is, or can be, adequately addressed.
For illustration, we use data from Illumina Digital Gene Expression (DGE) tag profiling in this paper. However, our statistical methodology, and its implementation in R, are general for all sequencing-based technologies that quantify gene expression as counts instead of continuous measurements such as probe intensity in microarrays. In DGE, the 3' end of transcripts with a poly-A tail are captured by beads coated with oligo dT. Two restriction enzymes, NlaIII and Mmel are used to digest the captured transcripts, generating a 21-base fragment starting at the most 3' NlaIII site. The 21-base fragments are sequenced to quantify the transcriptome. Consider two samples in a comparison and let X 1 and X 2 be the counts of a particular sequence tag in the two samples. The most common approach is to consider the counts as a realization of binomial distribution B(N i , π i ), i = 1,2. where N i is the total number of sequences in a sample, representing sequencing depth. A statistical test for π 1 = π 2 can be conducted. The classical Z-test using the Gaussian approximation to the binomial distribution is proposed for the Serial Analysis of Gene Expression (SAGE) data [9, 10] and recently applied to DGE and other sequencing data [11–13], and Fisher's exact test has also been proposed . In other technologies, sequence counts may have to be combined at either the exon or full transcript level to form the counts X 1 and X 2.
The test for H 0 : π 1 = π 2 can be performed without replicates. However, rejection of the H 0 hypothesis simply implies difference between the two samples. Unless the proportion of a gene in the transcriptome is the same for all samples under the same condition (lack of within-class variation), we can not generalize the difference between two samples to the difference between two classes. The within-class biological variation among replicates leads to over dispersion in Binomial or Poisson models. Models accounting for over dispersion, such as a beta-binomial, have been introduced for the analysis of SAGE data when several replicates within each class are available [15, 16]. Robinson and Smyth  use a negative binomial model and squeeze tag-wise dispersion towards a common dispersion, estimated using all tags, with a weighted likelihood approach that yields an EB-like solution. edgeR , a Bioconductor package implementing this method, has been applied to both DGE and RNA-seq data with replicates. However, since replicates are still rare in high throughput sequencing, many researchers have been relying on simple tests of equal proportion with multiple testing correction. Another drawback in some current analysis methods, especially those applied to the no-replicate situation, is the use of Gaussian approximation of binomial distribution [11, 19], which does not work well with data that include low count numbers. In transcriptome analysis, due to the depth of sequencing, the majority of genes have low counts. Relying on Gaussian distribution often gives highly expressed genes favorable statistical power, such that genes that have a lower expression but exhibit greater extent of differential expression between samples are less likely to be discovered.
In this paper we present an empirical Bayes method, titled Analysis of Sequence Counts (ASC), to estimate the log fold change of transcription between two samples. We borrow information across sequences to estimate the hyper parameters representing the normal biological variation among replicates and the distribution of a transcriptome. The statistical model does not rely on Gaussian approximation of the binomial distribution for all tags and requires no special treatment of 0 counts. Differential expression is computed in the form of a shrinkage estimate of log fold change. This estimate is the basis for ranking genes. We also compute the posterior probability that the log fold change is greater than a biologically relevant threshold chosen by the user. In contrast to sorting genes simply by p-values, we focus on the biological significance (represented by the posterior expectation of log fold change) and provide uncertainty measure in the form of posterior probability.
Overlap between the top 1000 genes identified by different methods and the SAGE BetaBin ranking.
Bayes Error by SAGE BetaBin
Fisher's exact test
Overlap between the top 100 or top 1000 differentially expressed genes identified by edgeR on full data and by other statistics on data without replicate
edgeR on full data
We present a simple hierarchical model for sequencing-based gene expression data (e.g. DGE, RNAseq ect.) that provides a shrinkage estimate of differential expression in the form of posterior mean of log fold change. Even in experiments lacking replicates, we take advantage of the large number of sequences quantified in the same experiment and establish a prior distribution of difference between conditions. The differential expression of a gene is evaluated based on the posterior expectation of log fold change. This estimate takes into account the increased uncertainty for genes with smaller counts (demonstrated by more aggressive shrinking in Figure 4) yet still allows the identification of differential expression among genes with lower expression. Our measure of statistical uncertainty is the posterior probability that the differential expression is beyond a given threshold, thus the inference on differential expression avoids the problem of conflicting "statistical significance" versus "biological significance" seen in Z-tests.
It is not uncommon to use hierarchical models for gene expression data. Several models used in microarray data analysis [21, 22] add another level of hierarchy by assuming that only a fraction of genes may have been affected by any treatment, and the rest have absolutely no change. Therefore, δ|Z = 1 ~ N(0, τ 2) and δ = 0|Z = 0 with P(Z = 1) = p 0. This essentially assumes that the prior distribution of δ is a zero-inflated Gaussian distribution. We show that using a simple Gaussian prior provides good shrinkage without the extra layer of hierarchy, which greatly simplifies computation.
In biological terms, our model means that the mean gene expression levels between two populations are never absolutely equal for any gene. However, the difference for most genes are small. We use posterior expectation as the estimate of the magnitude of difference. McCarthy and Smyth  showed that testing the differential expression relative to a biologically meaningful threshold identifies more biologically meaningful genes. We take a similar approach and estimate the posterior probability that the differential expression is greater than a threshold. Therefore, we avoid genes with very subtle differential expression even if that difference is statistically significant between the two samples in comparison. The genes identified by ASC include those that are modestly expressed as well as highly expressed.
The diatom Thalassiosira pseudonana (Strain 1335 from the Center for the Culture of Marine Phytoplankton) was grown axenically in 24 hr light at 14°C in f/2 media [24, 25] made with Sargasso Sea water, Treatments consisting of phosphorus-limited medium (0.4 μM PO4) and phosphorus-replete medium (36 μM PO4) were grown in triplicate and are herein referred to as treatments A and B, respectively. Equal volumes of cell biomass from each replicate were pooled for the A or B treatments 96 hours after inoculation and harvested by gentle filtration. Filters were immediately frozen in liquid nitrogen and stored at -80°C.
Total RNA was extracted using the RNeasy Midi Kit (Qiagen), following the manufacturer's instructions with the following changes: RNA samples were processed with Qiashredder columns (Qiagen) to remove large cellular material and DNA was removed with an on-column DNAase digestion using RNase-free DNAase (Qiagen). A second DNA removal step was conducted using the Turbo DNA-free kit (Ambion, Austin, TX, USA)[B1]. The RNA was quantified in triplicate using the Mx3005 Quantitative PCR System (Stratagene) and the Quant-iT RiboGreen RNA Assay Kit (Invitrogen) and was analyzed for integrity by gel electrophoresis. Total RNA was sent to Illumina (Hayward, CA) and they constructed digital gene expression (DGE) libraries with NlaIII tags following their protocol. Sequencing libraries for NlaIII digested tags were constructed by Illumina and sequenced on their Genome Analyzer. 12,525,833 tags were sequenced from the A library and 13,431,745 tags were sequenced from the B library.
Here δ has the interpretation of log fold change in gene expression, and λ is a nuisance parameter representing the average (log) expression.
where Exp represents shifted exponential distribution with rate α and shift λ 0.
We obtain the posterior mean given the gene counts as an estimate of differential expression. We refer to as the shrinkage estimate of log fold change, which is sufficient to rank the genes. To evaluate the statistical significance, we compute the posterior probability P(|δ| > Δ0|x), where Δ0 is a user-defined effect size of biological significance. There is no closed form expression for the posterior distribution and we use numerical integration for the evaluation of the posterior mean and probability.
Thus we can estimate α as . Our default setting is q 1 = 0.8 and q 2 = 0.9. The posterior mean E[δ|x] is not sensitive to the choice of q 1, q 2 (Additional file 3, Figure S3). Again, due to the lack of memory property, the probability density of shifted exponential distributions with the same rate are proportional, thus the value of λ 0 does not affect the posterior distribution and does not need to be estimated.
To estimate τ, the parameter representing the biological variation among replicates, we borrow information across genes. Although for any given gene we only observe one total count under each condition, and thus the true differential expression and the biological variation cannot be identified, we assume that the majority of genes are not affected by the treatment, an assumption found to be reasonable in microarray data in many experiments. We can model τ as a function of λ, but Figure 2 suggests that the biological variation is rather constant across expression levels. Thus we estimate one global parameter τ. We start with the Gaussian approximation of the Binomial model . Since the total counts N are usually a very large integer, we also have, approximately, . The variance of log(p 1) decreases with rate of 1/N as π increases and becomes negligible compared to biological variation. Thus we simply estimate τ from the differences of log rpm with the highest average log rpm. We use inter quartile range instead of sample standard deviation to avoid influence of genes with extreme differential expression. . In practice we use total counts above 1000 and this allows us to have several thousand genes (over 4000 in our example) for the estimation.
We thank the reviewers for their insightful comments and suggestions that greatly strengthened the manuscript. We thank A. Drzewianowski for her assistance with laboratory experiments. Funding was provided by NSF OCE-0723677.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.