Empirical bayes analysis of sequencing-based transcriptional profiling without replicates
- Zhijin Wu^{1}Email author,
- Bethany D Jenkins^{2, 3},
- Tatiana A Rynearson^{3},
- Sonya T Dyhrman^{4},
- Mak A Saito^{5},
- Melissa Mercier^{3} and
- LeAnn P Whitney^{2}
https://doi.org/10.1186/1471-2105-11-564
© Wu et al; licensee BioMed Central Ltd. 2010
Received: 4 June 2010
Accepted: 16 November 2010
Published: 16 November 2010
Abstract
Background
Recent technological advancements have made high throughput sequencing an increasingly popular approach for transcriptome analysis. Advantages of sequencing-based transcriptional profiling over microarrays have been reported, including lower technical variability. However, advances in technology do not remove biological variation between replicates and this variation is often neglected in many analyses.
Results
We propose an empirical Bayes method, titled Analysis of Sequence Counts (ASC), to detect differential expression based on sequencing technology. ASC borrows information across sequences to establish prior distribution of sample variation, so that biological variation can be accounted for even when replicates are not available. Compared to current approaches that simply tests for equality of proportions in two samples, ASC is less biased towards highly expressed sequences and can identify more genes with a greater log fold change at lower overall abundance.
Conclusions
ASC unifies the biological and statistical significance of differential expression by estimating the posterior mean of log fold change and estimating false discovery rates based on the posterior mean. The implementation in R is available at http://www.stat.brown.edu/Zwu/research.aspx.
Background
Recent technological advancements have made high throughput sequencing an increasingly popular approach for transcriptome analysis. Unlike microarrays, enumeration of transcript abundance with sequencing technology is based on direct counts of transcripts rather than relying on hybridization to probes. This has reduced the noise caused by cross-hybridization and the bias caused by the variation in probe binding efficiency. Sequencing-based transcription profiling does have other challenges. For example, whole transcript analysis produces data with transcript length bias [1]. Other biases, including GC content, have also been reported [2]. Nonetheless, sequencing based expression analysis has been shown to be more robust and have higher resolution compared to microarray platforms [3]. Some researchers have predicted that it will eventually replace microarrays as the major platform for monitoring gene expression [4]. The importance of replicates is well recognized in microarray analysis [5] and it is now standard practice to include biological replicates under each experimental condition. However, as of now, sequencing-based gene expression studies often do not include replicates [6–8], posing the question of whether the biological variation is, or can be, adequately addressed.
For illustration, we use data from Illumina Digital Gene Expression (DGE) tag profiling in this paper. However, our statistical methodology, and its implementation in R, are general for all sequencing-based technologies that quantify gene expression as counts instead of continuous measurements such as probe intensity in microarrays. In DGE, the 3' end of transcripts with a poly-A tail are captured by beads coated with oligo dT. Two restriction enzymes, NlaIII and Mmel are used to digest the captured transcripts, generating a 21-base fragment starting at the most 3' NlaIII site. The 21-base fragments are sequenced to quantify the transcriptome. Consider two samples in a comparison and let X_{1} and X_{2} be the counts of a particular sequence tag in the two samples. The most common approach is to consider the counts as a realization of binomial distribution B(N_{ i } , π_{ i } ), i = 1,2. where N_{ i } is the total number of sequences in a sample, representing sequencing depth. A statistical test for π_{1} = π_{2} can be conducted. The classical Z-test using the Gaussian approximation to the binomial distribution is proposed for the Serial Analysis of Gene Expression (SAGE) data [9, 10] and recently applied to DGE and other sequencing data [11–13], and Fisher's exact test has also been proposed [14]. In other technologies, sequence counts may have to be combined at either the exon or full transcript level to form the counts X_{1} and X_{2}.
The test for H_{0} : π_{1} = π_{2} can be performed without replicates. However, rejection of the H_{0} hypothesis simply implies difference between the two samples. Unless the proportion of a gene in the transcriptome is the same for all samples under the same condition (lack of within-class variation), we can not generalize the difference between two samples to the difference between two classes. The within-class biological variation among replicates leads to over dispersion in Binomial or Poisson models. Models accounting for over dispersion, such as a beta-binomial, have been introduced for the analysis of SAGE data when several replicates within each class are available [15, 16]. Robinson and Smyth [17] use a negative binomial model and squeeze tag-wise dispersion towards a common dispersion, estimated using all tags, with a weighted likelihood approach that yields an EB-like solution. edgeR [18], a Bioconductor package implementing this method, has been applied to both DGE and RNA-seq data with replicates. However, since replicates are still rare in high throughput sequencing, many researchers have been relying on simple tests of equal proportion with multiple testing correction. Another drawback in some current analysis methods, especially those applied to the no-replicate situation, is the use of Gaussian approximation of binomial distribution [11, 19], which does not work well with data that include low count numbers. In transcriptome analysis, due to the depth of sequencing, the majority of genes have low counts. Relying on Gaussian distribution often gives highly expressed genes favorable statistical power, such that genes that have a lower expression but exhibit greater extent of differential expression between samples are less likely to be discovered.
In this paper we present an empirical Bayes method, titled Analysis of Sequence Counts (ASC), to estimate the log fold change of transcription between two samples. We borrow information across sequences to estimate the hyper parameters representing the normal biological variation among replicates and the distribution of a transcriptome. The statistical model does not rely on Gaussian approximation of the binomial distribution for all tags and requires no special treatment of 0 counts. Differential expression is computed in the form of a shrinkage estimate of log fold change. This estimate is the basis for ranking genes. We also compute the posterior probability that the log fold change is greater than a biologically relevant threshold chosen by the user. In contrast to sorting genes simply by p-values, we focus on the biological significance (represented by the posterior expectation of log fold change) and provide uncertainty measure in the form of posterior probability.
Modeling biological variation
Distribution of expression levels in a transcriptome
Results and Discussion
Comparison with other methods
Overlap between the top 1000 genes identified by different methods and the SAGE BetaBin ranking.
Bayes Error by SAGE BetaBin | ASC | DGEseq | Z-test | Fisher's exact test | EdgeR |
---|---|---|---|---|---|
≈ 0 | 320 | 189 | 178 | 180 | 259 |
≤ 0.01 | 391 | 254 | 242 | 244 | 332 |
≤ 0.05 | 516 | 404 | 397 | 398 | 436 |
Overlap between the top 100 or top 1000 differentially expressed genes identified by edgeR on full data and by other statistics on data without replicate
edgeR on full data | Without replicates | ||||
---|---|---|---|---|---|
ASC | edgeR | DGEseq | Z | Fisher | |
top 100 | 33 | 37 | 2 | 1 | 1 |
top 1000 | 260 | 263 | 89 | 82 | 86 |
Discussion
We present a simple hierarchical model for sequencing-based gene expression data (e.g. DGE, RNAseq ect.) that provides a shrinkage estimate of differential expression in the form of posterior mean of log fold change. Even in experiments lacking replicates, we take advantage of the large number of sequences quantified in the same experiment and establish a prior distribution of difference between conditions. The differential expression of a gene is evaluated based on the posterior expectation of log fold change. This estimate takes into account the increased uncertainty for genes with smaller counts (demonstrated by more aggressive shrinking in Figure 4) yet still allows the identification of differential expression among genes with lower expression. Our measure of statistical uncertainty is the posterior probability that the differential expression is beyond a given threshold, thus the inference on differential expression avoids the problem of conflicting "statistical significance" versus "biological significance" seen in Z-tests.
It is not uncommon to use hierarchical models for gene expression data. Several models used in microarray data analysis [21, 22] add another level of hierarchy by assuming that only a fraction of genes may have been affected by any treatment, and the rest have absolutely no change. Therefore, δ|Z = 1 ~ N(0, τ^{2}) and δ = 0|Z = 0 with P(Z = 1) = p_{0}. This essentially assumes that the prior distribution of δ is a zero-inflated Gaussian distribution. We show that using a simple Gaussian prior provides good shrinkage without the extra layer of hierarchy, which greatly simplifies computation.
In biological terms, our model means that the mean gene expression levels between two populations are never absolutely equal for any gene. However, the difference for most genes are small. We use posterior expectation as the estimate of the magnitude of difference. McCarthy and Smyth [23] showed that testing the differential expression relative to a biologically meaningful threshold identifies more biologically meaningful genes. We take a similar approach and estimate the posterior probability that the differential expression is greater than a threshold. Therefore, we avoid genes with very subtle differential expression even if that difference is statistically significant between the two samples in comparison. The genes identified by ASC include those that are modestly expressed as well as highly expressed.
Methods
DGE data generation
The diatom Thalassiosira pseudonana (Strain 1335 from the Center for the Culture of Marine Phytoplankton) was grown axenically in 24 hr light at 14°C in f/2 media [24, 25] made with Sargasso Sea water, Treatments consisting of phosphorus-limited medium (0.4 μ M PO_{4}) and phosphorus-replete medium (36 μ M PO_{4}) were grown in triplicate and are herein referred to as treatments A and B, respectively. Equal volumes of cell biomass from each replicate were pooled for the A or B treatments 96 hours after inoculation and harvested by gentle filtration. Filters were immediately frozen in liquid nitrogen and stored at -80°C.
Total RNA was extracted using the RNeasy Midi Kit (Qiagen), following the manufacturer's instructions with the following changes: RNA samples were processed with Qiashredder columns (Qiagen) to remove large cellular material and DNA was removed with an on-column DNAase digestion using RNase-free DNAase (Qiagen). A second DNA removal step was conducted using the Turbo DNA-free kit (Ambion, Austin, TX, USA)[B1]. The RNA was quantified in triplicate using the Mx3005 Quantitative PCR System (Stratagene) and the Quant-iT RiboGreen RNA Assay Kit (Invitrogen) and was analyzed for integrity by gel electrophoresis. Total RNA was sent to Illumina (Hayward, CA) and they constructed digital gene expression (DGE) libraries with NlaIII tags following their protocol. Sequencing libraries for NlaIII digested tags were constructed by Illumina and sequenced on their Genome Analyzer. 12,525,833 tags were sequenced from the A library and 13,431,745 tags were sequenced from the B library.
Hierarchical model for gene counts
Here δ has the interpretation of log fold change in gene expression, and λ is a nuisance parameter representing the average (log) expression.
where Exp represents shifted exponential distribution with rate α and shift λ_{0}.
We obtain the posterior mean $\tilde{\delta}=E[\delta |\mathbf{x}]$ given the gene counts as an estimate of differential expression. We refer to $\tilde{\delta}$ as the shrinkage estimate of log fold change, which is sufficient to rank the genes. To evaluate the statistical significance, we compute the posterior probability P(|δ| > Δ_{0}|x), where Δ_{0} is a user-defined effect size of biological significance. There is no closed form expression for the posterior distribution and we use numerical integration for the evaluation of the posterior mean and probability.
Estimation of hyper parameters
gives estimates$\{\begin{array}{l}\widehat{a}=-[\mathrm{log}(1-{q}_{1})-\mathrm{log}(1-{q}_{2})]/({\lambda}_{1}-{\lambda}_{2})\\ {\widehat{\lambda}}_{0}={\lambda}_{1}+\mathrm{log}(1-{q}_{1})/\widehat{\alpha}\end{array}$
Thus we can estimate α as $1/\{{\overline{X}}_{X>{\widehat{F}}^{-1}(q)}-{\widehat{F}}^{-1}(q)\}$. Our default setting is q_{1} = 0.8 and q_{2} = 0.9. The posterior mean E[δ|x] is not sensitive to the choice of q_{1}, q_{2} (Additional file 3, Figure S3). Again, due to the lack of memory property, the probability density of shifted exponential distributions with the same rate are proportional, thus the value of λ_{0} does not affect the posterior distribution and does not need to be estimated.
To estimate τ, the parameter representing the biological variation among replicates, we borrow information across genes. Although for any given gene we only observe one total count under each condition, and thus the true differential expression and the biological variation cannot be identified, we assume that the majority of genes are not affected by the treatment, an assumption found to be reasonable in microarray data in many experiments. We can model τ as a function of λ, but Figure 2 suggests that the biological variation is rather constant across expression levels. Thus we estimate one global parameter τ. We start with the Gaussian approximation of the Binomial model ${p}_{1}|{\pi}_{1}={x}_{1}/{N}_{1}\dot{~}N({\pi}_{1},{\pi}_{1}(1-{\pi}_{1})/{N}_{1})$. Since the total counts N are usually a very large integer, we also have, approximately, $\mathrm{log}({p}_{1})|{\pi}_{1}\dot{~}N\{\mathrm{log}({\pi}_{1})-(1-{\pi}_{1})/(2{N}_{1}{\pi}_{1}),(1-{\pi}_{1})/({N}_{1}{\pi}_{1})\}$. The variance of log(p_{1}) decreases with rate of 1/N as π increases and becomes negligible compared to biological variation. Thus we simply estimate τ from the differences of log rpm with the highest average log rpm. We use inter quartile range instead of sample standard deviation to avoid influence of genes with extreme differential expression. $\widehat{\tau}=IQR[\mathrm{log}({p}_{1})-\mathrm{log}({p}_{2})]/IQR[N(0,1)]$. In practice we use total counts above 1000 and this allows us to have several thousand genes (over 4000 in our example) for the estimation.
Declarations
Acknowledgements
We thank the reviewers for their insightful comments and suggestions that greatly strengthened the manuscript. We thank A. Drzewianowski for her assistance with laboratory experiments. Funding was provided by NSF OCE-0723677.
Authors’ Affiliations
References
- Oshlack A, Wakefield M: Transcript length bias in RNA-seq data confounds systems biology. Biology Direct 2009, 4: 14. 10.1186/1745-6150-4-14View ArticlePubMedPubMed CentralGoogle Scholar
- Dohm J, Lottaz C, Borodina T, Himmelbauer H: Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic acids research 2008, 36(16):e105. 10.1093/nar/gkn425View ArticlePubMedPubMed CentralGoogle Scholar
- Hoen P, Ariyurek Y, Thygesen H, Vreugdenhil E, Vossen R, de Menezes R, Boer J, van Ommen G, den Dunnen J: Deep sequencing-based expression analysis shows major advances in robustness, resolution and inter-lab portability over five microarray platforms. Nucleic acids research 2008, 36(21):e141. 10.1093/nar/gkn705View ArticlePubMedPubMed CentralGoogle Scholar
- Li B, Ruotti V, Stewart R, Thomson J, Dewey C: RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics 2010, 26(4):493. 10.1093/bioinformatics/btp692View ArticlePubMedPubMed CentralGoogle Scholar
- Lee M, Kuo F, Whitmore G, Sklar J: Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations. Proceedings of the National Academy of Sciences of the United States of America 2000, 97(18):9834. 10.1073/pnas.97.18.9834View ArticlePubMedPubMed CentralGoogle Scholar
- Cheng L, Lu W, Kulkarni B, Pejovic T, Yan X, Chiang J, Hood L, Odunsi K, Lin B: Analysis of chemotherapy response programs in ovarian cancers by the next-generation sequencing technologies. Gynecologic Oncology 2010, 117: 159–169. 10.1016/j.ygyno.2010.01.041View ArticlePubMedPubMed CentralGoogle Scholar
- Marti E, Pantano L, Banez-Coronel M, Llorens F, Minones-Moyano E, Porta S, Sumoy L, Ferrer I, Estivill X: A myriad of miRNA variants in control and Huntington's disease brain regions detected by massively parallel sequencing. Nucleic acids research 2010.Google Scholar
- Cui L, Guo X, Qi Y, Qi X, Ge Y, Shi Z, Wu T, Shan J, Shan Y, Zhu Z, Wang H: Identification of microRNAs Involved in the Host Response to Enterovirus 71 Infection by a Deep Sequencing Approach. Journal of Biomedicine and Biotechnology 2010, 2010: 425–939. 10.1155/2010/425939Google Scholar
- Kal A, Van Zonneveld A, Benes V, Van Den Berg M, Koerkamp M, Albermann K, Strack N, Ruijter J, Richter A, Dujon B, et al.: Dynamics of gene expression revealed by comparison of serial analysis of gene expression transcript profiles from yeast grown on two different carbon sources. Molecular biology of the cell 1999, 10(6):1859.View ArticlePubMedPubMed CentralGoogle Scholar
- Schaaf G, van Ruissen F, van Kampen A, Kool M, Ruijter J: Statistical comparison of two or more SAGE libraries. Methods in Molecular Biology 2008, 387: 151–168. full_textView ArticlePubMedGoogle Scholar
- Wang L, Feng Z, Wang X, Wang X, Zhang X: DEGseq: an R package for identifying differentially expressed genes from RNA-seq data. Bioinformatics 2010, 26: 136. 10.1093/bioinformatics/btp612View ArticlePubMedGoogle Scholar
- Nygaard S, Jacobsen A, Lindow M, Eriksen J, Balslev E, Flyger H, Tolstrup N, Møller S, Krogh A, Litman T: Identification and analysis of miRNAs in human breast cancer and teratoma samples using deep sequencing. BMC Medical Genomics 2009, 2: 35.View ArticlePubMedPubMed CentralGoogle Scholar
- Hashimoto S, Qu W, Ahsan B, Ogoshi K, Sasaki A, Nakatani Y, Lee Y, Ogawa M, Ametani A, Suzuki Y, et al.: High-Resolution Analysis of the 5'-End Transcriptome Using a Next Generation DNA Sequencer. PLoS One 2009, 4: e4108. 10.1371/journal.pone.0004108View ArticlePubMedPubMed CentralGoogle Scholar
- Bloom J, Khan Z, Kruglyak L, Singh M, Caudy A: Measuring differential gene expression by short read sequencing: quantitative comparison to 2-channel gene expression microarrays. BMC genomics 2009, 10: 221. 10.1186/1471-2164-10-221View ArticlePubMedPubMed CentralGoogle Scholar
- Baggerly K, Deng L, Morris J, Aldaz C: Differential expression in SAGE: accounting for normal between-library variation. Bioinformatics 2003, 19(12):1477. 10.1093/bioinformatics/btg173View ArticlePubMedGoogle Scholar
- Vêncio R, Brentani H, Patrão D, Pereira C: Bayesian model accounting for within-class biological variability in Serial Analysis of Gene Expression(SAGE). BMC bioinformatics 2004, 5: 119. 10.1186/1471-2105-5-119View ArticlePubMedPubMed CentralGoogle Scholar
- Robinson M, Smyth G: Moderated statistical tests for assessing differences in tag abundance. Bioinformatics 2007, 23(21):2881. 10.1093/bioinformatics/btm453View ArticlePubMedGoogle Scholar
- Robinson M, McCarthy D, Smyth G: edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 2010, 26: 139. 10.1093/bioinformatics/btp616View ArticlePubMedPubMed CentralGoogle Scholar
- Stolovitzky G, Kundaje A, Held G, Duggar K, Haudenschild C, Zhou D, Vasicek T, Smith K, Aderem A, Roach J: Statistical analysis of MPSS measurements: Application to the study of LPS-activated macrophage gene expression. Proceedings of the National Academy of Sciences of the United States of America 2005, 102(5):1402. 10.1073/pnas.0406555102View ArticlePubMedPubMed CentralGoogle Scholar
- Robinson M, Smyth G: Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics 2008, 9(2):321. 10.1093/biostatistics/kxm030View ArticlePubMedGoogle Scholar
- Lonnstedt I, Speed T: Replicated microarray data. Statistical Sinica 2002, 12: 31–46.Google Scholar
- Smyth GK: Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology 2004, 3: Article 3. 10.2202/1544-6115.1027View ArticleGoogle Scholar
- McCarthy D, Smyth G: Testing significance relative to a fold-change threshold is a TREAT. Bioinformatics 2009, 25(6):765. 10.1093/bioinformatics/btp053View ArticlePubMedPubMed CentralGoogle Scholar
- Guillard R, Ryther J: Studies of marine planktonic diatoms. I. Cyclotella nana Hustedt, and Detonula confervacea (Cleve) Gran. Canadian Journal of Microbiology 1962, 8: 229. 10.1139/m62-029View ArticlePubMedGoogle Scholar
- Guillard R: Culture of phytoplankton for feeding marine invertebrates. Culture of marine invertebrate animals 1975, 29–60.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.