Sample size calculation while controlling false discovery rate for differential expression analysis with RNA-sequencing experiments
- Ran Bi^{1} and
- Peng Liu^{1}Email author
https://doi.org/10.1186/s12859-016-0994-9
© Bi and Liu. 2016
Received: 5 November 2015
Accepted: 20 March 2016
Published: 31 March 2016
Abstract
Background
RNA-Sequencing (RNA-seq) experiments have been popularly applied to transcriptome studies in recent years. Such experiments are still relatively costly. As a result, RNA-seq experiments often employ a small number of replicates. Power analysis and sample size calculation are challenging in the context of differential expression analysis with RNA-seq data. One challenge is that there are no closed-form formulae to calculate power for the popularly applied tests for differential expression analysis. In addition, false discovery rate (FDR), instead of family-wise type I error rate, is controlled for the multiple testing error in RNA-seq data analysis. So far, there are very few proposals on sample size calculation for RNA-seq experiments.
Results
In this paper, we propose a procedure for sample size calculation while controlling FDR for RNA-seq experimental design. Our procedure is based on the weighted linear model analysis facilitated by the voom method which has been shown to have competitive performance in terms of power and FDR control for RNA-seq differential expression analysis. We derive a method that approximates the average power across the differentially expressed genes, and then calculate the sample size to achieve a desired average power while controlling FDR. Simulation results demonstrate that the actual power of several popularly applied tests for differential expression is achieved and is close to the desired power for RNA-seq data with sample size calculated based on our method.
Conclusions
Our proposed method provides an efficient algorithm to calculate sample size while controlling FDR for RNA-seq experimental design. We also provide an R package ssizeRNA that implements our proposed method and can be downloaded from the Comprehensive R Archive Network (http://cran.r-project.org).
Keywords
Background
During the past decade, next generation sequencing (NGS) technology has revolutionized genomic studies, and tremendous development has been made in terms of throughput, scalability, speed and sequencing cost. RNA-Sequencing (RNA-seq), also called Whole Transcriptome Shotgun Sequencing (WTSS), is a technology that uses the capabilities of NGS to study the entire transcriptome. Compared with microarray technologies that used to be the major tool for transcriptome studies, RNA-seq technologies have several advantages including a larger dynamic range of expression levels, less noise, higher throughput, and more power to detect gene fusions, single nucleotide variants and novel transcripts. Hence, RNA-seq technologies have been popularly applied in transcriptomic studies.
In a typical RNA-seq experiment, messenger RNA (mRNA) molecules are extracted from samples, fragmented, and reverse transcribed to double-stranded complementary DNA (cDNA). The cDNA fragments are then sequenced on a high-throughput platform, such as HiSeq by Illumina or SOLiD by Applied Biosystems. After sequencing, millions of DNA fragment sequences, called reads, are recorded and aligned to a reference genome. The number of reads mapped to each gene measures the expression level for that gene. Thus, RNA-seq provides discrete count data serving as measurements of mRNA expression levels, which is different from the fluorescence intensity measurements from microarray technologies that have been considered as continuous variables after transformation. As a result of high frequency of low integers, the statistical methods developed for analyzing microarray data are not directly applicable for RNA-seq data.
In the statistical analysis of RNA-seq data, identifying differentially expressed (DE) genes across treatments or conditions is a major step or main focus. A gene is considered to be DE across treatments or conditions if the mean read counts differ across treatment groups. Otherwise, we say the gene is equivalently expressed (EE). Many statistical methods have been proposed for the detection of DE genes with RNA-seq data. Some popular methods, including edgeR [1–4], DESeq [5] and DESeq2 [6], are based on the negative binomial (NB) distribution. QuasiSeq [7] presented quasi-likelihood methods with shrunken dispersion estimates. A more recently proposed method by the Smyth group [8] works with log-transformed count data and captures the mean-variance relationship of the log-count data through a precision weight for each observation (using a function called voom in their R package) and then applies the limma method [9] for differential expression analysis.
Due to the genetic complexity and high-dimensionality of the resulting datasets, RNA-seq experiments require complicated bioinformatic and statistical analysis in addition to the cost of experimental materials and sequencing. Many experiments only employ a small number of replicates, in which cases the power of statistical inference is limited. However, if the sample size is too large (which is rare), it is also a waste of experimental materials and manpower. For these reasons, one of the principal questions in designing an RNA-seq experiment is: how many biological replicates should be used to achieve a desired power? In other words, how large of the sample size do we need?
To answer this question, we need to determine a sample size that is required to achieve a desired power while controlling an appropriate error rate. When calculating sample size for a single test, type I error rate is commonly used. Fang and Cui [10] discussed a sample size formula for a single gene based on likelihood ratio test or Wald test. Hart et al. [11] and their associated R package RNASeqPower [12] proposed a sample size calculation method for any single gene based on score test while controlling type I error rate. However, for RNA-seq data analysis, tens of thousands of genes are simultaneously tested for differential expression, which requires the correction of multiple testing error, and false discovery rate (FDR) [13] has been the choice of error criterion in RNA-seq data analysis.
Several sample size calculation methods while controlling FDR have been proposed in microarray experiments. For example, Liu and Hwang [14] developed a method to calculate sample size given a desired power and a controlled level of FDR by finding the rejection region for the test procedure and hence power for each sample size. Hereafter, we call this sample size calculation method the LH method. Orr and Liu [15] assembled the ssize.fdr R package which implements the LH method.
However, sample size calculation for RNA-seq data analysis while controlling FDR is underdeveloped. Some earlier studies performed sample size and power estimation for RNA-seq experiments under Poisson distribution [16, 17], but the additional biological variation across RNA-seq samples yields overdispersion, which means the equal mean-variance relationship for the Poisson distribution does not adapt to the variability present in RNA-seq data. To account for overdispersion, the negative binomial distribution is more flexible to use. Li et al. [18] proposed a sample size determination method while controlling FDR based on the exact test implemented in edgeR that tests for genes differentially expressed between two treatments or conditions. This method calculates a sample size based on the minimum fold change of DE genes, the minimum average read counts of DE genes in the control group, and the maximum dispersion of DE genes under negative binomial models. As expected, such a method would be very conservative and not practically informative. The RnaSeqSampleSize R package [19] provides an estimation of sample size based on single read count and dispersion which implements Li et al.’s method. Also, instead of using the minimum average read counts and the maximum dispersion, RnaSeqSampleSize gives an estimation of sample size based on the read count and dispersion distributions estimated from real data, together with the minimum fold change, which is much better than Li et al.’s method, but would still be conservative due to the usage of the minimum fold change. The LH method is applicable as long as we can compute the power and type I error rate given a rejection region. However, there are no closed-form formulae for power for the popularly applied NB based methods. Then we have to rely on a lot of simulation to figure out quantities such as power and type I error rate for each sample size and each simulation setting [10]. Ching et al. [20] provided a power analysis tool that calculates the power for a given budget constraint for each size of samples, and then determined the sample size for a desired power. Wu et al. [21] introduced the concepts of stratified targeted power and false discovery cost, and estimated sample size by the evaluation of statistical power over a range of sample sizes based on simulation studies. Both Ching et al. and Wu et al.’s methods are simulation-based, thus we need to do a lot of simulations for power assessment for each sample size, which is time-consuming.
In this paper, we propose a much less computationally intensive method, which only demands one-time simulation, for sample size calculation in designing RNA-seq experiments. First, we use the voom method to model the mean-variance relationship of the log-count data of RNA-seq and produce a precision weight for each observation. Second, based on the normalized log-counts and associated precision weights, we estimate the distribution of weighted residual standard deviation of expression levels. Then for two-sample experiments, we derive a formula of the t test statistic in the weighted least squares setting and estimate the distribution of effect sizes for differential expression. Next, we apply the LH method to calculate the required sample size for a given desired power and a controlled FDR level. Our simulation demonstrates that the desired power is reached for data with the sample size calculated from our method for several popular tests for differential expression.
The article is organized as follows. The ‘Methods’ section describes our proposed method illustrated with the two-sample t-test. In the ‘Results and discussion’ section, we present four simulation studies based on either negative binomial distributions or real RNA-seq dataset, and our method provide reliable sample sizes for all simulation studies. The ‘Conclusions’ section discusses our results and some future work.
Methods
In this section, we first review the voom method [8] and the LH method of sample size calculation. Then, we introduce our approach for calculating sample size while controlling FDR in designing RNA-seq experiments.
The voom method
Suppose that an RNA-seq experiment includes a total of N samples. Each sample has been sequenced, and the resulting reads are aligned with a reference genome. The number of reads mapped to each reference gene is recorded. The RNA-seq data then consist of a matrix of read counts r _{ gij }, where g=1,2,…,G denotes gene g, i=1,2 denotes group where i=1 is for the control group and i=2 is for the treatment group, and j=1,2,…,n _{ i } denotes replicates in each group with N=n _{1}+n _{2}. The idea of the voom method proposed by Law et al. [8] is to use precision weights to account for the mean-variance relationship and apply weighted least square analysis to RNA-seq data.
To obtain a smooth mean-variance trend, Law et al. fit a LOWESS curve to the square root of residual standard deviations \(\eta _{g}^{1/2}\) as a function of average log-counts \(\tilde {r}_{g}\), where \( \tilde {r}_{g} = \bar {y}_{g} +log_{2}(\tilde {R} + 1) - log_{2}(10^{6}) \) with \(\bar {y}_{g}\) being the average log-cpm value for each gene g and \(\tilde {R}\) being the geometric mean of library sizes. Then for each observation y _{ gij }, the predicted square root residual standard deviation \(\hat {\eta }_{gij}^{1/2}\) is obtained to be the LOWESS fitted value corresponding to \(\hat {l}_{gij}\).
Finally, the voom precision weights are defined as the inverse variances \( w_{gij} = \frac {1}{\hat {\eta }_{gij}^{2}}\). Law et al. recommended analyzing the log-cpm data with weighted least squares, and the weights (w _{ gij }) are used to account for the mean-variance relationship in the log-cpm values. Assuming normal distribution for residual errors (ε _{ g }), methods such as t-tests or moderated t-tests can then be applied for differential expression analysis.
The LH method of sample size calculation
Outcomes when testing G hypotheses
Accept null | Reject null | Total | |
---|---|---|---|
True nulls | U | V | π _{0} G |
False nulls | T | S | (1−π _{0})G |
Total | W | R | G |
Both FDR and pFDR are widely used error rates to control in multiple testing encounted in genomic studies. In RNA-seq experiments, most often we end up detecting DE genes, i.e. R>0. Hence, in this paper, we do not differentiate between FDR and pFDR.
where α is the controlled level of FDR, T denotes the test statistic and Γ denotes the rejection region of the test. Then for each comparison, the LH method calculates the sample size as follows. First, for a fixed proportion of non-differentially expressed genes, π _{0}, and the level of FDR to control, α, they find a rejection region Γ that satisfies (1) for each sample size. Then for the selected rejection region Γ for each sample size, the power is calculated by P r(T∈Γ|H=1). According to the desired power, a sample size is determined.
The rejection region depends on the test applied for differential expression, and the method based on (1) can be applied to any multiple testing procedure where the same rejection region is used. This LH method can be implemented using an R package, ssize.fdr, developed by Orr and Liu [15], and applied for designing one-sample, two-sample, or multi-sample microarray experiments. The method would be applicable to RNA-seq experiments if we can calculate power and type I error rate given a rejection region.
Proposed method for RNA-seq experiments with two-sample comparison
For the popularly applied tests in RNA-seq differential expression analysis such as edgeR and DESeq, there are no closed-form expressions to calculate the two quantities P r(T∈Γ|H=0) and P r(T∈Γ|H=1). Hence, the LH method cannot be directly applicable to these methods. However, the recently proposed voom and limma analysis for RNA-seq data [8, 23] is based on weighted linear models and we can obtain tractable formulae for power and type I error rate. In this paper, our idea is to derive formulae to calculate power and type I error rate based on voom and weighted linear model analysis, and then apply the LH method for sample size calculation. We will use two-sample t-tests to illustrate our idea. Similar methods can be derived for other designs such as paired-sample or multiple treatments comparison.
where the estimated l o g _{2}-fold change between treatment and control group \(\hat {\beta }_{g2}\) and its standard error \(S.E.(\hat {\beta }_{g2})\) could be obtained through weighted least squares estimation.
can be viewed as the scaled effect size which is defined by weighted mean difference of log-cpm values. Here, \(\bar {w}_{g1\cdot } = \frac {1}{n_{1}} \sum _{j=1}^{n_{1}} w_{g1j}\), \(\bar {w}_{g2\cdot } = \frac {1}{n_{2}} \sum _{j=1}^{n_{2}} w_{g2j}\) and \(\bar {w}_{g\cdot \cdot } = \frac {1}{n_{1}+n_{2}} \sum _{i=1}^{2} \sum _{j=1}^{n_{i}} w_{gij}\). Details of the derivation for (3) is provided in the ??.
- 1.
For a given RNA-seq experiment, specify the following parameters:
G: total number of genes for testing;
π _{0}: proportion of non-DE genes;
α: FDR level to control;
pow: desired average power to achieve;
λ _{ g }: average read count for gene g=1,…,G in control group (without loss of generality, we assume that the normalization factors d _{ ij } are equal to 1 for all samples);
ϕ _{ g }: dispersion parameter for gene g;
δ _{ g }: fold change for gene g.
Note that λ _{ g } and ϕ _{ g } could be estimated from real data using methods such as edgeR.
- 2.
Simulate RNA-seq read count data from a NB distribution with given parameters in step 1.
- 3.
Use the voom and limma method to obtain the log-cpm value and the associated precision weight for each count, and then estimate effect size Δ _{ g } according to (4) for each gene g and parameters a, b for the prior of σ _{ g }.
- 4.Estimate μ _{ Δ } and σ _{ Δ } by fitting$$ \Delta_{g} \sim N\left(\mu_{\Delta}, \sigma_{\Delta}^{2}\right). $$
- 5.
Use the LH method to determine the sample size n to achieve desired power and controlled FDR level.
Results and discussion
In this section, we present four simulation studies to evaluate our proposed method for sample size calculation for RNA-seq experiments. In the first three simulation studies, we set the total number of genes to be G=10,000 and the desired average power to be 80 %. The last simulation is real data-based.
Simulation 1. Same set of parameters
We start from the simplest simulation setting where all genes share the same set of parameters for the NB distribution. Although such cases are unrealistic, they allow the method of Li et al. [18] to perform best because this method uses a single set of NB parameters (mean, dispersion, fold change) when calculating sample size. Hence, we use this simulation setting to study the performance of our method and compare it to the method of Li et al. We refer to the parameter settings from Table 1 in [18], and compare the resulting sample size and power calculated by both Li et al.’s method and our proposed method.
In the main manuscript, we present results for one of those parameter settings as an example: the proportion of non-DE genes π _{0}=0.99, the mean read counts for control group λ=5 with normalization factors d _{ ij }=1, dispersion parameter ϕ=0.1, FDR controlling at level 0.05, and fold change δ=2 for differentially expressed genes. Suppose r _{ gij } denotes the read count for gene g, group i and replicate j=1,2,…,n _{ i } in each group with n _{1}=n _{2}=n. Then, for EE genes, both r _{ g1j } and r _{ g2j } were drawn from N B(5,0.1); for DE genes, r _{ g1j } were drawn from N B(5,0.1) and r _{ g2j } were drawn from N B(10,0.1) or N B(2.5,0.1).
Sample size and anticipated average power calculated by our method and observed average power by the voom and limma pipeline while controlling FDR using q-value procedure based on parameters at different m for three simulation settings
Simulation 1 | Simulation 2 | Simulation 3 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Our Method | Our Method | RnaSeqSampleSize | Our Method | ||||||||
m | Sample | Anticipated | Observed | Sample | Anticipated | Observed | Sample | Estimated | Sample | Anticipated | Observed |
size | power | power | size | power | power | size | power | size | power | power | |
50 | 35 | 0.802 | 0.858 | 13 | 0.810 | 0.876 | 9 | 0.780 | 22 | 0.800 | 0.801 |
100 | 34 | 0.814 | 0.846 | 13 | 0.814 | 0.876 | 9 | 0.724 | 22 | 0.803 | 0.801 |
200 | 34 | 0.815 | 0.846 | 13 | 0.823 | 0.876 | 9 | 0.765 | 22 | 0.804 | 0.801 |
500 | 33 | 0.817 | 0.833 | 13 | 0.827 | 0.876 | 9 | 0.769 | 22 | 0.805 | 0.801 |
1000 | 32 | 0.800 | 0.804 | 13 | 0.826 | 0.876 | 9 | 0.764 | 22 | 0.805 | 0.801 |
Comparison of sample size calculation methods, including the proposed method in this paper, Zhao et al.’s approach (RnaSeqSampleSize) and Wu et al.’s approach (PROPER).
Simulation 1 | Simulation 2 | Simulation 3 | |||||||
---|---|---|---|---|---|---|---|---|---|
Method | Sample | Power | Computation | Sample | Power | Computation | Sample | Power | Computation |
size | time | size | time | size | time | ||||
Our method | 34 | 0.815 | 17.3 sec | 13 | 0.823 | 16.8 sec | 22 | 0.804 | 15.5 sec |
RnaSeqSampleSize | 32 | 0.810 | 0.3 sec | 9 | 0.765 | 54.6 sec | 74 | 0.742 | 82.8 sec |
PROPER | 25 | 0.806 | 6.5 h | 10 | 0.804 | 1.5 h | 19 | 0.805 | 3.5 h |
Results for other parameter settings under m=200 are presented in the Additional file 1, with Li et al.’s results in the first row, and our results in the second row.
Simulation 2. Gene-specific mean and dispersion with fixed fold change
In the second simulation setting, we used a real RNA-seq dataset to generate gene-specific mean and dispersion parameters. A maize dataset was obtained from a study by Tausta et al. [25], who compared gene expression between bundle sheath and mesophyll cells of corn plants.
Similar to simulation 1, we generated 10,000 genes from N B(λ _{ g },ϕ _{ g }), with fold change δ=2 for DE genes, λ _{ g } and ϕ _{ g } from the means and dispersions estimated for each gene in the maize dataset. For EE genes, both r _{ g1j } and r _{ g2j } were drawn from N B(λ _{ g },ϕ _{ g }); for DE genes, r _{ g1j } were drawn from N B(λ _{ g },ϕ _{ g }) and r _{ g2j } were drawn from N B(2λ _{ g },ϕ _{ g }) or N B(0.5λ _{ g },ϕ _{ g }). The proportion of non-DE genes was π _{0}=0.8.
The RnaSeqSampleSize R package [19] could give an estimation of sample size and power by prior real data. They first use user-specified number of genes to estimate the gene read count and dispersion distribution, then sample_size_distribution and est_power_distribution functions will be used to determine sample size and actual power. When we used the same real dataset as our simulation setting 2, the sample size calculated by their method was 7, with actual power 0.774, which did not reach the desired power 0.8. We also tried to apply their method using our simulated data (with different m), the resulting sample size is larger (n=9). The power estimated by their method at n=9 are shown in Table 2, and all their estimated power were actually smaller than 0.8. PROPER started from an estimation of mean and dispersion parameters, which is similar to our method. The sample size calculated by their method is 10, with power 0.804 based on DE detection method edgeR. The comparison results of our proposed method and these three approaches are shown in the middle three columns of Table 3. Still, PROPER is much more time-consuming than the other two methods.
Simulation 3. Gene-specific mean and dispersion with different fold change
The last three columns in Table 2 give the sample size and power calculated by our method. As in simulation 2, varying the size of simulated data (m) did not result in different sample sizes. Anticipated and observed power curves are presented in Fig. 6(b), from which we notice that the three curves are almost indistinguishable after power reaches 60 %. This more realistic simulation demonstrates that our proposed method provides accurate power and sample size.
We also applied RnaSeqSampleSize to this simulation setting. Since their method is based on minimum fold change, such results will be conservative due to the variability of fold change, especially as in this case, the minimum fold change is close to 1. When we used the 10th percentile of fold change of DE genes as the “minimum” fold change, the sample size calculated by their method was 74, which is still much larger than what we actually need, but the power calculated by their method based on the “minimum” fold change was less than the desired power 0.8. PROPER gave a result of sample size 19 with power 0.805 based on DE detection method edgeR. The comparison results of our proposed method and these two approaches are shown in the last three columns of Table 3.
Based on results from simulations, our proposed method and RnaSeqSampleSize provided answers much faster than PROPER, and our proposed method and PROPER provided good sample size estimation. Overall, our proposed method worked the best while both accuracy and computation time are considered.
Simulation 4. Real data-based simulation
Conclusions
In recent years, RNA-seq technology has become a major platform to study gene expression. With large sample size, RNA-seq experiments would be rather costly; while insufficient sample size may result in unreliable statistical inference. Thus sample size calculation is a crucial issue when designing an RNA-seq experiment. Although we could use a lot of simulations for each sample size and determine the one that reach our desired power as suggested in [10, 20, 21], this requires generous calculation and lacks efficiency. Our method provides a quick calculation for sample size, which only demands one-time simulation. From the simulation studies in the section of Results and discussion, we demonstrate that our proposed method offers a reliable approach for sample size calculation for RNA-seq experiments.
For each gene g, when we use a two-sample t-test to do differential expression analysis, the effect size Δ _{ g } in formula (4) depends on the simulated sample size m. Larger m may lead to better estimation of the prior distributions and hence a more accurate sample size. Based on our simulation studies, the effect of m on the resulting sample size is not big, and m=200 should be enough for providing a relatively precise sample size.
The ordinary t-test instead of the moderated t-test [9] was used in ssizeRNA R package. Because the ordinary t-test is a bit less powerful than the moderated t-test, it tends to overestimate the sample size which might be the reason why our calculated sample size in simulation 2 is a little bit larger than what we actually need according to the observed power curves using voom and limma. However, the overestimation is not dramatic and far less than the method of Li et al. [18].
In this article, we illustrate our idea using a method for two-sample comparison with the t-test, because detecting differentially expressed genes between two treatment groups is the most common case in RNA-seq analysis. Our idea could be applied to multi-sample comparison with an F-test or tests for linear contrasts of treatment means as well.
- ■■■
source(“http://bioconductor.org/biocLite.R”)
- ■■■
biocLite(“ssizeRNA”)
Appendices
Appendix A: Derivation of Eq. (3)
Appendix B: Choice of rejection region Γ satisfying formula (1)
The integration in (6) with respect to Δ _{ g } could be avoided through mathematical derivation, and the integration with respect to σ _{ g } is approximated using static quadrature rules, which allows a stable calculation to get the root of c. Details of derivation could be found in ?? of [14].
Once the choice of c has been made for each sample size, power would be calculated accordingly by integrating over the prior distributions on effect size and residual variance. Hence based on the desired power, sample size is finally determined.
Declarations
Acknowledgements
This research was partially supported by the National Science Foundation Plant Genome Research Program under Award Number IOS-1127017.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
- Robinson MD, Smyth GK. Moderated statistical tests for assessing differences in tag abundance. Bioinformatics. 2007; 23:2881–87.View ArticlePubMedGoogle Scholar
- Robinson MD, Smyth GK. Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics. 2008; 9:321–32.View ArticlePubMedGoogle Scholar
- Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010; 26:139–140.View ArticlePubMedPubMed CentralGoogle Scholar
- McCarthy DJ, Chen Y, Smyth GK. Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res. 2012; 40:4288–97.View ArticlePubMedPubMed CentralGoogle Scholar
- Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010; 11:R106.View ArticlePubMedPubMed CentralGoogle Scholar
- Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biol. 2014; 15(12):550.View ArticlePubMedPubMed CentralGoogle Scholar
- Lund SP, Nettleton D, McCarthy DJ, Smyth GK. Detecting differential expression in RNA-sequence data using quasi-likelihood with shrunken dispersion estimates. Stat Appl Genet Mol Biol. 2012; 11:Article 8.Google Scholar
- Law CW, Chen Y, Shi W, Smyth GK. Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 2014; 15:R29.View ArticlePubMedPubMed CentralGoogle Scholar
- Smyth GK. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004; 3:Article 3.Google Scholar
- Fang Z, Cui X. Design and validation issues in RNA-seq experiments. Brief Bioinform. 2011; 12:280–87.View ArticlePubMedGoogle Scholar
- Hart SN, Therneau TM, Zhang Y, Poland GA, Kocher J-P. Calculating sample size estimates for RNA sequencing data. J Comput Biol. 2013; 20:970–78.View ArticlePubMedPubMed CentralGoogle Scholar
- Therneau T, Hart S, Kocher J-P. Calculating samplesSize estimates for RNA Seq studies. R package version 1.10.0. https://bioconductor.org/packages/release/bioc/html/RNASeqPower.html.
- Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B. 1995; 57:289–300.Google Scholar
- Liu P, Hwang JTG. Quick calculation for sample size while controlling false discovery rate with application to microarray analysis. Bioinformatics. 2007; 23(6):739–46.View ArticlePubMedGoogle Scholar
- Orr M, Liu P. Sample size estimation while controlling false discovery rate for microarray experiments using ssize.fdr package. The R J. 2009; 1(1, May 2009):47–53.Google Scholar
- Chen Z, Liu J, Ng HKT, Nadarajah S, Kaufman HL, Yang JY, Deng Y. Statistical methods on detecting differentially expressed genes for RNA-seq data. BMC Syst Biol. 2011; 5(Suppl 3):S1.View ArticleGoogle Scholar
- Li CI, Su PF, Guo Y, Shyr Y. Sample size calculation for differential expression analysis of RNA-seq data under poisson distribution. Int J Comput Biol Drug Des. 2013; 6:358–75.View ArticlePubMedGoogle Scholar
- Li CI, Su PF, Shyr Y. Sample size calculation based on exact test for assessing differential expression analysis in RNA-seq data. BMC Bioinforma. 2013; 14(1):357.View ArticleGoogle Scholar
- Zhao S, Li C, Guo Y, Sheng Q, Shyr Y. RnaSeqSampleSize: RnaSeqSampleSize. R package version 1.2.0. https://www.bioconductor.org/packages/release/bioc/html/RnaSeqSampleSize.html.
- Ching T, Huang S, Garmire LX. Power analysis and sample size estimation for RNA-Seq differential expression. RNA. 2014; 20(11):1684–96.View ArticlePubMedPubMed CentralGoogle Scholar
- Wu H, Wang C, Wu Z. PROPER: comprehensive power evaluation for differential expression using RNA-seq. Bioinformatics. 2015; 31:233–41.View ArticlePubMedPubMed CentralGoogle Scholar
- Storey JD. A direct approach to false discovery rates. J R Stat Soc B. 2002; 64:479–98.View ArticleGoogle Scholar
- Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015; 43(7):e47.View ArticlePubMedPubMed CentralGoogle Scholar
- Storey JD, Taylor JE, Siegmund D. Strong control, conservative point estimation and simultaneous rates: a unified approach. J R Stat Soc B. 2004; 66:187–205.View ArticleGoogle Scholar
- Tausta SL, Li P, Si Y, Gandotra N, Liu P, Sun Q, Brutnell TP, Nelson T. Developmental dynamics of Kranz cell transcriptional specificity in maize leaf reveals early onset of C4-related processes. J Exp Bot. 2014; 65:3543–55.View ArticlePubMedPubMed CentralGoogle Scholar
- Pickrell J, Marioni J, Pai A, Degner J, Engelhardt B, Nkadori E, Veyrieras JB, Stephens M, Gilad Y, Pritchard JK. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature. 2010; 464:768–72.View ArticlePubMedPubMed CentralGoogle Scholar