 Methodology Article
 Open Access
 Published:
Sample size calculation while controlling false discovery rate for differential expression analysis with RNAsequencing experiments
BMC Bioinformatics volume 17, Article number: 146 (2016)
Abstract
Background
RNASequencing (RNAseq) experiments have been popularly applied to transcriptome studies in recent years. Such experiments are still relatively costly. As a result, RNAseq experiments often employ a small number of replicates. Power analysis and sample size calculation are challenging in the context of differential expression analysis with RNAseq data. One challenge is that there are no closedform formulae to calculate power for the popularly applied tests for differential expression analysis. In addition, false discovery rate (FDR), instead of familywise type I error rate, is controlled for the multiple testing error in RNAseq data analysis. So far, there are very few proposals on sample size calculation for RNAseq experiments.
Results
In this paper, we propose a procedure for sample size calculation while controlling FDR for RNAseq experimental design. Our procedure is based on the weighted linear model analysis facilitated by the voom method which has been shown to have competitive performance in terms of power and FDR control for RNAseq differential expression analysis. We derive a method that approximates the average power across the differentially expressed genes, and then calculate the sample size to achieve a desired average power while controlling FDR. Simulation results demonstrate that the actual power of several popularly applied tests for differential expression is achieved and is close to the desired power for RNAseq data with sample size calculated based on our method.
Conclusions
Our proposed method provides an efficient algorithm to calculate sample size while controlling FDR for RNAseq experimental design. We also provide an R package ssizeRNA that implements our proposed method and can be downloaded from the Comprehensive R Archive Network (http://cran.rproject.org).
Background
During the past decade, next generation sequencing (NGS) technology has revolutionized genomic studies, and tremendous development has been made in terms of throughput, scalability, speed and sequencing cost. RNASequencing (RNAseq), also called Whole Transcriptome Shotgun Sequencing (WTSS), is a technology that uses the capabilities of NGS to study the entire transcriptome. Compared with microarray technologies that used to be the major tool for transcriptome studies, RNAseq technologies have several advantages including a larger dynamic range of expression levels, less noise, higher throughput, and more power to detect gene fusions, single nucleotide variants and novel transcripts. Hence, RNAseq technologies have been popularly applied in transcriptomic studies.
In a typical RNAseq experiment, messenger RNA (mRNA) molecules are extracted from samples, fragmented, and reverse transcribed to doublestranded complementary DNA (cDNA). The cDNA fragments are then sequenced on a highthroughput platform, such as HiSeq by Illumina or SOLiD by Applied Biosystems. After sequencing, millions of DNA fragment sequences, called reads, are recorded and aligned to a reference genome. The number of reads mapped to each gene measures the expression level for that gene. Thus, RNAseq provides discrete count data serving as measurements of mRNA expression levels, which is different from the fluorescence intensity measurements from microarray technologies that have been considered as continuous variables after transformation. As a result of high frequency of low integers, the statistical methods developed for analyzing microarray data are not directly applicable for RNAseq data.
In the statistical analysis of RNAseq data, identifying differentially expressed (DE) genes across treatments or conditions is a major step or main focus. A gene is considered to be DE across treatments or conditions if the mean read counts differ across treatment groups. Otherwise, we say the gene is equivalently expressed (EE). Many statistical methods have been proposed for the detection of DE genes with RNAseq data. Some popular methods, including edgeR [1–4], DESeq [5] and DESeq2 [6], are based on the negative binomial (NB) distribution. QuasiSeq [7] presented quasilikelihood methods with shrunken dispersion estimates. A more recently proposed method by the Smyth group [8] works with logtransformed count data and captures the meanvariance relationship of the logcount data through a precision weight for each observation (using a function called voom in their R package) and then applies the limma method [9] for differential expression analysis.
Due to the genetic complexity and highdimensionality of the resulting datasets, RNAseq experiments require complicated bioinformatic and statistical analysis in addition to the cost of experimental materials and sequencing. Many experiments only employ a small number of replicates, in which cases the power of statistical inference is limited. However, if the sample size is too large (which is rare), it is also a waste of experimental materials and manpower. For these reasons, one of the principal questions in designing an RNAseq experiment is: how many biological replicates should be used to achieve a desired power? In other words, how large of the sample size do we need?
To answer this question, we need to determine a sample size that is required to achieve a desired power while controlling an appropriate error rate. When calculating sample size for a single test, type I error rate is commonly used. Fang and Cui [10] discussed a sample size formula for a single gene based on likelihood ratio test or Wald test. Hart et al. [11] and their associated R package RNASeqPower [12] proposed a sample size calculation method for any single gene based on score test while controlling type I error rate. However, for RNAseq data analysis, tens of thousands of genes are simultaneously tested for differential expression, which requires the correction of multiple testing error, and false discovery rate (FDR) [13] has been the choice of error criterion in RNAseq data analysis.
Several sample size calculation methods while controlling FDR have been proposed in microarray experiments. For example, Liu and Hwang [14] developed a method to calculate sample size given a desired power and a controlled level of FDR by finding the rejection region for the test procedure and hence power for each sample size. Hereafter, we call this sample size calculation method the LH method. Orr and Liu [15] assembled the ssize.fdr R package which implements the LH method.
However, sample size calculation for RNAseq data analysis while controlling FDR is underdeveloped. Some earlier studies performed sample size and power estimation for RNAseq experiments under Poisson distribution [16, 17], but the additional biological variation across RNAseq samples yields overdispersion, which means the equal meanvariance relationship for the Poisson distribution does not adapt to the variability present in RNAseq data. To account for overdispersion, the negative binomial distribution is more flexible to use. Li et al. [18] proposed a sample size determination method while controlling FDR based on the exact test implemented in edgeR that tests for genes differentially expressed between two treatments or conditions. This method calculates a sample size based on the minimum fold change of DE genes, the minimum average read counts of DE genes in the control group, and the maximum dispersion of DE genes under negative binomial models. As expected, such a method would be very conservative and not practically informative. The RnaSeqSampleSize R package [19] provides an estimation of sample size based on single read count and dispersion which implements Li et al.’s method. Also, instead of using the minimum average read counts and the maximum dispersion, RnaSeqSampleSize gives an estimation of sample size based on the read count and dispersion distributions estimated from real data, together with the minimum fold change, which is much better than Li et al.’s method, but would still be conservative due to the usage of the minimum fold change. The LH method is applicable as long as we can compute the power and type I error rate given a rejection region. However, there are no closedform formulae for power for the popularly applied NB based methods. Then we have to rely on a lot of simulation to figure out quantities such as power and type I error rate for each sample size and each simulation setting [10]. Ching et al. [20] provided a power analysis tool that calculates the power for a given budget constraint for each size of samples, and then determined the sample size for a desired power. Wu et al. [21] introduced the concepts of stratified targeted power and false discovery cost, and estimated sample size by the evaluation of statistical power over a range of sample sizes based on simulation studies. Both Ching et al. and Wu et al.’s methods are simulationbased, thus we need to do a lot of simulations for power assessment for each sample size, which is timeconsuming.
In this paper, we propose a much less computationally intensive method, which only demands onetime simulation, for sample size calculation in designing RNAseq experiments. First, we use the voom method to model the meanvariance relationship of the logcount data of RNAseq and produce a precision weight for each observation. Second, based on the normalized logcounts and associated precision weights, we estimate the distribution of weighted residual standard deviation of expression levels. Then for twosample experiments, we derive a formula of the t test statistic in the weighted least squares setting and estimate the distribution of effect sizes for differential expression. Next, we apply the LH method to calculate the required sample size for a given desired power and a controlled FDR level. Our simulation demonstrates that the desired power is reached for data with the sample size calculated from our method for several popular tests for differential expression.
The article is organized as follows. The ‘Methods’ section describes our proposed method illustrated with the twosample ttest. In the ‘Results and discussion’ section, we present four simulation studies based on either negative binomial distributions or real RNAseq dataset, and our method provide reliable sample sizes for all simulation studies. The ‘Conclusions’ section discusses our results and some future work.
Methods
In this section, we first review the voom method [8] and the LH method of sample size calculation. Then, we introduce our approach for calculating sample size while controlling FDR in designing RNAseq experiments.
The voom method
Suppose that an RNAseq experiment includes a total of N samples. Each sample has been sequenced, and the resulting reads are aligned with a reference genome. The number of reads mapped to each reference gene is recorded. The RNAseq data then consist of a matrix of read counts r _{ gij }, where g=1,2,…,G denotes gene g, i=1,2 denotes group where i=1 is for the control group and i=2 is for the treatment group, and j=1,2,…,n _{ i } denotes replicates in each group with N=n _{1}+n _{2}. The idea of the voom method proposed by Law et al. [8] is to use precision weights to account for the meanvariance relationship and apply weighted least square analysis to RNAseq data.
The method of voom starts from transforming the RNAseq count data to the logcounts per million (logcpm) value calculated by
where \(R_{ij} = \sum _{g=1}^{G} r_{gij} \) is the library size for the ith treatment and jth replicate. As has been done in [9], Law et al. then fit a linear model to the transformed data according to the experimental design. For each gene g, the following linear model
is fitted to \(\mathbf {y}_{g} = (y_{g11}, \dots, y_{g1n_{1}}, y_{g21}, \dots, y_{g2n_{2}})'\), the vector of logcpm values, where X is the design matrix with rows \(\mathbf {x}_{ij}^{T}\), β _{ g } is a vector of parameters that may be parameterized to include l o g _{2}fold changes between experimental conditions, and ε _{ g } is the error term with E(ε _{ g })=0.
Assuming that \( E(y_{gij}) = \mu _{gij} = \mathbf {x}_{ij}^{T} \boldsymbol {\beta }_{g} \), then by ordinary least squares, the above linear model is fitted for each gene g, which yields regression coefficient estimates \(\boldsymbol {\hat {\beta }}_{g}\), fitted values \(\hat {\mu }_{gij} = \mathbf {x}_{ij}^{T} \boldsymbol {\hat {\beta }}_{g} \), residual standard deviations η _{ g } and fitted l o g _{2}read counts
To obtain a smooth meanvariance trend, Law et al. fit a LOWESS curve to the square root of residual standard deviations \(\eta _{g}^{1/2}\) as a function of average logcounts \(\tilde {r}_{g}\), where \( \tilde {r}_{g} = \bar {y}_{g} +log_{2}(\tilde {R} + 1)  log_{2}(10^{6}) \) with \(\bar {y}_{g}\) being the average logcpm value for each gene g and \(\tilde {R}\) being the geometric mean of library sizes. Then for each observation y _{ gij }, the predicted square root residual standard deviation \(\hat {\eta }_{gij}^{1/2}\) is obtained to be the LOWESS fitted value corresponding to \(\hat {l}_{gij}\).
Finally, the voom precision weights are defined as the inverse variances \( w_{gij} = \frac {1}{\hat {\eta }_{gij}^{2}}\). Law et al. recommended analyzing the logcpm data with weighted least squares, and the weights (w _{ gij }) are used to account for the meanvariance relationship in the logcpm values. Assuming normal distribution for residual errors (ε _{ g }), methods such as ttests or moderated ttests can then be applied for differential expression analysis.
The LH method of sample size calculation
In genomic studies, we simultaneously test a large number of hypotheses, each relating to a gene. Hence, multiple testing is commonly used in the analysis. Assume there are G genes in total and each gene is tested for the significance of differential expression. Table 1 summarizes the various outcomes that occur when testing G hypotheses, where V is the number of false positives, R is the number of rejections among the G tests, and π _{0} is the proportion of nondifferentially expressed genes.
False discovery rate (FDR), defined by Benjamini and Hochberg [13], is the expected proportion of false positives among the rejected hypothesis:
while positive FDR (pFDR), proposed by Storey [22], is defined to be
Both FDR and pFDR are widely used error rates to control in multiple testing encounted in genomic studies. In RNAseq experiments, most often we end up detecting DE genes, i.e. R>0. Hence, in this paper, we do not differentiate between FDR and pFDR.
Liu and Hwang [14] proposed a method for a quick sample size calculation for microarray experiments while controlling FDR. Let H=0 represent no differential expression (null hypothesis is true) and H=1 represent differential expression (null hypothesis is false). Based on the definition of pFDR and assumptions in [22] (all tests are identical, independent and Bernoulli distributed with P r(H=0)=π _{0}, where π _{0} is the proportion of EE genes), they derived that
where α is the controlled level of FDR, T denotes the test statistic and Γ denotes the rejection region of the test. Then for each comparison, the LH method calculates the sample size as follows. First, for a fixed proportion of nondifferentially expressed genes, π _{0}, and the level of FDR to control, α, they find a rejection region Γ that satisfies (1) for each sample size. Then for the selected rejection region Γ for each sample size, the power is calculated by P r(T∈ΓH=1). According to the desired power, a sample size is determined.
The rejection region depends on the test applied for differential expression, and the method based on (1) can be applied to any multiple testing procedure where the same rejection region is used. This LH method can be implemented using an R package, ssize.fdr, developed by Orr and Liu [15], and applied for designing onesample, twosample, or multisample microarray experiments. The method would be applicable to RNAseq experiments if we can calculate power and type I error rate given a rejection region.
Proposed method for RNAseq experiments with twosample comparison
For the popularly applied tests in RNAseq differential expression analysis such as edgeR and DESeq, there are no closedform expressions to calculate the two quantities P r(T∈ΓH=0) and P r(T∈ΓH=1). Hence, the LH method cannot be directly applicable to these methods. However, the recently proposed voom and limma analysis for RNAseq data [8, 23] is based on weighted linear models and we can obtain tractable formulae for power and type I error rate. In this paper, our idea is to derive formulae to calculate power and type I error rate based on voom and weighted linear model analysis, and then apply the LH method for sample size calculation. We will use twosample ttests to illustrate our idea. Similar methods can be derived for other designs such as pairedsample or multiple treatments comparison.
Suppose our interest is to identify the differentially expressed (DE) genes between a treatment and a control group. Assuming that for gene g, group i and replicates j, we observe the RNAseq data read counts r _{ gij }, where the mean for gene g in group i is λ _{ gij }=d _{ ij } γ _{ gi }. Here, d _{ ij } stands for a normalization factor or effective library size that adjusts the sequencing depth for sample j in group i, γ _{ gi } stands for the normalized mean expression level of gene g in group i. Then for each gene g, to test for differential expression means to test the hypothesis:
As reviewed in the first part of the ‘Methods’ section, when applying the voom method, the RNAseq read counts r _{ gij } are transformed to logcpm values y _{ gij } with associated weights w _{ gij } and mean μ _{ gi } for each sample j in group i. With this parameterization, testing for DE means testing
where μ _{ g1} and μ _{ g2} are the expectation of logcpm values of gth gene for control and treatment group, respectively.
For each individual gene g, the weighted linear model
can be fitted to logcpm values
with design matrix
coefficients vector
unknown genespecific standard deviation σ _{ g }, and associated voom precision weights
Assuming \(\boldsymbol {\epsilon } \sim MVN(\boldsymbol {0},I_{n_{1}+n_{2}})\), where MVN stands for multivariate normal distribution, the ttest statistic for gene g is
where the estimated l o g _{2}fold change between treatment and control group \(\hat {\beta }_{g2}\) and its standard error \(S.E.(\hat {\beta }_{g2})\) could be obtained through weighted least squares estimation.
To make the ttest based method more straightforward to apply, we reparameterize the formula (2) to
where
can be viewed as the pooled sample standard deviation, which is an estimator of σ _{ g }, and
can be viewed as the scaled effect size which is defined by weighted mean difference of logcpm values. Here, \(\bar {w}_{g1\cdot } = \frac {1}{n_{1}} \sum _{j=1}^{n_{1}} w_{g1j}\), \(\bar {w}_{g2\cdot } = \frac {1}{n_{2}} \sum _{j=1}^{n_{2}} w_{g2j}\) and \(\bar {w}_{g\cdot \cdot } = \frac {1}{n_{1}+n_{2}} \sum _{i=1}^{2} \sum _{j=1}^{n_{i}} w_{gij}\). Details of the derivation for (3) is provided in the ??.
After generating the effect size Δ _{ g }, and the standard deviation σ _{ g } for each gene g, we could assume, as in [14], that the effect size follows a normal distribution
and the variance of logcpm values for each gene follows an inverse gamma distribution
with mean \(\frac {b}{a1}\). Then we apply the LH method to calculate the optimal sample size given desired power and controlled FDR level. See ?? for a brief review of the calculations in the LH method involving in choosing the rejection region Γ safisfying formula (1).
Our proposed method requires the estimation of hyperparameters μ _{ Δ }, σ _{ Δ }, a, and b. If a relatively large pilot dataset is available, these parameters can be estimated based on the pilot data. Otherwise, we can simulate data to obtain the values for these hyperparameters. It has been shown that the NB model fits real RNAseq data well [5]. In addition, many popularly applied tests for differential expression analysis of RNAseq data are based on NB models. Hence, we suggest to simulate data according to NB models, and then use such simulated data to obtain the estimates of μ _{ Δ }, σ _{ Δ }, a, and b, which are then used to calculate sample size. We outline our proposed procedure for sample size calculation as follows:

1.
For a given RNAseq experiment, specify the following parameters:
G: total number of genes for testing;
π _{0}: proportion of nonDE genes;
α: FDR level to control;
pow: desired average power to achieve;
λ _{ g }: average read count for gene g=1,…,G in control group (without loss of generality, we assume that the normalization factors d _{ ij } are equal to 1 for all samples);
ϕ _{ g }: dispersion parameter for gene g;
δ _{ g }: fold change for gene g.
Note that λ _{ g } and ϕ _{ g } could be estimated from real data using methods such as edgeR.

2.
Simulate RNAseq read count data from a NB distribution with given parameters in step 1.

3.
Use the voom and limma method to obtain the logcpm value and the associated precision weight for each count, and then estimate effect size Δ _{ g } according to (4) for each gene g and parameters a, b for the prior of σ _{ g }.

4.
Estimate μ _{ Δ } and σ _{ Δ } by fitting
$$ \Delta_{g} \sim N\left(\mu_{\Delta}, \sigma_{\Delta}^{2}\right). $$ 
5.
Use the LH method to determine the sample size n to achieve desired power and controlled FDR level.
Results and discussion
In this section, we present four simulation studies to evaluate our proposed method for sample size calculation for RNAseq experiments. In the first three simulation studies, we set the total number of genes to be G=10,000 and the desired average power to be 80 %. The last simulation is real databased.
Simulation 1. Same set of parameters
We start from the simplest simulation setting where all genes share the same set of parameters for the NB distribution. Although such cases are unrealistic, they allow the method of Li et al. [18] to perform best because this method uses a single set of NB parameters (mean, dispersion, fold change) when calculating sample size. Hence, we use this simulation setting to study the performance of our method and compare it to the method of Li et al. We refer to the parameter settings from Table 1 in [18], and compare the resulting sample size and power calculated by both Li et al.’s method and our proposed method.
In the main manuscript, we present results for one of those parameter settings as an example: the proportion of nonDE genes π _{0}=0.99, the mean read counts for control group λ=5 with normalization factors d _{ ij }=1, dispersion parameter ϕ=0.1, FDR controlling at level 0.05, and fold change δ=2 for differentially expressed genes. Suppose r _{ gij } denotes the read count for gene g, group i and replicate j=1,2,…,n _{ i } in each group with n _{1}=n _{2}=n. Then, for EE genes, both r _{ g1j } and r _{ g2j } were drawn from N B(5,0.1); for DE genes, r _{ g1j } were drawn from N B(5,0.1) and r _{ g2j } were drawn from N B(10,0.1) or N B(2.5,0.1).
After setting these simulation parameters in step 1, we follow steps 24 to simulate data and obtain the values of hyperparameters. To investigate the effect of this simulation step, we tried different sizes of simulated data, m=50,100,200,500,1000, where m is the sample size for each group in step 2 of our procedure. For each m, we generated read counts r _{ g1j } (control group) and r _{ g2j } (treatment group) from independent NB distributions for every gene g and sample j, g=1,…,G, j=1,…,m. After using voom and lmFit in the R package limma [9] to produce weights w _{ gij } for each observation, we then obtained effect size Δ _{ g } for each gene and parameters a, b for the prior distribution of \({\sigma _{g}^{2}}\). The fitted inverse gamma distributions of \({\sigma _{g}^{2}}\) for each m are shown as in Fig. 1, with vertical lines indicating the modes. It seems that the mode doesn’t change much, and the distribution of \({\sigma _{g}^{2}}\) shrinks towards the center as sample size gets larger.
After obtaining the fitted parameters, we calculated sample size according to our proposed method described in the third part of the ‘Methods’ section to achieve a desired power of 80 %. We then simulated data according to each calculated sample size and checked whether the desired power was achieved. In Table 2, the first three columns listed our simulation results corresponding to this simulation setting. As m increased from 50 to 100 to 1000, the calculated sample size dropped from 35 to 34 and 32, respectively. This decrease is expected because the parameters were estimated more precisely with larger m. For example, the distribution of \({\sigma _{g}^{2}}\) shrank as m increased as shown in Fig. 1. The effect on the resulting sample size is not big, at most with a difference of 3 (35 vs. 32).
We now choose a sample size n=32 and demonstrate this sample size indeed reaches the desired power 0.8. At n=32, we simulated 100 datasets and performed several popularly applied tests such as the edgeR exact test, the voom and limma method, DESeq, DESeq2 and QuasiSeq using the corresponding R packages. Desired power (0.8) was achieved for all testing methods when controlling FDR at 0.05 using qvalue procedure [24], and the observed FDR was controlled successfully under all the five methods. The results are shown in Fig. 2. For the voom and limma pipeline method, the observed power curves while FDR was controlled using the Benjamini and Hochberg’s method [13] and the qvalue procedure [24] and the power curve based on our calculation are shown in Fig. 3. The observed power was obtained by averaging actual power over 100 simulated datasets for each sample size. The observed power and the power calculated by our method are close with our calculation being a little conservative. Hence, our proposed method provides an accurate estimate of power, and the sample size calculated by our method is reliable.
Finally we would like to compare our method with other existing sample size calculation methods, including Li et al.’s approach [18, 19] and Wu et al.’s approach [21]. Li et al. proposed to calculate the sample size by “using a common \(\rho ^{*} = argmin_{g \in M_{1}} \{log_{2}(\rho _{g})\}\) minimum fold change”, where ρ _{ g } in their paper denotes the fold change and is equivalent to δ _{ g } in this paper. However, we found that the direction of fold change does matter when applying their code. If we set ρ _{ g }=2, the sample size calculated by their method is n=20, as presented in their Table 1. The plot of average power vs. nominal FDR for their method is shown in Fig. 4, from which we notice that the desired power (0.8) is not achieved at sample size n=20 when controlling FDR at 0.05. In fact, the observed power is 0.6166 when using the edgeR exact test based on which they derived their method. When applying the the voom and limma pipeline, the observed power is 0.4608 for sample size 20. If we set ρ _{ g }=0.5, then the sample size will be 32, same as our proposed method, and we get power of 0.8988 using the edgeR exact test and 0.8149 using the voom and limma pipeline for differential expression analysis. Wu et al. (PROPER) provided a simulationbased power evaluation tool, which requires a lot of simulations to assess the power for each sample size. Table 3 presents the computation time needed for the calculation. It took PROPER 6.5 hours to get the resulting sample size while the other two methods only needed seconds. PROPER is more than 1,300 fold timeconsuming than our proposed method. The resulting sample size from PROPER is 25, less than our proposed method. This is because PROPER is based on edgeR exact test, which tends to be more powerful than the voom and limma pipeline.
Results for other parameter settings under m=200 are presented in the Additional file 1, with Li et al.’s results in the first row, and our results in the second row.
Simulation 2. Genespecific mean and dispersion with fixed fold change
In the second simulation setting, we used a real RNAseq dataset to generate genespecific mean and dispersion parameters. A maize dataset was obtained from a study by Tausta et al. [25], who compared gene expression between bundle sheath and mesophyll cells of corn plants.
Similar to simulation 1, we generated 10,000 genes from N B(λ _{ g },ϕ _{ g }), with fold change δ=2 for DE genes, λ _{ g } and ϕ _{ g } from the means and dispersions estimated for each gene in the maize dataset. For EE genes, both r _{ g1j } and r _{ g2j } were drawn from N B(λ _{ g },ϕ _{ g }); for DE genes, r _{ g1j } were drawn from N B(λ _{ g },ϕ _{ g }) and r _{ g2j } were drawn from N B(2λ _{ g },ϕ _{ g }) or N B(0.5λ _{ g },ϕ _{ g }). The proportion of nonDE genes was π _{0}=0.8.
The fitted inverse gamma distributions of \({\sigma _{g}^{2}}\) for m= 50 and 1000 are very similar, as shown as in Fig. 5, where vertical lines indicate the modes. The middle three columns in Table 2 give the sample size and average power calculated by our ssizeRNA package. As shown in Table 2, the resulting sample sizes are all 13 when m ranges from 50 to 1000. This is expected because Fig. 5 indicates that the estimated distributions of \({\sigma _{g}^{2}}\) are very close using different m values for this dataset.
At n=13, we checked the plots of average power vs. nominal FDR and true FDR vs. nominal FDR, and the results were similar to those obtained in simulation 1. More specifically, the desired power (0.8) was achieved, and FDR was controlled successfully. Actually, the desired power can be reached at sample size n=11. Figure 6(a) gives the power curve calculated by our method based on hyperparameters estimated at m=1000 together with observed power curves with FDR controlled by the Benjamini and Hochberg’s method and the qvalue procedure, respectively. The anticipated power curve based on m=1000 is close to the other two observed power curves.
The RnaSeqSampleSize R package [19] could give an estimation of sample size and power by prior real data. They first use userspecified number of genes to estimate the gene read count and dispersion distribution, then sample_size_distribution and est_power_distribution functions will be used to determine sample size and actual power. When we used the same real dataset as our simulation setting 2, the sample size calculated by their method was 7, with actual power 0.774, which did not reach the desired power 0.8. We also tried to apply their method using our simulated data (with different m), the resulting sample size is larger (n=9). The power estimated by their method at n=9 are shown in Table 2, and all their estimated power were actually smaller than 0.8. PROPER started from an estimation of mean and dispersion parameters, which is similar to our method. The sample size calculated by their method is 10, with power 0.804 based on DE detection method edgeR. The comparison results of our proposed method and these three approaches are shown in the middle three columns of Table 3. Still, PROPER is much more timeconsuming than the other two methods.
Simulation 3. Genespecific mean and dispersion with different fold change
In this simulation, the setting is the same as the second simulation study, except that the fold change δ _{ g } was simulated from a lognormal distribution for differentially expressed genes. For EE genes, both r _{ g1j } and r _{ g2j } were drawn from N B(λ _{ g },ϕ _{ g }); for DE genes, r _{ g1j } were drawn from N B(λ _{ g },ϕ _{ g }) and r _{ g2j } were drawn from N B(λ _{ g } δ _{ g },ϕ _{ g }) or N B(λ _{ g }/δ _{ g },ϕ _{ g }) where
The last three columns in Table 2 give the sample size and power calculated by our method. As in simulation 2, varying the size of simulated data (m) did not result in different sample sizes. Anticipated and observed power curves are presented in Fig. 6(b), from which we notice that the three curves are almost indistinguishable after power reaches 60 %. This more realistic simulation demonstrates that our proposed method provides accurate power and sample size.
We also applied RnaSeqSampleSize to this simulation setting. Since their method is based on minimum fold change, such results will be conservative due to the variability of fold change, especially as in this case, the minimum fold change is close to 1. When we used the 10th percentile of fold change of DE genes as the “minimum” fold change, the sample size calculated by their method was 74, which is still much larger than what we actually need, but the power calculated by their method based on the “minimum” fold change was less than the desired power 0.8. PROPER gave a result of sample size 19 with power 0.805 based on DE detection method edgeR. The comparison results of our proposed method and these two approaches are shown in the last three columns of Table 3.
Based on results from simulations, our proposed method and RnaSeqSampleSize provided answers much faster than PROPER, and our proposed method and PROPER provided good sample size estimation. Overall, our proposed method worked the best while both accuracy and computation time are considered.
Simulation 4. Real databased simulation
Our method involves simulating data based on negative binomial distributions. To check the robustness of our method, we conducted a simulation based on a real RNAseq dataset from [26], which was upon an RNAseq experiment that sequenced 69 lymphoblastoid cell lines (LCL) derived from unrelated Nigerian individuals. We used the genes with minimum read counts across all individuals larger than 10, which results in 9154 genes. First, we estimated the mean and dispersion across all 69 individuals for each gene. Assume that fold change comes from a lognormal distribution as in simulation 3,
the proportion of nonDE genes being 80 %, to reach a desired power 0.8 while controlling FDR at 0.05, the sample size calculated by our method is 12 at m=200.
To check whether desired power can be achieved at the calculated sample size, we simulated 100 datasets. For each simulation, we randomly picked 24 out of the 69 individuals and randomly assigned 12 individuals to the control group and the remaining 12 individuals to the treatment group. Consider all 9154 genes among the 24 individuals as EE since the samples were randomly selected from the same population. Then we randomly generated 20 % of the 9154 genes to be DE, and their counts in the treatment group were multiplied by fold change δ _{ g } which were drawn from a l o g−n o r m a l(l o g(2),0.5l o g(2)) distribution. The scaled counts were rounded to the nearest integers. This strategy likely results in more realistic data because all counts come from real dataset and no distributional assumptions were imposed. The plot of average power vs. nominal FDR at n=12 is shown in Fig. 7(a), where desired power (0.8) was achieved for most testing methods, including edgeR, DESeq2, QuasiSeq, voom and limma methods, when controlling FDR at 0.05 using qvalue procedure. Figure 7(b) gives the power calculated by our method based on hyperparameters estimated at m=200. It also presents the observed average power curves when FDR was controlled by either the Benjamini and Hochberg’s method or the qvalue procedure. The anticipated power curve based on m=200 is close to the other two observed power curves. Hence, our proposed method also provides a reliable estimation of sample size and power in the most realistic simulation study.
Conclusions
In recent years, RNAseq technology has become a major platform to study gene expression. With large sample size, RNAseq experiments would be rather costly; while insufficient sample size may result in unreliable statistical inference. Thus sample size calculation is a crucial issue when designing an RNAseq experiment. Although we could use a lot of simulations for each sample size and determine the one that reach our desired power as suggested in [10, 20, 21], this requires generous calculation and lacks efficiency. Our method provides a quick calculation for sample size, which only demands onetime simulation. From the simulation studies in the section of Results and discussion, we demonstrate that our proposed method offers a reliable approach for sample size calculation for RNAseq experiments.
For each gene g, when we use a twosample ttest to do differential expression analysis, the effect size Δ _{ g } in formula (4) depends on the simulated sample size m. Larger m may lead to better estimation of the prior distributions and hence a more accurate sample size. Based on our simulation studies, the effect of m on the resulting sample size is not big, and m=200 should be enough for providing a relatively precise sample size.
The ordinary ttest instead of the moderated ttest [9] was used in ssizeRNA R package. Because the ordinary ttest is a bit less powerful than the moderated ttest, it tends to overestimate the sample size which might be the reason why our calculated sample size in simulation 2 is a little bit larger than what we actually need according to the observed power curves using voom and limma. However, the overestimation is not dramatic and far less than the method of Li et al. [18].
In this article, we illustrate our idea using a method for twosample comparison with the ttest, because detecting differentially expressed genes between two treatment groups is the most common case in RNAseq analysis. Our idea could be applied to multisample comparison with an Ftest or tests for linear contrasts of treatment means as well.
The R package ssizeRNA implements our proposed sample size calculation method for RNAseq experiments and it is freely available on the Comprehensive R Archive Network (http://cran.rproject.org). To install this package, start R and enter:

■■■
source(“http://bioconductor.org/biocLite.R”)

■■■
biocLite(“ssizeRNA”)
Appendices
Appendix A: Derivation of Eq. (3)
For each individual gene g, the weighted linear model
can be fitted to logcpm values
with design matrix
coefficients vector
unknown genespecific standard deviation σ _{ g }, associated voom precision weights
and error
Thus we could obtain the coefficient estimators
with variancecovariance matrix
where \({\sigma _{g}^{2}}\) is estimated by \({s_{g}^{2}}\)
with
Let v _{ gk } be the kth diagonal element of (X ^{T} W _{ g } X)^{−1}, where
Under the assumptions as made in Smyth (2004),
and
where d _{ g } is the residual degrees of freedom for the linear model of gene g, the ordinary ttest statistic will be
which follows an approximate tdistribution with d _{ g } degrees of freedom.
Assuming equal variance between treatment and control group, then the statistic for testing
for the gth gene is
where
with \(\bar {w}_{g1\cdot } = \frac {1}{n_{1}} \sum _{j=1}^{n_{1}} w_{g1j}\), \(\bar {w}_{g2\cdot } = \frac {1}{n_{2}} \sum _{j=1}^{n_{2}} w_{g2j}\) and \(\bar {w}_{g\cdot \cdot } = \frac {1}{n_{1}+n_{2}} \sum _{i=1}^{2} \sum _{j=1}^{n_{i}} w_{gij}\).
Appendix B: Choice of rejection region Γ satisfying formula (1)
For the twosample comparison with ttest statistics T _{ g } as in Eq. (3), we assume as in LH method that the effect size follows a normal distribution
and the variance of logcpm values for each gene follows an inverse gamma distribution
with mean \(\frac {b}{a1}\), then formula (1) becomes
where π _{1}(Δ _{ g }) and π _{2}(σ _{ g }) denote the probability distribution function (p.d.f.) of Δ _{ g } and σ _{ g } respectively. The numerator in (5) equals
where T _{ d }(·) denotes the cumulative distribution function (c.d.f.) of a central tdistribution with d degrees of freedom, and the denominator in (5) equals
Here, T _{ d }(·θ _{ g }) denotes the c.d.f. of a noncentral tdistribution with d degrees of freedom and noncentrality parameter
The integration in (6) with respect to Δ _{ g } could be avoided through mathematical derivation, and the integration with respect to σ _{ g } is approximated using static quadrature rules, which allows a stable calculation to get the root of c. Details of derivation could be found in ?? of [14].
Once the choice of c has been made for each sample size, power would be calculated accordingly by integrating over the prior distributions on effect size and residual variance. Hence based on the desired power, sample size is finally determined.
References
 1
Robinson MD, Smyth GK. Moderated statistical tests for assessing differences in tag abundance. Bioinformatics. 2007; 23:2881–87.
 2
Robinson MD, Smyth GK. Smallsample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics. 2008; 9:321–32.
 3
Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010; 26:139–140.
 4
McCarthy DJ, Chen Y, Smyth GK. Differential expression analysis of multifactor RNASeq experiments with respect to biological variation. Nucleic Acids Res. 2012; 40:4288–97.
 5
Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010; 11:R106.
 6
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNASeq data with DESeq2. Genome Biol. 2014; 15(12):550.
 7
Lund SP, Nettleton D, McCarthy DJ, Smyth GK. Detecting differential expression in RNAsequence data using quasilikelihood with shrunken dispersion estimates. Stat Appl Genet Mol Biol. 2012; 11:Article 8.
 8
Law CW, Chen Y, Shi W, Smyth GK. Voom: precision weights unlock linear model analysis tools for RNAseq read counts. Genome Biol. 2014; 15:R29.
 9
Smyth GK. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004; 3:Article 3.
 10
Fang Z, Cui X. Design and validation issues in RNAseq experiments. Brief Bioinform. 2011; 12:280–87.
 11
Hart SN, Therneau TM, Zhang Y, Poland GA, Kocher JP. Calculating sample size estimates for RNA sequencing data. J Comput Biol. 2013; 20:970–78.
 12
Therneau T, Hart S, Kocher JP. Calculating samplesSize estimates for RNA Seq studies. R package version 1.10.0. https://bioconductor.org/packages/release/bioc/html/RNASeqPower.html.
 13
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B. 1995; 57:289–300.
 14
Liu P, Hwang JTG. Quick calculation for sample size while controlling false discovery rate with application to microarray analysis. Bioinformatics. 2007; 23(6):739–46.
 15
Orr M, Liu P. Sample size estimation while controlling false discovery rate for microarray experiments using ssize.fdr package. The R J. 2009; 1(1, May 2009):47–53.
 16
Chen Z, Liu J, Ng HKT, Nadarajah S, Kaufman HL, Yang JY, Deng Y. Statistical methods on detecting differentially expressed genes for RNAseq data. BMC Syst Biol. 2011; 5(Suppl 3):S1.
 17
Li CI, Su PF, Guo Y, Shyr Y. Sample size calculation for differential expression analysis of RNAseq data under poisson distribution. Int J Comput Biol Drug Des. 2013; 6:358–75.
 18
Li CI, Su PF, Shyr Y. Sample size calculation based on exact test for assessing differential expression analysis in RNAseq data. BMC Bioinforma. 2013; 14(1):357.
 19
Zhao S, Li C, Guo Y, Sheng Q, Shyr Y. RnaSeqSampleSize: RnaSeqSampleSize. R package version 1.2.0. https://www.bioconductor.org/packages/release/bioc/html/RnaSeqSampleSize.html.
 20
Ching T, Huang S, Garmire LX. Power analysis and sample size estimation for RNASeq differential expression. RNA. 2014; 20(11):1684–96.
 21
Wu H, Wang C, Wu Z. PROPER: comprehensive power evaluation for differential expression using RNAseq. Bioinformatics. 2015; 31:233–41.
 22
Storey JD. A direct approach to false discovery rates. J R Stat Soc B. 2002; 64:479–98.
 23
Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. Limma powers differential expression analyses for RNAsequencing and microarray studies. Nucleic Acids Res. 2015; 43(7):e47.
 24
Storey JD, Taylor JE, Siegmund D. Strong control, conservative point estimation and simultaneous rates: a unified approach. J R Stat Soc B. 2004; 66:187–205.
 25
Tausta SL, Li P, Si Y, Gandotra N, Liu P, Sun Q, Brutnell TP, Nelson T. Developmental dynamics of Kranz cell transcriptional specificity in maize leaf reveals early onset of C4related processes. J Exp Bot. 2014; 65:3543–55.
 26
Pickrell J, Marioni J, Pai A, Degner J, Engelhardt B, Nkadori E, Veyrieras JB, Stephens M, Gilad Y, Pritchard JK. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature. 2010; 464:768–72.
Acknowledgements
This research was partially supported by the National Science Foundation Plant Genome Research Program under Award Number IOS1127017.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
RB and PL developed the method. RB implemented the ssizeRNA R package. RB and PL drafted the manuscript. Both authors read and approved the final manuscript.
Additional file
Additional file 1
This excel file contains comparison of resulting sample size and power between Li et al.’s method [18] and our proposed method for simulation 1, with parameter settings from Table 1 in [18]. The results are obtained under m=200, with Li’s result in the first row from each parameter setting, and our result in the second row. (XLS 49.2 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Bi, R., Liu, P. Sample size calculation while controlling false discovery rate for differential expression analysis with RNAsequencing experiments. BMC Bioinformatics 17, 146 (2016). https://doi.org/10.1186/s1285901609949
Received:
Accepted:
Published:
Keywords
 RNAseq
 FDR
 Experimental design
 Sample size calculation
 Power analysis