Power analysis for RNA-Seq differential expression studies using generalized linear mixed effects models

Power analysis becomes an inevitable step in experimental design of current biomedical research. Complex designs allowing diverse correlation structures are commonly used in RNA-Seq experiments. However, the field currently lacks statistical methods to calculate sample size and estimate power for RNA-Seq differential expression studies using such designs. To fill the gap, simulation based methods have a great advantage by providing numerical solutions, since theoretical distributions of test statistics are typically unavailable for such designs. In this paper, we propose a novel simulation based procedure for power estimation of differential expression with the employment of generalized linear mixed effects models for correlated expression data. We also propose a new procedure for power estimation of differential expression with the use of a bivariate negative binomial distribution for paired designs. We compare the performance of both the likelihood ratio test and Wald test under a variety of simulation scenarios with the proposed procedures. The simulated distribution was used to estimate the null distribution of test statistics in order to achieve the desired false positive control and was compared to the asymptotic Chi-square distribution. In addition, we applied the procedure for paired designs to the TCGA breast cancer data set. In summary, we provide a framework for power estimation of RNA-Seq differential expression under complex experimental designs. Simulation results demonstrate that both the proposed procedures properly control the false positive rate at the nominal level.

measurements from multiple samples of the same individual are correlated in nature. It is very challenging to estimate power in designing these RNA-Seq experiments as well as to do comparative analysis. Currently there is a need of developing statistical methods for sample size calculation and power estimation with correlated RNA-Seq data.
Since the emergence of RNA-Seq data, several papers have used Poisson or negative binomial (NB) distribution to model count-based expression data [7][8][9]. But these methods are based on generalized linear models with fixed effects, so they can not be directly applied to correlated expression data. To overcome this limitation, others proposed the generalized linear mixed effects model (GLMM) to model count-based expression data by adding random effects to allow diverse correlation structures while still assuming a Poisson distribution (Poisson-LMM) or NB distribution (NB-LMM) [10][11][12]. In addition, the bivariate negative binomial (BNB) distribution was introduced to model paired counts of brain lesions [13]. The BNB distribution is a compound distribution of two conditionally independent Poisson random variables for modeling paired counts with a Gamma random variable for modeling individual effects. Even though the BNB distribution has not yet been used for RNA-Seq data analysis, it is a great candidate for experiments using paired designs.
Several studies provide methods for sample size calculation and power estimation at the marginal level [14][15][16][17] or data set level [18][19][20] for testing differential expression of RNA-Seq experiments. However these methods were designed for experiments using independent samples and can not be directly applied to correlated expression data since they may lead to biased estimators of model parameters and result in a failure of proper error rate control. So far, there is lack of general methods for power analysis that can be applied to correlated RNA-Seq data. To overcome this deficiency, we are the first group to propose a BNB approach for paired designs and a more general GLMM approach for designs with diverse correlation structures. To ensure the false positive rate is properly controlled at the nominal level, we employ our previously published procedure for simulating the null distribution of test statistics [17] and also compare it against the asymptotic Chi-square distribution. To demonstrate performance of the new BNB and GLMM approaches, simulations were conducted under a variety of scenarios. A real TCGA data set was used for method application.

BNB model
The BNB distribution can be obtained by compounding two conditionally independent Poisson random variables X|G = g ∼ Poisson(μg) and Y |G = g ∼ Poisson(γ μg) with a Gamma random variable G ∼ Gamma(φ −1 , φ). The probability mass function for the joint distribution of (X, Y ) is Without loss of generality, we use γ to denote the fold ratio of a gene between two conditions. We are interested in testing hypothesis H 0 : γ = γ 0 vs. hypothesis H 1 : γ = γ 0 . A Wald test for testing log transformed γ is H 0 : log(γ ) = log(γ 0 ) vs. H 1 : log(γ ) = log(γ 0 ). The likelihood ratio test (LRT) statistic and the Wald test statistic for the above hypothesis with BNB distribution are defined in Rettiganti and Nagaraja [13].

GLMM
Poisson or NB distribution is modeled through the log link function of a linear predictor of mixed effects as where β are fixed effects and b are random effects following normal distributions. For inference on fixed-effects β, hypotheses H 0 : Lβ = 0 vs.H 1 : Lβ = 0 is tested by LRT or Wald test. Random effects b can be tested by z-statistic for difference from 0.

Empirical parametric test
In this study, we used our previously published simulation-based empirical parametric test for inferences [17] and the extended Bonferroni method for controlling per comparison error rate (PCER) [21].

Parameter setting
Count data were simulated from a Poisson-Gamma (BNB) distribution under two experimental conditions (e.g. baseline vs. treatment) for n subjects. The input parameters for power calculation at the marginal level are sample size n, mean expression μ at baseline, fold ratio γ between the two conditions, dispersion φ, and nominal false positive rate α. Parameter values for each are n = 5, 10, 15, 20, 25; μ = 3, 5, 10, 20, 100; γ =

TCGA data set
To study and demonstrate the proposed power estimation procedure in a real data application, we used the TCGA breast cancer data set as a pilot data set for designing a new study to detect differential expression using a paired design. The TCGA breast cancer data set BRCA (acquired in Feb. 2019 from FireBrowse) contains 1,212 tumor samples and 20,531 genes with non-zero counts. TMM method was used for normalization of all samples [22]. We chose the comparison between primary tumor samples and their matched normal samples (112 samples) for this case study. BNB model was fit to obtain parameter estimates for each gene. The estimated mean expression levels of matched normal samples are 14, 681, and 3,022 at the 20th, 50th, and 80th percentiles of all genes respectively. The estimated dispersion values are 0.07, 0.2, and 1 at the 20th, 50th, and 80th percentiles of all genes respectively. To demonstrate how to estimate power, we choose the mean expression level at the 20th percentile, the dispersion at the 80th percentile, and the  Fig. 6. We observe that actual false positive rates are inflated at most sample sizes for the Wald test, which is consistent with the simulation results. Figure 7 shows power as a function of sample sizes at the mean expression level of 14 and the dispersion of 1 for a 2-fold up-or down-regulation of primary tumor compared to matched normal using the empirical parametric test for both tests. The LRT has better performance than the Wald test in general. To design a new RNA-Seq study with at least 80% power for the LRT, we will need n = 5 samples per group to detect a 2-fold up-regulation or n = 9 samples per group to

Discussion
Several studies discovered excessively inflated false positive rates for differential expression detection by using popular NB methods (i.e. edgeR, DESeq2) in RNA-Seq data analysis [23][24][25][26]. A reasonable explanation for this phenomena is that these methods mainly rely on biased asymptotic Chi-square distribution for inferences, which results in the failure in false positive rate control, especially in experiments with small sample sizes. In consistency with this phenomena, our published paper shows the downward bias of critical values when an asymptotic Chi-square distribution is applied for both the LRT and Wald tests under the NB model in power analysis for RNA-Seq differential expression studies [17]. In that paper, we provided a solution so that false positive rates can be controlled at the nominal level with the use of the empirical parametric test for obtaining critical values of both tests. In this paper, we show that either the empirical parametric We observed that the LRT under the BNB model and the LRT under the Poisson-LMM model have similar performance under both the null and alternative hypotheses. The main reason of the equivalent performance of these two tests is that the employed BNB model is based on the Poisson-Gamma compound distribution, which is similar to fit the Poisson-LMM model. When testing γ > 1 for paired designs, the Wald test of BNB is recommended since it has the highest power among all four tests. But when testing γ < 1 for paired designs, either the LRT under the BNB model or the LRT under the Poisson-LMM model is recommended. This unbalanced effect on power for the Wald test is related to selected transformation functions g(.), which was also reported in the article by Rettiganti and Nagaraja [13]. We also notice that there are some negative values of LRT statistics under the NB-LMM model in our simulations. For these cases with negative values, the parameter space under the null hypothesis is not completely contained in the parameter space under the alternative hypothesis since the MLE estimates of the random effect under both hypotheses are not the same.
In almost all current studies that use a NB model for differential expression detection or power analysis, the dispersion parameter is assumed equal across conditions. We have proposed unequal dispersion parameters across conditions for power analysis [17]. Similarly, the proposed BNB model can be naturally extended to unequal dispersion parameters with both the LRT and Wald tests for paired designs. In addition, the proposed GLMM procedure can be applied to multiple factorial designs and allow multiple random effects. For these designs, the value or distribution of model parameters need to be pre-specified so that the procedure can simulate data from a known GLMM model under both hypotheses. In addition, our proposed method can be used to estimate power at the data set level, where the distribution under the null hypothesis of the LRT or Wald test can be simulated for groups of genes with similar expression profile to maintain a proper false positive rate control.