Evaluation of logistic regression models and effect of covariates for case–control study in RNASeq analysis
 Seung Hoan Choi^{1},
 Adam T. Labadorf^{2},
 Richard H. Myers^{2},
 Kathryn L. Lunetta^{1},
 Josée Dupuis^{1} and
 Anita L. DeStefano^{1, 2}Email author
DOI: 10.1186/s128590171498y
© The Author(s). 2017
Received: 25 August 2016
Accepted: 27 January 2017
Published: 6 February 2017
Abstract
Background
Next generation sequencing provides a count of RNA molecules in the form of short reads, yielding discrete, often highly nonnormally distributed gene expression measurements. Although Negative Binomial (NB) regression has been generally accepted in the analysis of RNA sequencing (RNASeq) data, its appropriateness has not been exhaustively evaluated. We explore logistic regression as an alternative method for RNASeq studies designed to compare cases and controls, where disease status is modeled as a function of RNASeq reads using simulated and Huntington disease data. We evaluate the effect of adjusting for covariates that have an unknown relationship with gene expression. Finally, we incorporate the data adaptive method in order to compare false positive rates.
Results
When the sample size is small or the expression levels of a gene are highly dispersed, the NB regression shows inflated TypeI error rates but the Classical logistic and Bayes logistic (BL) regressions are conservative. Firth’s logistic (FL) regression performs well or is slightly conservative. Large sample size and low dispersion generally make TypeI error rates of all methods close to nominal alpha levels of 0.05 and 0.01. However, TypeI error rates are controlled after applying the data adaptive method. The NB, BL, and FL regressions gain increased power with large sample size, large log2 foldchange, and low dispersion. The FL regression has comparable power to NB regression.
Conclusions
We conclude that implementing the data adaptive method appropriately controls TypeI error rates in RNASeq analysis. Firth’s logistic regression provides a concise statistical inference process and reduces spurious associations from inaccurately estimated dispersion parameters in the negative binomial framework.
Keywords
RNASequencing analysis Firth’s logistic regression Negative binomial regression Covariate effectBackground
Because the total number of reads of each sample will likely be different, a normalization step is required prior to performing differential expression inferences between two conditions. Normalization approaches have been evaluated elsewhere [1]. Based on these evaluations, we implemented the DESeq normalization approach in this study [2].
Estimation of the dispersion parameter (ϕ) of each gene is challenging with the small number of observations typically available in RNASeq studies. An overestimated dispersion may result in loss of power to detect DE genes and an underestimated dispersion parameter may increase false discoveries. Two of the most sophisticated and widely used software packages for identifying DE genes are DESeq2 and edgeR [3, 4], which estimate the dispersion parameter of each gene using empirical Bayes shrinkage and CoxReid adjusted profile likelihood methods, respectively.
Many RNASeq analysis methods have been evaluated in different settings including multigroup study designs [1, 5–9]. Soneson et al. [9] compared performance of RNASeq analysis tools (edgeR, DESeq, baySeq [10], NBPSeq [11], TSPM [12], EBSeq [13], NOIseq [14], SAMseq [15], ShrinkSeq [16] and limma [17]) using real and simulated data sets. They reported that when sample size is small, the results should be cautiously interpreted. However, when sample size is large, the limma using variance stabilizing transformation method performed well. Seyednasrollah et al. [18] evaluated eight computational methods such as edgeR, DESeq, baySeq, NOISeq, SAMseq, limma, Cuffdiff2 [19], and EBSeq using publically available human and mice RNASeq data. They concluded that no method fits for all situations and results from distinct methods could be largely different. Rapaport et al. [7] assessed commonly used analysis packages (Cuffdiff, edgeR, DESeq, PoissonSeq [20], baySeq, and limma) for RNASeq data. They analyzed human RNASeq data with those methods and emphasize the importance of large sample replicates to accurately detect association with genes. Tang et al. [6] included TCC [21], edgeR, DESeq, DESeq2, voom [22], SAMseq, PoissonSeq, baySeq, and EBSeq in their evaluation with multigroup data. The edgeR and DESeq2 packages were recommended in their assessment. DESeq2 and edgeR are generally accepted in the analysis of RNASeq data because these packages are designed to properly handle studies with small sample size and lowly expressed genes. However, the appropriateness of the NB model compared to the logistic model has not been exhaustively evaluated. Because many RNASeq studies are designed to compare cases and controls, we explore logistic regression as an alternative approach, in which disease status is modeled as a function of RNASeq reads. This is a reversal of the experimental and explanatory variables in the NB model in the RNASeq setting. An attractive feature of the logistic framework is that the estimation of a dispersion parameter for gene expression is not necessary. Although some studies analyzed their data modeled as a function of gene expression [23, 24], a comparison to standard NB methods was not conducted. We evaluate logistic regression models in which the dependent variable is disease status and gene expression is the independent variable.
Because of the substantial costs associated with RNASeq technology and the challenges in obtaining appropriate tissue sample sizes may be limited in some studies. When sample size is small the distribution of test statistics may not achieve the expected asymptotic distribution. Hence, we incorporated the data adaptive method into NB and logistic regressions because this approach estimates a recalibrated distribution of test statistics. We evaluated the validity of the data adaptive method in RNASeq studies.
Another important feature of differential expression analysis is covariate adjustment. Adjustment for confounders is crucial in protecting against spurious associations. We define a confounder as a covariate that is associated with both the experimental and explanatory variables. Covariates in RNASeq analysis are associated with disease status, technical artifacts from experimental methodology, or intrinsic biological properties of a system in RNASeq models. If these covariates affect the abundance measurements of RNASeq data, then the covariates could significantly confound the association between RNASeq and disease status. Again, we considered two approaches for differential expression analysis: 1) NB regression where gene expression values are the outcome variable and case–control status is the predictor variable and 2) logistic regression where case–control status is a function of gene expression values. If diseaseassociated covariates are not associated with gene expression, then these covariates are nonpredictive (NP) covariates in models with RNASeq as the outcome. The effect of adjusting for covariates, when the relationship between covariates and gene expression is not assessed through statistical tests or prior studies, has not been extensively evaluated in RNASeq studies using the NB framework. If we alternatively consider a logistic model, the NP covariates in the NB model become nonconfounding predictive (NCP) covariates in the logistic model, because the covariates are not associated with the independent variable (gene expression) but are associated with the dependent variable (disease status).
We use both simulated data sets and an application to a real Huntington’s disease (HD) RNASeq data set [25]. The results of this study will guide the selection of an appropriate regression model and guide decisions regarding covariate adjustment in RNASeq studies.
Methods
This study focuses on a gene as a unit; hence various genebased scenarios are considered.
Regression methods for analyzing RNASeq data
Negative binomial (NB) regression
NB regression uses the MaximumLikelihood (ML) fitting process [26]. The generalized linear model (GLM) framework is used by DESeq2 and edgeR. In the current study, GLM was implemented using the glm(,family = negative.binomial(1/ϕ)) function in the Rpackage “MASS” and utilized either the estimated dispersion from ML, Quasilikelihood (QL) or the true dispersion value from the simulation scenario. ML and QL methods are described in Additional file 1: Supplementary Method Section 1. In our real data application, the original data and permuted data sets were analyzed with DESeq2 [3].
Classical logistic (CL) regression
We conducted GLM in a logistic regression framework using the logit link function using the glm(,family = binomial) function in R. In the RNASeq setting CL regression may be limited by small sample size and complete separation. If the expression values of a gene are completely or nearly completely separated between case and control groups, which may occur when the effect size is large, the ML estimation from CL regression may fail to converge. Because complete separation may be an indicator of differential expression, we implemented Bayes and Firth’s logistic regressions, which overcome complete separation bias.
Bayes logistic (BL) regression
Gelman et al. [27] proposed a prior to estimate stable coefficients in a Bayesian framework, when data show separation. The proposed prior is the Cauchy distribution with center 0 and scale 2.5. They demonstrated that this flattailed distribution has robust inference in logistic regression and is computationally efficient. The procedure is implemented by incorporating an EM algorithm into iteratively reweighted least squares, using the bayesglm function in the Rpackage “arm”.
Firth’s logistic (FL) regression
The ML estimators may be biased due to small sample size and small total Fisher information. Firth proposed a method that eliminates firstorder bias in ML estimation by introducing a bias term in the likelihood function [28]. Heinze and Schemper demonstrated that Firth’s method is an ideal solution when the data show separation [29]. The logistf function in the Rpackage “logistf” was used.
Simulation study
Parameters and their values in simulation scenarios
Parameter  Values 

Design  Balanced, Unbalanced2, Unbalanced4 
Number of cases (N _{ D=1 })  10, 25, 75, 500 
Mean expression value in controls (μ _{ D=0 } )  50, 100, 1000, 10000 
Dispersion (ϕ)  0.01, 0.01, 0.5, 1 
Covariate OR (CovOR)  1, 1.2, 3, 5, 10 
log_{2} foldchange (log2fc)  0, 0.3, 0.6, 1.2, 2 
Number of Covariates  0, 1, 2, 3, 5, 10 
Generating simulated RNASeq data
For each scenario, the read counts (y _{ i }) were sampled from the NB distribution with mean (μ) and dispersion (ϕ) as specified in Table 1. We simulated 10,000 independent replicates per scenario using the following steps and evaluated TypeI error rate and power per scenario based on results from these 10,000 independent replicates.
Generating simulated covariate data
Analyzing simulated data
Scenarios for which log2fc = 0 are TypeI error studies. Otherwise, the scenarios are power studies. TypeI error rates, at significance (alpha) levels 0.05 and 0.01, were calculated based on replicates with converged results. For power studies, because different TypeI error rates were observed among the distinct regression methods, comparing the power of different regression methods using the same threshold is not appropriate. For fair comparison, an empirical threshold for each regression method was calculated. Then, we computed the empirical power of each regression method using their empirical thresholds. Although only positive log2fcs were simulated in those scenarios in which power was evaluated, we consistently used twosided tests for both TypeI error and power studies. The equations for TypeI error rate and empirical power are shown in Additional file 1: Supplementary Method Section 2.
Validation of data adaptive (DA) method using crossvalidation technique
The DA method reestimates a distribution of test statistics under the null hypothesis of no association [30]. The DA approach enables one to obtain a recalibrated distribution of test statistics because when sample size is small, the asymptotic distribution may not be appropriate. This method also avoids heavy computing burden compared to implementing permutation tests with all possible permutations. The results from the DA method are validated using a crossvalidation technique. Detailed procedures are provided in Additional file 1: Supplementary Method Section 3.
Huntington’s disease (HD) study
A real RNASeq data set was analyzed using the DESeq2 Rpackage, which implements a NB GLM. The data set was also analyzed utilizing R (v3.0.0) to implement CL, BL, and FL regressions. The logistic regressions modeled case–control status as a function of normalized counts of a gene and covariates. The publicly available HD data set [25] was downloaded from the GEO database (accession number GSE64810, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE64810). This data set contains 20 HD cases and 49 neurologically normal controls. Details of the HD data set are provided in Additional file 1: Supplementary Method Section 4.
Generating permuted HD Data
The original HD study used RNA Integrity Number (RIN) and Age at Death (AAD) as covariates. RIN was included due to the potential confounding effect between HD and the abundance of RNAs. To remove the effect of RIN in our permutations, at first, samples were divided by RIN categories. Then, each gene was resampled within each category of RIN. Because AAD was included in the regression model due to its association with HD, the relationship between HD and AAD was preserved during the permutation process. We generated 10,000 MonteCarlo permutations.
Analyzing permuted HD data
For the original HD data and each permutated data set, DE genes were identified using the NB model (Model A) as implemented in DESeq2. We also implemented the CL, BL, and FL regressions (Model B) to compare statistical models. In logistic models, we used normalized counts from DESeq2 as an independent variable. For these analyses, we adjusted for AAD and RIN.
The TypeI error rates at our alpha levels using supplementary equation (2.1), and the exact pvalues [31] were calculated with the results from the 10,000 permutations. The DA method was applied using our permutation results [30] to measure TypeI error rates and to obtain adjusted pvalues of each gene.
For HD data analysis, asymptotic pvalues, exact pvalues using 10,000 permutation results, and DA using 1000 pvalues were calculated and corrected for multiple testing by imposing a false discovery rate (FDR) of 0.05. To assess the model adequacy, QQplots of original, exact and DA pvalues were generated, and genomic inflation factors, λ _{ gc }, were calculated. This λ _{ gc } is the ratio of the median of the observed test statistics divided by the expected median of an asymptotic distribution. If a λ _{ gc } is greater than 1, this may suggest inflation in the test statistics and may indicate the presence of systemic bias such as hidden population structures, latent covariates, technical artifacts, etc. [32].
Analyzing HD data with simulated covariates
To evaluate the effect of covariates in a model, the same method for generating covariates in our simulation study was applied to the HD data set to create simulated covariates. In this real data application, we focused on a moderate covariate effect on HD status (CovOR = 1.2). The HD data were analyzed using the NB GLM in DESeq2 with Model A and using the FL regression with Model B with 1000 replicated sets of simulated covariates (C = 1, 2, 3, 5, or 10). The change of λ _{ gc } with the addition of a varying number of simulated covariates in a model was evaluated.
Results
Simulation results of NB vs. Logistic regressions
TypeI error simulations
TypeI error rates of regression methods from the balanced design having μ _{ D=0 } = 1000
Alpha  N _{ D=1 }  Disp  NB  CL  BL  FL 

0.05  10  0.01  0.066  0.023  0.023  0.044 
10  1  0.094  0.016  0.017  0.039  
25  0.01  0.057  0.044  0.040  0.049  
25  1  0.064  0.034  0.032  0.044  
75  0.01  0.054  0.050  0.048  0.051  
75  1  0.056  0.045  0.043  0.048  
500  0.01  0.049  0.049  0.048  0.049  
500  1  0.054  0.052  0.052  0.053  
0.01  10  0.01  0.018  <0.001  <0.001  0.007 
10  1  0.032  <0.001  0.001  0.007  
25  0.01  0.014  0.005  0.004  0.011  
25  1  0.019  0.002  0.002  0.009  
75  0.01  0.012  0.009  0.008  0.010  
75  1  0.013  0.007  0.006  0.010  
500  0.01  0.011  0.010  0.010  0.011  
500  1  0.011  0.010  0.010  0.011 
In most scenarios, the DA method reduces the inflation observed with NB regression and the controls conservativeness observed with the CL, BL, and FL regressions. However, with small sample size, conservative results are still observed at alpha level 0.01 for the CL and BL models. The DA method with NB and FL regressions showed wellcontrolled TypeI error rates at all alpha levels even with small sample size.
Power simulations
Application to RNASeq data in Huntington’s disease
TypeI error permutations
In the NB results from DESeq2, as dispersion increases, the TypeI error rates increase when genes are in the categories of (0, 0.05), (0.05, 0.15), and (0.15, 0.8). However, genes in the (0.8, 1.5), and (1.5, 10) categories exhibit decreasing TypeI error rates (Additional file 6: Figure S2). Genes in the (0.8, 1.5), and (1.5, 10) categories largely have very low mean expression values. After excluding genes having mean expression values less than 3, TypeI error rates increase as the estimated dispersion increases as shown in Fig. 2. These increasingly liberal TypeI error rates are observed at both alpha levels and are consistent with our simulation results.
In the CL, BL and FL regression results, we observe that genes in the categories of (0, 0.05), (0.05, 0.15), and (0.15, 0.8) produce increasingly conservative TypeI error rates at both alpha levels, as presented in Fig. 3. However, these increasingly conservative TypeI error rates are attenuated in the (0.8, 1.5), and (1.5, 10) categories (Additional file 7: Figure S3). Because we observe this inconsistent pattern of TypeI error rates among extremely lowly expressed genes in the DESeq2 results, we also examined the set of genes excluding those with mean expression values less than or equal to 3. After exclusion of genes with low expression, the remaining genes show consistent increasingly conservative TypeI error rates as dispersion increases as shown in Fig. 3. Although TypeI error rates from the FL regression are also more conservative when dispersion is large, TypeI error rates are relatively well controlled compared to CL and BL regressions. The TypeI error rates observed in the real data set using logistic regression confirm our simulation results.
The DA method controls TypeI error rates well for the DESeq2 results (Additional file 8: Figure S4) and the FL regression results (Additional file 9: Figure S5) at both alpha levels, regardless of dispersions of all genes. The TypeI error rates from CL and BL regressions at significance level of 0.01 are conservative as seen in Additional file 9: Figure S5.
We also observed the bias of the regression methods using permuted HD data sets. As shown in Additional file 10: Figure S6, FL regression revealed the smallest bias.
HD results from the NB vs. logistic regressions
Top 10 significant genes from FL regression among genes not significant in DESeq2
Gene  Mean. Exp. Case  Mean. Exp. Cont  Disp  NB.Pval  CL.Pval  BL.Pval  FL.Pval 

SLC1A6  374  554  0.29  0.039  4.3E04  4.5E04  3.2E06 
SERHL2  210  163  0.17  0.016  3.2E04  6.3E03  1.2E05 
KCNK9  315  454  0.30  0.063  3.0E04  9.2E04  1.7E05 
DISP2  687  937  0.21  0.047  5.5E04  8.2E04  4.3E05 
SPOCK2  12370  15649  0.09  0.010  8.9E04  1.1E03  8.0E05 
C20orf27  726  934  0.11  0.019  5.9E04  2.4E04  9.5E05 
IST1  3388  3134  0.02  0.009  5.7E04  4.5E03  9.6E05 
ARC  596  1058  0.40  0.030  1.1E03  1.2E03  1.0E04 
STRADB  980  844  0.03  0.013  1.4E03  1.5E03  1.0E04 
PCP4  734  1330  0.37  0.086  1.1E03  2.2E03  1.2E04 
Simulation results of various covariate models from the NB and FL regressions
The properties of the NB regression that are shown in Simulation results of NB vs. Logistic regressions are also presented in the simulation results with various covariate models.
TypeI error simulations
TypeI error rates with covariate models from balanced design of N _{ D=1 } = 10 and μ _{ D=0 } = 1000
Disp  CovOR  Ncov  Alpha = 0.05  Alpha = 0.01  

NB  FL  NB  FL  
0.01  1.2  1  0.071  0.049  0.023  0.01 
0.01  1.2  5  0.076  0.05  0.026  0.009 
0.01  5  1  0.061  0.041  0.018  0.006 
0.01  5  5  0.08  0.021  0.024  0.001 
1  1.2  1  0.103  0.041  0.036  0.006 
1  1.2  5  0.151  0.045  0.067  0.008 
1  5  1  0.102  0.039  0.037  0.007 
1  5  5  0.142  0.02  0.061  0.001 
The same covariates that are nonpredictive (NP) covariates in a NB model are nonconfounding predictive (NCP) covariates in logistic models. Unlike the NB regression, even with small sample size (Table 4), when the CovOR is small, the FL regression is robust regarding the increment in the number of NCP covariates. When CovOR is large, TypeI error rates from FL regression become very conservative as the number of NCP covariates increase. TypeI error rates are not affected by large dispersion. When sample size is increased, TypeI error rates are less affected by increased number of NCP covariates with large CovOR.
In all scenarios, the DA method controls TypeI error rates well in the NB and FL regressions at both alpha levels. The newly approximated distribution of test statistics diminishes deviated TypeI error rates that were not controlled when many covariates were included in the NB and FL models.
In addition to TypeI error rates, the bias is also calculated for each scenario. As shown in Additional file 16: Table S7, as dispersion increases, bias from NB regression increases. However, the bias from FL regression is robust with increasing dispersion. The bias does not change much for any of the models, as CovOR and the number of covariates changes.
Power simulations
In our simulation, the power of NB and FL regression is affected by three factors 1) Dispersion, 2) CovOR, and 3) Number of NP/NCP covariates in a model, when sample size is fixed. Large dispersion, large CovOR, and increasing number of NP/NCP covariates in a model decrease power. NB regression is less sensitive to an increase of NP covariates with small dispersion. As shown Fig. 4a, NB regression results in marginally more power than FL regression when the number of covariates is large and dispersion is small. When dispersion is large but CovOR is small, the loss of power in NB regression is more sensitive to the increase in number of NP covariates than in FL regression as seen in Fig. 4b. Especially, in Fig. 4b with the CovOR of 1.2, the power of FL regression with 10 covariates in a model is more powerful than NB regression with 10 covariates. Regardless of dispersion, when CovOR and the number of covariates in a model are large, NB regression shows slightly better power than FL regression. This is demonstrated in Fig. 4 with CovOR equal to 5.
HD results from covariate models
Summary of λ _{ gc } from HD analyses with simulated covariates
Ncov  Median_NB  SD_NB  Median_FL  SD_FL 

1  4.05  0.10  3.50  0.16 
2  4.02  0.14  3.46  0.22 
3  4.00  0.17  3.40  0.27 
5  3.93  0.22  3.28  0.35 
10  3.74  0.29  2.95  0.53 
Discussion
In this study, we propose using a logistic regression framework as an alternative to NB regression to analyze RNASeq data for case–control studies. We have shown in our simulations that FL regression performs well in terms of controlling TypeI error rates and shows comparable empirical power. The dispersion is not estimated in the logistic framework, thus avoiding potential false association resulting from incorrectly estimated dispersion. The simulations presented focused on single genes varying relevant parameters (mean, dispersion, log_{2}foldchange); transcriptomewide data were not simulated.
The TypeI error simulations presented demonstrate that NB regression has inflated TypeI error rates with small sample size. The degree of inflation is varied by the scale of the dispersion parameter with constant sample size. The relationship between increased TypeI error and dispersion was confirmed through permutation of a real data set. Although large sample size reduced the inflation from NB, the high cost of RNASeq technology and difficulty of obtaining certain sample tissues may preclude a larger sample size in some studies. The distinct TypeI error rates observed with varying dispersion parameter values may violate the general assumption that pvalues from nonDE genes follow a uniform distribution. However, the current simulation and permutation studies validate that the DA method is a suitable alternative approach that controls TypeI error rates in all regression methods.
The empirical power of the NB, BL, and FL regressions are comparable across all scenarios. Lower power was observed for CL regression, which appears to be driven by scenarios of complete separation and a failure to converge. With large log2fc and small dispersion, simulated data are likely to show complete separation; in these scenarios the NB, BL and FL regressions are more powerful. For many circumstances with small sample size, the CL regression demonstrated the lowest empirical power among all methods because the CL regression is not able to accommodate complete separation.
Analysis of the HD data showed λ _{ gc } was decreased after applying the DA method to the results from NB GLM in DESeq2 but was increased after applying DA method to the results from CL, BL and FL regression. The exact pvalues from 10,000 permutations revealed the same pattern. This pattern is consistent with our simulation results where TypeI error rates were inflated in the NB framework and conservative in the logistic framework when test statistics were compared with a theoretical asymptotic distribution.
Although it is unknown which genes are truly DE in the HD data set, we compared DE genes identified in the HD data by different statistical approaches. We found that SLC1A6 (solute carrier family 1, member 6; EAAT4) did not show evidence of association with HD when using DESeq2, but was highly significant when using FL regression, as shown in Table 3. SLC1A6, which is highly expressed in the cerebellum of human brain compared to other regions [33], showed lower levels of expression in prior studies of mood disorder diseases such as bipolar and major depression disorders in the striatum in situ hybridization study [34]. In addition, Utal et al. [35] showed that Purkinje cell protein 4 (PCP4), also known as PEP19, had dramatic reduction in HD. This gene was not significantly associated with HD status when using DESeq2 (pvalue = 0.086) but showed strong association when using FL regression (pvalue = 1.2 × 10^{−4}). We also found that some genes expressed highly in both cases and controls may not be detected in the NB framework, because it utilizes the ratio of mean expressions of cases and controls. For instance, the normalized mean expression value of SPOCK2 is 12,370 in cases and 15,649 in controls. Although the difference of the means is large, the gene might not be statistically significant due to the small effect size (log2foldchange = −0.34) in the NB framework. However, this gene is strongly associated with HD in our logistic framework (Table 3). SPOCK2, also known as SPARC/osteonectin and Testican2, plays an important role in the central nervous system [36]. As a member of the testican group, the expression in various neuronal cell types including cell types cerebral cortex, thalamus, hippocampus, cerebellum was reported by in situ hybridization [36]. Several studies showed evidence of associations with prostate, colon, and breast cancer and bronchopulmonary dysplasia [37, 38].
The top genes that showed associations exclusively in NB GLM, except for gene S100A11, have low average counts as shown in Additional file 13: Table S4. The estimated dispersions for these genes are also fairly large suggesting that they may be false positives.
The effect of including covariates has not been investigated for RNASeq studies. Identifying relationships with covariates for all genes is computationally demanding and existing software do not allow for defining genewise models for all genes, which makes this approach challenging. Therefore, RNASeq studies that include covariates in a single model applied to all genes will likely result in some gene expression models that include unassociated covariates. Hence, it is important to investigate the effect of NP covariates in RNASeq analysis.
Simulations that included NP covariates in the NB model showed inflated TypeI error rates and a loss of power. With large dispersion, this inflation and loss of power becomes severe. The TypeI error in the FL regression is not notably affected by the increment of number of NCP covariates when the CovOR is small. With large CovOR and increased number of NCP covariates, conservative TypeI error rates are observed. The DA method effectively controls the increase of TypeI error rates even with larger CovOR and high number of NP/NCP covariates. Our empirical power results show that the FL regression is more greatly influenced by the increase of covariates than the NB regression, when CovOR is large.
Our HD analyses with simulated NP/NCP covariates demonstrated that an increase in the number of NP/NCP covariates results in a less stable λ _{ gc }(Table 5). Adding more NP covariates in an NB model slightly decreases the median of λ _{ gc }, and hence an increased in the median of pvalues. In other words, many pvalues in a set are generally increased. Based on our simulation results with NP covariates, TypeI error rates of largely dispersed genes are likely to be inflated and power of differentially expressed genes are likely to be decreased. Therefore, this slightly decreased median may indicate that the loss of power is greater than the gain of TypeI error rates.
Adding more NCP covariates in a model also slightly decreases the median of the λ _{ gc } in FL regression. This decreased median might be caused by the loss of power, and this loss may occur from NCP covariates in a model according to our simulation results. Under a moderate CovOR, the number of NCP covariates in a model does not affect the TypeI error rates.
The change in the median λ _{ gc } with additional covariates is larger in FL regression than in NB regression because the FL regression results are solely affected by the loss of power. NB regression results are influenced by both increased TypeI error and decreased power. The standard deviation of λ _{ gc } is increased with adding NP/NCP covariates in a model. This means that the results generated from a model that includes many covariates is not likely to be reliable, even if these covariates are associated with case–control status but not gene expression.
Conclusions
In conclusion, unlike NB, CL and BL regressions, FL regression controls TypeI error rates well and maintains comparable power even with small sample size. Firth’s logistic regression is an excellent alternative to NB regression for analysis of RNASeq data in case–control studies. We recommend implementing the DA method in analysis of RNASeq data to appropriately control TypeI error rates. If computational burden of permutations required for the DA method precludes using this approach, FL regression is the best option for controlling TypeI errors with comparable power to NB regression. However, a parsimonious model is necessary to obtain robust results in the FL regression setting. This approach can be extended in multiple classes of disease status using a multinomial logistic regression method.
Abbreviations
 AAD:

Age at death
 BL:

Bayes logistic
 CL:

Classical logistic
 CovOR:

Covariate OR
 DA:

Data adaptive
 DE:

Differentially expressed
 FDR:

False discovery rate
 FL:

Firth’s logistic
 GLM:

Generalized linear model
 HD:

Huntington’s disease
 log2fc:

log_{2} foldchange
 ML:

Maximumlikelihood
 NB:

Negative Binomial
 NCP:

Nonconfounding predictive
 NGS:

Next generation sequencing
 NP:

Nonpredictive
 QL:

Quasilikelihood
 RIN:

RNA Integrity number
 RNA:

Ribonucleic acid
 RNASeq:

RNA sequencing
Declarations
Acknowledgements
Not applicable.
Funding
This work has been supported by the National Institute for Diabetes and Digestive and Kidney Disease/National Institutes of Health grant, 1R01DK099269 (ALD), the Jerry McDonald HD Research Fund (RHM) and Public Health Service, National Institutes of Health grants, National Institute of Neurological Disorders and Stroke/National Institute grant, 1R01NS073947 (RHM).
Availability of data and materials
The underlying R code to regenerate simulated datasets analyzed for the current study is provided in Additional file 18. HD RNASeq data set is available in the GEO database repository (http://dx.doi.org/10.1371/journal.pone.0143563) [25].
Authors’ contributions
SHC and ALD conceived the study. SHC, JD, KLL, and ALD designed the study. SHC performed analysis and drafted the manuscript. SHC, ATL, RHM, JD, KLL, and ALD confirmed analysis results and contributed to refinement of the manuscript. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Dillies MA, Rau A, Aubert J, HennequetAntier C, Jeanmougin M, Servant N, et al. A comprehensive evaluation of normalization methods for Illumina highthroughput RNA sequencing data analysis. Brief Bioinform. 2013;14:671–83.View ArticlePubMedGoogle Scholar
 Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11:R106. BioMed Central Ltd.View ArticlePubMedPubMed CentralGoogle Scholar
 Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNAseq data with DESeq2. Genome Biol. 2014;15:550.View ArticlePubMedPubMed CentralGoogle Scholar
 Robinson MD, McCarthy DJ, Smyth GK. edger: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26:139–40.View ArticlePubMedGoogle Scholar
 Seyednasrollah F, Laiho A, Elo LL. Comparison of software packages for detecting differential expression in RNAseq studies. Brief Bioinform. 2013;16:5970
 Tang M, Sun J, Shimizu K, Kadota K. Evaluation of methods for differential expression analysis on multigroup RNAseq count data. BMC Bioinformatics. 2015;16:361. BioMed Central.View ArticlePubMedPubMed CentralGoogle Scholar
 Rapaport F, Khanin R, Liang Y, Pirun M, Krek A, Zumbo P, et al. Comprehensive evaluation of differential gene expression analysis methods for RNAseq data. Genome Biol. 2013;14:R95.View ArticlePubMedPubMed CentralGoogle Scholar
 Landau WM, Liu P. Dispersion estimation and its effect on test performance in RNAseq data analysis: a simulationbased comparison of methods. Chen L, editor. PLoS One. 2013;8:e81415.View ArticlePubMedPubMed CentralGoogle Scholar
 Soneson C, Delorenzi M. A comparison of methods for differential expression analysis of RNAseq data. BMC Bioinformatics. 2013;14:91.View ArticlePubMedPubMed CentralGoogle Scholar
 Hardcastle TJ, Kelly KA. baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics. 2010;11:422.View ArticlePubMedPubMed CentralGoogle Scholar
 Di Y, Schafer DW, Cumbie JS, Chang JH. The NBP Negative Binomial Model for Assessing Differential Gene Expression from RNASeq. Stat Appl Genet Mol Biol. 2011;10:1–28.
 Auer PL, Doerge RW. A TwoStage Poisson Model for Testing RNASeq Data. Stat Appl Genet Mol Biol. 2011;10:1.
 Leng N, Dawson JA, Thomson JA, Ruotti V, Rissman AI, Smits BMG, et al. EBSeq: an empirical Bayes hierarchical model for inference in RNAseq experiments. Bioinformatics. 2013;29:1035–43.View ArticlePubMedPubMed CentralGoogle Scholar
 Tarazona S, GarciaAlcalde F, Dopazo J, Ferrer A, Conesa A. Differential expression in RNAseq: a matter of depth. Genome Res. 2011;21:2213–23.View ArticlePubMedPubMed CentralGoogle Scholar
 Li J, Tibshirani R. Finding consistent patterns: a nonparametric approach for identifying differential expression in RNASeq data. Stat Methods Med Res. 2013;22:519–36.View ArticlePubMedGoogle Scholar
 Van De Wiel MA, Leday GGR, Pardo L, Rue H, Van Der Vaart AW, Van Wieringen WN. Bayesian analysis of RNA sequencing data by estimating multiple shrinkage priors. Biostatistics. 2013;14:113–28.View ArticleGoogle Scholar
 Smyth GK. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004;3:1–25.Google Scholar
 Seyednasrollah F, Laiho A, Elo LL. Comparison of software packages for detecting differential expression in RNAseq studies. Brief Bioinform. 2015;16:59–70.View ArticlePubMedGoogle Scholar
 Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L. Differential analysis of gene regulation at transcript resolution with RNAseq. Nat Biotechnol. 2012;31:46–53.View ArticlePubMedGoogle Scholar
 Li J, Witten DM, Johnstone IM, Tibshirani R. Normalization, testing, and false discovery rate estimation for RNAsequencing data. Biostatistics. 2012;13:523–38.View ArticlePubMedGoogle Scholar
 Sun J, Nishiyama T, Shimizu K, Kadota K. TCC: an R package for comparing tag count data with robust normalization strategies. BMC Bioinformatics. 2013;14:219.View ArticlePubMedPubMed CentralGoogle Scholar
 Law CWC, Chen Y, Shi W, Smyth GGK. Voom: precision weights unlock linear model analysis tools for RNAseq read counts. Genome Biol. 2014;15:R29.View ArticlePubMedPubMed CentralGoogle Scholar
 Yu L, Chibnik LB, Srivastava GP, Pochet N, Yang J, Xu J, et al. Association of brain DNA methylation in SORL1, ABCA7, HLADRB5, SLC24A4, and BIN1 with pathological diagnosis of alzheimer disease. JAMA Neurol. 2015;72:15.View ArticlePubMedPubMed CentralGoogle Scholar
 Bennett DA, Yu L, De Jager PL. Building a pipeline to discover and validate novel therapeutic targets and lead compounds for alzheimer’s disease. Biochem Pharmacol. 2014;88:617–30.View ArticlePubMedPubMed CentralGoogle Scholar
 Labadorf A, Hoss AG, Lagomarsino V, Latourelle JC, Hadzi TC, Bregu J, et al. RNA sequence analysis of human huntington disease brain reveals an extensive increase in inflammatory and developmental gene expression. Ariga H, editor. PLoS One. 2015;10:e0143563. Public Library of Science.View ArticlePubMedPubMed CentralGoogle Scholar
 McCullagh P, Nelder JA. Generalized linear models. Second. London: Chapman and Hall/CRC Press; 1989.View ArticleGoogle Scholar
 Gelman A, Jakulin A, Pittau MG, Su YS. A weakly informative default prior distribution for logistic and other regression models. Ann Appl Stat. 2008;2:1360–83.View ArticleGoogle Scholar
 Firth D. Bias reduction of maximum likelihood estimates. Biometrika. 1993;80:27–38.View ArticleGoogle Scholar
 Heinze G, Schemper M. A solution to the problem of separation in logistic regression. Stat Med. 2002;21:2409–19.View ArticlePubMedGoogle Scholar
 Han F, Pan W. A dataadaptive sum test for disease association with multiple common or rare variants. Hum Hered. 2010;70:42–54.View ArticlePubMedPubMed CentralGoogle Scholar
 Phipson B, Smyth GK. Permutation Pvalues should never be zero: calculating exact Pvalues when permutations Are randomly drawn. Stat Appl Genet Mol Biol. 2010;9:Article39.PubMedGoogle Scholar
 Devlin B, Roeder K. Genomic control for association studies. Biometrics. 1999;55:997–1004.View ArticlePubMedGoogle Scholar
 Furuta A, Martin L, Lin CL, DykesHoberg M, Rothstein JD. Cellular and synaptic localization of the neuronal glutamate transporters excitatory amino acid transporter 3 and 4. Neuroscience. 1997;81:1031–42.View ArticlePubMedGoogle Scholar
 McCullumsmith R. Striatal excitatory amino acid transporter transcript expression in schizophrenia, bipolar disorder, and major depressive disorder. Neuropsychopharmacology. 2002;26:368–75. Nature Publishing Group.View ArticlePubMedGoogle Scholar
 Utal A, Stopka A, Roy M, Coleman P. PEP19 immunohistochemistry defines the basal ganglia and associated structures in the adult human brain, and is dramatically reduced in Huntington’s disease. Neuroscience. 1998;86:1055–63.View ArticlePubMedGoogle Scholar
 Vannahme C, Schübel S, Herud M, Gösling S, Hülsmann H, Paulsson M, et al. Molecular cloning of testican2. J Neurochem. 2002;73:12–20.View ArticleGoogle Scholar
 Hadchouel A, Durrmeyer X, Bouzigon E, Incitti R, Huusko J, Jarreau PH, et al. Identification of SPOCK2 as a susceptibility gene for bronchopulmonary dysplasia. Am J Respir Crit Care Med. 2011;184:1164–70.View ArticlePubMedPubMed CentralGoogle Scholar
 Chung W, KwabiAddo B, Ittmann M, Jelinek J, Shen L, Yu Y, et al. Identification of novel tumor markers in prostate, colon and breast cancer by unbiased methylation profiling. PLoS One. 2008;3:e2079.View ArticlePubMedPubMed CentralGoogle Scholar