# multiDE: a dimension reduced model based statistical method for differential expression analysis using RNA-sequencing data with multiple treatment conditions

- Guangliang Kang†
^{1}, - Li Du†
^{1}and - Hong Zhang
^{1, 2}Email author

**17**:248

https://doi.org/10.1186/s12859-016-1111-9

© The Author(s) 2016

**Received: **30 January 2016

**Accepted: **2 June 2016

**Published: **22 June 2016

## Abstract

### Background

The growing complexity of biological experiment design based on high-throughput RNA sequencing (RNA-seq) is calling for more accommodative statistical tools. We focus on differential expression (DE) analysis using RNA-seq data in the presence of multiple treatment conditions.

### Results

We propose a novel method, multiDE, for facilitating DE analysis using RNA-seq read count data with multiple treatment conditions. The read count is assumed to follow a log-linear model incorporating two factors (i.e., condition and gene), where an interaction term is used to quantify the association between gene and condition. The number of the degrees of freedom is reduced to one through the first order decomposition of the interaction, leading to a dramatically power improvement in testing DE genes when the number of conditions is greater than two. In our simulation situations, multiDE outperformed the benchmark methods (i.e. edgeR and DESeq2) even if the underlying model was severely misspecified, and the power gain was increasing in the number of conditions. In the application to two real datasets, multiDE identified more biologically meaningful DE genes than the benchmark methods. An R package implementing multiDE is available publicly at http://homepage.fudan.edu.cn/zhangh/softwares/multiDE.

### Conclusions

When the number of conditions is two, multiDE performs comparably with the benchmark methods. When the number of conditions is greater than two, multiDE outperforms the benchmark methods.

## Keywords

## Background

High-throughput RNA sequencing (RNA-seq) technologies are emerging rapidly in recent years, which are widely used in biological studies [1]. One of the most important biological problems is to identify differentially expressed genes between multiple experimental conditions. As a result, the key of analyzing these data lies in establishing an appropriate statistical model for RNA-seq count data and consequently preforming differential expression (DE) analysis. Many statistical methods have been developed to fulfill this commission [2, 3]. Methods such as DEGSeq assume that the read count follows a Poisson distribution [4]. Nevertheless, this assumption is violated in the presence of variance overdispersion due to technological and biological variations. An additional parameter can be used to account for the overdispersion. DESeq2 and edgeR, for example, both assume a negative binomial distribution [5, 6]. Much more efficient estimates of the dispersion parameters are obtained using an empirical Bayesian method, resulting in a considerable improvement of DE analysis power and a better control of false positive rate if the sample size is very small. These two methods have been implemented in the Bioconductor packages edgeR and DESeq2, respectively.

In many biological studies, biologists are interested in identifying those genes differently expressed between multiple conditions. For example, the transcriptome of several stages of embryo development or several subtypes of cancer simultaneously enable them to get insight into the sophisticated biological mechanism on a much more comprehensive point [7, 8]. Moreover, the read counts of various conditions can be correlated with each other if they are generated from the same subject. This further complicates the statistical analysis. When the number of conditions, *D*, is greater than two, an analysis of variance (ANOVA) model can be used to detect those genes differentially expressed between the conditions. The chi-squared test based the ANOVA model has *D*−1 degrees of freedom.

To improve power of DE analysis, we recently proposed a 1-df test based on a rank-reduced logistic ANOVA model for logarithm of the expected read count, where the read count was assumed to follow a Poisson-lognormal distribution [9]. This method was termed PLN-ANOVA. In this paper, we present a novel framework committing to facilitate the DE analysis of RNA-seq read count data generated by experiments with multiple conditions, which takes the correlation between samples (if any) into account. The negative binomial distribution is used to model the read count, where the effects of gene and condition on the read counts are incorporated through a two-factor log-linear model with both main effects and interaction effects, while the association between gene and condition is quantified through *D* interaction terms for each of *G* genes. We propose to reduce the dimension of independent interaction parameters from (*D*−1)(*G*−1) to *D*+*G*−2. A rigorous statistical estimation/test procedure is developed in this paper, which could be much more efficient compared with the ones without dimension reduction when *D*>2. The major difference between the current paper and [9] are two-fold: in the distribution assumptions on the read count are different, and the former can deal with both matched samples and unmatched samples but the later can only deal with matched samples.

## Methods

In this section, we describe a rank-reduced model for the RNA-seq read counts of biological samples drawn from multiple conditions, then develop an estimation/test procedure for DE analysis.

### Notation and model

Let *X* denote the read count (i.e., the number of reads mapped to a reference) for any interested gene of any sample. Let *δ* denote the library size factor for that sample, then we can normalize the read count by *Y*=[*X*/*δ*], where [ *a*] is the integer closest to *a*. The size factor can be estimated by any normalization method to be described in the next subsection. As in many statistical methods for analyzing RNA-seq read counts, we assume that *Y* follows a negative binomial distribution. That is, there is an underlying gene expression *Z* that follows a gamma distribution, and *Y* follows a Poisson distribution with expectation *Z*. This way, *Y* marginally follows a compound Poisson-gamma distribution (i.e., negative binomial distribution). In the following, we extend the negative binomial distribution to model the read counts of biological samples drawn from multiple conditions, where the samples can be either independent or correlated with each other.

Consider a study involving *D* conditions, and *n*
_{
d
} samples are drawn from the *d*th condition (*d*=1,…,*D*). Let *X*
_{
idg
} denote the read count of the *g*th gene for the *i*th sample in the *d*th condition. Let *Y*
_{
idg
}=[*X*
_{
idg
}/*δ*
_{
id
}] be a normalized read count, where *δ*
_{
id
} is a size factor for the *i*th sample in the *d*th condition. We assume the following conditions hold:

(C1) *Y*
_{
i1g
},…,*Y*
_{
iDg
} are conditionally independent of each other given their underlying gene expressions *Z*
_{
i1g
},…,*Z*
_{
iDg
}, and *Z*
_{
idg
} follows a gamma distribution with expectation *μ*
_{
dg
} and variance \(\phi _{g}\mu _{dg}^{2}\). As a result, *Y*
_{
idg
} marginally follows the negative binomial distribution with expectation *μ*
_{
dg
} and dispersion parameter *ϕ*
_{
g
} (i.e., the variance of *Y*
_{
idg
} is \(\mu _{dg}+\phi _{g}\mu _{dg}^{2}\)).

(C2) The correlation coefficient of \(Z_{id_{1}g}\) and \(Z_{id_{2}g}\), \(\rho _{d_{1}d_{2}}\), is free of gene *g*, which is equal to 0 for *d*
_{1}≠*d*
_{2} in the unmatched situation.

where *μ* is the grand mean, *α*
_{
d
} is the main effect for condition *d*, *β*
_{
g
} is the main effect for gene *g*, and *γ*
_{
dg
}:=*u*
_{
d
}
*v*
_{
g
} is the interaction effect between gene *g* and condition *d*.

(C4) There are a sufficiently large number of DE genes between *G* genes, i.e., *γ*
_{
dg
}≠0 for a large number of genes. As remarked below, this implies that *u*
_{
d
}≠0 for at least one *d*.

*u*

_{1}=1, or equivalently

*α*

_{ d }should be equal to zero so that there is no systematic gene expression difference between the

*D*conditions. The main genetic effect

*β*

_{ g }quantifies the relative expression of gene

*g*. Since \(\log \mu _{d_{1}g}-\log \mu _{d_{2}g}=\gamma _{d_{1}g}-\gamma _{d_{2}g}\) provided

*α*

_{ d }=0,

*γ*

_{ dg }quantifies the relative association strength between condition

*d*and the expression level for gene

*g*. If there is at least one gene differentially expressed between the

*D*conditions, then

*γ*

_{ dg }≠0 for at least one

*g*, hence

*u*

_{ d }≠0 for at least one

*d*under decomposition (1). Therefore, the relative association strength can be measured by

*v*

_{ g }, and the null hypothesis that the

*g*th gene is differentially expressed between the

*D*conditions can be formulated by

The key idea of model (1) is to approximate the interaction effect *γ*
_{
dg
} using the product of two terms that depend on *d* and *g* separately. This reduces the dimension of test problem from *D*−1 (the corresponding null hypothesis is *H*
_{
g
}:*γ*
_{1g
}=…=*γ*
_{
Dg
}) to one (the corresponding null hypothesis is *H*
_{
g
}:*v*
_{
g
}=0) for gene *g*. Obviously, the resultant test could be much more powerful than the chi-square test based on the ANOVA model without such decomposition if *D*>2.

### Read count normalization

In this subsection, we consider the calculation of the size factor *δ*
_{
id
}. It is widely known that raw counts are not directly comparable between genes due to differential gene lengths and sequencing depths, and reads per kilobase per million reads (RPKM) can be used to correct the resultant technical bias [11]. In DE analysis between multiple conditions, the gene length does not affect the analysis result since such DE analysis focuses on the same gene. However, the condition comparison could greatly suffer from sample specific effects such as sequencing depth and sample specific GC-content effect. The sample specific GC-content effect could arise if two or more samples are sequenced in the same lane. Several within-lane normalization methods (i.e., regression normalization, global-scaling normalization, and full-quantile normalization) can be used to correct the resultant technical bias [12]. On the other hand, such effect can be absorbed into sample specific sequencing depth if only a single sample is sequenced in each lane, and the following four between-lane normalization methods are designed for correcting the technical bias due to sequencing depth.

where *m*
_{
id
} is the sample median of {*X*
_{
i
d1},…,*X*
_{
idG
}}.

where *s*
_{
id
} is the sum of {*X*
_{
i
d1},…,*X*
_{
idG
}}.

where *q*
_{
id
} is the 75th percentile of {*X*
_{
i
d1},…,*X*
_{
idG
}}.

*i*in condition

*d*and sample

*i*

^{′}in condition

*d*

^{′}, the log-fold changes (M-values) are defined as

*i*

_{0}in condition

*d*

_{0}as a reference sample, then the size factor for sample

*i*in condition

*d*is defined as

*G*

^{∗}is the set of those genes with the upper and lower 30 % of the M-values removed. In edgeR that implements TMM, the sample with upper quartile closest to the mean upper quartile across all samples is used as the default reference. To minimize the variance of

*δ*

_{ id }, here the weight \(w_{idg}^{i_{0}d_{0}}\) is defined as inverse of the estimated variance of M-values:

*X*

_{ idg }can be normalized by

### Parameter estimation

In this subsection, we derive estimators of *μ*, *α*
_{
d
}, *β*
_{
g
}, *u*
_{
d
}, *v*
_{
g
}, *ϕ*
_{
g
}, and \(\rho _{d_{1}d_{2}}\). We first describe two existing methods for estimating *ϕ*
_{
g
}. Then we develop a simple estimation equation method for estimating *μ*, *α*
_{
d
}, *β*
_{
g
}, *u*
_{
d
}, and *v*
_{
g
}. Finally, we propose a method for estimating \(\rho _{d_{1}d_{2}}\) in the matched sample situation.

First, we estimate the dispersion parameter *ϕ*
_{
g
} using an empirical Bayes method or a parametric method via a robust gamma-family generalized linear model. The two dispersion estimation methods have been implemented in the Bioconductor packages edgeR and DESeq2, respectively.

*μ*,

*α*

_{ d },

*β*

_{ g },

*u*

_{ d }, and

*v*

_{ g }. Denote

*η*

_{ dg }= log

*μ*

_{ dg }, which can be estimated by

*μ*,

*α*

_{ d },

*β*

_{ g }, and

*γ*

_{ dg }:

*u*

_{ d }and

*v*

_{ g }[15]. Noting that \(\hat {\gamma }_{dg}\) has an asymptotic variance proportional to \(n_{d}^{-1}\), we can minimize the weighted sum of squares

*v*

_{ g }=0 and the corresponding information is pure noise for estimating

*u*

_{ d }. To eliminate the impact of such noise, we can modify (15) as

where *S* is a DE gene set determined by any existing method such as edgeR. Equations (16) and (17) can be easily solved in an iterative manner, and the algorithm for all genes can be greatly sped up via vectorization.

*Y*

_{1d g },…,

*Y*

_{ ndg }are identically distributed, we can estimate \(\text {corr}(Y_{id_{1}g},Y_{id_{2}g})\) with the sample correlation coefficient of \((Y_{1d_{1}g},Y_{1d_{2}g}),\ldots,(Y_{nd_{1}g},Y_{nd_{2}g})\) times a continuity correct term proposed by [16], which is denoted by \(r_{d_{1}d_{2}g}\). Here the continuity correct term is used to reduce the estimation bias due to a very small sample size. An estimator of \(\rho _{d_{1}d_{2}}\) takes the form

To construct a Wald test statistic for testing *H*
_{
g
}:*v*
_{
g
}=0, we need to estimate the variance of \(\hat v_{g}\), as detailed in the next subsection.

### Variance estimation and Wald test

*n*

_{ d }=

*n*for

*d*=1,…,

*D*, we need to further estimate \(\text {cov}(\hat {\eta }_{d_{1}g},\hat {\eta }_{d_{2}g})\) for

*d*

_{1}≠

*d*

_{2}. The resultant estimate of \(\text {var}(\hat v_{g})\) in the matched sample situation, \(\widehat {\text {var}}({\hat {v}_{g}})\), takes the form

whose null limiting distribution is the chi-squared distribution with one degree of freedom.

## Results

To evaluate the performance of the proposed method, we conducted both simulation studies and real data analyses. For comparison purpose, we considered two benchmark methods edgeR (version 3.12.1) and DESeq2 (version 1.10.1). The functions *estimateGLMTagwiseDisp* and *estimateDispersions* were used to estimate tagwise negative binomial dispersions in edgeR and DESeq2, respectively. In these two methods, the likelihood ratio tests were used for DE analysis and the default parameters were adopted. We also considered our recently developed method PLN-ANOVA, which was designed for matched samples. In both simulation studies and real data applications for multiDE, the dispersion estimates given by edgeR and DESeq2 produced DE analysis results that were very close to each other. Therefore, in the following studies, we only present the results of multiDE with the dispersion estimates given by edgeR.

### Simulation studies

We considered the situation where only a single sample was sequenced in each lane, as commonly done in real applications, so that the effect of sample specific GC-content can be absorbed into the library size. First we evaluated the four normalization methods using simulation data generated from a real dataset. Then we conducted a comparison study between two benchmark methods (i.e., edgeR and DESeq2) and multiDE equipped with MEDIAN using simulation data generated under model (1). Finally, we conducted a sensitivity analysis by generating data without the assumption of model (1).

#### Simulation based on a real dataset

In multiDE, any of the four methods (namely, MEDIAN, TOTAL, QUANTILE, and TMM) can be used to normalize the read counts.

We generated simulation data using a real data set from an embryonic stem cells study described in the next subsection. In this study, RNA-seq read counts at 6,526 genes were available from nine unrelated individuals. Three different conditions were considered, and each condition had three individuals. Refer to the next subsection for detailed description of this study. We randomly permutated the condition information for the nine samples, then randomly selected 600 DE genes. In each of these DE genes, the read counts of the second condition were the original ones multiplied by a factor of 1.1, and the read counts of the third condition were the original ones multiplied by a factor of 0.9. The rest 5,926 genes were regarded as equally expressed genes. Such experiment was repeated for 400 times, and the four normalization methods were applied to these datasets.

We also applied the two benchmark methods edgeR and DESeq2 to the above simulation datasets. Evidently, multiDE outperformed edgeR and DESeq2 even using the MEDIAN (Fig. 1), we therefore focus on this normalization method in the next two subsections.

#### Simulation based on model (1)

We considered two situations for generating the read count *X*
_{
idg
}, one was for unmatched samples and the other was for matched samples. The parameters shared the two situations were set to be the same. We let *D* = 2, 3, 4, or 5, and fixed the total number of genes to be *G*=10^{4}. Then we set *n*
_{
d
}=*D*+*d*−1 for *d*=1,…,*D* in the unmatched sample situation, and *n*=4 in the matched sample situation.

*μ*=4,

*α*

_{ d }=0 for

*d*=1,…,

*D*,

*v*

_{ g }=0 for equally expressed genes

*g*=1,…,9000,

*v*

_{ g }∼−|

*N*(0,0.32)| for up-regulated genes

*g*=9001,…,9500, and

*v*

_{ g }∼|

*N*(0,0.32)| for down-regulated genes

*g*=9501,…,10000. In addition, we randomly generated size factors

*δ*

_{ id }from the log-normal distribution with mean and varariance parameters 0 and 0.25

^{2}, main genetic effects

*β*

_{ g }from the normal distribution with mean 0 and variance 0.25

^{2}, dispersion parameters

*ϕ*

_{ g }from the gamma distribution with shape parameter 5 and rate parameter 20, where the shape and rate parameters were close to those for the embryonic stem cells data to be analyzed in the next subsection.

With the above parameters, we then generated underlying gene expressions (*Z*
_{
i1g
},…,*Z*
_{
iDg
}) from multivariate gamma distribution with correlation parameters \(\{\rho _{d_{1}d_{2}}:1\leq d_{1}\leq d_{2}\leq D\}\) using the Bioconductor package copula [17], and generated the read counts *X*
_{
idg
} from the Poisson distribution with expectation *Z*
_{
idg
}. We let \(\rho _{d_{1}d_{2}}=0\) in the unmatched situation and \(\rho _{d_{1}d_{2}}\sim U(0.2, 0.4)\) in the matched situation.

*u*

_{ d }and

*v*

_{ g }were estimated using multiDE. To evaluate the estimation accuracy of multiDE, we report in Figs. 2, 3, 4 and 5 the

*u*

_{ d }estimates and the biases of

*v*

_{ g }estimates in both matched and unmatched sample situations. Overall, the

*u*

_{ d }estimates were virtually close to the true ones and the estimation biases of

*v*

_{ g }were minor.

*p*-values for testing

*H*

_{0}:

*v*

_{ g }=0 using the four considered methods (PLN-ANOVA was applied only in the matched sample situation), and evaluated the performance of these methods through ROC curves and FDRs (Figs. 6, 7, 8 and 9). For

*N*=1,…,2000, we evaluated the FDRs of top

*N*identified DE genes. The FDR-adjusted

*p*-values were calculated using the R function

*p.adjust*, and the empirical FDRs were consequently obtained. In terms of FDRs, the four methods performed quite comparably when

*D*=2. On the other hand, multiDE and PLN-ANOVA had smaller FDRs compared with edgeR and DESeq2 when

*D*>2. According to the ROC curves, when

*D*=2, multiDE was comparable with the other two methods in the unmatched sample situation, and the former was even slightly more powerful in the matched sample situation. When

*D*>2, multiDE was evidently more powerful than the other methods including PLN-ANOVA, and the power gain steadily went up as the number of conditions increased.

#### Sensitivity analysis

*γ*

_{ dg }was generated from the uniform distribution on the interval (−0.75,0.75) for genes 9,001 through 10,000. Other parameters setting and data generation process were the same as before. We only present the results for unmatched sample situation since the results for matched sample situation were similar. Again, multiDE outperformed edgeR and DESeq2 in terms of both FDRs and ROC curves (Figs. 10 and 11). Interestingly, model misspecification did not alter the performance advantage of multiDE over the other methods.

### Real data analyses

In order to evaluate their relative performance on real data, multiDE, edgeR, and DESeq2 were applied to data from two biological experiments with multiple treatment conditions. Furthermore, PLN-ANOVA was applied to the first dataset with matched samples. In each of these two experiments, only a single sample was sequenced in each lane, so we only used between-lane normalization methods.

#### Psoriatic study

In this study, the major interest was to detect the influence of aryl hydrocarbon receptor (AhR) on RNA expression profiles of psoriatic lesion cells [18]. Each of eight patients were treated with culture treatment of DMSO (vehicle control), AhR agonist FICZ, and AhR antagonist CH-2233191. RNA-seq data were obtained using Illumina Genome Analyzer II platform for each of three treated lesion tissue samples. Therefore, this was a matched sample design. The RNA-seq read counts were derived from the GEO database (accession ID: GSE47944). We kept 13,416 genes with maximal read counts greater than 50 in each treatment condition.

With multiDE, the estimated *u*
_{
d
} for the three conditions (vehicle control, AhR agonist, and AhR antagonist) were 1, −4.20, and 3.20, respectively, which coincided with the fact that AhR-activating ligands reduced inflammation in the lesion of psoriasis patients and AhR antagonists upregulated inflammation.

*α*+ dendritic cells, the most powerful antigen presenting cell during inflammation [19]. The up regulation of HRNR, which encoded granular layer keratin bundling proteins, was closely related to psoriatic lesions [20]. As a chemoattractant of a serial of immune cells, encoded protein of CXCL10 also played an important role in psoriatic immune abnormality [21].

*P*-values of DE analysis for five housekeeping genes in the psoriatic study

Method | ACTB | GAPDH | NONO | PGK1 | PPIH |
---|---|---|---|---|---|

multiDE | 0.1081 | 0.3010 | 0.8429 | 0.1925 | 0.5984 |

edgeR | 0.1814 | 0.2139 | 0.9340 | 0.1907 | 0.8923 |

DESeq2 | 0.2247 | 0.1852 | 0.8984 | 0.1608 | 0.8084 |

PLN-ANOVA | 0.1411 | 0.3862 | 0.8282 | 0.2563 | 0.6134 |

#### Embryonic stem cells study

The second dataset on a study of Homo sapiens hormone embryonic stem cells was downloaded from the NCBI GEO database (accession ID: GSE36552). To find causal relationship between gene expression network and cellular phenotype, Yan et al. derived embryonic stem cells from donated human pre-implantation embryos, prepared cDNA and sequenced them by Illumina HiSeq 2000 [23].

RNA-seq samples were obtained from the embryonic stem cells of nine unrelated individuals, so this was an unmatched design and PLN-ANOVA was not applicable. The embryonic stem cells were obtained at the 2-cell stage, three at the 4-cell stage, and the other three at the 8-cell stage. We aligned the downloaded RNA-seq reads to human reference genome hg19 (UCSC release) using the bioinformatics tool *TopHat* [24], and counted the reads for each gene using the Python program *htseq-count* [25]. Altogether, 6,526 genes with the maximal counts greater than 50 in each treatment condition were kept.

With multiDE, the estimated *u*
_{
d
} for the three conditions (2-cell stage, 4-cell stage, and 8-cell stage) were 1, 0.94 and, −1.94, respectively, indicating that the gene expression difference between the 2-cell and 8-cell stages could be generally large, while the difference between the 2-cell and the 4-cell stages was generally minor.

Exclusively enriched GO terms using identified DE genes in the embryonic stem cells study

GO code | GO term description |
| |
---|---|---|---|

multiDE | GO:0030163 [BP] | Protein catabolic process | 1.72e-04 |

GO:0044257 [BP] | Cellular protein catabolic | 5.60e-04 | |

process | |||

GO:0005829 [CC] | Cytosol | 4.00e-03 | |

GO:0030530 [CC] | Nuclear ribonucleoprotein | 6.16e-03 | |

complex | |||

GO:0019787 [MF] | Small conjugating protein | 2.71e-02 | |

ligase activity | |||

GO:0031967 [CC] | Organelle envelope | 4.32e-02 | |

edgeR | GO:0010468 [BP] | Regulation of gene expression | 1.29e-03 |

GO:0022618 [BP] | Ribonucleoprotein complex | 1.27e-02 | |

assembly | |||

GO:0044452 [CC] | Nucleolar part | 1.61e-02 | |

GO:0006413 [BP] | Translational initiation | 4.86e-02 | |

DESeq2 | GO:0000279 [BP] | M phase | 1.31e-04 |

GO:0005819 [CC] | spindle | 3.70e-04 | |

GO:0030880 [CC] | RNA polymerase complex | 2.07e-03 | |

GO:0006259 [BP] | DNA metabolic process | 2.83e-03 |

*P*-values of DE analysis for four housekeeping genes in embryonic stem cells study

Methods | ACTB | GAPDH | PGK1 | PPIH |
---|---|---|---|---|

multiDE | 0.5419 | 0.1064 | 0.2056 | 0.7460 |

edgeR | 0.5351 | 0.1845 | 0.4914 | 0.9021 |

DESeq2 | 0.6540 | 0.0151 | 0.6484 | 0.9666 |

## Discussion

Models for fitting the distribution of read count data are essential for detecting DE genes. In experiments involving multiple conditions, it would be of great interest to detect those genes that are differentially expressed between at least two conditions. The traditional statistical methods are generally based on an ANOVA like framework, and the number of the degrees of freedom for detecting DE genes is equal to *D*−1 (*D* is the number of conditions). In this paper, we propose to reduce the number of the degrees of freedom from *D*−1 to one based on a new dimension reduced model. The new method multiDE based on this model can handle both matched and unmatched samples. If *D*>2, multiDE greatly outpermed the existing methods in our simulation studies, even if the model used to generate data was severely misspecified.

If only a single sample is sequenced in each lane, one needs only to correct the technical bias due to library size effect since the lane-specific efect can be absorbed into the library size factors. In multiDE, there are various options for estimating size factors. Four between-lane normalization methods (i.e., MEDIAN, TOTAL, QUANTILE, and TMM) can be used to estimate size factors. In our simulation study based on a real dataset, TMM slightly outperformed the other three normalization methods. Furthermore, in [14], TMM had been shown to be robust against outlying read counts and DE genes, and it outperformed other methods in simulation studies. Therefore, we recommend TMM in real data applications. If two or more samples are sequenced in the same lane, it would be advantageous to use any within-lane normalization method before between-lane normalization [12].

In multiDE, two methods implemented in edgeR and DESeq2 can be used to estimate dispersion parameters. The two dispersion estimation methods performed comparably in our simulation studies. When estimating *u*
_{
d
}, the size of gene set *S* can be specified to be the number of significant DE genes (after FDR adjustment) by any existing method such as edgeR or DESeq2.

## Conclusions

In this paper, the new method multiDE is developed based on a dimension-reduced model for the purpose of detecting DE genes between multiple conditions. Through both simulation studies and real data applications, multiDE was shown to outperform the existing benchmark methods. The proposed method multiDE has been implemented in an R package. This package requires that each condition has at least two biological replications, it takes RNA-seq read counts as input data and can be used to estimate fold changes and to conduct Wald tests for detecting DE genes between various conditions. Three functions are provided in multiDE, namely *normalization*, *dispersion*, and *multiDE*, which can be used to calculate size factors using four normalization methods (i.e., MEDIAN, TOTAL, QUANTILE, and TMM), to estimate dispersion parameters using two methods provided in edgeR and DESeq2, respectively, and to calculate DE *p*-values and fold changes and their standard errors. Using a desktop computer with a 3.20GHz CPU, it took multiDE only a few seconds to analyze two real datasets.

## Abbreviations

AhR, aryl hydrocarbon receptor; DE, differentially expressed; FDR, false discovery rate; NB, negative binomial; ROC, receiver operating characteristic

## Declarations

### ᅟ

ᅟ

### Funding

HZ was partially supported by the State Key Development Program for Basic Research of China 2012CB316505 and the National Natural Science Foundation of China 11371101.

### Availability of data and materials

Source code repository: http://homepage.fudan.edu.cn/zhangh/softwares/multiDE/. Psoriatic study dataset: Downloaded from the GEO database with accession ID GSE47944. Embryonic stem cells study dataset: Downloaded from the GEO database with accession ID GSE36552.

### Authors’ contributions

GK contributed to drafting the manuscript, developing the new statistical method, and analyzing both the simulation data and real data. LD contributed to analyzing the real data analysis and drafting the manuscript. HZ conceived the idea and designed the model. All authors read and approved the final manuscript.

### Competing interests

The authors declare that they have no competing interests.

### Consent for publication

Not applicable.

### Ethics approval and consent to participate

Not applicable.

**Open Access** This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

## Authors’ Affiliations

## References

- Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008; 5(7):621–8.View ArticlePubMedGoogle Scholar
- Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics. 2010; 11(1):94.View ArticlePubMedPubMed CentralGoogle Scholar
- Kvam VM, Liu P, Si Y. A comparison of statistical methods for detecting differentially expressed genes from RNA-seq data. Am J Bot. 2012; 99(2):248–56.View ArticlePubMedGoogle Scholar
- Wang L, Feng Z, Wang X, Wang X, Zhang X. DEGseq: an R package for identifying differentially expressed genes from RNA-seq data. Bioinformatics. 2010; 26(1):136–8.View ArticlePubMedGoogle Scholar
- Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010; 11(10):106.View ArticleGoogle Scholar
- Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010; 26(1):139–40.View ArticlePubMedGoogle Scholar
- Brandenberger R, Wei H, Zhang S, Lei S, Murage J, Fisk GJ, Li Y, Xu C, Fang R, Guegler K, et al. Transcriptome characterization elucidates signaling networks that control human ES cell growth and differentiation. Nat Biotechnol. 2004; 22(6):707–16.View ArticlePubMedGoogle Scholar
- Goldhirsch A, Wood W, Coates A, Gelber R, Thürlimann B, Senn HJ, et al. Strategies for subtypes–dealing with the diversity of breast cancer: highlights of the St Gallen International Expert Consensus on the Primary Therapy of Early Breast Cancer 2011. Ann Oncol. 2011; 22(8):1736–47.View ArticlePubMedPubMed CentralGoogle Scholar
- Zhang H, Xu J, Jiang N, Hu X, Luo Z. PLNseq: a multivariate Poisson lognormal distribution for high-throughput matched RNA-sequencing read count data. Stat Med. 2015; 34(9):1577–89.View ArticlePubMedGoogle Scholar
- Jung Y, Huang JZ, Hu J. Biomarker detection in association studies: modeling SNPs simultaneously via logistic ANOVA. J Am Stat Assoc. 2014; 109(508):1355–67.View ArticlePubMedPubMed CentralGoogle Scholar
- Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, Van Baren MJ, Salzberg SL, Wold BJ, Pachter L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010; 28(5):511–5.View ArticlePubMedPubMed CentralGoogle Scholar
- Risso D, Schwartz K, Sherlock G, Dudoit S. GC-content normalization for RNA-Seq data. BMC Bioinformatics. 2011; 12(1):480.View ArticlePubMedPubMed CentralGoogle Scholar
- Li J, Witten DM, Johnstone IM, Tibshirani R. Normalization, testing, and false discovery rate estimation for rna-sequencing data. Biostatistics. 2011; 13(3):523–38.View ArticlePubMedPubMed CentralGoogle Scholar
- Robinson MD, Oshlack A, et al. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010; 11(3):25.View ArticleGoogle Scholar
- Ruppert D, Wand MP. Multivariate locally weighted least squares regression. Ann Stat. 1994; 22(3):1346–70.View ArticleGoogle Scholar
- Olkin I, Pratt JW. Unbiased estimation of certain correlation coefficients. Ann Math Stat. 1958; 29(1):201–11.View ArticleGoogle Scholar
- Kojadinovic I, Yan J, et al. Modeling multivariate distributions with continuous margins using the copula R package. J Stat Softw. 2010; 34(9):1–20.View ArticleGoogle Scholar
- Di Meglio P, Duarte JH, Ahlfors H, Owens ND, Li Y, Villanova F, Tosi I, Hirota K, Nestle FO, Mrowietz U, Gilchrist MJ, Stockinger B. Activation of the aryl hydrocarbon receptor dampens the severity of inflammatory skin conditions. Immunity. 2014; 40(6):989–1001.View ArticlePubMedPubMed CentralGoogle Scholar
- Pasparakis M, Haase I, Nestle FO. Mechanisms regulating skin immunity and inflammation. Nat Rev Immunol. 2014; 14(5):289–301.View ArticlePubMedGoogle Scholar
- Roberson ED, Bowcock AM. Psoriasis genetics: breaking the barrier. Trends Genet. 2010; 26(9):415–23.View ArticlePubMedPubMed CentralGoogle Scholar
- Nomura I, Gao B, Boguniewicz M, Darst MA, Travers JB, Leung DY. Distinct patterns of gene expression in the skin lesions of atopic dermatitis and psoriasis: a gene microarray analysis. J Allergy Clin Immunol. 2003; 112(6):1195–202.View ArticlePubMedGoogle Scholar
- Zhu J, He F, Hu S, Yu J. On the nature of human housekeeping genes. Trends Genet. 2008; 24(10):481–4.View ArticlePubMedGoogle Scholar
- Yan L, Yang M, Guo H, Yang L, Wu J, Li R, Liu P, Lian Y, Zheng X, Yan J, et al. Single-cell RNA-Seq profiling of human preimplantation embryos and embryonic stem cells. Nat Struct Mol Biol. 2013; 20(9):1131–9.View ArticlePubMedGoogle Scholar
- Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009; 25(9):1105–11.View ArticlePubMedPubMed CentralGoogle Scholar
- Anders S, Pyl PT, Huber W. HTSeq–A Python framework to work with high-throughput sequencing data. Bioinformatics. 2004; 31(2):166–169.View ArticleGoogle Scholar
- Huang D, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2008; 4(1):44–57.View ArticleGoogle Scholar
- Huang DW, Sherman BT, Lempicki RA. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009; 37(1):1–13.View ArticleGoogle Scholar
- Frank KM, Sharpless NE, Gao Y, Sekiguchi JM, Ferguson DO, Zhu C, Manis JP, Horner J, DePinho RA, Alt FW. DNA ligase IV deficiency in mice leads to defective neurogenesis and embryonic lethality via the p53 pathway. Mol Cell. 2000; 5(6):993–1002.View ArticlePubMedGoogle Scholar
- Zhu H, Kavsak P, Abdollah S, Wrana JL, Thomsen GH. A SMAD ubiquitin ligase targets the BMP pathway and affects embryonic pattern formation. Nature. 1999; 400(6745):687–93.View ArticlePubMedGoogle Scholar
- Jurisicova A, Acton BM. Deadly decisions: the role of genes regulating programmed cell death in human preimplantation embryo development. Reproduction. 2004; 128(3):281–91.View ArticlePubMedGoogle Scholar
- Antczak M, Van Blerkom J. Temporal and spatial aspects of fragmentation in early human embryos: possible effects on developmental competence and association with the differential elimination of regulatory proteins from polarized domains. Hum Reprod. 1999; 14(2):429–47.View ArticlePubMedGoogle Scholar