Confident difference criterion: a new Bayesian differentially expressed gene selection algorithm with applications
 Fang Yu^{1}Email author,
 MingHui Chen^{2},
 Lynn Kuo^{2},
 Heather Talbott^{3} and
 John S. Davis^{4}
https://doi.org/10.1186/s1285901506643
© Yu et al.; licensee BioMed Central. 2015
Received: 24 October 2014
Accepted: 7 July 2015
Published: 7 August 2015
Abstract
Background
Recently, the Bayesian method becomes more popular for analyzing high dimensional gene expression data as it allows us to borrow information across different genes and provides powerful estimators for evaluating gene expression levels. It is crucial to develop a simple but efficient gene selection algorithm for detecting differentially expressed (DE) genes based on the Bayesian estimators.
Results
In this paper, by extending the twocriterion idea of Chen et al. (Chen MH, Ibrahim JG, Chi YY. A new class of mixture models for differential gene expression in DNA microarray data. J Stat Plan Inference. 2008;138:387–404), we propose two new gene selection algorithms for general Bayesian models and name these new methods as the confident difference criterion methods. One is based on the standardized differences between two mean expression values among genes; the other adds the differences between two variances to it. The proposed confident difference criterion methods first evaluate the posterior probability of a gene having different gene expressions between competitive samples and then declare a gene to be DE if the posterior probability is large. The theoretical connection between the proposed first method based on the means and the Bayes factor approach proposed by Yu et al. (Yu F, Chen MH, Kuo L. Detecting differentially expressed genes using alibrated Bayes factors. Statistica Sinica. 2008;18:783–802) is established under the normalnormalmodel with equal variances between two samples. The empirical performance of the proposed methods is examined and compared to those of several existing methods via several simulations. The results from these simulation studies show that the proposed confident difference criterion methods outperform the existing methods when comparing gene expressions across different conditions for both microarray studies and sequencebased highthroughput studies. A real dataset is used to further demonstrate the proposed methodology. In the real data application, the confident difference criterion methods successfully identified more clinically important DE genes than the other methods.
Conclusion
The confident difference criterion method proposed in this paper provides a new efficient approach for both microarray studies and sequencebased highthroughput studies to identify differentially expressed genes.
Keywords
Background
In the past decade, highthroughput molecular technologies have gained great popularity in gene expression profiling due to their capability of producing thousands of measurements for each of the assayed samples. The microarray technology and nextgeneration sequencing are two widely used highthroughput technologies. Nextgeneration sequencing improves upon Sanger dideoxy sequencing so that the number of sequencing reactions in a single run can be in millions. For example, in Nature (2008), Bentley et al. [4] and Wang et al. [34] reported the DNA sequence of a Nigerian individual and an Asian individual, respectively. Ley et al. [18] analyzed the genome sequence of a tumor sample. One common scientific question addressed by these highthroughput experiments is to identify the genes with differential expression between two biological conditions. Although the highthroughput technologies offer us rich biological information, they are highly errorprone because many genes are monitored at the same time with a relatively small sample size. Bayesian methods provide a good solution to this problem because they synthesize all the data by borrowing information across different genes and produce more efficient estimators for evaluating the gene expressions. They include linear models in LIMMA [28] where empirical Bayesian methods were used to obtain stable results even with small sample size. A more detailed description of the Bayesian statistical methods for microarray studies can be found in Dudoit et al. [7], Pan [25], and Kuo et al. [15]. Other Bayesian methods for RNASeq studies using next generation sequencing were reviewed by Kvam et al. [16] and Soneson and Delorenzi [29].
Yu et al. [36] pointed out that most statistical methods for microarray studies examined the differential expressions by testing on the equality of means of the logtransformed intensities between the treatment and control, which may not be appropriate for data with complex structures (for example, a mixture normal distributions with multiple modes). They proposed a calibrated Bayes factor (CBF) method to evaluate the ratio of the full data marginal likelihood under the alternative hypothesis that a gene is differentially expressed (DE) relative to that for the null hypothesis that a gene is equivalently expressed (EE) between two biological conditions. Although their approach has the potential for handling data with more complicated distributions, the computational cost of their method may increase greatly with the complexity of the model.
Chen et al. [6] employed a class of mixture models with two components to fit the microarray data with two biological conditions. To evaluate the differential expressions for each gene, they proposed a gene selection algorithm, namely the twocriterion method. Specifically, they calculated a posterior probability that there is at least a twofold change between the mean values of raw intensities under the two considered conditions. Then a gene is declared to be DE if the resulting posterior probability is large (say at least 0.7). Since the posterior probability is readily available once a Markov chain Monte Carlo sample is drawn from the posterior distribution, the gene selection algorithm proposed by them is quite easy to implement and computationally inexpensive. However, their approach does not consider general data distributions as that in the Bayes factor approach given by Yu et al. [36]. Assuming that the data under each biological condition follow a lognormal distribution as in [6], the mean value of raw intensities equals to exp(mean+variance/2) under each condition. Thus, the twocriterion method proposed by Chen et al. [6] that calculates the ratio of two means of the raw intensities depends on not only the difference between the two transformed means but also the difference between their variances. So, when the differences between the means and the differences between the variances are in opposite directions, the Chen et al. method may not be able to detect DE genes. Additionally, their paper neither provides a guidance on controlling the false discovery rate (FDR) nor carries out the performance comparison with other existing methods.
Our goal in this paper is to develop a simple but efficient gene selection algorithm so that it is not only computationally efficient, but also flexible in handling data with a complicated distribution as in Yu et al. [36]. We redevelop the twocriterion method proposed by Chen et al. [6] and construct two new gene selection algorithms for general Bayesian models. One is based on the differences between means and the other is based on both mean differences and variance differences. To differentiate the method proposed in Chen et al. [6], we name our methods as confident difference criterion methods and the two proposed confident difference criterion methods in this paper as Methods I and II. We show that the Method I, which compares the mean expressions from different conditions, is equivalent to the calibrated Bayes factor approach [36] when the raw intensities from two different biological conditions follow lognormal distributions with equal variance. We also address the multiple comparisons issue with a control of the false discovery rate. We further apply the proposed method to carry out analyses of microarray data with more than two conditions as well as sequencebased RNA data.
Method
Model for microarray data
We assume that the data, denoted by D _{ obs }, have already been preprocessed with appropriate transformation and normalization. Let T be the total number of biological conditions in the study. The data may contain two biological conditions (T=2) or multiple biological conditions (T>2). The common analytical objective is to detect differentially expressed (DE) genes across different biological conditions.
Let x _{ gtk } denote the preprocessed expression intensity of the g ^{ t h } gene in the k ^{ t h } sample under the t ^{ t h } condition for t=1,⋯,T. There are a total of G genes with sample size n _{ gt } under condition t. Thus, the data on gene g under each condition can be summarized using a vector: \(\mathbf {X}_{\textit {gt}}=(x_{gt1},\ldots,x_{gtn_{\textit {gt}}})\phantom {\dot {i}\!}\). We assume that the intensity, x _{ gtk }, k=1,⋯,n _{ gt }, t=1,2,⋯,T, follows a normal distribution \(\mathcal {N}\left (\mu _{\textit {gt}}, \sigma ^{2}_{\textit {gt}}\right)\) independently. The parameters μ _{ gt } and \(\sigma ^{2}_{\textit {gt}}\) denote the mean and variance of the intensities of gene g under condition t, respectively. We write the mean intensities as μ _{ gt } = μ _{ g } + δ _{ gt }/T, t = 1,…,T, where \(\sum ^{T}_{t=1}\delta _{\textit {gt}}\,=\,0\). For simplicity, we set \(\delta _{g1}=\sum ^{T}_{t=2} \delta _{\textit {gt}}\) under the first biological condition. We note that μ _{ g } defines the overall mean of the intensities across all biological conditions, and δ _{ gt }/T measures the difference in the mean intensity under biological condition t from the overall mean. In a microarray study with two biological conditions (T=2), the mean intensities μ _{ g1} and μ _{ g2} are written as μ _{ g1}=μ _{ g }−δ _{ g }/2 and μ _{ g2}=μ _{ g }+δ _{ g }/2, respectively. When a gene is DE, we expect that the distributions of the data differ at least under two biological conditions.
Hierarchial prior distributions
Noninformative conditionally conjugate priors are specified for all parameters. Specifically, we assume that the mean parameters \(\mu _{g} \sim \mathcal {N}(0,\tau ^{2})\) and \(\delta _{\textit {gt}} \sim \mathcal {N} (0, \omega ^{2})\) for t>1 and any g, and the variance parameters \(\sigma ^{2}_{\textit {gt}} \sim \mathcal {IG}(a_{t},b_{t})\). We set the variance parameters τ ^{2} and ω ^{2} in the normal priors to be 100 to obtain relatively noninformative priors. The shape parameter a _{ t } in the inverse gamma prior is set to be 2, so that the prior mean of \(\sigma _{\textit {gt}}^{2}\) equals b _{ t }. We further let the scale parameter b _{ t } follow a conditionally conjugate gamma prior with \(b_{t} \sim \mathcal {G}(c,d)\), where the hyperparameter c is specified as 1 and the hyperparameter \(d \sim \mathcal {IG}(a_{d},b_{d})\), in which a _{ d } and b _{ d } are both set to be 0.01 in the simulation study and 1 in the real data analysis. Our hierarchical priors for the variance parameters, which are often difficult to estimate, allow for borrowing the information across genes via \(b_{t} \sim \mathcal {G}(c,d)\) as well as biological conditions via \(d \sim \mathcal {IG}(a_{d},b_{d})\). We intend to specify a noninformative inversegamma prior for the parameter d. The value of “1" was specified for both a _{ d } and b _{ d } in the real data analysis since the real data had a smaller sample size than the simulated data in the simulation study. These values of the hyperparameters still led to noninformative priors since the prior mean and variance of d do not exist. However, these values allowed us to borrow a little but not too much information across different biological conditions under comparison.
Conditional posterior distributions
Let \(\bar {x}_{\textit {gt}}\) denote the average intensities of gene g under condition t and also let the vector \(\bar {\mathbf {X}}_{g}=\{\bar {x}_{g1},\cdots,\bar {x}_{\textit {gT}}\}\) denote the average intensities for gene g. Then, \(\bar {\mathbf {X}}_{g}\) follows a multivariate normal distribution with \(\bar {\mathbf {X}}_{g} \sim \mathcal {N}(\mathbf {A}\mathbf {\Theta }_{g},\mathbf {\Sigma }_{g})\), where Θ _{ g }=(μ _{ g },δ _{ g2},⋯,δ _{ gT })^{′} is a column vector of size T, and Σ _{ g } is a diagonal matrix of size T×T with the t ^{ t h } diagonal element \((\mathbf {\Sigma }_{g})_{t,t}=\sigma ^{2}_{\textit {gt}}/n_{\textit {gt}}\). Here A is a T×T matrix, in which all elements in the first column equals one, i.e., A _{ t1}=1 for t=1,⋯,T, all but the first element in the first row equals −1/T, i.e., A _{1t }=−1/T for t=2,⋯,T, all but the first diagonal element equals 1/T, i.e., A _{ t,t }=1/T for t=2,⋯,T, and all other elements equal zero. Since the parameters in Θ _{ g } independently follow normal prior distributions, then \(\mathbf {\Theta }_{g} \sim \mathcal {N}(\mathbf {0},\mathbf {\Sigma }_{0})\), where 0 is a vector of size T containing all zero’s and Σ _{0} is a diagonal matrix with the first diagonal element (Σ _{0})_{1,1}=τ ^{2} and all other diagonal elements equal to ω ^{2}, i.e., (Σ _{0})_{ t,t }=ω ^{2} for t=2,⋯,T. Therefore, the conditional posterior distribution of Θ _{ g } is a multivariate normal distribution with Θ _{ g }∼N(U _{ g },B _{ g }), where the inverse of the variance matrix \(\mathbf {B}_{g}^{1}=\left (\mathbf {A}'\mathbf {\Sigma }_{g}^{1} \mathbf {A}+\mathbf {\Sigma }_{0}^{1}\right)\), and the mean vector \(\mathbf {U}_{g}= \mathbf {B}_{g} \mathbf {A}' \mathbf {\Sigma }_{g}^{1} \bar {\mathbf {X}}_{g}\). The conditional posterior distribution of the variance parameter \(\sigma _{\textit {gt}}^{2}\) is an inversegamma distribution with \(\sigma _{\textit {gt}}^{2} \sim \mathcal {IG}\left (a_{t}+\frac {1}{2}n_{\textit {gt}}, b_{t}+\frac {1}{2}\sum _{k=1}^{n_{\textit {gt}}}\left (x_{\textit {gtk}}\mu _{\textit {gt}}\right)^{2}\right)\). The conditional posterior density of the hyperparameter b _{ t } is given by \(g\left (b_{t}\sigma _{1t}^{2},\cdots, \sigma _{\textit {Gt}}\right) \propto b_{t}^{(\mathrm {G} a_{t}/2+c)}\exp \left (\frac {b_{t}}{2}\sum {\sigma _{\textit {gt}}^{2}}\right)\times \left (\sum _{t'\ne t} b_{t}'+b_{t}+b_{d}\right)^{\mathrm {T} c+a_{d}}\). Consequently, we can apply the Gibbs sampling algorithm to sample the parameters b _{ t }, \(\sigma ^{2}_{\textit {gt}}\) and Θ _{ g } in turn from their respective conditional posterior distributions using the following steps: (1) sample b _{ t } for each condition t from its posterior density function \(g(b_{t}\sigma _{1t}^{2},\cdots, \sigma _{\textit {Gt}})\) via the MetropolisHastings algorithm; (2) sample \(\sigma ^{2}_{\textit {gt}}\) given b _{ t } and μ _{ gt } for each g and t from its inverse gamma posterior distribution with updated parameters \(a_{t}+\frac {1}{2}n_{\textit {gt}}\) and \(b_{t}+\frac {1}{2}\sum _{k=1}^{n_{\textit {gt}}}(x_{\textit {gtk}}\mu _{\textit {gt}})^{2}\); (3) sample Θ _{ g } given \(\sigma ^{2}_{\textit {gt}}\) for all g from their conditional multivariate normal posterior distribution, and calculate the μ _{ gt } based on the sampled values of Θ _{ g }.
Model for sequencebased data
Let \(\phantom {\dot {i}\!}\textit {\textbf {Y}}_{\textit {gt}}=(y_{gt1},\cdots y_{gtn_{\textit {gt}}})\) denote all n _{ gt } observed counts of the expressed tags of gene g under condition t for g=1,⋯,G and t=1,⋯,T. We assume that y _{ gkt } follows a negative binomial distribution, which is commonly used for the count data with overdispersion [2, 26]. Specifically, we assume y _{ gtk } follows \(\mathcal {NB}\left (\phi _{t},\frac {m_{\textit {tk}}\lambda _{\textit {gt}}}{\phi _{t}+m_{\textit {tk}}\lambda _{\textit {gt}}}\right)\), with mean m _{ tk } λ _{ gt } and variance \(m_{\textit {tk}}\lambda _{\textit {gt}}\left (1+m_{\textit {tk}}\lambda _{\textit {gt}}\phi _{t}^{1}\right)\). We set m _{ tk } to be the library size of the k ^{ t h } sample under the t ^{ t h } condition, which is the sum of all counts from this library. The dispersion parameter ϕ _{ t } is assumed to be positive, accounting for potential overdispersion in the data. When the dispersion parameter ϕ _{ t } gets extremely large, the value of \(\phi _{t}^{1}\) approaches to zero, and the negative binomial distribution becomes a Poisson distribution with a mean value of m _{ tk } λ _{ gt }. DE genes are expected to have different λ _{ gt }’s under different biological conditions.
Hierarchical prior distributions
We assume that each dispersion parameter ϕ _{ t } follows a gamma distribution, \(\phi _{t} \sim \mathcal {G}(\alpha _{\phi }, \beta _{\phi })\) independently over t and its scale parameter β _{ ϕ } follows an inverse gamma distribution with \(\beta _{\phi } \sim \mathcal {IG}(\zeta _{\phi },\eta _{\phi })\). We also assume that each gene expression parameter λ _{ gt } follows an inverse gamma distribution with \(\lambda _{\textit {gt}} \sim \mathcal {IG}(\alpha _{\lambda _{t}},\beta _{\lambda _{t}})\), where the scale parameter \(\beta _{\lambda _{t}} \sim \mathcal {G}(\zeta _{\lambda },\eta _{\lambda })\). In our simulation studies, we set all the hyperparameters \(\{\alpha _{\phi },\zeta _{\phi }, \eta _{\phi },\alpha _{\lambda _{t}},\zeta _{\lambda },\eta _{\lambda }\}\phantom {\dot {i}\!}\) to be one.
Conditional posterior distributions
Since a negative binomial distribution can be written as a Poissongamma distribution, we can rewrite the distribution of y _{ gtk } as \(y_{\textit {gtk}} \sim \mathcal {P}oi(\theta _{\textit {gtk}})\), and \(\theta _{\textit {gtk}} \sim \mathcal {G}\left (\phi _{t},m_{\textit {tk}} \lambda _{\textit {gt}} \phi _{t}^{1}\right)\). Then we can derive all the conditional posterior distributions for all of the parameters. Specifically, the conditional posterior distribution of θ _{ gtk } is a gamma distribution with \(\theta _{\textit {gtk}} \sim \mathcal {G}\left (y_{\textit {gtk}}+\phi _{t},\left [1+\frac {\phi _{t}}{m_{\textit {tk}}\lambda _{\textit {gt}}}\right ]^{1}\right)\), the kernel of the conditional posterior density of ϕ _{ t } is given by \(\prod _{\textit {gk}} \left [\frac {\phi _{t}^{\phi _{t}}}{\Gamma (\phi _{t})} \left (\frac {\theta _{\textit {gtk}}}{m_{\textit {tk}}\lambda _{\textit {gt}}}\right)^{\phi _{t}} \exp \left (\frac {\theta _{\textit {gtk}}}{m_{\textit {tk}}\lambda _{\textit {gt}}}\phi _{t}\right) \right ] \exp \left (\frac {\phi _{t}}{\beta _{\phi }}\right)\phi _{t}^{\alpha _{\phi }1} I(\phi _{t}>0)\), the conditional posterior distribution of λ _{ gt } is \(\mathcal {IG}\left (\sum _{k}\phi _{t}+\alpha _{\lambda _{t}}, \beta _{\lambda _{t}}+\sum _{k}\frac {\theta _{\textit {gtk}}\phi _{t}}{m_{\textit {tk}}}\right)\), and the hyperparameters β _{ ϕ }, and \(\beta _{\lambda _{t}}\) respectively have the conditional posterior distributions: \(\beta _{\phi } \sim \mathcal {IG}\left (T\alpha _{\phi }+\zeta _{\phi }, \sum _{t}\phi _{t}+\eta _{\phi }\right)\), and \(\beta _{\lambda _{t}} \sim \mathcal {G}\left (G\alpha _{\lambda }+\zeta _{\lambda },1/\left (1/\eta _{\lambda }+\sum _{g} 1/\lambda _{\textit {gt}}\right)\right)\). Let θ _{ t } denote a set containing all θ _{ gtk }’s and λ _{ t } as a set containing all λ _{ gt }’s for each condition t. We use the Gibbs sampling algorithm to sample parameters \(\phantom {\dot {i}\!}\{\boldsymbol {\theta }_{t}, \boldsymbol {\lambda }_{t}, \beta _{\lambda _{t}}\}\), ∀t, and β _{ ϕ } from their conditional posterior distributions. The conditional posterior distribution of ϕ _{ t } does not have a known distribution form. These parameters are sampled using the MetropolisHastings sampling algorithm from their conditional posterior distributions.
Confident difference criterion
Preliminary
The confident difference criterion method was extended from the twocriterion method, which was firstly proposed by Ibrahim et al. [13] to detect DE genes for microarray studies with two biological conditions. In this twocriterion method, the fold change between two conditions was defined as \(\xi _{g}=\exp \left (\mu _{g2}+0.5\sigma _{g2}^{2}\mu _{g1}\right.\) \(\left.0.5\sigma _{g1}^{2}\right)\), and the posterior probabilities of having at least two fold changes between two conditions, denoted as γ _{ g1}=P r(ξ _{ g }>2D _{ obs }) and γ _{ g2}=P r(ξ _{ g }<1/2D _{ obs }), were evaluated on each gene to quantify the evidence of its differential expression. A gene is declared to be DE genes if the calculated posterior probabilities γ _{ g1} or γ _{ g2} are sufficiently large. The twocriterion method is easy to compute and provides good false positive and false negative rates [6] for identifying DE genes from microarray studies with two biological conditions. However, the posterior probability γ _{ g } defined in this confident difference criterion method does not account for the posterior variability of the fold change, and may not work well for the data with multiple conditions due to the potential multiple comparisons problem since only two conditions can be compared at a time.
In this section, we will develop confident difference criterion using a similar idea of the existing twocriterion method to compare mean expressions (Method I) after taking into account the posterior variability of the mean intensity parameters for the microarray data with two biological conditions. Then we extend the newly developed confident difference criterion method for the microarray data with multiple biological conditions. Furthermore, we will develop another version of the confident difference criterion method to compare both means and variances of the expressions (Method II) for the microarray data. Finally, we extend the confident difference criterion method for comparing mean differential expressions of microarray data (Method I) to the analysis of RNASeq data (Method I).
Confident difference criterion for the comparison between mean expressions for the microarray data
Microarray study with two conditions
where \(\sigma _{\mu _{g2}\mu _{g1}}\) is the posterior standard deviation of μ _{ g2}−μ _{ g1}. Then we select a cutoff value γ _{0} (0<γ _{0}<1) and declare a gene to be DE if its posterior probability γ _{ g } is greater than the cutoff value γ _{0}.
Note that the choice of γ _{0} reflects how strong the evidence is for declaring DE genes. When a larger value is specified for γ _{0}, fewer genes will be selected to be DE. In the twocriterion method, Chen et al. [6] recommended to use a large cutoff value (ranging between 0.7 and 0.9) because they did not adjust for the posterior variability of the fold change when comparing the gene intensities between the two conditions. After adjusting for the posterior variability, γ _{ g } in (1) is quite different than the corresponding posterior probability under the twocriterion method of Chen et al. [6], as shown in the following proposition.
Proposition 1.
Assume that the difference in the mean intensities, μ _{ g2}−μ _{ g1}, follows a normal distribution. The proposed confident difference criterion method ensures that if γ _{ g }≥γ _{0}, then the maximum value of the posterior probabilities for the difference μ _{ g2}−μ _{ g1} being larger or smaller than zero, i.e., max{P r(μ _{ g2}−μ _{ g1}>0 D _{ obs }), P r(μ _{ g2}−μ _{ g1}<0 D _{ obs })}, is at least Φ(2−Φ ^{−1}(1+Φ(−2)−γ _{0})) for γ _{0}>Φ(−2), where Φ and Φ ^{−1} denote the cumulative distribution function (cdf) and the inverse cdf of the standard N(0,1) distribution, respectively. The detailed proof is presented in Additional file 1.
Connection with the CBF method for microarry study with two conditions
For a microarray study with two biological conditions, we assume that the preprocessed expression intensity from each biological condition follows a normal distribution with \(x_{\textit {gtk}} \sim N\left (\mu _{\textit {gt}},\sigma _{\textit {gt}}^{2}\right)\), and the parameters follow the prior distribution specified in the aforementioned Model for microarray data subsection. For simplicity, we assume that the equal number of intensities are observed from the same gene under different conditions, and they share the same known variance, i.e., n _{ g1}=n _{ g2}=n _{ g } and \(\sigma ^{2}_{g1}=\sigma ^{2}_{g2}={\sigma ^{2}_{g}}\). The proposed confident difference criterion method is used to detect differentially expressed genes. Alternatively, we can also apply the CBF method for the data analysis. To detect differentially expressed genes, we test on the null hypothesis that the mean intensities are equal (μ _{ g1}=μ _{ g2}) against the alternative hypothesis that the mean intensities are unequal (μ _{ g1}≠μ _{ g2}) between the two biological conditions. We use the same prior distributions as that in the confident difference criterion method under the alternative hypothesis, and similar prior distributions for the parameters under the null hypothesis. With simple algebra, we can show that the proposed confident difference criterion method for comparing the mean intensities between two biological conditions agrees with the CBF method under the condition stated in the following Proposition.
Proposition 2.
The confident difference criterion method comparing the mean intensities between the two biological conditions with a cutoff value of γ _{0} agrees with the CBF method for the hypotheses testing on whether the mean intensities are equal between the two biological conditions with a cutoff value of C _{0} if \(\Phi \left (2+E_{g}^{\ast }\right)\Phi \left (2+E_{g}^{\ast }\right)=1\gamma _{0}\), where \(E_{g}^{\ast }= \left [ \log \left (\frac {n_{g}\omega ^{2}}{2{\sigma ^{2}_{g}}}+1\right) 2\log (C_{0})\right ]^{\frac {1}{2}}\), provided that the cutoff value C _{0} is chosen so that the argument in the square root expression is nonnegative. The detailed proof is presented in Additional file 1.
Microarray study with multiple conditions
The confident difference criterion method can be extended to microarray studies with multiple biological conditions. Our primary interest of the study is to identify genes that have differential expressions at least between two biological conditions. Therefore, we define a quadratic form to quantify the differences in the gene expression intensities across different biological conditions, and conduct an overall test to determine whether the mean intensities are different at least under two biological conditions on each gene.
and declare gene g to be DE if γ _{ g }≥γ _{0}.
Confident difference criterion for comparison of both means and variances of expression for microarray data
where \(\boldsymbol {\Delta }_{\boldsymbol {\mu }, \boldsymbol {\sigma },g}=\left (\mu _{g2}\mu _{g1},\cdots, \mu _{\textit {gT}}\mu _{g1}, q\left (\sigma ^{2}_{g2}\right)q \left (\sigma ^{2}_{g1}\right),\cdots, q\left (\sigma ^{2}_{\textit {gT}}\right)q\left (\sigma ^{2}_{g1}\right)\right)'\) is a column vector of length 2T−2 containing the differences in both means and transformed variances of the intensities. The covariance matrix \(\Sigma _{\boldsymbol {\Delta }_{\boldsymbol {\mu },\boldsymbol {\sigma },g}}\phantom {\dot {i}\!}\) is the posterior covariance matrix of Δ _{ μ,σ,g }. Since q(·) is a onetoone transformation function, then we have \( \sigma ^{2}_{\textit {gt}}=\sigma ^{2}_{gt'}\) if and only if \( q\left (\sigma ^{2}_{\textit {gt}}\right)=q\left (\sigma ^{2}_{gt'}\right)\) for t≠t ^{′}. Thus, the same q(·) function has to be used across all the T treatment groups. The primary reason for introducing the transformation function q(·) is to make the distribution of \(q\left (\sigma ^{2}_{\textit {gt}}\right)\) more normal.
Similar to (2), we compute the posterior probability γ _{ g }=P r(Q _{ g }>CD _{ obs }), where C is chosen to be an integer, which is closest to the 95 ^{ t h } percentile of the chisquare distribution with d f=2T−2. For example, C will be chosen to be 9 when T=2, and 13 when T=3. In this paper, we consider the negative cube root transformation on the variance parameters \(\sigma ^{2}_{\textit {gt}}\)’s. The cube root transformation, also known as WilsonHilferty transformation, was derived by Wilson and Hilferty [35] to transform a chisquare variate to be approximately normally distributed. In the proposed gene selection algorithm below, the cutoff value γ _{0} will be automatically determined to control the false discovery rate to be less than a targeted level.
Confident difference criterion for sequencebased data
where \(\sigma _{\log (\lambda _{g2})\log (\lambda _{g1})}\) is the posterior standard deviation for the difference log(λ _{ g2})− log(λ _{ g1}). We then declare gene g to be DE if γ _{ g }≥γ _{0}, where 0<γ _{0}<1 is a predetermined credible level.
When the sequencebased data are collected from multiple conditions, the first biological condition will be considered as the reference condition and a column vector Δ _{ λ,g }=(log(λ _{ g2})− log(λ _{ g1}), log(λ _{ g3})− log(λ _{ g1}),⋯, log(λ _{ gT })− log(λ _{ g1}))^{′} contains the differences in the log scaled expression values between the nonreference conditions and the reference condition. Let \({\Sigma }_{\boldsymbol {\Delta }_{\lambda,g}}\phantom {\dot {i}\!}\) denote the posterior covariance matrix of Δ _{ λ,g }. Accordingly, the confident difference criterion method is defined as \(\gamma _{g}=Pr\left ({\boldsymbol {\Delta }}_{\lambda,g}'\mathbf {\Sigma }^{1}_{{\boldsymbol {\Delta }}_{\lambda,g}} {\boldsymbol {\Delta }}_{\lambda,g}>C_{\lambda }D_{\textit {obs}}\right)\), where C _{ λ } is an integer, which is closest to the 95 ^{ t h } percentile of the chisquare distribution with d f=T−1. Again, we declare gene g to be DE if γ _{ g }≥γ _{0}, where 0<γ _{0}<1.
From the Model for sequencebased data subsection, we see that the variance of the observed count y _{ gtk } is \(m_{\textit {tk}}\lambda _{\textit {gt}} \left (1+ m_{\textit {tk}} \lambda _{\textit {gt}} \phi ^{1}_{t}\right)\), which is a function of λ _{ gt } and ϕ _{ t }, for the sequence data. Since ϕ _{ t } does not depend on g, it is sufficient to compare the mean expressions under different conditions for determining DE genes for the sequence data.
False discovery rate and gene selection algorithm
The proposed confident difference criterion methods calculate the value of γ _{ g } on each gene, whose magnitude reflects the evidence of differential expression. When γ _{ g } is large enough, the gene will be declared to be DE. It is of great importance to determine how to choose the cutoff value γ _{0}.
We adopt the approach proposed by Tadesse et al. [31] to select the cutoff value γ _{0} for controlling the Bayesian FDR. Let V denote the number of incorrect decisions by identifying EE genes as DE genes and let R be the number of identified DE genes. Then the positive false discovery rate defined by Storey [30] is given by \(pFDR=E\left (\frac {V}{R}R>0\right)\).

Split the genes into two subsets containing n _{ E } EE genes (EEGENE) with calculated γ _{ g }<γ _{0} and n _{ D } DE genes (DEGENE) with γ _{ g }≥γ _{0}.

Note that a DE gene can be either up or down regulated under some condition compared to the reference condition in terms of means or variances of the expression values for microarray experiment or in terms of mean gene expressions for sequencebased experiment. Accordingly, the DE genes will be further split into a series of gene subsets based on the pattern of parameters in comparisons under different biological conditions. For example, in a microarray study with three biological conditions and the mean gene expressions are in comparison. Consider the first condition as the reference condition. The DE genes can be classified into four subsets: (i) genes with lower mean gene expressions under both conditions 2 and 3; (ii) genes with lower mean gene expressions under condition 2 but higher mean gene expressions under condition 3; (iii) genes with higher mean gene expressions under condition 2 but lower mean gene expressions under condition 3; and (iv) genes with higher mean gene expressions under both conditions 2 and 3. We denote these subsets of DE genes as D _{ ℓ },ℓ=1,⋯,L, where the number of subsets L=2^{ T−1} for the microarray study when only mean parameters are in comparison or the sequencedbased study; and L=4^{ T−1} for the microarray study when both mean and variance parameters are in comparison. We also denote the number of genes in the D _{ ℓ } DE gene subsets as n _{ D ℓ }.

For EE genes identified in previous steps, the same data distributions will be considered for the gene expression data from different biological conditions. Hierarchical priors similar to those proposed previously in the Model for microarray data section and the Model for sequencebased data subsection will be augmented separately for microarry data or sequencebased data. The posterior mean of each parameter defined in the distribution of the gene expression data will be calculated. The true parameters in the gene expression data distribution will be estimated using the average value of posterior mean of corresponding parameters from all EE genes. For all identified DE genes, the Markov chain Monte Carlo (MCMC) sampling values from previous steps when implementing the proposed confident difference criterion method will be used for calculating the posterior means of the parameters defined in the gene expression data distribution. For each differential gene expression pattern ℓ, the actual value of each parameter in the gene expression data distribution will be estimated using the average value of its posterior means across all genes in the subset D _{ ℓ } with this DE pattern.

Using the estimated values for the parameters in the gene expression data, the data will be simulated for κ×G genes (say κ=0.1), among which κ n _{ E } EE genes and \(\kappa n_{D_{\ell }}, \ell =1,\dots,L\) DE genes with a pattern of differential gene expression observed in the DE gene subset D _{ ℓ }, respectively.

The posterior probability γ _{ g } will be calculated for each gene based on the simulated data. Depending on whether γ _{ g }≥γ _{0}, the gene in the simulated data will also be claimed to be DE or EE.

Denote the total number of identified DE and EE genes from the simulated data as m _{ D } and m _{ E }. Then the probability for a EE genes claimed to be DE, P r(γ _{ g }≥γ _{0}H _{0g }), will be estimated as \(Pr(\gamma _{g} \ge \gamma _{0}H_{0g})=\frac {m_{E}}{\kappa n_{E}}\); and the probability for a DE genes claimed to be DE, P r(γ _{ g }≥γ _{0}H _{1g }), will be estimated as \(Pr(\gamma _{g} \ge \gamma _{0}H_{1g})=\frac {m_{D}}{\kappa n_{D}}\). Using (5), the estimated Bayesian FDR equals \(\widehat {bFDR}=\frac {m_{E}}{m_{D}+m_{E}}\).
Note that steps (4) to (6) provide a predictive approach to estimate bFDR when a certain value γ _{0} was used for identifying DE genes. Therefore, we can control the FDR at some prespecified value (i.e. 0.05) by choosing the corresponding cutoff value \(\hat {\gamma _{0}}\) as the minimum value of all cutoff values with an associated FDR no more than 0.05, or \(\hat {\gamma }_{0}=\min \{\gamma _{0}: (\widehat {bFDR}(\gamma) \le 0.05)\}\).
Results and discussion
In this section, two different simulation studies were conducted to investigate the performance of the proposed confident difference criterion methods on identifying DE genes for microarray or sequencebased studies, respectively. In addition, a real affymetrix dataset is used to further demonstrate the proposed methodology.
Simulation study I: Microarray data
Two settings were considered. In the first setting, the intensity values having different means and variances between two biological conditions on each DE gene are simulated. In the second setting, the data were simulated from three biological conditions, with DE genes having different mean and variance values between at least two biological conditions.
Setting 1 (Two conditions)
Fifty simulations were used in this study to investigate the performance of different versions of the confident difference criterion methods described in the Confident difference criterion section. In each simulation, there were 5000 genes in total and 500 DE genes with 10 replications under each of the two biological groups. The logscaled data were generated via \(x_{g1k} \overset {\text {iid}}{\sim } \mathcal {N}(\mu _{g}0.5 \delta _{g}, 0.2^{2}), x_{g2k} \overset {\text {iid}}{\sim } \mathcal {N}\left (\mu _{g}+0.5\delta _{g}, 0.9^{2}\right)\) with δ _{ g }=1,∀g=1,…,250 and δ _{ g }=−1,∀g=251,…,500 for the DE genes, and \(x_{g1k},x_{g2k} \overset {\text {iid}}{\sim } \mathcal {N}(\mu _{g}, 0.7^{2})\) for the remaining EE genes. The average intensities μ _{ g } were generated from an uniform distribution, where \(\mu _{g} \overset {\text {iid}}{\sim } \mathcal {U} (5,11)\) for all genes. Conditionally conjugate priors described in the Model for microarray data subsection were used for all parameters μ _{ g }, δ _{ g }, \(\sigma _{g1}^{2}\), \(\sigma _{g2}^{2}\) and \({\sigma _{g}^{2}}\).
The simulated data were analyzed using both Methods I and II of the confident difference criteroin methods. For each version, the cutoff value γ _{0} were set to be 0.4, 0.6, or the cutoff value controlling the FDR to be no more than 0.05, separately. The genes with the calculated posterior distribution values γ _{ g } via Equation (1) or Equation (3) less than the chosen γ _{0} were identified to be DE. To evaluate the performance of the confident difference criterion methods, the simulated datasets were also analyzed by four existing methods: Significant Analysis of Microarrays (SAM) [33], Linear Models for Microarray Data (LIMMA) [28], Semiparametric Hierarchical Method (SPH) [23], Empirical Bayesian Analysis of Gene Expression Data (EBarrays or EBA) [14]. All these existing methods allowed a control of the FDR for multiple comparisons. The genes were declared to be DE with FDR controlled at 0.05 for all these four methods.
Performance evaluation under Study I (Setting 1), (G=5000, 500 DE gene) ^{ # }
Cutoff  Method  CDE  CCDE  CCEE  FNR  FPR  FDR  FNDR 

γ _{0}  I  796.7(14.5)  466.7(4.7)  4169.9(13.9)  0.067(0.009)  0.073(0.003)  0.414(0.011)  0.008(0.001) 
(0.4)  II  864.4(13.8)  499.2(1.1)  4134.8(13.8)  0.002(0.002)  0.081(0.003)  0.422(0.009)  0.000(0.000) 
γ _{0}  I  526.5(9.8)  419.2(6.7)  4392.7(6.9)  0.162(0.013)  0.024(0.002)  0.204(0.011)  0.018(0.001) 
(0.6)  II  582.3(7.3)  493.3(3.0)  4411.0(6.9)  0.013(0.006)  0.020(0.002)  0.153(0.010)  0.002(0.001) 
FDR  I  296.4(13.7)  283.3(12.9)  4486.9(2.9)  0.433(0.026)  0.003(0.001)  0.044(0.009)  0.046(0.003) 
(0.05)  II  469.2(13.1)  450.7(10.2)  4481.5(4.4)  0.099(0.020)  0.004(0.001)  0.039(0.009)  0.011(0.002) 
SAM  330.5(16.0)  314.0(13.5)  4483.4(4.7)  0.372(0.027)  0.004(0.001)  0.050(0.013)  0.040(0.003)  
LIMMA  320.2(15.2)  304.9(13.7)  4484.7(4.1)  0.390(0.027)  0.003(0.001)  0.048(0.012)  0.042(0.003)  
SPH  192.0(10.6)  188.1(10.0)  4496.1(2.1)  0.624(0.020)  0.001(0.000)  0.020(0.011)  0.065(0.002)  
EBA  166.4(14.1)  158.8(13.3)  4492.3(2.2)  0.682(0.027)  0.002(0.000)  0.046(0.012)  0.071(0.003) 
Setting 2 (Three conditions)
The data were simulated from three biological conditions, and the first biological condition was considered as the reference group. A gene was set to be DE so that at least one group would be either up or down regulated from the reference group. Specifically, 500 DE genes out of 5000 genes were set in the data, and the log intensities of the DE genes were generated via \(x_{g1k} \overset {\text {iid}}{\sim } \mathcal {N}\left (\mu _{g1}, 0.2^{2}\right), x_{g2k} \overset {\text {iid}}{\sim } \mathcal {N}\left (\mu _{g1}+0.5\nu _{g1}, 0.5^{2}\right), x_{g3k} \overset {\text {iid}}{\sim } \mathcal {N}\left (\mu _{g1}+0.5 \nu _{g2}, 0.8^{2}\right)\). Depending on whether the gene was set to be DE in one or both conditions from reference group, the parameters ν _{ g1} and ν _{ g2} were set to have ν _{ g1}=ν _{ g2}=1.5 for g=1,⋯,62 (upregulated in both conditions); ν _{ g1}=1.5, ν _{ g2}=0 for g=63,⋯,125 (only upregulated in condition 2); ν _{ g1}=1.5, ν _{ g2}=−1.5 for g=126,⋯,187 (upregulated in condition 2, downregulated in condition 3); ν _{ g1}=0 and ν _{ g2}=1.5 for g=188,⋯,250 (only upregulated in condition 3); ν _{ g1}=0, ν _{ g2}=−1.5 for g=251,⋯,312 (only downregulated in condition 3); ν _{ g1}=−1.5, ν _{ g2}=1.5 for g=313,⋯, 375 (downregulated in condition 2, upregulated in condition 3); ν _{ g1}=−1.5, ν _{ g2}=0 for g=376, ⋯,437 (only downregulated in condition 2); ν _{ g1}=ν _{ g2}=−1.5, for g=438,⋯,500 (downregulated in both conditions). The remaining genes were EE and their log intensities were generated via \(x_{\textit {gtk}} \overset {\text {iid}}{\sim } \mathcal {N}\left (\mu _{g}, 0.6^{2}\right)\) for t=1,2,3 and g=501,⋯,5000. On all genes, the parameter μ _{ g1} were generated from an uniform distribution, i.e., \(\mu _{g1} \overset {\text {iid}}{\sim } \mathcal {U}(5,11)\). Each condition contained 10 replicates on each gene and 50 simulations were conducted.
Performance evaluation under Study I (Setting 2), (G=5000, 500 DE gene) ^{ # }
Cutoff  Method  CDE  CCDE  CCEE  FNR  FPR  FDR  FNDR 

γ _{0}  I  1086.5(22.5)  476.3(4.4)  3890.8(21.9)  0.045(0.009)  0.135(0.005)  0.561(0.009)  0.006(0.001) 
(0.4)  II  1388.1(25.7)  499.7(0.5)  3611.6(25.7)  0.001(0.001)  0.197(0.006)  0.640(0.007)  0.000(0.000) 
γ _{0}  I  656.8(13.8)  448.2(5.5)  4291.4(13.1)  0.104(0.011)  0.046(0.003)  0.317(0.014)  0.012(0.001) 
(0.6)  II  749.3(15.0)  496.0(1.8)  4246.7(15.1)  0.008(0.004)  0.056(0.003)  0.338(0.014)  0.001(0.000) 
FDR  I  357.1(8.7)  342.7(8.0)  4485.6(3.9)  0.315(0.016)  0.003(0.001)  0.040(0.011)  0.034(0.002) 
(0.05)  II  480.7(10.5)  458.7(7.7)  4478.0(5.7)  0.083(0.015)  0.005(0.001)  0.046(0.011)  0.009(0.002) 
SAM  326.9(12.9)  312.0(11.2)  4485.1(4.5)  0.376(0.022)  0.003(0.001)  0.045(0.013)  0.040(0.002)  
LIMMA  329.5(51.9)  309.9(21.6)  4480.4(31.8)  0.380(0.043)  0.004(0.007)  0.053(0.045)  0.041(0.004)  
EBA  190.4(6.7)  184.2(6.7)  4493.7(1.9)  0.632(0.013)  0.001(0.000)  0.033(0.010)  0.066(0.001) 
Simulation Study II: sequencebased data
The focus of this study is to investigate the performance of the proposed confident difference criterion method for identifying DE genes from sequencebased highthroughput experiments including SAGE and RNASeq studies.
Setting 1 (SAGE experiment)
The simulation proposed by Lu et al. [20] was used to conduct the simulation study. Specifically, 5000 genes were sampled under five libraries from each of the two conditions with fixed library sizes sampled uniformly between 30000 and 90000. A total of 500 genes were set to be DE genes. The data were generated from a negative binomial distribution, \(y_{\textit {gtk}} \overset {\text {iid}}{\sim } \mathcal {NB}(\phi _{t}, \frac {m_{\textit {tk}}\lambda _{\textit {gt}}}{\phi _{t}+m_{\textit {tk}}\lambda _{\textit {gt}}})\) for gene g, for a fixed library k of condition t, where m _{ tk } was the library size for library k under condition t; ϕ _{1} and ϕ _{2} denoted the dispersion parameters for data from the two conditions separately, and both set to be 0.4; λ _{ gt } measured the expression level of gene g under condition t and were set with different values when gene g is DE and the same value when gene g is EE. For g=1,⋯,250, we set λ _{ g1}=8E−4 and λ _{ g2}=2E−4 to include downregulated genes in condition 2. For g=251,⋯,500, we set λ _{ g1}=2E−4 and λ _{ g2}=8E−4 to include upregulated genes in condition 2. For other genes with g=501,⋯,5000, we set λ _{ g1}=λ _{ g2}=2E−4 to include EE genes. Fifty simulations were used in this study.
The proposed confident difference criterion method for RNASeq data was used to analyze the simulated data. The posterior probability γ _{ g } measuring the evidence of differential gene expression were estimated using average value of its posterior sampled values. The cutoff value γ _{0} for γ _{ g } were set to be 0.4, 0.6 or a value to control the FDR to be 0.05, separately. The genes with estimated γ _{ g } less than the chosen γ _{0} value were claimed to be DE. We also fit several other existing methods including edgeR [26], DESeq [2], BaySeq [10], NBPSeq [8], EBSeq [17], NOISeq [32], SAMSeq [19], and TSPM [3]. When the edgeR method was applied, we chose both options to estimate the common dispersion parameter for all tags and the tagwise dispersion parameters respectively. For the NOISeq method, we estimated and controlled the FDR using the method proposed by Newton et al. [23] for identifying DE genes.
Performance evaluation under Study II (Setting 1), (G=5000, 500 DE gene) ^{ # }
Method  CDE  CCDE  CCEE  FNR  FPR  FDR  FNDR 

twocri. (γ _{0}=0.4)  785.1(8.3)  465.6(2.5)  4180.5(7.6)  0.069(0.005)  0.071(0.002)  0.407(0.006)  0.008 (0.001) 
(γ _{0}=0.6)  509.4(6.8)  421.3(4.6)  4411.9(4.1)  0.157(0.009)  0.019(0.001)  0.173(0.007)  0.018(0.001) 
(ζ=0.05)  344.2(8.2)  328.8(6.3)  4484.6(2.6)  0.342(0.013)  0.003(0.001)  0.044(0.007)  0.037(0.001) 
edgeR ^{1}(ζ=0.05)  289.7(17.6)  278.1(16.5)  4488.3(3.6)  0.444(0.033)  0.003(0.001)  0.040(0.011)  0.047(0.003) 
edgeR ^{2}(ζ=0.05)  290.6(18.1)  276.4(16.8)  4485.8(3.8)  0.447(0.034)  0.003(0.001)  0.049(0.012)  0.047(0.003) 
DESeq(ζ=0.05)  297.2(21.3)  265.9(18.4)  4468.7(5.6)  0.468(0.037)  0.007(0.001)  0.105(0.016)  0.050(0.002) 
BaySeq(ζ=0.05)  203.1(22.8)  203.0(22.8)  4499.9(0.2)  0.594(0.046)  0.000(0.000)  0.000(0.001)  0.062(0.004) 
NBPSeq(ζ=0.05)  248.3(20.5)  239.8(19.3)  4491.5(4.0)  0.520(0.039)  0.002(0.001)  0.034(0.015)  0.055(0.004) 
EBSeq(ζ=0.05)  303.7(18.8)  257.7(14.9)  4454.0(6.8)  0.485(0.030)  0.010(0.002)  0.151(0.017)  0.052(0.003) 
NOISeq(ζ=0.05)  303.1(19.1)  294.4(17.6)  4491.3(3.2)  0.411(0.035)  0.002(0.001)  0.028(0.010)  0.044(0.004) 
SAMSeq(ζ=0.05)  134.2(45.2)  126.1(43.2)  4491.9(3.4)  0.748(0.086)  0.002(0.001)  0.061(0.022)  0.077(0.008) 
TSPM(ζ=0.05)  85.4 (19.2)  58.7 (15.4)  4473.3(6.5)  0.883(0.031)  0.006(0.001)  0.316(0.056)  0.090(0.003) 
Setting 2 (RNASeq experiment)
We used a similar simulation setting proposed by Kvam et al. [16] for illustrating the application of the proposed confident difference criterion method for RNASeq experiment. We still simulated 50 dataset, each dataset contained six libraries with three libraries from each of the two conditions on 5000 genes, among which 250 genes were set to be upregulated genes and another 250 genes were set to be downregulated genes in condition 2 versus condition 1. The overall mean expression levels across both conditions were generated from a gamma distribution with \(\lambda _{g} \sim \mathcal {G}(0.25, 600)\). To avoid including genes with low expression levels from both conditions as DE genes, we set the difference in the gene expression levels between conditions in two ways depending on whether the value of λ _{ g } is larger than one. Specifically, we generated ξ _{ g } from uniform distribution \(\mathcal {U}(3,20)\) for each gene. If the value of λ _{ g }>1, we let the fold change between the gene expression values of DE genes to be ξ _{ g }, or \(\lambda _{g1}=\lambda _{g}/\sqrt {\xi _{g}}\) and \(\lambda _{g2}=\lambda _{g}*\sqrt {\xi _{g}}\) for upregulated genes, and \(\lambda _{g1}=\lambda _{g}*\sqrt {\xi _{g}}\) and \(\lambda _{g2}=\lambda _{g}/\sqrt {\xi _{g}}\) for downregulated genes. If the value of λ _{ g }≤1, we let the absolute difference in the gene expression values to be ξ _{ g }, or we let λ _{ g1}=λ _{ g }+ξ _{ g } and λ _{ g2}=λ _{ g } for downregulated genes, and λ _{ g1}=λ _{ g } and λ _{ g2}=λ _{ g }+ξ _{ g } for upregulated genes in condition 2. For an EE gene, we had λ _{ g1}=λ _{ g2}=λ _{ g }.
Then we generated the data using negative binomial distribution of \(y_{\textit {gtk}} \overset {\text {iid}}{\sim } \mathcal {NB}(\phi _{t}, \frac {\lambda _{\textit {gt}}}{\phi _{t}+\lambda _{\textit {gt}}})\) for gene g, and the overdispersion parameters ϕ _{1} and ϕ _{2} were set to have ϕ _{1} = 1 and ϕ _{2} = 8 respectively for DE genes; and ϕ _{1}=ϕ _{2}=4 for EE genes.
Performance evaluation under Study II (Setting II), (G=5000, 500 DE gene) ^{ # }
Method  CDE  CCDE  CCEE  FNR  FPR  FDR  FNDR 

twocri.(γ _{0}=0.4)  654.6(5.4)  460.9(2.3)  4306.3(5.2)  0.078(0.005)  0.043(0.001)  0.295(0.006)  0.009(0.000) 
(γ _{0}=0.6)  490.0(3.9)  434.1(2.4)  4444.2(3.4)  0.132(0.005)  0.012(0.001)  0.114(0.006)  0.014(0.001) 
(ζ=0.05)  415.7(5.0)  400.6(3.5)  4484.9(2.7)  0.199(0.007)  0.003(0.001)  0.036(0.006)  0.022(0.001) 
edgeR ^{1}(ζ=0.05)  420.0(8.6)  411.7(8.4)  4491.7(2.8)  0.177(0.017)  0.002(0.001)  0.020(0.001)  0.020(0.002) 
edgeR ^{2}(ζ=0.05)  399.4(10.2)  386.3(9.8)  4486.9(4.6)  0.227(0.020)  0.003(0.001)  0.033(0.011)  0.025(0.002) 
DESeq(ζ=0.05)  443.6(15.9)  409.3(15.1)  4465.8(5.3)  0.181(0.030)  0.008(0.001)  0.077(0.011)  0.020(0.003) 
BaySeq(ζ=0.05)  331.0(15.5)  327.0(14.9)  4496.0(2.2)  0.346(0.030)  0.001(0.000)  0.012(0.006)  0.037(0.003) 
NBPSeq(ζ=0.05)  422.3(7.9)  412.7(7.9)  4490.4(3.1)  0.175(0.016)  0.002(0.001)  0.023(0.007)  0.019(0.002) 
EBSeq(ζ=0.05)  332.9(14.0)  248.4(11.1)  4415.6(9.4)  0.503(0.022)  0.019(0.002)  0.253(0.023)  0.054(0.002) 
NOISeq(ζ=0.05)  196.8(11.0)  191.4(10.8)  4494.6(2.5)  0.617(0.022)  0.001(0.001)  0.028(0.013)  0.064(0.002) 
SAMSeq(ζ=0.05)  274.3(15.7)  212.1(7.7)  4437.8(10.9)  0.576(0.015)  0.014(0.002)  0.226(0.029)  0.061(0.001) 
TSPM(ζ=0.05)  129.9(10.5)  80.1 (8.9)  4450.2(7.1)  0.840(0.018)  0.011(0.002)  0.383(0.046)  0.086(0.002) 
Real data analysis
We used a real data set obtained using customized Bovine Affymetrix arrays (Davis, Talbott, Yu, and Cupp, unpublished results) to illustrate the proposed method. Fifteen arrays composed of three replicate arrays under three biological conditions were produced to screen for DE genes associated with prostaglandin F2α(PGF) treatment after 30 min, 1 h, 2 h, and 4 h compared to the control treatment (saline). For simplicity, we focused on detecting genes using the confident difference criterion methods (Method I and Method II) that were regulated 1 h or 2 h after PGF treatment. The data were extracted, normalized and summarized using the Robust Multiarray Average (RMA) [12] method at the exon level via the Affymetrix expression console. The data set contains 21724 genes. Note that some genes may have multiple probe replicates ranging from one replicate to 266 replicates, and the data from different probes of the same gene may have large variation even after RMA normalization. We centered the data from each probe of the same gene to the mean log intensities of that gene, and excluded 3116 genes with only a single probe replicate from the analysis to make sure that the parameters were estimable. Additionally, we excluded 2137 low expression genes if twothirds or more (six out of nine) samples on this gene had gene expression values measured by the geometric mean expression values across different probes less than 10. Of the remaining 16471 genes with replicate probes, we used z _{ gjtk } to denote the k ^{ t h } biological replicate sample of the log2 scale gene expression intensity for probe j of gene g under condition t. Note that the index j was added to the previous notations for the log intensity values as data are available for multiple probes on the same gene. We assumed normal distribution for the log2 intensities with \(z_{\textit {gjtk}} \sim \mathcal {N}(\mu _{\textit {gtk}}, \sigma ^{2\ast }_{\textit {gt}})\), and the same prior for μ _{ gtk } as what we set for X _{ gt } in the Model for microarray data subsection. The variance parameters are assumed to follow inverse gamma distribution with \(\sigma ^{2\ast }_{\textit {gt}} \sim \mathcal {IG}(\alpha ^{\ast }_{t},\beta ^{\ast }_{t})\) with \(\alpha ^{\ast }_{t}=2\) and \(\beta ^{\ast }_{t} \sim \mathcal {G}(\alpha ^{\ast }_{0},\beta ^{\ast }_{0})\). We set \(\alpha ^{\ast }_{0}=1\) and \(\beta ^{\ast }_{0}\sim IG(\alpha ^{\ast },\beta ^{\ast })\) where both α ^{∗}=β ^{∗}=1. During computation for controlling the FDR, we reuse these settings of the prior distributions on the parameters μ _{ gtk } and \(\sigma ^{2\ast }_{\textit {gt}}\) for DE genes. For EE gens, we assume that \(z_{\textit {gjtk}} \sim \mathcal {N}(\mu _{\textit {gtk}},\sigma ^{2\ast }_{g})\), and make similar augment for the prior distributions of their parameters μ _{ gtk } and \(\sigma ^{2\ast }_{g}\) as the DE genes. The proposed confident difference criterion methods were applied to assess the evidence of differential expression on each gene and identify DE genes with the cutoff value equal to be 0.4, 0.6 or a value that controls the FDR at 0.05.
The genes identified solely by Method I or Method II were analyzed by Ingenuity Pathway Analysis (IPA, Build version: 313398M, Content version: 18841524 (Release Date: 20140624) to determine biological functions and pathways associated with the newly identified genes. Genes identified solely by Method I and not by SAM or EBarrays were analyzed by IPA which identified several major canonical pathways such as hepatic fibrosis / hepatic stellate cell activation, glucocortiocoid receptor signaling, agranulocyte adhesion and diapedesis, and role of IL17A in arthritis (Additional file 2: Table S1). Many of the canonical pathways identified have either established or potential roles in corpus luteum function indicating that Method I identified DE genes that are biologically relevant within the model. Method I also identified IL1B (P=2.12E−08) and TNF (P=3.03E−08) as upstream regulators of the genes found exclusively by Method I, which also fits with known and suspected mechanisms of PGF action within the corpus luteum [1, 24].
Genes identified solely by Method II were also analyzed by IPA which identified canonical pathways such as hepatic fibrosis/hepatic stellate cell activation [21], glucocorticoid receptor signaling, IL8 signaling, and granulocyte adhesion and diapedesis. Upstream regulators of gene found solely by Method II included: IL1B (P=4.56E−13), TGFB1 (P=1.19E−11), and IFNG (P=1.82E−11). The IPA results both concur with current literature and offer new insights into the possible mechanism(s) of action of PGF in the corpus luteum [1, 9, 11, 21]. These and similar canonical and regulatory functions were also identified when the complete dataset (30 min, 1 h, 2 h, and 4 h) was analyzed by IPA. These network functions are in agreement with the known or suspected changes in biological function in the corpus luteum following PGF treatment in several species [1, 5, 22, 27]. Several of the genes identified by Methods I and II are known to be involved in regulation of the fate of the corpus luteum after PGF treatment, and were also identified as DE genes in our larger data set and a similar microarray dataset examining the effects of PGF in the cow [22]. For example, genes that code for chemokines (e.g., CCL3 and CCL8) and prostaglandin synthesis (e.g., PTGS2) were found to be significantly upregulated at 1 and 2 h using Methods I and II which were not identified using LIMMA. However, CCL3, CCL8, and PTGS2 were all identified as significantly upregulated in later time points using LIMMA, which conservatively identifies DE genes. Therefore, it seems possible that Methods I and II may provide a more sensitive approach to identify the temporal patterns of gene expression.
Conclusion
In this paper, we have proposed a new differentially expressed gene selection algorithm, which controls the FDR based on predictive Bayesian estimates. The simulation studies empirically showed that the proposed confident difference criterion methods outperform the existing methods when comparing gene expressions across different conditions for both microarray studies and sequencebased highthroughput studies. For the analysis of the real data, the method II successfully identified more clinically important DE genes than the other methods. In comparison to Method I, the Method II provides a much better sensitivity rate, but slightly a lower specificity rate based on the simulation studies.
In scenarios where the data are not symmetrically distributed, we need to model the data with other types of distributions (e.g., a gamma distribution). The confident difference criterion method proposed for comparing both means and variances can also be extended to evaluate the differences in multiple parameters defined in the nonnormal data distributions. In addition, the Euclidean distances used in the proposed confident difference criterion method may also be extended to other types of distances to measure the difference among the distributions under two or more biological conditions. In the case of two conditions, the entropybased distance such as the KullbackLeibler (KL) divergence may be considered. However, the distribution of the entropybased statistics is quite difficult to characterize and, hence, it is quite challenging to choose the cutoff value for the entropy statistics. Such extensions need to be further investigated. Finally, we note that all models considered in this paper assume that the gene expressions are independent across genes. The proposed confident difference criterion methods do not require the independence assumption. However, the performance of the confident difference criterion methods under the correlated models need to be further examined.
Availability and requirements
All analyses results presented in this paper were obtained using codes developed in FORTRAN with IMSL library. We have also implemented the proposed method in R for windows (32 bits). The R codes can be obtained at the websites: http://www.unmc.edu/publichealth/departments/biostatistics/facultyandstaff/cdc_micro.zipand http://www.unmc.edu/publichealth/departments/biostatistics/facultyandstaff/cdc_RNASeq.zip.
Declarations
Acknowledgements
We would like to thank the Editor and two referees for their very helpful comments and suggestions, which have led to an improved version of the paper. This work was supported by COPH Dean’s Mentored Research Grant at UNMC (FY), NIH grants #GM 70335 and #P01 CA142538 (MHC), Agriculture and Food Research Initiative (AFRI) Competitive Grant no. 20116701520076 from the USDA National Institute of Food and Agriculture (NIFA) (JSD); the Department of Veterans Affairs (JSD), the Olson Center for Women’s Health (HT and JSD), and a AFRI NIFA Predoctoral Fellowship Award (HT).
Authors’ Affiliations
References
 Atli MO, Bender RW, Mehta V, Bastos MR, Luo W, Vezina CM, et al. Patterns of gene expression in the bovine corpus luteum following repeated intrauterine infusions of low doses of prostaglandin F_{2} α. Biol Reprod. 2012; 86(4):130.View ArticlePubMedPubMed CentralGoogle Scholar
 Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010; 11:R106.View ArticlePubMedPubMed CentralGoogle Scholar
 Auer PL, Doerge RW. A twostage poisson model for testing RNASeq data. Stat Appl Genet Mol Biol. 2011; 10:1–26.Google Scholar
 Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008; 456(7218):53–9.View ArticlePubMedPubMed CentralGoogle Scholar
 Bishop CV, Bogan RL, Hennebold JD, Stouffer RL. Analysis of microarray data from the macaque corpus luteum; the search for common themes in primate luteal regression. Mol Hum Reprod. 2011; 17(3):143–51.View ArticlePubMedGoogle Scholar
 Chen MH, Ibrahim JG, Chi YY. A new class of mixture models for differential gene expression in DNA microarray data. J Stat Plan Inference. 2008; 138:387–404.View ArticlePubMedPubMed CentralGoogle Scholar
 Dudroit S, Yang YH, Callow MJ, Speed TP. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica. 2002; 12:111–39.Google Scholar
 Di Y, Schafer DW, Cumbie JS, Chang JH. The NBP negative binomial model for assessing differential gene expression from RNASeq. Stat Appl Genet Mol Biol. 2011; 10(1):1–28.Google Scholar
 Galväo AM, FerreiraDias G, Skarzynski DJ. Cytokines and angiogenesis in the corpus luteum. Mediators Inflamm. 2013; 2013:420186.View ArticlePubMedPubMed CentralGoogle Scholar
 Hardcastle TJ, baySeq KellyKA. Empirical Bayesian analysis of patterns of differential expression in count data. BMC Bioinformatics. 2010; 11:422–35.View ArticlePubMedPubMed CentralGoogle Scholar
 Hou X, Arvisais EW, Jiang C, Chen DB, Roy SK, Pate JL, et al. Prostaglandin F2 α stimulates the expression and secretion of transforming growth factor B1 via induction of the early growth response 1 gene (EGR1) in the bovine corpus luteum. Mol Endocrinol. 2008; 22(2):403–414.View ArticlePubMedGoogle Scholar
 Irizarry RA, Hobbs B, Collin F, BeazerBarclay YD, Antonellis KJ, Scherf U, et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003; 4(2):249–64.View ArticlePubMedGoogle Scholar
 Ibrahim JG, Chen MH, Gray RJ. Bayesian models for gene expression with DNA microarray data. J Am Stat Assoc. 2002; 97:88–99.View ArticleGoogle Scholar
 Kendziorski CM, Newton MA, Lan H, Gould MN. On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles. Stat Med. 2003; 22:3899–914.View ArticlePubMedGoogle Scholar
 Kuo L, Yu F, Zhao Y. Statistical methods for identifying differentially expressed genes in replicated experiments: A review. In: Biswas A, Data S, Fine J, Segal M, editors. Statistical Advances in the Biomedical Sciences: Clinical Trials, Epidemiology, Survival Analysis, and Bioinformatics. Hoboken, NJ: WileyInterscience: 2008. p. 341–64.Google Scholar
 Kvam VM, Liu P, Si Y. A comparison of statistical methods for detecting differentially expressed genes from RNAseq data. Am J Bot. 2012; 99(2):248–56.View ArticlePubMedGoogle Scholar
 Leng N, Dawson JA, Stewart RM, Ruotti V, Rissman A, Smits B, et al. EBseq: An empirical Bayes hierarchical model for inference in RNAseq experiments. Bioinformatics. 2013; 29(8):1035–43.View ArticlePubMedPubMed CentralGoogle Scholar
 Ley TJ, Mardis ER, Ding L, Fulton B, McLellan MD, Chen K, et al. DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature. 2008; 456(7218):66–72.View ArticlePubMedPubMed CentralGoogle Scholar
 Li J, Tibshirani R. Finding consistent patterns: a nonparametric approach for identifying differential expression in RNAseq data. Stat Methods Med Res. 2013; 22:519–36.View ArticlePubMedGoogle Scholar
 Lu J, Tomfohr JK, Kepler TB. Identifying differential expression in multiple SAGE libraries: an overdispersed loglinear model approach. BMC Bioinformatics. 2005; 6:165.View ArticlePubMedPubMed CentralGoogle Scholar
 Maroni D, Davis JS. TGFB1 disrupts the angiogenic potential of microvascular endothelial cells of the corpus luteum. J Cell Sci. 2012; 124(14):2501–510.View ArticleGoogle Scholar
 Mondal M, Schilling B, Folger J, Steibel JP, Buchnick H, Zalman Y, et al. Deciphering the luteal transcriptome: potential mechanisms mediating stagespecific luteolytic response of the corpus luteum to prostaglandin F _{2} α. Physiol Genomics. 2011; 43(8):447–56.View ArticlePubMedGoogle Scholar
 Newton MA, Noueiry A, Sarkar D, Ahlquist P. Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics. 2004; 5:155–76.View ArticlePubMedGoogle Scholar
 Okuda K, Sakumoto R. Multiple roles of TNF super family members in corpus luteum function. Reprod Biol Endocrinol. 2003; 1:95.View ArticlePubMedPubMed CentralGoogle Scholar
 Pan W. A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics. 2002; 18:546–54.View ArticlePubMedGoogle Scholar
 Robinson MD, McCarthy DJ. Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010; 26:139–40.View ArticlePubMedGoogle Scholar
 Romero JJ, Antoniazzi AQ, Smirnova NP, Webb BT, Yu F, Davis JS, et al. Pregnancyassociated genes contribute to antiluteolytic mechanisms in ovine corpus luteum. Physiol Genomics. 2013; 45(22):1095–1108.View ArticlePubMedPubMed CentralGoogle Scholar
 Smyth GK. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004; 3(1):Article 3.Google Scholar
 Soneson C, Delorenzi M. A comparison of methods for differential expression analysis of RNASeq data. BMC Bioinformatics. 2013; 14:91.View ArticlePubMedPubMed CentralGoogle Scholar
 Storey JD. A direct approach to false discovery rates. J R Stat Soc Ser B. 2002; 64:479–98.View ArticleGoogle Scholar
 Tadesse MG, Ibrahim JG, Vannucci M, Gentleman R. Wavelet thresholding with Bayesian false discovery rate control. Biometrics. 2005; 61:25–35.View ArticlePubMedGoogle Scholar
 Tarazona S, GarcíaAlcalde F, Dopazo J, Ferrer A, Conesa A. Differential expression in RNASeq: a matter of depth. Genome Res. 2011; 21:2213–223.View ArticlePubMedPubMed CentralGoogle Scholar
 Tusher VG, Ti bshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A. 2011; 98:5116–121.View ArticleGoogle Scholar
 Wang J, Wang W, Li R, Li Y, Tian G, Goodman L, et al. The diploid genome sequence of an Asian individual. Nature. 2008; 456:60–65.View ArticlePubMedPubMed CentralGoogle Scholar
 Wilson EB, Hilferty MM. The distribution of chisquare. Proc Natl Acad Sci U S A. 1931; 17:684–88.View ArticlePubMedPubMed CentralGoogle Scholar
 Yu F, Chen MH, Kuo L. Detecting differentially expressed genes using calibrated Bayes factors. Statistica Sinica. 2008; 18:783–802.Google Scholar
 Zalman Y, Klipper E, Farberov S, Mondal M, Wee G, Folger JK. Regulation of AngiogenesisRelated Prostaglandin F2alphaInduced Genes in the Bovine Corpus Luteum. Biology of Reproduction. 2012; 86(3):92.View ArticlePubMedGoogle Scholar
Copyright
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.