 Methodology Article
 Open Access
 Published:
Confident difference criterion: a new Bayesian differentially expressed gene selection algorithm with applications
BMC Bioinformatics volumeÂ 16, ArticleÂ number:Â 245 (2015)
Abstract
Background
Recently, the Bayesian method becomes more popular for analyzing high dimensional gene expression data as it allows us to borrow information across different genes and provides powerful estimators for evaluating gene expression levels. It is crucial to develop a simple but efficient gene selection algorithm for detecting differentially expressed (DE) genes based on the Bayesian estimators.
Results
In this paper, by extending the twocriterion idea of Chen et al. (Chen MH, Ibrahim JG, Chi YY. A new class of mixture models for differential gene expression in DNA microarray data. J Stat Plan Inference. 2008;138:387â€“404), we propose two new gene selection algorithms for general Bayesian models and name these new methods as the confident difference criterion methods. One is based on the standardized differences between two mean expression values among genes; the other adds the differences between two variances to it. The proposed confident difference criterion methods first evaluate the posterior probability of a gene having different gene expressions between competitive samples and then declare a gene to be DE if the posterior probability is large. The theoretical connection between the proposed first method based on the means and the Bayes factor approach proposed by Yu et al. (Yu F, Chen MH, Kuo L. Detecting differentially expressed genes using alibrated Bayes factors. Statistica Sinica. 2008;18:783â€“802) is established under the normalnormalmodel with equal variances between two samples. The empirical performance of the proposed methods is examined and compared to those of several existing methods via several simulations. The results from these simulation studies show that the proposed confident difference criterion methods outperform the existing methods when comparing gene expressions across different conditions for both microarray studies and sequencebased highthroughput studies. A real dataset is used to further demonstrate the proposed methodology. In the real data application, the confident difference criterion methods successfully identified more clinically important DE genes than the other methods.
Conclusion
The confident difference criterion method proposed in this paper provides a new efficient approach for both microarray studies and sequencebased highthroughput studies to identify differentially expressed genes.
Background
In the past decade, highthroughput molecular technologies have gained great popularity in gene expression profiling due to their capability of producing thousands of measurements for each of the assayed samples. The microarray technology and nextgeneration sequencing are two widely used highthroughput technologies. Nextgeneration sequencing improves upon Sanger dideoxy sequencing so that the number of sequencing reactions in a single run can be in millions. For example, in Nature (2008), Bentley et al. [4] and Wang et al. [34] reported the DNA sequence of a Nigerian individual and an Asian individual, respectively. Ley et al. [18] analyzed the genome sequence of a tumor sample. One common scientific question addressed by these highthroughput experiments is to identify the genes with differential expression between two biological conditions. Although the highthroughput technologies offer us rich biological information, they are highly errorprone because many genes are monitored at the same time with a relatively small sample size. Bayesian methods provide a good solution to this problem because they synthesize all the data by borrowing information across different genes and produce more efficient estimators for evaluating the gene expressions. They include linear models in LIMMA [28] where empirical Bayesian methods were used to obtain stable results even with small sample size. A more detailed description of the Bayesian statistical methods for microarray studies can be found in Dudoit et al. [7], Pan [25], and Kuo et al. [15]. Other Bayesian methods for RNASeq studies using next generation sequencing were reviewed by Kvam et al. [16] and Soneson and Delorenzi [29].
Yu et al. [36] pointed out that most statistical methods for microarray studies examined the differential expressions by testing on the equality of means of the logtransformed intensities between the treatment and control, which may not be appropriate for data with complex structures (for example, a mixture normal distributions with multiple modes). They proposed a calibrated Bayes factor (CBF) method to evaluate the ratio of the full data marginal likelihood under the alternative hypothesis that a gene is differentially expressed (DE) relative to that for the null hypothesis that a gene is equivalently expressed (EE) between two biological conditions. Although their approach has the potential for handling data with more complicated distributions, the computational cost of their method may increase greatly with the complexity of the model.
Chen et al. [6] employed a class of mixture models with two components to fit the microarray data with two biological conditions. To evaluate the differential expressions for each gene, they proposed a gene selection algorithm, namely the twocriterion method. Specifically, they calculated a posterior probability that there is at least a twofold change between the mean values of raw intensities under the two considered conditions. Then a gene is declared to be DE if the resulting posterior probability is large (say at least 0.7). Since the posterior probability is readily available once a Markov chain Monte Carlo sample is drawn from the posterior distribution, the gene selection algorithm proposed by them is quite easy to implement and computationally inexpensive. However, their approach does not consider general data distributions as that in the Bayes factor approach given by Yu et al. [36]. Assuming that the data under each biological condition follow a lognormal distribution as in [6], the mean value of raw intensities equals to exp(mean+variance/2) under each condition. Thus, the twocriterion method proposed by Chen et al. [6] that calculates the ratio of two means of the raw intensities depends on not only the difference between the two transformed means but also the difference between their variances. So, when the differences between the means and the differences between the variances are in opposite directions, the Chen et al. method may not be able to detect DE genes. Additionally, their paper neither provides a guidance on controlling the false discovery rate (FDR) nor carries out the performance comparison with other existing methods.
Our goal in this paper is to develop a simple but efficient gene selection algorithm so that it is not only computationally efficient, but also flexible in handling data with a complicated distribution as in Yu et al. [36]. We redevelop the twocriterion method proposed by Chen et al. [6] and construct two new gene selection algorithms for general Bayesian models. One is based on the differences between means and the other is based on both mean differences and variance differences. To differentiate the method proposed in Chen et al. [6], we name our methods as confident difference criterion methods and the two proposed confident difference criterion methods in this paper as Methods I and II. We show that the Method I, which compares the mean expressions from different conditions, is equivalent to the calibrated Bayes factor approach [36] when the raw intensities from two different biological conditions follow lognormal distributions with equal variance. We also address the multiple comparisons issue with a control of the false discovery rate. We further apply the proposed method to carry out analyses of microarray data with more than two conditions as well as sequencebased RNA data.
Method
Model for microarray data
We assume that the data, denoted by D _{ obs }, have already been preprocessed with appropriate transformation and normalization. Let T be the total number of biological conditions in the study. The data may contain two biological conditions (T=2) or multiple biological conditions (T>2). The common analytical objective is to detect differentially expressed (DE) genes across different biological conditions.
Let x _{ gtk } denote the preprocessed expression intensity of the g ^{th} gene in the k ^{th} sample under the t ^{th} condition for t=1,â‹¯,T. There are a total of G genes with sample size n _{ gt } under condition t. Thus, the data on gene g under each condition can be summarized using a vector: \(\mathbf {X}_{\textit {gt}}=(x_{gt1},\ldots,x_{gtn_{\textit {gt}}})\phantom {\dot {i}\!}\). We assume that the intensity, x _{ gtk }, k=1,â‹¯,n _{ gt }, t=1,2,â‹¯,T, follows a normal distribution \(\mathcal {N}\left (\mu _{\textit {gt}}, \sigma ^{2}_{\textit {gt}}\right)\) independently. The parameters Î¼ _{ gt } and \(\sigma ^{2}_{\textit {gt}}\) denote the mean and variance of the intensities of gene g under condition t, respectively. We write the mean intensities as Î¼ _{ gt } = Î¼ _{ g } + Î´ _{ gt }/T, t = 1,â€¦,T, where \(\sum ^{T}_{t=1}\delta _{\textit {gt}}\,=\,0\). For simplicity, we set \(\delta _{g1}=\sum ^{T}_{t=2} \delta _{\textit {gt}}\) under the first biological condition. We note that Î¼ _{ g } defines the overall mean of the intensities across all biological conditions, and Î´ _{ gt }/T measures the difference in the mean intensity under biological condition t from the overall mean. In a microarray study with two biological conditions (T=2), the mean intensities Î¼ _{ g1} and Î¼ _{ g2} are written as Î¼ _{ g1}=Î¼ _{ g }âˆ’Î´ _{ g }/2 and Î¼ _{ g2}=Î¼ _{ g }+Î´ _{ g }/2, respectively. When a gene is DE, we expect that the distributions of the data differ at least under two biological conditions.
Hierarchial prior distributions
Noninformative conditionally conjugate priors are specified for all parameters. Specifically, we assume that the mean parameters \(\mu _{g} \sim \mathcal {N}(0,\tau ^{2})\) and \(\delta _{\textit {gt}} \sim \mathcal {N} (0, \omega ^{2})\) for t>1 and any g, and the variance parameters \(\sigma ^{2}_{\textit {gt}} \sim \mathcal {IG}(a_{t},b_{t})\). We set the variance parameters Ï„ ^{2} and Ï‰ ^{2} in the normal priors to be 100 to obtain relatively noninformative priors. The shape parameter a _{ t } in the inverse gamma prior is set to be 2, so that the prior mean of \(\sigma _{\textit {gt}}^{2}\) equals b _{ t }. We further let the scale parameter b _{ t } follow a conditionally conjugate gamma prior with \(b_{t} \sim \mathcal {G}(c,d)\), where the hyperparameter c is specified as 1 and the hyperparameter \(d \sim \mathcal {IG}(a_{d},b_{d})\), in which a _{ d } and b _{ d } are both set to be 0.01 in the simulation study and 1 in the real data analysis. Our hierarchical priors for the variance parameters, which are often difficult to estimate, allow for borrowing the information across genes via \(b_{t} \sim \mathcal {G}(c,d)\) as well as biological conditions via \(d \sim \mathcal {IG}(a_{d},b_{d})\). We intend to specify a noninformative inversegamma prior for the parameter d. The value of â€œ1" was specified for both a _{ d } and b _{ d } in the real data analysis since the real data had a smaller sample size than the simulated data in the simulation study. These values of the hyperparameters still led to noninformative priors since the prior mean and variance of d do not exist. However, these values allowed us to borrow a little but not too much information across different biological conditions under comparison.
Conditional posterior distributions
Let \(\bar {x}_{\textit {gt}}\) denote the average intensities of gene g under condition t and also let the vector \(\bar {\mathbf {X}}_{g}=\{\bar {x}_{g1},\cdots,\bar {x}_{\textit {gT}}\}\) denote the average intensities for gene g. Then, \(\bar {\mathbf {X}}_{g}\) follows a multivariate normal distribution with \(\bar {\mathbf {X}}_{g} \sim \mathcal {N}(\mathbf {A}\mathbf {\Theta }_{g},\mathbf {\Sigma }_{g})\), where Î˜ _{ g }=(Î¼ _{ g },Î´ _{ g2},â‹¯,Î´ _{ gT })^{â€²} is a column vector of size T, and Î£ _{ g } is a diagonal matrix of size TÃ—T with the t ^{th} diagonal element \((\mathbf {\Sigma }_{g})_{t,t}=\sigma ^{2}_{\textit {gt}}/n_{\textit {gt}}\). Here A is a TÃ—T matrix, in which all elements in the first column equals one, i.e., A _{ t1}=1 for t=1,â‹¯,T, all but the first element in the first row equals âˆ’1/T, i.e., A _{1t }=âˆ’1/T for t=2,â‹¯,T, all but the first diagonal element equals 1/T, i.e., A _{ t,t }=1/T for t=2,â‹¯,T, and all other elements equal zero. Since the parameters in Î˜ _{ g } independently follow normal prior distributions, then \(\mathbf {\Theta }_{g} \sim \mathcal {N}(\mathbf {0},\mathbf {\Sigma }_{0})\), where 0 is a vector of size T containing all zeroâ€™s and Î£ _{0} is a diagonal matrix with the first diagonal element (Î£ _{0})_{1,1}=Ï„ ^{2} and all other diagonal elements equal to Ï‰ ^{2}, i.e., (Î£ _{0})_{ t,t }=Ï‰ ^{2} for t=2,â‹¯,T. Therefore, the conditional posterior distribution of Î˜ _{ g } is a multivariate normal distribution with Î˜ _{ g }âˆ¼N(U _{ g },B _{ g }), where the inverse of the variance matrix \(\mathbf {B}_{g}^{1}=\left (\mathbf {A}'\mathbf {\Sigma }_{g}^{1} \mathbf {A}+\mathbf {\Sigma }_{0}^{1}\right)\), and the mean vector \(\mathbf {U}_{g}= \mathbf {B}_{g} \mathbf {A}' \mathbf {\Sigma }_{g}^{1} \bar {\mathbf {X}}_{g}\). The conditional posterior distribution of the variance parameter \(\sigma _{\textit {gt}}^{2}\) is an inversegamma distribution with \(\sigma _{\textit {gt}}^{2} \sim \mathcal {IG}\left (a_{t}+\frac {1}{2}n_{\textit {gt}}, b_{t}+\frac {1}{2}\sum _{k=1}^{n_{\textit {gt}}}\left (x_{\textit {gtk}}\mu _{\textit {gt}}\right)^{2}\right)\). The conditional posterior density of the hyperparameter b _{ t } is given by \(g\left (b_{t}\sigma _{1t}^{2},\cdots, \sigma _{\textit {Gt}}\right) \propto b_{t}^{(\mathrm {G} a_{t}/2+c)}\exp \left (\frac {b_{t}}{2}\sum {\sigma _{\textit {gt}}^{2}}\right)\times \left (\sum _{t'\ne t} b_{t}'+b_{t}+b_{d}\right)^{\mathrm {T} c+a_{d}}\). Consequently, we can apply the Gibbs sampling algorithm to sample the parameters b _{ t }, \(\sigma ^{2}_{\textit {gt}}\) and Î˜ _{ g } in turn from their respective conditional posterior distributions using the following steps: (1) sample b _{ t } for each condition t from its posterior density function \(g(b_{t}\sigma _{1t}^{2},\cdots, \sigma _{\textit {Gt}})\) via the MetropolisHastings algorithm; (2) sample \(\sigma ^{2}_{\textit {gt}}\) given b _{ t } and Î¼ _{ gt } for each g and t from its inverse gamma posterior distribution with updated parameters \(a_{t}+\frac {1}{2}n_{\textit {gt}}\) and \(b_{t}+\frac {1}{2}\sum _{k=1}^{n_{\textit {gt}}}(x_{\textit {gtk}}\mu _{\textit {gt}})^{2}\); (3) sample Î˜ _{ g } given \(\sigma ^{2}_{\textit {gt}}\) for all g from their conditional multivariate normal posterior distribution, and calculate the Î¼ _{ gt } based on the sampled values of Î˜ _{ g }.
Model for sequencebased data
Let \(\phantom {\dot {i}\!}\textit {\textbf {Y}}_{\textit {gt}}=(y_{gt1},\cdots y_{gtn_{\textit {gt}}})\) denote all n _{ gt } observed counts of the expressed tags of gene g under condition t for g=1,â‹¯,G and t=1,â‹¯,T. We assume that y _{ gkt } follows a negative binomial distribution, which is commonly used for the count data with overdispersion [2, 26]. Specifically, we assume y _{ gtk } follows \(\mathcal {NB}\left (\phi _{t},\frac {m_{\textit {tk}}\lambda _{\textit {gt}}}{\phi _{t}+m_{\textit {tk}}\lambda _{\textit {gt}}}\right)\), with mean m _{ tk } Î» _{ gt } and variance \(m_{\textit {tk}}\lambda _{\textit {gt}}\left (1+m_{\textit {tk}}\lambda _{\textit {gt}}\phi _{t}^{1}\right)\). We set m _{ tk } to be the library size of the k ^{th} sample under the t ^{th} condition, which is the sum of all counts from this library. The dispersion parameter Ï• _{ t } is assumed to be positive, accounting for potential overdispersion in the data. When the dispersion parameter Ï• _{ t } gets extremely large, the value of \(\phi _{t}^{1}\) approaches to zero, and the negative binomial distribution becomes a Poisson distribution with a mean value of m _{ tk } Î» _{ gt }. DE genes are expected to have different Î» _{ gt }â€™s under different biological conditions.
Hierarchical prior distributions
We assume that each dispersion parameter Ï• _{ t } follows a gamma distribution, \(\phi _{t} \sim \mathcal {G}(\alpha _{\phi }, \beta _{\phi })\) independently over t and its scale parameter Î² _{ Ï• } follows an inverse gamma distribution with \(\beta _{\phi } \sim \mathcal {IG}(\zeta _{\phi },\eta _{\phi })\). We also assume that each gene expression parameter Î» _{ gt } follows an inverse gamma distribution with \(\lambda _{\textit {gt}} \sim \mathcal {IG}(\alpha _{\lambda _{t}},\beta _{\lambda _{t}})\), where the scale parameter \(\beta _{\lambda _{t}} \sim \mathcal {G}(\zeta _{\lambda },\eta _{\lambda })\). In our simulation studies, we set all the hyperparameters \(\{\alpha _{\phi },\zeta _{\phi }, \eta _{\phi },\alpha _{\lambda _{t}},\zeta _{\lambda },\eta _{\lambda }\}\phantom {\dot {i}\!}\) to be one.
Conditional posterior distributions
Since a negative binomial distribution can be written as a Poissongamma distribution, we can rewrite the distribution of y _{ gtk } as \(y_{\textit {gtk}} \sim \mathcal {P}oi(\theta _{\textit {gtk}})\), and \(\theta _{\textit {gtk}} \sim \mathcal {G}\left (\phi _{t},m_{\textit {tk}} \lambda _{\textit {gt}} \phi _{t}^{1}\right)\). Then we can derive all the conditional posterior distributions for all of the parameters. Specifically, the conditional posterior distribution of Î¸ _{ gtk } is a gamma distribution with \(\theta _{\textit {gtk}} \sim \mathcal {G}\left (y_{\textit {gtk}}+\phi _{t},\left [1+\frac {\phi _{t}}{m_{\textit {tk}}\lambda _{\textit {gt}}}\right ]^{1}\right)\), the kernel of the conditional posterior density of Ï• _{ t } is given by \(\prod _{\textit {gk}} \left [\frac {\phi _{t}^{\phi _{t}}}{\Gamma (\phi _{t})} \left (\frac {\theta _{\textit {gtk}}}{m_{\textit {tk}}\lambda _{\textit {gt}}}\right)^{\phi _{t}} \exp \left (\frac {\theta _{\textit {gtk}}}{m_{\textit {tk}}\lambda _{\textit {gt}}}\phi _{t}\right) \right ] \exp \left (\frac {\phi _{t}}{\beta _{\phi }}\right)\phi _{t}^{\alpha _{\phi }1} I(\phi _{t}>0)\), the conditional posterior distribution of Î» _{ gt } is \(\mathcal {IG}\left (\sum _{k}\phi _{t}+\alpha _{\lambda _{t}}, \beta _{\lambda _{t}}+\sum _{k}\frac {\theta _{\textit {gtk}}\phi _{t}}{m_{\textit {tk}}}\right)\), and the hyperparameters Î² _{ Ï• }, and \(\beta _{\lambda _{t}}\) respectively have the conditional posterior distributions: \(\beta _{\phi } \sim \mathcal {IG}\left (T\alpha _{\phi }+\zeta _{\phi }, \sum _{t}\phi _{t}+\eta _{\phi }\right)\), and \(\beta _{\lambda _{t}} \sim \mathcal {G}\left (G\alpha _{\lambda }+\zeta _{\lambda },1/\left (1/\eta _{\lambda }+\sum _{g} 1/\lambda _{\textit {gt}}\right)\right)\). Let Î¸ _{ t } denote a set containing all Î¸ _{ gtk }â€™s and Î» _{ t } as a set containing all Î» _{ gt }â€™s for each condition t. We use the Gibbs sampling algorithm to sample parameters \(\phantom {\dot {i}\!}\{\boldsymbol {\theta }_{t}, \boldsymbol {\lambda }_{t}, \beta _{\lambda _{t}}\}\), âˆ€t, and Î² _{ Ï• } from their conditional posterior distributions. The conditional posterior distribution of Ï• _{ t } does not have a known distribution form. These parameters are sampled using the MetropolisHastings sampling algorithm from their conditional posterior distributions.
Confident difference criterion
Preliminary
The confident difference criterion method was extended from the twocriterion method, which was firstly proposed by Ibrahim et al. [13] to detect DE genes for microarray studies with two biological conditions. In this twocriterion method, the fold change between two conditions was defined as \(\xi _{g}=\exp \left (\mu _{g2}+0.5\sigma _{g2}^{2}\mu _{g1}\right.\) \(\left.0.5\sigma _{g1}^{2}\right)\), and the posterior probabilities of having at least two fold changes between two conditions, denoted as Î³ _{ g1}=P r(Î¾ _{ g }>2D _{ obs }) and Î³ _{ g2}=P r(Î¾ _{ g }<1/2D _{ obs }), were evaluated on each gene to quantify the evidence of its differential expression. A gene is declared to be DE genes if the calculated posterior probabilities Î³ _{ g1} or Î³ _{ g2} are sufficiently large. The twocriterion method is easy to compute and provides good false positive and false negative rates [6] for identifying DE genes from microarray studies with two biological conditions. However, the posterior probability Î³ _{ g } defined in this confident difference criterion method does not account for the posterior variability of the fold change, and may not work well for the data with multiple conditions due to the potential multiple comparisons problem since only two conditions can be compared at a time.
In this section, we will develop confident difference criterion using a similar idea of the existing twocriterion method to compare mean expressions (Method I) after taking into account the posterior variability of the mean intensity parameters for the microarray data with two biological conditions. Then we extend the newly developed confident difference criterion method for the microarray data with multiple biological conditions. Furthermore, we will develop another version of the confident difference criterion method to compare both means and variances of the expressions (Method II) for the microarray data. Finally, we extend the confident difference criterion method for comparing mean differential expressions of microarray data (Method I) to the analysis of RNASeq data (Method I).
Confident difference criterion for the comparison between mean expressions for the microarray data
Microarray study with two conditions
For a study with two biological conditions, Î¼ _{ g2}âˆ’Î¼ _{ g1} quantifies the difference in the mean intensities of gene g between the two conditions and its conditional posterior distribution follows a normal distribution. We define the posterior probability as
where \(\sigma _{\mu _{g2}\mu _{g1}}\) is the posterior standard deviation of Î¼ _{ g2}âˆ’Î¼ _{ g1}. Then we select a cutoff value Î³ _{0} (0<Î³ _{0}<1) and declare a gene to be DE if its posterior probability Î³ _{ g } is greater than the cutoff value Î³ _{0}.
Note that the choice of Î³ _{0} reflects how strong the evidence is for declaring DE genes. When a larger value is specified for Î³ _{0}, fewer genes will be selected to be DE. In the twocriterion method, Chen et al. [6] recommended to use a large cutoff value (ranging between 0.7 and 0.9) because they did not adjust for the posterior variability of the fold change when comparing the gene intensities between the two conditions. After adjusting for the posterior variability, Î³ _{ g } in (1) is quite different than the corresponding posterior probability under the twocriterion method of Chen et al. [6], as shown in the following proposition.
Proposition 1.
Assume that the difference in the mean intensities, Î¼ _{ g2}âˆ’Î¼ _{ g1}, follows a normal distribution. The proposed confident difference criterion method ensures that if Î³ _{ g }â‰¥Î³ _{0}, then the maximum value of the posterior probabilities for the difference Î¼ _{ g2}âˆ’Î¼ _{ g1} being larger or smaller than zero, i.e., max{P r(Î¼ _{ g2}âˆ’Î¼ _{ g1}>0 D _{ obs }), P r(Î¼ _{ g2}âˆ’Î¼ _{ g1}<0 D _{ obs })}, is at least Î¦(2âˆ’Î¦ ^{âˆ’1}(1+Î¦(âˆ’2)âˆ’Î³ _{0})) for Î³ _{0}>Î¦(âˆ’2), where Î¦ and Î¦ ^{âˆ’1} denote the cumulative distribution function (cdf) and the inverse cdf of the standard N(0,1) distribution, respectively. The detailed proof is presented in Additional file 1.
We note that the maximum value of the posterior probabilities for the difference Î¼ _{ g2}âˆ’Î¼ _{ g1} being larger or smaller than zero measures a Bayesian pvalue. Figure 1 shows a graphical presentation of the Proposition 1 with Î³ _{0} chosen to be 0.5 and 0.7, respectively. For example, we use \(\xi _{\mu _{g2}\mu _{g1}}\) to denote the posterior mean value of the difference Î¼ _{ g2}âˆ’Î¼ _{ g1}. When Î³ _{0}=0.5 and assuming that the posterior mean value \(\xi _{\mu _{g2}\mu _{g1}}>0\), \(\xi _{\mu _{g2}\mu _{g1}}\) is at least \(1.94 \sigma _{\mu _{g2}\mu _{g1}}\) away from zero. The maximum value of the posterior probabilities for the difference Î¼ _{ g2}âˆ’Î¼ _{ g1} being larger or smaller than zero, max{P r(Î¼ _{ g2}âˆ’Î¼ _{ g1}>0 D _{ obs }), P r(Î¼ _{ g2}âˆ’Î¼ _{ g1}<0 D _{ obs })}, will be at least Î¦(2âˆ’Î¦ ^{âˆ’1}(1+Î¦(âˆ’2)âˆ’0.5))=97.4 %. Therefore, we recommend to use a smaller cutoff value than the previous twocriterion method [6] when using (1) for identifying DE genes. Possible choices of the cutoff value Î³ _{0} may range from 0.4 to 0.7.
Connection with the CBF method for microarry study with two conditions
For a microarray study with two biological conditions, we assume that the preprocessed expression intensity from each biological condition follows a normal distribution with \(x_{\textit {gtk}} \sim N\left (\mu _{\textit {gt}},\sigma _{\textit {gt}}^{2}\right)\), and the parameters follow the prior distribution specified in the aforementioned Model for microarray data subsection. For simplicity, we assume that the equal number of intensities are observed from the same gene under different conditions, and they share the same known variance, i.e., n _{ g1}=n _{ g2}=n _{ g } and \(\sigma ^{2}_{g1}=\sigma ^{2}_{g2}={\sigma ^{2}_{g}}\). The proposed confident difference criterion method is used to detect differentially expressed genes. Alternatively, we can also apply the CBF method for the data analysis. To detect differentially expressed genes, we test on the null hypothesis that the mean intensities are equal (Î¼ _{ g1}=Î¼ _{ g2}) against the alternative hypothesis that the mean intensities are unequal (Î¼ _{ g1}â‰ Î¼ _{ g2}) between the two biological conditions. We use the same prior distributions as that in the confident difference criterion method under the alternative hypothesis, and similar prior distributions for the parameters under the null hypothesis. With simple algebra, we can show that the proposed confident difference criterion method for comparing the mean intensities between two biological conditions agrees with the CBF method under the condition stated in the following Proposition.
Proposition 2.
The confident difference criterion method comparing the mean intensities between the two biological conditions with a cutoff value of Î³ _{0} agrees with the CBF method for the hypotheses testing on whether the mean intensities are equal between the two biological conditions with a cutoff value of C _{0} if \(\Phi \left (2+E_{g}^{\ast }\right)\Phi \left (2+E_{g}^{\ast }\right)=1\gamma _{0}\), where \(E_{g}^{\ast }= \left [ \log \left (\frac {n_{g}\omega ^{2}}{2{\sigma ^{2}_{g}}}+1\right) 2\log (C_{0})\right ]^{\frac {1}{2}}\), provided that the cutoff value C _{0} is chosen so that the argument in the square root expression is nonnegative. The detailed proof is presented in Additional file 1.
Microarray study with multiple conditions
The confident difference criterion method can be extended to microarray studies with multiple biological conditions. Our primary interest of the study is to identify genes that have differential expressions at least between two biological conditions. Therefore, we define a quadratic form to quantify the differences in the gene expression intensities across different biological conditions, and conduct an overall test to determine whether the mean intensities are different at least under two biological conditions on each gene.
Considering the first biological condition as a reference condition, we define a column vector Î” _{ Î¼,g }=(Î¼ _{ g2}âˆ’Î¼ _{ g1},â‹¯,Î¼ _{ gT }âˆ’Î¼ _{ g1})^{â€²} to measure the difference in the mean intensities between each nonreference biological condition and the reference condition. Let \(\phantom {\dot {i}\!}\Sigma _{\boldsymbol {\Delta }_{\mu, g}}\) be the posterior covariance matrix of Î” _{ Î¼,g }. We then propose the quadratic form, \(\boldsymbol {\Delta }_{\mu, g}'\Sigma ^{1}_{\boldsymbol {\Delta }_{\mu, g}}\boldsymbol {\Delta }_{\mu, g}\), to quantify the differential gene expressions for all nonreference biological conditions compared to the reference condition. Under the null hypothesis that gene g is not DE, the quadratic form follows a chisquare distribution with d f=Tâˆ’1 when Î” _{ Î¼,g } is assumed to follow a multivariate normal distribution. We note that the multivariate normality holds asymptotically when the sample size is large. We choose an integer, denoted as C, which is closest to the 95 ^{th} percentile of the chisquare distribution. For example, for a microarray study with three biological conditions (i.e., T=3), the corresponding C value equals 6. Similar to (1), we compute the posterior probability
and declare gene g to be DE if Î³ _{ g }â‰¥Î³ _{0}.
Confident difference criterion for comparison of both means and variances of expression for microarray data
We note that the confident difference criterion method proposed so far only evaluates the differences in mean intensities. Recall that the Bayes factor approach in Yu et al. [36] is more desirable since it compares both means and variances of the intensities for each gene. Assume that the means and variances are equally important. An appropriate quadratic form can be constructed to quantify the overall difference between both the means and the variances under different conditions on each gene. Since the posterior distribution of \(\sigma _{\textit {gt}}^{2}\)â€™s is typically skewed, a stabilization transformation of the variance \(\sigma _{\textit {gt}}^{2}\) is required. Let q(.) denote a onetoone transformation function. The differences in both means and transformed variances of the intensities across different conditions can be summarized in a quadratic form given by
where \(\boldsymbol {\Delta }_{\boldsymbol {\mu }, \boldsymbol {\sigma },g}=\left (\mu _{g2}\mu _{g1},\cdots, \mu _{\textit {gT}}\mu _{g1}, q\left (\sigma ^{2}_{g2}\right)q \left (\sigma ^{2}_{g1}\right),\cdots, q\left (\sigma ^{2}_{\textit {gT}}\right)q\left (\sigma ^{2}_{g1}\right)\right)'\) is a column vector of length 2Tâˆ’2 containing the differences in both means and transformed variances of the intensities. The covariance matrix \(\Sigma _{\boldsymbol {\Delta }_{\boldsymbol {\mu },\boldsymbol {\sigma },g}}\phantom {\dot {i}\!}\) is the posterior covariance matrix of Î” _{ Î¼,Ïƒ,g }. Since q(Â·) is a onetoone transformation function, then we have \( \sigma ^{2}_{\textit {gt}}=\sigma ^{2}_{gt'}\) if and only if \( q\left (\sigma ^{2}_{\textit {gt}}\right)=q\left (\sigma ^{2}_{gt'}\right)\) for tâ‰ t ^{â€²}. Thus, the same q(Â·) function has to be used across all the T treatment groups. The primary reason for introducing the transformation function q(Â·) is to make the distribution of \(q\left (\sigma ^{2}_{\textit {gt}}\right)\) more normal.
Similar to (2), we compute the posterior probability Î³ _{ g }=P r(Q _{ g }>CD _{ obs }), where C is chosen to be an integer, which is closest to the 95 ^{th} percentile of the chisquare distribution with d f=2Tâˆ’2. For example, C will be chosen to be 9 when T=2, and 13 when T=3. In this paper, we consider the negative cube root transformation on the variance parameters \(\sigma ^{2}_{\textit {gt}}\)â€™s. The cube root transformation, also known as WilsonHilferty transformation, was derived by Wilson and Hilferty [35] to transform a chisquare variate to be approximately normally distributed. In the proposed gene selection algorithm below, the cutoff value Î³ _{0} will be automatically determined to control the false discovery rate to be less than a targeted level.
Confident difference criterion for sequencebased data
As discussed in the Model for sequencebased data subsection, the parameter Î» _{ gt } quantifies the expression level of gene g under condition t. The differences in Î» _{ gt }â€™s measure the relative differential expressions of gene g between the conditions. Note that the Î» _{ gt }â€™s likely have small values and their posterior distributions may be skewed. Therefore, we apply a log transformation on Î» _{ gt }â€™s and use the differences in logÎ» _{ gt }â€™s to quantify the differential gene expressions among different biological conditions. Similar to the confident difference criterion method for microarray data, we propose the confident difference criterion method for the sequencebased data with two biological conditions as follows. We first compute
where \(\sigma _{\log (\lambda _{g2})\log (\lambda _{g1})}\) is the posterior standard deviation for the difference log(Î» _{ g2})âˆ’ log(Î» _{ g1}). We then declare gene g to be DE if Î³ _{ g }â‰¥Î³ _{0}, where 0<Î³ _{0}<1 is a predetermined credible level.
When the sequencebased data are collected from multiple conditions, the first biological condition will be considered as the reference condition and a column vector Î” _{ Î»,g }=(log(Î» _{ g2})âˆ’ log(Î» _{ g1}), log(Î» _{ g3})âˆ’ log(Î» _{ g1}),â‹¯, log(Î» _{ gT })âˆ’ log(Î» _{ g1}))^{â€²} contains the differences in the log scaled expression values between the nonreference conditions and the reference condition. Let \({\Sigma }_{\boldsymbol {\Delta }_{\lambda,g}}\phantom {\dot {i}\!}\) denote the posterior covariance matrix of Î” _{ Î»,g }. Accordingly, the confident difference criterion method is defined as \(\gamma _{g}=Pr\left ({\boldsymbol {\Delta }}_{\lambda,g}'\mathbf {\Sigma }^{1}_{{\boldsymbol {\Delta }}_{\lambda,g}} {\boldsymbol {\Delta }}_{\lambda,g}>C_{\lambda }D_{\textit {obs}}\right)\), where C _{ Î» } is an integer, which is closest to the 95 ^{th} percentile of the chisquare distribution with d f=Tâˆ’1. Again, we declare gene g to be DE if Î³ _{ g }â‰¥Î³ _{0}, where 0<Î³ _{0}<1.
From the Model for sequencebased data subsection, we see that the variance of the observed count y _{ gtk } is \(m_{\textit {tk}}\lambda _{\textit {gt}} \left (1+ m_{\textit {tk}} \lambda _{\textit {gt}} \phi ^{1}_{t}\right)\), which is a function of Î» _{ gt } and Ï• _{ t }, for the sequence data. Since Ï• _{ t } does not depend on g, it is sufficient to compare the mean expressions under different conditions for determining DE genes for the sequence data.
False discovery rate and gene selection algorithm
The proposed confident difference criterion methods calculate the value of Î³ _{ g } on each gene, whose magnitude reflects the evidence of differential expression. When Î³ _{ g } is large enough, the gene will be declared to be DE. It is of great importance to determine how to choose the cutoff value Î³ _{0}.
We adopt the approach proposed by Tadesse et al. [31] to select the cutoff value Î³ _{0} for controlling the Bayesian FDR. Let V denote the number of incorrect decisions by identifying EE genes as DE genes and let R be the number of identified DE genes. Then the positive false discovery rate defined by Storey [30] is given by \(pFDR=E\left (\frac {V}{R}R>0\right)\).
We need to test the hypotheses of H _{0g }: gene g is EE versus H _{1g }: gene g is DE on each gene. We assume that all genes have the same probability of being EE, and DE, respectively, i.e., P r(H _{0g })â€™s are equal for all genes, and P r(H _{1g })â€™s are equal for all genes. Therefore, the Î³ _{ g }â€™s are independently and identically distributed. Following Tadesse et al. [31], the Bayesian FDR b F D R(Î³ _{0}) when using a cutoff value of Î³ _{0} for the confident difference criterion method is defined as
and P r(Î³ _{ g }â‰¥Î³ _{0})=P r(Î³ _{ g }â‰¥Î³ _{0}H _{0g })P r(H _{0g })+P r(Î³ _{ g }â‰¥Î³ _{0}H _{1g })P r(H _{1g }). To estimate the FDR, we need to compute P r(Î³ _{ g }â‰¥Î³ _{0}H _{0g }), P r(Î³ _{ g }â‰¥Î³ _{0}H _{1g }) and P r(H _{1g }). Note that gene g can be classified into DE or EE depending on whether Î³ _{ g }â‰¥Î³ _{0}. We reuse the data information and specify the prior probability P r(H _{1g }) as the proportion of genes classified as DE. Denote the total number of identified DE genes as n _{ D }. Then the probability of a gene being DE will be P r(H _{1g })=n _{ D }/G. Additionally, we estimate the true parameters in the gene expression data distributions from DE or EE genes as the posterior means of the corresponding parameters from the identified DE and EE genes, respectively. An algorithm using the posterior samples from DE or EE genes to estimate the aforementioned probabilities P r(Î³ _{ g }â‰¥Î³ _{0}H _{0g }), P r(Î³ _{ g }â‰¥Î³ _{0}H _{1g }) and b F D R(Î³ _{0}) is given as follows.

Split the genes into two subsets containing n _{ E } EE genes (EEGENE) with calculated Î³ _{ g }<Î³ _{0} and n _{ D } DE genes (DEGENE) with Î³ _{ g }â‰¥Î³ _{0}.

Note that a DE gene can be either up or down regulated under some condition compared to the reference condition in terms of means or variances of the expression values for microarray experiment or in terms of mean gene expressions for sequencebased experiment. Accordingly, the DE genes will be further split into a series of gene subsets based on the pattern of parameters in comparisons under different biological conditions. For example, in a microarray study with three biological conditions and the mean gene expressions are in comparison. Consider the first condition as the reference condition. The DE genes can be classified into four subsets: (i) genes with lower mean gene expressions under both conditions 2 and 3; (ii) genes with lower mean gene expressions under condition 2 but higher mean gene expressions under condition 3; (iii) genes with higher mean gene expressions under condition 2 but lower mean gene expressions under condition 3; and (iv) genes with higher mean gene expressions under both conditions 2 and 3. We denote these subsets of DE genes as D _{ â„“ },â„“=1,â‹¯,L, where the number of subsets L=2^{Tâˆ’1} for the microarray study when only mean parameters are in comparison or the sequencedbased study; and L=4^{Tâˆ’1} for the microarray study when both mean and variance parameters are in comparison. We also denote the number of genes in the D _{ â„“ } DE gene subsets as n _{ D â„“ }.

For EE genes identified in previous steps, the same data distributions will be considered for the gene expression data from different biological conditions. Hierarchical priors similar to those proposed previously in the Model for microarray data section and the Model for sequencebased data subsection will be augmented separately for microarry data or sequencebased data. The posterior mean of each parameter defined in the distribution of the gene expression data will be calculated. The true parameters in the gene expression data distribution will be estimated using the average value of posterior mean of corresponding parameters from all EE genes. For all identified DE genes, the Markov chain Monte Carlo (MCMC) sampling values from previous steps when implementing the proposed confident difference criterion method will be used for calculating the posterior means of the parameters defined in the gene expression data distribution. For each differential gene expression pattern â„“, the actual value of each parameter in the gene expression data distribution will be estimated using the average value of its posterior means across all genes in the subset D _{ â„“ } with this DE pattern.

Using the estimated values for the parameters in the gene expression data, the data will be simulated for ÎºÃ—G genes (say Îº=0.1), among which Îº n _{ E } EE genes and \(\kappa n_{D_{\ell }}, \ell =1,\dots,L\) DE genes with a pattern of differential gene expression observed in the DE gene subset D _{ â„“ }, respectively.

The posterior probability Î³ _{ g } will be calculated for each gene based on the simulated data. Depending on whether Î³ _{ g }â‰¥Î³ _{0}, the gene in the simulated data will also be claimed to be DE or EE.

Denote the total number of identified DE and EE genes from the simulated data as m _{ D } and m _{ E }. Then the probability for a EE genes claimed to be DE, P r(Î³ _{ g }â‰¥Î³ _{0}H _{0g }), will be estimated as \(Pr(\gamma _{g} \ge \gamma _{0}H_{0g})=\frac {m_{E}}{\kappa n_{E}}\); and the probability for a DE genes claimed to be DE, P r(Î³ _{ g }â‰¥Î³ _{0}H _{1g }), will be estimated as \(Pr(\gamma _{g} \ge \gamma _{0}H_{1g})=\frac {m_{D}}{\kappa n_{D}}\). Using (5), the estimated Bayesian FDR equals \(\widehat {bFDR}=\frac {m_{E}}{m_{D}+m_{E}}\).
Note that steps (4) to (6) provide a predictive approach to estimate bFDR when a certain value Î³ _{0} was used for identifying DE genes. Therefore, we can control the FDR at some prespecified value (i.e. 0.05) by choosing the corresponding cutoff value \(\hat {\gamma _{0}}\) as the minimum value of all cutoff values with an associated FDR no more than 0.05, or \(\hat {\gamma }_{0}=\min \{\gamma _{0}: (\widehat {bFDR}(\gamma) \le 0.05)\}\).
Results and discussion
In this section, two different simulation studies were conducted to investigate the performance of the proposed confident difference criterion methods on identifying DE genes for microarray or sequencebased studies, respectively. In addition, a real affymetrix dataset is used to further demonstrate the proposed methodology.
Simulation study I: Microarray data
Two settings were considered. In the first setting, the intensity values having different means and variances between two biological conditions on each DE gene are simulated. In the second setting, the data were simulated from three biological conditions, with DE genes having different mean and variance values between at least two biological conditions.
Setting 1 (Two conditions)
Fifty simulations were used in this study to investigate the performance of different versions of the confident difference criterion methods described in the Confident difference criterion section. In each simulation, there were 5000 genes in total and 500 DE genes with 10 replications under each of the two biological groups. The logscaled data were generated via \(x_{g1k} \overset {\text {iid}}{\sim } \mathcal {N}(\mu _{g}0.5 \delta _{g}, 0.2^{2}), x_{g2k} \overset {\text {iid}}{\sim } \mathcal {N}\left (\mu _{g}+0.5\delta _{g}, 0.9^{2}\right)\) with Î´ _{ g }=1,âˆ€g=1,â€¦,250 and Î´ _{ g }=âˆ’1,âˆ€g=251,â€¦,500 for the DE genes, and \(x_{g1k},x_{g2k} \overset {\text {iid}}{\sim } \mathcal {N}(\mu _{g}, 0.7^{2})\) for the remaining EE genes. The average intensities Î¼ _{ g } were generated from an uniform distribution, where \(\mu _{g} \overset {\text {iid}}{\sim } \mathcal {U} (5,11)\) for all genes. Conditionally conjugate priors described in the Model for microarray data subsection were used for all parameters Î¼ _{ g }, Î´ _{ g }, \(\sigma _{g1}^{2}\), \(\sigma _{g2}^{2}\) and \({\sigma _{g}^{2}}\).
The simulated data were analyzed using both Methods I and II of the confident difference criteroin methods. For each version, the cutoff value Î³ _{0} were set to be 0.4, 0.6, or the cutoff value controlling the FDR to be no more than 0.05, separately. The genes with the calculated posterior distribution values Î³ _{ g } via Equation (1) or Equation (3) less than the chosen Î³ _{0} were identified to be DE. To evaluate the performance of the confident difference criterion methods, the simulated datasets were also analyzed by four existing methods: Significant Analysis of Microarrays (SAM) [33], Linear Models for Microarray Data (LIMMA) [28], Semiparametric Hierarchical Method (SPH) [23], Empirical Bayesian Analysis of Gene Expression Data (EBarrays or EBA) [14]. All these existing methods allowed a control of the FDR for multiple comparisons. The genes were declared to be DE with FDR controlled at 0.05 for all these four methods.
Based on the identified gene list by each method, we calculated the number of claimed DE genes (CDE), the number of correctly claimed DE genes (CCDE), the number of correctly claimed EE genes (CCEE), the false positive rate (FPR), false negative rate (FNR), false discovery rate (FDR) and false nondiscovery rate (FNDR) for all considered methods. These results and their standard deviations reported in parentheses were summarized in Table 1. Note, for Methods I and II, the choice of Î³ _{0}=0.4 identified the a larger number of DE genes when compared to the choice of Î³ _{0}=0.6. While for the case with FDR control of 0.05, the Method II identified the largest number of DE genes among all six methods. We also compared the results of the confident difference criterion method with a control of FDR against all four existing methods. We expected a method with good performance will provide a good control of FDR and provide smaller error rates in terms of FPR, FNR and FNDR. Under both versions of the confident difference criterion method, the achieved FDR is close to but less than 0.05, implying that the proposed confident difference criterion methods provided a good control of the FDR. All four existing methods also obtained a control of FDR at 0.05 successfully, although the SPH method provides a conservative control of FDR with the empirical FDR equal to 0.02. Since all methods provided small error rates of FPR and FNDR, we put more weight to the comparison of the empirical FNRs among all applied methods. The results in Table 1 showed that Method II provided the smallest empirical FNR out of all methods by successfully identifying almost all truly DE genes; and Method I had comparable empirical FNR as the SAM and the LIMMA methods, and much smaller empirical FNR when compared to the SPH and the EBA methods.
Setting 2 (Three conditions)
The data were simulated from three biological conditions, and the first biological condition was considered as the reference group. A gene was set to be DE so that at least one group would be either up or down regulated from the reference group. Specifically, 500 DE genes out of 5000 genes were set in the data, and the log intensities of the DE genes were generated via \(x_{g1k} \overset {\text {iid}}{\sim } \mathcal {N}\left (\mu _{g1}, 0.2^{2}\right), x_{g2k} \overset {\text {iid}}{\sim } \mathcal {N}\left (\mu _{g1}+0.5\nu _{g1}, 0.5^{2}\right), x_{g3k} \overset {\text {iid}}{\sim } \mathcal {N}\left (\mu _{g1}+0.5 \nu _{g2}, 0.8^{2}\right)\). Depending on whether the gene was set to be DE in one or both conditions from reference group, the parameters Î½ _{ g1} and Î½ _{ g2} were set to have Î½ _{ g1}=Î½ _{ g2}=1.5 for g=1,â‹¯,62 (upregulated in both conditions); Î½ _{ g1}=1.5, Î½ _{ g2}=0 for g=63,â‹¯,125 (only upregulated in condition 2); Î½ _{ g1}=1.5, Î½ _{ g2}=âˆ’1.5 for g=126,â‹¯,187 (upregulated in condition 2, downregulated in condition 3); Î½ _{ g1}=0 and Î½ _{ g2}=1.5 for g=188,â‹¯,250 (only upregulated in condition 3); Î½ _{ g1}=0, Î½ _{ g2}=âˆ’1.5 for g=251,â‹¯,312 (only downregulated in condition 3); Î½ _{ g1}=âˆ’1.5, Î½ _{ g2}=1.5 for g=313,â‹¯, 375 (downregulated in condition 2, upregulated in condition 3); Î½ _{ g1}=âˆ’1.5, Î½ _{ g2}=0 for g=376, â‹¯,437 (only downregulated in condition 2); Î½ _{ g1}=Î½ _{ g2}=âˆ’1.5, for g=438,â‹¯,500 (downregulated in both conditions). The remaining genes were EE and their log intensities were generated via \(x_{\textit {gtk}} \overset {\text {iid}}{\sim } \mathcal {N}\left (\mu _{g}, 0.6^{2}\right)\) for t=1,2,3 and g=501,â‹¯,5000. On all genes, the parameter Î¼ _{ g1} were generated from an uniform distribution, i.e., \(\mu _{g1} \overset {\text {iid}}{\sim } \mathcal {U}(5,11)\). Each condition contained 10 replicates on each gene and 50 simulations were conducted.
The model similar to those described in Model for microarray data subsection and the proposed confident difference criterion methods including the Method I for comparing mean expressions and the Method II comparing both mean and variance expressions were applied to the simulated data. We considered three choices for the cutoff value Î³ _{0}, including prespecified values 0.4, 0.6, or a value with FDR controlled at 0.05, separately. The data were also analyzed by the SAM [33], LIMMA [28], and EBArrays [14] with the FDR controlled at 0.05. The SPH [23] were not used in the study as they were proposed for studies with two biological conditions only. The analytical results from all methods were compared based on four error rates including FPR, FNR, FDR and FNDR from each considered method (Table 2). The confident difference criterion methods including Method I and Method II as well as the existing methods except LIMMA all provided an empirical FDR no more than 0.05 successfully. Comparing to the existing methods, the proposed confident difference criterion methods provided comparable FPR and smaller FNR and FNDR. Method II of the confident difference criterion method compares both mean and variance values of the gene expression intensities across different biological conditions. This is a potential reason for the proposed method providing smaller FNR for microarray data analysis. The confident difference criterion method is particularly effective when both mean and variance of the expression intensities differ across biological conditions on the DE genes.
Simulation Study II: sequencebased data
The focus of this study is to investigate the performance of the proposed confident difference criterion method for identifying DE genes from sequencebased highthroughput experiments including SAGE and RNASeq studies.
Setting 1 (SAGE experiment)
The simulation proposed by Lu et al. [20] was used to conduct the simulation study. Specifically, 5000 genes were sampled under five libraries from each of the two conditions with fixed library sizes sampled uniformly between 30000 and 90000. A total of 500 genes were set to be DE genes. The data were generated from a negative binomial distribution, \(y_{\textit {gtk}} \overset {\text {iid}}{\sim } \mathcal {NB}(\phi _{t}, \frac {m_{\textit {tk}}\lambda _{\textit {gt}}}{\phi _{t}+m_{\textit {tk}}\lambda _{\textit {gt}}})\) for gene g, for a fixed library k of condition t, where m _{ tk } was the library size for library k under condition t; Ï• _{1} and Ï• _{2} denoted the dispersion parameters for data from the two conditions separately, and both set to be 0.4; Î» _{ gt } measured the expression level of gene g under condition t and were set with different values when gene g is DE and the same value when gene g is EE. For g=1,â‹¯,250, we set Î» _{ g1}=8Eâˆ’4 and Î» _{ g2}=2Eâˆ’4 to include downregulated genes in condition 2. For g=251,â‹¯,500, we set Î» _{ g1}=2Eâˆ’4 and Î» _{ g2}=8Eâˆ’4 to include upregulated genes in condition 2. For other genes with g=501,â‹¯,5000, we set Î» _{ g1}=Î» _{ g2}=2Eâˆ’4 to include EE genes. Fifty simulations were used in this study.
The proposed confident difference criterion method for RNASeq data was used to analyze the simulated data. The posterior probability Î³ _{ g } measuring the evidence of differential gene expression were estimated using average value of its posterior sampled values. The cutoff value Î³ _{0} for Î³ _{ g } were set to be 0.4, 0.6 or a value to control the FDR to be 0.05, separately. The genes with estimated Î³ _{ g } less than the chosen Î³ _{0} value were claimed to be DE. We also fit several other existing methods including edgeR [26], DESeq [2], BaySeq [10], NBPSeq [8], EBSeq [17], NOISeq [32], SAMSeq [19], and TSPM [3]. When the edgeR method was applied, we chose both options to estimate the common dispersion parameter for all tags and the tagwise dispersion parameters respectively. For the NOISeq method, we estimated and controlled the FDR using the method proposed by Newton et al. [23] for identifying DE genes.
The results using the proposed confident difference criterion methods and all fitted existing methods for RNASeq data were summarized in Table 3. Similar to the simulation study I for microarray data, Table 3 showed that the higher the cutoff value Î³ _{0}, the less number of genes were identified to be DE. The confident difference criterion method with a control of FDR at 0.05 achieved an empirical FDR of 0.044, and successfully identified 328.8 genes (65.8 %) on average out of 500 truly DE genes. Compared to other considered methods, the confident difference criterion method performed the best by providing the smallest FNR and FNDR while maintaining comparable FPR and a well controlled FDR. Out of the applied existing methods, the NOISeq method and edgeR method achieved the lowest FNR, and a FDR of no more than 0.05. The BaySeq method provided a conservative control of FDR, and achieved an empirical FDR of lower than 0.001 when controlling the FDR at 0.05. The DESeq, EBSeq and TSPM methods failed to control the FDR at 0.05. The SAMSeq method and TSPM method failed to identify most of the truly DE genes as DE genes, which was not surprising as the performance of both the SAMSeq and TSPM methods is highly sample size dependent as pointed out by Soneson and Delorenzi (2013) [29].
Setting 2 (RNASeq experiment)
We used a similar simulation setting proposed by Kvam et al. [16] for illustrating the application of the proposed confident difference criterion method for RNASeq experiment. We still simulated 50 dataset, each dataset contained six libraries with three libraries from each of the two conditions on 5000 genes, among which 250 genes were set to be upregulated genes and another 250 genes were set to be downregulated genes in condition 2 versus condition 1. The overall mean expression levels across both conditions were generated from a gamma distribution with \(\lambda _{g} \sim \mathcal {G}(0.25, 600)\). To avoid including genes with low expression levels from both conditions as DE genes, we set the difference in the gene expression levels between conditions in two ways depending on whether the value of Î» _{ g } is larger than one. Specifically, we generated Î¾ _{ g } from uniform distribution \(\mathcal {U}(3,20)\) for each gene. If the value of Î» _{ g }>1, we let the fold change between the gene expression values of DE genes to be Î¾ _{ g }, or \(\lambda _{g1}=\lambda _{g}/\sqrt {\xi _{g}}\) and \(\lambda _{g2}=\lambda _{g}*\sqrt {\xi _{g}}\) for upregulated genes, and \(\lambda _{g1}=\lambda _{g}*\sqrt {\xi _{g}}\) and \(\lambda _{g2}=\lambda _{g}/\sqrt {\xi _{g}}\) for downregulated genes. If the value of Î» _{ g }â‰¤1, we let the absolute difference in the gene expression values to be Î¾ _{ g }, or we let Î» _{ g1}=Î» _{ g }+Î¾ _{ g } and Î» _{ g2}=Î» _{ g } for downregulated genes, and Î» _{ g1}=Î» _{ g } and Î» _{ g2}=Î» _{ g }+Î¾ _{ g } for upregulated genes in condition 2. For an EE gene, we had Î» _{ g1}=Î» _{ g2}=Î» _{ g }.
Then we generated the data using negative binomial distribution of \(y_{\textit {gtk}} \overset {\text {iid}}{\sim } \mathcal {NB}(\phi _{t}, \frac {\lambda _{\textit {gt}}}{\phi _{t}+\lambda _{\textit {gt}}})\) for gene g, and the overdispersion parameters Ï• _{1} and Ï• _{2} were set to have Ï• _{1} = 1 and Ï• _{2} = 8 respectively for DE genes; and Ï• _{1}=Ï• _{2}=4 for EE genes.
All methods applied in setting I of simulation study II were also used for data analysis in this simulation study. The results in Table 4 displayed that the confident difference criterion method with a control of FDR at 0.05, the edgeR method with common dispersion parameter over genes, the edgeR with genewise dispersion parameter, the BaySeq, the NBPSeq, the NOISeq methods successfully controlled the FDR at 0.05. Additionally the confident difference criterion method, the NBPSeq method, the edgeR method with a common dispersion parameter over genes also provided a good and comparable control of FNR of less than 0.2, while maintaining low levels of FPR and FNDR.
Real data analysis
We used a real data set obtained using customized Bovine Affymetrix arrays (Davis, Talbott, Yu, and Cupp, unpublished results) to illustrate the proposed method. Fifteen arrays composed of three replicate arrays under three biological conditions were produced to screen for DE genes associated with prostaglandin F2Î±(PGF) treatment after 30 min, 1 h, 2 h, and 4 h compared to the control treatment (saline). For simplicity, we focused on detecting genes using the confident difference criterion methods (Method I and Method II) that were regulated 1 h or 2 h after PGF treatment. The data were extracted, normalized and summarized using the Robust Multiarray Average (RMA) [12] method at the exon level via the Affymetrix expression console. The data set contains 21724 genes. Note that some genes may have multiple probe replicates ranging from one replicate to 266 replicates, and the data from different probes of the same gene may have large variation even after RMA normalization. We centered the data from each probe of the same gene to the mean log intensities of that gene, and excluded 3116 genes with only a single probe replicate from the analysis to make sure that the parameters were estimable. Additionally, we excluded 2137 low expression genes if twothirds or more (six out of nine) samples on this gene had gene expression values measured by the geometric mean expression values across different probes less than 10. Of the remaining 16471 genes with replicate probes, we used z _{ gjtk } to denote the k ^{th} biological replicate sample of the log2 scale gene expression intensity for probe j of gene g under condition t. Note that the index j was added to the previous notations for the log intensity values as data are available for multiple probes on the same gene. We assumed normal distribution for the log2 intensities with \(z_{\textit {gjtk}} \sim \mathcal {N}(\mu _{\textit {gtk}}, \sigma ^{2\ast }_{\textit {gt}})\), and the same prior for Î¼ _{ gtk } as what we set for X _{ gt } in the Model for microarray data subsection. The variance parameters are assumed to follow inverse gamma distribution with \(\sigma ^{2\ast }_{\textit {gt}} \sim \mathcal {IG}(\alpha ^{\ast }_{t},\beta ^{\ast }_{t})\) with \(\alpha ^{\ast }_{t}=2\) and \(\beta ^{\ast }_{t} \sim \mathcal {G}(\alpha ^{\ast }_{0},\beta ^{\ast }_{0})\). We set \(\alpha ^{\ast }_{0}=1\) and \(\beta ^{\ast }_{0}\sim IG(\alpha ^{\ast },\beta ^{\ast })\) where both Î± ^{âˆ—}=Î² ^{âˆ—}=1. During computation for controlling the FDR, we reuse these settings of the prior distributions on the parameters Î¼ _{ gtk } and \(\sigma ^{2\ast }_{\textit {gt}}\) for DE genes. For EE gens, we assume that \(z_{\textit {gjtk}} \sim \mathcal {N}(\mu _{\textit {gtk}},\sigma ^{2\ast }_{g})\), and make similar augment for the prior distributions of their parameters Î¼ _{ gtk } and \(\sigma ^{2\ast }_{g}\) as the DE genes. The proposed confident difference criterion methods were applied to assess the evidence of differential expression on each gene and identify DE genes with the cutoff value equal to be 0.4, 0.6 or a value that controls the FDR at 0.05.
In addition, we analyzed the real data using the existing methods including SAM, LIMMA, and EBarrays as described in the Simulation Studies section for identification of DE genes. Since the existing methods were developed for data with single probe replicate on each gene, we calculated the mean log intensities over all probes for each biological sample on each gene to quantify the corresponding gene expression. The genes were declared to be DE if the false discovery rate was no more than 0.05. We used Venn diagrams to demonstrate the overlap of DE genes identified by Method I (Fig. 2, Left Panel) or Method II (Fig. 2, Right Panel), to the DE genes identified by SAM and EBarrays (Fig. 2). The results showed that more genes were identified to be DE by the proposed Method I and Method II than the existing methods. Specifically, 1050 DE genes were identified by Method II, while 896 genes were identified to be DE by either SAM or EBarrays. Of note 340 out of 353 DE genes identified by LIMMA were also identified by SAM (data not shown), and 951 of 991 DE genes identified by Method I were also identified as DE by Method II. We found that SAM identified 375 DE genes, all of which were also identified by other methods. For example, 358 (95.5 %) genes identified by SAM were also identified by Method I or II; and 342 (91.2 %) genes identified by SAM were also identified by EBarrays method. The EBarrays method identified 863 DE genes, out of which, 643 (74.5 %) genes were also identified by Method I or II. Method I identified 116 of the 324 genes identified by LIMMA when comparing all four time points versus control in the same dataset, while Method II called 105 out of 387 genes DE that were also called DE by LIMMA within the whole dataset. In addition, many genes identified to be DE only by Method II not by Method I show a linear trend among the average gene expression across conditions observed from samples collected with longer time after treatment, and larger data variations under the control condition than those observed at other time points after treatment. For example, the average log2 gene expression of THBS1 increased from 9.22 under control condition to 10.35 at 2h after treatment, and the standard deviation equaled 0.88 under the control condition, and 0.37 at 2 h after treatment. This gene was only detected to be DE by Method II and was shown to play roles in angiogenesis [37].
The genes identified solely by Method I or Method II were analyzed by Ingenuity Pathway Analysis (IPA, Build version: 313398M, Content version: 18841524 (Release Date: 20140624) to determine biological functions and pathways associated with the newly identified genes. Genes identified solely by Method I and not by SAM or EBarrays were analyzed by IPA which identified several major canonical pathways such as hepatic fibrosis / hepatic stellate cell activation, glucocortiocoid receptor signaling, agranulocyte adhesion and diapedesis, and role of IL17A in arthritis (Additional file 2: Table S1). Many of the canonical pathways identified have either established or potential roles in corpus luteum function indicating that Method I identified DE genes that are biologically relevant within the model. Method I also identified IL1B (P=2.12Eâˆ’08) and TNF (P=3.03Eâˆ’08) as upstream regulators of the genes found exclusively by Method I, which also fits with known and suspected mechanisms of PGF action within the corpus luteum [1, 24].
Genes identified solely by Method II were also analyzed by IPA which identified canonical pathways such as hepatic fibrosis/hepatic stellate cell activation [21], glucocorticoid receptor signaling, IL8 signaling, and granulocyte adhesion and diapedesis. Upstream regulators of gene found solely by Method II included: IL1B (P=4.56Eâˆ’13), TGFB1 (P=1.19Eâˆ’11), and IFNG (P=1.82Eâˆ’11). The IPA results both concur with current literature and offer new insights into the possible mechanism(s) of action of PGF in the corpus luteum [1, 9, 11, 21]. These and similar canonical and regulatory functions were also identified when the complete dataset (30 min, 1 h, 2 h, and 4 h) was analyzed by IPA. These network functions are in agreement with the known or suspected changes in biological function in the corpus luteum following PGF treatment in several species [1, 5, 22, 27]. Several of the genes identified by Methods I and II are known to be involved in regulation of the fate of the corpus luteum after PGF treatment, and were also identified as DE genes in our larger data set and a similar microarray dataset examining the effects of PGF in the cow [22]. For example, genes that code for chemokines (e.g., CCL3 and CCL8) and prostaglandin synthesis (e.g., PTGS2) were found to be significantly upregulated at 1 and 2 h using Methods I and II which were not identified using LIMMA. However, CCL3, CCL8, and PTGS2 were all identified as significantly upregulated in later time points using LIMMA, which conservatively identifies DE genes. Therefore, it seems possible that Methods I and II may provide a more sensitive approach to identify the temporal patterns of gene expression.
Conclusion
In this paper, we have proposed a new differentially expressed gene selection algorithm, which controls the FDR based on predictive Bayesian estimates. The simulation studies empirically showed that the proposed confident difference criterion methods outperform the existing methods when comparing gene expressions across different conditions for both microarray studies and sequencebased highthroughput studies. For the analysis of the real data, the method II successfully identified more clinically important DE genes than the other methods. In comparison to Method I, the Method II provides a much better sensitivity rate, but slightly a lower specificity rate based on the simulation studies.
In scenarios where the data are not symmetrically distributed, we need to model the data with other types of distributions (e.g., a gamma distribution). The confident difference criterion method proposed for comparing both means and variances can also be extended to evaluate the differences in multiple parameters defined in the nonnormal data distributions. In addition, the Euclidean distances used in the proposed confident difference criterion method may also be extended to other types of distances to measure the difference among the distributions under two or more biological conditions. In the case of two conditions, the entropybased distance such as the KullbackLeibler (KL) divergence may be considered. However, the distribution of the entropybased statistics is quite difficult to characterize and, hence, it is quite challenging to choose the cutoff value for the entropy statistics. Such extensions need to be further investigated. Finally, we note that all models considered in this paper assume that the gene expressions are independent across genes. The proposed confident difference criterion methods do not require the independence assumption. However, the performance of the confident difference criterion methods under the correlated models need to be further examined.
Availability and requirements
All analyses results presented in this paper were obtained using codes developed in FORTRAN with IMSL library. We have also implemented the proposed method in R for windows (32 bits). The R codes can be obtained at the websites: http://www.unmc.edu/publichealth/departments/biostatistics/facultyandstaff/cdc_micro.zipand http://www.unmc.edu/publichealth/departments/biostatistics/facultyandstaff/cdc_RNASeq.zip.
References
Atli MO, Bender RW, Mehta V, Bastos MR, Luo W, Vezina CM, et al. Patterns of gene expression in the bovine corpus luteum following repeated intrauterine infusions of low doses of prostaglandin F_{2} Î±. Biol Reprod. 2012; 86(4):130.
Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010; 11:R106.
Auer PL, Doerge RW. A twostage poisson model for testing RNASeq data. Stat Appl Genet Mol Biol. 2011; 10:1â€“26.
Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008; 456(7218):53â€“9.
Bishop CV, Bogan RL, Hennebold JD, Stouffer RL. Analysis of microarray data from the macaque corpus luteum; the search for common themes in primate luteal regression. Mol Hum Reprod. 2011; 17(3):143â€“51.
Chen MH, Ibrahim JG, Chi YY. A new class of mixture models for differential gene expression in DNA microarray data. J Stat Plan Inference. 2008; 138:387â€“404.
Dudroit S, Yang YH, Callow MJ, Speed TP. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica. 2002; 12:111â€“39.
Di Y, Schafer DW, Cumbie JS, Chang JH. The NBP negative binomial model for assessing differential gene expression from RNASeq. Stat Appl Genet Mol Biol. 2011; 10(1):1â€“28.
GalvÃ¤o AM, FerreiraDias G, Skarzynski DJ. Cytokines and angiogenesis in the corpus luteum. Mediators Inflamm. 2013; 2013:420186.
Hardcastle TJ, baySeq KellyKA. Empirical Bayesian analysis of patterns of differential expression in count data. BMC Bioinformatics. 2010; 11:422â€“35.
Hou X, Arvisais EW, Jiang C, Chen DB, Roy SK, Pate JL, et al. Prostaglandin F2 Î± stimulates the expression and secretion of transforming growth factor B1 via induction of the early growth response 1 gene (EGR1) in the bovine corpus luteum. Mol Endocrinol. 2008; 22(2):403â€“414.
Irizarry RA, Hobbs B, Collin F, BeazerBarclay YD, Antonellis KJ, Scherf U, et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003; 4(2):249â€“64.
Ibrahim JG, Chen MH, Gray RJ. Bayesian models for gene expression with DNA microarray data. J Am Stat Assoc. 2002; 97:88â€“99.
Kendziorski CM, Newton MA, Lan H, Gould MN. On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles. Stat Med. 2003; 22:3899â€“914.
Kuo L, Yu F, Zhao Y. Statistical methods for identifying differentially expressed genes in replicated experiments: A review. In: Biswas A, Data S, Fine J, Segal M, editors. Statistical Advances in the Biomedical Sciences: Clinical Trials, Epidemiology, Survival Analysis, and Bioinformatics. Hoboken, NJ: WileyInterscience: 2008. p. 341â€“64.
Kvam VM, Liu P, Si Y. A comparison of statistical methods for detecting differentially expressed genes from RNAseq data. Am J Bot. 2012; 99(2):248â€“56.
Leng N, Dawson JA, Stewart RM, Ruotti V, Rissman A, Smits B, et al. EBseq: An empirical Bayes hierarchical model for inference in RNAseq experiments. Bioinformatics. 2013; 29(8):1035â€“43.
Ley TJ, Mardis ER, Ding L, Fulton B, McLellan MD, Chen K, et al. DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature. 2008; 456(7218):66â€“72.
Li J, Tibshirani R. Finding consistent patterns: a nonparametric approach for identifying differential expression in RNAseq data. Stat Methods Med Res. 2013; 22:519â€“36.
Lu J, Tomfohr JK, Kepler TB. Identifying differential expression in multiple SAGE libraries: an overdispersed loglinear model approach. BMC Bioinformatics. 2005; 6:165.
Maroni D, Davis JS. TGFB1 disrupts the angiogenic potential of microvascular endothelial cells of the corpus luteum. J Cell Sci. 2012; 124(14):2501â€“510.
Mondal M, Schilling B, Folger J, Steibel JP, Buchnick H, Zalman Y, et al. Deciphering the luteal transcriptome: potential mechanisms mediating stagespecific luteolytic response of the corpus luteum to prostaglandin F _{2} Î±. Physiol Genomics. 2011; 43(8):447â€“56.
Newton MA, Noueiry A, Sarkar D, Ahlquist P. Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics. 2004; 5:155â€“76.
Okuda K, Sakumoto R. Multiple roles of TNF super family members in corpus luteum function. Reprod Biol Endocrinol. 2003; 1:95.
Pan W. A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics. 2002; 18:546â€“54.
Robinson MD, McCarthy DJ. Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010; 26:139â€“40.
Romero JJ, Antoniazzi AQ, Smirnova NP, Webb BT, Yu F, Davis JS, et al. Pregnancyassociated genes contribute to antiluteolytic mechanisms in ovine corpus luteum. Physiol Genomics. 2013; 45(22):1095â€“1108.
Smyth GK. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004; 3(1):Article 3.
Soneson C, Delorenzi M. A comparison of methods for differential expression analysis of RNASeq data. BMC Bioinformatics. 2013; 14:91.
Storey JD. A direct approach to false discovery rates. J R Stat Soc Ser B. 2002; 64:479â€“98.
Tadesse MG, Ibrahim JG, Vannucci M, Gentleman R. Wavelet thresholding with Bayesian false discovery rate control. Biometrics. 2005; 61:25â€“35.
Tarazona S, GarcÃaAlcalde F, Dopazo J, Ferrer A, Conesa A. Differential expression in RNASeq: a matter of depth. Genome Res. 2011; 21:2213â€“223.
Tusher VG, Ti bshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A. 2011; 98:5116â€“121.
Wang J, Wang W, Li R, Li Y, Tian G, Goodman L, et al. The diploid genome sequence of an Asian individual. Nature. 2008; 456:60â€“65.
Wilson EB, Hilferty MM. The distribution of chisquare. Proc Natl Acad Sci U S A. 1931; 17:684â€“88.
Yu F, Chen MH, Kuo L. Detecting differentially expressed genes using calibrated Bayes factors. Statistica Sinica. 2008; 18:783â€“802.
Zalman Y, Klipper E, Farberov S, Mondal M, Wee G, Folger JK. Regulation of AngiogenesisRelated Prostaglandin F2alphaInduced Genes in the Bovine Corpus Luteum. Biology of Reproduction. 2012; 86(3):92.
Acknowledgements
We would like to thank the Editor and two referees for their very helpful comments and suggestions, which have led to an improved version of the paper. This work was supported by COPH Deanâ€™s Mentored Research Grant at UNMC (FY), NIH grants #GM 70335 and #P01 CA142538 (MHC), Agriculture and Food Research Initiative (AFRI) Competitive Grant no. 20116701520076 from the USDA National Institute of Food and Agriculture (NIFA) (JSD); the Department of Veterans Affairs (JSD), the Olson Center for Womenâ€™s Health (HT and JSD), and a AFRI NIFA Predoctoral Fellowship Award (HT).
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authorsâ€™ contributions
FY, MHC and LK developed the method, and carried out the simulation and real data analysis. HT and JD provided the real data and conducted the Ingenuity pathway analysis. All authors contributed to the writing, proof reading and approval of the final manuscript.
Additional files
Additional file 1
Methods. Mathematical Proof for Propositions 1 and 2.
Additional file 2
Real Data Analysis Results. Canonical Pathways identified by Methods I and II.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Yu, F., Chen, MH., Kuo, L. et al. Confident difference criterion: a new Bayesian differentially expressed gene selection algorithm with applications. BMC Bioinformatics 16, 245 (2015). https://doi.org/10.1186/s1285901506643
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285901506643
Keywords
 Bayesian
 Differential expression
 Microarray
 Nextgeneration sequencing