A two-sample Bayesian t-test for microarray data

Fox, Richard J; Dimmic, Matthew W

doi:10.1186/1471-2105-7-126

Research article
Open access
Published: 10 March 2006

A two-sample Bayesian t-test for microarray data

Richard J Fox¹ &
Matthew W Dimmic^2,3

BMC Bioinformatics volume 7, Article number: 126 (2006) Cite this article

12k Accesses
67 Citations
9 Altmetric
Metrics details

Abstract

Background

Determining whether a gene is differentially expressed in two different samples remains an important statistical problem. Prior work in this area has featured the use of t-tests with pooled estimates of the sample variance based on similarly expressed genes. These methods do not display consistent behavior across the entire range of pooling and can be biased when the prior hyperparameters are specified heuristically.

Results

A two-sample Bayesian t-test is proposed for use in determining whether a gene is differentially expressed in two different samples. The test method is an extension of earlier work that made use of point estimates for the variance. The method proposed here explicitly calculates in analytic form the marginal distribution for the difference in the mean expression of two samples, obviating the need for point estimates of the variance without recourse to posterior simulation. The prior distribution involves a single hyperparameter that can be calculated in a statistically rigorous manner, making clear the connection between the prior degrees of freedom and prior variance.

Conclusion

The test is easy to understand and implement and application to both real and simulated data shows that the method has equal or greater power compared to the previous method and demonstrates consistent Type I error rates. The test is generally applicable outside the microarray field to any situation where prior information about the variance is available and is not limited to cases where estimates of the variance are based on many similar observations.

Background

Determining whether a gene is differentially expressed under different conditions is an important statistical problem [1–3]. Consider, for example, a common microarray experiment: for a particular gene Y, one has an unordered set of measurements of log-expression levels $y_{1}^{c} \dots y_{n_{1}}^{c}$ in a control situation and $y_{1}^{t} \dots y_{n_{2}}^{t}$ in a treatment situation (or any non-control situation) (notation similar to [4]). The question of interest is whether any expression difference $Δ \bar{y} = \sum_{j = 1}^{n_{2}} y_{j}^{t} / n_{2} - \sum_{i = 1}^{n_{1}} y_{i}^{c} / n_{1} = {\bar{y}}_{2} - {\bar{y}}_{1}$ is a significant difference, or if it would be expected under the null hypothesis of no actual difference in expression.

Though multifactorial experimental designs are becoming increasingly popular [5–7], there continue to be experimentalists interested in analyzing two-sample designs. There are many approaches to determining whether genes are differentially expressed in such designs and there are a number of excellent reviews of the various methods [2, 3]. Many of the approaches include the use of a t-test [8–14], which is a common frequentist statistical approach to comparing a difference in sample means. Some have pointed out that the normality assumption used in t-tests may not always hold and other statistical techniques may be warranted [15, 16]. Even for those measurements where normality holds the t-test is difficult to apply because the number of replicates n₁ and n₂ are often quite small (typically, n₁ = n₂ = 1 to 3), leading to uncertainty in the sample variance and relatively low power. An early approach to dealing with this problem was based on the addition of a constant value to the estimate of the standard deviation [16, 17]. While the approach does tend to regularize the variance estimates, it is ad hoc and is not expected to exhibit statistically consistent behavior (unknown or incorrect Type I error rate under the null).

More rigorous Bayesian methods have been developed that incorporate prior information about the variance but require Markov chain sampling of the posterior density or numerical integration of the cumulative distribution function [18–21]. A very popular work by Baldi and Long [4] avoids such calculations while using a statistically justified regularization technique via construction of a probabilistic Bayesian framework that applies a prior probability to parameters of interest such as the expression means and variances. An analytically tractable solution can then be obtained by taking a point estimate for the variance. The chief advantages of this Bayesian framework are twofold: it allows the use of a regularized t-test to determine whether a difference in expression is significant, and it provides a natural method for incorporating information from putatively related measurements. It is this basic framework that we seek to extend in the work presented here.

As noted by Baldi and Long [4], the use of a point estimate for the variance is a compromise between a rigorous Bayesian approach and tractable calculation. As demonstrated below, this compromise can lead to bias in the specificity of the test and improper behavior when there are few prior degrees of freedom. Our hope here is to correct this and other issues (both theoretical and practical) towards the goal of improving the performance and understanding of microarray and general laboratory data analysis. To address these issues, we extend the basic framework put forth earlier, taking the model to a fully Bayesian approach by explicitly calculating the marginal posterior distribution for the difference in mean expression levels. This analytically tractable solution does not require simulation from the posterior distribution and obviates the need to use point estimates of the prior variance while overcoming the undesirable correlation between specificity and prior degrees of freedom. Moreover, a clear and statistically rigorous connection between the prior variance and the prior degrees of freedom is established, reducing the number of hyperparameters from two to one while yielding consistent Type I error rates in the process. It should be stressed that achieving consistent Type I error rates is not just a conceptual luxury. Indeed proper determination of important quantities such as the false discovery rate are of great economical significance and rely on having correct Type I error rates [22].

Method

Bayesian probability model

The original model proposed by Baldi and Long [4] is briefly reviewed here. The likelihood of the observed data, y, given the true expression level, μ, and variance, σ², for a single gene is assumed to follow a normal distribution:

p (y | μ, σ^{2}) \propto \prod_{i = 1}^{n} N (y_{i}; μ, σ^{2}) (1)

where n is the number of measurements in one sample. For the priors on μ and σ² the authors attempt to capture the a priori dependence between μ and σ² by factoring the joint prior as p(μ,σ²) = p(μ|σ²)p(σ²) and taking each factor as

p(μ|σ²) = N(μ;μ₀,σ²/λ₀) (2)

and

p(σ²) = I(σ²;ν₀, $σ_{0}^{2}$ ) (3)

where the prior probability of μ follows a normal distribution and σ² follows a scaled inverse gamma distribution. The hyperparameters μ₀ and λ₀ represent the prior beliefs regarding the value of μ and the confidence associated with that belief, while the hyperparameters $σ_{0}^{2}$ and ν₀ represent the prior beliefs regarding the value of σ² and the degrees of freedom or confidence associated with that belief. The authors subsequently let λ₀ → 0 later in their derivation, leading to a diffuse, noninformative prior on μ:

\lim_{λ_{0} \to 0} p (μ | σ^{2}) \propto \frac{1}{σ} (4)

The remaining 1/σ factor that comes from a diffuse normal results in a prior of the form p(μ,σ²) ∝ σ^-1I(σ²;ν₀, $σ_{0}^{2}$ ). Using Bayes' rule the authors obtain the following posterior distribution:

p(μ,σ²|y,α) ∝ N(μ;μ_n,σ²/λ_n)I(σ²;ν_n, $σ_{n}^{2}$ ) (5)

where α is a vector of hyperparameters (μ₀,λ₀,ν₀, $σ_{0}^{2}$ ) for the prior and

μ_{n} = \frac{λ_{0}}{λ_{0} + n} μ_{0} + \frac{n}{λ_{0} + n} \bar{y} (6)

λ_n= λ₀ + n (7)

ν_n= ν₀ + n (8)

ν_{n} σ_{n}^{2} = ν_{0} σ_{0}^{2} + (n - 1) s^{2} + \frac{λ_{0} n}{λ_{0} + n} {(\bar{y} - μ_{0})}^{2} (9)

The sufficient statistics are $\bar{y} = \sum_{i = 1}^{n} y_{i} / n$ and $s^{2} = \sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2} / (n - 1)$ . Finally, the authors calculate the posterior mean estimate or point estimate (PE) of the variance for one sample as

σ^{2} = \frac{ν_{n}}{ν_{n} - 2} σ_{n}^{2} (10)

This estimate is then used in the hypothesis testing procedures implemented in their Cyber-T software. Though this model is developed using a Bayesian framework to make a point estimate of the variance, the hypothesis testing is carried out using a classical frequentist t-test.

In the standard t-test, the variance is still defined when there are no prior degrees of freedom, but inspection of Eq. (10) shows that when the number of prior degrees of freedom ν₀ < 2 - n, the variance is undefined. This limiting behavior stems from the use of a point estimate for the variance. The total number of prior degrees of freedom used in the Cyber-T software is actually 2ν₀ as it counts ν₀ prior degrees of freedom for each sample in the two-sample test. However, for consistency and clarity we hereafter refer to the total number of prior degrees of freedom as ν₀. Therefore the total number of degrees of freedom used in the Cyber-T software for the two-sample hypothesis test then appears to be ν_t-test= ν₀ + n₁ + n₂ - 4, where n₁ is the number of experimental replicates of the gene in the control regime and n₂ is the number of replicates of the gene in the treatment regime. This equation for the total degrees of freedom also demonstrates undesired limiting behavior when ν₀ = 0, since positive degrees of freedom exist only when n₁ + n₂ > 4. This result contradicts what we would expect to see from a classic two-sample t-test with no prior estimate of the variance, i.e. a classical two-sample t-test has degrees of freedom n₁ + n₂ - 2 and positive degrees of freedom exist when n₁ + n₂ > 2.

Full two-sample model

Posterior distribution

Instead of using a point estimate for the variance to conduct a frequentist t-test, we can recast the problem to directly infer the posterior distribution for a two-sample case:

p(μ,Δμ,σ²|y) ∝ p(μ,Δμ,σ²)p(y|μ,Δμ,σ²) (11)

where μ is the mean expression level of the gene in the control regime, μ +Δμ the mean of the gene in the alternate experimental regime, and σ² the variance associated with the measurements. In contrast to the previous, point estimate model, in this formulation there is no explicit prior dependence between σ² and μ. While this is admittedly a simplification, this independence is not only a practical requirement for the full Bayesian integration, we also believe it is justified on several other grounds. First, while it has been widely observed that genes with different expression levels often have different variances [1], it is not clear how the prior used by Baldi and Long [4] to derive the point estimate of variance could capture the dependencies observed with actual microarray data. Their figures show that the logarithm of the mean expression level tends to decrease with increasing variance, yet the prior given by Eq. (2) asserts that the prior probability of the mean should be more diffuse – not necessarily smaller – with increasing variance. The observed trend between mean and variance cannot be captured by such a prior. Second, their method does include an implicit dependence between mean and variance by using other gene expression levels to determine the priors; indeed, this is what lends the method its power. Because the Bayesian formulation proposed here determines the prior hyperparameters in a similar fashion, it retains this implicit dependence. One may also consider a number of variance stabilizing transforms that may be applied in order to achieve constant variance across the expression spectrum [23, 24]. Finally, as shown below, simulations reveal that the method still retains power even when the true variance is allowed to vary over a substantial range.

Returning to the posterior distribution, we assume that the samples from each experimental regime follow a normal distribution, each with equal variances and (possibly) different means.

y_i~ N(μ,σ²) (12)

y_j~ N(μ + Δμ,σ²) (13)

This leads to the following posterior distributionp

\begin{matrix} p (μ, Δ μ, σ^{2} | y) \propto \\ p (μ, Δ μ, σ^{2}) \prod_{i = 1}^{n_{1}} \frac{1}{\sqrt{2 π σ}} \exp (- \frac{1}{2 σ^{2}} {(y_{i} - μ)}^{2}) \prod_{j = 1}^{n_{2}} \frac{1}{\sqrt{2 π σ}} \exp (- \frac{1}{2 σ^{2}} {((y_{j} - (μ + Δ μ))}^{2}) \end{matrix} (14)

where n₁ and n₂ are the number of measurements in each sample. We elect not to make use of the prior on μ used in the Baldi work, Eq. (2), or its limiting form, Eq. (4). Instead, we assume a completely flat prior on both μ and Δμ while retaining the prior for σ² given by Eq. (3). Similar to the earlier work, the observed dependence between μ and σ² is established by calculating $σ_{0}^{2}$ based on the variance of similarly expressed genes. However, in this formulation, instead of taking the average standard deviation of similarly expressed genes we choose to estimate the prior variance by totaling the sum-squared differences for each similar gene and dividing by the total prior degrees of freedom. This is a more statistically rigorous way of incorporating prior information and leads to more a more consistent test as will be discussed below.

Marginal posterior for Δμ

With the above definitions and assumptions in place, it can be shown (see Appendix) that the marginal posterior distribution follows a t distribution:

\frac{Δ μ - Δ \bar{y}}{σ_{n} \sqrt{\frac{1}{n_{1}} + \frac{1}{n_{2}}}} | y \sim t_{ν_{n}} (15)

where

ν_n= n₁ + n₂ + ν₀ - 2 (16)

ν_{n} σ_{n}^{2} = ν_{0} σ_{0}^{2} + (n_{1} - 1) s_{1}^{2} + (n_{2} - 1) s_{2}^{2} (17)

The sufficient statistics are $Δ \bar{y} = \sum_{j = 1}^{n_{2}} y_{j} / n_{2} - \sum_{i = 1}^{n_{1}} y_{i} / n_{1} = {\bar{y}}_{2} - {\bar{y}}_{1}$ , and the sum squared differences (n₁ - 1) $s_{1}^{2}$ and (n₂ - 1) $s_{2}^{2}$ , where $s_{1}^{2} = \sum_{i = 1}^{n_{1}} {(y_{i} - {\bar{y}}_{1})}^{2} / (n_{1} - 1)$ and $s_{2}^{2} = \sum_{j = 1}^{n_{2}} {(y_{j} - {\bar{y}}_{2})}^{2} / (n_{2} - 1)$ . This distribution has the feature that the standard t-test is obtained even when there are no prior degrees of freedom, ν₀ = 0, in which case the observed variance for each gene dominates the posterior density. A hypothesis test is performed by asserting a null hypothesis that the true difference in expression levels is zero, i.e. Δμ = 0 (though other values for the true difference, Δμ, can be used when attempting to identify genes that are differentially expressed by some threshold degree). When the posterior probability of no differential expression approaches zero, Pr(Δμ = 0|y) → 0, the null is rejected. It is worth noting that while the posterior probability distribution for no differential expression, Pr(Δμ = 0|y), follows the same frequentist distribution for the data, Pr(y|Δμ = 0) [25], the resulting probabilities have different interpretations. In the Bayesian approach the posterior probability is interpreted as the probability that there is no difference in expression levels. In the frequentist approach the probability is interpreted as the probability of observing such a difference from the null distribution of no differential expression given chance alone. When Δμ = 0 the Bayesian and frequentist methods give the same numerical results and one can ignore the interpretational differences when discussing such things as the false positive rate under the null. However, power analysis calculations (Δμ ≠ 0) are best conducted under a frequentist framework using a non-central t-distribution [25].

Estimating the prior variance

For m similarly expressed genes each having n replicates, the prior degrees of freedom for the variance can be calculated as

ν₀ = m(n - 1) (18)

The prior variance is then calculated as the total sum-square differences for each sample of similarly expressed genes divided by the prior degrees of freedom, namely

σ_{0}^{2} = \frac{\sum_{k = 1}^{m} \sum_{i = 1}^{n} {(y_{k, i} - {\bar{y}}_{k})}^{2}}{ν_{0}} (19)

where ${\bar{y}}_{k}$ is the mean response of gene k and y_k,iis the response i of gene k. In general the prior degrees of freedom and variance should be determined using these equations and not chosen separately from one another. While it is tempting to think of the prior variance as separate from the precision or credibility associated with that estimate (as represented by ν₀), a consistent hypothesis test requires that they be considered together based on the actual prior data collected.

Results

To assess the performance of the new test, a series of simulations were done to measure the false positive rate and statistical power under various conditions. The simulations consisted of the following steps:

1)
Draw random samples from two normal distributions of size n₁ and n₂ distributed as N(μ₁, $σ_{t e s t}^{2}$ ) and N(μ₂, $σ_{t e s t}^{2}$ ) respectively. This step represents the selection of two different experimental regimes for a gene of interest in order to test for differential expression. For power testing, the standardized effect (μ₁ - μ₂)/σ_test= 2 is used to generate the data by setting μ₁ = 2, μ₂ = 0, and σ_test= 1. For false positive rate testing μ₁ = 0.
2)
Draw an estimate of the prior variance from the following Chi-square distribution: $σ_{0}^{2} ~ σ_{t e s t}^{2} χ_{ν}^{2} / ν_{0}$ . This step simulates the task of estimating the prior variance from similarly expressed genes. It is mathematically equivalent to calculating the sum-squared differences of similarly expressed genes and dividing by their prior degrees of freedom, but it is computationally more efficient.
3)
Perform hypothesis tests using a) the method used by Baldi and Long [4] based on the mean posterior point estimate for the variance, PE(σ²) or just PE, and b) the new test proposed here based on the marginal posterior for the difference in expression level, MP(Δμ) or just MP. The hypothesis tests using each method are done at the α = 0.05 level using a two-sided, two-sample t-test. Test p-values below this level are counted as significant. To make valid comparisons, both methods used the same number of prior degrees of freedom.
4)
Steps 1 to 3 above are repeated 10,000 times.

Results of the power simulations for a number of sample sizes and prior degrees of freedom are shown in Table 1. Because the PE test cannot be used when ν₀ + n₁ + n₂ ≤ 4, the test has no power and no false positive rate when n₁ = n₂ = 2 and ν₀ = 0. Conversely, the MP test results in a classical t-test in this case and has a power of 0.2169 and a false positive rate of 0.0509, which is near the expected value given the α = 0.05 level of the test. Holding n₁ = n₂ = 2 but increasing the number of prior degrees of freedom to ν₀ = 2 (equal to 2 other similarly expressed genes used to determine the prior variance) the PE test can now be used but has a very low power of 0.0450, compared to the MP power of 0.3343. Overall, when the estimate of the variance is balanced between the prior and observed variances (i.e. the prior number of degrees of freedom is not large compared to the number of replicates in the test samples) the MP test is significantly more powerful. The difference in power between the two methods is less pronounced when ν₀ = 16 (the default value used in the Cyber-T software when n₁ = n₂ = 2). However, one should note that such a relatively high number for the prior degrees of freedom represents a strong degree of belief in the prior estimate of the variance relative to the variance in the genes of interest.

Table 1 Comparison of false positive rate and statistical power for fixed α. For a given significance level (α) the observed false positive rate (FP) and power were determined by simulation using either a point estimate for the variance (PE) or the marginal posterior distribution for Δμ (MP). Values are given for different sample sizes (n₁ = n₂ = n) and prior degrees of freedom (ν₀).

Full size table

These simulations reveal that the observed false positive rate for the PE test is substantially out of line with the nominal value of 0.05 for small sample replicates (n₁ = n₂ ≤ 3) and few prior degrees of freedom (ν₀ ≤ 16). Conversely, the MP test demonstrates consistent behavior for the observed false positive rate. However, it is possible to obtain the desired false positive rate for the PE method by iterating through values of α. Table 2 shows that, by adjusting the critical α value for the PE method until the observed false positive rate matches that of the MP method, both tests have similar power. The disadvantage is that α no longer represents the expected Type I error rate, and simulation or calibration is required whenever a new test is performed. Conversely, with the MP test, the critical value α represents an unbiased estimate of the Type I error rate.

Table 2 Comparison of α and statistical power for iterated false positive rate. The values are the same as those given in Table 1 except the significance level (α) was iterated for the PE method until its power matched that of the MP method where α was held constant.

Full size table

It is interesting to look for a simple correction to the degrees of freedom in order to make the two methods match each other. Unfortunately, there does not appear to be any simple correction available. The degrees of freedom used in the t-test itself would be easy to modify – though ad hoc, one could change the last term in the expression for the degrees of freedom used in the PE test (ν_t-test= ν₀ + n₁ + n₂ - 4) from -4 to -2 by simply adding two to ν₀. Unfortunately, the point estimate of the variance given by Eq. (10) is more problematic as it appears to give an increased estimate of the variance compared to the one used in the MP test, resulting in a decrease in both the t-statistic and the statistical power.

In the Cyber-T software [4] the prior variance is estimated by averaging the standard deviation over 100 similarly expressed genes. This has two problems associated with it. First, the measured standard deviation is a biased estimate and tends to be smaller than the true value for small sample sizes like n = 2 or 3, the average of these measured standard deviations will also be biased downward and in general $E (\sqrt{σ^{2}}) \neq \sqrt{E (σ^{2})}$ , where E(·) represents the expected value of a random variable. The correct way to estimate the variance based on prior observations is to calculate the total sum square differences for all the prior samples and divide by the total degrees of freedom, as shown in Eq. (19). Second, if similarly expressed genes for each of the two samples are used to estimate the prior variance and they have the same number of replicates, n, as the samples themselves, then estimating the standard deviation from m = 100 similarly expressed genes is tantamount to asserting ν₀ = 100·(n - 1) prior degrees of freedom. Setting ν₀ to less than the actual number of prior degrees of freedom used to calculate the prior variance results in a lower actual false positive rate than the rejection level α. For consistency if we fix the number of prior degrees of freedom that are used in the test, then we must choose a corresponding number of similarly expressed genes that together have the same number of prior degrees of freedom. This dependence between ν₀ and m is more consistent with the formulation of the prior itself and leads to consistent Type I error rates. When we calculate the prior variance in the statistically rigorous manner (by taking the total sum square differences of prior observations and dividing by the total prior degrees of freedom) the PE method becomes overly conservative, generating false positives with a rate of 0.0044 at n₁ = n₂ = 2, α = 0.05, and ν₀ = 2, this appears to be the result of using a point estimate for the prior variance that is larger than the expected value given similar genes, see Eq. (10).

As mentioned previously, to perform the full Bayesian integration required by the MP test, the assumption of a constant variance is required. To test the effect of violating this assumption, simulations were performed as described above, but in step 2 the variances for each sample were drawn from a uniform distribution in the range [0.05,1]. The same uniform distribution was used to generate prior variances by calculating the total sum-squared differences of m = ν₀/(n - 1) different draws (corresponding to m different genes) for the variance and dividing by the prior degrees of freedom, ν₀. The results for both the PE and MP methods are shown in Table 3, where the observed false positive rate for the MP method is used to calibrate the α value for the PE test so as to achieve nearly the same false positive rate for each method. Though both methods are seen to have nearly the same power for a given false positive rate, the PE method again shows large deviations between α and the observed false positive rate. However, the MP method shows more consistency between the observed and theoretical rates and is fairly robust to violations of the constant variance assumption.

Table 3 Comparison of α and statistical power for iterated false positive rate and variable variance. The values are the same as those given in Table 1 and Table 2 except the variance for each sample was drawn from a uniform distribution between 0.05 and 1 during the simulation. The significance level (α) for the PE method was iterated so that the observed false positive rate matched that of the MP method where α was held constant.

Full size table

To test whether these statistical improvements in the method translate to improvements in experimental data analysis, both methods were evaluated with the same microarray data [26] used by Baldi and Long [27]. The Arfin data matrix is composed of 8 columns by 1973 rows that contain gene expression levels used to study IHF regulated genes in Escherichia coli. Columns 1 to 4 correspond to 4 IHF⁺ strains while columns 5 to 8 correspond to 4 IHF^- strains. The 1973 rows correspond to the measured gene expression levels for each strain. There are actually more rows in the original data set but to be consistent with the earlier work, we only chose rows that had measurable expression levels for all 8 runs. As was done in the Long paper the 4 control and 4 treatment replicates were partitioned into 12 subsets that differ by at least two replicates so that quasi-independent pairwise comparisons could be made (i.e. 12v56, 12v78, 13v57, 13v68, 14v58, 14v67, 23v58, 23v67, 24v57, 24v68, 34v56, and 34v78). For the experimental evaluation we adopted the same strategy for pooling variances used by Baldi and Long. We sorted all the genes within each sample by the mean expression level and computed the prior variance by considering a window of the m/2 most similarly expressed genes around each gene of interest. For the two samples this corresponds to a total of m similarly expressed genes and Eq. (18) gives the total prior degrees of freedom, ν₀ = m.

To measure the consistency of a given testing method the number of common genes discovered between two quasi-independent subsets can be recorded. Genes that are discovered to be in common between subsets are more likely to be truly differentially expressed, thus higher a commonality percentage (genes discovered in common divided by the total number of genes discovered) is an indication that the testing method is more reliable. The number of common genes discovered can be recorded for the 66 possible ways of comparing the 12 subsets. Based on p-value rankings, the MP method achieved similar results (within sampling error) to those published using the Cyber-T software. The single hyperparameter (ν₀) was optimized for the MP method and a 95% confidence interval of [62.4, 67.8] common genes were identified out of the top 120. This compares well with the published results using Cyber-T where optimizing two parameters independently (ν₀ and m) achieved an average common gene count of 67. These results represent a substantial improvement over the 33 common genes identified using a zero parameter classical t-test.

Because the p-value rankings by themselves are not actually statistical tests, a more stringent approach to assessing the performance of each method was accomplished by identifying genes that are differentially expressed at some confidence level α. For the following tests we used a p-value of 0.01 for the null hypothesis cutoff (uncorrected for multiple tests). The results are presented in Table 4. In general, for a given number of prior degrees of freedom, the MP method consistently detects more genes as differentially expressed than does the PE method and more of these genes are commonly discovered when comparing them with quasi-independent subsets. In addition the fraction of common genes within those detected is generally higher for the MP method. The difference between the methods tends to decrease as the number of prior degrees of freedom is increased. The optimal performance (in terms of % commonality) is achieved at around ν₀ = 400, which corresponds to basing the local variance on about 10% of the 1972 similarly expressed genes within each sample (there are 3944 similarly expressed genes when considering both samples). As ν₀ increases past 400 the commonality percentage drops somewhat. Basing the variance on the combined estimated of all 3944 similar genes results in a difference-in-means test, and the drop in performance is similar to that seen by Long and Baldi when a simple difference in means was used to rank the expression differences [27].

Table 4 Experimental comparison of hypothesis testing methods. The microarray data consists of 1973 genes partitioned into 12 quasi independent n₁ = 2 and n₂ = 2 subsets using replicate experiments of 4 control and 4 treatment runs [26, 27]. The number of genes detected as differentially expressed using each method is listed for an α = 0.01 significance level (there was no correction for multiple testing) and different prior degrees of freedom (ν₀). The average number of common genes for the 66 unique comparisons of the 12 subsets is also listed along with the % of commonality (common/detected) for each test.

Full size table

It is important to stress that the above results for the PE method are not what one would expect using the implementation found in the Cyber-T software [4]. As mentioned before, this is because the existing software incorrectly calculates the pooled variance by averaging the standard deviation of similarly expressed genes. This results in a substantial error for example, when n₁ = n₂ = 2 and ν₀ = 400, the Type I error rate is approximately 0.125 (2.5 times higher than that predicted by the nominal rate α = 0.05). The authors of Cyber-T recommend smaller values of ν₀ where the overestimate of variance given by their point estimate, Eq. (10), is balanced by the underestimate of variance owing to incorrect pooling, thereby achieving (approximately) correct false positive rates by offsetting two competing biases.

Discussion and conclusion

The new marginal posterior (MP) formulation proposed here is uniformly at least as powerful as the previous point estimate (PE) method which it is based on and more powerful when the number of similarly expressed genes (and therefore prior degrees of freedom ν₀) is small. Moreover, unlike the existing regularization methods, the new formulation consistently maintains the proper false positive rates under the null hypothesis across the entire range of pooling. The MP formulation also makes clear the dependence between the prior degrees of freedom and prior variance and thus methods to estimate the single hyperparameter ν₀ are expected to be more robust and interpretable.

As mentioned previously, the key to the power of these Bayesian methods is the use of similarly-expressed genes to estimate the expected variance. Naturally, the more similar the variance in expression of the other genes is to the gene of interest, the greater the statistical consistency of the method. In generating the simulated data we have attempted to remain agnostic on the definition of "similarly-expressed", using variances that are either constant or drawn from a uniform distribution. Another Bayesian method [28] utilizes a clustering technique based on a series of t-statistics to group genes into prior categories in order to pool their prior variances. Because the Bayesian posterior probabilities used by that method are not interpretable as frequentist Type I error rates, it is difficult to set the desired false positive rate using such a method without resorting to simulation and iteration. Other clustering strategies include those based on non-parametric regression to obtain local variance estimates [29] and mixture models that group together similar variances [30] using maximum likelihood. These clustering strategies assume the variance is known based on the large number of observations in each group. This allows normal distributions to be used in deriving test statistics. Unfortunately, when the number of similarly expressed genes is small the use of a normal distribution is not justified as it assumes there is no uncertainty in the variance. The advantage of the MP formulation is that the normal approximation for the test statistic is not required – it is statistically robust across the entire range of pooling or clustering strategies (assuming the strategies work properly to group together genes of similar variance). In our experimental application of the method we have adopted a pooling strategy based on grouping together genes with similar expression levels. Thus the MP formulation can be used in situations where two or more control replicates can be compared with only single observations under treatment conditions; this is not possible with clustering strategies that require at least two replicates in each sample [30].

Though the MP test was developed in a Bayesian framework, there is a strong analogy with frequentist statistics. One person's prior is another person's data; if we treat the prior estimates of variance as just additional data collection then our hypothesis test using the t distribution is the same one as that used in any introductory statistics course. In either situation, one considers the estimate of variance as the total sum-of-squared differences for all observations of the variance divided by the total degrees of freedom. By "all observations" we mean both the two samples involved in the hypothesis test plus some number of replicates from other genes we believe to be similarly expressed and are good representatives of the variance. This is just like estimating the method standard deviation and conducting a normal t-test with that estimate along with the additional degrees of freedom that come with it. All of the power analysis calculations can then be done without recourse to simulations; for example, built-in functions in the desktop statistical package R (using the non-central t distribution) gave the same results we present by simulation for the MP method. It should be stressed that this Bayesian hypothesis test is not limited in any way to the analysis of microarray data. It can be applied to any experimental situation where prior beliefs about the variance associated with small sample sizes are used to achieve higher statistical power while maintaining consistent Type I error rates. This is particularly important when only a few similar observations of the variance are available, for example when only a handful of genes are under study across a range of treatment conditions.

The method is implemented in R and is freely available for download [31] or by contacting the corresponding author.

Appendix

Determining the marginal posterior distribution of Δμ

The posterior is the product of the prior and the likelihood:

p(μ,Δμ,σ²|y) ∝ p(μ,Δμ,σ²)p(y|μ,Δμ,σ²) (20)

The likelihood consists of the product of two normal distributions:

y_i~ N(μ,σ₂) (21)

y_j~ N(μ + Δμ,σ²) (22)

\begin{matrix} p (μ, Δ μ, σ^{2} | y) \propto \\ p (μ, Δ μ, σ^{2}) \prod_{i = 1}^{n_{1}} \frac{1}{\sqrt{2 π} σ} \exp (- \frac{1}{2 σ^{2}} {(y_{i} - μ)}^{2}) \prod_{j = 1}^{n_{2}} \frac{1}{\sqrt{2 π} σ} \exp (- \frac{1}{2 σ^{2}} {(y_{j} - (μ + Δ μ))}^{2}) \end{matrix} (23)

Combining the joint likelihood into a common exponential gives:

p (μ, Δ μ, σ^{2} | y) \propto p (μ, Δ μ, σ^{2}) σ^{- (n_{1} + n_{2})} \exp (- \frac{1}{2 σ^{2}} [\sum_{i = 1}^{n_{1}} {(y_{i} - μ)}^{2} + \sum_{j = 1}^{n_{2}} {(y_{j} - (μ + Δ μ))}^{2}]) (24)

Next we can define a variable that that only contains terms for the expression levels:

C_{μ} = \sum_{i = 1}^{n_{1}} {(y_{i} - μ)}^{2} + \sum_{j = 1}^{n_{2}} {(y_{j} - (μ + Δ μ))}^{2} (25)

p (μ, Δ μ, σ^{2} | y) \propto p (μ, Δ μ, σ^{2}) σ^{- (n_{1} + n_{2})} \exp (- \frac{C_{μ}}{2 σ^{2}}) (26)

Expanding the square leads to the following:

C_{μ} = \sum_{i = 1}^{n_{1}} (y_{i}^{2} - 2 y_{i} μ + μ^{2}) + \sum_{j = 1}^{n_{2}} (y_{j}^{2} - 2 y_{j} μ - 2 y_{j} Δ μ + μ^{2} + Δ μ^{2} + 2 μ Δ μ) (27)

{\bar{y}}_{1} = \frac{1}{n_{1}} \sum_{i = 1}^{n_{1}} y_{i} (28)

{\bar{y}}_{2} = \frac{1}{n_{2}} \sum_{j = 1}^{n_{2}} y_{j} (29)

C_{μ} = (n_{1} + n_{2}) μ^{2} - 2 (n_{1} {\bar{y}}_{1} + n_{2} {\bar{y}}_{2} - n_{2} Δ μ) μ + \sum_{i = 1}^{n_{1}} y_{j}^{2} + \sum_{j = 1}^{n_{2}} y_{j}^{2} + n_{2} Δ μ^{2} - 2 n_{2} {\bar{y}}_{2} Δ μ (30)

Next, we look to complete the square by adding and subtracting the same value k:

k = \frac{n_{1} {\bar{y}}_{1} + n_{2} {\bar{y}}_{2} - n_{2} Δ μ}{n_{1} + n_{2}} (31)

C_{μ} = (n_{1} + n_{2}) (μ^{2} - 2 k μ + k^{2} - k^{2}) + \sum_{i = 1}^{n_{1}} y_{i}^{2} + \sum_{j = 1}^{n_{2}} y_{j}^{2} + n_{2} Δ μ^{2} - 2 n_{2} {\bar{y}}_{2} Δ μ (32)

Now collapsing back to a single squared term in the exponent and defining a new variable, C_σ, we have

C_μ = (n₁ + n₂)(μ - k)² + C_σ (33)

where

C_{σ} = \sum_{i = 1}^{n_{1}} y_{i}^{2} + \sum_{j = 1}^{n_{2}} y_{j}^{2} + n_{2} Δ μ^{2} - 2 n_{2} {\bar{y}}_{2} Δ μ - (n_{1} + n_{2}) k^{2} (34)

The posterior density can then be written as

p (μ, Δ μ, σ^{2} | y) \propto p (μ, Δ μ, σ^{2}) σ^{- (n_{1} + n_{2})} \exp (- \frac{(n_{1} + n_{2})}{2 σ^{2}} {(μ - k)}^{2}) \exp (- \frac{C_{σ}}{2 σ^{2}}) (35)

We can now look to integrate out the true expression level, μ:

p (Δ μ, σ^{2} | y) \propto \int_{- \infty}^{+ \infty} p (μ, Δ μ, σ^{2} | y) d μ (36)

We have flat priors on μ and Δμ, allowing us to complete the integration over all μ.

p(μ,Δμ,σ²) ∝ p(σ²) (37)

p (Δ μ, σ^{2} | y) \propto p (σ^{2}) σ^{- (n_{1} + n_{2})} \exp (- \frac{C_{σ}}{2 σ^{2}}) \int_{- \infty}^{+ \infty} \exp (- \frac{(n_{1} + n_{2})}{2 σ^{2}} {(μ - k)}^{2}) d μ (38)

\int_{- \infty}^{+ \infty} \exp (- \frac{(n_{1} + n_{2}}{2 σ^{2}} {(μ - k)}^{2}) d μ \propto σ (39)

The posterior for the joint density of Δμ and σ² is then

p (Δ μ, σ^{2} | y) \propto p (σ^{2}) σ^{- (n_{1} + n_{2} - 1)} \exp (- \frac{C_{σ}}{2 σ^{2}}) (40)

We next look to integrate out σ² by defining the sufficient statistics associated with the variance:

(n_{1} - 1) s_{1}^{2} = \sum_{i = 1}^{n_{1}} {(y_{i} - {\bar{y}}_{1})}^{2} = \sum_{i = 1}^{n_{1}} y_{i}^{2} - 2 n_{1} {\bar{y}}_{1}^{2} + n_{1} {\bar{y}}_{1}^{2} = \sum_{i = 1}^{n_{1}} y_{i}^{2} - n_{1} {\bar{y}}_{1}^{2} (41)

(n_{2} - 1) s_{2}^{2} = \sum_{j = 1}^{n_{2}} y_{j}^{2} - n_{2} {\bar{y}}_{2}^{2} (42)

C_{σ} = (n_{1} - 1) s_{1}^{2} + (n_{2} - 1) s_{2}^{2} + C_{Δ μ} (43)

where

C_{Δ μ} = n_{1} {\bar{y}}_{1}^{2} + n_{2} {\bar{y}}_{2}^{2} + n_{2} Δ μ^{2} - 2 n_{2} {\bar{y}}_{2} Δ μ - (n_{1} + n_{2}) k^{2} (44)

The last term in Eq. (43) C Δμ, now contains all references to the null difference in means, Δμ. The term can be expanded by multiplying out k² and then simplified:

\begin{matrix} C_{Δ μ} = n_{1} {\bar{y}}_{1}^{2} + n_{2} {\bar{y}}_{2}^{2} + n_{2} Δ μ^{2} - 2 n_{2} {\bar{y}}_{2} Δ μ - \\ \frac{1}{n_{1} + n_{2}} (n_{1}^{2} {\bar{y}}_{1}^{2} + 2 n_{1} n_{2} {\bar{y}}_{1} {\bar{y}}_{2} + n_{2}^{2} {\bar{y}}_{2}^{2} - 2 n_{1} n_{2} {\bar{y}}_{1} Δ μ + n_{2}^{2} Δ μ^{2} - 2 n_{2}^{2} {\bar{y}}_{2} Δ_{μ}) 45 \\ = \frac{n_{1} n_{2}}{n_{1} + n_{2}} {(Δ μ - ({\bar{y}}_{2} - {\bar{y}}_{1}))}^{2} \end{matrix}

This can be inserted back into Eq. (43) to give

C_{σ} = (n_{1} - 1) s_{1}^{2} + (n_{2} - 1) s_{2}^{2} + \frac{n_{1} n_{2}}{n_{1} + n_{2}} {(Δ μ - ({\bar{y}}_{2} - {\bar{y}}_{2}))}^{2} (46)

The prior for variance is said to follow a scaled inverse Chi-square distribution:

p (σ^{2}) = I (σ^{2}; ν_{0}, σ_{0}^{2}) \propto σ^{- (ν_{0} + 2)} \exp (- \frac{1}{2 σ^{2}} ν_{0} σ_{0}^{2}) (47)

The posterior density is then

\begin{matrix} p (Δ μ, σ^{2} | y) \propto \\ σ^{- (n_{1} + n_{2} + ν_{0} + 1)} \exp (- \frac{1}{2 σ^{2}} [ν_{0} σ_{0}^{2} + (n_{1} - 1) s_{1}^{2} + (n_{2} - 1) s_{2}^{2} + \frac{n_{1} n_{2}}{n_{1} + n_{2}} {(Δ μ - ({\bar{y}}_{2} - {\bar{y}}_{1}))}^{2}]) (48) \end{matrix}

But now we can integrate out σ²

p (Δ μ | y) \propto \int_{0}^{\infty} p (Δ μ, σ^{2} | y) d σ^{2} (49)

With a change of variables

A = ν_{0} σ_{0}^{2} + (n_{1} - 1) s_{1}^{2} + (n_{2} - 1) s_{2}^{2} + \frac{n_{1} n_{2}}{n_{1} + n_{2}} {(Δ μ - ({\bar{y}}_{2} - {\bar{y}}_{1}))}^{2} (50)

z = \frac{A}{2 σ^{2}} (51)

p (Δ μ | y) \propto {\int_{0}^{\infty} (\frac{A}{2 z})}^{- (n_{1} + n_{2} + ν_{0} + 1) / 2} \exp (- z) (\frac{A}{2 z^{2}}) d z (52)

p (Δ μ | y) \propto A^{- (n_{1} + n_{2} + ν_{0} - 1) / 2} \int_{0}^{\infty} z^{(n_{1} + n_{2} + ν_{0} - 3) / 2} \exp (- z) d z (53)

The posterior density then becomes the well known t distribution.

p (Δ μ | y) \propto {[1 + \frac{{(Δ μ - ({\bar{y}}_{2} - {\bar{y}}_{1}))}^{2}}{ν_{n} (\frac{1}{n_{1}} + \frac{1}{n_{2}}) σ_{n}^{2}}]}^{- (ν_{n} + 1) / 2} (54)

where

ν_n= n₁ + n₂ + ν₀ - 2 (55)

ν_{n} σ_{n}^{2} = ν_{0} σ_{0}^{2} + (n_{1} - 1) s_{1}^{2} + (n_{2} - 1) s_{2}^{2} (56)

\frac{Δ μ - Δ \bar{y}}{σ_{n} \sqrt{\frac{1}{n_{1}} + \frac{1}{n_{2}}}} | y ~ t_{ν_{n}} (57)

References

Chen Y, Dougherty ER, Bittner ML: Ratio-based decisions and the quantitative analysis of cDNA microarray images. J Biomed Optics 1997, 2(4):364–374. 10.1117/12.281504
Article CAS Google Scholar
Cui X, Churchill G: Statistical tests for differential expression in cDNA microarray expriments. Genome Biology 2003, 4(4):210.0–210.1. 10.1186/gb-2003-4-4-210
Article Google Scholar
Nadon R, Shoemaker J: Statistical issues with microarrays: processing and analysis. Trends in Genetics 2002, 18(5):265–271. 10.1016/S0168-9525(02)02665-3
Article CAS PubMed Google Scholar
Baldi P, Long A: A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics 2001, 17(6):509–519. 10.1093/bioinformatics/17.6.509
Article CAS PubMed Google Scholar
Townsend J: Multifactorial experimental design and the transitivity of ratios with spotted DNA microarrays. BMC Genomics 2003., 4(41):
Google Scholar
Vinciotti V, Khanin R, xD'Alimonte R, Liu X, Cattini N, Hotchkiss G, G. B, de Jesus O, Rasaiyaah J, Smith CP, Kellam P, Wit E: An experimental evaluation of a loop versus a reference design for two-channel microarrays. Bioinformatics 2005, 21(4):492–501. 10.1093/bioinformatics/bti022
Article CAS PubMed Google Scholar
Yang YH, Speed T: Design issues for cDNA microarray experiments. Nat Rev Genet 2002, 3(8):579–588.
CAS PubMed Google Scholar
Orian A, van Steensel B, Delrow J, Bussemaker HJ, Li L, Sawado T, Williams E, Loo LW, Cowley SM, Yost C, Pierce S, B.A. E, Parkhurst SM, Eisenman RN: Genomic binding by the Drosophila Myc, Max, Mad/Mnt transcription factor network. Genes Dev 2003, 17(9):1101–1114. 10.1101/gad.1066903
Article PubMed Central CAS PubMed Google Scholar
Sato N, Sanjuan IM, Heke M, Uchida M, Naef F, Brivanlou AH: Molecular signature of human embryonic stem cells and its comparison with the mouse. Dev Biol 2003, 260(2):404–413. 10.1016/S0012-1606(03)00256-2
Article CAS PubMed Google Scholar
Tompa R, McCallum CM, Delrow J, Henikoff JG, van Steensel B, Henikoff S: Genome-wide profiling of DNA methylation reveals transposon targets of CHROMOMETHYLASE3. Curr Biol 2002, 12(1):65–68. 10.1016/S0960-9822(01)00622-4
Article CAS PubMed Google Scholar
Hu L, Wang J, Baggerly K, Wang H, Fuller GN, Hamilton SR, Coombes KR, Zhang W: Obtaining reliable information from minute amounts of RNA using cDNA microarrays. BMC Genomics 2002., 3(16):
Google Scholar
Gu J, Gu X: Induced gene expression in human brain after the split from chimpanzee. Trends Genet 2003, 19(2):63–65. 10.1016/S0168-9525(02)00040-9
Article CAS PubMed Google Scholar
Pavlidis P, Li. Q, Noble WS: The effect of replication on gene expression microarray experiments. Bioinformatics 2003, 19(13):1620–1627. 10.1093/bioinformatics/btg227
Article CAS PubMed Google Scholar
Pawitan Y, Michiels S, Koscielny S, Gusnanto A, Ploner A: False discovery rate, sensitivity and sample size for microarray studies. Bioinformatics 2005, 21(13):3017–3024. 10.1093/bioinformatics/bti448
Article CAS PubMed Google Scholar
Thomas JG, Olson JM, Tapscott SJ, Zhao LP: An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles. Genome Res 2001, 11(7):1227–1236. 10.1101/gr.165101
Article PubMed Central CAS PubMed Google Scholar
Newton MA, Kendziorski CM, Richmond CS, Blattner FR, Tsui KW: On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. J Comp Biol 2001, 8(1):37–52. 10.1089/106652701300099074
Article CAS Google Scholar
Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A 2000, 98(9):5116–5121. 10.1073/pnas.091062498
Article Google Scholar
Theilhaber J, Bushnell S, Jackson A, Fuchs R: Bayesian estimation of fold-changes in the analysis of gene expression: the PFOLD algorithm. J Comp Biol 2001, 8(6):585–614. 10.1089/106652701753307502
Article CAS Google Scholar
Townsend J: Resolution of large and small differences in gene expression using models for the Bayesian analysis of gene expression levels and spotted DNA microarrays. BMC Bioinformatics 2004., 5(54):
Google Scholar
Townsend JP, Hartl DL: Bayesian analysis of gene expression levels: statistical quantification of relative mRNA level across multiple strains or treatments. Genome Biol 2002., 3(12):
Google Scholar
Tseng GC, Oh MK, Rohlin L, Liao JC, Wong WH: Issues in cDNA microarray analysis: quality filtering, channel normalization, models of variations and assessment of gene effects. Nucleic Acids Res 2001, 29(12):2549–2557. 10.1093/nar/29.12.2549
Article PubMed Central CAS PubMed Google Scholar
Reiner A, Yekutieli D, Benjamini Y: Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics 2003, 19(3):368–375. 10.1093/bioinformatics/btf877
Article CAS PubMed Google Scholar
Huber W, von Heydebreck A, Sultmann H, Poustka A, Vingron M: Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 2002, 18(Suppl. 1):S96-S104.
Article PubMed Google Scholar
Durbin BP, Hardin JS, Hawkins DM, Rocke DM: A variance-stabilizing transformation for gene-expression microarray data. Bioinformatics 2002, 18(Suppl. 1):S105-S110.
Article PubMed Google Scholar
DeGroot MH, Schervish MJ: Probability and Statistics. Addison Wesley; 2002.
Google Scholar
Arfin S, Long D, Ito E, Tolleri L, Riehle M, Paegle E, Hatfield GW: Global gene expression profiling in Escherichia coli K12. J Biol Chem 2000, 275(38):29672–29684. 10.1074/jbc.M002247200
Article CAS PubMed Google Scholar
Long D, Mangalam H, Chan B, Tolleri L, Hatfield GW, Baldi P: Improved statistical inference from DNA microarray data using analysis of variance and a Bayesian statistical framework. J Biol Chem 2001, 276(23):19937–19944. 10.1074/jbc.M010192200
Article CAS PubMed Google Scholar
Gottardo R, Pannucci J, Kuske C, Brettin T: Statistical analysis of microarray data: a Bayesian approach. Biostatistics 2003, 4(4):597–620. 10.1093/biostatistics/4.4.597
Article PubMed Google Scholar
Jain N, Thatte J, Braciale T, Ley K, O'Connell M, Lee J: Local-pooled-error test for identifying differentially expressed genes with a small number of replicated microarrays. Bioinformatics 2003, 19(15):1945–1951. 10.1093/bioinformatics/btg264
Article CAS PubMed Google Scholar
Delmar P, Robin S, Daudin JJ: VarMixt: efficient variance modeling for the differential analysis of replicated gene expression data. Bioinformatics 2005, 21(4):502–508. 10.1093/bioinformatics/bti023
Article CAS PubMed Google Scholar
Fox RJ, Dimmic MW: A Bayesian two-sample t-test.2006. [http://www.dimmic.net/supplement/]
Google Scholar

Download references

Acknowledgements

We would like to thank Magda Bartilson and Stephen DelCardayre for giving us the inspiration to develop the method. We also thank John Grate, Lori Giver and John Whitman for their generous support and feedback and for reviewing various versions of the manuscript.

Author information

Authors and Affiliations

Codexis, Inc, Redwood City, CA, 94063, USA
Richard J Fox
Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, NY, 14853, USA
Matthew W Dimmic
Divergence, Inc, St. Louis, MO, 63141, USA
Matthew W Dimmic

Authors

Richard J Fox
View author publications
You can also search for this author in PubMed Google Scholar
Matthew W Dimmic
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Richard J Fox.

Additional information

Authors' contributions

RJF conceived of and derived the statistical formulation, applied the method to various data sets and drafted the manuscript. MWD contributed major revisions to the manuscript and suggested ways of improving the study. Both authors contributed to the final version of the manuscript and approved it.

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Fox, R.J., Dimmic, M.W. A two-sample Bayesian t-test for microarray data. BMC Bioinformatics 7, 126 (2006). https://doi.org/10.1186/1471-2105-7-126

Download citation

Received: 18 October 2005
Accepted: 10 March 2006
Published: 10 March 2006
DOI: https://doi.org/10.1186/1471-2105-7-126

A two-sample Bayesian t-test for microarray data

Abstract

Background

Results

Conclusion

Background

Method

Bayesian probability model

Full two-sample model

Posterior distribution

Marginal posterior for Δμ

Estimating the prior variance

Results

Discussion and conclusion

Appendix

Determining the marginal posterior distribution of Δμ

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Authors' contributions

Rights and permissions

About this article

Cite this article

Keywords

BMC Bioinformatics

Contact us

A two-sample Bayesian t-test for microarray data

Abstract

Background

Results

Conclusion

Background

Method

Bayesian probability model

Full two-sample model

Posterior distribution

Marginal posterior for Δμ

Estimating the prior variance

Results

Discussion and conclusion

Appendix

Determining the marginal posterior distribution of Δμ

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Authors' contributions

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us