A mixture model approach to sample size estimation in two-sample comparative microarray experiments
- Tommy S Jørstad^{1},
- Herman Midelfart^{1} and
- Atle M Bones^{1}Email author
https://doi.org/10.1186/1471-2105-9-117
© Jørstad et al; licensee BioMed Central Ltd. 2008
Received: 03 July 2007
Accepted: 25 February 2008
Published: 25 February 2008
Abstract
Background
Choosing the appropriate sample size is an important step in the design of a microarray experiment, and recently methods have been proposed that estimate sample sizes for control of the False Discovery Rate (FDR). Many of these methods require knowledge of the distribution of effect sizes among the differentially expressed genes. If this distribution can be determined then accurate sample size requirements can be calculated.
Results
We present a mixture model approach to estimating the distribution of effect sizes in data from two-sample comparative studies. Specifically, we present a novel, closed form, algorithm for estimating the noncentrality parameters in the test statistic distributions of differentially expressed genes. We then show how our model can be used to estimate sample sizes that control the FDR together with other statistical measures like average power or the false nondiscovery rate. Method performance is evaluated through a comparison with existing methods for sample size estimation, and is found to be very good.
Conclusion
A novel method for estimating the appropriate sample size for a two-sample comparative microarray study is presented. The method is shown to perform very well when compared to existing methods.
Background
One of the most frequently used experimental setups for microarrays is the two-sample comparative study, i.e. a study that compares expression levels in samples from two different experimental conditions. In the case of replicated two-sample comparisons statistical tests may be used to assess the significance of the measured differential expression. A natural test statistic for doing so is the t-statistic (see e.g. [1]), which will be our focus here. In the context of two-sample comparisons it is also convenient to introduce the concept of 'effect size'. In this paper effect size is taken to mean: the difference between two conditions in a gene's mean expression level, divided by the common standard deviation of the expression level measurements.
In an ordinary microarray experiment thousands of genes are measured simultaneously. Performing a statistical test for each gene leads to a multiple hypothesis testing problem, and a strategy is thus needed to control the number of false positives among the tests. A successful approach to this has been to control the false discovery rate (FDR) [2], or FDR-variations like the positive false discovery rate (pFDR) [3].
To obtain the wanted results from an experiment it is important that an appropriate sample size, i.e. number of biological replicates, is used. A goal can, for example, be set in terms of a specified FDR and average power, and a sample size chosen so that the goal may be achieved [4].
In the last few years many methods have been suggested that can help estimate the needed sample size. Some early approaches [5, 6] relied on simulation to see the effect of sample size on the FDR. Later work established explicit relationships between sample size and FDR. A common feature of the more recent methods is that they require knowledge of the distribution of effect sizes in the experiment to be run. In lack of this distribution there are two alternatives. The first alternative is simply specifying the distribution to be used. The choice may correspond to specific patterns of differential expression that one finds interesting, or it can be based on prior knowledge of how effect sizes are distributed. Many of the available methods discuss sample size estimates for specified distributions [7–11]. The second alternative is estimating the needed distribution from a pilot data set. Ferreira and Zwinderman [12] discuss one such approach. Assuming that the probability density functions for the test statistics are symmetric and belonging to a location family, they obtain the wanted distribution using a deconvolution estimator. One should note that, for the sample sizes often used in a microarray experiment, the noncentral density functions for t-statistics depart from these assumptions. Hu et al. [13] and Pounds and Cheng [14] discuss two different approaches. Both methods recognize that test statistics for differentially regulated genes are noncentrally distributed, and aim to estimate the corresponding noncentrality parameters. From the noncentrality parameters, effect sizes can be found. Hu et al. consider, as we do, t-statistics and estimate the noncentrality parameters by fitting a 3-component mixture model to the observed statistics of pilot data. Pounds and Cheng consider F-statistics and estimate a noncentrality parameter for each observation. They then rescale the estimates according to a Bayesian q-value interpretation [15]. A last approach that needs mention is that of Pawitan et al. [16], which fits a mixture model to observed t-statistics using a likelihood-based criterion. The approach is not explored as a sample size estimation method in the paper by Pawitan et al., but it can be put to this use.
In this article we introduce a mixture model approach to estimating the underlying distribution of noncentrality parameters, and thus also effect sizes, for t-statistics observed in pilot data. The number of mixture components used is not restricted, and we present a novel, closed form, algorithm for estimating the model parameters. We then demonstrate how this model can be used for sample size estimation. By examining the relationships between FDR and average power, and between FDR and the false nondiscovery rate (FNR), we are able to choose sample sizes that control these measures in pairs. To validate our model and sample size estimates, we test its performance on simulated data. We include the estimates made by the methods of Hu et al., Pounds and Cheng and Pawitan et al.
Results and Discussion
Notation, assumptions and test statistics
Throughout this text t_{ ν }(λ) represents the probability density function (pdf) of a t-distributed random variable with ν degrees of freedom and noncentrality parameter λ. A central t pdf, t_{ ν }(0), can also be written t_{ ν }. A t_{ ν }(λ) evaluated at x is written t_{ ν }(x; λ).
Assume gene expression measurements can be made in a pilot study. For a particular gene we denote the n_{1} measurements from condition 1 and the n_{2} from condition 2 by X_{1i}, (i = 1, ..., n_{1}) and X_{2j}, (j = 1, ..., n_{2}). Let (μ_{1}, ${\sigma}_{1}^{2}$) and (μ_{2}, ${\sigma}_{2}^{2}$) be expectation and variance for each X_{1i}and X_{2j}, respectively. For simplicity we focus in this paper on the case where ${\sigma}_{1}^{2}={\sigma}_{2}^{2}={\sigma}^{2}$. As is common in microarray data analysis, the X_{1i}s and X_{2j}s are assumed to be normally distributed random variables.
Measured expression levels are often transformed before normality is assumed.
where $\overline{d}={n}^{-1}{\displaystyle {\sum}_{i=1}^{n}({X}_{1i}-{X}_{2i})}$ and ${S}_{d}^{2}={(n-1)}^{-1}{\displaystyle {\sum}_{i=1}^{n}{({d}_{i}-\overline{d})}^{2}}$. Now, under H_{0} : μ_{1} = μ_{2}, T_{2} has as a t_{n - 1}pdf. If μ_{1} - μ_{2} = ξ, however, the pdf of T_{2} is a t_{n - 1}(ξ σ^{-1} $\sqrt{n}$).
In both experimental setups we note that the pdf of t-statistics for truly unregulated genes is t_{ ν }(0). For truly regulated genes the pdf is t_{ ν }(δ), with δ ≠ 0 reflecting the gene's level of differential expression. We also note that this δ is proportional to the gene's effect size, ξ/σ . The δ s can be considered realizations of some underlying random variable Δ, distributed as h(δ). Under our assumptions the observed t-scores should thus be modelled as a mixture of t_{ ν }(δ)-distributions, with the h(δ) as its mixing distribution. The h(δ) is not directly observed and must be estimated.
In the following, the t-statistics calculated in an experiment are assumed to be independent. This assumption, and the assumption that ${\sigma}_{1}^{2}={\sigma}_{2}^{2}={\sigma}^{2}$, may not hold in the microarray setting. In the Testing section we examine cases where these assumptions are not satisfied to see how results are affected.
Algorithm for estimating effect sizes
Let y_{ j }, j = 1, ..., m denote observed t-statistics for the m genes of an experiment, having Y_{ j }s as corresponding random variables. Let f(y) be their pdf. Our mixture model can then generally be stated asf(y; h) = ∫ t_{ ν }(y; δ) dh(δ),
The h(δ) is thus a distribution where Pr(δ = δ_{ i }) = π_{ i }, (i = 0, ..., g), and ${\sum}_{i=0}^{g}{\pi}_{i}=1$. The second form of f(y) in (3) is due to knowing that δ = 0 for unregulated genes.
We now aim to find the parameters of a model like (3) with a fixed number of components g + 1. It is clear that finding these parameters can be formulated as a missing data problem, which suggests the use of the EM-algorithm [18]. Although this approach has been discussed in earlier work [13, 16], a closed form EM-algorithm that solves the problem has not been available until now. The main difficulty with constructing the algorithm is the lack of a closed form expression for the noncentral t pdf. In the remainder of this section we show how the needed algorithm can be obtained.
where the z_{ j }is a realized value of the random Z_{ j }.
This form of the noncentral t pdf can be identified as a mixture of normal $N\left(\frac{\delta}{u},\frac{1}{{u}^{2}}\right)$ distributions, with a scaled χ_{ ν }mixing distribution for the random variable U. Based on the characterization in (4) we can introduce a new set of missing data u_{ ij }, (i = 0, ..., g; j = 1, ..., m) that are realizations of U_{ ij }, and defined so that (Y_{ j }|u_{ ij }, z_{ ij }= 1) follows a $N\left(\frac{\delta}{{u}_{ij}},\frac{1}{{u}_{ij}^{2}}\right)$ distribution. Restating the model in this form, as a mixture of mixtures, is a vital step in finding the closed form algorithm.
where π= (π_{1}, ..., π_{ g }), δ= (δ_{1}, ..., δ_{ g }) and f_{ c }(y|u, z) and g_{ c }(u) are the above-mentioned normal and scaled χ_{ ν }distribution, respectively. The E-step of the EM-algorithm requires the expectation of L_{ c }in (5) conditional on the data. Combining (4) with (5) we find that we need the expectationsE(Z_{ ij }|y_{ j }) and E(U_{ ij }|y_{ j }, z_{ ij }= 1).
where f_{ c }(u|y, z) is the pdf of (U_{ ij }|y_{ j }, z_{ ij }= 1). Note that g_{ c }(u|z) = g_{ c }(u) since U_{ ij }and Z_{ ij }are independent.
where $H{h}_{k}(x)={\displaystyle {\int}_{0}^{\infty}\frac{{w}^{k}}{k!}}{e}^{-\frac{1}{2}{(w+x)}^{2}}$dw for an integer k ≥ 0. The properties of the H h_{ k }-functions are discussed in [21]. A particularly nice property is that it satisfies the recurrence relation(k + 1) H h_{k + 1}(x) = - x H h_{ k }(x) + H h_{k - 1}(x).
With easily calculated $H{h}_{-1}(x)={e}^{-\frac{1}{2}{x}^{2}}$, $H{h}_{0}(x)={\displaystyle {\int}_{x}^{\infty}{e}^{-\frac{1}{2}{u}^{2}}}du$, we have a convenient way of computing (9).
On convergence, the estimated π and δ are used as the parameters of a h(δ) with g + 1 point masses. As discussed above we fix δ_{0} = 0.
An issue that has received much attention is estimating the proportion of true null hypotheses when many hypothesis tests are performed simultaneously (e.g. [3, 22, 23]). In the microarray context this amounts to estimating the proportion of truly unregulated genes among all genes considered. Referring to (3) we see that this quantity enters our model as π_{0}. To draw on the extensive work on π_{0}-estimation, we suggest using a known conservative π_{0}-estimate to guide the choice of some model parameters. This is discussed below. In our implementation we use the convex decreasing density estimate proposed in [24], but a different estimate may be input.
Assessing the appropriate number of components in a mixture model is a difficult problem that is not completely resolved. An often adopted strategy is using measures like the Akaike Information Criterion (AIC) [25] or the Bayesian Information Criterion (BIC) [26]. We find from simulation that, in our setting, these criteria seem to underestimate the needed number of components (refer to the Testing section for some of the simulation results). This is possibly due to the large proportion of unregulated genes often found in microarray data. With relatively few regulated genes, the gain in likelihood from fitting additional components is, for these criteria, not enough to justify the additional parameters used. In our implementation we use g = log_{2}((1 - ${\widehat{\pi}}_{0}$) m), where ${\widehat{\pi}}_{0}$ is the above-mentioned estimate. This choice is motivated by experience, but has proven itself adequate in our numerical studies. It also reflects the fact that a single component should provide sufficient explanation for the unregulated genes, while the remaining g components explain the regulated ones. A different g may be specified by users of the sample size method.
A complication that could arise when fitting the mixture model is that one or more of the ${\left\{{\delta}_{i}\right\}}_{i=1}^{g}$, could be assigned to δ = 0, or very close to it, thus affecting the fit of the central component. To avoid this, we define a small neighbourhood around δ_{0} = 0 from which all g remaining components are excluded. A δ_{ i }, i ≠ 0, that tries crossing into this neighbourhood while fitting the model, is simply halted at the neighbourhood boundary. The boundary is determined by finding the smallest $\tilde{\delta}$ for which it is possible to tell apart the t_{ ν }(0) distribution from a t_{ ν }($\tilde{\delta}$) one, based on samples of sizes ${\widehat{\pi}}_{0}m$ and (1 - ${\widehat{\pi}}_{0}$) m/g, respectively. The latter sample size assumes regulated genes to be evenly distributed among their components. The samples are taken as evenly spaced points within the 0.05 and 0.95 quantiles of the two distributions. The criterion used to check if the two samples originated from the same distribution is a two-sample Kolmogorov-Smirnov test with a significance level of 0.05. The rationale behind this criteron is that, for the g components associated with regulated genes, we only want to allow those that with reasonable certainty can be distinguished from the central component. Again, the ${\widehat{\pi}}_{0}$ used is the estimate discussed above.
Another difficulty related to fitting a mixture model is that the optimal fit is not unique, something which might cause convergence problems. The difficulty is due to the fact that permuting mixture components does not change the likelihood function. In our implementation we do not have any constraints that resolve this problem. We did, however, track the updates of our mixture components in a number of test runs and did not see this problem occur.
In summary, our approach provides estimates, ${\left\{{\widehat{\delta}}_{i}\right\}}_{i=0}^{g}$ and ${\left\{{\widehat{\pi}}_{i}\right\}}_{i=0}^{g}$, of the noncentrality parameters in the data and a set of weights. Together these quantities make up an estimate of the distribution h(δ). As seen in the section on test-statistics a δ is proportional to the effect size, ξ/σ. No estimates or assumptions are thus made on the numerical size of the means or variances in the data. We only estimate a set of mean shift to variance ratios.
Algorithm for estimating sample sizes for FDR control
Outcomes of m hypothesis tests
H_{0} accepted | H_{0} rejected | Total | |
---|---|---|---|
H _{ 0 } true | A _{0} | R _{0} | m _{0} |
H _{ 0 } false | A _{1} | R _{1} | m _{1} |
Total | M _{ A } | M _{ R } | m |
Table 1 summarizes the possible outcomes of m hypothesis tests. All table elements except m are random variables. From Table 1 the much used positive false discovery rate (pFDR) [3] can be defined as E(R_{0}/M_{ R }|M_{ R }> 0). In [3] it is also proven that, for a given significance region and under assumed independence for the tests, we havepFDR = Pr(H_{0} true|H_{0} rejected),
Equation (13) is important in sample size estimation as it provides the relationship between pFDR, significance region and sample size. Sample size will determine Pr(p <α) since the shape of the t-distributions depends on n_{1}, n_{2}. (An explanation is provided at the end this section). The application of (13) to sample size estimation was first discussed by Hu et al. [13].
In words, instead of specifying an α, one can specify E(M_{ R }). This way of obtaining α, however, has a shortcoming. It provides little direct information to the user about the experiment's ability to recognize regulated genes. In our view, a more informative way to choose α would be to let the user specify quantities such as average power or the false nondiscovery rate (FNR). We now discuss how this can be accomplished.
Power is defined as the probability of rejecting a hypothesis that is false. In the microarray multiple hypothesis setting, average power controls the proportion of regulated genes that is correctly identified. Setting α through an intuitively appealing measure as average power would thus be helpful. A relationship between α and average power can be found by rewriting the denominator of the right side in (13) asPr(p <α) = Pr(p <α|H_{0} true)π_{0} + Pr(p <α|H_{0} false)(1 - π_{0}).
Combining (15) and (13) one can now find sample sizes that achieve a specified pFDR and average power, with no need of specifying α.
Another interesting measure to control is the false nondiscovery rate (FNR), the expected proportion of false negatives among the hypotheses not rejected. In other words, the FNR controls the proportion of regulated genes erroneously accepted as unregulated. We use a version of the FNR discussed in [15] called the pFNR = E(A_{1}/M_{ A }|M_{ A }> 0), which, under the same assumptions as for the pFDR, can be stated aspFNR = Pr(H_{0} false|H_{0} accepted).
Again, specifying a pFNR will correspond to a specific choice of α. The pFNR approach to setting α could be interesting to use. One should note, however, that in the microarray setting this measure can sometimes be hard to apply. The reason is the potentially large M_{ A }, due to a high proportion of unregulated genes. A large M_{ A }makes pFNR numerically small, and a reasonable size may be hard to set.
where y_{0} is the t-score corresponding to a p-value cutoff α for testing the null hypothesis. Since f(y) is a weighted sum of t_{ ν }(δ)s, the integrals needed to be taken are simply quantiles of (noncentral) t-distributions. Sample size affects (17) because all t_{ ν }(δ)s to be integrated are on the form ${t}_{{s}_{1}(n)}({\xi}_{i}{\sigma}_{i}^{-1}{s}_{2}(n))$, with s_{1}(n) and s_{2}(n) dependent on n_{1} and n_{2}. (Refer to pdfs of (1) and (2)). If the effect sizes ${\left\{{\xi}_{i}{\sigma}_{i}^{-1}\right\}}_{i=0}^{g}$ and the weights ${\left\{{\pi}_{i}\right\}}_{i=0}^{g}$ were known we could evaluate (17) for all s_{1}(n), s_{2}(n).
For finding the root of S(n) we implement a bisection method.
Testing
To evaluate our approach we implemented the EM-algorithm and the sample size estimation method described above. We then ran tests on simulated data sets. For reasons discussed above we focused on controlling average power, along with the pFDR, in our sample size estimates.
When using simulated data sets it is possible to calculate the true sample size needed to achieve a given combination of pFDR and average power. To evaluate the performance of our method we compared our estimates to the true values. For comparison with existing approaches we also included the estimates made by the methods of Hu et al. [13], Pounds and Cheng [14] and Pawitan et al. [16].
Use of existing methods
In their paper Hu et al. discuss three different mixture models. In our comparison we used their truncated normal model, as this seemed to be the favored one. To produce sample size estimates using this model one needs to input the wanted pFDR and E(M_{ R }). As we wished to control pFDR along with average power, we calculated the E(M_{ R }) corresponding to each choice of pFDR and average power as E(M_{ R }) = average power·m(1 - π_{0})/(1 – pFDR). All tests were run with default parameters as found in the source code. In the implementation of Hu et al., however, three parameters (diffthres, iternum, gridsize) were missing default values. Reasonable choices (0.01,10,100) were kindly provided by the authors.
For the method of Pounds and Cheng one needs to input quantities called anticipated false discovery ratio (aFDR) and anticipated average power. For the estimates presented here we used the corresponding pFDR and average power combination. As Pounds and Cheng work with F-statistics, the t-statistics calculated in our tests were transformed accordingly. (If T follows a t_{ ν }distribution, then T^{2} is distributed as F_{1,ν}). In their implementation Pounds and Cheng set an upper limit, nmax, on the sample size estimates, and replace all estimates above nmax with nmax itself. The default value of nmax = 50 was replaced with nmax = 1000 in our tests. The reason for this was that sample size estimates of more than 50 could, and did, occur in the tests. Using a low nmax would then affect the comparison to the other methods. Apart from nmax all tests were run with default parameters as found in the source code.
In [16] Pawitan et al. discuss a method for fitting a mixture model to t-statistics. The fitted model is used to make an estimate of the proportion of unregulated genes, π_{0}. The use of their method for sample size estimation is mentioned, but is not further explored or tested. The only input needed to fit a model is the number of mixture components. In our tests, this number was determined using the AIC, as suggested by Pawitan et al. in their paper. In the implementation of Pawitan et al. the assignment of non-central components close to the central one is not restricted. A preliminary test run using their unadjusted model fit showed that sample size estimates were greatly deteriorated by this. In our tests we therefore adjusted their fitted model by collapsing all non-central components within a given threshold (|δ| < 0.75) into the central one. Our model adjustment corresponds to the π_{0}-estimation procedure used in the implementation of Pawitan et al. For our test runs we used our procedures to produce sample size estimates based on the models fitted by this method.
Test procedure and results
For the test results presented here we used m = 10000 genes, and n_{1} = n_{2} = 5 measurements per group. For the proportion of unregulated genes we examined the cases of π_{0} = 0.7 and π_{0} = 0.9.
In a first set of tests we considered sample size estimates in the case of normally distributed measurements and equal variances. In this setting we simulated data with and without a correlation structure. The true distribution of noncentrality parameters for the m_{1} = (1 - π_{0})m regulated genes was generated in the following way: A random sample was drawn from a N(1,0.5^{2}) and a N(-1,0.5^{2}) distribution. Both samples were of size m_{1}/2, and together they made up the ${\left\{{\xi}_{k}\right\}}_{k=1}^{{m}_{1}}$ for the regulated genes in data set. The measurement variances were set to ${\sigma}^{2}={\sigma}_{1}^{2}={\sigma}_{2}^{2}={0.5}^{2}$. The noncentrality parameters of the regulated genes were then calculated as ${\left\{{\xi}_{k}{\sigma}^{-1}\sqrt{{n}_{1}{n}_{2}{({n}_{1}+{n}_{2})}^{-1}}\right\}}_{k=1}^{{m}_{1}}$ (Refer to discussion of (1)). Based on our experience with microarray data analysis the above choices seem plausible for log-transformed microarray measurements. Since the true noncentrality parameters are all known, the true sample size needed to achieve a particular pFDR and average power can be calculated.
Correlation was introduced using a block correlation structure with block size 50. The reasoning behind such a structure was discussed in [27]. All genes, regulated and unregulated, were randomly assigned to their blocks. A correlation matrix for each group was then generated by first sampling random values from a uniform U(-1, 1) distribution into the off-diagonal elements of a symmetric matrix. Then, using the iteration procedure described in [28], we iterated to find the positive semidefinite matrix with unit diagonal that was closest to our randomly generated one.
Evaluating the number of mixture components.
π _{0} | pFDR | power | True | JMB (sd) | AIC (sd) |
---|---|---|---|---|---|
0.7 | 0.05 | 0.6 | 6 | 6 (0.3) | 7 (0.5) |
0.7 | 0.05 | 0.7 | 8 | 8 (0.5) | 8 (0.8) |
0.7 | 0.05 | 0.8 | 11 | 10 (1.1) | 15 (4.0) |
0.7 | 0.05 | 0.9 | 24 | 22 (2.5) | 52 (22.5) |
0.7 | 0.01 | 0.6 | 9 | 9 (0.7) | 9 (0.5) |
0.7 | 0.01 | 0.7 | 11 | 11 (0.8) | 12 (1.0) |
0.7 | 0.01 | 0.8 | 16 | 15 (1.8) | 25 (8.4) |
0.7 | 0.01 | 0.9 | 35 | 35 (3.6) | 79 (34.4) |
0.9 | 0.05 | 0.6 | 9 | 8 (1.1) | 11 (3.8) |
0.9 | 0.05 | 0.7 | 11 | 11 (2.4) | 15 (5.6) |
0.9 | 0.05 | 0.8 | 16 | 15 (4.1) | 21 (8.5) |
0.9 | 0.05 | 0.9 | 35 | 24 (7.9) | 33 (13.7) |
0.9 | 0.01 | 0.6 | 11 | 11 (1.9) | 15 (6.3) |
0.9 | 0.01 | 0.7 | 14 | 14 (3.2) | 21 (8.8) |
0.9 | 0.01 | 0.8 | 21 | 20 (6.3) | 29 (12.7) |
0.9 | 0.01 | 0.9 | 45 | 32 (11.1) | 44 (21.8) |
Tests with normally distributed measurements were also run, in which the ${\left\{{\xi}_{k}\right\}}_{k=1}^{{m}_{1}}$ were drawn from gamma distributions and where variances differed according to the model discussed below. Results were similar to those discussed above (not shown).
In a second set of tests we wanted to simulate data from a model having the characteristics of a true microarray experiment. We also wanted to see how sample size estimates were affected if the assumptions of normality and equal variances did not hold. To accomplish this we based our simulation on the Swirl data set, which is included in the limma software package [29], and on a model for gene expression level measurements discussed by Rocke and Durbin [30]. The model of Rocke and Durbin states thatw = α + μe^{ η }+ ∈,
where w is the intensity measurement, μ is the expression level, α is the background level and ∈ and η are error terms distributed as N (0, ${\sigma}_{\epsilon}^{2}$) and N (0, ${\sigma}_{\eta}^{2}$) respectively. Using the estimation method discussed in their paper we estimated the parameters of (18) for the Swirl data set. The estimated parameters $(\widehat{\alpha},{\widehat{\sigma}}_{\epsilon},{\widehat{\sigma}}_{\eta})$ were, for the mutant: (394.12, 150.84, 0.18), and for the wild-type: (612.99, 291.40, 0.19). To generate a set of log-ratios representative of the same data we performed a significance analysis as outlined in the limma user's guide. Using a cutoff level of 0.10 for the FDR-adjusted p-values, we obtained a set of 280 log-ratios for genes likely to be regulated. Log-ratios for the regulated genes in our tests were sampled from this set with replacement. The true expression levels were generated by sampling from the background-corrected mean intensities of the genes in the mutant data set. To simulate microarray data for two conditions the following procedure was used: A set of log-ratios, and the true expression levels for one condition, were sampled. The true expression levels for the other condition were then calculated. Using the above-mentioned model, with their respective sets of parameters, measurements were simulated for both conditions and then log-transformed. To introduce correlation in this setting we added a random effect, γ, to the log-transformed measurements for each correlated block of genes. The γ was drawn from a N (0, $0.5{\sigma}_{\eta}^{2}$) distribution. The block size was again assumed to be 50, and the genes were assigned randomly to each block. The true sample size requirements were in this case estimated by repeatedly drawing data sets from the given model and calculating their average power and FDR on a fine grid of cutoff values for the t-statistics. A direct calculation is possible since the regulated genes are known.
Evaluating sample size estimates from different methods.
No correlation | With correlation | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
π _{0} | pFDR | power | True | JMB (sd) | HZW (sd) | PC (sd) | PMMP (sd) | True | JMB (sd) | HZW (sd) | PC (sd) | PMMP (sd) |
0.7 | 0.05 | 0.6 | 6 | 6 (0.4) | 11 (0.0) | 11 (1.6) | 6 (0.4) | 6 | 6 (0.5) | 11 (0.0) | 11 (1.4) | 6 (0.5) |
0.7 | 0.05 | 0.7 | 8 | 8 (0.6) | 18 (0.2) | 14 (2.8) | 7 (1.0) | 8 | 8 (1.0) | 18 (0.0) | 14 (2.3) | 7 (0.9) |
0.7 | 0.05 | 0.8 | 11 | 11 (1.5) | 39 (0.7) | 20 (4.8) | 9 (2.5) | 11 | 11 (2.0) | 38 (0.5) | 19 (4.1) | 9 (2.0) |
0.7 | 0.05 | 0.9 | 23 | 24 (3.4) | 146 (2.4) | 30 (9.5) | 13 (6.5) | 23 | 23 (4.5) | 145 (1.6) | 29 (7.9) | 15 (6.8) |
0.7 | 0.01 | 0.6 | 9 | 9 (0.5) | 17 (0.5) | 16 (2.7) | 9 (0.9) | 9 | 9 (0.7) | 17 (0.5) | 16 (2.4) | 8 (0.8) |
0.7 | 0.01 | 0.7 | 11 | 11 (1.1) | 28 (0.5) | 21 (4.6) | 10 (1.7) | 11 | 11 (1.7) | 28 (0.5) | 21 (3.8) | 10 (1.7) |
0.7 | 0.01 | 0.8 | 16 | 16 (2.4) | 60 (1.1) | 28 (7.9) | 13 (4.5) | 16 | 16 (3.3) | 60 (0.8) | 27 (6.4) | 13 (3.4) |
0.7 | 0.01 | 0.9 | 34 | 37 (6.0) | 231 (4.2) | 42 (14.5) | 18 (10.0) | 32 | 36 (7.5) | 229 (2.8) | 40 (11.9) | 22 (11.3) |
0.9 | 0.05 | 0.6 | 9 | 9 (2.1) | 16 (0.5) | 24 (7.8) | 9 (3.9) | 8 | 9 (2.6) | 16 (0.5) | 23 (6.6) | 9 (3.1) |
0.9 | 0.05 | 0.7 | 11 | 11 (3.5) | 27 (0.8) | 31 (11.4) | 11 (7.1) | 10 | 12 (4.1) | 27 (0.8) | 30 (9.7) | 11 (6.6) |
0.9 | 0.05 | 0.8 | 16 | 16 (5.2) | 59 (1.7) | 41 (16.7) | 15 (11.1) | 14 | 16 (5.9) | 58 (1.7) | 41 (14.4) | 15 (11.8) |
0.9 | 0.05 | 0.9 | 34 | 26 (7.6) | 227(6.6) | 60 (26.8) | 21 (17.9) | 29 | 25 (8.5) | 225 (6.7) | 59 (23.1) | 22 (18.8) |
0.9 | 0.01 | 0.6 | 11 | 12 (3.3) | 22 (0.7) | 33 (11.8) | 12 (6.0) | 11 | 12 (4.1) | 22 (0.6) | 32 (9.9) | 12 (4.9) |
0.9 | 0.01 | 0.7 | 14 | 15 (5.3) | 38 (1.2) | 43 (16.9) | 15 (10.4) | 13 | 16 (6.0) | 37 (1.2) | 42 (14.4) | 15 (9.9) |
0.9 | 0.01 | 0.8 | 21 | 21 (7.4) | 82 (2.6) | 56 (24.3) | 20 (15.6) | 19 | 21 (8.4) | 81 (2.7) | 55 (20.9) | 21 (16.8) |
0.9 | 0.01 | 0.9 | 46 | 35 (10.0) | 318 (10.0) | 79 (37.2) | 27 (24.3) | 38 | 33 (11.3) | 316 (11.3) | 78 (32.0) | 29 (25.1) |
0.7 | 0.05 | 0.6 | 6 | 6 (0.2) | 6 (0.0) | 10 (1.0) | 6 (1.1) | 6 | 6 (0.3) | 6 (0.0) | 10 (1.1) | 5 (0.7) |
0.7 | 0.05 | 0.7 | 7 | 8 (0.7) | 8 (0.3) | 12 (1.7) | 8 (2.4) | 7 | 8 (0.7) | 8 (0.1) | 12 (1.8) | 7 (1.6) |
0.7 | 0.05 | 0.8 | 9 | 11 (1.4) | 16 (0.4) | 16 (2.9) | 10 (4.9) | 9 | 11 (1.5) | 16 (0.2) | 16 (3.2) | 10 (3.8) |
0.7 | 0.05 | 0.9 | 14 | 23 (4.1) | 56 (1.2) | 24 (5.5) | 18 (11.3) | 15 | 24 (5.3) | 56 (1.0) | 25 (6.1) | 14 (7.8) |
0.7 | 0.01 | 0.6 | 8 | 8 (0.6) | 8 (0.0) | 11 (1.7) | 8 (2.2) | 8 | 8 (0.7) | 8 (0.0) | 14 (1.8) | 8 (1.0) |
0.7 | 0.01 | 0.7 | 10 | 11 (1.1) | 12 0.1) | 14 (2.8) | 11 (4.3) | 10 | 11 (1.3) | 12 (0.1) | 18 (2.8) | 10 (3.1) |
0.7 | 0.01 | 0.8 | 13 | 16 (2.5) | 23 (0.5) | 24 (4.6) | 15 (8.4) | 15 | 16 (2.5) | 23 (0.3) | 24 (5.0) | 13 (6.4) |
0.7 | 0.01 | 0.9 | 26 | 36 (7.2) | 84 (1.8) | 35 (8.4) | 27 (17.8) | 30 | 37 (8.8) | 83 (1.3) | 34 (9.4) | 20 (12.0) |
0.9 | 0.05 | 0.6 | 8 | 10 (2.0) | 7 (0.4) | 21 (8.2) | 8 (1.6) | 8 | 10 (2.3) | 8 (0.4) | 24 (8.5) | 9 (3.6) |
0.9 | 0.05 | 0.7 | 9 | 13 (3.3) | 11 (0.7) | 27 (12.1) | 10 (3.9) | 9 | 13 (3.4) | 12 (0.7) | 32 (12.8) | 11 (5.7) |
0.9 | 0.05 | 0.8 | 12 | 18 (5.5) | 21 (1.2) | 37 (19.0) | 13 (8.0) | 13 | 19 (5.0) | 23 (1.5) | 44 (19.7) | 14 (8.7) |
0.9 | 0.05 | 0.9 | 24 | 31 (8.5) | 76 (4.6) | 54 (30.2) | 18 (14.5) | 25 | 31 (7.9) | 85 5.4) | 65 (32.6) | 20 (14.5) |
0.9 | 0.01 | 0.6 | 11 | 13 (3.1) | 9 (0.5) | 29 (12.6) | 11 (2.4) | 11 | 13 (3.6) | 10 (0.6) | 34 (13.2) | 12 (5.6) |
0.9 | 0.01 | 0.7 | 13 | 17 (5.0) | 16 (0.6) | 38 (18.5) | 14 (6.2) | 14 | 19 (5.0) | 17 (0.8) | 45 (19.6) | 15 (8.2) |
0.9 | 0.01 | 0.8 | 18 | 25 (8.1) | 28 (1.5) | 50 (27.2) | 17 (11.8) | 19 | 26 (7.1) | 31 (1.9) | 60 (29.2) | 19 (12.3) |
0.9 | 0.01 | 0.9 | 51 | 43 (11.4) | 103 (6.2) | 72 (42.8) | 23 (19.8) | 53 | 43 (10.9) | 115 (7.9) | 88 (46.3) | 26 (19.7) |
Note that, since the implementations of Hu et al. and Pounds and Cheng support only sample size estimates based on two-sample t-statistics (1), all tests listed are based on this statistic. Our implementation supports both types, and tests were also run to check the case of one-sample t-statistics (2). The results were similar to those of the two-sample t-statistics (not shown).
Conclusion
We have in this article discussed a mixture model approach to estimating the distribution of noncentrality parameters, and thus effect sizes, among regulated genes in a two-sample comparative microarray study. The model can be fitted to t-statistics calculated from pilot data for the study. We have also illustrated how the model can be used to estimate sample sizes that control the pFDR along with other statistical measures like average power or pFNR. In the microarray setting our results will often also be approximately valid when using the FDR and FNR instead of the pFDR and pFNR. This is due to, referring to Table 1, that one frequently will have Pr(M_{ R }) ≈ 1 and Pr(M_{ A }) ≈ 1 in this setting. Sample size estimation methods like the one presented are useful in the planning of any large scale microarray experiment.
We examined the accuracy of our sample size estimates by performing a series of numerical studies. The conclusions are that our estimates are reasonably accurate, and have low variance, for moderate cutoffs in the error measures used. For stringent cutoffs we see a larger variance and a somewhat lowered accuracy. Overall our method seems to provide better results than the available sample size estimation methods of Hu et al. and Pounds and Cheng. We have also evaluated a method by Pawitan et al. for fitting a mixture model and its use in sample size estimation. The results using the method are good, and our tests suggest that optimizing the method of Pawitan et al. for use in sample size estimation could be interesting.
The decreased accuracy for stringent cutoffs found in the estimates is probably due to the difficult task of describing the distribution of regulated genes well near the point of no regulation, that is near δ = 0. It is important that the characterization of such genes is precise, but also that it does not affect the estimated distribution of the unregulated genes. Our solution in this article was to introduce a small neighbourhood around δ = 0 in which no other components are fitted. Better ways of differentiating between the two distributions close to 0 could be a subject of further study.
In our tests we also checked how correlation among the genes would affect the sample size estimates. We found that the estimates were only moderately affected. Nevertheless, we believe that estimation methods that incorporate correlation among genes is an important topic for future studies.
For the microarray setting the use of a moderated t-statistic, as discussed in [31], is often more appropriate than the ordinary t-statistic. For our method to be directly applicable to this type of statistic one would need to know that moderated t-statistics for unregulated genes follow a central t-distribution, and one would need the distribution's degrees of freedom. In [31] this distributional result is shown to hold, with augmented degrees of freedom for the central t-distribution. One would also need to know that moderated t-statistics for a regulated gene, with some given degree of regulation, follow a noncentral t-distribution, and one would need the distribution's degrees of freedom and noncentrality parameter. For this second result we are not aware of any findings. If this second result is shown to hold, then our method can be applied to moderated t-statistics as well. Testing the performance of our algorithm with moderated t-statistics, even without the second result, could also be interesting.
In our work we focus on average power as the control measure used along with the pFDR. One should note that having an average power of 0.9 means being able to correctly identify 90% of the regulated genes. In many experiments achieving this may not be interesting. One example is studies that aim only at identifying the small set of marker genes that best distinguishes one sample from the other. Other examples are experiments that, when comparing a treated and untreated tissue sample, only are interested in the most heavily affected regulatory pathways. In both these examples a power well below 0.9 could suffice. Our estimates are in this case particularly useful since they seem to be both accurate, and have low variance, for moderate power cutoffs with a low pFDR.
Although our main goal in this paper was fitting a model to be used in sample size estimation we would like to emphasize that the fitted model itself does provide some interesting information. One example is a direct estimate of π_{0}, the proportion of unregulated genes, which we often found to be better than the one made by existing methods.
Other subjects for future work include a speed-up of the algorithm. The convergence of a straightforward EM-algorithm is known to be rather slow, and methods that improve on it will be implemented. Another issue that may be investigated further is picking the number of model components. As already mentioned, using a few more components than suggested by the traditionally used information criteria would often improve sample size estimates substantially. A different approach might be needed to choose the number of components in this setting.
The methods discussed in this article are applicable to any two-sample comparative multiple hypothesis testing situation where t-tests are used, and not only to problems in the microarray setting.
Availability
An implementation of the methods discussed in this paper for the R environment is available from the authors upon request.
Declarations
Acknowledgements
This work was supported by grants NFR 143250/140 and NFR 151991/S10 from the biotechnology and the functional genomics (FUGE) programs of the Norwegian Research Council (AMB, TSJ). Financial support was also given by the cross-disciplinary project "BIOEMIT – Prediction and modification in functional genomics: combining bioinfomatical, bioethical, biomedical, and biotechnlogical research" at the Norwegian University of Science and Technology (HM). The authors would like to thank Mette Langaas for reading the manuscript and making suggestions for improvements. We would also like to thank two anonymous reviewers for their careful reading and helpful advice.
Authors’ Affiliations
References
- Callow MJ, Dudoit S, Gong EL, Speed TP, Rubin EM: Microarray Expression Profiling Identifies Genes with Altered Expression in HDL-Deficient Mice. Genome Res 2000, 10: 2022–2029.PubMed CentralView ArticlePubMedGoogle Scholar
- Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B 1995, 57: 289–300.Google Scholar
- Storey JD: A direct approach to false discovery rates. J R Stat Soc Ser B 2002, 64: 479–498.View ArticleGoogle Scholar
- Jørstad TS, Langaas M, Bones AM: Understanding sample size: what determines the required number of microarrays for an experiment? Trends Plant Sci 2007, 12: 46–50.View ArticlePubMedGoogle Scholar
- Gadbury GL, Page GP, Edwards J, Kayo T, Prolla TA, Weindruch R, Permana PA, Mountz JD, Allison DB: Power and sample size estimation in high dimensional biology. Stat Methods Med Res 2004, 13: 325–338.View ArticleGoogle Scholar
- Müller P, Parmigiani G, Robert C, Rousseau J: Optimal Sample Size for Multiple Testing: the Case of Gene Expression Microarrays. J Am Stat Assoc 2004, 99: 990–1001.View ArticleGoogle Scholar
- Jung SH: Sample size for FDR-control in microarray data analysis. Bioinformatics 2005, 21: 3097–3104.View ArticlePubMedGoogle Scholar
- Li SS, Bigler J, Lampe JW, Potter JD, Feng Z: FDR-controlling testing procedures and sample size determination for microarrays. Stat Med 2005, 24: 2267–2280.View ArticlePubMedGoogle Scholar
- Pawitan Y, Michiels S, Koscielny S, Gusnanto A, Ploner A: False discovery rate, sensitivity and sample size for microarray studies. Bioinformatics 2005, 21: 3017–3024.View ArticlePubMedGoogle Scholar
- Tibshirani R: A simple method for assessing sample sizes in microarray experiments. BMC Bioinformatics 2006, 7: 106.PubMed CentralView ArticlePubMedGoogle Scholar
- Liu P, Hwang JTG: Quick Calculation for Sample Size while Controlling False Discovery Rate with Application to Microarray Anaylsis. Bioinformatics 2007, 23: 739–746.View ArticlePubMedGoogle Scholar
- Ferreira JA, Zwinderman AH: Approximate Power and Sample Size Calculations with the Benjamini-Hochberg Method. Int J Biostat 2007, 2: Article 8.Google Scholar
- Hu J, Zou F, Wright FA: Practical FDR-based sample size calculations in microarray experiments. Bioinformatics 2005, 21: 3264–3272.View ArticlePubMedGoogle Scholar
- Pounds S, Cheng C: Sample Size Determination for the False Discovery Rate. Bioinformatics 2005, 21: 4263–4271.View ArticlePubMedGoogle Scholar
- Storey JD: The Positive False Discovery Rate: A Bayesian Interpretation and the q-value. Ann Stat 2003, 31: 2013–2035.View ArticleGoogle Scholar
- Pawitan Y, Murthy KRK, Michiels S, Ploner A: Bias in the estimation of the false discovery rate in microarray studies. Bioinformatics 2005, 21: 3865–3872.View ArticlePubMedGoogle Scholar
- Lindsay BG: The Geometry of Mixture Likelihoods: A General Theory. Ann Stat 1983, 11: 86–94.View ArticleGoogle Scholar
- Dempster AP, Laird NM, Rubin DB: Maximum Likelihood from Incomplete Data via the EM Algorithm. J R Stat Soc Ser B 1977, 39: 1–38.Google Scholar
- Johnson NL, Kotz S, Balakrishnan N: Continuous Univariate Distributions. Volume 2. second edition. John Wiley and Sons, Inc; 1995.Google Scholar
- McLachlan G, Peel D: Finite Mixture Models. John Wiley and Sons, Inc; 2000.View ArticleGoogle Scholar
- Jeffreys H, Jeffreys BS: Methods of Mathematical Physics. third edition. Cambridge University Press; 1972.Google Scholar
- Schweder T, Spjøtvoll E: Plots of p-values to evaluate many tests simultaneously. Biometrika 1982, 69: 493–502.View ArticleGoogle Scholar
- Allison DB, Gadbury GL, Heo M, Fernandez JR, Lee CK, Prolla TA, Weindruch R: A mixture model approach for the analysis of microarray gene expression data. Comput Stat Data An 2002, 39: 1–20.View ArticleGoogle Scholar
- Langaas M, Lindqvist BH, Ferkingstad E: Estimating the proportion of true null hypotheses, with application to DNA microarray data. J R Stat Soc Ser B 2005, 67: 555–572.View ArticleGoogle Scholar
- Akaike H: Information theory and an extension of the maximum likelihood principle. In Second International Symposium on Information Theory Edited by: Petrov BN, Csaki F. 1973, 267–281.Google Scholar
- Schwarz G: Estimating the Dimension of a Model. Ann Stat 1978, 6: 461–464.View ArticleGoogle Scholar
- Storey JD: Invited comment on 'Resampling-based multiple testing for DNA microarray data analysis' by Ge, Dudoit, and Speed. Test 2003, 12: 1–77.View ArticleGoogle Scholar
- Higham NJ: Computing the nearest correlation matrix – a problem from finance. IMA J Numer Anal 2002, 22: 329–343.View ArticleGoogle Scholar
- Smyth GK: Limma: linear models for microarray data. In Bioinformatics and Computational Biology Solutions using R and Bioconductor. Edited by: Gentleman R, Carey V, Dudoit S, Irizarry R, Huber W. Springer, New York; 2005:397–420.View ArticleGoogle Scholar
- Rocke DM, Durbin B: A Model for Measurement Error for Gene Expression Arrays. J Comput Biol 2001, 8: 557–569.View ArticlePubMedGoogle Scholar
- Smyth GK: Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments. Stat Appl Genet Mol Biol 2004, 3: Article 3.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.