 Methodology article
 Open access
 Published:
False discovery rate control in twostage designs
BMC Bioinformatics volume 13, Article number: 81 (2012)
Abstract
Background
For gene expression or gene association studies with a large number of hypotheses the number of measurements per marker in a conventional singlestage design is often low due to limited resources. Twostage designs have been proposed where in a first stage promising hypotheses are identified and further investigated in the second stage with larger sample sizes. For two types of twostage designs proposed in the literature we derive multiple testing procedures controlling the False Discovery Rate (FDR) demonstrating FDR control by simulations: designs where a fixed number of topranked hypotheses are selected and designs where the selection in the interim analysis is based on an FDR threshold. In contrast to earlier approaches which use only the secondstage data in the hypothesis tests (pilot approach), the proposed testing procedures are based on the pooled data from both stages (integrated approach).
Results
For both selection rules the multiple testing procedures control the FDR in the considered simulation scenarios. This holds for the case of independent observations across hypotheses as well as for certain correlation structures. Additionally, we show that in scenarios with small effect sizes the testing procedures based on the pooled data from both stages can give a considerable improvement in power compared to tests based on the secondstage data only.
Conclusion
The proposed hypothesis tests provide a tool for FDR control for the considered twostage designs. Comparing the integrated approaches for both selection rules with the corresponding pilot approaches showed an advantage of the integrated approach in many simulation scenarios.
Background
Modern experimental techniques in genetic research such as microarray experiments or gene association studies produce high dimensional data and often thousands of hypotheses are tested simultaneously to identify genetic markers. Due to limited resources, the number of measurements per marker in a conventional singlestage design is often low. Twostage designs have been proposed where in a first stage promising markers are identified from the set of all markers considered initially. Thus, hypotheses corresponding to unpromising markers can be dropped in the interim analysis such that the second stage is performed with the reduced set of selected hypotheses. Given limited total resources or budgets, this allows the allocation of a larger number of observations to more promising hypotheses. It has been shown that such sequential procedures are typically considerably more powerful than singlestage designs[1–8].
An important problem when drawing inference from data produced by such designs is the construction of hypothesis tests that control the False Discovery Rate (FDR). While the construction of such test procedures is straightforward if only the secondstage data is used for testing, tests that make use of the data from both stages need to account for the specific selection rule used to select hypotheses for the second stage.
For twostage procedures where in an interim analysis all hypotheses with an unadjusted firststage pvalue below a prefixed selection boundary γ_{1} are selected for the second stage, hypothesis tests based on the pooled data from both stages have been proposed that control FDR or the familywise error rate[7] (see[8] for a generalization to multistage designs). Selecting all hypotheses whose firststage pvalue lies below a fixed threshold has several implications. First, the number of hypotheses selected for the second stage is a random variable unknown a priori, which is an obstacle for researchers if resources are limited and the number of markers for which further measurements can be collected cannot be arbitrarily increased. Second, because rather large thresholds need to be applied in the interim analysis in order not to miss alternative hypotheses in spite of the small sample size, the fixed threshold rule will, with a high probability, select some hypotheses even under the global null hypothesis, even though resources could be saved in this scenario by stopping the experiment for futility at the interim analysis.
In this work we propose statistical tests to control the FDR in twostage designs with selection rules that are not based on a fixed threshold for the firststage pvalues. First, we consider designs where a fixed number of hypotheses is selected for the second stage (FNS design)[4]. The selected hypotheses are the topranked hypotheses according to the first stage pvalues. Second, we consider twostage designs where selection is based on a fixed FDR threshold α_{1} in the interim analysis (FDRS design). All hypotheses that can be rejected with a test controlling the FDR at level α_{1} are selected for the second stage[9]. For the FNS design the number of continued hypotheses is deterministic and the procedure continues to the second stage, regardless of the actual effect sizes observed. For the FDRS design in contrast, the number of selected hypotheses is a random variable. Furthermore, under the global null hypothesis (where the FDR coincides with the familywise error rate) with probability 1  α_{1} no hypothesis can be rejected with the interim test at FDR level α_{1} and the trial is stopped for futility.
A simple approach to construct hypothesis tests controlling the FDR for twostage designs is to consider tests based on the secondstage data only. Standard multiple testing procedures applied to the secondstage data will control the FDR. For the FDRS design Benjamini and Yekutieli[9] showed that the nominal level applied at the second stage may be even increased taking into account the interim selection threshold. However, these approaches do not make full use of the available data because the firststage observations are used for selection only. We construct tests for the FNS and FDRS designs that are based on sufficient test statistics of the data from both stages (the “integrated approach”). These tests are designed in analogy to group sequential tests and appear to control the FDR well if the number of hypotheses tested is large enough.
The paper is structured as follows: in the next section the testing problem and the selection rules are introduced. Then the results of a simulation study which investigates the actual FDR and compares the mean number of rejected alternatives of the integrated approach to the pilot approach are presented. Finally, a real data example and a short discussion are given.
Methods
The test problem
We consider an experiment to test m twosided null hypotheses H_{0i}: μ_{ i } = 0 versus H_{1i}: μ_{ i } ≠ 0, i = 1, …, m, for the mean of independent, normally distributed observations with variances\left(\right.separators="">\n \n \n \n \sigma \n \n \n i\n \n \n 2\n \n \n \n, i = 1, …, m. More general distributional scenarios are discussed at the end of this section.
To adjust for multiple testing we aim to control the FDR of the experiment. The FDR[10] is defined as the expectation of the proportion of erroneously rejected hypotheses V among all rejected hypotheses R, FDR = E(V / max{R, 1}). We apply the stepup BenjaminiHochberg (BH) procedure to control the FDR at level α. Denote the ordered pvalues of the m hypotheses by p_{(1)} ≤ p_{(2)} ≤ ⋯ ≤ p_{(m)} and let d = argmax_{ i }{p_{(i)} ≤ iα/m} denote the index of the largest pvalue p_{(i)} smaller than or equal to iα/m. Then the BHprocedure rejects all hypotheses H_{0i} such that p_{(i)} ≤ dα/m. For singlestage tests it has been shown by Benjamini and Yekutieli[11] that the BHprocedure controls the FDR at level π_{0}α, if the subset of test statistics corresponding to true null hypotheses are independent or positively regression dependent. Here, π_{0} denotes the (unknown) proportion of true null hypotheses among the m tested null hypotheses. Furthermore, the FDR is asymptotically controlled (for increasing number of hypotheses) if the limiting fraction of true null hypotheses is less than one and the test statistics are weakly dependent such that their empirical distribution functions converge almost surely (and some additional technical conditions hold)[12]. In the following we assume distributional scenarios where the BHprocedures controls the FDR.
The twostage procedure
In the firststage for each hypothesis n_{1} observations are collected. Then an interim analysis is performed and for each hypothesis a twosided firststage pvalue\left(\right.separators="">\n \n \n \n p\n \n \n i\n \n \n (\n 1\n )\n \n \n =\n 2\n (\n 1\n \n \Phi \n (\n \n \n \n z\n \n \n i\n \n \n (\n 1\n )\n \n \n \n )\n \n, i = 1, …, m, is calculated, where\left(\right.separators="">\n \n \n \n z\n \n \n i\n \n \n (\n 1\n )\n \n \n \n denotes the standardized firststage mean for hypothesis i and Φ the cumulative distribution function of the standard normal distribution. The firststage pvalues are ranked according to their magnitude and the m_{2} hypotheses with the smallest pvalue are selected for the second stage. The number of selected hypotheses m_{2} can be either a prefixed number or may depend on the firststage results. Below we consider several choices for m_{2}. In a second stage for each selected hypothesis n_{2} observations are collected. n_{2} is assumed to be fixed and does not depend on the number of selected hypotheses. We consider two different approaches to arrive at the final test decision: the “integrated approach”, where the test decision is based on the combined data of both stages and the “pilot approach”, where the test decision is based on the secondstage data only.
In the following we introduce several rules to determine the number of selected hypotheses m_{2}.
Selection rules for twostage designs
Selection according to a prefixed selection boundary γ_{1}
Twostage designs have been proposed[2, 7, 8] where a selection boundary γ_{1} is prespecified and in the interim analysis all hypotheses with a firststage pvalue smaller than γ_{1} are selected for the second stage. Then
where 1 is the indicator function which equals 1 if the condition in the parentheses is satisfied and 0 otherwise. Thus, m_{2} is a random variable.
Prefixed number of hypotheses selected for the second stage  FNS design
With this rule the value of m_{2} is fixed a priori and the m_{2} hypotheses with the smallest firststage pvalues are selected for the second stage. We denote this procedure FNS design (Fixed Number Selection).
Selection based on an FDR threshold  FDRS design
In this approach all hypotheses which are significant according to the BHprocedure at a prefixed level α_{1} > α in the interim analysis are selected for the second stage. Thus, the number of selected hypotheses m_{2} is a random variable which depends on the firststage results. If no hypothesis can be rejected at level α_{1} in the interim analysis and thus be carried over to the second stage, m_{2} is set to zero. In this case the whole experiment is stopped. Note that under the global null hypothesis, i.e. in the setting where all null hypotheses are true and thus π_{0} = 1, this occurs with a probability of 1  α_{1}.
FDR control
In the subsections below we review the FDR controlling test procedure for twostage designs where hypotheses are selected based on a prefixed selection boundary applied to the firststage pvalues. Both pilot and integrated designs are considered. We then propose generalizations of these test procedures to FNS and FDRS designs, showing that for each of the designs a corresponding data dependent selection boundary for the firststage pvalues can be defined that converges under suitable assumptions almost surely to a fixed value. The results are derived for independent data. However, in genetic data dependence of test statistics is frequently observed and even weak dependence may be a strong assumption. In the simulation study we thus investigate the performances of the procedures for several correlation structures.
Selection according to a prefixed selection boundary γ_{1}
Pilot approach
In the pilot approach the final test statistics are based on data from the second stage only. The firststage data are used for selecting promising hypotheses only. To control the FDR of the experiment, at the end of the trial for each hypothesis a twosided pvalue\left(\right.separators="">\n \n \n \n p\n \n \n i\n \n \n (\n 2\n )\n \n \n =\n 2\n (\n 1\n \n \Phi \n (\n \n \n \n z\n \n \n i\n \n \n (\n 2\n )\n \n \n )\n \n )\n \n, i = 1, …, m_{2}, is calculated, where\left(\right.separators="">\n \n \n \n z\n \n \n i\n \n \n (\n 2\n )\n \n \n \n denotes the standardized mean of the secondstage observations for hypothesis i. Then the BHprocedure is applied to these pvalues which are based on the secondstage data only. Because the firststage observations are used only for selection and do not enter the final test statistics, the BHprocedure controlling the FDR at the second stage controls the FDR overall.
Integrated approach
If the data from both stages are to be used in the final test decision, one can account for the selection in the interim analysis by calculating sequential pvalues p_{ i }, i = 1 …, m, based on the monotonic ordering of the sample space[13]. If\left(\right.separators="">\n \n \n \n p\n \n \n i\n \n \n (\n 1\n )\n \n \n \n \n \n \gamma \n \n \n 1\n \n \n \n then the twosided sequential pvalue is defined as
and if\left(\right.separators="">\n \n \n \n p\n \n \n i\n \n \n (\n 1\n )\n \n \n \u2264\n \n \n \gamma \n \n \n 1\n \n \n \n it is given by
where Z_{ i } denotes the standardized overall mean of the observations from both stages and\left(\right.separators="">\n \n \n \n Z\n \n \n i\n \n \n (\n 1\n )\n \n \n \n the standardized mean of the observations in the first stage. Furthermore,\left(\right.separators="">\n \n \n \n z\n \n \n i\n \n \n ,\n \n \n z\n \n \n i\n \n \n (\n 1\n )\n \n \n \n denote realizations of the random variables\left(\right.separators="">\n \n \n \n Z\n \n \n i\n \n \n ,\n \n \n Z\n \n \n i\n \n \n (\n 1\n )\n \n \n \n and\left(\right.separators="">\n \n \n \n c\n \n \n 1\n \n \n \n \gamma \n \n \n 1\n \n \n /\n 2\n \n \n \n the (1  γ_{1}/2)quantile of the standard normal distribution. If the stopping criterion is satisfied the sequential pvalue is just the classical fixed sample pvalue calculated from the firststage observations. Otherwise, the calculation of the sequential pvalue involves the numerical solution of an integral (see the Appendix). Finally, the BHprocedure is applied to the sequential pvalues p_{1}, …, p_{ m }.
In a twostage procedure with fixed per hypothesis sample sizes n_{1}, n_{2} and a fixed selection boundary γ_{1} the sequential pvalues are uniformly distributed under the null hypothesis[13].
If for the subset of true null hypotheses the observations are independent across hypotheses such that the sequential pvalues are independent as well, it follows that the FDR is controlled in such a twostage design. As described above FDR control holds also if the sequential pvalues corresponding to true null hypotheses are positive regression dependent[11].
Next we extend the test procedure to the FNS and FDRS design.
FNS design
Pilot approach
For the FNS selection rule the FDR control with the pilot approach is straightforward: the BHprocedure can be applied to the secondstage pvalues of the m_{2} selected hypotheses. Because the firststage data do not enter the final test statistics, FDR control is guaranteed under the assumption of positive regression dependency.
Integrated approach
To utilize the data from both stages for the final test decision, we propose to compute sequential pvalues in analogy to (2), replacing the fixed selection boundary γ_{1} (which is not defined for the FNS design) by the value of the largest firststage pvalue of all selected hypotheses, i.e., setting\left(\right.separators="">\n \n \n \n \gamma \n \n \n 1\n \n \n =\n \n \n p\n \n \n (\n \n \n m\n \n \n 2\n \n \n )\n \n \n \n. Thus, the threshold γ_{1} is now data dependent. Because this is not accounted for in the calculation of the sequential pvalues, they may no longer be uniformly distributed under the null hypothesis and it is not obvious if the FDR is still controlled. However, if the observations are sufficiently independent across hypotheses, such that the empirical distribution functions of the firststage test statistics converge almost surely as the number of tested hypotheses increases (and some additional technical conditions hold, see the Appendix), γ_{1} converges almost surely to a fixed number. Thus, asymptotically γ_{1} is deterministic and for large m and m_{2} the procedure becomes similar to the method with a prefixed selection boundary.
Note that with the integrated twostage testing procedure, hypotheses that have not been selected in the interim analysis can in principle be rejected in the final test. Especially, if m_{2} is small compared to the number of hypotheses for which the alternative holds and the effect sizes are large, hypotheses that were not selected at the interim analysis can be rejected at the end because for every hypothesis a sequential pvalue is calculated (even for nonselected ones). Rejection of nonselected hypotheses can occur in an overpowered FNS design where only few hypotheses are selected, but even the sequential pvalues corresponding to nonselected alternative hypotheses are small enough to lead to a rejection. If such rejections occur, this is an indication that the firststage sample size has been chosen too large and no secondstage sample would have been needed to reach sufficient power. While the efficiency of such a design can be improved by choosing appropriate firststage sample sizes and selection rules, the control of the FDR is not affected.
FDRS design
Pilot approach
As for the FNS selection rule, if the BHprocedure is applied at nominal level α to the secondstage pvalues of the m_{2} selected hypotheses (computed from the observations of the second stage only), FDR control is guaranteed. However, as Benjamini and Yekutieli[9] showed, if, in a first stage, hypotheses are selected that can be rejected with the BHprocedure at nominal level α_{1}, and, in a second stage, the selected hypotheses are tested at nominal level α_{2}, the FDR of the secondstage test is actually controlled at level α_{1}α_{2}π_{0}, given the test statistics at each stage are positively regression dependent[9]. Thus, if in the second stage the nominal level α/α_{1} is applied, the FDR is still controlled at level π_{0}α. In the following we consider the latter, improved procedure.
Integrated approach
Similar to the FNS rule we propose to compute sequential pvalues in analogy to (2) to utilize the data from both stages for the final test decision.
Again, the resulting threshold γ_{1} is data dependent: we set γ_{1} = m_{2}α_{1}/m where m_{2} is a random variable. Then γ_{1} is approximately equal to the largest firststage pvalue of all selected hypotheses. Thus the sequential pvalues may no longer be uniformly distributed under the null hypothesis and FDR control is in question. However, the following argument gives a heuristic for FDR control when the number of hypotheses is large. If for a positive proportion of hypotheses the alternative holds the empirical distribution functions of the firststage test statistics converge almost surely as the number of tested hypotheses increases (and some additional technical conditions hold, see the Appendix), it can be shown that γ_{1} converges almost surely. Hence, in these settings γ_{1} is asymptotically deterministic.
Under the global null hypothesis γ_{1} does not converge and simulations (see the Results section) show that the FDR is actually inflated. Therefore, we suggest the following modification of the test procedure. Let m_{ s } > 0 denote a positive constant. In cases where less than m_{ s } hypotheses are selected by the FDRS selection rule the threshold γ_{1} used in (2) is set to the m_{ s }smallest firststage pvalue, thus{\gamma}_{1}=max({p}_{\left({m}_{2}\right)}^{\left(1\right)},{p}_{\left({m}_{s}\right)}^{\left(1\right)}). Note that this modification increases the firststage critical boundary used in the calculation of the sequential pvalue.
Generalizations to other testing problems
The procedure can be directly generalized to two group comparisons, replacing the standardized means by the standardized mean between group differences. More generally, the sequential pvalue can be computed as in (2) if (under the null hypothesis) the cumulative test statistics follow a multivariate normal distribution with mean zero and variance one. In the actual computation based on (3) the term n_{1} / (n_{1} + n_{2}) (resp. n_{2} / (n_{1} + n_{2})) is then replaced by the correlation ρ(resp. 1  ρ). For example, for testing problems as the comparison of rates, or the analysis of covariances, the standardized means can be replaced by standardized efficient score statistics that are asymptotically normally distributed. The correlation between these test statistics is determined by the information fractions (see[14]).
Results
First we investigate the actual FDR of the proposed testing procedures for the FNS and FDRS selection rules. Additionally, to quantify the advantage in power of the integrated approach compared to the pilot approach, we report the mean number of rejected alternatives under different scenarios. We consider the onesample ztest for m twosided null hypotheses H_{0i}: μ_{ i } = 0 versus H_{1i}: μ_{ i } ≠ 0, i = 1, …, m, for the mean of normally distributed observations with nominal significance level α = 0.05. The simulations are performed for a wide range of scenarios. For a detailed description see Additional file1.
In the following we assume independence of test statistics across hypotheses. However, because this assumption is often not satisfied in genetic data, we also report simulations assuming several correlation structures.
All computations were performed using the statistical language R[15]. Rcode to reanalyse the real data application is available for download from the authors’ web pagehttp://statistics.msi.meduniwien.ac.at/index.php?page=pageszfnr.
Simulation results for the FNS procedure
Control of the error rate
Integrated approach: In all simulated scenarios the FDR is well controlled if m_{2} > 5 (see Additional file1). Only if a smaller m_{2} is chosen, the FDR may be inflated up to 0.11.
A heuristic explanation for this inflation is that for very small m_{2} the pvalue threshold corresponding to the FNS design,\left(\right.separators="">\n \n \n \n p\n \n \n (\n \n \n m\n \n \n 2\n \n \n )\n \n \n \n, is close to zero. For such low pvalue thresholds even small changes may lead to large changes in the sequential pvalue (2) and the approximation based on a fixed threshold is poor. Figure1C illustrates the decrease of the FDR with increasing m_{2} for two particular scenarios.
Pilot approach: For the pilot approach the FDR is controlled in all scenarios.
Mean number of rejected hypotheses
Table1 shows the mean number of rejected alternatives for selected scenarios for the integrated approach controlling the FDR and the improvement in percent compared to the pilot approach. While the simulation study investigating FDR control covers π_{0} values from 0.5 to 1 for the investigation of the power we consider settings where alternative hypotheses are sparse. These are settings where the advantage of twostage designs that select promising hypotheses at interim analysis is expected to be largest.
In all scenarios the integrated approach rejects more or the same number of alternative hypotheses than the pilot approach. The increase in rejections is up to 59%;. Figure1D shows the impact of m_{2} on the mean number of rejected alternatives for the integrated (black lines) and the pilot approach (grey lines): For Δ = 1 (solid lines) and very small m_{2}, the number of rejected alternatives is very small but it clearly increases with m_{2}. Here the difference to the pilot design is more distinct. For Δ = 1.6 the advantage of the integrated approach is only moderate.
Simulation results for the FDRS procedure
Control of the error rate
Integrated approach: For the original procedure (without the modified critical value) and π_{0} < 0.8 the FDR is controlled for all considered values of m, α_{1} and Δ (data not shown). For larger values of π_{0} the FDR may be inflated, especially if the effect size under the alternative is low such that the expected number of selected hypotheses for the second stage is very small. The inflation is, however, moderate and the maximal FDR over all simulation scenarios is 0.073 instead of the nominal 0.05.
The simulations for the modified procedure show that across all scenarios the FDR is controlled for m_{ s } = 6 (see Figure1A for an example and Additional file1 for all scenarios). Therefore, in the following we only report results of the modified procedure with m_{ s } = 6. Note that for some of the parameter values the modified procedure is strictly conservative.
For the pilot approach FDR control follows by theoretical arguments in[9] and is confirmed by the simulation study (within the simulation error, see Figure1A).
Mean number of rejected hypotheses
Table2 and Figure1B show that the number of rejected alternatives increases with α_{1} as expected. For small values of α_{1}, the pilot and the integrated approach have similar power values. In some settings for lower m the pilot design even slightly outperforms the integrated design. This is due to the fact that the modified FDRS procedure may be strictly conservative, especially if the number of selected hypotheses is low.
If the firststage sample size is increased, the advantage of the integrated approach increases: E.g., for n_{1} = n_{2} = 9 and π_{0} = 0.95, Δ = 1, α_{1} = 0.5, the mean number of rejected hypotheses is 22%; larger for the integrated approach than for the pilot approach.
Correlated test statistics
Test statistics from genetic data are often stochastically dependent across hypotheses. In this section we study the impact of correlation between test statistics on the FDR and consider autocorrelation, blockcorrelation[12] and equicorrelation[16]. Autocorrelation may occur for example in microarray data because of spatial artefacts on the array or in gene association studies due to correlation between neighbouring markers. A block correlation structure, also called clumpy correlation, may be induced in microarray data for example by pathways of genes that are commonly regulated[17]. Finally, equicorrelation can be due to ‘array effects’ in microarray analyses.
For autocorrelation we consider an order among hypotheses and assume an autoregressive correlation structure. Here the correlation between the test statistics for hypotheses i and j is given by ρ^{ij}. For blockcorrelation we assume that the test statistics are correlated in blocks of 20 hypotheses where the correlation between the test statistics within one block is ρ[12]. Hypotheses of different blocks are assumed to be independent. For equicorrelation we assume that for all pairs of hypotheses a pairwise correlation of ρ holds. For all correlation structures the alternatives are randomly distributed among the sequence of hypotheses. The simulations with correlated data were performed for the scenarios m ∈ {1000, 10000, 100000}, m_{2} ∈ {0.01m, 0.05m, 0.1m}, Δ ∈ {1, 1.6}), and π_{0} ∈ {0.95, 0.99,1} with correlation coefficient ρ = 0.5.
For blockcorrelation and autocorrelation the results are very similar to the independent case concerning the actual FDR. The mean number of rejected alternatives for the pilot and the integrated design are nearly identical (data not shown). For equicorrelated data the error rates of both selection procedures are maintained in all scenarios, even under the global null hypothesis. However, for most scenarios the procedure appears to be more conservative compared to the independent case. For scenarios with small Δ the mean number of rejected alternatives and the superiority of the integrated designs increases for the FDRS design (see Table3). For the FNS design the differences between the integrated and the pilot design decrease (see Table4). Note, however, that equicorrelation can be reduced or removed by adequate normalization[17].
Real data application
We reanalysed the microarray data set by Tian[18]. In this experiment gene expression data were compared between 137 patients where bone lytic lesions could be detected by magnetic resonance imaging and 36 controls where such lesions could not be detected for 12625 probe sets. We used the preprocessed data set by Jeffery[19].
To obtain balanced group sizes for the reanalysis we arbitrarily selected 36 patients from the bone lytic lesions group. The samples were arbitrarily allocated to the two stages and the pilot and the integrated approach were applied for the FNS and the FDRS procedure and different parameters: n_{1} = {6, 12} (n_{2} = 36  n_{1}), m_{2} = {10, 50, 100, 200}, α_{1} = {0.1, 0.2, 0.5, 0.8}, m_{ s } = 10. In the first stage for all procedures a twosided ttest was computed. For the integrated procedures we computed sequential pvalues based on (2) using a normal approximation where the critical values from the model with known variances are applied to the pvalues of the ttest.
Table5 shows that in most scenarios the integrated procedure rejects more hypotheses than the corresponding pilot procedure for both selection rules. This difference is larger for larger first stage sample sizes and for increasing parameters m_{2} or α_{1}, respectively. Only for small α_{1} and n_{1} = 6 the integrated and the pilot approach of the FDRS procedure reject approximately the same number of hypotheses. Note that no hypothesis was significant at the final test decision which was not considered in the second stage. Setting m_{ s } = 0 the results for the integrated FDRS procedure did not change.
Discussion and conclusion
In this paper we discussed several selection rules for twostage designs, where after an interim analysis only promising hypotheses are considered in the second stage.
For the choice of the selection rule, different criteria may apply. With the FNS design, the total number of observations is known in advance, which facilitates the planning of resources. However, this design does not adapt to the number of hypotheses that show an effect in the interim analysis. The latter can be achieved with the FDRS design, where, on the other hand, the total number of observations is random and the planning of resources becomes more difficult. As an extension one can consider an FDRS design where the overall number of observations (across all hypotheses and both stages) is fixed and the observations allocated to the second stage are equally distributed among the selected hypotheses. This comes at the cost of a decreasing per hypothesis power if for a larger number of hypotheses the alternative holds.
For the FNS design the testing procedures provided a sound control in the considered scenarios where more than 5 hypotheses are selected for the second stage for independent as well as for correlated data. Also for the modified FDRS procedure FDR control is given in all scenarios for m_{ s } > 5. Comparing the integrated approaches for both selection rules with the corresponding pilot approaches showed an advantage of the integrated approach in many scenarios. This holds particularly for the FNS design but in many scenarios also for the FDRS design. The advantage of the integrated design increases with the proportion of observations allocated to the first stage. This is in line with earlier findings[7, 8], where scenarios with small firststage sample sizes were considered and only small differences between the integrated and the pilot design have been observed. In particular, if the effect sizes in microarray studies are low (as, e.g., shown in examples in[20]) and the number of observations in the first stage is sufficiently large compared to the number of observations for the second stage, the integrated design is superior.
On the other hand, using only the secondstage data for testing has the advantage of increased flexibility and simplicity. For example, the pilot FNS procedure controls the FDR even if the hypotheses for the second stage are selected in an arbitrary way. Furthermore, standard (nonsequential) tests can be applied and FDR control can be shown analytically under suitable assumptions.
In the simulations the BHprocedure was applied to the sequential pvalues to control the FDR. As described above, this method is conservative if π_{0} < 1 as it controls the FDR actually at level π_{0}α. Following the suggestion of one of the referees, we additionally considered so called adaptive FDR controlling procedures that are based on an estimate of π_{0} (see Additional file2). Under independence these adaptive tests are less conservative then the BHtests, but did not exceed the nominal level in the considered simulation scenarios. However, as shown earlier (e.g.,[21]) under strong correlation adaptive procedures may inflate the FDR.
It is well known that twostage designs may lead to a considerable improvement in efficiency compared to singlestage designs[1–8] and this applies also to the procedures investigated in this paper (see Additional file3 for a simulation study comparing the twostage tests to corresponding singlestage designs). Furthermore, the methods can be extended to designs where an explicit early rejection boundary is applied in the interim analysis as in many groupsequential applications. In this case the calculation of the sequential pvalues is slightly modified (the integral boundaries depend on the early rejection boundaries). However, unless the fraction of hypotheses for which the alternative holds is large, it is expected that the addition of an early rejection boundary at the interim analysis has only a marginal impact on the efficiency of the procedure. Furthermore, for hypotheses that are rejected in the interim analysis based on few observations, a confirmation with a larger sample size might be important anyway.
Appendix
Asymptotic considerations
In this section we argue that asymptotically, for increasing number of hypotheses, the FNS and the FDRS selection rule are equivalent to a selection rule where hypotheses are selected based on a fixed threshold γ_{1}. Let\left(\right.separators="">\n \n R\n (\n \gamma \n )\n =\n \u266f\n {\n \n \n p\n \n \n i\n \n \n (\n 1\n )\n \n \n \u2264\n \gamma \n }\n \n denote the number of hypotheses with a firststage pvalue not exceeding γ, and V(γ) (S(γ)) the respective number of pvalues corresponding to true null (alternative) hypotheses, respectively. Let m_{0} and m_{1} denote the number of true null and alternative hypotheses, respectively. Consider the following assumptions:

1.
The empirical distribution functions of the firststage pvalues corresponding to the null and the alternative hypotheses converge almost surely, i.e., {lim}_{m\to \infty}V\left(\gamma \right)/{m}_{0}=\gamma and {lim}_{m\to \infty}S\left(\gamma \right)/{m}_{1}={F}_{1}\left(\gamma \right) exist for all γ ∈ (0, 1], where F _{1} is a continuous strictly increasing function.

2.
{lim}_{m\to \infty}{m}_{0}/m={\pi}_{0}^{\infty}
exists.
For the FNS procedure assume that{lim}_{m\to \infty}{m}_{2}/m=\delta exists and let G(p) = π_{0}p + (1  π_{0})F_{1}(p) denote the limiting distribution function of the firststage pvalues. By the continuity and strict monotonicity of G there is a unique{\gamma}_{1}^{\infty} such thatG\left({\gamma}_{1}^{\infty}\right)=\delta. Now,{lim}_{m\to \infty}{p}_{\left({m}_{2}\right)}={\gamma}_{1}^{\infty} almost surely. If we additionally assume that for all finite m the distribution of a (randomly chosen) pvalue is given by G(p) with density g, then γ_{1} is the m_{2}/m quantile of G. Thus, its variance is given by (m_{2}/m)(1  m_{2}/m)/(mg(m_{2}/m)^{2})[22].
For the FDRS procedure γ_{1} = m_{2}/mα corresponds to the critical value that results from the BHprocedure. If (1) and (2) hold and π_{0} < 1, it follows as in[12] (Theorem 5) that this critical value converges almost surely.
Computation of the twosided sequential pvalue
If the hypothesis H_{ i }, i = 1, …, m is selected for the second stage (i.e., if the firststage pvalue is smaller than γ_{1}), the twosided sequential pvalue given by (2) is calculated by numerical integration:
witha=\sqrt{\frac{{n}_{1}}{{n}_{1}+{n}_{2}}}\phantom{\rule{.5em}{0ex}}z andb=\sqrt{\frac{{n}_{2}}{{n}_{1}+{n}_{2}}}. Here φ(.), Φ(.), and\left(\right.separators="">\n \n \n \n c\n \n \n 1\n \n \n \n \gamma \n \n \n 1\n \n \n /\n 2\n \n \n \n denote the density, the cumulative distribution function, and the (1  γ_{1}/2)quantile of the standard normal distribution, respectively. If the firststage pvalue is larger than γ_{1},\left(\right.separators="">\n \n \n \n p\n \n \n i\n \n \n =\n \n \n p\n \n \n i\n \n \n (\n 1\n )\n \n \n \n.
Author’s contributions
Both authors contributed equally to the development of the methods, the design of the simulations, and to writing the paper. SZ conducted the simulations and data analyses. All authors read and approved the final manuscript.
Authors’ information
The views expressed are those of the author (MP) and should not be understood or quoted as being made on behalf of the European Medicines Agency or its scientific Committees.
References
Gail MH, Pfeiffer RM, Wheeler W, Pee D: Probability that a twostage genomewide association study will detect a diseaseassociated snp and implications for multistage designs. Ann Hum Genet 2008, 72: 812–820. 10.1111/j.14691809.2008.00467.x
Goll A, Bauer P: Twostage designs applying methods differing in costs. Bioinformatics 2007, 23: 1519–1526. 10.1093/bioinformatics/btm140
Moerkerke B, Goetghebeur E: Optimal screening for promising genes in 2stage designs. Biostatistics 2008, 9: 700–714. 10.1093/biostatistics/kxn002
Satagopan JM, Verbel DA, Venkatraman ES, Offit KE, Begg CB: Twostage designs for genedisease association studies. Biometrics 2002, 58: 163–170. 10.1111/j.0006341X.2002.00163.x
Van den Oord EJ, Sullivan PF: A framework for controlling false discovery rates and minimizing the amount of genotyping in the search for disease mutations. Human Heredity 2003, 56: 188–199. 10.1159/000076393
Victor A, Hommel G: Combining adaptive designs with control of the false discovery rate  a generalized definition for a global pvalue. Biometrical J 2007, 49: 94–106. 10.1002/bimj.200510311
Zehetmayer S, Bauer P, Posch M: Twostage designs for experiments with a large number of hypotheses. Bioinformatics 2005, 21: 3771–3777. 10.1093/bioinformatics/bti604
Zehetmayer S, Bauer P, Posch M: Optimized multistage designs controlling the false discovery or the family wise error rate. Stat Med 2008, 27: 4145–4160. 10.1002/sim.3300
Benjamini Y, Yekutieli D: Quantitative trait loci using the false discovery rate. Genetics 2005, 171: 783–90. 10.1534/genetics.104.036699
Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Statist Soc B 1995, 57: 289–300.
Benjamini Y, Yekutieli DL: The control of the false discovery rate in multiple testing under dependency. Ann Stat 2001, 29: 1165–1188. 10.1214/aos/1013699998
Storey JD, Taylor JE, Siegmund D: Strong Control, Conservative Point Estimation and Simultaneous Conservative Consistency of False Discovery Rates: a Unified Approach. J R Statist Soc B 2004, 66: 187–205. 10.1111/j.14679868.2004.00439.x
Tsiatis AA, Rosner GL, Mehta CR: Exact Confidence Intervals Following a Group Sequential Test. Biometrics 1984, 40: 797–803. 10.2307/2530924
Jennison C, Turnbull BW: Group Sequential Methods with Applications to Clinical Trials. Boca Raton: Chapman & Hall/CRC; 2000.
R Development Core Team: R: A language and environment for statistical computing.. Vienna, Austria: R Foundation for Statistical Computing; 2011. . [ISBN 3–900051–07–0] [http://www.Rproject.org/] []. [ISBN 3900051070]
Benjamini Y, Krieger AM, Yekutieli D: Adaptive linear stepup procedures that control the false discovery rate. Biometrika 2006, 93: 491–507. 10.1093/biomet/93.3.491
Qiu X, Qiu X, Brooks AI, Klebanov L, Yakovlev A: The effects of normalization of the correlation structure of microarray data. BMC Bioinf 2005, 6: 1–11. 10.1186/1471210561
Tian E, Zhan F, Walker R, Rasmussen E, Ma Y, Barlogie B, D SJ: The role of the Wntsignaling antagonist DKKI in the development of osteolytic lesions in multiple myeloma. New England J Med 2003, 349: 2483–2494. 10.1056/NEJMoa030847
Jeffery IB, Higgins DG, Culhane AC: Comparison and evaluation of methods for generating differentially expressed genes lists from microarray data. BMC Bioinf 2006, 7: 359–375. 10.1186/147121057359
Posch M, Zehetmayer S, Bauer P: Hunting for Significance with the False Discovery Rate. JASA 2009, 104: 836–840.
Zehetmayer S, Posch M: Post hoc power estimation in largescale multiple testing problems. Bioinf 2010, 26(8):1050–1056. 10.1093/bioinformatics/btq085
Serfling RJ: Approximation Theorems of Mathematical Statistics. New York: Wiley Series in Probability and Statistics, John Wiley & Sons Inc; 1980.
Acknowledgements
We would like to thank the two Reviewers for helpful suggestions and Julia Saperia for many helpful comments.
This work was supported by the Austrian Science Fund FWF (grant numbers T 401B12 and P23167).
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
Both authors have no competing interests.
Electronic supplementary material
12859_2011_5461_MOESM1_ESM.pdf
Additional file 1:We report the simulation scenarios and results of the simulation study assessing the FDR of the FNS and FDRS design (modified procedure with m_{ s }=6) for the case of independent test statistics as described in the results section of the manuscript. For each scenario at least 1000 simulation runs were performed. For scenarios with lower m the simulation runs were increased to 50000 (m = {100;500}), 20000 (m = 1000), and 10000 (m = 5000), because in these scenarios there is a higher variability of the false discovery proportion such that the estimator of the FDR converges slower. This also holds if m is large but π_{0} ≈ 1 or Δ is small. Therefore, for these scenarios the number of simulation runs was increased. The resulting FDR values were plotted as a function of α_{1} for the FDRS design (left column) or as a function of m_{2} for the FNS design (right column), respectively. (PDF 106 KB)
12859_2011_5461_MOESM2_ESM.pdf
Additional file 2:Results of a simulation study for twostage designs where an adaptive test procedures is applied based on an estimator for the proportion of true null hypotheses.(PDF 106 KB)
12859_2011_5461_MOESM3_ESM.pdf
Additional file 3:Two singlestage designs are compared to the results: For the first singlestage design the sample size for each hypothesis is n_{ 1 }, for the second design the sample size is n_{ 1 } + n_{ 2 }. For the first design we compare the gain in power of the integrated design and for the second design the attention lies on the reduction in costs. (PDF 106 KB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Zehetmayer, S., Posch, M. False discovery rate control in twostage designs. BMC Bioinformatics 13, 81 (2012). https://doi.org/10.1186/147121051381
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/147121051381