Skip to main content
  • Methodology article
  • Open access
  • Published:

False discovery rate control in two-stage designs

Abstract

Background

For gene expression or gene association studies with a large number of hypotheses the number of measurements per marker in a conventional single-stage design is often low due to limited resources. Two-stage designs have been proposed where in a first stage promising hypotheses are identified and further investigated in the second stage with larger sample sizes. For two types of two-stage designs proposed in the literature we derive multiple testing procedures controlling the False Discovery Rate (FDR) demonstrating FDR control by simulations: designs where a fixed number of top-ranked hypotheses are selected and designs where the selection in the interim analysis is based on an FDR threshold. In contrast to earlier approaches which use only the second-stage data in the hypothesis tests (pilot approach), the proposed testing procedures are based on the pooled data from both stages (integrated approach).

Results

For both selection rules the multiple testing procedures control the FDR in the considered simulation scenarios. This holds for the case of independent observations across hypotheses as well as for certain correlation structures. Additionally, we show that in scenarios with small effect sizes the testing procedures based on the pooled data from both stages can give a considerable improvement in power compared to tests based on the second-stage data only.

Conclusion

The proposed hypothesis tests provide a tool for FDR control for the considered two-stage designs. Comparing the integrated approaches for both selection rules with the corresponding pilot approaches showed an advantage of the integrated approach in many simulation scenarios.

Background

Modern experimental techniques in genetic research such as microarray experiments or gene association studies produce high dimensional data and often thousands of hypotheses are tested simultaneously to identify genetic markers. Due to limited resources, the number of measurements per marker in a conventional single-stage design is often low. Two-stage designs have been proposed where in a first stage promising markers are identified from the set of all markers considered initially. Thus, hypotheses corresponding to unpromising markers can be dropped in the interim analysis such that the second stage is performed with the reduced set of selected hypotheses. Given limited total resources or budgets, this allows the allocation of a larger number of observations to more promising hypotheses. It has been shown that such sequential procedures are typically considerably more powerful than single-stage designs[18].

An important problem when drawing inference from data produced by such designs is the construction of hypothesis tests that control the False Discovery Rate (FDR). While the construction of such test procedures is straightforward if only the second-stage data is used for testing, tests that make use of the data from both stages need to account for the specific selection rule used to select hypotheses for the second stage.

For two-stage procedures where in an interim analysis all hypotheses with an unadjusted first-stage p-value below a pre-fixed selection boundary γ1 are selected for the second stage, hypothesis tests based on the pooled data from both stages have been proposed that control FDR or the familywise error rate[7] (see[8] for a generalization to multistage designs). Selecting all hypotheses whose first-stage p-value lies below a fixed threshold has several implications. First, the number of hypotheses selected for the second stage is a random variable unknown a priori, which is an obstacle for researchers if resources are limited and the number of markers for which further measurements can be collected cannot be arbitrarily increased. Second, because rather large thresholds need to be applied in the interim analysis in order not to miss alternative hypotheses in spite of the small sample size, the fixed threshold rule will, with a high probability, select some hypotheses even under the global null hypothesis, even though resources could be saved in this scenario by stopping the experiment for futility at the interim analysis.

In this work we propose statistical tests to control the FDR in two-stage designs with selection rules that are not based on a fixed threshold for the first-stage p-values. First, we consider designs where a fixed number of hypotheses is selected for the second stage (FNS design)[4]. The selected hypotheses are the top-ranked hypotheses according to the first stage p-values. Second, we consider two-stage designs where selection is based on a fixed FDR threshold α1 in the interim analysis (FDRS design). All hypotheses that can be rejected with a test controlling the FDR at level α1 are selected for the second stage[9]. For the FNS design the number of continued hypotheses is deterministic and the procedure continues to the second stage, regardless of the actual effect sizes observed. For the FDRS design in contrast, the number of selected hypotheses is a random variable. Furthermore, under the global null hypothesis (where the FDR coincides with the familywise error rate) with probability 1 - α1 no hypothesis can be rejected with the interim test at FDR level α1 and the trial is stopped for futility.

A simple approach to construct hypothesis tests controlling the FDR for two-stage designs is to consider tests based on the second-stage data only. Standard multiple testing procedures applied to the second-stage data will control the FDR. For the FDRS design Benjamini and Yekutieli[9] showed that the nominal level applied at the second stage may be even increased taking into account the interim selection threshold. However, these approaches do not make full use of the available data because the first-stage observations are used for selection only. We construct tests for the FNS and FDRS designs that are based on sufficient test statistics of the data from both stages (the “integrated approach”). These tests are designed in analogy to group sequential tests and appear to control the FDR well if the number of hypotheses tested is large enough.

The paper is structured as follows: in the next section the testing problem and the selection rules are introduced. Then the results of a simulation study which investigates the actual FDR and compares the mean number of rejected alternatives of the integrated approach to the pilot approach are presented. Finally, a real data example and a short discussion are given.

Methods

The test problem

We consider an experiment to test m two-sided null hypotheses H0i: μ i  = 0 versus H1i: μ i  ≠ 0, i = 1, …, m, for the mean of independent, normally distributed observations with variances σ i 2 , i = 1, …, m. More general distributional scenarios are discussed at the end of this section.

To adjust for multiple testing we aim to control the FDR of the experiment. The FDR[10] is defined as the expectation of the proportion of erroneously rejected hypotheses V among all rejected hypotheses R, FDR = E(V / max{R, 1}). We apply the step-up Benjamini-Hochberg (BH) procedure to control the FDR at level α. Denote the ordered p-values of the m hypotheses by p(1) ≤ p(2) ≤  ≤ p(m) and let d = argmax i {p(i) ≤ /m} denote the index of the largest p-value p(i) smaller than or equal to /m. Then the BH-procedure rejects all hypotheses H0i such that p(i) ≤ /m. For single-stage tests it has been shown by Benjamini and Yekutieli[11] that the BH-procedure controls the FDR at level π0α, if the subset of test statistics corresponding to true null hypotheses are independent or positively regression dependent. Here, π0 denotes the (unknown) proportion of true null hypotheses among the m tested null hypotheses. Furthermore, the FDR is asymptotically controlled (for increasing number of hypotheses) if the limiting fraction of true null hypotheses is less than one and the test statistics are weakly dependent such that their empirical distribution functions converge almost surely (and some additional technical conditions hold)[12]. In the following we assume distributional scenarios where the BH-procedures controls the FDR.

The two-stage procedure

In the first-stage for each hypothesis n1 observations are collected. Then an interim analysis is performed and for each hypothesis a two-sided first-stage p-value p i ( 1 ) = 2 ( 1 - Φ ( | z i ( 1 ) | ) , i = 1, …, m, is calculated, where z i ( 1 ) denotes the standardized first-stage mean for hypothesis i and Φ the cumulative distribution function of the standard normal distribution. The first-stage p-values are ranked according to their magnitude and the m2 hypotheses with the smallest p-value are selected for the second stage. The number of selected hypotheses m2 can be either a pre-fixed number or may depend on the first-stage results. Below we consider several choices for m2. In a second stage for each selected hypothesis n2 observations are collected. n2 is assumed to be fixed and does not depend on the number of selected hypotheses. We consider two different approaches to arrive at the final test decision: the “integrated approach”, where the test decision is based on the combined data of both stages and the “pilot approach”, where the test decision is based on the second-stage data only.

In the following we introduce several rules to determine the number of selected hypotheses m2.

Selection rules for two-stage designs

Selection according to a prefixed selection boundary γ1

Two-stage designs have been proposed[2, 7, 8] where a selection boundary γ1 is pre-specified and in the interim analysis all hypotheses with a first-stage p-value smaller than γ1 are selected for the second stage. Then

m 2 = i = 1 m 1 { p i ( 1 ) γ 1 } ,

where 1 is the indicator function which equals 1 if the condition in the parentheses is satisfied and 0 otherwise. Thus, m2 is a random variable.

Pre-fixed number of hypotheses selected for the second stage - FNS design

With this rule the value of m2 is fixed a priori and the m2 hypotheses with the smallest first-stage p-values are selected for the second stage. We denote this procedure FNS design (Fixed Number Selection).

Selection based on an FDR threshold - FDRS design

In this approach all hypotheses which are significant according to the BH-procedure at a prefixed level α1  > α in the interim analysis are selected for the second stage. Thus, the number of selected hypotheses m2 is a random variable which depends on the first-stage results. If no hypothesis can be rejected at level α1 in the interim analysis and thus be carried over to the second stage, m2 is set to zero. In this case the whole experiment is stopped. Note that under the global null hypothesis, i.e. in the setting where all null hypotheses are true and thus π0 = 1, this occurs with a probability of 1 - α1.

FDR control

In the subsections below we review the FDR controlling test procedure for two-stage designs where hypotheses are selected based on a prefixed selection boundary applied to the first-stage p-values. Both pilot and integrated designs are considered. We then propose generalizations of these test procedures to FNS and FDRS designs, showing that for each of the designs a corresponding data dependent selection boundary for the first-stage p-values can be defined that converges under suitable assumptions almost surely to a fixed value. The results are derived for independent data. However, in genetic data dependence of test statistics is frequently observed and even weak dependence may be a strong assumption. In the simulation study we thus investigate the performances of the procedures for several correlation structures.

Selection according to a prefixed selection boundary γ1

Pilot approach

In the pilot approach the final test statistics are based on data from the second stage only. The first-stage data are used for selecting promising hypotheses only. To control the FDR of the experiment, at the end of the trial for each hypothesis a two-sided p-value p i ( 2 ) = 2 ( 1 - Φ ( | z i ( 2 ) ) | ) , i = 1, …, m2, is calculated, where z i ( 2 ) denotes the standardized mean of the second-stage observations for hypothesis i. Then the BH-procedure is applied to these p-values which are based on the second-stage data only. Because the first-stage observations are used only for selection and do not enter the final test statistics, the BH-procedure controlling the FDR at the second stage controls the FDR overall.

Integrated approach

If the data from both stages are to be used in the final test decision, one can account for the selection in the interim analysis by calculating sequential p-values p i , i = 1 …, m, based on the monotonic ordering of the sample space[13]. If p i ( 1 ) > γ 1 then the two-sided sequential p-value is defined as

p i = p i ( 1 ) = P H o | Z i ( 1 ) | | z i ( 1 ) |
(1)

and if p i ( 1 ) γ 1 it is given by

p i = P H o { | Z i | | z i | } { | Z i ( 1 ) | c 1 - γ 1 / 2 } ,
(2)

where Z i denotes the standardized overall mean of the observations from both stages and Z i ( 1 ) the standardized mean of the observations in the first stage. Furthermore, z i , z i ( 1 ) denote realizations of the random variables Z i , Z i ( 1 ) and c 1 - γ 1 / 2 the (1 - γ1/2)-quantile of the standard normal distribution. If the stopping criterion is satisfied the sequential p-value is just the classical fixed sample p-value calculated from the first-stage observations. Otherwise, the calculation of the sequential p-value involves the numerical solution of an integral (see the Appendix). Finally, the BH-procedure is applied to the sequential p-values p1, …, p m .

In a two-stage procedure with fixed per hypothesis sample sizes n1, n2 and a fixed selection boundary γ1 the sequential p-values are uniformly distributed under the null hypothesis[13].

If for the subset of true null hypotheses the observations are independent across hypotheses such that the sequential p-values are independent as well, it follows that the FDR is controlled in such a two-stage design. As described above FDR control holds also if the sequential p-values corresponding to true null hypotheses are positive regression dependent[11].

Next we extend the test procedure to the FNS and FDRS design.

FNS design

Pilot approach

For the FNS selection rule the FDR control with the pilot approach is straightforward: the BH-procedure can be applied to the second-stage p-values of the m2 selected hypotheses. Because the first-stage data do not enter the final test statistics, FDR control is guaranteed under the assumption of positive regression dependency.

Integrated approach

To utilize the data from both stages for the final test decision, we propose to compute sequential p-values in analogy to (2), replacing the fixed selection boundary γ1 (which is not defined for the FNS design) by the value of the largest first-stage p-value of all selected hypotheses, i.e., setting γ 1 = p ( m 2 ) . Thus, the threshold γ1 is now data dependent. Because this is not accounted for in the calculation of the sequential p-values, they may no longer be uniformly distributed under the null hypothesis and it is not obvious if the FDR is still controlled. However, if the observations are sufficiently independent across hypotheses, such that the empirical distribution functions of the first-stage test statistics converge almost surely as the number of tested hypotheses increases (and some additional technical conditions hold, see the Appendix), γ1 converges almost surely to a fixed number. Thus, asymptotically γ1 is deterministic and for large m and m2 the procedure becomes similar to the method with a prefixed selection boundary.

Note that with the integrated two-stage testing procedure, hypotheses that have not been selected in the interim analysis can in principle be rejected in the final test. Especially, if m2 is small compared to the number of hypotheses for which the alternative holds and the effect sizes are large, hypotheses that were not selected at the interim analysis can be rejected at the end because for every hypothesis a sequential p-value is calculated (even for non-selected ones). Rejection of non-selected hypotheses can occur in an overpowered FNS design where only few hypotheses are selected, but even the sequential p-values corresponding to non-selected alternative hypotheses are small enough to lead to a rejection. If such rejections occur, this is an indication that the first-stage sample size has been chosen too large and no second-stage sample would have been needed to reach sufficient power. While the efficiency of such a design can be improved by choosing appropriate first-stage sample sizes and selection rules, the control of the FDR is not affected.

FDRS design

Pilot approach

As for the FNS selection rule, if the BH-procedure is applied at nominal level α to the second-stage p-values of the m2 selected hypotheses (computed from the observations of the second stage only), FDR control is guaranteed. However, as Benjamini and Yekutieli[9] showed, if, in a first stage, hypotheses are selected that can be rejected with the BH-procedure at nominal level α1, and, in a second stage, the selected hypotheses are tested at nominal level α2, the FDR of the second-stage test is actually controlled at level α1α2π0, given the test statistics at each stage are positively regression dependent[9]. Thus, if in the second stage the nominal level α/α1 is applied, the FDR is still controlled at level π0α. In the following we consider the latter, improved procedure.

Integrated approach

Similar to the FNS rule we propose to compute sequential p-values in analogy to (2) to utilize the data from both stages for the final test decision.

Again, the resulting threshold γ1 is data dependent: we set γ1 = m2α1/m where m2 is a random variable. Then γ1 is approximately equal to the largest first-stage p-value of all selected hypotheses. Thus the sequential p-values may no longer be uniformly distributed under the null hypothesis and FDR control is in question. However, the following argument gives a heuristic for FDR control when the number of hypotheses is large. If for a positive proportion of hypotheses the alternative holds the empirical distribution functions of the first-stage test statistics converge almost surely as the number of tested hypotheses increases (and some additional technical conditions hold, see the Appendix), it can be shown that γ1 converges almost surely. Hence, in these settings γ1 is asymptotically deterministic.

Under the global null hypothesis γ1 does not converge and simulations (see the Results section) show that the FDR is actually inflated. Therefore, we suggest the following modification of the test procedure. Let m s  > 0 denote a positive constant. In cases where less than m s hypotheses are selected by the FDRS selection rule the threshold γ1 used in (2) is set to the m s -smallest first-stage p-value, thus γ 1 =max( p ( m 2 ) ( 1 ) , p ( m s ) ( 1 ) ). Note that this modification increases the first-stage critical boundary used in the calculation of the sequential p-value.

Generalizations to other testing problems

The procedure can be directly generalized to two group comparisons, replacing the standardized means by the standardized mean between group differences. More generally, the sequential p-value can be computed as in (2) if (under the null hypothesis) the cumulative test statistics follow a multivariate normal distribution with mean zero and variance one. In the actual computation based on (3) the term n1 / (n1 + n2) (resp. n2 / (n1 + n2)) is then replaced by the correlation ρ(resp. 1 - ρ). For example, for testing problems as the comparison of rates, or the analysis of co-variances, the standardized means can be replaced by standardized efficient score statistics that are asymptotically normally distributed. The correlation between these test statistics is determined by the information fractions (see[14]).

Results

First we investigate the actual FDR of the proposed testing procedures for the FNS and FDRS selection rules. Additionally, to quantify the advantage in power of the integrated approach compared to the pilot approach, we report the mean number of rejected alternatives under different scenarios. We consider the one-sample z-test for m two-sided null hypotheses H0i: μ i  = 0 versus H1i: μ i  ≠ 0, i = 1, …, m, for the mean of normally distributed observations with nominal significance level α = 0.05. The simulations are performed for a wide range of scenarios. For a detailed description see Additional file1.

In the following we assume independence of test statistics across hypotheses. However, because this assumption is often not satisfied in genetic data, we also report simulations assuming several correlation structures.

All computations were performed using the statistical language R[15]. R-code to reanalyse the real data application is available for download from the authors’ web pagehttp://statistics.msi.meduniwien.ac.at/index.php?page=pageszfnr.

Simulation results for the FNS procedure

Control of the error rate

Integrated approach: In all simulated scenarios the FDR is well controlled if m2 > 5 (see Additional file1). Only if a smaller m2 is chosen, the FDR may be inflated up to 0.11.

A heuristic explanation for this inflation is that for very small m2 the p-value threshold corresponding to the FNS design, p ( m 2 ) , is close to zero. For such low p-value thresholds even small changes may lead to large changes in the sequential p-value (2) and the approximation based on a fixed threshold is poor. Figure1C illustrates the decrease of the FDR with increasing m2 for two particular scenarios.

Figure 1
figure 1

Power values and error rates. (A) and (C) show the actual FDR for the FDRS and the FNS design, respectively, (B) and (D) the corresponding mean number of rejected alternatives for n1=6, n2=12, α=0.05, m = 1000, π0 = 0.99, m s  = 6. The effect sizes are Δ = 1 (solid line) or Δ = 1.6 (dotted line). The integrated approach is depicted in black, the pilot approach in grey (50000 simulation runs per scenario).

Pilot approach: For the pilot approach the FDR is controlled in all scenarios.

Mean number of rejected hypotheses

Table1 shows the mean number of rejected alternatives for selected scenarios for the integrated approach controlling the FDR and the improvement in percent compared to the pilot approach. While the simulation study investigating FDR control covers π0 values from 0.5 to 1 for the investigation of the power we consider settings where alternative hypotheses are sparse. These are settings where the advantage of two-stage designs that select promising hypotheses at interim analysis is expected to be largest.

Table 1 FNS design

In all scenarios the integrated approach rejects more or the same number of alternative hypotheses than the pilot approach. The increase in rejections is up to 59%;. Figure1D shows the impact of m2 on the mean number of rejected alternatives for the integrated (black lines) and the pilot approach (grey lines): For Δ = 1 (solid lines) and very small m2, the number of rejected alternatives is very small but it clearly increases with m2. Here the difference to the pilot design is more distinct. For Δ = 1.6 the advantage of the integrated approach is only moderate.

Simulation results for the FDRS procedure

Control of the error rate

Integrated approach: For the original procedure (without the modified critical value) and π0 < 0.8 the FDR is controlled for all considered values of m, α1 and Δ (data not shown). For larger values of π0 the FDR may be inflated, especially if the effect size under the alternative is low such that the expected number of selected hypotheses for the second stage is very small. The inflation is, however, moderate and the maximal FDR over all simulation scenarios is 0.073 instead of the nominal 0.05.

The simulations for the modified procedure show that across all scenarios the FDR is controlled for m s  = 6 (see Figure1A for an example and Additional file1 for all scenarios). Therefore, in the following we only report results of the modified procedure with m s  = 6. Note that for some of the parameter values the modified procedure is strictly conservative.

For the pilot approach FDR control follows by theoretical arguments in[9] and is confirmed by the simulation study (within the simulation error, see Figure1A).

Mean number of rejected hypotheses

Table2 and Figure1B show that the number of rejected alternatives increases with α1 as expected. For small values of α1, the pilot and the integrated approach have similar power values. In some settings for lower m the pilot design even slightly outperforms the integrated design. This is due to the fact that the modified FDRS procedure may be strictly conservative, especially if the number of selected hypotheses is low.

Table 2 FDRS design

If the first-stage sample size is increased, the advantage of the integrated approach increases: E.g., for n1 = n2 = 9 and π0 = 0.95, Δ = 1, α1 = 0.5, the mean number of rejected hypotheses is 22%; larger for the integrated approach than for the pilot approach.

Correlated test statistics

Test statistics from genetic data are often stochastically dependent across hypotheses. In this section we study the impact of correlation between test statistics on the FDR and consider auto-correlation, block-correlation[12] and equi-correlation[16]. Auto-correlation may occur for example in microarray data because of spatial artefacts on the array or in gene association studies due to correlation between neighbouring markers. A block correlation structure, also called clumpy correlation, may be induced in microarray data for example by pathways of genes that are commonly regulated[17]. Finally, equi-correlation can be due to ‘array effects’ in microarray analyses.

For auto-correlation we consider an order among hypotheses and assume an autoregressive correlation structure. Here the correlation between the test statistics for hypotheses i and j is given by ρ|i-j|. For block-correlation we assume that the test statistics are correlated in blocks of 20 hypotheses where the correlation between the test statistics within one block is ρ[12]. Hypotheses of different blocks are assumed to be independent. For equi-correlation we assume that for all pairs of hypotheses a pairwise correlation of ρ holds. For all correlation structures the alternatives are randomly distributed among the sequence of hypotheses. The simulations with correlated data were performed for the scenarios m {1000, 10000, 100000}, m2 {0.01m, 0.05m, 0.1m}, Δ {1, 1.6}), and π0 {0.95, 0.99,1} with correlation coefficient ρ = 0.5.

For block-correlation and auto-correlation the results are very similar to the independent case concerning the actual FDR. The mean number of rejected alternatives for the pilot and the integrated design are nearly identical (data not shown). For equi-correlated data the error rates of both selection procedures are maintained in all scenarios, even under the global null hypothesis. However, for most scenarios the procedure appears to be more conservative compared to the independent case. For scenarios with small Δ the mean number of rejected alternatives and the superiority of the integrated designs increases for the FDRS design (see Table3). For the FNS design the differences between the integrated and the pilot design decrease (see Table4). Note, however, that equi-correlation can be reduced or removed by adequate normalization[17].

Table 3 FDRS design for equi-correlated data
Table 4 FNS design for equi-correlated data

Real data application

We reanalysed the microarray data set by Tian[18]. In this experiment gene expression data were compared between 137 patients where bone lytic lesions could be detected by magnetic resonance imaging and 36 controls where such lesions could not be detected for 12625 probe sets. We used the pre-processed data set by Jeffery[19].

To obtain balanced group sizes for the re-analysis we arbitrarily selected 36 patients from the bone lytic lesions group. The samples were arbitrarily allocated to the two stages and the pilot and the integrated approach were applied for the FNS and the FDRS procedure and different parameters: n1 = {6, 12} (n2 = 36 - n1), m2 = {10, 50, 100, 200}, α1 = {0.1, 0.2, 0.5, 0.8}, m s  = 10. In the first stage for all procedures a two-sided t-test was computed. For the integrated procedures we computed sequential p-values based on (2) using a normal approximation where the critical values from the model with known variances are applied to the p-values of the t-test.

Table5 shows that in most scenarios the integrated procedure rejects more hypotheses than the corresponding pilot procedure for both selection rules. This difference is larger for larger first stage sample sizes and for increasing parameters m2 or α1, respectively. Only for small α1 and n1 = 6 the integrated and the pilot approach of the FDRS procedure reject approximately the same number of hypotheses. Note that no hypothesis was significant at the final test decision which was not considered in the second stage. Setting m s  = 0 the results for the integrated FDRS procedure did not change.

Table 5 Real data application

Discussion and conclusion

In this paper we discussed several selection rules for two-stage designs, where after an interim analysis only promising hypotheses are considered in the second stage.

For the choice of the selection rule, different criteria may apply. With the FNS design, the total number of observations is known in advance, which facilitates the planning of resources. However, this design does not adapt to the number of hypotheses that show an effect in the interim analysis. The latter can be achieved with the FDRS design, where, on the other hand, the total number of observations is random and the planning of resources becomes more difficult. As an extension one can consider an FDRS design where the overall number of observations (across all hypotheses and both stages) is fixed and the observations allocated to the second stage are equally distributed among the selected hypotheses. This comes at the cost of a decreasing per hypothesis power if for a larger number of hypotheses the alternative holds.

For the FNS design the testing procedures provided a sound control in the considered scenarios where more than 5 hypotheses are selected for the second stage for independent as well as for correlated data. Also for the modified FDRS procedure FDR control is given in all scenarios for m s  > 5. Comparing the integrated approaches for both selection rules with the corresponding pilot approaches showed an advantage of the integrated approach in many scenarios. This holds particularly for the FNS design but in many scenarios also for the FDRS design. The advantage of the integrated design increases with the proportion of observations allocated to the first stage. This is in line with earlier findings[7, 8], where scenarios with small first-stage sample sizes were considered and only small differences between the integrated and the pilot design have been observed. In particular, if the effect sizes in microarray studies are low (as, e.g., shown in examples in[20]) and the number of observations in the first stage is sufficiently large compared to the number of observations for the second stage, the integrated design is superior.

On the other hand, using only the second-stage data for testing has the advantage of increased flexibility and simplicity. For example, the pilot FNS procedure controls the FDR even if the hypotheses for the second stage are selected in an arbitrary way. Furthermore, standard (non-sequential) tests can be applied and FDR control can be shown analytically under suitable assumptions.

In the simulations the BH-procedure was applied to the sequential p-values to control the FDR. As described above, this method is conservative if π0 < 1 as it controls the FDR actually at level π0α. Following the suggestion of one of the referees, we additionally considered so called adaptive FDR controlling procedures that are based on an estimate of π0 (see Additional file2). Under independence these adaptive tests are less conservative then the BH-tests, but did not exceed the nominal level in the considered simulation scenarios. However, as shown earlier (e.g.,[21]) under strong correlation adaptive procedures may inflate the FDR.

It is well known that two-stage designs may lead to a considerable improvement in efficiency compared to single-stage designs[18] and this applies also to the procedures investigated in this paper (see Additional file3 for a simulation study comparing the two-stage tests to corresponding single-stage designs). Furthermore, the methods can be extended to designs where an explicit early rejection boundary is applied in the interim analysis as in many group-sequential applications. In this case the calculation of the sequential p-values is slightly modified (the integral boundaries depend on the early rejection boundaries). However, unless the fraction of hypotheses for which the alternative holds is large, it is expected that the addition of an early rejection boundary at the interim analysis has only a marginal impact on the efficiency of the procedure. Furthermore, for hypotheses that are rejected in the interim analysis based on few observations, a confirmation with a larger sample size might be important anyway.

Appendix

Asymptotic considerations

In this section we argue that asymptotically, for increasing number of hypotheses, the FNS and the FDRS selection rule are equivalent to a selection rule where hypotheses are selected based on a fixed threshold γ1. Let R ( γ ) = { p i ( 1 ) γ } denote the number of hypotheses with a first-stage p-value not exceeding γ, and V(γ) (S(γ)) the respective number of p-values corresponding to true null (alternative) hypotheses, respectively. Let m0 and m1 denote the number of true null and alternative hypotheses, respectively. Consider the following assumptions:

  1. 1.

    The empirical distribution functions of the first-stage p-values corresponding to the null and the alternative hypotheses converge almost surely, i.e., lim m V(γ)/ m 0 =γ and lim m S(γ)/ m 1 = F 1 (γ) exist for all γ (0, 1], where F 1 is a continuous strictly increasing function.

  2. 2.
    lim m m 0 /m= π 0

    exists.

For the FNS procedure assume that lim m m 2 /m=δ exists and let G(p) = π0p + (1 - π0)F1(p) denote the limiting distribution function of the first-stage p-values. By the continuity and strict monotonicity of G there is a unique γ 1 such thatG( γ 1 )=δ. Now, lim m p ( m 2 ) = γ 1 almost surely. If we additionally assume that for all finite m the distribution of a (randomly chosen) p-value is given by G(p) with density g, then γ1 is the m2/m quantile of G. Thus, its variance is given by (m2/m)(1 - m2/m)/(mg(m2/m)2)[22].

For the FDRS procedure γ1 = m2/ corresponds to the critical value that results from the BH-procedure. If (1) and (2) hold and π0 < 1, it follows as in[12] (Theorem 5) that this critical value converges almost surely.

Computation of the two-sided sequential p-value

If the hypothesis H i , i = 1, …, m is selected for the second stage (i.e., if the first-stage p-value is smaller than γ1), the two-sided sequential p-value given by (2) is calculated by numerical integration:

p i = c 1 - γ 1 / 2 1 - Φ ( | z i | - a b ) φ ( z ) dz + c 1 - γ 1 / 2 Φ ( - | z i | - a b ) φ ( z ) dz + - c γ 1 / 2 1 - Φ ( | z i | - a b ) φ ( z ) dz + - c γ 1 / 2 Φ ( - | z i | - a b ) φ ( z ) dz
(3)

witha= n 1 n 1 + n 2 z andb= n 2 n 1 + n 2 . Here φ(.), Φ(.), and c 1 - γ 1 / 2 denote the density, the cumulative distribution function, and the (1 - γ1/2)-quantile of the standard normal distribution, respectively. If the first-stage p-value is larger than γ1, p i = p i ( 1 ) .

Author’s contributions

Both authors contributed equally to the development of the methods, the design of the simulations, and to writing the paper. SZ conducted the simulations and data analyses. All authors read and approved the final manuscript.

Authors’ information

The views expressed are those of the author (MP) and should not be understood or quoted as being made on behalf of the European Medicines Agency or its scientific Committees.

References

  1. Gail MH, Pfeiffer RM, Wheeler W, Pee D: Probability that a two-stage genome-wide association study will detect a disease-associated snp and implications for multistage designs. Ann Hum Genet 2008, 72: 812–820. 10.1111/j.1469-1809.2008.00467.x

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  2. Goll A, Bauer P: Two-stage designs applying methods differing in costs. Bioinformatics 2007, 23: 1519–1526. 10.1093/bioinformatics/btm140

    Article  CAS  PubMed  Google Scholar 

  3. Moerkerke B, Goetghebeur E: Optimal screening for promising genes in 2-stage designs. Biostatistics 2008, 9: 700–714. 10.1093/biostatistics/kxn002

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  4. Satagopan JM, Verbel DA, Venkatraman ES, Offit KE, Begg CB: Two-stage designs for gene-disease association studies. Biometrics 2002, 58: 163–170. 10.1111/j.0006-341X.2002.00163.x

    Article  PubMed  Google Scholar 

  5. Van den Oord EJ, Sullivan PF: A framework for controlling false discovery rates and minimizing the amount of genotyping in the search for disease mutations. Human Heredity 2003, 56: 188–199. 10.1159/000076393

    Article  CAS  PubMed  Google Scholar 

  6. Victor A, Hommel G: Combining adaptive designs with control of the false discovery rate - a generalized definition for a global p-value. Biometrical J 2007, 49: 94–106. 10.1002/bimj.200510311

    Article  Google Scholar 

  7. Zehetmayer S, Bauer P, Posch M: Two-stage designs for experiments with a large number of hypotheses. Bioinformatics 2005, 21: 3771–3777. 10.1093/bioinformatics/bti604

    Article  CAS  PubMed  Google Scholar 

  8. Zehetmayer S, Bauer P, Posch M: Optimized multi-stage designs controlling the false discovery or the family wise error rate. Stat Med 2008, 27: 4145–4160. 10.1002/sim.3300

    Article  PubMed  Google Scholar 

  9. Benjamini Y, Yekutieli D: Quantitative trait loci using the false discovery rate. Genetics 2005, 171: 783–90. 10.1534/genetics.104.036699

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  10. Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Statist Soc B 1995, 57: 289–300.

    Google Scholar 

  11. Benjamini Y, Yekutieli DL: The control of the false discovery rate in multiple testing under dependency. Ann Stat 2001, 29: 1165–1188. 10.1214/aos/1013699998

    Article  Google Scholar 

  12. Storey JD, Taylor JE, Siegmund D: Strong Control, Conservative Point Estimation and Simultaneous Conservative Consistency of False Discovery Rates: a Unified Approach. J R Statist Soc B 2004, 66: 187–205. 10.1111/j.1467-9868.2004.00439.x

    Article  Google Scholar 

  13. Tsiatis AA, Rosner GL, Mehta CR: Exact Confidence Intervals Following a Group Sequential Test. Biometrics 1984, 40: 797–803. 10.2307/2530924

    Article  CAS  PubMed  Google Scholar 

  14. Jennison C, Turnbull BW: Group Sequential Methods with Applications to Clinical Trials. Boca Raton: Chapman & Hall/CRC; 2000.

    Google Scholar 

  15. R Development Core Team: R: A language and environment for statistical computing.. Vienna, Austria: R Foundation for Statistical Computing; 2011. . [ISBN 3–900051–07–0] [http://www.R-project.org/] []. [ISBN 3-900051-07-0]

    Google Scholar 

  16. Benjamini Y, Krieger AM, Yekutieli D: Adaptive linear step-up procedures that control the false discovery rate. Biometrika 2006, 93: 491–507. 10.1093/biomet/93.3.491

    Article  Google Scholar 

  17. Qiu X, Qiu X, Brooks AI, Klebanov L, Yakovlev A: The effects of normalization of the correlation structure of microarray data. BMC Bioinf 2005, 6: 1–11. 10.1186/1471-2105-6-1

    Article  Google Scholar 

  18. Tian E, Zhan F, Walker R, Rasmussen E, Ma Y, Barlogie B, D SJ: The role of the Wnt-signaling antagonist DKKI in the development of osteolytic lesions in multiple myeloma. New England J Med 2003, 349: 2483–2494. 10.1056/NEJMoa030847

    Article  CAS  Google Scholar 

  19. Jeffery IB, Higgins DG, Culhane AC: Comparison and evaluation of methods for generating differentially expressed genes lists from microarray data. BMC Bioinf 2006, 7: 359–375. 10.1186/1471-2105-7-359

    Article  Google Scholar 

  20. Posch M, Zehetmayer S, Bauer P: Hunting for Significance with the False Discovery Rate. JASA 2009, 104: 836–840.

    Google Scholar 

  21. Zehetmayer S, Posch M: Post hoc power estimation in large-scale multiple testing problems. Bioinf 2010, 26(8):1050–1056. 10.1093/bioinformatics/btq085

    Article  CAS  Google Scholar 

  22. Serfling RJ: Approximation Theorems of Mathematical Statistics. New York: Wiley Series in Probability and Statistics, John Wiley & Sons Inc; 1980.

    Book  Google Scholar 

Download references

Acknowledgements

We would like to thank the two Reviewers for helpful suggestions and Julia Saperia for many helpful comments.

This work was supported by the Austrian Science Fund FWF (grant numbers T 401-B12 and P23167).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sonja Zehetmayer.

Additional information

Competing interests

Both authors have no competing interests.

Electronic supplementary material

12859_2011_5461_MOESM1_ESM.pdf

Additional file 1:We report the simulation scenarios and results of the simulation study assessing the FDR of the FNS and FDRS design (modified procedure with m s =6) for the case of independent test statistics as described in the results section of the manuscript. For each scenario at least 1000 simulation runs were performed. For scenarios with lower m the simulation runs were increased to 50000 (m = {100;500}), 20000 (m = 1000), and 10000 (m = 5000), because in these scenarios there is a higher variability of the false discovery proportion such that the estimator of the FDR converges slower. This also holds if m is large but π0 ≈ 1 or Δ is small. Therefore, for these scenarios the number of simulation runs was increased. The resulting FDR values were plotted as a function of α1 for the FDRS design (left column) or as a function of m2 for the FNS design (right column), respectively. (PDF 106 KB)

12859_2011_5461_MOESM2_ESM.pdf

Additional file 2:Results of a simulation study for two-stage designs where an adaptive test procedures is applied based on an estimator for the proportion of true null hypotheses.(PDF 106 KB)

12859_2011_5461_MOESM3_ESM.pdf

Additional file 3:Two single-stage designs are compared to the results: For the first single-stage design the sample size for each hypothesis is n 1 , for the second design the sample size is n 1  + n 2 . For the first design we compare the gain in power of the integrated design and for the second design the attention lies on the reduction in costs. (PDF 106 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Zehetmayer, S., Posch, M. False discovery rate control in two-stage designs. BMC Bioinformatics 13, 81 (2012). https://doi.org/10.1186/1471-2105-13-81

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1471-2105-13-81

Keywords