False discovery rate control in two-stage designs
- Sonja Zehetmayer^{1}Email author and
- Martin Posch^{2}
DOI: 10.1186/1471-2105-13-81
© Zehetmayer and Posch; licensee BioMed Central Ltd. 2012
Received: 30 September 2011
Accepted: 23 April 2012
Published: 6 May 2012
Abstract
Background
For gene expression or gene association studies with a large number of hypotheses the number of measurements per marker in a conventional single-stage design is often low due to limited resources. Two-stage designs have been proposed where in a first stage promising hypotheses are identified and further investigated in the second stage with larger sample sizes. For two types of two-stage designs proposed in the literature we derive multiple testing procedures controlling the False Discovery Rate (FDR) demonstrating FDR control by simulations: designs where a fixed number of top-ranked hypotheses are selected and designs where the selection in the interim analysis is based on an FDR threshold. In contrast to earlier approaches which use only the second-stage data in the hypothesis tests (pilot approach), the proposed testing procedures are based on the pooled data from both stages (integrated approach).
Results
For both selection rules the multiple testing procedures control the FDR in the considered simulation scenarios. This holds for the case of independent observations across hypotheses as well as for certain correlation structures. Additionally, we show that in scenarios with small effect sizes the testing procedures based on the pooled data from both stages can give a considerable improvement in power compared to tests based on the second-stage data only.
Conclusion
The proposed hypothesis tests provide a tool for FDR control for the considered two-stage designs. Comparing the integrated approaches for both selection rules with the corresponding pilot approaches showed an advantage of the integrated approach in many simulation scenarios.
Background
Modern experimental techniques in genetic research such as microarray experiments or gene association studies produce high dimensional data and often thousands of hypotheses are tested simultaneously to identify genetic markers. Due to limited resources, the number of measurements per marker in a conventional single-stage design is often low. Two-stage designs have been proposed where in a first stage promising markers are identified from the set of all markers considered initially. Thus, hypotheses corresponding to unpromising markers can be dropped in the interim analysis such that the second stage is performed with the reduced set of selected hypotheses. Given limited total resources or budgets, this allows the allocation of a larger number of observations to more promising hypotheses. It has been shown that such sequential procedures are typically considerably more powerful than single-stage designs[1–8].
An important problem when drawing inference from data produced by such designs is the construction of hypothesis tests that control the False Discovery Rate (FDR). While the construction of such test procedures is straightforward if only the second-stage data is used for testing, tests that make use of the data from both stages need to account for the specific selection rule used to select hypotheses for the second stage.
For two-stage procedures where in an interim analysis all hypotheses with an unadjusted first-stage p-value below a pre-fixed selection boundary γ_{1} are selected for the second stage, hypothesis tests based on the pooled data from both stages have been proposed that control FDR or the familywise error rate[7] (see[8] for a generalization to multistage designs). Selecting all hypotheses whose first-stage p-value lies below a fixed threshold has several implications. First, the number of hypotheses selected for the second stage is a random variable unknown a priori, which is an obstacle for researchers if resources are limited and the number of markers for which further measurements can be collected cannot be arbitrarily increased. Second, because rather large thresholds need to be applied in the interim analysis in order not to miss alternative hypotheses in spite of the small sample size, the fixed threshold rule will, with a high probability, select some hypotheses even under the global null hypothesis, even though resources could be saved in this scenario by stopping the experiment for futility at the interim analysis.
In this work we propose statistical tests to control the FDR in two-stage designs with selection rules that are not based on a fixed threshold for the first-stage p-values. First, we consider designs where a fixed number of hypotheses is selected for the second stage (FNS design)[4]. The selected hypotheses are the top-ranked hypotheses according to the first stage p-values. Second, we consider two-stage designs where selection is based on a fixed FDR threshold α_{1} in the interim analysis (FDRS design). All hypotheses that can be rejected with a test controlling the FDR at level α_{1} are selected for the second stage[9]. For the FNS design the number of continued hypotheses is deterministic and the procedure continues to the second stage, regardless of the actual effect sizes observed. For the FDRS design in contrast, the number of selected hypotheses is a random variable. Furthermore, under the global null hypothesis (where the FDR coincides with the familywise error rate) with probability 1 - α_{1} no hypothesis can be rejected with the interim test at FDR level α_{1} and the trial is stopped for futility.
A simple approach to construct hypothesis tests controlling the FDR for two-stage designs is to consider tests based on the second-stage data only. Standard multiple testing procedures applied to the second-stage data will control the FDR. For the FDRS design Benjamini and Yekutieli[9] showed that the nominal level applied at the second stage may be even increased taking into account the interim selection threshold. However, these approaches do not make full use of the available data because the first-stage observations are used for selection only. We construct tests for the FNS and FDRS designs that are based on sufficient test statistics of the data from both stages (the “integrated approach”). These tests are designed in analogy to group sequential tests and appear to control the FDR well if the number of hypotheses tested is large enough.
The paper is structured as follows: in the next section the testing problem and the selection rules are introduced. Then the results of a simulation study which investigates the actual FDR and compares the mean number of rejected alternatives of the integrated approach to the pilot approach are presented. Finally, a real data example and a short discussion are given.
Methods
The test problem
We consider an experiment to test m two-sided null hypotheses H_{0i}: μ_{ i } = 0 versus H_{1i}: μ_{ i } ≠ 0, i = 1, …, m, for the mean of independent, normally distributed observations with variances$\left(\right)close="">{\sigma}_{i}^{2}$, i = 1, …, m. More general distributional scenarios are discussed at the end of this section.
To adjust for multiple testing we aim to control the FDR of the experiment. The FDR[10] is defined as the expectation of the proportion of erroneously rejected hypotheses V among all rejected hypotheses R, FDR = E(V / max{R, 1}). We apply the step-up Benjamini-Hochberg (BH) procedure to control the FDR at level α. Denote the ordered p-values of the m hypotheses by p_{(1)} ≤ p_{(2)} ≤ ⋯ ≤ p_{(m)} and let d = argmax_{ i }{p_{(i)} ≤ iα/m} denote the index of the largest p-value p_{(i)} smaller than or equal to iα/m. Then the BH-procedure rejects all hypotheses H_{0i} such that p_{(i)} ≤ dα/m. For single-stage tests it has been shown by Benjamini and Yekutieli[11] that the BH-procedure controls the FDR at level π_{0}α, if the subset of test statistics corresponding to true null hypotheses are independent or positively regression dependent. Here, π_{0} denotes the (unknown) proportion of true null hypotheses among the m tested null hypotheses. Furthermore, the FDR is asymptotically controlled (for increasing number of hypotheses) if the limiting fraction of true null hypotheses is less than one and the test statistics are weakly dependent such that their empirical distribution functions converge almost surely (and some additional technical conditions hold)[12]. In the following we assume distributional scenarios where the BH-procedures controls the FDR.
The two-stage procedure
In the first-stage for each hypothesis n_{1} observations are collected. Then an interim analysis is performed and for each hypothesis a two-sided first-stage p-value$\left(\right)close="">{p}_{i}^{\left(1\right)}=2(1-\mathrm{\Phi}(|{z}_{i}^{\left(1\right)}|)$, i = 1, …, m, is calculated, where$\left(\right)close="">{z}_{i}^{\left(1\right)}$ denotes the standardized first-stage mean for hypothesis i and Φ the cumulative distribution function of the standard normal distribution. The first-stage p-values are ranked according to their magnitude and the m_{2} hypotheses with the smallest p-value are selected for the second stage. The number of selected hypotheses m_{2} can be either a pre-fixed number or may depend on the first-stage results. Below we consider several choices for m_{2}. In a second stage for each selected hypothesis n_{2} observations are collected. n_{2} is assumed to be fixed and does not depend on the number of selected hypotheses. We consider two different approaches to arrive at the final test decision: the “integrated approach”, where the test decision is based on the combined data of both stages and the “pilot approach”, where the test decision is based on the second-stage data only.
In the following we introduce several rules to determine the number of selected hypotheses m_{2}.
Selection rules for two-stage designs
Selection according to a prefixed selection boundary γ_{1}
where 1 is the indicator function which equals 1 if the condition in the parentheses is satisfied and 0 otherwise. Thus, m_{2} is a random variable.
Pre-fixed number of hypotheses selected for the second stage - FNS design
With this rule the value of m_{2} is fixed a priori and the m_{2} hypotheses with the smallest first-stage p-values are selected for the second stage. We denote this procedure FNS design (Fixed Number Selection).
Selection based on an FDR threshold - FDRS design
In this approach all hypotheses which are significant according to the BH-procedure at a prefixed level α_{1} > α in the interim analysis are selected for the second stage. Thus, the number of selected hypotheses m_{2} is a random variable which depends on the first-stage results. If no hypothesis can be rejected at level α_{1} in the interim analysis and thus be carried over to the second stage, m_{2} is set to zero. In this case the whole experiment is stopped. Note that under the global null hypothesis, i.e. in the setting where all null hypotheses are true and thus π_{0} = 1, this occurs with a probability of 1 - α_{1}.
FDR control
In the subsections below we review the FDR controlling test procedure for two-stage designs where hypotheses are selected based on a prefixed selection boundary applied to the first-stage p-values. Both pilot and integrated designs are considered. We then propose generalizations of these test procedures to FNS and FDRS designs, showing that for each of the designs a corresponding data dependent selection boundary for the first-stage p-values can be defined that converges under suitable assumptions almost surely to a fixed value. The results are derived for independent data. However, in genetic data dependence of test statistics is frequently observed and even weak dependence may be a strong assumption. In the simulation study we thus investigate the performances of the procedures for several correlation structures.
Selection according to a prefixed selection boundary γ_{1}
Pilot approach
In the pilot approach the final test statistics are based on data from the second stage only. The first-stage data are used for selecting promising hypotheses only. To control the FDR of the experiment, at the end of the trial for each hypothesis a two-sided p-value$\left(\right)close="">{p}_{i}^{\left(2\right)}=2(1-\mathrm{\Phi}(|{z}_{i}^{\left(2\right)})|)$, i = 1, …, m_{2}, is calculated, where$\left(\right)close="">{z}_{i}^{\left(2\right)}$ denotes the standardized mean of the second-stage observations for hypothesis i. Then the BH-procedure is applied to these p-values which are based on the second-stage data only. Because the first-stage observations are used only for selection and do not enter the final test statistics, the BH-procedure controlling the FDR at the second stage controls the FDR overall.
Integrated approach
where Z_{ i } denotes the standardized overall mean of the observations from both stages and$\left(\right)close="">{Z}_{i}^{\left(1\right)}$ the standardized mean of the observations in the first stage. Furthermore,$\left(\right)close="">{z}_{i},{z}_{i}^{\left(1\right)}$ denote realizations of the random variables$\left(\right)close="">{Z}_{i},{Z}_{i}^{\left(1\right)}$ and$\left(\right)close="">{c}_{1-{\gamma}_{1}/2}$ the (1 - γ_{1}/2)-quantile of the standard normal distribution. If the stopping criterion is satisfied the sequential p-value is just the classical fixed sample p-value calculated from the first-stage observations. Otherwise, the calculation of the sequential p-value involves the numerical solution of an integral (see the Appendix). Finally, the BH-procedure is applied to the sequential p-values p_{1}, …, p_{ m }.
In a two-stage procedure with fixed per hypothesis sample sizes n_{1}, n_{2} and a fixed selection boundary γ_{1} the sequential p-values are uniformly distributed under the null hypothesis[13].
If for the subset of true null hypotheses the observations are independent across hypotheses such that the sequential p-values are independent as well, it follows that the FDR is controlled in such a two-stage design. As described above FDR control holds also if the sequential p-values corresponding to true null hypotheses are positive regression dependent[11].
Next we extend the test procedure to the FNS and FDRS design.
FNS design
Pilot approach
For the FNS selection rule the FDR control with the pilot approach is straightforward: the BH-procedure can be applied to the second-stage p-values of the m_{2} selected hypotheses. Because the first-stage data do not enter the final test statistics, FDR control is guaranteed under the assumption of positive regression dependency.
Integrated approach
To utilize the data from both stages for the final test decision, we propose to compute sequential p-values in analogy to (2), replacing the fixed selection boundary γ_{1} (which is not defined for the FNS design) by the value of the largest first-stage p-value of all selected hypotheses, i.e., setting$\left(\right)close="">{\gamma}_{1}={p}_{\left({m}_{2}\right)}$. Thus, the threshold γ_{1} is now data dependent. Because this is not accounted for in the calculation of the sequential p-values, they may no longer be uniformly distributed under the null hypothesis and it is not obvious if the FDR is still controlled. However, if the observations are sufficiently independent across hypotheses, such that the empirical distribution functions of the first-stage test statistics converge almost surely as the number of tested hypotheses increases (and some additional technical conditions hold, see the Appendix), γ_{1} converges almost surely to a fixed number. Thus, asymptotically γ_{1} is deterministic and for large m and m_{2} the procedure becomes similar to the method with a prefixed selection boundary.
Note that with the integrated two-stage testing procedure, hypotheses that have not been selected in the interim analysis can in principle be rejected in the final test. Especially, if m_{2} is small compared to the number of hypotheses for which the alternative holds and the effect sizes are large, hypotheses that were not selected at the interim analysis can be rejected at the end because for every hypothesis a sequential p-value is calculated (even for non-selected ones). Rejection of non-selected hypotheses can occur in an overpowered FNS design where only few hypotheses are selected, but even the sequential p-values corresponding to non-selected alternative hypotheses are small enough to lead to a rejection. If such rejections occur, this is an indication that the first-stage sample size has been chosen too large and no second-stage sample would have been needed to reach sufficient power. While the efficiency of such a design can be improved by choosing appropriate first-stage sample sizes and selection rules, the control of the FDR is not affected.
FDRS design
Pilot approach
As for the FNS selection rule, if the BH-procedure is applied at nominal level α to the second-stage p-values of the m_{2} selected hypotheses (computed from the observations of the second stage only), FDR control is guaranteed. However, as Benjamini and Yekutieli[9] showed, if, in a first stage, hypotheses are selected that can be rejected with the BH-procedure at nominal level α_{1}, and, in a second stage, the selected hypotheses are tested at nominal level α_{2}, the FDR of the second-stage test is actually controlled at level α_{1}α_{2}π_{0}, given the test statistics at each stage are positively regression dependent[9]. Thus, if in the second stage the nominal level α/α_{1} is applied, the FDR is still controlled at level π_{0}α. In the following we consider the latter, improved procedure.
Integrated approach
Similar to the FNS rule we propose to compute sequential p-values in analogy to (2) to utilize the data from both stages for the final test decision.
Again, the resulting threshold γ_{1} is data dependent: we set γ_{1} = m_{2}α_{1}/m where m_{2} is a random variable. Then γ_{1} is approximately equal to the largest first-stage p-value of all selected hypotheses. Thus the sequential p-values may no longer be uniformly distributed under the null hypothesis and FDR control is in question. However, the following argument gives a heuristic for FDR control when the number of hypotheses is large. If for a positive proportion of hypotheses the alternative holds the empirical distribution functions of the first-stage test statistics converge almost surely as the number of tested hypotheses increases (and some additional technical conditions hold, see the Appendix), it can be shown that γ_{1} converges almost surely. Hence, in these settings γ_{1} is asymptotically deterministic.
Under the global null hypothesis γ_{1} does not converge and simulations (see the Results section) show that the FDR is actually inflated. Therefore, we suggest the following modification of the test procedure. Let m_{ s } > 0 denote a positive constant. In cases where less than m_{ s } hypotheses are selected by the FDRS selection rule the threshold γ_{1} used in (2) is set to the m_{ s }-smallest first-stage p-value, thus${\gamma}_{1}=max({p}_{\left({m}_{2}\right)}^{\left(1\right)},{p}_{\left({m}_{s}\right)}^{\left(1\right)})$. Note that this modification increases the first-stage critical boundary used in the calculation of the sequential p-value.
Generalizations to other testing problems
The procedure can be directly generalized to two group comparisons, replacing the standardized means by the standardized mean between group differences. More generally, the sequential p-value can be computed as in (2) if (under the null hypothesis) the cumulative test statistics follow a multivariate normal distribution with mean zero and variance one. In the actual computation based on (3) the term n_{1} / (n_{1} + n_{2}) (resp. n_{2} / (n_{1} + n_{2})) is then replaced by the correlation ρ(resp. 1 - ρ). For example, for testing problems as the comparison of rates, or the analysis of co-variances, the standardized means can be replaced by standardized efficient score statistics that are asymptotically normally distributed. The correlation between these test statistics is determined by the information fractions (see[14]).
Results
First we investigate the actual FDR of the proposed testing procedures for the FNS and FDRS selection rules. Additionally, to quantify the advantage in power of the integrated approach compared to the pilot approach, we report the mean number of rejected alternatives under different scenarios. We consider the one-sample z-test for m two-sided null hypotheses H_{0i}: μ_{ i } = 0 versus H_{1i}: μ_{ i } ≠ 0, i = 1, …, m, for the mean of normally distributed observations with nominal significance level α = 0.05. The simulations are performed for a wide range of scenarios. For a detailed description see Additional file1.
In the following we assume independence of test statistics across hypotheses. However, because this assumption is often not satisfied in genetic data, we also report simulations assuming several correlation structures.
All computations were performed using the statistical language R[15]. R-code to reanalyse the real data application is available for download from the authors’ web pagehttp://statistics.msi.meduniwien.ac.at/index.php?page=pageszfnr.
Simulation results for the FNS procedure
Control of the error rate
Integrated approach: In all simulated scenarios the FDR is well controlled if m_{2} > 5 (see Additional file1). Only if a smaller m_{2} is chosen, the FDR may be inflated up to 0.11.
Pilot approach: For the pilot approach the FDR is controlled in all scenarios.
Mean number of rejected hypotheses
FNS design
m= 1000 | m= 10000 | m= 100000 | |||||
---|---|---|---|---|---|---|---|
m _{2} | π _{0} | Δ = 1 | Δ = 1.6 | Δ = 1 | Δ = 1.6 | Δ = 1 | Δ = 1.6 |
0.01m | .95 | 6.1 (17%;) | 15.4 (58%;) | 58.7 (19%;) | 141.9 (46%;) | 583.7 (19%;) | 1406.8 (45%;) |
0.01m | .99 | 1.8 (13%;) | 5.0 (2%;) | 13.3 (19%;) | 41.9 (3%;) | 126.8 (20%;) | 407.4 (3%;) |
0.05m | .95 | 12.5 (21%;) | 26.7 (5%;) | 117.6 (23%;) | 260.3 (5%;) | 1166.9 (23%;) | 2590.3 (5%;) |
0.05m | .99 | 2.8 (25%;) | 6.2 (4%;) | 19.8 (35%;) | 53.8 (6%;) | 188.4 (36%;) | 523.7 (6%;) |
0.1m | .95 | 14.9 (27%;) | 30.1 (6%;) | 139.9 (29%;) | 293.9 (7%;) | 1388.2 (29%;) | 2926.5 (7%;) |
0.1m | .99 | 3.1 (33%;) | 6.5 (6%;) | 22.0 (47%;) | 57.0 (9%;) | 209.9 (49%;) | 554.7 (9%;) |
In all scenarios the integrated approach rejects more or the same number of alternative hypotheses than the pilot approach. The increase in rejections is up to 59%;. Figure1D shows the impact of m_{2} on the mean number of rejected alternatives for the integrated (black lines) and the pilot approach (grey lines): For Δ = 1 (solid lines) and very small m_{2}, the number of rejected alternatives is very small but it clearly increases with m_{2}. Here the difference to the pilot design is more distinct. For Δ = 1.6 the advantage of the integrated approach is only moderate.
Simulation results for the FDRS procedure
Control of the error rate
Integrated approach: For the original procedure (without the modified critical value) and π_{0} < 0.8 the FDR is controlled for all considered values of m, α_{1} and Δ (data not shown). For larger values of π_{0} the FDR may be inflated, especially if the effect size under the alternative is low such that the expected number of selected hypotheses for the second stage is very small. The inflation is, however, moderate and the maximal FDR over all simulation scenarios is 0.073 instead of the nominal 0.05.
The simulations for the modified procedure show that across all scenarios the FDR is controlled for m_{ s } = 6 (see Figure1A for an example and Additional file1 for all scenarios). Therefore, in the following we only report results of the modified procedure with m_{ s } = 6. Note that for some of the parameter values the modified procedure is strictly conservative.
For the pilot approach FDR control follows by theoretical arguments in[9] and is confirmed by the simulation study (within the simulation error, see Figure1A).
Mean number of rejected hypotheses
FDRS design
m= 1000 | m= 10000 | m= 100000 | |||||
---|---|---|---|---|---|---|---|
α _{1} | π _{0} | Δ = 1 | Δ = 1.6 | Δ = 1 | Δ = 1.6 | Δ = 1 | Δ = 1.6 |
0.1 | .95 | 2.5 (-1 %;) | 18.3 (0%;) | 17.7 (1%;) | 171.7 (0%;) | 166.2 (0%;) | 1703.0 (0%;) |
.99 | 0.4 (-4%;) | 3.3 (0%;) | 1.3 (-3%;) | 23.1 (0%;) | 7.4 (0%;) | 219.7 (0%;) | |
0.2 | .95 | 4.3 (1%;) | 21.7 (1%;) | 34.1 (2%;) | 206.4 (0%;) | 328.6 (2%;) | 2047.1 (0%;) |
.99 | 0.6 (-3%;) | 3.9 (0%;) | 2.3 (0%;) | 28.9 (0%;) | 15.8 (1%;) | 275.7 (0%;) | |
0.5 | .95 | 9.3 (8%;) | 27.4 (2%;) | 81.3 (7%;) | 264.1 (2%;) | 799.7 (7%;) | 2625.7 (2%;) |
.99 | 1.3 (4%;) | 5.1 (1%;) | 6.2 (5%;) | 39.8 (1%;) | 51.2 (5%;) | 382.0 (1%;) |
If the first-stage sample size is increased, the advantage of the integrated approach increases: E.g., for n_{1} = n_{2} = 9 and π_{0} = 0.95, Δ = 1, α_{1} = 0.5, the mean number of rejected hypotheses is 22%; larger for the integrated approach than for the pilot approach.
Correlated test statistics
Test statistics from genetic data are often stochastically dependent across hypotheses. In this section we study the impact of correlation between test statistics on the FDR and consider auto-correlation, block-correlation[12] and equi-correlation[16]. Auto-correlation may occur for example in microarray data because of spatial artefacts on the array or in gene association studies due to correlation between neighbouring markers. A block correlation structure, also called clumpy correlation, may be induced in microarray data for example by pathways of genes that are commonly regulated[17]. Finally, equi-correlation can be due to ‘array effects’ in microarray analyses.
For auto-correlation we consider an order among hypotheses and assume an autoregressive correlation structure. Here the correlation between the test statistics for hypotheses i and j is given by ρ^{|i-j|}. For block-correlation we assume that the test statistics are correlated in blocks of 20 hypotheses where the correlation between the test statistics within one block is ρ[12]. Hypotheses of different blocks are assumed to be independent. For equi-correlation we assume that for all pairs of hypotheses a pairwise correlation of ρ holds. For all correlation structures the alternatives are randomly distributed among the sequence of hypotheses. The simulations with correlated data were performed for the scenarios m ∈ {1000, 10000, 100000}, m_{2} ∈ {0.01m, 0.05m, 0.1m}, Δ ∈ {1, 1.6}), and π_{0} ∈ {0.95, 0.99,1} with correlation coefficient ρ = 0.5.
FDRS design for equi-correlated data
m= 1000 | m= 10000 | m= 100000 | |||||
---|---|---|---|---|---|---|---|
α _{1} | π _{0} | Δ = 1 | Δ = 1.6 | Δ = 1 | Δ = 1.6 | Δ = 1 | Δ = 1.6 |
0.1 | .95 | 2.7 (3%;) | 18.1 (0%;) | 20.6 (5%;) | 169.9 (0%;) | 180.2 (5%;) | 1682.2 (0%;) |
.99 | 0.4 (0%;) | 3.2 (0%;) | 2.0 (12%;) | 22.7 (0%;) | 16.4 (19%;) | 214.4 (1%;) | |
0.2 | .95 | 3.9 (9%;) | 21.5 (1%;) | 30.6 (12%;) | 203.4 (1%;) | 300.6 (14%;) | 2015.9 (1%;) |
.99 | 0.6 (7%;) | 3.9 (1%;) | 2.9 (26%;) | 28.3 (1%;) | 26.0 (33%;) | 269.5 (2%;) | |
0.5 | .95 | 7.3 (22%;) | 26.8 (3%;) | 60.5 (28%;) | 257.3 (4%;) | 576.0 (29%;) | 2554.7 (4%;) |
.99 | 1.1 (25%;) | 4.9 (3%;) | 5.7 (6%;) | 37.9 (4%;) | 48.8 (78%;) | 363.3 (4%;) |
FNS design for equi-correlated data
m= 1000 | m= 10000 | m= 100000 | |||||
---|---|---|---|---|---|---|---|
m _{2} | π _{0} | Δ = 1 | Δ = 1.6 | Δ = 1 | Δ = 1.6 | Δ = 1 | Δ = 1.6 |
0.01m | .95 | 8.0 (13%;) | 15.2 (53%;) | 77.3 (13%;) | 141.2 (43%;) | 768.0 (13%;) | 1392.8 (41%;) |
.99 | 2.4 (0%;) | 5.6 (1%;) | 17.7 (0%;) | 48.6 (1%;) | 168.5 (0%;) | 471.5 (1%;) | |
0.05m | .95 | 14.4 (12%;) | 28.5 (4%;) | 135.4 (13%;) | 278.9 (4%;) | 1342.9 (13%;) | 2783.1 (4%;) |
.99 | 3.0 (15%;) | 6.3 (3%;) | 21.6 (21%;) | 55.3 (4%;) | 207.9 (21%;) | 537.4 (5%;) | |
0.1m | .95 | 15.8 (21%;) | 30.6 (5%;) | 149.0 (22%;) | 299.7 (5%;) | 1473.5 (22%;) | 2990.8 (5%;) |
.99 | 3.2 (28%;) | 6.4 (5%;) | 22.8 (39%;) | 57.1 (8%;) | 216.4 (41%;) | 556.2 (8%;) |
Real data application
We reanalysed the microarray data set by Tian[18]. In this experiment gene expression data were compared between 137 patients where bone lytic lesions could be detected by magnetic resonance imaging and 36 controls where such lesions could not be detected for 12625 probe sets. We used the pre-processed data set by Jeffery[19].
To obtain balanced group sizes for the re-analysis we arbitrarily selected 36 patients from the bone lytic lesions group. The samples were arbitrarily allocated to the two stages and the pilot and the integrated approach were applied for the FNS and the FDRS procedure and different parameters: n_{1} = {6, 12} (n_{2} = 36 - n_{1}), m_{2} = {10, 50, 100, 200}, α_{1} = {0.1, 0.2, 0.5, 0.8}, m_{ s } = 10. In the first stage for all procedures a two-sided t-test was computed. For the integrated procedures we computed sequential p-values based on (2) using a normal approximation where the critical values from the model with known variances are applied to the p-values of the t-test.
Real data application
FNS | FDRS | ||||
---|---|---|---|---|---|
n_{1}/n_{2} | m _{2} | Rejections | α _{1} | Rejections | m_{2}^{FDRS} |
6 / 30 | 10 | 6 (1) | 0.1 | 0 (0) | 1 |
50 | 15 (10) | 0.2 | 1 (1) | 2 | |
100 | 30 (12) | 0.5 | 28 (21) | 85 | |
200 | 68 (30) | 0.8 | 345 (132) | 2291 | |
12 / 24 | 10 | 8 (4) | 0.1 | 3 (3) | 3 |
50 | 33 (8) | 0.2 | 51 (38) | 84 | |
100 | 60 (17) | 0.5 | 398 (150) | 1745 | |
200 | 109 (37) | 0.8 | 573 (99) | 5887 |
Discussion and conclusion
In this paper we discussed several selection rules for two-stage designs, where after an interim analysis only promising hypotheses are considered in the second stage.
For the choice of the selection rule, different criteria may apply. With the FNS design, the total number of observations is known in advance, which facilitates the planning of resources. However, this design does not adapt to the number of hypotheses that show an effect in the interim analysis. The latter can be achieved with the FDRS design, where, on the other hand, the total number of observations is random and the planning of resources becomes more difficult. As an extension one can consider an FDRS design where the overall number of observations (across all hypotheses and both stages) is fixed and the observations allocated to the second stage are equally distributed among the selected hypotheses. This comes at the cost of a decreasing per hypothesis power if for a larger number of hypotheses the alternative holds.
For the FNS design the testing procedures provided a sound control in the considered scenarios where more than 5 hypotheses are selected for the second stage for independent as well as for correlated data. Also for the modified FDRS procedure FDR control is given in all scenarios for m_{ s } > 5. Comparing the integrated approaches for both selection rules with the corresponding pilot approaches showed an advantage of the integrated approach in many scenarios. This holds particularly for the FNS design but in many scenarios also for the FDRS design. The advantage of the integrated design increases with the proportion of observations allocated to the first stage. This is in line with earlier findings[7, 8], where scenarios with small first-stage sample sizes were considered and only small differences between the integrated and the pilot design have been observed. In particular, if the effect sizes in microarray studies are low (as, e.g., shown in examples in[20]) and the number of observations in the first stage is sufficiently large compared to the number of observations for the second stage, the integrated design is superior.
On the other hand, using only the second-stage data for testing has the advantage of increased flexibility and simplicity. For example, the pilot FNS procedure controls the FDR even if the hypotheses for the second stage are selected in an arbitrary way. Furthermore, standard (non-sequential) tests can be applied and FDR control can be shown analytically under suitable assumptions.
In the simulations the BH-procedure was applied to the sequential p-values to control the FDR. As described above, this method is conservative if π_{0} < 1 as it controls the FDR actually at level π_{0}α. Following the suggestion of one of the referees, we additionally considered so called adaptive FDR controlling procedures that are based on an estimate of π_{0} (see Additional file2). Under independence these adaptive tests are less conservative then the BH-tests, but did not exceed the nominal level in the considered simulation scenarios. However, as shown earlier (e.g.,[21]) under strong correlation adaptive procedures may inflate the FDR.
It is well known that two-stage designs may lead to a considerable improvement in efficiency compared to single-stage designs[1–8] and this applies also to the procedures investigated in this paper (see Additional file3 for a simulation study comparing the two-stage tests to corresponding single-stage designs). Furthermore, the methods can be extended to designs where an explicit early rejection boundary is applied in the interim analysis as in many group-sequential applications. In this case the calculation of the sequential p-values is slightly modified (the integral boundaries depend on the early rejection boundaries). However, unless the fraction of hypotheses for which the alternative holds is large, it is expected that the addition of an early rejection boundary at the interim analysis has only a marginal impact on the efficiency of the procedure. Furthermore, for hypotheses that are rejected in the interim analysis based on few observations, a confirmation with a larger sample size might be important anyway.
Appendix
Asymptotic considerations
- 1.
The empirical distribution functions of the first-stage p-values corresponding to the null and the alternative hypotheses converge almost surely, i.e., ${lim}_{m\to \infty}V\left(\gamma \right)/{m}_{0}=\gamma $ and ${lim}_{m\to \infty}S\left(\gamma \right)/{m}_{1}={F}_{1}\left(\gamma \right)$ exist for all γ ∈ (0, 1], where F _{1} is a continuous strictly increasing function.
- 2.exists.${lim}_{m\to \infty}{m}_{0}/m={\pi}_{0}^{\infty}$
For the FNS procedure assume that${lim}_{m\to \infty}{m}_{2}/m=\delta $ exists and let G(p) = π_{0}p + (1 - π_{0})F_{1}(p) denote the limiting distribution function of the first-stage p-values. By the continuity and strict monotonicity of G there is a unique${\gamma}_{1}^{\infty}$ such that$G\left({\gamma}_{1}^{\infty}\right)=\delta $. Now,${lim}_{m\to \infty}{p}_{\left({m}_{2}\right)}={\gamma}_{1}^{\infty}$ almost surely. If we additionally assume that for all finite m the distribution of a (randomly chosen) p-value is given by G(p) with density g, then γ_{1} is the m_{2}/m quantile of G. Thus, its variance is given by (m_{2}/m)(1 - m_{2}/m)/(mg(m_{2}/m)^{2})[22].
For the FDRS procedure γ_{1} = m_{2}/mα corresponds to the critical value that results from the BH-procedure. If (1) and (2) hold and π_{0} < 1, it follows as in[12] (Theorem 5) that this critical value converges almost surely.
Computation of the two-sided sequential p-value
with$a=\sqrt{\frac{{n}_{1}}{{n}_{1}+{n}_{2}}}\phantom{\rule{.5em}{0ex}}z$ and$b=\sqrt{\frac{{n}_{2}}{{n}_{1}+{n}_{2}}}$. Here φ(.), Φ(.), and$\left(\right)close="">{c}_{1-{\gamma}_{1}/2}$ denote the density, the cumulative distribution function, and the (1 - γ_{1}/2)-quantile of the standard normal distribution, respectively. If the first-stage p-value is larger than γ_{1},$\left(\right)close="">{p}_{i}={p}_{i}^{\left(1\right)}$.
Author’s contributions
Both authors contributed equally to the development of the methods, the design of the simulations, and to writing the paper. SZ conducted the simulations and data analyses. All authors read and approved the final manuscript.
Authors’ information
The views expressed are those of the author (MP) and should not be understood or quoted as being made on behalf of the European Medicines Agency or its scientific Committees.
Declarations
Acknowledgements
We would like to thank the two Reviewers for helpful suggestions and Julia Saperia for many helpful comments.
This work was supported by the Austrian Science Fund FWF (grant numbers T 401-B12 and P23167).
Authors’ Affiliations
References
- Gail MH, Pfeiffer RM, Wheeler W, Pee D: Probability that a two-stage genome-wide association study will detect a disease-associated snp and implications for multistage designs. Ann Hum Genet 2008, 72: 812–820. 10.1111/j.1469-1809.2008.00467.xPubMed CentralView ArticlePubMed
- Goll A, Bauer P: Two-stage designs applying methods differing in costs. Bioinformatics 2007, 23: 1519–1526. 10.1093/bioinformatics/btm140View ArticlePubMed
- Moerkerke B, Goetghebeur E: Optimal screening for promising genes in 2-stage designs. Biostatistics 2008, 9: 700–714. 10.1093/biostatistics/kxn002PubMed CentralView ArticlePubMed
- Satagopan JM, Verbel DA, Venkatraman ES, Offit KE, Begg CB: Two-stage designs for gene-disease association studies. Biometrics 2002, 58: 163–170. 10.1111/j.0006-341X.2002.00163.xView ArticlePubMed
- Van den Oord EJ, Sullivan PF: A framework for controlling false discovery rates and minimizing the amount of genotyping in the search for disease mutations. Human Heredity 2003, 56: 188–199. 10.1159/000076393View ArticlePubMed
- Victor A, Hommel G: Combining adaptive designs with control of the false discovery rate - a generalized definition for a global p-value. Biometrical J 2007, 49: 94–106. 10.1002/bimj.200510311View Article
- Zehetmayer S, Bauer P, Posch M: Two-stage designs for experiments with a large number of hypotheses. Bioinformatics 2005, 21: 3771–3777. 10.1093/bioinformatics/bti604View ArticlePubMed
- Zehetmayer S, Bauer P, Posch M: Optimized multi-stage designs controlling the false discovery or the family wise error rate. Stat Med 2008, 27: 4145–4160. 10.1002/sim.3300View ArticlePubMed
- Benjamini Y, Yekutieli D: Quantitative trait loci using the false discovery rate. Genetics 2005, 171: 783–90. 10.1534/genetics.104.036699PubMed CentralView ArticlePubMed
- Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Statist Soc B 1995, 57: 289–300.
- Benjamini Y, Yekutieli DL: The control of the false discovery rate in multiple testing under dependency. Ann Stat 2001, 29: 1165–1188. 10.1214/aos/1013699998View Article
- Storey JD, Taylor JE, Siegmund D: Strong Control, Conservative Point Estimation and Simultaneous Conservative Consistency of False Discovery Rates: a Unified Approach. J R Statist Soc B 2004, 66: 187–205. 10.1111/j.1467-9868.2004.00439.xView Article
- Tsiatis AA, Rosner GL, Mehta CR: Exact Confidence Intervals Following a Group Sequential Test. Biometrics 1984, 40: 797–803. 10.2307/2530924View ArticlePubMed
- Jennison C, Turnbull BW: Group Sequential Methods with Applications to Clinical Trials. Boca Raton: Chapman & Hall/CRC; 2000.
- R Development Core Team: R: A language and environment for statistical computing.. Vienna, Austria: R Foundation for Statistical Computing; 2011. . [ISBN 3–900051–07–0] [http://www.R-project.org/] []. [ISBN 3-900051-07-0]
- Benjamini Y, Krieger AM, Yekutieli D: Adaptive linear step-up procedures that control the false discovery rate. Biometrika 2006, 93: 491–507. 10.1093/biomet/93.3.491View Article
- Qiu X, Qiu X, Brooks AI, Klebanov L, Yakovlev A: The effects of normalization of the correlation structure of microarray data. BMC Bioinf 2005, 6: 1–11. 10.1186/1471-2105-6-1View Article
- Tian E, Zhan F, Walker R, Rasmussen E, Ma Y, Barlogie B, D SJ: The role of the Wnt-signaling antagonist DKKI in the development of osteolytic lesions in multiple myeloma. New England J Med 2003, 349: 2483–2494. 10.1056/NEJMoa030847View Article
- Jeffery IB, Higgins DG, Culhane AC: Comparison and evaluation of methods for generating differentially expressed genes lists from microarray data. BMC Bioinf 2006, 7: 359–375. 10.1186/1471-2105-7-359View Article
- Posch M, Zehetmayer S, Bauer P: Hunting for Significance with the False Discovery Rate. JASA 2009, 104: 836–840.
- Zehetmayer S, Posch M: Post hoc power estimation in large-scale multiple testing problems. Bioinf 2010, 26(8):1050–1056. 10.1093/bioinformatics/btq085View Article
- Serfling RJ: Approximation Theorems of Mathematical Statistics. New York: Wiley Series in Probability and Statistics, John Wiley & Sons Inc; 1980.View Article
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.