A statistical approach to validation
Interesting results in highdimensional studies are the assays that are statistically significant at a specified false discovery rate (FDR). The FDR can be thought of as the acceptable level of false positives among a set of significant results [3]. If 100 variables are significant at an FDR ≤ 5%, then we expect no more than 0.05 × 100 = 5 false discoveries. Applying an independent technology to confirm all 100 variables and finding 5 or fewer false positives would verify this claim. However, applying independent technologies or functional assays to individual discoveries is often costly and time consuming.
Here we propose to experimentally test a random sample of significant results with an independent technology and confirm the false discovery rate with a statistical procedure. The approach consists of manually confirming a sample of n hits with the independent technology, determining the number of false positives, n_{
FP
}, and calculating the probability the true proportion of false positives, Π_{0}, is less than the claimed FDR of \hat{\alpha}. When a subset of features is claimed to be significant at a specified FDR, α, then the expected proportion of false positives among the significant results is α. The expected proportion of false positives in the validation sample Π_{0}, should then be approximately equivalent to the original FDR. We can then use the expected proportion of false positives in the validation sample to confirm the original FDR estimate.
If the probability \mathrm{Pr}({\Pi}_{0}\le \widehat{\alpha}{n}_{\mathrm{FP}},n) is larger than 0.5, then the validation substudy supports the original FDR estimate, although larger values are required to strongly support validation. Using the posterior distribution, it is also possible to calculate a posterior credible interval for the false discovery rate, Π_{0}, as a measure of variability. This probability represents a direct measurement of concordance, unlike statistics like the correlation between the original and validation statistics which measure only agreement and depend on the scale of the measurements being taken [14].
Calculating validation probabilities
Suppose that there are m significant hits at an FDR of Π_{0}, and n of them are sampled randomly for validation. For each of the n probes let δ_{
i
} = 1 if probe i is a false positive according to the independent technology and δ_{
i
} = 0 if not. Each probe may have a different probability of being a false positive, so Pr(δ_{
i
} = 1p_{
i
}) = p_{
i
} where p_{
i
} is drawn from a distribution, f(p) such that E_{f(p)}[p] = Π_{0}. The distribution of δ_{
i
} can be written as:
\begin{array}{ll}f\left({\delta}_{i}\right)\phantom{\rule{2.77695pt}{0ex}}& =\phantom{\rule{2.77695pt}{0ex}}\underset{0}{\overset{1}{\int}}{p}^{{\delta}_{i}}{(1p)}^{1{\delta}_{i}}\times f\left(p\right)\mathrm{dp}\phantom{\rule{2em}{0ex}}\\ =\phantom{\rule{2.77695pt}{0ex}}\left\{\begin{array}{c}\underset{0}{\overset{1}{\int}}p\times f\left(p\right)\mathrm{dp}={\Pi}_{0}\phantom{\rule{2em}{0ex}}\mathrm{if}\phantom{\rule{2.77695pt}{0ex}}{\delta}_{i}=1\\ \underset{0}{\overset{1}{\int}}(1p)\times f\left(p\right)\mathrm{dp}=1{\Pi}_{0}\phantom{\rule{1em}{0ex}}\mathrm{if}\phantom{\rule{2.77695pt}{0ex}}{\delta}_{i}=0\end{array}\right.\phantom{\rule{2em}{0ex}}\\ =\phantom{\rule{2.77695pt}{0ex}}{\Pi}_{0}^{{\delta}_{i}}{(1{\Pi}_{0})}^{1{\delta}_{i}}\phantom{\rule{2em}{0ex}}\end{array}
(1)
So the number of false positives, n_{
FP
} has a binomial distribution with parameter Π_{0}. We assume a Beta(a b) conjugate prior distribution for Π_{0}[15], so the posterior distribution is a Beta(a + n_{
FP
}b + n−n_{
FP
}) [16]. This result relies on the independence of the validation experiments, but does not rely on the rate of true or false positives. The potential reasons for dependence between validation tests include batch and other technical artifacts [9]. However, methods have been developed to address these potential sources of dependence [8, 17], which lead to nearly independent hypotheses [18].
Using this posterior distribution it is straightforward to calculate the probability that the FDR (Π_{0}) is less than the claimed level \left(\hat{\alpha}\right): \mathrm{Pr}({\Pi}_{0}<\hat{\alpha}{n}_{\mathrm{FP}},n). We can also calculate the posterior expected value of the FDR using the mean of the posterior, \frac{a+{n}_{\mathrm{FP}}}{a+b+n}, which can give us an idea of the actual FDR of our original results. For our analysis, we made the assumption that the prior distribution for Π_{0} was U(0,1), by setting the parameters to the Beta function to a = b = 1. This prior is somewhat conservative, since it is likely that many of the results in the validation experiment will be true positives.
In some cases it may be useful to encode the belief that most of the results will be true positives in the prior by choosing values of a and b that put greater prior weight on higher validation probabilities. Specifically, a much less conservative prior choice would set the mean of the prior distribution to be the observed FDR for the validation targets in the original study \hat{\alpha}=\frac{a}{a+b}. Since this choice permits a range of solutions for a and b, one could select the choice that maximizes the prior variance \frac{\mathrm{ab}}{{(a+b)}^{2}(a+b+1)}. This could be accomplished by choosing a to be a small value like 0.01 and setting b=\frac{1\hat{\alpha}}{\hat{\alpha}}\times a.
For small values of \hat{\alpha} this prior may influence the prior probabilities and make it difficult to compare validation at different FDR levels. This is a particular concern given the known variability in FDR estimates, particularly for small sample sizes or low FDR levels [19]. In general, conservative and nonadaptive prior distributions will lead to less potential for bias and greater comparability across FDR levels. Our R functions for calculating the validation probabilities allow for different choices of a and b. However, the validation probability is somewhat robust to prior choice (Results).
Bootstrap confidence intervals for the validation probabilities
It may be of interest to determine the variability of the validation probability. One potential approach is to calculate a bootstrap confidence interval for the posterior probability [20]. The basic approach is as follows.

1.
For b = 1,…,B bootstrap samples, take a random sample of size n with replacement form the n validation results. Calculate the number of false positives and calculate the null statistic: {\mathrm{Pr}}^{b}({\Pi}_{0}<\hat{\alpha}{n}_{\mathrm{FP}}^{b},n).

2.
Calculate the 2.5th and 97.5th quantiles of the distribution of null statistics, {\mathrm{Pr}}^{b}({\Pi}_{0}<\hat{\alpha}{n}_{\mathrm{FP}}^{b},n). Use these values as a 95% confidence interval for the validation probability.
The bootstrap is not justified for small sample sizes, and when the validation sample size is small, these bootstrap confidence intervals may not have the appropriate coverage.
Choosing the FDR level and sample size
An important question for statistical validation is: How does one choose the FDR level and the validation sample size to use? To answer this question, suppose that in a given study for each FDR cutoff q, there are n_{
sig
}(q) significant genes. The goal is to find the minimum number of sampled results required to achieve a high validation probability Pr(Π_{0} < qn_{
FP
},n), for the case where the results would be confirmed with a perfect independent technology. The minimum validation sample size for FDR cutoff q can be found by solving the following optimization problem:
\underset{n}{min}\mathrm{Pr}({\Pi}_{0}<qn\times q,n)>\text{Target Probability}
In other words, what is the minimum validation sample size needed to get at least the target validation probability, assuming that q × n false positives will be observed?
Here, as in any sample size calculation, we must estimate the effect size  in this case the expected number of false positives in the validation set. In our examples, we estimate the effect size as the observed FDR for the validation targets. However, our R functions allow for alternative choices of the expected FDR for each true FDR level. If a user chooses higher FDR levels than the observed values, the minimum sample size will be smaller to confirm that higher FDR threshold.
This optimization problem can be solved for any specific study based only on the set of pvalues for the original analysis performed. For a fixed target probability and a fixed false discovery rate threshold, the minimum sample size will be fixed as long as the number of significant features n_{
sig
}(q) is sufficiently large. The reason is that the optimization is over only the single variable n, when q and the Target Probability are fixed.
As an example of this procedure, we use the data from the first simulated study (of 100) in the errorless validation simulation as described in the Results. Based on the pvalues from that study, we calculated the minimum validation sample size needed for each FDR threshold to achieve a target validation probability of Pr(Π_{0} < qn × q,n) ≥ 0.5 (Figure 2). For an FDR of 5%, it is not possible to achieve the desired validation probability. For increasing FDR cutoffs, the required validation sample size decreases. This is not surprising, since we have shown in the previous section that validation is more likely at higher FDR thresholds. The lower the FDR threshold used for validation, the more convincing the validation may be, so the investigator can calculate this curve for any given study and use the results to decide how many results to validate based on available resources.
If the authors have designed their study using our estimates of the minimum validation sample size, and the validation probability is low, then it is likely that Π_{0} is greater than the claimed FDR level. If however, they choose to validate many fewer targets than suggested by the minimum validation sample size, it is ambiguous whether the sample size was too small or the FDR did not validate.
Calculating qPCR validation costs
Manually confirming genomic results with independent technologies or functional assays can be costly and time consuming, since most validation technologies must be performed one gene, transcript, or protein at a time. There are a large number of validation technologies, but one of the most commonly used is quantitative PCR (qPCR). To compare costs associated with different strategies we use qPCR validation for gene expression studies as an example. The results presented here are representative of the results for any costly independent confirmation experiment.
We estimated the costs associated with two different qPCR technologies: SYBrGreen and TaqMan. For TaqMan we assumed that three genes and a reference gene were multiplexed in each reaction, which is theoretically possible but optimistic in practice. SYBrGreen reactions also included a reference gene, but were not assumed to be multiplexed. We calculated costs as follows: $250 for each TaqMan probe, $150 for Mastermix for each plate for both SYBrGreen and TaqMan, and $4 for each 96well plate. We assumed each reaction was replicated three times to ensure accurate measurements  a typical approach taken in validation experiments. We also made the assumption that one research assistant, working full time and paid $40,000 per year, could run and analyze four 96well plates per day. For the purposes of our analysis we assumed 22 working days per month.
Based on these assumptions, we can calculate the cost and time required for validating n_{
genes
} genes on n_{
samples
} samples for each technology. For TaqMan, after accounting for multiplexing the number of plates run is \frac{{n}_{\mathrm{genes}}\times {n}_{\mathrm{samples}}}{96}, so the cost (C_{
TaqMan
}) and time (T_{
TaqMan
}) for the validation experiment are as follows.
\begin{array}{ll}{C}_{\mathrm{TaqMan}}\phantom{\rule{2.77695pt}{0ex}}& =\phantom{\rule{2.77695pt}{0ex}}\underset{\text{Primer Costs}}{\underset{\u23df}{\$250\times {n}_{\mathrm{genes}}}}\phantom{\rule{2em}{0ex}}\\ +\phantom{\rule{2.77695pt}{0ex}}\underset{\text{Reagent Costs}}{\underset{\u23df}{(\$4+\$150)\times \lfloor \frac{{n}_{\mathrm{genes}}\times {n}_{\mathrm{samples}}}{96}\rfloor}}\phantom{\rule{2em}{0ex}}\\ +\phantom{\rule{2.77695pt}{0ex}}\underset{\text{Personnel Costs}}{\underset{\u23df}{\$40,000\times {T}_{\mathrm{TaqMan}}}}\phantom{\rule{2em}{0ex}}\end{array}
(2)
\begin{array}{ll}{T}_{\mathrm{TaqMan}}\phantom{\rule{2.77695pt}{0ex}}& =\phantom{\rule{2.77695pt}{0ex}}\frac{1}{4\times 22\times 12}\times \lfloor \frac{{n}_{\mathrm{genes}}\times {n}_{\mathrm{samples}}}{96}\rfloor \phantom{\rule{2em}{0ex}}\end{array}
(3)
For the SYBrGreen validation, reactions can not be multiplexed. However, these reactions also do not incur the primer costs of the TaqMan reactions. So the cost (C_{
SG
}) and time (T_{
SG
}) for the validation experiment are as follows.
\begin{array}{ll}{C}_{\mathrm{SG}}\phantom{\rule{2.77695pt}{0ex}}& =\phantom{\rule{2.77695pt}{0ex}}\underset{\text{Reagent Costs}}{\underset{\u23df}{(\$4+\$150)\times \lfloor \frac{{n}_{\mathrm{genes}}\times {n}_{\mathrm{samples}}\times 3\times 2}{96}\rfloor}}\phantom{\rule{2em}{0ex}}\\ +\phantom{\rule{2.77695pt}{0ex}}\underset{\text{Personnel Costs}}{\underset{\u23df}{\$40,000\times {T}_{\mathrm{SG}}}}\phantom{\rule{2em}{0ex}}\end{array}
(4)
\begin{array}{ll}{T}_{\mathrm{SG}}\phantom{\rule{2.77695pt}{0ex}}& =\phantom{\rule{2.77695pt}{0ex}}\frac{1}{4\times 22\times 12}\times \lfloor \frac{{n}_{\mathrm{genes}}\times {n}_{\mathrm{samples}}\times 3\times 2}{96}\rfloor \phantom{\rule{2em}{0ex}}\end{array}
(5)
In these equations the terms inside the floor operators ⌊·⌋ represent the number of plates needed to run the reactions, which must be multiplied by the fixed costs for those plates. From these equations, it can be seen that manual confirmation of gene expression results using either Taqman or SYBrGreen is costly and time consuming. Taqman is slightly more expensive, but slightly less time consuming.