 Methodology article
 Open Access
A simple method for assessing sample sizes in microarray experiments
 Robert Tibshirani^{1}Email author
https://doi.org/10.1186/147121057106
© Tibshirani; licensee BioMed Central Ltd. 2006
 Received: 04 October 2005
 Accepted: 02 March 2006
 Published: 02 March 2006
Abstract
Background
In this short article, we discuss a simple method for assessing sample size requirements in microarray experiments.
Results
Our method starts with the output from a permutationbased analysis for a set of pilot data, e.g. from the SAM package. Then for a given hypothesized mean difference and various samples sizes, we estimate the false discovery rate and false negative rate of a list of genes; these are also interpretable as per gene power and type I error. We also discuss application of our method to other kinds of response variables, for example survival outcomes.
Conclusion
Our method seems to be useful for sample size assessment in microarray experiments.
Keywords
 False Discovery Rate
 False Negative Rate
 Pilot Data
 Gene Score
 Permutation Distribution
Background
Assessment of sample sizes for microarray data is a tricky exercise. The data are complex, as are the biological questions that one might try to answer from such data. What assumptions should one make, and what quantities should be provided as output?
There have been a number of recent papers that address this problem. The authors in [2] utilize an ANOVA model and provides power calculations for various alternative models. In [4] a decisiontheoretic approach is used and a hierarchical Bayes model. The authors in [8] examine the roles of technical and biological variability, in determining sample size. In [5] it is assumed that the genes are independent and have equal variance, and false discovery rates and sensitivities are reported. The ssize package [7] also assumes that the genes are independent, but uses pilot data to estimate the variance. It focuses on power and type I error. The proposal of [6] assumes independence of genes; the convenient (but unrealistic) case of equal correlation among all genes is also considered.
All of these approaches may have shortcomings, namely the assumption of equal variances or independence of genes (or both). These assumptions are often violated in real microarray data and can have a real impact on sample size calculations.
We avoid these assumptions in our proposal. We start with the output from a permutationbased analysis for a set of pilot data. From this we estimate the standard deviation of each gene, and the overall null distribution of the genes. Then for a given hypothesized mean difference, we estimate the false discovery rate (FDR) and false negative rate (FNR) of a list of genes. Many authors now favor the FDR over the familywise error rate (FWER) as the appropriate error measure for microarray studies. The latter is the probability of at least one false positive call, given that we expect many false positive calls among thousands of genes, the FWER does not seem to be as relevant.
Since the calculation is based on the gene scores from permutations of the data, the correlation in the genes is accounted for. Use of the permutation distribution avoids parametric assumptions about the distribution of individual genes. And by working with the scores rather than the raw data, we avoid the difficult task of simulating new data from a population having a complicated (and unknown) correlation structure.
We provide interpretation of our results both in terms of FDR and FNR, and in terms power and type I error. Our proposal is implemented in the current version of the SAM package [1].
Our main focus is on microarray experiments for determining which genes are differentially expressed across two different experimental conditions, like treatment versus control. However our approach is also applicable to other settings, for example studies that correlate survival time with gene expression.
We learned of the proposal in [3] from a referee; it was unknown to us at the time that this paper was written. The resamplingbased approach in that paper is very close to the one described here. Some differences are a) by shifting the teststatistics rather than the data, our method is applicable beyond the twosample problem to general settings like survival data, and b) we report not only false discovery rates but also false negative rates in our assessment of sample sizes.
The proposed method
Possible outcomes from m hypothesis tests of a set of genes. The rows represent the true state of the population and the columns are the result a databased decision rule.
Called Not Significant  Called Significant  Total  

Null  U  V  m _{0} 
Nonnull  T  S  m _{1} 
Total  m – R  R  m 
We have FDR = V/R and FNR = T/(m  R), power = S/m_{1} and type 1 error = V/m_{0}. For simplicity, for assessing sample sizes we choose our rule so that the number of genes called significant (R) is the same as the number of nonnull genes in the population (m_{1}). This implies that 1  power = FDR and type I error = FNR. Hence conveniently, the FDR can be interpreted as one minus the power per gene, and similarly for the FNR.
Here are the details of the calculation for the twoclass unpaired case (below we indicate changes necessary for other data types). Let x_{ ij } be the expression for gene i in sample j; C_{ j }is the set of indices for the n_{ j }samples in group j, for j = 1 or 2. The twosample unpaired tstatistic is
${d}_{i}=\frac{{\overline{x}}_{i2}{\overline{x}}_{i1}}{{s}_{i}}\left(1\right)$
where
${s}_{i}=[(1/{n}_{1}+1/{n}_{2})\{{\displaystyle \sum _{j\in {C}_{1}}{({x}_{ij}{\overline{x}}_{i1})}^{2}+{\displaystyle \sum _{j\in {C}_{2}}{({x}_{ij}{\overline{x}}_{i2})}^{2}\}/({n}_{1}+{n}_{2}2){]}^{1/2}}}$
Note that this is the gene score used in the SAM method; see the Remark below regarding the exchangeability constant. If σ_{ i }is the true withingroup standard deviation for gene i (assumed to be the same for each group), then s_{ i }^{2} estimates
$\mathrm{var}({\overline{x}}_{i2}{\overline{x}}_{i1})={\sigma}_{i}^{2}(1/{n}_{1}+1/{n}_{2})$
Hence a shift of δ units in one gene for each sample in group 2 causes an average increase in the score d_{ i }of $\delta /({\sigma}_{i}\sqrt{1/{n}_{1}+1/{n}_{2}})$ (we assume that the proportion of samples in groups 1 and 2 remains the same as we vary the sample size).
 1.
Estimate the null distribution of the scores, and the per gene standard deviation σ_{ i }, by randomly permuting the class labels and recomputing the gene scores for the permuted data.
 2.
For k (the number of truly changed genes) running from (say) 10 to m/2, do the following:

Sample a set of m scores from the permutation distribution of the scores

Add $\delta /({\widehat{\sigma}}_{i}\sqrt{1/{n}_{1}+1/{n}_{2}})$ in class 2 to a randomly chosen set of k of these scores.

Find the cutpoint c equal to the k th largest score in absolute value

Estimate the FDR and FNR of the rule d_{ i } > c. This is straight forward since we know which genes are truly nonnull (they are the ones that were incremented above).
 3.
Repeat Step 2 B times and report the median result for each k. We also report the 10th and 90th percentiles of the FDR across the B permutations.
In our examples we use a relatively small number of repetitions (B = 20); this makes the procedure fast and gives sufficiently accurate estimates. For the twosample problem, we typically require pilot data with at least 4 or 5 samples per class.
The results of this process provide information on how the FDR and FNR will improve if the sample size were to be increased. To get an idea of what values of the mean difference δ are appropriate or reasonable, one can look at the values ${\overline{x}}_{i2}{\overline{x}}_{i1}$ among the significant genes in the pilot data.
This approach can be easily applied to other designs and other types of response parameters. For paired data, we take n_{1}= n_{2}= n/2 (remember n is the total sample size). and all of the above recipe remains the same. For one class data var = ${\sigma}_{i}^{2}$/n.
For survival data and Cox's proportional hazards model, the analogue of the mean difference between groups is the numerator of the partial likelihood score statistic, which we denote by r_{ i }. Hence we define the genespecific variance ${\sigma}_{i}^{2}$ via the relation var (r_{ i }) = ${\sigma}_{i}^{2}$/n, and we interpret the shift parameter δ relative to r_{ i }. The units of r_{ i }are not very interpretable, however, so we use of pilot data as a guide. That is for example, if in our pilot data the genes that we call significant have r_{ i } > 100, we can set δ = 100 in our sample size assessment.
Remark
In the SAM approach, the denominator s_{ i }in the score (1) is replaced by s_{ i }+ s_{0}, where s_{0} is an exchangeability constant. It shrinks the scores of genes with expression near 0 (having s_{0} ≈ 0).
An example
Remember that the quantity on the horizontal axis – number of genes – refers to both the hypothesized number of truly nonnull genes, and the number of genes called significant.
We see that, depending on the number of genes truly changed at 2fold, the sample size should be increased to 60 or 100, in order to get the FDR down to 10% or 5%. The false negative rate is consistently low throughout, when n = 60 or 100.
Note the similarity between Figures 1 and 2. Of course with real data, the second method – generating data from the underlying model – would not be available, since the underlying model is unknown.
Discussion
We have presented a simple method for assessing sample sizes, that starts with a permutationbased analysis for some pilot data. The method gives reasonably accurate estimates of false discovery rates and false negative rates, as a function of the total number of samples. Our proposal is implemented in the SAM package the Excel addin and the R package samr [1].
Declarations
Acknowledgements
We would like to thank two referees for helpful comments. The author was partially supported by National Science Foundation Grant DMS9971405 and National Institutes of Health Contract N01HV28183.
Authors’ Affiliations
References
 Gilbert Chu, Balasubramanian Narasimhan, Robert Tibshirani, Virginia Tusher: Significance analysis of microarrays (sam) software.[http://wwwstat.stanford.edu/~tibs/SAM/]
 Lee MLT, Whitmore GA: Power and sample size for microarray studies. Statistics in Medicine 2002, (21):3543–3570. 10.1002/sim.1335Google Scholar
 Li SS, Bigler J, Lampe JW, Potter JD, Feng Z: Fdrcontrolling testing procedures and sample size determination for microarrays. Statistics in Medicine 2005, (24):2267–2280. 10.1002/sim.2119Google Scholar
 Muller P, Parmigiani G, Robert C, Rousseau J: Optimal sample size for multiple testing: the case of gene expression microarrays. J Amer Statist Assoc 2005, 99: 990–1001. 10.1198/016214504000001646View ArticleGoogle Scholar
 Pawitan Y, Michiels S, Koscielny A, Gusnanto S, Ploner A: False discovery rate, sensitivity and sample size for microarray studies. Bioinformatics 2005, (21):3017–24. 10.1093/bioinformatics/bti448Google Scholar
 ChenAn Tsai, SueJane Wang, DungTsa Chen, James ChenJ: Sample size for gene expression microarray experiments. Bioinformatics 2005, (21):1502–1508.Google Scholar
 Warnes G, Liu P: Sample size estimation for microarray experiments. submitted to Bioinformatics; ssize package in R. 2005.Google Scholar
 Wei C, Li J, Bumgartner R: Sample size for detecting differentially expressed genes in microarray experiments. BMC Genomics 2004, (5):1–10.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.