### Biological data

For the purposes of this study, use was made of the St. Jude Children's Research Hospital (SJCRH) Database on childhood leukemia which is publicly available on their website under the Supplemental Data section: [12] The whole SJCRH Database contains gene expression data on 335 subjects, each represented by a separate array (Affymetrix, Santa Clara, CA) reporting measurements on the same set of 12558 genes. We selected two groups of patients with hyperdiploid (Hyperdip) and T-cell acute lymphoblastic leukemia (TALL), respectively. The groups were balanced to include 43 patients in each group. Since the nature of our study was purely methodological, the choice of the data set was quite arbitrary; it was dictated solely by sample size considerations. The microarray data thus chosen were background corrected and normalized using the Bioconductor RMA software. This software implements the quantile normalization procedure [13, 14] carried out at the probe feature level. After the normalization, each gene is represented in the final data set by the logarithm (base 2) of its expression level.

### Simulated data

Our simulation study was designed to illustrate the effect of correlation on the performance of gene selection procedures. We simulated 2*n* independent multi-variate normal random vectors with exchangeable correlation structure, each representing log-intensities of 1255 genes of which the first 125 genes were designated to be differentially expressed. Two sets of simulations were conducted with the sample size chosen to be *n* = 15 and *n* = 43, respectively. We use the following self-explanatory notation for the four sets of simulated data: SIM15, SIM15CORR, SIM43, SIM43CORR. In total, 200 independent data sets, each consisting of 2*n* simulated vectors, were generated for each sample size. The marginal distributions of the log-intensities of "Not Different" genes were standard normal, while the log-intensities of "Different" genes expressions followed the normal distribution with mean two and unit variance.

The exchangeable pairwise correlation structure was superimposed on the normal vectors with independent components as discussed in [1]. Briefly, we first generate a 1255 × 2*n* matrix with each entry being an independent realization of a standard normal random variable. To model a set of "Different" genes, we add a value of 2 to the first 125 rows in the first group and denote the resultant matrix by *X* = {*x*
_{
ij
}}, *i* = 1, ..., 1255; *j* = 1, ..., 2*n*. All the elements *xij* of this matrix are stochastically independent, but those with *i* = 1, 2, ..., 125 and *j* = 1, 2, ..., *n* are normally distributed with mean 2 and unit variance. Expression levels of the genes outside this special set of 125 genes follow the standard normal distribution. Next we generate a 2*n*-dimensional random vector with independent and identically distributed components, each component having a standard normal distribution. Denote this vector by *A* = {*a*
_{
j
}}, *j* = 1, ..., 2*n*. Define
, *i* = 1, ..., 1255; *j* = 1, ..., 2*n*, so that for any *i*
_{1} ≠ *i*
_{2} and *j* we have corr (
,
) = *ρ* In the present study, the correlated data were generated for a single value of the correlation coefficient *ρ* = 0.6. This high correlation coeffcient was chosen to more clearly demonstrate the effects of correlation. However, this value is not overly unrealistic because the mean (over all gene pairs) correlation coefficient estimated from the raw data referred to in Section 2.1 is equal to 0.72.

In order to see whether or not the stability of gene selection is related to the power, our explanatory simulations were conducted under two different scenarios. Under the first scenario, the sample size was small (*n* = 15) so that the power was lower than 100%. Under the second scenario the sample size was sufficiently large (*n* = 43) to attain a 100% power.

### Resampling techniques

When analyzing biological data, we used a subsampling version of the delete-*d* jackknife method [5, 15, 16], which is technically equivalent to the leave-*d*-out cross-validation. It can be proven that if
/*d* → 0 and *n* - *d* → ∞, then the delete-*d*-jackknife is consistent for the median [15]. Therefore, the general recommendation is to leave out more than *d* =
but much fewer that *n* observations. A similar recommendation holds for the variance [15, 16]. We used *d* = 7 to perturb the data set, which is only slightly greater than
, to be as close as possible to the most widely used delete-one version of jackknife. It should be noted that the delete-1-jackknife method may be inconsistent for some estimators (sample quantiles representing a typical example) and the delete-*d*-jackknife was proposed to remedy this problem [16]. When implemeting the delete-*d*-jackknife method, we resorted to sampling without replacement because the empirical Bayes method is very sensitive to ties. As far as subsampling versions of the delete-*d*-jackknife method are concerned, the schemes with and without replacement are essentially identical when the number of subsamples is large [16].

The total number of subsamples was typically equal to 200. In a separate study, we ascertained that the results for 1000 subsamples were largely similar. Let *Z* be the number of selected genes. The variance of *Z* is estimated by a resampling counterpart of the jackknife sample variance [16]:

where *B* is the total number of subsamples (*B* = 200), *Z*
_{n-d, j}is the statistic *Z* evaluated at the *j*th delete-*d* jackknife subsample. The variance of the number of selected genes was used as a criterion of stability of the testing procedures under study. The corresponding distributions were also estimated. Another criterion was the selection stability for each individual gene measured by the frequency of selection conditional on the event of selection in at least one of the subsamples.

### Selection of differentially expressed genes

When resorting to the Bonferroni adjustment, one needs to compute unadjusted *p*-values from the sampling distribution of the test statistic under consideration. For the *t*-test we used quantiles of the Student distribution. Among the distribution-free methods, the Cramér-von Mises test [17] represents an appealing alternative to the Kolmogorov-Smirnov test. The reason is that the granularity of the Cramér-von Mises statistic (which causes granularity of the corresponding *p*-values) is much smaller than that for the Kolmogorov-Smirnov test. As a result, the *p*-values corresponding to the critical region increase much more steeply for the Kolmogorov-Smirnov test than for the Cramér-von Mises test, thereby making the Kolmogorov-Smirnov test less powerful.

To describe the Cramér-von Mises test, consider two independent samples *x*
_{1}, *x*
_{2}, ..., *x*
_{
m
}and *y*
_{1}, *y*
_{2}, ..., *y*
_{
n
}from distributions *F*(*x*) and *G*(*x*), respectively, and let *F*
_{
m
}and *G*
_{
n
}be their respective empirical distribution functions. We wish to test the following null hypothesis **H**
_{0}:*F*(*x*) = *G*(*x*) for all *x* versus the alternative: *F* ≠ *G*. The Cramér-von Mises test-statistic for the hypothesis **H**
_{0} is defined as the (squared) *L*
_{2} distance between *F*
_{
m
}(*x*) and *G*
_{
n
}(*x*):

The asymptotic theory for the Cramér-von Mises test was developed by Anderson and Darling [18], Rosenblatt [19], Anderson [20], and Csorgo and Faraway [21]. The asymptotic approximation of the distribution of *W*
^{2} under **H**
_{0} is of little utility in microarray data analysis because it is very inaccurate whenever one works with extremely small *p*-values required by the FWER controlling multiple testing procedures. For example, when *n* = *m* = 43 and the exact *p*-value for the Cramér-von Mises test is equal to 3.928 × 10^{-6}, its asymptotic approximation gives 6.897 × 10^{-6}, a much larger *p*-value. The above-mentioned exact *p*-value of 3.928 × 10^{-6}corresponds to the adjusted (by the Bonferroni method) *p*-value of about 0.049 when testing 12558 hypotheses as in our application described in Section 3.1. Therefore, one needs an algorithm for computing exact quantiles of the Cramér-von Mises sampling distribution. We used the method proposed by Burr [22] for this purpose. It should be noted that the needed small *p*-values for the Cramér-von Mises test cannot be estimated with sufficient accuracy by permuting the test-statistics, because the required number of permuations is astronomical [23] and cannot be accomplished with present-day hardware.

The Westfall-Young step-down algorithm [6] bypasses the stage of computing unadjusted *p*-values and goes directly to the estimation of adjusted *p*-values at a given level of the FWER. We carried out 10,000 permutations to model a null distribution of each test statistic. We also used the multiple testing adjustment proposed by Benjamini and Hochberg [10] and its modification by Benjamini and Yekutieli [11]. The more conservative Benjamini and Yekutieli procedure is warranted with normalized data because the quantile normalization is known to induce negative correlations in microarray data [1]. The nonparametric empirical Bayes method by Efron et al. [7–9] was one more method of choice in the present paper. We used kernel smoothing (with the Gaussian kernel) for density estimation to implement the empirical Bayes method. The threshold level of the posterior probability was set at 0.95.

To distinguish between different statistical procedures, we use the following notation:

B/t – *t*-test with Bonferroni adjustment;

B/KS – Kolmogorov-Smirnov test with Bonferroni adjustment;

B/CVM – Cramér-von Mises test with Bonferroni adjustment;

WY/t – *t*-test with Westfall-Young algorithm;

WY/KS – Kolmogorov-Smirnov test with Westfall-Young algorithm;

WY/CVM – Cramér-von Mises with Westfall-Young algorithm;

BH/t – *t*-test with Benjamini-Hochberg adjustment;

BY/t – *t*-test with Benjamini-Yekutieli adjustment;

EB/t – *t*-test with gene selection by nonparametric empirical Bayes method.

### False discovery rate and power

We provide estimates of the FDR only for simulations. We do not report FDR estimates for biological data because only indirect methods [24, 25] are available in this case. Such methods introduce an additional variation in the estimates which is impossible to distinguish from that caused by a given selection procedure. In our simulation studies, the true FDR was estimated directly as the proportion of false discoveries among all discoveries. Then the sample mean (across the 200 samples) of this nonparametric estimate is reported together with the corresponding standard deviation. It happened only once (when applying the Kolmogorov-Smirnov test with Bonferroni adjustment to a sample of size *n* = 15) that we set the estimated FDR at zero (see [26] for the definition of the positive FDR). Since the expression levels of the 125 differentially expressed genes are identically distributed, the power can be defined as the expected proportion of correct discoveries among the 125 true alternative hypotheses. We provide the usual nonparametric estimates of the power thus defined and its standard deviation.

### Software

The relevant software is included in the Additional Material Files [see additional file 1].