DNA microarrays measuring gene expression continue to grow in popularity, furthering our understanding of the genetic operation of organisms spanning humans to prokaryotes. Questions remain, however, about how best to interpret the wealth of gene-by-gene transcriptional levels measured in microarrays. Over the past few years, many statistical methods of analyzing gene expression data in the context of gene sets have been proposed to simplify and increase the impartiality of gene expression data analysis. Gene set methods are designed to aid the investigator in making biological sense of gene expression data by viewing genes under study in the context of *a priori* identified, biologically relevant, gene sets. Gene sets are groups of genes with some common characteristic (e.g. function, physical location in the genome, etc.). The most common methods of gene set analysis either use Fisher's Exact Test (FET) [1] or the newer Gene Set Enrichment Analysis (GSEA) method [2, 3]. While FET and GSEA are the most popular, many other methods have also been proposed, see for example [4–18], and/or reviews of methods provided in [7, 19–21].

Fisher's Exact Test was among the first methods proposed which used gene sets in statistical analysis of microarray data. In order to use FET, the genes in the experiment must first be dichotomized by classifying each gene as "up/down-regulated" (differentially expressed) or "not regulated." One method of dichotomization identifies genes with absolute values of log-ratios of expression scores above a certain cutoff as "up/down-regulated" and those below the cutoff as not regulated; see, for example, Schwartz *et al*. [22]. Once the genes have been dichotomized, FET compares the proportion of up/down-regulated genes in the set of interest to the proportion of up/down-regulated genes not in the set of interest. FET then uses the hyper-geometric distribution to compute a p-value based on the difference in the two proportions. As recently noted by Allison *et al*. [19] and validated through simulation by Ben-Shaul, *et al*. [17], the dichotomization necessary to use FET yields a loss of information which translates directly into a loss of statistical power, making it more difficult to identify real differences in regulation as statistically significant. Despite this loss of power Fisher's exact test is still often used in gene set analyses [22, 23].

Other methods, like GSEA [2, 3], have attempted to improve on FET. These newer methods do not dichotomize, instead they use the full range of quantitative gene expression data available (i.e. the entire sorted list of log-ratios of gene expression values). We will call these methods *non-cutoff based methods*. In most cases, the newer gene-set analysis methods that are alternatives to FET have been developed for human data. As such they assume that there are multiple replicates (one microarray/chip for each person in an experiment) of the data in order to conduct permutation based inference. For example, in an experiment comparing humans with a disease phenotype to those without, the statistic of interest (e.g. GSEA) is computed. Then, subjects are randomly assigned a phenotype status many, many times and each time the statistic of interest (e.g. GSEA) is computed. The statistic computed using the true phenotypes is then compared to the distribution of statistics based on random phenotype assignment in order to estimate the p-value (the measure of statistical significance of the strength of association between phenotype and regulation of the gene set). This type of permutation-based inference has been termed *subject-sampling* [20].

While there are many experimental and analytic similarities in DNA microarray experiments across organisms, prokaryotic organisms require some unique experimental considerations [24]. In particular, DNA microarray experiments on prokaryotes typically have many fewer microarrays (chips) per experimental comparison. For this reason, subject-sampling (permuting across the phenotype) can be impossible, since a moderate number of microarrays are necessary in order to have a sufficient number of permutations to correctly estimate small p-values. So, despite recent criticisms of FET, many of the non-cutoff based methods are not directly applicable to prokaryotic experiments.

There are two web-based software tools focused on prokaryotes, available for conducting gene set analysis. The FIVA tool [25] uses FET and a variant of FET proposed by Breitling *et al*. [16] which finds the optimal cutoff for "significant" vs. "non-significant" genes for each gene set. The JProGo tool [26] implements FET as well as three other non-cutoff based methods that compare the association measures (e.g. log-ratios) of genes in the gene set of interest with the association measures of the genes outside of the gene set of interest. The three non-cutoff based methods are the t-test, the Kolmogorov-Smirnov [K-S] test, and the unpaired Wilcoxon (Mann-Whitney U) test.

The non-cutoff based methods implemented by JProGO, however, are less than optimal. Specifically, the t-test requires a normality assumption about the microarray data, and is most powerful when testing for a difference in mean log-ratios between genes in and out of the set, unlike other methods which test for changes in standard deviation as well. The Wilcoxon test has received little consideration in the literature for use in gene set analysis and so it is unclear how valid and useful this approach is. The K-S test was considered by Efron and Tibshirani [8] who proposed an alternative measure, the MAXMEAN statistic. When comparing K-S to MAXMEAN, MAXMEAN performed better than K-S in both simulation and real data analysis. Efron and Tibshirani also compared MAXMEAN to a weighted K-S test (the aforementioned GSEA statistic) with a similar result.

After demonstrating that MAXMEAN is more powerful than K-S/GSEA, Efron and Tibshirani [8] compare MAXMEAN to two other statistics (we call these SUM and ABSSUM in this work). Efron and Tibshirani argue that neither SUM nor ABSSUM is robust with regard to changes in both the standard deviation and the mean, while MAXMEAN is powerful to detect both types of changes. The arguments of Efron and Tibshirani are made in the context of experiments on which multiple replicates are available and, thus, subject-sampling is possible. These methods [GSEA, ABSSUM, SUM, and MAXMEAN] have not been evaluated in the context of microarray experiments for which there are few, if any, replicates available.

In this work we consider five popular non-cutoff based gene-set analytic techniques originally proposed using subject sampling, (permuting the phenotype and, thus, requiring multiple chips) for use on prokaryotic microarray experiments with few replicates. We first demonstrate how these five methods can be implemented on experiments where subject sampling is not possible. Then, we conduct a simulation study comparing the five non-cutoff based methods with each other and FET for their ability to maintain nominal α (type I error rate; a measure of false positives) while giving high statistical power (a measure of true positives) relative to the other methods. We then compare the non-cutoff based methods to each other and FET on real microarray data sets obtained from experiments on *Salmonella enterica* serovar Typhimurium (*S. typhimurium*) and *Escherichia coli* K-12 *(E. coli)*. Lastly, we consider the biological significance of the findings in light of the different methods used.