Filtering for increased power for microarray data analysis

Background Due to the large number of hypothesis tests performed during the process of routine analysis of microarray data, a multiple testing adjustment is certainly warranted. However, when the number of tests is very large and the proportion of differentially expressed genes is relatively low, the use of a multiple testing adjustment can result in very low power to detect those genes which are truly differentially expressed. Filtering allows for a reduction in the number of tests and a corresponding increase in power. Common filtering methods include filtering by variance, average signal or MAS detection call (for Affymetrix arrays). We study the effects of filtering in combination with the Benjamini-Hochberg method for false discovery rate control and q-value for false discovery rate estimation. Results Three case studies are used to compare three different filtering methods in combination with the two false discovery rate methods and three different preprocessing methods. For the case studies considered, filtering by detection call and variance (on the original scale) consistently led to an increase in the number of differentially expressed genes identified. On the other hand, filtering by variance on the log2 scale had a detrimental effect when paired with MAS5 or PLIER preprocessing methods, even when the testing was done on the log2 scale. A simulation study was done to further examine the effect of filtering by variance. We find that filtering by variance leads to higher power, often with a decrease in false discovery rate, when paired with either of the false discovery rate methods considered. This holds regardless of the proportion of genes which are differentially expressed or whether we assume dependence or independence among genes. Conclusion The case studies show that both detection call and variance filtering are viable methods of filtering which can increase the number of differentially expressed genes identified. The simulation study demonstrates that when paired with a false discovery rate method, filtering by variance can increase power while still controlling the false discovery rate. Filtering out 50% of probe sets seems reasonable as long as the majority of genes are not expected to be differentially expressed.


Background
Microarrays allow researchers to examine the expression of thousands of genes simultaneously. The primary goal of many microarray experiments is to identify a group of genes that is differentially expressed between two or more conditions. Such "differentially expressed genes" (DEGs) are identified through statistical testing. With tens of thousands of genes represented on an array and one or more hypotheses being tested for each gene, a multiple testing adjustment is certainly warranted. For expression studies involving microarrays, it has become common practice to focus on control of the false discovery rate (FDR). The false discovery rate is the expected proportion of incorrect rejections among the rejected hypotheses. Let V be the number of truly null hypotheses that are rejected and R be the total number of hypotheses that are rejected. Let Q be defined as V/R when R > 0 and let Q = 0 if R = 0. FDR is then defined as FDR = E(Q) [1].
Many procedures are available for estimating or controlling FDR. Benjamini and Hochberg proposed an intuitive procedure for controlling FDR [1]. Storey and Tibshirani offer the q-value method to estimate the FDR [2]. The qvalue is a measure of significance in terms of FDR. The qvalue of a particular feature (gene) is the expected proportion of false positives among all features as extreme or more extreme than the observed one. The q-value method uses an estimate of  0 , the proportion of p-values that correspond to tests in which the null hypothesis is true. Both the Benjamini-Hochberg and q-value methods are based on the assumption that the distribution of p-values corresponding to truly null hypotheses (the null distribution) follows a uniform distribution between zero and one. Additional FDR methods have been proposed by many authors, but we find the Benjamini-Hochberg and q-values methods to be the most commonly used methods.
FDR methods offer a substantial increase in power over methods that control family-wise error rate. However, low power can still be a problem when the proportion of differentially expressed genes is relatively low. In addition, researchers using standard manufactured arrays (i.e. Affymetrix GeneChips) have no control over the number of genes represented on the array. For example, the ATH1 (Arabidopsis) GeneChip contains approximately 22,500 probe sets, the MGU430 (mouse) GeneChip contains approximately 45,000 probe sets and the Wheat Gene-Chip contains roughly 61,000 probe sets. Hence, situations can arise where the number of tests is very large but the proportion of differentially expressed genes is relatively low, resulting in low power even when using an FDR method.
Filtering methods can be used to reduce the number of tests and therefore increase the power to detect true differences. An ideal filtering method would remove tests which are truly null (corresponding to genes that are equally expressed), while leaving those tests corresponding to genes which are truly differentially expressed. Several methods for filtering have been suggested including filtering by variance, signal, and MAS detection call.
All filtering methods discussed here can be applied without using information about treatment assignments. When filtering by variance, we remove genes with low variance across arrays (ignoring treatment). The rationale is that expression for equally expressed genes (EEGs) should not differ greatly between treatment groups, hence leading to small overall variance. The goal of filtering by signal is to filter out genes that have signal close to background level. Genes with low average signal (ignoring treatment) are removed. Filtering by MAS detection (or Present/ Absent) call is a common choice of investigators using Affymetrix GeneChips. The MAS detection call algorithm is based on the use of the Wilcoxon Signed Rank test to compare PM (Perfect Match) and MM (Mismatch) probes within a probe pair. A "call" of Present, Absent or Marginal is made for each probe set [3]. The idea of filtering by detection call is that if a transcript is not present in any sample, then clearly it cannot be differentially expressed. Hence, we filter out probe sets that are called Absent on all arrays.

Results
In order to evaluate the effect of filtering, we use three case studies as well as a simulation study. All programming was done in R using Bioconductor [4,5].
For the three case studies, we examine the effect of three filtering methods (variance, signal and detection call) as well as the results when no filtering is done. In order to facilitate direct comparisons between the filtering methods, we selected the same number of probe sets to be filtered out for all filtering methods.
Specifically, we found the number of probe sets not called Present on any array in a given experiment and hence filtered out by the detection call method. We then fix this to be the number of probe sets filtered out by the variance and signal filtering methods as well. In addition to the various filtering and FDR methods, we consider the RMA, MAS5 and PLIER methods for preprocessing. We note that all testing was done using expression values on the log 2 scale. However, we examined the effect of filtering by variance on both the log 2 and "original" scales. A 0.05 significance level was used for all methods.
For the simulation study, we start with simulated expression data and focus on the effect of filtering by variance. A 0.05 significance level was used for all methods.

Case Study: Wheat Data
A study was conducted to examine gene expression of resistant and susceptible lines of wheat grown in the presence and absence of the Russian wheat aphid. The Affymetrix GeneChip Wheat genome array (containing 61,290 probe sets representing 55,052 transcripts for all 42 wheat chromosomes) was used for this study. RNA samples were collected from wheat plants in 2 × 2 factorial design. The design was originally balanced, but one array was dropped due to concerns about array quality. Each array represents a pooled sample from five seedlings. The data used here consists of 11 arrays: 3 arrays representing the resistant wheat variety in the absence of the Russian wheat aphid, 2 arrays representing the resistant wheat variety in the presence of the Russian wheat aphid, 3 arrays representing susceptible wheat variety in the absence of the Russian wheat aphid, and 3 arrays representing the susceptible wheat variety in the presence of the Russian wheat aphid.
For the purposes of this paper, we focus on two comparisons of interest: (1) comparison of gene expression of the resistant wheat line in the presence and absence of the Russian wheat aphid and (2) comparison of gene expression of the resistant and susceptible wheat lines in the absence of the Russian wheat aphid. These two comparisons were selected because the first is expected to yield a large number of DEGs while the second should yield fewer DEGs. Testing for the two comparisons of interest was performed using an analysis of variance (ANOVA) model and contrasts of factor level means.
In order to facilitate direct comparisons between the filtering methods, we selected the same number of probe sets to be filtered out for all filtering methods. A total of 30,234 probe sets (49%) were not called Present on any of the 11 arrays and were therefore filtered out by the detection call filtering method.
Hence, when filtering by average signal (or variance), the probe sets with the smallest 30,234 average signal values (or variances) were filtered out. Figure 1A gives a histogram of p-values obtained from testing for DEGs for the first comparison with p-values corresponding to the filtered (low variance) probe sets overlaid in gray.
The number of probe sets corresponding to differentially expressed genes identified for each of the combinations of preprocessing (RMA, MAS5 and PLIER), filtering (none, MAS detection call, signal and variance on the log 2 and original scales) and FDR methods (none, Benjamini-Hochberg, q-value) are shown in Table 1 for both wheat comparisons. We see that for a given preprocessing and FDR method, filtering by detection call, signal or variance (on the original scale) leads to an increase in the number of DEGs identified. In contrast, in some cases, filtering by variance on the log 2 scale leads to a decrease in the Histogram of p-values for the Wheat case study and a simulation run  number of DEGs identified (as compared to unfiltered data) for MAS5 and PLIER preprocessing methods.

Case Study: Diabetes Data
A study was conducted to examine gene expression in the cardiac left ventricle using a rodent model of diabetic cardiomyopathy [6]. The Affymetrix Rat GeneChip 230 2.0 array (with 31,099 probe sets) was used for this investigation. RNA samples were collected from the cardiac left ventricles of 7 diabetes induced rats and 7 controls. Each sample was hybridized to a single array. The data can be obtained from the NCBI Gene Expression Omnibus (accession number GSE5606) [7]. A two-sample t-test assuming equal variances was used to identify differentially expressed genes.
Similar to the analysis for the wheat data, we selected the same number of probe sets to be filtered out for all filtering methods. A total of 10,473 probe sets (34%) were called Absent on all 14 arrays and were therefore filtered out by the MAS detection call filtering method. Hence, the same number of probe sets were removed for the other filtering methods. The number of probe sets corresponding to differentially expressed genes for each of the combinations of preprocessing, filtering and FDR methods are found in Table 1. We see that for a given preprocessing and FDR method, filtering by detection call, signal or variance (on the original scale) leads to an increase in the number of DEGs identified. In contrast, filtering by variance on the log 2 scale leads to a decrease in the number of DEGs identified (as compared to unfiltered data) for MAS5 and PLIER preprocessing methods.

Case Study: Smoking Data
A study was conducted to examine gene expression in the lungs of young mice exposed to 14 days of cigarette smoke [8]. The Affymetrix Mouse Genome 430 2.0 array (with 45,101 probe sets) was used for this investigation. RNA samples were collected from the lungs of 6 mice exposed to cigarette smoke and 4 controls. Each sample was hybridized to a single array. The data can be obtained from the NCBI Gene Expression Omnibus (accession number GSE7310) [7]. A two-sample t-test assuming equal variances was used to identify differentially expressed genes.
A total of 19,471 probe sets (43%) were called Absent on all 10 arrays and were therefore filtered out by the MAS detection filtering method. Hence, the same number of probe sets were removed for the other filtering methods. The number of probe sets corresponding to differentially expressed genes for each of the combinations of preprocessing, filtering and FDR methods are found in Table 1. We see that for a given preprocessing and FDR method, filtering by detection call or variance (on the original scale) leads to an increase in the number of DEGs identified. In contrast, filtering by variance on the log 2 scale leads to a decrease in the number of DEGs identified for MAS5 and PLIER preprocessing methods. We also observe a decrease in the number of DEGs identified when signal filtering is paired with RMA preprocessing.

Simulation Study
We simulated expression data under two models: when signal values between genes are independent and when the signal values between genes follow a "clumpy dependence" [9]. The data was simulated to correspond to two groups of five samples (arrays) with signal values generated for 50,000 genes for each sample. We considered true  0 values of 0.7, 0.8, 0.9, 0.95, and 0.98. A total of 1000 runs were used for each simulation scenario.
The signal value for gene g in sample k in block j and group i, was generated according to the model A proportion,  0 , of genes were randomly selected to have indicator variable I g = 0 (corresponding to EEGs) and the rest of genes have I g = 1 (corresponding to DEGs). The were chosen such that the distribution of the variance of Y ijkg is the same for both the dependent and independent models. Moreover, the distributions of F ig , B jk , and Z ijkg were selected so the distribution of p-values for the simulation study resembles the distribution of p-values seen in case studies. This is supported by the histogram of p-values shown in Figure 1.
For each run of the simulation, t-tests comparing the two groups were performed and the BH and q-value methods were applied, with and without filtering to the 50,000 resulting p-values. The t-tests were performed assuming equal variances for the two groups. Filtering was per- formed by variance, with the 25,000 genes with the lowest variances (ignoring group) being filtered out. An  = 0.05 level of significance was used for all FDR methods. A histograms of the p-values for a single run of the simulation with  0 = 0.9 for the independent case is shown in Figure   1B.

Power
The observed power for each method and each run was calculated as the proportion of true positives that were detected at the stated significance level of  = 0.05. The distribution of observed power for each of the FDR methods with and without filtering are shown in Figure 2 and summarized in Additional file 1 Table S1. As expected, the power for the two FDR methods increases as  0 decreases, demonstrating increased power as a higher proportion of genes are differentially expressed. More importantly, these results show that filtering by variance results in an overall gain in power for both FDR methods considered for both independent and dependent models. The gain in power due to filtering is fairly consistent across the range of  0 values. Not surprisingly, the power under the independent model was less variable than the corresponding power under the dependent model. However, the median power for a given value of  0 is about the same for independent and dependent models. Not unexpectedly (since BH is an FDR controlling procedure and therefore more conservative) we find that q-value has higher power than the Benjamini-Hochberg method for a given simulation scenario.

False Discovery Rate
The observed FDR for each method and each run was calculated as the proportion of false positives among the rejected hypotheses. This observed FDR was compared to the nominal FDR level of 0.05. The distribution of the observed false discovery rate for each of the simulation scenarios are shown in Figure 3 and summarized in Additional file 2 Power results from simulation study Figure 2 Power results from simulation study. Boxplots of observed power of tests for differential expression (0.05 significance level) for simulation runs both with and without variance filtering. Plots (A) and (B) are boxplots for simulation scenarios with a "clumpy dependence" among genes using the Benjamini-Hochberg (BH) and q-value methods, respectively. Plots (C) and (D) are boxplots for simulation scenarios modeling independence among genes using the BH and q-value methods, respectively.

Analysis of Different Filtering Thresholds
We examined the effect of different thresholds when filtering by variance. The observed power and FDR for a simulation run of the independent model with  0 = 0.80 across a range of variance quantiles (ranging from 0.05 to 0.95) is shown in Figure 4A and 4B. For instance, if the variance quantile is 0.10, then 10% of genes (with the lowest variances) are filtered out for BH and q-value methods.
For both FDR methods, the power increases as an increasing proportion is filtered out (corresponding to an increasing quantile) until the proportion (quantile) gets close to  0 . At the same time, the observed FDR for these methods stays close to or below the  level of 0.05. As the quantile used for the threshold becomes close to  0 , the power begins to decrease. This suggests that we are starting to remove genes that are truly differentially expressed. Hence both the BH and q-value methods have improved power (while still maintaining a desirable FDR level) if filtering is done at a level somewhat close to, but well below,  0 . Similar results were obtained for the dependent models.
We also examined the effect of filtering with different thresholds for the three case studies. The number of DEGs found when varying the proportion of genes filtered out for wheat comparison 1 (using RMA preprocessing paired with filtering by variance on the log 2 scale) is also shown in Figure 4C. For this comparison, the number of DEGs identified gradually increased for both Benjamini-Hochberg and q-value methods as the proportion filtered out increased until a threshold of about 0.60. The quantile at which the number of DEGs began to decrease is close to the q-value estimate of  0 ( 0 = 0.62). Similar results were seen for the other case studies and preprocessing methods, but these results are not shown here.

Discussion
McClintick and Edenberg previously studied the effects of filtering by MAS detection call and signal in combination with MAS5 and RMA preprocessing methods [10]. They recommend filtering out probe sets that are not called Present in at least 50% of samples in at least one treatment group. When using signal as a filtering criteria, they filtered out probe sets that did not have average signal greater than some threshold in at least one treatment  FDR results from simulation study Figure 3 FDR results from simulation study. Boxplots of observed FDR of tests for differential expression for simulations runs both with and without variance filtering. Plots (A) and (B) are boxplots for simulation scenarios with a "clumpy dependence" among genes using the Benjamini-Hochberg (BH) and q-value methods, respectively. Plots (C) and (D) are boxplots for simulation scenarios modeling independence among genes using the BH and q-value methods, respectively. The nominal FDR level = 0.05 is represented by the horizontal dashed lines on plots. group. Instead of filtering out probe sets that are not called Present in at least 50% of samples for at least one treatment group, we filtered out probe sets that were not called Present for any samples. A benefit of this method is that no knowledge of treatment assignments is used for filtering. In addition, in our experience, for moderately sized experiments (20 arrays or less) this method removes the vast majority of probe sets that would be removed using the 50% rule. However, as the number of arrays increases, it becomes more likely that a probe set corresponding to a truly unexpressed transcript will be called Present on at least one array just by chance. Hence we could see more dramatic differences between the two methods for larger experiments.
In their analysis, McClintick and Edenberg found filtering by MAS detection call to be superior to filtering by signal because it results in decreased FDR. Their logic for filtering out Absent called genes is clear, "Data for genes not actually expressed represent experimental noise and cannot increase true positives, but can (and do) generate false positives." While this is true, we must bear in mind that the MAS detection call is itself a statistical test and the truth of which genes are unexpressed is unknown. In addition, filtering by MAS detection call is not an option for spotted cDNA arrays or other types of manufactured arrays besides Affymetrix GeneChips.
We consider three different filtering methods in combination with two FDR methods and three preprocessing methods. For all case studies, preprocessing methods and FDR methods examined, filtering by detection call and variance (on the "original scale") increased the number of DEGs identified when compared to unfiltered data. In one case, filtering by signal (when paired with RMA preprocessing) lead to a decrease in the number of DEGs identified. In most cases, filtering by variance on the log 2 scale in combination with MAS5 and PLIER methods actually lead to a decrease in the number of DEGs identified. This is surprising since testing was conducted on the log 2 for all methods.
Filtering by variance using different thresholds for wheat case study and simulation study One reason the log 2 transformation is used is to stabilize the variance. However, it seems that for MAS5 and PLIER, this transformation over corrects and leads to increased variance for low expression transcripts. The result is that on log 2 scale, high expression genes tend to have relatively low variances.
In addition to the relationship between signal and variance, there is a tendency for high expression genes to be over-represented in the list of DEGs. To examine this, we calculated the proportion of DEGs (using a significance level of 0.05 without filtering or applying any multiple testing adjustment) that had average signal in the top 50%. Hence, if there was no relationship between average signal and significance, we would expect 50% of DEGs to have average signal in the top 50%. The actual proportions varied by case study and preprocessing method ranging between 45% and 84%. In only one case (PLIER applied to the smoking data), was this percentage less than 50%.
These relationships between signal and variance and signal and significance lead to removal of high expression genes when using the MAS5 or PLIER methods and filtering by variance on the log 2 scale. Since highly expressed genes are more likely to be identified as DEGs, then this filtering method tends to filter out genes that are likely to be differentially expressed. Filtering by variance on the original scale works better for these methods, even when testing is done on the log 2 scale. This can be seen by examining the histogram of p-values corresponding to those genes filtered out by variance (not shown). The distribution of p-values more closely approximates a uniform distribution when filtering by variance is done on the original scale for MAS5 and PLIER. We suggest that whatever filtering method researchers choose, they examine the distribution of p-values corresponding to those genes filtered out.
Filtering by detection call and variance (on the original scale) consistently led to an increase in the number of differentially expressed genes identified. This was true for both cases where a large proportion of genes are differen-tially expressed (i.e. wheat comparison 1) and a small proportion of genes are differentially expressed (i.e. Smoking data). However, we note that for other data sets we examined we were not able to identify any DEGs (using a multiple testing adjustment) either with or without filtering. It is possible that some of these are cases where no genes are differentially expressed. On the other hand, it could be that even after filtering, the power was still too low. Either way, if no DEGs were identified to begin with, there is certainly no harm in attempting filtering.
The simulation study focuses on filtering by variance. We note that the simulated data does not exactly mimic observed microarray results. Specifically, we did not consider the relationships between signal and variance and signal and significance. In addition, the simulation study applies filtering by variance on the same scale as testing and does not represent a specific preprocessing method. Because of these issues, there may be concerns about the generalizability of the simulation results. The key issues for extending the simulation results are the full distribution of p-values, the null distribution of p-values and the distribution of filtered out p-values. Regarding the full distribution of p-values, we choose simulation parameters to generate realistic distributions. Regarding the null distribution of p-values, we examined simulation scenarios that represented both dependent and independent cases. Regarding the distribution of filtered out p-values, we note that for both the case studies and the simulation, there were significant departures from the uniform distribution based on the Kolmogorov-Smirnov test (data not shown). Specifically, for all case studies, preprocessing methods and filtering methods, the K-S test rejected the assumption of uniformity (of the filtered out p-values) at the 0.05 significance level. For the simulations studies, the assumption of uniformity (of the filtered out p-values) was rejected more than 5% of the time at the 0.05 significance level (i.e. for  0 = 0.9 case, the assumption was rejected for 45% of independent runs and 82% of dependent runs). However, the departures from the uniform distribution seemed to be larger for the observed data.
Based on our simulation study, we find that filtering by variance results in increased power without an increase in the observed FDR when paired with BH or q-value methods. While only filtering by variance was used in the simulation study, it is expected that similar results could be found if filtering by detection call had been explored. This is supported by the large overlap in the number of probe sets identified by both the variance and detection call filtering methods for the case studies. Based on the three case studies examined, the percentage overlap in DEGs identified using detection call and variance filtering was consistently above 80% for all preprocessing methods and FDR methods (data not shown). This is based on variance filtering on the original scale for MAS5 and PLIER, but on both the original and log 2 scales for RMA.
While filtering by MAS detection call leads to some natural thresholds (i.e., filtering out probe sets which are not called Present on any array), it is not clear how to choose a threshold when filtering by variance. For the simulation, we removed 50% of the genes. As long as the majority of genes are not differentially expressed, then this seems like a reasonable choice. When we examined the effect of varying the proportion filtered out, we found that the power increased until the proportion filtered out approached  0 .
A similar effect was observed for the case studies when using 0 from the q-value method. Since a common assumption of microarray analysis is that the majority of genes will not be differentially expressed, filtering 50% of the values should be reasonable in most cases. As an example, when we filter out 50% of values by variance for the Diabetes data (for which  0 is estimated to be between 0.77 and 0.88 depending on preprocessing method) we see consistent gains in the number of DEGs identified as compared to the values presented in Table 1 (data not shown).
The filtering methods examined in this paper can be applied to data with any number of treatment groups. We note that in cases when there are three or more treatment groups, the global F-test could also be used for filtering. Specifically, those genes which do not pass the F-test would be removed from further testing (i.e. pairwise comparisons). A concern with this method is the need to control the overall error rate. Since false rejections when performing the F-test will affect false rejections when performing further testing, the FDR of the whole procedure must be controlled. Jiang and Doerge suggest a two-step procedure to control the overall FDR [11]. Though the two-step procedure is only appropriate for experiments involving three or more treatment groups, if there are more than three treatment groups, it becomes very complex because the possible configurations of means of the factor levels must be determined to apply the two-step procedure.
In this paper, we focus on the use of filtering to increase the number of differentially expressed genes identified in gene expression studies when using an FDR method. However, not all researchers use FDR to identify a group of differentially expressed genes. Recently, the MicroArray Quality Control (MAQC) project concluded "that a straightforward approach of fold-change ranking plus non stringent P cutoff can be successful in identifying reproducible gene lists" [12]. We believe that this method of identifying DEGs by using a p-value cutoff followed by ranking genes by absolute fold change can be improved by considering the false discovery rate. In particular, an estimate of the FDR can aid in the selection of an appropriate significance cutoff, one that will help control the number of false positives.

Conclusion
The need for the multiple testing adjustments to microarray data is well established. However, after applying an FDR method, the number of differentially expressed genes that are identified in the analysis is often greatly reduced and when the number of true DEGs is small relative to the number of tests, applying a multiple testing adjustment can result in a substantial loss in power. In this paper we examine the effect of filtering out probe sets in order to increase power. Three filtering criteria were considered: MAS detection call, variance, and average signal. Our analysis also considered the performance of two FDR methods (Benjamini-Hochberg and q-value) and three preprocessing methods (RMA, MAS5 and PLIER).
For the case studies considered, filtering by detection call and variance (on the original scale) consistently led to an increase in the number of DEGs identified. On the other hand, filtering by variance on the log 2 scale had a detrimental effect when paired with MAS5 and PLIER preprocessing methods, even when the testing was done on the log 2 scale. For a fixed preprocessing and FDR method, the DEGs identified with filtering by detection call and variance filtering (on the original scale for MAS and PLIER or either scale for RMA) were largely the same.
While we saw an increase in the number of DEGs identified for the case studies when filtering by variance was used in combination with an FDR method, we cannot determine whether this is due to an increase in power or false discovery rate. Hence a simulation study was performed to examine the issues of power and false discovery rate. The simulation study demonstrates that filtering by variance (with the median of the variances of the genes as a threshold) improves the power over a range of null proportions for the two FDR methods considered. The qvalue method has higher power than BH in all the cases considered both with or without filtering. The observed FDR is maintained close to or below the stated level for both FDR procedures. Overall, filtering by variance can effectively increase power while maintaining the stated FDR and performs especially well when paired with qvalue method.
Finally, we examined the effect of various thresholds for variance filtering. We found that filtering out 50% of probe sets seems reasonable as long as the majority of genes are expected to be equally expressed. This assump- tion can be checked based on the estimate of  0 provided by the q-value method.

Preprocessing Methods
All preprocessing was carried out in R using BioConductor. MAS5 [3] and RMA [13] expression indices were calculated using the affy package [14]. PLIER [15] expression indices were calculated using the plier package. We note that RMA and PLIER expression indices are calculated on the log 2 scale, so when we discuss the "original" scale for those methods, values have been transformed using f(x) = 2 x .

FDR Methods
Benjamini and Hochberg proposed a simple adjustment to the p-values from hypotheses tests to control the overall false discovery rate. Suppose one is testing m hypotheses resulting in m p-values. Let p (1)  p (2)  ʜ  p (m) be the ordered p-values with the corresponding hypotheses H (1)  H (2)  ʜ  H (m) . Let k be the largest i such that .
For a specific feature (gene), the q-value is the expected proportion of false positives among all features as extreme or more extreme than the one observed. Suppose one is testing m hypotheses and obtains m p-values, p 1 , p 2 ,..., p m , corresponding to these m hypotheses. If we assume that pvalues are uniformly distributed under the null hypothesis, then an estimate of the FDR is given by: where t is the level (threshold) at which you would like to control FDR, 0 is an estimate for the proportion of truly null hypothesis. The q-value of a feature i is estimated as The qvalue package [17] was used to calculate q-values.
We note that BH is an FDR controlling procedure (providing an upperbound on FDR), while q-value is an FDR estimation method. Because of this, BH is more conservative than the q-value method in most situations. This is reflected in Figures 2 and 3, where for a given simulation scenario, both the power and observed FDR tend to be lower for BH as compared to q-value. A more thorough comparison and discussion of these two FDR methods (as well as others) can be found in [9].

Filtering Methods
Three methods for filtering were considered in our analysis. If a probe set was "filtered out" by a particular method, the p-value for that probe set was not passed through to the FDR method and it could not be called differentially expressed.
When filtering by variance, the variance of signal values (ignoring treatment assignments) is calculated for each probe set. Probe sets are then ranked by variance, and the probe sets falling below some threshold are filtered out. For the simulation study, 50% of probe sets were filtered (except where otherwise noted). We note that for the case studies, filtering by variance was done on both the original and log 2 scales.
When filtering by signal, the mean signal (ignoring treatment assignments) is calculated for each probe set. Probe sets are then ranked by mean signal, and the probe sets falling below some threshold are filtered out. We note that average signal was calculated on the log 2 scale.
Filtering using the MAS detection call only applies when using Affymetrix arrays. For each probe set on each array, a detection call of Present, Absent or Marginal is made. The detection call is based on the Wilcoxon signed rank test performed using PM and MM values. Detection calls were made using the affy package [14]. For both the case studies and the simulation study, probe sets that were never called Present on any array (sample) were filtered out.

Authors' contributions
AMH and AJH designed the study and helped to draft the manuscript. AJH did all of the programming.