Affymetrix microarrays are high throughput assays for measuring the expression levels of thousands of gene transcripts simultaneously. This type of microarray measures the expression of each transcript multiple times through a set of "probe pairs". Since the advent of the Affymetrix microarray, numerous methods have been proposed for producing numerical expression summaries for each transcript based on the probe pair data. Several systematic studies have appeared comparing a number of methods on a common basis (e.g. [1–5]). These studies rely heavily on calibration data sets derived from spike-in, dilution series, and mixture experiments for comparing methods. Our goal here was to carry out a comparative study of Affymetrix array processing methods using data sets from typical biological experiments seeking differentially expressed genes in human tissue samples.

The following seven methods are considered here: Dchip [10], GCRMA-EB and GCRMA-MLE [11], MAS5 [12], PDNN [13], RMA [2, 3], and TM [[6, 7], and http://dot.ped.med.umich.edu:2000/pub/shared/Affymethods.html]. While not every popular method is included in our study, several highly distinctive and original approaches are studied. For example, Dchip was one of the first approaches to attempt to learn probe weights directly from the probe data, and RMA pioneered the approach of disregarding the control mismatch probes. PDNN uses physical modeling to determine probe weights, while the two GCRMA methods use GC content of the probe sequences to reduce variance in the mismatch (control) probe levels. The MAS5 method is the current default method provided by Affymetrix.

In addition to the six methods cited previously, we also include a method designated TM (trimmed mean). This is a simple method that has been used in a number of published investigations (e.g. [6, 7]), but has not been considered in any previous systematic comparison of Affymetrix processing methods. To produce the probe-set summary score, the PM-MM differences are rank ordered, and the brightest 20% and dimmest 20% of values are deleted. The mean of the remaining values is used as the summary score. The scores for all probe-sets are then quantile normalized to a reference array using a piecewise linear spline with 100 knots.

An important feature of this study is the use of False Discovery Rate (FDR) to quantify the sensitivity of a processing method in terms of its ability to distinguish differentially expressed genes from genes having invariant expression. This is a highly relevant property, as differential expression analysis is the most common application of microarray data. A key advantage of using FDR to compare processing methods is that FDR values can be calculated accurately using real disease profiling data where the identities of differentially expressed genes are uncertain. In contrast, most previous systematic comparisons of array processing methods have focused on calibration data sets in which concentrations of certain genes were experimentally manipulated.

When it is highly likely that at least one gene is differentially expressed, false discovery rate may be defined as the expected ratio of the number of false positive calls to the total number of positive calls in a differential expression analysis between two groups of samples [8]. If the groups are biologically distinct, a sensitive processing method should result in many genes with low FDR. Thus to compare the performances of different array processing methods, we looked at two datasets in which a verified biological characteristic divided the samples into two classes, and compared the methods based on the number of genes having FDR smaller than various thresholds. For this to be a valid basis for comparison, the FDR values must be estimated with reasonable accuracy. Following other recent work (e.g. [9]), we used a permutation approach for this estimation, arguing that there is no reason that this approach favors or disfavors any particular array processing method.

A small FDR is due either to a small numerator, a large denominator, or both. The denominator of the FDR depends on the actual data distribution, so variation in this value may be due to factors such as accuracy in modeling the physical and chemical nature of probe binding. Variation in the FDR numerator, however, depends only on the distribution of values produced for randomized data, a purely statistical quantity reflecting the tendency of the method to incorrectly produce test statistic outliers. Our results suggest that both factors are important in determining sensitivity. The best methods produce many large test statistic values in the actual data, and also produce consistently small test statistic values for randomized data. Poor performance of one method can be directly explained by the tendency of the method to produce outlier expression values, leading to greater numbers of incorrectly large test statistics.

For overall comparison, we evaluated every pair of methods on the basis of whether the first method is expected to call at least one truly differentially expressed gene that is not also called by the second method. If this is not expected to occur, the second method is said to *strongly outperform* the first. Based on this comparison, two of the methods considered are clearly favored, two are inferior, and results for the other three methods are mixed.