 Methodology article
 Open Access
 Published:
A first principles approach to differential expression in microarray data analysis
BMC Bioinformatics volume 10, Article number: 292 (2009)
Abstract
Background
The disparate results from the methods commonly used to determine differential expression in Affymetrix microarray experiments may well result from the wide variety of probe set and probe level models employed. Here we take the approach of making the fewest assumptions about the structure of the microarray data. Specifically, we only require that, under the null hypothesis that a gene is not differentially expressed for specified conditions, for any probe position in the gene's probe set: a) the probe amplitudes are independent and identically distributed over the conditions, and b) the distributions of the replicated probe amplitudes are amenable to classical analysis of variance (ANOVA). Logamplitudes that have been standardized withinchip meet these conditions well enough for our approach, which is to perform ANOVA across conditions for each probe position, and then take the median of the resulting (1  p) values as a genelevel measure of differential expression.
Results
We applied the technique to the HGU133A, HGU95A, and "Golden Spike" spikein data sets. The resulting receiver operating characteristic (ROC) curves compared favorably with other published results. This procedure is quite sensitive, so much so that it has revealed the presence of probe sets that might properly be called "unanticipated positives" rather than "false positives", because plots of these probe sets strongly suggest that they are differentially expressed.
Conclusion
The median ANOVA (1p) approach presented here is a very simple methodology that does not depend on any specific probe level or probe models, and does not require any preprocessing other than withinchip standardization of probe level log amplitudes. Its performance is comparable to other published methods on the standard spikein data sets, and has revealed the presence of new categories of probe sets that might properly be referred to as "unanticipated positives" and "unanticipated negatives" that need to be taken into account when using spikedin data sets at "truthed" test beds.
Background
In this paper we introduce a very simple probelevel procedure for determining differential expression in singlecolor microarray experiments. It is not based upon any particular model for probes sets or gene expression, and depends on just two requirements for each probe set:

a)
under the null hypothesis that a gene is not differentially expressed for specified conditions, for any probe position in the gene's probe set the probe amplitudes are independent and identically distributed (IID) over the conditions, and

b)
at each probe position distributions of replicated probe amplitudes are amenable to classical analysis of variance (ANOVA).
After log transformation followed by withinchip standardization, the resulting withinchip standardized scores (zscores) meet requirement a), and within chips, logs of probe values are reasonably well modeled as being gammadistributed. Since ANOVA is quite robust with respect to the withintreatment distribution, b) holds as well. (Note that we could drop requirement b) and work with a nonparametric version of ANOVA. Classical parametric ANOVA is, however, more powerful than its nonparametric counterparts, so it makes sense to use it whenever feasible.) We hereafter assume that each CEL file's perfect match (pm) values have been log2 transformed, that a gamma distribution has been fit (using the CRAN R [1] fitdistr function) to the transformed data with the lower 0.1% and the upper 1% trimmed off, and that the log scores have been standardized by subtracting the mean of the gamma fit and dividing by the standard deviation of the fit. We will hereafter refer to the results of the transformation process as "standardized probe values" (or "withinchip standardized probe values" when it is important to make it clear that standardization does not take place across chips.) In order to focus on the "first principles" perspective and concepts presented here, we do not perform any background correction or normalization of probe sets in this paper. In practice, of course, doing such preprocessing prior to performing the ANOVAs could improve the effectiveness of the method when applied to experimental data. Withinchip standardization, however, has been carried out because it is in effect a general signal processing calibration procedure which ensures that probe amplitudes can be meaningfully compared across chips. It removes global chip effects which could otherwise be confounded with differential gene expression.
Given a condition we wish to check for differential expression, we first limit the data set to be processed to those chips that are part of the condition. For each probe set we then proceed from one probe position to the next. At each probe position we perform analysis of variance (ANOVA) on the all of the standardized probe values at that position. We apply CRAN R's aov function and retain the pvalue obtained from it. (The R lm function produces the same results, as would an independent two sample, equal variance ttest when two treatments are being analyzed for differential expression.) In Figure 1 we illustrate this concept with an example from the HGU133A Latin square experiment [2]. In the left portion of the figure we display the standardized probe set replicates for the selected conditions, in this case gene 205398_s_at hybridized at 256 and 512 pM. The table on the right in the figure lists, for each probe position, the pvalue resulting from performing analysis of variance on the probe values at that probe position. (Since we are comparing only two conditions here, we would have obtained the same pvalue from independent twosample, equal variance ttests.)
Several ways of combining the probe level ANOVA pvalues were examined. Perhaps the most conceptually appealing measure is the product of the p's. Under the null hypothesis that the gene under consideration is not differentially expressed for the specified conditions, it is reasonable to assume that the ANOVA results are independent from probe position to probe position. (Examination of the probe set plots for a small sample of nonspikedin genes supports this assumption, as well as the assumption that probe amplitudes are IID across conditions under the null hypothesis.) In that case, the product of the pvalues, referred to as "Total p" in Figure 1, is actually an overall pvalue for testing the null hypothesis. However, because there are a number of different probe set sizes on most arrays, it makes more sense to use the perprobe pvalue instead; i.e. the geometric mean of the pvalues (the nth root of the product of the p's, where n is the number of probes in the probe set). This allows for direct comparison of genes with different numbers of probes in their probe sets.
In practice, however, as a tool for assessing differential expression Total p can be overly influenced by a few large (nonsignificant) probe level pvalues, as can be the mean of the p's. Other summary measures such as the trimmed mean or trimmed geometric mean of the p's also do not appear to be as effective as the median in ranking genes in accordance with the known differences in concentration for the conditions being examined. As shown in the table in Figure 2, even after trimming the highest and lowest pvalues, Total p can, in some cases, produce a much lower ranking of a condition than would have been expected. The Total p ranking of the comparison of 64 versus 128 pM concentrations for gene 208010_s_at from the HGU133A Latin Square experiment was 279 out of the 22300 genes in the comparison. On the other hand, the rank based on the median of the ANOVA (1p)'s for the same condition was 6, which is much more consistent with the concentrations involved. While determining the "best" way of combining the probe level pvalues deserves a great deal more study, from this point on we will base our measure of differential expression on the median of the probe level pvalues. In order to make a larger measure correspond to the condition of being more differentially expressed, our measure of differential expression for the gene will be the median of the probe level ANOVA (1p)'s. This is a harmless change since the median of a set of (1p)'s is the same as (1  the median of the p's).
Even though we can perform ANOVA on any subset (of size two or more) of the experimental conditions, in this paper we will restrict our attention to looking for differential expression between two conditions. For clarity, let us refer to the two conditions as A and B and suppose that we specify the ANOVA model such that a positive ANOVA coefficient, as provided by the aov function, corresponds to the mean response under condition B being larger (by the amount of the coefficient) than the mean under condition A. Under these conditions we consider a refinement of our procedure, in which each probe position's (1p) is given the sign of the coefficient obtained from the acrossreplicates ANOVA for the probe. This additional step has two advantages. First, for many nonexpressed genes, the ANOVA coefficient is positive at some probe positions and negative at others. For these genes, the median of the signed (1p)'s will be closer in absolute value to 0 than will the median of the unsigned (1p)'s. Thus, when we use the median of the signed (1p)'s as the measure of differential expression, any gene with a score close to 0 is very likely not to be differentially expressed for the two conditions under consideration. Figure 3 provides an example of the improvement that can occur when signed (1p)'s are used. For the experimental condition depicted, based on the median of the (1p)'s the gene 217207_s_at, which is not spikedin for the HGU133A Latin Square experiment, ranks 56^{th} among the 22300 genes involved in this comparison. This is a higher ranking than that of 8 the 64 spikedin genes. On the other hand, when we used the median of the signed (1p)'s, the gene's rank is 10923, well below any of the spikedin genes. Second, when a gene is differentially expressed between conditions A and B, the sign of resulting signed median tells us whether the gene is upregulated or downregulated. While that is not an important consideration in spikein experiments, where the direction of the regulation is known beforehand, it can be very useful in realworld experiments.
In practice, when we work with signed (1p)'s we take the absolute value of the median of the signed (1p)'s as the measure of differential expression, retaining the median's sign in case we need to know whether the gene is upregulated or downregulated. The reason for working with the absolute value of the signed median is entirely pragmatic. If we retain the sign of median of the signed (1p's) there could be two groups of differentially expressed genes  those with median near 1 and those with median near +1. There is no problem with that, but the two widely separated clusters of differentially expressed gene produce very nonstandard looking ROC curves. In order to produce the familiarlooking ROC curves, we work with the absolute value of the median of the signed (1p)'s. In this case, all differentially expressed genes have a score near +1.
The median ANOVA (1p) approach described here readily lends itself to determination of an (unadjusted) pvalue for the hypotheses test that a gene is not differentially expressed for the conditions under consideration. Under our assumptions for the nondifferentially expressed condition, viz. a) at each probe position probe amplitudes are IID across the condition, and b) ANOVA results are independent from one probe position to another, it follows from a) that the ANOVA pvalues at a single probe have a uniform distribution on the interval [0,1], and from a) and b) that the sample median of the ANOVA pvalues over the probes in a probe set has a beta distribution whose two parameters depend on the number of probes in the probe set [3]. (More accurately, the median has a beta distribution only if there is an odd number of probes in the probe set. It is the mean of two beta distributions for probe sets with an even number of probes.) Regarding the absolute value of the signed median ANOVA (1p)'s, under the null hypothesis the signed median has a uniform distribution over the interval [1,1], so its absolute value also has a uniform distribution on [0, 1]. Thus its distribution under the null hypothesis is the same as that of the unsigned median.
Obtaining the pvalue (unadjusted for multiple hypothesis tests) for a median ANOVA (1p) score, x, is straightforward. For an odd number, n, of probes, pvalue = 1  pbeta(x, m, m) where pbeta is the beta cumulative distribution function in the CRAN R stats package and m = (n+1)/2. (For an even number, n, of probes, pvalue = 1p(x), where p is the cumulative distribution function of the mean of two beta distributions, (B(m,m+1) + B(m+1,m))/2, and m = n/2. We can obtain p(x) from the tools for working with distributions of sums of random variables found in the CRAN R distr package.) The unadjusted pvalues for the median ANOVA (1p) scores are shown in Figures 1, 2 and 3. Figure 4 shows the unadjusted pvalues for the hypothesis test as a function of the median ANOVA (1p) scores for the case of 11 probes per probe set, which is the case for the HGU133A chip. Figure 4(a) shows the entire curve, and 4(b) shows the portion of the curve of greatest importance.
Results
To assess the effectiveness of the ANOVAp approach, we examined its performance on the three spikein controlled experiments that are commonly used as test beds for differential expression procedures. These are the HGU133A and HGU95A Latin Square experiments [2] and the "Golden Spike" [4] experiment. For the two Latin Square designs we processed each contiguous pair of spikein concentration conditions (referred to as the "d = 1" condition in McGee and Chen [5], and corresponding to a twofold increase in concentration for most spikedin genes). Since the Golden Spike experiment entails only two conditions  Control versus Spikedin  there is only one comparison to consider.
HGU133A Latin Square experiment
As designed, the Affymetrix HGU133A Latin Square experiment [2] selected 42 transcripts and assigned them to 14 groups with 3 transcripts each in each group. Each group was spikedin at each of 14 concentrations from 0 to 512 pM, with 3 replicates per concentration. Withinchip and acrosschip group concentrations were organized in a Latin Square design, i.e. within chips, concentrations increase by a factor of two from group to group (wrapping from 512 pM to 0), and for each group, concentrations similarly increase by a factor of two from one experimental condition to the next. See Appendix A of [5] for a complete description of the Latin Square design. The CEL files from this experiment, together with needed metadata, are available for download from the Affymetrix web site.
Researchers in the field have noted that several additional probe sets should be considered as spikedin and have assigned those probe sets to groups with matching expression profiles [5]. We followed their recommendations and expanded the number known spikedin probe sets to 64.
Figures 5 and 6 show the relationships between the unadjusted pvalues obtained from our median ANOVA (1p) methodology and those obtained from RMA and probe level modeling (PLM) processing of Experimental Conditions 1 and 2 of the HGU133A Spikein Experiment. (We chose those conditions as an example because, as mentioned above, there is a considerably expanded set of highly differentially expressed genes involved in the comparison.) RMA and PLM unadjusted pvalues were obtained using the Bioconductor [6] affylmGUI package. As these plots indicate, the ANOVAp approach produces larger pvalues in the extremely low pvalue region (the region associated with the genes whose concentrations changes from 512 pM to 0), but this difference has no meaningful impact  after adjustment for multiple hypothesis testing all of these genes remain highly significant regardless of the methodology applied. As for the other genes, ANOVAp and RMA are in reasonably close agreement, and ANOVAp and PLM are in fair agreement (median ANOVA (1p)'s gives somewhat smaller pvalues for spikedin genes with twofold concentration differences, and PLM has on average somewhat smaller pvalues for the nonspikedin genes). Since the receiver operating characteristic (ROC) curves shown in the following figures provide essentially the same information about the relationships between processing methodologies in a more easily interpretable format, we have not included any other scatterplots comparing pvalues.
Figure 7 contains ROC curves comparing the performance of median ANOVA (1p), median signed ANOVA (1p), RMA and PLM over the full range of false positive rates for all d = 1 comparisons. These are comparisons in which experimental conditions increase by one (factor of 2) concentration step (plus those conditions in the Latin square where concentrations drop from 512 to 0 pM). For each d = 1 condition we obtained median ANOVA (1p), median signed ANOVA (1p), RMA and PLM scores for each of the 22300 HGU133A genes. RMA and PLM scores were obtained using affylmGUI with default settings. Then, for each of the differential expression measures, we combined its 14 d = 1 results into a single data structure from which we calculated an ROC curve. Figure 8 shows the very short initial portion of the ROC curves up to 150 false positives (roughly an average of 10 false positives per d = 1 comparison), the region of highest practical importance.
HGU95A Latin Square experiment
The HGU95A Latin Square design consists of 14 spikedin human gene groups in 14 experimental groups [2]. The concentration of the 14 gene groups in the first experiment is 0, 0.25, 0.5, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, and 1024 pM. Each subsequent experiment rotates the spikein concentrations by one group; i.e. experiment 2 begins with 0.25 pM and ends at 0 pM, on up to experiment 14, which begins with 1024 pM and ends with 512 pM. Each experiment contains at least 3 replicates. Replicates within each group result in a total of 59 CEL files. Most groups contain 1 gene, the exceptions being group 1, which contains 2 genes, and group 12, which is empty. (Specifically, transcript 407_at listed as present in group 12 is actually included in group 1.) See Table One of [7] for a tabular summary of the HGU95A Latin Square design.
The ROC curves in Figure 9 compare the performance of median ANOVA (1p), median signed ANOVA (1p), RMA and PLM over the full range of false positive rates for all d = 1 conditions, obtained in the same manner as for the HGU133A experiment. Figure 10 shows the initial portion of the ROC curves up to 450 false positives (roughly an average of 40 false positives per d = 1 comparison), the region of highest practical importance.
"Golden Spike" experiment
The Golden Spike [4] experiment involved 3 Control and 3 Spikedin Affymetrix DrosGenome1 GeneChip arrays, each of which has a total of 14010 probe sets. As described in [4], a total of 1331 probe sets had an increased concentration between the control and spikein samples, 2535 probe sets had equal concentration and the remaining 10144 probe sets were empty on both the control and spikein arrays. For the 1331 true positives, the log2 fold changes range from 0.26 to 2. The additional data files referenced in [4] provide access to a complete description of the experiment, as well as to the CEL files resulting from the experiment. Although the Golden Spike data set has been called into question by several authors [8, 9], its value as tool for comparing differential expression methods is recognized even by its critics [8]. We thus included the Golden Spike data set in our explication of the ANOVAp methods presented here.
The ROC curves in Figure 11 compare the performance of median ANOVA (1p), median signed ANOVA (1p), RMA and PLM over the full range of false positive rates for the Golden Spike experiment as described in [4]. Probe sets that were either equal or empty under both conditions were considered to be false positives, while those with increased concentrations in the spikedin chips were taken to be true positives. Figure 12 shows the initial portion of the ROC curves up to 50 false positives, the region of highest practical importance.
Discussion
Spikein Experiments, Differential Expression and Truth
In order to accurately assess the capabilities of procedures that decide which genes are differentially expressed, we need some completely characterized test data on which to base the assessments. There are special requirements for such "truthed" data sets, which are intended to be used to train or assess algorithms that will ultimately be applied to experimental data. The first and foremost requirement is that every probe set's condition and every comparison of a probe set across conditions in such a test bed be unambiguously assigned to some category, such as "Expressed" or "Not Expressed" ("Present" or "Absent" in Affymetrix MAS 5 terminology) in the case of a single condition, and "Differentially Expressed" or "Not Differentially Expressed" in the case of comparisons between conditions. Secondarily, measures of the degree of expression or differential expression for every condition, whether in the form of fold change or some other quantitative measure, are highly desirable. The various spikein experiments were devised to provide that test data. Indeed, the spikein experiments provide truth at the input stage of the experiment (i.e., accurate measures of the amounts of a variety of labeled probes in the mixes to be hybridized onto the chips).
The differential expression procedures discussed here and in a multitude of other publications do not however process the truthed hybridizing mixtures, which we will refer to as the "input truth" for the experiment. Instead these algorithms can only access the end result of the experiment, in the form of DAT or, more likely, CEL files. The process undergone by labeled probes from hybridization through scanning into DAT files to postprocessing of the results to create CEL files is highly nonlinear and not especially well characterized. Consequently "input truth" does necessarily translate into truth at the experiment's CEL or DAT file output stage, which we will refer to as the "output truth" of the experiment. Putting it more simply, knowing the difference between a probe set's initial concentrations across conditions does not guarantee that we know what the net effect on the resulting CEL files will be. There are many reasons why "input truth" might not translate to "output truth". Cross hybridization, SNP effects, incorrect sequencing of probe sets, probe set duplication and other types of erroneous probe set characterization have been identified as culprits. The use of BLAST and other online metadata and annotation sources has proved to be successful in making biologicallybased corrections to the collection of probe sets that are expected to be differentially expressed in the CEL files. This situation has been recognized by several authors. For example, McGee and Chen [5] address the issue by adding in 22 new genes to the "truth set" for the HGU133A spikein experiment. Similarly, Affymetrix [2] points out the need to correct the list of HGU95A spikedin genes. Perhaps more critical to the results of spikein experiments are the inaccuracies in the original probe set definitions that have been uncovered by more modern genomic techniques [10].
There is also a disturbing circularity with regard to truth and differential expression procedures in the literature. First, quite properly, analysis procedures are assessed based on their ability to get the "right answers" from the truthed data sets. On the other hand, what's considered "true" is in part based on the results of the data processing methods. For example in [5] McGee and Chen actually found 30 candidate new genes, but eliminated 8 of those for reasons that depend on which other genes were considered to be differentially expressed.
In this paper, in which we focus on first principles of statistical analysis and impose the fewest possible conditions on the data, we take a corresponding approach to the question of when a gene is differentially expressed. We consider a gene to be differentially expressed with respect to a specified test procedure if, for at least one pair of conditions in the experiment, we can reject the null hypothesis that the gene has the same distribution for the test statistic at a predetermined significance level, taking into account the multiple hypothesis testing environment in which the decision is being made. The adjusted pvalue corresponding to the median ANOVA (1p) score provides a quantitative indicator of whether the gene is expressed or not. Plots of the withinchip zscores of the log amplitudes of the probes in a probe set provide an invaluable visual tool for judging whether the adjusted pscore, which is based up a summary measure of probe amplitude distributions, provides an accurate assessment of the behavior of the probe set. For example, consider the probe sets from the HGU133A spikein experiment with gene IDs 204890_s_at, 204891_s_at, 203173_s_at and 213060_s_at. These genes, along with four others, were addressed in McGee and Chen [5], but were not included in their new definition of the spikedin data set. In the comparison of experimental condition 14 with experiment condition 1 (hereafter referred to as E14 vs. E1) of the HGU133A Latin Square design, these four genes ranked 4, 5, 6 and 7 for differential expression, respectively, according to each of the unsigned and signed median ANOVA (1p) approaches, RMA and PLM. The only genes that had higher scores were the three that belong to Group 1  the group for which there is a 512 to 0 pM change from condition 14 to condition 1. When we consider the combined results of all HGU133A d = 1 comparisons (which involves a total of 896 spikedin conditions), we find that these same four genes are among the top 100 genes for the ANOVA (1p) approaches, RMA and PLM. Furthermore they are the only such putatively nonspikedin genes in the top 200 results for any of the methods.
Figures 13 show the within chip zscores for the probes in the 204890_s_at and 204891_s_at probe sets for the E14 vs. E1 condition. Figure 14 contains the E14 vs. E1 profiles for 203173_s_at and 213060_s_at. Although the separation between E1 and E14 conditions is not as great for 213060_s_at as it is for the other three genes, it is still extremely difficult to believe that these profiles arose from a gene with identical probe amplitudes distributions at each probe position for conditions 1 and 14 (and that all four quantitative measures examined in this paper are wrong in ranking these among the very highest of genes). We believe that from the perspective of an algorithm developer or assessor, these should not be considered false positives. Instead, since their differential expression was not predicted at the time the spikein mixtures were prepared, we should call these "unanticipated positives", and not penalize any procedure that calls them differentially expressed. From an examination of these genes' profiles and quantitative scores over the various experimental conditions, it appears that they belong to Group 1 in the HGU133A spikein design.
Unanticipated positives are found in all three of the spikein data sets. Figure 15 contains examples from the HGU95A and Golden Spike experiments.
The introduction of "unanticipated positives" underscores the conundrum we face with the use of these spikein data sets as truthed test beds. On the one hand we would like to avoid subjective statements we have made, such as "it is still extremely difficult to believe that these profiles arose from a gene with identical probe amplitudes distributions at each probe position ...". In fact, if we had an unambiguously truthed data set to work with, we would not have to make any such statements. The problem is that there has been no established method for defining the truth in the very data sets that are used to assess the various algorithms applied to microarray data. If we use, say PLIER, to decide which genes "are" expressed or differentially expressed, then we should not be surprised if PLIER outperforms other methodologies on the resulting "truthed" data sets. Similarly using PLM, RMA, GCRMA, MBEI, etc to declare which are the truly expressed/differentially expressed genes, we should expect results to favor the procedure used to determine the "truth". Two ways of dealing with this problem come to mind. The first is to continue to use spikein data sets in the current manner of use. They will still be very valuable tools in algorithm analysis, design and training, but they will never achieve their intended status as definitive test beds. The second is to make a communitywide effort to decide, even if somewhat arbitrarily in some cases, on the status of each comparative condition (or at least each d = 1 comparison) for each gene in the data sets. (It might be necessary to define an "Ambiguous" status to comparative conditions that do not achieve communitywide consensus, akin to Affymetrix "Marginal" status for probe sets on chips.) This might be an effort too large and/or contentious to undertake in practice, but something along those lines is needed if all algorithms are to have a level playing field on which to be compared.
When it comes to dealing with what would be considered false negatives from the perspective of the spikein concentrations, the situation is a bit more complicated. First, there are some genes with high spikein concentrations that should have manifested differential expression in the CEL files, but for some reason did not. For example, 153553_at is a gene in the Golden Spike experiment with a log 2 fold change of 3. Yet it ranks 7246, 10007, 6034, and 6049 (out of 14010) according to the median ANOVA (1p), signed median ANOVA (1p), RMA and PLM measures of differential expression, respectively. Its profile in Figure 16 and its pvalues are that of a nonexpressed gene. In this case we have what might be called an "unanticipated negative". Second, very low concentrations might have been included in the spikein experimental design in order to assess the lower limits of sensitivity of the various differential expression algorithms, but they also serve to establish the lower limits at which differential expression actually occurs. Those genes for which differential expression does occur at the lowest concentrations do indeed provide the test bed for the sensitivity floor for any procedure. However, for many genes the numerous steps that take place after the preparation of the hybridizing mixture result in the gene not being characterized as differentially expressed in the CEL files. Because they are not actually expressed in the CEL files, these are not really "false negatives". Because it is no surprise that genes with very low concentrations wind up being nonexpressed, they really aren't "unexpected negatives" either. For such genes one should not penalize an algorithm for not being able to distinguish a difference that existed at the start of the experiment but did not make it through to the final CEL file product. Figure 17 provides an example of each of these types of genes. The problem is how to tell one type of gene from the other in an efficient manner. Notice that in these cases, for which the probe set profiles for the conditions overlap or cross each other, we needed to use the median of the signed ANOVA (1p)'s as the measure of differential expression. (When there is no overlap or crossing of profiles for the conditions, except for the sign of the median ANOVA (1p), it does not matter whether we use the signed or unsigned methodology).
Conclusion
Our first conclusion is that the median ANOVA (1p) approach and its median signed ANOVA (1p) variant presented here provide conceptually and computationally simple but effective measures of differential expression. Even though these methods do not clearly outperform existing methods, they perform reasonably well (as evidenced by Figures 4, 5, 6, 7, 8 and 9), and the associated probe set plots provide invaluable tools for those who want to look beyond summary measures in assessing differential expression. Furthermore, since the median is a very robust statistic it might be less sensitive than other methods to the redefinition of GeneChip probe sets that have been suggested and even implemented [10] by several authors.
The second conclusion is that if we wish to have effective test beds for assessing (or training) differential expression algorithms we cannot be satisfied with truths that have been established at the input side of the controlled experiments. If the spikein data sets are to be critical tools in differential expression research then we must establish truth at the point in the processing chain at which the algorithms begin. To penalize a differential expression methodology for not finding a gene which is differentially spikedin but not differentially expressed in the arrays, or vice versa, diminished the value of the spikein experiments. The additions or deletions of spikedin genes proposed by a few authors is a step in the right direction, but the real need it to have communitywide agreement on which genes are differentially expressed for which conditions, based on the contents of the CEL files rather than on the designs of the experiments. Biological resources will no doubt have a large role in predicting and explaining differences between the concentrationbased "input truths" and the imagebased "output truths" that differential expression tools work with. Nonetheless, the ultimate resolution of what is expressed and what is not expressed has to come from the CEL files themselves (if not the DAT that produced them).
Determining truth at the CEL file level for all pairs of conditions will take a lot of work, but that is a requirement for a test bed that can be trusted to assess the effectiveness of the various differential expression paradigms. However, in practice there will probably be "only" several hundred comparative gene conditions that will require close examination of probe set profiles, and initially it makes sense to focus on the d = 1 conditions. It may well be that for some genes there will not be communitywide agreement as to whether they are differentially expressed or not for some pairs of conditions. Figure 18 presents a possible example of such a gene from the Q versus A conditions of the HGU95A Latin Square design. Although the adjusted pvalues for median signed ANOVA (1p), RMA and PLM (8.7 × 10^{18}, 6.9 × 10^{12}, and 4.6 × 10^{16}, respectively) all strongly indicate differential expression, a biologist looking at the profile plots might well have second thoughts (either about differential expression or about the validity of the probe set itself). If researchers cannot agree if a collection of probe sets is differentially expressed or not, we cannot expect mechanized procedures to be in agreement either. In such cases, there may need to be a label in the CEL file truth metadata indicating the ambiguous status of the condition.
References
 1.
The Comprehensive R Archive Network[http://cran.rproject.org/]
 2.
Affymetrix Latin Square Experiments[http://www.affymetrix.com/analysis/download_center2.affx]
 3.
Gupta AK, Nadarajah S: Order Statistics and Records. In Handbook Of Beta Distribution And Its Applications. Volume Chapter 4. Edited by: Gupta AK, Nadarajah S. New York: Marcel Dekker; 2004:89–95.
 4.
Choe SE, Boutros M, Michelson AM, Church GM, Halfon MS: Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset. Genome Biol 2005, 6(2):R16. 10.1186/gb200562r16
 5.
McGee M, Chen Z: New SpikedIn Probe Sets for the Affymetrix HGU133A Latin Square Experiment. COBRA Preprint Series 2006, Article 5. [http://biostats.bepress.com/cobra/ps/art5]
 6.
BioConductor[http://bioconductor.org/]
 7.
Ballman KV, Therneau TM: An Exploration of Affymetrix ProbeSet Intensities in SpikeIn Experiments. Tech Report #74 July 2005 Copyright 2005 Mayo Foundation;
 8.
Pearson RD: A comprehensive reanalysis of the Golden Spike data: Towards a benchmark for differential expression methods. BMC Bioinformatics 2008, 9: 164. 10.1186/147121059164
 9.
Irizarry RA, Cope LM, Wu Z: Featurelevel exploration of a published Affymetrix GeneChip control dataset. Genome Biology 2006, 7: 404. 10.1186/gb200678404
 10.
Dai M, Wang P, Boyd AD, Kostov G, Athey B, Jones EG, Bunney WE, Myers RM, Speed TP, Akil H, Watson SJ, Meng F: Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. Nucleic Acids Research 2005, 33(No 20):e175. 10.1093/nar/gni179
Acknowledgements
The author wishes to thank the reviewers of the earlier versions of this manuscript for their insights and recommendations for improving the content and focus of this paper.
Author information
Affiliations
Corresponding author
Additional information
Authors' contributions
There is a single author for this paper.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Rubin, R.A. A first principles approach to differential expression in microarray data analysis. BMC Bioinformatics 10, 292 (2009). https://doi.org/10.1186/1471210510292
Received:
Accepted:
Published:
Keywords
 Receiver Operating Characteristic Curve
 Probe Position
 Probe Amplitude
 Input Truth
 Golden Spike Data