- Methodology article
- Open Access

# A first principles approach to differential expression in microarray data analysis

- Robert A Rubin
^{1, 2}Email author

**10**:292

https://doi.org/10.1186/1471-2105-10-292

© Rubin; licensee BioMed Central Ltd. 2009

**Received:**1 April 2009**Accepted:**16 September 2009**Published:**16 September 2009

## Abstract

### Background

The disparate results from the methods commonly used to determine differential expression in Affymetrix microarray experiments may well result from the wide variety of probe set and probe level models employed. Here we take the approach of making the fewest assumptions about the structure of the microarray data. Specifically, we only require that, under the null hypothesis that a gene is not differentially expressed for specified conditions, for any probe position in the gene's probe set: a) the probe amplitudes are independent and identically distributed over the conditions, and b) the distributions of the replicated probe amplitudes are amenable to classical analysis of variance (ANOVA). Log-amplitudes that have been standardized within-chip meet these conditions well enough for our approach, which is to perform ANOVA across conditions for each probe position, and then take the median of the resulting (1 - p) values as a gene-level measure of differential expression.

### Results

We applied the technique to the HGU-133A, HG-U95A, and "Golden Spike" spike-in data sets. The resulting receiver operating characteristic (ROC) curves compared favorably with other published results. This procedure is quite sensitive, so much so that it has revealed the presence of probe sets that might properly be called "unanticipated positives" rather than "false positives", because plots of these probe sets strongly suggest that they are differentially expressed.

### Conclusion

The median ANOVA (1-p) approach presented here is a very simple methodology that does not depend on any specific probe level or probe models, and does not require any pre-processing other than within-chip standardization of probe level log amplitudes. Its performance is comparable to other published methods on the standard spike-in data sets, and has revealed the presence of new categories of probe sets that might properly be referred to as "unanticipated positives" and "unanticipated negatives" that need to be taken into account when using spiked-in data sets at "truthed" test beds.

## Keywords

- Receiver Operating Characteristic Curve
- Probe Position
- Probe Amplitude
- Input Truth
- Golden Spike Data

## Background

- a)
under the null hypothesis that a gene is not differentially expressed for specified conditions, for any probe position in the gene's probe set the probe amplitudes are independent and identically distributed (IID) over the conditions, and

- b)
at each probe position distributions of replicated probe amplitudes are amenable to classical analysis of variance (ANOVA).

After log transformation followed by within-chip standardization, the resulting within-chip standardized scores (z-scores) meet requirement a), and within chips, logs of probe values are reasonably well modeled as being gamma-distributed. Since ANOVA is quite robust with respect to the within-treatment distribution, b) holds as well. (Note that we could drop requirement b) and work with a nonparametric version of ANOVA. Classical parametric ANOVA is, however, more powerful than its nonparametric counterparts, so it makes sense to use it whenever feasible.) We hereafter assume that each CEL file's perfect match (pm) values have been log2 transformed, that a gamma distribution has been fit (using the CRAN R [1] *fitdistr* function) to the transformed data with the lower 0.1% and the upper 1% trimmed off, and that the log scores have been standardized by subtracting the mean of the gamma fit and dividing by the standard deviation of the fit. We will hereafter refer to the results of the transformation process as "standardized probe values" (or "within-chip standardized probe values" when it is important to make it clear that standardization does not take place across chips.) In order to focus on the "first principles" perspective and concepts presented here, we do not perform any background correction or normalization of probe sets in this paper. In practice, of course, doing such pre-processing prior to performing the ANOVAs could improve the effectiveness of the method when applied to experimental data. Within-chip standardization, however, has been carried out because it is in effect a general signal processing calibration procedure which ensures that probe amplitudes can be meaningfully compared across chips. It removes global chip effects which could otherwise be confounded with differential gene expression.

*aov*function and retain the p-value obtained from it. (The R

*lm*function produces the same results, as would an independent two sample, equal variance t-test when two treatments are being analyzed for differential expression.) In Figure 1 we illustrate this concept with an example from the HGU-133A Latin square experiment [2]. In the left portion of the figure we display the standardized probe set replicates for the selected conditions, in this case gene 205398_s_at hybridized at 256 and 512 pM. The table on the right in the figure lists, for each probe position, the p-value resulting from performing analysis of variance on the probe values at that probe position. (Since we are comparing only two conditions here, we would have obtained the same p-value from independent two-sample, equal variance t-tests.)

Several ways of combining the probe level ANOVA p-values were examined. Perhaps the most conceptually appealing measure is the product of the p's. Under the null hypothesis that the gene under consideration is not differentially expressed for the specified conditions, it is reasonable to assume that the ANOVA results are independent from probe position to probe position. (Examination of the probe set plots for a small sample of non-spiked-in genes supports this assumption, as well as the assumption that probe amplitudes are IID across conditions under the null hypothesis.) In that case, the product of the p-values, referred to as "Total p" in Figure 1, is actually an over-all p-value for testing the null hypothesis. However, because there are a number of different probe set sizes on most arrays, it makes more sense to use the per-probe p-value instead; i.e. the geometric mean of the p-values (the nth root of the product of the p's, where n is the number of probes in the probe set). This allows for direct comparison of genes with different numbers of probes in their probe sets.

*aov*function, corresponds to the mean response under condition B being larger (by the amount of the coefficient) than the mean under condition A. Under these conditions we consider a refinement of our procedure, in which each probe position's (1-p) is given the sign of the coefficient obtained from the across-replicates ANOVA for the probe. This additional step has two advantages. First, for many non-expressed genes, the ANOVA coefficient is positive at some probe positions and negative at others. For these genes, the median of the signed (1-p)'s will be closer in absolute value to 0 than will the median of the unsigned (1-p)'s. Thus, when we use the median of the signed (1-p)'s as the measure of differential expression, any gene with a score close to 0 is very likely not to be differentially expressed for the two conditions under consideration. Figure 3 provides an example of the improvement that can occur when signed (1-p)'s are used. For the experimental condition depicted, based on the median of the (1-p)'s the gene 217207_s_at, which is not spiked-in for the HGU-133A Latin Square experiment, ranks 56

^{th}among the 22300 genes involved in this comparison. This is a higher ranking than that of 8 the 64 spiked-in genes. On the other hand, when we used the median of the signed (1-p)'s, the gene's rank is 10923, well below any of the spiked-in genes. Second, when a gene is differentially expressed between conditions A and B, the sign of resulting signed median tells us whether the gene is up-regulated or down-regulated. While that is not an important consideration in spike-in experiments, where the direction of the regulation is known beforehand, it can be very useful in real-world experiments.

In practice, when we work with signed (1-p)'s we take the absolute value of the median of the signed (1-p)'s as the measure of differential expression, retaining the median's sign in case we need to know whether the gene is up-regulated or down-regulated. The reason for working with the absolute value of the signed median is entirely pragmatic. If we retain the sign of median of the signed (1-p's) there could be two groups of differentially expressed genes - those with median near -1 and those with median near +1. There is no problem with that, but the two widely separated clusters of differentially expressed gene produce very non-standard looking ROC curves. In order to produce the familiar-looking ROC curves, we work with the absolute value of the median of the signed (1-p)'s. In this case, all differentially expressed genes have a score near +1.

The median ANOVA (1-p) approach described here readily lends itself to determination of an (unadjusted) p-value for the hypotheses test that a gene is not differentially expressed for the conditions under consideration. Under our assumptions for the non-differentially expressed condition, viz. a) at each probe position probe amplitudes are IID across the condition, and b) ANOVA results are independent from one probe position to another, it follows from a) that the ANOVA p-values at a single probe have a uniform distribution on the interval [0,1], and from a) and b) that the sample median of the ANOVA p-values over the probes in a probe set has a beta distribution whose two parameters depend on the number of probes in the probe set [3]. (More accurately, the median has a beta distribution only if there is an odd number of probes in the probe set. It is the mean of two beta distributions for probe sets with an even number of probes.) Regarding the absolute value of the signed median ANOVA (1-p)'s, under the null hypothesis the signed median has a uniform distribution over the interval [-1,1], so its absolute value also has a uniform distribution on [0, 1]. Thus its distribution under the null hypothesis is the same as that of the unsigned median.

*pbeta*(x, m, m) where

*pbeta*is the beta cumulative distribution function in the CRAN R

*stats*package and m = (n+1)/2. (For an even number, n, of probes, p-value = 1-p(x), where p is the cumulative distribution function of the mean of two beta distributions, (B(m,m+1) + B(m+1,m))/2, and m = n/2. We can obtain p(x) from the tools for working with distributions of sums of random variables found in the CRAN R

*distr*package.) The unadjusted p-values for the median ANOVA (1-p) scores are shown in Figures 1, 2 and 3. Figure 4 shows the unadjusted p-values for the hypothesis test as a function of the median ANOVA (1-p) scores for the case of 11 probes per probe set, which is the case for the HGU-133A chip. Figure 4(a) shows the entire curve, and 4(b) shows the portion of the curve of greatest importance.

## Results

To assess the effectiveness of the ANOVA-p approach, we examined its performance on the three spike-in controlled experiments that are commonly used as test beds for differential expression procedures. These are the HGU-133A and HGU-95A Latin Square experiments [2] and the "Golden Spike" [4] experiment. For the two Latin Square designs we processed each contiguous pair of spike-in concentration conditions (referred to as the "d = 1" condition in McGee and Chen [5], and corresponding to a two-fold increase in concentration for most spiked-in genes). Since the Golden Spike experiment entails only two conditions - Control versus Spiked-in - there is only one comparison to consider.

### HGU-133A Latin Square experiment

As designed, the Affymetrix HGU-133A Latin Square experiment [2] selected 42 transcripts and assigned them to 14 groups with 3 transcripts each in each group. Each group was spiked-in at each of 14 concentrations from 0 to 512 pM, with 3 replicates per concentration. Within-chip and across-chip group concentrations were organized in a Latin Square design, i.e. within chips, concentrations increase by a factor of two from group to group (wrapping from 512 pM to 0), and for each group, concentrations similarly increase by a factor of two from one experimental condition to the next. See Appendix A of [5] for a complete description of the Latin Square design. The CEL files from this experiment, together with needed metadata, are available for download from the Affymetrix web site.

Researchers in the field have noted that several additional probe sets should be considered as spiked-in and have assigned those probe sets to groups with matching expression profiles [5]. We followed their recommendations and expanded the number known spiked-in probe sets to 64.

*affylmGUI*package. As these plots indicate, the ANOVA-p approach produces larger p-values in the extremely low p-value region (the region associated with the genes whose concentrations changes from 512 pM to 0), but this difference has no meaningful impact - after adjustment for multiple hypothesis testing all of these genes remain highly significant regardless of the methodology applied. As for the other genes, ANOVA-p and RMA are in reasonably close agreement, and ANOVA-p and PLM are in fair agreement (median ANOVA (1-p)'s gives somewhat smaller p-values for spiked-in genes with two-fold concentration differences, and PLM has on average somewhat smaller p-values for the non-spiked-in genes). Since the receiver operating characteristic (ROC) curves shown in the following figures provide essentially the same information about the relationships between processing methodologies in a more easily interpretable format, we have not included any other scatterplots comparing p-values.

*affylmGUI*with default settings. Then, for each of the differential expression measures, we combined its 14 d = 1 results into a single data structure from which we calculated an ROC curve. Figure 8 shows the very short initial portion of the ROC curves up to 150 false positives (roughly an average of 10 false positives per d = 1 comparison), the region of highest practical importance.

### HGU-95A Latin Square experiment

The HGU-95A Latin Square design consists of 14 spiked-in human gene groups in 14 experimental groups [2]. The concentration of the 14 gene groups in the first experiment is 0, 0.25, 0.5, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, and 1024 pM. Each subsequent experiment rotates the spike-in concentrations by one group; i.e. experiment 2 begins with 0.25 pM and ends at 0 pM, on up to experiment 14, which begins with 1024 pM and ends with 512 pM. Each experiment contains at least 3 replicates. Replicates within each group result in a total of 59 CEL files. Most groups contain 1 gene, the exceptions being group 1, which contains 2 genes, and group 12, which is empty. (Specifically, transcript 407_at listed as present in group 12 is actually included in group 1.) See Table One of [7] for a tabular summary of the HG-U95A Latin Square design.

### "Golden Spike" experiment

The Golden Spike [4] experiment involved 3 Control and 3 Spiked-in Affymetrix DrosGenome1 GeneChip arrays, each of which has a total of 14010 probe sets. As described in [4], a total of 1331 probe sets had an increased concentration between the control and spike-in samples, 2535 probe sets had equal concentration and the remaining 10144 probe sets were empty on both the control and spike-in arrays. For the 1331 true positives, the log2 fold changes range from 0.26 to 2. The additional data files referenced in [4] provide access to a complete description of the experiment, as well as to the CEL files resulting from the experiment. Although the Golden Spike data set has been called into question by several authors [8, 9], its value as tool for comparing differential expression methods is recognized even by its critics [8]. We thus included the Golden Spike data set in our explication of the ANOVA-p methods presented here.

## Discussion

### Spike-in Experiments, Differential Expression and Truth

In order to accurately assess the capabilities of procedures that decide which genes are differentially expressed, we need some completely characterized test data on which to base the assessments. There are special requirements for such "truthed" data sets, which are intended to be used to train or assess algorithms that will ultimately be applied to experimental data. The first and foremost requirement is that every probe set's condition and every comparison of a probe set across conditions in such a test bed be unambiguously assigned to some category, such as "Expressed" or "Not Expressed" ("Present" or "Absent" in Affymetrix MAS 5 terminology) in the case of a single condition, and "Differentially Expressed" or "Not Differentially Expressed" in the case of comparisons between conditions. Secondarily, measures of the degree of expression or differential expression for every condition, whether in the form of fold change or some other quantitative measure, are highly desirable. The various spike-in experiments were devised to provide that test data. Indeed, the spike-in experiments provide truth at the input stage of the experiment (i.e., accurate measures of the amounts of a variety of labeled probes in the mixes to be hybridized onto the chips).

The differential expression procedures discussed here and in a multitude of other publications do not however process the truthed hybridizing mixtures, which we will refer to as the "input truth" for the experiment. Instead these algorithms can only access the end result of the experiment, in the form of DAT or, more likely, CEL files. The process undergone by labeled probes from hybridization through scanning into DAT files to post-processing of the results to create CEL files is highly non-linear and not especially well characterized. Consequently "input truth" does necessarily translate into truth at the experiment's CEL or DAT file output stage, which we will refer to as the "output truth" of the experiment. Putting it more simply, knowing the difference between a probe set's initial concentrations across conditions does not guarantee that we know what the net effect on the resulting CEL files will be. There are many reasons why "input truth" might not translate to "output truth". Cross hybridization, SNP effects, incorrect sequencing of probe sets, probe set duplication and other types of erroneous probe set characterization have been identified as culprits. The use of BLAST and other on-line metadata and annotation sources has proved to be successful in making biologically-based corrections to the collection of probe sets that are expected to be differentially expressed in the CEL files. This situation has been recognized by several authors. For example, McGee and Chen [5] address the issue by adding in 22 new genes to the "truth set" for the HG-U133A spike-in experiment. Similarly, Affymetrix [2] points out the need to correct the list of HG-U95A spiked-in genes. Perhaps more critical to the results of spike-in experiments are the inaccuracies in the original probe set definitions that have been uncovered by more modern genomic techniques [10].

There is also a disturbing circularity with regard to truth and differential expression procedures in the literature. First, quite properly, analysis procedures are assessed based on their ability to get the "right answers" from the truthed data sets. On the other hand, what's considered "true" is in part based on the results of the data processing methods. For example in [5] McGee and Chen actually found 30 candidate new genes, but eliminated 8 of those for reasons that depend on which other genes were considered to be differentially expressed.

In this paper, in which we focus on first principles of statistical analysis and impose the fewest possible conditions on the data, we take a corresponding approach to the question of when a gene is differentially expressed. We consider a gene to be differentially expressed with respect to a specified test procedure if, for at least one pair of conditions in the experiment, we can reject the null hypothesis that the gene has the same distribution for the test statistic at a predetermined significance level, taking into account the multiple hypothesis testing environment in which the decision is being made. The adjusted p-value corresponding to the median ANOVA (1-p) score provides a quantitative indicator of whether the gene is expressed or not. Plots of the within-chip z-scores of the log amplitudes of the probes in a probe set provide an invaluable visual tool for judging whether the adjusted p-score, which is based up a summary measure of probe amplitude distributions, provides an accurate assessment of the behavior of the probe set. For example, consider the probe sets from the HG-U133A spike-in experiment with gene IDs 204890_s_at, 204891_s_at, 203173_s_at and 213060_s_at. These genes, along with four others, were addressed in McGee and Chen [5], but were not included in their new definition of the spiked-in data set. In the comparison of experimental condition 14 with experiment condition 1 (hereafter referred to as E14 vs. E1) of the HGU-133A Latin Square design, these four genes ranked 4, 5, 6 and 7 for differential expression, respectively, according to each of the unsigned and signed median ANOVA (1-p) approaches, RMA and PLM. The only genes that had higher scores were the three that belong to Group 1 -- the group for which there is a 512 to 0 pM change from condition 14 to condition 1. When we consider the combined results of all HGU-133A d = 1 comparisons (which involves a total of 896 spiked-in conditions), we find that these same four genes are among the top 100 genes for the ANOVA (1-p) approaches, RMA and PLM. Furthermore they are the only such putatively non-spiked-in genes in the top 200 results for any of the methods.

The introduction of "unanticipated positives" underscores the conundrum we face with the use of these spike-in data sets as truthed test beds. On the one hand we would like to avoid subjective statements we have made, such as "it is still extremely difficult to believe that these profiles arose from a gene with identical probe amplitudes distributions at each probe position ...". In fact, if we had an unambiguously truthed data set to work with, we would not have to make any such statements. The problem is that there has been no established method for defining the truth in the very data sets that are used to assess the various algorithms applied to microarray data. If we use, say PLIER, to decide which genes "are" expressed or differentially expressed, then we should not be surprised if PLIER outperforms other methodologies on the resulting "truthed" data sets. Similarly using PLM, RMA, GCRMA, MBEI, etc to declare which are the truly expressed/differentially expressed genes, we should expect results to favor the procedure used to determine the "truth". Two ways of dealing with this problem come to mind. The first is to continue to use spike-in data sets in the current manner of use. They will still be very valuable tools in algorithm analysis, design and training, but they will never achieve their intended status as definitive test beds. The second is to make a community-wide effort to decide, even if somewhat arbitrarily in some cases, on the status of each comparative condition (or at least each d = 1 comparison) for each gene in the data sets. (It might be necessary to define an "Ambiguous" status to comparative conditions that do not achieve community-wide consensus, akin to Affymetrix "Marginal" status for probe sets on chips.) This might be an effort too large and/or contentious to undertake in practice, but something along those lines is needed if all algorithms are to have a level playing field on which to be compared.

## Conclusion

Our first conclusion is that the median ANOVA (1-p) approach and its median signed ANOVA (1-p) variant presented here provide conceptually and computationally simple but effective measures of differential expression. Even though these methods do not clearly outperform existing methods, they perform reasonably well (as evidenced by Figures 4, 5, 6, 7, 8 and 9), and the associated probe set plots provide invaluable tools for those who want to look beyond summary measures in assessing differential expression. Furthermore, since the median is a very robust statistic it might be less sensitive than other methods to the redefinition of GeneChip probe sets that have been suggested and even implemented [10] by several authors.

The second conclusion is that if we wish to have effective test beds for assessing (or training) differential expression algorithms we cannot be satisfied with truths that have been established at the input side of the controlled experiments. If the spike-in data sets are to be critical tools in differential expression research then we must establish truth at the point in the processing chain at which the algorithms begin. To penalize a differential expression methodology for not finding a gene which is differentially spiked-in but not differentially expressed in the arrays, or vice versa, diminished the value of the spike-in experiments. The additions or deletions of spiked-in genes proposed by a few authors is a step in the right direction, but the real need it to have community-wide agreement on which genes are differentially expressed for which conditions, based on the contents of the CEL files rather than on the designs of the experiments. Biological resources will no doubt have a large role in predicting and explaining differences between the concentration-based "input truths" and the image-based "output truths" that differential expression tools work with. Nonetheless, the ultimate resolution of what is expressed and what is not expressed has to come from the CEL files themselves (if not the DAT that produced them).

^{-18}, 6.9 × 10

^{-12}, and 4.6 × 10

^{-16}, respectively) all strongly indicate differential expression, a biologist looking at the profile plots might well have second thoughts (either about differential expression or about the validity of the probe set itself). If researchers cannot agree if a collection of probe sets is differentially expressed or not, we cannot expect mechanized procedures to be in agreement either. In such cases, there may need to be a label in the CEL file truth metadata indicating the ambiguous status of the condition.

## Declarations

### Acknowledgements

The author wishes to thank the reviewers of the earlier versions of this manuscript for their insights and recommendations for improving the content and focus of this paper.

## Authors’ Affiliations

## References

- The Comprehensive R Archive Network[http://cran.r-project.org/]
- Affymetrix Latin Square Experiments[http://www.affymetrix.com/analysis/download_center2.affx]
- Gupta AK, Nadarajah S: Order Statistics and Records. In
*Handbook Of Beta Distribution And Its Applications*.*Volume Chapter 4*. Edited by: Gupta AK, Nadarajah S. New York: Marcel Dekker; 2004:89–95.Google Scholar - Choe SE, Boutros M, Michelson AM, Church GM, Halfon MS: Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset.
*Genome Biol*2005, 6(2):R16. 10.1186/gb-2005-6-2-r16PubMed CentralView ArticlePubMedGoogle Scholar - McGee M, Chen Z: New Spiked-In Probe Sets for the Affymetrix HGU-133A Latin Square Experiment.
*COBRA Preprint Series*2006, Article 5. [http://biostats.bepress.com/cobra/ps/art5]Google Scholar - BioConductor[http://bioconductor.org/]
- Ballman KV, Therneau TM: An Exploration of Affymetrix Probe-Set Intensities in Spike-In Experiments. Tech Report #74 July 2005 Copyright 2005 Mayo Foundation;Google Scholar
- Pearson RD: A comprehensive re-analysis of the Golden Spike data: Towards a benchmark for differential expression methods.
*BMC Bioinformatics*2008, 9: 164. 10.1186/1471-2105-9-164PubMed CentralView ArticlePubMedGoogle Scholar - Irizarry RA, Cope LM, Wu Z: Feature-level exploration of a published Affymetrix GeneChip control dataset.
*Genome Biology*2006, 7: 404. 10.1186/gb-2006-7-8-404PubMed CentralView ArticlePubMedGoogle Scholar - Dai M, Wang P, Boyd AD, Kostov G, Athey B, Jones EG, Bunney WE, Myers RM, Speed TP, Akil H, Watson SJ, Meng F: Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data.
*Nucleic Acids Research*2005, 33(No 20):e175. 10.1093/nar/gni179PubMed CentralView ArticlePubMedGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.