A comprehensive reanalysis of the Golden Spike data: Towards a benchmark for differential expression methods
 Richard D Pearson^{1}Email author
DOI: 10.1186/147121059164
© Pearson; licensee BioMed Central Ltd. 2008
Received: 09 November 2007
Accepted: 26 March 2008
Published: 26 March 2008
Abstract
Background
The Golden Spike data set has been used to validate a number of methods for summarizing Affymetrix data sets, sometimes with seemingly contradictory results. Much less use has been made of this data set to evaluate differential expression methods. It has been suggested that this data set should not be used for method comparison due to a number of inherent flaws.
Results
We have used this data set in a comparison of methods which is far more extensive than any previous study. We outline six stages in the analysis pipeline where decisions need to be made, and show how the results of these decisions can lead to the apparently contradictory results previously found. We also show that, while flawed, this data set is still a useful tool for method comparison, particularly for identifying combinations of summarization and differential expression methods that are unlikely to perform well on real data sets. We describe a new benchmark, AffyDEComp, that can be used for such a comparison.
Conclusion
We conclude with recommendations for preferred Affymetrix analysis tools, and for the development of future spikein data sets.
Background
The issue of method validation is of great importance to the microarray community; arguably more important than the development of new methods [1]. The microarray analyst is faced with a seemingly endless choice of methods, many of which give evidence to support their claims of being superior to other approaches, which at times can appear contradictory. Because of this, choice of methods is often determined not by a rigorous comparison of method performance, but by what a researcher is familiar with, what a researcher's colleagues have expertise in, or what was used in a researcher's favorite paper. Method validation is a difficult problem in microarray analysis because, for the vast majority of microarray data sets, we don't know what the "right answer" really is. For example, in a typical analysis of differential gene expression, we rarely know which genes are truly differentially expressed (DE) between different conditions. Perhaps even worse than this, we rarely have any strong evidence about the proportion of genes that are differentially expressed.
 1.
It uses data sets which only have a small number of DE spikein probesets.
 2.
It only uses fold change (FC) as a metric for DE detection, and hence cannot be used to compare other competing DE methods.
More recently, the MicroArray Quality Control (MAQC) study [3] has developed a large number of reference data sets. The primary goal of this study was to show that microarray results can be reproducible, however, a secondary goal was to provide tools for benchmarking methods. The study concluded that using FC as a DE method gives results that are more reproducible than the other DE methods studied. However, the study could not give recommendations about other important metrics for DE methods such as sensitivity and specificity. The problem here is that we don't know for sure which genes are differentially expressed between the conditions. We could infer this by comparing results across the different microarray technologies used, but the different technologies may well have similar biases, invalidating the results. We could also infer which genes are differentially expressed by comparison with other technologies such as qRTPCR, but again, there could be similar biases in these technologies. Furthermore, there are competing methods for detection of DE genes using qRTPCR, so we may well get contradictory results when comparing different microarray DE methods against different qRTPCR DE methods.
The "Golden Spike" data set of Choe et al. [4] includes two conditions; control (C) and sample (S), with 3 replicates per condition. Each array has 14,010 probesets. 3,866 of these probesets can be used to detect RNAs that have been spiked in. 2,535 of these spikein probesets relate to RNAs that have been spikedin at equal concentrations in the two conditions. The remaining 1,331 probesets relate to RNAs that have been spikedin at higher concentrations in the S condition relative to the C condition. As such, this data set has a large number of probesets that are known to be DE, and a large number that are known to be not DE. This makes the Golden Spike data set potentially very valuable for validating DE methods.
There have been criticisms of the Golden Spike data set from Dabney and Storey [5], Irizarry et al. [6] and Gaile and Miecznikowski [7]. The main criticisms of [5] and [7] center around the fact that the nonDE probesets in the Golden Spike data set have nonuniform pvalue distributions. This implies that any measure of significance of DE will be incorrect. Significance measures are valuable because they allow a researcher to make principled decisions about how many genes might be DE, which is a goal towards which we should strive. Unfortunately, we still have no way of knowing for sure whether the nonuniform pvalue distributions of the nonDE probesets seen in the Golden Spike data set are particular to this data set, or are a general feature of microarray data sets. Indeed, a recent study by Fodor et al. [8] has suggested nonuniform pvalue distributions may be common. However, even if we cannot reliably predict the proportion of genes that are differentially expressed, we can still rank the genes from most likely to be DE to least likely to be DE. In many cases, a researcher might want a list of candidate genes which will be investigated further. A common though admittedly unprincipled approach is to choose the top N candidate genes where N is determined by available resources rather than statistical significance. In such situations it is the rank order of probability of being DE that is used. The tool that has been used most extensively for comparing methods on this data set is the receiveroperator characteristic (ROC) chart. The ROC chart only takes into account the rank order of DE probesets, and hence is not affected by concerns about nonuniform pvalue distributions. Gaile and Miecznikowski [7] show that the Golden Spike data set is not suitable for comparison of methods of false discovery rate (FDR) control, but say nothing about whether or not the data set can be used for comparing methods of ranking genes by propensity to be DE.
 1.
Spikein concentrations are unrealistically high.
 2.
DE spikeins are all oneway (upregulated).
 3.
Nominal concentrations and FC sizes are confounded.
While we agree that these are indeed undesirable characteristics, and would recommend the creation of new spikein data sets that do not have these characteristics, we do not believe that these completely invalidate the use of the Golden Spike data set as a useful comparison tool.
Perhaps more serious is the artifact identified by Irizarry et al. [6]. They show that the FCs of the spikeins that are spiked in at equal levels are lower than the "empty" probesets (i.e. those not spiked in). Schuster et al. [9] have recently suggested that this difference is due to differences in nonspecific binding, which in turn is due to differences in amounts of labeled cRNA between the C and S conditions. We agree that this artifact invalidates comparison methods that use the set of all unchanging (equal FC and empty) probesets as true negatives when creating ROC charts. However, we argue that we can still use the Golden Spike data set as a valid benchmark by using ROC charts with just the equal FC probesets as our true negatives (i.e. by ignoring the empty probesets).
The Golden Spike data set has been used to validate many different methods for summarizing Affymetrix data sets. Choe et al. [4] originally used this data set to show that a modified form of MAS5.0 (which we will refer to as CP for Choe Preferred) outperforms RMA [10], GCRMA [11] and MBEI (the algorithm used in the dChip software) [12]. Liu et al. [13] used the data set to show that multimgMOS [14] can outperform CP. Hochreiter et al. [15] used the data set to show that FARMS outperforms RMA, MAS5.0 and MBEI, and that RMA outperforms MAS5.0 and MBEI, in apparent contradiction to Choe et al. [4]. Chen et al. [16] used the data to show that DFW and GCRMA outperform RMA, MAS5.0, MBEI, PLIER [17], FARMS and CP, again in apparent contradiction to Choe et al. [4]. All of these papers used some form of ROC curve in their analyses. The confusing, and seemingly contradictory results, make it difficult for typical Affymetrix users to decide between methods.
 1.
Summary statistic used (e.g. RMA, GCRMA, MAS5.0, etc.). Note that Choe et al. [4] broke this particular choice down to four separate subchoices of methods for background correction, probelevel normalization, PM adjustment, and expression summary.
 2.
Postsummarization normalization method. Choe et al. [4] compared no further normalization against the use of a loess probesetlevel normalization based on the known invariant probesets.
 3.
 4.
Direction of differential expression. Choe et al. [4] used a 2sided test (as opposed to, for example, a 1sided test of upregulation).
 5.
Choice of true positives. Choe et al. [4] used all spikein probesets with foldchange (FC) greater than 1.
 6.
Choice of true negatives. Choe et al. [4] used all invariant probesets. This included both probesets that were spiked in at equal quantities, as well as the socalled "empty" probesets.
Analysis choices of various studies of the "Golden Spike" data set. These are choices we believe were made for each of the six stages of the analysis pipeline we have outlined.
Study  Summarization method  Postsumm Normalization  DE method  Dir  True positives  True negatives 

Choe et al. [4]  CP, MAS5.0, RMA, GCRMA, MBEI plus many variants of these  none, loess_invariant  ttest, CyberT, SAM  either  FC >1  invariant 
Liu et al. [13]  CP, multimgMOS  loess_invariant  CyberT, PPLR  up  FC >1  invariant 
Hochreiter et al. [15]  MAS5.0, RMA, MBEI and FARMS  none  SAM  up  FC >1  invariant 
Chen et al. [16]  CP, MAS5.0, RMA, GCRMA, MBEI, PLIER, FARMS and DFW  none  FC  either  FC >1 and FC = x(for all x)  invariant 
Current study  CP, MAS5.0, RMA, GCRMA, MBEI, multimgMOS, FARMS, DFW, PLIER  none, loess_invariant, loess_equal, loess_all  FC, ttest, CyberT, limma and PPLR  either, up and down  FC >1, low FC, medium FC, high FC and FC = x(for all x)  equal and invariant 
The most commonly used metric for assessing a DE detection method's performance is the Area Under the standard ROC Curve (AUC). This is typically calculated for the full ROC chart (i.e. FPR values from 0 to 1), but can also be calculated for a small portion of the chart (e.g. FPRs between 0 and 0.05). Other metrics that might be used are the number or proportion of true positives for a fixed number or proportion of false positives, or conversely the number or proportion of false positives for a fixed number or proportion of true positives.
In this study we have analyzed all combinations of the various options shown in the last row of Table 1. In addition, we have created charts displaying the data in different ways. In the next section we show how results can vary when making different choices at the stages of the analysis pipeline highlighted above. We also discuss what we believe are good choices. We detail a web resource called AffyDEComp which can be used as a limited benchmark for DE methods on Affymetrix data. We also highlight some issues of reproducibility in comparative studies. We conclude by making recommendations on choices of Affymetrix analysis methods, and desired characteristics of future spikein data sets.
Results and Discussion
Direction of Differential Expression
The choice of whether 1sided or 2sided tests should be used for comparison of methods is debatable. A 1sided test for downregulation is clearly not a sensible choice given that all the known DE genes are upregulated. We would expect a 1sided test of upregulation to give the strongest results, given that all the unequal spikeins are upregulated. However, in most real microarray data sets, we are likely to be interested in genes which show the highest likelihood of being DE, regardless of the direction of change. As such, we will continue to use both a 2sided test, and a 1sided test of upregulation in the remainder of the paper. In our comprehensive analysis, however, we also include results for 1sided tests of downregulation for completeness.
True negatives
PostSummarization Normalization
We agree with Gaile and Miecznikowski [7] that "the invariant set of genes used for the preprocessing steps in Choe et al. should not have included the empty null probesets". As such, for the remainder of this paper will we not use the empty probesets in loess normalization. In our comprehensive analysis we also include, for completeness, results when using all of the following postsummarization normalization strategies: no postsummarization normalization, a loess normalization based on all spikein probesets, a loess normalization based on all the unchanging probesets and a loess normalization based on the equalvalued spikeins.
Differential Expression Detection Methods
AUCs for 2sided test of DE. This table shows AUC values for different combinations of summarization and DE detection methods. The 10 highest AUC values are highlighted in bold. Note that the PPLR method is only applicable to summarization methods that give uncertainty estimates as well as mean expression levels for each probeset. These results were calculated using only the equal spikeins as true negatives, and all spikeins with FC > 1 as true positives. A postsummarization loess normalization using the equalvalued spikeins was used. The results in this table are for 2sided tests of DE.
limma  FC  ttest  CyberT  PPLR  

mmgMOS  0.903  0.861  0.902  0.919  0.922 
MAS5  0.884  0.848  0.879  0.905  
CP  0.905  0.873  0.898  0.919  0.889 
PLIER  0.898  0.889  0.889  0.911  
RMA  0.881  0.885  0.858  0.886  0.860 
GCRMA  0.890  0.902  0.883  0.909  
DFW  0.764  0.815  0.732  0.703  0.806 
MBEI  0.885  0.884  0.870  0.897  0.855 
FARMS  0.842  0.891  0.805  0.844  0.772 
AUCs for 1sided test of upregulation. This table shows AUC values for different combinations of summarization and DE detection methods. The 10 highest AUC values are highlighted in bold. Note that the PPLR method is only applicable to summarization methods that give uncertainty estimates as well as mean expression levels for each probeset. These results were calculated using only the equal spikeins as true negatives, and all spikeins with FC > 1 as true positives. A postsummarization loess normalization using the equalvalued spikeins was used. The results in this table are for 1sided tests of upregulation.
limma  FC  ttest  CyberT  PPLR  

mmgMOS  0.940  0.920  0.938  0.949  0.951 
MAS5  0.924  0.908  0.921  0.934  
CP  0.940  0.928  0.935  0.948  0.932 
PLIER  0.934  0.929  0.930  0.941  
RMA  0.929  0.932  0.914  0.932  0.917 
GCRMA  0.926  0.946  0.921  0.944  
DFW  0.817  0.918  0.794  0.830  0.912 
MBEI  0.928  0.928  0.920  0.934  0.915 
FARMS  0.883  0.938  0.847  0.908  0.893 
Figure 6 can be used for overall comparisons of DE methods. In general, we see that CyberT tends to outperform limma, and both of these methods generally outperform the use of standard ttests. The performance of FC as a DE detection method varies much more, depending on the summarization method used. When FC is used in combination with DFW, FARMS or GCRMA, performance is generally amongst the best. However, performance of FC with RMA, MBEI and PLIER is less strong, and the combination of FC with multimgMOS, MAS5.0 or CP is particularly poor. Of the summarization methods that perform well with FC, FARMS and DFW have generally poor performance when used in combination with other methods. GCRMA has reasonable performance in combination with CyberT, but is in the lower half of summarization methods when used in combination with either limma or standard ttests.
True positives
So far we have used all of the genes that are spikedin at higher concentrations in the S samples relative to the C samples as our true positives. This is perhaps the best and fairest way to determine overall performance of a DE detection method. However, we might also be interested in whether certain methods perform particularly well in "easier" or "more difficult" cases. Indeed, many analysts are only interested in genes which are determined not only to have a probability of being DE that is significant, but also have a FC which is greater than some predetermined threshold. In order to determine which methods perform more strongly in "easy" or "difficult" cases, we can restrict our true positives to just those genes than are known to be DE by just a small FC, or to those that are very highly DE.
Comprehensive Analysis
 1.
AUC where equalvalued spikeins are used as true negatives, spikeins with FC > 1 are used as true positives, a postsummarization loess normalization based on the equalvalued spikeins is used, and a 1sided test of upregulation is the DE metric. This gives the values shown in Table 3.
 2.
as 1. but using a 2sided test of DE. This gives the values shown in Table 2.
 3.
as 1. but with low FC spikeins used as true positives. This gives the values shown in Figure 7.
 4.
as 1. but with medium FC spikeins used as true positives. This gives the values shown in Figure 7.
 5.
as 1. but with high FC spikeins used as true positives. This gives the values shown in Figure 7.
 6.
as 1. but with all unchanging probesets used as true negatives.
 7.
as 1. but with all unchanging probesets used as true negatives, and a postsummarization loess normalization based on the unchanging probesets.
 8.
as 1. but with a postsummarization loess normalization based on all spikein probesets.
 9.
as 1. but with a no postsummarization normalization.
 10.
as 1. but giving the AUC for FPRs up to 0.01.
 11.
the proportion of true positives without any false positives (i.e. the TPR for a FPR of 0), using the same conditions as 1.
 12.
the TPR for a FPR of 0.5, using the same conditions as 1.
 13.
the FPR for a TPR of 0.5, using the same conditions as 1.
We are happy to include other methods if they are made available through Bioconductor packages. We also intend to extend AffyDEComp to include future spikein data sets as they become available. In this way we expect this web resource to become a valuable tool in comparing the performance of both summarization and DE detection methods.
Reproducible Research
 1.
provide full details of all parameter choices used in the papers Methods section, or
 2.
make the code used to create the results available, ideally as supplementary information to ensure a permanent record.
We recommend that journals should not accept method comparison papers unless either of these is done. This paper was prepared as a "Sweave" document [25]. The source code for this document is a mixture of LaTeX and R code. We have made the source code available as Additional file 1. This means that all the code used to create all the results in this paper, and in AffyDEComp [24], are available and all results can be recreated using open source tools.
Conclusion
 1.
The use of a postsummarization loess normalization, with the equal spikein probesets used as the subset to normalize with. This is also recommended by Gaile and Miecznikowski [7].
 2.
The use of a 1sided test for upregulation of genes between the C and S conditions. This mimics the actual situation because all the nonequal spikeins are upregulated.
 3.
The use of all upregulated probesets as the true positives for the ROC chart.
Using the above recommendations, we created ROC charts for all combinations of summarization and DE methods (Figure 5b and Table 3). This showed us that there was no clear DE detection method that stood out, but that what is important is the combination of summarization and DE method. We saw that the combination of multimgMOS and PPLR gave the largest AUC. One of the downsides with the PPLR approach is that there is no principled way of determining the proportion of genes that are DE, as is claimed by some FDR methods. Other combinations that had strong performance included GCRMA/FC, and CyberT used in conjunction with various normalization methods. By looking at very small FPRs (Figure 6), CP/CyberT, FARMS/FC and DFW/FC were all shown to be potentially valuable when identifying a small number of potential targets. If looking only for genes with larger FCs (Figure 7), RMA/FC was seen to give the strongest performance.
It should be noted that the design of this experiment could favor certain methods. We have seen that the intensities of the spikein probesets are particularly high. Estimates of expression levels are known to be more accurate for high intensity probesets. This could favor the FC method of determining DE.
Furthermore, the replicates in the Golden Spike study are technical rather than biological, and hence the variability between arrays might be expected to be lower in this data set than in a typical data set. Again, this might favor the FC DE method.
We agree with Irizarry et al. [6] that the Golden Spike data set is flawed. In particular, we recognize that in creating ROC charts from just those probesets which were spikedin, we are using a data set where the probe intensities are higher than in many typical microarray data sets. Also, applying a postsummarization normalization is not something that many typical analysts will perform, but is believed to be necessary to overcome some of the limitations of this data set, namely that the experiment is unbalanced due to the fact that all the DE spikeins are upregulated. We believe that using only the equalvalued spikein probesets, both as true negatives and for the postsummarization normalization, is the most appropriate way of analyzing this particular data set. Furthermore, given the issues highlighted in the introduction regarding Affycomp and comparisons with qRTPCR results, we believe that the Golden Spike data set is still the most appropriate tool for comparing DE methods. To this end we have created the AffyDEComp benchmark to enable researchers to compare DE methods. However, we should stress that we are not, at this stage, recommending that AffyDEComp be used as a reliable benchmark as the Golden Spike data set might not be representative of data sets more generally. In particular, just because a method does well here, doesn't necessarily mean that the method will do well generally. At this time, AffyDEComp might better be suited to identifying combinations of summarization and DE detection methods that perform particularly poorly.
 1.
Spikein concentrations are realistic
 2.
DE spikeins are a mixture of up and downregulated
 3.
Nominal concentrations and FC sizes are not confounded
 4.
The number of arrays used is large enough to be representative of some of the larger studies being performed today
We believe that only by creating such data sets will we be able to ascertain whether the artifact noted by Irizarry et al. [6] is a peculiarity of the Golden Spike data set, or is a general feature of spikein data sets. More importantly, the creation of such data sets should improve the AffyDEComp benchmark, and hence enable the community to better evaluate DE detection methods for Affymetrix data.
Methods
The raw data from the Choe et al. [4] study was originally downloaded from the author's website [26]. All analysis was carried out using the R language (version 2.6.0). MAS5.0, CP, RMA and MBEI expression measures were created using the Bioconductor [27] affy package (version 1.16.0). GCRMA expression measures were created using the Bioconductor gcrma package (version 2.10.0). PLIER expression measures were created using the Bioconductor plier package (version 1.8.0). multimgMOS expression measures were created using the Bioconductor puma package (version 1.4.1). FARMS expression measures were created using the FARMS package (version 1.1.1) from the author's website [28]. DFW expression measures were created using the affy package and code from the author's website [29]. CyberT results and Loess normalization were obtained using the goldenspike package (version 0.4) [26]. All other analysis was carried out using the Bioconductor puma package (version 1.4.1).
The code used to create all results in this document is included as Additional file 1.
List of abbreviations
 DE:

differentially expressed or differential expression, as appropriate.
 FC:

fold change
 MAQC:

MicroArray Quality Control
 ROC:

receiveroperator characteristic
 FPR:

falsepositive rate
 TPR:

truepositive rate
 DR:

falsediscovery rate
 AUC:

area under curve (in this paper this refers to the area under the ROC curve).
Declarations
Acknowledgements
The author thanks Magnus Rattray for a careful reading of the manuscript and useful comments. This work was supported by an NERC "Environmental Genomics/EPSRC" studentship.
Authors’ Affiliations
References
 Allison DB, Cui X, Page GP, Sabripour M: Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet 2006, 7: 55–65. 10.1038/nrg1749View ArticlePubMed
 Cope LM, Irizarry RA, Jaffee HA, Wu Z, Speed TP: A benchmark for Affymetrix GeneChip expression measures. Bioinformatics 2004, 20(3):323–31. 10.1093/bioinformatics/btg410View ArticlePubMed
 Shi L, Reid LH, Jones WD, Shippy R, Warrington JA, Baker SC, Collins PJ, de Longueville F, Kawasaki ES, Lee KY, et al.: The MicroArray Quality Control (MAQC) project shows inter and intraplatform reproducibility of gene expression measurements. Nat Biotechnol 2006, 24(9):1151–61. 10.1038/nbt1239View ArticlePubMed
 Choe SE, Boutros M, Michelson AM, Church GM, Halfon MS: Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset. Genome Biol 2005, 6(2):R16. 10.1186/gb200562r16PubMed CentralView ArticlePubMed
 Dabney AR, Storey JD: A reanalysis of a published Affymetrix GeneChip control dataset. Genome Biol 2006, 7(3):401. 10.1186/gb200673401PubMed CentralView ArticlePubMed
 Irizarry RA, Cope LM, Wu Z: Featurelevel exploration of a published Affymetrix GeneChip control dataset. Genome Biol 2006, 7(8):404. 10.1186/gb200678404PubMed CentralView ArticlePubMed
 Gaile DP, Miecznikowski JC: Putative null distributions corresponding to tests of differential expression in the Golden Spike dataset are intensity dependent. BMC Genomics 2007, 8: 105. 10.1186/147121648105PubMed CentralView ArticlePubMed
 Fodor AA, Tickle TL, Richardson C: Towards the uniform distribution of null pvalues on Affymetrix microarrays. Genome Biol 2007, 8(5):R69. 10.1186/gb200785r69PubMed CentralView ArticlePubMed
 Schuster E, Blanc E, Partridge L, Thornton J: Estimation and correction of nonspecific binding in a largescale spikein experiment. Genome Biology 2007, 8: R126. 10.1186/gb200786r126PubMed CentralView ArticlePubMed
 Irizarry RA, Hobbs B, Collin F, BeazerBarclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 2003, 4(2):249–64. 10.1093/biostatistics/4.2.249View ArticlePubMed
 Wu Z, Irizarry RA, Gentleman R, MartinezMurillo F, Spencer F: A ModelBased Background Adjustment for Oligonucleotide Expression Arrays. Journal of the American Statistical Association 2004, 99(468):909–918. 10.1198/016214504000000683View Article
 Li C, Wong WH: Modelbased analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci USA 2001, 98: 31–6. 10.1073/pnas.011404098PubMed CentralView ArticlePubMed
 Liu X, Milo M, Lawrence ND, Rattray M: Probelevel measurement error improves accuracy in detecting differential gene expression. Bioinformatics 2006, 22(17):2107–13. 10.1093/bioinformatics/btl361View ArticlePubMed
 Liu X, Milo M, Lawrence ND, Rattray M: A tractable probabilistic model for Affymetrix probelevel analysis across multiple chips. Bioinformatics 2005, 21(18):3637–44. 10.1093/bioinformatics/bti583View ArticlePubMed
 Hochreiter S, Clevert DA, Obermayer K: A new summarization method for Affymetrix probe level data. Bioinformatics 2006, 22(8):943–9. 10.1093/bioinformatics/btl033View ArticlePubMed
 Chen Z, McGee M, Liu Q, Scheuermann RH: A distribution free summarization method for Affymetrix GeneChip arrays. Bioinformatics 2007, 23(3):321–7. 10.1093/bioinformatics/btl609View ArticlePubMed
 Hubbell E: PLIER White Paper. Affymetrix, Santa Clara, California 2005.
 Smyth G: Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments. Statistical Applications in Genetics and Molecular Biology 2004, 3: Article 3. 10.2202/15446115.1027View Article
 Baldi P, Long AD: A Bayesian framework for the analysis of microarray expression data: regularized t test and statistical inferences of gene changes. Bioinformatics 2001, 17(6):509–19. 10.1093/bioinformatics/17.6.509View ArticlePubMed
 Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA 2001, 98(9):5116–21. 10.1073/pnas.091062498PubMed CentralView ArticlePubMed
 Lemieux S: Probelevel linear model fitting and mixture modeling results in high accuracy detection of differential gene expression. BMC Bioinformatics 2006, 7: 391. 10.1186/147121057391PubMed CentralView ArticlePubMed
 Hess A, Iyer H: Fisher's combined pvalue for detecting differentially expressed genes using Affymetrix expression arrays. BMC Genomics 2007, 8: 96. 10.1186/14712164896PubMed CentralView ArticlePubMed
 Sing T, Sander O, Beerenwinkel N, Lengauer T: ROCR:visualizing classifier performance in R. Bioinformatics 2005, 21(20):3940–3941. 10.1093/bioinformatics/bti623View ArticlePubMed
 AffyDEComp[http://manchester.ac.uk/bioinformatics/affydecomp]
 Leisch F: Sweave: Dynamic generation of statistical reports using literate data analysis. Compstat 2002, 575–580.View Article
 Golden Spike Experiment[http://www.elwood9.net/spike/]
 Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, et al.: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 2004, 5(10):R80. 10.1186/gb2004510r80PubMed CentralView ArticlePubMed
 FARMS package[http://www.bioinf.jku.at/software/farms/farms_1.1.1.tar.gz]
 Distribution Free Weighted Fold Change Summarization Method (DFW)[http://faculty.smu.edu/mmcgee/dfwcode.pdf]
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.