Volume 9 Supplement 9
The balance of reproducibility, sensitivity, and specificity of lists of differentially expressed genes in microarray studies
- Leming Shi1Email author,
- Wendell D Jones2,
- Roderick V Jensen3,
- Stephen C Harris1,
- Roger G Perkins4,
- Federico M Goodsaid5,
- Lei Guo1,
- Lisa J Croner6,
- Cecilie Boysen7,
- Hong Fang4,
- Feng Qian4,
- Shashi Amur5,
- Wenjun Bao8,
- Catalin C Barbacioru9,
- Vincent Bertholet10,
- Xiaoxi Megan Cao4,
- Tzu-Ming Chu8,
- Patrick J Collins11,
- Xiao-hui Fan1, 12,
- Felix W Frueh5,
- James C Fuscoe1,
- Xu Guo13,
- Jing Han14,
- Damir Herman15,
- Huixiao Hong4,
- Ernest S Kawasaki16,
- Quan-Zhen Li17,
- Yuling Luo18,
- Yunqing Ma18,
- Nan Mei1,
- Ron L Peterson19,
- Raj K Puri14,
- Richard Shippy20,
- Zhenqiang Su1,
- Yongming Andrew Sun9,
- Hongmei Sun4,
- Brett Thorn4,
- Yaron Turpaz12,
- Charles Wang21,
- Sue Jane Wang5,
- Janet A Warrington13,
- James C Willey22,
- Jie Wu4,
- Qian Xie4,
- Liang Zhang23,
- Lu Zhang24,
- Sheng Zhong25,
- Russell D Wolfinger8 and
- Weida Tong1
© Shi et al; licensee BioMed Central Ltd. 2008
Published: 12 August 2008
Reproducibility is a fundamental requirement in scientific experiments. Some recent publications have claimed that microarrays are unreliable because lists of differentially expressed genes (DEGs) are not reproducible in similar experiments. Meanwhile, new statistical methods for identifying DEGs continue to appear in the scientific literature. The resultant variety of existing and emerging methods exacerbates confusion and continuing debate in the microarray community on the appropriate choice of methods for identifying reliable DEG lists.
Using the data sets generated by the MicroArray Quality Control (MAQC) project, we investigated the impact on the reproducibility of DEG lists of a few widely used gene selection procedures. We present comprehensive results from inter-site comparisons using the same microarray platform, cross-platform comparisons using multiple microarray platforms, and comparisons between microarray results and those from TaqMan – the widely regarded "standard" gene expression platform. Our results demonstrate that (1) previously reported discordance between DEG lists could simply result from ranking and selecting DEGs solely by statistical significance (P) derived from widely used simple t-tests; (2) when fold change (FC) is used as the ranking criterion with a non-stringent P-value cutoff filtering, the DEG lists become much more reproducible, especially when fewer genes are selected as differentially expressed, as is the case in most microarray studies; and (3) the instability of short DEG lists solely based on P-value ranking is an expected mathematical consequence of the high variability of the t-values; the more stringent the P-value threshold, the less reproducible the DEG list is. These observations are also consistent with results from extensive simulation calculations.
We recommend the use of FC-ranking plus a non-stringent P cutoff as a straightforward and baseline practice in order to generate more reproducible DEG lists. Specifically, the P-value cutoff should not be stringent (too small) and FC should be as large as possible. Our results provide practical guidance to choose the appropriate FC and P-value cutoffs when selecting a given number of DEGs. The FC criterion enhances reproducibility, whereas the P criterion balances sensitivity and specificity.
A fundamental step in most microarray experiments is determining one or more short lists of differentially expressed genes (DEGs) that distinguish biological conditions, such as disease from health. Challenges regarding the reliability of microarray results have largely been founded on the inability of researchers to replicate DEG lists across highly similar experiments. For example, Tan et al.  found only four common DEGs using an identical set of RNA samples across three popular commercial platforms. Independent studies by the groups of Ramalho-Santos  and Ivanova  of stem cell-specific genes using the same Affymetrix platform and similar study design found a disappointing six common DEGs among about 200 identified in each study . A comparative neurotoxicological study by Miller et al.  using the same set of RNA samples found only 11 common DEGs among 138 and 425, respectively, from Affymetrix and CodeLink platforms. All these studies ranked genes by P-value from simple t-tests, used a P threshold to identify DEG lists, and applied the concept of the Percentage of Overlapping Genes (POG), or the Venn diagram, between DEG lists as the measure of reproducibility.
Criticism of and concerns about microarrays continue to appear in some of the most prestigious scientific journals [6–10], leading to a growing negative perception regarding microarray reproducibility, and hence reliability. However, in reanalyzing the data set of Tan et al. , Shi et al.  found that cross-platform concordance was markedly improved when either simple fold change (FC) or Significance Analysis of Microarrays (SAM)  methods were used to rank order genes before determining DEG lists. The awareness that microarray reproducibility is sensitive to how DEGs are identified was, in fact, a major motivator for the MicroArray Quality Control (MAQC) project [11, 13, 14].
Several plausible explanations and solutions have been proposed to interpret and address the apparent lack of reproducibility and stability of DEG lists from microarray studies. Larger sample sizes ; novel, microarray-specific statistical methods ; more accurate array annotation information by mapping probe sequences across platforms [1, 17]; eliminating absent call genes from data analysis [11, 18, 19]; improving probe design to minimize cross-hybridization ; standardizing manufacturing processes ; and improving data quality by fully standardizing sample preparation and hybridization procedures are among the suggestions for improvement .
The MAQC study  was specifically designed to address these previously identified sources of variability in DEG lists. Two distinct RNA samples, Stratagene Universal Human Reference RNA (i.e., MAQC sample A) and Ambion Human Brain Reference RNA (i.e., MAQC sample B), with thousands of differentially expressed genes, were prepared in sufficient quantities and distributed to three different laboratories for each of the five different commercial whole genome microarray platforms participating in the study. For each platform, each sample was analyzed using five technical replicates with standardized procedures for sample processing, hybridization, scanning, data acquisition, data preprocessing, and data normalization at each site. The probe sequence information was used to generate a stringent mapping of genes across the different platforms and 906 genes were further analyzed with TaqMan® assays using the same RNA samples.
In addition to assessing the technical performance of different microarray platforms, the MAQC study also discussed the idea of using fold-change ranking along with a non-stringent P-value cutoff for selecting DEGs [13, 21]. However, a lot of detailed results have not been formally published to support the idea . The MAQC project, while positively received by the community [23–27], also stimulated criticism from the statistical community about appropriate ways of identifying DEGs [22, 23, 27–33].
To help the microarray community better understand the issue at debate and move forward, in this study, we conducted a careful analysis of these MAQC data sets, along with numerical simulations and mathematical arguments. We demonstrate that the reported lack of reproducibility of DEG lists can be attributed in large part to identifying DEGs from simple t-tests without consideration of FC. The finding holds for intra-laboratory, inter-laboratory, and cross-platform comparisons independent of sample pairs and normalization methods, and is increasingly apparent with decreasing number of genes selected.
As a basic procedure for improving reproducibility while balancing specificity and sensitivity, choosing genes using a combination of FC-ranking and P threshold was investigated. This joint criterion results in DEG lists with much higher POG, commensurate with better reproducibility, than lists generated by t-test P alone, even when selecting a relatively small numbers of genes. An FC criterion explicitly incorporates the measured quantity to enhance reproducibility, whereas a P criterion incorporates control of sensitivity and specificity. The results increase our confidence in the reproducibility of microarray studies while supporting a need for caution in the use of inferential statistics when selecting DEGs. While numerous more advanced statistical modeling techniques have been proposed and compared for selecting DEGs [16, 34, 35], the primary objectives here are to explain that the primary reason for microarray reproducibility concerns is failure to include an FC criterion during gene selection, and to recommend a simple and straightforward approach concurrently satisfying statistical and reproducibility requirements. It should be stressed that robust methods are needed to meet stringent clinical requirements for reproducibility, sensitivity and specificity of microarray applications in, for example, clinical diagnostics and prognostics.
The POG for a number of gene selection scenarios employing P and/or FC are compared and a numerical example (see side box) is provided that shows how the simple t-test, when sample size is small, results in selection of different genes purely by chance. While the data generate from the MAQC samples A and B lack biological variability, the results are supported by the toxicogenomic data of Guo et al.  While P could be computed from many different statistical methods, for simplicity and consistency, throughout this article P is calculated with the two-tailed t-test that is widely employed in microarray data analysis.
Inter-site concordance for the same platform
The substantial difference in inter-site POG shown in Figures 1 and 2 is a direct result of applying different gene selection methods to the same data sets, and clearly depicts how perceptions of inter-site reproducibility can be affected for any microarray platform. While the emphasis here is on reproducibility in terms of POG, in practice, this criterion must be balanced against other desirable characteristics of gene lists, such as specificity and sensitivity (when the truth is binary) or mean squared error (when the truth is continuous), considerations that that are discussed further in later sections.
Concordance between microarray and TaqMan® assays
Reproducibility of FC and t-statistic: different metrics for identifying differentially expressed genes (DEGs)
Joint FC and P rule illustrated with a volcano plot: ranking by FC, not by P
Concordance using other statistical tests
Gene selection in simulated datasets
The MAQC data, like data from actual experiments, allows evaluation of DEG list reproducibility, but not of truth. Statistics are used to estimate truth, often in terms of sensitivity and specificity, but the estimates are based on assumptions about data variance and error structure that are also unknown. Simulations where truth can be specified a priori are useful to conduct parametric evaluations of gene selection methods, and true false positives and negatives are then known. However, results are sensitive to assumptions regarding data structure and error that for microarrays remains poorly characterized.
For the MAQC-simulated data, either FC-ranking or FC-ranking combined with any of the P threshold resulted in markedly higher POG than any P-ranking method, regardless of gene list length and coefficient of variation (CV) of replicates. The POG is ~100%, ~95%, and ~75%, for replicate CV values of 2%, 10%, and 35% CV, respectively, decreasing to about 20–30% with an exceedingly high (100%) CV. In contrast, POG from P-ranking alone varies from a few percent to only ~10% when 500 genes are selected.
For the Medium- and Small-Delta simulated data sets, differences start to emerge between using FC alone and FC with P cutoff. From Figure 11, when variances in replicates become larger (CV > 10%), the reproducibility is greatly enhanced using FC-ranking with a suitable P cutoff versus FC or P by themselves. In addition, when variances are small (CV ≤ 10%), the reproducibility is essentially the same for FC with P or without. What is clear is that P by itself did not produce the most reproducible DEG list under any simulated condition.
Although P-ranking generally resulted in very low POG, a false positive was rarely produced, even for a list size of 500 (data not shown). Thus, the P criterion performed as expected, and identified mostly true positives. Unfortunately, the probability of selection of the same true positives with a fixed P cutoff in a replicated experiment appears small due to variation in the P statistic itself (see inset). FC-ranking by itself resulted in a large number of false positives with a large number of genes for the Medium and Small-Delta sets when genes with small FCs are selected as differentially expressed. These false positives were greatly reduced to the same level as for the P-ranking alone when FC-ranking was combined with a P-cutoff.
Variability of the two-sample t-statistic
where is the average of the log2 expression levels of sample A with n A replicates, and is the average of the log2 expression levels of sample B with n B replicates, and S p 2 = (SS A +SS B )/(n A +n B -2) is the pooled variance of samples A and B, and SS denotes the sum of squared errors. The numerator of the t-statistic is the fold-change (FC) in log2 scale and represents the signal level of the measurements (i.e., the magnitude of the difference between the expression levels of sample A and sample B). The denominator represents the noise components from the measurements of samples A and B. Thus, the t-statistic represents a measure of the signal-to-noise ratio. Therefore, the FC and the t-statistic (P) are two measures for the differences between the means of sample A and sample B. The t-statistic is intrinsically less reproducible than FC when the variance is small.
where v = (2n-2) and is the degrees of freedom of the non-central t-distribution. When d = 0 (the two means are equal), then the t-statistic has a t-distribution with mean E(t) = 0 and Var(t) = v/(v-2). The variance of the t-statistic depends on the sample size n, the magnitude of the difference between the two means d, and the variance σ2. On the other hand, the variance of the mean difference for the FC is (2/n)σ2. That is, the variance of the FC depends only on the sample size n and the variance σ2, regardless of the magnitude of the difference d between the two sample means.
In an MAQC data set, a typical sample variance for the log2 expression levels is approximately σ2 = 0.152. With n = 5, the standard deviation of the FC (in log2 scale) is 0.09. For a differentially expressed gene with a 4-fold change between 5 replicates of sample A and 5 replicates of sample B, d = 2 and the t-values have a non-central t-distribution with (ν = nA+nB-2) = 8 degrees of freedom and δ = 21.08. From the equations above, the mean and the variance of the t-values are E(t) = 23.35 and Var(t) = 6.962. Within two standard deviations the expected value of the t-value ranges from 9.43 (= 23.35-2 × 6.96) to 37.27 (= 23.35+2 × 6.96), corresponding to P s from 1 × 10-5 to 3 × 10-10, based on the Student's two-sided t-test with 8 degrees of freedom. In contrast, when n = 5 the standard deviation of the FC (in log2 scale) is 0.09. The expected value of the FC ranges only from 3.53 (= 21.82) to 4.53 (= 22.18) within two standard deviations. In this case, this gene would be selected as differentially expressed using either a FC cutoff of 3.5 or a P cutoff of 1 × 10-5. On the other hand, for a gene with a 2-fold change (d = 1), the t-statistic has a non-central t-distribution with δ = 10.54. The mean and the variance of the t-statistic are E(t) = 11.68 and Var(t) = 3.622 with a corresponding P of 3 × 10-6 at t = 11.68. Using the same P cutoff, 1 × 10-5, this gene is likely to be selected with the probability greater than 0.5. For the FC criterion, the expected value of the FC ranges from 1.76 (= 20.82) to 2.26 (= 21.18). Using the same FC cutoff of 3.5, this gene is very unlikely to be selected. Thus, the top ranked gene list based on the FC is more reproducible than the top ranked gene list based on the P. The top ranked genes selected by a P cutoff may not be reproducible between experiments although both lists may contain mostly differentially expressed genes.
Reference: N. Johnson and S. Kotz (1970). Continuous Univariate Distributions – 2. Houghton Mifflin, Boston.
A fundamental requirement in microarray experiments is that the identification of DEG lists must be reproducible if the data and scientific conclusion from them are to be credible. DEG lists are normally developed by rank-ordering genes in accordance with a suitable surrogate value to represent biological relevance, such as the magnitude of the differential expression (i.e., FC) or the measure of statistical significance (P) of the expression change, or both. The results show that concurrent use of both FC-ranking and P-cutoff as criteria to identify biological relevant genes can be essential to attain reproducible DEG lists across laboratories and platforms.
A decade since the microarray-generated differential gene expression results of Schena et al and Lockhart et al were published, microarray usage has become ubiquitous. Over this time, many analytical techniques for identifying DEGs have been introduced and used. Early studies predominantly relied on the magnitude of differential expression change in experiments done with few if any replicates, with an FC cutoff typically of two used to reduce false positives. Mutch et al recommended using intensity-dependent FC cutoffs to reduce biased selection of genes with low expression.
Gene selection using statistical significance estimates became more prevalent during the last few years as studies with replicates became possible. Incorporation of a t-statistic in gene selection was intended to compensate for the heterogeneity of variances of genes . Haslett et al.  employed stringent values of both FC and P to determine DEGs. In recent years, there has been an increasing tendency to use P-ranking for gene selection. Kittleson et al.  selected genes with a FC cutoff of two and a very restrictive Bonferroni corrected P of 0.05 in a quest for a short list of true positive genes. Tan et al.  used P to rank genes. Correlation coefficient, which behaves similarly to the t-statistic, has also been widely used as a gene selection method in the identification of signature genes for classification purposes [15, 47, 48].
New and widely employed methods have appeared in recent years and implicitly correct for the large variance in the t-statistic that results when gene variance is estimated with a small number of samples. Allison et al.  collectively described these methods as "variance shrinkage" approaches. They include the popular permutation-based "SAM" procedure [5, 12, 49, 50], Bayesian-based approaches [43, 51] and others . Qin et al.  compared several variance shrinkage methods with a simple t-statistic and FC for spike-in gene identification on a two-color platform, concluding that all methods except P performed well. All these methods have the effect of reducing a gene's variance to be between the average for the samples, and the average over the arrays.
In some cases, however, the use of FC for gene selection was criticized and entirely abandoned. For example, Callow et al. , using P alone for identifying DEGs, concluded that P alone eliminated the need for filtering low intensity spots because the t-statistic is uniformly distributed across the entire intensity range. Reliance on P alone to represent a gene's FC and variability in gene selection has become commonplace. Norris and Kahn  describe how false discovery rate (FDR) has become so widely used as to constitute a standard to which microarray analyses are held. However, FDR usually employs a shrunken t-statistic and genes are ranked and selected similar to P (see Figure 11).
Prior to MAQC, Irizarry et al.  compared data from five laboratories and three platforms using the CAT plots that are essentially the same as the POG graphs used in our study. Lists of less than 100 genes derived from FC-ranking showed 30 to 80% intra-site, inter-site, and inter-platform concordance. Interestingly, important disagreements were attributable to a small number of genes with large FC that they posit resulted from a laboratory effect due to inexperienced technicians and sequence-specific effects where some genes are not correctly measured.
Exactly how to best employ FC with P to identify genes is a function of both the nature of the data, and the inevitable tradeoff between sensitivity and specificity that is familiar across research, clinical screening and diagnostics, and even drug discovery. But how the tradeoff is made depends on the application. Fewer false negatives at the cost of more false positives may be desirable when the application is identifying a few hundred genes for further study, and FC-ranking with a non-stringent P value cutoff from a simple t-test could be used to eliminate some noise. The gene list can be further evaluated in terms of gene function and biological pathway data, as illustrated in Guo et al.  for toxicogenomic data. Even for relatively short gene lists, FC-ranking together with a non-stringent P cutoff should result in reproducible lists. In addition, DEG lists identified by the ranking of FC is much less susceptible to the impact of normalization methods. In fact, global scaling methods (e.g., median- or mean-scaling) do not change the relative ranking of genes based on FC; they do, however, impact gene ranking by P-value .
The tradeoffs between reproducibility, sensitivity, and specificity become pronounced when genes are selected by P alone without consideration of FC, especially when a stringent P cutoff is used to reduce false positives. When sample numbers are small, any gene's t-statistic can change considerably in repeated studies within or across laboratories or across platforms. Each study can select different significant genes, purely by chance. It is entirely possible that separately determined lists will have a small proportion of common genes even while each list comprises mostly true positives. This apparent lack of reproducibility of the gene lists is an expected outcome of statistical variation in the t-statistic for small numbers of sample replicates. In other words, each study fails to produce some, but not all, of the correct results. The side box provides a numerical example of how gene list discordance can result from variation in the t-statistic across studies. Decreasing the P cutoff will increase the proportion of true positives, but also diminish the number of selected genes, diminish genes common across lists, and increase false negatives. Importantly, selecting genes based on a small P cutoff derived from a simple t-test without consideration of FC renders the gene list non-reproducible.
Additional insight is gained by viewing gene selection from the perspective of the biologist ultimately responsible for interpreting microarray results. Statistically speaking, a microarray experiment tests 10,000 or more null hypotheses where essentially all genes have non-zero differential expression. Statistical tests attempt to account for an unknowable error structure, in order to eliminate the genes with low probability of biological relevance. To the biologist, however, the variance of a gene with a large FC in one microarray study may be irrelevant if a similar experiment again finds the gene to have a large FC; the second experiment would probably be considered a validating reproduction. This conclusion would be reasonable since the gene's P depends on a poor estimate of variance across few samples, whereas a repeated FC measurement is tangible reproducibility which tends to increase demonstrably with increasing FC. The biological interpreter can also consider knowledge of gene function and biological pathways before finally assigning biological relevance, and will be well aware that either P or FC is only another indicator regarding biological significance.
This study shows that genes with smaller expression fold changes generated from one platform or laboratory are, in general, less reproducible in another laboratory with the same or different platforms. However, it should be noted that genes with small fold changes may be biologically important . When a fixed FC cutoff is chosen, sensitivity could be sacrificed for reproducibility. Alternatively, when a high P cutoff (or no P cutoff) is used, specificity could be sacrificed for reproducibility. Ultimately, the acceptable trade-off is based on the specific question being asked or the need being addressed. When searching for a few reliable biomarkers, high FC and low P cutoffs can be used to produce a highly specific and reproducible gene list. When identifying the components of genetic networks involved in biological processes, a lower FC and higher P cutoff can be used to identify larger, more sensitive but less specific, gene lists. In this case, additional biological information about putative gene functions can be incorporated to identify reliable gene lists that are specific to the biological process of interest.
Truly differentially expressed genes should be more likely identified as differentially expressed by different platforms and from different laboratories than those genes with no differential expression between sample groups. In the microarray field, we usually do not have the luxury of knowing the "truth" in a given study. Therefore, it is not surprising that most microarray studies and data analysis protocols have not been adequately evaluated against the "truth". A reasonable surrogate of such "truth" could be the consensus of results from different microarray platforms, from different laboratories using the same platform, or from independent methods such as TaqMan® assays, as we have extensively explored in this study.
The fundamental scientific requirement of reproducibility is a critical dimension to consider along with balancing specificity and sensitivity when defining a gene list. Irreproducibility would render microarray technology generally, and any research result, specifically, vulnerable to criticism. New methods for the identification of DEGs continue to appear in the scientific literature. These methods are typically promoted in terms of improved sensitivity (power) while retaining nominal rates of specificity. However, reproducibility is seldom emphasized.
The results show that selecting DEGs based solely on P from a simple t-test most often predestines a poor concordance in DEG lists, particularly for small numbers of genes. In contrast, using FC-ranking in conjunction with a non-stringent P-cutoff results in more concordant gene lists concomitant with needed reproducibility, even for fairly small numbers of genes. Moreover, enhanced reproducibility holds for inter-site, cross-platform, and between microarray and TaqMan® assay comparisons, and is independent of platforms, sample pairs, and normalization methods. The results should increase confidence in the reproducibility of data produced by microarray technology and should also expand awareness that gene lists identified solely based on P will tend to be discordant. This work demonstrates the need for a shift from the common practice of selecting differentially expressed genes solely on the ranking of a statistical significance measure (e.g., t-statistic) to an approach that emphasizes fold-change, a quantity actually measured by microarray technology.
Conclusions and recommendations
A fundamental step of microarray studies is the identification of a small subset of DEGs from among tens of thousands of genes probed on the microarray. DEG lists must be concordant to satisfy the scientific requirement of reproducibility, and must also be specific and sensitive for scientific relevance. A baseline practice is needed for properly assessing reproducibility/concordance alongside specificity and sensitivity.
Reports of DEG list instability in the literature are often a direct consequence of comparing DEG lists derived from a simple t-statistic when the sample size is small and variability in variance estimation is large. Therefore, the practice of using P alone for gene selection should be discouraged.
A DEG list should be chosen in a manner that concurrently satisfies scientific objectives in terms of inherent tradeoffs between reproducibility, specificity, and sensitivity.
Using FC and P together balances reproducibility, specificity, and sensitivity. Control of specificity and sensitivity can be accomplished with a P criterion, while reproducibility is enhanced with an FC criterion. Sensitivity can also be improved by better platforms with greater dynamic range and lower variability or by increased sample sizes.
FC-ranking should be used in combination with a non-stringent P threshold to select a DEG list that is reproducible, specific, and sensitive, and a joint rule is recommended as a baseline practice.
MAQC data sets
The MAQC data sets analyzed in this study are available from GEO under series accession number GSE5350. Analyses identified differentially expressed genes between the primary samples A (Stratagene Universal Human Reference RNA, Catalog #740000) and B (Ambion Human Brain Reference RNA, Catalog #6050) of the MAQC study. Analyses are additionally limited to data sets from the following five commercial genome-wide microarray platforms: ABI (Applied Biosystems), AFX (Affymetrix), AG1 (Agilent one-color), GEH (GE Healthcare), and ILM (Illumina), and to the subset of "12,091" genes commonly probed by them. TaqMan® assay data for 906 genes are used to examine gene list comparability between microarrays and TaqMan® assays. For more information about the MAQC project and the data sets, refer to Shi et al .
The following manufacturer's preferred normalization methods were used: quantile normalization for ABI and ILM, PLIER for AFX, and median-scaling for AG1 and GEH . For quantile normalization (including PLIER), each test site is independently considered.
Gene ranking (selection) rules
Six gene ranking (selection) methods were examined: (1) FC (fold change ranking); (2) FC_P 0.05 (FC-ranking with P cutoff of 0.05); (3) FC_P 0.01 (FC-ranking with P cutoff of 0.01); (4) P (P-ranking, simple t-test assuming equal variance); (5) P _FC2 (P-ranking with FC cutoff of 2); (6) P _FC1.4 (P-ranking with FC cutoff of 1.4). When a cutoff value (e.g., P < 0.05) is imposed for a ranking metric (e.g., FC), it is likely that the lists of candidate genes that meet the cutoff value may not be the same for the two test sites or two platforms as a result of differences in inter-site or cross-platform variations. Such differences are part of the gene selection process and have been carried over to the gene ranking/selection stage.
Evaluation criterion – POG (percentage of overlapping genes)
The POG (percentage of overlapping genes) calculation [11, 13] was applied in three types of comparisons: (1) Inter-site comparison using data from the three test sites of each platform; (2) Cross-platform comparison between ABI, AFX, AG1, GEH, and ILM using data from test site 1; for each sample pair, there are ten cross-platform pairs for comparison; (3) Microarray versus TaqMan® assay comparisons.
POG is calculated for many different cutoffs that can be considered as arbitrary.
The number of genes considered as differentially expressed is denoted as 2L, where L is both the number of genes up- and down-regulated. The number of genes available for ranking and selection in one direction, L, varies from 1 to 6000 (with a step of one) or when there are no more genes in one regulation direction, corresponding to 2L varying from 2 to 12,000. Directionality of gene regulation is considered in POG calculations; genes selected by two sites or platforms but with different regulation directionalities are considered as discordant. Therefore, POG can hardly reach 100% in reality.
The formula for calculating POG is: POG = 100*(DD+UU)/2L, where DD and UU are the number of commonly down- or up-regulated genes, respectively, from the two lists, and L is the number of genes selected from the up- or down-regulation directionality. To overcome the confusion of different numbers for the denominator, in our POG calculations we deliberately selected an equal number of up-regulated and down-regulated genes, L . The POG graphs shown in this study are essentially the same as the CAT (correspondence at the top) plots introduced by Irizarry et al.  and the POG graphs that we introduced previously  except that in the current POG graphs the x-axis is in log-scale to emphasize the details when fewer genes are selected.
Most of the analyses in this study exclude flagging information; that is, the entire set of "12,091" genes is used in the analyses. Some calculations are limited to subsets of genes commonly detectable ("common present") by the two test sites or two platforms under comparison. To be denoted as "commonly present", the gene is detected ("present") in the majority of replicates (e.g., three or more when there are five replicates) for each sample in a sample-pair comparison and for each test site or platform.
Gene selection simulation
A simulation was created to emulate the characteristics of the MAQC dataset. Fifteen thousand simulated genes were created where 5,000 were undifferentiated in expression between simulated biological samples A and B and 10,000 were differentiated but at various levels (exponential distribution for the log FC, where almost 4,000 are differentiated two-fold or higher, similar to a typical platform in the MAQC study, divided equally into up and down regulated genes). To simulate instances of technical or biological replicates, multiplicative noise (error) was added to the signal for each gene for each of five simulated replicates for each sample using an error distribution that would produce a replicate CV similar to that typically seen in the MAQC data set (i.e., the mean replicate CV would be roughly 10%). The CV for any given gene was randomly selected from a trimmed exponential distribution. To address a variety of additional error scenarios but preserving the same distribution of fold change, we also considered three additional mean CV values (2%, 35%, and 100%). To understand the impact of gene list size on the stability of the DEG list, list sizes of 10, 25, 100, and 500 genes were examined for each mean CV scenario. Several gene selection rules were compared: FC-ranking only, P-ranking only, and shrunken t-statistic ranking. Note: P-ranking is equivalent to t-statistic ranking as well as ranking based on FDR that monotonically transforms the P-value. In addition, shrunken t-statistic ranking is equivalent to ranking based on the test statistic used by SAM and related methods. In addition, rules based on FC-ranking with a P threshold were also compared (for P = 0.1, 0.01, 0.001, and 0.0001). Finally, to simulate differences in the variation patterns of analytes between platforms and even between laboratories, covariance between laboratories/platforms of the variance for each gene was included in the simulations. For a given mean CV, 20 or more simulated instances of five replicates of simulated biological samples A and B were created and DEG lists were prepared for each instance that were rank ordered using the methods described above. To determine reproducibility of a given method for a given mean CV under a given gene list size, the rank-ordered gene lists from these 20 instances were pair-wise compared for consistency and reproducibility. The results presented in the graphs are averages from those pair-wise comparisons.
The MAQC actual data is characterized by large magnitudes of differential expression among the vast majority of the 12,091 common genes, with some 4000 exhibiting FC > 2 and hundreds with FC > 10. As such, the data may be atypical of commonplace microarray experiments with biological effects. Consequently, two other simulation data sets were created with far fewer genes with large FC, as might be expected in some actual microarray experiments. Specifically, the Small-Delta data set was created with fewer than 50 genes with FC > 2, and a FC < 1.3 for most differentiated genes, and 10,000 undifferentiated genes. In addition, the variances of the genes were correlated similar to that observed in the MAQC data. The third simulated dataset, termed the Medium-Delta set, had a large number of differentiated genes similar to the MAQC simulated dataset, but with small FC similar to the Small-Delta set. Again, gene variances were correlated similar to that observed in the MAQC data.
This document has been reviewed in accordance with United States Food and Drug Administration (FDA) policy and approved for publication. Approval does not signify that the contents necessarily reflect the position or opinions of the FDA nor does mention of trade names or commercial products constitute endorsement or recommendation for use. The findings and conclusions in this report are those of the author(s) and do not necessarily represent the views of the FDA. James C. Willey is a consultant for and has significant financial interest in Gene Express, Inc.
List of abbreviations used
- A :
The MAQC sample A (Stratagene Universal Human Reference RNA)
- ABI :
Applied Biosystems microarray platform
- AFX :
Affymetrix microarray platform
- AG1 :
Agilent one-color microarray platform
- B :
The MAQC sample B (Ambion Human Brain Reference RNA)
- C :
The MAQC sample C (75%A+25%B mixture)
- CV :
Coefficient of variation
- D :
The MAQC sample D (25%A+75%B mixture)
- DEG :
Differentially expressed genes
- FC :
Fold change in expression levels
- GEH :
GE Healthcare microarray platform
- ILM :
Illumina microarray platform
- MAQC :
MicroArray Quality Control project
- P The P:
-value calculated from a two-tailed two-sample t-test assuming equal variance
- POG :
Percentage of Overlapping (common) Genes between two lists of differentially expressed genes. It is used as a measure of concordance of microarray results.
We thank participants of the MicroArray Quality Control (MAQC) project for generating the large data sets that were used in this study. Many MAQC participants contributed to the sometimes-heated discussions on the topic of this paper during MAQC teleconferences and face-to-face project meetings. The common conclusions and recommendations evolved from this extended discourse. Leming Shi and Feng Qian would like to dedicate this work in memory of Prof. Dr. Zhiliang Li of Chongqing University, China.
This article has been published as part of BMC Bioinformatics Volume 9 Supplement 9, 2008: Proceedings of the Fifth Annual MCBIOS Conference. Systems Biology: Bridging the Omics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/9?issue=S9
- Tan PK, Downey TJ, Spitznagel EL, Xu P, Fu D, Dimitrov DS, Lempicki RA, Raaka BM, Cam MC: Evaluation of gene expression measurements from commercial microarray platforms. Nucleic Acids Res. 2003, 31 (19): 5676-5684. 10.1093/nar/gkg763.PubMed CentralView ArticlePubMed
- Ramalho-Santos M, Yoon S, Matsuzaki Y, Mulligan RC, Melton DA: "Stemness": transcriptional profiling of embryonic and adult stem cells. Science. 2002, 298 (5593): 597-600. 10.1126/science.1072530.View ArticlePubMed
- Ivanova NB, Dimos JT, Schaniel C, Hackney JA, Moore KA, Lemischka IR: A stem cell molecular signature. Science. 2002, 298 (5593): 601-604. 10.1126/science.1073823.View ArticlePubMed
- Fortunel NO, Otu HH, Ng HH, Chen J, Mu X, Chevassut T, Li X, Joseph M, Bailey C, Hatzfeld JA: Comment on " 'Stemness': transcriptional profiling of embryonic and adult stem cells" and "a stem cell molecular signature". Science. 2003, 302 (5644): 393-10.1126/science.1086384. author reply 393.View ArticlePubMed
- Miller RM, Callahan LM, Casaceli C, Chen L, Kiser GL, Chui B, Kaysser-Kranich TM, Sendera TJ, Palaniappan C, Federoff HJ: Dysregulation of gene expression in the 1-methyl-4-phenyl-1,2,3,6-tetrahydropyridine-lesioned mouse substantia nigra. J Neurosci. 2004, 24 (34): 7445-7454. 10.1523/JNEUROSCI.4204-03.2004.View ArticlePubMed
- Miklos GL, Maleszka R: Microarray reality checks in the context of a complex disease. Nat Biotechnol. 2004, 22 (5): 615-621. 10.1038/nbt965.View ArticlePubMed
- Michiels S, Koscielny S, Hill C: Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet. 2005, 365 (9458): 488-492. 10.1016/S0140-6736(05)17866-0.View ArticlePubMed
- Ioannidis JP: Microarrays and molecular research: noise discovery?. Lancet. 2005, 365 (9458): 454-455.View ArticlePubMed
- Frantz S: An array of problems. Nat Rev Drug Discov. 2005, 4 (5): 362-363. 10.1038/nrd1746.View ArticlePubMed
- Marshall E: Getting the noise out of gene arrays. Science. 2004, 306 (5696): 630-631. 10.1126/science.306.5696.630.View ArticlePubMed
- Shi L, Tong W, Fang H, Scherf U, Han J, Puri RK, Frueh FW, Goodsaid FM, Guo L, Su Z: Cross-platform comparability of microarray technology: intra-platform consistency and appropriate data analysis procedures are essential. BMC Bioinformatics. 2005, 6 (Suppl 2): S12-10.1186/1471-2105-6-S2-S12.PubMed CentralView ArticlePubMed
- Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA. 2001, 98 (9): 5116-5121. 10.1073/pnas.091062498.PubMed CentralView ArticlePubMed
- Shi L, Reid LH, Jones WD, Shippy R, Warrington JA, Baker SC, Collins J, Longueville Fd, Kawasaki ES, Lee KY: The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat Biotechnol. 2006, 24 (9): 1151-1161. 10.1038/nbt1239.View ArticlePubMed
- Shi L, Perkins RG, Fang H, Tong W: Reproducible and reliable microarray results through quality control: good laboratory proficiency and appropriate data analysis practices are essential. Curr Opin Biotechnol. 2008, 19 (1): 10-18. 10.1016/j.copbio.2007.11.003.View ArticlePubMed
- Ein-Dor L, Zuk O, Domany E: Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proc Natl Acad Sci USA. 2006, 103 (15): 5923-5928. 10.1073/pnas.0601231103.PubMed CentralView ArticlePubMed
- Allison DB, Cui X, Page GP, Sabripour M: Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet. 2006, 7 (1): 55-65. 10.1038/nrg1749.View ArticlePubMed
- Mecham BH, Klus GT, Strovel J, Augustus M, Byrne D, Bozso P, Wetmore DZ, Mariani TJ, Kohane IS, Szallasi Z: Sequence-matched probes produce increased cross-platform consistency and more reproducible biological results in microarray-based gene expression measurements. Nucleic Acids Res. 2004, 32 (9): e74-10.1093/nar/gnh071.PubMed CentralView ArticlePubMed
- Barczak A, Rodriguez MW, Hanspers K, Koth LL, Tai YC, Bolstad BM, Speed TP, Erle DJ: Spotted long oligonucleotide arrays for human gene expression analysis. Genome Res. 2003, 13 (7): 1775-1785. 10.1101/gr.1048803.PubMed CentralView ArticlePubMed
- Shippy R, Sendera TJ, Lockner R, Palaniappan C, Kaysser-Kranich T, Watts G, Alsobrook J: Performance evaluation of commercial short-oligonucleotide microarrays and the impact of noise in making cross-platform correlations. BMC Genomics. 2004, 5 (1): 61-10.1186/1471-2164-5-61.PubMed CentralView ArticlePubMed
- Hoffman EP: Expression profiling – best practices for data generation and interpretation in clinical trials. Nat Rev Genet. 2004, 5 (3): 229-237. 10.1038/nrg1297.View Article
- Guo L, Lobenhofer EK, Wang C, Shippy R, Harris SC, Zhang L, Mei N, Chen T, Herman D, Goodsaid FM: Rat toxicogenomic study reveals analytical consistency across microarray platforms. Nat Biotechnol. 2006, 24 (9): 1162-1169. 10.1038/nbt1238.View ArticlePubMed
- Shi L, Jones W, Jensen RV, Harris SC, Perkins R, Goodsaid FM, Guo L, Croner LJ, Boysen C, Fang H: The reproducibility of lists of differentially expressed genes in microarray studies. Nature Precedings. 2007, [http://precedings.nature.com/documents/306/version/302]
- Strauss E: Arrays of hope. Cell. 2006, 127 (4): 657-659. 10.1016/j.cell.2006.11.005.View ArticlePubMed
- Eisenstein M: Microarrays: quality control. Nature. 2006, 442 (7106): 1067-1070. 10.1038/4421067a.View ArticlePubMed
- Couzin J: Genomics. Microarray data reproduced, but some concerns remain. Science. 2006, 313 (5793): 1559-10.1126/science.313.5793.1559a.View ArticlePubMed
- Kiermer V: Microarray quality in the spotlight again. Nat Methods. 2006, 3 (10): 772-10.1038/nmeth1006-772.View Article
- Sage L: Do microarrays measure up?. Anal Chem. 2006, 78 (21): 7358-7360.View Article
- Klebanov L, Qiu X, Welle S, Yakovlev A: Statistical methods and microarray data. Nat Biotechnol. 2007, 25 (1): 25-26. 10.1038/nbt0107-25.View ArticlePubMed
- Shi L, Jones WD, Jensen RV, Wolfinger RD, Kawasaki ES, Herman D, Guo L, Goodsaid FM, Tong W: Reply to Statistical methods and microarray data. Nat Biotechnol. 2007, 25 (1): 26-27. 10.1038/nbt0107-26.View Article
- Shi L: New hot paper comment. ESI Special Topics. 2007, [http://www.esi-topics.com/nhp/2007/march-07-LemingShi.html]
- Chen JJ, Wang SJ, Tsai CA, Lin CJ: Selection of differentially expressed genes in microarray data analysis. Pharmacogenomics J. 2007, 7 (3): 212-220. 10.1038/sj.tpj.6500412.View ArticlePubMed
- Chen J, Hsueh HM, Delongchamp R, Lin CJ, Tsai CA: Reproducibility of microarray data: a further analysis of microarray quality control (MAQC) data. BMC Bioinformatics. 2007, 8 (1): 412-10.1186/1471-2105-8-412.PubMed CentralView ArticlePubMed
- Perkel JM: Six things you won't find in the MAQC. The Scientist. 2007, 20 (11): 68-
- Qin LX, Kerr KF: Empirical evaluation of data transformations and ranking statistics for microarray analysis. Nucleic Acids Res. 2004, 32 (18): 5471-5479. 10.1093/nar/gkh866.PubMed CentralView ArticlePubMed
- Kim S, Lee J: Comparison of various statistical methods for identifying differential gene expression in replicated microarray data. Statistical Methods in Medical Research. 2006, 15: 1-18. 10.1191/0962280206sm423oa.View Article
- Shippy R, Wolber P, Jones WD, Jensen RV, Fulmer-Smentek S, Johnson CD, Pine PS, Boysen C, Guo X, Chudin E: Using RNA sample titrations to assess microarray platform performance and normalization techniques. Nat Biotechnol. 2006, 24 (9): 1123-1131. 10.1038/nbt1241.PubMed CentralView ArticlePubMed
- Wang Y, Barbacioru C, Hyland F, Xiao W, Hunkapiller KL, Blake J, Chan F, Gonzalez C, Zhang L, Samaha RR: Large scale real-time PCR validation on gene expression measurements from two commercial long-oligonucleotide microarrays. BMC Genomics. 2006, 7: 59-10.1186/1471-2164-7-59.PubMed CentralView ArticlePubMed
- Kuo WP, Liu F, Trimarchi J, Punzo C, Lombardi M, Sarang J, Whipple ME, Maysuria M, Serikawa K, Lee SY: A sequence-oriented comparison of gene expression measurements across different hybridization-based technologies. Nat Biotechnol. 2006, 24 (7): 832-840. 10.1038/nbt1217.View ArticlePubMed
- Canales RD, Luo Y, Willey JC, Austermiller B, Barbacioru CC, Boysen C, Hunkapiller K, Jensen RV, Knight C, Lee KY: Evaluation of DNA microarray results with quantitative gene expression platforms. Nat Biotechnol. 2006, 24 (9): 1115-1122. 10.1038/nbt1236.View ArticlePubMed
- Schena M, Shalon D, Davis RW, Brown PO: Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science. 1995, 270 (5235): 467-470. 10.1126/science.270.5235.467.View ArticlePubMed
- Lockhart DJ, Dong H, Byrne MC, Follettie MT, Gallo MV, Chee MS, Mittmann M, Wang C, Kobayashi M, Horton H: Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol. 1996, 14 (13): 1675-1680. 10.1038/nbt1296-1675.View ArticlePubMed
- Mutch DM, Berger A, Mansourian R, Rytz A, Roberts MA: Microarray data analysis: a practical approach for selecting differentially expressed genes. Genome Biol. 2001, 2 (12): PREPRINT0009-10.1186/gb-2001-2-12-preprint0009.View ArticlePubMed
- Baldi P, Long AD: A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics. 2001, 17 (6): 509-519. 10.1093/bioinformatics/17.6.509.View ArticlePubMed
- Haslett JN, Sanoudou D, Kho AT, Bennett RR, Greenberg SA, Kohane IS, Beggs AH, Kunkel LM: Gene expression comparison of biopsies from Duchenne muscular dystrophy (DMD) and normal skeletal muscle. Proc Natl Acad Sci USA. 2002, 99 (23): 15000-15005. 10.1073/pnas.192571199.PubMed CentralView ArticlePubMed
- Kittleson MM, Minhas KM, Irizarry RA, Ye SQ, Edness G, Breton E, Conte JV, Tomaselli G, Garcia JG, Hare JM: Gene expression in giant cell myocarditis: Altered expression of immune response genes. Int J Cardiol. 2005, 102 (2): 333-340. 10.1016/j.ijcard.2005.03.075.View ArticlePubMed
- Tan FL, Moravec CS, Li J, Apperson-Hansen C, McCarthy PM, Young JB, Bond M: The gene expression fingerprint of human heart failure. Proc Natl Acad Sci USA. 2002, 99 (17): 11387-11392. 10.1073/pnas.162370099.PubMed CentralView ArticlePubMed
- Ein-Dor L, Kela I, Getz G, Givol D, Domany E: Outcome signature genes in breast cancer: is there a unique set?. Bioinformatics. 2005, 21 (2): 171-178. 10.1093/bioinformatics/bth469.View ArticlePubMed
- Tan Y, Shi L, Tong W, Wang C: Multi-class cancer classification by total principal component regression (TPCR) using microarray gene expression data. Nucleic Acids Res. 2005, 33 (1): 56-65. 10.1093/nar/gki144.PubMed CentralView ArticlePubMed
- Kerr MK, Churchill GA: Experimental design for gene expression microarrays. Biostatistics. 2001, 2 (2): 183-201. 10.1093/biostatistics/2.2.183.View ArticlePubMed
- Wellmer F, Riechmann JL, Alves-Ferreira M, Meyerowitz EM: Genome-wide analysis of spatial gene expression in Arabidopsis flowers. Plant Cell. 2004, 16 (5): 1314-1326. 10.1105/tpc.021741.PubMed CentralView ArticlePubMed
- Newton MA, Kendziorski CM, Richmond CS, Blattner FR, Tsui KW: On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. J Comput Biol. 2001, 8 (1): 37-52. 10.1089/106652701300099074.View ArticlePubMed
- Cui X, Hwang JT, Qui J, Blades NJ, Churchill GA: Improved statistical tests for differential gene expression by shrinking variance components estimates. Biostatistics. 2005, 6: 59-75. 10.1093/biostatistics/kxh018.View ArticlePubMed
- Callow MJ, Dudoit S, Gong EL, Speed TP, Rubin EM: Microarray expression profiling identifies genes with altered expression in HDL-deficient mice. Genome Res. 2000, 10 (12): 2022-2029. 10.1101/gr.10.12.2022.PubMed CentralView ArticlePubMed
- Norris AW, Kahn CR: Analysis of gene expression in pathophysiological states: balancing false discovery and false negative rates. Proc Natl Acad Sci USA. 2006, 103 (3): 649-653. 10.1073/pnas.0510115103.PubMed CentralView ArticlePubMed
- Irizarry RA, Warren D, Spencer F, Kim IF, Biswal S, Frank BC, Gabrielson E, Garcia JG, Geoghegan J, Germino G: Multiple-laboratory comparison of microarray platforms. Nat Methods. 2005, 2 (5): 345-350. 10.1038/nmeth756.View ArticlePubMed
- Hughes TR, Marton MJ, Jones AR, Roberts CJ, Stoughton R, Armour CD, Bennett HA, Coffey E, Dai H, He YD: Functional discovery via a compendium of expression profiles. Cell. 2000, 102 (1): 109-126. 10.1016/S0092-8674(00)00015-5.View ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.