Reproducibility of microarray data: a further analysis of microarray quality control (MAQC) data

Background Many researchers are concerned with the comparability and reliability of microarray gene expression data. Recent completion of the MicroArray Quality Control (MAQC) project provides a unique opportunity to assess reproducibility across multiple sites and the comparability across multiple platforms. The MAQC analysis presented for the conclusion of inter- and intra-platform comparability/reproducibility of microarray gene expression measurements is inadequate. We evaluate the reproducibility/comparability of the MAQC data for 12901 common genes in four titration samples generated from five high-density one-color microarray platforms and the TaqMan technology. We discuss some of the problems with the use of correlation coefficient as metric to evaluate the inter- and intra-platform reproducibility and the percent of overlapping genes (POG) as a measure for evaluation of a gene selection procedure by MAQC. Results A total of 293 arrays were used in the intra- and inter-platform analysis. A hierarchical cluster analysis shows distinct differences in the measured intensities among the five platforms. A number of genes show a small fold-change in one platform and a large fold-change in another platform, even though the correlations between platforms are high. An analysis of variance shows thirty percent of gene expressions of the samples show inconsistent patterns across the five platforms. We illustrated that POG does not reflect the accuracy of a selected gene list. A non-overlapping gene can be truly differentially expressed with a stringent cut, and an overlapping gene can be non-differentially expressed with non-stringent cutoff. In addition, POG is an unusable selection criterion. POG can increase or decrease irregularly as cutoff changes; there is no criterion to determine a cutoff so that POG is optimized. Conclusion Using various statistical methods we demonstrate that there are differences in the intensities measured by different platforms and different sites within platform. Within each platform, the patterns of expression are generally consistent, but there is site-by-site variability. Evaluation of data analysis methods for use in regulatory decision should take no treatment effect into consideration, when there is no treatment effect, "a fold-change cutoff with a non-stringent p-value cutoff" could result in 100% false positive error selection.


Background
Microarray technology provides powerful tools to measure expression levels of thousands of genes simultaneously. Gene expression data are increasingly being used in disease diagnosis, identifying biomarkers, and predicting clinical outcomes. However, many researchers are concerned with the comparability and reliability of microarray data [1][2][3][4]. Various studies have been published on comparing data reproducibility across different platforms or different laboratories with mixed results. Some have found that microarray experiments generated similar results obtained at different test sites and using different platforms [2][3][4][5]. Others showed little overlap among lists of differentially expressed genes across platforms [6][7][8]. Recent completion of the MicroArray Quality Control (MAQC) [9] project provides a unique opportunity to assess reproducibility of gene expression data across multiple sites and the comparability across multiple platforms [10].
In the MAQC project, transcript levels of four titration samples were measured on seven microarray platforms and three alternative gene expression technologies. Each microarray platform was generally tested at three independent sites with five replicates of each sample. The MAQC project has generated many manuscripts; each has a specific aim and objective in the data generation, presentation, and analysis. The analysis presented for the conclusion of inter-and intra-platform comparability/ reproducibility of microarray gene expression measurements [9] is inadequate. The MAQC design is essentially a factorial design with four titration samples and three sites. The MAQC analysis performed pairwise comparison between two samples within each site and never formally evaluated the consistency of expression of the four samples across sites (sample by site interaction). The assessment of inter platform comparability was limited to the fold change between two samples, namely, Samples A and B within each site in the analysis [9]. The three metrics were considered in that analysis: differential gene list overlap, log ratio compression, and log rank correlation. For example, the rank correlations of the log ratios of sample B to sample A between two platforms or two sites were calculated as a measure of comparability. The correlation is a measure of concordance. A low correlation is an indication of poor reproducibility, but a high correlation itself is no sufficient to conclude reproducibility. Furthermore, the differential gene list overlap is used as a measure of reproducibility to evaluate a gene selection procedure. The MAQC Consortium [9] suggested a fold-change cutoff with a non-stringent p-value cutoff as a baseline practice to improve reproducibility. Many researchers have questioned this approach [11][12][13][14][15]. This paper presents further analyses of the reproducibility/comparability of the MAQC data for five high-density one-color microarray platforms and the TaqMan technology. Various statistical techniques are used to evaluate inter-and intra-platform reproducibility and consistency. A "gold standard" data set is constructed for assessment of sensitivity, specificity, and accuracy for evaluation of individual platforms' performances on selection of differentially expressed genes. We discuss some of the problems with the use of correlation coefficient as metric to evaluate the inter-and intra-platform reproducibility and the use of differential gene list overlap for evaluation of gene selection by MAQC Consortium [1,16]. Figure 1 shows a hierarchical clustering analysis of the 293 arrays from five platforms, three sites, four samples, and five replicates. The arrays are well separated by platform, by sample, and then by site. The five major branches of the dendrogram represent the five platforms. This indicates that there are differences in the intensities as measured by the different platforms. Therefore, intensities measured by different platforms are not directly comparable. Within each platform, the samples were well separated, except for GEH. For the GEH platform, samples C and samples D do not completely cluster together. Furthermore, since sample C is a 75%A+25%B mixture, the sample pair A and C should be clustered together. Likewise, sample D is a 25%A+75%B mixture, so the sample pair B and D should be clustered together. All platforms show good discriminability among the four biological samples. Within each sample, the replicates from the same sites are generally clustered together; this indicates site effects in all five platforms.

Microarray Cross platform comparability
The correlation coefficients between platforms were evaluated for each of the four samples (A, B, C, and D) and each of the fold-changes (B/A, C/A, D/A, C/B, D/B, and D/ C). For a given sample (A, B, C, or D), all pairwise correlations between each of the (up to) 15 replicates in one platform and each of the (up to) 15 replicates in another platform were computed. A total of up to 2250 inter platform correlations were computed. The fold-changes were calculated for each site within each platform. The foldchange correlations across platforms were then calculated. There were 90 fold-change correlations. The summary statistics of the sample correlations and fold-change correlations are shown in Table 1. The median correlations are 0.74, 0.70, 0.71, and 0.68 for Samples A, B, C, and D, respectively. The highest correlation observed is 0.82 in Sample A; the smallest correlation is 0.45 observed in Sample D. The median fold-change correlation are 0.85, 0.75, 0.82, 0.84, 0.78, 0.78 for fold-changes B/A, C/A, D/ A, C/B, D/B, D/C, respectively. The highest correlation is 0.92 observed in fold-change B/A; the smallest correlation is 0.53 observed in fold-change C/A. The median foldchange correlations are higher than the sample correlations. Table 1 indicates that Sample A is more similar to C than Sample B is to D; and Sample A and C are the most similar among the four samples. Figure 2 shows the scatter plots of all pairwise comparisons of the fold-changes B/A (upper triangle of each square) and D/C (lower triangle of each square) for all genes from the five platforms. The scatter plots provide an assessment of agreement in the measured fold-changes between two platforms. The fold-change estimates for B/A are somewhat consistent across platforms. However, a Hierarchical clustering Figure 1 Hierarchical clustering. Hierarchical clustering of 293 arrays from the five microarray platforms, four samples, and three sites, in most cases, five technical replicates for each sample. The five platforms are colored: Affymetrix (AFX), Applied Biosystems (ABI), Agilent Technologies (AG1), GE Healthcare (GEH), and Illumina (ILM). The four samples are colored: A, B, C, and D. The three sites are colored: site 1, site 2, and site 3. The correlation coefficient of the standardized intensity measurements over the 12091 genes were calculated for all pairwise combinations of the 293 arrays. The one-minus-correlation is used for the distance metric.  small fold-change observed in one platform may have a large fold-change in another platform. Figure 3 provides more detailed scatter plots of various platform comparisons in Figure 2. The two lines represent a 2-fold change.
The points in the lower right or upper left region have a 2fold change in one platform and less than 2-fold change in the other platform. Quite a number of genes falls in these two regions. For the fold-change D/C in Figure 2, the range of the fold-changes for D/C is smaller than the range for B/A. It is rather variable, relatively. AFX appears to have smaller ranges than the other four platforms. Patterns of the expression of the four samples across the platforms are evaluated using a two-factor ANOVA model with interaction. The proportion of genes that showed a significant Sample*Platform interaction is 0.30 at the FDR = 1% significance level. That is, 30% of genes in which the four samples show inconsistent patterns of expression across the five platforms.

TaqMan and microarray platform comparability
The summary statistics of the sample correlation coefficients and fold-change correlation coefficients between the TaqMan and each of the five microarray platforms for all pairwise combinations are evaluated (see Table 2). The medians of the sample correlation coefficients range from 0.57 to 0.79. This range is comparable with the sample correlation coefficients observed between microarray platforms. The fold-change correlation coefficients are also comparable with the fold-change correlation coefficients observed among the microarray platforms. Figure 4 shows correlation coefficients of TaqMan versus microarray platforms.
Using an ANOVA for the data from each of the five platforms and Taqman, the proportions of the genes that have a significant Sample*Platform interaction are 0.72, 0.57, 0.49, 0.65, 0.39 for AFX, ABI, AG1, ILM, and GEH, respectively. These values are higher than the Sample*Platform interaction of 0.30 obtained from the inter microarray platform comparability discussed above. Figure 5 shows that Gene NM_000168 has good consistency of patterns of expression of four samples in all five platforms; but patterns between the each of the five platforms and Taqman for the four samples are inconsistent. The IDs for this genes are 205201_at, 100093, A_23_P111531, GE57983, GI_13518031-S, Hs00609233_m1 for 5 AFX, ABI, AG1, ILM, GEH, TAQ, respectively.

Analysis of titration response
The correlation coefficients between the observed responses for Sample C and Sample D and the expected responses predicted by Samples A and B are shown in Col-umns 2 and 3 of Table 3. The correlations are at least 90% in both samples C and D from all platforms.
The differences between the observed responses and the predicted responses were evaluated. The proportion of genes with a difference less than 0.5 in log 2 scale (this corresponds to 2 0.5 = 1.41 fold change) was calculated for each site. The averaged proportions from the three sites are shown in Columns 4 and 5 of Table 3. The AFX has the highest proportions, greater than 99%. These numbers appear inconsistent with the lowest correlation coefficients. The inconsistency is, perhaps, that AFX has a shorter expression range than other platforms.
Using the two-step goodness of fit procedure, the proportions of genes follow the titration trend are at least 90% (Columns 6-7 of Table 3). All analyses indicate a good, self-consistent relationship between the expression measurements from the four samples.
Scatter plots of the fold-change estimates between two platforms and the correlation coefficients

Within microarray platform reproducibility and individual platforms' discriminability
For each platform, the correlation coefficients between two sites were calculated for each of the four samples to assess intra-platform reproducibility across sites. There were seventy-five correlation coefficients for each sample.
In Table 4, Columns 2-5 show the median correlation coefficients between two arrays from two different sites. The median correlations are high; the smallest median correlation coefficient is 0.862 observed in Sample D from the GEH platform.
Using a two-factor ANOVA, the proportions of genes that are significant at the significance level of FDR = 1% are given in Columns 6-8. All platforms show good discriminability to distinguish the four samples, about 80%; all platforms show large site effects, ranged 70% to 99%. The proportions of Sample*Site interaction for the AFX, ABI, and GEH platforms are low; but, the proportions for the AG1 and GEH platforms are more than 10%.

Sensitivity, specificity, and accuracy in gene selection
Selection of differentially expressed genes is one of the most important goals of microarray experiment. However, it is difficult to validate whether the selected genes are truly differentially expressed, and those not selected genes are truly non-differentially expressed. The MAQC project used technical replicates (small variance) with two distinct biological samples (large difference). We examined the number of genes expressed differently between Sample A and Sample B across platforms. Of the 12,091 genes, 9879 have p-value ≤ 10-5 in at least 1 platform, 8265 have p-value ≤ 10-5 in at least 2, 6846 in 3, 5241 in 4 and 3128 in all five. These numbers are larger than the number commonly observed in typical microarray experiments. We constructed a "gold standard data" set of differentially expressed and non-differentially expressed genes for evaluation of individual platforms' performances. A gene is "differentially expressed" if it was shown to be significant (p ≤ 10 -5 ) in at least two of the five platforms. A gene is non-differentially expressed if its fold change was shown to be between 0.90 and 1/0.90 in at least two of the five platforms at the significance level 10 -3 . Excluding the overlapping genes, the "gold standard data" set selected 8,187 differentially expressed genes and 420 non-differentially expressed genes.
Individual platforms' performances from the "gold standard data" set are shown in Table 5 using the FWE = 0.05 and FDR = 0.05 cutoff criteria. The FWE criterion gives high specificity and the FDR criterion gives high sensitivity. At the FDR = 0.05, the FDR estimates are all well below 0.05. An explanation is that the Benjamin and Hochberg procedure 16  ance. ILM has the best specificity, but lower sensitivity.
The low sensitivity of the ILM platform is due to large σ 2 (Table 4), high specificity is because ILM does not select as many genes as AFX, ABI, and AG1 at the FWE and FDR levels.
The "gold standard data" is further used as a reference data set to examine the value of overlapping criterion as a measure of reproducibility in the evaluation of cross platform comparisons. At the FWE = 0.05, a specificity of 95% implies 21 false positives for AFX, ABI, AG1, and GEH. ILM correctly identifies 53% (= 4,339) of differentially expressed genes without a false positive. If 4,000 genes are selected, then the numbers of false identifications are 1, 3, 2, 1, and 14 for AFX, ABI, AG1, ILM, and GEH, respectively. That is, regardless of the percentages of the overlap gene list between two platforms, more than 99.8% of the 4,000 genes identified by each of the five platforms are truly differentially expressed between the samples A and B. In other words, each platform can correctly identify differentially expressed genes. The percentage of overlap is not a useful measure to evaluate selection of differential expressions.

Discussion
Accuracy and precision of an estimator are the accepted metrics to evaluate the reproducibility [18]. Accuracy is the expected difference between an estimate and the true value. In these samples the true values of the fold changes are unknown, and an evaluation of accuracy is arguably not possible. Further, the accuracy of the estimator would depend upon the background correction and the normalization applied to the observed intensities. The MAQC Consortium 1 adopted the platform manufacturers' recommended procedures, and no attempt was made here to evaluate alternative background corrections or normalizations. Precision measures the similarity of repeated measurements. Because the accuracy is difficult to validate, evaluation of reproducibility is often based on the differ-TaqMan and microarray platform comparability   ence (i.e., closeness) of the two measurements from different platforms or laboratories. Such differences reflect differences in accuracy as well as precision. The measurement for evaluation can be either the measured intensities or the intensity ratios (fold-changes), or both. Both the sample intensity and the fold-change are evaluated in our analysis. Five MAQC microarray platforms were compared in this analysis.
The estimate of a fold change involves a comparison between measurements (relative change) made under different conditions. The correlations for the fold change B/ A have the range from 0.78 to 0.92 across platforms (Table 1). Despite these good correlations, fold-changes estimates from two platforms can be very different ( Figure  3). Within platform the fold-change correlations across sites are 96.5% or higher (Table 4). Figure 6 is a scatter plot of the fold-changes for Site 1 versus Site 3. A small fraction of genes falls in lower right or upper left region. This is consistent with the ANOVA ( Table 4) that a small fraction of genes show significant Site*Sample interaction; that is, some of the fold-change estimates are not consistent across sites. The inconsistencies are typically small and unlikely to be consequential in studies where comparisons incorporate biological variation. The estimated fold changes at different sites within a platform appear to be reasonably reproduced. But the high correlations can be deceptive (Table 4). In the MAQC data, the high correlations reflect the extreme range in the foldchanges (B/A), which can exceed 1/1000 to 1000.
For a given gene, assume a linear relationship between the observed intensities and the mRNA concentrations, This provides a first approximation since we know that the intensity records contamination from cross hybridization, which will supply a component to β, and the intensity is subject to the efficiency of transcription/labeling/hybridization as well as arbitrary amplification associated with the dye/laser signal, which all supply components to α.
Platform and site differences presumably reflect differences in α and β. The fold-change of Sample A and Sample

B is
That is, the ratio of intensities does not estimate the true fold-change (platforms would not strictly reproduce foldchanges) unless β = 0. The MAQC data show that the foldchanges are more consistent across platforms than the intensities (Table 1); that is, a high correlation is observed over a broad range of fold-change estimates. An important implication of the above equation is that significant differences in intensities imply significant differences in   mRNA concentrations even though the ratio of intensities is a biased estimate of the true fold-change. Essentially, the genes identified will depend on the statistical power to resolve differences and this is likely to differ by the platform, laboratory, technician, and sample size for any given gene.
The reproducibility of the fold-change estimate entails an experimental design that needs to properly address the relevant sources of variation. Reproducible estimates of fold changes can be achieved in experiments of reasonable size provided that the experiment blocks on major sources of variation. In the MAQC data, the reproducibility of estimated fold changes arises by blocking measurements within platforms and sites. This result agrees with recent analyses of other data sets which also have demonstrated that good comparability/reproducibility can be achieved across platforms and laboratories (sites) [2,3].
Even though accuracy cannot be evaluated directly, samples C and D are known mixtures of samples A and B, and the mixing proportions should specify the relationship among the sample means ( Table 4). Assume that the intensities satisfy the mixing relationship, for 0 = ρ = 1, and in these data most of genes adhere to this mixing relationship for ρ = 0.75 and ρ = 0.25 within each site/platform (Table 4). Then inferences concerning differences in intensities imply differences in mRNA concentrations, The Pearson (and rank) correlation coefficients and the slope (and R) of the linear regression are the most common statistical measures to assess the agreement between two measurements [4,9,16,19,20]. The correlation coefficient is a measure of linear association between two platforms. These two metrics do not detect changes in location or scale. A correlation coefficient of 0.90 or higher does not necessarily imply reproducibility between Scatter plots of the fold-change estimates between two sites for the five platforms two measurements (Table 4 and Figure 6). The interaction effect from the two-factor ANOVA model can be used to determine if the pattern of responses is consistence across the platforms (or sites). Larkin et al. [3] compared Affymetrix GeneChip 430 and TIGR cDNA array and showed that 9% of genes had significant platform*treatment interaction. The proportions of significances for the MAQC data are more than 30%. One explanation is that the MAQC project used technical replicates, and the ANOVA test is powerful because of small residual variability. Finally, microarray is generally a comparative experiment, the data analysis is about the relative expression levels of a gene among the samples, rather than its absolute intensity measures of each sample. Comparisons are made about the expression levels for a gene in different samples but not about the level of expression of one gene in relation to other genes. Correlation coefficient is a useful metric to assess the reliability of the measurements between two arrays in the same laboratory.
In addition to the correlation coefficient and regression coefficient, the MAQC project proposed differential gene list overlap for a metric of reproducibility. The number of genes in common or percent of overlapping genes (POG) between two gene lists were used as a measure for evaluation of cross platform reproducibility [9]. Reproducibility of a selected gene list is not the same as reproducibility of gene expression measurements or accuracy. The presumption with this measure is that high values of POG are indicative of reproducibility and good accuracy, and a low POG value in two gene lists is indicative of inconsistency or inaccuracy. However, POG does not reflect the accuracy of a selected gene list. POG represents an overlapping of two gene lists with unknown accuracy. A non-overlapping gene can be truly differentially expressed with a stringent cut (Table 5), and an overlapping gene can be non-differentially expressed with non-stringent cutoff.
Consider selection of the first 100 genes from the AFX and ABI platforms, an approach by the MAQC Consortium [1]. Using the p-value ranking there are only 14 overlapping genes, POG = 14%. The minimum of the foldchanges for those 86 non-lapping genes is 8.5 with the pvalue less than 10 -29 . There are only 64 overlapping genes using the fold-change ranking. The minimum of foldchange for the 36 non-overlapping genes is about 4.2 with the p-value less than 10 -17 . In either approach, those nonoverlapping genes are truly differentially expressed. An analysis of the "gold standard" data set shows the same result. Finally POG is unusable as a selection criterion. POG can increase or decrease irregularly as a cutoff changes; There is no criterion to determine a cutoff so that the percentage of overlapping genes is optimized. However, POG will be 100% if all genes are selected; regardless how many genes are truly differentially expressed.
In selection of differentially expressed genes, it may be desirable to generate a more reproducible list. However, given that there are more than ten thousand genes in the MAQC experiment, it is naive to evaluate a gene selection procedure simply based on the POG with cutoffs of selecting tens or even hundreds of genes. The general goal of gene selection is to identify a list of differentially expressed genes as accurately as possible. Because of the variation of the data, it is not possible to have an optimal cutoff that simultaneously minimizes both false positive and false negative errors. The tradeoff between the two errors depends on the application. In class comparison, for example, procedures that allow very few false positives may be appropriate when a small number of genes are selected to be validated by a PCR. While in class prediction or class discovery setting, where the intent is to develop genomic profiles or classifiers, the omission of informative genes would have a much more serious consequence than the inclusion of non-informative genes. In such cases, procedures with fewer false negatives may be more desirable.
The p-value (statistical) approach is much more than a way of gene ranking; it provides a measure to estimate the false positive error probability for a decision. In efficacy or toxicity testing, the default assumption is that there is no treatment effect. Statistical tests are designed to show a positive effect for the clinical or pre-clinical data collected in a study. Evaluation of data analysis methods should have taken no treatment effect into consideration. An evaluation based only on the data with treatment effects, its recommendation and utility for use in regulatory confirmation are questionable. When there is no treatment effect, 'a fold-change cutoff with a non-stringent p-value cutoff' would result in 100% false positive error selection.
In microarray experiments, intensities recorded are sensitive to several of the conditions under which the measurements are made; it has been recognized that the intensities cannot be reproduced across platforms and sites. One of significant contributions of the MAQC project is the identification of 12,091 genes that are represented across the platforms with an objective to compare expression data generated at multiple test sites using several microarray platforms. MAQC Consortium does not have specific conclusions about inter-platform compatibility and have the conclusion of inter-platform reproducibility [9]. In this study, we show there are differences in the intensities measured by different platforms, within each platform there is site-by-site variability. However, a microarray experiment typically is conducted in one site using one particular platform. As alternatives, an adequate normalization method will be selected to normalize for optimal comparisons of expression levels between tissue samples.
In this regard, all of the five platforms perform well in terms of discriminability.

The MAQC Study
The MAQC protocols for sample processing are available at the MAQC website [10]. We consider five microarray platforms and the TaqMan alternative platform since the NCI platform had data from only two sites and the Eppendorf microarray platform and the other two alternative platforms had less than 300 genes. The five microarray platforms are: Applied Biosystems (ABI), Affymetrix (AFX), Agilent Technologies (AG1), GE Healthcare (GEH), and Illumina (ILM). The numbers of probes for the five platforms were between 32,878 (ABI) and 54,675 (AFX). The TaqMan platform consisted of 1,004 genes. Four pools were used, two RNA sources as well as two titrations of the original samples: Sample A) 100% Universal Human Reference RNA (UHRR); Sample B) 100% Human Brain Reference RNA (HBRR); Sample C) 75% UHRR: 25% HBRR; and Sample D) 25% UHRR: 75% HBRR. All five microarray platforms used "one-color" protocols with five replicates for each of the four samples in three sites. Hybridizations that failed to meet the quality control criterions were not used. The number of arrays in each platform range from 56 to 60. The total number of arrays considered is 293. The TaqMan platform had one site with four replicates. The MAQC project generated 12,091 common genes for cross platform comparison; 906 of the 12,091 were assayed by the TaqMan platform, but only 849 genes were analyzed with the threshold detectable limit of 35 cycles.
In the inter-platform comparison, in order to minimize potential biases due to differences in scaling among platforms the data were standardized within each array so that each array had the median 0 and variance 1.

Inter and intra platform comparisons Hierarchical clustering analysis
We used a hierarchical clustering analysis to assess similarity for the 293 arrays. A hierarchical clustering tree presents a binary dendrogram representing the association structure of pairwised arrays. The association between two arrays was measured in terms of their correlations. The correlation coefficient of the standardized intensity measurements over the 12091 genes were calculated for all pairwise combinations of the 293 arrays. The algorithm identified the pair of arrays with the smallest distance and groups them with a link, where distance is defined to be one-minus-correlation. The algorithm proceeds in a recursive manner to build the tree structure step by step.

Correlation coefficient
The correlation coefficient was used to assess inter or intra platform concordance. All pairwise correlation coefficients between two arrays from different sites or/and platforms were calculated for each of the four samples (A, B, C, and D) and each of the fold-changes (B/A, C/A, D/A, C/ B, D/B, and D/C).

Analysis of variance models
Two-factor ANOVA models with interaction were used to evaluate inter-and intra-platform reproducibility. The ANOVA for evaluation of the intra-platform performance was y ijk = m + Sample i + Site j + Sample*Site ij + Error, where y ijk was the log2 expression level for sample i, site j, and replicated k. for each gene. The main effects Sample and Site and the interaction Sample*Site were tested for each platform. The proportion of genes that showed significant sample effect is a measure of platform's discriminability. Similarly, the proportion of significant site effects is a measure of reproducibility, and the proportion of significant interactions is a measure of consistency of the expressions of the four samples across sites. The comparison across platform was tested similarly.

Assessment of titration mixture
The expression levels of the two mixture samples C and D were compared to the expected responses predicted by samples A and B to assess the ability of each platform to follow the titration relationship. Denote R . The (titration) correlation coefficient of R(C) and Sample C was computed to assess titration trend, likewise for the correlation coefficient of R(D) and D. The differences of R(C) and C and the differences of R(D) and D were evaluated.
We further proposed a two-step goodness-of-fit procedure to estimate the proportion of genes that follow a titration relationship. The titration relationship can be modelled by M1 t : y ijk = m + βConc + Site j + Error, where Conc is the concentration of Sample A. The goodness-of-fit of the titration model can be tested by comparing the model M1 t to the (full) ANOVA model.
The first step is to test for the sample effect: H0 t1 : µ A = µ B = µ C = µ D . Rejection of the hypothesis indicates a significant difference in expressions among the four samples.
For those genes that are significant, the second step is to test if the titration model can adequately fit the data: H0 t2 : M1 = M2. Rejection of the second hypothesis implies the rejection of the titration relationship. The proportion of the genes that are significant in the first hypothesis and not significant in the second hypothesis is a measure of consistency of the platform. The FDR = 0.05 and 0.01 wee used for the significant levels.

Sensitivity, specificity, and accuracy in gene Selection
We present a statistical approach to construct a gold standard set of differentially expressed and non-differentially expressed genes to evaluate the MAQC data and individual platforms' performance. In order to minimize biases due to differences in measured expression values on different platforms, the standardized data were used to construct the "gold set". But, the un-standardized data were used in the evaluation for individual platform performance. In this analysis, only Samples A and B were considered.
A gene is "differentially expressed" if it was shown to be significant (p = 10 -5 ) in at least two of the five platforms. We use this criterion in order to avoid potential false positive error. It will ensure that the probability of false positive is smaller than 10 -10 = 10 -5 × 10 -5 under the assumption of independence. Eight thousand two hundred sixty five (8,265) genes were selected. A gene is nondifferentially expressed if its fold change was shown to be between 0.90 and 1/0.90 in at least two of the five platforms at the significance level 10 -3 . Specifically, the nondifferentially expressed genes were selected by applying the equivalence test [21]: H 0 : |µ A -µ B | ≥ δ versus H 1 : |µ A -µ B | <δ, where µ A and µ B are the means of Samples A and B, respectively, and δ = log 2 1.11 = -log 2 0.90, (the equivalence limit). The significance level was set at 10 -3 since majority of genes are differentially expressed. Four hundred ninety eight (498) genes were selected. We eliminated 78 overlapping genes resulting in 8187 differentially expressed genes and 420 non-differentially expressed genes. These 8607 genes were used as the gold standard set to evaluate each platform performance using the Bonferroni and FDR at 0.05 significance as cutoff for gene selection criterions. The accuracy, sensitivity (proportion of true positives), specificity (proportion of true negatives), and FDR (proportion of false positives among the selected genes were computed.

Authors' contributions
JJC and RRD conceived the study and wrote the manuscript. JJC and HMH developed the methodology. CAT implemented the hierarchical clustering. CJL performed the analysis. All authors read and approved the final manuscript.