Empirical comparison of cross-platform normalization methods for gene expression data
© Rudy and Valafar; licensee BioMed Central Ltd. 2011
Received: 24 May 2011
Accepted: 7 December 2011
Published: 7 December 2011
Simultaneous measurement of gene expression on a genomic scale can be accomplished using microarray technology or by sequencing based methods. Researchers who perform high throughput gene expression assays often deposit their data in public databases, but heterogeneity of measurement platforms leads to challenges for the combination and comparison of data sets. Researchers wishing to perform cross platform normalization face two major obstacles. First, a choice must be made about which method or methods to employ. Nine are currently available, and no rigorous comparison exists. Second, software for the selected method must be obtained and incorporated into a data analysis workflow.
Using two publicly available cross-platform testing data sets, cross-platform normalization methods are compared based on inter-platform concordance and on the consistency of gene lists obtained with transformed data. Scatter and ROC-like plots are produced and new statistics based on those plots are introduced to measure the effectiveness of each method. Bootstrapping is employed to obtain distributions for those statistics. The consistency of platform effects across studies is explored theoretically and with respect to the testing data sets.
Our comparisons indicate that four methods, DWD, EB, GQ, and XPN, are generally effective, while the remaining methods do not adequately correct for platform effects. Of the four successful methods, XPN generally shows the highest inter-platform concordance when treatment groups are equally sized, while DWD is most robust to differently sized treatment groups and consistently shows the smallest loss in gene detection. We provide an R package, CONOR, capable of performing the nine cross-platform normalization methods considered. The package can be downloaded at http://alborz.sdsu.edu/conor and is available from CRAN.
Simultaneous measurement of gene expression on a genomic scale can be accomplished using microarray technology or by sequencing based methods [1–3]. Many high-throughput mRNA expression experiments produce data that can be of value to other researchers when analyzed in new contexts or in combination with data from other experiments. In particular, the statistical power and reproducibility of gene expression studies can be increased by combining data across multiple studies [4–6]. While next generation sequencing seems likely to replace microarrays for expression analysis in the near future, the large amount of microarray data already in existence could continue to be useful to researchers for many years to come.
Characteristics of relevant microarray platforms
Number of Probes
Human Genome Survey Microarray v2.0
DNA oligonucleotide with 3' carbon spacer to reduce steric effects
Presynthesized and contact spotted
Digoxigenin (DIG) UTP
anti-DIG phosphatase catalyzed Chemiluminescence
HG-U133 Plus 2.0 GeneChip®
In situ photolithography
Whole Human Genome Oligo Microarray, G4112A
In situ inkjet printing
Cy3 or Cy5
Cy3 or Cy5 fluorescence
Human-6 BeadChip, 48K v1.0
DNA oligonucleotide with 29 base address sequence as linker
Presynthesized, immobilized on beads, and randomly deposited in wells
Chemiluminescence provides greater sensitivity for low levels of expression compared to fluorescence, but at the risk of saturation for highly expressed genes. The Applied Biosystems scanning procedure attempts to mitigate scanner saturation by using both a short and a long exposure to extend the dynamic range of its expression measurements. Probe sequences affect the binding constants between probes and target and non-target molecules, and therefore the sensitivity and specificity of each probe depends partially on its sequence. Salinity and composition of the hybridization solution, temperature, and incubation time of hybridization may also affect sensitivity and specificity. Data from two microarrays are directly comparable only if those microarrays are identical in all design parameters including probe sequences and have been subjected to similar hybridization conditions. Because no two platforms share the same set of probe sequences, no two platforms produce data that are directly comparable, even if all other variables are the same. For experimenters this restriction is not major. They need only ensure that all experiments are conducted using the same array platform and protocol. However, platform effects pose a significant problem for the re-analysis of data from multiple microarray studies.
Researchers who perform high throughput gene expression assays often deposit their data in public databases such as ArrayExpress  and Gene Expression Omnibus (GEO) , the latter of which currently houses 630, 845 assays distributed among 9, 348 platforms. Heterogeneity of measurement platforms leads to challenges for the re-use of these large data sets, creating limitations for researchers wishing to combine them. Extensive effort has been directed toward assessing the reproducibility of differential expression measurements across different platforms. Several studies have found good agreement among gene expression profiles produced by different platforms [13–18], while other studies have had conflicting results [19–21]. Technical issues pertaining to such evaluations include homogeneity of RNA samples, consistency of experimental protocols, mapping of probes across platforms, and the statistical methods used to assess reproducibility (e.g. direct comparison vs. log ratios). Those studies in which good intra-platform reproducibility was achieved and log ratios were compared across platforms generally showed good inter-platform reproducibility for oligonucleotide-based arrays. One study focusing on probe mapping in particular found that reproducibility between Affymetrix and cDNA platforms could be substantially improved by sequence-based re-annotation , and another found that reproducibility was further improved by mapping probe sequences at the exon level . More recent studies generally show better cross-platform reproducibility than earlier ones . It seems clear that, at least under ideal conditions, differential expression analysis gives consistent results across platforms. It is therefore worth asking how data from different platforms might be combined in an analysis. Models and techniques exist for the meta-analysis of microarray data from multiple studies and platforms [24–27], and these have been applied extensively to investigate questions of biological interest [28–33].
Cross-platform normalization differs from meta-analysis; the former involves direct comparison between expression measurements obtained from different platforms while the latter combines the results of intra-platform comparisons at a higher level. While meta-analysis techniques are extremely useful tools, they are limited to combining the results of studies that have tested the same hypothesis or compared the same treatments, and can not easily be applied to the investigation of new hypotheses from existing data.
Cross-platform normalization methods have been developed for the combination of data sets collected using different microarray platforms. These methods are the Cross-Platform Normalization (XPN) method of Shabalin et al., Distance Weighted Discrimination (DWD) , an Empirical Bayes (EB) method also known as ComBat , Median Rank Scores (MRS), Quantile Discretization (QD) , Normalized Discretization (NorDi) , the Distribution Transformation (DisTran) , and a method known as Gene Quantiles (GQ), which was developed as part of the WebArrayDB service . In addition to these specialized methods, Quantile Normalization (QN) , a method commonly employed for intra-platform normalization, has also been applied to cross-platform normalization . Many of the specialized methods include or are based closely on QN. Online analysis services currently offering some of these methods include WebArrayDB , ArrayMining , and DSGeo . In addition, QN is available as part of Bioconductor , and code for some methods can be obtained from their respective authors.
Cross-platform normalization could be a valuable resource to researchers. While several studies have employed it for microarray analysis [4–6, 43], it has not achieved the popularity of meta-analysis methods for the integration of results across studies and platforms. The online services listed above have received a total of 28 citations as of this writing according Google Scholar. It is difficult to judge the number of relevant citations for many methods, as some have alternative uses to cross-platform normalization or are introduced in publications describing other techniques. The publication describing XPN does not introduce any other methods or experiments, nor does XPN have any obvious applications other than cross-platform normalization. That article has been cited 34 times since its publication in 2008 according to Google Scholar, with only nine of those citing papers satisfying a full text search for the string "XPN." Those wishing to perform cross-platform normalization face three major obstacles. Firstly, a choice must be made about which method or methods to employ. While the authors of each method have demonstrated their methods on at least one example data set, to our knowledge no empirical comparison of cross-platform normalization methods similar to ours is available. In particular, no third party empirical comparison has been attempted. The authors of XPN do provide a comparison of their method against several others , but their analysis was conducted on a more limited data set and does not make use of resampling or any other procedure to evaluate the robustness of their results. This is not necessarily a shortcoming of their paper, but merely the result of a difference in objectives. The authors presented a method, and it is left for others to provide an unbiased evaluation. Secondly, software for the selected method must be obtained and incorporated into a data analysis work-flow. The disunity of interfaces and software packages for cross-platform normalization makes this task quite difficult for researchers lacking advanced computer skills, especially for methods that are only available as part of an online service. Some methods also rely on proprietary software, which presents an additional obstacle to integration. Thirdly, current cross-platform normalization techniques are only applicable to a limited subset of data sets. All the methods listed above require that every treatment group or sample type be represented on each platform. If this restriction is violated then it becomes impossible to distinguish platform effects from treatment effects of interest, and the latter may end up being removed by normalization.
In this paper we provide a comparison of available methods based on the MicroArray Quality Control (MAQC) project  data set and a human sperm data set  containing data from multiple platforms (see Methods for details). Envisioning potential applications to large scale databases or classification problems, we restrict our attention to cross-platform normalization performed without knowledge of treatment groups, and we examine the consequences of differently sized and missing treatment groups for the most successful methods. We also investigate the consistency of platform effects across different experiments. We have assembled an R  package capable of performing all of the methods investigated with a unified interface and reasonable defaults for user selectable parameters. Our package makes it possible for researchers to easily incorporate cross-platform normalization into existing work-flows, especially work-flows based on R or Bioconductor, and to experiment with multiple techniques without significant extra effort. Our package is available from the Comprehensive R Archive Network (CRAN, ). We explore a possible solution to the third difficulty above and show that it is insufficient in some cases.
Results and Discussion
By comparing these curves we obtain statistics for over and under-detection of differentially expressed genes after cross-platform normalization as follows. Over-detection can be assessed by observing the difference between the ROC-like curve for genes detected with the cross-platform data set (the red curve) and the curve for the intersection of the cross-platform genes and the union of the single platform genes (the blue curve), which gives the genes detected using the cross-platform data set but not using either single platform data set. Under-detection can be assessed by observing the difference between the curve for the intersection of the two single platform gene sets (the green curve) and the curve for the intersection of all three gene sets (the violet curve), which gives the proportion of genes detected by both platforms individually but not by the cross-platform data set. In our further analyses, the areas between these two pairs of curves are used as statistics to measure over and under-detection, denoted o and u, respectively, of differentially expressed genes. These area statistics have the advantage of not depending on any arbitrary FDR cut-off, and to our knowledge have not been used previously.
It is not possible to rank cross-platform normalization methods by either one of the statistics o or u described above. For example, a method may bring o to nearly zero by removing all treatment effects. Such a method would not be desirable, and would result in a large u. Each of the statistics o and u guards against unrestricted optimization of the other. The o and u statistics are related to false positive and false negative rates, respectively, and in practice a good cross-platform normalization should strike a balance between the two.
Sorted from least to greatest under-detection, the methods are ordered: DWD, EB, GQ, XPN, MRS, QN, DisTran. Again, there is a major jump between the fourth and fifth ranked methods. When sorted instead by over-detection, the order is: MRS, QN, DisTran, XPN, DWD, GQ, EB, with a major discrepancy between the third and fourth ranked methods. Inspection of the ROC-like curves shows that the lower levels of over-detection among MRS, DisTran, and QN can be understood in terms of lower total detection of differentially expressed genes. By all three measurements, the seven methods cluster into two groups. The first group, made up of DWD, EB, GQ, and XPN, is characterized by higher concordance and over-detection and lower under-detection, while the second group, consisting of DisTran, MRS, and QN, is characterized by lower concordance and over-detection and higher under-detection. The ROC-like plots for the two groups are qualitatively different (Figure 3). In the first group, the curve for the cross-platform normalized data lies between the intersection and union curves for the native differential expression sets. In the second group, the curve for the normalized data always lies below the native intersection curve.
Equally sized treatment groups
In terms of concordance (Figure 4), results largely agree with our initial evaluation. Methods can be divided into the same two groups. The rankings of DWD, EB, GQ, and XPN fluctuate depending on the platforms, with one of the XPN methods always showing the highest ranking. XPN, DWD, EB, and GQ all perform near the level of the resampling controls, while MRS, QN, and DisTran perform near the level of the non-normalized cross-platform control. Over detection (Figure 5) showed the same pattern as in our initial comparison. We had stated that the reduced over-detection of MRS, QN, and DisTran may be due entirely to the reduced level of total detection for those methods. Here we show that the increased over-detection of XPN, DWD, EB, and GQ is still below the level of the resampling controls regardless of platform. Under-detection (Figure 6) was near but somewhat above the resampling controls for XPN, DWD, EB, and GQ for all platform combinations. MRS, QN, and DisTran fluctuated together depending on platform, always with a higher level than the other methods, and sometimes near the level of the negative control. DWD always showed the lowest level of under-detection, while an XPN variant was always the highest out of XPN, DWD, EB, and GQ. Variance of under-detection for the XPN methods appears markedly higher than for DWD, EB, and GQ. A Brown-Forsythe Levene-type test based on the absolute deviations from the median  showed the pooled variance of the XPN methods differed from that of DWD, EB, and GQ at a significance level of less than 1e - 20.
Over all, some results were consistent for all data sets tested. DWD, EB, GQ, and XPN outperform MRS, QN, and DisTran in both concordance and under-detection, with DWD showing the lowest over-detection and XPN the highest concordance in all cases. Over detection never exceeded levels observed in resampling controls, even for the non-normalized control, which in fact showed no over-detection. Because all treatment groups were the same size for both platforms, platform effects would be expected to cancel out when comparing treatment groups. High variance within each treatment group due to platform effects explains the under-detection seen in the negative control. The performance of MRS, QN, and DisTran varied with the data sets tested. MRS, QN, and DisTran generally performed poorly. However, they still outperformed the negative control in the sperm data set, which is consistent with past successful applications of those methods.
Unequally sized treatment groups
Missing treatment groups
Exploring platform effects
where is the corrected value and η j is the location shift used to correct for platform effects. The values n ij are the number of repetitions for treatment i and platform j, and m is the total number of different treatment groups. It is assumed that the true values of the effects are known. In practice, η j will be an estimate based on the data.
where are the interaction terms from the training data set. Depending on the values of C ij and , transference of parameters could result in the increased over-detection seen in our missing treatment group experiment. Other causes are also possible, including differences in scanners or image processing procedures.
Uniformity of platform effects across different experiments is consistent with the model (1). Additivity is not, although empirically it appears to be a useful approximation, as indicated by the relative success of the location based methods. The accuracy of that approximation may be reduced, however, when employing a separate training set to estimate platform effects. This reduction was observed in our analyses, in particular in Figures 9 and 10. Model (1) is a sufficient explanation for all of our observations, which suggests it or the more general equation (4) may provide a good basis for an improved cross-platform normalization method that addresses the issue of non-identical treatment groups across platforms. However, fitting such a model would require a data set containing multiple matched treatment groups for any pair of platforms on which it could be used.
To facilitate the application of cross-platform normalization by other researchers, we packaged all the implementations used in this work, including those obtained from other authors, using the R package mechanism. Our package, CONOR, includes documentation and provides a common interface for all methods, along with reasonable defaults for user-selectable parameters. The package can be downloaded at http://alborz.sdsu.edu/conor and is available from CRAN.
Of the four methods capable of successful cross-platform normalization, DWD showed the least loss of treatment information and XPN showed the greatest inter-platform concordance, although the latter was sometimes in excess of resample controls and might be interpreted as a slight over-correction. DWD was the most robust to variations in treatment group sizes between the two platforms. This result is somewhat surprising because XPN incorporates an assay clustering step designed to correct for such variations . Our clustering variant showed reduced sensitivity to treatment size disparity, and it is possible that further improvement to the clustering step of the XPN algorithm could result in improved robustness. It is no surprise that GQ and EB, which do not account for treatment group disparities, suffered reduced performance under such conditions. In general, those methods that employ location shifts (DWD, EB, GQ, and XPN) outperformed those that do not, and the performance of the methods that do not include such shifts was quite unsatisfactory. Many of those methods were not originally designed for cross-platform normalization, and their failure to accomplish such normalization does not imply that they are insufficient to their other uses. The EB and XPN methods make use of distributional assumptions about the data. XPN uses a normally distributed residual for maximum likelihood estimation, while EB employs a complicated model including parametric prior distributions. For the log-transformed microarray data used in this study, these methods performed well. However, the distributional assumptions of these methods must be valid for their performance to be assured. When performed on non-log transformed data, for example, these methods fail to produce good results. In cases in which normality is in doubt, appropriate transformations (such as log transformation) should be employed. Our analyses considered cross-platform normalization in the absence of treatment group information. It is possible that superior methods to those investigated could be devised that make use of treatment information. The EB method is already able to accommodate treatment group membership information if it is available.
Our experiment in normalizing the human sperm data set using parameters derived from the MAQC data set shows that there may be some consistency in platform effects across different treatments. Larger differences between platform effects in the MAQC and human sperm studies may have resulted from differences in protocols between the two studies. However, they may also have resulted from larger disparity between gene expression patterns in sperm and those of other human tissues, in which case a more complicated model of platform effects might resolve the difference. Ideally, data sets like the MAQC project's could be used as "Rosetta Stones" for gene expression platforms, allowing data collected on one platform to be translated to be comparable with data from another platform regardless of treatment group disparities. This work has shown that a model including treatment-platform interaction terms will be required for such a system to be effective, and further investigation is required before such a system can be realized.
Next generation sequencing technology is replacing microarrays for the measurement of gene expression. The types of platform effects present in microarray data are probably not relevant for sequencing-based expression data. Nevertheless, existing databases, as well as microarray experiments that may be performed in the near future, represent a substantial resource. Cross-platform normalization has the potential to become a valuable tool for gene expression research by allowing researchers to combine and analyze existing data together with new data or in new contexts. While at least nine methods are currently available, we've shown that only four of those methods provide reasonable results on the MAQC and human sperm data sets. Although researchers are encouraged to draw their own conclusions based on their particular needs, two methods emerged as most effective. In cases in which false positives are to be preferred over false negatives, or in which treatment group sizes are not equal for the various single platform data sets, DWD is the recommended cross-platform normalization method because of its lower under-detection and improved robustness. In cases in which false negatives are to be preferred to false positives, and for use with classifiers, XPN is recommended because of its superior cross-platform concordance. If there is doubt as to which situation applies, DWD is recommended, but there is no reason that multiple techniques could not be employed and the results compared. The existence of the CONOR package will make the latter option particularly painless. In cases in which treatment groups are missing from one platform or another, the DWD-based procedure described here is the only currently available method, but care must be taken to ensure that training data have sufficiently similar transcription profiles to the data being transformed.
Cross-platform normalization of large or complex datasets in which treatment groups are missing will require improved models of treatment-platform interaction effects. Future work should include the modification of XPN to improve robustness and allow for the use of a separate training set, as well as the investigation of other models of platform effects. This study has not examined every platform available. In particular, no cDNA arrays were included, and the effectiveness of cross-platform normalization with such arrays should be investigated before those methods are applied. While the types of platform biases present in microarrays are not relevant to sequencing based assays, it is worth investigating the effects that different extraction, amplification, and sequencing methodologies have of such measurements, as well as how data from sequencing-based assays might be combined with microarray data. While the availability of an R package does much to make these methods accessible, there are some researchers who are not comfortable with the R command-style interface. For those researchers, a GUI application or web interface integrating all of those methods may be beneficial, although web interfaces are already available for some methods. If the transfer of platform parameters between data sets is improved, a web-based database of such platform parameters that could be integrated into cross-platform analyses would be extremely useful.
The MAQC data set  contains assays of two distinct RNA samples, Stratagene Universal Human Reference RNA (treatment group A) and Ambion Human Brain Reference RNA (treatment group B), along with 3:1 (group C) and 1:3 (group D) mixtures of the two. Each sample was assayed repeatedly at two or more independent sites for each platform included. The platforms used in our analyses were the Applied Biosystems Human Genome Survey Microarray V2.0, the Affymetrix HG-U133 Plus 2.0 GeneChip©, the Agilent Whole Genome Oligo Microarray G4112A, and the Illumina Human-6 BeadChip 48k v1.0. These platforms were selected based on the availability of data and the variety of probe lengths, manufacturing techniques, and detection methods employed (see Table 1).
The human sperm data set  contains assays of mRNA from sperm obtained from normally fertile men (group N) and teratozoospermic men (group T), assayed using the same AFX and ILM arrays as were used in MAQC, as well as another Illumina array, which was not used in this study. While some individual samples were assayed on both the AFX and ILM platforms, assays in the N or T group of each platform represent biological rather than technical replicates. Each assay comes from a unique sample and individual.
The MAQC and human sperm data were obtained from GEO [GEO: GSE5350 and GSE6969]. The data set for each microarray platform was subject to different preprocessing. Preprocessing did not make use of treatment group membership, as such knowledge would not be available or easily incorporated for some applications. For all data sets except ABI, quantile normalization was used as an intra-platform normalization strategy to remove assay effects. Quantile normalization is a well established and simple method for intra-platform normalization. While other methods are available , we found quantile normalization to be sufficient for our purposes. All expression data were subject to log transformation, and it is worth noting that most cross-platform normalization methods failed to perform well on data that was not log transformed in our initial trials (results not included). Those methods that failed to perform on non-log data were those that employed statistical models, and their poor performance is likely the result of violations of the distributional assumptions of those models.
The ABI data were downloaded already normalized from GEO using the GEOquery package  from Bioconductor. These data were normalized by the MAQC authors using the Expression Array System Software suite from Applied Biosystems, which implements a platform specific normalization sequence based on the specific properties of the ABI array and the 1700 Chemiluminescent Microarray Analyzer. The steps in this normalization sequence take advantage of co-localized control probes and signal to noise ratios obtained during image processing. Details can be found in the supplemental materials of the MAQC publication  and in the document entitled "User Bulletin: Applied Biosystems 1700 Chemiluminescent Microarray Analyzer" issued by ABI . The downloaded data were natural log transformed before use in my analyses. It should be noted that the ABI data set was the only set not subjected to quantile normalization.
Raw Affymetrix CEL files were pre-processed using the function justRMA from the affy package  of Bioconductor. For the experiments involving both MAQC and human sperm data, both sets were normalized together. For all other experiments, the MAQC and sperm data were normalized separately.
The raw Agilent data were processed using the limma package  from Bioconductor. Background correction was performed using the "normexp" and "mle" options of the backgroundCorrect function. Quantile normalization was performed by the normalizeBetweenArrays function. Data from duplicate probes was then averaged and natural log transformation performed.
Raw Illumina data were acquired from GEO in text format. The mean signal for each probe type was extracted and subjected to quantile normalization (provided by the normalize. quantiles function from Bioconductor's preprocessCore package) and natural log transformation. For the experiments involving both MAQC and human sperm data, both sets were quantile normalized together. For all other experiments, the MAQC and sperm data were subjected to separate quantile normalization.
The MAQC publication  provides mappings for all platforms involved in the study for 12, 091 common genes as Supplementary Table five, and we used that mapping for all cross-platform normalization experiments for both the MAQC and human sperm data sets. The mapping was accomplished by BLASTing probe sequences against the human RefSeq database. A detailed BLAST protocol is also provided in the supplementary materials of the MAQC publication, which are available from the Nature Biotechnology website. The RefSeq database has been improved and updated since 2006, and some justification is required for the use of gene annotations that are presently more than four years old. While it is generally advisable to use the most recent annotations available when analyzing microarray data, little is to be gained in this case by repeating the mapping with a more recent version of RefSeq. The number of additional probes that could be mapped using the current version of RefSeq is likely to be very small, especially when considering that the arrays used were designed before the original mapping was produced and that the human genome was already well mapped by 2006. As the purpose of this study is to compare the various methods for cross-platform normalization, and not to discover genes of biological interest, we believe the MAQC mapping to be sufficient.
Plots and statistics
Mean-mean plots and inter-platform concordance
Mean-mean expression scatter plots were produced by plotting the average expression value for each gene and treatment group on one platform against the average expression for the corresponding gene and treatment group of another platform. Treatment groups for MAQC data are equivalent to sets of technical replicates (since the same RNA pool was used for all assays), whereas treatment groups in the human sperm data set (samples obtained from normal and teratozoospermic individuals) represent biological replicates, although some technical replicates are included. Squared Pearson correlation between the x and y coordinates of each point in the mean-mean scatter plot was used as a statistic to measure inter-platform concordance. It has been pointed out by the editors that the concordance correlation coefficient is a more appropriate statistic for these purposes . Unfortunately this error was brought to our attention after it was possible to make the change. We believe, however, that the conclusions of this work are not impacted by the use of the squared Pearson correlation for two reasons: firstly, that the mean-mean plots used in this study generally show a good fit to the line y = x, and secondly, that the conclusions drawn are supported by the other statistics used.
Differential expression and ROC-like curves
Differential expression was assessed using the p-value of a two-sided Welch's t-test as a statistic. A p-value was obtained for each gene, and the resulting list of p-values was transformed into a list of q-values using the q-value package , available from Bioconductor. The q-value for a particular feature (or gene) is defined as the proportion of false positives expected when calling all features on a list up to and including that one significant, where here the list in question is the list of p-values obtained from the t-tests . The empirical cumulative distribution function (cdf) of the resulting list of q-values is then equivalent to the ROC-like curves presented. Similar curves have been used previously in the evaluation of methods for identifying differentially expressed genes .
The union and intersection curves were obtained by taking the empirical cdf of the gene-wise minimum and maximum, respectively, of the two q-value lists. Areas between curves were obtained by numerical integration using the trapezoid method, as implemented in the caTools package  available from CRAN. Curves were sampled at intervals of 0.001 on the FDR axis. Over detection was measured as the area between the ROC-like curve for differentially expressed genes detected using cross-platform normalized data and the intersection of that curve with the curve representing the union of genes detected using each platform independently. Under detection was measured as the area between the curve representing the intersection of genes detected using each platform independently and the intersection of that curve with the curve representing differentially expressed genes detected using cross-platform normalized data. The sample sizes were equalized for all curves during the resampling process.
A smoothed bootstrapping procedure was used to obtain distributions for the concordance, over-detection, and under-detection statistics. The smoothing was accomplished through the addition of zero-centered Gaussian noise with a 0.1 standard deviation. Resampling was restricted by treatment groups. That is, for the MAQC data set, every bootstrap maintained the same proportion of data from samples A, B, C, and D. For the human sperm data set, the proportion of data from N and T samples was kept constant. Data used to produce the native ROC-like curves for each platform (and subsequent union and intersection curves) were produced by the same resampling and smoothing procedure, but with the sample size doubled for the relevant treatment groups in order to maintain the same total sample size for each ROC-like curve produced. Data for resampling (positive) control methods was generated in the same manner as the data for the native ROC-like curves, sampled independently to simulate repetition of the experiment on each platform separately. Computations were performed using R version 2.9.0 (2009-04-17) x86_64-redhat-linux-gnu on a Rocks 5.3 computing cluster. Bootstrapping results are available as additional file 4.
R code was obtained for all methods if available. R functions for GQ, MRS, and QD were taken from the source code for WebArrayDB. Code for EB and NorDi was obtained from the original authors of those methods. QN was available from the preprocessCore package of Bioconductor. To the best of our knowledge, no R implementations of DWD, DisTran, or XPN existed prior to this work. Implementation of DisTran was based on a description of the method . Because the DisTran method relies on treatment group information, which for the purposes of this comparison was not considered to be available, a k-means clustering step was added to estimate that information. The correct number of treatment groups was always used as the value of k. Implementation of DWD and XPN was based on the Matlab implementations of those methods provided by their authors. Both methods were tested on toy data sets to ensure agreement with the original implementations. For XPN, agreement was approximate due to the randomness inherent in the initial clustering step. DWD relies on the solution to a second order cone program (SOCP). No appropriate SOCP solver existed in R prior to this work, so a recent smoothing Newton method algorithm was implemented and employed . Additional clustering features were added to XPN. In the original Matlab implementation, k-means clustering of assays was performed on data from both platforms together. If any cluster contained data from only one platform, clustering was simply repeated with a different set of initial centroids. To speed up the clustering process, an option to cluster data from the two platforms separately and match clusters based on the correlation of cluster centroids was added. XPN trials designated as "modified" or "mod" made use of this modified method.
Some methods require the user to select one or more parameters manually. XPN and DisTran both require the user to select how many assay clusters are to be used. In all cases, the actual number of treatment groups was used. XPN also requires the user to select the number of gene clusters and the number of iterations to perform. Values of 3, 6, and 9 were used for gene clusters as indicated in the results section. Where no value is indicated, three gene clusters were used. For all XPN trials, 30 iterations were performed as recommended by the original authors . The EB method allows the user to select whether a parametric or non-parametric prior distribution is to be used. For all trials shown, the parametric prior was selected. Difficulties were encountered with the non-parametric option. NorDi requires the selection of p-value and alpha parameters, which were set at 0.01 and 0.05, respectively, in agreement with that method's use by its authors.
A detailed description of each method can be found in the references, with the exception of GQ. GQ is included as part of the WebArrayDB service  and was invented by the authors of that service, but has never been described in a publication. GQ is a two step process. First, the data are transformed by the MRS method. Second, the median expression value is calculated for each gene and platform combination in the MRS normalized data. The median expression values are then subtracted from the second platform data, and the first platform's medians are added to the second platform's data set, with the end result that each gene has the same median expression level on both platforms.
The EB method is capable of taking treatment group or other covariate information into account . Because we are interested in applications in which such information may be unavailable, we did not utilize those capabilities in our analyses.
We wish to thank Barbara Bailey and Gary Hardiman for their advice and suggestions with regard to statistical methods and microarray related matters, respectively. We also thank Allison Shultz for her feedback regarding the manuscript and research methods and Suhail Anwar Khan and Tobias Wohlfrom for their feedback regarding research methods, software, and design. FV was funded in part by the National Institute of Health (NIH) Grant U54-HL108460.
- Schena M, Shalon D, Davis RW, Brown P: Qantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 1995, 270: 467–479. 10.1126/science.270.5235.467View ArticlePubMedGoogle Scholar
- Velculescu VE, Zhang L, Vogelstein B, Kinzler KW: Serial analysis of gene expression. Science 1995, 270(5235):484–487. 10.1126/science.270.5235.484View ArticlePubMedGoogle Scholar
- Wang Z, Gerstein M, Snyder M: RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics 2009, 10: 57–63. 10.1038/nrg2484PubMed CentralView ArticlePubMedGoogle Scholar
- Hu Z, Fan C, Oh D, Marron J, He X, Qaqish B, Livasy C, Carey L, Reynolds E, Dressler L: The molecular portraits of breast tumors are conserved across microarray platforms. BMC Genomics 2006, 7: 96. 10.1186/1471-2164-7-96PubMed CentralView ArticlePubMedGoogle Scholar
- Jiang H, Deng Y, Chen H, Tao L, Sha Q: Joint analysis of two microarray gene-expression data sets to select lung adenocarcinoma marker genes. BMC Bioinformatics 2004, 5: 81. 10.1186/1471-2105-5-81PubMed CentralView ArticlePubMedGoogle Scholar
- Warnat P, Eils R, Brors B: Cross-platform analysis of cancer microarray data improves gene expression based classification of phenotypes. BMC Bioinformatics 2005, 6: 265. 10.1186/1471-2105-6-265PubMed CentralView ArticlePubMedGoogle Scholar
- Elvidge G: Microarray expression technology: from start to finish. Pharmacogenomics 2006, 7: 123–134. 10.2217/146224220.127.116.11View ArticlePubMedGoogle Scholar
- Hardiman G: Microarray platforms-comparisons and contrasts. Pharmacogenomics 2004, 5(5):487–502. 10.1517/14622418.104.22.1687View ArticlePubMedGoogle Scholar
- Shi L, Perkins RG, Tong W: Microarrays: Preparation, Microfluidics, Detection Methods, and Biological Applications. Volume 1. Springer Science; 2009.Google Scholar
- Wick I, Hardiman G: Biochip platforms as functional genomics tools for drug discovery. Current Opinion in Drug Discoverey & Development 2005, 8(3):347–54.Google Scholar
- Parkinson H, Kapushesky M, Shojatalab M, Abeygunawardena N, Coulson R, Farne A, Holloway E, Kolesnykov N, Lilja P, Lukk M, Mani R, Rayner T, Sharma A, William E, Sarkans U, Brazma A: ArrayExpress--a public database of microarray experiments and gene expression profiles. Nucleic Acid Research 2007, 35: D747-D750. 10.1093/nar/gkl995View ArticleGoogle Scholar
- Barrett T, Edgar R: Gene Expression Omnibus (GEO): Microarray data storage, submission, retrieval, and analysis. Methods Enzymol 2006, 411: 352–369.PubMed CentralView ArticlePubMedGoogle Scholar
- Barnes M, Freudenberg J, Thompson S, Aronow B, Pavlidis P: Experimental comparison and cross-validation of the Affymetrix and Illumina gene expression analysis platforms. Nucleic Acids Research 2005, 33(18):5914. 10.1093/nar/gki890PubMed CentralView ArticlePubMedGoogle Scholar
- Kuo W, Liu F, Trimarchi J, Punzo C, Lombardi M, Sarang J, Whipple M, Maysuria M, Serikawa K, Lee S: A sequence-oriented comparison of gene expression measurements across different hybridization-based technologies. Nature Biotechnology 2006, 24(7):832–840. 10.1038/nbt1217View ArticlePubMedGoogle Scholar
- Larkin J, Frank B, Gavras H, Sultana R, Quackenbush J: Independence and reproducibility across microarray platforms. Nature Methods 2005, 2: 337. 10.1038/nmeth757View ArticlePubMedGoogle Scholar
- Petersen D, Chandramouli G, Geoghegan J, Hilburn J, Paarlberg J, Kim C, Munroe D, Gangi L, Han J, Puri R: Three microarray platforms: an analysis of their concordance in profiling gene expression. BMC Genomics 2005, 6: 63. 10.1186/1471-2164-6-63PubMed CentralView ArticlePubMedGoogle Scholar
- Shi L, Reid L, Jones W, Shippy R, Warrington J, Baker S, Collins P, de Longueville F, Kawasaki E, Lee K, Luo Y, Sun YA, Willey JC, Setterquist RA, Fischer GM, Tong W, Dragan YP, Dix DJ, Frueh FW, Goodsaid FM, Herman D, Jensen RV, Johnson CD, Lobenhofer EK, Puri RK, Scherf U, Thierry-Mieg J, Wang C, Wilson M, Wolber PK, Zhang L, Amur S, Bao W, Barbacioru CC, Lucas AB, Bertholet V, Boysen C, Bromly B, Brown D, Brunner A, Canales R, Cao XM, Cebula TA, Chen JJ, Cheng J, Chu T, Chudin E, Corson J, Corton JC, Croner LJ, Davies C, Davison TS, Delenstarr G, Deng X, Dorris D, Eklund AC, Fan X, Fang H, Fulmer-Smentek S, Fuscoe JC, Gallagher K, Ge W, Guo L, Guo X, Hager J, Haje PK, Han J, Han T, Harbottle HC, Harris SC, Hatchwell E, Hauser CA, Hester S, Hong H, Hurban P, Jackson SA, Ji H, Knight CR, Kuo WP, LeClerc JE, Levy S, Li Q, Liu C, Liu Y, Lombardi MJ, Ma Y, Magnuson SR, Maqsodi B, McDaniel T, Mei N, Myklebost O, Ning B, Novoradovskaya N, Orr MS, Osborn TW, Papallo A, Patterson TA, Perkins RG, Peters EH, Peterson R, Philips KL, Pine PS, Pusztai L, Qian F, Ren H, Rosen M, Rosenzweig BA, Samaha RR, Schena M, Schroth GP, Shchegrova S, Smith DD, Staedtler F, Su Z, Sun H, Tezak ZS, Thierry-Mieg D, Thompson KL, Tikhonova I, Turpaz Y, Vallanat B, Van C, Walker SJ, Wang SJ, Wolfinger YW, Wong A, Wu J, Xiao C, Xie Q, Yang W, Zhang L, Zhong S, Zong Y, Slikker W: The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nature Biotechnology 2006, 24(9):1151–1161. 10.1038/nbt1239View ArticlePubMedGoogle Scholar
- Woo Y, Affourtit J, Daigle S, Viale A, Johnson K, Naggert J, Churchill G: A comparison of cDNA, oligonucleotide, and Affymetrix GeneChip gene expression microarray platforms. Journal of Biomolecular Techniques 2004, 15(4):276.PubMed CentralPubMedGoogle Scholar
- Kothapalli R, Yoder S, Mane S, Loughran T: Microarray results: how accurate are they? BMC Bioinformatics 2002, 3: 22. 10.1186/1471-2105-3-22PubMed CentralView ArticlePubMedGoogle Scholar
- Kuo W, Jenssen T, Butte A, Ohno-Machado L, Kohane IS: Analysis of matched mRNA measurements from two different microarray technologies. Bioinformatics 2002, 18(3):405–412. 10.1093/bioinformatics/18.3.405View ArticlePubMedGoogle Scholar
- Tan P, Downey T, Jr ES, Xu P, Fu D, Dimitrov D, Lempicki R, Raaka B, Cam M: Evaluation of gene expression measurements from commercial microarray platforms. Nucleic Acids Research 2003, 31(19):5676. 10.1093/nar/gkg763PubMed CentralView ArticlePubMedGoogle Scholar
- Carter S, Eklund A, Mecham B, Kohane I, Szallasi Z: Redefinition of Affymetrix probe sets by sequence overlap with cDNA microarray probes reduces cross-platform inconsistencies in cancer-associated gene expression measurements. BMC Bioinformatics 2005, 6: 107. 10.1186/1471-2105-6-107PubMed CentralView ArticlePubMedGoogle Scholar
- Yauk C, Berndt M: Review of the literature examining the correlation among DNA microarray technologies. Environmental and Molecular Mutagenesis 2007, 48(5):380–394. 10.1002/em.20290PubMed CentralView ArticlePubMedGoogle Scholar
- Borozan I, Chen L, Paeper B, Heathcote J, Edwards A, Katze M, Zhang Z, McGilvray I: MAID: an effect size based model for microarray data integration across laboratories and platforms. BMC Bioinformatics 2008, 9: 305. 10.1186/1471-2105-9-305PubMed CentralView ArticlePubMedGoogle Scholar
- Hong F, Breitling R: A comparison of meta-analysis methods for detecting differentially expressed genes in microarray experiments. Bioinformatics 2008, 24(3):374. 10.1093/bioinformatics/btm620View ArticlePubMedGoogle Scholar
- Kugler K, Mueller L, Graber A: MADAM: an open source meta-analysis toolbox for R and Bioconductor. Source Code for Biology and Medicine 2010, 5:(3).View ArticleGoogle Scholar
- Ramasamy A, Mondry A, Holmes C, Altman D: Key issues in conducting a meta-analysis of gene expression microarray datasets. PLoS Medicine 2008, 5(9):e184. 10.1371/journal.pmed.0050184PubMed CentralView ArticlePubMedGoogle Scholar
- Assou S, Carrour TL, Tondeur S, Ström S, Gabelle A, Marty S, Nadal L, Pantesco V, Rème T, Hugnot J: A meta-analysis of human embryonic stem cells transcriptome integrated into a web-based expression atlas. Stem Cells 2007, 25(4):961–973. 10.1634/stemcells.2006-0352PubMed CentralView ArticlePubMedGoogle Scholar
- Grützmann R, Boriss H, Ammerpohl O, Lüttges J, Kalthoff H, Schackert H, Klöppel G, Saeger H, Pilarsky C: Meta-analysis of microarray data on pancreatic cancer defines a set of commonly dysregulated genes. Oncogene 2005, 24(32):5079–5088. 10.1038/sj.onc.1208696View ArticlePubMedGoogle Scholar
- Mulligan M, Ponomarev I, Hitzemann R, Belknap J, Tabakoff B, Harris R, Crabbe J, Blednov Y, Grahame N, Phillips T, Finn DA, Hoffman PL, Iyer VR, Koob GF, Bergeson SE: Toward understanding the genetics of alcohol drinking through transcriptome meta-analysis. Proceedings of the National Academy of Sciences 2006, 103(16):6368–6373. 10.1073/pnas.0510188103View ArticleGoogle Scholar
- Rhodes D, Barrette T, Rubin M, Ghosh D, Chinnaiyan A: Meta-analysis of microarrays: interstudy validation of gene expression profiles reveals pathway dysregulation in prostate cancer. Cancer Research 2002, 62(15):4427.PubMedGoogle Scholar
- Rogic S, Pavlidis P: Meta-analysis of kindling-induced gene expression changes in the rat hippocampus. Frontiers in Neuroscience 2009, 3: 53.PubMed CentralPubMedGoogle Scholar
- Wirapati P, Sotiriou C, Kunkel S, Farmer P, Pradervand S, Haibe-Kains B, Desmedt C, Ignatiadis M, Sengstag T, Schutz F: Meta-analysis of gene expression profiles in breast cancer: toward a unified understanding of breast cancer subtyping and prognosis signatures. Breast Cancer Research 2008, 10(4):R65. 10.1186/bcr2124PubMed CentralView ArticlePubMedGoogle Scholar
- Shabalin A, Tjelmeland H, Fan C, Perou C, Nobel A: Merging two gene-expression studies via cross-platform normalization. Bioinformatics 2008, 24(9):1154. 10.1093/bioinformatics/btn083View ArticlePubMedGoogle Scholar
- Benito M, Parker J, Du Q, Wu J, Xiang D, Perou C, Marron J: Adjustment of systematic microarray data biases. Bioinformatics 2004, 20: 105. 10.1093/bioinformatics/btg385View ArticlePubMedGoogle Scholar
- Walker W, Liao I, Gilbert D, Wong B, Pollard KS, McCulloch CE, Lit L, Sharp FR: Empirical Bayes accommodation of batch-effects in microarray data using identical replicate reference samples: application to RNA expression profiling of blood from Duchenne muscular dystrophy patients. BMC Genomics 2008, 9: 494. 10.1186/1471-2164-9-494PubMed CentralView ArticlePubMedGoogle Scholar
- Martinez R, Pasquier C, Pasquier N: GenMiner: mining informative association rules from genomic data. Proceeding of the IEEE International Conference on Bioinformatics and Biomedicine 2007, 1: 15–22.Google Scholar
- Xia XQ, Mcclelland M, Porwollik S, Song W, Cong X, Wang Y: WebArrayDB: cross-platform microarray data analysis and public data repository. Bioinformatics 2009, 25(18):2425–2429. 10.1093/bioinformatics/btp430PubMed CentralView ArticlePubMedGoogle Scholar
- Bolstad B, Irizarry R, Astrand M, Speed T: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 2003, 19(2):185. 10.1093/bioinformatics/19.2.185View ArticlePubMedGoogle Scholar
- Lacson R, Pitzer E, Kim J, Galante P, Hinske C, Ohno-Machado L: DSGeo: software tools for cross-platform analysis of gene expression data in GEO. Journal of Biomedical Informatics 2010, 43: 709–715. 10.1016/j.jbi.2010.04.007PubMed CentralView ArticlePubMedGoogle Scholar
- Glaab E, Garibaldi J, Krasnogor N: ArrayMining: a modular web-application for microarray analysis combining ensemble and consensus methods with cross-study normalization. BMC Bioinformatics 2009, 10: 358. 10.1186/1471-2105-10-358PubMed CentralView ArticlePubMedGoogle Scholar
- Gentleman R, Carey V, Bates D, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JYH, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. Genome Biology 2004, 5(10):R80. 10.1186/gb-2004-5-10-r80PubMed CentralView ArticlePubMedGoogle Scholar
- Yasrebi H, Sperisen P, Praz V, Bucher P: Can survival prediction be improved by merging gene expression data sets? PLoS One 2009, 4(10):e7431. 10.1371/journal.pone.0007431PubMed CentralView ArticlePubMedGoogle Scholar
- Platts A, Dix D, Chemes H, Thompson K, Goodrich R, Rockett J, Rawe V, Quintana S, Diamond M, Strader L: Success and failure in human spermatogenesis as revealed by teratozoospermic RNAs. Human Molecular Genetics 2007, 16(7):763. 10.1093/hmg/ddm012View ArticlePubMedGoogle Scholar
- R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria; 2010. . ISBN 3–900051–07–0 http://www.R-project.org . ISBN 3-900051-07-0Google Scholar
- Comprehensive R Archive Network[http://cran.r-project.org]
- Metz C: Basic principles of ROC analysis. Seminars in Nuclear Medicine 1978, 8(4):283–298. 10.1016/S0001-2998(78)80014-2View ArticlePubMedGoogle Scholar
- Noguchi K, Hui WLW, Gel YR, Gastwirth JL, Miao W: lawstat: An R package for biostatistics, public policy, and law. 2009.http://CRAN.R-project.org/package=lawstat. [R package version 2.3]Google Scholar
- Hekstra D, Taussig A, Magnasco M, Naef F: Absolute mRNA concentrations from sequence-specific calibration of oligonucleotide arrays. Nucleic Acids Research 2003, 31(7):1962. 10.1093/nar/gkg283PubMed CentralView ArticlePubMedGoogle Scholar
- Held G, Grinstein G, Tu Y: Modeling of DNA microarray data by using physical properties of hybridization. Proceedings of the National Academy of Sciences 2003, 100(13):7575. 10.1073/pnas.0832500100View ArticleGoogle Scholar
- Dabney A, Storey JD, with assistance from Gregory R Warnes: qvalue: Q-value estimation for false discovery rate control. [R package version 1.22.0] [R package version 1.22.0]Google Scholar
- Storey J, Tibshirani R: Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences of the United States of America 2003, 100(16):9440. 10.1073/pnas.1530509100PubMed CentralView ArticlePubMedGoogle Scholar
- Davis S, Meltzer P: GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor. Bioinformatics 2007, 23(14):1846. 10.1093/bioinformatics/btm254View ArticlePubMedGoogle Scholar
- Applied Biosystems: User Bulletin: Applied Biosystems 1700 Chemiluminescent Microarray Analyzer. 2005.Google Scholar
- Gautier L, Cope L, Bolstad B, Irizarry R: Affy: analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 2004, 20(3):307. 10.1093/bioinformatics/btg405View ArticlePubMedGoogle Scholar
- Smyth G: Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology 2004., 3:Google Scholar
- Storey J, Dai J, Leek J: The optimal discovery procedure for large-scale significance testing, with applications to comparative microarray experiments. Biostatistics 2007, 8(2):414.View ArticlePubMedGoogle Scholar
- Lin LIK: A concordance correlation coefficient to evaluate reproducibility. Biometrics 1989, 45: 255–268. 10.2307/2532051View ArticlePubMedGoogle Scholar
- Tuszynski J: caTools: Tools: moving window statistics, GIF, Base64, ROC AUC, etc. 2009. http://CRAN.R-project.org/package=caTools . [R package version 1.10]Google Scholar
- Chi X, Liu S: A one-step smoothing Newton method for second-order cone programming. Journal of Computational and Applied Mathematics 2009, 223: 114–123. 10.1016/j.cam.2007.12.023View ArticleGoogle Scholar
- Chambers JM, Cleveland WS, Kleiner B, Tukey PA: Graphical Methods for Data Analysis. Volume 3. Wadsworth & Brooks/Cole; 1983:62.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.