Evaluation of methods for oligonucleotide array data via quantitative real-time PCR
© Qin et al. 2006
Received: 20 May 2005
Accepted: 17 January 2006
Published: 17 January 2006
Skip to main content
© Qin et al. 2006
Received: 20 May 2005
Accepted: 17 January 2006
Published: 17 January 2006
There are currently many different methods for processing and summarizing probe-level data from Affymetrix oligonucleotide arrays. It is of great interest to validate these methods and identify those that are most effective. There is no single best way to do this validation, and a variety of approaches is needed. Moreover, gene expression data are collected to answer a variety of scientific questions, and the same method may not be best for all questions. Only a handful of validation studies have been done so far, most of which rely on spike-in datasets and focus on the question of detecting differential expression. Here we seek methods that excel at estimating relative expression. We evaluate methods by identifying those that give the strongest linear association between expression measurements by array and the "gold-standard" assay.
Quantitative reverse-transcription polymerase chain reaction (qRT-PCR) is generally considered the "gold-standard" assay for measuring gene expression by biologists and is often used to confirm findings from microarray data. Here we use qRT-PCR measurements to validate methods for the components of processing oligo array data: background adjustment, normalization, mismatch adjustment, and probeset summary. An advantage of our approach over spike-in studies is that methods are validated on a real dataset that was collected to address a scientific question.
We initially identify three of six popular methods that consistently produced the best agreement between oligo array and RT-PCR data for medium- and high-intensity genes. The three methods are generally known as MAS5, gcRMA, and the dChip mismatch mode. For medium- and high-intensity genes, we identified use of data from mismatch probes (as in MAS5 and dChip mismatch) and a sequence-based method of background adjustment (as in gcRMA) as the most important factors in methods' performances. However, we found poor reliability for methods using mismatch probes for low-intensity genes, which is in agreement with previous studies.
We advocate use of sequence-based background adjustment in lieu of mismatch adjustment to achieve the best results across the intensity spectrum. No method of normalization or probeset summary showed any consistent advantages.
Affymetrix GeneChip® oligonucleotide arrays are a popular platform for the high-throughput analysis of gene expression in mRNA. Nguyen et al  give an introduction to the technology for quantitative scientists. Briefly, an oligonucleotide array contains 11–20 probe pairs for each gene. Probe pairs consist of an oligonucleotide that is a "perfect match" (PM) to a subsequence of the mRNA transcript for a gene and a corresponding "mismatch" (MM) oligo that differs from it in one base in the middle. These MM probes are meant to provide information on cross-hybridization.
Quantitative researchers have proposed a variety of methods for handling probe-level data from Affymetrix® oligonucleotide arrays. Methods employ different procedures for adjusting for background fluorescence, normalizing the data, incorporating the information from "mismatch" probes, and summarizing probesets (combining all the data from the different probes for a given gene). In particular, the value and proper use of data from MM probes have been subjects of some controversy . It is important to validate a method for its effectiveness in achieving scientific goals, such as estimating relative gene expression or detecting differentially expressed genes . Note that different methods may be preferable for different scientific goals .
Previously, spike-in studies have been used to study the variance and bias of different estimates of relative expression derived from oligo array data. These studies are useful and important, but are not the end of the story. First, spike-in datasets are inherently artificial, and may not realistically represent the operating characteristics of a methodology on real data . For example, the Affymetrix Latin Square Dataset studied by Bolstad et al  has only 42 genes changing from sample to sample. In addition, this dataset was used to develop several methods, so it is not appropriate to use for validation. Finally, a criterion often not considered in the spike-in studies is the accuracy of measurements across genes. Instead, Bolstad et al  largely considered measurements across RNA samples for single genes. Obviously, these problems are related, yet they are not identical.
Choe et al  conducted a study using an experiment where 100–200 RNAs were spiked-in at various fold-changes. All RNAs other than the spike-ins had the same level in all samples. Impressively, the authors considered over 100 different combinations of methods for background adjustment, normalization, use of MM probes, and probeset summary methods. Many of the study's conclusions are based on the shared features of the ten best-performing combinations. However, eight of those ten combinations used a normalization based on the known subset of genes that were constant between the RNAs that were compared. Such a normalization scheme could not be implemented in an actual experiment where the identity of unchanging genes is unknown. This casts some doubt on the generalizability of the study's findings. Further concerns about generalizability arise from the study's non-standard RNA production protocol. In addition, one of the study's RNA samples contained unlabeled poly(C) RNA, to unknown effect.
Among evaluations that do not rely on spike-in datasets, Ploner et al  favored methods that produced zero correlation, on average, between randomly selected pairs of genes. Though creative, this criterion unfortunately does not correspond to a scientific question of interest. Furthermore, the criterion might favor methods that "over-normalized" the data – removed signal as well as systematic biases. Shedden et al  identified methods that optimized sensitivity for detecting differentially expressed genes. The authors relied on estimates of false discovery rates rather than using data from an independent validation technique for comparison.
In contrast to the studies that use spike-in datasets, our study is based on a real dataset that was collected to answer biological questions. The studies by Choe et al  and Shedden et al  are directed at identifying the best methods for selecting differentially expressed genes, which are not necessarily the best methods for estimating relative expression. In contrast, we focus here on the problem of estimating relative expression. We do not mean to suggest that previous approaches lack merit. Rather, different approaches have advantages and disadvantages, and a plurality of studies is needed.
Biological Samples. RNA samples were from an unbalanced 2 × 2 factorial design. The 24 mice were young or old, wild-type or carried the MCAT transgene, which directs overexpression of human catalase to the mitochondrial cellular compartment. Transgene overexpression extends lifespan, and thus gene expression differences between MCAT and age-matched wild-type mice would be expected.
N = 6
N = 8
N = 5
N = 5
Methods under Evaluation. Summary of the six methodologies for oligonucleotide array data that were compared in this study. Details on the methodologies can be found in the references. An asterisk (*) marks components of methods that were studied in the follow-up analysis (see Table 5).
scaling by a constant*
subtract idealized mismatch*
Tukey biweight average*
by GC content of probe*
medianpolish* (robust fit of linear model)
whole array adjustment
medianpolish* (robust fit of linear model)
variance stabilizing transformation
medianpolish* (robust fit of linear model)
Our choice to use Pearson's correlation, r, is motivated by the following formula. While not the standard textbook definition of r, a more instructive approximate formula is
where β is the slope of the line for predicting Y from X, Var(X) is the variance of X, and Var(Y|X = x) is the variance of Y in groups that have the same value of X. In our application, X is the measurements from qRT-PCR and Y is the measurements from array. X is fixed, and so also Var(X), but Y depends on what method is applied to the array data. Since Var(Y|X = x) appears in the denominator, a method's performance is improved if it minimizes Var(Y|X = x). Therefore, this metric tends to favor methods with smaller variability. Similarly, the larger the slope between Y and X, the larger r is and the more favorable a method's performance. In this sense, by using Pearson's correlation, we simultaneously take into account both the variance and bias of the measurements produced by arrays. That is, we seek methods that achieve the right balance between variance and bias to yield the strongest association between array measurements and qRT-PCR. However, it is also of interest to specifically examine variance and bias, and we will come back to this.
Results of the leave-one-out sensitivity analysis. For each contrast, an individual gene is listed if its removal produced a change in the ranking of the six methodologies. The third column shows how the ranking of the six methodologies changed upon removal of the gene. Here, M = Mas5, G = gcRMA, R = RMA, V = VSN, D = dChip, D- = dChip.mm. Bold font highlights changes. Note that all changes in rankings, with two exceptions, were transpositions of two adjacent methods or a shuffle of three adjacent methods.
Results of the leave-two-out sensitivity analysis. The table shows that removing gene pairs affected only minor changes in our findings.
# gene-pairs considered (non-influential singleton genes)
# influential gene-pairs
# of these pairs that produce a single transposition of neighbors
# of these pairs that produce a shuffle of the top-three methods
As an exploratory aspect of our study, we sought to identify factors that influence the level of agreement between array and qRT-PCR measurements. We considered whether gene sequence GC content or Affy probe GC content was associated with agreement in the measurements produced by the two platforms. Neither variable showed a consistent association. See ProbeLevelAnalysis.doc for more information on this exploratory analysis.
We also failed to corroborate the finding of Etienne et al  that large distance between qRT-PCR probe and the Affy probe set leads to poor agreement between platforms. We believe the likely reason for this discrepancy is the difference in RT-PCR methods. Etienne et al  used standard 2 primer PCR followed by radioactive Southern blot hybridization. The use of a real time PCR machine in our study allowed greater assurances that all amplifications measured were consistent, specific, and within the appropriate linear range. Our use of the Taqman system with a fluorogenic minor groove binding probe also increased specificity and stabilized binding sites. These factors combined could reduce any sequence-specific error in qRT-PCR measurement.
Components of Methods Examined in Follow-Up Analysis. The table gives abbreviations for the methods in Table 2 that are studied in the follow-up analysis. These abbreviations are used in RESULTS and Figures 6–9.
Figures 7, 8, 9 also indicate that differences in normalization had a generally minor effect on results – performance changed little when normalization method varied while all other components were held constant. We do not identify any compelling evidence in favor of any particular method for normalization.
Just as we pointed to limitations of previous studies, it is important to point out limitations of this study. One limitation is that the genes assayed with qRT-PCR were not a random sample or even a representative sample of genes on the array (see METHODS). Genes were initially selected primarily for their biological interest, but then some of these candidates were excluded. Two elements of the selection process are notable. First, genes were selected if they appeared to be promising candidates for differential expression based on processing the data with gcRMA. This introduces a possible bias for gcRMA into our results. Second, genes with average signal intensity less than 2 were excluded. These factors resulted in the selection of primarily high- and medium-intensity genes. Previous work  has suggested that difficulties with methods that subtract mismatch data arise for low-intensity genes due to extreme variability. It is likely that the omission of low-intensity genes in our study favored MAS5 and the dChip mismatch model. The remaining criteria used to select genes were the availability of RT-PCR assays and whether existing knowledge of a gene made it an interesting candidate in the study of aging. We are unaware of any biases produced by these latter selection criteria.
We argue that correlation is a reasonable measure of agreement in this study because it accounts for both the bias and variance of measurements, favoring methods that find the right balance between the two. However, Figure 4 shows that "the right balance" really depends on signal intensity. For example, for highly expressed genes, the variability across methods is roughly comparable and so our metric favors methods with the least bias. Figure 3 and Figure 5 show that this is exactly what happens. For genes at the lowest level of intensity, methods that use mismatch probes have been found to be extremely variable , which is consistent with our data (Figure 4). For such genes our metric favors methods with lower variability even if the bias is large. Unfortunately, the qRT-PCR analysis did not include low intensity genes. While we had qRT-PCR data for some high-intensity genes, they tended to have smaller fold-changes across group and thus exerted less influence on the correlations (see Figure 2 and similar figures in AdditionalFigures.doc). Our correlation results really pertain to medium-intensity genes, where bias and variation both come into play as sources of error.
Our results, narrowly interpreted, favor MAS5, gcRMA, and the dChip mismatch model. However, our assessment of variability, together with previous studies that demonstrate the unreliability of using MM data for low-intensity genes , leads to a more precise conclusion. Specifically, the sequence-based background adjustment of gcRMA emerges as a method that may be most effective across the intensity spectrum.
We have treated the qRT-PCR measurements as "truth" because they are the gold standard laboratory measurement of gene expression. Yet qRT-PCR measurements are also subject to error. However, our sensitivity analysis should partly address this concern.
We reiterate that we have compared the performance of array methodologies for estimating relative gene expression levels for a chosen list of genes. We have not compared methods on their abilities to identify differentially expressed genes, which is an important goal that is related but not identical. Still, it is useful to compare our findings with other validation studies, including those that used other criteria to evaluate methods. Of the three recent studies [5, 7, 8], our results are somewhat consistent with Choe et al  and Ploner et al  and least consistent with Shedden et al. . Choe et al  concluded that (1) regional background adjustment is better than foregoing background adjustment, (2) using the MAS5 method for use of MM probe data is better than simple MM subtraction or discarding MM data, and (3) the probeset summary method used by gcRMA and RMA performs slightly better than the methods used by MAS5 or the dChip model. Our results suggest a more complicated scenario – that each of these sub-methods performs well if combined with particular other sub-methods. On the other hand, we clearly corroborate the finding of Choe et al  that no method of normalization appears to be advantageous, and that gcRMA and MAS5 perform well. Ploner et al  concluded that MAS5 gave better results than RMA or the dChip mismatch model. We also found that MAS5 outperformed RMA but it was not clearly better than the dChip mismatch model. The results of Shedden et al  favored the dChip method using MM subtraction over MAS5 and gcRMA, while these three performed comparably for the medium- and high-intensity genes in the main part of our study.
Using qRT-PCR data as an independent measurement tool, we compared the performance of six methodologies for the quantification of gene expression from Affymetrix oligonucleotide arrays. Three methods – MAS5, gcRMA, and the dChip mismatch model – performed better than VSN, dChip without mismatch, and RMA. The factor driving these results was whether a method used mismatch data or, alternatively, a sequence-based background adjustment. Other differences among methods, such as the normalization scheme, made little difference in overall performance. Further analysis of variability lead us to favor the sequence-based background adjustment over procedures using mismatch probes. In summary, for estimating relative expression using oligonucleotide array data, we advocate (1) foregoing methods that use mismatch subtraction and (2) using the sequence-based background adjustment method in gcRMA.
Total RNA was extracted from flash-frozen heart tissue using Trizol (Invitrogen, Carlsbad, CA) extraction followed by cleanup with the RNeasy kit (Qiagen, Valencia, CA). Samples were prepared for Affymetrix arrays using 7 μg total RNA and following the manufacturer's instructions for One Cycle Eukaryotic Target Preparation (Affymetrix, 701025 Rev. 5) including first and second cDNA generations from oligo-dT and linear in vitro transcription using biotinylated ribonucleotides (Enzo, Farmingdale, NY). Samples were hybridized to mgu74av2 arrays at the University of Washington Center for Expression Arrays according to recommended procedures (Affymetrix, 701028 Rev. 3).
Samples were prepared for qRT-PCR using the cDNA archive protocol (Applied Biosystems, Foster City, CA), which uses random hexamers for first strand cDNA synthesis. cDNA samples representing 50 ng total RNA and 2× Universal PCR master mix (Applied Biosystems) were loaded into each port of the 384 well Applied Biosystems Low Density Arrays, which were custom designed for this experiment, and run in an Applied Biosystems 7900 real time PCR instrument according to manufacturer's instructions. Technical and biological replicates were balanced across ports and cards for each of the groups to minimize any effect of loading port position or variability between cards. Each cDNA sample was run in duplicate on 2 different cards.
qRT-PCR data were quantified using SDS 2.1 (Applied Biosystems). For the Low Density Arrays, baseline and threshold were identified automatically by the software and adjusted manually where necessary and the settings were applied to all arrays. Gene expression values were normalized to the 18s endogenous control and corrected for measured efficiency as calculated by a standard curve run on one of the Low Density Arrays .
Genes were selected for qRT-PCR based on an analysis of the entire Affymetrix data set (all 24 individual samples) processed using gcRMA. Chosen genes either exhibited a large average fold-change or highly statistically significant evidence of differential expression as determined by the LIMMA package of Bioconductor . Therefore, selected genes tended to represent genes with large changes for at least one contrast and genes with smaller changes and lower variability.
In detail, we generated gene lists that contained the top 200 candidates for each of the 6 contrasts we study. These gene lists were filtered based on magnitude of signal (average log2(signal) >2) and magnitude of change (log2(ratio) >0.5) for each contrast. Some genes were chosen based an analysis of additional array data from pools of the RNA samples, choosing the 200 genes with the largest changes for the contrasts listed, and then filtering based on magnitude of signal and magnitude of change as above. 110 genes were selected by these methods. This number was then reduced based on availability of assays on the ABI low density arrays and availability of annotation to yield the 47 genes studied here. The 18s endogenous control is included on the Applied Biosystems Low Density Arrays by default and was used for normalization of the qRT-PCR data as described above.
All array data were processed in the statistical language 'R'  using the "affy" package in Bioconductor . The Bioconductor document available at  provides a useful overview of different methods. In our initial "overview" investigation, six different methodologies were applied to array data. Table 2 briefly summarizes the six methodologies in terms of the four stages of data processing (adjustment for background, normalization, use of data from MM probes, and summarizing of data across a probeset). References are provided for background on the methods and they are not described here.
In the follow-up analysis there are three options in each stage of data processing (Table 5), so nominally there would be 81 combinations. However, certain combinations resulted in zero or negative values and could not be further evaluated. See the FollowUpAnalysis.xls. We evaluated 56 combinations in total.
Our goal in this study is to validate methods for oligonucleotide array data for estimating the relative expression of genes in different RNA samples. However, it would be unwieldy to consider all 276 pairwise comparisons of the 24 RNA samples.
Instead, we consider summary contrasts. Each contrast uses averages among biological replicates.
We give results for six summary contrasts. Three contrasts use the data on all 24 animals (Figure 1): Age (the contrast between the 10 old mice and the 14 young mice), Genotype (the contrast between the 13 MCAT mice and the 11 wild-type mice), and the interaction between age and genotype. We also give results for three simple pairwise group contrasts: YWT vs. OWT (age differences among wild-type mice), YWT vs. YMCAT (genotype differences among young mice), and OWT vs. OMCAT (genotype differences among old mice). We worked with the processed data on the log scale and computed contrasts as differences in per-group averages. The contrasts are interpretable as log-fold changes.
For each contrast and each method for processing array data, we have a scatterplot. Each of the 47 data points in a scatterplot represents the log-fold-change for one gene for the given contrast as measured by array (vertical axis) and by qRT-PCR (horizontal axis). For a given contrast, the data plotted on the horizontal axis are the same. See Figure 1 for the YWT vs. OWT contrast and the six methods studied in our initial analysis.
We considered several metrics for quantifying agreement within a scatterplot. While it would be ideal if array measurements and qRT-PCR measurements agreed exactly, it is satisfactory for there to be a linear relationship. Therefore, we considered the most important characteristic of these plots to be the overall linear trend. This led us to use Pearson Correlation as a measure of agreement. AdditionalFigures.doc gives results for four other measures of agreement: mean squared error, median absolute error, Canberra distance, and Spearman correlation. However, findings using these four measures were inconclusive.
Our measure of agreement, Pearson Correlation, is not robust as it can be disproportionately influenced by individual data points. For example, a gene with a large change for a certain contrast could heavily drive the results for that contrast, but this would be misleading. Therefore, we performed a sensitivity analysis to ensure that our findings were not driven by one or two genes. We systematically removed the data for (1) all single genes and (2) all pairs of genes from the processed datasets, and then re-computed the correlation of the scatterplot with the removed gene(s). We called a gene (or gene-pair) "influential" if the ranking of the six methods changed upon removal of the gene (or gene-pair). For the leave-two-out sensitivity analysis, we only considered gene pairs that were not influential singleton genes.
This sensitivity analysis also takes the place of computing confidence intervals for the correlations we compute. Because the set of genes included in this study was not a random sample, such confidence intervals would not be valid.
This research was supported by grants from the National Institute of Aging, Grant # NIA P30 AG13280 to the UW Nathan Shock Center for the Basic Biology of Aging (RPB, FNH, NJL) and Grant # NIA F32AG21827-01 (NJL); and Public Health Services Grants from the National Institute for Environmental Health Sciences, Grant # NIEHS U19ES011387 to the FHCRC/UW Toxicogenomics Research Consortium (LXQ, RPB, DEM, KFK) and Grant # NIEHS P30ES07033 to the UW Center for Ecogenetics and Environmental Health (RPB, KFK). Additional support was provided by UW Department of Biostatistics Career Development Funds (KFK) and a fellowship from the Merck Research Laboratories (LXQ). We thank Larry Ruzzo, Roger Bumgarner, Rafael Irizarry, and two anonymous reviewers for helpful suggestions to improve the paper.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.