Probe set algorithms: is there a rational best bet?
© Seo and Hoffman. 2006
Received: 13 April 2006
Accepted: 30 August 2006
Published: 30 August 2006
Skip to main content
© Seo and Hoffman. 2006
Received: 13 April 2006
Accepted: 30 August 2006
Published: 30 August 2006
Affymetrix microarrays have become a standard experimental platform for studies of mRNA expression profiling. Their success is due, in part, to the multiple oligonucleotide features (probes) against each transcript (probe set). This multiple testing allows for more robust background assessments and gene expression measures, and has permitted the development of many computational methods to translate image data into a single normalized "signal" for mRNA transcript abundance. There are now many probe set algorithms that have been developed, with a gradual movement away from chip-by-chip methods (MAS5), to project-based model-fitting methods (dCHIP, RMA, others). Data interpretation is often profoundly changed by choice of algorithm, with disoriented biologists questioning what the "accurate" interpretation of their experiment is. Here, we summarize the debate concerning probe set algorithms. We provide examples of how changes in mismatch weight, normalizations, and construction of expression ratios each dramatically change data interpretation. All interpretations can be considered as computationally appropriate, but with varying biological credibility. We also illustrate the performance of two new hybrid algorithms (PLIER, GC-RMA) relative to more traditional algorithms (dCHIP, MAS5, Probe Profiler PCA, RMA) using an interactive power analysis tool. PLIER appears superior to other algorithms in avoiding false positives with poorly performing probe sets. Based on our interpretation of the literature, and examples presented here, we suggest that the variability in performance of probe set algorithms is more dependent upon assumptions regarding "background", than on calculations of "signal". We argue that "background" is an enormously complex variable that can only be vaguely quantified, and thus the "best" probe set algorithm will vary from project to project.
Current Affymetrix microarrays  use chemical synthesis of 25 mer oligonucleotides in 8 μm2 features on glass slides, resulting in a density of approximately 1 million oligonucleotides per 1.3 cm2 microarray. The impressive density of oligonucleotides provides adequate space on the chip ("real estate") for use of multiple probes per mRNA transcript. Thus, each transcript is queried by a probe set (typically 22 oligonucleotides on current arrays). The assessment of the amount of a specific mRNA in a solution from a tissue or cell is determined by the amount of hybridization to the different oligo features for that mRNA. While "hybridization" is a highly sensitive and specific assay, it is not perfectly sensitive or perfectly specific. Thus, it is critical to provide measures of non-specific hybridization and probe binding (background), and one method is to print mismatch probes, where a single base mutation in the center of each 25 mer probe should disrupt most "specific" hybridization. Current expression profiling microarrays use 11 perfect match and mismatch probe pairs for each probe set, although there may be many probe sets per transcript unit.
When the same data set is then normalized to time 0, the Y axis becomes a measure of "fold-change" between baseline (time 0), and peak expression (day 3) (Figure 2). As the graph now becomes a ratio of day 3 (numerator) and day 0 (denominator), the very different values for day 0 expression lead to very different ratios for the different probe set algorithms (Figure 2). In the example shown, the PCA probe set algorithm calculates a 90-fold increase in expression of myogenin (PCA), while the perfect match only probe set algorithms show only a 2-fold increase (RMA, dCHIP PMonly) (Figure 2). The volatility of the fold-change is a consequence of different calculations of the denominator for a ratio, and this is due to different assumptions concerning what is "background", and what is true "signal". If one takes this single example and extrapolates to all 40,000 probe sets on a microarray, one quickly understands why different probe set algorithms give very different interpretations of the same data.
There has been growing interest on the effect of probe set algorithms on data interpretation, with development of new probe set algorithms, and also comparisons between probe set algorithms (see  and  for reviews). A number of studies have found the lower variance of the perfect-match only algorithms, and other methods of reducing variance, to lead to more precision, with the conclusion that these provide the best performance [2, 8–12]. However, one recent paper studying the issues of effects of normalizations methods on variance and data interpretation noted, "Any normalization reduces variance and increases sensitivity with no clear winner" . Other authors have similarly cautioned against equating increased precision with improved performance, as precision may come at the cost of accuracy. The dramatic differences in fold-change interpretation of myogenin by different probe set algorithms shown in Figure 2 illustrate this point. The signal of the mismatch probes for myogenin is relatively high at time 0, but this mismatch signal does not rise proportionately with the increased myogenin perfect match signal seen at day 3. Thus, the greater precision of RMA and dCHIP PMonly comes at a cost of underestimating the fold-change at day 3. In other words, the balance of precision and accuracy would fall in favor of the greater accuracy of PCA, dCHIP diff, and MAS5.0 in this specific instance. Thus, appropriate assessments of "performance" may depend on a balance of precision and accuracy, and that this is likely project and probe set dependent [3, 14, 15].
An excellent R package to compare the performance of different probe set algorithms (Affycomp II)  has been used to compare over 50 probe set algorithms in a paper describing the effects on accuracy, precision, and sensitivity into low signals . Consistent with expectations, the authors found that background correction (e.g. using mismatch signal) improved accuracy but worsened precision. They proposed that GC-RMA, PLIER, ZL, and RSVD algorithms provided the best balance of accuracy and precision . This interpretation was based upon a spike-in data set, where a limited number of synthesized mRNAs were studied at varying concentrations.
We recently took a different approach, where three different large data sets with different biological variance intrinsic to each organism and experimental system (grass, rat, and human biopsy samples) were studied. We built an interactive analysis tool to query the statistical power of each probe set on the arrays in the project, with dynamic user-definition of sample size, effect size (fold-change), and statistical significance . We found that the same algorithms that were found to provide excellent accuracy using spike in data , tended to reduce variance to inappropriate levels, leading to high false positives . A recent paper found similar results where the majority of RMA-specific probe sets were unable to be validated by RT-PCR . A generalization from all of these studies is that probe set algorithms that utilize only the perfect match probes give much better sensitivity with less variance (enhanced accuracy), particularly at low signal levels. However, these same low signals are particularly susceptible to strong effects from cross-hybridization from related mRNA sequences (non-specific hybridization), and other sources of undesired background signal. It should be kept in mind that in "real experiment" (e.g. not spike in data), the levels of both the desired (specific) transcript are changing, as well as the levels of the cross-hybridizing (non-specific) transcripts. This serves to make the system highly (if not impossibly) complex, and the ideal balance of precision and accuracy could be project and organism specific. Thus, all probe set algorithms should be viewed as a necessary but imperfect compromise in data interpretation.
One could envision that the ideal probe set algorithm would optimize background correction and normalization for each project and each probe set in that project. While such approaches are doomed to be "data starved", recently, new "hybrid" probe set algorithms have emerged that attempt to better combine the precision of model based approaches, with more judicious use of background correction (mismatch penalty). These include the Affymetrix PLIER algorithm  and GC-RMA . The new hybrid algorithms have the potential to achieve a more ideal balance of accuracy and precision. To test this, we took a novel approach; namely statistical power analysis of "absent call only" probe sets in a microarray project. To explain this approach, the original MAS5.0 probe set algorithm looks at the signals from the 11 probe pairs in a probe set, and determines if the signals from perfect match probes exceed the signals from the paired mismatch probes. The statistical confidence that the ensembled perfect match signals from a probe set are significantly above background is derived as a "detection p value". If, on average, the perfect match probes show greater signal intensity than the corresponding mismatch, then the detection p value improves, to a threshold to where a "present call" is assigned (e.g. the target transcript is likely "present" in the sample tested). Whereas poor signal from the perfect match, and increasing signal from the mismatch, causes the detection p value to worsen, leading to an "absent call" determination. The "present call" is thus a statistically derived threshold reflecting a relative confidence that the desired mRNA is indeed present in the RNA sample being tested at a level significantly above background hybridization. An "absent call" suggests that the target mRNA is not present in the sample, or that there is non-target mRNAs binding to the probe set (non-desired cross-hybridization), or both. Thus, limiting data analyses to "present calls" can restrict analyses to the better performing probe sets and target mRNAs . dCHIP algorithms also have an option to utilize the MAS5.0 "present calls" as a form of noise filter, and we have recently shown that use of the detection p value as a weighting function improves the performance of all probe set algorithms .
We reasoned that use of only those probe sets that showed 100% absent calls by MAS5.0 algorithms would enrich for poor quality signals with considerable noise components, due to poorly performing probe sets, low level or absent target mRNAs, or cross-hybridization. By limiting our statistical power analysis to these 100% absent call probe sets, we hypothesized that this would lead to a sensitive test for performance of probe set algorithms with regards to balance of precision and accuracy, and further, an assessment of the potential for false positives.
Another method of illustrating this point is to analyze sample mixing data sets. We used a publicly available dataset of two different human tissues (PBMC and placenta) mixed in defined ratios (100:0, 95:5, 75:25, 50:50, 25:75, 0:100) . To enrich for signals near detection threshold, we selected all probe sets that were "present calls" in all 100% PBMC profiles, but absent calls in 75%, 50%, 25%, and 0% PBMC profiles. We then took all probe sets appropriately powered at n = 2 for PLIER and GC-RMA, and studies the ability of the two probe set algorithms to track dilutions of these low level PBMC-specific probe sets. Correlation analysis between PBMC RNA concentration and expression levels of the selected probe sets showed a statistically significant difference (F-Test p value = 1.43449E-05). As expected, GC-RMA (average R2 = 0.244803) was considerably less accurate than PLIER (R2 = 0.324908).
Efforts to use a single "best" probe set algorithm for all probe sets and projects may be an unattainable goal. Instead, selection of analysis methods for any particular project should be viewed as striving towards an appropriate balance of precision and accuracy. Emerging hybrid probe set algorithms, such as PLIER, may prove particularly powerful in reducing false positives. New tools such as Affycomp and HCE-power can be used to assess this balance.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.