One could envision that the ideal probe set algorithm would optimize background correction and normalization for each project and each probe set in that project. While such approaches are doomed to be "data starved", recently, new "hybrid" probe set algorithms have emerged that attempt to better combine the precision of model based approaches, with more judicious use of background correction (mismatch penalty). These include the Affymetrix PLIER algorithm [17] and GC-RMA [9]. The new hybrid algorithms have the potential to achieve a more ideal balance of accuracy and precision. To test this, we took a novel approach; namely statistical power analysis of "absent call only" probe sets in a microarray project. To explain this approach, the original MAS5.0 probe set algorithm looks at the signals from the 11 probe pairs in a probe set, and determines if the signals from perfect match probes exceed the signals from the paired mismatch probes. The statistical confidence that the ensembled perfect match signals from a probe set are significantly above background is derived as a "detection p value". If, on average, the perfect match probes show greater signal intensity than the corresponding mismatch, then the detection p value improves, to a threshold to where a "present call" is assigned (e.g. the target transcript is likely "present" in the sample tested). Whereas poor signal from the perfect match, and increasing signal from the mismatch, causes the detection p value to worsen, leading to an "absent call" determination. The "present call" is thus a statistically derived threshold reflecting a relative confidence that the desired mRNA is indeed present in the RNA sample being tested at a level significantly above background hybridization. An "absent call" suggests that the target mRNA is not present in the sample, or that there is non-target mRNAs binding to the probe set (non-desired cross-hybridization), or both. Thus, limiting data analyses to "present calls" can restrict analyses to the better performing probe sets and target mRNAs [18]. dCHIP algorithms also have an option to utilize the MAS5.0 "present calls" as a form of noise filter, and we have recently shown that use of the detection p value as a weighting function improves the performance of all probe set algorithms [19].
We reasoned that use of only those probe sets that showed 100% absent calls by MAS5.0 algorithms would enrich for poor quality signals with considerable noise components, due to poorly performing probe sets, low level or absent target mRNAs, or cross-hybridization. By limiting our statistical power analysis to these 100% absent call probe sets, we hypothesized that this would lead to a sensitive test for performance of probe set algorithms with regards to balance of precision and accuracy, and further, an assessment of the potential for false positives.
We used our public domain power analysis tool (HCE-power) [20] to study the effects on power analysis of two newer hybrid algorithms (PLIER, GC-RMA) relative to other popular algorithms (MAS5.0, RMA, and dCHIP difference model) (Figure 3). Power is the probability of calling a null hypothesis false when the null is actually false. By studying only poor signals (absent calls), many or most sufficiently powered probe sets will be prone to false positive results. Both RMA and GC-RMA showed statistically significant power for the large majority of these poorly performing probe sets, even at groups of n = 2. This suggests that both RMA and GC-RMA are subject to high false positive rates, consistent with conclusions of a recent study [3]. On the other hand, PLIER showed inability to appropriately power the large majority of probe sets, even at n = 10 per group (Figure 3). The exquisite precision of RMA and GC-RMA affords desired sensitivity into low signal ranges for these untrustworthy probe sets, but it is relatively unlikely that the derived signals correspond to specific hybridization. Thus, chip-to-chip fluctuations in background might lead to false positives from these probe sets when using RMA and also the GC-RMA hybrid algorithm. In the example shown, we can assume that the performance of the "absent call" probe sets is poor, with background signal and cross-hybridization to non-specific mRNAs leading to "signal" that is suspect. The data presented in Figure 3 suggests that improving precision through mathematical reduction of variance can become "inappropriate", and result in high risk for false positives. By this analysis, the PLIER algorithm appears the most stringent, with the lowest likelihood for false positives in poorly performing probe sets.
Another method of illustrating this point is to analyze sample mixing data sets. We used a publicly available dataset of two different human tissues (PBMC and placenta) mixed in defined ratios (100:0, 95:5, 75:25, 50:50, 25:75, 0:100) [21]. To enrich for signals near detection threshold, we selected all probe sets that were "present calls" in all 100% PBMC profiles, but absent calls in 75%, 50%, 25%, and 0% PBMC profiles. We then took all probe sets appropriately powered at n = 2 for PLIER and GC-RMA, and studies the ability of the two probe set algorithms to track dilutions of these low level PBMC-specific probe sets. Correlation analysis between PBMC RNA concentration and expression levels of the selected probe sets showed a statistically significant difference (F-Test p value = 1.43449E-05). As expected, GC-RMA (average R2 = 0.244803) was considerably less accurate than PLIER (R2 = 0.324908).
Log transformation of microarray data is another method that is frequently used to reduce variance, and hence improve precision [22–24]. We hypothesized that log transformation of the same data in Figure 3 would further mathematically reduce variance, improve precision and proportion of probe sets that are adequately powered, but with the disadvantage of leading to more potential false positives. Since the scale of signal values significantly influences power analysis results, we ran HCE-power with log2-scale data (Figure 4). Comparison of Figure 3 and Figure 4 confirmed our hypothesis: log transformation reduced variance and greatly increased the percentage of sufficiently powered probe sets (Figure 4). Interestingly, PLIER still showed the best performance in terms of the lowest potential false positive rates. This result suggests that the new dynamic weighting and error model for PLIER signal calculation is effective in reducing the influence of noise in the power calculation. It is important to note that the default signal calculations of the PLIER algorithm (obtained through R) include log transformation of the resulting signals. The data presented here suggests that log transformation of data should be used very judiciously; it may result in increased precision, but at a cost of accuracy. This interpretation is in agreement with other publications that have studied the balance of precision and accuracy [3, 22].