Correlation test to assess low-level processing of high-density oligonucleotide microarray data
© Ploner et al. 2005
Received: 08 December 2004
Accepted: 31 March 2005
Published: 31 March 2005
Skip to main content
© Ploner et al. 2005
Received: 08 December 2004
Accepted: 31 March 2005
Published: 31 March 2005
There are currently a number of competing techniques for low-level processing of oligonucleotide array data. The choice of technique has a profound effect on subsequent statistical analyses, but there is no method to assess whether a particular technique is appropriate for a specific data set, without reference to external data.
We analyzed coregulation between genes in order to detect insufficient normalization between arrays, where coregulation is measured in terms of statistical correlation. In a large collection of genes, a random pair of genes should have on average zero correlation, hence allowing a correlation test. For all data sets that we evaluated, and the three most commonly used low-level processing procedures including MAS5, RMA and MBEI, the housekeeping-gene normalization failed the test. For a real clinical data set, RMA and MBEI showed significant correlation for absent genes. We also found that a second round of normalization on the probe set level improved normalization significantly throughout.
Previous evaluation of low-level processing in the literature has been limited to artificial spike-in and mixture data sets. In the absence of a known gold-standard, the correlation criterion allows us to assess the appropriateness of low-level processing of a specific data set and the success of normalization for subsets of genes.
The spread of microarray technology has made possible the routine and simultaneous measurement of expression profiles for tens of thousands of genes. In the case of photolithographically synthesized high-density oligonucleotide arrays as described in , the technology for hybridizing RNA on chips and quantitating fluoresence-intensity data has been highly standardized and automated. The results are then related to the biology of interest, both through exploratory methods (e.g. ) and a large and growing number of sophisticated prediction and classification algorithms (e.g. ). Yet the very first step on which these procedures rely is still open to discussion: the derivation of a numerical summary value that is both representative of a gene's relative expression level and reasonably free of technical variation, summarily referred to as low-level analysis.
The need for a summary function is due to the setup of high-density oligonucleotide arrays, where each gene is probed by a set of paired oligonucleotides: one of each pair matches the target sequence on the probed gene perfectly (perfect match or PM oligo), the other has one altered central base-pair (mismatch or MM oligo), where the MMs serve to establish a reference for non-specific hybridisation. While the full set of PMs has been used successfully for detecting differential expression , there is usually a strong interest in having one number that represents the relative abundance of a gene on a chip. The most common summary measures use a non-model-based robust averaging of measurements in a probe set, such as Affymetrix's MAS5 expression value , or a model-based expression index (MBEI ) or a log-additive robust-multichip-average (RMA ) across chips.
The second crucial aspect of low-level analysis is the control of technical variation between chips, which is introduced by the measurement process during sample preparation, labelling, hybridization, and scanning. Technical variation of this kind and the need for a corrective normalization procedure are not specific to high-density oligonucleotide arrays, but are a general feature of mRNA measurement, e.g. for cDNA microarrays , northern-blot analysis or RT-PCR . Numerous procedures have been suggested, differing in their assumptions on what feature of the data remains constant across chips and can therefore be used for normalization .
Comparative evaluation of different approaches to low-level analysis has so far been limited to artificial data sets, where differential expression is due to spiked-in RNA or mixtures and dilutions of RNA from different sources [4, 10–11]. This has the obvious advantage that the true expression ratios are known (up to experimental error). Consequently, different approaches can be compared in regard to bias (when estimating fold change) and variance (when testing for differential expression). Results so far indicate that there is generally a trade-off between the two, and it seems fair to say that no current method is optimal under all circumstances.
The choice of low-level analysis and especially the choice of normalization have severe impact on the subsequent analysis of the expression data . Given the wide range of methods available, it would be useful to have a method for assessing their relative merits for a concrete data set, without reference to an external spike-in or dilution data set. This is especially true if we have to assume that our data set is not as well behaved as artificial data, either in terms of the percentage of differentially expressed genes or in terms of RNA quality, or both, as for the clinical data set on breast cancer described in the Methods section. In this paper, we propose that by studying coregulation or correlations between random pairs of genes, we can compare different summary measures and assess the effect of different normalization procedures. Our underlying hypothesis is that given a modern large-scale chip covering a large percentage of a species' genome, randomly selected pairs of genes will be on average uncorrelated. Note that we do not claim the absence of all biological correlation between genes, but rather that the number of connections between genes in regulatory pathways is small compared to the number of all possible combinations of genes; this argument is given more detail in the Discussion. Consequently, a low-level analysis strategy will be deemed suitable for a given data set, if the resulting normalized expression values are on average uncorrelated for randomly chosen pairs of genes. Lack of correlation is not assessed via formal tests, but by easily adaptable graphical tools that do not rely on stringent conditions for validity.
We proceed as follows: first, we establish relationships between lack of normalization and correlations between randomly selected genes for three important summary measures; then we show that the default normalization schemes associated with these summary measures do remove the correlations to a large degree, but not completely, with varying amounts of residual correlation. We also show that where available, housekeeping gene normalization is inferior to default normalization in removing random correlation, and we relate random correlation to the number of unexpressed genes in the data. We conclude by discussing the results and the underlying assumption of our approach as well as considerations for its practical implementation, and point out both limitations and possible extensions.
We first calculated raw unnormalized MAS5, RMA, and MBEI expression values for the breast cancer, dilution, and spike-in data sets as described in the Methods section. The breast cancer data set is an example of a clinical data set from a real patient population, which is expected to have greater biological variation than the dilution and spike-in data sets. We then computed the Pearson correlation coefficients for 5000 random pairs of probes for each data set.
We have also found that the technical correlation between genes is inversely related to the variability of the genes involved. This can be seen in the lower part of Figure 1, where the correlations between the random pairs are plotted against the product of their standard deviations: the average correlation (shown in blue) is highest for genes with small standard deviations and decreases with increasing variability. This fits well with what we would expect from assuming a simple additive chip effect as the source of chip-to-chip variation; even though this is certainly an oversimplification, the corresponding model fits the general shape of the data well enough (shown as red line in Figure 1; see Methods).
In the following, we will refer to unwanted correlation artifacts after normalization as residual correlation. Although we observed no residual correlation for the whole set of genes, there was no guarantee that this would hold for certain subsets of genes: an ideal normalization should remove the residual correlation for any sufficiently large subset of genes. Therefore, we investigated the pattern of correlations for pairs of genes with different intensity and variability across chips.
We previously described the systematic inverse relationship between correlation and variability. Although the default normalizations strongly reduced the scale of this correlation for all three expression measures, we still found a significant relationship between correlations and variability for RMA and MBEI, especially for the breast cancer data. The lower part of Figure 2 shows the average correlations between genes grouped by the product of their standard deviations; this is the same summary line as in Figure 1, but without plotting the individual points contributing to it. The residual correlations were smaller than before normalization, but the approximate confidence intervals show them to be highly significant. The shape of the relationship also changed and did no longer follow any simple model.
We found that the residual correlations were both absolutely larger and more significant for RMA than for MAS5. For MAS5, only the subset of genes with the lowest variability showed significant correlation, all of it positive and less than 0.05. In contrast, for RMA and MBEI, several of the low-variability classes showed significant positive correlation, up to 0.2 for the breast cancer data set. In addition, we observed small but significant negative correlations for genes in the middle range of variability for the breast cancer and dilution data.
Thus the analysis shows that RMA and MBEI do not provide properly normalized expression values for genes with low variability, particularly for the clinical data. We will explain this pattern later in terms of absence and intensity of genes.
The HGU133A chips that were used for the breast cancer study contain 100 probes for generic housekeeping genes, whose expression is assumed to be constant on average for most or all experimental conditions. Consequently, it has been suggested to use these housekeeping genes for normalization, by adjusting the expression level on each chip so that the average expression of the housekeeping genes is constant across chips (see Methods). To date, there is no convincing evidence whether this method actually works or not, and it seems that some research groups are using it.
In each tissue, only a limited number of genes will be expressed in quantities above the detection limit, usually much fewer than the number of genes available on modern large-scale chips. The purpose of pairing PM and MM probes is to detect which genes are reliably expressed (present genes), and for which genes the observed intensities are dominated by technical and biological noise (absent genes). The most common method of classifying genes as either present or absent is based a non-parametric test for the PM/MM pairs (Affymetrix's detection calls ).
There is currently no consensus on how to use these detection calls. All methods report expression values of all genes including the absent genes, so in principle the analyst might ignore the issue of absent genes and treat all genes as present. Intuitively the absent genes will be measured with a lot of noise, but will they be properly normalized, i.e., will the measurements be unbiased?
To provide more information, the average correlation for each subset was again plotted against variability; see lower half of Figure 6. Generally, the average correlation was highest for pairs of absent genes, indicating failure of normalization of measured expression of absent genes. This was most serious for RMA: excess correlations were consistently and strongly positive for absent pairs and negative for absent/present pairs for all data sets. Only for present pairs, correlations were mostly non-significant and small in absolute value. Correlations for MAS5 were throughout smaller and less significant, with no clear pattern between the three groups of pairs. MBEI showed the same pattern as RMA, though somewhat weaker.
While the absence or presence of a gene could be assessed via other potential quality control measures, the Affymetrix detection call seems to provide useful information for gene filtering.
Note that the summary curves of mean correlation shown in Figure 2 are the weighted means of the curves by presence status shown in Figure 6. We can, for example, explain that the high correlations at low variability for RMA in Figure 2 are mainly due to absent/absent pairs in the expression data. The slight negative dip for genes at the middle range of variability in Figure 2 is the effect of an incomplete cancelation between the positive correlations for absent/absent pairs and the negative correlations for absent/present pairs in this range.
Detection of a gene is trivially related to the relative abundance of its mRNA in the sample. Thus, genes that are expressed at the lower end of the detection range are much more likely to be absent. This might indicate that the relationship between the absence/presence of genes and their residual correlation is in fact due to their difference in abundance, and that by focusing on genes with a minimum expression level, we could avoid residual correlation altogether.
In summary it seems strongly preferable to define a gene filter according to absent/present calls than according to the gene intensity levels.
Note that correlation between the intensity and presence of genes is reflected by the number of pairs that contribute to each curve in Figure 8: there were relatively more present/present gene pairs and less absent/absent pairs at high intensities, and vice versa for low intensities; curves with lower pair counts have correspondingly wider confidence intervals.
The model described in the Methods section only assumes differences in mean intensity between chips. This corresponds to the simple global mean normalization commonly used for the MAS5 expression values. Figure 1 confirms that this model (shown in red) describes the average behavior of the correlations (shown in blue) adequately for all data sets, suggesting that global mean normalization is indeed suitable for MAS5 data.
Apart from MAS5, the model fits adequately only for the RMA-based correlations in the breast cancer data, suggesting that global mean normalization on the probe-set level may be attempted in this case, but that it is not generally suitable for RMA and MBEI data. Still, Figure 1 shows that correlations decrease systematically with the variability of the gene pairs for all expression measures, and it may be possible to describe this relationship by extending the simple model, e.g. by allowing the array effect θ in Equation 1 to be correlated with the gene effects ψ i .
The use of housekeeping genes seems reasonable when studying a small number of genes under controlled experimental settings, or where the choice of one or several housekeeping genes can be motivated biologically. For the breast cancer data, which was collected in a real clinical setting, where samples are both genetically heterogenous and potentially genomically unstable, it is much harder to believe in the common expression of housekeeping genes. Therefore we argue that the failure of housekeeping normalization in this example is not due to the procedure per se, but to our inability to identify a suitable set of housekeeping genes, and the use of the generic set of genes suggested by the chip manufacturer. Even for northern-blot analysis and RT-PCR, where housekeeping normalization is the default, an uncritical use of housekeeping genes has been shown to lead to unacceptable results .
It has been suggested that the generally much lower variability of RMA and MBEI for low-intensity probe sets is a clear advantage of these model-based expression measures over the simpler MAS5 [7, 10]. Our results however indicate that this low variability may well be misleading: RMA and MBEI values for absent probe sets, which constitute the vast majority of low-intensity probe sets, show the strongest residual correlation. This indicates that RMA and MBEI values for low-intensity probe sets that are reported without regard for their absence/presence status will be compromised by lack of normalization (Figures 6 and 8). It seems therefore that RMA and MBEI estimate expression of low-abundance genes in a biased, but very precise manner. Minimizing variability as much as possible only makes sense for unbiased estimato rs: if the variability of the estimate becomes small relative to the bias, we get a dangerous sense of confidence in an estimate that is not quite what we think it is. In the same way, the large variability of the MAS5 values at low intensities may well hide an amount of bias comparable to that of RMA and MBEI: as long as the variability of MAS5 is large compared to the bias, we will not be lead to make inappropriate conclusions based on possibly biased estimates; in that sense the MAS5 estimates for low-intensity genes are more honest and better normalized than the corresponding RMA and MBEI values. It is interesting to note that Bolstad et al. have already described the choice between different low-level approaches in terms of bias (when estimating fold change between conditions) and variance (when testing for differential expression between conditions) . Our results suggest that a) the same trade-off applies when looking directly at the expression values, instead of comparing aggregated fold changes and test statistics between different biological conditions, and b) the trade-off is more disadvantageous for the model-based expression measures than generally thought.
The underlying lack of normalization of RMA and MBEI for absent genes could be due to the computation of the expression values, or the normalization step, or a combination thereof. Preliminary results (not shown) indicate that the first step, the summarization of the individual probe intensities through the expression measure, seems to be responsible in both cases. If this can be confirmed, a possible explanation would be that the models used (log-linear for RMA and and multiplicative for MBEI) may not be appropriate for absent genes (but see also below).
In a recent paper, Choe et al. have evaluated the performance of a wide range of low-level analysis methods and test procedures in detecting differential expression in a carefully constructed spike-in data set . They report 70% sensitivity at 10% false discovery rate for their top-ranking combinations clearly there is still ample room for improvement in current low-level methodology. We want to outline here shortly how our approach could be used to guide this effort.
We found that renormalization reduced residual correlation for all data sets and all expression measures. Indeed, for MAS5 the correlations are not significantly different from zero at any lag, indicating perfect normalization as measured by our criterion. RMA and MBEI show strongly reduced levels of residual correlation, but are still well above the levels of the original MAS5 as seen in Figure 2.
It is interesting that the ranking of the original and renormalized expression measures in terms of normalization quality (i.e. renormalized MAS5 is best, followed by the original MAS5, followed by renormalized RMA and MBEI, followed by the original RMA and MBEI) corresponds closely to the ranking by performance in detecting differential expression found by Choe et al. (, Figure 7f). This suggests that the lack of normalization that our method is able to measure is indeed relevant for the ability to detect regulated genes.
Additionally, Figure 11 gives an indication of how the newly renormalized expression measures may be further improved. E.g. for the renormalized MAS5, there is clearly little need to work on the normalization aspect; modifications of the expression measure could instead aim at reducing the variability of MAS5 values, possibly by using the information in the MM probes as weights in the summary measure.
Renormalized RMA and MBEI on the other hand still suffer from insufficient normalization; as we perform already normalization steps on both the probe and the probe set level, it seems promising to focus on the intermediate steps like the fitting of the multi-chip model and to study whether these steps are prone to systematic biases.
The only condition for using the correlation test is a fairly large chip, with probes covering a wide range of the genome under study. For chips that are designed to study only a few related pathways or highly specialized tissues with only a couple of hundred probe sets, the zero correlation assumption may not hold, because the genes from which we want to sample randomly have already been pre-selected by the chip design. The example of the KEGG probe sets on the U133A chip suggests though that several thousand probe sets organized in a hundred and some pathways is a safe size.
It should be pointed out that this approach is not limited to high-density oligonucleotide chips. The same argument for between-chip normalization holds in principle for cDNA or any other two-color microarray system, although the usual intensity-based normalization between dye channels on the same chip simplifies the situation somewhat .
We have presented a simple graphical method for assessing the quality of low-level analysis of oligonucleotide array expression data. The main advantage of our approach lies in the fact that we do not make use of external reference data, but instead exploit the internal correlation structure of large expression data sets. This allows us to select, evaluate, and modify low-level procedures for specific data sets. In order to demonstrate the use of and usefulness of our approach, we have applied it to three large data sets and three widely used low-level methods (MAS5, RMA, MBEI). We found a number of interesting results: a) For a large breast cancer data set, normalizing to housekeeping genes does not work at all, regardless of expression measure; b) normalization quality for all three data sets and all three expression measures is closer related to the absence/presence status of a probe set than to its intensity level; c) RMA and MBEI normalize absent probe sets poorly for all three data sets; d) removing pre-dominantly absent probe sets improves normalization for all data sets and all expression measures. The cutoff percentage of absent calls for a probe set to be included in the analysis can be chosen based on our graphical criterion. We have also evaluated the effect of a second round of normalization on the probe set level data. We found that this improved normalization significantly for all three data sets, in a manner consistent with the observed improvements in the detection of gene regulation .
We used three data sets, two of which are publicly available from GeneLogic . (1) The dilution data set is a collection of 75 HGU95Av2 chips, on which RNA from two different sources (liver and nervous system) was hybridized in different concentrations and mixture ratios. (2) The spike-in data set consists of 94 HGU95Avl chips, for which eleven bacterial cRNA fragments were added in different concentrations and combinations to a base sample from an AML cell line. Both of these data sets have been widely used for assessing normalization methods and expression measures [4, 7, 10].
In contrast the RNA for the third data set was extracted from tumor tissue collected from a population-based breast cancer cohort at Karolinska Hospital, Stockholm. After processing the RNA, several quality control steps, and screening the patients on medical criteria resulted in data from 159 HGU133A chips. Details on data preparation, patient selection, and the definition of clinical parameters like hormone replacement therapy are given in [for personal communications see Hall P, Ploner A, Bjöhle J et al.].
MAS5 expression values were computed as described in . We used global mean normalization for the logged expression values as default, assuming that the mean across the logged expression values of all probes should be constant across all chips, and adjusting the level of each chip by adding a corrective constant to all probes. This is roughly equivalent to using the standard Affymetrix scaling factors on the raw data, but estimation of the corrective term on the log-scale has been found to be less variable . The corrected log values were used for the analysis.
For RMA, the individual PM probe values were background-corrected and quantilenormalized before computing the expression values, as described in .
The computation of MBEI expression values followed : PM and MM values were normalized separately to a baseline array of average PM and MM intensities. The baseline array was obtained via smoothing an empirically identified set of rank-invariant probes. A multiplicative model was fitted to the difference between normalized PM and MM values. Expression values were logged, with non-positive values set to missing.
Housekeeping gene normalization was based on the probes with suffix 2000_ on the HGU133A chip. The same principle as with the global mean normalization was employed, except that the correction constant was based on the average of the housekeeping genes. For the MAS5 values, two variants were considered: a) multiplicative correction (scaling factor) of the un-logarithmized expression values, referred to as 'raw housekeeping' in the legends for Figures 3 and 4b) additive correction of the logarithmized expression values, as for the global mean normalization, referred to as ' log housekeeping' in the figure legends. Housekeeping normalization for RMA and MBEI was done additively for the logarithmized expression values calculated from the unnormalized probe data; therefore, these are also addressed as 'log housekeeping' values in Figure 3.
The renormalization of the expression measures described in the Discussion was performed as in , using the iterated pairwise intensity-based normalization via smoothing loess curves described previously in . The renormalization was performed at the probe set level on the expression measures computed and normalized as described above; correspondingly, renormalized RMA and MBE values have been normalized on both the probe level (originally) and the probe set level (second round), whereas MAS5 has been normalized twice on the probe set level.
For each data set, we randomly selected 5000 pairs of probe sets from the collection of probe sets available on the different chip types. For each pair, we computed the Pearson correlation coefficient between the two probe sets across all chips in the underlying data set, resulting in a random sample of 5000 correlation coefficients. At the same time, we calculated for each pair of probe sets the product of the two standard deviations across all chips in the data set; the scatter plots in the lower part of Figure 1 show the resulting 5000 pairs of (product of standard deviation, correlation).
The summary curves shown on top (like in Figure 1) or instead (all other Figures) of the point scatter describe the average behaviour of the scattered cloud. They were produced by taking the range of values for the product of the standard deviations in the sample and splitting it into intervals containing an equal number of observations, typically around 500. For each interval, the mean of the correlations was plotted against the median of the product of standard deviations. The 95% confidence intervals of the means shown from Figure 2 onwards were computed based on normal approximation.
We assume as experimental unit one microarray chip with the associated samples from the biological population under study. Each chip yields observations y i for i = 1 ... n genes specified by the array design. We can write this as a random variable
Y i = θ + ψ i + ε i (1)
where θ is a random array effect, ψ i is a random gene effect, and ε i is the gene-specific measurement error. Note that this assumes random gene effects only in so far as we sample from the population, and we do not specify any treatment or experimental structure.
We assume that the random components are independent, and that the errors have expectation zero. The covariance between the observable expression values for two genes then simplifies to
the covariance between the unobservable 'real' gene expressions plus the variance of the array effect. Let's designate the variance of any Y i as . Now the correlations between the observable expression values for two genes can be written as
The first term is the contribution of the array effect to the correlation, which is the source of the correlation artifact. The second term varies across all possible pairs of genes, and we expect it to have zero average. Thus we get the inverse relationship
We can investigate this empirically. Given a set of microarrays, we can take a sample of random pairs of genes (i, j), then calculate their correlations r ij and standard deviations s i and s j . Under our hypothesis of zero average correlation and assuming that our simple model holds, the underlying pattern in the plot of (s i s j , r ij ) should follow this inverse relationship.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.