Lack of normalization is associated with random correlation
We first calculated raw unnormalized MAS5, RMA, and MBEI expression values for the breast cancer, dilution, and spike-in data sets as described in the Methods section. The breast cancer data set is an example of a clinical data set from a real patient population, which is expected to have greater biological variation than the dilution and spike-in data sets. We then computed the Pearson correlation coefficients for 5000 random pairs of probes for each data set.
As shown in the upper part of Figure 1, the distributions of the correlation coefficients are centered far away from zero for each data set and expression measure. There is clearly a large amount of excess correlation that is unrelated to biological relationships between genes. The similarity of expression between random pairs of genes across chips is due to technical differences between chips which have not been normalized out. This is a striking example of statistical confounding, where genes are apparently correlated for some underlying non-biological reason.
We have also found that the technical correlation between genes is inversely related to the variability of the genes involved. This can be seen in the lower part of Figure 1, where the correlations between the random pairs are plotted against the product of their standard deviations: the average correlation (shown in blue) is highest for genes with small standard deviations and decreases with increasing variability. This fits well with what we would expect from assuming a simple additive chip effect as the source of chip-to-chip variation; even though this is certainly an oversimplification, the corresponding model fits the general shape of the data well enough (shown as red line in Figure 1; see Methods).
Default normalization removes excess correlation
We calculated the same expression measures for the same data sets as above, but applied the default normalization procedure suggested for each expression measure: for MAS5 expression values, we normalized to the global mean within each array, for RMA values, we applied the quantile normalization, for MBEI we applied the invariant set normalization, see Methods. The upper part of Figure 2 shows that in all cases, the default normalization step was sufficient to remove excess correlation and center the distribution of the correlation coefficients at zero.
In the following, we will refer to unwanted correlation artifacts after normalization as residual correlation. Although we observed no residual correlation for the whole set of genes, there was no guarantee that this would hold for certain subsets of genes: an ideal normalization should remove the residual correlation for any sufficiently large subset of genes. Therefore, we investigated the pattern of correlations for pairs of genes with different intensity and variability across chips.
Genes with low variability are poorly normalized by RMA and MBEI
We previously described the systematic inverse relationship between correlation and variability. Although the default normalizations strongly reduced the scale of this correlation for all three expression measures, we still found a significant relationship between correlations and variability for RMA and MBEI, especially for the breast cancer data. The lower part of Figure 2 shows the average correlations between genes grouped by the product of their standard deviations; this is the same summary line as in Figure 1, but without plotting the individual points contributing to it. The residual correlations were smaller than before normalization, but the approximate confidence intervals show them to be highly significant. The shape of the relationship also changed and did no longer follow any simple model.
We found that the residual correlations were both absolutely larger and more significant for RMA than for MAS5. For MAS5, only the subset of genes with the lowest variability showed significant correlation, all of it positive and less than 0.05. In contrast, for RMA and MBEI, several of the low-variability classes showed significant positive correlation, up to 0.2 for the breast cancer data set. In addition, we observed small but significant negative correlations for genes in the middle range of variability for the breast cancer and dilution data.
Thus the analysis shows that RMA and MBEI do not provide properly normalized expression values for genes with low variability, particularly for the clinical data. We will explain this pattern later in terms of absence and intensity of genes.
Normalization on housekeeping genes fails to remove excess correlation
The HGU133A chips that were used for the breast cancer study contain 100 probes for generic housekeeping genes, whose expression is assumed to be constant on average for most or all experimental conditions. Consequently, it has been suggested to use these housekeeping genes for normalization, by adjusting the expression level on each chip so that the average expression of the housekeeping genes is constant across chips (see Methods). To date, there is no convincing evidence whether this method actually works or not, and it seems that some research groups are using it.
The correlation test given in Figure 3 shows that for the MAS5, RMA and MBEI methods of computing expression values, the housekeeping gene normalization failed to remove the excess correlation. There was nonzero average correlation over all genes, indicating a general failure of normalization. The systematic inverse relationship between correlation and variability were at higher levels throughout the range of variability compared to the default normalizations. The failure of housekeeping gene normalization was particularly severe for RMA.
Note that even if the amount of residual correlation shown in Figure 3 for MAS5 housekeeping-genenormalized values looks small, the impact on the subsequent high-level analysis can be serious. Figure 4 shows the distribution of 22283 gene-wise t-statistics for the housekeeping-normalized and the global-mean-normalized breast cancer data. Each t-statistic compares the mean expression level between (a) postmenopausal women who are users of hormone replacement therapy (HRT) versus (b) those who are not; see (for personal communications see Hall P, Ploner A, Bjöhle J et al.). The t-statistics for the housekeeping-normalized values are globally shifted below zero, indicating a genome-wide down-regulation of thousands of genes. In contrast, the t-statistics based on global-mean-normalized values are centered around zero, suggesting a much less pronounced difference between HRT users and non-users. In this example, the global-mean-normalized results are biologically much more plausible.
Absent genes are poorly normalized by RMA and MBEI
In each tissue, only a limited number of genes will be expressed in quantities above the detection limit, usually much fewer than the number of genes available on modern large-scale chips. The purpose of pairing PM and MM probes is to detect which genes are reliably expressed (present genes), and for which genes the observed intensities are dominated by technical and biological noise (absent genes). The most common method of classifying genes as either present or absent is based a non-parametric test for the PM/MM pairs (Affymetrix's detection calls [13]).
There is currently no consensus on how to use these detection calls. All methods report expression values of all genes including the absent genes, so in principle the analyst might ignore the issue of absent genes and treat all genes as present. Intuitively the absent genes will be measured with a lot of noise, but will they be properly normalized, i.e., will the measurements be unbiased?
In order to study the success of normalization of measured expression of absent genes, we classified all genes as either present or absent based on Affymetrix's present calls (see Methods). For all data sets, genes were most frequently either completely absent or completely present across all chips (Figure 5).
Consequently, the pairs of genes in our random samples could naturally be divided into three classes: those averaging few or no present calls between them, those averaging almost a 100% present calls, and those averaging around 50% present calls (upper part of Figure 6). These classes correspond naturally to pairs where both genes were mostly absent, or both mostly present, or where one was mostly absent and the other mostly present; by cutting at 33% and 67% average present calls as indicated in the histograms in the upper part of Figure 6, we managed to separate these groups evenly.
To provide more information, the average correlation for each subset was again plotted against variability; see lower half of Figure 6. Generally, the average correlation was highest for pairs of absent genes, indicating failure of normalization of measured expression of absent genes. This was most serious for RMA: excess correlations were consistently and strongly positive for absent pairs and negative for absent/present pairs for all data sets. Only for present pairs, correlations were mostly non-significant and small in absolute value. Correlations for MAS5 were throughout smaller and less significant, with no clear pattern between the three groups of pairs. MBEI showed the same pattern as RMA, though somewhat weaker.
This result implies that, at least in case of RMA and MBEI, measured expressions of absent genes were poorly normalized, so analyses of absent genes should be avoided or at least viewed with caution. This interpretation is supported by Figure 7, which shows the distribution of t-statistics comparing HRT-users and non-users as above, but only for genes that were not detected (absent) on all 159 chips (n = 4371); the distributions for MBEI and especially RMA indicate strong and wide-spread regulatory effects of HRT, which seems biologically implausible, especially for genes measured at the detection limit throughout the data set.
While the absence or presence of a gene could be assessed via other potential quality control measures, the Affymetrix detection call seems to provide useful information for gene filtering.
Note that the summary curves of mean correlation shown in Figure 2 are the weighted means of the curves by presence status shown in Figure 6. We can, for example, explain that the high correlations at low variability for RMA in Figure 2 are mainly due to absent/absent pairs in the expression data. The slight negative dip for genes at the middle range of variability in Figure 2 is the effect of an incomplete cancelation between the positive correlations for absent/absent pairs and the negative correlations for absent/present pairs in this range.
Residual correlation is only weakly related to the expression level of genes
Detection of a gene is trivially related to the relative abundance of its mRNA in the sample. Thus, genes that are expressed at the lower end of the detection range are much more likely to be absent. This might indicate that the relationship between the absence/presence of genes and their residual correlation is in fact due to their difference in abundance, and that by focusing on genes with a minimum expression level, we could avoid residual correlation altogether.
Figure 8 shows that this is not the case: when plotting correlations against standard deviations grouped by intensity in the breast cancer data, we found that the pattern of correlation depends more on the percentage of present calls than on the intensity level. The pattern we saw previously in Figure 6 was observed at different levels of intensity: (i) pairs where both genes are mostly absent tend to be positively correlated, (ii) pairs with one gene mostly absent and one gene mostly present tend to be negatively correlated, and (iii) genes where both partners are mostly present tend to be almost uncorrelated. This pattern is most pronounced at low and medium intensities, and it is stronger for RMA and MBEI, but it is consistently seen, also at high intensities and for MAS5 values.
In summary it seems strongly preferable to define a gene filter according to absent/present calls than according to the gene intensity levels.
Note that correlation between the intensity and presence of genes is reflected by the number of pairs that contribute to each curve in Figure 8: there were relatively more present/present gene pairs and less absent/absent pairs at high intensities, and vice versa for low intensities; curves with lower pair counts have correspondingly wider confidence intervals.
Filtering out absent genes reduces residual correlation
Figure 9 demonstrates for the breast cancer data how the filtering of genes with a large number of absent calls can reduce residual correlation for normalized expression values. In this case, the 5000 pairs of genes were randomly sampled from subsets of genes with an increasing percentage of present calls. Already by excluding genes that are always absent, the level of systematic correlation was reduced below 0.04 for all expression measures, though the pattern of positive correlations for genes with low variability was still present; by considering only genes with at least 20% present calls, we found that this pattern is reversed for RMA and MBEI, but not for MAS5. Further restrictions did not change this pattern, but increased the absolute level of residual correlation.