The spread of microarray technology has made possible the routine and simultaneous measurement of expression profiles for tens of thousands of genes. In the case of photolithographically synthesized high-density oligonucleotide arrays as described in [1], the technology for hybridizing RNA on chips and quantitating fluoresence-intensity data has been highly standardized and automated. The results are then related to the biology of interest, both through exploratory methods (e.g. [2]) and a large and growing number of sophisticated prediction and classification algorithms (e.g. [3]). Yet the very first step on which these procedures rely is still open to discussion: the derivation of a numerical summary value that is both representative of a gene's relative expression level and reasonably free of technical variation, summarily referred to as low-level analysis.

The need for a summary function is due to the setup of high-density oligonucleotide arrays, where each gene is probed by a set of paired oligonucleotides: one of each pair matches the target sequence on the probed gene perfectly (perfect match or PM oligo), the other has one altered central base-pair (mismatch or MM oligo), where the MMs serve to establish a reference for non-specific hybridisation. While the full set of PMs has been used successfully for detecting differential expression [4], there is usually a strong interest in having one number that represents the relative abundance of a gene on a chip. The most common summary measures use a non-model-based robust averaging of measurements in a probe set, such as Affymetrix's MAS5 expression value [5], or a model-based expression index (MBEI [6]) or a log-additive robust-multichip-average (RMA [7]) across chips.

The second crucial aspect of low-level analysis is the control of technical variation between chips, which is introduced by the measurement process during sample preparation, labelling, hybridization, and scanning. Technical variation of this kind and the need for a corrective normalization procedure are not specific to high-density oligonucleotide arrays, but are a general feature of mRNA measurement, e.g. for cDNA microarrays [8], northern-blot analysis or RT-PCR [9]. Numerous procedures have been suggested, differing in their assumptions on what feature of the data remains constant across chips and can therefore be used for normalization [10].

Comparative evaluation of different approaches to low-level analysis has so far been limited to artificial data sets, where differential expression is due to spiked-in RNA or mixtures and dilutions of RNA from different sources [4, 10–11]. This has the obvious advantage that the true expression ratios are known (up to experimental error). Consequently, different approaches can be compared in regard to bias (when estimating fold change) and variance (when testing for differential expression). Results so far indicate that there is generally a trade-off between the two, and it seems fair to say that no current method is optimal under all circumstances.

The choice of low-level analysis and especially the choice of normalization have severe impact on the subsequent analysis of the expression data [12]. Given the wide range of methods available, it would be useful to have a method for assessing their relative merits for a concrete data set, without reference to an external spike-in or dilution data set. This is especially true if we have to assume that our data set is not as well behaved as artificial data, either in terms of the percentage of differentially expressed genes or in terms of RNA quality, or both, as for the clinical data set on breast cancer described in the Methods section. In this paper, we propose that by studying coregulation or correlations between random pairs of genes, we can compare different summary measures and assess the effect of different normalization procedures. Our underlying hypothesis is that given a modern large-scale chip covering a large percentage of a species' genome, randomly selected pairs of genes will be *on average* uncorrelated. Note that we do not claim the absence of all biological correlation between genes, but rather that the number of connections between genes in regulatory pathways is small compared to the number of all possible combinations of genes; this argument is given more detail in the Discussion. Consequently, a low-level analysis strategy will be deemed suitable for a given data set, if the resulting normalized expression values are on average uncorrelated for randomly chosen pairs of genes. Lack of correlation is not assessed via formal tests, but by easily adaptable graphical tools that do not rely on stringent conditions for validity.

We proceed as follows: first, we establish relationships between lack of normalization and correlations between randomly selected genes for three important summary measures; then we show that the default normalization schemes associated with these summary measures do remove the correlations to a large degree, but not completely, with varying amounts of residual correlation. We also show that where available, housekeeping gene normalization is inferior to default normalization in removing random correlation, and we relate random correlation to the number of unexpressed genes in the data. We conclude by discussing the results and the underlying assumption of our approach as well as considerations for its practical implementation, and point out both limitations and possible extensions.