### General model

Consider an experiment with genes *i* = 1,..., *I* from RNA extracts *c* = 1,..., *C*. For example, in oligonucleotide microarrays each slide measures the gene-expression levels of exactly one RNA extract whereas for a two-color microarrays each slide measures two RNA extracts, one in each channel. From now on, we refer to the RNA extracts or replicates of such as *channels*. Let *x*_{
c,i
}be the true gene-expression level of gene *i* in channel *c* and let *y*_{
c, i
}be the corresponding observed gene-expression level. The relationship between the observed and the true expression levels can be written as

*y*_{
c,i
}= *f*_{
c
}(*x*_{
c,i
}) + *ε*_{
c,i
} (1)

where *f*_{
c
}is a channel specific *measurement function*, which includes all steps in the gene-expression acquisition process. Most generally, we have that *E* [*ε*_{
c,i
}] = 0 and *V* [*ε*_{
c,i
}] = {\sigma}_{c,i}^{2}, where the variance can take any form. Importantly, the properties of *ε*_{
c
}are not well understood and depends on platform used, but also which part of the process that is studied. For this reason and because of the many interesting effects that the affine transformation (presented below) generates by itself, we conduct this study under the assumption of noise-free data. Relationship (1) may be specified for subsets of genes or probes, e.g. print tip [4], microtiter plate or clone library [5] groups. Spatial dependencies may also be modeled. However, to simplify the discussion that follows, we avoid such details.

Since inference is ideally based on *x*_{c,i}, the inverse of *f*_{
c
}has to be identified, something that, in theory, is possible if it is strictly increasing. Violation of this constraint has been observed in, for instance, two-color microarray data. This can be due to too high concentrations of fluorophores, which sometimes quenches the signal so much that the signal decreases when the concentration increases [6, 7]. Extreme saturation in the scanner, which is commonly observed when the PMT gain is set too high, results in censored signals, which in turn prevents a unique inverse of the measurement function to be found. This paper does not discuss saturation further, because we believe that saturation can and should be avoided.

### Dissection of the overall measurement function

Formally, each step in the microarray process can be seen as a function that takes a set of input objects and outputs another set of objects. The sequential nature of the process makes it possible to think of the measurement function *f*_{
c
}as a *composite function* (function of functions); *f*_{
c
}= *f*_{c,S}◦ *f*_{c,S-1}◦ ⋯ ◦ *f*_{c,1}, where *S* is the number of steps in the process. For instance, and of course simplified, it could be that *f*_{c,1}models the extraction of the RNA from the cell, *f*_{c,2}models the reverse transcription of RNA into cDNA and so on. Some of these *submeasurement functions* are shared by several channels and others are channel specific or even gene specific. Moreover, there may be joining subfunctions too, e.g. the hybridization of labeled cDNA sequences to the probes on the array. In this paper, measurement functions of different channels are treated independently.

A first-order Taylor series expansion of an arbitrary measurement function *f*_{
c
}(*x*_{
c,i
}), has the form

*f*_{
c
}(*x*_{
c,i
}) = *a*_{
c
}+ *b*_{
c
}*x*_{
c,i
}+ *R*_{
c
}(*x*_{
c,i
}), ∀*c,i*. (2)

From the above dissection of a measurement functions, it is easy to argue that some of the subfunctions may introduce offset (bias) and that there for this reason ought to be an offset in *f*_{
c
}(we will use the terms bias and offset interchangeably). For instance, the offset terms may be due to non-uniformity of the reverse transcription, the labeling [7] or the hybridization, due to dark noise in the PMT [8] or laser scatter light in the scanner, background noise, non-uniformity of the scanned glass slide [9], or threshold effects etc. In [10] it is shown how various background estimates based on different image analysis methods may introduce bias. Similarly, we have shown that different scanners may introduce bias [11].

### The affine measurement function

In order to focus on the effects of *a*_{
c
}and *b*_{
c
}, but also because it results in the simplest parametric measurement function possible, we assume *R*_{
c
}(*x*_{
c,i
}) in (2) to be small. The *affine measurement function* is

*f*_{
c
}(*x*_{
c,i
}) = *a*_{
c
}+ *b*_{
c
}*x*_{
c,i
}, ∀*c,i*, (3)

with unique inverse

{x}_{c,i}={f}_{c}^{-1}({y}_{c,i})=\frac{{y}_{c,i}-{a}_{c}}{{b}_{c}},\forall c,i,\left(4\right)

where *a*_{
c
}is the overall offset (bias) and *b*_{
c
}*>* 0 is the overall scale factor in channel *c*. The *a*_{
c
}parameters are commonly positive, but under certain circumstances, for instance, as demonstrated later, when two different measuring techniques are compared, the effective offset may be negative. Modeling microarray data by an affine transform is not novel [3, 12–14], but the reasons for it might have been different in those papers.

### The log-ratio log-intensity transform

In two-color but also in oligonucleotide microarray experiments, it is convenient to do statistical analysis on the log-ratios and the log-intensities [15] of the gene-expression levels in two channels instead of on the expression levels directly. For gene *i* we have that

{M}_{i}={\mathrm{log}}_{2}\frac{{y}_{R,i}}{{y}_{G,i}}={\mathrm{log}}_{2}\frac{{f}_{R}({x}_{R,i})}{{f}_{G}({x}_{G,i})}\left(5\right)

\begin{array}{c}{A}_{i}=\frac{1}{2}{\mathrm{log}}_{2}({y}_{R,i}\cdot {y}_{G,i})\\ =\frac{1}{2}{\mathrm{log}}_{2}({f}_{R}({x}_{R,i})\cdot {f}_{G}({x}_{G,i})).\end{array}\left(6\right)

For simplicity, we denoted channels 1 and 2 by *R* and *G*, which are mnemonics for the red and the green dyes commonly used in two-color microarray data. A rationale for this bijective transform (if the observed signals are positive) is that the main measure of interest, the fold change, is contained in one variable. However, since the transform is based on observed expression levels and not the true ones, *M* alone does indeed not carry all information about the biological fold change. This can be seen if the *true fold change* for an arbitrary gene *i* is considered;

*r*_{
i
}= *x*_{
R,i
}/*x*_{
G,i
} (7)

where *r*_{
i
}> 0. Dropping gene index *i* in (5) and (6), *M* and *A* can be written as functions of *x*_{
G
}and *r*, i.e. *M* = *g*_{
r
}(*x*_{
G
}) and *A* = *h*_{
r
}(*x*_{
G
}). Thus,

M={m}_{r}(A)={g}_{r}({h}_{r}^{-1}(A)),\left(8\right)

which shows that *M* is a function of *A* (and *r*). Hence, and discussed thoroughly below, commonly observed intensity-dependent effects in the log-ratios may contain valuable information, and consequently, applying normalization methods without care may result in loss of information and introduced bias.

### Log-ratios as a function of log-intensities with affine transformations

Under an affine transformation, the relationship between the observed log-ratios and the observed log-intensities for a fixed fold change *r*, omitting gene index *i*, is

\begin{array}{c}M={m}_{r}(A)={\mathrm{log}}_{2}r+{\mathrm{log}}_{2}\beta \\ +{\mathrm{log}}_{2}\frac{\frac{1}{2}\alpha (r)+\sqrt{\frac{1}{4}{[\alpha (r)]}^{2}+r\beta {2}^{2A}}}{-\frac{1}{2}\alpha (r)+\sqrt{\frac{1}{4}{[\alpha (r)]}^{2}+r\beta {2}^{2A}}}\end{array}\left(9\right)

where *α*(*r*) = *a*_{
R
}- *r* *β* *a*_{
G
}quantifies how much *M* depends on *A* at the given fold change, and *β* = *b*_{
R
}/*b*_{
G
}is the *relative scale factor* between the two channels compared. See Methods for details. Recall that log_{2}*r* is the variable of interest. The derivative of *M* with respect to *A* for a fixed fold change *r* is

{\frac{dM}{dA}|}_{{x}_{R}=r{x}_{G}}(A)=-\frac{\alpha (r)}{\sqrt{\frac{1}{4}{[\alpha (r)]}^{2}+r\beta {2}^{2A}}}.\left(10\right)

Consider a fixed *r* and define *α* = *α*(*r*). Then there are only two parameters in (9) and (10) that determine the shape of *m*_{
r
}(*A*), namely *α* and *β*. Consequently, when *a*_{
R
}, *a*_{
G
}≠ 0, *M* is independent of *A* if and only if *α* = 0, that is, when *r* = (*b*_{
G
}*a*_{
R
})/(*b*_{
R
}*a*_{
G
}). For this particular value of *r*, we have that the observed log-ratio is *M* = log_{2} (*a*_{
R
}/*a*_{
G
}), which is independent of scale factors. Moreover, for log-ratios of non-differential expressions, that is *log*_{2}*r* = 0, to be independent of *A*, it must be true that *b*_{
G
}*a*_{
R
}= *b*_{
R
}*a*_{
G
}or, equivalently, *b*_{
R
}/*b*_{
G
}= *a*_{
R
}/*a*_{
G
}. It is also clear from (10) that the scale parameters cannot introduce any curvature themselves, but only enhance or decrease curvature introduced by the offset. In addition to this, relative scale different from one shifts the log-ratios up or down. Moreover, the size of the effect that the offset terms have on the log-ratios decreases as the intensity increases. At high intensities the only observable effect is that from the relative scale between the two channels. The observed log-ratio for non-differentially expressed genes at high intensity is *M*_{∞} ≈ log_{2}*β*. In the case of a linear transform (*a*_{
R
}= *a*_{
G
}= 0), *α* is (always) zero and *M* is therefore independent of *A* for *all r*. The remaining log-ratio bias is log_{2}*β*. If *a*_{
R
}, *a*_{
G
}> 0, the "weakest" observable data point is (*A*_{0}, *M*_{0}) = (1/2·log_{2}(*a*_{
R
}*a*_{
G
}), log_{2}(*a*_{
R
}/*a*_{
G
})), which is independent of both gene expression and scale parameters. All fold-change curves converge to this point. In the left graph of Figure 1 the effect of the affine transform {\mathcal{A}}_{1} = {(*a*_{
G
}*a*_{
R
}) = (200,20), (*b*_{
G
}, *b*_{
R
}) = (1-4, 0.8)} at different fold changes is depicted. The different curves plotted are the functions *M* = *m*_{
r
}(*A*) for different fold changes. Note the asymmetry in curvature between up and down regulation. From the above discussion we know that the observed log-ratios are independent of the log-intensities for *log*_{2}*r* ≈ -2.51 with value *M*_{0} ≈ -3.32. The log-ratio for non-differentially expressed genes at high intensities is *M*_{∞} ≈ -0.81. A real-world example taken from [11], where the same array was scanned four times at various scanner PMT (sensitivity) settings, is shown in the right plot of Figure 1. Observed within-channel log-ratios *M*_{
c
}= log_{2}({y}_{c}^{(v)}/{y}_{c}^{(w)}) are plotted against the within-channel log-intensities *A*_{
c
}= log_{2}({y}_{c}^{(v)}{y}_{c}^{(w)}) /2 for the red channel (*c* = *R*) where {y}_{c}^{(v)} and {y}_{c}^{(w)} are observations at two different scanner PMT settings. In this case it turned out that all scans share the same offset. For more details, see [11]. For another example, see Figure 9.

#### Bias in the log-ratios

From (9) we see that the bias in the log-ratios introduced by the affine transform is intensity dependent. This non-linearity can be observed as a propeller shaped graph in Figure 2, where the log-ratios under the affine transform {\mathcal{A}}_{1} are plotted against the true log-ratios at different log-intensity levels. If a regression line is fitted between the affine transformed log-ratios and the true log-ratios, the slope will always be *less* than one. Moreover, this is true for all normalization methods that do not overcompensate for channel offsets. This may explain why some studies show that cDNA microarrays tend to compress the absolute log-ratios compared to oligoarrays and QRT-PCR [16–18] including a recent study [19]; the channel offsets in cDNA microarrays are probably larger. When [20] compared cDNA microarray log-ratios to Northern blot log-ratios for their background correction method they found similar behavior, which emphasizes the close relationship between offset and background estimates. We will return to this later. The same patterns is seen in an *M* versus *M* scatter plot for non-normalized versus (affine) normalized data. See right scatter plot in Figure 2. To visualize the intensity dependency of the log-ratios, only data points at certain log-intensity levels are plotted. For details on data, see Methods.

### Normalization in general

Depending on the design of the microarray experiment, we expect to observe different types of patterns in data. A typical example is where a subset of the genes studied is expected to be non-differentially expressed in a test sample compared to a reference sample. However, it is common that the patterns of the *observed* expression levels are not in line with the expected patterns of the *true* expression levels. Whenever this happens various strategies can be adopted in order to make the normalized data meet the expectations. Normalization of microarray data is about identifying and removing such artifactual variations that are not due to noise or natural variability. An example is the intensity-dependent log-ratio artifact.

In the following section we will, with the affine model in mind, revisit various more or less well known normalization methods that directly or indirectly remove intensity-dependent artifacts. With the gained knowledge, we then propose a generic and robust multi-dimensional normalization method for affine transformed data.

To be more precise in what follows, we will refer to methods that correct for differences in observed and expected data, that is, conform the signals to a standard or a norm, as *normalization methods*, where normalization has the meaning of conforming to expectations. Sometimes *calibration data*, also known as control data, which contains true relative or absolute expression levels, is available. Such data can be used to correct for discrepancies between observed and true expression levels. We refer to methods that use calibration (read *known*) data points to correct for artifacts as *calibration methods*. To this category we also count methods that are based on models for which we can find the inverse of the measurement function. For precise definitions, see the introduction of [21]. Calibration methods are not discussed further in this paper.

Typically a normalization method is only capable of estimating *α* = *a*_{
R
}- *βa*_{
G
}for *r* = 1 in (9) and not the individual offset terms. This is because the often used *assumption that most genes are non-differentially expressed* (and/or that there is an equal amount of up and down regulated genes) will only help us identify one fold-change curve, namely log_{2} *r* = 0. For a normalization method, like most calibration methods, to be able to estimate both *a*_{
R
}and *a*_{
G
}more constraints are needed and without known data this can only be done based on more assumptions. As more research is needed, we will not elaborate on such additional assumptions in this paper. Thus, the rest of this paper will only discuss normalization methods based on the commonly accepted assumption that it is possible to identify a set of genes that can be used to normalize the non-differentially expressed genes.

### Curve-fit normalization revisited

When [4] first observed the intensity-dependent effects on the log-ratios they suggested a curve-fit normalization method that is often referred to as *lo(w)ess normalization*. The simplest version of this assumes that the majority of the genes are non-differentially expressed regardless of expression level and for this reason the log-ratios are expected to be centered around zero for all intensities. Under the above assumption, curves estimated using robust local regression methods such as lowess [22, 23] or loess [24], or curves modeled by smoothing splines [25] will be good approximations for the *m*_{r = 1}(*A*) function, which then can be subtracted from the observed log-ratios

*M* ← *M* - *m*_{r=1}(*A*) = *m*_{
r
}(*A*) - *m*_{r=1}(*A*). (11)

Under an affine transform, *m*_{
r
}(*A*) and *m*_{r=1}(*A*) are as in (9), but we do not know of a closed form expression for (11). An example of a curve-fit normalization under the affine transform is depicted in Figure 3. Note that the asymmetry between up- and down-regulated genes is *not* corrected for. Moreover, if we look at the overlaid (log_{2}*x*_{
G
}, log_{2}*x*_{
R
}) grid in the left graph of Figure 3, we find that the curve-fit normalization warps it and removes the otherwise orthogonal relationship between log_{2}*x*_{
R
}and log_{2}*x*_{
G
}(if the (2*A, M*) plane is considered).

### Perpendicular translation normalization revisited

The perpendicular (shift-log) normalization method proposed by [13] corrects for differences in the channel offsets. It normalizes log-ratios using a translation transform where a constant, *a* ∈ ℝ, is added to the signals in one channel and subtracted from the other;

\begin{array}{c}{y}_{R,i}\leftarrow {a}_{R}+{b}_{R}{x}_{R,i}+a;\forall i\\ {y}_{G,i}\leftarrow {a}_{G}+{b}_{G}{x}_{G,i}-a;\forall i.\end{array}\left(12\right)

We refer to this translation normalization transform as the *perpendicular translation normalization*, because it moves (*x*_{
G
}, *x*_{
R
}) perpendicular to the *x*_{
R
}= *x*_{
G
}line. From (9), we get that the observed log-ratios *m*_{
r
}(*A*) can be made independent of the intensities if and only if

a=\frac{r{b}_{R}{a}_{G}-{b}_{G}{a}_{R}}{{b}_{G}+r{b}_{R}},r>0.\left(13\right)

As this is a function of *r*, it is only for a single fold change at a time this method can make *M* independent of *A*. The most common choice is *r* = 1 for which the optimal perpendicular shift is

a=\frac{{b}_{R}{a}_{G}-{b}_{G}{a}_{R}}{{b}_{G}+{b}_{R}},\left(14\right)

which is the weighted difference between *a*_{
R
}and *a*_{
G
}with weights *b*_{
G
}/(*b*_{
G
}+ *b*_{
R
}) and *b*_{
R
}/(*b*_{
G
}+ *b*_{
R
}), respectively. The distance from the *r* = 1 curve to the *M* = 0 curve for the optimal perpendicular shift is log_{2} *β*. In other words, the perpendicular shift normalization will not remove an overall bias in the log-ratios (although it is not hard to estimate *β* afterward). The optimal shift for {\mathcal{A}}_{1} is *a* = 60 with log_{2}*β* = 0.57. The result of this normalization is depicted in Figure 4. Note that *m*_{
r
}(*A*) after normalization is constant for *r* = 1.

As suggested by [13], one way to find the optimal shift *a* is to minimize the curvature by minimizing the variation of the log-ratios after applying the shift *a*. To do this robustly, the median absolute deviation (MAD) can be used as a measure of variation;

\widehat{a}=\mathrm{arg}\underset{a}{\mathrm{min}}\underset{1\le i\le I}{\text{MAD}}({M}_{i}(a)).\left(15\right)

We have found that the variance of \widehat{a} is unnecessarily large.

A problem with the perpendicular translation normalization methods, which is not related to estimator (15), is that the optimal shift can result in non-positive signals making a huge number of expression ratios invalid. The normalization method discussed next does not have this problem, but on the other hand, it will not work or work badly under certain conditions.

### Parallel translation normalization revisited

For historical reasons, but also because it contributes to our discussion about background correction, the shift-log method proposed by [26] for stabilizing (read decreasing or shrinking) the variance of the measured log-ratios is of interest. A side effect of this method is that it can correct for intensity-dependent curvature. It is based on a translation transform where the same constant, *a* ∈ ℝ, is added to the signals in both channels;

\begin{array}{c}{y}_{R,i}\leftarrow {a}_{R}+{b}_{R}{x}_{R,i}+a;\forall i\\ {y}_{G,i}\leftarrow {a}_{G}+{b}_{G}{x}_{G,i}+a;\forall i.\end{array}\left(16\right)

Because (16) moves data (*x*_{
G
}, *x*_{
R
}) parallel to the *x*_{
R
}= *x*_{
G
}line, it is referred to as the *parallel translation normalization*. Again, as this is a function of *r*, *M* can only be made independent of *A* for one unique *r* at the time, cf. (9). For *r* = 1 the optimal parallel shift is

a=\frac{{b}_{R}{a}_{G}-{b}_{G}{a}_{R}}{{b}_{G}-{b}_{R}},{b}_{G}\ne {b}_{R},\left(17\right)

which may be estimated as in (15). For example, for {\mathcal{A}}_{1} the optimal parallel shift is *a* = 220 with the *r* = 1 curve 0.57 units below the *M* = 0 line. The result of this normalization is depicted in Figure 5. From the above expression, we also see that an optimal value of *a* can indeed be negative. For example, if (*a*_{
G
}, *a*_{
R
}) = (200,140) and (*b*_{
G
}, *b*_{
R
}) = (1-4, 0.8), the optimal parallel shift is *a* = -60, which corresponds to an effective shift of ({{a}^{\prime}}_{G}, {{a}^{\prime}}_{R}) = (140, 80). However, it can also result in non-positive signals and therefore undefined log-ratios. For example, with (*a*_{
G
}, *a*_{
R
}) = (20, 200) and (*b*_{
G
}, *b*_{
R
}) = (1-4, 0.8), the optimal parallel shift is *a* = -440, which corresponds to an effective shift of ({{a}^{\prime}}_{G}, {{a}^{\prime}}_{R}) = (-420, -240). Moreover, from (17) we see that when the scale parameters are equal there is no solution. This is because in such cases data is moved in parallel to the *x*_{
R
}= *x*_{
G
}line making it impossible to get closer. As in the case of the perpendicular shift normalization, the distance between the *r* = 1 curve and the *M* = 0 curve is log_{2} *β*. Hence, a parallel shift normalization will not remove an overall bias in the log-ratios either and rescaling is necessary.

### Single-channel translation normalization

A hybrid of the previous two methods is a normalization method that translates the signals in one of the channels at the time according to

\begin{array}{l}{y}_{R,i}\leftarrow {a}_{R}+{b}_{R}{x}_{R,i}+a\cdot I((a\ge 0);\forall i\\ {y}_{G,i}\leftarrow {a}_{G}+{b}_{G}{x}_{G,i}-a\cdot I(a<0);\forall i,\left(18\right)\end{array}

where I is the indicator function and *a* ∈ ℝ. This will not generate non-positive signals as only positive translations are applied. Moreover, because only one channel is shifted an optimal shift will always be found.

### Rescale normalization

The above translation normalization methods remove curvature for non-differentially by adjusting the offset parameters in *α* = *a*_{
R
}- *βa*_{
G
}keeping the relative scale *β* fixed. Similarly, if the offset parameters are kept fixed, curvature can be removed by adjusting the relative scale *β*. In [11] we show that the scanner may introduce scale (PMT) insensitive (read *fixed*) biases to the channels. Thus, by adjusting the PMT settings such that the curvature of the pre-scanned data is as small as possible one minimizes |*α*| = |*a*_{
R
}- *βa*_{
G
}|. Indeed, this strategy may in practice be used by many. However, from above we know that this can equally well be done numerically. It is much more important to adjust the PMT (and laser) settings such that the dynamical range of the signals is as large as possible. Furthermore, as scanner settings are often adjusted for each array separately, there will be a discrepancy between arrays, which in any case has to be normalized for.

### Dye-swap normalization revisited

*Dye-swap normalization*, also known as *reverse labeling* and *paired-slides normalization*, is a balanced experimental design for two-color microarrays that can be used whenever two technically replicated hybridizations are available. Consider an experiment with two sets of cell populations, *A* and *B*, for which relative gene expressions, {*r*_{
i
}}_{
i
}, are to be investigated. After cDNA is obtained through reverse transcription, the two samples are each split into two identical parts, one which is labeled with a red fluorescent dye and one which is labeled with a green fluorescent dye. The red cDNA cocktail from sample *A* is mixed with the green ditto from sample *B* and co-hybridized to the DNA on the first array. After scanning, expression levels {\{({f}_{{G}_{1}}({x}_{B,i}),{f}_{{R}_{1}}({x}_{A,i}))\}}_{i} are observed. The same is done for the remaining red-green pair for which {\{({f}_{{G}_{2}}({x}_{A,i}),{f}_{{R}_{2}}({x}_{B,i}))\}}_{i} are observed. Dropping gene index *i*, the dye-swap normalization suggested by [27] is

\begin{array}{c}M=\frac{1}{2}({M}_{1}+{M}_{2})\\ =\frac{1}{2}\left({\mathrm{log}}_{2}\frac{{f}_{{R}_{1}}({x}_{A})}{{f}_{{G}_{1}}({x}_{B})}-{\mathrm{log}}_{2}\frac{{f}_{{R}_{2}}({x}_{B})}{{f}_{{G}_{2}}({x}_{A})}\right)\\ =\frac{1}{2}\left({\mathrm{log}}_{2}\frac{{f}_{{R}_{1}}({x}_{A})}{{f}_{{R}_{2}}({x}_{B})}+{\mathrm{log}}_{2}\frac{{f}_{{G}_{2}}({x}_{A})}{{f}_{{G}_{1}}({x}_{B})}\right)\\ =\frac{1}{2}({{M}^{\prime}}_{1}+{{M}^{\prime}}_{2})\end{array}\left(19\right)

and similarly for the log-intensities

\begin{array}{c}A=\frac{1}{2}({A}_{1}+{A}_{2})\\ =\frac{{\mathrm{log}}_{2}({f}_{{R}_{1}}({x}_{A}){f}_{{G}_{1}}({x}_{B}))+{\mathrm{log}}_{2}({f}_{{R}_{2}}({x}_{B}){f}_{{G}_{2}}({x}_{A}))}{4}\\ =\frac{{\mathrm{log}}_{2}({f}_{{R}_{1}}({x}_{A}){f}_{{R}_{2}}({x}_{B}))+{\mathrm{log}}_{2}({f}_{{G}_{2}}({x}_{A}){f}_{{G}_{1}}({x}_{B}))}{4}\\ =\frac{1}{2}({{A}^{\prime}}_{1}+{{A}^{\prime}}_{2}).\end{array}\left(20\right)

Thus, the result of a dye-swap can be written as the average of two "virtual" hybridizations ({{A}^{\prime}}_{1}, {{M}^{\prime}}_{1}) and ({{A}^{\prime}}_{2}, {{M}^{\prime}}_{2}). Moreover, if (and only if) the measurement functions are equal for each array, that is, {f}_{{R}_{1}}={f}_{{R}_{2}} and {f}_{{G}_{1}}={f}_{{G}_{2}}, then the observed ratios will be identical to the true ratios *for non-differentially expressed genes*. For this to be true for differentially expressed genes we know that they also have to be linear, that is, affine with zero intercept.

Several authors [28, 29] have reported that dye-swap normalization does remove curvature, but less successful results have also been reported [30]. To better understand the reasons why and when dye-swap normalization works or not, we dissect the measurement functions *f*_{
c
}of the four channels *c* = *R*_{1}, *G*_{1}, *R*_{2}, *G*_{2} into (*v*_{
c
}◦ *u*_{
c
}◦ *t*_{
c
}◦ *s*_{
c
}) where *s*_{
c
}models the process of all steps up to the step where the (not yet labeled) cDNA sample is obtained, *t*_{
c
}models the labeling, *u*_{
c
}models the following steps including the hybridization, and *v*_{
c
}models the scanning etc. As channel *R*_{1} and *G*_{2} are from sample *A* and the other two are from sample *B*, we know that {s}_{{R}_{1}}={s}_{{G}_{2}}={s}_{A} and {s}_{{R}_{2}}={s}_{{G}_{1}}={s}_{B}. Furthermore, if the labeling process is well controlled, we can assume that {t}_{{R}_{1}}\approx {t}_{{R}_{2}}\approx {t}_{R} and {t}_{{G}_{1}}\approx {t}_{{G}_{2}}\approx {t}_{G}. When channel *R*_{1} and *G*_{1} are hybridized to array 1 and the other two to array 2 we have that {u}_{{R}_{1}}\approx {u}_{{G}_{1}}\approx {u}_{1} and {u}_{{R}_{2}}\approx {u}_{{G}_{2}}\approx {u}_{2}. Moreover, if the same scanner settings are used for both arrays and everything else is equal, we have that {v}_{{R}_{1}}\approx {v}_{{R}_{2}}\approx {v}_{R} and {v}_{{G}_{1}}\approx {v}_{{G}_{2}}\approx {v}_{G}. The overall measurement functions for the channels are then approximately

\begin{array}{c}{f}_{{R}_{1}}\approx {v}_{R}\circ {u}_{1}\circ {t}_{R}\circ {s}_{A}\\ {f}_{{G}_{1}}\approx {v}_{G}\circ {u}_{1}\circ {t}_{G}\circ {s}_{B}\\ {f}_{{R}_{2}}\approx {v}_{R}\circ {u}_{2}\circ {t}_{R}\circ {s}_{B}\\ {f}_{{G}_{2}}\approx {v}_{G}\circ {u}_{2}\circ {t}_{G}\circ {s}_{A}.\end{array}\left(21\right)

For the dye-swap normalization to be efficient, we conclude that we must control the process of extracting the RNA etc. to an extent such that we can expect *s*_{
A
}≈ *s*_{
B
}. Moreover, we must also be able to reproduce hybridizations well enough such that *u*_{1} ≈ *u*_{2}. If these requirements are met, data will be self-normalized. Turning to the affine model, from (19) we have, if {f}_{{R}_{1}}={f}_{{R}_{2}} and {f}_{{G}_{1}}={f}_{{G}_{2}}. that a dye-swap normalization of affine transformation data gives

\begin{array}{l}{{M}^{\prime}}_{1}={\mathrm{log}}_{2}\frac{{a}_{R}+{b}_{R}{x}_{A}}{{a}_{R}+{b}_{R}{x}_{B}},\\ {{M}^{\prime}}_{2}={\mathrm{log}}_{2}\frac{{a}_{G}+{b}_{G}{x}_{A}}{{a}_{G}+{b}_{G}{x}_{B}},\left(22\right)\end{array}

and similar for {{A}^{\prime}}_{1} and {{A}^{\prime}}_{2}. For both virtual arrays, the signals in both channels have undergone identical affine transformations. We know from before that identical transformation in both channels does not introduce curvature for the non-differentially expressed genes and that symmetry between up- and down-regulated genes is preserved, cf. perpendicular and parallel shift normalization. If the offsets in any of the two replicated channels are not equal ({a}_{{R}_{1}}\ne {a}_{{R}_{2}} or {a}_{{G}_{1}}\ne {a}_{{G}_{2}}), the dye-swap normalization will not work.

The above discussion assumed that the same cell samples have been replicated. If biological replicates are used, an additional source of variability is introduced. However, as long as it is possible to assume that for most genes {x}_{{A}_{1}}\approx {x}_{{A}_{2}} and {x}_{{B}_{1}}\approx {x}_{{B}_{2}}. dye-swap normalization should still perform well.

In [11] we observed that scanners can introduce channel-specific offsets that are stable over time, i.e. {a}_{{R}_{1}}={a}_{{R}_{2}} and {a}_{{G}_{1}}={a}_{{G}_{2}}. Assume that everything else is perfect, but the PMT is adjusted separately for each array resulting in {b}_{{R}_{1}}/{b}_{{R}_{2}}\ne {b}_{{R}_{2}}/{b}_{{R}_{1}} so that (22) is not obtained. This may be a reason why dye-swap normalization sometimes fails.

### Alternative dye-swap normalization

An alternative dye-swap normalization method is to average the observed expression levels *before* taking the logarithm

\begin{array}{c}M={\mathrm{log}}_{2}\frac{({f}_{{R}_{1}}({x}_{A})+{f}_{{G}_{2}}({x}_{A}))/2}{({f}_{{R}_{2}}({x}_{B})+{f}_{{G}_{1}}({x}_{B}))/2}\\ ={\mathrm{log}}_{2}\frac{{f}_{{R}_{1}}({x}_{A})+{f}_{{G}_{2}}({x}_{A})}{{f}_{{R}_{2}}({x}_{B})+{f}_{{G}_{1}}({x}_{B})},\end{array}\left(23\right)

and analogously for *A*. This approach uses the *arithmetic mean* of the observed signals whereas the previous dye-swap method used the *geometric mean*. To be able to say more about the difference between the two approaches, we turn to the affine transformation for which we have

\begin{array}{c}M={\mathrm{log}}_{2}\frac{{a}^{\prime}+{b}^{\prime}{x}_{A}}{{a}^{\prime}+{b}^{\prime}{x}_{B}}\\ A=\frac{{\mathrm{log}}_{2}({a}^{\prime}+{b}^{\prime}{x}_{A})({a}^{\prime}+{b}^{\prime}{x}_{B})}{2}\end{array}\left(24\right)

where *a*' = *a*_{
R
}+ *a*_{
G
}and *b*' = *b*_{
R
}+ *b*_{
G
}. Again, we note that the dye-swap method makes the transforms in the resulting two virtual channels equal. Comparing the bias in log-intensities between the geometrical and the arithmetical approaches, for the latter we have

{A}_{0}={\mathrm{log}}_{2}\frac{{a}_{R}+{a}_{G}}{2}\left(25\right)

whereas for the former we have

{A}_{0}={\mathrm{log}}_{2}\sqrt{{a}_{R}{a}_{G}}.\left(26\right)

Because (*a*_{
R
}+ *a*_{
G
})/2 ≥ \sqrt{{a}_{R}{a}_{G}}, we conclude that the log-ratio biases are always larger for arithmetic than geometric dye swap. However, there are other differences too. For instance, if each microarray glass array (the *u*_{
c
}functions above) introduces the same offset to both channels and this offset is different between arrays, but otherwise everything else is the same, that is, {a}_{{R}_{2}}={a}_{{R}_{1}}+a and {a}_{{G}_{2}}={a}_{{G}_{1}}+a, then geometric dye-swap fails whereas arithmetic dye-swap succeeds to remove curvature.

### Two-channel quantile normalization

Two-channel or in general multi-channel *quantile normalization* [31, 32] is based on and relies on the *assumption that the true gene-expression levels in the two biological samples are approximately equally distributed*. If the measurement functions in the two channels, say *f*_{
R
}and *f*_{
G
}, are different, then the distributions of the measured signals in the two channels are different even if underlying distributions of true expression levels are identical. By estimating the distributions of the two channels and making them equal, for instance to an average distribution, the log-ratios for the *non-differentially* expressed genes will be unbiased and independent of the intensities. Thus, making the density functions of measured data equal for the two channels is the same as making their transformation functions equal, say to *f*_{
RG
}, which makes *M* independent of *A* for non-differentially expressed genes. If *f*_{
RG
}could be made linear too, this would be true for all fold changes.

For affine transformations, two-channel quantile normalization removes intensity-dependent effects, because the offsets *a*_{
R
}and *a*_{
G
}are identical after normalization. In addition, the constant log-ratio bias log_{2}*β* is also removed. Hence, two-channel quantile normalization can be considered to be both a method that corrects for differences in offset between two channels, but also a method that corrects for biases in the expression ratios. In Figure 6, the quantile normalization of {\mathcal{A}}_{1} transformed data is depicted. The curvature for non-differentially expressed genes is removed.

### Background subtraction as a normalization method

We have observed that log-ratios of *background signals* show the same intensity-dependent effects as ditto for *foreground signals* do, which suggests that background signals undergo the same transformation as foreground signals. An example of this is shown in Figure 7, where background and foreground estimates are plotted in the same *M* versus *A* scatter plots. A probable reason for this is the existence of scanner biases [11]. A widely adopted rationale for background correction is the assumption that the region that defines the spot is contaminated with the same physical noise that can be observed in the surrounding regions. Background noise is believed to be due to dust particles, DNA contaminated buffers, failed washing during printing or hybridization, cross hybridization etc. [20, 33]. This type of background noise is often assumed to add to the foreground signal. Thus, in order to obtain true signals, background is subtracted from foreground signal as

{y}_{c,i}\leftarrow {y}_{c,i}^{(\text{fg})}-{y}_{c,i}^{(\text{bg})}\left(27\right)

where {y}_{c,i}^{(\text{fg})} is the estimated foreground signal and {y}_{c,i}^{(\text{bg})} is the estimated background signal for channel *c* and spot *i*. Under a transformation that is dominated by an affine function at lower intensities (of the same level as the background), subtracting the background from the foreground will shift the biases toward zero and background subtracted signals will have less curvature in the (*A*,*M*) plane than non-background subtracted signals (not shown). In this sense we can consider background subtraction to be a normalization method. However, just because the log-ratios as a function of the log-intensities become more flat, it does not imply that foreground regions are contaminated by the same noise as in background regions; unnecessary noise may be introduced. Instead, it may be that the background estimates from the image analysis *happen* to be close to a non-image-related offset in the foreground signals. Moreover, different image analysis software estimate the background signal differently based on different algorithms such as fixed-size circles, adaptive circles, morphological estimates, and pixel intensity distributions. Although comparative studies have been conducted [10, 34], it is still not clear which background estimate is most correct. Some methods give higher background estimates than others, which means that they all correct for channel biases by different amounts, which by the way is another argument for why there exist channel offsets. makes use of this is [20], which emphasizes that the true signal can *not* be negative and uses a Bayesian approach to correct for this.

### Result of a (relative) negative translation

If too much background is subtracted, or a threshold has to be passed before the reverse transcription takes place, one can imagine that *a*_{
G
}, *a*_{
R
}< 0. Negative bias also applies if the observed signals are compared, not to the true signals, but to the signals obtained by another measuring technique that has a larger bias. Examples of such comparisons can be two-color microarray data compared to oligonucleotide (Affymetrix) data or two-color microarray data compared to QRT-PCR data. Negative bias may also be observed when control clones, spike-ins, negative and positive controls etc. are compared to the genes/ESTs of interest. The effect of a negative translation is depicted in Figure 8. The fan-out effect in the fold-change curves for the lower intensities is due to the negative translation. Note that this should not be mistaken for the fan-out effect due to decreasing signal-to-noise levels in the same way as lack of a fan-out effect due to a positive offset should not be mistaken for low noise.

### Robust affine normalization

From the above discussion, it is clear that it is essential to correct for channel offsets when normalizing gene expression data. For two-channel data, we can obtain estimates of *a*_{
R
}, *a*_{
G
}and *β* as follows. For non-differentially expressed genes (without noise) we have that

*y*_{
R,i
}= *α* + *βy*_{
G,i
}; ∀*i* (28)

with *α* = *a*_{
R
}- *βa*_{
G
}and *β* = *b*_{
R
}/*b*_{
G
}. Define y={\left\{{y}_{i}\right\}}_{i=1}^{I} where **y**_{
i
}= (*y*_{
G,i
}, *y*_{
R,i
}) and let

Q(\alpha ,\beta ;y)={\displaystyle \sum _{i=1}^{I}{w}_{i}{d}_{i}}{(\alpha ,\beta ;{y}_{i})}^{2}\left(29\right)

be our objective function where *d*_{
i
}(*α*, *β*;**y**_{
i
}) > 0 is the orthogonal Euclidean distance between **y**_{
i
}and the line *L*(*α*, *β*) with intercept *α* and slope *β*. The estimates of *α* and *β* are then

(\widehat{\alpha},\widehat{\beta})=\mathrm{arg}\underset{(\alpha ,\beta )}{\mathrm{min}}Q(\alpha ,\beta ;y).\left(30\right)

With *w*_{
i
}= 1 for all observations we obtain standard principal component analysis (PCA), which minimizes the orthogonal distances in the *L*_{2} norm [35]. With *w*_{
i
}≠ 1, (sample-) weighted PCA (WPCA), a special case of generalized PCA, is obtained [35, 36]. With weights *w*_{
i
}= 1/(*d*_{
i
}(\widehat{\alpha}, \widehat{\beta}; **y**_{
i
}) + *δ*) we can minimize the distances in the *L*_{1} norm, if we let *δ* → 0^{+}. The distance *d*_{
i
}(\widehat{\alpha}, \widehat{\beta}; **y**_{
i
}), which equals the sum of squares of the values of all but the first principal component, was first suggested by [37]. Thus, our choice of weight function down-weigh outliers as defined by [37] in order to obtain a robust estimate of *L*(*α*, *β*) corresponding to the first principal component. Our procedure is related to principal component analysis applied to an M-estimator of the covariance (scatter) matrix of data. The main difference is that we use weights *w* = *w*(*d*_{
i
}) = 1/ (*δ* + *d*_{
i
}) based on the orthogonal distance *d*_{
i
}from **y**_{
i
}to *L*(*α*, *β*) whereas for M-estimation one uses weights *w* = *w*(*d*_{
i
}) based on a robustified Mahalanobis distance of **y**_{
i
}, which is computed from an M-estimator of the covariance matrix of data. M-estimation of location and scatter was first defined by [38], and subsequently applied to principal component analysis by [39]. For other more recent papers on robust multivariate analysis, see [40, 41] and the references therein. Alternative robust estimators can be obtained by choosing other weight functions *w*(*d*_{
i
}), but we choose to optimize in *L*_{1}. Moreover, if one suspects a non-symmetric distribution of data points around the line, a trimmed version of the weight function may be considered. In practice, the above optimization can be performed by an *iterative reweighted principal component analysis* (IWPCA) scheme. For iteration *l* = 1,2,..., minimize (29) using WPCA where {w}_{i}^{(1)} = 1 and {w}_{i}^{(l+1)} = 1/(*d*_{
i
}(*α*^{(l)},*β*^{(l)};**y**_{
i
}) + *δ*) with *δ* being a small positive number to avoid infinite weights.

As a last step, in order to get estimates of the four parameters *a*_{
R
}, *a*_{
G
}, *b*_{
R
}, and *b*_{
G
}from the two parameter estimates \widehat{\alpha} and \widehat{\beta}, we introduce additional constraints. Let *y*_{c,(1)}= min_{
i
}*y*_{
c,i
}for *c* = {*R, G*} and choose

\begin{array}{l}{\widehat{b}}_{G}=1\\ {\widehat{b}}_{R}=\widehat{\beta}\left(\text{31}\right)\end{array}

\begin{array}{c}{\widehat{a}}_{G}=\mathrm{max}\{{a}_{G};{a}_{G}<{y}_{G,(1)}\wedge \widehat{\alpha}+\widehat{\beta}{a}_{G}<{y}_{R,(1)}\}\\ {\widehat{\alpha}}_{R}=\widehat{\alpha}+\widehat{\beta}{\widehat{a}}_{G}\end{array}\left(32\right)

to be the estimates of the bias and the scale parameters in model (3). Constraint (32) is only correct in the noise-free case. If we allow noise, say

*y*_{
c,i
}= *a*_{
c
}+ *b*_{
c
}*x*_{
c,i
}+ *ε*_{
c,i
}, (33)

where *E*[*ε*_{
c,i
}] = 0 and *V*[*ε*_{
c,i
}] = {\sigma}_{c,i}^{2} for *c* = {*R, G*}, it is possible that the bias terms *a*_{
R
}and *a*_{
G
}are larger than the smallest observed value in the respective channel. This is especially important if the distributions of *ε*_{
c,i
}for *c* = {*R, G*} have heavy negative tails. An alternative, which introduces negative estimates, is to replace *y*_{c,(1)}in (32) with *y*_{c, (j)}for some order index (*j*) such that *j* - 1 non-positive signals are obtained in channel *c*. Choosing an optimal value on *j* is currently investigated by the authors, but beyond this article. Furthermore, it has been observed that the noise in each channel is roughly proportional to the signal strength, that is, *σ*_{
c,i
}∝ *x*_{
c,i
}. Thus, a positive side effect of the above estimation algorithm is that, contrary to have equal weights for all spots (*w*_{
i
}= 1), more weight will be given to low-intensity spots compared to high-intensity ones. This makes the method more robust to saturation and other non-linear effects that might occur at high intensities, effects for which classical line fits, which rely on homoscedasticity, would fail. Finally, with backward transformation (4) based on estimates ({\widehat{a}}_{G}, {\widehat{a}}_{R}, {\widehat{b}}_{G}, {\widehat{b}}_{R}), data is translated and rotated such that it falls around the diagonal line that goes through (0, 0) and (1, 1).

To illustrate the affine normalization method we have applied it to six two-color microarray data sets each containing 240 spike-in controls designed to have log_{2} *r* = (-2, 0, +2) at various intensities. See also Methods. These controls were not used to estimate the normalization parameters. As shown in Figure 9, which is for one of the arrays, there is a small curvature for non-differentially expressed genes (and spike-ins) before normalization, a curvature that corresponds to -\widehat{\alpha} ≈ +7 > 0 (small positive derivative) at log_{2} *r* = 0, cf. (10). More importantly, the intensity dependent effect is profound for the log_{2} *r* = ± 2 controls. Affine normalization allowing no negative signals removes curvature (*α* ≈ 0) for log_{2} *r* = 0, but not for the log_{2} *r* = ± 2 controls, which indicates equal affine transformation in both channels, cf. right graph of Figure 6. If 5% negative signals is allowed, the log-ratios of all controls become roughly independent of intensity, which indicates that the observed signals are proportional to the concentrations of the spike-ins. All six arrays in this study show very similar properties.

#### Generalization to multiple channels and multiple arrays

A multi-dimensional version of the above algorithm can be summarized as follows. Say there are *N* arrays each hybridized with *K* samples (colors) such that there is in total *C* = *NK* channels. Let **y**_{
i
}= (*y*_{1,i},..., *y*_{K,i},..., *y*_{(N-1)K + 1,i},..., *y*_{NK,i}) be the *NK* observations for gene *i*. Thus, {**y**_{
i
}}_{
i
}spans an *NK*-dimensional space. Analogously to the above two-dimensional procedure, we can fit a robust line *L* through data in ℝ^{NK} and constrain the estimate of **a** = (a_{1},..., *a*_{
NK
}) by enforcing that *a* <**y**_{
i
}; ∀*i*, where < is the component-wise inequality. Backward transformation (4) translates and rotates data such that it lies along the diagonal line. By normalizing all arrays at once, signals from all hybridizations are brought onto the same scale and no further, so called, between-slide scale normalization is needed.

To apply the multi-dimensional normalization, the assumption that most genes are non-differentially expressed for *all* possible hybridization/channel pairs must be added. For most experimental setups this is not a problem. For instance, in two-channel microarrays experiments it is common to hybridize one test sample and one reference sample, which is selected such that it does not differ too much from the test sample, to the same array. The same reference is then used between arrays (in either channel). Thus, since each test-reference pair is "close" to each other, all test-test pairs should be approximately "close" to each other too. Alternatively, all reference channels can be normalized together. Then, keeping the reference signals fixed, each test channel is normalized toward the corresponding reference channel.

An implementation of the above algorithm is made available in the R [42] package named *aroma* [43], which is platform independent. In addition, the methods are available as an R plugin [44] for BASE [45]. A typically call is normalizeAffine(rg), which will normalize all arrays and all channels in the microarray object rg at once. The first parameter that has to be specified in the above algorithm is *δ*. However, its value is not critical and we have found that for instance *δ* = 0.02 works well in general and is therefore the default value. The second parameter to be specified is the number of negative signals allowed after normalization. By default the method allows 5% negative signals, but any fraction (or absolute number) of negative signals can be specified. Moreover, the method can be applied to any subsets of genes separately such as print-tip groups, clone groups and spike-ins. Finally, support for datapoint weights has been implemented so that the influence each spot has in the estimation procedure can be specified (not to be mistaken for the iterative weights above). Such weights may for instance be calculated from spot quality measures obtained by image analysis methods.