Correction of scaling mismatches in oligonucleotide microarray data

Barenco, Martino; Stark, Jaroslav; Brewer, Daniel; Tomescu, Daniela; Callard, Robin; Hubank, Michael

doi:10.1186/1471-2105-7-251

Methodology article
Open access
Published: 09 May 2006

Correction of scaling mismatches in oligonucleotide microarray data

Martino Barenco^1,2,
Jaroslav Stark^3,1,
Daniel Brewer^2,1,
Daniela Tomescu¹,
Robin Callard^1,2 &
…
Michael Hubank^1,2

BMC Bioinformatics volume 7, Article number: 251 (2006) Cite this article

6621 Accesses
7 Citations
Metrics details

Abstract

Background

Gene expression microarray data is notoriously subject to high signal variability. Moreover, unavoidable variation in the concentration of transcripts applied to microarrays may result in poor scaling of the summarized data which can hamper analytical interpretations. This is especially relevant in a systems biology context, where systematic biases in the signals of particular genes can have severe effects on subsequent analyses. Conventionally it would be necessary to replace the mismatched arrays, but individual time points cannot be rerun and inserted because of experimental variability. It would therefore be necessary to repeat the whole time series experiment, which is both impractical and expensive.

Results

We explain how scaling mismatches occur in data summarized by the popular MAS5 (GCOS; Affymetrix) algorithm, and propose a simple recursive algorithm to correct them. Its principle is to identify a set of constant genes and to use this set to rescale the microarray signals. We study the properties of the algorithm using artificially generated data and apply it to experimental data. We show that the set of constant genes it generates can be used to rescale data from other experiments, provided that the underlying system is similar to the original. We also demonstrate, using a simple example, that the method can successfully correct existing imbalancesin the data.

Conclusion

The set of constant genes obtained for a given experiment can be applied to other experiments, provided the systems studied are sufficiently similar. This type of rescaling is especially relevant in systems biology applications using microarray data.

Background

Gene expression profiling using microarrays has become a popular technique in modern biochemical research. One of the commonest microarray platforms in use is the high-density oligonucleotide array introduced by Affymetrix (Santa Clara, CA). In the Affymetrix system, biotinylated cRNA generated from the sample of interest is hybridised to the array and detected using fluorescently-labelled streptavidin. A number of different expression summary algorithms are available to derive the concentration of each transcript from the intensity of fluorescence. These include MAS5 (Affymetrix) [1], RMA [2], MBEI [3, 4].

It is important to use the most accurate and precise methods for calculating gene expression levels from microarray data. MAS5, RMA and MBEI offer different solutions to this problem. On Affymetrix arrays, transcripts are represented by multiple (typically 11) pairs of 25-mer oligonucleotides termed a probeset. One of each pair is a perfect match (PM) for the target and the other is a mismatch control (MM). In MAS5, signal values from MM oligonucleotides are subtracted from PM oligonucleotide signals (if the MM value is greater than the PM value, an "ideal mismatch" (IM) value is subtracted from the PM value). RMA and MBEI use only the perfect match signal. In MAS5 there is no attempt to correct non-linearities in the signal intensity response resulting for example from different overall transcript concentrations. In other words, MAS5 treats microarrays individually in contrast to RMA and MBEI, which apply a ranking-type algorithm to arrive at a signal value, and require multiple microarrays. However, it is important to note that the nomalization steps applied at the probe level in both MBEI and RMA even out non-linearities across microarray experiments, but do not correct them. The other distinctive characteristic of MAS5 is that anaysis of spike-in data has shown it to be more accurate than the other methods cited [5, 6]. This higher accuracy, however, comes at the cost of a lower precision -or higher variablity- at low expression values. These differences in method result in differences in performance and a trade-off between accuracy and precision. Researchers select their preferred expression summary algorithm depending on their particular experimental application.

To date, expression profiling has been largely limited to applications in class discovery or prediction (geared towards molecular diagnostics), and in relatively simple functional studies involving direct comparisons of microarrays. Increasingly however, there is a move towards the mathematical modelling of dynamic gene networks. These systems biology applications typically require a sequence of microarray measurements in time which are applied in conjunction with dynamical models. In this context, measurement accuracy is paramount otherwise it would be impossible to arrive at accurate solutions to kinetic parameter estimates, such as transcript degradation rates. Therefore in systems biology, the accuracy of the measurements takes precedence over other considerations such as their precision. The higher accuracy achieved by MAS5 makes it the method of choice for gene network modelling.

Systems biology applications also raise practical issues. Time course data collection is demanding on resources as replicate RNA samples must be isolated for each timepoint. Because of experimental variability, individual timepoints cannot be rerun separately. Rejection of data from outlier arrays in a time course experiment should therefore entail the replacement of the whole time course. In complex experimental systems this is often unfeasible on the grounds of both practicality and cost. Consequently there is a demand for methods for renormalising, and thus retaining, sub-optimal arrays. These methods would benefit all microarray applications, but would be of particular importance where the data is applied in modeling contexts. In this paper we therefore concentrate on renormalisation of MAS5 generated data to more accurately reflect expression values.

In MAS5, in order to make microarrays comparable, the trimmed average of all signals on a microarray is set to some predefined value. Unlike other methods [2, 7], this simple scaling does not attempt to even out distortions in signal response (See Figure 1A). A knock-on effect is that, because the Affymetrix scaling procedure takes into account almost all the signals on the microarrays, those genes lying in the median range have a scaling mismatch (See Figure 1A first panel). Data imbalances that result become particularly visible in time course applications (Figure 1B). In this paper, we propose a procedure that corrects these scaling mismatches. We do not aim to correct for distortions occurring across the whole signal range but we rather try to concentrate on those genes whose intensity is in the median signal range because they are particularly important in modelling situations. It is also in this range that scaling mismatches can have have the most deleterious effects because they are commensurate -if not bigger- than the measurement error (at low expression level, measurement errors swamp these scaling mismatches). The problem of signal saturation affecting strongly expressed genes has been identified and partially corrected elsewhere [8, 9]. In this paper, and for notational convenience only, we will use the term "gene" in place of "probe set", even though a given gene can have several probe sets.

Results and discussion

Our method supposes there exists a set of constant genes, i.e. genes whose expression levels are unaffected under the experimental conditions, and that these genes can be used to accurately normalize the data. This idea of using constant genes has already been proposed [10], it was however in the context of two-colour cDNA microarrays. Our method consists of identifying this set of constant genes in an iterative fashion. In a first step a set of constant genes is singled out on the basis of some dispersion criteria, we call this set the initial set. The genes belonging to this set are used to normalize the microarrays. In a subsequent step, a new set of unchanging genes is defined using the same criteria. Several such steps are performed in sequence until the set of constant genes is unchanged from one step to the next. At this point the method is said to have converged and the corresponding set of constant probe sets is used to rescale the whole experiment (for a thorough description see Methods section).

To study the properties of the method, we generated artificial data. This allowed us to explore its convergence, increase its efficiency and define criteria for detecting hypothetical (and unlikely) situations in which a set of constant genes is non-existent. The results of these purely in silico explorations are described in the the first part of this section, the description of the setup is deferred to the methods section. In the second part, we apply the algorithm to real experimental data and show that it converges to a unique set of genes and that the set of constant genes can be re-used with a different microarray dataset, provided that the tissues involved are sufficiently similar. We demonstrate, using a simple time course interpolation example, that the resulting rescaling does indeed correct imbalances in the data and compare these results with another popular normalization procedure (quantile normalization) and with RMA-summarized data.

In Silico explorations

Artificial data was generated for 10'000 genes on 7 microarrays, a small portion (5%) of these genes was set to be constant. The scaling of these data was then perturbed with varying severity and for each perturbation our rescaling algorithm was applied. This setup is thoroughly described in the Methods section.

Each individual run was assigned to one of the three following groups:

Good runs were those for which the size of the final set exceeded 100.
In contrast, poor runs were those for which the size of the final was less than or equal to 100.
In the special case where the algorithm failed to find a big enough (> 2) initial set, three genes known to be constant were forced into the initial set (recall that 5% of the genes were set to be constant). We call these helped runs.

For each run, we computed an accuracy index (see methods) indicating whether the artificially induced perturbations were accurately corrected. This index can take non-negative values and must be as close to zero as possible. We also monitored the speed of convergence, i.e. the number of iterations steps the algorithm takes before convergence.

Artificial data: results

From this first series of runs we can conclude that the amount of perturbation affects the speed of convergence, the bigger the perturbation, the longer it takes for the algorithm to converge (Figure 2, top plot). The number of steps does not exceed 20, which makes the algorithm extremely fast (< 1 minute on a standard PC). Secondly, the threshold of 100 used to separate good and poor runs correctly identifies those runs that yield a good accuracy (Figure 2, median plot). All good and helped runs, with a final set size far higher than our set threshold of 100 have a very good accuracy (< 1%) whereas poor runs, with a much smaller set size are not accurate. Note also that the threshold of 100 could have been safely set anywhere between 50 and 350 (Figure 2, bottom plot).

Helped runs constitute a special case. Despite starting with a high perturbation – which is why they need help in the first place – they end up being very similar to good runs with both a good accuracy and a large final set. The inclusion of genes known to be constant accelerates convergence so that helped runs converge within 5–7 iterations. Helped runs occur mainly when the perturbation is large. This demonstrates that feeding some known constant genes into the initial iteration is a good way to "seed" the algorithm and has the added bonus of speeding up the (already quick) process. With real data, standard housekeeping genes (Affymetrix identifier ranging from 20000_s _at to 20099_s _at on the U133 Array) can fulfill that role.

Although they might not end up in the final set, our experience is that the inclusion of some of these in the initial set does indeed set the algorithm onto the right path.

Poor runs were defined by setting a limit on the size of the final set. For these cases, it is easy to verify that the final set, on top of being small, has a small "basin of attraction". We perturbed the initial guess by removing and adding genes. This produced different final sets thereby confirming the poor quality of these runs. In contrast, good and helped runsproduced stable final sets (data not shown). Hence, poor quality is fairly easy to detect and can be remedied by supplying the algorithm with a small set of genes that are suspected of being constant, i.e. to create a "helped" run.

In the good and helped runs, the final set of constant genes contains around 400 genes. Hence, not all constant genes (recall that 500 of those were generated) are included in this final set. A less severe selection criterion would select more constant genes. The final set also includes a small number of genes (< 10) that were not expected to be constant. But, as the accuracy plot testifies, their presence does not have an adverse effect on the final result. With real data, the boundary between constant and non-constant genes is blurry, and lightly downregulated genes are likely to be balanced by lightly upregulated ones. This hypothesis is also at the heart of the centralization algorithm presented in [11].

From this first series of simulations, we conclude that the quality of a run can simply be assessed by the size of the final set and that a good way to ensure fast and accurate convergence is to feed the starting set with genes known to be constant. In a real situation, standard housekeeping genes can be used to that effect.

Algorithm performance in the absence of constant genes

In the controlled experiments described above, we set a small proportion of genes to be constant. To further test the main features of the algorithm we ran it in the absence of constant genes, the results are shown in Figure 3. All the runs are flagged as poor because of the small size of the final sets. The diversity of these runs is reflected in the variety of both the size of the final sets and the measured accuracies. The latter are all above 10%. Thus, the algorithm fares badly in the absence of constant genes but this absence is easily deduced from the lack of a large strongly attracting final set. With the real experimental data presented below, this does not happen, and we believe that in most real situations there should be a sufficiently large subset of constant genes.

In summary, we recommend including prior information, by supplying the algorithm with a set of genes known or likely to be constant. A strong sign that the algorithm has converged correctly is that the resulting final set is large, unique and stable, i.e. immune to the inclusion of random genes and/or the deletions of genes belonging to it. Finally, if these properties of the final set are not met, then this suggests that there are no constant genes. In our experience, this is unlikely to happen in reality.

Experimental data

Cells from the human T-cell line (MOLT4) were submitted to ionising radiation of 5 Gy using a ¹³⁷Cs source. RNA was prepared and labelled using standard procedures. Affymetrix U133A microarrays were then run on biological triplicates of transcripts taken 2,4,6,8,10 and 12 hours after irradiation. Triplicates were also taken at 0 hours, i.e. just before irradiation. These 21 microarrays vary in average signal intensity, by this we mean that some are "darker" than others, indicating a lower transcript concentration. We will refer to this data set as the 5Gy experiment. Another set of chips were run under different conditions, the irradiation dose was smaller (1Gy) than in the previous time series (5Gy). Biological triplicates were taken 4, 8 and 12 hours after irradiation, the 0 hours microarrays were the same as in the 5Gy experiment, therefore there is a slight overlap between the two experiments. This dataset, consisting of 12 microarrays, will be referred to as the 1Gy experiment. The data was summarized with MAS5 using standard parameters.

5Gy experiment

We first applied the algorithm to the 5Gy experiment. As a measure of the variability of a gene, we used the second formula (5) given in the methods section and fixed the variability limit to 0.2 to match our estimation of the relative error. This condition was applied to each replicate separately (if a replicate failed, then the gene would be excluded from the constant set) rather than globally on the 21 microarrays.

Further conditions (denoted in the methods section equation (6) by C(i, A))were as follows:

To avoid including genes with a low expression and absent genes an upper limit of 0.001 was set on the detection p-value provided by MAS5.
An upper limit on the signal value of the 0 hours microarray of the first replicate was set to 2000.

This limit empirically corresponds to the point where distortions in the upper range of the signal become apparent (see Figure 1).

The first replicate of the 0 hours microarray was set to be the reference microarray. The algorithm converged in six steps to an invariant final set composed of 1194 genes out of a total of 1675 genes that passed the 0.001 p-value and 2000 intensity filtering. The resulting rescaling factors are given in Table 1.

Table 1 Rescaling factors obtained by the recursive procedure (5Gy experiment)

Full size table

To test the stability of the final set, several further runs were performed using perturbed versions of the original data. To do this, we chose initial values of $p_{a}^{0}$ not equal to 1 ( $p_{a}^{0}$ is the initial perturbation coefficient, see methods). For example, $p_{a}^{0}$ = 1 + ε, where ε is a random variable uniformly distributed on (-1/2,1/2). In all of these additional runs the algorithm converged to a very similar final set (the overlap was > 99% in each case), which confirms its robustness of the original set.

Unsurprisingly, it is the darkest microarrays that require the most intensive rescaling (replicate 2: 8 hours/replicate 3: 8 and 10 hours): Table 2 below shows the original scaling factors applied by MAS5. However this correlation breaks down for lower scaling factors. It is important to stress that our rescaling factors do not replace MAS5's original scaling factors but complement them. For example, the total scaling applied to the raw data obtained from the microarray run at eight hours (sample 2) is 3.010 * 1.22.

Table 2 Original scaling factors obtained with MAS5

Full size table

As intended, the expression levels of the genes composing the constant set span a smaller range than the overall expression level measured on a microarray (Figure 4). Note that the majority of gene with signals below 20 are considered absent by the MAS5 algorithm.

1Gy experiment

Our rescaling algorithm was run with the 1Gy experiment data set under the same conditions as in the 5Gy experiment, resulting in a set of 1568 constant genes. In this 1Gy experiment, 2944 genes in total passed the 0.001 p-value and 2000 signal intensity filtering. The corresponding rescaling factors are shown in Table 3 below.

Table 3 Rescaling factors obtained by the recursive procedure (1Gy experiment)

Full size table

With respect to the 5Gy experiment, the quality of the microarrays used in this second experiment is more uniform, which results in rescaling factors that are closer to 1. Notice as well that the number of genes included in the constant set is larger: 1568 instead of 1194. This is also because in the 1Gy experiment there were less microarrays of lower quality, causing more genes in this experiment to pass the filtering step.

Lists of constant genes are re-useable

Of the 1568 genes flagged as constant in the 1Gy experiment, 826 are also in the the list of constant genes for the 5Gy experiment (hence 742 genes are included in the 1Gy list only and 368 in the 5Gy list only). Despite this incomplete overlap, we show below that the usage of such lists does not have to be restricted to the experiments from which they were extracted.

We found that calculating rescaling factors for the 1Gy data set using the list of 1194 genes identified though the 5Gy experiment produced values very similar to those obtained with the 1Gy data alone (Figure 5, right). The biggest absolute difference between the two sets of values occurs for the first sample microarray 4 hours after irradiation and is equal to 0.013 (the average of these absolute differences across all microarrays is equal to 0.005).

The same conclusion can be reached when the experiment roles are swapped. The rescaling factors for the 5Gy data obtained using the 1Gy set of constant genes are compared with the original values in Figure 5 (left plot) and the biggest absolute difference is 0.024 for the third replicate 2 hours after irradiation, the average of these absolute differences being equal to 0.007.

Therefore, provided that the experimental conditions are not too dissimilar (same cell line and treatment), we conclude that the list of constant genes obtained through a given experiment can be safely used to calibrate data obtained from another experiment.

Why is rescaling necessary?

The rescaling factors shown in Table 1 might seem small with respect to the measurement error that prevails in microarray experiments and they may have little impact in simple screening or comparison-type analyses. However, increasing numbers of microarray experiments are carried out in a systems biology context (see [12] for a short review), in which proper scaling of the data is crucial to obtain sound interpretations.

To illustrate this point, we performed a simple time course analysis. Intermediate points (at 2,4,6,8 and 10 hours) were "predicted" using a linear interpolation procedure on neighbouring time points, for each replicate and gene (Figure 6). Each plot shows the quartiles of relative interpolation error ((true value -interpolated value)/true value) for the first and the second replicate, before (Figure 6A) and after rescaling (Figure 6B). Non-rescaled data is severely skewed, this is especially the case for those time points suffering from ascaling mismatch such as the one illustrated in figure 1 (replicate 2, 8 hours). Rescaling the data corrects for this mismatch: the interpolation errors are much more centered in the rescaled situation. We have used this rescaled data in a modeling context to predict targets of p53, an important transcription factor involved in the cell response to DNA damage [13].

We also compared our method to another popular normalization method: quantile normalization [14]. This normalization step is the one used in the RMA summary method, albeit at the probe level. We applied it instead to MAS5 summarized values and the results are shown in figure 6C. As can be readily seen from this example, quantile normalization fails to accurately correct for scaling imbalances in the data. Similarly, we performed the same interpolation analysis on RMA-summarized probe-level data (figure 6D). RMA does not perform well in this situation, we believe this is because the lesser quality microarrays have a deleterious effect on the whole dataset.

Conclusion

Inaccurate scaling of median range signal values can have significant deleterious effects on mathematical modelling of microarray data. Experimental correction of problematic arrays can be expensive and impractical. We have therefore developed a method for rescaling arrays to avoid the unnecessary loss of otherwise valid data. We present an algorithm designed to identify a set of constant genes in oligonucleotide microarray experiments. Applying this algorithm to artificially generated data allowed us to identify the conditions under which this set can be used for the correction of scaling mismatches. We applied the algorithm successfully to real experimental data and showed that the set of constant genes obtained for a given experiment can be generally applied, provided the systems studied are sufficiently similar. This type of rescaling is especially relevant, if not crucial, in modelling applications of microarray data.

Methods

Description of the algorithm

Let x_i,abe the signal for gene i as it is measured in microarray a. We hypothesise that this value can be decomposed as follows:

x_i,a= k_as_i,aexp(ε), (1)

where s_i,ais the "true" signal for gene i in microarray a (the value of the signal depends on the microarray as well, but for purely biological reasons). The third term exp(ε) is a multiplicative noise term, ε is a random variable identically distributed across all genes and microarrays with mean zero, and standard deviation typically around 0.15 – 0.20 which matches the average relative error (of 15% – 20%) generally observed in microarray data. The focus here is on the first component k_a, which results from inaccurate normalisation. In other words, one wishes to find a rescaling factor p_asuch that

p_ak_a= 1.

Note that there is no "true" level for the signal, so that a more accurate formulation would be that one wishes to find p_asuch that p_ak_a= C for all microarrays a, where C is some user-defined constant. However, we will define a particular microarray r to be the reference (k_r= 1) in order to retain the above formulation.

The proposed algorithm to determine the p_a's is simple, it assumes that there exists a set of genes U whose expression is unchanged across all microarrays:

U = {i : s_i,a= s_i> 0 ∀ a}.

If U is known, an accurate estimate of p_acan be obtained via a ratio of geometric averages for the relevant microarrays:

{\tilde{p}}_{a} = \frac{{(\prod_{i \in U} x_{i, r})}^{1 / # (U)}}{{(\prod_{i \in U} x_{i, a})}^{1 / # (U)}}, (2)

where #(U) denotes the number of elements in the set U. We show below that ${\tilde{p}}_{a}$ is a good approximation for p_a:

{\tilde{p}}_{a}

(1)

≈ p_a. (3)

Our aim is to determine U. To do this, we first determine a set of genes U₁ which have a low "variability" $V$ . As a measure of variability, we could use the coefficient of variation, which is given by

V (i, A) = {\bar{x}}_{i}^{- 1} {(\frac{1}{n - 1} \sum_{a = 1}^{n} {(x_{i, a} - {\bar{x}}_{i})}^{2})}^{1 / 2} (4)

where A is the set A = {1, ..., n} (the set of microarrays), x_i,a, a = 1, ..., n denotes n observations and ${\bar{x}}_{i}$ is the arithmetic mean of the x_i,aacross all a's. We thus average the readings for a particular gene across several microarrays. However, because of the multiplicative context, it is preferable to use a geometric average, so that

V (i, A) = {(\prod_{a = 1}^{n} M a x (\frac{x_{i, a}}{{\bar{x}}_{i}^{g}}, \frac{{\bar{x}}_{i}^{g}}{x_{i, a}}))}^{\frac{1}{n - 1}} - 1, (5)

where ${\bar{x}}_{i}^{g}$ is the geometric mean. We therefore define

U₁ = {i : $V$ (i, A) <v_* $C$ (i, A)},

where A is the set of microarrays and v_* is a parameter. $C$ (i, A) is a boolean variable that denotes another condition (or set of conditions) on the i th gene and is not of direct concern here. For example, via $C$ (i, A), we might try and avoid including in the set of constant genes those genes that are flagged absent by MAS5 throughout the experiment.

We then assume U₁ is U and we use (2) to compute estimated rescaling factors ${\tilde{p}}_{a}$ . The x_i,a's can then be updated, but this is likely to modify the composition of the set constant genes so that we might have to perform a similar step again, therefore the set U has to be determined in an iterative fashion. The full description of the k th step is

1. Determine a putative set of "constant" genes using

U_k= {i : (i, A, k - 1) <v*, $C$ (i, A)}. (6)

In the equation above, a parameter k - 1 has been added to the function $V$ , to underline the fact that the coefficient of variation has to be determined with updated values of the signal.

2. Update the rescaling factors using

p_{a}^{k} = p_{a}^{k - 1} {(\prod_{i \in U_{k}} \frac{x_{i, r}^{k - 1}}{x_{i, a}^{k - 1}})}^{1 / # (U_{k})},

or (the formula is equivalent)

p_{a}^{k} = {(\prod_{i \in U_{k}} \frac{x_{i, r}^{k - 1}}{x_{i, a}})}^{1 / # (U_{k})} .

3. Update the signals using

x_{i, a}^{k} = p_{a}^{k} \frac{x_{i, a}^{k - 1}}{p_{a}^{k - 1}},

or (again, equivalently)

x_{i, a}^{k} = p_{a}^{k} x_{i, a} .

The algorithm has converged at step m if U_mand U_m+1are identical sets. To start the algorithm at k = 1 we require initial values $x_{i, a}^{0}$ . These are given by

x_{i, a}^{0}

(2)

= $p_{a}^{0}$ x_{i, a}, (7)

where we will typically fix $p_{a}^{0}$ = 1.

Several criteria have to be met for the algorithm to be reliable. First, one has to verify that the algorithm does converge: it has to find the final set U_mand not wander aimlessly among the huge number of gene sets. Second, the convergence has to be reasonably quick and robust. Lastly, one wants to ensure that the final set of genes U_mreturned by the algorithm is indeed a set of constant genes. These conditions are difficult to verify analytically, which is why we resorted to numeric simulation. These simulations are presented in the Results and Methods section. A R script implementing the algorithm in the case of simple replication and no extra conditions has been included as supplementary material.

A set of constant genes can be used for rescaling

We want here to prove that (3) holds and in particular demonstrate that the approximation is a good one. Using (1) we have

{\tilde{p}}_{a} = \frac{k_{r}}{k_{a}} \frac{{(\prod_{i \in U} s_{i})}^{1 / # (U)}}{{(\prod_{i \in U} s_{i})}^{1 / # (U)}} \frac{{(\prod_{i \in U} \exp (ε_{i, r}))}^{1 / # (U)}}{{(\prod_{i \in U} \exp (ε_{i, a}))}^{1 / # (U)}} . (8)

We have k_r= 1, the second ratio on the right hand side is equal to 1. Turning to the third term in (8), we see that its expectation can be written as

E (exp (ε)),

where

ε = \frac{1}{# (U)} \sum_{i \in U} ε_{i, r} - \frac{1}{# (U)} \sum_{i \in U} ε_{i, a} . (9)

For the sake of this argument, we assume the relative errors for the gene expression measurements to be iid and equal to 20%. Applying the central limit theorem to (9), one can see that ε is Gaussian with parameters μ = 0 and a = (2 * 0.2)/ $\sqrt{# (U)}$ . Thus, exp(ε) is lognormal with the same parameters. The expectation of a lognormal distribution is exp(μ + σ²/2), hence

E (exp (ε)) = exp(0 + 0.08/#(U)).

The expectation of this third term is slightly bigger than 1 but since the size on the set U is typically in the hundreds, the difference is considerably less than 1%. Replacing in (8) we thus have, as required

{\tilde{p}}_{a} \approx \frac{1}{k_{a}} = p_{a} .

where the approximation is more than good enough for our practical purposes.

In silico explorations

Artificial data generation

We generated artificial signals for 10,000 genes on 7 microarrays using the following procedure. For the i th gene in the a th microarray we generate observed signals x_i,aaccording to

x_i,a= k_as_i,aexp(ε),

The first term k_ais the scaling mismatch we want the algorithm to uncover. The third term exp(ε) is an error term, where ε is set to be Gaussian with mean zero and standard deviation 0.2. The second term s_i,ais generated using

s_{i, a} = 10^{G},

where $G$ is a Gaussian random variable with mean 1.5 and standard deviation 0.4. Furthermore, a small portion of these genes (5% or 500) are set to be constant, that is for i ε U, we have s_i,a= s_i.

Alhough real experiments yield more complex expression patterns, this artificial data is amply sufficient to capture the main features of our method.

To simulate the scaling mismatches k_a, we generate them by

k_a= 1 + K(u - 0.5), (10)

where u is a random variable uniformly distributed on the unit interval and K is a parameter capturing the severity of the mismatch. Without loss of generality, we set k₁ = 1 because choosing an arbitrary reference array is required by our method.

We generated 191 different datasets and ran the algorithm on each of them. Each dataset differed from the others by the choice of the constant K in (10). The 191 values for K ranged uniformly between 0 and 1.9.

Each run was then assigned to one of the following three groups on the basis of the size of the final set U_m(where the index m denotes the number of steps it took to the algorithm to converge):

Good runs were those for which the size of the final set U_mexceeded 100.
In contrast, poor runs were those for which the size of U_mwas less than or equal to 100.
In the special case where the algorithm failed to find a big enough (> 2) initial set U₁, three genes known to be constant were forced into the initial set (recall that 5% of the genes were set to be constant). We call these helped runs.

For each run r, we computed an accuracy index using

a c c u r a c y = \frac{1}{6} \sum_{a = 2}^{7} | 1 - p_{a} k_{a} | (11)

where p_ais the rescaling factor recovered from thealgorithm. Ideally, we want the rescaling factors p_ato perfectly compensate the scaling factors k_aso that p_ak_a= 1. Consequently, the accuracy index above, which can take non-negative values, must be as close to zero as possible. We also monitored the speed of convergence i.e. the number of iterations steps m the algorithm takes before convergence.

Availability

An R script implementing the algorithm in the case of simple replication and no extra conditions has been included as additional file 1 (rescale.R). The data is available in Array Express (accession number E-MEXP-549).

References

Affymetrix: Affymetrix Microarray Suite User Guide, version 5. Affymetrix, Santa Clara, CA; 2001.
Google Scholar
Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP: Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res 2003, 31(4):el5. 10.1093/nar/gng015
Article Google Scholar
Li C, Hung Wong W: Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biol 2001, 2(8):RESEARCH0032.
PubMed Central CAS PubMed Google Scholar
Li C, Wong WH: Model-based analysis of oligonucleotidearrays: expression index computation and outlier detection. Proc Natl Acad Sci U S A 2001, 98: 31–6. 10.1073/pnas.011404098
Article PubMed Central CAS PubMed Google Scholar
Ploner A, Miller LD, Hall P, Bergh J, Pawitan Y: Correlation test to assess low-level processing of high-density oligonucleotide microarray data. BMC Bioinformatics 2005, 6: 80. 10.1186/1471-2105-6-80
Article PubMed Central PubMed Google Scholar
Choe SE, Boutros M, Michelson AM, Church CM, Halfon MS: Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset. Genome Biol 2005, 6(2):R16. 10.1186/gb-2005-6-2-r16
Article PubMed Central PubMed Google Scholar
Kepler TB, Crosby L, Morgan KT: Normalization and analysis of DNA microarray data by self-consistency and local regression. Genome Biol 2002, 3(7):RESEARCH0037. 10.1186/gb-2002-3-7-research0037
Article PubMed Central PubMed Google Scholar
Held GA, Grinstein G, Tu Y: Modeling of DNA microarray data by using physical properties of hybridization. Proc Natl Acad Sci U S A 2003.
Google Scholar
Naef F, Socci ND, Magnasco M: A study of accuracy and precision in oligonucleotide arrays: extracting more signal at large concentrations. Bioinformatics 2003, 19(2):178–84. 10.1093/bioinformatics/19.2.178
Article CAS PubMed Google Scholar
Tseng GC, Oh MK, Rohlin L, Liao JC, Wong WH: Issues in cDNA microarray analysis: quality filtering, channel normalization, models of variations and assessment of gene effects. Nucleic Acids Res 2001, 29(12):2549–57. 10.1093/nar/29.12.2549
Article PubMed Central CAS PubMed Google Scholar
Zien A, Aigner T, Zimmer R, Lengauer T: Centralization: a new method for the normalization of gene expression data. Bioinformatics 2001, 17(Suppl 1):8323–31.
Article Google Scholar
Stark J, Callard R, Hubank M: From the top down: towardsa predictive biology of signalling networks. Trends Biotechnol 2003, 21(7):290–3. 10.1016/S0167-7799(03)00140-9
Article CAS PubMed Google Scholar
Barenco M, Tomescu D, Brewer D, Callard R, Stark J, Hubank M: Ranked predictions of p53 targets using Hidden Variable Dynamic Modelling (HVDM). Genome Biology 2006, 7(3):R25. 10.1186/gb-2006-7-3-r25
Article PubMed Central PubMed Google Scholar
Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 2003, 19(2):185–93. 10.1093/bioinformatics/19.2.185
Article CAS PubMed Google Scholar
Gautier L, Cope L, Bolstad B, Irizarry R: affy – Analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 2004, 20(3):307–15. 10.1093/bioinformatics/btg405
Article CAS PubMed Google Scholar
Gentleman RC, et al.: Bioconductor: open softwaredevelopment for computational biology and bioinformatics. Genome Biol 2004, 5(10):R80. 10.1186/gb-2004-5-10-r80
Article PubMed Central PubMed Google Scholar

Download references

Acknowledgements

Part of the analyses and graphs presented in this article were produced using the Bioconductor package under R, in particular the "affy" library [15, 16]. M. Barenco and D. Tomescu are supported by an Exploiting Genomics Initiative grant from the BBSRC, Ref.:39/EGM16102. D. Brewer holds a CHRAT studentship by the ICH. The authors would like to thank the reviewers for their useful remarks.

Author information

Authors and Affiliations

lnstitute of Child Health, University College London, UK
Martino Barenco, Jaroslav Stark, Daniel Brewer, Daniela Tomescu, Robin Callard & Michael Hubank
CoMPLEX, University College London, UK
Martino Barenco, Daniel Brewer, Robin Callard & Michael Hubank
Department of Mathematics, Imperial College London, UK
Jaroslav Stark

Authors

Martino Barenco
View author publications
You can also search for this author in PubMed Google Scholar
Jaroslav Stark
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Brewer
View author publications
You can also search for this author in PubMed Google Scholar
Daniela Tomescu
View author publications
You can also search for this author in PubMed Google Scholar
Robin Callard
View author publications
You can also search for this author in PubMed Google Scholar
Michael Hubank
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Martino Barenco.

Additional information

Authors' contributions

MB designed the method in collaboration with JS and wrote the paper. DT and MH ran the microarrays. RC and DB provided useful advice. MH, JS and RC designed and coordinated the project.

Electronic supplementary material

12859_2005_990_MOESM1_ESM.R

Additional File 1: An R script implementing the algorithm in the case of simple replication and no extra conditions has been included as additional file (rescale.R). (R 2 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Barenco, M., Stark, J., Brewer, D. et al. Correction of scaling mismatches in oligonucleotide microarray data. BMC Bioinformatics 7, 251 (2006). https://doi.org/10.1186/1471-2105-7-251

Download citation

Received: 09 November 2005
Accepted: 09 May 2006
Published: 09 May 2006
DOI: https://doi.org/10.1186/1471-2105-7-251

Correction of scaling mismatches in oligonucleotide microarray data

Abstract

Background

Results

Conclusion

Background

Results and discussion

In Silico explorations

Artificial data: results

Algorithm performance in the absence of constant genes

Experimental data

5Gy experiment

1Gy experiment

Lists of constant genes are re-useable

Why is rescaling necessary?

Conclusion

Methods

Description of the algorithm

A set of constant genes can be used for rescaling

In silico explorations

Artificial data generation

Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Authors' contributions

Electronic supplementary material

12859_2005_990_MOESM1_ESM.R

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us