Skip to main content

Evaluation of microarray data normalization procedures using spike-in experiments

Abstract

Background

Recently, a large number of methods for the analysis of microarray data have been proposed but there are few comparisons of their relative performances. By using so-called spike-in experiments, it is possible to characterize the analyzed data and thereby enable comparisons of different analysis methods.

Results

A spike-in experiment using eight in-house produced arrays was used to evaluate established and novel methods for filtration, background adjustment, scanning, channel adjustment, and censoring. The S-plus package EDMA, a stand-alone tool providing characterization of analyzed cDNA-microarray data obtained from spike-in experiments, was developed and used to evaluate 252 normalization methods. For all analyses, the sensitivities at low false positive rates were observed together with estimates of the overall bias and the standard deviation. In general, there was a trade-off between the ability of the analyses to identify differentially expressed genes (i.e. the analyses' sensitivities) and their ability to provide unbiased estimators of the desired ratios. Virtually all analysis underestimated the magnitude of the regulations; often less than 50% of the true regulations were observed. Moreover, the bias depended on the underlying mRNA-concentration; low concentration resulted in high bias. Many of the analyses had relatively low sensitivities, but analyses that used either the constrained model (i.e. a procedure that combines data from several scans) or partial filtration (a novel method for treating data from so-called not-found spots) had with few exceptions high sensitivities. These methods gave considerable higher sensitivities than some commonly used analysis methods.

Conclusion

The use of spike-in experiments is a powerful approach for evaluating microarray preprocessing procedures. Analyzed data are characterized by properties of the observed log-ratios and the analysis' ability to detect differentially expressed genes. If bias is not a major problem; we recommend the use of either the CM-procedure or partial filtration.

Background

A large number of methods for low-level analysis of microarray data have been developed, but the relative merits of these methods are generally not easy to assess [1]. Analytical methods are commonly motivated by theoretical properties or how well they perform on simulated microarray data [24]. Neither approach is fully satisfactory, since they rely on model assumptions that are not necessarily supported by empirical studies. For real data, the true values are not known and therefore cannot be characterized and used for evaluation. Data from spike-in experiments, where the mRNA-ratios of a set of artificial clones are known, can be used to determine the relative merits of a set of analysis methods [1, 5]. The design of a spike-in experiment needs to be based on assumptions of how real microarray data behave. However, these assumptions are generally less restrictive than the ones needed for simulating microarray data. The presented evaluation study used eight in-house produced spike-in microarrays (The Lucidea array) with approximately 10,000 spots, 4,000 of which were spiked at different concentrations, i.e., differentially expressed (DE) genes. In comparison with the spike-in study performed by Qin et al. [5], our spike-in array encompasses more DE-genes, allowing us to obtain reliable estimates of the methods' abilities to detect DE-genes (i.e. the methods' sensitivities).

Microarray studies are often used to screen for DE-genes. In this case, the sensitivity and specificity of the study are of interest. The Receiver Operating Characteristic curve (ROC-curve) shows the relationship between sensitivity and specificity and can be used to characterize the classification properties of a study [6]. Alternatively, pre-processed microarray data are sometimes used in so-called high-level analyses (e.g. cluster analysis and classification). In this case, the sensitivity and specificity of detection is no longer the most appropriate framework for evaluation. Rather, the properties of the normalized log-ratios need to be understood.

In this article, 252 normalization procedures were evaluated. We simultaneously evaluated three filtration methods, two techniques for background adjustment, seven scanning procedures, two ways of dealing with negative intensities, and four censoring approaches. The majority of these methods are well established, but we also introduced some novel ones: partial filtration handles data from spots not properly identified by the image analysis software, C-spot inclusion handles negative background adjusted intensities, and censoring moderate extreme ratios caused by weakly expressed genes.

Results

The general model

We consider a two-channel cDNA-microarray experiment with a reference and treated channel. Let µ denote the true gene expression level of a gene. The raw intensities cannot be directly used to identify DE-genes, since they are influenced by systematic variation. Normalization aims to remove systematic variation and create normalized log-ratios that are used to calculate test-statistics that rank the genes according to how likely they are to be DE. Genes with test-statistics above a user defined cut-off value are classified as DE. For each gene and array the normalized log-ratio should be an observation of the true log-ratio of interest, i.e. log2 (µTreated /µReference). The methods used for transforming raw data to normalized log-ratios constitute a normalization procedure. We consider raw data generated from arrays that have been scanned at four laser settings, where the normalization procedures involve filtration, background adjustment, merging data from different scans, channel normalization, and censoring (Figure 1).

Figure 1
figure 1

Description of the evaluation study. Theevaluation study was based on raw data from eight Lucidea arrays (Figure 2). All arrays were scanned at four settings and image analysis was applied to extract raw data. The data were normalized in five steps. Filtration and background adjustment were applied to each of the raw data sets. Three procedures for filtration and two procedures for background adjustment were considered. In the third step, data from the four scans were "merged" together. Seven merging procedures were considered; four of these were single scan procedures that used only data from a single scan. Two procedures for channel adjustment and four censoring approaches were considered. In total, 252 unique analyses were conducted. For each analysis the normalized log-ratios were used to conduct a test that generated a list of genes classified as DE. Properties of the average normalized log-ratios and the analyses' classification properties were used to characterize and evaluate the analyses.

Design of spike-in microarray arrays

In a spike-in experiment, control clones (i.e. artificial clones designed to avoid cross-hybridization) are printed on an array. We will refer to control clones as genes even though they do not correspond to actual genes. Prior to labeling, the biological samples are spiked with control genes of known concentration. All other experimental steps are identical to a standard two-channel microarray experiment. The evaluation study presented in this paper used an in-house produced cDNA-array (the Lucidea array), consisting of 12 non-differentially expressed (NDE) genes and 8 DE-genes from the Lucidea Universal ScoreCard (Amersham Biosciences), where each gene was spotted 480 times. The NDE-genes were spiked with different RNA-concentrations ranging from zero to very high concentrations. The DE-genes were either three-fold or ten-fold up- or down-regulated, and were spiked with low or high RNA-concentrations; see Figure 2. Technical details about array production and experimental protocols are described in the methods section.

Figure 2
figure 2

Description of the Lucidea array. The Lucidea array has 9,600 spots from 20 genes; each gene was replicated 480 times. Eight genes were spiked with different mRNA-concentrations in the treated and reference channels (i.e. the DE-genes, red dots 11–18). The remaining 12 genes were spiked equally in both channels at different mRNA-concentrations (i.e. the NDE-genes, blue dots 1–10). Three NDE-genes, two expressed at concentration zero and one at concentration one, were clustered together and are in this paper treated as one gene.

Methods and tools for evaluation of spike-in microarray data

An analysis involving image analysis, normalization, and a test generates normalized log-ratios and test-statistics (e.g. the B-statistic [2] or the absolute value of the t-statistic). A gene is classified as DE if its test-statistic is above a user determined cut-off value c. The experiment's sensitivity is the proportion of DE-genes that are correctly classified. The false positive rate (FPR) is the proportion of NDE-genes that are falsely classified, while the false discovery rate (FDR) is the proportion of NDE-genes among the genes classified as DE.

The analyzed microarray data are characterized by the properties of the normalized log-ratios and their classification properties (Figure 1). The former is characterized by the ROC-curve, describing the relationship between the sensitivity and the FPR. If the purpose is to screen for DE-genes it is sufficient to consider the classification properties. However, unbiased estimates of the normalized log-ratios are important in order to understand the biology and are desirable when combining results from different techniques (e.g. cDNA-arrays, short-oligo arrays, and qRT-PCR). Furthermore, if the bias depends on the genes' mRNA-concentrations (e.g. the bias is higher for DE-genes expressed at low concentrations rather than at high concentrations), then results from high-level analyses such as clustering or classification can be misleading. We propose that the following properties should be considered when evaluating analysis methods:

  1. I.

    High sensitivity at the user acceptable FDR.

  2. II.

    The expected values of the average normalized log-ratios of DE-genes should be close to the true log-ratios, and the bias should not depend on the genes' mRNA- concentrations.

The relative importance of these properties depends on the purpose of the experiment.

The properties of the normalized log-ratios can be summarized with estimates of the overall bias and standard deviation. Consider an experiment with r DE-genes, where the k th gene is replicated n k times and has the true log-ratio. The reflected bias and the selected bias are measures of property II. The reflected bias is estimated by

b ^ D E = ? k = 1 r ? i = 1 n k s i g n ( O k ) ( M ¯ k i - O k ) n D E , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGIbGygaqcamaaBaaaleaacqWGebarcqWGfbqraeqaaOGaeyypa0ZaaSaaaeaadaaeWbqaamaaqahabaGaem4CamNaemyAaKMaem4zaCMaemOBa42aaeWaaeaacqqHPoWvdaWgaaWcbaGaem4AaSgabeaaaOGaayjkaiaawMcaaaWcbaGaemyAaKMaeyypa0JaeGymaedabaGaemOBa42aaSbaaWqaaiabdUgaRbqabaaaniabggHiLdGcdaqadaqaaiqbd2eanzaaraWaaSbaaSqaaiabdUgaRjabdMgaPbqabaGccqGHsislcqqHPoWvdaWgaaWcbaGaem4AaSgabeaaaOGaayjkaiaawMcaaaWcbaGaem4AaSMaeyypa0JaeGymaedabaGaemOCaihaniabggHiLdaakeaacqWGUbGBdaWgaaWcbaGaemiraqKaemyraueabeaaaaGccqGGSaalaaa@59AA@

where M ¯ k i MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGnbqtgaqeamaaBaaaleaacqWGRbWAcqWGPbqAaeqaaaaa@30CD@ are the average normalized log-ratios taken over all arrays, and where n DE is the total number of DE-spots on the array. The selected bias is similarly estimated as the reflected bias albeit only genes classified as DE are used to estimate the bias. For some problems, the selected bias may be a more relevant measure than the reflected bias. In order to determine if the average normalized log-ratios have a strong linear relationship to the true log-ratios, it is necessary to study the bias of the DE-genes individually. The overall standard deviation of the DE-genes can be estimated by

s ^ D E = ? k = 1 r ? i = 1 n k ( M ¯ k i - M ¯ ¯ k ) 2 n D E - r , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGZbWCgaqcamaaBaaaleaacqWGebarcqWGfbqraeqaaOGaeyypa0ZaaOaaaeaadaWcaaqaamaaqahabaWaaabCaeaadaqadaqaaiqbd2eanzaaraWaaSbaaSqaaiabdUgaRjabdMgaPbqabaGccqGHsislcuWGnbqtgaqegaqeamaaBaaaleaacqWGRbWAaeqaaaGccaGLOaGaayzkaaWaaWbaaSqabeaacqaIYaGmaaaabaGaemyAaKMaeyypa0JaeGymaedabaGaemOBa42aaSbaaWqaaiabdUgaRbqabaaaniabggHiLdaaleaacqWGRbWAcqGH9aqpcqaIXaqmaeaacqWGYbGCa0GaeyyeIuoaaOqaaiabd6gaUnaaBaaaleaacqWGebarcqWGfbqraeqaaOGaeyOeI0IaemOCaihaaaWcbeaakiabcYcaSaaa@52E7@

where M ¯ ¯ k MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGnbqtgaqegaqeamaaBaaaleaacqWGRbWAaeqaaaaa@2F89@ is the average taken over all arrays and replicates.

Description of the data analyses used in the evaluation study

Eight hybridized spike-in Lucidea arrays were scanned at four settings (laser power/PMT): 70/70, 80/80, 90/90, and 100/100 (in that order), where the numbers were percentages of maximum values. These scans are referred to as the 70, 80, 90, and 100 scans. The pre-processing procedures considered in this work involved seven consecutive steps: image analysis, filtration, background adjustment, merging data from several scans, channel adjustment, censoring, and calculations of test-statistics (Figure 1). The analyses were carried out using ScanArrayExpress 2.1 (PerkinElmer), Bioconductor [7], the Aroma package [8], and the in-house S-Plus library UmeaSAMED. The evaluations were carried out using the in-house S-Plus library EDMA [9].

Image analysis

The standard way to conduct image analysis is to analyze the two images obtained from one scan together, so that the spots are equally defined for both channels. We propose an alternative method, the combined image analysis where the scan's images are analyzed with images from a second scan (commonly the highest scan in the experiment, in our case the 100 scan), so that the spots are equally defined for all four images. This approach is possible since ScanArray Express allows four images to be analyzed simultaneously. All the 252 evaluated analyses used combined image analysis. The median of the spots' pixel values was used to calculate the intensities. For one array, additional image analyses were done in the standard way using both ScanArrayExpress 2.1 and GenePix 5.0 (Axon Instruments Inc); the software generate "flags" indicating whether the spots are properly identified [10, 11]. The percentage of so-called not-found spots was used to characterize the different image analyses (Table 1). Combined image analysis using additional images from a higher scan will improve spot finding and thereby improve the quality of the data.

Table 1 Combined and standard image analysis. For one of the scanned Lucidea arrays, image analyses were performed in three different ways; standard analysis using GenePix, standard analysis using ScanArrayExpress, and combined analysis using ScanArrayExpress. The standard method analyzed images from one scan so that the spots were equally defined for both channels, while the combined method analyzed images from two scans so that the spots were equally defined for all four images. Here "70+100" means that images from the 70 scan were analyzed together with images from the 100 scan. The table shows the percentage of spots that the image analyses failed to identify.

Filtration

Intensities from not-found spots (i.e. spots not properly identified by the image analysis software) were treated in three different ways:

  1. I.

    Complete filtration: the intensities were treated as missing values.

  2. II.

    Partial filtration: the intensities were treated as missing values during normalization, but prior to calculating test-statistics the spot's log-ratios were set to zero. In the special case when all arrays generated not-found spots, the gene was removed from the experiment.

  3. III.

    No filtration: the intensities were treated as intensities of found spot.

Complete filtration is commonly used while partial filtration is a novel method. The idea behind partial filtration is that spots called "not found" commonly arise from genes that are not expressed in either channel, and therefore can be regarded as NDE-genes.

Background adjustment

The analyses either did not apply any background adjustment, or applied the standard background adjustment removing the local background intensities from the observed intensities. Background adjustment divided the spots into three groups: A-spots, where both the reference and the treated background adjusted intensities (ba-intensities) were positive, B-spots where either the reference or the treated ba-intensity was negative, and C-spots where both the ba-intensities were negative.

Merging data from several scans

Scans generated 16-bit images and, since the median was used to calculate the spot intensities, all intensities were integers between 0 and 216 -1. Henceforth, intensities equal to the maximum value will be called saturated. One common approach to deal with saturation is to adjust the scanner settings such that only a small fraction of the intensities will be saturated. Two alternative approaches, restricted linear scaling (RLS) and the constrained model (CM) [12], combine intensities from two or more scans in order to expand the linear range of the experiment. RLS is a slight modification of the algorithm suggested by Dudley etal. [13]. Seven scanning procedures were considered:

  1. I.

    Using data from the 70 scan.

  2. II.

    Using data from the 80 scan.

  3. III.

    Using data from the 90 scan.

  4. IV.

    Using data from the 100 scan.

  5. V.

    RLS using combined data from the 70, 80, and 90 scans (RLS 90).

  6. VI.

    RLS using combined data from the 80, 90, and 100 scans (RLS 100).

  7. VII.

    CM using data from all four scans with the 70 data as baseline.

The 80 scan can be thought of as a standard scan since it was the highest scan where only a small fraction (< 0.2%) of the intensities were saturated. From a practical point of view, the CM and RLS procedures demand more scanning and image analyses. In addition, databases created for microarray data storage do not commonly support data from several scans, e.g., BASE [14]. The storage problem can usually be solved by creating additional software [15].

Channel adjustment

The print-tip lowess normalization [16] was used to remove the systematic differences between channels. For the Lucidea experiment, only data from NDE-genes were used to fit the lowess curves. However, data from all A-spots were adjusted. Clearly, this approach is an idealization, since the DE-genes in a real experiment affect the estimated curves. However, if the proportion of DE-genes is small, and if the true log-ratios are symmetrically distributed around zero, then this effect is small. For two analyses different proportions of DE-genes were used to fit the lowess curves. For these analyses, inclusion of a small number of DE-genes had marginal effect on the analyses' sensitivities and biases (Table 2). Intensities from B-spots were treated as missing values, while intensities from C-spots were treated in two different ways:

Table 2 Print-tip lowess normalization. In two of the analyses (80.I.1 and 80.II.1, see Table 3), the percentage of DE-genes allowed to influence the lowess curve estimates was kept at five different levels; 0, 1,6, 11, and 40%. The latter was obtained when all DE-genes were included. For each analysis, the sensitivities at the 0.05, 0.1, and 0.5% FPRs were observed together with the reflected bias.
  1. I.

    Removed: the C-spots' intensities were treated as missing values.

  2. II.

    Included: the C-spots' log-ratios were set to zero prior to calculating the test-statistics.

The rationale for including C-spots is again that these spots commonly arise from genes that are not expressed in either channel and therefore have mRNA log-ratios equal to zero.

Censoring of low intensities

A-spot intensities were censored such that all intensities below a user-defined censoring value ? were increased to this value. In this work the censoring values 1, 8, 64, and 512 were used. In practice, using minimal censoring (i.e. ? = 1) is equivalent to applying no censoring at all. The idea behind censoring is to moderate the variance of the weakly expressed genes. It still remains to determine how to select an optimal censoring value. In this paper, background adjustment is a spot level procedure whereas censoring is an adjustment applied globally to an array.

Test-statistics

The statistics generated by the B-test were used as test-statistics.

Empirical results of the evaluation study

The data generated by the eight hybridized Lucidea arrays were normalized in 252 ways as described in Table 3. The notation S.P.? refers to a normalization that used scanning procedure S and procedure P (I-IX in Table 3) with censoring value ?. The censoring value (among 1, 8, 64, and 512) generating the highest sensitivity for a group of analyses using scanning approach S and procedure P, will be referred to as the groups optimal censoring value.

Table 3 Description of the analyses used in the evaluation study. The nine procedures for filtration, background adjustment, and channel adjustment that were evaluated. These procedures were combined with seven scanning procedures (70, 80, 90, 100, RLS 90, RLS 100, and CM) and four censoring values (1, 8, 64, and 512). Combined image analysis using ScanArrayExpress, print-tip lowess, and the B-test were used by all analyses.

For all normalizations the properties of the analyzed data were summarized by observing the sensitivity at the 0.05%, 0.1%, and 0.5% FPRs (Table 4, 5, 6), the overall reflected bias (Table 7), and the overall standard deviation (Table 8). Figure 3 shows the bias and sensitivity for an interesting subclass of analyses. Six normalizations were selected as particularly interesting and were investigated in some detail (Table 9, Figure 4, 5, 6).

Table 4 Sensitivity at the 0.05% false positive rate. The estimated sensitivity (%) obtained when the FPR was fixed at 0.05%. If approximately 1% of the genes are DE, then an observed sensitivity in the range of 40–80% converts to an observed false discovery rate of between 5 and 11%.
Table 5 Sensitivity at the 0.1 % false positive rate. The estimated sensitivity (%) obtained when the FPR was fixed at 0.1%.
Table 6 Sensitivity at the 0.5% false positive rate. The estimated sensitivity (%) obtained when the FPR was fixed at 0.5%.
Table 7 Reflected bias. The estimated reflected bias of the DE-genes as defined in Relation 2. A reflected bias equal to -1 implies that only 50% of the true regulation was observed.
Table 8 Standard deviation of the DE-genes. The estimated overall standard deviation of the DE-genes as defined in Relation 3.
Table 9 Estimated measures for six selected analyses. The table shows sensitivity (%), reflected bias, and selected bias for six analyses.
Figure 3
figure 3

Reflected bias and sensitivity for selected normalizations. Reflected bias and sensitivity for 24 normalizations using four types of procedures for filtration, background adjustment, and censoring: complete filtration without background adjustment and censoring (black), complete filtration with background adjustment but without censoring (blue), complete filtration with background adjustment and optimal censoring (red), and partial filtration without background adjustment and censoring (green).

Figure 4
figure 4

MA-plots for the selected normalizations. MA-plots for six normalizations. The coloring corresponds to genes 1–18 in Figure 2: NDE-genes (1–10) are blue, highly-regulated genes (i.e. ratio 1/10 or 10) expressed at high concentrations (12, 18) are dark green, highly-regulated genes expressed at low concentrations (11, 17) are light green, moderately-regulated genes (i.e. ratio 1/3 or 3) expressed at high concentrations (14, 16) are brown, and moderately-regulated genes expressed at low concentrations (13, 15) are colored red.

Figure 5
figure 5

ROC-curves for selected normalizations. ROC-curvesfor six normalizations; 80.I.1 (black), 80.II.1 (red), 80.II.64 (darkred), 80.V.1 (green), CM.I.l (blue), and CM.II.l (yellow).

Figure 6
figure 6

Description of individual genes for selectednormalizations. Six measures are presented: the percentage of NDE-genes included1 in the statistical analysis (6A), the overall standard deviation of the NDE-genes (6B), the percentage of DE-genes included1 in the statistical analysis (6C), the overall standard deviation of the DE-genes (6D), the reflected bias of the DE-genes (6E), and the sensitivity of the DE-genes (6F). Details of the genes (1–18) are found in Figure 2. In 6C-6F, genes with the same magnitude of regulation are presented together. The analyses are marked with different numbers: 80.I.1 (1), 80.II.1 (2), 80.II.64 (3), 80.V.1 (4), CM.I.l (5) and CM.II.l (6). A star indicates that values are equal. 1) A gene was included in the statistical analysis if at least one of the arrays was able to generate an observation of the desired log-ratio. In addition, analyses that used partial filtration did not include genes for which all arrays generated so-called not-found spots.

Normalizations without background adjustment did not benefit from censoring, therefore only results observed at the minimal censoring are presented in this paper. The number of C-spots was low when either partial or complete filtration was used (data not shown), and consequently including C-spots (III and VI) gave similar results to excluding C-spots (II and V), so only results from the later analyses are presented in this paper. The partial filtration with background adjustment (V) performed considerable better (higher sensitivity and lower bias) than the partial filtration without background adjustment (IV) (Table 4, 5, 6, 7). Normalizations without filtration (VII, VIII, and IX) and minimal censoring had very low sensitivities. In this case, optimal censoring gave considerable higher sensitivities, but these were still lower than the sensitivities obtained when procedure V was used (Table 4, 5, 6, 7).

Henceforth, we concentrate on what are arguably the four most interesting procedures: complete filtration without background adjustment (I), complete filtration with background adjustment and minimal censoring (II.1), complete filtration with background adjustment and optimal censoring (II.op), and partial filtration with background adjustment (V). These procedures were combined with the seven scanning procedures to give a subclass of 24 normalizations. The properties of these normalizations are discussed below.

Overall bias and standard deviation

For all except one of the analyses (with 70.II.1 as the exception), the reflected bias was negative and the magnitude of the regulation was underestimated (Table 7). Background adjustment had a positive effect on the bias, while censoring and the use of partial filtration resulted in high bias (Table 7, Figure 3). The use of the 70 and 80-scan procedures resulted in relatively small bias, while the CM-procedure gave high bias (Table 7, Figure 3). The standard deviation was generally high among normalizations using partial filtration, and low among methods using complete filtration without background adjustment or the CM-procedure (Table 8).

Sensitivity at the 0.05% false positive rate

Normalizations that used complete filtration and minimal censoring (I and II. 1) had generally low sensitivities (26–51%) (the exceptions were analyses 80.I.1, CM.I.l, and CM.II.l), while analyses that used partial filtration had relatively high sensitivities (66–74%) (Table 4, Figure 3). Normalizations that used the CM-procedure or the 80-scan procedure (except 80.II.1) had high sensitivities (68–78%) (Table 4, Figure 3).

Sensitivity at the 0.1 and 0.5% false positive rates

The results at the 0.1% FPR were similar to those obtained at the 0.05% FPR, although the variability between the analyses was smaller (Table 5). The variability at the 0.5% FPR was even smaller, and the analyses' sensitivities varied between 65 and 88% (Table 6). Normalizations that used the 70 or 100-scan procedure had with one exception (70.II.8) the smallest sensitivities, while analyses that used the CM-procedure had among the highest sensitivities (Table 6).

A detailed comparison between six selected normalizations

Four of the best performing normalizations, 80.II.64, 80.V.I, CM.I.l, and CM.II.l, together with 80.I.1 and 80.II.1 (arguably standard normalizations), were selected for further comparison. The MA-plots (average normalized log-intensities versus the average normalized log-ratios) for the six analyses are shown in Figure 4. Note the characteristic rocket shape formed by the NDE-intensities for the analysis 80.II.1; this analysis had low bias and low sensitivity (Table 9, Figure 5). The other analyses avoided the typical rocket shape and thereby achieved higher sensitivity but also higher bias (Table 9, Figure 4). The trade-off between an analysis' ability to identify DE-genes and its ability to obtain low bias has previously been discussed, e.g. [5].

A large proportion of the genes expressed at very low concentrations were removed prior to the test, independent of which normalization was used (Figure 6A). For analysis 80.II.1, the standard deviation of the NDE-genes decreased with the mRNA-concentration, indicating that the majority of the extreme NDE-log-ratios were caused by genes expressed at low concentrations (Figure 6B).

For analysis 80.V.1 the selected bias (estimated by data from correctly classified DE-genes) was considerable lower than the reflected bias (Table 9). It follows that analysis 80.V. 1 divides the DE-genes into two categories; the correctly classified genes with relatively low bias and the falsely classified genes with high bias. Genes with several so-called not found spots are likely to fall in the second category. Spot finding was positively correlated with the mRNA-concentration (Figure 6A,C), and DE-genes with low mRNA-concentrations had higher bias, higher standard deviation, and lower sensitivity than genes with high mRNA-concentrations (Figure 4, 6D,E,F).

For most of the other normalizations, DE-genes expressed at low concentrations had considerable higher bias and lower sensitivity than genes expressed at high concentrations (Figure 4, 6E,F). Highly-regulated genes (DE-genes that were either tenfold up- or down-regulated) expressed at low concentrations (i.e. 11 and 17 in Figure 2) had the highest bias (Figure 6E). These genes were expressed at lower concentrations than any of the other DE-genes. Interestingly, the highly up-regulated genes 12 and 18 had equal or higher bias than the moderately-regulated genes 14 and 16 even though they were expressed at higher concentrations. This suggests that the magnitude of the regulation affects the bias so that highly-regulated genes generally have higher bias than moderately-regulated genes.

Discussion

In some important aspects our experiment differs from an ordinary microarray experiment. Most importantly, the Lucidea data were not influenced by any biological variation. It is unclear how adding biological variation influences the relative ranking of the evaluated analyses. Furthermore, the evaluation is based on eight arrays. It is possible that the number of arrays in an experiment affects the relative ranking of the analyses. Non-expressed genes that are switched on and expressed genes that are switched off can be very interesting. Genes of this type are not present on the Lucidea array. From an experimental point of view we have the complication that each array was scanned at four settings (laser power/PMT): 70/70, 80/80, 90/90, and 100/100 (in that order), and it is possible that data from the higher settings lost some information due to photobleaching. The sensitivities, biases, and standard deviations presented in this paper are all point estimators, the uncertainties of these estimates are not considered in this paper. Despite these limitations, the Lucidea experiment gives valuable information about the relatively performances of the evaluated analysis methods.

In microarray analyses, one of the most important and difficult problems is to select a cut-off value in order to control the false discovery rate. This problem is not considered in this paper.

The 252 evaluated analyses represent only a small fraction of all available pre-processing procedures. The background adjustment, the print-tip lowess normalization, and the B-test used in the evaluation are all widely used methods, but not necessarily the best methods available. For example, it is possible that better results can be obtained using more advanced background adjustment procedures [17]. Furthermore, the inclusion of C-spots, partial filtration, and censoring generate log-ratios that are affected by censored intensities. Although the B-test is a robust test [2], it was not designed to handle censored observations. All analyses used the same type of image analysis. It is possible that there exist image analysis methods with significantly better spot finding properties than ScanArrayExpress and that the use of such methods could change the relative ranking of the evaluated normalization procedures.

Both partial filtration and the inclusion of C-spots are built on the idea that not-found spots and C-spots are most likely observations from non-expressed genes. However, occasionally these spots arise due to experimental failures. The probability that not-found spots and C-spots arise from non-expressed genes might be determined by considering observations from all arrays simultaneously. Analyses using this information are likely to perform better than the methods suggested in this paper.

Considerable space was devoted to censoring. Even though there is no method for determining the optimal censoring values. However, censoring increased the sensitivities of most analyses using background adjustment, sometimes dramatically. We find these results promising and think that they can serve as an inspiration for further research.

Conclusion

The use of spike-in experiments is a powerful approach for evaluating microarray preprocessing procedures. The sensitivities at low false positive rates and the reflected bias are useful measures for characterizing analyzed microarray data.

In general, there was a trade-off between the ability of the analyses to identify DE-genes and their ability to provide unbiased estimators of the desired ratios. No single analysis achieved both low bias and high sensitivity. The magnitude of the regulation of the DE-genes was underestimated by almost all analyses, often less than 50% (i.e. reflected bias < -1) of the true regulation were observed. Moreover, the bias depended on the underlying mRNA-concentrations; DE-genes with low concentration generally had higher bias than genes expressed at high concentration.

When very low false positive rates were considered (e.g. 0.05%); many of the analyses had relatively low sensitivities. However, analyses that used either the CM-procedure or partial filtration had with few exceptions high or very high sensitivities. If bias is not a major problem; we strongly recommend the use of either the CM-procedure or partial filtration, which gives considerable higher sensitivities than some commonly used analysis methods.

Methods

The Lucidea arrays

The arrays were in-house produced cDNA-arrays [18] consisting of 20 clones from the Lucidea Universal ScoreCard (Figure 2). The clones were dissolved in 50% DMSO and printed on UltraGAPS slides (Corning Life Science) using a Microgrid II arrayer (Biorobotics). Each clone was printed 480 times in 48 identically designed sub-grids. Besides the Lucidea genes several other genes were printed on the array, but data from these genes was excluded prior to normalization.

Labeling, hybridization, and scanning

Eight Lucidea arrays were hybridized. In short, first strand cDNA-synthesis was performed using Superscript II (Invitrogen) incorporating aminoallyl-dUTP (Amersham Biosciences). Five µ l of the Lucidea Universal ScoreCard reference and test spike mix RNA, together with 25 µ g of total RNA from murine cell line J774.1, were used in the respective reactions. The fluorophores Cy3 and Cy5 (Amersham Biosciences) were coupled to the aminoallyl group in the test and reference reactions respectively. The labeled cDNA was purified, and cDNA from test and reference reactions were mixed and dissolved in DIG Easy Hyb (Roche) supplemented with tRNA (Sigma) and fish sperm DNA (Sigma). The arrays were hybridized overnight at 37°C. Washing was performed in a Genetac hybridization station (Genomic Solutions) at 50°C in 0.1 × SSC, 0.1% SDS, followed by 0.1 × SSC at 20°C. Each array was scanned using a ScanArray 4000XL (PerkinElmer) at four different settings.

Image analysis

The Images were analyzed by ScanArrayExpress, using adaptive circle with nominal diameter 150 µ m. The median was used to calculate both the foreground and background intensities. One array was also analyzed by Genepix using the circular feature with nominal diameter 150 µ m.

References

  1. Mehta T, Tanik M, Allison DB: Towards sound epistemological foundations of statistical methods for high-dimensional biology. Nat Genet 2004, 36(9):943–947. 10.1038/ng1422

    Article  CAS  PubMed  Google Scholar 

  2. Lönnstedt I, Speed TP: Replicated microarray data. Statistical Sinica 2002, 12: 31–46.

    Google Scholar 

  3. Wolfinger RD, Gibson G, Wolfinger ED, Bennett L, Hamadeh H, Bushel P, Afshari C, Paules RS: Assessing gene significance from cDNA microarray expression data via mixed models. J Comput Biol 2001, 8(6):625–637. 10.1089/106652701753307520

    Article  CAS  PubMed  Google Scholar 

  4. Kerr MK: Linear models for microarray data analysis: hidden similarities and differences. J Comput Biol 2003, 10(6):891–901. 10.1089/106652703322756131

    Article  CAS  PubMed  Google Scholar 

  5. Qin LX, Kerr KF: Empirical evaluation of data transformations and ranking statistics for microarray analysis. Nucleic Acids Res 2004, 32(18):5471–5479. 10.1093/nar/gkh866

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  6. Pawitan Y, Michiels S, Koscielny S, Gusnanto A, Ploner A: False discovery rate, sensitivity and sample size for microarray studies. Bioinformatics 2005, 21(13):3017–3024. 10.1093/bioinformatics/bti448

    Article  CAS  PubMed  Google Scholar 

  7. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, lacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JY, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 2004, 5(10):R80. 10.1186/gb-2004-5-10-r80

    Article  PubMed Central  PubMed  Google Scholar 

  8. Bengtsson H: Aroma-An R Object-oriented Microarray Analysis evironment. Preprint in Mathematical Sciences 2004:18, Matematical Statistics, Centre for Mathematical Scienses, Lund University, Sweden 2004.

    Google Scholar 

  9. EDMA[http://www.umu.se/climi/bact/Microarray/EDMA]

  10. PerkinElmer Life Sciences: ScanArrayExpress 2.0. User Manual. 2002.

    Google Scholar 

  11. Axon Instruments Inc: GenePix Pro 5.0, User's Guide & Tutorial. 2003.

    Google Scholar 

  12. Bengtsson H, Jonsson G, Vallon-Christersson J: Calibration and assessment of channel-specific biases in microarray data withextended dynamical range. BMC Bioinformatics 2004, 5(1):177. 10.1186/1471-2105-5-177

    Article  PubMed Central  PubMed  Google Scholar 

  13. Dudley AM, Aach J, Steffen MA, Church GM: Measuring absolute expression with microarrays with a calibrated reference sample and an extended signal intensity range. Proc NatlAcadSci USA 2002, 99(11):7554–7559. 10.1073/pnas.112683499

    Article  CAS  Google Scholar 

  14. Saal LH, Troein C, Vallon-Christersson J, Gruvberger S, Borg A, Peterson C: BioArray Software Environment (BASE): a platform for comprehensive management and analysis of microarray data. Genome Biol 2002, 3(8):SOFTWARE0003. 10.1186/gb-2002-3-8-software0003

    Article  PubMed Central  PubMed  Google Scholar 

  15. Sjödin A, Bylesjö M, Skogström O, Eriksson D, Nilsson P, Rydén P, Jansson S, Karlsson J: UPSC-BASE – Populus transcriptomics online. Plant Journal 2005, in press.

    Google Scholar 

  16. Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP: Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res 2002, 30(4):el5. 10.1093/nar/30.4.e15

    Article  Google Scholar 

  17. Yin W, Chen T, Zhou SX, Chakraborty A: Background correction for cDNA microarray images using the TV+L1 model. Bioinformatics 2005, 21(10):2410–2416. 10.1093/bioinformatics/bti341

    Article  CAS  PubMed  Google Scholar 

  18. Lucidea Array[http://www.umu.se/climi/bact/Microarray/LucideaArray.pdf]

Download references

Acknowledgements

We thank Linda Stenman for valuable assistance during the production and hybridization of the Lucidea arrays. This work was supported by grants from the Knut and Alice Wallenberg Foundation through the Wallenberg Consortium North, the Kempe Foundation, the Faculty of Medicine at Umeå University and the Faculty of Science and Technology at Umeå University.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Patrik Rydén.

Additional information

Authors' contributions

PR and LNO designed the Lucidea array. LN and LNO produced the arrays. LN carried out the laboratory work. BH performed the image analyses. HA invented the combined image analysis method and PR conceived the other novel analysis methods presented in this work. ML and PR created EDMA and carried out the evaluation study. All authors helped to draft the manuscript.

Authors’ original submitted files for images

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Rydén, P., Andersson, H., Landfors, M. et al. Evaluation of microarray data normalization procedures using spike-in experiments. BMC Bioinformatics 7, 300 (2006). https://doi.org/10.1186/1471-2105-7-300

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1471-2105-7-300

Keywords