Two-stage normalization using background intensities in cDNA microarray data
© Yoon et al; licensee BioMed Central Ltd. 2004
Received: 07 April 2004
Accepted: 21 July 2004
Published: 21 July 2004
In the microarray experiment, many undesirable systematic variations are commonly observed. Normalization is the process of removing such variation that affects the measured gene expression levels. Normalization plays an important role in the earlier stage of microarray data analysis. The subsequent analysis results are highly dependent on normalization. One major source of variation is the background intensities. Recently, some methods have been employed for correcting the background intensities. However, all these methods focus on defining signal intensities appropriately from foreground and background intensities in the image analysis. Although a number of normalization methods have been proposed, no systematic methods have been proposed using the background intensities in the normalization process.
In this paper, we propose a two-stage method adjusting for the effect of background intensities in the normalization process. The first stage fits a regression model to adjust for the effect of background intensities and the second stage applies the usual normalization method such as a nonlinear LOWESS method to the background-adjusted intensities. In order to carry out the two-stage normalization method, we consider nine different background measures and investigate their performances in normalization. The performance of two-stage normalization is compared to those of global median normalization as well as intensity dependent nonlinear LOWESS normalization. We use the variability among the replicated slides to compare performance of normalization methods.
For the selected background measures, the proposed two-stage normalization method performs better than global or intensity dependent nonlinear LOWESS normalization method. Especially, when there is a strong relationship between the background intensity and the signal intensity, the proposed method performs much better. Regardless of background correction methods used in the image analysis, the proposed two-stage normalization method can be applicable as long as both signal intensity and background intensity are available.
cDNA microarrays consist of thousands of individual DNA sequences printed in a high density array on a glass slide. After being reverse-transcribed into cDNA and labeled using red (Cy5) and green (Cy3) fluorescent dyes, two target mRNA samples are hybridized with the arrayed DNA sequences or probes. Then, the relative abundance of these spotted DNA sequences can be measured. The ratio of the fluorescence intensity for each spot represents the relative abundance of the corresponding DNA sequence.
In cDNA microarray experiments, there are many sources of systematic variation. Normalization is the process of removing such variation that affects the measured gene expression levels. The main idea of normalization is to adjust for artifact differences in intensity of the two labels. Such differences result from differences in affinity of the two labels for DNA, differences in amounts of sample and label used, differences in photomultiplier tube and laser voltage settings and differences in photon emission response to laser excitation. Although normalization alone cannot control all systematic variations, normalization plays an important role in the earlier stage of microarray data analysis.
Many normalization methods have been proposed by using the statistical regression models. Kerr et al.  and Kerr et al.  suggested the ANOVA model approach. Wolfinger et al.  proposed a mixed effect model for normalization. Schadt et al.  proposed smoothing splines with generalized cross-validation (GCVSS). Kepler et al.  used a local polynomial regression to estimate the normalized expression levels as well as the expression level dependent error variance. Yang et al.  summarized a number of normalization methods for dual labeled microarrays such as global normalization and robust locally weighted scatter plot smoothing (LOWESS, Cleveland ). Workman et al.  proposed a robust nonlinear method for normalization using array signal distribution analysis and cubic splines. Wang et al.  suggested iterative normalization of cDNA microarray data to estimate normalization coefficients and to identify control gene set. Chen et al.  presented subset normalization to adjust for location biases combined with global normalization for intensity biases.
After performing two dye cDNA microarray experiments, we get foreground and background intensities from red channel and green channel, respectively. Although a complex modeling approach can be used, the signal intensity is usually computed by subtracting the background intensity from the foreground intensity. Thus, the noise in the background intensity may have a large effect on the signal intensity.
In order to perform the two-stage normalization method, we consider nine different background measures and investigate their performances in normalization. A detailed description on background measures is given in Methods section. Also, Methods section describes the proposed two-stage normalization methods. Results section describes the results from NCI 60 cDNA microarray experiment, which illustrates the effects of background intensities (Zhou et al. ). In addition, some comparative results are presented from cDNA microarrays of cortical stem cells of rat (Park et al. ) and those from kidney, liver, and testis cells from mice (Pritchard et al. ). The performance of two-stage normalization is compared to those of global normalization as well as intensity dependent nonlinear LOWESS normalization. We use the variability among the replicated slides to compare the performance of normalization methods. For certain selected background measures, the proposed two-stage normalization performs better than global or intensity dependent nonlinear normalization method. Finally, Conclusion section summarizes the concluding remarks.
We propose a two-stage normalization method for the cDNA microarray data analysis using background intensities. At the first stage, we adjust for the effect of background intensities on M; at the second stage, we correct bias on M caused by other sources of systematic variation.
Stage 1. Background normalization
Let g fi and g bi be the means(or medians) of the i th foreground and background intensity of green channel, respectively; r fi and r bi be the corresponding means(or medians) of red channel, respectively. Then for each spot, we have two pairs of intensities: (g fi , g bi ) and (r fi , r bi ), i = 1,..., p, where p is the number of spots in a slide. (For simplicity, we omit the subscript and define A and M using notation of Yang et al.  as follows:
M = [log(r f - r b ) - log(g f - g b )] = [log(R) - log(G)], (2)
Y 1 = log(r b ), (3)
Y 2 = log(r f / r b ), (4)
Y 4 = log(g b ), (6)
Y 5 = log(g f / g b ), (7)
For simplicity, let Y k (k = 1,...,9) be an appropriate background intensity defined by red channel (a), green channel (b), or two channels (c). Then, fit the nonlinear LOWESS curve as follows:
M k = C(1) (Y k ), (12)
M k (1) = M k - C(1) (Y k ), for k = 1,...,9 (13)
where C(1)(·) represents the LOWESS curve and M k (1) is the residual from the curve. Note that M k (1) is the log-ratio of relative intensities after removing the effect of background intensities. For these ratios, we can perform the usual MA normalization at the second stage.
Stage 2. MA normalization
In the second stage, we perform the normalization process as follows:
M k (1) = C(2) (A), (14)
M k (2) = M k (1) - C(2) (A), (15)
where C(2) (·) is the LOWESS curve and M k (2) is the residuals from the curve in the second stage.
Note that at the second stage any normalization method can be applied including a simple global normalization method.
Results of NCI 60 data
We first apply the proposed two-stage normalization method to a microarray data set of the NCI 60 cancer cell lines. These cell lines derived from human tumors have been widely used for investigations on various drugs and molecular targets (http://discover.nci.nih.gov). The National Cancer Institute's Developmental Therapeutic Programs (http://dtp.nci.nih.gov) has been studying a large number of anti- cancer drug compounds and molecular targets on the 60 cancer cell lines (Weinstein et al. ). In particular, the NCI 60 microarray data have been frequently reanalyzed as an experimental model due to the inaccessibility to human tumor tissues for various studies on cancer. Using HCT116, one of the colon cell lines in the NCI 60 panel, Zhuo et al.  performed gene expression profile of dose- and time-dependent effects by the topoisomerase inhibitor I camptothecin compound (CPT). We here use a subset of the array data set consisting of ten slides. These slides were randomly selected to demonstrate the proposed method. Each slide contains 2,208 spotted clone sequences. We also apply global median normalization and intensity dependent nonlinear LOWESS normalization to this data set.
From ten slides we choose one slide to illustrate the proposed method. Figure 2 shows the plots of M versus Y k , where M = log(R/G) and Y k , k = 1,...,9 are background measures described in Methods section. The correlation coefficients between M and Y k 's (k = 1,...,9) are 0.2025, 0.6238, 0.6256, -0.0184, 0.5065, 0.5096, 0.1291, 0.5707, and 0.5729, respectively. Background measures Y2 and Y3 tend to have higher correlations than others.
The goal of this study is to compare performances of normalization methods. We compare two-stage normalization to global median normalization and intensity dependent nonlinear LOWESS normalization. Following the idea of Park et al. , we use the variability among the replicated slides as comparison measures, which can be estimated by the mean square error (MSE). For each gene, we can calculate MSE l (l = 1,..., number of gene) which is the variance estimator for each gene derived from replicated slides. The main idea is that the better the normalization method, the smaller the variation among the replicated observations. Here, we use three different sets of microarray data: colorectal cancer data of NCI 60 (Zhou, et al. ), cortical stem cells data (Park, et al. ), and mouse gene expression data (Pritchard et al. ).
The goal of cortical stem cells study is to identify genes that are associated with neuronal differentiation of cortical stem cells. In this experiment, there are 3,840 genes in each slide from two experimental groups for comparison measured at six different time points (12 hrs, 1 day, 2 days, 3 days, 4 days, 5 days). All experiments were replicated three times, thus we have 36 slides for the analysis.
The objective of mouse gene expression study of Pritchard et al.  is to assess natural differences in murine gene expression. A 5406-clone spotted cDNA microarray was used to measure transcript levels in the kidney, liver, and testis from each of 6 normal male C57BL6 mice. Experiments were replicated four times per each mouse organ, two red fluorescent dye sample and two green dye samples. Since there are three organs, we have three sets of microarrays. In each organ, there are 24 slides available for the analysis.
In this comparative study, all five microarray data sets were used: colorectal cancer data set from NCI 60, cortical stem cells data set from Park et al. , and three organ data sets from Pritchard et al. . Since results are similar among three organs, we only present the results of kidney. For simplicity, denote CCD for colorectal cancer data, SCD for stem cells data, and KD for kidney data, respectively.
In microarray studies, many undesirable systematic variations are commonly observed. Normalization becomes a standard process for removing some of the variation that affects the measured gene expression levels. One major source of variation is the background intensities. Recently, some methods have been employed for adjusting for the background intensities. However, all these methods focus on defining signal intensities appropriately from foreground and background intensities during the image analysis (Kooperberg et al. , Edwards ). Although a number of normalization methods have been proposed, no systematic methods have been proposed using the background intensities in the normalization process.
In this paper, we propose two-stage normalization for adjusting for the effect of background intensities systematically. The motivating idea is that the noise caused by background intensities may increase the variability in signal intensities. Although we use the log-transformed ratio of two channels denoted by M in most subsequent analysis, the noise caused by background intensities may still remain in M even after normalization. The two-stage normalization may be quite effective especially when there is a high correlation between M and background measures such as Y2 and Y3.
Among nine background measures, we recommend two background measures Y2 and Y3 based on the results of the comparative studies. For these two background measures, we show that the two-stage normalization method always performs better than the global normalization methods and the nonlinear LOWESS normalization method.
We wondered if the relative good performance of two-stage normalization using Y2 and Y3 is due to low intensities. We investigated this problem for NCI 60 data after removing spots with low intensities. The spots whose ratio of foreground and background intensities were smaller than 1.5 were removed in the analysis. This new data set also provided quite similar results.
The main reasons why background measures Y2 and Y3 perform better than other backgrounds are as follows. The background fluorescence might be relatively strong in the Cy5 channel due to interaction between the slide substrate and the hybridization materials. This effect is weaker in the Cy3 channel. It might be also possible that the background fluorescence in the Cy5 channel inflates the values of r b without correspondingly inflating the values of r f . This means that for weakly-responding spots, the r f and r b values are similar. This produces very low values in M, Y2 and Y3 for these spots. Note values under 5 for log(R) but not for log (G) in Figure 1. Also note downward outliers but not upward outliers for M in all panes in Figure 2. These artefacts in M are partially corrected by regressing against Y2 or Y3. However, the effectiveness of the proposed method for other data may depend on the background fluorescence artefacts.
The comparison is based on the variability measures derived from the replicated microarray samples. These variability measures can be easily derived from any replicated microarray experiment. Although we have studied only a limited number of data sets, our findings are quite consistent.
For these data sets, we also conduct a similar analysis to see the effect of print-tips in normalization process. The results were consistent to those in Figure 5 and not reported here. Background measures Y2 and Y3 yielded better results than other background measures.
In addition, we performed the hypothesis test for M equal to zero and counted the number of genes that have different expression level between two channels. The number of significant gene with different expression value is 909 for the original data before normalization, 326 for the data after global normalization, 302 for the data set after LOWESS normalization, and 297 for the data after two-stage normalization. The numbers of significant genes do not differ much among normalization methods.
The proposed method can be applicable when both background and foreground intensities are available after the image analysis. Many image software provides both signal intensity as well as background intensity of spots. In most cases, the signal intensity is defined by simple subtraction such as (r f - r b ) for the red channel. Our method was illustrated using the usual subtraction (r f - r b ). Although our method starts with a most common approach based on the usual subtraction, it can be applied to any other models for defining spot intensity r = f(r f ,r b ), where r is the signal intensity, as long as a background intensity is available. Our method can be applicable as long as the image analysis provides the signal intensity and background intensity. For example, our method using Y2 builds on the relationship between r = f(r f ,r b ) and r b .
Figure 7 also shows that the median values of correlations of Y2 and Y3 are higher than 0.5 for both CCD and SCD, while those for KD are smaller than 0.5. Note that CCD and SCD have more reduction in variability than KD in the two-stage normalization. For lower correlations of Y k s do not reduce variability much compared to the usual LOWESS normalization.
The authors wish to thank two anonymous referees for many helpful comments and suggestions that improved the manuscript greatly. This work was supported by a grant from the Korean Ministry of Science and Technology (Korean Systems and Biology Research Grant, M1030970000-03B5007).
- Kerr MK, Martin M, Churchill GA: Analysis of variance for gene expression microarray data. J Comput Biol 2000, 7: 819–837. 10.1089/10665270050514954View ArticlePubMedGoogle Scholar
- Kerr MK, Martin M, Churchill GA: Experimental design for gene expression microarrays. Biostatics 2001, 2: 183–201. 10.1093/biostatistics/2.2.183View ArticleGoogle Scholar
- Wolfinger RD, Gibson G, Wolfinger ED, Bennett L, Hamadeh H, Bushel P, Afshari C, Paules RS: Assessing gene significance from cDNA microarray expression data via mixed models. J Comput Biol 2001, 8: 625–37. 10.1089/106652701753307520View ArticlePubMedGoogle Scholar
- Schadt EE, Li C, Ellis B, Wong WH: Feature extraction and normalization algorithms for high-density oligonucleotide gene expression array data. J Cell Biochem Suppl 2001, 37(Suppl):120–5. 10.1002/jcb.10073View ArticlePubMedGoogle Scholar
- Kepler TB, Crosby L, Morgan KT: Normalization and analysis of DNA microarray data by self-consistency and local regression. Genome Biology 2002, 3: RESEARCH0037. 10.1186/gb-2002-3-7-research0037PubMed CentralView ArticlePubMedGoogle Scholar
- Yang YH, Dudoit S, Luu DM, Peng V, Ngai J, Speed TP: Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Research 2002, 30: e15. 10.1093/nar/30.4.e15PubMed CentralView ArticlePubMedGoogle Scholar
- Cleveland WS: Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association 1974, 74: 829–836.View ArticleGoogle Scholar
- Workman C, Jensen LJ, Jarmer H, Berka R, Gautier L, Nielser HB, Saxild HH, Nielsen C, Brunak S, Knudsen S: A new non-linear normalization method for reducing variability in DNA microarray experiments. Genome Biol 2002, 3: research0048. 10.1186/gb-2002-3-9-research0048PubMed CentralView ArticlePubMedGoogle Scholar
- Wang Y, Lu J, Lee R, Gu Z, Clarke R: Iterative normalization of cDNA microarray data. IEEE Trans Inf Technol Biomed 2002, 6: 29–37. 10.1109/4233.992159View ArticlePubMedGoogle Scholar
- Chen YJ, Kodell R, Sistare F, Thompson KL, Morris S, Chen JJ: Normalization methods for analysis of microarray gene-expression data. J Biopharm Stat 2003, 13: 57–74. 10.1081/BIP-120017726View ArticlePubMedGoogle Scholar
- Yang MC, Ruan QG, Yang JJ, Eckenrode S, Wu S, McIndoe RA, She JX: A statistical method for flagging weak spots improves normalization and ratio estimates in microarrays. Physiol Genomics 2001, 7: 45–53.View ArticlePubMedGoogle Scholar
- Kim JH, Kim HY, Lee YS: A novel method using edge detection for signal extraction from cDNA microarray image analysis. Exp Mol Med 2001, 33: 83–8.View ArticlePubMedGoogle Scholar
- Kim JH, Shin DM, Lee YS: Effect of local background intensities in the normalization of cDNA microarray data with a skewed expression profiles. Exp Mol Med 2002, 34: 224–32.View ArticlePubMedGoogle Scholar
- Kooperberg C, Fazzio TG, Delrow JJ, Tsukiyama T: Improved background correction for spotted DNA microarrays. J Comput Biol 2002, 9: 55–66. 10.1089/10665270252833190View ArticlePubMedGoogle Scholar
- Edwards D: Non-linear normalization and background correction in one-channel cDNA microarray studies. Bioinformatics 2003, 19: 825–833. 10.1093/bioinformatics/btg083View ArticlePubMedGoogle Scholar
- Zhou Y, Gwadry FG, Reinhold WC, Miller LD, Smith LH, Scherf U, Liu ET, Kohn KW, Pommier Y, Weinstein JN: Transcriptional Regulation of Mitotic Genes by Camptothecin-induced DNA Damage: Microarray Analysis of Dose- and Time-dependent Effects. Cancer Research 2002, 62: 1688–1695.PubMedGoogle Scholar
- Park T, Yi SG, Lee S, Lee SY, Yoo DH, Ahn Jl, Lee YS: Statistical tests for identifying differentially expressed genes in time course microarray experiments. Bioinformatics 2003, 19: 694–703. 10.1093/bioinformatics/btg068View ArticlePubMedGoogle Scholar
- Pritchard CC, Hsu L, Delrow J, Nelson PS: Project normal: Defining normal variance in mouse gene expression. PNAS 2001, 98(6):13266–71. 10.1073/pnas.221465998PubMed CentralView ArticlePubMedGoogle Scholar
- Weinstein JN, Myers TG, O'Connor PM, Friend SH, Fornace AJ Jr, Kohn KW, Fojo T, Bates SE, Rubinstein LV, Anderson NL, Buolamwini JK, van Osdol WW, Monks AP, Scudiero DA, Sausville EA, Zaharevitz DW, Bunow B, Viswanadhan VN, Johnson GS, Wittes RE, Paull KD: An information-intensive approach to the molecular pharmacology of cancer. Science 1997, 275: 343–9. 10.1126/science.275.5298.343View ArticlePubMedGoogle Scholar
- Park T, Yi SG, Kang SH, Lee S, Lee YS, Simon R: Evaluation of normalization methods for microarray data. BMC Bioinformatics 2003, 4: 33. 10.1186/1471-2105-4-33PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.