A robust two-way semi-linear model for normalization of cDNA microarray data

Background Normalization is a basic step in microarray data analysis. A proper normalization procedure ensures that the intensity ratios provide meaningful measures of relative expression values. Methods We propose a robust semiparametric method in a two-way semi-linear model (TW-SLM) for normalization of cDNA microarray data. This method does not make the usual assumptions underlying some of the existing methods. For example, it does not assume that: (i) the percentage of differentially expressed genes is small; or (ii) the numbers of up- and down-regulated genes are about the same, as required in the LOWESS normalization method. We conduct simulation studies to evaluate the proposed method and use a real data set from a specially designed microarray experiment to compare the performance of the proposed method with that of the LOWESS normalization approach. Results The simulation results show that the proposed method performs better than the LOWESS normalization method in terms of mean square errors for estimated gene effects. The results of analysis of the real data set also show that the proposed method yields more consistent results between the direct and the indirect comparisons and also can detect more differentially expressed genes than the LOWESS method. Conclusions Our simulation studies and the real data example indicate that the proposed robust TW-SLM method works at least as well as the LOWESS method and works better when the underlying assumptions for the LOWESS method are not satisfied. Therefore, it is a powerful alternative to the existing normalization methods.


Background
Microarray technology has become a useful tool for quan-titatively monitoring gene expression patterns and has been widely used in functional genomics [1,2]. In a cDNA microarray experiment, cDNA segments representing a collection of transcripts and Expressed Sequence Tags (ESTs) are amplified by PCR and spotted in high density on glass microscope slides using a robotic system to produce cDNA microarrays. Each microarray contains thousands of such PCR products, named cDNA probes, which serve as reporters for the expression of the respective transcripts that represent the collection of genes or ESTs. The cDNA microarrays are queried in a co-hybridization assay using two fluorescently labeled biosamples derived from RNA obtained from the cell populations of interest. One sample is labeled with fluorescent dye Cy5 (red), and another with fluorescent dye Cy3 (green). Hybridization is assayed using a confocal laser scanner to measure fluorescence intensities, allowing simultaneous determination of the relative expression levels of all the genes represented on the slide [3].
A basic question in analyzing cDNA microarray data is normalization, the purpose of which is to remove systematic bias in the observed expression values by establishing a normalization curve across the whole dynamic range. A proper normalization method ensures that the normalized intensity ratios provide meaningful measures of relative expression levels. Normalization is needed because many factors, including different efficiency of dye incorporation, difference in the amount of RNA labeled between the two channels, uneven hybridizations, difference in the printing pin heads, among others, may cause bias in the observed expression values. Therefore, proper normalization is a critical component in the analysis of microarray data and can have important impact on higher level analysis such as detection of differentially expression genes, classification, and cluster analysis.
Many normalization methods have been proposed in the literature. The earliest normalization method for cDNA microarray data goes back to Chen et al. [4] who proposed a ratio-based method. Yang et al. [5] summarized several normalization methods for cDNA microarray data such as global normalization, dye-swap normalization, blockwise normalization, and scale normalization. They also proposed a locally weighted scatter plot smoothing (LOWESS [6]) method for intensity dependent normalization. Quackenbush [7] and Bilban et al. [8] provided good reviews on normalization methods for cDNA microarray data. Tseng et al. [9] proposed using a rank based procedure to first select a set of invariant genes that are likely to be constantly expressed and then carrying out LOWESS normalization using this set of genes. But as pointed out by Tseng et al., selected invariant genes may not cover the whole dynamic range of the expression values, and extrapolation is needed to fill in the gaps that are not covered by the invariant genes. Kepler et al. [10] also first estimated a set of "constantly expressed genes" and then used the LOWESS method. Wang et al. [11] proposed an iterative normalization method for cDNA microarray data by estimating a normalization coefficient and identifying control genes. Workman et al. [12] used array signal distribution analysis for a robust non-linear method of normalization. Park et al. [13] compared a number of normalization methods, including global, linear and LOWESS normalization methods. Wolfinger et al. [14] used a mixed model for normalization. They proposed a normalization model for normalization and a gene model for inference and these two models are related by the residual terms in the normalization model. A constant normalization factor assumption is needed in this method. Fan et al. [15] considered a Semi-linear-In-slide Model (SLIM) method using within-array replications. The SLIM method requires replication of a subset of the genes in an array. If the number of replicated genes is small, the expression values of the replicated genes may not cover the entire dynamic range or reflect spatial variation in an array. Fan et al. [16] generalized the SLIM method to account for across-array information, resulting in an aggregated SLIM, so that replication within an array is no longer required. Huang et al. [17] proposed a twoway semi-linear model (TW-SLM) for normalization of cDNA microarray data. They used the least squares method for estimating the normalization curves based on B-splines. This method does not require the assumptions required by the LOWESS normalization method, i.e. (i) a small fraction of genes are differentially expressed or (ii) there is symmetry in the expression levels of up-and down-regulated genes.
It is well known that the least squares method is not resistant to outliers which arise often in cDNA microarray experiments because of many sources of variations. In this paper, we propose a robust method for normalization in the framework of the TW-SLM. We conduct simulation studies and use a real cDNA microarray data set to compare the proposed method with the LOWESS normalization method.

Simulation study
Simulation was conducted to compare the mean square errors (MSE) and biases of estimated gene expression levels between the proposed robust TW-SLM and LOWESS normalization methods, between the proposed method and the TW-SLM using OLS. The MSE for the jth gene is calculated as the following: where N is the total number of replicates for each simulation, J is the number of unique genes, β j is the true gene expression level (base two log scale) for gene j, is the estimated value for β j , is the mean of for N replicates, j = 1, 2,..., J, where J is the total number of genes. The data simulation procedure is based on the method proposed by Balagurunathan et al. [18]. In each simulation, we generated 10 slides with twelve blocks in each, and 500 genes in each block. We repeated 100 times for each simulation. The simulation procedure can be summarized in the following steps: 1. Simulate true signal intensity for each gene j using the exponential distribution with the mean of 3,000, i.e. I j ẽ xp(λ = 1/3000), for j = 1,..., J; 2. Simulate fluorescent intensity for the Cy5 channel and the Cy3 channel with the normal distribution, respectively. Suppose the coefficients of variation for intensity in the Cy5 channel and the Cy3 channel are α rj and α gj , respectively, then the fluorescent intensity on the two channels can be generated by the normal distribution with mean I j and standard deviations α rj I j and α gj I j for the red channel and the green channel, respectively. Let R j and G j represent simulated fluorescent intensity for the Cy5 channel and the Cy3 channel for gene j, respectively; 3. Simulate differentially expressed genes. Suppose γ × 100% genes are differentially expressed in the whole simulated gene set, then the ratio of the expression level for gene j can be generated by t j = 10 ±b with b ~ Beta (1.7,4.8).
The sign ± will determine if the gene is up-or down-regulated. The probability of the up-regulated genes within those γ × 100% differentially expressed genes is given as an input parameter. For the genes that are not differentially expressed, the b takes value zero; 4. Incorporate the t j into signal intensity of gene j. The R j and G j will be adjusted by adding the simulated expression ratio t j through the following formulae: for j = 1,..., J; . Simulate a fluorescent system with the imperfect response characteristics. Due to various reasons, such as the unequal amount of mRNA for the two channels, different labeling efficiencies, or uneven laser powers at the scanning stage [18], actual intensity in the two channels are not exactly the same. More over, fluorescent intensity is not necessarily linearly related to the expression levels. Balagurunathan et al. proposed the following functional family, to distort the response characteristic functions of observed fluorescent intensity for the two channels, which are expressed as and , respectively.
So four parameter values need to be determined for each channel before simulation. Different parameter values in the two channels will control the shape of the ratio vs. signal intensity plots (R-I plots); 6. Simulate background noise for each channel. The mean of background noise is determined by one input parameter: the signal to noise ratio (SNR) and the true mean of signal. The SNR is the ratio between the true mean of the signal and the true mean of background noise. The SNR controls variability of background noise. The normal distribution with a given mean value is used in simulating background noise. Variance of background noise will be controlled by the input parameters α br and α bg for the Cy5 channel and the Cy3 channel, respectively. These two parameters are the ratios between the mean and the standard deviations of background noise for the two channels, respectively. Simulated signal intensity for the two channels, and , are adjusted by subtracting background noise in each channel. Let and still denote background adjusted signal intensity for the two channels; 7. Add noise to the signal intensity for each channel. Finally, the signal intensity of each channel is generated by given as input parameters to control variability of fluorescent signal intensity.
We simulated two situations, one is the no-dye bias case and another one is the shape case (dye bias exists). R-I plots of twelve blocks on one slide for two simulated cases are shown in Figures 2 and 3, respectively. We considered five different percentage levels of differentially expressed genes: 1%, 5%, 10%, 20%, and 40%. The ratio of the up-regulated genes to the down-regulated genes takes three values, i.e., 1:1, 3:1, and 9:1 at each percentage level of differentially expressed genes. In addition, based on the suggestion of a reviewer, we simulated an extreme case for the scenario in Figure 3, in which 70% genes are all up-regulated and the remaining ones are not differentially expressed.
The trend of MSEs and biases of estimated gene expression levels are similar between the robust TW-SLM and the LOWESS normalization methods across different levels of the ratios between the up-regulated genes and the downregulated genes. This trend also exists in the extreme case. We present the results of the following two scenarios: (a) a 9:1 ratio between the number of the up-regulated genes and that of the down-regulated genes and, (b) the extreme case. Tables 1 and 3 present MSEs, Tables 2 and 4 show

Percentage of DEG
biases of estimated gene expression levels. MSEs and biases for the extreme case (70% of the genes are up-regulated) are presented in the bottom of Tables 3 and 4, which are displayed in Figures 6 and 7, respectively. The robust TW-SLM method has smaller means of MSEs than the LOWESS normalization method and the TW-SLM using OLS, respectively. Also the ranges of MSEs for the proposed method are also smaller than those using the LOWESS method and the TW-SLM with OLS, respectively.
Comparing the different robust weight functions, means of MSEs are slightly smaller using Tukey's weight function than that using Huber's weight function. These results are observed across different percentage levels of differentially expressed genes. Biases for estimated gene expression levels distributed similarly between the proposed method and the LOWESS normalization method. But the ranges of the biases for the proposed method are smaller than those of the LOWESS normalization method and the TW-SLM using OLS, respectively. These observations are true in both simulated situations.
The extreme case is an example where the proposed method does better than the LOWESS method (Tables 3  and 4, Figures 6 and 7). Estimates using the LOWESS method are downward biased in this case. This is what we would expect because the LOWESS method fits normalization curves through the majority of genes, which are mostly up-regulated here. In contrast, the TW-SLM method does not need the either of the two assumptions needed by the LOWESS method, neither of which is satisfied here. The distributions of MSEs and biases between the TW-SLM using OLS and the LOWESS method are similar for cases where there is a relatively small percentage of differentially expressed genes. However, the TW-SLM with OLS performs better than the LOWESS when a larger proportion of genes are differentially expressed. It appears that the more deviation from the two assumptions required by the LOWESS, the better the TW-SLM performs. This trend is consistent with findings in our previous work [17].

An example
In this section, a real data set was analyzed to compare consistency of the LOWESS normalization method and the proposed robust TW-SLM method. A collection of human placenta cDNAs comprising 7,042 clones was identified and used as the probe set for cDNA microarray fabrication in this study [19]. An example of R-l plots for twelve blocks of slide one with no-dye bias for two channels, 10% genes are differentially expressed Figure 2 An example of R-l plots for twelve blocks of slide one with no-dye bias for two channels, 10% genes are differentially expressed. An example of shaped R-l plots for twelve blocks of slide one with the dye bias for two channels, 10% genes are differentially expressed Figure 3 An example of shaped R-l plots for twelve blocks of slide one with the dye bias for two channels, 10% genes are differentially expressed. between the human placenta RNA and the PS reference RNA, and between the Universal Human Reference RNA and the PS reference RNA. The design of this experiment is depicted in Figure 1.
After hybridization, slides were scanned with the Axon instruments 4000B scanner. The 633 and 532 lasers are used for excitation of the Cy5 and Cy3 fluorophores, respectively. For each of the three types of hybridizations (i.e., the human placenta vs. the universal reference, the A boxplot for comparing mean square errors among normalization methods for the case that 70% genes are all up-regulated (Table 3) Figure 6 A boxplot for comparing mean square errors among normalization methods for the case that 70% genes are all up-regulated (Table 3). human placenta vs. the PS reference, and the universal reference vs. the PS reference), there are four slides, including two dye-swapped slides. Each clone was printed three times on different blocks on each slide. Background adjusted medians for the Cy5 and Cy3 channels were used as expression levels. We removed negative controls including "Human Cot1", "PolyA" and "Empty" in the analysis.
To evaluate the proposed method, we compare it with the LOWESS method by examining which method produces more consistent results between the direct comparison A boxplot for comparing biases among normalization methods for the case that 70% genes are all up-regulated (Table 4) Figure 7 A boxplot for comparing biases among normalization methods for the case that 70% genes are all up-regulated (Table 4).    and the indirect comparison of human placenta and universal human reference RNA samples as described above (see also Figure 1). The rationale is that the results from the direct comparison design and the indirect comparison design should be similar, because the same RNA samples are compared in both designs, albeit the indirect comparison is through a third common reference. Therefore, a better normalization method is the one that yields more consistent results between the direct and indirect comparison experiments.
The data were normalized using the LOWESS normalization method and the robust TW-SLM with Tukey's robust weight function separately. Significance analysis was carried out for the normalized data for each method by comparing gene expression levels in the human placenta tissue relative to the universal reference. One sample t-test was used for the direct comparison and two-sample t-test was used for the indirect comparison. We used 10 -5 and 10 -3 as cutoff points for p-values to determine if clones are statistical significant or not. Consistency of estimated relative gene expression levels was compared between the direct design and the indirect design for each method. We also compared overlap between the LOWESS normalization method and the robust TW-SLM for each design. The results are presented in Figures 4 and 5.
We used 10 -5 as a cutoff point for p-values in Figure 4. Using the robust TW-SLM normalization and the t-tests, there are 2,907 genes with p-value less than 10 -5 in the direct comparison and 2,791 in the indirect comparison. There are 1,713 genes common in these two sets of genes with p-value less than 10 -5 , which account for about 59% (1713/2907) in the direct comparison and about 61% (1713/2791) in the indirect comparison.
In comparison, using the LOWESS normalization and the t-tests, there are 1,447 genes with p-value less than 10 -5 in the direct comparison and 1,045 in the indirect comparison. The number of overlapping genes with p-value less than 10 -5 is 467, which is around 32% (467/1447) in the direct comparison and about 44% (467/1045) in the indirect comparison. It is clear that the proposed method performs more consistent between the direct comparison and the indirect comparison.
We also examined overlap between the LOWESS and robust TW-SLM methods for the two comparisons. In the direct comparison, about 79% (1141/1447) of the genes found to be significant based on the LOWESS method are also found to be significant based on the robust TW-SLM method. But they only account for about 40% (1141/ 2907) of the significant genes detected based on the robust TW-SLM method. In the indirect comparison, about 71% (738/1045) of the significant genes based on the LOWESS method are also found to be significant based on the robust TW-SLM method. But they only account for about 26% (738/2791) significant genes detected based on the robust TW-SLM method.
In our analysis, we used background adjusted intensity values. How to adjust background is an important issue in microarray data analysis. To evaluate if background affects our conclusions, we repeated the comparison analysis without adjusting background for the intensity values in both channels, the results are presented in Tables 5 and 6. We see from these tables that the overall results are similar to those using background adjusted intensity values in normalization. This is what we would expect because of low and uniform distributed background noise in all arrays in this example (data description is not shown).
Therefore, the robust TW-SLM method yields more consistent results between the direct comparison and the indirect comparison with the human placenta and the universal human reference RNA samples. In addition, the robust TW-SLM method detects more significant genes for a given cutoff p-value. This makes sense biologically because most of the 7,042 genes specifically discovered from human placenta are expected to have differential expressions relative to the universal reference RNAs. We would expect that the similar comparison results will be got if we compare the TW-SLM using OLS or Huber's weight function with the LOWESS method because the normalization curves for the TW-SLMs (TW-SLM:OLS, TW-SLM:Huber, TW-SLM:Tukey) are similar, but all these three curves are different from the LOWESS normalization curve (Figure 8).

Discussion
We have proposed a robust TW-SLM normalization method for cDNA microarray data. It is interesting to compare the proposed normalization method with the existing methods, such as the widely used LOWESS normalization proposed by Yang et al. (2001) [5] and further discussed by Tseng et al. (2001) [9]. In the LOWESS method, normalization is done separately by first fitting a separate curve for each slide through the R-I plot of logintensity ratios versus log-intensity products. In comparison, the proposed method uses all the slides in estimating each normalization curve, using the gene effects β j as the thread linking these slides. In addition, in the proposed method, the normalization curves φ i and gene effects β j are estimated simultaneously. With this approach, there is no need to assume that the percentage of genes with differential expression levels is small or the expression levels of up-and down-regulated genes are symmetric, or when one of these assumptions is not satisfied, to use dye-swap normalization, which in turn requires the assumption that the two normalization curves are symmetric. (How-ever, we note that dye-swap as a design strategy is useful to balance the experimental conditions and reduce bias due to different dye incorporation efficiencies.) An underlying condition required for the proposed method is independence of arrays, which is satisfied in a typical microarray experiment. Further theoretical conditions for the TW-SLM can be found in the paper by Huang et al. [17].
We have only considered the proposed robust TW-SLM method for the simple direct comparison design described in the Methods section. We can easily extend the method to more complicated designs. For example, we can adapt the proposed robust method to the TW-SLM that accommodates the design where a gene is printed multiple times. Such a design is helpful for improving the precision and for assessing the quality of an array using the coefficient of variation (Tseng et al. 2001 [9]). We can also adapt the robust TW-SLM to incorporate control genes with known concentration ratios in estimating the normalization curves. Model (1) can be easily extended to block-wise normalization by treating different blocks as separate arrays and normalization can be carried out as what we did here. Block-wise normalization considers spatial variation within an array. We did block-wise normalization on the data sets in the example and compared the results with that using the LOWESS method (Tables 5  and 6). The proposed method still outperforms the LOWESS method if we use block-wise normalization in this example.

Conclusions
In our simulation studies, the proposed method performs better than the LOWESS normalization method in terms of MSEs of estimated gene effects in the simulation models we considered. Analysis of the probe set reference data set [19] shows that the proposed method yields more consistent results between the direct and indirect comparisons than the LOWESS normalization method. In addition, the proposed method is more sensitive in detecting differentially expressed genes than the LOWESS method. Therefore, we believe that the proposed robust TW-SLM method is a powerful alternative to the existing normalization methods. We have coded the proposed method in an R package which is available from the corresponding authors.

Methods
We first describe the TW-SLM. For simplicity, we focus on the case of comparing two cell populations, in which two cDNA samples from the respective cell populations are competitively hybridized on the same array. Let n be the number of slides, and J be the number of genes in the study. Let R ij represent background corrected signal intensity from the Cy5 channel and G ij the background corrected signal intensity from the Cy3 channel, and let y ij = log 2 (R ij /G ij ), x ij = (1/2) log 2 (R ij × G ij ), for gene j on slide i. We assume that there is only one spot for each gene on each slide. The TW-SLM [17] is it is well known that least squares estimates are not robust against outliers which often arise in microarray experiments. Therefore, we propose to use the robust method [20] for estimating φ i and β j . This is done by minimizing the objective function where ρ is an appropriately chosen function for robust estimation, λ is the collection of the coefficients in the spline representations of φ i described below, σ is the scale parameter, and α is a constant to be described below. We note here that estimation of φ i , β j are done jointly and uses data from all the arrays. This is different from the LOWESS normalization method in which estimation of normalization curves are done array by array. We The expressions of these functions are given in the Appendix. We choose commonly used constants in the literature for Huber's and Tukey's functions, i.e., H = 1.345 and k = 4.685. The influence of the choice of these constants on normalization methods is beyond the scope of this study.
We use the cubic B-splines [21,22] to approximate the normalization curves φ i . Specifically, let b 1 ,..., b K be K Bspline basis functions. We approximate φ i by We estimate the parameters in model (1) by minimizing objective function (3) using an iterative procedure. Two steps, a location step and a scale step, will be used in the computation.

Location step
We use the following vector and matrix notations in describing the location step: where and because of identifiability requirement in the TW-SLM. We can solve these equations iteratively to obtain and . The derivations of these equations are given in the Appendix.

Scale step
According to Huber's proposal [23], the estimation equation for σ is , and N is the total number of observations in the data set. In general, equation (8)  In order to obtain the consistent scale estimator at the normal distribution and obtain the classic estimates when using the least squares objective function, i.e., , we used the constant suggested by Huber [23], where E Φ denotes expectation with respect to the standard normal distribution function Φ.
The procedure described above is called an iterative reweighted least squares (IWLS) algorithm that is used in many non-least squares estimation problems. The implementation of the IWLS algorithm can be carried out using the following steps:   according to equation (5), and set ; 6. Calculate β (m) given σ (m-1) and using equation (7), and set ; 7. Go to step 2 and iteratively update the estimators of parameters and the weights between steps 2 and 6 until convergence.
We derive estimation equations for location parameters presented in the Methods section in this appendix. Again the notations from the Methods section: ψ(x) = ρ'(x). As defined in equation (5) and Plugging W i into equations (10) and (11) and setting them to zeros, and solving these two equations and yielding estimation equations for in equation (6) and in equation (7). They are Let then equation (7) can also be expressed as  Let x = y -b/3a, divided by a in the both sides of equation (13), and plugs x into equation (13), then we get the Cardan's cubic equation Let p = c/a -b 2 /3a 2 , q = d/a -bc/(3a 2 ) + 2b 3 /(27a 3 ), the above equation becomes y 3 + py + q = 0. (15) The determinant for Cardan's equation (15) is It can be shown that the determined function ∆ must be positive. The first term in the determinant equation must be positive because of the square function and q cannot be zero. If we can show that the p is greater or equal to zero, then the ∆ must be positive. Because So the ∆ is positive if only if 3ac -b 2 is non-negative. We can see that According to the Cauchy inequality [24], we have Therefore, the p must be non-negative and the ∆ must be positive. Thus there is only one real root for equation (15), that is Then the solution for σ in equation (12) is