- Methodology article
- Open access
- Published:
A robust two-way semi-linear model for normalization of cDNA microarray data
BMC Bioinformatics volume 6, Article number: 14 (2005)
Abstract
Background
Normalization is a basic step in microarray data analysis. A proper normalization procedure ensures that the intensity ratios provide meaningful measures of relative expression values.
Methods
We propose a robust semiparametric method in a two-way semi-linear model (TW-SLM) for normalization of cDNA microarray data. This method does not make the usual assumptions underlying some of the existing methods. For example, it does not assume that: (i) the percentage of differentially expressed genes is small; or (ii) the numbers of up- and down-regulated genes are about the same, as required in the LOWESS normalization method. We conduct simulation studies to evaluate the proposed method and use a real data set from a specially designed microarray experiment to compare the performance of the proposed method with that of the LOWESS normalization approach.
Results
The simulation results show that the proposed method performs better than the LOWESS normalization method in terms of mean square errors for estimated gene effects. The results of analysis of the real data set also show that the proposed method yields more consistent results between the direct and the indirect comparisons and also can detect more differentially expressed genes than the LOWESS method.
Conclusions
Our simulation studies and the real data example indicate that the proposed robust TW-SLM method works at least as well as the LOWESS method and works better when the underlying assumptions for the LOWESS method are not satisfied. Therefore, it is a powerful alternative to the existing normalization methods.
Background
Microarray technology has become a useful tool for quantitatively monitoring gene expression patterns and has been widely used in functional genomics [1, 2]. In a cDNA microarray experiment, cDNA segments representing a collection of transcripts and Expressed Sequence Tags (ESTs) are amplified by PCR and spotted in high density on glass microscope slides using a robotic system to produce cDNA microarrays. Each microarray contains thousands of such PCR products, named cDNA probes, which serve as reporters for the expression of the respective transcripts that represent the collection of genes or ESTs. The cDNA microarrays are queried in a co-hybridization assay using two fluorescently labeled biosamples derived from RNA obtained from the cell populations of interest. One sample is labeled with fluorescent dye Cy5 (red), and another with fluorescent dye Cy3 (green). Hybridization is assayed using a confocal laser scanner to measure fluorescence intensities, allowing simultaneous determination of the relative expression levels of all the genes represented on the slide [3].
A basic question in analyzing cDNA microarray data is normalization, the purpose of which is to remove systematic bias in the observed expression values by establishing a normalization curve across the whole dynamic range. A proper normalization method ensures that the normalized intensity ratios provide meaningful measures of relative expression levels. Normalization is needed because many factors, including different efficiency of dye incorporation, difference in the amount of RNA labeled between the two channels, uneven hybridizations, difference in the printing pin heads, among others, may cause bias in the observed expression values. Therefore, proper normalization is a critical component in the analysis of microarray data and can have important impact on higher level analysis such as detection of differentially expression genes, classification, and cluster analysis.
Many normalization methods have been proposed in the literature. The earliest normalization method for cDNA microarray data goes back to Chen et al. [4] who proposed a ratio-based method. Yang et al. [5] summarized several normalization methods for cDNA microarray data such as global normalization, dye-swap normalization, block-wise normalization, and scale normalization. They also proposed a locally weighted scatter plot smoothing (LOWESS [6]) method for intensity dependent normalization. Quackenbush [7] and Bilban et al. [8] provided good reviews on normalization methods for cDNA microarray data. Tseng et al. [9] proposed using a rank based procedure to first select a set of invariant genes that are likely to be constantly expressed and then carrying out LOWESS normalization using this set of genes. But as pointed out by Tseng et al., selected invariant genes may not cover the whole dynamic range of the expression values, and extrapolation is needed to fill in the gaps that are not covered by the invariant genes. Kepler et al. [10] also first estimated a set of "constantly expressed genes" and then used the LOWESS method. Wang et al. [11] proposed an iterative normalization method for cDNA microarray data by estimating a normalization coefficient and identifying control genes. Workman et al. [12] used array signal distribution analysis for a robust non-linear method of normalization. Park et al. [13] compared a number of normalization methods, including global, linear and LOWESS normalization methods. Wolfinger et al. [14] used a mixed model for normalization. They proposed a normalization model for normalization and a gene model for inference and these two models are related by the residual terms in the normalization model. A constant normalization factor assumption is needed in this method. Fan et al. [15] considered a Semi-linear-In-slide Model (SLIM) method using within-array replications. The SLIM method requires replication of a subset of the genes in an array. If the number of replicated genes is small, the expression values of the replicated genes may not cover the entire dynamic range or reflect spatial variation in an array. Fan et al. [16] generalized the SLIM method to account for across-array information, resulting in an aggregated SLIM, so that replication within an array is no longer required. Huang et al. [17] proposed a two-way semi-linear model (TW-SLM) for normalization of cDNA microarray data. They used the least squares method for estimating the normalization curves based on B-splines. This method does not require the assumptions required by the LOWESS normalization method, i.e. (i) a small fraction of genes are differentially expressed or (ii) there is symmetry in the expression levels of up- and down-regulated genes.
It is well known that the least squares method is not resistant to outliers which arise often in cDNA microarray experiments because of many sources of variations. In this paper, we propose a robust method for normalization in the framework of the TW-SLM. We conduct simulation studies and use a real cDNA microarray data set to compare the proposed method with the LOWESS normalization method.
Results
Simulation study
Simulation was conducted to compare the mean square errors (MSE) and biases of estimated gene expression levels between the proposed robust TW-SLM and LOWESS normalization methods, between the proposed method and the TW-SLM using OLS. The MSE for the j th gene is calculated as the following:
that is,
, where N is the total number of replicates for each simulation, J is the number of unique genes, β j is the true gene expression level (base two log scale) for gene j, is the estimated value for β j , is the mean of for N replicates, j = 1, 2,..., J, where J is the total number of genes. The data simulation procedure is based on the method proposed by Balagurunathan et al. [18]. In each simulation, we generated 10 slides with twelve blocks in each, and 500 genes in each block. We repeated 100 times for each simulation. The simulation procedure can be summarized in the following steps:
-
1.
Simulate true signal intensity for each gene j using the exponential distribution with the mean of 3,000, i.e. I j ~ exp(λ = 1/3000), for j = 1,..., J;
-
2.
Simulate fluorescent intensity for the Cy5 channel and the Cy3 channel with the normal distribution, respectively. Suppose the coefficients of variation for intensity in the Cy5 channel and the Cy3 channel are α rj and α gj , respectively, then the fluorescent intensity on the two channels can be generated by the normal distribution with mean I j and standard deviations α rj I j and α gj I j for the red channel and the green channel, respectively. Let R j and G j represent simulated fluorescent intensity for the Cy5 channel and the Cy3 channel for gene j, respectively;
-
3.
Simulate differentially expressed genes. Suppose γ × 100% genes are differentially expressed in the whole simulated gene set, then the ratio of the expression level for gene j can be generated by t j = 10±bwith b ~ Beta(1.7,4.8). The sign ± will determine if the gene is up- or down-regulated. The probability of the up-regulated genes within those γ × 100% differentially expressed genes is given as an input parameter. For the genes that are not differentially expressed, the b takes value zero;
-
4.
Incorporate the t j into signal intensity of gene j. The R j and G j will be adjusted by adding the simulated expression ratio t j through the following formulae: for j = 1,..., J;
-
5.
Simulate a fluorescent system with the imperfect response characteristics. Due to various reasons, such as the unequal amount of mRNA for the two channels, different labeling efficiencies, or uneven laser powers at the scanning stage [18], actual intensity in the two channels are not exactly the same. More over, fluorescent intensity is not necessarily linearly related to the expression levels. Balagurunathan et al. proposed the following functional family,
to distort the response characteristic functions of observed fluorescent intensity for the two channels, which are expressed as
and , respectively. So four parameter values need to be determined for each channel before simulation. Different parameter values in the two channels will control the shape of the ratio vs. signal intensity plots (R-I plots);
-
6.
Simulate background noise for each channel. The mean of background noise is determined by one input parameter: the signal to noise ratio (SNR) and the true mean of signal. The SNR is the ratio between the true mean of the signal and the true mean of background noise. The SNR controls variability of background noise. The normal distribution with a given mean value is used in simulating background noise. Variance of background noise will be controlled by the input parameters α br and α bg for the Cy5 channel and the Cy3 channel, respectively. These two parameters are the ratios between the mean and the standard deviations of background noise for the two channels, respectively. Simulated signal intensity for the two channels, and , are adjusted by subtracting background noise in each channel. Let and still denote background adjusted signal intensity for the two channels;
-
7.
Add noise to the signal intensity for each channel. Finally, the signal intensity of each channel is generated by
with
, where α 1 ~ U (a 1, b 1), α 2 ~ U (c 1, d 1), α 3 ~ U (a 2, b 2), α 4 ~ U (c 2, d 2). The a 1, b 1, c 1, d 1, a 2, b 2, c 2, d 2 are given as input parameters to control variability of fluorescent signal intensity.
We simulated two situations, one is the no-dye bias case and another one is the shape case (dye bias exists). R-I plots of twelve blocks on one slide for two simulated cases are shown in Figures 2 and 3, respectively. We considered five different percentage levels of differentially expressed genes: 1%, 5%, 10%, 20%, and 40%. The ratio of the up-regulated genes to the down-regulated genes takes three values, i.e., 1:1, 3:1, and 9:1 at each percentage level of differentially expressed genes. In addition, based on the suggestion of a reviewer, we simulated an extreme case for the scenario in Figure 3, in which 70% genes are all up-regulated and the remaining ones are not differentially expressed.
The trend of MSEs and biases of estimated gene expression levels are similar between the robust TW-SLM and the LOWESS normalization methods across different levels of the ratios between the up-regulated genes and the down-regulated genes. This trend also exists in the extreme case. We present the results of the following two scenarios: (a) a 9:1 ratio between the number of the up-regulated genes and that of the down-regulated genes and, (b) the extreme case. Tables 1 and 3 present MSEs, Tables 2 and 4 show biases of estimated gene expression levels. MSEs and biases for the extreme case (70% of the genes are up-regulated) are presented in the bottom of Tables 3 and 4, which are displayed in Figures 6 and 7, respectively. The robust TW-SLM method has smaller means of MSEs than the LOWESS normalization method and the TW-SLM using OLS, respectively. Also the ranges of MSEs for the proposed method are also smaller than those using the LOWESS method and the TW-SLM with OLS, respectively.
Comparing the different robust weight functions, means of MSEs are slightly smaller using Tukey's weight function than that using Huber's weight function. These results are observed across different percentage levels of differentially expressed genes. Biases for estimated gene expression levels distributed similarly between the proposed method and the LOWESS normalization method. But the ranges of the biases for the proposed method are smaller than those of the LOWESS normalization method and the TW-SLM using OLS, respectively. These observations are true in both simulated situations.
The extreme case is an example where the proposed method does better than the LOWESS method (Tables 3 and 4, Figures 6 and 7). Estimates using the LOWESS method are downward biased in this case. This is what we would expect because the LOWESS method fits normalization curves through the majority of genes, which are mostly up-regulated here. In contrast, the TW-SLM method does not need the either of the two assumptions needed by the LOWESS method, neither of which is satisfied here.
The distributions of MSEs and biases between the TW-SLM using OLS and the LOWESS method are similar for cases where there is a relatively small percentage of differentially expressed genes. However, the TW-SLM with OLS performs better than the LOWESS when a larger proportion of genes are differentially expressed. It appears that the more deviation from the two assumptions required by the LOWESS, the better the TW-SLM performs. This trend is consistent with findings in our previous work [17].
An example
In this section, a real data set was analyzed to compare consistency of the LOWESS normalization method and the proposed robust TW-SLM method. A collection of human placenta cDNAs comprising 7,042 clones was identified and used as the probe set for cDNA microarray fabrication in this study [19].
Three kinds of RNA samples were used which include: (i) a common reference RNA obtained by in vitro transcription from a pool of cDNAs in equal amount comprising the entire probe set (PS); (ii) the "Universal Human Reference RNA" from Stratagene, a pool of RNAs derived from 10 different cell lines; and (iii) human full-term placenta RNA. The original goal of the study was to evaluate the performance of the PS RNA as a reference RNA in comparison with that of Stratagene's universal reference RNA.
In this study, the Universal Human Reference RNA and the human placenta RNA were treated as two experimental samples. The PS RNA was used as the reference against which the two other bio-samples were compared. In the simple direct comparison, gene expression values were obtained through direct hybridizations between the human placenta RNA and the Universal Human Reference RNA. In the indirect comparison using the PS set as the common reference, hybridizations were performed between the human placenta RNA and the PS reference RNA, and between the Universal Human Reference RNA and the PS reference RNA. The design of this experiment is depicted in Figure 1.
After hybridization, slides were scanned with the Axon instruments 4000B scanner. The 633 and 532 lasers are used for excitation of the Cy5 and Cy3 fluorophores, respectively. For each of the three types of hybridizations (i.e., the human placenta vs. the universal reference, the human placenta vs. the PS reference, and the universal reference vs. the PS reference), there are four slides, including two dye-swapped slides. Each clone was printed three times on different blocks on each slide. Background adjusted medians for the Cy5 and Cy3 channels were used as expression levels. We removed negative controls including "Human Cot1", "PolyA" and "Empty" in the analysis.
To evaluate the proposed method, we compare it with the LOWESS method by examining which method produces more consistent results between the direct comparison and the indirect comparison of human placenta and universal human reference RNA samples as described above (see also Figure 1). The rationale is that the results from the direct comparison design and the indirect comparison design should be similar, because the same RNA samples are compared in both designs, albeit the indirect comparison is through a third common reference. Therefore, a better normalization method is the one that yields more consistent results between the direct and indirect comparison experiments.
The data were normalized using the LOWESS normalization method and the robust TW-SLM with Tukey's robust weight function separately. Significance analysis was carried out for the normalized data for each method by comparing gene expression levels in the human placenta tissue relative to the universal reference. One sample t-test was used for the direct comparison and two-sample t-test was used for the indirect comparison. We used 10-5 and 10-3 as cutoff points for p-values to determine if clones are statistical significant or not. Consistency of estimated relative gene expression levels was compared between the direct design and the indirect design for each method. We also compared overlap between the LOWESS normalization method and the robust TW-SLM for each design. The results are presented in Figures 4 and 5.
We used 10-5 as a cutoff point for p-values in Figure 4. Using the robust TW-SLM normalization and the t-tests, there are 2,907 genes with p-value less than 10-5 in the direct comparison and 2,791 in the indirect comparison. There are 1,713 genes common in these two sets of genes with p-value less than 10-5, which account for about 59% (1713/2907) in the direct comparison and about 61% (1713/2791) in the indirect comparison.
In comparison, using the LOWESS normalization and the t-tests, there are 1,447 genes with p-value less than 10-5 in the direct comparison and 1,045 in the indirect comparison. The number of overlapping genes with p-value less than 10-5 is 467, which is around 32% (467/1447) in the direct comparison and about 44% (467/1045) in the indirect comparison. It is clear that the proposed method performs more consistent between the direct comparison and the indirect comparison.
We also examined overlap between the LOWESS and robust TW-SLM methods for the two comparisons. In the direct comparison, about 79% (1141/1447) of the genes found to be significant based on the LOWESS method are also found to be significant based on the robust TW-SLM method. But they only account for about 40% (1141/2907) of the significant genes detected based on the robust TW-SLM method. In the indirect comparison, about 71% (738/1045) of the significant genes based on the LOWESS method are also found to be significant based on the robust TW-SLM method. But they only account for about 26% (738/2791) significant genes detected based on the robust TW-SLM method.
In our analysis, we used background adjusted intensity values. How to adjust background is an important issue in microarray data analysis. To evaluate if background affects our conclusions, we repeated the comparison analysis without adjusting background for the intensity values in both channels, the results are presented in Tables 5 and 6. We see from these tables that the overall results are similar to those using background adjusted intensity values in normalization. This is what we would expect because of low and uniform distributed background noise in all arrays in this example (data description is not shown).
Therefore, the robust TW-SLM method yields more consistent results between the direct comparison and the indirect comparison with the human placenta and the universal human reference RNA samples. In addition, the robust TW-SLM method detects more significant genes for a given cutoff p-value. This makes sense biologically because most of the 7,042 genes specifically discovered from human placenta are expected to have differential expressions relative to the universal reference RNAs. We would expect that the similar comparison results will be got if we compare the TW-SLM using OLS or Huber's weight function with the LOWESS method because the normalization curves for the TW-SLMs (TW-SLM:OLS, TW-SLM:Huber, TW-SLM:Tukey) are similar, but all these three curves are different from the LOWESS normalization curve (Figure 8).
Discussion
We have proposed a robust TW-SLM normalization method for cDNA microarray data. It is interesting to compare the proposed normalization method with the existing methods, such as the widely used LOWESS normalization proposed by Yang et al. (2001) [5] and further discussed by Tseng et al. (2001) [9]. In the LOWESS method, normalization is done separately by first fitting a separate curve for each slide through the R-I plot of log-intensity ratios versus log-intensity products. In comparison, the proposed method uses all the slides in estimating each normalization curve, using the gene effects β j as the thread linking these slides. In addition, in the proposed method, the normalization curves φ i and gene effects β j are estimated simultaneously. With this approach, there is no need to assume that the percentage of genes with differential expression levels is small or the expression levels of up- and down-regulated genes are symmetric, or when one of these assumptions is not satisfied, to use dye-swap normalization, which in turn requires the assumption that the two normalization curves are symmetric. (However, we note that dye-swap as a design strategy is useful to balance the experimental conditions and reduce bias due to different dye incorporation efficiencies.) An underlying condition required for the proposed method is independence of arrays, which is satisfied in a typical microarray experiment. Further theoretical conditions for the TW-SLM can be found in the paper by Huang et al. [17].
We have only considered the proposed robust TW-SLM method for the simple direct comparison design described in the Methods section. We can easily extend the method to more complicated designs. For example, we can adapt the proposed robust method to the TW-SLM that accommodates the design where a gene is printed multiple times. Such a design is helpful for improving the precision and for assessing the quality of an array using the coefficient of variation (Tseng et al. 2001 [9]). We can also adapt the robust TW-SLM to incorporate control genes with known concentration ratios in estimating the normalization curves. Model (1) can be easily extended to block-wise normalization by treating different blocks as separate arrays and normalization can be carried out as what we did here. Block-wise normalization considers spatial variation within an array. We did block-wise normalization on the data sets in the example and compared the results with that using the LOWESS method (Tables 5 and 6). The proposed method still outperforms the LOWESS method if we use block-wise normalization in this example.
Conclusions
In our simulation studies, the proposed method performs better than the LOWESS normalization method in terms of MSEs of estimated gene effects in the simulation models we considered. Analysis of the probe set reference data set [19] shows that the proposed method yields more consistent results between the direct and indirect comparisons than the LOWESS normalization method. In addition, the proposed method is more sensitive in detecting differentially expressed genes than the LOWESS method. Therefore, we believe that the proposed robust TW-SLM method is a powerful alternative to the existing normalization methods. We have coded the proposed method in an R package which is available from the corresponding authors.
Methods
We first describe the TW-SLM. For simplicity, we focus on the case of comparing two cell populations, in which two cDNA samples from the respective cell populations are competitively hybridized on the same array. Let n be the number of slides, and J be the number of genes in the study. Let R ij represent background corrected signal intensity from the Cy5 channel and G ij the background corrected signal intensity from the Cy3 channel, and let y ij = log2(R ij /G ij ), x ij = (1/2) log2(R ij × G ij ), for gene j on slide i. We assume that there is only one spot for each gene on each slide. The TW-SLM [17] is
y ij = φ i (x ij ) + β j + ∈ ij , i = 1,..., n, j = 1,..., J (1)
In this model, the observed log intensity ratio is decomposed into three components. The first component is φ i which is the intensity dependent normalization curve for slide i, the second component is β j which represents the relative expression value of the j th gene after normalization, the last one is the residual error term. Let be a robust estimator of the i th normalization curve φ i based on this model described above. The normalized data are
Huang et al. (2004) [17] considered the least squares method for estimating φ i and β j in the TW-SLM. However, it is well known that least squares estimates are not robust against outliers which often arise in microarray experiments. Therefore, we propose to use the robust method [20] for estimating φ i and β j . This is done by minimizing the objective function
where ρ is an appropriately chosen function for robust estimation, λ is the collection of the coefficients in the spline representations of φ i described below, σ is the scale parameter, and α is a constant to be described below. We note here that estimation of φ i , β j are done jointly and uses data from all the arrays. This is different from the LOWESS normalization method in which estimation of normalization curves are done array by array.
We consider two ρ functions: Huber's ρ function and Tukey's biweight function. Huber's ρ function is
Tukey's biweight function is
Two other usefull functions derived from ρ, ψ and χ, will be used repeatedly in the description of the algorithm below. They are defined as
ψ(x) = ρ'(x), χ(x) = x ψ(x) - ρ(x). (4)
The expressions of these functions are given in the Appendix. We choose commonly used constants in the literature for Huber's and Tukey's functions, i.e., H = 1.345 and k = 4.685. The influence of the choice of these constants on normalization methods is beyond the scope of this study.
We use the cubic B-splines [21, 22] to approximate the normalization curves φ i . Specifically, let b 1,..., b K be K B-spline basis functions. We approximate φ i by
where b(x) = (1, b 1(x),..., b K (x))' and λ i = (λ i 0, λ i 1,..., λ iK )'.
We estimate the parameters in model (1) by minimizing objective function (3) using an iterative procedure. Two steps, a location step and a scale step, will be used in the computation.
Location step
We use the following vector and matrix notations in describing the location step:
B i = (b(x i 1), b(x i 2),..., b(x iJ ))',
y i = (y i 1, y i 2,..., y iJ )'.
Let
and let
for i = 1,..., n. Given the scale parameter σ, and satisfy the equations:
where
and because of identifiability requirement in the TW-SLM. We can solve these equations iteratively to obtain and . The derivations of these equations are given in the Appendix.
Scale step
According to Huber's proposal [23], the estimation equation for σ is
where r ij = y ij - b'(x ij ) , and N is the total number of observations in the data set. In general, equation (8) does not have an explicit solution. So we use the following updating equation to compute the estimated scale parameter σ,
In order to obtain the consistent scale estimator at the normal distribution and obtain the classic estimates when using the least squares objective function, i.e.,
, we used the constant suggested by Huber [23],
where E Φ denotes expectation with respect to the standard normal distribution function Φ.
The procedure described above is called an iterative reweighted least squares (IWLS) algorithm that is used in many non-least squares estimation problems. The implementation of the IWLS algorithm can be carried out using the following steps:
-
1.
Initialize
for j = 1,..., J and σ (0) = 1, , for i = 1,..., n, j = 1,..., J;
-
2.
Calculate
according to equation (6) given β (m-1), σ (m-1)and for i = 1,..., n, j = 1,..., J, m = 1,...;
-
3.
Check convergence of λ i , β, and σ. If the convergence criteria is met, then stop, otherwise continue;
-
4.
Update σ (m)by equation (9) given , , σ (m-1), and , and set σ (m-1)= σ (m);
-
5.
Calculate weight
given β (m-1), and σ (m)according to equation (5), and set ;
-
6.
Calculate β (m)given σ (m-1)and using equation (7), and set ;
-
7.
Go to step 2 and iteratively update the estimators of parameters and the weights between steps 2 and 6 until convergence.
Appendix
Derivation of and
We derive estimation equations for location parameters presented in the Methods section in this appendix. Again the notations from the Methods section:
b(x) = (1, b 1(x),..., b K (x))',
λ i = (λ i 0, λ i 1,..., λ iK )'.
φ i (x ij ) can be approximated by a linear combination of B-spline basis functions, i.e. b'(x ij )λ i , where b k (x ij ) is the k th B-spline basis function of x ij . Let A = (a 1, a 2,..., a n )', C = (c 1, c 2,..., c n )', and define
Given scale parameter σ, the first partial derivatives of S(λ, β, σ) (3) with respect to λ and β can be expressed in the matrix form as
where B i ; = (b(x i1 ), b(x i2 ),..., b(x iJ ))', y i = (y i 1, y i 2,..., y iJ )', ψ(x) = ρ'(x). As defined in equation (5)
and
W i = diag(w i 1, w i 2,..., w iJ ),
Plugging W i into equations (10) and (11) and setting them to zeros, and solving these two equations and yielding estimation equations for in equation (6) and in equation (7). They are
Let
then equation (7) can also be expressed as
= (Z'WZ)-1 Z'W(y - B ).
The solution of (Z'WZ)-1 can be explicitly calculates using the following matrix,
where
, and for j = 1,..., J. We can get the explicit solution of after doing some linear algebra. It is
for j = 2,..., J. And because of identifiability requirement in model (1).
Derivation of scale parameter estimator
The ψ and χ functions derived from Huber's ρ(z) function are
The related weight function has the form
where H is a constant.
The constant α used in the scale step for Huber's robust estimation can be calculated as the following
where Φ is the distribution function of the standard normal distribution, N is the total number of observations in the dataset, and p is the total number of parameters in the model.
The ψ and χ functions derived from Tukey's ρ(z) function are
The associated weight function has the form
where k is a constant and the constant a in the scale step takes value
where
is the Chi-square probability function with n degrees of freedom evaluated at s.
When Tukey's weight function is used, equation (8) is solved directly for the estimator of σ instead of iteratively updating equation (9) in our R program. It can be shown that equation (8) for Tukey's χ(z) function has an unique real root. This real root is just the solution for the estimator of σ. Let n* be the total number of observations that satisfy the second case of Tukey's χ function, i.e. | z | >k, let J* be the total number of clones that satisfy the first case of the χ, i.e. | z | ≤ k. Replacing z by and plugging Tukey's χ into equation (8), we get
where
Let
Then equation (12) becomes
ax 3 + bx 2 + cx + d = 0. (13)
Let x = y - b/3a, divided by a in the both sides of equation (13), and plugs x into equation (13), then we get the Cardan's cubic equation
Let p = c/a - b 2/3a 2, q = d/a - bc/(3a 2) + 2b 3/(27a 3), the above equation becomes
y 3 + py + q = 0. (15)
The determinant for Cardan's equation (15) is
It can be shown that the determined function Δ must be positive. The first term in the determinant equation must be positive because of the square function and q cannot be zero. If we can show that the p is greater or equal to zero, then the Δ must be positive. Because
So the Δ is positive if only if 3ac - b 2 is non-negative. We can see that
According to the Cauchy inequality [24], we have
Therefore, the p must be non-negative and the Δ must be positive. Thus there is only one real root for equation (15), that is
Then the solution for σ in equation (12) is
References
Schena M, Shalon D, Davis RW, Brown PO: Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 1995, 270: 467–470.
Brown PO, Bostein D: Exploring the new world of the genome with microarrays. Nature Genetics 1999, 21(suppl 1):33–37. 10.1038/4462
Hedge P, Qi R, Abernathy K, Gay C, Dharap S, Gaspard R, Earle-Hughes J, Snesrud E, Lee N, Quackenbush J: A concise guide to cDNA microarray analysis. Biotechniques 2000, 29: 548–562.
Chen Y, Dougherty ER, Bittner ML: Ratio-based decisions and the quantitative analysis of cDNA microarray images. Journal of Biomedical Optics 1997, 2(4):364–374. 10.1117/1.429838
Yang YH, Dudoit S, Luu P, Speed TP: Normalization for cDNA microarray. Microarrays: Optical Technologies and Informatics SPIE, Society for Optical Engineering, San Jose, CA 2001., 4266:
Cleveland WS: Robust locally weighted regression and smoothing scatter plots. Journal of American Statistical Association 1979, 74: 829–836.
Quackenbush J: Microarray data normalization and transformation. Nat Genet 2002, 32(supplement):496–501. 10.1038/ng1032
Bilban M, Buehler LK, Head S, Desoye G, Quaranta V: Normalizing DNA microarray data. Current Issues in Molecular Biology 2002, 4: 57–64.
Tseng GC, Oh MK, Rohlin L, Liao JC, Wong WH: Issues in cDNA microarray analysis:quality filtering, channel normalization, models of variation and assessemnt of gene effects. Nucleic Acids Research 2001, 29(12):2549–2557. 10.1093/nar/29.12.2549
Kepler TB, Crosby L, Morgan KT: Normalization and analysis of DNA microarray data by self-consistency and local regression. Genome Biol 2002, 3(7):1–12. 10.1186/gb-2002-3-7-research0037
Wang Y, Lu J, Lee R, Gu Z, Clarke R: Iterative normalization of cDNA microarray data. IEEE Transactions on Information Technology in Biomedicine 2002, 6: 29–37. 10.1109/4233.992159
Workman C, Jensen LJ, Jarmer H, Berka R, Gautier L, Nielsen HB, Saxild HH, Nielsen C, Brunak S, Knudsen S: A new non-linear normalization method for reducing variability in DNA microarray experiments. Genome Biology 2002, 3(9):1–16. 10.1186/gb-2002-3-9-research0048
Park T, Yi SG, Kang SH, Lee SY, Lee YS, Simon R: Evaluation of normalization methods for microarray data. BMC Bioinformatics 2003, 4: 33–45. 10.1186/1471-2105-4-33
Wolfinger RD, Gibson G, Wolfinger ED, Bennett L, Hamadeh H, Bushel P, Afshari C, Paules RS: Assessing gene significance from cDNA microarray expression data via mixed models. Journal of Computational Biology 2001, 8(6):625–637. 10.1089/106652701753307520
Fan J, Tam P, Vande Woude G, Ren Y: Normalization and analysis of cDNA microarrays using within-array replications applied to neuroblastoma cell response to a cytokine. PNAS 2004, 101: 1135–1140. 10.1073/pnas.0307557100
Fan J, Peng H, Huang T: Semilinear high-dimensional model for normalization of microarray data: a theoretical analysis and partial consistency. Journal of the American Statistical Association 2005, in press.
Huang J, Wang D, Zhang C: A two-way semi-linear model for normalization and analysis of cDNA microarray data. Joural of the American Statistical Association 2005, in press.
Balagurunathan Y, Dougherty ER, Chen Y, Bittner ML, Trent JM: Simulation of cDNA microarrays via a parameterized random signal model. Journal of Biomedical Optics 2002, 7(3):507–523. 10.1117/1.1486246
Xie H, Wang D, Manzella L, Huang J, Soares MB: Probe set as a common reference and a semi-linear normalization method for cDNA microarray experiments. Preprint, Department of Pediatrics, University of Iowa 2004.
Huber PJ: Robust estimation of a location parameter. Annals Mathematical Statistics 1964, 35: 73–101.
Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning. Springer. 2001.
de Boor C: A Practical Guide to Splines. Springer-Verlag, New York; 1978.
Huber PJ: Robust Statistics. John Wiley & Sons; 1981.
Abramowitz M, Stegun IA, (Eds): Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Table. Dover Pubns; 1974.
Acknowledgements
This work is supported in part by grant HL72288 from the National Heart, Lung and Blood Institute.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Authors' contributions
DW devised and implemented the procedure described in the paper, drafted and finalized the manuscript. JH helped with writing and revising the manuscript. JH and MBS supervised and provided support for this work. HX and LM conducted the experiment that generated the data set used in the example.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Wang, D., Huang, J., Xie, H. et al. A robust two-way semi-linear model for normalization of cDNA microarray data. BMC Bioinformatics 6, 14 (2005). https://doi.org/10.1186/1471-2105-6-14
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/1471-2105-6-14