Comparison of normalization methods for CodeLink Bioarray data

Background The quality of microarray data can seriously affect the accuracy of downstream analyses. In order to reduce variability and enhance signal reproducibility in these data, many normalization methods have been proposed and evaluated, most of which are for data obtained from cDNA microarrays and Affymetrix GeneChips. CodeLink Bioarrays are a newly emerged, single-color oligonucleotide microarray platform. To date, there are no reported studies that evaluate normalization methods for CodeLink Bioarrays. Results We compared five existing normalization approaches, in terms of both noise reduction and signal retention: Median (suggested by the manufacturer), CyclicLoess, Quantile, Iset, and Qspline. These methods were applied to two real datasets (a time course dataset and a lung disease-related dataset) generated by CodeLink Bioarrays and were assessed using multiple statistical significance tests. Compared to Median, CyclicLoess and Qspline exhibit a significant and the most consistent improvement in reduction of variability and retention of signal. CyclicLoess appears to retain more signal than Qspline. Quantile reduces more variability than Median in both datasets, yet fails to consistently retain more signal in the time course dataset. Iset does not improve over Median in either noise reduction or signal enhancement in the time course dataset. Conclusion Median is insufficient either to reduce variability or to retain signal effectively for CodeLink Bioarray data. CyclicLoess is a more suitable approach for normalizing these data. CyclicLoess also seems to be the most effective method among the five different normalization strategies examined.


Background
DNA microarrays have made possible the expression profiling of thousands of genes in a single experiment. They have been used in a wide range of applications, e.g., disease classification and characterization [1][2][3][4], discovery of new disease subtypes [5,6], and identification of novel genes or gene regulatory networks for biological mechanistic research [7,8]. Depending on the probe types used, DNA microarrays can be classified as either cDNA or oligonucleotide microarrays. Additionally, depending on the A challenge inherent for the normalization of microarray data is the lack of a gold standard (e.g., a 'perfect' validation set that can reflect the complexity of real data). It is, therefore, not easy to identify the best normalization method(s) (which can both remove noise and retain signal most effectively) for any array platform. Two criteria have been employed to compare normalization strategies [17][18][19]: one is the ability to reduce noise in the data and the other is the ability to retain biological signal. To assess noise reduction, diagnostic plots have been employed to visualize and identify noise in the data. For example, the MA plot, which is the scatter plot of average log intensity values (A) of the green and red fluorescent dyes (in twocolor cDNA microarrays) vs. log intensity ratio (M) of the two fluorescent dyes, has been used to identify intensity-dependent dye effect in cDNA microarray data [19]; while the spatial plot, a spatial representation of the spots on a microarray color-coded by their intensity values, has been used to evaluate spatial-dependent dye effect in cDNA microarray data [20]. Other approaches have used the coefficient of variation [21,22], correlation [21] and variance [23] in replicate arrays, as quantitative, unambiguous measures to evaluate the effectiveness of normalization methods in removing variability.
It is relatively ambiguous to evaluate normalization methods in terms of signal retention, because there is no ground truth for total signal in real biological samples. Prevailing approaches compare the ability to predict a fixed number of known differentially expressed genes using t-statistics [19] and adjusted p-values [24] and to reveal spike-in genes [17]. When little is known about differentially expressed genes in the data (a common situation when real datasets are used), signal has been estimated by calculating the overabundance of differentially expressed genes or informative genes from real datasets [18]. The leave-one-out cross-validation k-NN classification error has also been employed as a functional measure for comparing normalization methods in real datasets [16].
Previously we have developed a strategy for the evaluation of normalization approaches for Affymetrix GeneChips [18]. In the present study, we employed a similar strategy to compare five existing normalization methods (Median, CyclicLoess, Quantile, Iset, and Qspline) for CodeLink Bioarray data. These methods were designed for high-density oligonucleotide microarrays, and some of them (Median, CyclicLoess, Quantile, and Iset) have already been validated using GeneChip data [17,18]. We applied these normalization approaches individually to two real datasets obtained from CodeLink Bioarrays. We examined intensity-dependent differences in the normalized datasets and assessed variability of the normalized datasets, using replicate microarrays and/or positive control probes on the arrays. Finally, to assess the effectiveness of the normalization methods in retaining signal, we compared the numbers of differentially expressed genes detected from the normalized datasets using multiple statistical significance tests. Computer programs in this work were implemented using the R language for statistical computing and graphics [25].

Noise in the normalized time course dataset
Intensity-dependent differences in the normalized data We used pairwise MA plots to examine the ability of the normalization methods to remove intensity-dependent differences between each pair of arrays in the technical replicates. Figure 1 shows the MA plots of two pairs of arrays in the TRC1 replicate data, either non-normalized (Nonorm), or normalized with the five normalization methods. These arrays exhibit obvious intensity-dependent differences between "pair 1" (array 1 vs. array 4) and "pair 2" (array 4 vs. array 5) (shown as curved loess lines in the plots) (Figure 1, Nonorm). For the Median-normalized data, although the loess lines are centered near zero, they still have curvature, indicating that Median does not remove intensity-dependent differences adequately (Figure 1, Median). CyclicLoess, Quantile and Qspline remove intensity-dependent differences almost completely; Iset, however, fails to remove all the intensitydependent differences (Figure 1).

Variability of the normalized data
We calculated the coefficient of variation (CV) of the normalized intensity values for each transcript across all arrays in each set of the normalized technical replicates. Figure 2 shows the density plots of the CVs for Nonorm and the five normalization methods. The CV of Nonorm in TRC1 peaks more to the left side with a wider spread than those in TRC2 and TRC3, suggesting that the quality of the TRC1 replicates is not as good as those of TRC2 and TRC3. This is consistent with our observation that TRC1 contains two 'corrupted' pairs of arrays that show obvious intensity-dependent differences in the MA plots ( Figure  1). Figure 2 shows that in TRC1, the five normalization methods reduce the data variability to different degrees: Iset and CyclicLoess have overall lower CVs (and thus perform better) than Median, Quantile and Qspline, as indicated by the CV density plots of the former methods peak more to the left side of those of the latter approaches; Quantile and Qspline have slightly lower CVs than Median. In TRC2 (Figure 2), Iset fails to improve over Median, whereas the performances of the other four normalization methods (CyclicLoess, Quantile, Qspline, and Median) are similar. In TRC3, Median is outperformed by all the other four normalization strategies ( Figure 2). Finally, we calculated the CVs of the normalized intensity values for positive control probes across all arrays in the time course dataset. Our results show that CyclicLoess, Quantile and Qspline reduce variability from the normalized data more effectively than Nonorm, Median and Iset. The means of the CVs for CyclicLoess, Quantile and Qspline are 57%, 54% and 56% respectively, whereas the means of the CVs for Nonorm, Median and Iset are 72%, 66% and 61% respectively. CyclicLoess has a slightly higher mean CV than Quantile and Qspline, but the differences are not statistically significant (Welch's t-tests, pvalues > 0.05).
These results suggest that overall CyclicLoess reduces variability most effectively and consistently than Median in both the 'corrupted' (TRC1) and good-quality (TRC2 and MA plots of two pairs of microarrays in the TRC1 replicate set from the time course dataset Figure 1 MA plots of two pairs of microarrays in the TRC1 replicate set from the time course dataset. The plots in each row show the results from non-normalized data (row 1) and data normalized with each of the five normalization methods (row 2-6). Columns depict two pairs of microarrays, pair 1 (array 1 vs. array 4, left) and pair 2 (array 4 vs. array 5, right), in the TRC1 replicates that exhibit obvious intensity-dependent differences between the arrays. The yellow line in each plot shows the loess fitting of the entire data in the plot.

Qspline 2
A M TRC3) data. Although Quantile and Qspline perform as well as CyclicLoess in the good-quality data, CyclicLoess outperforms them in the 'corrupted' data. Iset fails to improve over Median consistently.

Signal in the normalized time course dataset
Since the aim of an effective normalization method is to remove noise while retaining biological signal in the data, we next compared the effectiveness of the normalization approaches in signal retention. As no spike-in datasets were available for CodeLink Bioarrays, we were unable to determine total signal in the data. Instead, we estimated signal by calculating the number of differentially expressed genes in the data. We expected that the more signal retained, the more differentially expressed genes should be revealed. Although similar methodology has been shown effective in our previous work [18], we developed a statistical model and used simulated data to illustrate the validity of this intuition (see additional file 1: Simulation model and data for detail).
We then used Welch's two sample t-tests and Wilcoxon rank sum tests, with and without permutation, to estimate the numbers of differentially expressed genes in the nonnormalized data and in the data normalized with the five normalization methods. For each statistical test, Figure 3 shows the numbers of differentially expressed genes estimated from the normalized data at different time points. We ranked all the normalization methods at each time point based on the numbers of differentially expressed genes detected (Tables 1, 2, 3, 4). The more differentially expressed genes detected, the higher the rank of the normalization method. The mean, median and standard deviation of the ranks across all the time points were calculated for each normalization approach and are shown in Tables 1, 2, 3, 4 and Figure 4. For all the statistical tests performed, CyclicLoess has the highest mean and median ranks among all the normalization methods; the mean and median ranks of Qspline are also consistently higher than those of Median; Quantile fails to outperform Median in Wilcoxon rank sum tests (without permuta-Comparison of variability of the normalized data in the three sets of technical replicates from the time course dataset  Notably, compared to the other normalization methods, Iset has the highest standard deviations of the ranks across all time points in almost all the statistical tests (3 out of the 4 tests) (Tables 1, 2, 3 and Figure 4). For example, the ranks of Iset range from 2 (in 7-day and 30-day treatments) to 6 (in 1-day and 3-day treatments) in Welch's ttests without permutation (Table 1). It has been shown previously that the performance of Iset is dependent on the choice of the baseline array [17]; our results here sug-gest that even with a fixed choice of the baseline array, its performance may also depend on the arrays to be normalized.
In addition to the comparisons between the control group and each test group, we used the Analysis of Variance (ANOVA) to estimate the number of differentially expressed genes whose intensity values varied with the days of the treatment. CyclicLoess reveals the largest number of differentially expressed genes (68 genes) with ANOVA. Iset, Qspline and Quantile reveal slightly fewer numbers of differentially expressed genes (66, 56 and 54 Comparison of the numbers of differentially expressed genes estimated from the normalized time course dataset using multiple statistical tests Figure 3 Comparison of the numbers of differentially expressed genes estimated from the normalized time course dataset using multiple statistical tests. The x-axis in each plot shows the days when the rats were treated. The y-axis represents the numbers of differentially expressed genes detected between the control vs. each test group using the following sta- genes, respectively), but they still significantly outperform Median and Nonorm (which reveal only 24 and 9 differentially expressed genes, respectively).
The comparison of the normalization methods in the time course dataset can be summarized as follows. For noise removal and signal retention, CyclicLoess demonstrates the greatest and most consistent improvement over Median; Qspline exhibits consistent yet moderate improvement over Median. Quantile performs consistently better than Median for variability reduction, yet does not do so for signal detection. Iset fails to improve over Median consistently for either noise reduction or signal retention.
Since CyclicLoess, Quantile and Qspline exhibit considerable improvement over Median in the time course dataset, we compared them in greater detail using another dataset, the IPF dataset (see Methods), which contains larger and more balanced numbers of arrays for both the control (control patients) and test groups (pulmonary fibrotic patients). To focus the comparison on these methods, we excluded Nonorm and Iset from further analyses.

Variability of the normalized IPF dataset
Since there were no technical replicates in the IPF dataset, we compared the four normalization methods (Cycli-cLoess, Quantile, Qspline, and Median) for noise removal using the positive control probes on the CodeLink Bioarrays. The CV of the normalized intensity values across all arrays was calculated for each positive control probe in the IPF dataset processed with the normalization methods.
Our results show that CyclicLoess, Quantile and Qspline all have significantly lower mean CVs (79%, 76% and 74%, respectively) than Median (89%). CyclicLoess has a slightly higher, yet not statistically significant mean CV than Quantile and Qspline (Welch's t-tests, p-values > 0.05). Table 5 and Figure 5 show the numbers of differentially expressed genes estimated from the normalized IPF dataset using Welch's t-tests and Wilcoxon rank sum tests, with Table 2: Ranks of the normalization methods in Wilcoxon tests (without permutation) in the time course dataset. For each time point, the normalization strategies were ranked based on the numbers of differentially expressed genes estimated from the normalized data using Wilcoxon rank sum tests without permutation. "Rank" and the "Mean", "Median" and "Standard Deviation" of the ranks are defined as described in Table 1 Table 3: Ranks of the normalization methods in Welch's t-tests (with permutation) in the time course dataset. For each time point, the normalization strategies were ranked based on the numbers of differentially expressed genes estimated from the normalized data using Welch's t-tests with permutation. "Rank" and the "Mean", "Median" and "Standard Deviation" of the ranks are defined as described in Table 1 and without permutation. In three of the four statistical tests (t-tests with permutation and Wilcoxon tests with and without permutation, Figure 5), CyclicLoess reveals the largest numbers of differentially expressed genes (164, 331 and 297 genes, respectively) ( Table 5), which are significantly greater than the numbers of differentially expressed genes revealed by Median (108, 279 and 227 genes, respectively). In all the statistical tests, Qspline reveals more differentially expressed genes than Quantile, which in turn outperforms Median (Table 5).

Signal in the normalized IPF dataset
Overall, comparative results of the four normalization methods in the IPF dataset agree with most of those from the time course dataset: CyclicLoess and Qspline exhibit significant and consistent improvement over Median in both noise reduction and signal retention; CyclicLoess reveals slightly more signal (differentially expressed genes) than Qspline. Quantile outperforms Median for both noise reduction and signal retention in all four statistical tests, which is in contrast to its performance in the time course dataset where it fails to reveal more signal than Median in some tests (e.g., Wilcoxon rank sum tests, without permutation).

Discussion
CodeLink Bioarrays are recently introduced, single-color oligonucleotide microarrays, which differ from Affymetrix GeneChips in the following aspects [22,26]: 1) CodeLink Bioarrays use a single pre-synthesized, pre-validated 30mer probe to detect each target transcript, whereas Gene-Chips use multiple in-situ synthesized, 25-mer probes; and 2) the surface of CodeLink Bioarrays is made of 3dimensional aqueous gel matrix, whereas that of Affymetrix GeneChips is made of 2-dimensional glass matrix. These characteristics suggest that CodeLink Bioarrays behave differently from GeneChips and may require different normalization strategies from the ones optimized for GeneChips.
In this study, in order to determine the best normalization method(s) for CodeLink Bioarrays, we compared five existing approaches designed for high-density oligonucleotide microarrays. These methods have been applied previously to Affymetrix GeneChip data. Our goal is to provide a guideline for practitioners in the choice of a 'proper' normalization method that removes variability and retains signal effectively for CodeLink Bioarray data and thus to ensure the validity of downstream data analyses. Using our criteria, the Median normalization method (recommended by the manufacturer) is insufficient for   Figure 4 Ranks of the normalization methods in the normalized time course dataset. The mean, median and standard deviation of the ranks of the normalization method are defined in Table 1. The bar plots are visual representation of the results shown in Tables 1-4 (the "Mean Rank", "Median Rank" and "Standard Deviation" columns). In each plot, mean ranks are shown in pink, median ranks are in blue, and standard deviations of the ranks are shown as the error bars on top of the "Mean" rank bars.

Mean Median
A major difference between CyclicLoess, Qspline, Quantile, and Iset can be explained as follows. CyclicLoess and Qspline have more relaxed assumptions on microarray data than Quantile and Iset. The former methods require only that genes on the arrays are randomly distributed (i.e., the numbers of up-and down-regulated genes are similar), whereas the latter approaches require either that the data on different arrays have the identical distribution (Quantile) or that few genes on the arrays are differen-tially expressed (Iset). Since we expected neither identical nor very different gene expression patterns between the control vs. test groups in the examined datasets, Cycli-cLoess and Qspline seem to be more reasonable choices for normalizing these data. This may account for their better performances in this work. Our results also partially agree with another comparative study using Affymetrix GeneChip data [27], which showed that CyclicLoess reduced variability and retained signal more effectively Comparison of the numbers of differentially expressed genes estimated from the normalized IPF dataset using multiple statisti-cal tests Figure 5 Comparison of the numbers of differentially expressed genes estimated from the normalized IPF dataset using multiple statistical tests. The x-axis shows the numbers of differentially expressed genes estimated using the following sta-

Adjusted p value
Median CyclicLoess Quantile Qspline than Quantile. This suggests that CyclicLoess (and Qspline) may be more suitable than Quantile (and Iset) for normalizing single-color, oligonucleotide microarray data.
Besides the difference mentioned above, two baselinearray approaches, Qspline and Iset, also differ in the following way. Although both methods use a subset of genes to estimate intensity-dependent differences between a pair of microarrays for normalization, Qspline chooses these genes evenly over the entire range of the genes on the arrays. Iset, however, uses rank invariant genes, which are usually in small numbers (about 300 -1000 genes in the examined datasets) and thus may be insufficient for estimating intensity effect accurately in some arrays. This may account for the unstable performance of Iset in the time course dataset. Moreover, although intuitively normalization should be more effective if a 'proper' set of 'housekeeping' genes can be selected, the effectiveness of such approach could be limited by the still unanswered question as to whether these genes exists in higher organisms [28].
Since Qspline is a baseline-array approach, a concern could be raised that its performance may depend on the choice of the baseline array. Indeed, it has been shown that the performance of Iset varied when different individual arrays were used as the baseline array [17]. However, when the 'synthetic' baseline array was employed such that the intensity value of each gene in this array is equal to the median or mean of the intensity values of the same gene across all arrays, the performance of Iset seemed to be stable compared to Nonorm and Quantile Normalization [17]. This suggests that these 'synthetic' baseline arrays (the ones we used in this study for Iset and Qspline) are less biased and therefore are better choices for baseline-array approaches.
In addition to the examined normalization methods, there are other approaches that can be applied to Code-Link Bioarray data. For example, mean cyclic loess [29] and fast linear loess [27] were designed as speed-up versions of CyclicLoess (which, as has been often noted, has the drawback of being computationally slow). Although proposed independently, mean cyclic loess and fast linear loess are virtually identical methods. Unlike CyclicLoess, which adjusts intensity-dependent differences between each pair of arrays by iterating over all pairs of arrays in the dataset (see Methods), the speed-up methods eliminate the iteration step. Instead, for each target array, they first make a reference array by having the intensity value of each gene equal to the mean intensity value of the same gene across the rest of the arrays, and then estimate and adjust intensity-dependent differences in the target array with respect to the reference array. These methods are sim-ilar to Qspline when the latter chooses a similar baseline array (i.e., using the array-wise mean intensity value for each gene to construct the baseline array, the one we used in this study).
There are two possible limitations in this study. The first is that in the time course dataset, the sample sizes of the control vs. test groups were not well balanced (i.e., the control group contained 14 arrays, while the test groups contained only 4 -6 arrays each). Since a single statistical test could be biased or less powerful for estimating differentially expressed genes in unbalanced datasets, we used multiple statistical tests (parametric and non-parametric in nature, with and without permutation) to assess the performances of the normalization strategies. The conclusion of the comparative performances of the normalization methods was made based on the results from all the statistical tests. Moreover, we confirmed the evaluation results from the time course dataset using the IPF dataset. The latter dataset was better balanced in terms of the sample sizes of the control vs. test groups (11 and 15 arrays, respectively), and therefore allowed more accurate estimation of the numbers of the differentially expressed genes in the normalized data.
The second possible limitation of this study is that, since no spike-in datasets were available for CodeLink Bioarrays, two real CodeLink Bioarray datasets were used instead. It would be more informative if a fixed number of known differentially expressed genes were present in the data. However, this information is often unknown for real microarray datasets. Although spike-in or dilution data has been shown to be useful for evaluation of normalization methods [17,30], they are too clean in terms of the 'background' genes, whose intensity values are constant between control vs. test groups. In real data, these 'background' genes rarely exist. In fact, in the two examined datasets, we neither knew the number of 'background' genes in the control vs. test groups, nor did we even know the number of these genes within each group (since individual samples of each group were different). For this reason, normalization strategies that perform well in spike-in or dilution data may not perform equally well in real data. It may be the most informative if normalization methods can be evaluated using both real and spike-in microarray data.

Datasets
Two real datasets were used in this study.

Time course dataset
These data were collected to test the difference between a control group of rats (exposed to a treatment for 0 days) and test groups of rats exposed to a treatment for either 1, 3, 7, 14 or 30 days. The control group contained 14 arrays. The 14-day treatment group contained 6 arrays, and the other test groups contained 4 arrays each. Thus, the dataset contained 36 arrays. For the control group, there were 3 sets of technical replicates, sets TRC1, TRC2 and TRC3, which contained 5, 5 and 4 arrays, respectively. The arrays used in this dataset were CodeLink UniSet Rat I Bioarrays containing pre-validated oligonucleotide probes targeting about 10K transcripts in the rat genome.

IPF dataset
This dataset was generated to compare expression profiles of control lungs vs. lungs from patients with idiopathic pulmonary fibrosis (IPF). A total of 26 microarrays were obtained from 11 controls and 15 patients. The arrays used were CodeLink UniSet Human I Bioarrays containing pre-validated oligonucleotide probes targeting about 10K transcripts in the human genome. The arrays are available at the Gene Expression Omnibus (GEO accession number GSE 2052).
Both of the above CodeLink Bioarray platforms contain 68 bacterial control probes on each array, of which 18 are positive control probes (which can be used to monitor the quality of microarray experiments, see below) and 50 are negative control probes (which can be used to determine the low limit of signal). Each control probe is spotted 6 times on each array.

Microarray protocol
A CodeLink Bioarray experiment involves the following steps. Total RNAs are first prepared from a biological sample. Then a set of bacterial mRNAs of known concentrations (which are provided by the manufacturers and have complementary sequences to the positive control probes on the Bioarrays) are spiked in as positive controls. The mixed mRNAs are reverse transcribed into cDNAs and amplified into cRNAs, using in vitro transcription. The cRNAs are labeled with a fluorescent dye and hybridized to a CodeLink Bioarray presynthesized with probes. Finally the array is washed and scanned. The intensity value (signal) of the fluorescent dye detected from each probe on the array is proportional to the abundance of the targeting transcript in the sample of interest. The quality of these processes can be monitored by detecting the signal of the positive control probes on the Bioarrays.
Since some normalization methods (e.g., Quantile and Qspline) do not perform well with missing values in microarray data, missing values in the raw data were imputed before normalization using the following procedure. For each dataset, we first removed genes which contained missing values in more than 5% of the microarrays, then used the k-NNimpute algorithm [31] to fill in missing values for the remaining genes. The k-NNimpute algo-rithm was implemented using the Bioconductor package/ function pamr/pamr.knnimpute(k = 10) [32].

Normalization methods
All of the examined normalization methods are available at http://www.bioconductor.org [33].

Global approaches
We compared three global normalization methods, Median, CyclicLoess and Quantile. For reference, we also included Nonorm, which does not perform any data transformation on raw intensity values, S, of the spots on an array, which is defined as S = S f -S b , where S f is the fluorescent signal of the spots and S b is the local background intensity values of the spots. Median normalization scales the raw intensity values on an array using the median of the raw intensity value, i.e., S n = S/median(S), where S n is the normalized intensity values of the spots. Median is recommended by the manufacturer for normalization and serves as a baseline method in this study (Note: this should not be confused with the 'baseline-array' approaches described below). The Median-normalized CodeLink Bioarray data was obtained directly from the software provided by the manufacturer.
CyclicLoess [17] adjusts intensity-dependent differences between pairs of arrays with the aid of the MA plot (which, when used on data obtained from single-color microarrays, is the scatter plot of average log intensity values [A] of a pair of arrays vs. log intensity ratio [M] of the same arrays) [17]. CyclicLoess is an inter-microarray variant of loess-based normalization methods originally applied to two-color cDNA microarrays to adjust the intensity-dependent dye effect within a microarray [19,24]. In two-color cDNA microarrays, a test-and a reference sample (labeled with two fluorescent dyes of distinct colors, i.e., red and green, respectively) are hybridized to the same array; the differences of gene expression between these samples can be obtained by comparing the red and green fluorescent intensities of each gene on the same array. Thus intra-microarray loessbased normalization approaches can be used to adjust intensity-dependent differences in each microarray. In single-color microarrays, however, only one sample is hybridized to each microarray, and the differences of gene expression between different samples can only be obtained by comparing intensity values of each gene on different arrays. Inter-microarray normalization methods (e.g., CyclicLoess studied here), therefore, have to be used to estimate and minimize intensity-dependent differences in these microarrays.
CyclicLoess uses the MA plot and loess smoothing [34] to estimate intensity-dependent differences in a pair of microarrays, and then removes these differences by centering the loess line to zero. This procedure is carried out in a pairwise manner for all the arrays in a dataset and iterated until intensity-dependent differences are removed from all arrays. CyclicLoess was implemented using the Bioconductor package/function affy/normalize.loess(data, epsilon = 1, log.it = F, span = 0.4, maxit = 2) [35]. The parameter epsilon in normalize.loess measures intensity-dependent differences in the data and thus serves as a criterion for the procedure to stop iterating. Our results show that when epsilon is smaller than 1, intensity-dependent differences in the data are negligible. The smaller value of epsilon does not produce better results in terms of variability reduction and signal retention (data not shown). It took CyclicLoess two iterations in the time course dataset and one iteration in the IPF dataset to satisfy the stopping criterion (epsilon < 1).
Quantile [17] normalizes data in the following way. It first ranks data on each array, and then assigns data of the same rank across all arrays the mean of the data (of the same rank). Quantile normalization was implemented using the Bioconductor package/function affy/normalize.quantiles(data).

Baseline-array approaches
We compared two baseline-array methods, Iset and Qspline. These two strategies share several similarities: 1) both need to choose a baseline-array to estimate intensitydependent differences in target arrays; 2) both use a spline smoothing technique to remove intensity-dependent differences in target arrays; 3) both estimate smoothing curves for normalization using a subset of the genes on the arrays; and 4) both are rank-based methods. However, they differ in their choices of the subset of genes for fitting normalization curves. Iset chooses these genes by selecting a set of rank-invariant genes (or so called "housekeeping genes") in the target array with respect to the baseline array [36,37], while Qspline does so by first ranking all the genes on the target array and the baseline array respectively, and then by using quantiles of the ranked genes to estimate smoothing curves [38].
Iset was implemented in the Bioconductor package/function affy/normalize.invariantset ("target", "ref", prd.td = c(0.003, 0.007)), where "target" is the data matrix from the target array and "ref" is the data matrix from the baseline array. We chose the baseline array for Iset as follows: the intensity value of each gene is the median of the intensity values of the same gene across all arrays. Qspline was implemented in the Bioconductor package/function affy/ normalize.qspline (data, fit.iters = 5, min.offset = 5, spar = 0, p.min = 0, p.max = 1.0). We chose the baseline array for Qspline using the default option, in which each gene in the baseline array had the intensity value equal to the mean of the intensity values of the same gene across all arrays.

Detection of noise in the normalized datasets
We applied the five normalization methods individually to the time course and IPF datasets. In order to assess the effectiveness of the normalization methods in removing noise, we first used three sets of technical replicate microarrays, sets TRC1, TRC2 and TRC3, from the time course dataset. For each set of the replicates, we used pairwise MA plots to examine intensity-dependent differences in the data that are normalized individually with the normalization methods; we then calculated the coefficient of variation ( Since the IPF dataset did not contain technical replicates, we measured and compared the CVs for each positive control probe across all arrays in the IPF data normalized with the normalization methods. The normalized intensity values used in this section for calculating CVs are on the scale of the raw intensity data (i.e., not log-transformed values).

Detection of signal in the normalized datasets
A negative control threshold T NC is used (as suggested by CodeLink Bioarrays) to monitor the low limit of signal [39], where T NC is defined as T NC = (80% trimmed mean of negative control probes) + (3 standard deviations of the 80% trimmed population of negative control probes). In order to minimize the effects of low signal probes (whose intensity values are smaller than the negative control threshold) on signal detection, we replaced intensity values of these probes with T NC .
Log-transformed, normalized intensity values were used in the analyses described below.
In order to compare the effectiveness of the normalization methods in enhancing signal reproducibility, we detected signal in the normalized datasets. We assume that signal quality can be estimated by the number of differentially expressed genes detected (that is, the more signal retained in the normalized data, the more differentially expressed genes should be revealed). We first developed a simulation model and verified this intuition using simulated data (see additional file1: Simulation model and data for detail).
We then used multiple statistical significance tests, both parametric (e.g., Welch's two-sample t-tests and ANOVA) and non-parametric (e.g., Wilcoxon rank sum tests), to detect differentially expressed genes in the normalized datasets. Raw p-values of the t-statistics and Wilcoxon rank sum tests were computed using the null distribution of the test statistics as well as the permutation tests of the sample labels. These p-values were then adjusted for multiple testing using the false discovery rate (FDR) controlling procedure of Benjamini & Hochberg [40], which seems to balance the control of both false-positive and false-negative error rates better than other procedures (e.g., Bonferroni, the Benjamini & Yekutieli FDR and SAM) [41]. The cut-off adjusted p-value of 0.05 was used to determine differentially expressed genes for these twosample statistical tests.
In addition to pair-wise comparisons between the control and test groups, we used ANOVA to detect genes whose intensity values varied with the length of the treatment. Since we believe that there was a non-linear relationship between the response-(intensity values of genes) and the explanatory variables (days of the treatment), we used a quadratic regression model to fit these variables. For transcript k, where 1 ≤ k ≤ K and K is the total number of the transcripts on the array, let j be the treatment group, 1 ≤ j ≤ 6; for array i in treatment group j and 1 ≤ i ≤ N j , where N j is the number of the arrays for treatment group j, let s k,i,j denote the intensity value of transcript k on array i in treatment group j and d i,j be the day of treatment for array i in treatment group j, i.e., d i,j = {0,1,3,7,14,30}. The quadratic regression model can be written as: ANOVA was used to estimate statistical significance of the model parameters β 1 and β 2 for all transcripts, 1 ≤ k ≤ K.
The p-values of the F statistics were adjusted using the Benjamini & Hochberg FDR procedure. To ensure that both β 1 and β 2 are non-zero and thus minimize the risk of false positives, a stringent criterion, the adjusted p-value < 0.01, was used to determine differentially expressed genes.

Authors' contributions
WW designed and performed computational experiments, and drafted the manuscript. ND performed microarray experiments. GCT conducted the simulation study and edited the manuscript. TR read and edited the manuscript. EPX and NK participated in experimental design and in drafting the manuscript. All authors contributed to, read and approved the final manuscript.