### Datasets

Two real datasets were used in this study.

#### Time course dataset

These data were collected to test the difference between a control group of rats (exposed to a treatment for 0 days) and test groups of rats exposed to a treatment for either 1, 3, 7, 14 or 30 days. The control group contained 14 arrays. The 14-day treatment group contained 6 arrays, and the other test groups contained 4 arrays each. Thus, the dataset contained 36 arrays. For the control group, there were 3 sets of technical replicates, sets TRC1, TRC2 and TRC3, which contained 5, 5 and 4 arrays, respectively. The arrays used in this dataset were CodeLink UniSet Rat I Bioarrays containing pre-validated oligonucleotide probes targeting about 10K transcripts in the rat genome.

#### IPF dataset

This dataset was generated to compare expression profiles of control lungs vs. lungs from patients with idiopathic pulmonary fibrosis (IPF). A total of 26 microarrays were obtained from 11 controls and 15 patients. The arrays used were CodeLink UniSet Human I Bioarrays containing pre-validated oligonucleotide probes targeting about 10K transcripts in the human genome. The arrays are available at the Gene Expression Omnibus (GEO accession number GSE 2052).

Both of the above CodeLink Bioarray platforms contain 68 bacterial control probes on each array, of which 18 are positive control probes (which can be used to monitor the quality of microarray experiments, see below) and 50 are negative control probes (which can be used to determine the low limit of signal). Each control probe is spotted 6 times on each array.

### Microarray protocol

A CodeLink Bioarray experiment involves the following steps. Total RNAs are first prepared from a biological sample. Then a set of bacterial mRNAs of known concentrations (which are provided by the manufacturers and have complementary sequences to the positive control probes on the Bioarrays) are spiked in as positive controls. The mixed mRNAs are reverse transcribed into cDNAs and amplified into cRNAs, using *in vitro* transcription. The cRNAs are labeled with a fluorescent dye and hybridized to a CodeLink Bioarray presynthesized with probes. Finally the array is washed and scanned. The intensity value (signal) of the fluorescent dye detected from each probe on the array is proportional to the abundance of the targeting transcript in the sample of interest. The quality of these processes can be monitored by detecting the signal of the positive control probes on the Bioarrays.

Since some normalization methods (*e.g*., Quantile and Qspline) do not perform well with missing values in microarray data, missing values in the raw data were imputed before normalization using the following procedure. For each dataset, we first removed genes which contained missing values in more than 5% of the microarrays, then used the *k*-NNimpute algorithm [31] to fill in missing values for the remaining genes. The *k*-NNimpute algorithm was implemented using the Bioconductor package/function **pamr/pamr.knnimpute(k = 10)** [32].

### Normalization methods

All of the examined normalization methods are available at http://www.bioconductor.org[33].

#### Global approaches

We compared three global normalization methods, Median, CyclicLoess and Quantile. For reference, we also included Nonorm, which does not perform any data transformation on raw intensity values, *S*, of the spots on an array, which is defined as *S* = *S*_{
f
}- *S*_{
b
}, where *S*_{
f
}is the fluorescent signal of the spots and *S*_{
b
}is the local background intensity values of the spots. Median normalization scales the raw intensity values on an array using the median of the raw intensity value, *i.e*., *S*_{
n
}= *S*/*median*(*S*), where *S*_{
n
}is the normalized intensity values of the spots. Median is recommended by the manufacturer for normalization and serves as a baseline method in this study (Note: this should not be confused with the 'baseline-array' approaches described below). The Median-normalized CodeLink Bioarray data was obtained directly from the software provided by the manufacturer.

CyclicLoess [17] adjusts intensity-dependent differences between pairs of arrays with the aid of the MA plot (which, when used on data obtained from single-color microarrays, is the scatter plot of average log intensity values [*A*] of a pair of arrays vs. log intensity ratio [*M*] of the same arrays) [17]. CyclicLoess is an inter-microarray variant of loess-based normalization methods originally applied to two-color cDNA microarrays to adjust the intensity-dependent dye effect within a microarray [19, 24]. In two-color cDNA microarrays, a test- and a reference sample (labeled with two fluorescent dyes of distinct colors, *i.e*., red and green, respectively) are hybridized to the same array; the differences of gene expression between these samples can be obtained by comparing the red and green fluorescent intensities of each gene on the same array. Thus intra-microarray loess-based normalization approaches can be used to adjust intensity-dependent differences in each microarray. In single-color microarrays, however, only one sample is hybridized to each microarray, and the differences of gene expression between different samples can only be obtained by comparing intensity values of each gene on different arrays. Inter-microarray normalization methods (*e.g*., CyclicLoess studied here), therefore, have to be used to estimate and minimize intensity-dependent differences in these microarrays.

CyclicLoess uses the MA plot and loess smoothing [34] to estimate intensity-dependent differences in a pair of microarrays, and then removes these differences by centering the loess line to zero. This procedure is carried out in a pairwise manner for all the arrays in a dataset and iterated until intensity-dependent differences are removed from all arrays. CyclicLoess was implemented using the Bioconductor package/function **affy/normalize.loess(data, epsilon = 1, log.it = F, span = 0.4, maxit = 2)** [35]. The parameter epsilon in **normalize.loess** measures intensity-dependent differences in the data and thus serves as a criterion for the procedure to stop iterating. Our results show that when epsilon is smaller than 1, intensity-dependent differences in the data are negligible. The smaller value of epsilon does not produce better results in terms of variability reduction and signal retention (data not shown). It took CyclicLoess two iterations in the time course dataset and one iteration in the IPF dataset to satisfy the stopping criterion (epsilon < 1).

Quantile [17] normalizes data in the following way. It first ranks data on each array, and then assigns data of the same rank across all arrays the mean of the data (of the same rank). Quantile normalization was implemented using the Bioconductor package/function **affy/normalize.quantiles(data)**.

#### Baseline-array approaches

We compared two baseline-array methods, Iset and Qspline. These two strategies share several similarities: 1) both need to choose a baseline-array to estimate intensity-dependent differences in target arrays; 2) both use a spline smoothing technique to remove intensity-dependent differences in target arrays; 3) both estimate smoothing curves for normalization using a subset of the genes on the arrays; and 4) both are rank-based methods. However, they differ in their choices of the subset of genes for fitting normalization curves. Iset chooses these genes by selecting a set of rank-invariant genes (or so called "housekeeping genes") in the target array with respect to the baseline array [36, 37], while Qspline does so by first ranking all the genes on the target array and the baseline array respectively, and then by using quantiles of the ranked genes to estimate smoothing curves [38].

Iset was implemented in the Bioconductor package/function **affy/normalize.invariantset ("target", "ref", prd.td = c(0.003, 0.007))**, where "target" is the data matrix from the target array and "ref" is the data matrix from the baseline array. We chose the baseline array for Iset as follows: the intensity value of each gene is the median of the intensity values of the same gene across all arrays. Qspline was implemented in the Bioconductor package/function **affy/normalize.qspline (data, fit.iters = 5, min.offset = 5, spar = 0, p.min = 0, p.max = 1.0)**. We chose the baseline array for Qspline using the default option, in which each gene in the baseline array had the intensity value equal to the mean of the intensity values of the same gene across all arrays.

### Detection of noise in the normalized datasets

We applied the five normalization methods individually to the time course and IPF datasets. In order to assess the effectiveness of the normalization methods in removing noise, we first used three sets of technical replicate microarrays, sets TRC1, TRC2 and TRC3, from the time course dataset. For each set of the replicates, we used pairwise MA plots to examine intensity-dependent differences in the data that are normalized individually with the normalization methods; we then calculated the coefficient of variation (CV) of the normalized intensity values for each transcript (gene) across all arrays. Specifically, if we let *S*_{
k
}denote the vector of normalized intensity values for transcript *k* in technical replicate set *j*, *S*_{
k
}= (*s*_{k,1},..., *s*_{k,m},...*s*_{k,M}) where *m* denotes the *m* th array in set *j* and 1 ≤ *m* ≤ *M*. The CV of *S*_{
k
}is computed as follows:

CV (*S*_{
k
}) = standard deviation (*S*_{
k
})/mean (*S*_{
k
}) × 100%.

Finally, we exploited the redundancy in the positive control probes on the arrays and measured the CVs for these probes across all arrays in the time course dataset. Similarly as above, we let *S*_{
c
}denote the vector of normalized intensity values for the positive control probe *c* on array *n*, and *S*_{
c
}= (*s*_{c,1,1},...*s*_{c,p,1},...*s*_{c,6,1},...,*s*_{c,p,n},...*s*_{c,6,N}), where *p* denotes the *p* th positive control probe *c* on array *n*, 1 ≤ *p* ≤ 6 (*i.e*., each positive control probe is spotted six times on each array), where 1 ≤ *n* ≤ *N* and *N* (= 36) is the total number of arrays in the entire time course datasets. The CV of *S*_{
c
}is computed as follows:

CV (*S*_{
c
}) = standard deviation (*S*_{
c
})/mean (*S*_{
c
}) × 100%.

Since the IPF dataset did not contain technical replicates, we measured and compared the CVs for each positive control probe across all arrays in the IPF data normalized with the normalization methods. The normalized intensity values used in this section for calculating CVs are on the scale of the raw intensity data (*i.e*., not log-transformed values).

### Detection of signal in the normalized datasets

A negative control threshold *T*_{
NC
}is used (as suggested by CodeLink Bioarrays) to monitor the low limit of signal [39], where *T*_{
NC
}is defined as *T*_{
NC
}= (80% trimmed mean of negative control probes) + (3 standard deviations of the 80% trimmed population of negative control probes). In order to minimize the effects of low signal probes (whose intensity values are smaller than the negative control threshold) on signal detection, we replaced intensity values of these probes with *T*_{
NC
}.

Log-transformed, normalized intensity values were used in the analyses described below.

In order to compare the effectiveness of the normalization methods in enhancing signal reproducibility, we detected signal in the normalized datasets. We assume that signal quality can be estimated by the number of differentially expressed genes detected (that is, the more signal retained in the normalized data, the more differentially expressed genes should be revealed). We first developed a simulation model and verified this intuition using simulated data (see additional file1: Simulation model and data for detail).

We then used multiple statistical significance tests, both parametric (*e.g*., Welch's two-sample *t*-tests and ANOVA) and non-parametric (*e.g*., Wilcoxon rank sum tests), to detect differentially expressed genes in the normalized datasets. Raw *p*-values of the *t*-statistics and Wilcoxon rank sum tests were computed using the null distribution of the test statistics as well as the permutation tests of the sample labels. These *p*-values were then adjusted for multiple testing using the false discovery rate (FDR) controlling procedure of Benjamini & Hochberg [40], which seems to balance the control of both false-positive and false-negative error rates better than other procedures (*e.g*., Bonferroni, the Benjamini & Yekutieli FDR and SAM) [41]. The cut-off adjusted *p*-value of 0.05 was used to determine differentially expressed genes for these two-sample statistical tests.

In addition to pair-wise comparisons between the control and test groups, we used ANOVA to detect genes whose intensity values varied with the length of the treatment. Since we believe that there was a non-linear relationship between the response- (intensity values of genes) and the explanatory variables (days of the treatment), we used a quadratic regression model to fit these variables. For transcript *k*, where 1 ≤ *k* ≤ *K* and *K* is the total number of the transcripts on the array, let *j* be the treatment group, 1 ≤ *j* ≤ 6; for array *i* in treatment group *j* and 1 ≤ *i* ≤ *N*_{
j
}, where *N*_{
j
}is the number of the arrays for treatment group *j*, let *s*_{k,i,j}denote the intensity value of transcript *k* on array *i* in treatment group *j* and *d*_{i,j}be the day of treatment for array *i* in treatment group *j*, *i.e*., *d*_{i,j}= {0,1,3,7,14,30}. The quadratic regression model can be written as:

\begin{array}{c}{s}_{k,i,j}={\beta}_{0}+{\beta}_{1}{d}_{i,j}+{\beta}_{2}{d}_{i,j}^{2}+{\epsilon}_{k,i,j},\\ {\epsilon}_{k,i,j}~N\left(0,{\sigma}^{2}\right).\end{array}

ANOVA was used to estimate statistical significance of the model parameters *β*_{1} and *β*_{2} for all transcripts, 1 ≤ *k* ≤ *K*. The *p*-values of the F statistics were adjusted using the Benjamini & Hochberg FDR procedure. To ensure that both *β*_{1} and *β*_{2} are non-zero and thus minimize the risk of false positives, a stringent criterion, the adjusted *p*-value < 0.01, was used to determine differentially expressed genes.