Normalization of oligonucleotide arrays based on the least-variant set of genes
© Calza et al. 2008
Received: 13 June 2007
Accepted: 05 March 2008
Published: 05 March 2008
Skip to main content
© Calza et al. 2008
Received: 13 June 2007
Accepted: 05 March 2008
Published: 05 March 2008
It is well known that the normalization step of microarray data makes a difference in the downstream analysis. All normalization methods rely on certain assumptions, so differences in results can be traced to different sensitivities to violation of the assumptions. Illustrating the lack of robustness, in a striking spike-in experiment all existing normalization methods fail because of an imbalance between up- and down-regulated genes. This means it is still important to develop a normalization method that is robust against violation of the standard assumptions
We develop a new algorithm based on identification of the least-variant set (LVS) of genes across the arrays. The array-to-array variation is evaluated in the robust linear model fit of pre-normalized probe-level data. The genes are then used as a reference set for a non-linear normalization. The method is applicable to any existing expression summaries, such as MAS5 or RMA.
We show that LVS normalization outperforms other normalization methods when the standard assumptions are not satisfied. In the complex spike-in study, LVS performs similarly to the ideal (in practice unknown) housekeeping-gene normalization. An R package called lvs is available in http://www.meb.ki.se/~yudpaw.
High-throughput microarray technologies are becoming the norm in genetic and molecular research. Nevertheless, some steps in the preprocessing of the data prior to main analyses still remain problematic, as there is no universally accepted procedure for background correction, expression-value summarization and normalization. Here we are focusing on the normalization step of Affymetrix expression arrays, whose main purpose is to remove any systematic non-biological array-to-array variation. It is well known that (i) a noisy technical variation exists between arrays  due to many factors such as mRNA quantities, scanner settings, instrument operator, etc., and (ii) the choice of normalization method can make a substantial impact to the final results .
Absolute and relative (in parentheses) concordances between different normalization algorithms applied to MAS5 expression values. The numbers of genes declared DE are determined using the local false discovery rate  at 0.1% limit. Percentages in parenthesis are relative to the method in the column headings. The last line reports the percentages of over-expressed genes among those declared DE.
All normalization methods need a reference set of genes that do not vary between samples. Most methods in fact use the whole set of genes as the reference set, and this choice is justified with two key assumptions [3, 9] that (i) the great majority of genes do not vary between samples, and (ii) the distribution of up- and down-regulated genes is approximatively symmetric. Under these assumptions, the simple global-mean normalization, for example, involves making all arrays have the same mean. The methods are not robust against violation of these assumptions: when either of the two assumptions is not satisfied, existing normalization procedures are not trustworthy. The problem is that, in practice, these assumptions are rarely checked. Furthermore, it is not usually stated at what proportion the 'great majority' should be, but statistically we should probably expect at least 90%. Much smaller proportions than that would undermine the methods; for example, if 40% of the genes vary, it is no longer credible that the global mean should be constant across the arrays.
Spike-in experiments have been the key tool to establish current normalization schemes. Most of these experiments are typically quite simple, involving only a few spike-in genes. For these experiments, most existing normalization methods work well. However, for the so-called Golden-Spike data , almost 4000 out of 14,010 genes are spiked, among which 1,331 are differentially expressed (DE) and 2,535 are nonDE.
The Golden-Spike experiment is contrived, but it is important in revealing the lack of robustness of the existing normalization methods. Because all the DE genes are up-regulated, hence violating the balanced regulation assumption, all of the current normalization methods fails with this dataset. In many real studies, unbalanced regulation might reasonably happen [11–13]. This scenario has been already investigated in two-color microarrays  and raises some question marks over the existing procedures. A different sensitivity to violation of this assumption might explain differences in performance of the methods. Searching for a safer and more robust normalization procedure has been the motivation for this paper.
The so-called housekeeping genes, i.e. genes involved in the basic maintenance of the cells, might be considered a perfect reference set for normalization. In fact, they are used for normalization of PCR assays [15, 16]. To survive, every cell is supposed to express them approximately at the same level , so we do not expect the expression of these genes to vary between samples. Affymetrix arrays contain a set of possible housekeeping genes, usually used for quality control procedures, but also suggested as the optimal reference for normalizing the arrays. Nevertheless, a broad body of evidence exits that genes traditionally considered as housekeeping genes are in reality not invariant under a range of experimental and pathological conditions [18–20]. Specific examples of the failure of normalization based on a priori housekeeping genes were given in .
A data-driven procedure for identifying genes that do not vary across samples, and therefore might be a good reference set for normalization, leads to the so-called 'invariant-set normalization' [22, 23]. The procedure selects the set of genes to use as a reference for normalization in a pairwise fashion. This is done by selecting genes whose ranks are invariant between each sample and a reference distribution, e.g. a pseudo-median sample.
Our approach here is also based on data-driven housekeeping genes by identifying genes that vary the least between arrays. Instead of using pairwise comparison between samples, we exploit the total information from all the samples. The information is extracted from the probe-level data, by partitioning the observed variability into array-to-array variation, within-probeset variation and residual variation. Probesets whose array-to-array variability are below a given quantile are called the 'least-variant set' (LVS), and they provide the reference set for normalization.
To summarize our novel contribution, we have developed the LVS normalization and studied its performance in several spike-in experiments. We show that LVS normalization (i) performs similarly to other normalization methods when the standard assumptions are satisfied, but (ii) performs better when the standard assumptions are violated. In fact, in the latter case, LVS performs similarly to the ideal (in practice unknown) housekeeping-gene normalization. This means that the LVS normalization is robust against violation of the standard assumptions of normalization, so it is more widely and more safely applicable than the current normalization methods.
The LVS normalization is based on a two-step procedure. The first step operates at the probe-level data in order to estimate the component of variance due to array-to-array variability. This step is a multi-array procedure, using all the arrays in order to identify genes that vary least between the arrays. If the number of samples is large, a proper subset should be used for faster computation. The second step involves a non-linear fit of the LVS genes from individual arrays against those from a reference array, such as a pseudo-median array.
To identify the LVS genes, we analyse the probe-level data. Each probeset may contain from 8 to 20 pairs of perfect match (PM) and mismatch (MM) probes. First, the PM data is corrected for background; in our examples we use the so-called ideal mismatch (IMM) , but in principle any background correction method may be used. In Figure S3, Additional file 1, we show that different methods would produce similar LVS genes. Then, for each gene, specify the model
log2(PM ij ) = μ + α i + β j + ε ij , (1)
where PM ij is the background-corrected PM value for the jth probe from the ith array, i = 1,...,n and j = 1,...,J. μ is the grand mean parameter, α i is the ith array effect, and β j is the jth probe effect. This log-linear model was the basis for the RMA summary measure . It was similar to, but not the same with, the Li and Wong model , which uses a multiplicative term plus noise, so we cannot take log and get a log-linear model.
where is the vector of estimated α i 's, and V is its estimated covariance matrix. These quantities are available from the robust linear model fit.
The array effect α i includes both the technical artifact t i and real biological effect b i , so that
α i = t i + b i (3).
The ideal housekeeping genes are those with b i ≡ 0 for all i, thus allowing for the estimation of the remaining systematic variation that comes solely from technical sources. The LVS genes are the data-based estimation of these housekeeping genes.
Thus, in practice, to determine the LVS genes, we fit a nonparametric quantile function [27, 28] of the array χ 2 statistic (on the square root scale) as a function of the logarithm of residual standard deviation (SD), and declare those genes that fall under the curve as the LVS genes. In analogy with classical linear models where the conditional mean of the response variable is modelled as a function of some covariates, quantile regression aims at modelling any chosen quantile of the response variable a a function of the covariates. In our current application the quantile function is 'nonparametric' in the sense that it is not based on an explicit functional form, but on local smoothing of the data. We need to set a proportion τ, below which we expect genes not to vary between samples. In our experience τ = 60% is a reasonable choice, since we expect no more than 40% of the genes to be vary significantly between arrays. Nevertheless, it might be possible to conceive of an experimental situation were a higher proportion of genes are expected to be regulated, so the user needs to tune the value of τ accordingly.
This step works on multi-array basis, requiring all the arrays for the analysis. For a large number of arrays, memory requirement can be reduced by analysing limited number of probes at a time. Furthermore, to reduce computational burden, the analysis might also be performed on a random sub-sample of the data.
Once the LVS genes are identified, the normalization algorithm works on the individual arrays by fitting a spline smoother between the arrays and an arbitrary reference array. The latter is, for example, a pseudo-median array or any user-specified array. The curve fitted through the least variant genes is then used to map intensities of all the genes in each array to be normalized.
This normalization step might be performed either after expression summarization or at the probe-level, prior to other preprocessing procedures. Any expression summary (e.g. MAS5-style, RMA-style, etc.) might be used. For all analyses in this paper, step 2 is applied after background correction and expression summarization. Finally, note that this step is single-array based, i.e., not requiring all the samples at the same time. This reduces the memory requirement during computation.
The so-called global or constant normalization method is typically used by the Affymetrix Microarray Suite . Each sample is rescaled to have mean set to some arbitrary target value (usually 500). This is achieved by dividing each sample values by a scaling factor obtained as the ratio between the target value and the sample mean. While the standard MAS5 algorithm works on the original scale, in our implementation we work on the log scale with zero target value, simply subtracting each sample by its mean.
While the global normalization assumes a linear relationship between the arrays, the invariant-set normalization [22, 23] uses a non-linear regression to normalize data. A subset of genes is first selected based on comparing the ranks of the expression values in each sample to a reference array. The idea is that invariant genes, supposedly nonDE genes, should consistently have low ranks in each sample. A local regression is then fitted on the subset of invariant genes to get a normalization curve.
Quantile normalization [3, 4] is a multichip procedure, were the expression distribution (across genes) of each sample is forced to be the same as the distribution of a reference sample. The reference might be any of the samples or any derived one, such as the pseudo-median sample.
Three freely-available data sets have been used to evaluate the proposed normalization procedure. All of these data sets are from spike-in experiments, i.e. produced by controlled experiments with known RNA intensities or predefined mutual relationships.
The so-called 'Golden-Spike' experiment for Affymetrix arrays designed by  provides a dataset of 3,860 RNA species, where 100–200 RNAs were spiked in at fold-change (FC) level ranging from 1.2 to 4-fold, while a set of 2,551 RNA species was spiked-in at a constant (FC = 1) level. Data were designed as a two-group comparison, spike-in (S) versus control (C) (n = 3 each), with overall 9.5% genes over-expressed in S versus C. Out of 14,010 probesets on this DrosGenome1 chip, 1,331 had FC>1, among which 650 had FC>2, 2,535 had FC = 1 and 10,144 were 'empty'. The FC1 genes are thus the ideal housekeeping genes, and provide the perfect reference set for normalization of this dataset. This dataset is chosen to represent a study where there is an imbalance between up- and down-regulated genes, so the current normalization methods are expected to fail.
The two spike-in data sets were produced by Affymetrix  and are part of the assessment procedure at the Affycomp II website . Dataset HGU133A is based on a latin-square experiment with 42 arrays and overall 42 spiked-in genes at various concentrations ranging from 0.0 to 512 pM. Each concentration was performed with three replicates, and each array contains 22,283 probes.
Dataset HGU95A spike-in is one of the data used in the original assessment by . It consists of 20 experiments arranged in a latin-square design, with 14 genes spiked-in at 14 different concentrations ranging from 0.0 to 1024 pM. Each concentration has two replicates, and each array contains 12,626 probes.
Figure 1 shows the RAplot of the square-root of the array-effect test statistic as a function of the logarithm of the residual standard deviation for the Golden-Spike data, showing the array-to-array variability vs residual variance from the probe-level linear model. Black points correspond to probesets that were spiked in with a nominal fold change ≥ 2, while the black line represents the quantile regression curve using τ = 0.6. A total of 8,409 genes lying below this curve were used as the least-variant set of genes for normalization.
Using probe intensities summarized according to the MAS5 algorithm, we performed several normalizations, i.e. global normalization (also known as 'constant normalization'), quantile normalization and invariant-set normalization. Additionally, we performed both a global normalization and a loess normalization using the set of genes with known FC = 1. Since the FC1 genes are the ideal housekeeping genes, these should provide the theoretically best normalization method.
Similar results for RMA expression values (after RMA background correction, and median polish summarization of PM values) are given in Figure S1, Additional file 1.
The LVS algorithm was evaluated also on the well-known Affymetrix spike-in experiments, namely the HGU-95Av2 and the HGU133A data. The performance of the LVS method was tested within the framework of the affy competition [33, 34], using for expression summarization and background correction the methods adopted by the standard MAS5 and RMA algorithms. The reports automatically created by the Affycomp II website  are available in the Additional files 2, 3, 4, 5.
We have presented a new algorithm called LVS for normalizing high-density oligonucleotide arrays based on a set of genes with the least variability across samples. Array-to-array variation is estimated through a robust linear model fitted at the probe level. In noisy and complex datasets such as the Golden-Spike data, LVS normalization outperforms other normalization procedure. In simple spike-in experiments where very few genes are expressed or spiked, all methods of normalization should work equally well. In real experiments, the normalization step does make a difference [2, 36], and in this case it is safer to use a more robust method that relies on fewer assumptions.
In the Golden-Spike data it seems clear that the signal observed for unexpressed genes was due to both experimental artifacts and nonspecific hybridizations. The latter occurs because probes associated with unexpressed genes might bind to other mRNA species with high concentration. When these are differentially expressed, the non-specific binding will lead to false discoveries. So, assuming that probes associated with unexpressed genes should be nonDE might indeed be wrong; in recent large arrays, these probes can be expected to be the majority. The complexity of non-specific binding also means that it is important to have more realistic spike-in experiments with a large number of expressed genes.
Any normalization procedure is supposed to equalize the distribution of non-varying genes across samples, thus correcting for any random or non-biological systematic variation. Obviously the determination of the non-varying genes must rely on pre-normalized data. Most current procedures make certain assumptions that would allow one to use the full collection of genes as the reference set, and no analysis is needed to identify them. LVS exploits the information in the probe-level data to determine a pre-specified proportion of least-variant features across samples. In contrast with the invariant-set method, which uses pairwise comparisons between each array and a reference array, LVS uses the full collection of arrays. This partly explains the better performance of the LVS compared to the invariant-set method for the Golden-Spike data.
Because of the intrinsic random noise in microarray experiments, a sufficiently large number of genes should be selected for normalization. Normalization based on a small set of genes, such as the housekeeping-gene or invariant-set normalization, might be ineffective with noisy data. The LVS algorithm allows a reasonably large proportion of genes to be selected as a reference set; we get a stable result over a range of proportions from 40–60%.
One of the key assumptions in current normalization procedures is that there is a balance between up- and down-regulated genes. This explains the failure of the current procedures in the Golden-Spike data. In real studies, can we always expect balanced expression? In lab experiments, e.g. with knock-out mice, an unbalanced proportion of over/under-expressed genes may reasonably occur. Haslett et al. , for example, reported a relevant bias towards over-expression in muscle-related genes (135 of the 185 declared DE). A similar unbalanced pattern was reported by  and .
The problem is that, in practice, the assumptions underlying the normalization procedures are rarely checked, so it is never certain that the data are properly normalized. For example, even for clinical data with balanced expression levels, we showed in  that the commonly-used quantile normalization was biased for low-intensity genes. This means that we need a robust and safe procedure that relies on fewer assumptions. We believe that LVS normalization is a step in that direction.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.