Delineation of amplification, hybridization and location effects in microarray data yields better-quality normalization
© Hulsman et al; licensee BioMed Central Ltd. 2010
Received: 8 June 2009
Accepted: 26 March 2010
Published: 26 March 2010
Oligonucleotide arrays have become one of the most widely used high-throughput tools in biology. Due to their sensitivity to experimental conditions, normalization is a crucial step when comparing measurements from these arrays. Normalization is, however, far from a solved problem. Frequently, we encounter datasets with significant technical effects that currently available methods are not able to correct.
We show that by a careful decomposition of probe specific amplification, hybridization and array location effects, a normalization can be performed that allows for a much improved analysis of these data. Identification of the technical sources of variation between arrays has allowed us to build statistical models that are used to estimate how the signal of individual probes is affected, based on their properties. This enables a model-based normalization that is probe-specific, in contrast with the signal intensity distribution normalization performed by many current methods. Next to this, we propose a novel way of handling background correction, enabling the use of background information to weight probes during summarization. Testing of the proposed method shows a much improved detection of differentially expressed genes over earlier proposed methods, even when tested on (experimentally tightly controlled and replicated) spike-in datasets.
When a limited number of arrays are available, or when arrays are run in different batches, technical effects have a large influence on the measured expression of genes. We show that a detailed modelling and correction of these technical effects allows for an improved analysis in these situations.
Most applications of oligonucleotide arrays, such as finding differential expressed genes or network reverse engineering, involve the comparison of different arrays. Since array measurements are highly sensitive to the experimental conditions, comparison of arrays can be problematic. This is especially the case when experiments have been performed in different batches or experiments. Several normalization methods have been developed to handle this problem (e.g. MAS 5.0 , VSN , RMA , PDNN , PLIER , GCRMA ).
In this work we focus on differences between arrays caused by amplification, hybridization and location-based effects. Often used normalization methods such as RMA do not take into account these significant technical effects, while methods such as PDNN and GC-RMA only take into account the hybridization effect (although in a different way than we propose). We introduce a new normalization method that takes into account all these effects and improves performance over the existing methods. Although this study focuses on Affymetrix GeneChips, the resulting method can also be applied to other oligonucleotide platforms.
Background removal after normalization phase
Microarray pre-processing methods often have a background signal subtraction phase (bias reduction, i.e. removing signal consisting of optical background and signal due to non-specific binding), followed by a normalization phase (variance reduction) and summarization of the probe sets. The primary goal of the first phase is to improve accuracy (i.e. match the true signal level more closely), while the second phase is performed to improve precision (i.e. consistency of measurements over different arrays). Current pre-processing methods differ mostly in the method of bias reduction. For example, MAS 5.0 uses mismatch probes to estimate the background, RMA uses a general background distribution and GCRMA a sequence-based model. For normalization, often general distribution-correcting methods such as quantile normalization  or loess normalization  are used. One can perform these methods using a single reference array, or use multiple reference arrays as is done in the PTR method .
Our first attempt at removing technical effects during the bias reduction phase did show that, although one can improve accuracy, it is hard to not simultaneously decrease precision. The reason for this is that the estimated correction factors (biases) can be large and are estimated for each array separately with models that are simplifications of reality. To solve this, we perform the background subtraction phase after the normalization phase. Consequently, within the normalization phase, differences in technical effects between arrays are corrected. For example, in the case of sequence effects we perform now variance normalization (Figures 2c, d) instead of bias removal (Figures 2a, b). As we normalize not only the true signal but also the background signal, this allows us to use the same background estimate for all arrays during the background removal phase. This way we can still improve accuracy, while simultaneously preventing the reduction of precision.
Background removal within summarization phase
In general it is true that in cases that the measured fold change between arrays is low one is less certain that there is actually a 'real' fold change as opposed to situations where there is a large fold change. An important factor influencing the measured fold change is (again) the background signal. Probes with a relatively large background signal will generally have a smaller fold change, if the fold change is calculated over the whole signal (background + foreground).
Currently, most methods remove the background before the summarization of the probe sets. Although this reduces signal and fold change bias, it also obscures the 'real' fold change. That is, the fold changes of probes with a large background signal will be blown up more than those of other probes. In fact, probes which measure only background signal for some arrays could get infinite fold changes if this was not prevented by limiting the amount of background subtracted. This has a major impact on the summarization of probe sets, as such probes become more important than they should.
One could choose to perform no background subtraction, preventing the dominance of high-background probes, at the cost of increased bias. But, even then probes that measure mostly background signal influence the summarization outcome. Therefore, in our approach, we have moved the background removal not only after the normalization phase, but into the summarization phase. This allows us to model the importance of the probes during summarization according to the amount of 'true' signal they measure.
Results and Discussion
Using this signal, we perform in the next step array location correction (L(.)) and distribution correction using quantile normalization (Q(.)) (Figure 4, step 2b). The reason to do these corrections separately is that these effects cannot be effciently estimated using the approach of equation 3. As technical effect estimations can be dependent on each other, one could choose to perform the normalization steps in an iterative fashion (i.e. repeatedly applying each step until convergence). We found, however, that repeating the first part (step 2a) of the normalization is unnecessary, as it did not add noticeable performance. For the second part (step 2b), one could first perform array location correction and then quantile normalization, or the other way around. However, as quantile normalization is a fast and idempotent procedure, we perform quantile normalization both before and after array location correction, i.e. . Also in this case we did not find noticeable performance advantages over a non-iterative solution.
In the next subsections we discuss how to estimate the reference signal components (Equation 1), as well as the hybridization and amplification model (Equation 2). Then we discuss the array location correction performed during step 2b, and the final step in the algorithm, the summarization and background removal (Figure 4, step 3).
Estimating signal components
The hybridization model τ j is used three times: to estimate the reference background signal (Equation 5) as well as to estimate the background and foreground hybridization differences between arrays (Equation 2). The separate background and foreground hybridization difference model represents the notion that the non-specific hybridization process differs from the hybridization process of the targeted sequence .
The hybridization model is based on the probe sequence. Earlier studies have indicated that the position of nucleotides on the probe are important [4, 8]. Furthermore,  as well as  suggested to fit dinucleotide binding strength, based on hybridization experiments in earlier studies. Each dinucleotide is presumed to have a multiplicative effect on the signal. As additive models are easier to handle, we use the following transformation: . This allows us to estimate the effects as additive components of . is modelled according to two parts. One part models the influene of nucleotide pairs in the probe sequence (part 1, τ(1),j). The other part models the influence of the number of certain nucleotides present in the probe sequence (part 2, τ(2),j).
where ϕp,qis a knot weight q for nucleotide pair p, and Bp,q(m) is the corresponding B-spline basis function factor for position m on the probe. Furthermore, I((ζj,(m,m+1)= p) is an indicator function determining if probe j has nucleotide pair p on positions (m, m + 1).
Experimentally, we found that Q = 5 knots with degree 3 (cubic B-splines) are able to fit the signal well. Our results indicate that there is enough data to estimate the 230 parameters within ϕ without significant overfitting. This is to be expected as there are often more than 200,000 probes on a microarray.
Amplification effect correction
where probe-specific vector p j contains for each dinucleotide the number of occurences between the middle of probe j and the 3' end of its probe set, and the array-specific vector a i contains the parameters for each dinucleotide determining its role in amplification differences. An F-test shows that this model significantly improves the amount of amplification effect our model is able to fit over the location-only linear model (Additional file 1, Section S5).
Array location effects
Estimating the (per-probe) array location effects within the model itself would lead to an unrealistic large number of parameters. For that reason, we estimate the array location effects separately. That is, for a given input signal x ij , we calculate an output signal y ij = L(x ij ) corrected for location effects. In the 'Algorithm overview' section it is described how this function is used in the second normalization step. Although we cannot exclude that location effects could affect the optical background signal, we found that normally only the hybridization signal (both background and foreground) is affected (see Additional file 1, Section S6). Furthermore, even if there are location effects in the optical background they will not have a significant impact on the signal of expressed genes. For this reason, we estimate the location effect only for the hybridization signal. Assuming that the input signal has already been normalized for optical background (as we do in Figure 4, step 2a), we can use a common optical background estimate .
We use log scaling as the location effects affect the signal multiplicatively. Next, all calculated residues ϵ ij are mapped to the arrays on the location of their corresponding probe j (see Figure 3). If there is a location effect in a certain region of an array this will show as residues with a negative or positive bias. To robustly estimate this effect, we calculate for each probe on each array the median of the residues in the neighborhood. However, not every place on the array contains a probe, furthermore we do not use the mismatch probes. To handle the empty probe positions as well as probes near the border of the array, we choose to use a fixed size median box filter of 9 × 9, which takes into account only the valid probes under it. This means that the number of probes used by the filter depends on its position. The result of this filter are location effect values λ ij = median_box_filter(ϵ i , j) for each probe on each array. To obtain the corrected signal we calculate .
To make the model identifiable, we use the constraints ∑ i α ik = 0 ∀k and ∑ j ρ j = 0. When the mean true signal level for a probe is low (i.e. ), the array effect α ik will have less influence on the magnitude of the residual signal. Due to this, the probe will have less influence on the final summarized signal.
A second modification we make is the removal of outlier probes. This has been suggested before in . If the Huber M-estimator (see Additional file 1, Section S3) assigns a low weight to the measured values of a probe (i.e. < 0.9) for more than one-third of the arrays we remove the probe entirely as its quality for the other arrays is also questionable. However, we make certain that we keep always more than 5 probes within a probe set.
The fold change values α ik are used for further analyses. These fold changes are corrected for bias as the background signal has been removed. Using bias-corrected fold changes can be advantageous when performing for example network reverse engineering. However, as discussed in the introduction, one looses information on the reliability of the fold change. This is one of the reasons that methods such as MAS 5.0 and PLIER (which perform a strong background correction, removing most of the bias) do not perform that well when used with differential expression detection algorithms based on the magnitude of the fold change. To prevent this, we have added an option to backscale the fold changes using . This is the final output of the algorithm and is used in the subsequent analyses.
Differentially expressed gene detection performance
An overview of the results for the two spike-in experiments
fold chg. single repl.
fold chg. all
fold chg. single repl.
fold chg. all
RDN [qq, loc]
RDN [qq, hyb]
RDN [qq, hyb, amp]
RDN [qq, hyb, loc]
Every spike-in experiment was performed three times. We compared the performance both by using the methods on single replicates (and averaging the performance over the three replicates) as well as by applying the methods on the complete dataset simultaneously. As expected, the performance gain when using each replicate separately is larger, as replication is performed to reduce the effect of unwanted differences between arrays. However, even with replication our method outperforms other methods. In this context, it is also interesting to look at the SAM test statistic score. Many studies do not have multiple replicates for each sample (as is done here), but use multiple samples per class. When comparing such classes, often the SAM(-related) test statistic score is used, which takes into account variability within a class to determine the ranking. Reducing variability caused by technical effects can improve its results. Here we see that RDN performs better than many of the other normalization methods. The slightly higher SAM score of GCRMA for the HG_U95A dataset can be attributed partially to its use of a low probe expression cutoff, thereby removing probe sets with very low expression from consideration. If we apply a similar cut-off to the same amount of probes as in GCRMA, we obtain a better SAM-score of 0.76. To determine the performance effect of individual components of RDN, we ran our method with certain normalization steps disabled (Table 1). When performing only a standard quantile normalization in combination with the new summarization procedure, RDN already outperforms most of the other methods. Enabling correction for location-specific effects as well as hybridization effects further increases the performance. Amplification correction on the other hand does not significantly add to the performance for these experiments. This is as expected, as there are no significant differences in amplification between the arrays (see Figure 1a, the lines are more or less parallel), presumably because the amplification was done once for all arrays, which in practice is not realistic for most experiments.
Differential gene finding - hMSC dataset
Signal bias and background estimation
To correct for signal bias, the estimated background is removed from the signal during summarization (Equation 14). To determine the remaining signal bias after this procedure, we compared the corrected expression levels for the spike-in probe sets with their actual spike-in concentrations (Figures 6c, 6d). As these are log-log plots, unbiased signals should result in straight lines. In these plots, we see that PLIER and PDNN are the best performing method in terms of bias reduction (see also Tables S6 and S7 in Additional file 1, containing the r2 measure as well as the slope for each of the methods). By default, RDN does not focus on bias correction as unbiased fold changes can negatively impact differential expression detection, as discussed in the introduction. In fact, after summarization it re-adds the background signal (backscaling). Due to this, RDN has approximately the same bias as RMA_NBG, a method which does not remove background signal. However, if one requires more accurate signals one can also use a different setting for RDN, i.e. no backscaling and no underestimation of background (η = 1). With this setting, RDN has a summed r2 measure between PDNN and PLIER.
Signal precision for low, medium and high expression spike-ins
Next to signal accuracy (improved by background removal), one can also look at signal precision (improved by normalization). However, one cannot use measures such as the standard deviation of a probe signal over multiple arrays. The reason for this is that background removal blows up the fold changes, especially of low expression probes. As each method uses different strategies for background removal, one can not directly compare such measures. A similar problem affects comparisons of signal precision using spike-in probes, where one determines the performance of the spike-in probe sets for different spike-in concentration ranges. For example, when testing low-expression spike-in probe sets, the performance for a method without background substraction will be affected by a relatively large number of high-expression false positives. For methods with background subtraction, it is the other way around.
Another method to quantify precision is to rank replicate array pairs w.r.t. non-replicate array pairs based on the difference in expression between the arrays in the pair. Then one can compare the ranking using a method such as the AUC score. We did this on the hMSC dataset (Additional file 1, Section S1) and found that RDN outperforms all other methods.
Inspection of dataset after normalization
The results show that the performance improvements shown cannot be attributed to one component of the algorithm. Rather, it is a collection of improvements obtained by normalizing the different technical effects, removing the background after the normalization procedure, and, using background information in the summarization procedure.
Our method currently assumes that, on average, the expressions of genes remains similar accross multiple arrays, allowing us to identify the technical effects. This assumption is also used in other widely used normalization methods such as quantile normalization. The limitation of these type of methods is that they cannot be used when all genes change their expression simultaneously upwards or downwards. RDN requires more computation time than other normalization methods (10 minutes per array on a recent dual core 3 Ghz Intel processor). However, this amount of time is still small when compared to the time it takes to perform the microarray experiment. Furthermore, it is easy to parallelize the algorithm. Compared to the number of probes (> 200.000) our method uses a relatively limited numer of parameters. This prevents overfitting, so that only large scale effects affecting many probes simultaneously are removed. For platforms with even more probes one can choose to estimate the parameters with a subset of the probes. Currently, the amplification correction is based upon the assumption that probe sets are close to the poly-A tail of the probe set. This is not always the case. We found that especially probe sets further away from the poly-A tail can have large variations in signal for all probes, correlating with the amplification differences. This is not corrected by our current method. Given reliable distances towards the poly-A tail this could be added relatively easily.
The approach shown in this paper could also be useful for other oligonucleotide platforms. In this work we have mainly focused on gene expression. However, it could also be applied to ChIP-chip or tiling experiments.
We have introduced a new normalization method that corrects for hybridization, amplification and array location effects that occur when measuring expression using oligonucleotide arrays. The proposed Robust Difference Normalization (RDN) corrects these technical effects by removing the differences in measured expression over different arrays instead of correcting for signal bias for each array. In this way, the proposed normalization procedure focuses on controlling the precision instead of the accuracy of the measured expression, which is more important for some applications (like, for example, detecting differentially expressed genes). We have shown that the proposed RDN method increases performance, even on experimentally tightly controlled and three times replicated spike-in data sets. The method will be most useful for studies consisting of few independent replicates or those showing large batch effects.
Spike-in experiments have often been used for validation of normalization methods. We use two Latin Square experiments performed by Affymetrix on the HG-U95A and the HG-U133A platforms. These consist of respectively 59 and 42 arrays, and each measure 14 different spike-in gene groups. For each spike-in group, 14 different concentrations are measured. For the HG-U95A experiments these concentrations are 0, 0.25, 0.5 ... 1024 pmol, while for the HG-U133A experiment the concentrations are 0, 0.13, 0.25 ... 512 pmol. Fold changes between subsequent steps are equal to 2. For every measurement there are three replicates. The HG-U95A platform has extra replicates for some of the measurements. See Additional file 1, Section S2 for a description of which spike-in probe sets were used.
We also validated our method on a dataset measuring human mesenchymal stem cells (hMSCs). In this experiment, bone marrow aspirates were obtained from the acetabulum or iliac crest of patients undergoing hip replacement surgery, who had given written informed consent. Human mesenchymal stem cells (hMSCs) were isolated and proliferated as described previously in . To analyze the gene expression profile we seeded hMSCs at 1000 cells/cm2 and upon reaching near confluence RNA was isolated using an RNeasy mini kit (Qiagen). Quality and quantity were analyzed by gel electrophoresis and spectrophotometrically. For 62 donors, RNA microarray experiments were performed on the HG-U133A 2.0 platform, in three different batches of respectively 21, 21 and 20 donors. Although experiments were performed at the same microarray facility, and with arrays from the same production batch, signficant technical differences were observed between the batches. Differences are likely due to the fact that the batches are performed at a different dates with several months in between. To validate our method, we used two different binary labelsets, namely the gender of the donors and the location (actabulum or iliac crest) from where the cells were obtained in the body.
Using the Affymetrix spike-in datasets, we compared the performance of different normalization methods to detect differential expression between different spike-in concentrations. Each spike-in experiment contains 14 concentration groups, with each next group in the series having a 2-fold higher concentration. To make the comparison a more realistic approximation of reality, we only compared two subsequent concentration groups (limiting the comparison to 2-fold differences, including the step from 0 to the next lowest concentration). The resulting number of spike-in concentration pairs is 13 * |spike-ins|. We determine the fold-change or SAM test statistic  for the array pairs with the subsequent concentration groups. As is usually done, to determine the performance we calculate an area under the ROC curve (AUC score) using the spike-in pairs with 2-fold concentration difference as positive examples, and the non-spike-in pairs as negative examples. We limit the number of false positives to 50 for each array pair (also called ROC50 score), corresponding to regular practice (e.g. ) where one is only interested in results with a limited number of false positives.
The quantile-quantile normalization is performed by mapping the signal distribution of different arrays to the distribution of the median array, after which the measurements from all arrays are sorted and replaced with values from the median array .
The software is available as a Matlab Toolbox at http://bioinformatics.tudelft.nl/users/marc-hulsman
- Affymetrix: Affymetrix Microarray Suite User Guide (version 5). Tech. rep., Affymetrix Santa Clara 2002.Google Scholar
- Huber W, Heydebreck von A, Sultmann H, Poustka A, Vingron M: Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 2002, 18: S96-S104.View ArticlePubMedGoogle Scholar
- Irizarry R, Hobbs B, Collin F, Beazer-Barclay Y, Antonellis K, Scherf U, Speed T: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 2003, 4(2):249. 10.1093/biostatistics/4.2.249View ArticlePubMedGoogle Scholar
- Zhang L, Miles MF, Aldape KD: A model of molecular interactions on short oligonucleotide microarrays. Nature Biotechnology 2003, 21(7):818. 10.1038/nbt836View ArticlePubMedGoogle Scholar
- Hubbell E, Liu W, Mei R: Supplemental data: robust estimators for expression analysis. Tech rep Affymetrix 2004.Google Scholar
- Wu Z, Irizarry RA: Stochastic Models Inspired by Hybridization Theory for Short Oligonucleotide Arrays. Journal of Computational Biology 2005, 12(6):882. 10.1089/cmb.2005.12.882View ArticlePubMedGoogle Scholar
- Gautier L, Cope L, Bolstad B, Irizarry R: affy-analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 2004, 20(3):307. 10.1093/bioinformatics/btg405View ArticlePubMedGoogle Scholar
- Naef F, Magnasco MO: Solving the riddle of the bright mismatches: Labeling and effective binding in oligonucleotide arrays. Physical Review E 2003, 68: 011906. 10.1103/PhysRevE.68.011906View ArticleGoogle Scholar
- Binder H, Kirsten T, Loeffler M, Stadler P: Sensitivity of Microarray Oligonucleotide Probes: Variability and Effect of Base Composition. Journal of Physical Chemistry B 2004, 108(46):18003. 10.1021/jp049593gView ArticleGoogle Scholar
- Held GA, Grinstein G, Tu Y: Relationship between gene expression and observed intensities in DNA microarrays-a modeling study. Nucleic Acids Research 2006, 34(9):e70. 10.1093/nar/gkl122View ArticlePubMedPubMed CentralGoogle Scholar
- Skvortsov D, Abdueva D, Christina C, Schaub B, Tavare S: Explaining differences in saturation levels for Affymetrix GeneChip(R) arrays. Nucleic Acids Research 2007, 35(12):4154. 10.1093/nar/gkm348View ArticlePubMedPubMed CentralGoogle Scholar
- Cheng C, Li LM: Sub-array normalization subject to differentiation. Nucleic Acids Research 2005, 33(17):5565–5573. 10.1093/nar/gki844View ArticlePubMedPubMed CentralGoogle Scholar
- Li LM, Cheng C: Differentiation Detection in Microarray Normalization. In DNA Microarray Normalization. Edited by: Stafford P. CRC press; 2008:19–39.View ArticleGoogle Scholar
- Bolstad B, Irizarry R, Astrand M, Speed T: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 2003, 19: 185–193. 10.1093/bioinformatics/19.2.185View ArticlePubMedGoogle Scholar
- Dudoit S, Yang Y, Callow M, Speed T: Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica 2002, 12: 111–140.Google Scholar
- Ge H, Cheng C, Li L: A probe-treatment-reference(PTR) model for the analysis of oligonucleotide expression microarrays. BMC bioinformatics 2008, 9: 194. 10.1186/1471-2105-9-194View ArticlePubMedPubMed CentralGoogle Scholar
- Huber P, Wiley J: Robust statistics. John Wiley & Sons, Inc, New York; 1981. full_textView ArticleGoogle Scholar
- Bolstad B: Low-level Analysis of High-density Oligonucleotide Array Data: Background, Normalization and Summarization. PhD thesis. University of California; 2004.Google Scholar
- Both SK, Muijsenberg J, van Blitterswijk C, de Boer J, de Bruijn J: A Rapid and Efficient Method for Expansion of Human Mesenchymal Stem Cells. Tissue Engineering 2007, 13: 3. 10.1089/ten.2005.0513View ArticlePubMedGoogle Scholar
- Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Sciences 2001, 98(9):5116. 10.1073/pnas.091062498View ArticleGoogle Scholar
- Irizarry RA, Wu Z, Jaffee H: Comparison of Affymetrix GeneChip expression measures. Bioinformatics 2006, 22(7):789. 10.1093/bioinformatics/btk046View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.