Spotting effect in microarray experiments
© Mary-Huard et al; licensee BioMed Central Ltd. 2004
Received: 09 January 2004
Accepted: 19 May 2004
Published: 19 May 2004
Microarray data must be normalized because they suffer from multiple biases. We have identified a source of spatial experimental variability that significantly affects data obtained with Cy3/Cy5 spotted glass arrays. It yields a periodic pattern altering both signal (Cy3/Cy5 ratio) and intensity across the array.
Using the variogram, a geostatistical tool, we characterized the observed variability, called here the spotting effect because it most probably arises during steps in the array printing procedure.
The spotting effect is not appropriately corrected by current normalization methods, even by those addressing spatial variability. Importantly, the spotting effect may alter differential and clustering analysis.
Microarray technology is probably the most successful in the area of functional genomics. Biologists use it to analyze gene expression at the genome scale by comparing the levels of messenger RNAs present in matched biological samples, for example grown under contrasted conditions or with different genetic configurations. Microarray data can be used for differential analysis, to identify genes whose expression strongly depends on the nature of the samples, as well as for clustering analysis, to identify coexpressed genes. Microarray data show a high level of variability. Some of this variability is relevant because it corresponds to the differential expression of genes. But, unfortunately, a large portion results from undesirable biases introduced during the many technical steps of the experimental procedure. Several sources of experimental noise have already been addressed, such as dye or fluorophore, fluorescence level or print-tips and statistical methods have been proposed to normalize data according to the related effects ([1, 2]).
In this paper, we describe an experimental bias and use statistical methods to investigate the distribution of the signal across the microarray area. We use the variogram to analyze spatial dissimilarities between spots on the slide. Although spatial signal distribution across the slide has already been studied ([3, 4]), the bias we report here has never before been explicitly characterized. We also present two experiments that give clues about the nature of the spotting effect, and finally we investigate the possibility to correct the effect and the efficiency of usual normalization procedures to do it.
Analyzed datasets are produced by using glass arrays and the two-color labeling strategy by which two conditions are compared directly. In these experiments, mRNA samples are collected from case and reference cells. The two corresponding cDNA samples are synthesized and labeled with either the Cy3 (green) or Cy5 (red) fluorophore and are mixed and hybridized simultaneously to a single array. For each DNA feature (representing a gene) printed and bound on the array, the fluorescence emitted by the hybridized labeled cDNA is measured in the Cy3 and Cy5 channels. Both fluorescence measurements are compared to define the relative gene expression in case versus reference cells. We designate these two values G and R and define the signal and intensity associated with a given gene as follows:
the signal associated with a gene is the logarithm of the ratio R/G. This quantity is used to identify differentially expressed genes;
the intensity is defined as the logarithm of the product R × G (or log(R × G)/2).
The goal of normalization is to correct the signal for experimental bias. Most existing normalization procedures do not specifically correct for potential spatial effects. The few that do only consider sources of variation that are restricted locally. For instance, the print-tip effect acts as a block effect, where the blocks are defined by the cluster of spots printed on the array with the same print-tip (). The goal of this study is to determine whether normalization that corrects for additional spatial effects is necessary or whether current normalization models are sufficient.
Detection of the spotting effect in multiple microarray datasets
Characteristics of the datasets studied
Lab. or Database
Cols × Rows/Block
Nber of Rep.
(Micro grid II)
21 × 20
row N and N + 10
UHN Toronto (Tor270)
24 × 25
col N and N + 1
24 × 24
24 × 21
where Y s is the signal measured at spot s, and , β bl and Int s the mean spotting row, block and intensity effects, respectively. Spotting row effect means that only one parameter is estimated for all the rows spotted at the same time. For example, in the self-hybridized Arabidopsis slide dataset the row effect is the same for the rows 1,11, ..., 231.
Nature of the spotting effect
The spotting effect could be explained in different ways. The amount of material deposited on or bound to the slide and the shape of DNA spots can be affected by multiple factors, such as the time during which the print-tips are soaked in the DNA source microtiter plates, the time during which the print-tips touch the slides, the speed at which the print-tips move, the concentration and the salinity of the DNA solutions, the temperature and the relative humidity of the arrayer printing cabinet, and the physicochemical characteristics of the print-tips and of the glass surface. With most spotting robots, the printing of high density arrays containing thousands of features lasts for hours and subtle changes in spotting conditions may, therefore, alter all these factors. For example, DNA solutions may evaporate over time. In that regard, the spotting effect may be related to the "time-of-print" effect reported in .
Alternatively, the spotting effect may reflect DNA source plate variations because all DNA features printed simultaneously originate from the same plate. To test this hypothesis, we analyzed results from two microarray experiments for which the plate effect was controlled. In the first experiment, slides were printed with a unique 384-well-plate containing human cDNA amplicons and hybridized to targets prepared from RNA isolated from primary CD4+ T cell. In the second experiment, slides were printed with oligonucleotides of 70 bases, synthesized according to a different chemistry, and hybridized to cDNA from various developmental stages of Plasmodium falciparum. The oligonucleotides have the same length and are resuspended in solutions showing a narrow concentration range. In both cases, the spotting effect was greatly reduced (data not shown). This observation suggests that the plate effect is a major component of the spotting effect.
Spotting effect and normalization
Probes are arranged according to their biological characteristics, for instance intergenic regions separated from transcription units, or genes expected to be differentially expressed grouped together in particular plates. In this case, it is impossible to distinguish between a significant plate effect due to coexpression of genes belonging to the same class, or due to technical artifacts.
Probes are arranged according to their chromosomal order. Such structure may lead to significant differences between plates if genes with similar expression profiles are spatially clustered in the genome (silent neighboring genes in heterochromatic regions, for example). Such spatial clustering has been recently observed in several organisms ([9, 10]) and may affect many others.
Probes are randomly distributed among plates. Most human array experiments verify this hypothesis. The results presented in Section 4 prove that this configuration does not cause the spotting effect to disappear.
A normalization procedure to correct the effect is advisable only in the last case because, in the first two, regardless of the importance of the spotting bias, the correction would unavoidably alter the biological information contained in the data. Thus, the effect can considerably affect the conclusions of experiments corresponding to the first two cases. In particular, results of experiments studying gene similarity or the relationship between relative chromosomal position and coexpression could be essentially twisted, as also pointed out by Balazsi et al. in .
Assuming that the experiment of interest corresponds to the third case, one has to investigate whether a specific normalization for the spotting effect is needed or if standard normalization is sufficient. We present here the consequences of the normalization procedure proposed by Yang et al. in , one of the most widely used methods in the microarray community, on the self-hybridized Arabidopsis cDNA array data. Only results obtained with background-corrected signals and the global loess normalization procedure are presented. The analysis performed on background-uncorrected data, or with the print-tip loess normalization procedure gave similar results.
Because of the strong pattern observed in rows, the spotting effect may be treated as row effect per block or as a global row effect across the slide. Preliminary results suggest that such models adequately correct the periodic spatial bias described in various microarray datasets. Row models are advantageous because they rely exclusively on the geometrical information that is embedded in the data and that is mandatory according to the "Minimum Information About a Microarray Experiment" (MIAME) guidelines. In contrast, plate origin information is very rarely available and is difficult to integrate into statistical analysis considering that successive technical steps usually take place in multi-titer plates of different formats (e.g. 96-, 384- and 1536-well plates) during spotted microarray DNA production.
We have observed that transcription profiling datasets obtained with spotted glass microarrays and the two-color labeling strategy show a bias that leads to periodic patterns according to rows and columns of the array grid. These patterns affect the entire area and alter both signal and intensity. We propose that such patterns result from artifacts introduced during the DNA feature preparation into microtiter plates or the slide printing procedure because features spotted together yield the most similar signals.
Color swaps are now routinely included in microarray experimental designs to correct for labeling biases. They consist in repeated hybridizations in which the case and reference samples are labeled at least once with each of the Cy3 and Cy5 fluorophores. Preliminary analyses indicate that the spotting effect is reduced when raw data from opposite color swap hybridizations are combined (unpublished results). This observation is consistent with the fact that the spotting effect depends on the position of the spots on the slide and that the relative spot position remains the same from slide to slide in most setups. We suggest that the reduction of the spotting effect resulting from the combination of raw opposite color datasets may constitute additional justification for the inclusion of color swaps in microarray experiments.
We have shown that the variogram is an efficient tool to display spatial correlations between spots. Furthermore, it is possible to test the null hypothesis that no spatial correlation exists (for instance the Moran test described in ). Such tests could be performed together with the variogram analysis as part of the data normalization procedure to investigate the significance of observed spatial biases and to evaluate the need and efficiency of different correction methods.
We have proved that the spotting effect is statistically significant, is as important as other effects that are commonly corrected, and should be taken into account in normalization procedures. This effect is expected to increase the number of false positives and negatives in classical microarray studies. In differential analysis, some rows or columns may contain artificially high or low numbers of "differentially expressed genes". In clustering analysis, genes may be associated because of a similarity caused by the spotting effect.
Description of the first test slide
The Arabidopsis glass slide (Corniong GAP II) studied in section Results was hybridized with Cy3 and Cy5 labeled cDNA samples, both prepared from the same mRNA extracted from Arabidopsis flower buds (self-hybridization). The microarray structure consists of 4 × 12 blocks, each with 20 rows and 21 columns. cDNA sequences were spotted in duplicate, i.e. rows N and N+10 (for N = 1 to 10) in the same block were printed with the same series of amplicons. The robot printing head consisted of 48 (4 × 12) print-tips, each defining a block. During a single printing step, the robot printed 48 spots on a slide (each corresponding to a different DNA feature) distant by 20 rows in one direction and 21 columns in the other (the distance between print-tips); then, within a fraction of a second, the robot arm moved laterally and printed the duplicate spots 10 rows away before moving to the next slide. Once all slides were spotted with a given set of 48 duplicated amplicons, the robot washed all print-tips simultaneously, loaded them with the next set of 48 amplicons, and resumed printing. Each set was printed on all slides in approximatively 2 min and the entire procedure lasted 16 h for the 10000 duplicated cDNAs.
The structure of the spatial distribution of the signal on a slide can be studied with a geostatistical tool called a variogram ([13, 14]). In geostatistics, the variogram has been used to detect departure of stationarity in the data. In the microarray data analysis context, it represents a useful exploratory tool to study spatial correlations due to systematic biases. A variogram (also called semi-variogram) is defined (2), and estimated (3), for a distance d and a variable Y, as follows:
where N(d) is the set of all possible pairs of spots (s i , s j ) with a distance d between one another, and with |N(d)| the cardinal of N(d). As implied by expression (2), V(d) decreases when the number of similar points separated by the distance d increases.
E.C. was supported by a fellowship from Agence Nationale de Recherches sur le SIDA
- Yang Y, Dudoit S, Luu P, Speed T: Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nuclear Acids Res 2002, 30(4):e15. 10.1093/nar/30.4.e15View ArticleGoogle Scholar
- Quackenbush J: Microarray data normalization and transformation. Nature Genet 2002, 32: 496–501. 10.1038/ng1032View ArticlePubMedGoogle Scholar
- Schuschhardt J, Beule D, Malik A, Wolski E, Eickhoff H, Lehrach H, Herzel H: Normalization strategies for cDNA microarrays. Nucleic Acids Res 2000, 28: e47. 10.1093/nar/28.10.e47View ArticleGoogle Scholar
- Workman C, Jensen L, Jarmer H, Berka R, Gautier L, Nielser H, Saxild H, Nielsen C, Brunak S, Knudsen S: A new non-linear normalization method for reducing variability in DNA microarray experiments. Genome Biol 2002, 3(9):1–16. 10.1186/gb-2002-3-9-research0048View ArticleGoogle Scholar
- Lieb J, Liu X, Botstein D, Brown P: Promoter-specific binding of Rap1 revealed by genome-wide maps of protein-DNA association. Nature Genet 2001, 28(4):327–34. 10.1038/ng569View ArticlePubMedGoogle Scholar
- Zhu G, Spellman P, Volpe T, Brown P, Botstein D, Davis T, Futcher B: Two yeast forkhead genes regulate the cell cycle and pseudohyphal growth. Nature 2000, 406(6791):90–4. 10.1038/35017581View ArticlePubMedGoogle Scholar
- Searle S: Linear Models New York: John Wiley & Sons, Inc 1971.Google Scholar
- Ball C, Chen Y, Panavally S, Sherlock G, Speed T, Spellman P, Yang Y: Section 7: An introduction to microarray bioinformatics. In DNA Microarrays: A Molecular Cloning Manual Cold Spring Harbor Press 2003.Google Scholar
- Cohen B, Mitra R, Hughes J, Church G: A computational analysis of whole-genome expression data reveals chromosomal domains of gene espression. Nature Genet 2000, 26: 183–186. 10.1038/79896View ArticlePubMedGoogle Scholar
- Spellman P, Rubin G: Evidence for large domains of similarly expressed genes in the Drosophila genome. J Biol 2002, 1: 1–5. 10.1186/1475-4924-1-5View ArticleGoogle Scholar
- Balazsi G, Kay K, Barabasi A, Oltvai Z: Spurious spatial periodicity of co-expression in microarray data due to printing design. Nucleic Acids Res 2003, 31: 4425–4433. 10.1093/nar/gkg485PubMed CentralView ArticlePubMedGoogle Scholar
- Banerjee S, Carlin B, Gelfand A: Hierarchical Modeling and Analysis for Spatial Data. Monographs on Statistics and Applied Probability Chapman and Hall/CRC Press 2004.Google Scholar
- Jowett G: The Accuracy of systematic sampling from conveyor belts. Applied Statistics 1952, 1: 50–59.View ArticleGoogle Scholar
- Cressie A: Statistics for spatial data. Wiley series in probability Wiley 1997.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.