Assessing probe-specific dye and slide biases in two-color microarray data
© Lu et al; licensee BioMed Central Ltd. 2008
Received: 13 December 2007
Accepted: 19 July 2008
Published: 19 July 2008
A primary reason for using two-color microarrays is that the use of two samples labeled with different dyes on the same slide, that bind to probes on the same spot, is supposed to adjust for many factors that introduce noise and errors into the analysis. Most users assume that any differences between the dyes can be adjusted out by standard methods of normalization, so that measures such as log ratios on the same slide are reliable measures of comparative expression. However, even after the normalization, there are still probe specific dye and slide variation among the data. We define a method to quantify the amount of the dye-by-probe and slide-by-probe interaction. This serves as a diagnostic, both visual and numeric, of the existence of probe-specific dye bias. We show how this improved the performance of two-color array analysis for arrays for genomic analysis of biological samples ranging from rice to human tissue.
We develop a procedure for quantifying the extent of probe-specific dye and slide bias in two-color microarrays. The primary output is a graphical diagnostic of the extent of the bias which called ECDF (Empirical Cumulative Distribution Function), though numerical results are also obtained.
We show that the dye and slide biases were high for human and rice genomic arrays in two gene expression facilities, even after the standard intensity-based normalization, and describe how this diagnostic allowed the problems causing the probe-specific bias to be addressed, and resulted in important improvements in performance. The R package LMGene which contains the method described in this paper has been available to download from Bioconductor.
One of the major tasks in the analysis of high-dimensional biological assay data such as gene expression arrays is to detect differential expression from a comparative experiment. Using two-color microarrays is supposed to adjust for the noise introduced by many factors on the same slide including spot size and conformation. Standard data pre-processing methods for two-color data include the normalization of the differences between two dye channels, after which most users believe the dye bias has effectively been removed and that the normalized measurements are now relatively free of dye bias. However, probe specific dye-bias and slide-bias can be high even after standard normalization, which may cause problems when one expects to identify many statistically significantly differentially expressed genes.
This dye bias has received some recent attention [1–8]. These papers generally provide computational methods to detect and correct for dye bias, at least in some circumstances. Correction can include use of gene-specific dye bias terms in an ANOVA, for example. Even when this is done, dye bias may still cause significant harm by introducing large amounts of noise that prevent identification of significantly differentially expressed genes. We present a graphical method of assessing this problem that can be used for process improvement and to compare array platforms.
Standard normalization methods are based on the entire set of probe intensities of the arrays, while the conclusions of comparative experiments are made for specific probes. One of the common approaches for the analysis is gene-by-gene linear models, which uses the normalized log or glog  intensity data and is fitted for each probe. In the routine gene-by-gene linear model, the mean square (MS) of each factor is the measurement of the variance contribution from the factor, which is also the base of the construction of F-statistic for testing the factor effect. So, for each probe, the relative sizes of the mean squares can serve as comparison measures of the contributions of the specific factors to the overall variation.
For the standard F statistic, we consider the ratios of each mean square to an appropriate error term, which is usually also a mean square. We propose instead as a diagnostic to consider the ratio of each mean square to the sum of all the mean squares, so that we obtain for each gene a set of mean-square ratios that sum to 1, which are thus free of scaling specific to a given probe. To assess the overall magnitudes of these quantities, we plot the empirical cumulative distribution functions (ECDF) of the variability proportion of each factor across the whole set of probes in a single plot, serving as the diagnostic graphic tool for showing the relative magnitude of the probe specific dye-bias after normalization. Since the linear model is on a probe-by-probe basis, the dye bias we are measuring is in fact the dye-by-probe interaction. Similarly, including slide as one of the factors could also provide an assessment of the relative size of the slide-by-probe interaction effect. The lower a line is in the plot, the larger the effect's mean square is stochastically across probes.
Results and methods
In most cases being shown in this paper, the linear model, including factors of interest dye, slide, treatment and sample replicates, can be written as:y ijkl = α + dye i + slide j + treat k + sample l + ϵ ijkl ,
where the index i refers to different channels (dyes), the index j to arrays, the index k to treatment levels and the index l to the sample replicates within each treatment level .
Consider as a first example an experiment conducted on slides spotted and hybridized at a UC Davis array facility. The experimental objective was to study the effects of oxygen concentration on gene expression before and at confluence in human keratinocyte cell cultures. There were three different oxygen concentrations used, with two replicates in each condition. Labeled sample was hybridized with common reference for each condition. In each case, in one of the replicates the sample was labeled with Cy3 and the reference with Cy5, and in the other replicate the reverse labeling was used.
Average Mean Squares from ANOVA for Oxygen Experiment
A second example shows a comparison of the analysis of the same RNA on two different two-color array platforms. The samples were from human skin biopsies exposed in vivo to controlled radiation doses incidental to radiation therapy, but with accurate dosimetry [13, 14]. Patients were treated in a standard fashion for their localized prostate cancer and the areas of their abdominal wall skin which would receive 1, 10, 100 cGy of radiation exposure respectively were marked at the time of the patient's first radiation treatment. Prior to any radiation therapy, patients had a control biopsy, at 0 dose. In this component of the study, there were 8 patients, and the data for the array comparison are the four samples from patient 5. The samples were run on two different array platforms: arrays spotted by a UC Davis array facility and Agilent Human Whole Genome arrays run by Icoria's Paradigm Array Labs. The model used had log intensity as a linear function of dose, or else of modified log dose, which was -1, 0, 1, and 2 for the doses 0, 1, 10, and 100. This is log10 dose except that the 0 dose is treated as if it were 0.1 cGy. We call this modified log dose or mld.
Experimental Design for IR Study with UC Davis Arrays
Experimental Design for IR Study with Agilent Arrays
Some care must be taken in the analysis of these data. Unlike a reference design study, we are not analyzing the log ratios. Instead, we analyze the separate values for each gene on each array and each dye channel. There is only one biological sample for each dose, and the variation between replicate measurements of the same RNA is not an appropriate denominator for a test of significance of the regression. We could specify this as a mixed model, with separate random effects for the sample (with 4 levels) and replicates within sample, but fitting such models by maximum likelihood results in many estimation failures using standard software because the model must be fitted for each gene of thousands. Instead, we first fit a model with dose or mld as a quantitative variable, then fit another model with dose or mld as a factor. Then we use the decrease in the residuals sum of squares of the first model to that of the second model. We quantify the within-sample variation in this alternative way and this gives us two degrees of freedom.
Significant Genes from Two Platforms
Most microarray users assume that any differences between the dyes that may cause problems in an analysis can be handled by standard methods of normalization. However, there are still probe specific dye bias and slide bias afterwards in two-color microarray data. We developed a procedure for quantifying the extent of them. The primary graphical diagnostic was used to show the probe-specific dye and slide bias exist and can be quite large after normalization in arrays from rice and humans, in two facilities. This tool guided improvements in the array facility at UC Davis that essentially eliminated the problematic dye bias behavior.
This work was supported by grants from the National Human Genome Research Institute R01-HG003352 (RL, DMR), National Institute of Environmental Health Sciences P42-ES04699 (RL, RHR, DMR), National Cancer Institute P30 CA093373-04 (DMR), Air Force Office of Scientific Research FA9550-06-1-0132 and FA9550-07-1-0146 (RL, ZG, DMR), U.S. Department of Energy Office of Biological and Environmental Research, DE-FG03-01ER63237 (ZG), Campus Laboratory Collaboration Grant from UCOP (ZG), UC Davis Health Systems (DMR). The authors would also like to thank John Tillinghast, Zhi-Wei Lu and Wei Jiang for software support.
- Dobbin KK, Shih JH, Simon RM: Comment on 'Evaluation of the gene-specific dye bias in cDNA microarray experiments'. Bioinformatics 2005, 21(12):2803–2804. 10.1093/bioinformatics/bti428View ArticlePubMedGoogle Scholar
- Dobbin KK, Kawasaki ES, Petersen DW, Simon RM: Characterizing dye bias in microarray experiments. Bioinformatics 2005, 21(10):2430–2437. 10.1093/bioinformatics/bti378View ArticlePubMedGoogle Scholar
- Dombkowski A, Thibodeau B, Starcevic S, Novak R: Gene-specific dye bias in microarray reference designs. Febs Letters 2004.Google Scholar
- Martin-Magniette ML, Aubert J, Cabannes E, Daudin JJ: Answer to the comments of K. Dobbin, J. Shih and R. Simon on the paper 'Evaluation of the gene-specific dye-bias in cDNA microarray experiments'. Bioinformatics 2005, 21(14):3065. 10.1093/bioinformatics/bti479View ArticlePubMedGoogle Scholar
- Martin-Magniette ML, Aubert J, Cabannes E, Daudin JJ: Evaluation of the gene-specific dye bias in cDNA microarray experiments. Bioinformatics 2005, 21(9):1995–2000. 10.1093/bioinformatics/bti302View ArticlePubMedGoogle Scholar
- Rosenzweig B, Pine P, Domon O, Morris S, Chen J, Sistare F: Dye-bias correction in dual-labeled cDNA microarray gene expression measurements. Environmental Health Perspectives 2004, 112(4):480–487.PubMed CentralView ArticlePubMedGoogle Scholar
- Dabney AR, Storey JD: A new approach to intensity-dependent normalization of two-channel microarrays. Biostatistics 2007, 8(1):128–139. 10.1093/biostatistics/kxj038View ArticlePubMedGoogle Scholar
- Soo KB, Soo-Jin K, Saet-Byul L, Won H, Kun-Soo K: Simple method to correct gene-specific dye bias from partial dye swap information of a DNA microarray experiment. Journal of Microbiology and Biotechnology 2005, 15(6):1377–1383.Google Scholar
- Rocke DM: Design and Analysis of Experiments with High Throughput Biological Assay Data. Semin Cell Dev Biol 2004, 15(6):703–713.View ArticlePubMedGoogle Scholar
- Kerr MK, Martin M, Churchill GA: Analysis of Variance for gene expression microarray data. Journal of Computational Biology 2000, 7: 819–837. 10.1089/10665270050514954View ArticlePubMedGoogle Scholar
- Yang YH, Dudoit S, Luu P, Speed T: Normalization for cDNAmicroarray data. Microarrays: Optical Technologies and Informatics, 4266 of Proceeding of SPIE 2001.Google Scholar
- Smyth GK, Yang YH, Speed T: Statistical issues in microarray data analysis. Methods Mol Biol 2003, 224: 111–136.PubMedGoogle Scholar
- Goldberg Z, Rocke DM, Schwietert C, Berglund SR, Santana A, Jones A, Lehmann J, Stern R, Lu R, Siantar4 CH: Human in vivo dose response to controlled, low dose low LET ionizing radiation exposure. Clinical Cancer Research 2006, 12: 3723–3729. 10.1158/1078-0432.CCR-05-2625View ArticlePubMedGoogle Scholar
- Lehmann J, Stern RL, Daly TP, Rocke DM, Schwietert CW, Jones GE, Arnold ML, Siantar CLH, Goldberg Z: Dosimetry for Quantitative Analysis of the Effects of Low-Dose Ionizing Radiation in Radiation Therapy Patients. Radiation Research 2006, 165: 240–247. 10.1667/RR3480.1View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.