Simulation of DNA array hybridization experiments and evaluation of critical parameters during subsequent image and data analysis
© Wierling et al; licensee BioMed Central Ltd. 2002
Received: 18 June 2002
Accepted: 22 October 2002
Published: 22 October 2002
Gene expression analyses based on complex hybridization measurements have increased rapidly in recent years and have given rise to a huge amount of bioinformatic tools such as image analyses and cluster analyses. However, the amount of work done to integrate and evaluate these tools and the corresponding experimental procedures is not high. Although complex hybridization experiments are based on a data production pipeline that incorporates a significant amount of error parameters, the evaluation of these parameters has not been studied yet in sufficient detail.
In this paper we present simulation studies on several error parameters arising in complex hybridization experiments. A general tool was developed that allows the design of exactly defined hybridization data incorporating, for example, variations of spot shapes, spot positions and local and global background noise. The simulation environment was used to judge the influence of these parameters on subsequent data analysis, for example image analysis and the detection of differentially expressed genes. As a guide for simulating expression data real experimental data were used and model parameters were adapted to these data. Our results show how measurement error can be balanced by the analysis tools.
We describe an implemented model for the simulation of DNA-array experiments. This tool was used to judge the influence of critical parameters on the subsequent image analysis and differential expression analysis. Furthermore the tool can be used to guide future experiments and to improve performance by better experimental design. Series of simulated images varying specific parameters can be downloaded from our web-site: http://www.molgen.mpg.de/~lh_bioinf/projects/simulation/biotech/
DNA-array technology is nowadays frequently used for the generation of genome-wide gene expression profiles (see The chipping forecast, Nature Genetics Suppl. 21, 1999 for a review). The technology is based on the hybridization of labeled ssDNA to its complementary strand called probe. Different probes are fixed as spots on planar surfaces, like glass slides or nylon filters. The arrays are scanned and hybridization signals of the spots are quantified by suitable image analysis software. To gain further biological relevant information complex hybridizations from parallel experiments with different target samples as well as experimental repetitions are carried out. Further data evaluation of these hybridization signals by statistical tests and clustering algorithms yields information about differentially expressed or coregulated genes.
The reliability of data produced by these experiments and their reproducibility are crucial for this research. To ensure both reliability and reproducibility a sophisticated experimental design is necessary. This includes for example the identification of error parameters that affect the hybridization data during the data generation process. Influences of systematic and statistical errors due to biotechnical methods (for example mRNA preparation, PCR, hybridization), as well as due to devices and array-media (for example robots, filters, glass-slides) and their effects on evaluation software and algorithms (image analysis, statistical tests, clustering algorithms) must be estimated. These sources of error are frequently discussed in the context of callibration and normalization of microarray data (e.g. [2, 4, 6, 9]). Here we present a computer simulation, that takes into account several sources of error. It enables scientists to judge which parameters are critical and how the experimental design or data evaluation might be improved.
On the other hand creating simulated data without practical consideration is less helpful because it might lead to artificial data sets that estimate and quantify parameters that are not relevant for the analysis of hybridization data. Thus, data should be adapted and linked to real experiments.
Our tool is designed for that purpose. Hybridization signal intensities taken from experimental data are the input; these data were derived as mean values from six filters each of which spotted with the same set of 14208 zebrafish cDNA clones and hybridized independently with the same complex target of an mRNA pool from zebrafish gastrula stage embryos. The output are series of filter images containing well-defined error parameters. In each series only a single parameter was varied at once in order to measure its effects on data analysis. The range of parameter variation was adapted to real experiments (experimental reference).
After creating the simulated data the effect of the error parameters were measured on the subsequent data analysis pipeline. We highlight two modules of this pipeline: Image analysis and statistical analysis of differentially expressed genes, although the simulation tool is not restricted to these applications. We chose image analysis because it is the first module of the data analysis and builds the basis for all further research and statistical analysis of differentially expressed genes because it is one of the most utilized applications of gene arrays.
The images were analyzed with three different image processing programs. Parameters that are judged in this paper are variations of the spot positions caused by different experimental artifacts and different sources of background noise. For gene expression profiling twelve filters with varying local background and experimentally determined signal variations were simulated, six of them correspond to hybridizations with a treatment and six of them correspond to hybridizations with a complex control target. We analyzed how many experimental repetitions are necessary to detect a given level of differential expression. Here, the significance of the differential expression was judged by P values computed by the Welch t-test (cf. ).
Our results show that the simulation tool is a valuable resource for the identification and the rating of sources of error arising in hybridization experiments. The simulated sets can be used as benchmark tests for new data analysis modules such as image analyses coming up in the course of gene expression data analysis.
Implementation of the simulation tool
Definition, modelling and critical effects of simulation parameters.
spot shift (Gaussian distribution)
SD from ideal position
SD > 0.15–0.2 mm 16.7–22.2%(2)
block shift (Gaussian distribution)
SD from ideal position
SD > 0.12–0.167 mm 13.3–18.6 %(2,3)
a) two-dimensional Gaussian distribution
a) no variation (fixed SD = 0.1482 mm)
b) Crater spot distribution
b) radius of crater
b) radius > 0.1995 mm 22.2 %(2,4,5)
c) Plateau spot distribution
c) no variation (fixed radius of cylindric plateau spot = 0.342 mm)
additive signal from a Gaussian distribution
fixed mean/SD derived from experimental data
additive signal from fractal clouds
mean signal/background ratio < 25
The quality of an expression analysis strongly depends on the distribution of the signal intensities and the spot positions on the filter (e.g. outshining effects). To have a realistic situation results of real experiments were used as input data for the construction of the artificial data and the statistical expression analysis.
Experimental macroarray data
A detailed description of the cDNA clone array design, mRNA labeling, hybridization and data capture is given in . PCR products of 14208 zebrafish cDNA clones of a representative library from gastrula stage embryos  and 2304 copies of an Arabidopsis thaliana cDNA clone were spotted on nylon filter membranes. Clones were spotted in a rectangular grid of blocks with 25 spots (5 × 5) per block by the use of a gadget with 16 × 24 pins corresponding to a 384-well microtiter plate. Figure 1B illustrates the filter design. Due to the experimental procedure a filter is divided into six fields of 384 blocks each. For the 5 × 5 spotting pattern each block contains 25 spots. The zebrafish target derived from mRNA of gastrula stage embryos (6 hours post fertilization) was hybridized to six filter replicates which were spotted with the same set of clones. Each clone was spotted twice in the same block (duplicate) to improve reproducibility.
Design of artificial sample sets
In order to detect differentially expressed genes the cDNA clone array is hybridized in real experiments with two mRNA targets of different origin: one target commonly refers to a reference tissue (control), the second target refers to a certain chemical treatment, a mutant or a disease (treatment).
Generation of signal intensities
Schuchhardt et al. have shown that a strong correlation exists for spot intensities spotted by the same pin. Spots in the same block are spotted by the same pin. Clones that are spotted in different blocks are spotted by different pins. Thus the amount of material that is transfered to the array varies from pin to pin, and this relative pin specific variation can be described for the 384 pins of a gadget by the following pin distribution P(Y):
P(Y) = N (1, ); σ1 = 0.43 (1)
Here N(1, ) denotes a Gaussian normal distribution with mean 1 and variance . The standard deviation, σ1, was derived from experimental data. Clones with identical 384-well microtiter plate positions are spotted by the same pin. In the experimental reference A. thaliana cDNA of identical amplicons were spotted in each block as a control. Based on this information the mean CV over all pins was calculated and used as σ1.
On one filter the signal distribution P(X ij ) of replicates is defined as follows:
P(X ij ) = N (y i ·z j , (y i ·z j ·σ2)2); σ2 = 0.2 (2)
with i ∈ ; i = [1, w]
j ∈ ; j = [1, m]
z j is the mean signal for clone j taken from the median signal distribution of experimental data (cf. Figure 2), y i denotes the pin dependent factor for pin i derived from the distribution, P(Y). For the simulations presented in this paper the number of pins is w = 384 and the number of clones is m = 14208. Using the duplicate correlation (0.8) of the constant experimental A. thaliana clone signals and σ1 one can calculate σ2 = 0.2, because they are associated with each other (proof is not shown). Thus σ2 is the CV for identical PCR-products that were spotted by the same pin.
The simulated images are generated by an intensity function, which yields for each pixel k an intensity value. The presented model is based on empirical assumptions. It is given by a continuous function of the position r on the filter, I(r), as follows:
where A j is the given spot intensity, g is a function that describes the local and global background, ε denotes a stochastic perturbation, and |r-r j | is the Euclidean distance to the center of spot j. The nine spot centers closest to r are considered, due to the fact, that the pixelized spot shape is given by a square 19 × 19 pixel matrix and the usual distance between two spot centers is 7.89 pixel for the image resolution used in this paper (0.114 mm/pixel).
Here f(|r-r j |) is a spot shape distribution which describes the spot shape (see below).
with N = 16 for a 16 bit image and r i is the center of the pixel k. The square brackets denotes the integer function, that returns the largest integer less than or equal to the value in brackets.
The spot intensities A j are taken from a real experiment (see above, intensity distribution see Fig. 2).
To determine the location r j of the spots we assume that the probes are spotted approximately in an orthogonal grid.
Local distortions of the spots are considered. Due to the experimental procedure two different spot distortions are introduced: spot shifting and pin shifting. Both of them are modeled by randomly Gaussian distributed shifting of the spot-centers relative to their theoretical spot-centers. For spot shifting the distortions are independent for each spot; for pin shifting they are equal for all spots of one block of 5 × 5 spots, because they were spotted by the same pin.
Due to the experimental procedure of the array preparation, the array surface type, and the nature of the fixed DNA material, the spot shapes are different. Here we introduced three distribution models of spot shapes that are based on experimental evidence:
(a) a normalized two-dimensional Gaussian distribution with a given SD (σ):
(b) a normalized two-dimensional Gaussian distribution with a given SD (σ1) of which another concentric Gaussian-distribution (SD = σ2) with a scaling-factor S = (0,1) is subtracted. The resulting spot resembles a crater like spot shape.
These spot models were used because they are commonly observable with spotted array data on nylon and glass supports respectively and are frequently assumed as quantification models by image analysis programs. More irregular spot shapes that do not have a common spot distribution can also be observed (e.g. ), but are not considered for this paper.
Two different sources of background noise can be distinguished: a global background due to the scanner noise or filter surface and a local background due to inhomogeneous hybridization to the filter that looks like smear.
Global background noise
The global background is described by a randomly Gaussian distributed noise that is equal for the whole filter. It can be varied by its mean and SD.
Local background noise
As a model for the local background fractal clouds as described in  are used. They are generated with the midpoint displacement method with a fractal dimension of 0.4 and then scaled to a given minimum/maximum-range, which defines the intensity level of this background. The model was chosen for local background, because the intensity level of a given pixel depends on its neighbors. This results in images that look quite the same as the background of experimental images. By the use of a pseudo random number generator reproducible fractals were created.
Data evaluation and quality measurement
To illustrate the power of using simulated data for the judgment of image analysis software, we used the following programs: (1.) FA: which was developed at the Max-Planck-Institute for Molecular Genetics . It is fully automated – no manual effort for the positioning of the grid is necessary, (2.) AIDA: Raytest, Germany http://www.raytest.de, which needs some manual interaction for the positioning of the grid, (3.) Visual Grid: GPC Biotech, Germany http://www.gpc-ag.com, for which the whole grid has to be adapted manually. These programs have been chosen, because they are frequently used at our institute and have already been utilized for massive image analysis (FA ; Visual Grid ). Furthermore, they are representative for the different levels of automation of image analyses.
Evaluation of gridfind and quantification quality
The following two steps are essential for the analysis of hybridization images: gridfind and quantification. First the gridfind has to locate the exact positions of the spots and then the signal intensities are assigned to each spot by the quantification. For instance the image analysis FA does a Gaussian spot shape fit for quantification . The performance of the different image analysis programs are tested by the following quality parameters:
The mean distance between simulated and calculated spot centers.
The Pearson correlation between simulated and calculated intensities.
The first parameter measures the quality of the gridfind. The second is a measure for the quality of the whole image processing.
Statistical evaluation of differential expression
For testing statistical significance of differential expression we calculated P values according to the Welch test . This test is an unpaired t-test. It assumes that the two samples ("treatment" and "control") are distributed according to Gaussian distributions with means, μtreatment and μcontrol respectively, and judges the hypothesis if μtreatment = μcontrol. Here, in contrast to Student's t-test, it is not assumed that both sample distributions have the same variance. The test statistic, T, has the form
The quality of an expression profile analysis based on array data is highly dependent on the number of repeated sample measurements, and of the array preparation, hybridization and signal quantification procedure. The latter can be increased either by improved array preparation and hybridization or better algorithms of the image analysis software that can handle preparation errors. The improvement of both methods is limited. Major critical parameters are local distortions of the spots, variations of the spot shape and outshining effects due to neighbor spots or massive background noise. These parameters have been analyzed in this paper (see Table 1). In the following we simulated series of images by changing only one parameter at once.
Spot shifting was simulated with SDs between 0 and 0.342 mm from its ideal positions (Fig. 3). The mean distance between adjacent spot centers was 0.9 mm. For the three image analysis programs which were under investigation this parameter became critical (correlation < 0.95) for the quantification, which is also influenced by the quality of the gridfind, for SDs in the range of 0.15–0.2 mm (Fig. 3B). This is about a fifth of the distance of adjacent spot centers.
In figure 3C we focused only on the quality of the gridfind. The error given by the mean distance of the calculated spot center after image analysis from its simulated center is relatively linear to its perturbation for all tested image analysis programs. The low quality for Aida for small perturbations is due to a missing sub-pixel precision. This means, that if e.g. the simulated spot center is not identical with the center of a pixel, the output-result from Aida lacks this sub-pixel precision.
The error due to pin variations is a systematic error for all spots in the same block, because they were spotted by the same pin (Fig. 4A). Perturbations with SDs between 0 and 0.2 mm were simulated. This error became critical (correlation < 0.95) for SDs of the pin shifting greater than 0.12 mm for Visual Grid and greater than 0.167 mm for FA. The error of the gridfind was linear to its perturbation (Fig. 4C). Here again the low quality for Aida for small perturbations is due to the missing sub-pixel precision.
In the following all images have constant Gaussian spot-shapes and all spot centers are located at the ideal grid nodes. Thus the gridfind has only to cope with the background noise.
Global background noise
From the border (non-spotted) area of an experimental filter image with a 16 bit depth the noise level was found to be about 16000 with a standard deviation of about 4000; the distribution is similar to Gaussian (data not shown).
Local background noise
Influence of background noise on the expression test
We tested the influence of local background noise on the quality of the expression analysis in dependence of the number of repetitions. The significance of differentially expressed genes was judged by the use of the Welch test as described in .
Figure 9E shows a comparison of the CVs for sample size 12 of the input data signals and the signals quantified by the three different image processing programs. The medians of the CVs are increasing in the following order: input data (0.19), FA (0.21), AIDA (0.29), Visual Grid (0.34). This result shows that data reproducibility increases with the level of automation of the image analysis programs.
Discussion and Conclusions
In this paper we presented a simulation for complex hybridization experiments. This was used to judge critical experimental parameters in the light of the following data analysis. We studied critical parameter of the image analysis by the use of three different image analysis programs representing different levels of automation of the gridfinding and signal quantification. We showed that local distortions of the spot centers like non systematic spot shifting as well as systematic errors resulting in block shifting due to pin errors did not become critical for the reference experiments with the image analysis programs. Also global background noise did not become critical for the experiments studied in this paper. Local background noise might become critical for filter experiments in some cases. Here we showed by the use of fractal clouds as background – which looks very similar to the smear in real experiments – that a mean signal/background ratio below 13 might become critical for some image analysis. However, for the automation of complex hybridizations it might be very helpful to check these parameters during the following data analysis pipeline. This can help to identifiy bad experiments more efficiently. Furthermore it might help to detect sources of error during the experimental procedure or improvements that were made. Although it is possible to get a higher quality of the results by an improvement of the experimental procedure and data analysis algorithms, it is always limited (not at least by the available resources). Furthermore variations of biological material can be expected. To cope with this limitations repetitions of the experiments are indispensable. Not at least due to the fact that array experiments are still very expensive one wants to know how many repetitions are necessary to ensure a certain quality for your expression analysis. For this purpose we did statistical analysis with 4, 8 and 12 repetitions using a Gaussian distributed noise of the input data with σ2 = 0.2. Here the image analyses had to cope with changing local backgrounds with the same intensity level. The results of the statistical analysis indicate that for the different image analyses expression ratios below 2 become critical. The relatively poor performance for Visual Grid indicated by the distribution of the CVs is probably due to the fact that this program does no local alignment of the spot position. Since here ideal spot positions were simulated this can explain the relatively good correlation found in Fig. 8 for this program. But due to the manual positioning of the global grid this might become a significant source of error. AIDA and FA do local alignments for the spot positions whereby this source of noise due to manual interaction does not occur.
Automated expression analysis by chip technology will become more and more important in the future, e.g. in biology for comprehensive studies of any kind of developmental processes or in medicine for the study of genetically reasoned diseases. Therefore it is essential to have a well characterized chip technology and subsequent data analysis. This can be supported significantly by well defined models and a whole process simulation. By using well characterized radioactively labeled filter cDNA-arrays, we showed in this paper, that the simulation of this biotechnological method reveals for several parameters the level when they become critical for the follow up data analysis and how this can be improved. Furthermore the simulation environment can also be easily used for the study of cDNA arrays based on glass slides, where e.g. background noise seems to be less critical, but distortions of spot positions and less well characterized spot shapes are more critical.
coefficient of variation
This work has been financed by the Max-Planck-Society and the BMBF grant No. 01GR0105. We thank M. Albrecht for performing image analysis of experimental images.
- Clark MD, Henning S, Herwig R, Clifton SW, Marra MA, Lehrach H, Johnson SL, the WU-GSC EST group: An oligonucleotide fingerprint normalized and EST characterized zebrafish cDNA library. Genome Research 2001, 11: 1594–1602. 10.1101/gr.186901PubMed CentralView ArticlePubMedGoogle Scholar
- Dudoit S, Yang YH, Speed TP, Callow MJ: Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica 2002, 12(1):111–139.Google Scholar
- Herwig R, Aanstad P, Clark M, Lehrach H: Statistical evaluation of differential expression on cDNA nylon arrays with replicated experiments. Nucleic Acids Research 2001, 29(23):e117. 10.1093/nar/29.23.e117PubMed CentralView ArticlePubMedGoogle Scholar
- Huber W, von Heydebreck A, Sültmann H, Poustka A, Vingron M: Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 2002, 18(Suppl. 1):S96-S104.View ArticlePubMedGoogle Scholar
- Jain AN, Tokuyasu TA, Snijders AM, Segraves R, Albertson DG, Pinkel D: Fully automatic quantification of microarray image data. Genome Research 2002, 12: 325–332. 10.1101/gr.210902PubMed CentralView ArticlePubMedGoogle Scholar
- Kepler TB, Crosby L, Morgan KT: Normalization and analysis of DNA microarray data by self-consistency and local regression. Genome Biology 2002, 3(7):1–0037. 10.1186/gb-2002-3-7-research0037View ArticleGoogle Scholar
- Salin H, Vujasinovic T, Mazurie A, Maitrejean S, Menini C, Mallet J, Dumas S: A novel sensitive microarray approach for differential screening using probes labelled with two different radioelements. Nucleic Acids Research 2002, 30(4):e17. 10.1093/nar/30.4.e17PubMed CentralView ArticlePubMedGoogle Scholar
- Saupe D: Algorithms for random fractals. In The Science of Fractal Images. (Edited by: HO Peitgen, D Saupe). Springer-Verlag, New York, Berlin, Heidelberg 1988.Google Scholar
- Schuchhardt J, Beule D, Malik A, Wolski E, Eickhoff H, Lehrach H, Herzel H: Normalization strategies for cDNA microarrays. Nucleic Acids Research 2000, 28(10):e47. 10.1093/nar/28.10.e47PubMed CentralView ArticlePubMedGoogle Scholar
- Steinfath M, Wruck W, Seidel H, Radelof U, Lehrach H, O'Brien J: Automated image analysis for array hybridization experiments:. Bioinformatics 2001, 17: 634–641. 10.1093/bioinformatics/17.7.634View ArticlePubMedGoogle Scholar
- Welch BL: The generalization of Student's problem when several different population variances are involved. Biometrika 1947, 34: 28–35.PubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.