Simulation of DNA array hybridization experiments and evaluation of critical parameters during subsequent image and data analysis

Wierling, Christoph K; Steinfath, Matthias; Elge, Thorsten; Schulze-Kremer, Steffen; Aanstad, Pia; Clark, Matthew; Lehrach, Hans; Herwig, Ralf

doi:10.1186/1471-2105-3-29

Research article
Open access
Published: 22 October 2002

Simulation of DNA array hybridization experiments and evaluation of critical parameters during subsequent image and data analysis

Christoph K Wierling¹,
Matthias Steinfath¹,
Thorsten Elge²,
Steffen Schulze-Kremer²,
Pia Aanstad³,
Matthew Clark¹,
Hans Lehrach¹ &
…
Ralf Herwig¹

BMC Bioinformatics volume 3, Article number: 29 (2002) Cite this article

9732 Accesses
18 Citations
Metrics details

Abstract

Background

Gene expression analyses based on complex hybridization measurements have increased rapidly in recent years and have given rise to a huge amount of bioinformatic tools such as image analyses and cluster analyses. However, the amount of work done to integrate and evaluate these tools and the corresponding experimental procedures is not high. Although complex hybridization experiments are based on a data production pipeline that incorporates a significant amount of error parameters, the evaluation of these parameters has not been studied yet in sufficient detail.

Results

In this paper we present simulation studies on several error parameters arising in complex hybridization experiments. A general tool was developed that allows the design of exactly defined hybridization data incorporating, for example, variations of spot shapes, spot positions and local and global background noise. The simulation environment was used to judge the influence of these parameters on subsequent data analysis, for example image analysis and the detection of differentially expressed genes. As a guide for simulating expression data real experimental data were used and model parameters were adapted to these data. Our results show how measurement error can be balanced by the analysis tools.

Conclusions

We describe an implemented model for the simulation of DNA-array experiments. This tool was used to judge the influence of critical parameters on the subsequent image analysis and differential expression analysis. Furthermore the tool can be used to guide future experiments and to improve performance by better experimental design. Series of simulated images varying specific parameters can be downloaded from our web-site: http://www.molgen.mpg.de/~lh_bioinf/projects/simulation/biotech/

Background

DNA-array technology is nowadays frequently used for the generation of genome-wide gene expression profiles (see The chipping forecast, Nature Genetics Suppl. 21, 1999 for a review). The technology is based on the hybridization of labeled ssDNA to its complementary strand called probe. Different probes are fixed as spots on planar surfaces, like glass slides or nylon filters. The arrays are scanned and hybridization signals of the spots are quantified by suitable image analysis software. To gain further biological relevant information complex hybridizations from parallel experiments with different target samples as well as experimental repetitions are carried out. Further data evaluation of these hybridization signals by statistical tests and clustering algorithms yields information about differentially expressed or coregulated genes.

The reliability of data produced by these experiments and their reproducibility are crucial for this research. To ensure both reliability and reproducibility a sophisticated experimental design is necessary. This includes for example the identification of error parameters that affect the hybridization data during the data generation process. Influences of systematic and statistical errors due to biotechnical methods (for example mRNA preparation, PCR, hybridization), as well as due to devices and array-media (for example robots, filters, glass-slides) and their effects on evaluation software and algorithms (image analysis, statistical tests, clustering algorithms) must be estimated. These sources of error are frequently discussed in the context of callibration and normalization of microarray data (e.g. [2, 4, 6, 9]). Here we present a computer simulation, that takes into account several sources of error. It enables scientists to judge which parameters are critical and how the experimental design or data evaluation might be improved.

On the other hand creating simulated data without practical consideration is less helpful because it might lead to artificial data sets that estimate and quantify parameters that are not relevant for the analysis of hybridization data. Thus, data should be adapted and linked to real experiments.

Our tool is designed for that purpose. Hybridization signal intensities taken from experimental data are the input; these data were derived as mean values from six filters each of which spotted with the same set of 14208 zebrafish cDNA clones and hybridized independently with the same complex target of an mRNA pool from zebrafish gastrula stage embryos. The output are series of filter images containing well-defined error parameters. In each series only a single parameter was varied at once in order to measure its effects on data analysis. The range of parameter variation was adapted to real experiments (experimental reference).

After creating the simulated data the effect of the error parameters were measured on the subsequent data analysis pipeline. We highlight two modules of this pipeline: Image analysis and statistical analysis of differentially expressed genes, although the simulation tool is not restricted to these applications. We chose image analysis because it is the first module of the data analysis and builds the basis for all further research and statistical analysis of differentially expressed genes because it is one of the most utilized applications of gene arrays.

The images were analyzed with three different image processing programs. Parameters that are judged in this paper are variations of the spot positions caused by different experimental artifacts and different sources of background noise. For gene expression profiling twelve filters with varying local background and experimentally determined signal variations were simulated, six of them correspond to hybridizations with a treatment and six of them correspond to hybridizations with a complex control target. We analyzed how many experimental repetitions are necessary to detect a given level of differential expression. Here, the significance of the differential expression was judged by P values computed by the Welch t-test (cf. [3]).

Our results show that the simulation tool is a valuable resource for the identification and the rating of sources of error arising in hybridization experiments. The simulated sets can be used as benchmark tests for new data analysis modules such as image analyses coming up in the course of gene expression data analysis.

Methods

Implementation of the simulation tool

The simulation tool is written in the object-oriented scripting language python http://www.python.org. Some computation intensive functions are implemented in C and can be used as modules in python. Objects like filters, spots or hybridization-data are stored as persistent objects by the use of Zope http://www.zope.org. Figure 1A illustrates the implemented simulation pipeline. It takes as an input a set of expression data (in this paper we used an experimental signal distribution of hybridization data) and their position on the array. During the simulation pipeline several perturbations can be performed. Signal intensities can change due to the up- or down-regulation of gene expression, independent perturbations (that effect signal differences of identical spotted duplicates) or a systematic error during the spotting process due to pin-dependent differences in the amount of transfered PCR-product. Perturbations of systematic or non systematic spot position errors and varying spot shapes are also considered. These perturbations result in the input data (filter object, which references its spot objects) used for the array image simulation. Depending on the type of array (filter or glass slide) different levels of global or local background noise can be considered here. The simulation parameters that are under investigation in this paper are listed in Table 1. The output of one array simulation is a parameter file (that contains the values of the variation parameters), a file of the input data for the array image simulation (that contains signal and background intensities and the spot positions) and the image as a 16 bit Tiff-file.

Table 1 Definition, modelling and critical effects of simulation parameters.

Full size table

Data sets

The quality of an expression analysis strongly depends on the distribution of the signal intensities and the spot positions on the filter (e.g. outshining effects). To have a realistic situation results of real experiments were used as input data for the construction of the artificial data and the statistical expression analysis.

Experimental macroarray data

A detailed description of the cDNA clone array design, mRNA labeling, hybridization and data capture is given in [3]. PCR products of 14208 zebrafish cDNA clones of a representative library from gastrula stage embryos [1] and 2304 copies of an Arabidopsis thaliana cDNA clone were spotted on nylon filter membranes. Clones were spotted in a rectangular grid of blocks with 25 spots (5 × 5) per block by the use of a gadget with 16 × 24 pins corresponding to a 384-well microtiter plate. Figure 1B illustrates the filter design. Due to the experimental procedure a filter is divided into six fields of 384 blocks each. For the 5 × 5 spotting pattern each block contains 25 spots. The zebrafish target derived from mRNA of gastrula stage embryos (6 hours post fertilization) was hybridized to six filter replicates which were spotted with the same set of clones. Each clone was spotted twice in the same block (duplicate) to improve reproducibility.

Design of artificial sample sets

In order to detect differentially expressed genes the cDNA clone array is hybridized in real experiments with two mRNA targets of different origin: one target commonly refers to a reference tissue (control), the second target refers to a certain chemical treatment, a mutant or a disease (treatment).

In our simulation set-up the signals for the control target hybridization were taken from a signal-distribution derived from corresponding experimental data of 14208 clones (see above); the experimental images were analyzed with the in-house developed image analysis FA [10] and medians and the coefficients of variation (CV = standard deviation/mean) were calculated from the replicates of each clone. Figure 2 shows the distributions of these medians and CVs. If reproducibility is perfect the CV is 0, if it is poor the CV tends to higher values. The CVs of the raw data are most frequent in the interval between 0.4 and 0.5 (Fig. 2B). These values are fairly high since a CV of 0.5 for example means that nearly 50 % of the measurement is due to error. However, we want to have a rather upper bound for initial data reproducibility since then error parameter can be identified more clearly. In published studies the CV is in the area of 10 %–25 % (e.g. [3, 7]) since raw data undergoes intensive data normalization and calibration. The signals for the treatment target hybridization were derived from the medians of the experimental reference signals by up-regulating 5000 clones (35.2 % of all clones) randomly. The coefficients of these upregulations – the expression ratios – are uniformly distributed between 1 and 10. The signals of the other 9208 clones remained unchanged. Both signal sets consist of values for the 14208 clones that were screened for differentially expressed genes. The input signal intensity for the spots corresponding to the constant A. thaliana cDNA clones of the experimental reference was always the same. For the expression analysis six images were simulated of both signal sets, respectively. Signal intensity variations as described in the following paragraph and local background noise variations (see below) were carried out for each filter. The spotting order was identical with the experimental reference.

Simulation model

Generation of signal intensities

Schuchhardt et al.[9] have shown that a strong correlation exists for spot intensities spotted by the same pin. Spots in the same block are spotted by the same pin. Clones that are spotted in different blocks are spotted by different pins. Thus the amount of material that is transfered to the array varies from pin to pin, and this relative pin specific variation can be described for the 384 pins of a gadget by the following pin distribution P(Y):

P(Y) = N (1,); σ₁ = 0.43 (1)

Here N(1,) denotes a Gaussian normal distribution with mean 1 and variance . The standard deviation, σ₁, was derived from experimental data. Clones with identical 384-well microtiter plate positions are spotted by the same pin. In the experimental reference A. thaliana cDNA of identical amplicons were spotted in each block as a control. Based on this information the mean CV over all pins was calculated and used as σ₁.

On one filter the signal distribution P(X_ij) of replicates is defined as follows:

P(X_ij) = N (y_i·z_j, (y_i·z_j·σ₂)²); σ₂ = 0.2 (2)

with i ∈ ; i = [1, w]

j ∈ ; j = [1, m]

z_j is the mean signal for clone j taken from the median signal distribution of experimental data (cf. Figure 2), y_i denotes the pin dependent factor for pin i derived from the distribution, P(Y). For the simulations presented in this paper the number of pins is w = 384 and the number of clones is m = 14208. Using the duplicate correlation (0.8) of the constant experimental A. thaliana clone signals and σ₁ one can calculate σ₂ = 0.2, because they are associated with each other (proof is not shown). Thus σ₂ is the CV for identical PCR-products that were spotted by the same pin.

Filter model

The simulated images are generated by an intensity function, which yields for each pixel k an intensity value. The presented model is based on empirical assumptions. It is given by a continuous function of the position r on the filter, I(r), as follows:

(3)

where A_j is the given spot intensity, g is a function that describes the local and global background, ε denotes a stochastic perturbation, and |r-r_j| is the Euclidean distance to the center of spot j. The nine spot centers closest to r are considered, due to the fact, that the pixelized spot shape is given by a square 19 × 19 pixel matrix and the usual distance between two spot centers is 7.89 pixel for the image resolution used in this paper (0.114 mm/pixel).

Here f(|r-r_j|) is a spot shape distribution which describes the spot shape (see below).

The pixel intensity

is given by:

(4)

with N = 16 for a 16 bit image and r_i is the center of the pixel k. The square brackets denotes the integer function, that returns the largest integer less than or equal to the value in brackets.

The spot intensities A_j are taken from a real experiment (see above, intensity distribution see Fig. 2).

To determine the location r_j of the spots we assume that the probes are spotted approximately in an orthogonal grid.

Local distortions

Local distortions of the spots are considered. Due to the experimental procedure two different spot distortions are introduced: spot shifting and pin shifting. Both of them are modeled by randomly Gaussian distributed shifting of the spot-centers relative to their theoretical spot-centers. For spot shifting the distortions are independent for each spot; for pin shifting they are equal for all spots of one block of 5 × 5 spots, because they were spotted by the same pin.

Spot shape

Due to the experimental procedure of the array preparation, the array surface type, and the nature of the fixed DNA material, the spot shapes are different. Here we introduced three distribution models of spot shapes that are based on experimental evidence:

(a) a normalized two-dimensional Gaussian distribution with a given SD (σ):

(b) a normalized two-dimensional Gaussian distribution with a given SD (σ₁) of which another concentric Gaussian-distribution (SD = σ₂) with a scaling-factor S = (0,1) is subtracted. The resulting spot resembles a crater like spot shape.

(c) a normalized cylindric distributed shape with a given radius d that forms a plateau-like spot:

These spot models were used because they are commonly observable with spotted array data on nylon and glass supports respectively and are frequently assumed as quantification models by image analysis programs. More irregular spot shapes that do not have a common spot distribution can also be observed (e.g. [5]), but are not considered for this paper.

Background noise

Two different sources of background noise can be distinguished: a global background due to the scanner noise or filter surface and a local background due to inhomogeneous hybridization to the filter that looks like smear.

Global background noise

The global background is described by a randomly Gaussian distributed noise that is equal for the whole filter. It can be varied by its mean and SD.

Local background noise

As a model for the local background fractal clouds as described in [8] are used. They are generated with the midpoint displacement method with a fractal dimension of 0.4 and then scaled to a given minimum/maximum-range, which defines the intensity level of this background. The model was chosen for local background, because the intensity level of a given pixel depends on its neighbors. This results in images that look quite the same as the background of experimental images. By the use of a pseudo random number generator reproducible fractals were created.

Data evaluation and quality measurement

Image analysis

To illustrate the power of using simulated data for the judgment of image analysis software, we used the following programs: (1.) FA: which was developed at the Max-Planck-Institute for Molecular Genetics [10]. It is fully automated – no manual effort for the positioning of the grid is necessary, (2.) AIDA: Raytest, Germany http://www.raytest.de, which needs some manual interaction for the positioning of the grid, (3.) Visual Grid: GPC Biotech, Germany http://www.gpc-ag.com, for which the whole grid has to be adapted manually. These programs have been chosen, because they are frequently used at our institute and have already been utilized for massive image analysis (FA [10]; Visual Grid [3]). Furthermore, they are representative for the different levels of automation of image analyses.

Evaluation of gridfind and quantification quality

The following two steps are essential for the analysis of hybridization images: gridfind and quantification. First the gridfind has to locate the exact positions of the spots and then the signal intensities are assigned to each spot by the quantification. For instance the image analysis FA does a Gaussian spot shape fit for quantification [10]. The performance of the different image analysis programs are tested by the following quality parameters:

The mean distance between simulated and calculated spot centers.
The Pearson correlation between simulated and calculated intensities.

The first parameter measures the quality of the gridfind. The second is a measure for the quality of the whole image processing.

Statistical evaluation of differential expression

For testing statistical significance of differential expression we calculated P values according to the Welch test [11]. This test is an unpaired t-test. It assumes that the two samples ("treatment" and "control") are distributed according to Gaussian distributions with means, μ_treatment and μ_control respectively, and judges the hypothesis if μ_treatment = μ_control. Here, in contrast to Student's t-test, it is not assumed that both sample distributions have the same variance. The test statistic, T, has the form

Here,

and denote the sample means, and denote the sample variances and n and m are the respective sizes of the treatment and the control sample. High and low values of the test statistic then indicate significantly different sample means. This test has been applied in several studies on differential expression of array data, for example [3] and [2].

Results

The quality of an expression profile analysis based on array data is highly dependent on the number of repeated sample measurements, and of the array preparation, hybridization and signal quantification procedure. The latter can be increased either by improved array preparation and hybridization or better algorithms of the image analysis software that can handle preparation errors. The improvement of both methods is limited. Major critical parameters are local distortions of the spots, variations of the spot shape and outshining effects due to neighbor spots or massive background noise. These parameters have been analyzed in this paper (see Table 1). In the following we simulated series of images by changing only one parameter at once.

Local distortion

In the following the spots have constant Gaussian shape without background noise. Thus only the effects of local distortions are tested. Figure 3 and 4 show the influence of spot-shifts on the gridfind and quantification.

Spot shifting

Spot shifting was simulated with SDs between 0 and 0.342 mm from its ideal positions (Fig. 3). The mean distance between adjacent spot centers was 0.9 mm. For the three image analysis programs which were under investigation this parameter became critical (correlation < 0.95) for the quantification, which is also influenced by the quality of the gridfind, for SDs in the range of 0.15–0.2 mm (Fig. 3B). This is about a fifth of the distance of adjacent spot centers.

In figure 3C we focused only on the quality of the gridfind. The error given by the mean distance of the calculated spot center after image analysis from its simulated center is relatively linear to its perturbation for all tested image analysis programs. The low quality for Aida for small perturbations is due to a missing sub-pixel precision. This means, that if e.g. the simulated spot center is not identical with the center of a pixel, the output-result from Aida lacks this sub-pixel precision.

Pin shifting

The error due to pin variations is a systematic error for all spots in the same block, because they were spotted by the same pin (Fig. 4A). Perturbations with SDs between 0 and 0.2 mm were simulated. This error became critical (correlation < 0.95) for SDs of the pin shifting greater than 0.12 mm for Visual Grid and greater than 0.167 mm for FA. The error of the gridfind was linear to its perturbation (Fig. 4C). Here again the low quality for Aida for small perturbations is due to the missing sub-pixel precision.

Figure 5 shows the distribution of block center shifts measured for experimental data (the block centers were manually determined with Visual Grid). For the results mentioned above this means that the error due to pin shifting is for the majority of blocks never in the critical area. But in general this can become a critical parameter strongly depending on the used devices (e.g. spotting robots).